Design optimization
Design optimization covers several designs, data formats, and job optimization strategies to improve performance. This will be covered in the following sections.
Partition table design
Hive partitioning is one of the most effective ways to improve query performance on larger tables. A query with partition filtering will only load data from the specified partitions (sub-directories), so it can execute much faster than a normal query that filters by a non-partitioning field. The selection of the partition key is always an important factor for performance. It should always be a low-cardinal attribute to avoid so many sub-directories overhead. The following are some attributes commonly used as partition keys:
- Partitions by date and time: Use date and time, such as year, month, and day (even hours), as partition keys when data is associated with the date/time columns, such as
load_date
,business_date
,run_date
, and so on - Partitions by location: Use country, territory, state, and...