Buckets
Besides partition, the bucket is another technique to cluster datasets into more manageable parts to optimize query performance. Different from a partition, a bucket corresponds to segments of files in HDFS. For example, the employee_partitioned
table from the previous section uses year
and month
as the top-level partition. If there is a further request to use employee_id
as the third level of partition, it creates many partition directories. For instance, we can bucket the employee_partitioned
table using employee_id
as a bucket column. The value of this column will be hashed by a user-defined number of buckets. The records with the same employee_id
will always be stored in the same bucket (segment of files). The bucket columns are defined by CLUSTERED BY
keywords. It is quite different from partition columns since partition columns refer to the directory, while bucket columns have to be actual table data columns. By using buckets, an HQL query can easily and efficiently do sampling...