Sampling
When the data volume is extra large, we may need to find a subset of data to speed up data analysis. This is sampling, a technique used to identify and analyze a subset of data in order to discover patterns and trends in the whole dataset. In HQL, there are three ways of sampling data: random sampling, bucket table sampling, and block sampling.
Random sampling
Random sampling uses the rand()
function and LIMIT
keyword to get the sampling of data, as shown in the following example. The DISTRIBUTE
and SORT
keywords are used here to make sure the data is also randomly distributed among mappers and reducers efficiently. The ORDER BY rand()
statement can also achieve the same purpose, but the performance is not good:
> SELECT name FROM employee_hr > DISTRIBUTE BY rand() SORT BY rand() LIMIT 2; +--------+ | name | +--------+ | Will | | Steven | +--------+ 2 rows selected (52.399 seconds)
Bucket table sampling
This is a special sampling method, optimized for bucket tables, as shown...