Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How Facebook uses HyperLogLog in Presto to speed up cardinality estimation

Save for later
  • 180 min read
  • 2018-12-18 06:02:07

article-image

Yesterday, Facebook shared how they use HyperLogLog (HLL) in Presto to do computational intensive operations like estimating distinct values within a huge dataset. With this implementation, they were able to achieve up to 1,000x speed improvements while working on count-distinct problems.

What is HyperLogLog?


HyperLogLog is an algorithm designed for estimating the number of unique values within a huge dataset, which is also known as cardinality. To produce an estimate of the cardinality it uses an auxiliary memory of m units and performs a single pass over the data. This algorithm is an improvised version of the previously known cardinality estimator, LogLog.

Facebook uses HypeLogLog in scenarios like determining the number of distinct people visiting Facebook in the past week using a single machine. To even further speed up these type of queries they implemented HLL in Presto, which is an open source distributed SQL query engine. Presto is designed for running interactive analytic queries against data sources of all sizes.

Using HLL, the same calculation can be performed in 12 hours with less than 1 MB of memory. Facebook highlights that they have seen great improvements, with some queries being run within minutes, including those used to analyze thousands of A/B tests.

Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at €14.99/month. Cancel anytime

Presto’s HLL implementation


The implementation of HLL data structures in Presto consists of two layout formats: sparse and dense. To save on memory the storage starts off with a sparse layout and when the input data structure goes over the prespecified memory limit for the sparse format, Presto switches to the dense layout automatically. The sparse layout is used to get almost the exact count in low-cardinality datasets, for instance, the number of distinct countries. The dense layout is used in the cases where the cardinality is high such as the number of distinct users.

There is an HYPERLOGLOG data type in Presto. Those users who prefer a single format so that they can process the output structure in other platforms such as Python, there is another data type called P4HYPERLOGLOG, which starts and stays strictly as a dense HLL.

To read more in detail about how Facebook uses HLL, check out their article.

Facebook open-sources PyText, a PyTorch based NLP modeling framework

Facebook contributes to MLPerf and open sources Mask R-CNN2Go, its CV framework for embedded and mobile devices

Australia’s ACCC publishes a preliminary report recommending Google Facebook be regulated and monitored for discriminatory and anti-competitive behavior