Hyperloglog: Obtaining the cardinality of a set in a probabilistic way

Published in

devartis

8 min readJan 24, 2018

Introduction

A few years ago, I was presented with a fairly simple request: to be able to know how many different users were accessing our client’s website. A fairly simple task, but at the time, the project in which I was working on, was using exclusively Redis, which is an in-memory database, to store all our data. That made sense, since our application consisted on a big data analysis board, that was constantly being fed events coming from the client’s website. We had a few millon events per minute and we just wanted to store data for a month, to display in our analytics board. Our data mostly consisted on daily, weekly and monthly counters, that were displayed on several graphs and tables.

Back to the task in hand, among the events coming to our application, there was one type called enter_website_eventthat signaled a user entering the website. With this event we could store the user ids and then count the amount of different ids we had at any given moment. Instead of saving all user ids, we just wanted the ones that hadn’t already appeared, so we chose a set to store our user ids, which is a structure supported by Redis.

At first, it worked fine, but after a few days it started showing some troubles. Sets have a property, which says that the space needed to store the set is proportional to its cardinality and in our case, our sets had millions of different users. Esentially the problem was that our sets were using about 8Gb of memory each, and that was not a viable situation.

After stumbling trying different solutions for a few days, I found a blog post talking about this new (at the time) structure supported by Redis, the hyperloglog structure. But what is this hyperloglog? and how did it help me with this problem?

Hyperloglog — Why is it useful?

Hyperloglog (HLL from now on) refers to a probabilistic algorithm that derivates from the Loglog algorithm, used to determine the cardinality of a really large set without the necesity of storing all its values. As I said before, regular sets, or bitmaps, can be very resource intensive. HLL addresses this problematic, using fixed size structures that, depending on the implementation, can be lower than 16kb. In exchange of the low resource requirement, the cardinality measurement is probabilistic, meaning that is bound to have an error that is usually less than 2%.

Let’s put this error rate into numbers, to make it easy to realize how much are we talking about:

Let’s assume that we have a set with 1.000.000 unique users
A 2% error would mean that there is a chance, of missing the 1.000.000 unique users when calculating the cardinality, by 20.000
Then, we can easily get the two worst case scenarios as follows (1.000.000–20.000) = 980.000 || (1.000.000+20.000) = 1.020.000

We can comment a few things about this little math that we just did. First, we are assuming an error of 2%, when most HLL implementations get error rates lower than 1%. Secondly, since there is an error rate, you should take it into account if you are going to use HLL. Depending on the problem you are trying to address, this error rate may prove to be too much, or to be tolerable.

Hyperloglog — How does the magic work?

Well, up until now, we’ve talked about what is this HLL algorithm and why is it useful, but now let’s talk about how it does it. Basically HLL is able to estimate the cardinality of a given set, by hashing every element, and counting the amount of 0s to the left in said hash.

Let’s put this into a practical example extracted from here. Imagine that you’ve been throwing a coin, and counting the amount of uninterrupted run of heads you’ve been getting. If you tell me that your maximum was 3, I can figure that you haven’t been throwing the coin a lot of times. If you tell me it was 10 instead, I will be pretty sure that you’ve thrown the coin a lot of times.

Basically this is what HLL does, but instead of counting ‘heads in a row’ it counts ‘0s to the left of the hash’. A good hashing algorithm ensures us that every possible hash, has roughly the same probability of appearance and an even distribution. This allows the HLL algorithm to estimate the amount of elements it has ‘seen’ based on the hash with the longest stream of 0s to the left. For example, let’s say I have a hash function, that given an element it returns the binary representations of the numbers 0–15:

0000 — 0100 — 1000 — 11000001 — 0101 — 1001 — 11010010 — 0110 — 1010 — 11100011 — 0111 — 1011 — 1111

Looking at these numbers, and remembering that since its a hash result, every number has the same probability of occurring, we can assure the following probabilities given a hash starting with:

0 — 50% (1 in 2)00 — 25% (1 in 4)000 — 12.5% (1 in 8)0000 — 6.25% (1 in 16)

I need to see then, around 8 hashed elements in order to see a hash starting with ‘000’. If we generalize this, I would need to see ‘2^k’ elements to get one with ‘k’ 0s to the left, and inverting this, if we know that our highest amount of 0s to the left is ‘k’ then is probable that we’ve already seen ‘2^k’ elements (our estimated cardinality).

But then what happens if I get a hash value with multiple 0s really early?

Well, let’s go back a bit to our coin’s example. Imagine now that you’ve been lucky and got your 10 heads in a row in your first 10 throws and immediately stopped, making my assumption of a lot of throws very far from the actual number. Taking this into account, I may ask you to repeat your throws, but this time using 10 coins and 10 pieces of paper to record the amount of heads with each coin. This will allow me to observe much more data, making me able to get a much more accurate assumption.

Using a similar way, the HLL algorithm divides the hash result in 2, using the first section to address a bucket (our pair of coin/paper), and the second one to count the 0s to the left and store the largest amount so far (writing in the paper the longest run of heads we’ve encountered so far).

Lets put this in a simple example. Imagine that we decide to divide our set in 8 sub-sets or 8 buckets, that would mean that we need 3 bits to address them, since 2³ is 8. Then our buckets are addressed by the following binary numbers:

000 – 001 – 010 – 011 – 100 – 101 – 110 – 111

Then, for every hash we get, we will check the first 3 bits in order to know in which bucket we will be putting it in. Say for example that we have a hash function that returns 8 bit values. We will be using the first 3 bits to know the bucket and the remaining 5 bits to know the amount of 0s as follows:

By having a big distribution of buckets we diminish the error that random occurrences produce, and if we want to get our cardinality at any given point, we just need to estimate the cardinality of each bucket and take an average.

For the sake of simplicity, I said “average”, but the actual HLL algorithm uses harmonic mean instead of arithmetic mean. This allows the algorithm to eliminate large outliers from the formula, efectively increasing the accuracy of the estimate. The use of harmonic mean is the difference between the original Loglog algorithm and the new Hyperloglog algorithm

About Redis implementation

Redis, implements its HLL structure, using 16384 registers or “buckets”, bringing the standard error to 0.81%.

As for the hash function, the one used by Redis, has a 64 bit output, meaning that it is using the first 14 bits to address the 16k registers and the remaining 50 bits are being used to count the amount of 0s to the left. As we’ve seen before, each bucket will be storing the highest stream of 0s to that point, being the highest possible 50 (because there are only 50 remaining bits from the hash that can be 0s), each bucket will need 6 bits to be able to store up to the number 50 (which is 110010 in binary). Adding it up, we get that we’ll need 98304 bits (6 for each bucket) to store 1 HLL structure and if we translate those bits into bytes, we get 12288 bytes (or 12kb) which is the size that any hyperloglog will have in the Redis implementation.

Wrap up

So, we’ve discussed the theory and the implementation of this algorithm, but how does it work in practice?

In our case, we needed 5 counters for each day of the week, 5 for each week of the month, and one for the month, adding up to 56 counters that needed to be available.

By implementing the first solution we thought off, each of those counters would have been represented by a set’s cardinality. This would’ve meant we needed 56 sets to account for our 56 counters. Since each set was using roughly 8gb of memory, we would’ve needed 450gb available just to store the sets.

After implementing the HLL solution, each counter was now represented by a HLL structure. Since each HLL structure has a fixed size of 12kb in the Redis implementation, our 56 structures were using 672kb in total, less than 0.00014% of our previous solution’s memory requirement.

On the other hand, we didn’t have exact values anymore. Now we had probabilistic values, which meant that we had error rates. But in our case, there wasn’t a problem with a 1% error rate, since our results were being used for graphs and administration panels, that didn’t need exact data, they needed representative data.

In conclusion, the Hyperloglog algorithm gives us the posibility of having really small structures capable of calculating the cardinality of really big sets for the price of a minimal error. As I said before, each different problem will tell us if we need the exact data or if we can tolerate this low error rate to take advantage of the low size that this structure provides.