# HyperLogLog

During Netconomy Summit I learned about an interesting algorithm called HyperLogLog. The idea is like this — suppose you have an array of a billion strings. Some of them are unique, some are duplicates. The problem is to figure out the number of unique strings.

The algorithm goes like this: for each new string we calculate a pseudorandom hash and look at its binary form. We look how many leading zeroes the hash had. And save a maximum number we have seen so far.

In the end, we can simply calculate 2-power-<maximum number> and that is our approximation.

Of course, such an approximation can be way off so here is part two of the algorithm: instead of one hash we calculate a 1000 different hashes and average out the result.

By tweaking the parameters we can get a reasonable result.

Now why we would need that at all? Suppose you have a very popular site with millions of visits and what you’re interested in is how many unique users have it.

This site uses Akismet to reduce spam. Learn how your comment data is processed.