During Netconomy Summit I learned about an interesting algorithm called HyperLogLog. The idea is like this — suppose you have an array of a billion strings. Some of them are unique, some are duplicates. The problem is to figure out the number of unique strings.

The algorithm goes like this: for each new string we calculate a pseudorandom hash and look at its binary form. We look how many leading zeroes the hash had. And save a maximum number we have seen so far.

In the end, we can simply calculate 2-power-<maximum number> and that is our approximation.

Of course, such an approximation can be way off so here is part two of the algorithm: instead of one hash we calculate a 1000 different hashes and average out the result.

By tweaking the parameters we can get a reasonable result.

Now why we would need that at all? Suppose you have a very popular site with millions of visits and what you’re interested in is how many unique users have it.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.