Big Data Counting: How to count a billion distinct objects using only 1.5KB of Memory
This is a guest post by Matt Abrams (@abramsm), from Clearspring, discussing how they are able to accurately estimate the cardinality of sets with billions of distinct elements using surprisingly small data structures. Their servers receive well over 100 billion events per month.
At Clearspring we like to count things. Counting the number of distinct elements (the cardinality) of a set is challenge when the cardinality of the set is large.
To better understand the challenge of determining the cardinality of large sets let's imagine that you have a 16 character ID and you'd like to count the number of distinct IDs that you've seen in your logs. Here is an example:
4f67bfc603106cb2
These 16 characters represent 128 bits. 65K IDs would require 1 megabyte of space. We receive over 3 billion events per day, and each event has an ID. Those IDs require 384,000,000,000 bits or 45 gigabytes of storage. And that is just the space that the ID field requires! To get the cardinality of IDs in our daily events we could take a simplistic approach. The most straightforward idea is to use an in memory hash set that contains the unique list of IDs seen in the input files. Even if we assume that only 1 in 3 records are unique the hash set would still take 119 gigs of RAM, not including the overhead Java requires to store objects in memory. You would need a machine with several hundred gigs of memory to count distinct elements this way and that is only to count a single day's worth of unique IDs. The problem only gets more difficult if we want to count weeks or months of data. We certainly don't have a single machine with several hundred gigs of free memory sitting around so we needed a better solution.
One common approach to this problem is the use of bitmaps. Bitmaps can be used to quickly and accurately get the cardinality of a given input. The basic idea with a bitmap is mapping the input dataset to a bit field using a hash function where each input element uniquely maps to one of the bits in the field. This produces zero collisions, and reduces the space required to count each unique element to 1 bit. While bitmaps drastically reduce the space requirements from the naive set implementation described above they are still problematic when the cardinality is very high and/or you have a very large number of different sets to count. For example, if we want to count to one billion using a bitmap you will need one billion bits, or roughly 120 megabytes for each counter. Sparse bitmaps can be compressed in order to gain space efficiency, but that is not always helpful.
Luckily, cardinality estimation is a popular area of research. We've leveraged this research to provide a open source implementation of cardinality estimators, set membership detection, and top-k algorithms.
Cardinality estimation algorithms trade space for accuracy. To illustrate this point we counted the number of distinct words in all of Shakespeare's works using three different counting techniques. Note that our input dataset has extra data in it so the cardinality is higher than the standard reference answer to this question. The three techniques we used were Java HashSet, Linear Probabilistic Counter, and a Hyper LogLog Counter. Here are the results:
Counter |
Bytes Used |
Count |
Error |
HashSet |
10447016 |
67801 |
0% |
Linear |
3384 |
67080 |
1% |
HyperLogLog |
512 |
70002 |
3% |
The table shows that we can count the words with a 3% error rate using only 512 bytes of space. Compare that to a perfect count using a HashMap that requires nearly 10 megabytes of space and you can easily see why cardinality estimators are useful. In applications where accuracy is not paramount, which is true for most web scale and network counting scenarios, using a probabilistic counter can result in tremendous space savings.
Linear Probabilistic Counter
The Linear Probabilistic Counter is space efficient and allows the implementer to specify the desired level of accuracy. This algorithm is useful when space efficiency is important but you need to be able to control the error in your results. This algorithm works in a two-step process. The first step assigns a bitmap in memory initialized to all zeros. A hash function is then applied to the each entry in the input data. The result of the hash function maps the entry to a bit in the bitmap, and that bit is set to 1. The second step the algorithm counts the number of empty bits and uses that number as input to the following equation to get the estimate.
n=-m ln Vn
In the equation m is the size of the bitmap and Vn is the ratio of empty bits over the size of the map. The important thing to note is that the size of the original bitmap can be much smaller than the expected max cardinality. How much smaller depends on how much error you can tolerate in the result. Because the size of the bitmap, m, is smaller than the total number of distinct elements, there will be collisions. These collisions are required to be space-efficient but also result in the error found in the estimation. So by controlling the size of the original map we can estimate the number of collisions and therefore the amount of error we will see in the end result.
Hyper LogLog
The Hyper LogLog Counter's name is self-descriptive. The name comes from the fact that you can estimate the cardinality of a set with cardinality Nmax using just loglog(Nmax) + O(1) bits. Like the Linear Counter the Hyper LogLog counter allows the designer to specify the desired accuracy tolerances. In Hyper LogLog's case this is done by defining the desired relative standard deviation and the max cardinality you expect to count. Most counters work by taking an input data stream, M, and applying a hash function to that set, h(M). This yields an observable result of S = h(M) of {0,1}^∞ strings. Hyper LogLog extends this concept by splitting the hashed input stream into m substrings and then maintains m observables for each of the substreams. Taking the average of the additional observables yields a counter whose accuracy improves as m grows in size but only requires a constant number of operations to be performed on each element of the input set. The result is that, according to the authors of this paper, this counter can count one billion distinct items with an accuracy of 2% using only 1.5 kilobytes of space. Compare that to the 120 megabytes required by the HashSet implementation and the efficiency of this algorithm becomes obvious.
Merging Distributed Counters
We've shown that using the counters described above we can estimate the cardinality of large sets. However, what can you do if your raw input dataset does not fit on single machine? This is exactly the problem we face at Clearspring. Our data is spread out over hundreds of servers and each server contains only a partial subset of the the total dataset. This is where the fact that we can merge the contents of a set of distributed counters is crucial. The idea is a little mind-bending but if you take a moment to think about it the concept is not that much different than basic cardinality estimation. Because the counters represent the cardinality as set of bits in a map we can take two compatible counters and merge their bits into a single map. The algorithms already handle collisions so we can still get a cardinality estimation with the desired precision even though we never brought all of the input data to a single machine. This is terribly useful and saves us a lot of time and effort moving data around our network.
Next Steps
Hopefully this post has helped you better understand the concept and application of probabilistic counters. If estimating the cardinality of large sets is a problem and you happen to use a JVM based language then you should check out the stream-lib project — it provides implementations of the algorithms described above as well as several other stream-processing utilities.
Related Articles
Reader Comments (43)
The sentence "These 16 characters represent 128 bytes." was probably meant as "These 16 characters represent 128 bits.", since ASCII characters are 8 bits, and 16*8 = 128.
Good article
Probably worth mentioning alternatives: Obvious (sort set + count distinct elements), and Bloom Filters, which has similarities with the Linear Probabilistic Counters, but uses k hashes and a slightly different step2.
You are right, I've submitted a fix. Embarrassing mistake!
I'm a little confused by the math in the beginning. Are there 128 bytes of additional information associated with an ID that you aren't displaying, or did you mean the 16 characters would take up 128 bits (not bytes) or are you converting those characters to some super inefficient data structure that uses 8 bytes to represent a character? If it is 128 additional bytes, wouldn't that mean you had 144 bytes of data per ID, and if it is 128 bits your math is about an order of magnitude off (you'd get 65536 IDs per MB). If the IDs are alway hex, and you stored that as binary data you'd only need 64-bits per ID, which would get you 131072 IDs per MB (assuming your example ID is the standard ID length).
Did you try using a bloom filter for computing the cardinality? I'd think that would let you tune memory usage vs acceptable error rates.
nice one
I was writing another comment, describing how the first number falsifies the others, but you have now corrected that. :)
What bothers me most is that "highscalability.com" DOES NOT PROOFREAD articles before they are published, which would help "save the face" of the well-intentioned authors. This is not just a one-off occurrence; maybe 1/4 of the articles I read here have at least one significant technical mistake.
For example, the article "Hazelcast 2.0: Big Data In-Memory" from the third of April says "The overall response time dramatically *decreases* if latency of each request is not consistent and low. " I'm pretty sure that they mean *increase*, as lower response times are better, and lower response times comes from lower latency.
In the hyper log log section, I think you mean an accuracy of 98% or an error rate of 2%, right?
Sounds like a variant of a bloom filter. Curious if you looked at BFs and what were the comparative results.
You could use a trie instead of a HashSet which would be faster and reduce memory usage.
16 hexadecimal chars (0-9, a-f) is 64 bits or 8 bytes. 128 bits (e.g. an MD5 hash) is 32 hex characters.
I take it that the linear probabilistic counting algorithm is the one in the 1990 Transactions on Database paper "A Linear-Time Probabilistic Counting Algorithm for Database Applications" by Whang et al.
@nate, @sebastien - sorry about the mixup at the front of the post. that was unfortunate..
@nate, @greg, @mina - We do use bloom filters for membership detection and counting the number of occurrences of an individual key but we prefer to use counters when we are only interested in the total distinct count. I haven't tried but I do not think I could beat HyperLogLog in terms of memory usage with a bloom filter. I'd be interested to know if you think that would be possible, can you point me to any reference implementations? Hadoop's implementation only allows you to get a count for a given key, http://hadoop.apache.org/common/docs/r0.20.1/api/org/apache/hadoop/util/bloom/CountingBloomFilter.html, as an example.
@Greg - you are right, a trie would be more efficient than a HashSet but it would still be much less efficient than bitmaps or the two probabilistic counters mentioned above.
@SR - yes, you are right. 2% accuracy would be pretty terrible. it should read 2% error rate.
For the people suggesting bloom filters, it's not actually possible to count the number of distinct elements in a bloom filter; you can only query it for membership. The techniques are reminiscent of how a BF works, but an actual BF does not support the operations needed here.
You can count using a bloom filter and there are variants using counters.
Surely you could use bloom filters and a counter to measure the cardinality - if a value is not in the bloom filter add it and increment the counter.
@george, you are right. That paper is linked to in the blog post. Also our stream-lib project cites that paper in our implementation of the linear probabilistic counter.
@Matt Abrams:
Google Guava added a Bloom filter recently. I'm not a data structure expert, but it might be worth looking at their implementation.
There's an implementation of a HyperLogLogCounter in the DSI Utilities, along with high performance bloom filters. The key distinction of this particular HyperLogLog is its ability to manage hundreds of millions of counters simultaneously. In other words, if you need to keep track of a few hundred million counters at the same (like some kind of huge set of metrics), this is a good structure for the job.
http://dsiutils.dsi.unimi.it/docs/it/unimi/dsi/util/IntHyperLogLogCounterArray.html
@Lee - Your solution would only work on a single machine. There is no way to merge the counts from multiple blooms using this approach unless each bloom contains non-overlapping keys. So unless the data is partitioned so that you are guaranteed that unique keys are only every processed by one node you can't use this approach.
@Etienne - thanks for the pointer but that is just a standard bloom. I know the guava version has been updated so i'm not sure how it compares to the one we've implemented in the stream-lib project. I am not aware of any bloom implementation that would allow you to count the total number of elements in the set without doing something like @Lee suggested.
I think we should incorporate some of these algorithms in Guava.
Why can't we just build a trie kind of a structure. If reuse character nodes - I just tried implementing the same, this is amounting upto 144340 single char nodes for the shakespeare text pointed above.
This turns out to about 140KB. And no loss of accuracy. Won't it be a fair solution?
@Guruprasad - Nice work implementing the trie. The data set I used as an example in this blog is small enough to fit into a reasonable amount of space using a variety of data structures, trie's included. It also only has a single dimension. Increasing the size and dimensions of the data set could cause the space requirements to jump significantly. The probabilistic solutions are for when it is either impossible or impractical to store the full data set by every dimension you'd like to count. In cases where you need perfect accuracy or your data set is small enough to fit in a reasonable amount of space than your solution is great.
@Dimitris - that is awesome. I'd love to help.
Hi Matt,
Nice post. This would be a good topic for the next Big Data meetup. :) You could expand upon how you take two compatible counters and merge their bits into a single map. Is this sort of a "divide and conquer" approach?
David
Note that some randomised datastructures (eg bloom filters) can be trivially paralleled by biwise-ORing the cells with each other. You need to make sure that each task shares the same hash functions.
Hi Matt,
Do you consider implementing (approximate) histogramming in stream-lib?
For a review of methods see: http://www.cs.ucsb.edu/~suri/psdir/ency.pdf
Best regards,
Grimaldi