Class HyperLogLog
The HyperLogLog is a probabilistic algorithm for cardinality estimation by Flajolet et al. [1]. As with other probabilistic data structures, the algorithm trades space by accuracy. It trades the memory necessary to keep track of unique items by using "short bytes" to count elements.
The version implemented here takes inspiration from the version implemented in Redis [2]. Follows part of the Google paper [3], which makes a practical implementation, and the work on [4], with optimizations to the estimation accuracy.
How it works
From a bird's eye view, the algorithm uses a hash function that returns a value with (P + Q) bits. The P bits identify one bucket to place the count, and the Q we extract the number of consecutive trailing zeroes. Since the higher the number of zeroes, the lower the probability (N + 1 zeroes is half the N), the algorithm then collects the values from all the buckets to guess the cardinality.The implementation
We use theMurmurHash3
with 64 bits. Therefore, we have P=14
for the
number of buckets and Q=64-P
to count the number of trailing zeroes. In theory, this allows us to
count up to 264 elements!
We use two representations to store the elements. The first one, for small cardinalities (0 - 192), uses the explicit
hash return and a hash set, as seen on ExplicitSet
. The compact representation with
CompactSet
. The form the algorithm proposes with the "short bytes" and
occupies low memory.
The compact representation occupies roughly 12Kb and could count up to 264 elements. The explicit representation counts up to 192 because it starts using more than 12Kb of memory. After reaching the threshold, the representation changes from explicit to compact.
- Since:
- 15.0
- Author:
- José Bolina
- See Also: