DSA Challenge - Bloom Filter

A Vue component that displays a styled QR code linking to Frontend Mentor, as the first project of the FEMentor learning path.

cpp

Features

Research

Group Information

We split the work into corresponding 3 parts to do across 2 days (Wednesday 2/8/23 and Thursday 3/8/23):

Introduction

A Bloom filter is a probabilistic data structure, conceived by Burton Howard Bloom in 1970, that trades off error-freeness for space efficiency. It is mainly used to check membership of a certain element against a set. It is probabilistic, because there is a small probability of the check yielding a false positive, which means that it may return true, regardless of the element’s actual membership. However, false negatives are impossible. In other words, it is a special filter that when checking, it can yield a Definitely not there or a It’s there, maybe.

The high level view is that it is a surjective map from all elements of xXx\in X, and maps to some yYy \in Y via means of one, or multiple hash functions.

For simple Bloom filters, the distinction between a false positive and a true positive is impossible. It is mostly used to prevent unnecessary disk access. For practical example, Google Chrome previously used a Bloom filter to identify malicious URLs, and if the filter yields a positive result, will the browser warn the user, as if not, it is, definitely not malicious.

State the factors to estimate the optimal size of the bit array and the number of hash functions for a Bloom filter

At its very core, Bloom filters are a type of data structure that trades false positives for space efficiency for large data sets. So, the error rate (or the false positive rate) is a very important factor anywhere related to these filters. In addition to that, the number of elements we expect to add to the filter also plays an important role in how many slots we should allocate to the bitset. Summarized, the optimal size of the bit array depends mostly on the error rate and the number of elements.

The optimal number of bits per element is log2ϵln2-\frac{\log_2\epsilon}{\ln2}. Therefore, let nn be the number of elements we wish to insert, and ϵ\epsilon be the error rate desired at nn elements, then the optimal size can be calculated with:

mn=log2ϵln2    m=nlog2ϵln2=nlnϵln21ln2=nlnϵln22\frac{m}{n}=-\frac{\log_2\epsilon}{\ln2} \implies m = -\frac{n\log_2\epsilon}{\ln 2} = -n\cdot \frac{\ln\epsilon}{\ln2}\cdot \frac1{\ln2} = -\frac{n\ln\epsilon}{\ln^22}

The count of hashers, or hash functions is again, dependent on the number of elements we expect the filter to have at most, as a good set of hashers should uniformly distribute slots across the bitset. But too many hashers compared to the bitset size will prove to be redundant, and there will be more false positives, so it also has to depend on the number of bits. But, the number of bits itself also depends on the error rate. So, it still boils down to the error rate being the most important factor. Summarized, the optimal number of hashers depend on the number of elements and the number of bits.

With mm being the bitset size, and nn being the elements count, we can calculate the optimal number of hash functions:

k=mnln2=log2ϵk = \frac{m}{n}\ln2 = -\log_2\epsilon

How do hash functions impact the performance and accuracy of a Bloom filter?

As insertions and queries require going through a set of hash functions, both of these operations’ complexity would be O(k)O(k) with kk being the number of hash functions. If one of these hash functions have a, for the lack of a better word, worse complexity, it would affect both of these operations heavily.

Also, the number of hashers also dictate how many bits get flipped during an insertion. If there are way too few hashers, queries would only check that many bits, the fewer bits checked, the higher rate of false positives. False positive rate can be derived with the following formula:

ϵ=(1[11m]kn)k\epsilon = \left( 1 - \left[ 1 - \frac1m \right]^{kn} \right)^{k}

For the same exemplary mm and nn (with nn being substantially smaller than mm) of some arbitrarily large numbers, as kk tends downwards, false positive rates go up:

Graph

There is also a slight descend as with m=10km=10k and n=1kn=1k, k=10k = 10 is not the optimal number of hash functions, yet. But it starts spiking up real fast as kk trends towards 1.

Do lookup operations on Bloom filter always give correct results? If not, indicate the estimation method for false positive rate.

No, lookup operations on Bloom filters will have a chance of turning up a false positive, after all, it is a data structure that favors probability over error-freeness, otherwise, we would have gone with similarly purposed structures like hash tables.

Assume, mm is the number of bits in the array, the probability that after a hash, one bit gets flipped to 1 is 1m\frac1m. So, after a hash, the probability that one bit hasn’t been flipped to 1 is P(0)=(11m)kP(0)=(1-\frac1m)^k with kk being the number of hashers.

After inserting nn elements, the probability that one bit is still 0 is P(0)nP(0)^n, and then, the probability that it is already 11 is therefore: P(1)=1P(0)nP(1)=1-P(0)^n. If we wants to query for an element, the query goes through kk number of hash functions, then if they are all 11, it would return a Maybe there claim, as the probability of that happening is now P(1)kP(1)^k. This is the false positive rate, with formula written in full as:

ϵ=P(1)k=(1P(0)n)k=[1(11m)kn]k\epsilon = P(1)^k = (1 - P(0)^n)^k= \left[1-\left(1-\frac1m\right)^{kn}\right]^k

The same as what we used in the previous section.

Does the insertion of elements affect the false positive rate in a Bloom filter?

Yes, as the Bloom filter fills up with 11, false positive rates is obviously going to go up. If at some point, the Bloom filter only is comprised of 11s, any queries would yield a Maybe there.

This is why we usually want to estimate how many elements we would like in a Bloom filter, and the false positive rate to go with it. At the specific number of nn, ϵ\epsilon would be our desired rate. Going over nn, ϵ\epsilon naturally goes up. If still much less than nn, ϵ\epsilon hasn’t reached our desired rate yet.

The main point of Bloom filters is to be able to filter fast, and reliable (most of the time), so two things must come to mind before implementing a filter, which is the number of elements we want it to contain, and false positive rate tolerable.

Are there any known techniques or strategies to reduce the false positive rate in a Bloom filter?

Generally, the simple technique is just to increase mm to a very large number, approaching infinity while nn being finitely small, and kk is substantially less than mm. Taking the limit as mm\to \infty on the false positive rate equation, we can notice that ϵ0\epsilon \to 0.

Although, there has been a paper published that has done some research on constructing a false positive free zone for general filters. For a small number of dd, for a restrictedly finite universe of U=1,...,ndU={1,...,n_d} then until at most dd elements are in the filter, it’s guaranteed that there will be no false positives for queries of elements from UU.

To put it simply, here’s the lemma: the filter has a false positive free zone with at most dd elements in the filter for universe U=1,...,nd=nU={1,...,n_d=n} if:

nj=1kpjdn \leq \sqrt[d]{\prod_{j=1}^{k}p_j}

where nn is the threshold, and pjp_j is the jj-th prime number such that (p1=2p_1=2, p2=3p_2=3,…), and dd being the level of the universe.

Is it possible to estimate the number of elements stored in a Bloom filter?

Yes, to an extent. The limitation is similar to the analogy that, with a ball on the top of a hill, given the gravitational force and the force applied to the ball, it is relatively simple to work out where the ball would land after rolling down, but given the destination and the force applied to the ball, it is nearly impossible to figure out where it started originally, as there are too many “points” in space to consider, but we can “guess” with a chance where it started. The same applies to Bloom filters, Swamidass & Baldi, showed that the number of items in a Bloom filter can be approximated with:

n=mkln(1xm)n^*= -\frac{m}{k}\ln\left(1-\frac{x}{m}\right)

where nn^* is the estimate of the number of items inserted, mm is the length of the bitset, kk being the number of hashers and xx being the number of bits that are “on”, or set to one.

Can a Bloom filter handle deletions of elements from the set it represents?

A simple Bloom filter can not per say, handle deletions of elements properly. If by deleting, we mean just changing one bit of the mapped indices to 00, which would already effectively remove that element from the set, but we don’t know if there are other elements that also by chance, map to a set of bits that consist of that bit. With a simple algorithm, we also have no way to know which bit to clear, as there are commonly multiple hash functions.

A more advanced variation of Bloom filter, a Counting Bloom Filter, where numbers are used to count, similarly to Counting Sort, instead of just a boolean value of true or false. A deletion is then possible, if and only if, that element has been added to the filter, and the number of times it was added is exactly once.

Can a Bloom filter be dynamically resized to accommodate more elements?

A Bloom filter works only when you know the number of elements to be inserted in advance, along with the desired false positive rate. A Bloom filter, technically, already can accommodate an infinite number of elements, but then, the false positive rate trends towards 1, which kind of already defeats the point of the filter.

If the threshold of “optimal elements count” is exceeded, a new Bloom filter with a larger capacity can be created if and only if we know exactly which elements that have been added, not just the bits that have been flipped, as moving from mm to mm', the hash functions would yield different outputs. However, a Bloom filter only stores the probability of existence of any element, the elements themselves have to be stored externally, so for simple filters without that external storage, a dynamic resize is impossible.

Straying from the subject, there have been research on Scalable Bloom Filter, which itself is comprised of multiple small filters, and it allows an arbitrary growth of the set. Each successive filter is created with a tighter maximum error probability on a geometric progression, so that the compounded probability over the whole series converges, even given as the number tends towards infinity. Querying is then made by testing for presence in each filter.

Compare between Bloom Filter and Hash Table

Assuming, for bloom filters, we only care about the actual bit arrays, or integer arrays in the case of counting variation, and the actual elements do not get stored.

FeatureBloom FilterHash TableCounting Bloom Filter
Space Efficiency1.44log2(1/ϵ)1.44\log_2(1/\epsilon) bits per keyHowever many bits a key needs1.44log2(1/ϵ)1.44\log_2(1/\epsilon) integers per key
Time EfficiencyAlways O(k)O(k)The bigger the capacity, the closer to O(1)O(1)Always O(k)O(k)
InsertionAlways O(k)O(k)Depends on O(h)O(h) where hh is the complexity of the hash functionAlways O(k)O(k)
DeletionNot possibleDepends on O(h)O(h) againAlways O(k)O(k), provided that it exists
QueriesAlways O(k)O(k), with a chance of false positivesDepends on O(h)O(h), no chance of false resultsAlways O(k)O(k), with a chance of false positives
LimitationsCan not be resized, can not know exact number of elements, false positives always existMuch larger space size neededCan not be resized, can not know exact count, false positives always exist

The integers that a counting filter uses usually only consist of 33 to 44 bits, so the space needed for them usually is about as three times or four times a simple filter.

With sufficient memory, a hash table is usually preferable, but with limited memory compared to how large the data is, a Bloom filter is used to minimize the number of reads needed, as false positives are possible but false negatives are not. So if an element doesn’t exist already, there’s no need to actually read anything to check.

Overall on Data Structures

The project is quite small and intuitive, where the sources folder would consist of 3 notable files:

We are utilizing a few C++ libraries to help with the project:

We declare a simple Account to contain a username field and password field, both of type string. Another Database struct would hold a vector of Accounts. Then, the jam and butter, the struct BloomFilter would hold a bitset of arbitrarily large number of bits.

For the bloom filters, we chose m=10,000,000m=10,000,000 (or ten million) bits to store and k=50k=50 (or fifty) hash functions. Although we expect about 1500015000 elements to be inserted, the dudes that worked on the coding part deliberately chose an absurdly high mm that the false positive rate is now approximately 9×10589 \times 10^{-58}.

To grasp that number, our world today has about 8 billion people. If we pick 1 person at random, and it happens to be Tom Holland. Then do it again, and it again, happens to be Tom Holland. Then, again, again, again… In this imaginary scenario, if we do this random picking 5 times, and every single time it just happens to be Tom Holland, the probability of this happening is still 100,000,000×100,000,000\times more likely than a false positive turning up.

Program’s Lifecycle

A flowchart for easier grasping:

flowchart TD
   s1("Initializing")
   s2("Pull from local database (AccountFile.txt)")
   s1 --> s2
   s2 --> s3("Open menu for user")
   s3 --> c1{"If choice"}
   c1 --> |Exit| aExit("End program")
   c1 --> |No| aThing("Do the action")
   aThing --> s3

References