Bloom Filters: BFs are probabilistic hash codes to reduce the amount of space. They offer a trade-off between the size of the filter and the prob of a false positive. (I.e. when BF says, x not in the set, answer always correct. But when it says, x in the set, it may be wrong with some small prob.) This is extremely useful when most x are NOT in the set of interest. Many applications: web proxy caching, distributed naming services, query routing, network measurements, dictionaries, bad password files. Some EXAMPLES: 1. Suppose a program needs to perform a computational process on a large number of distinct cases. For the large majority of cases, the process is quite simple. However, for a "difficulty to identify" minority set of cases, the process is quite complicated and time consuming. An obvious method is to "hash code" the "features/identifiers" for the minority case. The traditional hash method will store all identifiers, therefore requiring large table size. The BF will have much smaller size, but may occasionally flag an easy case to be hard. Whenever a case is identified as hard, we use a more careful diagnosis method. 2. Cooperative Web Caching (or naming service). Different servers need to maintain summaries of what others have. So, at any point, A wants to know what to send to B; i.e. what B doesn't have but A has. Constantly sending whole lists of URLs, when only a few differences exist, is a huge waste of bandwidth. B sends A its BF. A checks this against its own content, and sends B the diff. The penalty of being wrong here is trivial. Traditional Hash Table: A hash area is organized into cells; the size of the cell is large enough to store the key/entity plus perhaps a live/dead bit. (For instance, the hash cell should be big enough to store the dictionary words, if the goal is to check if a word is valid.) INSERT: Given x, we generate a random number k = h(x) (hash value), and check if the k-th cell is empty. If so, we store x there. If not, we continue to generate additional hash addresses until empty cell found. LOOKUP: Same as insert; at each address, check if the entry stored there matches the lookup entry. ANALYSIS: Hashing is an error-free lookup scheme. The memory is proportional to the number of entries and their size. Bloom Filters: Does not store the entire set. So, the filter size DOES NOT depend on the number of entries or their sizes. Rather, the BF size is determined by the error probability you want. PRINCIPLE: Suppose S = {x1, x2, ..., xn}. k hash functions, h1, h2, .., hk: [1, n] -> [1, m] k m-bit hash tables; all entries initially 0. (Example hash funcs: x mod p, x^2 + a mod p', ax + b mod p etc). (Alternatively, one long bit vector of length km.) INSERT: For each x \in S, set bit hi(x) = 1 in table i (This can be viewed as the "signature" of x.) Note that each location can be set by many different x's; but only the first setting matters. LOOKUP: Compute h1(y),. h2(y), ..., hk(y). Output "y in S" if and only if all these bits are 1. otherwise, no. Note that if BF says "y NOT in S" it MUST be correct. But when BF says "y in S" it may be wrong. We now consider that false-positive probability: Prob. that a particular bit is 0 after all entries hashed: (1 - 1/m)^kn = e^{-kn/m) = p. Prob. of a false positive (none of k bits for y zero) = (1 - p)^k = (1 - e^kn/m)^k Suppose n = 2^16 = 65K entries. m = 30K, and k = 10 hash functions. Then, the prob of false positive <= (1/2)^10 (one in a thousand). APPLICATIONS: 1. Unix Spell Checker: when memory was small/expensive, Bloom suggested using BFs. 2. Password files: Spafford suggests using BF of "bad password lists". 3. Distributed Caching: If a local cache doesn't have a web page, instead of going to the internet, or source, first check if another nearby cache has it. But too expensive to go to each (possibly hundreds) proxy. So, each node maintains a summary of what other nodes have. To reduce network traffic, nodes do not want to send complete URL lists constantly. It's clear that these lists do not change that much, so only the diff matter. We use BF instead. A false positive doesn't have any disastrous effect. 4. Set reconciliation. A sends to B its BF, and B then sends the diff. FUNDAMENTAL PROPERTIES OF BLOOM FILTERS: 1. UNION. If B1, B2 are bloom filters for sets S1, S2. Then we can easily compute the filter for S1 union S2. It's just the bitwise OR. 2. Counting Filters: Notice that DELETION is not easy to do with BFs. Since each 1 bit can be set by many entries, we can't just reset all the bits of x. To implement deletion, we replace each bit with a small counter; often 4-6 bit counter suffice. On insert, we increment the counters. On deletion, we decrement. Application of COUNTING Bloom Filters: Traffic Measurement. Network managers at ISP are constantly trying to understand the traffic characteristics of the network. They want to know when certain suspicious traffic patterns arise. One obvious thing to look for is "flows that are contributing a lot of network traffic" at a router or link. Call a flow (source, source-dest, source-application etc) a Heavy Hitter if it accounts for more than x% of the router capacity. Exact measure is infeasible---millions of flows; backbone routers (OC 48 and higher) process several million packets per second. So the processing is in nano-seconds. The router's fast memory (nano-sec access) is very expensive and small. Bloom Filter Approach. Hash each incoming packet's header into a k-hash BF. Increment each locations's counter by the packet's payload size. Whenever a packet from flow F is hashed, and the MIN of its k counters > threshold, we declare F to be a Heavy Hitter. Note that if F is a heavy hitter, we will catch it. False positives come from many small flows adding up to a large flow.... SUMMARY: Vanilla Hashing is a great tool. But it is inflexible---Bloom Filters build on it to offer an interesting space-error tradeoff. In many applications, vanilla hashing can't be used because it would require too much space, so BF become invaluable.