HASHING: A great practical tool, with interesting & subtle theory too.
Basic setting: We have a set of items we want to store for efficient lookup.
Two versions:
- STATIC: Given a set S of items, store them for effficient lookup.
- DYNAMIC: The data structure processes a sequence of inserts, lookups,
and delete requests. Want to do these all efficiently.
In the static case, we can use a sorted array with binary search for lookups.
In the dynamic case, we can use a balanced search tree. Both those schemes
have O(log n) complexity.
Hashing gives an alternative approach that is often the fastest and most
convenient way of solving these problems.
Applications: An AI-search program may want to "cache" subproblems or
situations (board positions or elements of state-space) that the program
has already solved before; we don't want to redo the same computation when
you encounter them again. Many other uses in cryptography, networks,
complexity theory.
FORMAL SETUP
------------
- The keys come from a large universe U. (e.g, all < 50-character strings)
- The INSTANCE S is a subset of U (static or dynamic). N = |S|.
These are the keys a particular run of the program must process.
- The data structure is an array (table) A of size M, and a HASH FUNCTION
h: U -> {0,...,M-1}, which maps each key to a unique location in the table.
Given an element x, we store it in location A[h(x)].
Note: If U were small (like 3-character strings) then we could just store
x in A[x] as in bucket sort. But U is big; e.g. 50-char strings, or 32 bit
numbers. That's why we need the hash function.
- The hashing may produce collisions---two keys x and y may map to the
same location: h(x) = h(y). There are several ways to deal with such
collisions. The simplest one is called "separate chaining": make each entry
in A a linked list, which stores all the keys that map to that location.
DESIRABLE PROPERTIES:
- The hash table size should be small, e.g. O(N).
- The hash function should be fast and easy to compute.
A SIMPLE IDEA:
Suppose N = 100; that is, we expect to deal with at most 100 keys. We
create a table of size M = 100, and use the MOD function:
i.e. h(x) = x mod 100.
It has both the desired properties: small size and fast hashing.
An example: 91, 2048, 65000, 547356, 329, etc...
What can go wrong? What if all keys have the last two digits the same?
e.g. 00! Then, all keys map to the same location, and the hash
table becomes an unordered linked list, with O(N) cost for lookup!!!
This motivates the 3rd desirable property:
- The hash function should spread keys out in the table. Ideally all buckets
(lists) should be about the same size.
Why not RANDOM NUMBER: One idea would be to use randomization. May each key to
a random location in the array. This will help in evening out the load,
but "how do we do a FIND?" We would need to know, for each key, what
random number was used to map it! But it turns out that pseudo-randomness
does help. Before we come to that, let's first consider some BAD NEWS:
The worst-case behavior is bad!
CLAIM: For ANY hash function h, if |U| >= N*M, there exists a set S of size N
where all keys hash to the SAME location.
PROOF: By the pigeon-hole principle. Map all keys of U using h. At least one
table location must receive at least N keys; use those keys as the set S.
This is partly why hashing seems so mysterious -- how can you claim hashing
is good if for any hash function you can come up with ways of foiling it?
One answer is that there are a lot of simple hash functions that work well
in practice.
There is a formal concept of Universal Hashing and Perfect Hashing, which
we will briefly touch later. For now, let's claim that hash functions o
the type:
h(x) = (ax + b) mod p, where p is a prime number larger than N
work very well in practice, and there is good theoretical justification
for that.
Hashing with Separate Chaining:
==============================
Each entry of the hash table has a linked list, which stores all keys mapped
to that position.
We use hash functions of the form: h(x) = x mod p for simplicity.
Given a key x, search the linked list at position A[h(x)].
When inserting a new element x, add it to the head of the list at A[h(x)].
Delete is done by first searching and removing the element from the list.
Example: x mod 11
0:
1: 23 -> 1 -> 56
2: 24
3: 36 -> 14
4:
5: 16
6: 17
7: 7 -> 29
8:
9: 31 -> 20 -> 42
10:
Insert 53. 53 mod 11 -> 9
Because The key 53 is added in location 9, changing that list to:
9: 53 -> 31 -> 20 -> 42
ANALYSIS:
Worst-case: All keys hash to the same bucket.
Insert takes O(1), but
delete and find take O(N).
Average Case: If keys are uniformly distributed, then all buckets are
about the same size. Each linked list has expected size O(N/M).
Thus, insert is still O(1), but delete and find become O(N/M).
N/M is called the load factor.
--------------------------------------
OPEN ADDRESSING:
================
This is another method for dealing with collisions. Rather than allocate
new memory dynamically, here we use the table itself. Ideally, we would have
a table at least a constant factor bigger than N, so we expect many empty cells.
When the location A[h(x)] is occupied, we use some fixed strategy to look for
NEXT available open cell.
The simplest strategy is: Linear Probing. Sequentially try next position until
an open cell is found.
Example: h(x) = x Mod 11
(In this example, keys are inserted in increasing order.)
0: 42
1: 1
2: 24
3: 14
4:
5: 16
6: 28
7: 7
8:
9: 31
10: 9
Linear Probing: Insert 12 12 mod 11 = 1
Because 1, 2, 3, are occupied, the next available slot is 4, so put 12
in A[4].
0: 42
1: 1
2: 24
3: 14
4: 12
5: 16
6: 28
7: 7
8:
9: 31
10: 9
Linear Probing: Search 15 15 mod 11 = 4
We start at A[4], and continue sequentially until we either find 15, or reach
an OPEN SLOT. In this case, we reach the open slot A[8], and report NOT FOUND.
CODE for Search with Linear Probing:
hSearch (k)
hashVal = k % Tablesize;
j = hashVal;
do {
if (empty[j] || A[j] == k) return j;
j = (j+1) % Tablesize;
} while (j != hashVal);
return j;
}
find (k, e)
b = hSearch (k);
if (empty[b] || A[b] != k) return false;
e = A[b];
return true;
}
DELETION in Hashing with Linear Probing.
Because empty slots are used also to terminate search, we cannot simply delete
a key. It may invalidate other searches.
Instead, we simply mark the key as DELETED, but not remove it.
We simple modify the search, so that it simply goes past the Deleted keys.
Advantage: Easy and Correct.
Disadvantage: Table can become full with dead items.
All these operations take O(1) plus the length of the linked list.
Deletion with Linear Probing (Lazy Deletion)
Delete 9 9 mod 11 = 9
0: 42
1: 1
2: 24
3: 14
4: 12
5: 16
6: 28
7: 7
8:
9: 31
10: 9 <-- Found
Change it to:
10: D <-- Deleted; otherwise search for 42 can fail
--------------------------------------------------------
Review:
Hash function. h: U -> {0, 1, ..., M-1}
Collision resolution:
Separate chaining
Open addressing
linear probing.
lazy deletion
On average, if M = O(N), avg search is O(1).
-------------------------------------------------------
Eager Deletion is possible; but expensive. Need to move data around
to make sure deletion hole does not corrupt valid searches.
QUADRATIC PROBING: Open addressing with linear probing can have bad
search behavior due to clustered data: a few big blocks of the
table get filled. Quadratic probing attempts to circumvent this
by jumping non-linearly.
Check h(x). If collision, search h(x) + i^2
Start at h(x).
If key not there, search h(x) + 1
If key not there, search h(x) + 4
If key not there, search h(x) + 9
If key not there, search h(x) + 16, etc.
Example: Insert 12 12 mod 11 = 1
0: 42
1: 1 <- h(x)
2: 24 <- h(x) + 1
3: 14
4: <- h(x) + 25 INSERT 12 here
5: 16 <- h(x) + 4
6: 28 <- h(x) + 16
7: 7
8:
9: 31
10: 9 <- h(x) + 9
--------------------------------------
DOUBLE HASHING.
==============
Another idea is to use a second hash function to deal with collisions.
Suppose R is the largest prime smaller than the table size.
Then, consider the hash function
h'(x) = R - (x mod R)
* Why not simply h'(x) = (x mod R) ? *
INSERTING 12.
h' (12) = 7 - (12 mod 7) = 7 - 5 = 2
Check h(x) = A[2].
If collision, check h(x) + h'(x)
If collision, check h(x) + 2 * h'(x)
If collision, check h(x) + 3 * h'(x)
i.e. check h(x) + i * h'(x).
Example:
0: 42 <- h(x) + 10
1: 1 <- h(x)
2: 24 <- h(x) + 12
3: 14 <- h(x) + 2
4: <- h(x) + 14 INSERT 12 here.
5: 16 <- h(x) + 4
6: 28
7: 7 <- h(x) + 6
8:
9: 31 <- h(x) + 8
10: 9
------------------------------------------------------
Practice of hashing: REHASHING.
o. If table gets too full, operations take too long.
o. Build another table, TWICE as large. (Use prime number as size).
o. Common advice to rebuild when table becomes 70% full.
------------------------------------------------------
A Little THEORY OF HASHING:
Recall the bad news:
CLAIM: For any hash function h, if |U| >= (N-1)*M + 1, there exists
a set S of size N where all keys hash to the same location.
PROOF: By the pigeon-hole principle.
An idea: Use randomization, like in randomized quicksort. We will use
some random numbers in our construction of h, howrver, h itself will be
a deterministic function. We will show that for *any* set S, if we pick
h in this random way, things will be good in expectation.
This is idea of UNIVERSAL hashing. Later we will use universal hashing
to get PERFECT HASHING!
UNIVERSAL HASHING
=================
A SET of hash functions H is *universal* if, for any hash function
h chosen randomly, we get Pr [ h(x) = h(y) ] <= 1/M, for all x != y in U.
THEOREM: If H is universal, then for any set S in U, for any x in S, the
*expected* number of collisions x has [for a random h in H] is <= (N-1)/M
PROOF: Each y != x has <= 1/M chance of colliding with x. Adding up over all
y in S, you get expected number <= (N-1)/M. Or, more formally:
- Let C(x,y) = 1 if x and y collide; and 0 otherwise.
Then, E[C(x,y)] = Pr(x and y collide) <= 1/M
- C(x) = total # collisions for x = sum_y C(x,y).
- E[C(x)] = sum_y E[ C(x,y) ] <= (N-1)/M
COROLLARY: For any *sequence* of L lookups, the expected *total* time
for the lookups if we use a random h in H is O(L(1 + N/M)).
That is, constant time per operation if M=N.
PROOF: By the linearity of expectation, the expected total time is just the
sum of expected times for each lookup. Lookup time is proportional
to length of list = 1 + # of collisions.
QUESTION: Can we actually construct a universal hash family? If not, this
is all pretty vacuous. Answer is YES.
We are not actually going to explicitly construct all of H and then pick one
h in H at random. Instead, we'll have a randomized procedure for constructing
h. In fact, it probably would have been more intuitive to define H as a
*probability distribution* over hash functions, and say H is a universal
probability distrib over hash function if [rest of defn here]. But
``universal sets'' is the way it's done in the literature. Analogy is
randomized quicksort: can think of it as picking an ordering on pivots at
random from the set of all n! orderings, but instead we find it more natural
to just say: each time you need to pick a pivot, pick one at random.
HOW TO CONSTRUCT: Matrix method
---------------------------------
Suppose the keys are u-bits long, suppose the hash table size is power of 2;
that is, M = 2^b, so an index is b-bits long.
We choose hash function h to be a random 0/1 matrix of dimensions b-by-u.
The hash function h(x) multiplies this matrix with key x, addition modulo 2.
For instance, consider a table of size 8 (3-bit indices), and 4-bit keys.
h x h(x)
--------- - ---
[1 0 0 0] 1 1
[0 1 1 1] 0 = 1
[1 1 1 0] 1 0
0
CLAIM: For x != y, Pr[h(x) = h(y)] = 1/M = 1/2^b.
PROOF: First of all, what does it mean to multiply h by x? We can think of
it as adding some of the columns of h (mod 2) where the 1 bits in x
tell you which ones to add. (we added 1st and 3rd columns of h above)
Now, take arbitrary x and y, x != y. They must differ in some bit position,
so say they differ in ith coordinate; for concreteness let us assume
that x_i = 0 while y_i = 1. Imagine we first choose all of h except
the ith column. Over remaining choices of ith column, h(x) is fixed.
But, each of the 2^b different settings of the ith column gives a
different value of h(y) [every time we flip a bit in that column, we
flip the corresponding bit in h(y).] So there is exactly a 1/2^b
chance that h(x) = h(y).
There are other methods based on multiplication modulo primes too.
=====================================================================
Question: If we fix the set S, can we find hash function h such that *all*
lookups are constant-time? Yes. This leads to...
PERFECT HASHING
===============
A Hash function is "perfect" for S if all lookups take O(1) time.
METHOD #1
---------
Suppose we are willing to have a table whose size is quadratic in the size
N of our dictionary S. Then, here is an easy method. Let H be universal and
set M = N^2. Pick a random hash function h from H and hash everything in S.
CLAIM: Pr [no collisions] >= 1/2.
[So, we just try h, and if we got any collisions, we try a new h]
PROOF:
- How many pairs (x,y) in S are there? {N \choose 2}
- For each pair, the chance they collide is <= 1/M by defn of universal.
- So, Pr [exists a collision] <= {N \choose 2}/M < 1/2.
This is like the "other side" to the "birthday paradox". If the number of
days is a lot *more* than (# people)^2, then there is a reasonable chance
that *no* pair has the same birthday.
BUT, what if we want to use just O(N) space?
METHOD #2:
---------
HISTORY: This was a BIG open problem for a long time -- posed as the question
"Should Tables be Sorted?". With linear space, we can get O(log N)
search using sorted tables. Is it possible to achieve O(1) worst-case
search time? There were many increasingly complicated schemes, with
increasingly closer to O(1) time and generality. Finally solved using
nice idea of universal hash functions in 2-level scheme.
PROPOSAL: Hash into table of size N. Will get some collisions. Then, for
each bin, rehash it using Method #1, squaring the size of the bin to
get zero collisions.
E.g., If 1st hash function maps {a,b,c,d,e,f} to [--,{a}, {b,c}, {d,e,f}],
then final result would look something like: [--,h_2, h_3, h_4]
h_2 would just be the function h_2(x) = 0. h_3(x) would be a function
to table of size 4, h_4 would be function to table of size 9.
The Point is: using Method #1, we can find h_i with no collisions by picking
from universal H and trying it -- if it doesn't work we try again
(each time, Pr [success] >= 1/2).
By *design* this is O(1) time for lookups, since just compute two hash
functions. Question is: How much space does it use?
Space used is O(N) for the 1st table (assume it takes a constant amount of
space to store each hash function), plus O(sum of |B_i|^2) for the other
tables (B_i is the ith bucket). So, all we need to do is prove:
THEOREM: If initial h is chosen from the universal set H, then
Pr[sum of |B_i|^2 > 4*N] < 1/2.
Proof: We'll prove this by showing that E[sum_i |B_i|^2] < 2*N
(Why does this give us the result we need? <-- "Markov's inequality")
The neat trick: we count this quantity by counting the number of ordered
pairs that collide, including element colliding with itself.
E.g, if bucket has {d,e,f}, then d collides with each of d,e,f;
e collides with each of d,e,f; f collides with each of d,e,f; so get 9.
Now,
sum_i |B_i|^2 = # of ordered pairs that collide (allowing x=y)
= sum_x sum_y C(x,y) [C(x,y) = 1 if in same bucket, else 0]
So,
E[sum_i |B_i|^2] = sum_x sum_y Pr [x and y are in same bucket]
<= N + N(N-1) * (1/M), where the "1/M" comes from
definition of universal.
< 2*N. (since we set M = N)
So, final "hash function" is
1. compute i = h(x)
2. Now compute h_i(x) and use to lookup in table T_i.
Total space used is O(sum of (B_i)^2) = O(N).
============================================================================
Other uses of hashing
=====================
Suppose we have a long sequence of items and we want to see how many
*different* items are in the list. What's a good way of doing that?
One way: Create hash table; do lookup and then insert if not in table; count
number of inserts.
Now what if the list is really huge, so we don't have space to store them all,
but we are OK with just an approximate answer. E.g., imagine we're a router
and watching a lot of packets go by, and we want to see (roughly) how many
different source IP addresses there are.
NEAT IDEA: Say we have a hash function h that behaves like a random function,
and let's think of h(x) as a real number between 0 and 1. One thing we can
do is just keep track of the *minimum* hash value produced so far (so we
won't have a table at all). E.g., if keys are 3,10,3,3,12,10,12 and
h(3)=0.4, h(10)=0.2, h(12)=0.7, then we get 0.2.
Point is: if we pick N random numbers in [0,1], the expected value of
the minimum is 1/(N+1). And, there's a good chance it is fairly close
(can improve estimate by running several hash functions and taking
median of mins). So, from the value of the MIN observed, we can
estimate its inverse, which must be the number of distinct elements.
Question: Why use a hash function rather than just picking a random number
each time? That is because we care about the number of *different*
items, not just the number of items (that problem is a lot easier:
just keep a counter...)
=======================================================
Topics:
* search trees vs hashing,
* augmenting search tree data structures (if didn't do it last time)
* another way of getting universal hashing [if time]
==========================================================================
Search trees versus hashing:
---------------------------
Hashing is very convenient if you just need to do inserts and lookups, but
what are some operations that are better suited for search trees?
- lookup-by-prefix. E.g., type in first few letters of someone's
name and have it output all names in the database that begin with
those letters.
- find-in-range[key1, key2] (output all keys between key1 and key2)
- a version of lookup(x) that returns the closest key to x when x is
not in the database.
Now, what if we want to do the following operations:
- output the kth smallest element (let's assume all keys are distinct)
- do a version of lookup(x) that tells us the rank of x (how many
keys are smaller than x).
If we had a sorted array, these would be easy. To output the kth smallest
element, we just output A[k]. To get the rank of x, we just do binary search
to find it, and then output which location it's in.
So this brings us to the question: how could we do this with search trees?
Auxiliary information in search trees:
=====================================
By adding extra information to the nodes of a binary search tree, it's
often possible to dramatically expand what can be done. However, the data
you add has to have two properties:
(1) it's sufficient to solve the problem you want to solve
(2) the values of the data in a node can be maintained efficiently.
So, here would be a *bad* way of solving the problem: have each node store
the rank of its key. Why is this bad? Because if you insert a new key
(that, say, is smaller than everything else) you may have to update
*everyone's* rank. So we don't have property (2).
What's a better way? One thing we can do is store the size of the subtree
rooted at each node in that node.
- how do we use this to solve the problems?
- how do we maintain these when we do an insertion or tree rotation?
===================================================
Here's another method of doing universal hashing that requires less
computation than the one in lecture, and is closer to what you likely coded
up in 15-211.
The hashing scheme:
Let's assume keys in {0,1,...,U-1}. Hashing down to {0,1,...,M-1}
Pick prime p >= U. (Or, think of U as prime). Define
h_{a,b}(x) = ( (a*x + b) mod p ) mod M.
H: Pick a at random from 1,...,p-1.
Pick b at random from 0,...,p-1.
Claim: H is a 2-universal family.
Proof idea: the inner part will spread things out nicely in the range 0..p-1.
In particular, given x!= y, (1) x is equally likely to go anywhere, and
(2) given where x goes, y is equally likely to go to each of the p-1 places
x didn't go (there won't be any collisions at the mod p level). Then, this
allows us to show things will be nice when we take the results mod M.
Actual Proof: Fixing x!=y and two locations r and s, let's look at:
Pr[[(ax + b) = r (mod p)] AND [(ay + b) = s (mod p)]].
-> this means that ax-ay = r-s (mod p), so a = (r-s)/(x-y) mod p, which
has exactly one solution (mod p) since Z_p* is a field.
[We needed p >= U to ensure that x!=y mod p]
-> If r=s, then this means we need a=0 which wasn't allowed. So Pr=0.
-> If r!=s, then there is a 1/(p-1) chance that a has the right
value. Given this value of a, we need b = r-ax (mod p), and there is
a 1/p chance b gets this value, so the overall probability is
1/[p(p-1)].
Now, the probability x and y collide is equal to 1/p(p-1) times the
number of pairs r!=s in {0,...,p-1} such that r = s (mod M).
We have p choices for r, and then at most ceiling(p/M)-1 choices for
s ("-1" since we disallow s=r). The product is at most p(p-1)/M
Putting this all together,
Pr[(a*x + b mod p) mod M = (a*y + b mod p) mod M]
<= p(p-1)/M * [1/(p(p-1))] = 1/M.
QED