Hash Tables

COS 265 - Data Structures & Algorithms

“
Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered.

We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil.

Yet we should not pass up our opportunities in that critical 3%.

—Donald E. Knuth
”

symbol table implementations: summary

implementation	search*	insert*	delete*	search\(^\dagger\)	insert\(^\dagger\)	delete\(^\dagger\)	ordered	ops on keys
seq search	\(N\)	\(N\)	\(N\)	\(N\)	\(N\)	\(N\)		`equals()`
binary search	\(\log N\)	\(N\)	\(N\)	\(\log N\)	\(N\)	\(N\)	✓	`compareTo()`
BST	\(N\)	\(N\)	\(N\)	\(\log N\)	\(\log N\)	\(\sqrtN\)	✓	`compareTo()`
2-3 tree	\(\log N\)	\(\log N\)	\(\log N\)	\(\log N\)	\(\log N\)	\(\log N\)	✓	`compareTo()`
LLRB	\(\log N\)	\(\log N\)	\(\log N\)	\(\log N\)	\(\log N\)	\(\log N\)	✓	`compareTo()`

\(^*\) guaranteed, \(^\dagger\) typical

Q. Can we do better?

Yes, but with different access to the data.

hashing: basic plan

Save items in a key-indexed table (index is a function of the key)

Hash function: method for computing array index from key

Issues

Computing the hash function
Equality test: Method for checking whether two keys are equal
Collision resolution: Algorithm and data structure to handle two keys that hash to the same array index.

hashing: basic plan

Classic space-time tradeoff:

No space limitation: trivial hash function with key as index
No time limitation: trivial collision resolution with sequential search
Space and time limitations: hashing (the real world)

Hash Tables

Hash Functions

Computing the hash function

Idealistic goal: Scramble the keys uniformly to produce a table index

Efficiently computable
Each table index equally likely for each key
- (thoroughly researched problem, still problematic in practical applications)

Ex: social security numbers

Bad: first three digits (573=California, 574=Alaska)
Better: last three digits

Practical challenge: need different approach for each key type

hash tables: quiz 1

[ live view, card view ]

Which is the last digit of your day of birth?

Example: The 4 in 2002.03.14

0, 1, or 2
3 or 4
5, 6, or 7
8 or 9

hash tables: quiz 2

[ live view, card view ]

Which is the last digit of your month of birth?

Example: The 3 in 2002.03.14

0, 1, or 2
3 or 4
5, 6, or 7
8 or 9

hash tables: quiz 3

[ live view, card view ]

Which is the last digit of your year of birth?

Example: The 2 in 2002.03.14

0, 1, or 2
3 or 4
5, 6, or 7
8 or 9

java's hash code conventions

All Java classes inherit a method hashCode(), returns a 32-bit int

Requirement: if x.equals(y), then x.hashCode() == y.hashCode()
Highly desirable: if !x.equals(y), then x.hashCode() != y.hashCode()
Default implementation: Memory address of x
Legal (but poor) implementation: Always return 17
Customized implementations: Integer, Double, String, File, URL, Date, ...
User-defined types: Users are on their own

implementing hash code: Integers, Booleans

Java library implementations

public final class Integer {
    private final int value;
    // ...
    public int hashCode() { return value; }
}

public final class Boolean {
    private final boolean value;
    // ...
    public int hashCode() {
        if(value) return 1231;
        else      return 1237;
    }
}

implementing hash code: Doubles

public final class Double {
    private final double value;
    // ...
    public int hashCode() {
        long bits = doubleToLongBits(value);
        return (int) (bits ^ (bits >>> 32));
    }
}

convert to IEEE 64-bit representation
xor most significant 32-bits with least significant 32-bits

Warning

Warning: \(-0.0\) and \(+0.0\) have different hash codes!

00000000000000000000000000000000b = \(0.0\)
10000000000000000000000000000000b = \(-0.0\)
32b float

[ wikipedia ]

implementing hash code: Strings

Treat string of length \(L\) as \(L\)-digit, base-31 number

\[\begin{array}{rcl} h & = & s[0]\cdot 31^{L-1} + \ldots + s[L-3]\cdot 31^{2} + \\ & & + s[L-2] \cdot 31^1 + s[L-1] \cdot 31^0 \end{array}\]

public final class String {
    private final char[] s;
    // ...
    public int hashCode() {
        int hash = 0;
        for(int i=0; i<length(); i++)
            hash = s[i] + (31 * hash);
        return hash;
    }
}

char	Unicode
...	...
`a`	97
`b`	98
`c`	99
...	...

Horner's method: only \(L\) multiples/adds to hash string of length \(L\)

implementing hash code: Strings

String s = "call";
s.hashCode();      // 3045982 = 99*31^3 + 97*31^2 +
                   //           + 108*31^1 + 108*31^0
                   //         = 108 + 31*(108 + 31*(97 + 31*(99)))

implementing hash code: Strings

Performance optimization

Cache the hash value in an instance variable
Return cached value

public final class String {
    private int hash = 0;               // cache of hash code
    private final char[] s;
    // ...
    public int hashCode() {
        int h = hash, i;
        if(h != 0) return h;            // return cached value
        for(i = 0; i < length(); i++)
            h = s[i] + (31 * h);
        hash = h;                       // store cache of hash code
        return hash;
    }
}

Q: What if hashCode() of string is \(0\)? (hashCode() of pollinating sandboxes is \(0\), link)

implementing hash code: user-defined types

public final class Transaction implements Comparable<Transaction>
{
    private final String who;
    private final Date   when;
    private final double amount;

    public Transaction(String who, Date when, double amount)
    { /*...*/ }

    public boolean equals(Object y) { /* ... */ }

    public int hashCode() {
        int hash = 17;     // non-zero constant
        hash = 31*hash + who.hashCode();
        hash = 31*hash + when.hashCode();
        hash = 31*hash + ((Double) amount).hashCode();
        return hash;
    }
}

Why 31? Typically any small prime will do.
For reference types, use hashCode()
For primitive types, use hashCode() of wrapper type

Hash code design

"Standard" recipe for user-defined types

Combine each significant field using the \(31x + y\) rule
If field is a primitive type, use wrapper type hashCode()
If field is null, use 0.
If field is reference type, use hashCode() (applies rule recursively)
If field is an array, apply to each entry (or use Arrays.deepHashCode())

Hash code design

In practice: Previous recipe works reasonably well; used in Java libraries

In theory: Keys are bitstring; "universal"\(^*\) family of hash functions exist

Basic rule: Need to use the whole key to compute hash code; consult an expert for state-of-the-art hash codes

\(^*\)"universal": awkward in Java since only one (deterministic) hashCode()

Hash tables: quiz 4

[ live view, card view ]

Which of the following is an effective way to map a hashable key to an integer between \(0\) and \(M-1\)?

private int hash(Key key)
{ return key.hashCode() % M; }

private int hash(Key key)
{ return Math.abs(key.hashCode()) % M; }

C. Both A and B

D. Neither A nor B

Modular hashing

Hash code: An int between \(-2^{31}\) and \(2^{31}-1\)

Hash function: An int between \(0\) and \(M-1\) (for use as array index, where \(M\) is typically a prime or power of 2)

Bug (link):

private int hash(Key key)
{ return key.hashCode() % M; }
// op % returns remainder in Java, not modulus

1-in-a-billion bug (link):

private int hash(Key key)
{ return Math.abs(key.hashCode()) % M; }
// hashCode() of "polygenelubricants" is -2^31

Correct:

private int hash(Key key)
{ return (key.hashCode() & 0x7ffffff) % M; }

Hash Tables

Collisions

uniform hashing assumption

Uniform hashing assumption: Each key is equally likely to hash to an integer between \(0\) and \(M-1\)

Bins and balls: Throw balls uniformly at random into \(M\) bins

Collisions: two distinct keys hashing to same index

Collisions play out in at least three ways:

uniformity: birthday problem

Birthday Problem: the probability that, in a set of \(n\) randomly chosen people, some pair of them will have the same birthday. By the pigeonhole principle, the probability reaches 100% when the number of people reaches 367 (since there are only 366 possible birthdays, including February 29). However, 99.9% probability is reached with just 70 people, and 50% probability with 23 people.

Put another way, if we have \(M\) bins and we start tossing balls into the bins, we can expect two balls in the same bin after \(\sim \sqrt{\pi M/2}\) tosses

[ Wikipedia ]

uniformity: coupon collector

Coupon Collector: describes the "collect all coupons and win" contests. It asks the following question: Suppose that there is an urn of \(n\) different coupons, from which coupons are being collected, equally likely, with replacement. What is the probability that more than \(t\) sample trials are needed to collect all n coupons? An alternative statement is: Given \(n\) coupons, how many coupons do you expect you need to draw with replacement before having drawn each coupon at least once?

Put another way, if we have \(M\) bins and we start tossing balls into the bins, we can expect every bin has \(\geq 1\) ball after \(\sim M \ln M\) tosses

[ Wikipedia ]

uniformity: load balancing

Load Balancing: improves the distribution of workloads across multiple computing resources, such as computers, a computer cluster, network links, central processing units, or disk drives. Load balancing aims to optimize resource use, maximize throughput, minimize response time, and avoid overload of any single resource.

Put another way, if we have \(M\) bins and we toss \(M\) balls into the bins, expect most loaded bin has \(\sim \ln M / \ln \ln M\) balls

[ Wikipedia ]

uniform hashing assumption

Uniform hashing assumption: Each key is equally likely to hash to an integer between \(0\) and \(M-1\)

Bins and balls: Throw balls uniformly at random into \(M\) bins

Java's String data uniformly distribute the keys of Tale of Two Cities

collisions

Collisions: two distinct keys hashing to same index

Birthday problem: can't avoid collisions unless you have a ridiculous (quadratic) amount of memory
Coupon collector: not too much wasted space
Load balancing: no index gets too many collisions

Challenge: deal with collisions efficiently

Hash Tables

Separate Chaining

separate-chaining symbol table

Use an array of \(M\) linked lists, where \(M<N\)

Hash: map key to integer \(i\) between \(0\) and \(M-1\)
Search: sequential search in \(i\)th chain
Insert: put at front of \(i\)th chain (if not already a chain)

[ [H.P.Luhn, IBM 1953] ]

hash(L) = 3
put(L, 11)

hash(L) = 3
put(L, 11)

hash(E) = 1
get(E)

separate-chaining st: java implementation

public class SeparateChainingHashST<Key, Value>
{
    private int M = 97;                 // number of chains
    private Node[] st = new Node[M];    // array of chains
    // array doubling and halving code omitted

    private static class Node
    {
        private Object key; // no generic array creation
        private Object val; // (declare key and value of type Object)
        private Node next;
        // ...
    }

    private int hash(Key key) {
        return (key.hashCode() & 0x7fffffff) % M;
    }

    public Value get(Key key) {
        int i = hash(key);
        for(Node x = st[i]; x != null; x = x.next)
            if(key.equals(x.key)) return (Value) x.val;
        return null;
    }

    public void put(Key key, Value val) {
        int i = hash(key);
        for(Node x = st[i]; x != null; x = x.next)
            if(key.equals(x.key)) { x.val = val; return; }
        st[i] = new Node(key, val, st[i]);
    }
}

Analysis of separate chaining

Proposition: Under uniform hashing assumption, probability that the number of keys in a list is within a constant factor of \(N/M\) is extremely close to \(1\)

Pf sketch: Distribution of list size obeys a binomial distribution

Consequence: Number of probes (equals() and hashCode()) for search/insert is proportional to \(N/M\) (\(M\) times faster than sequential search)

\(M\) too large -> too many empty chains (memory)
\(M\) too small -> chains too long (time)
Typical choice: \(M \sim \frac{1}{4}N\) → constant-time ops

resizing in a separate-chaining hash table

Goal: Average length of list \(N/M = \text{ constant}\).

Double size of array \(M\) when \(N/M \geq 8\)
Halve size of array \(M\) when \(N/M \leq 2\)
Note: need to rehash all keys when resizing (x.hashCode() does not change, but hash(x) can change)

resizing in a separate-chaining hash table

Before resizing (\(N/M=8\)):

After resizing (\(N/M=4\))

deletion in a separate-chaining hash table

Q: How to delete a key (and its associated value)?
A: Easy! Need to consider only chain containing key

Before deleting C (hash(C) = 2)

After deleting C (hash(C) = 2)

symbol table implementations: summary

implementation	search*	insert*	delete*	search\(^\dagger\)	insert\(^\dagger\)	delete\(^\dagger\)	ordered	ops on keys
seq search	\(N\)	\(N\)	\(N\)	\(N\)	\(N\)	\(N\)		`equals()`
binary search	\(\log N\)	\(N\)	\(N\)	\(\log N\)	\(N\)	\(N\)	✓	`compareTo()`
BST	\(N\)	\(N\)	\(N\)	\(\log N\)	\(\log N\)	\(\sqrtN\)	✓	`compareTo()`
2-3 tree	\(\log N\)	\(\log N\)	\(\log N\)	\(\log N\)	\(\log N\)	\(\log N\)	✓	`compareTo()`
LLRB	\(\log N\)	\(\log N\)	\(\log N\)	\(\log N\)	\(\log N\)	\(\log N\)	✓	`compareTo()`
separate chaining	\(N\)	\(N\)	\(N\)	\(1^\ddagger\)	\(1^\ddagger\)	\(1^\ddagger\)		`equals()` `hashCode()`

\(^*\) guaranteed, \(^\dagger\) typical, \(^\ddagger\) under uniform hashing assumption

Hash Tables

Linear Probing

collision resolution: open addressing

Open addressing [Amdahl-Boehme-Rocherster-Samuel, IBM 1953]

Maintains keys and values in two parallel arrays
When a new key collides, find next empty slot, and put it there

//         0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15
keys[] = [ P, M,  ,  , A, C,  , H, L,  , E,  ,  ,  , R, X];
vals[] = [11,10,  ,  , 9, 5,  , 6,12,  ,13,  ,  ,  , 4, 8];

put(K, 14) // hash(K) = 7

linear-probing hash table summary

Use an array of size \(M>N\)

Hash: map key to integer \(i\) between \(0\) and \(M-1\)
Search: search table index \(i\); if occupied but no match, try \(i+1\), \(i+2\), etc.
Insert: put at table index \(i\) if free; if not, try \(i+1\), \(i+2\), etc.

//         0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15
keys[] = [ P, M,  ,  , A, C, S, H, L,  , E,  ,  ,  , R, X];

put(K, 14) // hash(K) = 7

linear-probing st: java implementation

public class LinearProbingHashST<Key, Value> {
    private int M = 30001;
    private Value[] vals = (Value[]) new Object[M];
    private Key[]   keys = (Key[])   new Object[M];

    // array doubling and halving code omitted

    private int hash(Key key) { /* as before */ }

    public Value get(Key key) {
        // sequential search starting at hash(key)
        for(int i = hash(key); keys[i] != null; i= (i+1) % M)
            if(key.equals(keys[i])) return vals[i];
        return null;
    }

    private void put(Key key, Value val) {
        // sequential search for empty or key starting at hash(key)
        int i;
        for(i = hash(key); keys[i] != null; i = (i+1) % M)
            if(keys[i].equals(key)) break;
        keys[i] = key;
        vals[i] = val;
    }
}

clustering

Cluster: A contiguous block of items

Observation: new keys likely to hash into middle of big clusters

insert("A") // no collision
insert("B") // collides: A
insert("C") // no collision
insert("D") // no collision
insert("E") // no collision
insert("F") // collides: E
insert("G") // collides: B,E,F
insert("H") // no collision
insert("I") // collides: H
insert("J") // collides: A-G
insert("K") // no collision
insert("L") // collides: C
insert("M") // collides: H,I
insert("N") // no collision
============================

|       N                                 |
| M M M                                   |
|                                     L L |
|                               K         |
|                 J J J J J J             |
| I I                                     |
| H                                       |
|                   G G G G               |
|                     F F                 |
|                     E                   |
|           D                             |
|                                     C   |
|                 B B                     |
|                 A                       |
|=========================================|
| H I M N - D - - A B E F G J - K - - C L |

knuth's parking problem

Model: Cars arrive at one-way street with \(M\) parking spaces. Each desires a random space \(i\): if space \(i\) is taken, try \(i+1\), \(i+2\), etc.

Q: What is mean displacement of a car?

Half-full: With \(M/2\) cars, mean displacement is \(\sim 5/2\)

Full: With \(M\) cars, displacement is \(\sim \sqrt{\pi M/8}\)

Key insight: Cannot afford to let linear-probing hash table get too full!

analysis of linear probing

Proposition: Under uniform hashing assumption, the average number of probes in a linear probing hash table of size \(M\) that contains \(N=\alpha M\) keys is

\[\text{search hit: } \sim \frac{1}{2}\left( 1 + \frac{1}{1-\alpha} \right)\]

\[\text{search miss/insert: } \sim \frac{1}{2}\left( 1 + \frac{1}{(1-\alpha)^2} \right)\]

Pf:

analysis of linear probing

Proposition: Under uniform hashing assumption, the average number of probes in a linear probing hash table of size \(M\) that contains \(N=\alpha M\) keys is

Parameters:

\(M\) too large -> too many empty array entries (memory)
\(M\) too small -> search time blows up (time)
Typical choice: \(\alpha = N/M \sim 1/2\)
- Num of probes for search hit is about \(3/2\)
- Num of probes for search miss is about \(5/2\)

resizing in a linear-probing hash table

Goal: Average length of list \(N/M \leq 1/2\)

Double size of array \(M\) when \(N/M \geq 1/2\)
Halve size of array \(M\) when \(N/M \leq 1/8\)
Need to rehash all keys when resizing

Before resizing:

//         0  1  2  3  4  5  6  7
keys[] = [  , E, S,  ,  , R, A,  ];
vals[] = [  , 1, 0,  ,  , 3, 2,  ];

After resizing:

//         0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15
keys[] = [  ,  ,  ,  , A,  , S,  ,  ,  , E,  ,  ,  , R,  ];
vals[] = [  ,  ,  ,  , 2,  , 0,  ,  ,  , 1,  ,  ,  , 3,  ];

deletion in a linear-probing hash table

Q: How to delete a key (and its associated value)?
A: Requires some care: can't just delete array entries.

Before deleting S:

//         0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15
keys[] = [ P, M,  ,  , A, C, S, H, L,  , E,  ,  ,  , R, X];
vals[] = [10, 9,  ,  , 8, 4, 0, 5,11,  ,12,  ,  ,  , 3, 7];

After deleting S:

//         0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15
keys[] = [ P, M,  ,  , A, C,  , H, L,  , E,  ,  ,  , R, X];
vals[] = [10, 9,  ,  , 8, 4,  , 5,11,  ,12,  ,  ,  , 3, 7];
//                           ^ doesn't work, e.g., if hash(H) = 4

symbol table implementations: summary

implementation	search*	insert*	delete*	search\(^\dagger\)	insert\(^\dagger\)	delete\(^\dagger\)	ordered	ops on keys
seq search	\(N\)	\(N\)	\(N\)	\(N\)	\(N\)	\(N\)		`equals()`
binary search	\(\log N\)	\(N\)	\(N\)	\(\log N\)	\(N\)	\(N\)	✓	`compareTo()`
BST	\(N\)	\(N\)	\(N\)	\(\log N\)	\(\log N\)	\(\sqrtN\)	✓	`compareTo()`
2-3 tree	\(\log N\)	\(\log N\)	\(\log N\)	\(\log N\)	\(\log N\)	\(\log N\)	✓	`compareTo()`
LLRB	\(\log N\)	\(\log N\)	\(\log N\)	\(\log N\)	\(\log N\)	\(\log N\)	✓	`compareTo()`
separate chaining	\(N\)	\(N\)	\(N\)	\(1^\ddagger\)	\(1^\ddagger\)	\(1^\ddagger\)		`equals()` `hashCode()`
linear probing	\(N\)	\(N\)	\(N\)	\(1^\ddagger\)	\(1^\ddagger\)	\(1^\ddagger\)		`equals()` `hashCode()`

\(^*\) guaranteed, \(^\dagger\) typical, \(^\ddagger\) under uniform hashing assumption

3-sum (revisited)

3-Sum: Given \(N\) distinct integers, find three such that \(a + b + c = 0\)

Goal: \(N^2\) expected time case, \(N\) extra space.

Hash Tables

Context

War story: algorithmic complexity attacks

Q: Is the uniform hashing assumption important in practice?
A: Obvious situations: aircraft control, nuclear reactor, pacemaker, HFT, ...
A: Surprising situations: denial-of-service (DOS) attacks!

Malicious adversary learns your hash function (e.g., by reading Java API) and causes a big pile-up in single slot that grinds performance to a halt

War story: algorithmic complexity attacks

Real-world exploits [Crosby-Wallach 2003]

Linux 2.4.20 kernel: save files with carefully chosen names
Bro server: send carefully chosen packets to DOS the server, using less bandwidth than a dial-up modem

War story: algorithmic complexity attacks

A Java bug report

algorithmic complexity attack on Java

Goal: Find family of strings with the same hashCode()
Solution: The base-31 hash code is part of Java's String API.

key	`hashCode()`	key	`hashCode()`
`Aa`	2112	`AaAaAaAa`	-540425984
`BB`	2112	`AaAaAaBB`	-540425984
		`AaAaBBAa`	-540425984
		`AaAaBBBB`	-540425984
		\(\vdots\)	\(\vdots\)
		`BBBBAaAa`	-540425984
		`BBBBAaBB`	-540425984
		`BBBBBBAa`	-540425984
		`BBBBBBBB`	-540425984

\(2^n\) strings of length \(2n\) that hash to same value!

diversion: one-way hash functions

One-way hash functions are "hard" to find a key that will hash to a desired value (or two keys that hash to same value).

Ex: MD4\(^*\), MD5\(^*\), SHA-0\(^*\), SHA-1\(^*\), SHA-2, SHA-256, WHIRLPOOL, ...

\(^*\) known to be insecure!

String password = args[0];
MessageDigest sha = MessageDigest.getInstance("SHA-256");
byte[] bytes = sha.digest(password);

/* prints bytes as hex string */

Applications: Crypto, message digests, passwords, Bitcoin, ...
Caveat: Too expensive for use in ST implementations :(

separate chaining vs. linear probing

Separate Chaining

Performance degrades gracefully
Clustering less sensitive to poorly-designed hash function

Linear Probing

Less wasted space
Better cache performance

hashing: variations on the theme

Many improved versions have been studied

Two-probe hashing (separate-chaining variant)

Hash to two positions, insert key in shorter of the two chains.
Reduces expected length of the longest chain to \(\sim \lg \ln n\)

Double hashing (linear-probing variant)

Use linear probing, but skip a variable amount, not just 1 each time
Effectively eliminates clustering
Can allow table to become nearly full
More difficult to implement delete

hashing: variations on the theme

Many improved versions have been studied

Cuckoo hashing (linear-probing variant)

Hash key to two positions; insert key into either postition; if occupied, reinsert displaced key into its alternative position (and recur).
Constant worst-case time for search

hash tables vs. balanced search trees

Hash tables

Simpler to code
No effective alternative for unordered keys
Faster for simple keys (a few arithmetic ops versus \(\log n\) compares)
Better system support in Java for String (e.g., cached hash code)

Balanced search trees

Stronger performance guarantee
Support for ordered ST operations
Easier to implement compareTo() than hashCode()

hash tables vs. balanced search trees

Java system includes both

Balanced search trees
- java.util.TreeMap
- java.util.TreeSet (red-black BST)
Hash tables
- java.util.HashMap (linear probing)
- java.util.IdentifyHashMap (separate chaining)