“Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered.
We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil.
Yet we should not pass up our opportunities in that critical 3%.
—Donald E. Knuth
”
| implementation | search* | insert* | delete* | search\(^\dagger\) | insert\(^\dagger\) | delete\(^\dagger\) | ordered | ops on keys |
|---|---|---|---|---|---|---|---|---|
| seq search | \(N\) | \(N\) | \(N\) | \(N\) | \(N\) | \(N\) | equals() |
|
| binary search | \(\log N\) | \(N\) | \(N\) | \(\log N\) | \(N\) | \(N\) | ✓ | compareTo() |
| BST | \(N\) | \(N\) | \(N\) | \(\log N\) | \(\log N\) | \(\sqrtN\) | ✓ | compareTo() |
| 2-3 tree | \(\log N\) | \(\log N\) | \(\log N\) | \(\log N\) | \(\log N\) | \(\log N\) | ✓ | compareTo() |
| LLRB | \(\log N\) | \(\log N\) | \(\log N\) | \(\log N\) | \(\log N\) | \(\log N\) | ✓ | compareTo() |
\(^*\) guaranteed, \(^\dagger\) typical
Q. Can we do better?
Save items in a key-indexed table (index is a function of the key)
Issues

Classic space-time tradeoff:

Idealistic goal: Scramble the keys uniformly to produce a table index
Ex: social security numbers
Practical challenge: need different approach for each key type
Which is the last digit of your day of birth?
Example: The 4 in 2002.03.14
0, 1, or 2
3 or 4
5, 6, or 7
8 or 9
Which is the last digit of your month of birth?
Example: The 3 in 2002.03.14
0, 1, or 2
3 or 4
5, 6, or 7
8 or 9
Which is the last digit of your year of birth?
Example: The 2 in 2002.03.14
0, 1, or 2
3 or 4
5, 6, or 7
8 or 9
All Java classes inherit a method hashCode(), returns a 32-bit int


x.equals(y), then x.hashCode() == y.hashCode()!x.equals(y), then x.hashCode() != y.hashCode()x17Integer, Double, String, File, URL, Date, ...Java library implementations
public final class Integer {
private final int value;
// ...
public int hashCode() { return value; }
}
public final class Boolean {
private final boolean value;
// ...
public int hashCode() {
if(value) return 1231;
else return 1237;
}
}
public final class Double {
private final double value;
// ...
public int hashCode() {
long bits = doubleToLongBits(value);
return (int) (bits ^ (bits >>> 32));
}
}
Warning
Warning: \(-0.0\) and \(+0.0\) have different hash codes!
00000000000000000000000000000000b = \(0.0\)10000000000000000000000000000000b = \(-0.0\)Treat string of length \(L\) as \(L\)-digit, base-31 number
\[\begin{array}{rcl} h & = & s[0]\cdot 31^{L-1} + \ldots + s[L-3]\cdot 31^{2} + \\ & & + s[L-2] \cdot 31^1 + s[L-1] \cdot 31^0 \end{array}\]
public final class String {
private final char[] s;
// ...
public int hashCode() {
int hash = 0;
for(int i=0; i<length(); i++)
hash = s[i] + (31 * hash);
return hash;
}
}
|
|
Horner's method: only \(L\) multiples/adds to hash string of length \(L\)
String s = "call";
s.hashCode(); // 3045982 = 99*31^3 + 97*31^2 +
// + 108*31^1 + 108*31^0
// = 108 + 31*(108 + 31*(97 + 31*(99)))
Performance optimization
public final class String {
private int hash = 0; // cache of hash code
private final char[] s;
// ...
public int hashCode() {
int h = hash, i;
if(h != 0) return h; // return cached value
for(i = 0; i < length(); i++)
h = s[i] + (31 * h);
hash = h; // store cache of hash code
return hash;
}
}
Q: What if hashCode() of string is \(0\)? (hashCode() of pollinating sandboxes is \(0\), link)
public final class Transaction implements Comparable<Transaction>
{
private final String who;
private final Date when;
private final double amount;
public Transaction(String who, Date when, double amount)
{ /*...*/ }
public boolean equals(Object y) { /* ... */ }
public int hashCode() {
int hash = 17; // non-zero constant
hash = 31*hash + who.hashCode();
hash = 31*hash + when.hashCode();
hash = 31*hash + ((Double) amount).hashCode();
return hash;
}
}
hashCode()hashCode() of wrapper type"Standard" recipe for user-defined types
hashCode()null, use 0.hashCode() (applies rule recursively)Arrays.deepHashCode())In practice: Previous recipe works reasonably well; used in Java libraries
In theory: Keys are bitstring; "universal"\(^*\) family of hash functions exist
Basic rule: Need to use the whole key to compute hash code; consult an expert for state-of-the-art hash codes
\(^*\)"universal": awkward in Java since only one (deterministic) hashCode()
Which of the following is an effective way to map a hashable key to an integer between \(0\) and \(M-1\)?
|
A. private int hash(Key key)
{ return key.hashCode() % M; }
B. private int hash(Key key)
{ return Math.abs(key.hashCode()) % M; }
C. Both A and B D. Neither A nor B |
![]() |
Hash code: An int between \(-2^{31}\) and \(2^{31}-1\)
Hash function: An int between \(0\) and \(M-1\) (for use as array index, where \(M\) is typically a prime or power of 2)
|
Bug (link): private int hash(Key key)
{ return key.hashCode() % M; }
// op % returns remainder in Java, not modulus
1-in-a-billion bug (link): private int hash(Key key)
{ return Math.abs(key.hashCode()) % M; }
// hashCode() of "polygenelubricants" is -2^31
Correct: private int hash(Key key)
{ return (key.hashCode() & 0x7ffffff) % M; }
|
![]() |
Uniform hashing assumption: Each key is equally likely to hash to an integer between \(0\) and \(M-1\)
Bins and balls: Throw balls uniformly at random into \(M\) bins

Collisions play out in at least three ways:
the probability that, in a set of \(n\) randomly chosen people, some pair of them will have the same birthday. By the pigeonhole principle, the probability reaches 100% when the number of people reaches 367 (since there are only 366 possible birthdays, including February 29). However, 99.9% probability is reached with just 70 people, and 50% probability with 23 people.
|
Put another way, if we have \(M\) bins and we start tossing balls into the bins, we can expect two balls in the same bin after \(\sim \sqrt{\pi M/2}\) tosses |
![]() |
describes the "collect all coupons and win" contests. It asks the following question: Suppose that there is an urn of \(n\) different coupons, from which coupons are being collected, equally likely, with replacement. What is the probability that more than \(t\) sample trials are needed to collect all n coupons? An alternative statement is: Given \(n\) coupons, how many coupons do you expect you need to draw with replacement before having drawn each coupon at least once?
Put another way, if we have \(M\) bins and we start tossing balls into the bins, we can expect every bin has \(\geq 1\) ball after \(\sim M \ln M\) tosses
improves the distribution of workloads across multiple computing resources, such as computers, a computer cluster, network links, central processing units, or disk drives. Load balancing aims to optimize resource use, maximize throughput, minimize response time, and avoid overload of any single resource.
Put another way, if we have \(M\) bins and we toss \(M\) balls into the bins, expect most loaded bin has \(\sim \ln M / \ln \ln M\) balls
Uniform hashing assumption: Each key is equally likely to hash to an integer between \(0\) and \(M-1\)
Bins and balls: Throw balls uniformly at random into \(M\) bins

Java's String data uniformly distribute the keys of Tale of Two Cities


Challenge: deal with collisions efficiently
Use an array of \(M\) linked lists, where \(M<N\)
public class SeparateChainingHashST<Key, Value>
{
private int M = 97; // number of chains
private Node[] st = new Node[M]; // array of chains
// array doubling and halving code omitted
private static class Node
{
private Object key; // no generic array creation
private Object val; // (declare key and value of type Object)
private Node next;
// ...
}
private int hash(Key key) {
return (key.hashCode() & 0x7fffffff) % M;
}
public Value get(Key key) {
int i = hash(key);
for(Node x = st[i]; x != null; x = x.next)
if(key.equals(x.key)) return (Value) x.val;
return null;
}
public void put(Key key, Value val) {
int i = hash(key);
for(Node x = st[i]; x != null; x = x.next)
if(key.equals(x.key)) { x.val = val; return; }
st[i] = new Node(key, val, st[i]);
}
}
Proposition: Under uniform hashing assumption, probability that the number of keys in a list is within a constant factor of \(N/M\) is extremely close to \(1\)
Pf sketch: Distribution of list size obeys a binomial distribution

Consequence: Number of probes (equals() and hashCode()) for search/insert is proportional to \(N/M\) (\(M\) times faster than sequential search)
Goal: Average length of list \(N/M = \text{ constant}\).
x.hashCode() does not change, but hash(x) can change)Before resizing (\(N/M=8\)):

After resizing (\(N/M=4\))

Q: How to delete a key (and its associated value)?
A: Easy! Need to consider only chain containing key
|
Before deleting ![]() |
After deleting ![]() |
| implementation | search* | insert* | delete* | search\(^\dagger\) | insert\(^\dagger\) | delete\(^\dagger\) | ordered | ops on keys |
|---|---|---|---|---|---|---|---|---|
| seq search | \(N\) | \(N\) | \(N\) | \(N\) | \(N\) | \(N\) | equals() |
|
| binary search | \(\log N\) | \(N\) | \(N\) | \(\log N\) | \(N\) | \(N\) | ✓ | compareTo() |
| BST | \(N\) | \(N\) | \(N\) | \(\log N\) | \(\log N\) | \(\sqrtN\) | ✓ | compareTo() |
| 2-3 tree | \(\log N\) | \(\log N\) | \(\log N\) | \(\log N\) | \(\log N\) | \(\log N\) | ✓ | compareTo() |
| LLRB | \(\log N\) | \(\log N\) | \(\log N\) | \(\log N\) | \(\log N\) | \(\log N\) | ✓ | compareTo() |
| separate chaining | \(N\) | \(N\) | \(N\) | \(1^\ddagger\) | \(1^\ddagger\) | \(1^\ddagger\) | equals() hashCode() |
\(^*\) guaranteed, \(^\dagger\) typical, \(^\ddagger\) under uniform hashing assumption
Open addressing [Amdahl-Boehme-Rocherster-Samuel, IBM 1953]
// 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 keys[] = [ P, M, , , A, C, , H, L, , E, , , , R, X]; vals[] = [11,10, , , 9, 5, , 6,12, ,13, , , , 4, 8]; put(K, 14) // hash(K) = 7
Use an array of size \(M>N\)
// 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 keys[] = [ P, M, , , A, C, S, H, L, , E, , , , R, X]; put(K, 14) // hash(K) = 7
public class LinearProbingHashST<Key, Value> {
private int M = 30001;
private Value[] vals = (Value[]) new Object[M];
private Key[] keys = (Key[]) new Object[M];
// array doubling and halving code omitted
private int hash(Key key) { /* as before */ }
public Value get(Key key) {
// sequential search starting at hash(key)
for(int i = hash(key); keys[i] != null; i= (i+1) % M)
if(key.equals(keys[i])) return vals[i];
return null;
}
private void put(Key key, Value val) {
// sequential search for empty or key starting at hash(key)
int i;
for(i = hash(key); keys[i] != null; i = (i+1) % M)
if(keys[i].equals(key)) break;
keys[i] = key;
vals[i] = val;
}
}
Observation: new keys likely to hash into middle of big clusters
insert("A") // no collision
insert("B") // collides: A
insert("C") // no collision
insert("D") // no collision
insert("E") // no collision
insert("F") // collides: E
insert("G") // collides: B,E,F
insert("H") // no collision
insert("I") // collides: H
insert("J") // collides: A-G
insert("K") // no collision
insert("L") // collides: C
insert("M") // collides: H,I
insert("N") // no collision
============================
|
| N | | M M M | | L L | | K | | J J J J J J | | I I | | H | | G G G G | | F F | | E | | D | | C | | B B | | A | |=========================================| | H I M N - D - - A B E F G J - K - - C L | |
Model: Cars arrive at one-way street with \(M\) parking spaces. Each desires a random space \(i\): if space \(i\) is taken, try \(i+1\), \(i+2\), etc.
Q: What is mean displacement of a car?

Half-full: With \(M/2\) cars, mean displacement is \(\sim 5/2\)
Full: With \(M\) cars, displacement is \(\sim \sqrt{\pi M/8}\)
Key insight: Cannot afford to let linear-probing hash table get too full!
Proposition: Under uniform hashing assumption, the average number of probes in a linear probing hash table of size \(M\) that contains \(N=\alpha M\) keys is
\[\text{search hit: } \sim \frac{1}{2}\left( 1 + \frac{1}{1-\alpha} \right)\]
\[\text{search miss/insert: } \sim \frac{1}{2}\left( 1 + \frac{1}{(1-\alpha)^2} \right)\]
Pf:

Proposition: Under uniform hashing assumption, the average number of probes in a linear probing hash table of size \(M\) that contains \(N=\alpha M\) keys is
Parameters:
Goal: Average length of list \(N/M \leq 1/2\)
Before resizing:
// 0 1 2 3 4 5 6 7 keys[] = [ , E, S, , , R, A, ]; vals[] = [ , 1, 0, , , 3, 2, ];
After resizing:
// 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 keys[] = [ , , , , A, , S, , , , E, , , , R, ]; vals[] = [ , , , , 2, , 0, , , , 1, , , , 3, ];
Q: How to delete a key (and its associated value)?
A: Requires some care: can't just delete array entries.
Before deleting S:
// 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 keys[] = [ P, M, , , A, C, S, H, L, , E, , , , R, X]; vals[] = [10, 9, , , 8, 4, 0, 5,11, ,12, , , , 3, 7];
After deleting S:
// 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 keys[] = [ P, M, , , A, C, , H, L, , E, , , , R, X]; vals[] = [10, 9, , , 8, 4, , 5,11, ,12, , , , 3, 7]; // ^ doesn't work, e.g., if hash(H) = 4
| implementation | search* | insert* | delete* | search\(^\dagger\) | insert\(^\dagger\) | delete\(^\dagger\) | ordered | ops on keys |
|---|---|---|---|---|---|---|---|---|
| seq search | \(N\) | \(N\) | \(N\) | \(N\) | \(N\) | \(N\) | equals() |
|
| binary search | \(\log N\) | \(N\) | \(N\) | \(\log N\) | \(N\) | \(N\) | ✓ | compareTo() |
| BST | \(N\) | \(N\) | \(N\) | \(\log N\) | \(\log N\) | \(\sqrtN\) | ✓ | compareTo() |
| 2-3 tree | \(\log N\) | \(\log N\) | \(\log N\) | \(\log N\) | \(\log N\) | \(\log N\) | ✓ | compareTo() |
| LLRB | \(\log N\) | \(\log N\) | \(\log N\) | \(\log N\) | \(\log N\) | \(\log N\) | ✓ | compareTo() |
| separate chaining | \(N\) | \(N\) | \(N\) | \(1^\ddagger\) | \(1^\ddagger\) | \(1^\ddagger\) | equals() hashCode() |
|
| linear probing | \(N\) | \(N\) | \(N\) | \(1^\ddagger\) | \(1^\ddagger\) | \(1^\ddagger\) | equals() hashCode() |
\(^*\) guaranteed, \(^\dagger\) typical, \(^\ddagger\) under uniform hashing assumption
3-Sum: Given \(N\) distinct integers, find three such that \(a + b + c = 0\)
Goal: \(N^2\) expected time case, \(N\) extra space.
Q: Is the uniform hashing assumption important in practice?
A: Obvious situations: aircraft control, nuclear reactor, pacemaker, HFT, ...
A: Surprising situations: denial-of-service (DOS) attacks!

Malicious adversary learns your hash function (e.g., by reading Java API) and causes a big pile-up in single slot that grinds performance to a halt
Real-world exploits [Crosby-Wallach 2003]
A Java bug report

Goal: Find family of strings with the same hashCode()
Solution: The base-31 hash code is part of Java's String API.
| key | hashCode() |
key | hashCode() |
|
|---|---|---|---|---|
Aa |
2112 | AaAaAaAa |
-540425984 | |
BB |
2112 | AaAaAaBB |
-540425984 | |
AaAaBBAa |
-540425984 | |||
AaAaBBBB |
-540425984 | |||
| \(\vdots\) | \(\vdots\) | |||
BBBBAaAa |
-540425984 | |||
BBBBAaBB |
-540425984 | |||
BBBBBBAa |
-540425984 | |||
BBBBBBBB |
-540425984 |
\(2^n\) strings of length \(2n\) that hash to same value!
One-way hash functions are "hard" to find a key that will hash to a desired value (or two keys that hash to same value).
Ex: MD4\(^*\), MD5\(^*\), SHA-0\(^*\), SHA-1\(^*\), SHA-2, SHA-256, WHIRLPOOL, ...
\(^*\) known to be insecure!
String password = args[0];
MessageDigest sha = MessageDigest.getInstance("SHA-256");
byte[] bytes = sha.digest(password);
/* prints bytes as hex string */
Applications: Crypto, message digests, passwords, Bitcoin, ...
Caveat: Too expensive for use in ST implementations :(
|
Separate Chaining
|
![]() |
Many improved versions have been studied
Two-probe hashing (separate-chaining variant)
Double hashing (linear-probing variant)
Many improved versions have been studied
Cuckoo hashing (linear-probing variant)
Hash tables
String (e.g., cached hash code)Balanced search trees
compareTo() than hashCode()Java system includes both
java.util.TreeMapjava.util.TreeSet (red-black BST)java.util.HashMap (linear probing)java.util.IdentifyHashMap (separate chaining)