Chapter 10: Hashing
Subtopic: ADT Dictionary
Definition
-
ADT Dictionary (also called map, table, or associative array) is a data structure with entries that have two parts:
- A search key
- A value associated with the key
-
It organizes and identifies entries by their keys, allowing retrieval or removal using only the key.
Causes
- Not applicable to this topic
Goals / Objectives
- To enable efficient searching, retrieval, addition, and removal of data based on keys.
- To support dictionary-like applications where each entry must be quickly accessed by a unique identifier.
Importance
- Searching databases is common in many computer applications.
- The efficiency of searching within a dictionary is crucial.
- Hashing is introduced to improve search efficiency, ideally achieving O(1) search time.
Procedures
-
Dictionary Operations include:
add– Insert a key-value pairremove– Delete an entry using its keyretrieve– Get the value associated with a keycontains– Check if a key existsisFull– Test if the dictionary is fullget number of entries– Count entriesremove all entries– Clear the dictionary
-
Usage Example:
- A telephone directory where names (keys) map to phone numbers (values).
Advantages & Disadvantages
-
Advantages:
- Provides structured and organized key-based data access.
- Enables efficient operations when combined with hashing.
-
Disadvantages:
- Efficiency heavily depends on the underlying implementation (e.g., array, linked list, hashing).
- Without good hash functions or search mechanisms, performance may degrade.
Impact / Effect
-
Leads to faster searching in applications such as:
- Telephone directories
- Word frequency counters
- Concordance of words
Examples
-
Applications of ADT Dictionary:
- Directory of telephone numbers
- Frequency of words in a text
- Concordance of words (list of words with occurrences in a text)
-
In Java:
-
java.util.Map<K,V>is the dictionary interface. -
Implementations include:
HashMap<K,V>Hashtable<K,V>TreeMap<K,V>
-
Key Takeaways
- Dictionary = key-value data structure.
- Provides fast access using keys instead of searching entire lists.
- Core operations: add, remove, retrieve, check, clear.
- Widely used in real-world applications like phone directories and word counts.
Subtopic: Hashing – Overview
Definition
- Hashing is a technique that determines an index (or location) for storing an item in a data structure without searching.
- A hash function maps a search key to a specific location in a hash table (an array).
- The resulting location is called the hash index.
- A perfect hash function maps each search key to a unique index with no collisions.
Causes
- Not applicable to this topic
Goals / Objectives
- To allow fast access (ideally O(1)) to data by computing an index directly from the key.
- To reduce or eliminate the need for sequential or binary searches.
- To support dictionary operations efficiently.
Importance
- Improves efficiency of searching in data-heavy applications.
- Makes dictionary operations (add, retrieve, remove) much faster.
- Essential for implementing large-scale databases, indexing, and information systems.
Procedures
-
Two Steps of a Hash Function:
- Convert the search key into an integer hash code.
- Compress the hash code into the valid range of indices for the hash table.
-
Example Algorithm:
Algorithm getHashIndex(phoneNumber) i = last 4 digits of phoneNumber // Step 1 return i % tableSize // Step 2
Advantages & Disadvantages
-
Advantages:
- Direct computation of index → faster access.
- Reduces search cost compared to lists or trees.
- Ideal case: O(1) performance.
-
Disadvantages:
- Hash functions are rarely perfect → collisions can occur.
- Performance depends on quality of hash function and table size.
- Poor distribution leads to clustering.
Impact / Effect
- Provides fast access in ideal conditions.
- In practice, may require collision resolution methods when multiple keys map to the same index.
- Poor hashing strategy can lead to inefficiency and wasted space.
Examples
- Phone directory hashing: last 4 digits of phone number determine storage index.
- Word frequency counter: each word is hashed into an index, frequency updated at that index.
Key Takeaways
- Hashing = using a function to compute an item’s storage location directly.
- Works in two steps: generate hash code → compress into valid index.
- Best case = O(1) access; real case may involve collisions.
- Good hash function = uniform distribution + minimal collisions.
Subtopic: Hash Functions
Definition
- A hash function is a function that maps a search key into a location (index) in a hash table.
- The location returned is called the hash index.
- A good hash function distributes entries uniformly across the table and minimizes collisions.
Causes
-
Collisions occur because typical hash functions are not perfect, meaning:
- Multiple keys may map to the same index.
- Uneven distribution of values leads to clustering.
Goals / Objectives
- To generate a hash index from a search key efficiently.
- To minimize collisions while distributing entries evenly in the hash table.
- To ensure fast computation for practical use.
Importance
- Central to hashing efficiency – the performance of dictionary operations depends on the quality of the hash function.
- A poor hash function causes clustering, wasted space, and slower searches.
- A good hash function ensures close to O(1) average performance.
Procedures
-
Two Steps of a Hash Function
- Convert the search key into an integer (hash code).
- Compress the hash code into a valid index range (0 to n−1).
Example:
Algorithm getHashIndex(phoneNumber) i = last 4 digits of phoneNumber // Step 1 return i % tableSize // Step 2 -
Characteristics of a Good Hash Function
- Minimizes collisions
- Distributes entries uniformly
- Fast to compute
-
Hashing by Key Types
-
Primitive types: use the value directly, or apply folding/bitwise operations.
-
Objects: override the
hashCode()method. -
Strings: use Unicode values of characters with methods like:
-
Sum of character values
-
Weighted sum (char × position)
-
Horner’s method (preferred):
int hash = 0; for (int i = 0; i < s.length(); i++) hash = g * hash + s.charAt(i); // g = positive constant
-
-
-
Compression Technique
- Modulo operation:
hashIndex = hashCode % tableSize tableSizeis usually a prime number to reduce collisions.
- Modulo operation:
Advantages & Disadvantages
-
Advantages
- Allows direct and fast mapping from key → index.
- Flexible: works for primitive types and objects.
- With proper design, reduces clustering.
-
Disadvantages
- Imperfect hashing leads to collisions.
- Needs rehashing when load factor gets too high.
- Performance highly dependent on chosen hash function.
Impact / Effect
- Good hash functions → faster retrieval and insertion.
- Poor hash functions → frequent collisions, clustering, inefficient memory usage.
- Impacts load factor and overall cost of dictionary operations.
Examples
-
Phone number hashing:
h(555-1264) = 1264(last 4 digits as hash code).- Compressed index:
1264 % 101 = 52.
-
String hashing using Horner’s method for word frequency.
-
Student ID hashing:
hashCode = studentID.
Key Takeaways
- A hash function = (1) convert key → integer (2) compress into valid index.
- Good hash functions must be fast, uniform, and minimize collisions.
- Strings often use Horner’s method; primitive types can use folding or direct values.
- Choice of table size (prime number) is critical for effectiveness.
Subtopic: Collision and Collision Resolution
Definition
- A collision occurs when two different search keys are mapped by the hash function to the same index in the hash table.
- Collision resolution refers to techniques used to handle such situations so that all entries can still be stored and retrieved correctly.
Causes
- Non-perfect hash functions: Multiple keys produce the same hash index.
- Limited hash table size: The number of keys exceeds or approaches the number of table slots.
- Poor distribution of hash codes (clustering of values).
Goals / Objectives
- To ensure that all entries can be inserted and retrieved even when collisions occur.
- To minimize search time despite collisions.
- To reduce clustering and maintain efficiency of dictionary operations.
Importance
- Collisions are unavoidable in practice, so resolving them is essential.
- The choice of collision resolution strategy directly impacts efficiency of hashing.
- Without resolution, data could be lost or unretrievable.
Procedures
Two general approaches to resolving collisions:
-
Open Addressing
-
Finds another empty location within the same hash table.
-
Variants:
-
Linear Probing
- If collision at index k → check k+1, k+2, … (wrap around if needed).
- Requires packaging key with value.
-
Quadratic Probing
- Probes positions based on square increments: k+1², k+2², …
- Avoids primary clustering.
-
Double Hashing
-
Uses a second hash function to compute probe increments.
-
Example:
- h1(key) = key % 7
- h2(key) = 5 - (key % 5)
-
-
-
-
Separate Chaining
-
Each hash table location (bucket) stores a list or chain of multiple entries.
-
Types:
- Unsorted chain
- Sorted chain
- Linked list implementation
-
Advantages & Disadvantages
-
Open Addressing
- Advantage: Simple, no extra data structures.
- Disadvantage: Suffers from clustering, performance drops as load factor grows.
-
Separate Chaining
- Advantage: Simple to implement, fewer restrictions on load factor.
- Disadvantage: Requires extra memory for chains.
Impact / Effect
-
Proper collision resolution maintains O(1) average performance.
-
Poor resolution leads to:
- Longer probe sequences (linear probing).
- Clustering (primary and secondary).
- Increased search and insertion time.
Examples
-
Linear Probing Example:
- Keys that map to the same index are placed in the next available slots.
-
Double Hashing Example (table size = 7, key = 16):
- h1(16) = 16 % 7 = 2
- h2(16) = 5 - (16 % 5) = 4
- Probe sequence: 2, 6, 3, 0, 4, 1, 5
-
Separate Chaining Example:
- Keys 12 and 23 both hash to index 7.
- Both stored in a linked list at that index.
Key Takeaways
- Collisions are inevitable → must use resolution techniques.
- Two main strategies: Open Addressing (linear, quadratic, double hashing) and Separate Chaining.
- Open addressing keeps all entries in the table, but clustering is an issue.
- Separate chaining uses linked structures at each index, reducing clustering but using more memory.
Subtopic: Efficiency of Hashing (Load Factor & Costs)
Definition
- Efficiency of hashing refers to how well hashing performs in terms of time and space when handling dictionary operations (add, retrieve, remove).
- Load factor (λ) measures how full the hash table is:
[ \lambda = \frac{\text{Number of entries in dictionary}}{\text{Number of locations in hash table}} ]
Causes
-
Efficiency is reduced when:
- Load factor is too high, causing frequent collisions.
- Poor collision resolution strategy increases search/probe length.
- Improper table size (e.g., not prime, too small).
Goals / Objectives
- To keep hashing operations close to O(1) time.
- To maintain performance by controlling the load factor.
- To decide when rehashing (expanding table size) is needed.
Importance
- Efficiency determines whether hashing is practical for real-world applications like databases or large directories.
- Without efficiency management, hashing may degrade to O(n) performance.
Procedures
-
Managing Load Factor (λ)
- For open addressing: keep λ < 0.5 (table less than half full).
- For separate chaining: keep λ < 1 (average one entry per bucket).
-
Rehashing
-
When λ exceeds limits, expand table size:
- New size = prime number at least double current size.
-
Recompute indices for all existing entries using the new hash function.
-
-
Efficiency Cost Analysis
-
Open Addressing
- All operations depend on searching probe sequences.
- As λ increases, probe length grows significantly.
-
Separate Chaining
- Performance remains stable even as λ grows.
- Average chain length ≈ λ.
-
Advantages & Disadvantages
-
Open Addressing
- Advantage: Requires no extra memory beyond the table.
- Disadvantage: Performance degrades rapidly when λ > 0.5.
-
Separate Chaining
- Advantage: Stable performance; fewer rehashes required.
- Disadvantage: Uses more memory for linked chains.
Impact / Effect
- Proper management of load factor keeps hashing close to O(1).
- Ignoring load factor → many collisions → hashing slows down.
- Frequent rehashing can temporarily reduce efficiency but restores long-term performance.
Examples
- Empty table → λ = 0.
- 700 entries in 1000 slots → λ = 0.7 → too high for open addressing, acceptable for separate chaining.
- Rehashing: If a table of size 11 exceeds λ = 0.5 in open addressing, resize to next prime ≥ 22 (e.g., 23).
Key Takeaways
- Load factor λ is the key efficiency measure.
- Keep λ < 0.5 (open addressing) or < 1 (separate chaining).
- Rehash when λ is too high → new prime-sized table, reinsert entries.
- Open addressing = memory efficient but fragile; separate chaining = stable but uses extra memory.
Subtopic: Advantages & Disadvantages (Comparison of Methods)
Definition
- A summary evaluation of different collision resolution techniques in hashing, comparing their strengths and weaknesses.
Causes
- Not applicable to this topic
Goals / Objectives
- To analyze and compare the effectiveness of different collision resolution strategies.
- To help in selecting the best method based on application needs.
Importance
- Choosing the right method affects performance, memory use, and efficiency.
- Some methods are better for small tables, others for large, dynamic datasets.
Procedures
-
Methods compared:
- Linear Probing
- Quadratic Probing
- Double Hashing
- Separate Chaining
Advantages & Disadvantages
1. Linear Probing
-
Advantages:
- Easy to implement.
- Ensures every table location can eventually be used.
-
Disadvantages:
- Suffers from primary clustering (groups of occupied slots form).
- Performance drops quickly as table fills.
2. Quadratic Probing
-
Advantages:
- Reduces primary clustering by spreading probes further apart.
-
Disadvantages:
- Requires more computation than linear probing.
- Can still suffer from secondary clustering (different keys following the same probe sequence).
- Cannot always reach every location in the table.
3. Double Hashing
-
Advantages:
- Avoids both primary and secondary clustering.
- Can reach every location if table size is prime.
-
Disadvantages:
- Requires computing two hash functions → more time-consuming.
4. Separate Chaining
-
Advantages:
- Simple and efficient; fewer restrictions on table size.
- Handles high load factors well (λ can exceed 1).
- Reduces complexity in deletion.
-
Disadvantages:
- Requires extra memory for chains (linked lists or arrays).
- Average-case performance depends on chain length.
Impact / Effect
-
The choice of method determines average performance of dictionary operations:
- Linear probing → fast when table is sparse, poor with clustering.
- Quadratic probing → better distribution, but probe length can grow.
- Double hashing → best open addressing method, more costly to compute.
- Separate chaining → stable even with high load factor, but needs extra memory.
Examples (from notes)
- Linear Probing Example: Adding entries one by one to next available slot.
- Quadratic Probing Example: Probe sequence at increasing square increments (k+1², k+2², …).
- Double Hashing Example: h1(key)=key%7, h2(key)=5-(key%5), probe sequence based on second hash.
- Separate Chaining Example: Collided keys stored in linked list at same index.
Key Takeaways
- Linear probing: Simple but clustering is a big problem.
- Quadratic probing: Reduces primary clustering, but not secondary.
- Double hashing: Most effective open addressing method, avoids clustering, but slower due to dual hash functions.
- Separate chaining: Handles collisions gracefully with stable performance, but requires extra memory.