Chapter 10: Hashing

Subtopic: ADT Dictionary

Definition

ADT Dictionary (also called map, table, or associative array) is a data structure with entries that have two parts:
- A search key
- A value associated with the key
It organizes and identifies entries by their keys, allowing retrieval or removal using only the key.

Causes

Not applicable to this topic

Goals / Objectives

To enable efficient searching, retrieval, addition, and removal of data based on keys.
To support dictionary-like applications where each entry must be quickly accessed by a unique identifier.

Importance

Searching databases is common in many computer applications.
The efficiency of searching within a dictionary is crucial.
Hashing is introduced to improve search efficiency, ideally achieving O(1) search time.

Procedures

Dictionary Operations include:
- add – Insert a key-value pair
- remove – Delete an entry using its key
- retrieve – Get the value associated with a key
- contains – Check if a key exists
- isFull – Test if the dictionary is full
- get number of entries – Count entries
- remove all entries – Clear the dictionary
Usage Example:
- A telephone directory where names (keys) map to phone numbers (values).

Advantages & Disadvantages

Advantages:
- Provides structured and organized key-based data access.
- Enables efficient operations when combined with hashing.
Disadvantages:
- Efficiency heavily depends on the underlying implementation (e.g., array, linked list, hashing).
- Without good hash functions or search mechanisms, performance may degrade.

Impact / Effect

Leads to faster searching in applications such as:
- Telephone directories
- Word frequency counters
- Concordance of words

Examples

Applications of ADT Dictionary:
- Directory of telephone numbers
- Frequency of words in a text
- Concordance of words (list of words with occurrences in a text)
In Java:
- java.util.Map<K,V> is the dictionary interface.
- Implementations include:
  - HashMap<K,V>
  - Hashtable<K,V>
  - TreeMap<K,V>

Key Takeaways

Dictionary = key-value data structure.
Provides fast access using keys instead of searching entire lists.
Core operations: add, remove, retrieve, check, clear.
Widely used in real-world applications like phone directories and word counts.

Subtopic: Hashing – Overview

Definition

Hashing is a technique that determines an index (or location) for storing an item in a data structure without searching.
A hash function maps a search key to a specific location in a hash table (an array).
The resulting location is called the hash index.
A perfect hash function maps each search key to a unique index with no collisions.

Causes

Not applicable to this topic

Goals / Objectives

To allow fast access (ideally O(1)) to data by computing an index directly from the key.
To reduce or eliminate the need for sequential or binary searches.
To support dictionary operations efficiently.

Importance

Improves efficiency of searching in data-heavy applications.
Makes dictionary operations (add, retrieve, remove) much faster.
Essential for implementing large-scale databases, indexing, and information systems.

Procedures

Two Steps of a Hash Function:
1. Convert the search key into an integer hash code.
2. Compress the hash code into the valid range of indices for the hash table.

Example Algorithm:

Algorithm getHashIndex(phoneNumber)
    i = last 4 digits of phoneNumber   // Step 1
    return i % tableSize               // Step 2

Advantages & Disadvantages

Advantages:
- Direct computation of index → faster access.
- Reduces search cost compared to lists or trees.
- Ideal case: O(1) performance.
Disadvantages:
- Hash functions are rarely perfect → collisions can occur.
- Performance depends on quality of hash function and table size.
- Poor distribution leads to clustering.

Impact / Effect

Provides fast access in ideal conditions.
In practice, may require collision resolution methods when multiple keys map to the same index.
Poor hashing strategy can lead to inefficiency and wasted space.

Examples

Phone directory hashing: last 4 digits of phone number determine storage index.
Word frequency counter: each word is hashed into an index, frequency updated at that index.

Key Takeaways

Hashing = using a function to compute an item’s storage location directly.
Works in two steps: generate hash code → compress into valid index.
Best case = O(1) access; real case may involve collisions.
Good hash function = uniform distribution + minimal collisions.

Subtopic: Hash Functions

Definition

A hash function is a function that maps a search key into a location (index) in a hash table.
The location returned is called the hash index.
A good hash function distributes entries uniformly across the table and minimizes collisions.

Causes

Collisions occur because typical hash functions are not perfect, meaning:
- Multiple keys may map to the same index.
- Uneven distribution of values leads to clustering.

Goals / Objectives

To generate a hash index from a search key efficiently.
To minimize collisions while distributing entries evenly in the hash table.
To ensure fast computation for practical use.

Importance

Central to hashing efficiency – the performance of dictionary operations depends on the quality of the hash function.
A poor hash function causes clustering, wasted space, and slower searches.
A good hash function ensures close to O(1) average performance.

Procedures

Two Steps of a Hash Function
- Convert the search key into an integer (hash code).
- Compress the hash code into a valid index range (0 to n−1).
Example:
```
Algorithm getHashIndex(phoneNumber)
    i = last 4 digits of phoneNumber     // Step 1
    return i % tableSize                 // Step 2
```
Characteristics of a Good Hash Function
- Minimizes collisions
- Distributes entries uniformly
- Fast to compute
Hashing by Key Types
- Primitive types: use the value directly, or apply folding/bitwise operations.
- Objects: override the hashCode() method.
- Strings: use Unicode values of characters with methods like:
  - Sum of character values
  - Weighted sum (char × position)
  - Horner’s method (preferred):
```
int hash = 0;
for (int i = 0; i < s.length(); i++)
    hash = g * hash + s.charAt(i); // g = positive constant
```
Compression Technique
- Modulo operation: hashIndex = hashCode % tableSize
- tableSize is usually a prime number to reduce collisions.

Advantages & Disadvantages

Advantages
- Allows direct and fast mapping from key → index.
- Flexible: works for primitive types and objects.
- With proper design, reduces clustering.
Disadvantages
- Imperfect hashing leads to collisions.
- Needs rehashing when load factor gets too high.
- Performance highly dependent on chosen hash function.

Impact / Effect

Good hash functions → faster retrieval and insertion.
Poor hash functions → frequent collisions, clustering, inefficient memory usage.
Impacts load factor and overall cost of dictionary operations.

Examples

Phone number hashing:
- h(555-1264) = 1264 (last 4 digits as hash code).
- Compressed index: 1264 % 101 = 52.
String hashing using Horner’s method for word frequency.
Student ID hashing: hashCode = studentID.

Key Takeaways

A hash function = (1) convert key → integer (2) compress into valid index.
Good hash functions must be fast, uniform, and minimize collisions.
Strings often use Horner’s method; primitive types can use folding or direct values.
Choice of table size (prime number) is critical for effectiveness.

Subtopic: Collision and Collision Resolution

Definition

A collision occurs when two different search keys are mapped by the hash function to the same index in the hash table.
Collision resolution refers to techniques used to handle such situations so that all entries can still be stored and retrieved correctly.

Causes

Non-perfect hash functions: Multiple keys produce the same hash index.
Limited hash table size: The number of keys exceeds or approaches the number of table slots.
Poor distribution of hash codes (clustering of values).

Goals / Objectives

To ensure that all entries can be inserted and retrieved even when collisions occur.
To minimize search time despite collisions.
To reduce clustering and maintain efficiency of dictionary operations.

Importance

Collisions are unavoidable in practice, so resolving them is essential.
The choice of collision resolution strategy directly impacts efficiency of hashing.
Without resolution, data could be lost or unretrievable.

Procedures

Two general approaches to resolving collisions:

Open Addressing
- Finds another empty location within the same hash table.
- Variants:
  - Linear Probing
    - If collision at index k → check k+1, k+2, … (wrap around if needed).
    - Requires packaging key with value.
  - Quadratic Probing
    - Probes positions based on square increments: k+1², k+2², …
    - Avoids primary clustering.
  - Double Hashing
    - Uses a second hash function to compute probe increments.
    - Example:
      - h1(key) = key % 7
      - h2(key) = 5 - (key % 5)
Separate Chaining
- Each hash table location (bucket) stores a list or chain of multiple entries.
- Types:
  - Unsorted chain
  - Sorted chain
  - Linked list implementation

Advantages & Disadvantages

Open Addressing
- Advantage: Simple, no extra data structures.
- Disadvantage: Suffers from clustering, performance drops as load factor grows.
Separate Chaining
- Advantage: Simple to implement, fewer restrictions on load factor.
- Disadvantage: Requires extra memory for chains.

Impact / Effect

Proper collision resolution maintains O(1) average performance.
Poor resolution leads to:
- Longer probe sequences (linear probing).
- Clustering (primary and secondary).
- Increased search and insertion time.

Examples

Linear Probing Example:
- Keys that map to the same index are placed in the next available slots.
Double Hashing Example (table size = 7, key = 16):
- h1(16) = 16 % 7 = 2
- h2(16) = 5 - (16 % 5) = 4
- Probe sequence: 2, 6, 3, 0, 4, 1, 5
Separate Chaining Example:
- Keys 12 and 23 both hash to index 7.
- Both stored in a linked list at that index.

Key Takeaways

Collisions are inevitable → must use resolution techniques.
Two main strategies: Open Addressing (linear, quadratic, double hashing) and Separate Chaining.
Open addressing keeps all entries in the table, but clustering is an issue.
Separate chaining uses linked structures at each index, reducing clustering but using more memory.

Subtopic: Efficiency of Hashing (Load Factor & Costs)

Definition

Efficiency of hashing refers to how well hashing performs in terms of time and space when handling dictionary operations (add, retrieve, remove).
Load factor (λ) measures how full the hash table is:

[ \lambda = \frac{\text{Number of entries in dictionary}}{\text{Number of locations in hash table}} ]

Causes

Efficiency is reduced when:
- Load factor is too high, causing frequent collisions.
- Poor collision resolution strategy increases search/probe length.
- Improper table size (e.g., not prime, too small).

Goals / Objectives

To keep hashing operations close to O(1) time.
To maintain performance by controlling the load factor.
To decide when rehashing (expanding table size) is needed.

Importance

Efficiency determines whether hashing is practical for real-world applications like databases or large directories.
Without efficiency management, hashing may degrade to O(n) performance.

Procedures

Managing Load Factor (λ)
- For open addressing: keep λ < 0.5 (table less than half full).
- For separate chaining: keep λ < 1 (average one entry per bucket).
Rehashing
- When λ exceeds limits, expand table size:
  - New size = prime number at least double current size.
- Recompute indices for all existing entries using the new hash function.
Efficiency Cost Analysis
- Open Addressing
  - All operations depend on searching probe sequences.
  - As λ increases, probe length grows significantly.
- Separate Chaining
  - Performance remains stable even as λ grows.
  - Average chain length ≈ λ.

Advantages & Disadvantages

Open Addressing
- Advantage: Requires no extra memory beyond the table.
- Disadvantage: Performance degrades rapidly when λ > 0.5.
Separate Chaining
- Advantage: Stable performance; fewer rehashes required.
- Disadvantage: Uses more memory for linked chains.

Impact / Effect

Proper management of load factor keeps hashing close to O(1).
Ignoring load factor → many collisions → hashing slows down.
Frequent rehashing can temporarily reduce efficiency but restores long-term performance.

Examples

Empty table → λ = 0.
700 entries in 1000 slots → λ = 0.7 → too high for open addressing, acceptable for separate chaining.
Rehashing: If a table of size 11 exceeds λ = 0.5 in open addressing, resize to next prime ≥ 22 (e.g., 23).

Key Takeaways

Load factor λ is the key efficiency measure.
Keep λ < 0.5 (open addressing) or < 1 (separate chaining).
Rehash when λ is too high → new prime-sized table, reinsert entries.
Open addressing = memory efficient but fragile; separate chaining = stable but uses extra memory.

Subtopic: Advantages & Disadvantages (Comparison of Methods)

Definition

A summary evaluation of different collision resolution techniques in hashing, comparing their strengths and weaknesses.

Causes

Not applicable to this topic

Goals / Objectives

To analyze and compare the effectiveness of different collision resolution strategies.
To help in selecting the best method based on application needs.

Importance

Choosing the right method affects performance, memory use, and efficiency.
Some methods are better for small tables, others for large, dynamic datasets.

Procedures

Methods compared:
1. Linear Probing
2. Quadratic Probing
3. Double Hashing
4. Separate Chaining

Advantages & Disadvantages

1. Linear Probing

Advantages:
- Easy to implement.
- Ensures every table location can eventually be used.
Disadvantages:
- Suffers from primary clustering (groups of occupied slots form).
- Performance drops quickly as table fills.

2. Quadratic Probing

Advantages:
- Reduces primary clustering by spreading probes further apart.
Disadvantages:
- Requires more computation than linear probing.
- Can still suffer from secondary clustering (different keys following the same probe sequence).
- Cannot always reach every location in the table.

3. Double Hashing

Advantages:
- Avoids both primary and secondary clustering.
- Can reach every location if table size is prime.
Disadvantages:
- Requires computing two hash functions → more time-consuming.

4. Separate Chaining

Advantages:
- Simple and efficient; fewer restrictions on table size.
- Handles high load factors well (λ can exceed 1).
- Reduces complexity in deletion.
Disadvantages:
- Requires extra memory for chains (linked lists or arrays).
- Average-case performance depends on chain length.

Impact / Effect

The choice of method determines average performance of dictionary operations:
- Linear probing → fast when table is sparse, poor with clustering.
- Quadratic probing → better distribution, but probe length can grow.
- Double hashing → best open addressing method, more costly to compute.
- Separate chaining → stable even with high load factor, but needs extra memory.

Examples (from notes)

Linear Probing Example: Adding entries one by one to next available slot.
Quadratic Probing Example: Probe sequence at increasing square increments (k+1², k+2², …).
Double Hashing Example: h1(key)=key%7, h2(key)=5-(key%5), probe sequence based on second hash.
Separate Chaining Example: Collided keys stored in linked list at same index.

Key Takeaways

Linear probing: Simple but clustering is a big problem.
Quadratic probing: Reduces primary clustering, but not secondary.
Double hashing: Most effective open addressing method, avoids clustering, but slower due to dual hash functions.
Separate chaining: Handles collisions gracefully with stable performance, but requires extra memory.

Note51

Explorer

0.0 by Claude

Chapter 10: Hashing

Subtopic: ADT Dictionary

Definition

Causes

Goals / Objectives

Importance

Procedures

Advantages & Disadvantages

Impact / Effect

Examples

Key Takeaways

Subtopic: Hashing – Overview

Definition

Causes

Goals / Objectives

Importance

Procedures

Advantages & Disadvantages

Impact / Effect

Examples

Key Takeaways

Subtopic: Hash Functions

Definition

Causes

Goals / Objectives

Importance

Procedures

Advantages & Disadvantages

Impact / Effect

Examples

Key Takeaways

Subtopic: Collision and Collision Resolution

Definition

Causes

Goals / Objectives

Importance

Procedures

Advantages & Disadvantages

Impact / Effect

Examples

Key Takeaways

Subtopic: Efficiency of Hashing (Load Factor & Costs)

Definition

Causes

Goals / Objectives

Importance

Procedures

Advantages & Disadvantages

Impact / Effect

Examples

Key Takeaways

Subtopic: Advantages & Disadvantages (Comparison of Methods)

Definition

Causes

Goals / Objectives

Importance

Procedures

Advantages & Disadvantages

1. Linear Probing

2. Quadratic Probing

3. Double Hashing

4. Separate Chaining

Impact / Effect

Examples (from notes)

Key Takeaways

Graph View

Table of Contents

Backlinks