Chapter 10: Hashing

Subtopic: ADT Dictionary

Definition

  • ADT Dictionary (also called map, table, or associative array) is a data structure with entries that have two parts:

    • A search key
    • A value associated with the key
  • It organizes and identifies entries by their keys, allowing retrieval or removal using only the key.


Causes

  • Not applicable to this topic

Goals / Objectives

  • To enable efficient searching, retrieval, addition, and removal of data based on keys.
  • To support dictionary-like applications where each entry must be quickly accessed by a unique identifier.

Importance

  • Searching databases is common in many computer applications.
  • The efficiency of searching within a dictionary is crucial.
  • Hashing is introduced to improve search efficiency, ideally achieving O(1) search time.

Procedures

  • Dictionary Operations include:

    • add – Insert a key-value pair
    • remove – Delete an entry using its key
    • retrieve – Get the value associated with a key
    • contains – Check if a key exists
    • isFull – Test if the dictionary is full
    • get number of entries – Count entries
    • remove all entries – Clear the dictionary
  • Usage Example:

    • A telephone directory where names (keys) map to phone numbers (values).

Advantages & Disadvantages

  • Advantages:

    • Provides structured and organized key-based data access.
    • Enables efficient operations when combined with hashing.
  • Disadvantages:

    • Efficiency heavily depends on the underlying implementation (e.g., array, linked list, hashing).
    • Without good hash functions or search mechanisms, performance may degrade.

Impact / Effect

  • Leads to faster searching in applications such as:

    • Telephone directories
    • Word frequency counters
    • Concordance of words

Examples

  • Applications of ADT Dictionary:

    • Directory of telephone numbers
    • Frequency of words in a text
    • Concordance of words (list of words with occurrences in a text)
  • In Java:

    • java.util.Map<K,V> is the dictionary interface.

    • Implementations include:

      • HashMap<K,V>
      • Hashtable<K,V>
      • TreeMap<K,V>

Key Takeaways

  • Dictionary = key-value data structure.
  • Provides fast access using keys instead of searching entire lists.
  • Core operations: add, remove, retrieve, check, clear.
  • Widely used in real-world applications like phone directories and word counts.

Subtopic: Hashing – Overview

Definition

  • Hashing is a technique that determines an index (or location) for storing an item in a data structure without searching.
  • A hash function maps a search key to a specific location in a hash table (an array).
  • The resulting location is called the hash index.
  • A perfect hash function maps each search key to a unique index with no collisions.

Causes

  • Not applicable to this topic

Goals / Objectives

  • To allow fast access (ideally O(1)) to data by computing an index directly from the key.
  • To reduce or eliminate the need for sequential or binary searches.
  • To support dictionary operations efficiently.

Importance

  • Improves efficiency of searching in data-heavy applications.
  • Makes dictionary operations (add, retrieve, remove) much faster.
  • Essential for implementing large-scale databases, indexing, and information systems.

Procedures

  • Two Steps of a Hash Function:

    1. Convert the search key into an integer hash code.
    2. Compress the hash code into the valid range of indices for the hash table.
  • Example Algorithm:

    Algorithm getHashIndex(phoneNumber)
        i = last 4 digits of phoneNumber   // Step 1
        return i % tableSize               // Step 2
    

Advantages & Disadvantages

  • Advantages:

    • Direct computation of index → faster access.
    • Reduces search cost compared to lists or trees.
    • Ideal case: O(1) performance.
  • Disadvantages:

    • Hash functions are rarely perfect → collisions can occur.
    • Performance depends on quality of hash function and table size.
    • Poor distribution leads to clustering.

Impact / Effect

  • Provides fast access in ideal conditions.
  • In practice, may require collision resolution methods when multiple keys map to the same index.
  • Poor hashing strategy can lead to inefficiency and wasted space.

Examples

  • Phone directory hashing: last 4 digits of phone number determine storage index.
  • Word frequency counter: each word is hashed into an index, frequency updated at that index.

Key Takeaways

  • Hashing = using a function to compute an item’s storage location directly.
  • Works in two steps: generate hash code → compress into valid index.
  • Best case = O(1) access; real case may involve collisions.
  • Good hash function = uniform distribution + minimal collisions.

Subtopic: Hash Functions

Definition

  • A hash function is a function that maps a search key into a location (index) in a hash table.
  • The location returned is called the hash index.
  • A good hash function distributes entries uniformly across the table and minimizes collisions.

Causes

  • Collisions occur because typical hash functions are not perfect, meaning:

    • Multiple keys may map to the same index.
    • Uneven distribution of values leads to clustering.

Goals / Objectives

  • To generate a hash index from a search key efficiently.
  • To minimize collisions while distributing entries evenly in the hash table.
  • To ensure fast computation for practical use.

Importance

  • Central to hashing efficiency – the performance of dictionary operations depends on the quality of the hash function.
  • A poor hash function causes clustering, wasted space, and slower searches.
  • A good hash function ensures close to O(1) average performance.

Procedures

  1. Two Steps of a Hash Function

    • Convert the search key into an integer (hash code).
    • Compress the hash code into a valid index range (0 to n−1).

    Example:

    Algorithm getHashIndex(phoneNumber)
        i = last 4 digits of phoneNumber     // Step 1
        return i % tableSize                 // Step 2
    
  2. Characteristics of a Good Hash Function

    • Minimizes collisions
    • Distributes entries uniformly
    • Fast to compute
  3. Hashing by Key Types

    • Primitive types: use the value directly, or apply folding/bitwise operations.

    • Objects: override the hashCode() method.

    • Strings: use Unicode values of characters with methods like:

      • Sum of character values

      • Weighted sum (char × position)

      • Horner’s method (preferred):

        int hash = 0;
        for (int i = 0; i < s.length(); i++)
            hash = g * hash + s.charAt(i); // g = positive constant
  4. Compression Technique

    • Modulo operation: hashIndex = hashCode % tableSize
    • tableSize is usually a prime number to reduce collisions.

Advantages & Disadvantages

  • Advantages

    • Allows direct and fast mapping from key → index.
    • Flexible: works for primitive types and objects.
    • With proper design, reduces clustering.
  • Disadvantages

    • Imperfect hashing leads to collisions.
    • Needs rehashing when load factor gets too high.
    • Performance highly dependent on chosen hash function.

Impact / Effect

  • Good hash functions → faster retrieval and insertion.
  • Poor hash functions → frequent collisions, clustering, inefficient memory usage.
  • Impacts load factor and overall cost of dictionary operations.

Examples

  • Phone number hashing:

    • h(555-1264) = 1264 (last 4 digits as hash code).
    • Compressed index: 1264 % 101 = 52.
  • String hashing using Horner’s method for word frequency.

  • Student ID hashing: hashCode = studentID.


Key Takeaways

  • A hash function = (1) convert key → integer (2) compress into valid index.
  • Good hash functions must be fast, uniform, and minimize collisions.
  • Strings often use Horner’s method; primitive types can use folding or direct values.
  • Choice of table size (prime number) is critical for effectiveness.

Subtopic: Collision and Collision Resolution

Definition

  • A collision occurs when two different search keys are mapped by the hash function to the same index in the hash table.
  • Collision resolution refers to techniques used to handle such situations so that all entries can still be stored and retrieved correctly.

Causes

  • Non-perfect hash functions: Multiple keys produce the same hash index.
  • Limited hash table size: The number of keys exceeds or approaches the number of table slots.
  • Poor distribution of hash codes (clustering of values).

Goals / Objectives

  • To ensure that all entries can be inserted and retrieved even when collisions occur.
  • To minimize search time despite collisions.
  • To reduce clustering and maintain efficiency of dictionary operations.

Importance

  • Collisions are unavoidable in practice, so resolving them is essential.
  • The choice of collision resolution strategy directly impacts efficiency of hashing.
  • Without resolution, data could be lost or unretrievable.

Procedures

Two general approaches to resolving collisions:

  1. Open Addressing

    • Finds another empty location within the same hash table.

    • Variants:

      • Linear Probing

        • If collision at index k → check k+1, k+2, … (wrap around if needed).
        • Requires packaging key with value.
      • Quadratic Probing

        • Probes positions based on square increments: k+1², k+2², …
        • Avoids primary clustering.
      • Double Hashing

        • Uses a second hash function to compute probe increments.

        • Example:

          • h1(key) = key % 7
          • h2(key) = 5 - (key % 5)
  2. Separate Chaining

    • Each hash table location (bucket) stores a list or chain of multiple entries.

    • Types:

      • Unsorted chain
      • Sorted chain
      • Linked list implementation

Advantages & Disadvantages

  • Open Addressing

    • Advantage: Simple, no extra data structures.
    • Disadvantage: Suffers from clustering, performance drops as load factor grows.
  • Separate Chaining

    • Advantage: Simple to implement, fewer restrictions on load factor.
    • Disadvantage: Requires extra memory for chains.

Impact / Effect

  • Proper collision resolution maintains O(1) average performance.

  • Poor resolution leads to:

    • Longer probe sequences (linear probing).
    • Clustering (primary and secondary).
    • Increased search and insertion time.

Examples

  • Linear Probing Example:

    • Keys that map to the same index are placed in the next available slots.
  • Double Hashing Example (table size = 7, key = 16):

    • h1(16) = 16 % 7 = 2
    • h2(16) = 5 - (16 % 5) = 4
    • Probe sequence: 2, 6, 3, 0, 4, 1, 5
  • Separate Chaining Example:

    • Keys 12 and 23 both hash to index 7.
    • Both stored in a linked list at that index.

Key Takeaways

  • Collisions are inevitable → must use resolution techniques.
  • Two main strategies: Open Addressing (linear, quadratic, double hashing) and Separate Chaining.
  • Open addressing keeps all entries in the table, but clustering is an issue.
  • Separate chaining uses linked structures at each index, reducing clustering but using more memory.

Subtopic: Efficiency of Hashing (Load Factor & Costs)

Definition

  • Efficiency of hashing refers to how well hashing performs in terms of time and space when handling dictionary operations (add, retrieve, remove).
  • Load factor (λ) measures how full the hash table is:

[ \lambda = \frac{\text{Number of entries in dictionary}}{\text{Number of locations in hash table}} ]


Causes

  • Efficiency is reduced when:

    • Load factor is too high, causing frequent collisions.
    • Poor collision resolution strategy increases search/probe length.
    • Improper table size (e.g., not prime, too small).

Goals / Objectives

  • To keep hashing operations close to O(1) time.
  • To maintain performance by controlling the load factor.
  • To decide when rehashing (expanding table size) is needed.

Importance

  • Efficiency determines whether hashing is practical for real-world applications like databases or large directories.
  • Without efficiency management, hashing may degrade to O(n) performance.

Procedures

  1. Managing Load Factor (λ)

    • For open addressing: keep λ < 0.5 (table less than half full).
    • For separate chaining: keep λ < 1 (average one entry per bucket).
  2. Rehashing

    • When λ exceeds limits, expand table size:

      • New size = prime number at least double current size.
    • Recompute indices for all existing entries using the new hash function.

  3. Efficiency Cost Analysis

    • Open Addressing

      • All operations depend on searching probe sequences.
      • As λ increases, probe length grows significantly.
    • Separate Chaining

      • Performance remains stable even as λ grows.
      • Average chain length ≈ λ.

Advantages & Disadvantages

  • Open Addressing

    • Advantage: Requires no extra memory beyond the table.
    • Disadvantage: Performance degrades rapidly when λ > 0.5.
  • Separate Chaining

    • Advantage: Stable performance; fewer rehashes required.
    • Disadvantage: Uses more memory for linked chains.

Impact / Effect

  • Proper management of load factor keeps hashing close to O(1).
  • Ignoring load factor → many collisions → hashing slows down.
  • Frequent rehashing can temporarily reduce efficiency but restores long-term performance.

Examples

  • Empty table → λ = 0.
  • 700 entries in 1000 slots → λ = 0.7 → too high for open addressing, acceptable for separate chaining.
  • Rehashing: If a table of size 11 exceeds λ = 0.5 in open addressing, resize to next prime ≥ 22 (e.g., 23).

Key Takeaways

  • Load factor λ is the key efficiency measure.
  • Keep λ < 0.5 (open addressing) or < 1 (separate chaining).
  • Rehash when λ is too high → new prime-sized table, reinsert entries.
  • Open addressing = memory efficient but fragile; separate chaining = stable but uses extra memory.

Subtopic: Advantages & Disadvantages (Comparison of Methods)

Definition

  • A summary evaluation of different collision resolution techniques in hashing, comparing their strengths and weaknesses.

Causes

  • Not applicable to this topic

Goals / Objectives

  • To analyze and compare the effectiveness of different collision resolution strategies.
  • To help in selecting the best method based on application needs.

Importance

  • Choosing the right method affects performance, memory use, and efficiency.
  • Some methods are better for small tables, others for large, dynamic datasets.

Procedures

  • Methods compared:

    1. Linear Probing
    2. Quadratic Probing
    3. Double Hashing
    4. Separate Chaining

Advantages & Disadvantages

1. Linear Probing

  • Advantages:

    • Easy to implement.
    • Ensures every table location can eventually be used.
  • Disadvantages:

    • Suffers from primary clustering (groups of occupied slots form).
    • Performance drops quickly as table fills.

2. Quadratic Probing

  • Advantages:

    • Reduces primary clustering by spreading probes further apart.
  • Disadvantages:

    • Requires more computation than linear probing.
    • Can still suffer from secondary clustering (different keys following the same probe sequence).
    • Cannot always reach every location in the table.

3. Double Hashing

  • Advantages:

    • Avoids both primary and secondary clustering.
    • Can reach every location if table size is prime.
  • Disadvantages:

    • Requires computing two hash functions → more time-consuming.

4. Separate Chaining

  • Advantages:

    • Simple and efficient; fewer restrictions on table size.
    • Handles high load factors well (λ can exceed 1).
    • Reduces complexity in deletion.
  • Disadvantages:

    • Requires extra memory for chains (linked lists or arrays).
    • Average-case performance depends on chain length.

Impact / Effect

  • The choice of method determines average performance of dictionary operations:

    • Linear probing → fast when table is sparse, poor with clustering.
    • Quadratic probing → better distribution, but probe length can grow.
    • Double hashing → best open addressing method, more costly to compute.
    • Separate chaining → stable even with high load factor, but needs extra memory.

Examples (from notes)

  • Linear Probing Example: Adding entries one by one to next available slot.
  • Quadratic Probing Example: Probe sequence at increasing square increments (k+1², k+2², …).
  • Double Hashing Example: h1(key)=key%7, h2(key)=5-(key%5), probe sequence based on second hash.
  • Separate Chaining Example: Collided keys stored in linked list at same index.

Key Takeaways

  • Linear probing: Simple but clustering is a big problem.
  • Quadratic probing: Reduces primary clustering, but not secondary.
  • Double hashing: Most effective open addressing method, avoids clustering, but slower due to dual hash functions.
  • Separate chaining: Handles collisions gracefully with stable performance, but requires extra memory.