LeetCode (1): Hash Tables - Patterns, Pitfalls & Three Classic Problems Deep Dive
Chen Kai BOSS

Hash tables are one of the highest ROI data structures in both interviews and real systems: by trading a small amount of memory overhead for fast membership queries and lookups, you can transform "scan all elements" solutions into near-linear time workflows. This guide builds reusable hash table thinking patterns through three LeetCode classics —Two Sum, Longest Consecutive Sequence, and Group Anagrams— to teach you what to store, how to design keys, and how to avoid common boundary-case bugs. We'll also cover advanced patterns (complement search, frequency counting, sliding window), performance comparisons, interview tips, and a debugging checklist.

Series Navigation

📚 LeetCode Algorithm Masterclass Series (10 Parts): 1. → Hash Tables (Two Sum, Longest Consecutive, Group Anagrams) ← You are here 2. Two Pointers Techniques (Collision pointers, fast-slow, sliding window) 3. Linked List Operations (Reverse, cycle detection, merge) 4. Binary Tree Traversal & Recursion (Inorder/Preorder/Postorder, LCA) 5. Dynamic Programming Intro (1D/2D DP, state transition) 6. Backtracking Algorithms (Permutations, combinations, pruning) 7. Binary Search Advanced (Integer/Real binary search, answer binary search) 8. Stack & Queue (Monotonic stack, priority queue, deque) 9. Graph Algorithms (BFS/DFS, topological sort, union-find) 10. Greedy & Bit Manipulation (Greedy strategies, bitwise tricks)


Quick Refresher: What Hash Tables Provide

For a good hash function and typical workload, hash tables offer:

  • Insertion: Average
  • Lookup: Average
  • Deletion: AverageWorst-case can degrade (e.g., adversarial collisions), but for interview problems and most real use cases, the average case is what matters.

In Python:

  • dict is a hash map (key → value)
  • set is a hash set (membership only)

In Java:

  • HashMap<K,V> for key-value pairs
  • HashSet<E> for unique elements

In C++:

  • unordered_map<K,V> for key-value
  • unordered_set<T> for membership

LeetCode 1: Two Sum

Problem Analysis

Core Challenge: The key difficulty lies in finding two numbers that sum to the target value in a single pass. A brute-force approach requirestime to check all pairs, which is inefficient for large datasets.

Solution Approach: Use a hash table to store "visited value → index" mappings. As we traverse the array, for each current element num, the partner we need is target - num (the complement). If this complement is already in the hash table, we've found the answer; otherwise, store the current value for future lookups. This "space-for-time" trade-off reduces time complexity fromto.

Complexity Analysis: - Time Complexity: - Single pass through the array, hash table lookup and insertion are average - Space Complexity: - Worst case requires storingelements in the hash table

Key Insight: The essence of the problem is "finding pairs", and hash tables provideaverage lookup time, perfectly suited for this need. Key: understanding the "complement search" pattern: we're not looking for num itself, but for target - num.

Problem Statement

Given an array of integers nums and an integer target, return the indices of the two numbers such that they add up to target. Each input has exactly one solution, and you may not use the same element twice.

Example:

  • Input: nums = [2,7,11,15], target = 9
  • Output: [0,1] (because nums[0] + nums[1] = 9)

Constraints:

- - - Only one valid answer exists

Brute Force Approach (For Comparison)

Naive solution: Check all pairs

1
2
3
4
5
6
7
def twoSum_bruteforce(nums, target):
n = len(nums)
for i in range(n):
for j in range(i+1, n):
if nums[i] + nums[j] == target:
return [i, j]
return []

Complexity:time,space

Why it's bad: For nums.length = 10,000, that's 50 million operations!

Hash Table Solution: The Complement Pattern

Core insight: When scanning left to right, maintain a mapping:

1
value → index

When you see num, the partner you need is target - num (the complement). If it's already in the map, you're done; otherwise, store the current value and index.

Python Implementation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
from typing import List

class Solution:
def twoSum(self, nums: List[int], target: int) -> List[int]:
# Hash table: stores value → index mapping
# This enables O(1) lookup to check if a value has appeared
seen = {} # value → index

# Traverse the array once to find the answer
for i, num in enumerate(nums):
# Calculate the complement needed for current element
# If num + complement = target, then complement = target - num
complement = target - num

# Check if complement is already in hash table
# If yes, we found the pair, return immediately
if complement in seen:
# Return [earlier index, current index]
# seen[complement] is the first occurrence index of complement
return [seen[complement], i]

# If complement not in hash table, store current value
# Note: We store AFTER checking to avoid using same element twice
seen[num] = i

# According to constraints, there should always be a solution
# Return empty list as defensive programming
return [] # No solution (shouldn't happen per constraints)

Algorithm Visualization: The animation below demonstrates the complete process of how the algorithm finds pairs using complement search:

Java Implementation

1
2
3
4
5
6
7
8
9
10
11
12
13
class Solution {
public int[] twoSum(int[] nums, int target) {
Map<Integer, Integer> seen = new HashMap<>();
for (int i = 0; i < nums.length; i++) {
int complement = target - nums[i];
if (seen.containsKey(complement)) {
return new int[]{seen.get(complement), i};
}
seen.put(nums[i], i);
}
return new int[]{};
}
}

Why This Works

One-pass logic: 1. At index, we check if target - nums[i] was seen before 2. If yes, we found the pair immediately 3. If no, store nums[i] → i for future lookups

Time evolution (for nums = [2, 7, 11, 15], target = 9):

Step i num complement seen Action
1 0 2 7 {} Not found, store {2: 0}
2 1 7 2 {2: 0} Found! Return [0, 1]

Complexity Analysis

  • Time: (single pass)
  • Space: (hash map stores up toentries)

Space-time tradeoff: We sacrificememory to reduce time fromto.

Deep Dive

Algorithm Principle: Why Does This Algorithm Work?

The core of the hash table solution lies in the complement search pattern. When we see element nums[i], we're not looking for it itself, but for its "other half" — target - nums[i]. If this complement has appeared before (stored in the hash table), we've found the complete pair.

Mathematical Proof: Suppose a solution (i, j) exists such that nums[i] + nums[j] = target, with i < j. When the algorithm reaches index j: - nums[i] has already been processed and stored as seen[nums[i]] = i - Current element is nums[j], complement is complement = target - nums[j] - Since nums[i] + nums[j] = target, we have complement = nums[i] - nums[i] is in seen, algorithm returns [i, j]

Time Complexity Proof: Why O(n)?

  • Outer loop: Traverse array once,
  • Hash table operations: Each in check and = assignment are average
  • Total time: Worst case: If all elements hash to the same bucket (adversarial input), hash table operations degrade to, making total time. However, this is extremely rare in practice, and average case is what matters.

Space Optimization: Is There a More Space-Efficient Method?

For this problem, no. Reasons: 1. Need to return indices: Cannot sort (would destroy indices), must store index information 2. Unsorted array: Cannot use two pointers (requires sorting) 3. Single pass requirement: Must store information about visited elements

Variant Extension: If the problem asked for "return the two values" (no indices needed), we could: - Sort first, then use two pointers:time,space - Or still use hash table:time,space

Common Mistakes Analysis

Mistake Type Wrong Code Problem Correct Approach
Check before insert seen[num] = i; if complement in seen May use same element twice Check first, then insert
Return order return [i, seen[complement]] Wrong index order return [seen[complement], i]
Sort destroys indices Sort then two pointers Lose original indices Use hash table or store (value, index) tuples

Variant Extensions

  1. Three Sum: Fix first number, use two sum on remaining part
  2. Four Sum: Fix first two numbers, use two sum on remaining part
  3. Two Sum II (Sorted Array): Can use two pointers,space
  4. Two Sum (BST): Use inorder traversal + two pointers

Common Pitfalls & Edge Cases

Pitfall 1: Duplicate Values

Case: nums = [3, 3], target = 6

Question: Does storing value → index work if values repeat?

Answer: Yes, because we check before overwriting:

1
2
3
4
5
6
7
# First iteration: i=0, num=3
complement = 6 - 3 = 3 # Not in seen yet
seen[3] = 0 # Store {3: 0}

# Second iteration: i=1, num=3
complement = 6 - 3 = 3 # Found in seen!
return [seen[3], 1] # [0, 1]

Pitfall 2: "Same Element Twice" Misunderstanding

Constraint says: "You may not use the same element twice"

Confusion: Does this mean the same value or the same index?

Clarification: It means the same index. So [3, 3] with target = 6 is valid (different indices).

Pitfall 3: Don't Sort!

Temptation: "Let me sort first, then use two pointers"

Problem: Sorting destroys original indices, but the problem asks for original indices!

Solution: If you must use two-pointer approach, store (value, original_index) tuples and sort by value.

Pitfall 4: Off-by-One in Return Order

Mistake:

1
return [i, seen[complement]]  # Wrong order!

Correct:

1
return [seen[complement], i]  # Earlier index first

Real-World Analogy

  • E-commerce: "I have$100. Can these two items fit my budget?"
  • Trading: "Do any two orders in this stream cancel out to the target exposure?"
  • Chemistry: "Which two reactants combine to this molecular weight?"

LeetCode 128: Longest Consecutive Sequence

Problem Analysis

Core Challenge: The key difficulty is finding the longest consecutive sequence in time without using sorting (which requires). Sequence elements don't need to be consecutive in the original array, suggesting we need fast existence checks.

Solution Approach: Use a hash set to store all numbers, enablingmembership queries. The key insight is: only count from sequence starts. A numberis a sequence start if and only ifis not in the set. For each sequence start, we expand rightward counting consecutive numbers. This ensures each number is visited at most twice (once as a start candidate, once as part of a sequence), guaranteeingtime complexity.

Complexity Analysis: - Time Complexity: - Although there's an inner while loop, each number is visited at most twice - Space Complexity: - Hash set stores all numbers

Key Insight: Avoiding duplicate counting is the core optimization. If we count from every number, we'd count the same sequence multiple times. By counting only from starts, each sequence is counted exactly once.

Problem Statement

Given an unsorted array of integers, find the length of the longest consecutive elements sequence. Your algorithm must run intime.

Example:

  • Input: nums = [100, 4, 200, 1, 3, 2]
  • Output: 4 (sequence is [1, 2, 3, 4])

Constraints:

- -

Why Not Just Sort?

Naive approach: Sort then scan for runs

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
def longestConsecutive_sort(nums):
if not nums:
return 0
nums.sort()
longest = 1
current = 1
for i in range(1, len(nums)):
if nums[i] == nums[i-1]:
continue
elif nums[i] == nums[i-1] + 1:
current += 1
else:
longest = max(longest, current)
current = 1
return max(longest, current)

Complexity:(sorting dominates)

Problem: Violates theconstraint!

Hash Set Solution: Start from Sequence Beginnings

Key insight: A numberis a sequence start if and only ifis not in the set. Only count from sequence starts; otherwise, you'd count the same sequence multiple times.

Algorithm

  1. Put all numbers in a set formembership queries

  2. For each number:

    • Ifis in the set, skip (not a start)
    • Ifis not in the set, count consecutive numbers

Python Implementation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
def longestConsecutive(nums):
# Edge case: empty array returns 0
if not nums:
return 0

# Convert array to set for O(1) membership queries
# Set automatically removes duplicates, but this doesn't affect the algorithm
num_set = set(nums)
longest = 0 # Track longest sequence length

# Iterate through each number in the set
for num in num_set:
# Key optimization: only count from sequence starts
# A number is a sequence start if and only if num-1 is not in the set
# This avoids counting the same sequence multiple times
if num - 1 not in num_set:
current_num = num # Starting number of current sequence
current_length = 1 # Current sequence length (at least num itself)

# Expand rightward, counting consecutive numbers
# Continue as long as current_num + 1 is in the set
while current_num + 1 in num_set:
current_num += 1 # Move to next number
current_length += 1 # Increment sequence length

# Update longest sequence length
longest = max(longest, current_length)

return longest

Why Is This?

Misconception: "The inner while loop makes it"

Reality: Each number is visited at most twice: 1. Once in the outer loop 2. Once as part of a sequence (inner loop)

Proof: The inner while only runs when num - 1 is not in the set. So each sequenceis scanned exactly once (starting from).

Total operations: (outer) + (inner across all sequences) =

Algorithm Visualization: The animation below demonstrates how the algorithm identifies sequence starts and expands to count consecutive numbers:

Complexity Analysis

  • Time: (see proof above)
  • Space: (hash set)

Deep Dive

Algorithm Principle: Why Count Only from Starts?

Key Insight: If we count from every number, we'd count the same sequence multiple times. For example, for sequence [1, 2, 3, 4]: - Counting from 1: count 1, 2, 3, 4 → length 4 - Counting from 2: count 2, 3, 4 → length 3 (duplicate) - Counting from 3: count 3, 4 → length 2 (duplicate) - Counting from 4: count 4 → length 1 (duplicate)

By counting only from starts (num-1 not in set), each sequence is counted exactly once.

Time Complexity Proof: Why O(n)?

Misconception: "The inner while loop makes it"

Reality: Each number is visited at most twice: 1. Once in outer loop: Checked as num whether it's a start 2. Once in inner loop: Counted as part of some sequence

Proof: - Outer loop visits allnumbers: - Inner while only runs when num-1 is not in the set - Each sequenceis scanned exactly once (starting from) - Total length of all sequences ≤(each number belongs to at most one sequence) - Total inner loop operations: Total complexity:

Space Optimization: Is There a More Space-Efficient Method?

Method 1: Use Array as Hash Table (Only for Small Range) If number range is known and small (e.g., [0, 1000]), use boolean array:

1
2
3
4
5
# Only works for small number ranges
max_num = max(nums)
min_num = min(nums)
present = [False] * (max_num - min_num + 1)
# Space: O(range size), may be larger or smaller than O(n)

Method 2: In-place Marking (Requires Modifying Input) If input array can be modified, use sign marking:

1
2
# Mark visited numbers as negative (requires extra handling)
# Space: O(1) extra space, but destroys input

Trade-off: For general cases, hash set'sspace is optimal because: - Number range can be huge (to) - Cannot modify input array - Need fast queries ()

Common Mistakes Analysis

Mistake Type Wrong Code Problem Correct Approach
Count from every number for num in nums: while num+1 in set Duplicate counting, Count only from starts
Use list instead of set num_list = list(nums) Lookup is Use set()
Forget deduplication Iterate nums directly Duplicate elements cause duplicate counting Iterate set(nums)

Variant Extensions

  1. Longest Consecutive Increasing Sequence (must be consecutive in array): Use dynamic programming
  2. Longest Consecutive Sequence (Allow Duplicates): Need to count frequencies
  3. Longest Consecutive Sequence (2D): Extend to 2D arrays

Example Walkthrough

Input: nums = [100, 4, 200, 1, 3, 2]

Step 1: Build set {100, 4, 200, 1, 3, 2}

Step 2: Iterate through set

num num-1 in set? Action current_length
100 No (99 not in set) Start sequence Count 100 → length 1
4 Yes (3 in set) Skip -
200 No (199 not in set) Start sequence Count 200 → length 1
1 No (0 not in set) Start sequence Count 1,2,3,4 → length 4
3 Yes (2 in set) Skip -
2 Yes (1 in set) Skip -

Output: 4

Common Edge Cases

Edge Case 1: Empty Array

1
2
nums = []
# Output: 0

Edge Case 2: All Duplicates

1
2
nums = [1, 1, 1, 1]
# Set becomes {1}, output: 1

Edge Case 3: No Consecutive Numbers

1
2
nums = [1, 3, 5, 7, 9]
# Each number is its own sequence, output: 1

Edge Case 4: Entire Array is Consecutive

1
2
nums = [5, 1, 3, 4, 2]
# Sequence [1,2,3,4,5], output: 5

Edge Case 5: Negative Numbers

1
2
nums = [-1, -2, 0, 1, 2]
# Sequence [-2,-1,0,1,2], output: 5

Real-World Applications

  • Version control: Find longest sequence of consecutive commits
  • Gaming: Detect longest win streak
  • Time series: Find longest period without missing data

LeetCode 49: Group Anagrams

Problem Analysis

Core Challenge: The key difficulty lies in designing a hash key such that all anagrams map to the same key, while non-anagrams map to different keys. Anagrams have the same characters but different orders, so we need a canonical form to represent them uniformly.

Solution Approach: There are three main methods to design hash keys: 1. Sorted String: Use sorted string as key ("eat""aet") 2. Character Count Tuple: Count frequency of each character, convert to tuple as key 3. Fixed-Size Array: For lowercase English letters (26), use length-26 count array

Complexity Analysis: - Method 1 (Sort):time,space - Method 2 (Count + Sort):time,space - Method 3 (Fixed Array):time,space (fastest)

Key Insight: Choosing the hash key is the core of grouping problems. Sorting is simple and general, but character counting is faster because counting iswhile sorting is.

Problem Statement

Given an array of strings, group the anagrams together. The order of groups doesn't matter.

Example:

  • Input: strs = ["eat","tea","tan","ate","nat","bat"]
  • Output: [["bat"],["nat","tan"],["ate","eat","tea"]]

Constraints:

- - - strs[i] consists of lowercase English letters

Key Insight: What Makes a Good Hash Key?

Anagrams have the same characters but different orders. We need a canonical form that's identical for all anagrams.

Option 1: Sorted String as Key

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
from collections import defaultdict

def groupAnagrams(strs):
# Use defaultdict to avoid manual empty list initialization
groups = defaultdict(list)

# Iterate through each string
for s in strs:
# Sort string and use as hash key
# Anagrams sort to the same string
# Example: "eat" → "aet", "tea" → "aet", "ate" → "aet"
key = ''.join(sorted(s)) # Canonical form

# Add current string to corresponding group
groups[key].append(s)

# Return all groups (list of lists)
return list(groups.values())

Example:

  • "eat" → sorted → "aet"
  • "tea" → sorted → "aet"
  • "ate" → sorted → "aet"

All map to the same key!

Complexity:

  • Time:where,
  • Space:

Option 2: Character Count as Key (Faster!)

Instead of sorting, count character frequencies:

1
2
3
4
5
6
7
8
9
10
11
from collections import defaultdict, Counter

def groupAnagrams_optimized(strs):
groups = defaultdict(list)
for s in strs:
# Count characters: {'e':1, 'a':1, 't':1}
count = Counter(s)
# Convert to tuple (hashable): (('a',1), ('e',1), ('t',1))
key = tuple(sorted(count.items()))
groups[key].append(s)
return list(groups.values())

Complexity:

  • Time:(counting is, sorting 26 letters is)
  • Space: Faster because: Countingcharacters is, but sorting is.

Option 3: Fixed-Size Array as Key (Best for Lowercase English)

For lowercase English letters (26 total), use a fixed-size count array:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
def groupAnagrams_fastest(strs):
groups = defaultdict(list)

for s in strs:
# Create count array of length 26
# Index 0 corresponds to 'a', index 25 to 'z'
count = [0] * 26

# Count frequency of each character in string
for c in s:
# ord(c) - ord('a') converts character to 0-25 index
# Example: 'a' → 0, 'b' → 1, 'z' → 25
count[ord(c) - ord('a')] += 1

# Convert list to tuple as hash key (lists are not hashable)
# Anagrams will have the same count array
key = tuple(count)

# Add string to corresponding group
groups[key].append(s)

return list(groups.values())

Complexity:

  • Time:
  • Space: Why fastest: No sorting at all, just one pass through each string.

Deep Dive

Algorithm Principle: Why Do These Methods Work?

Core Idea: The essence of anagrams is that they have the same character frequencies but different orders. Therefore, we need an order-independent representation as the hash key.

Method 1 (Sort): - After sorting, all anagrams become the same string - "eat", "tea", "ate" all sort to "aet" - Time complexity: Sorting each string requires Method 2 (Character Count): - Count frequency of each character; anagrams have the same frequency distribution - "eat"{'e':1, 'a':1, 't':1}(('a',1), ('e',1), ('t',1)) - Time complexity: Counting is, sorting 26 letters is(constant)

Method 3 (Fixed Array): - Use fixed-size array where index corresponds to character position - "eat"[1,0,0,...,1,0,1] (a=1, e=1, t=1) - Time complexity: Only one pass needed,

Time Complexity Proof: Why Is Method 3 Fastest?

Method 1: - Sort each string: -strings: Method 2: - Count: - Sort count items (at most 26): -strings: Method 3: - Count:(one pass) - No sorting: -strings: Conclusion: Method 3 is fastest because it avoids sorting overhead.

Space Optimization: Is There a More Space-Efficient Method?

Optimization 1: Use String as Key (Method 1) - Space:(store sorted strings) - Pros: Keys are readable, easy to debug

Optimization 2: Use Tuple as Key (Method 2, 3) - Space:(store tuples) - Pros: More compact, but less readable

Trade-off: For lowercase English letters, Method 3 is optimal: - Fixed 26 integers, small space overhead - No sorting, fastest time - Clean code

Common Mistakes Analysis

Mistake Type Wrong Code Problem Correct Approach
Use list as key groups[count].append(s) Lists are not hashable Use tuple(count)
Forget to sort count items key = tuple(count.items()) Order may differ tuple(sorted(count.items()))
Index calculation error count[ord(c)] Index out of range count[ord(c) - ord('a')]

Variant Extensions

  1. Anagram Detection: Check if two strings are anagrams

    1
    2
    # Method: Compare sorted strings or character counts
    return sorted(s1) == sorted(s2)

  2. Find All Anagrams: Find all anagrams of a pattern in a string

    • Use sliding window + character counting
    • Time complexity:
  3. Group Anagrams (Case Sensitive):

    • Need to distinguish case, extend character set to 52

Complexity Comparison

Method Time Space Notes
Sorted string Simple, general
Counter + sort Faster, works for any charset
Fixed array Fastest, limited to lowercase English

Example Walkthrough

Input: strs = ["eat", "tea", "tan", "ate", "nat", "bat"]

Using sorted string keys:

String Sorted Key Group
"eat" "aet" Group A
"tea" "aet" Group A
"tan" "ant" Group B
"ate" "aet" Group A
"nat" "ant" Group B
"bat" "abt" Group C

Output: [["eat","tea","ate"], ["tan","nat"], ["bat"]]

Edge Cases

Edge Case 1: Empty String

1
2
strs = [""]
# Output: [[""]]

Edge Case 2: Single Character

1
2
strs = ["a"]
# Output: [["a"]]

Edge Case 3: No Anagrams

1
2
strs = ["abc", "def", "ghi"]
# Output: [["abc"], ["def"], ["ghi"]]

Edge Case 4: All Anagrams

1
2
strs = ["abc", "bca", "cab"]
# Output: [["abc", "bca", "cab"]]

Real-World Applications

  • Spell checking: Group typos by letter composition
  • Genomics: Cluster DNA sequences with same nucleotide counts
  • Data deduplication: Find records with same attributes (different order)

Advanced Hash Table Patterns

Pattern 1: Complement Search (Two Sum Variant)

Use when: You need to find pairs satisfying a condition

Examples:

  • Three Sum (extend to triplets)
  • Four Sum
  • Count pairs with given difference

Pattern 2: Frequency Counting

Use when: You need to track occurrences

Examples:

  • Top K Frequent Elements
  • First Unique Character
  • Valid Anagram (simpler than sorting)
1
2
3
4
from collections import Counter

def isAnagram(s, t):
return Counter(s) == Counter(t)

Pattern 3: Sliding Window + Hash Map

Use when: Fixed-size window with tracking

Example: Longest Substring with At Most K Distinct Characters

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
def lengthOfLongestSubstringKDistinct(s, k):
from collections import defaultdict
left = 0
char_count = defaultdict(int)
max_len = 0

for right in range(len(s)):
char_count[s[right]] += 1

# Shrink window if too many distinct chars
while len(char_count) > k:
char_count[s[left]] -= 1
if char_count[s[left]] == 0:
del char_count[s[left]]
left += 1

max_len = max(max_len, right - left + 1)

return max_len

Pattern 4: Prefix Sum with Hash Map

Use when: You need subarray sums

Example: Subarray Sum Equals K

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
def subarraySum(nums, k):
from collections import defaultdict
prefix_sum = 0
count = 0
sum_count = defaultdict(int)
sum_count[0] = 1 # Base case

for num in nums:
prefix_sum += num
# Check if (prefix_sum - k) exists
if prefix_sum - k in sum_count:
count += sum_count[prefix_sum - k]
sum_count[prefix_sum] += 1

return count

Hash Table Internals (Bonus)

How Python dict Works

Under the hood: 1. Hash function: Converts key to integer hash(key) 2. Bucket selection: index = hash(key) % table_size 3. Collision handling: Python uses open addressing with probing

Load factor: When num_entries / table_size > 2/3, Python resizes the table (doubles size).

Collision Resolution

Open Addressing (Python)

When bucket is occupied, probe next slots:

  • Linear probing: Try index+1, index+2, ...
  • Quadratic probing: Try index+1^2, index+2^2, ...

Chaining (Java HashMap)

Each bucket stores a linked list (or tree if too long) of colliding entries.

Hash Function Quality

Good hash function:

  • Deterministic (same key → same hash)
  • Uniform distribution (avoids clustering)
  • Fast to compute

Python's hash():

  • Integers: hash(x) = x (for small)
  • Strings: Uses SipHash (cryptographic quality)
  • Custom objects: Override __hash__ and __eq__

Common Mistakes & Debugging Checklist

Mistake 1: Forgetting to Check Before Inserting

Wrong:

1
2
3
seen[num] = i
if target - num in seen:
return [seen[target - num], i]

Right: Check before inserting (to avoid self-pairing)

1
2
3
if target - num in seen:
return [seen[target - num], i]
seen[num] = i

Mistake 2: Using Unhashable Keys

Error:

1
2
key = [1, 2, 3]  # List is not hashable!
groups[key].append(value) # TypeError

Fix: Convert to tuple

1
key = tuple([1, 2, 3])

Mistake 3: Mutating Hash Keys

Danger:

1
2
3
key = [1, 2]
groups[tuple(key)] = "value"
key.append(3) # Don't mutate after hashing!

Rule: Treat hash keys as immutable.

Mistake 4: Off-by-One in Index Calculations

Double-check:

  • Are you returning [i, j] or [j, i]?
  • Does the problem ask for 0-indexed or 1-indexed output?

Debugging Checklist

Before coding: 1. What should the key be? (value, count, sorted form?) 2. What should the value be? (index, list, count?) 3. Do I need dict or set?

After coding: 1. Test with empty input 2. Test with single element 3. Test with duplicates 4. Test with all unique elements 5. Print hash table contents at each step (for small inputs)


Interview Tips

Tip 1: Verbalize Your Thought Process

Template: > "I notice we're looking for pairs/groups/matches, which suggests a hash table. I'll store [X] as keys and [Y] as values. The lookup will be [Z]."

Tip 2: Start with Brute Force

Even if you know the hash table solution, say: > "The brute force iswith nested loops. We can optimize tousing a hash map."

Tip 3: Clarify Constraints

Ask:

  • Can the array have duplicates?
  • Is the array sorted?
  • What's the expected size? (Affects space complexity concerns)
  • Are there negative numbers?

Tip 4: Analyze Trade-offs

After your solution, mention: > "This tradesspace fortime. If memory is tight, we could fall back to thesorting approach."

Tip 5: Handle Edge Cases Out Loud

Walk through:

  • Empty array
  • Single element
  • All duplicates
  • All unique

Time-Saving Tricks

Trick 1: Use collections.Counter

Instead of:

1
2
3
freq = {}
for x in nums:
freq[x] = freq.get(x, 0) + 1

Write:

1
2
from collections import Counter
freq = Counter(nums)

Trick 2: Use collections.defaultdict

Instead of:

1
2
3
4
5
groups = {}
for item in items:
if key not in groups:
groups[key] = []
groups[key].append(item)

Write:

1
2
3
4
from collections import defaultdict
groups = defaultdict(list)
for item in items:
groups[key].append(item)

Trick 3: Use enumerate for Index+Value

Instead of:

1
2
for i in range(len(nums)):
num = nums[i]

Write:

1
for i, num in enumerate(nums):

Performance Comparison: Hash Table vs Alternatives

Problem Brute Force Sorting Hash Table Winner
Two Sum Hash
Longest Consecutive Hash
Group Anagrams Hash
Find Duplicates Hash

When sorting wins:

  • Need sorted output anyway
  • Space is extremely constrained (required)
  • Data already partially sorted

Practice Problems (10 Recommended)

Easy

  1. Contains Duplicate (LeetCode 217)
  2. Valid Anagram (LeetCode 242)
  3. Intersection of Two Arrays (LeetCode 349)

Medium

  1. Group Anagrams (LeetCode 49) ← Covered in this guide
  2. Top K Frequent Elements (LeetCode 347)
  3. Subarray Sum Equals K (LeetCode 560)
  4. Longest Substring Without Repeating Characters (LeetCode 3)

Hard

  1. Longest Consecutive Sequence (LeetCode 128) ← Covered in this guide
  2. Substring with Concatenation of All Words (LeetCode 30)
  3. First Missing Positive (LeetCode 41) ← Tricky: use array as hash table!

Summary: Hash Table in One Page

When to use:

  • Need fast lookups (average)
  • Finding pairs/complements
  • Grouping by key
  • Frequency counting
  • Checking membership

Common patterns: 1. Complement search: Store seen values, check for target - current 2. Canonical form: Group by sorted/normalized key 3. Frequency map: Count occurrences with Counter or defaultdict(int) 4. Sliding window: Track state in fixed-size window

Pitfalls to avoid:

  • Check before inserting (avoid self-pairs)
  • Use immutable keys (tuples, not lists)
  • Handle duplicates correctly
  • Don't forget edge cases (empty, single element)

Interview template: 1. Identify need for fast lookup 2. Decide key and value types 3. Walk through example 4. Code with dict/set 5. Analyze complexity 6. Test edge cases

Memory cheat: > Hash tables trade space for time:memory forlookups. Perfect for "find pair/group/match" problems.


What's Next?

In Part 2 (Two Pointers Techniques), we'll explore:

  • Collision pointers: Squeeze from both ends (Container With Most Water)
  • Fast-slow pointers: Detect cycles (Floyd's algorithm)
  • Sliding window: Dynamic window sizing (Longest Substring)
  • When to use hash table vs. two pointers

Preview question: How would you solve Three Sum without usingspace? See you in Part 2!


❓ Q&A: Hash Table Interview Tips

Q1: When should I use a hash table instead of an array for lookups?

Answer: Use a hash table when you need non-sequential key access or when keys are sparse (not consecutive integers starting from 0).

Array advantages:

  • Direct indexing: arr[i] isguaranteed
  • Better cache locality (contiguous memory)
  • Lower memory overhead (no hash function, no collision handling)
  • Predictable performance (no worst-case degradation)

Hash table advantages:

  • Flexible keys (strings, tuples, custom objects)
  • Sparse keys without wasting memory
  • Dynamic resizing without reindexing

Comparison table:

Scenario Array Hash Table Winner
Consecutive integer keys [0..n-1] direct access average Array (simpler, faster)
Sparse integer keys [1, 100, 1000] but wastes memory average Hash Table (space efficient)
String keys Not possible average Hash Table (only option)
Fixed-size, known range guaranteed average Array (predictable)
Unknown/dynamic keys Not practical average Hash Table (flexible)

Example:

1
2
3
4
5
6
7
# Array: Perfect for consecutive indices
scores = [85, 90, 78, 92] # Student 0, 1, 2, 3
print(scores[2]) # O(1), direct access

# Hash table: Perfect for sparse/non-integer keys
student_scores = {"Alice": 85, "Bob": 90, "Charlie": 78}
print(student_scores["Bob"]) # O(1) average, flexible keys

Interview tip: If the problem mentions "indices" or "positions", consider arrays first. If it mentions "values" or "identifiers", hash tables are likely better.


Q2: What are the different collision handling strategies, and when should I use each?

Answer: There are two main approaches: chaining and open addressing. Each has trade-offs.

Chaining (Separate Chaining):

  • Each bucket stores a linked list (or tree) of colliding entries
  • Used by: Java HashMap, C++ std::unordered_map (default)
  • Pros: Simple, handles high load factors well, no clustering
  • Cons: Extra memory for pointers, cache misses from linked lists
1
2
3
4
5
6
# Conceptual representation
buckets = [
[("key1", value1), ("key2", value2)], # Bucket 0: collision chain
[("key3", value3)], # Bucket 1: single entry
[], # Bucket 2: empty
]

Open Addressing:

  • When collision occurs, probe next available slot
  • Used by: Python dict, Go map
  • Probing methods:
    • Linear probing: (hash(key) + i) % size for - Quadratic probing: (hash(key) + i^2) % size
    • Double hashing: (hash1(key) + i * hash2(key)) % size
  • Pros: Better cache locality, no extra pointers
  • Cons: Performance degrades with high load factor, clustering issues

Comparison:

Aspect Chaining Open Addressing
Load factor tolerance High (0.8-1.0) Lower (0.5-0.7)
Memory overhead Higher (pointers) Lower (no pointers)
Cache performance Worse (scattered) Better (contiguous)
Deletion complexity Simple Complex (tombstones)
Worst-case guarantee per operation per operation

When to use:

  • Chaining: High load factors expected, many deletions, simple implementation
  • Open Addressing: Memory constrained, cache performance critical, few deletions

Interview note: You rarely implement collision handling yourself. Understanding the trade-offs helps explain why hash table performance can degrade and when to consider alternatives.


Q3: When should I use dict vs set in Python?

Answer: Use dict when you need key-value pairs, and set when you only need membership testing.

dict (dictionary):

  • Stores key-value mappings: {key: value}
  • Use cases: Counting frequencies, storing metadata, mapping relationships
1
2
3
4
5
6
7
8
9
10
11
# Counting frequencies
freq = {}
for num in [1, 2, 2, 3, 3, 3]:
freq[num] = freq.get(num, 0) + 1
# Result: {1: 1, 2: 2, 3: 3}

# Storing indices
index_map = {}
for i, val in enumerate([10, 20, 30]):
index_map[val] = i
# Result: {10: 0, 20: 1, 30: 2}

set (hash set):

  • Stores only unique keys: {key1, key2, key3}
  • Use cases: Deduplication, membership testing, set operations
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# Deduplication
unique_nums = set([1, 2, 2, 3, 3, 3])
# Result: {1, 2, 3}

# Fast membership testing
seen = set()
for num in nums:
if num in seen: # O(1) average
print("Duplicate!")
seen.add(num)

# Set operations
set1 = {1, 2, 3}
set2 = {2, 3, 4}
intersection = set1 & set2 # {2, 3}
union = set1 | set2 # {1, 2, 3, 4}

Decision tree:

1
2
3
4
5
6
Need to store values with keys?
├─ Yes → Use dict
│ └─ Examples: frequency counting, index mapping, caching

└─ No → Use set
└─ Examples: deduplication, membership testing, set operations

Performance comparison:

Operation dict set Notes
Insertion Similar performance
Lookup Similar performance
Memory Higher (stores values) Lower (keys only) set saves ~30-40% memory
Use case Key-value mapping Membership testing Different purposes

Common mistake: Using dict when you only need set:

1
2
3
4
5
6
7
8
9
10
11
# ❌ Inefficient: storing dummy values
seen = {}
for num in nums:
if num not in seen:
seen[num] = True # Wasteful!

# ✅ Efficient: use set
seen = set()
for num in nums:
if num not in seen:
seen.add(num)

Q4: What makes a good hash function, and what are common pitfalls?

Answer: A good hash function should be deterministic, uniform, and fast. Poor hash functions cause collisions and degrade performance.

Properties of a good hash function:

  1. Deterministic: Same input → same output (always)
  2. Uniform distribution: Keys should map evenly across buckets
  3. Fast computation: Should beorwhereis key size
  4. Avalanche effect: Small input changes → large hash changes

Example: String hashing:

1
2
3
4
5
6
7
8
9
10
11
12
13
# ❌ Bad: Simple sum (many collisions)
def bad_hash(s):
return sum(ord(c) for c in s) % 1000
# Problem: "abc" and "cba" hash to same value!

# ✅ Good: Polynomial rolling hash
def good_hash(s):
hash_val = 0
prime = 31
for c in s:
hash_val = (hash_val * prime + ord(c)) % (2**31)
return hash_val
# Better distribution, fewer collisions

Common pitfalls:

Pitfall 1: Non-uniform distribution

1
2
3
4
5
6
# ❌ Bad: All keys hash to same bucket
def terrible_hash(key):
return 0 # Everything collides!

# ✅ Good: Use built-in hash() or proper algorithm
hash_val = hash(key)

Pitfall 2: Ignoring key characteristics

1
2
3
4
5
# For integers: identity hash is fine
hash(42) == 42 # Works well

# For strings: need to consider all characters
# Python's hash() uses SipHash (cryptographic quality)

Pitfall 3: Hash function that's too slow

1
2
3
4
5
6
7
8
9
10
11
12
# ❌ Bad: Expensive computation per hash
def slow_hash(s):
import hashlib
return int(hashlib.sha256(s.encode()).hexdigest(), 16)
# Too slow for frequent lookups!

# ✅ Good: Fast polynomial hash
def fast_hash(s):
h = 0
for c in s:
h = h * 31 + ord(c)
return h

Hash function quality metrics:

Metric Good Bad
Collision rate Low (< 5% for random keys) High (> 20%)
Distribution Uniform across buckets Clustered
Speed whereis key size Slower than
Avalanche Small input change → large hash change Small change → small hash change

Interview tip: You rarely implement hash functions. Python's hash() handles most cases. For custom objects, override __hash__() and __eq__():

1
2
3
4
5
6
7
8
9
10
11
12
13
class Point:
def __init__(self, x, y):
self.x = x
self.y = y

def __hash__(self):
return hash((self.x, self.y)) # Use tuple hash

def __eq__(self, other):
return self.x == other.x and self.y == other.y

# Now Point can be used as dict key
points = {Point(1, 2): "A", Point(3, 4): "B"}

Q5: What is the memory overhead of hash tables compared to arrays?

Answer: Hash tables have significant memory overhead due to: 1. Hash table structure (buckets, metadata) 2. Collision handling (chaining pointers or extra slots) 3. Load factor (typically 50-70% full to maintain performance) 4. Hash function storage (for some implementations)

Memory breakdown:

Array:

  • Stores only data:
  • Example: [1, 2, 3, 4, 5] → 5 integers = 20 bytes (assuming 4-byte ints)

Hash table (Python dict):

  • Overhead per entry: ~24-48 bytes (key, value, hash, pointers)
  • Load factor: ~50-66% (table is 1.5-2x larger than entries)
  • Example: {1: "a", 2: "b", 3: "c"} → ~200-300 bytes (vs 12 bytes for array)

Comparison table:

Data Structure Memory forintegers Overhead Factor
Array bytes 1x (baseline)
Hash Set ~tobytes 6-12x
Hash Map ~tobytes 8-16x

Example calculation:

1
2
3
4
5
6
7
8
9
10
11
12
13
import sys

# Array: minimal overhead
arr = [i for i in range(1000)]
print(sys.getsizeof(arr)) # ~8040 bytes (1000 ints + small overhead)

# Set: significant overhead
s = set(range(1000))
print(sys.getsizeof(s)) # ~32968 bytes (~4x more!)

# Dict: even more overhead
d = {i: i for i in range(1000)}
print(sys.getsizeof(d)) # ~36968 bytes (~4.5x more!)

When memory matters:

Use arrays when:

  • Keys are consecutive integers [0..n-1]
  • Memory is extremely constrained
  • You need predictable memory usage

Use hash tables when:

  • Keys are sparse or non-integer
  • Memory overhead is acceptable
  • Fast lookups are critical

Space optimization techniques:

  1. Use set instead of dict when possible:

    1
    2
    3
    4
    5
    # ❌ Wasteful
    seen = {num: True for num in nums}

    # ✅ Efficient
    seen = set(nums)

  2. Pre-allocate size if known:

    1
    2
    # Python dicts resize automatically, but you can hint
    d = dict.fromkeys(range(1000)) # Pre-sized

  3. Consider array-based solutions for dense integer keys:

    1
    2
    3
    # For keys [0..999], array is better
    arr = [0] * 1000 # 4000 bytes
    d = {i: 0 for i in range(1000)} # ~37000 bytes

Interview tip: Always mention the space-time trade-off. Hash tables trade memory for speed. If memory is constrained, consider sorted arrays with binary search (lookup) or other alternatives.


Q6: What's the difference between OrderedDict and regular dict in Python?

Answer: OrderedDict (Python 3.7+) maintains insertion order, while older dict implementations didn't guarantee order. In Python 3.7+, regular dict also maintains insertion order, but OrderedDict offers additional features.

Historical context:

  • Python < 3.7: dict had arbitrary order (implementation detail)
  • Python 3.7+: dict maintains insertion order (language guarantee)
  • OrderedDict: Always maintained order, even in older Python versions

Current behavior (Python 3.7+):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# Regular dict: maintains insertion order
d = {}
d['first'] = 1
d['second'] = 2
d['third'] = 3
print(list(d.keys())) # ['first', 'second', 'third']

# OrderedDict: also maintains insertion order
from collections import OrderedDict
od = OrderedDict()
od['first'] = 1
od['second'] = 2
od['third'] = 3
print(list(od.keys())) # ['first', 'second', 'third']

When to use OrderedDict:

Use OrderedDict when you need:

  1. Reordering operations:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    from collections import OrderedDict

    od = OrderedDict([('a', 1), ('b', 2), ('c', 3)])

    # Move to end
    od.move_to_end('a')
    print(list(od.keys())) # ['b', 'c', 'a']

    # Move to beginning
    od.move_to_end('a', last=False)
    print(list(od.keys())) # ['a', 'b', 'c']

  2. Pop from specific end:

    1
    2
    3
    4
    5
    # Pop last item (LIFO)
    last = od.popitem(last=True) # ('c', 3)

    # Pop first item (FIFO)
    first = od.popitem(last=False) # ('a', 1)

  3. Equality based on order:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    # OrderedDict considers order in equality
    od1 = OrderedDict([('a', 1), ('b', 2)])
    od2 = OrderedDict([('b', 2), ('a', 1)])
    print(od1 == od2) # False (different order)

    # Regular dict doesn't care about order (Python 3.7+)
    d1 = {'a': 1, 'b': 2}
    d2 = {'b': 2, 'a': 1}
    print(d1 == d2) # True (same items)

Performance comparison:

Operation dict OrderedDict Notes
Insertion Similar
Lookup Similar
Memory Lower Higher (~20% more) OrderedDict stores extra pointers
Reordering Not supported move_to_end() only in OrderedDict

Common use cases:

  1. LRU Cache (Least Recently Used):

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    from collections import OrderedDict

    class LRUCache:
    def __init__(self, capacity):
    self.cache = OrderedDict()
    self.capacity = capacity

    def get(self, key):
    if key not in self.cache:
    return -1
    # Move to end (most recently used)
    self.cache.move_to_end(key)
    return self.cache[key]

    def put(self, key, value):
    if key in self.cache:
    self.cache.move_to_end(key)
    self.cache[key] = value
    if len(self.cache) > self.capacity:
    # Remove least recently used (first item)
    self.cache.popitem(last=False)

  2. Maintaining insertion order (though dict does this now):

    1
    2
    3
    4
    5
    6
    # Both work, but OrderedDict is more explicit
    def process_items(items):
    result = OrderedDict() # Clear intent: order matters
    for item in items:
    result[item.id] = item.process()
    return result

Interview tip: In Python 3.7+, regular dict is usually sufficient. Use OrderedDict only when you need move_to_end() or popitem(last=False), or when you need to support older Python versions with guaranteed ordering.


Q7: How do hash tables behave in multi-threaded environments?

Answer: Most hash table implementations are not thread-safe by default. Concurrent modifications can cause data corruption, infinite loops, or crashes. Use synchronization primitives or thread-safe alternatives.

The problem:

Race condition example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
import threading

# Shared hash table
counter = {}

def increment(key):
if key not in counter:
counter[key] = 0
counter[key] += 1 # Not atomic!

# Multiple threads modifying simultaneously
threads = []
for i in range(10):
t = threading.Thread(target=increment, args=('count',))
threads.append(t)
t.start()

for t in threads:
t.join()

print(counter['count']) # Might not be 10! Could be 5, 7, etc.

What can go wrong:

  1. Lost updates: Two threads read, both increment, both write → one update lost
  2. Inconsistent state: Hash table resizing during concurrent access → corruption
  3. Infinite loops: Iteration while another thread modifies → undefined behavior

Solutions:

Option 1: Locks (synchronization):

1
2
3
4
5
6
7
8
9
10
11
import threading

counter = {}
lock = threading.Lock()

def increment(key):
with lock: # Acquire lock
if key not in counter:
counter[key] = 0
counter[key] += 1
# Lock released automatically

Option 2: Thread-safe data structures:

Python: Use collections.Queue or external libraries:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# For simple cases, use queue
from queue import Queue
q = Queue() # Thread-safe

# For dict-like structures, use threading-safe wrappers
from threading import Lock

class ThreadSafeDict:
def __init__(self):
self._dict = {}
self._lock = Lock()

def __getitem__(self, key):
with self._lock:
return self._dict[key]

def __setitem__(self, key, value):
with self._lock:
self._dict[key] = value

Java: Use ConcurrentHashMap:

1
2
3
4
5
6
import java.util.concurrent.ConcurrentHashMap;

ConcurrentHashMap<String, Integer> map = new ConcurrentHashMap<>();
// Thread-safe operations
map.put("key", 1);
map.get("key");

C++: Use std::shared_mutex or external libraries:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
#include <shared_mutex>
#include <unordered_map>

std::unordered_map<std::string, int> map;
std::shared_mutex mtx;

// Read lock (multiple readers allowed)
{
std::shared_lock<std::shared_mutex> lock(mtx);
int value = map["key"];
}

// Write lock (exclusive)
{
std::unique_lock<std::shared_mutex> lock(mtx);
map["key"] = value;
}

Performance trade-offs:

Approach Thread Safety Performance Complexity
No synchronization ❌ Unsafe ⚡ Fastest ✅ Simple
Coarse-grained lock ✅ Safe 🐌 Slow (serialized) ✅ Simple
Fine-grained locks ✅ Safe ⚡ Faster (parallel reads) ❌ Complex
Lock-free structures ✅ Safe ⚡⚡ Fastest ❌❌ Very complex

Best practices:

  1. Avoid shared mutable state when possible:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    # ✅ Good: Each thread has its own dict
    def process_chunk(chunk):
    local_dict = {} # Thread-local
    for item in chunk:
    local_dict[item] = process(item)
    return local_dict

    # ❌ Bad: Shared dict
    shared_dict = {}
    def process_chunk(chunk):
    for item in chunk:
    shared_dict[item] = process(item) # Race condition!

  2. Use immutable data structures:

    1
    2
    3
    4
    5
    6
    7
    # Instead of modifying shared dict, return new dict
    def merge_results(results):
    # Combine thread results (no concurrent modification)
    final = {}
    for result in results:
    final.update(result)
    return final

  3. Read-only access is usually safe (but verify implementation):

    1
    2
    3
    # Multiple threads reading is typically safe
    def lookup(key):
    return shared_dict.get(key) # Read-only, usually safe

Interview tip: Mention that hash tables are not thread-safe by default. If asked about concurrent access, discuss locks, thread-safe alternatives, or architectural solutions (avoiding shared state).


Q8: What are common interview follow-up questions after solving a hash table problem?

Answer: Interviewers often probe deeper into your understanding. Here are common follow-ups and how to handle them:

Follow-up 1: "Can you optimize the space complexity?"

Example: After solving Two Sum withspace:

1
2
3
4
5
6
7
8
# Original: O(n) space
def twoSum(nums, target):
seen = {}
for i, num in enumerate(nums):
complement = target - num
if complement in seen:
return [seen[complement], i]
seen[num] = i

Optimization: If array is sorted, use two pointers (space):

1
2
3
4
5
6
7
8
9
10
11
# Optimized: O(1) space (but requires sorted array)
def twoSum_sorted(nums, target):
left, right = 0, len(nums) - 1
while left < right:
total = nums[left] + nums[right]
if total == target:
return [left, right]
elif total < target:
left += 1
else:
right -= 1

Follow-up 2: "What if we need to return all pairs, not just one?"

Example: Return all pairs that sum to target:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
def twoSum_all_pairs(nums, target):
seen = {}
result = []
for i, num in enumerate(nums):
complement = target - num
if complement in seen:
# Add all previous indices
for prev_idx in seen[complement]:
result.append([prev_idx, i])
# Store current index
if num not in seen:
seen[num] = []
seen[num].append(i)
return result

Follow-up 3: "What if the array can have duplicates?"

Answer: The hash table solution handles duplicates correctly (as shown in the Two Sum problem). Explain that you check before inserting to avoid self-pairing.

Follow-up 4: "How would you handle very large datasets that don't fit in memory?"

Answer: Use external hashing or streaming algorithms:

1
2
3
4
5
6
7
8
9
10
11
12
13
# Streaming approach: process in chunks
def twoSum_streaming(stream, target):
seen = {}
chunk_size = 10000

for chunk in read_in_chunks(stream, chunk_size):
for i, num in enumerate(chunk):
complement = target - num
if complement in seen:
return [seen[complement], current_index]
seen[num] = current_index
current_index += 1
# Optionally: evict old entries if memory constrained

Follow-up 5: "What's the worst-case time complexity?"

Answer: Hash table operations areaverage case, butworst case due to:

  • All keys hashing to same bucket (adversarial input)
  • Hash table resizing (amortized)

Follow-up 6: "Can you solve this without extra space?"

Answer: Sometimes yes (if array is sorted or has constraints):

  • Sorted array: Two pointers (space)
  • Array with constraints: Use array itself as hash table (First Missing Positive pattern)
  • Otherwise: Usually needspace fortime

Follow-up 7: "How would you test this solution?"

Answer: Cover these cases:

1
2
3
4
5
6
7
test_cases = [
([2, 7, 11, 15], 9, [0, 1]), # Normal case
([3, 3], 6, [0, 1]), # Duplicates
([3, 2, 4], 6, [1, 2]), # Not at start
([-1, -2, -3, -4, -5], -8, [2, 4]), # Negative numbers
([1, 2], 3, [0, 1]), # Edge: two elements
]

Follow-up 8: "What if we need to find three numbers that sum to target?"

Answer: Extend the pattern:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
def threeSum(nums, target):
nums.sort() # O(n log n)
result = []

for i in range(len(nums) - 2):
if i > 0 and nums[i] == nums[i-1]:
continue # Skip duplicates

left, right = i + 1, len(nums) - 1
while left < right:
total = nums[i] + nums[left] + nums[right]
if total == target:
result.append([nums[i], nums[left], nums[right]])
# Skip duplicates
while left < right and nums[left] == nums[left+1]:
left += 1
while left < right and nums[right] == nums[right-1]:
right -= 1
left += 1
right -= 1
elif total < target:
left += 1
else:
right -= 1

return result

Interview strategy: 1. Acknowledge the trade-off: "The hash table solution usesspace fortime. If space is constrained..." 2. Show alternatives: "We could use sorting + two pointers forspace buttime." 3. Ask clarifying questions: "Are we optimizing for time or space? Is the array sorted?"


Q9: What are edge cases in time complexity analysis for hash tables?

Answer: Hash table operations are amortized, but several factors can affect actual performance:

Edge case 1: Hash collisions (worst-case)

Scenario: All keys hash to the same bucket:

1
2
3
4
5
6
7
8
9
10
# Adversarial input: all keys collide
class BadHash:
def __hash__(self):
return 42 # Always same hash!

bad_keys = [BadHash() for _ in range(1000)]
d = {key: i for i, key in enumerate(bad_keys)}

# Lookup becomes O(n) - linear search through chain
value = d[bad_keys[500]] # O(n) worst case!

Mitigation: Good hash function + load factor management

Edge case 2: Hash table resizing (amortized analysis)

Scenario: When load factor exceeds threshold, table doubles in size:

1
2
3
4
# Inserting n elements
d = {}
for i in range(1000000):
d[i] = i # Triggers resizing at ~66k, ~133k, ~266k, etc.

Cost analysis:

  • Individual insert: Usually, butwhen resizing
  • Amortized cost:per insert (resizing happens rarely)

Proof sketch: If table doubles when full, total cost forinserts:

  • Insertions:operations
  • Resizings:times, each costs
  • Total:
  • Amortized:... wait, that's not right!

Correct analysis:

  • Resize at sizes:
  • Cost of resize at size:(rehash all entries)
  • Total resize cost:
  • Amortized:

Edge case 3: Iteration complexity

Scenario: Iterating through hash table:

1
2
3
4
5
d = {i: i*2 for i in range(1000)}

# Iteration: O(n) time, O(n) space for keys/items
for key in d: # O(n)
print(key, d[key])

Complexity:time,space (to store iteration state)

Edge case 4: String keys (hash computation cost)

Scenario: Long strings as keys:

1
2
3
4
5
6
# Hash computation: O(k) where k is string length
long_strings = ["a" * 10000 for _ in range(1000)]
d = {s: i for i, s in enumerate(long_strings)}

# Lookup: O(1) + O(k) = O(k) where k is key length
value = d[long_strings[500]] # O(10000) to compute hash!

Complexity:whereis key length (notif keys are large)

Edge case 5: Custom hash functions

Scenario: Expensive hash computation:

1
2
3
4
5
6
7
8
9
10
11
12
def expensive_hash(obj):
# Cryptographic hash: slow!
import hashlib
return int(hashlib.sha256(str(obj).encode()).hexdigest(), 16)

class ExpensiveKey:
def __hash__(self):
return expensive_hash(self)

# Every lookup computes expensive hash
d = {ExpensiveKey(): i for i in range(1000)}
value = d[key] # Slow due to hash computation

Mitigation: Cache hash values or use faster hash functions

Complexity summary table:

Operation Average Worst Case Amortized Notes
Insert Worst: all collisions
Lookup N/A Worst: all collisions
Delete N/A Worst: all collisions
Iteration N/A Always linear
Resize N/A per insert Amortized analysis

Interview tip: Always mention "average case" and acknowledge worst-case. Explain that worst-case is rare with good hash functions and proper load factor management.


Q10: What are space optimization techniques for hash table problems?

Answer: Several techniques can reduce memory usage while maintaining performance:

Technique 1: Use array as hash table (when keys are dense integers)

Scenario: Keys are integers in a known range [0..n-1]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# ❌ Wasteful: O(n) extra space for hash table
def find_duplicate_dict(nums):
seen = {}
for num in nums:
if num in seen:
return num
seen[num] = True

# ✅ Efficient: Use array as hash table
def find_duplicate_array(nums):
# Array indices act as hash keys
seen = [False] * (len(nums) + 1)
for num in nums:
if seen[num]:
return num
seen[num] = True

Space savings: Array usesvs hash table'sbut with lower constant factor (~4x less memory)

Technique 2: Bit manipulation for boolean flags

Scenario: Tracking presence/absence (boolean values):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# ❌ Wasteful: dict stores full integers
seen = {}
for num in nums:
seen[num] = True # Stores key + value

# ✅ Efficient: Use set (only keys)
seen = set()
for num in nums:
if num in seen:
return True
seen.add(num)

# ✅✅ Most efficient: Bit vector (if range is small)
def has_duplicate_bitvector(nums):
# Assuming nums are in range [1..32]
bits = 0
for num in nums:
bit = 1 << num
if bits & bit:
return True
bits |= bit
return False

Space comparison:

  • dict: ~48 bytes per entry
  • set: ~24 bytes per entry
  • bit vector: 1 bit per entry (32 entries = 4 bytes!)

Technique 3: In-place modification (use input array as hash table)

Scenario: First Missing Positive (LeetCode 41):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# Problem: Find smallest missing positive integer
# Constraint: O(1) extra space

def firstMissingPositive(nums):
n = len(nums)

# Step 1: Replace negatives/zeros with n+1
for i in range(n):
if nums[i] <= 0:
nums[i] = n + 1

# Step 2: Use array indices as hash keys
# Mark presence by making value negative
for i in range(n):
num = abs(nums[i])
if num <= n:
nums[num - 1] = -abs(nums[num - 1])

# Step 3: Find first positive index
for i in range(n):
if nums[i] > 0:
return i + 1

return n + 1

Key insight: Use array indices [0..n-1] to represent numbers [1..n]. Mark presence by negating values.

Technique 4: Sliding window with single pass

Scenario: Longest Substring Without Repeating Characters:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# ❌ Wasteful: Store all seen characters
def lengthOfLongestSubstring_bad(s):
max_len = 0
for i in range(len(s)):
seen = set() # New set for each starting position
for j in range(i, len(s)):
if s[j] in seen:
break
seen.add(s[j])
max_len = max(max_len, len(seen))
return max_len

# ✅ Efficient: Sliding window, reuse hash table
def lengthOfLongestSubstring_good(s):
char_index = {} # Reused across windows
left = 0
max_len = 0

for right in range(len(s)):
if s[right] in char_index:
# Move left pointer past last occurrence
left = max(left, char_index[s[right]] + 1)
char_index[s[right]] = right
max_len = max(max_len, right - left + 1)

return max_len

Space savings:space reused vstotal space

Technique 5: Frequency counting with array instead of dict

Scenario: Counting characters (limited alphabet):

1
2
3
4
5
6
7
8
# ❌ General but wasteful for small alphabet
from collections import Counter
freq = Counter(s) # Dict overhead

# ✅ Efficient for lowercase English (26 letters)
freq = [0] * 26
for c in s:
freq[ord(c) - ord('a')] += 1

Space comparison:

  • Counter (dict): ~26 × 48 bytes = ~1248 bytes
  • Array: 26 × 4 bytes = 104 bytes (~12x less!)

Technique 6: Two-pass instead of storing all data

Scenario: Finding first unique character:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# ❌ Store all indices
def firstUniqChar_bad(s):
char_indices = {}
for i, c in enumerate(s):
if c not in char_indices:
char_indices[c] = []
char_indices[c].append(i)

for c, indices in char_indices.items():
if len(indices) == 1:
return indices[0]
return -1

# ✅ Two passes: count first, then find
def firstUniqChar_good(s):
freq = {}
for c in s:
freq[c] = freq.get(c, 0) + 1

for i, c in enumerate(s):
if freq[c] == 1:
return i
return -1

Space savings: Store counts (integers) instead of lists of indices

Summary table:

Technique When to Use Space Savings Trade-off
Array as hash Dense integer keys [0..n-1] 4-8x less Requires known range
Bit vector Boolean flags, small range 8-32x less Limited to small ranges
In-place mod Can modify input array extra space Destroys input
Sliding window Substring/subarray problems Reuse vs recreate More complex logic
Array vs dict Small, fixed alphabet 10-20x less Less flexible
Two-pass Can avoid storing indices Store counts vs lists Extra pass

Interview tip: Always consider if you can use the input array itself as a hash table, especially when the problem asks forextra space. This is a common pattern in "hard" LeetCode problems.


🎓 Summary: Hash Table Patterns Cheat Sheet

Quick Reference

When to use hash tables:

  • ✅ Fast lookups needed (average)
  • ✅ Finding pairs/complements
  • ✅ Grouping by key
  • ✅ Frequency counting
  • ✅ Membership testing
  • ✅ Sparse or non-integer keys

When NOT to use hash tables:

  • ❌ Dense consecutive integer keys [0..n-1] → Use array
  • ❌ Memory extremely constrained → Consider sorted array + binary search
  • ❌ Need sorted order → Use sorted data structure
  • ❌ Thread-safe required → Use synchronization or thread-safe alternatives

Pattern Library

Pattern Key Insight Example Problem
Complement Search Store seen values, check for target - current Two Sum, Three Sum
Canonical Form Group by sorted/normalized key Group Anagrams
Frequency Counting Count occurrences with Counter Top K Frequent
Sliding Window Track state in fixed-size window Longest Substring
Prefix Sum Store prefix sums, find subarray sums Subarray Sum Equals K
Sequence Detection Check num-1 not in set to find starts Longest Consecutive
Array as Hash Use indices as keys for dense integers First Missing Positive

Complexity Cheat Sheet

Operation Average Worst Space
Insert
Lookup
Delete
Iterate

Python Quick Reference

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# Basic operations
d = {} # Create dict
d[key] = value # Insert/update
value = d[key] # Lookup (KeyError if missing)
value = d.get(key, default) # Safe lookup
del d[key] # Delete
key in d # Membership test

# Useful imports
from collections import Counter, defaultdict, OrderedDict
freq = Counter(items) # Frequency counting
groups = defaultdict(list) # Group by key
od = OrderedDict() # Maintain order

# Common patterns
seen = set() # Membership only
seen.add(item) # Add to set
if item in seen: ... # Check membership

# Key design
key = tuple(sorted(items)) # Immutable, hashable
key = ''.join(sorted(s)) # String canonical form
key = tuple(count_array) # Array → tuple for hashing

Common Pitfalls Checklist

Interview Template

  1. Identify: "This needs fast lookup → hash table"

  2. Design key: "I'll use [X] as the key because..."

  3. Design value: "I'll store [Y] to track..."

  4. Walk through: Trace example with small input

  5. Code: Implement with dict/set

  6. Analyze: Time, Space

  7. Optimize: "If space constrained, we could..."

  8. Test: Edge cases (empty, duplicates, etc.)

Memory Optimization Quick Wins

  1. Use set instead of dict when values aren't needed → ~50% memory savings
  2. Use array instead of dict for dense integer keys → ~75% memory savings
  3. Use bit vector for boolean flags (small range) → ~95% memory savings
  4. Reuse hash table in sliding window → Avoid recreation overhead
  5. Two-pass approach → Store counts instead of indices when possible

Further Reading

  • Books:
    • "Cracking the Coding Interview" (Chapter 1: Hash Tables)
    • "Elements of Programming Interviews" (Chapter 13: Hash Tables)
  • Online:
    • Python dict implementation: https://github.com/python/cpython/blob/main/Objects/dictobject.c
    • Hash table visualization: https://visualgo.net/en/hashtable
  • LeetCode: Hash Table tag (200+ problems)

Hash tables aren't magic — they're strategic memory allocation. Master the patterns, and you'll unlock dozens ofsolutions!

  • Post title:LeetCode (1): Hash Tables - Patterns, Pitfalls & Three Classic Problems Deep Dive
  • Post author:Chen Kai
  • Create time:2023-03-25 00:00:00
  • Post link:https://www.chenk.top/en/leetcode-hash-tables/
  • Copyright Notice:All articles in this blog are licensed under BY-NC-SA unless stating additionally.
 Comments