Find Median from Data Stream
Design a data structure that supports adding new integers and finding the median efficiently. Use two Heaps (Min-Heap and Max-Heap).
Why Interviewers Ask This
IBM evaluates candidates on their ability to optimize real-time data processing, a core competency for their cloud and cognitive systems. This specific problem tests your mastery of heap data structures, understanding of time complexity trade-offs between insertion and retrieval, and capacity to design balanced algorithms for streaming data scenarios common in enterprise environments.
How to Answer This Question
1. Clarify Requirements: Confirm input constraints, whether the stream is infinite, and if duplicates are allowed. IBM values precision before coding. 2. Define Strategy: Propose using two heaps—a max-heap for the lower half and a min-heap for the upper half—to maintain median access in O(1) time. 3. Explain Balancing Logic: Detail how you will ensure the size difference between heaps never exceeds one, ensuring the median is always at the root of one or both heaps. 4. Walk Through Operations: Step-by-step trace adding a number: push to appropriate heap, rebalance if sizes differ, then show how to calculate the median based on current heap roots. 5. Analyze Complexity: Explicitly state O(log n) for addNum and O(1) for findMedian, discussing space complexity as well.
Key Points to Cover
- Explicitly defining the invariant that heap sizes must differ by no more than one
- Demonstrating clear logic for determining which heap receives the new element
- Explaining the rebalancing mechanism to maintain O(log n) insertion time
- Correctly handling edge cases like empty streams or single elements
- Articulating the final time and space complexity analysis clearly
Sample Answer
To solve this efficiently, I would use a dual-heap approach. We maintain a max-heap for the smaller half of numbers and a min-heap for the larger half. The key invariant is that the size of these heaps differs by at most one. When adding a new integer, we first decide which heap it belongs to based on its value relative to the current max of the left heap. After pushing, we immediately check the sizes. If one heap grows too large, we move its root to the other heap to restore balance. For finding the median, if both heaps are equal size, the answer is the average of the two roots. If one is larger, the median is simply the root of that larger heap. For example, with stream [1, 3, 2], after adding 1, max-heap has [1]. Adding 3 goes to min-heap [3]. Adding 2 compares to 1, goes to min-heap making it [2,3], then we rebalance by moving 2 to max-heap. Now max-heap is [2,1] and min-heap is [3], so median is 2. This ensures O(log n) insertion and O(1) retrieval, ideal for high-throughput streams IBM often handles.
Common Mistakes to Avoid
- Using a sorted array instead of heaps, resulting in O(n) insertion time
- Failing to rebalance heaps after every insertion, breaking the median logic
- Not handling the case where total elements are even versus odd correctly
- Overlooking duplicate values when comparing against heap roots
Practice This Question with AI
Answer this question orally or via text and get instant AI-powered feedback on your response quality, structure, and delivery.