Design a Data Stream Median Finder
Design a data structure that supports adding integers from a data stream and finding the median of all elements added so far. Use two heaps (Max-Heap for the lower half, Min-Heap for the upper half).
Why Interviewers Ask This
Microsoft interviewers ask this to evaluate a candidate's ability to maintain real-time data invariants under dynamic conditions. They specifically test proficiency with heap properties, the logic of balancing two data structures, and the capacity to optimize time complexity from O(N) to O(log N) for streaming operations.
How to Answer This Question
1. Clarify requirements: Confirm if the stream is infinite, if elements are integers, and define median behavior for even/odd counts. 2. Propose the Two-Heap strategy: Explain using a Max-Heap for the lower half and a Min-Heap for the upper half to access the middle efficiently. 3. Define balancing rules: State that heaps must differ in size by at most one element to ensure O(1) median retrieval. 4. Walk through insertion logic: Describe comparing new values against current heap tops before pushing and rebalancing. 5. Analyze complexity: Explicitly state O(log N) for add and O(1) for findMedian, demonstrating awareness of performance trade-offs typical in Microsoft system design questions.
Key Points to Cover
- Explicitly defining the invariant that heap sizes differ by no more than one
- Correctly identifying the max-heap for the lower half and min-heap for the upper half
- Demonstrating O(log N) insertion and O(1) retrieval time complexities
- Handling edge cases like empty streams or single-element streams clearly
- Using clear variable names and logical flow consistent with Microsoft's emphasis on clean code
Sample Answer
To solve the Data Stream Median problem, I propose using two priority queues: a max-heap to store the smaller half of the numbers and a min-heap for the larger half. This structure allows us to access the median in constant time while maintaining logarithmic time complexity for insertions. First, I would initialize both heaps. When a new number arrives, I compare it to the top of the max-heap. If it is smaller or equal, I push it to the max-heap; otherwise, I push it to the min-heap. The critical step is rebalancing. After every insertion, I check the sizes of the heaps. If the difference exceeds one, I move the root of the larger heap to the smaller one. For example, if we have added [1, 3, 2], the max-heap holds [1] and the min-heap holds [2, 3]. Moving 2 to the max-heap balances them as [1, 2] and [3], making the median 2. This approach ensures that the median is always either the top of the larger heap or the average of both tops. In terms of complexity, each add operation involves heap push and pop operations, resulting in O(log N) time. Finding the median simply requires accessing the roots, which is O(1). This solution effectively handles the streaming nature of the data without needing to sort the entire collection repeatedly.
Common Mistakes to Avoid
- Attempting to sort the entire list on every query, leading to inefficient O(N log N) per operation
- Forgetting to rebalance the heaps after insertion, causing incorrect median calculations
- Confusing the roles of the max-heap and min-heap, swapping their definitions for lower vs upper halves
- Neglecting to handle the case where the total number of elements is even versus odd correctly
Practice This Question with AI
Answer this question orally or via text and get instant AI-powered feedback on your response quality, structure, and delivery.
Related Interview Questions
Convert Binary Tree to Doubly Linked List in Place
Hard
MicrosoftShortest Distance from All Buildings (BFS)
Hard
MetaDesign a Set with $O(1)$ `insert`, `remove`, and `check`
Easy
CiscoFind K Closest Elements (Heaps)
Medium
MetaDiscuss ACID vs. BASE properties
Easy
MicrosoftExperience with Monitoring Dashboards
Medium
Microsoft