Design a Twitter Feed (Conceptual Data Storage)

Data Structures
Medium
Microsoft
96.6K views

Describe the data structure required for a user's chronological Twitter/X feed, supporting billions of posts. Focus on the fan-out-on-write model using specialized storage like Redis or Cassandra.

Why Interviewers Ask This

Interviewers ask this to evaluate your ability to design scalable data systems that handle massive write-throughput. Specifically, they test if you understand the trade-offs between fan-out-on-write and fan-out-on-read models when serving billions of users with low latency requirements.

How to Answer This Question

1. Clarify constraints immediately: assume 500 million daily active users and strict read/write latency targets typical of Microsoft's scale. 2. Define the core problem: fetching a chronological feed without scanning the entire database for every request. 3. Propose the Fan-Out-On-Write model as the primary strategy, explaining how posts are pushed to follower caches upon creation. 4. Detail the storage architecture: use Redis for hot follower timelines due to sub-millisecond access speeds and Cassandra or HBase for persistent post storage. 5. Address edge cases like 'ghost followers' (users who follow but don't see new posts) and handling high-profile accounts with millions of followers by switching to fan-out-on-read for them only. 6. Conclude by summarizing the consistency vs. availability trade-off inherent in this distributed design.

Key Points to Cover

  • Explicitly choosing Fan-Out-On-Write over Fan-Out-On-Read for standard users
  • Justifying Redis usage for hot data caching to meet latency SLAs
  • Identifying the celebrity account bottleneck and proposing a hybrid solution
  • Explaining how to separate metadata from full content storage
  • Acknowledging the trade-off between strong consistency and system scalability

Sample Answer

To design a Twitter feed capable of handling billions of posts, I would prioritize the Fan-Out-On-Write model to ensure low-latency reads. When a user posts, we push that content ID into the pre-computed timeline lists of all their followers. We store these timelines in Redis because it offers extremely fast retrieval times essential for a scrollable feed. For the actual tweet content, we would persist it in a columnar store like Cassandra, which scales horizontally well for massive write volumes. However, pushing to everyone fails for celebrities with millions of followers. For these accounts, we switch to a hybrid approach where we only push to a subset or use fan-out-on-read, fetching recent tweets dynamically and merging them with the cached list. This prevents system overload during viral events. We also need to handle pagination efficiently. Since Redis lists can grow large, we might split timelines into shards based on time windows, allowing us to fetch specific ranges without loading the entire history. Finally, we must ensure eventual consistency; if a user follows someone and doesn't see the immediate last post, it is acceptable as long as the data converges quickly. This architecture balances the heavy write load of posting with the critical read performance needed for user experience.

Common Mistakes to Avoid

  • Suggesting a simple SQL join for every feed request, which would cause database collapse at scale
  • Ignoring the 'celebrity problem' where one user has millions of followers
  • Failing to distinguish between storing the tweet content versus just the reference ID
  • Overlooking the need for sharding strategies to distribute the write load across servers

Practice This Question with AI

Answer this question orally or via text and get instant AI-powered feedback on your response quality, structure, and delivery.

Start Practicing

Related Interview Questions

Browse all 154 Data Structures questionsBrowse all 65 Microsoft questions