Design a Pastebin Service
Design a service like Pastebin. Focus on data expiration, storage, and the design trade-offs for handling potentially malicious or very large text inputs.
Why Interviewers Ask This
Interviewers ask this to evaluate your ability to balance simplicity with scalability while addressing specific constraints like data expiration and input validation. They want to see if you can design a system that handles malicious inputs efficiently without over-engineering, reflecting Uber's focus on robust, cost-effective infrastructure for high-traffic services.
How to Answer This Question
1. Clarify requirements immediately: Ask about expected read/write ratios, retention policies, and maximum payload sizes to set boundaries. 2. Define the core API: Propose endpoints for creating, reading, and deleting pastes, ensuring unique ID generation is discussed. 3. Address storage strategy: Choose between object storage (S3) or key-value stores (Redis/DynamoDB), justifying the choice based on latency and cost. 4. Tackle security and limits: Detail how you will sanitize inputs to prevent XSS attacks and implement rate limiting to stop abuse. 5. Discuss expiration: Explain the mechanism for auto-deletion using TTLs or background cron jobs to manage storage costs effectively.
Key Points to Cover
- Proposing a hybrid storage model separating metadata from large content blobs
- Explicitly detailing input sanitization strategies to mitigate XSS risks
- Implementing time-to-live (TTL) mechanisms for automatic data expiration
- Setting clear rate limits and payload size constraints to prevent abuse
- Justifying technology choices based on read-heavy traffic patterns
Sample Answer
To design a Pastebin service, I'd start by defining the scope. We need to support short-lived text snippets with unique IDs, likely serving millions of reads daily. For the API, we need POST /paste, GET /:id, and DELETE /:id. Given the simple key-value nature, I'd propose a hybrid storage approach. The content itself, which can be large, should go into an object store like S3 for durability and low cost, while metadata and active sessions sit in a fast KV store like Redis for sub-millisecond reads. To handle potentially malicious inputs, we must enforce strict input validation at the gateway level, sanitizing HTML to prevent XSS before storage. For very large inputs, we'd reject payloads exceeding a reasonable limit, say 5MB, to protect backend resources. Data expiration is critical for cost control; we can leverage Redis TTLs for temporary pastes or use a background worker to delete expired records from S3. This architecture balances high availability for reads with secure, scalable writes, fitting the needs of a platform like Uber where reliability and speed are paramount.
Common Mistakes to Avoid
- Over-engineering the solution by adding complex features like user authentication when not requested
- Ignoring the security implications of storing raw user-generated text without sanitization
- Failing to define a concrete strategy for handling data expiration and cleanup
- Not considering the trade-offs between consistency and availability for global scaling
Practice This Question with AI
Answer this question orally or via text and get instant AI-powered feedback on your response quality, structure, and delivery.
Related Interview Questions
Discuss ACID vs. BASE properties
Easy
MicrosoftDiscuss Serverless Functions vs. Containers (FaaS vs. CaaS)
Easy
AppleDesign a CDN Edge Caching Strategy
Medium
AmazonDesign a Payment Processing System
Hard
UberDesign a System for Real-Time Fleet Management
Hard
UberWorking with Open Source Dependencies
Medium
Uber