Design a Time Series Database (TSDB)
Design a database optimized for storing and querying time-series data (e.g., sensor readings, stock prices). Discuss compression and indexing strategies.
Why Interviewers Ask This
Tesla evaluates this question to assess your ability to architect systems for high-velocity IoT data from vehicles. They specifically test your understanding of write-heavy workloads, efficient compression algorithms like Delta-of-Delta or Gorilla, and time-based indexing strategies that enable rapid aggregation without sacrificing storage costs.
How to Answer This Question
1. Clarify requirements: Define write throughput (millions of events/sec per vehicle), retention policies, and query patterns like range scans or aggregations over specific time windows.
2. Propose a schema: Suggest a columnar storage format optimized for time-series, separating metadata from metrics to maximize compression ratios.
3. Detail ingestion: Describe a write-ahead log followed by a memory buffer (memtable) that flushes to immutable disk segments to handle burst traffic.
4. Explain compression: Discuss encoding techniques such as run-length encoding for constant values and bit-packing for sensor IDs to reduce Tesla's massive fleet storage costs.
5. Address querying: Outline an inverted index on tags (e.g., VIN, sensor type) combined with sorted timestamp indexes to accelerate point-in-time lookups.
Key Points to Cover
- Explicitly mention compression algorithms like Delta-of-Delta or Gorilla relevant to sensor data
- Propose a columnar storage architecture rather than row-based SQL tables
- Address the write-heavy nature of IoT telemetry with memtables and immutable segments
- Explain how to balance latency for real-time monitoring versus cost for long-term storage
- Design a partitioning strategy based on unique identifiers like VINs for data isolation
Sample Answer
To design a TSDB for Tesla's fleet, I would prioritize write throughput and storage efficiency given the volume of telemetry from millions of vehicles. First, I'd define the schema using a column-oriented store where each column represents a metric like battery voltage or motor RPM. This allows us to apply highly effective compression algorithms independently per column.
For ingestion, data would flow into a high-speed in-memory structure before being flushed to disk as immutable SSTables. This ensures low-latency writes even during peak data bursts. Crucially, I would implement specialized compression: using Delta-of-Delta encoding for timestamps since they are sequential, and Gorilla XOR compression for floating-point sensor readings, which typically exhibit small changes between samples. This could reduce storage needs by up to 90% compared to raw text.
Regarding indexing, a global partition key based on Vehicle ID (VIN) is essential for isolation. Within partitions, we maintain a sorted index on timestamps. For queries requiring filtering across multiple cars, I'd layer a secondary inverted index on tags like 'model' or 'region'. This hybrid approach supports both fast single-vehicle diagnostics and broad fleet-level analytics required for OTA updates and safety monitoring.
Common Mistakes to Avoid
- Focusing solely on relational database features like ACID transactions instead of write optimization
- Ignoring the massive scale of data ingestion expected from a fleet of autonomous vehicles
- Suggesting generic compression methods like ZIP instead of domain-specific time-series encodings
- Overlooking the need for automatic data expiration or tiered storage for old telemetry data
Practice This Question with AI
Answer this question orally or via text and get instant AI-powered feedback on your response quality, structure, and delivery.