Book Summaries | System Design Interview (Part 1) - Design a Key-Value Store

February 15th, 2024

Chapter 6: Design a Key-Value Store

This summary will serve to further cement my learnings taken when reading the sixth chapter of System Design Interview - An Insider's Guide (Part 1), titled Design a Key-Value Store, and I hope will provide some learnings to you as well.

Introduction: Key-Value Store Basics

A key-value store is a non-relational database using unique identifiers as keys.
Key-value pairs consist of unique keys and associated values.
Operations include put(key, value) and get(key).
Design characteristics involve small pair size, big data storage, high availability, and scalability.

Operations and Design Characteristics

When designing a key-value store, our design is looking to support key-value operations like put and get.
Characteristics include small pair size, big data storage, and high scalability.
The system supports automatic scaling and tunable consistency.
A key-value store also has an emphasis on low-latency operations.

Single Server Key-Value Store

One key-value store design can use storage achieved through a hash table in a single server.
Optimizations include data compression and selective in-memory storage.
However, challenges arise with limited capacity in a single server.
Distribution necessitates a move to a distributed key-value store.

Distributed Key-Value Store

For a more scalable design, a distributed hash table distributes key-value pairs across multiple servers.
CAP theorem principles: trade-offs between consistency, availability, and partition tolerance.
The decision to use a distributed key-value store vs. a single server key-value store is based on the desired system characteristics.
Examples of distributed key-value stores include Dynamo and Cassandra.

Consistency Models

Consistency models: strong, weak, and eventual consistency.
Dynamo and Cassandra prefer eventual consistency.
Consistency models are crucial for system performance.
The choice of which consistency model is preferred depends on the specific use case requirements.
An example:
- N = The number of replicas
- W = A write quorum of size W. For a write operation to be considered successful, the write operation must be acknowledged from W replicas.
- R = A read quorum of size R. For a read operation to be considered successful, the read operation must wait for responses from at least R replicas.
- If R = 1 and W = N, the system is optimized for a fast read.
- If W = 1 and R = N, the system is optimized for fast write.
- If W + R > N, strong consistency is guaranteed (Usually N = 3, W = R = 2).
- If W + R <= N, strong consistency is not guaranteed.

Inconsistency Resolution: Versioning

In a distributed key-value store, replication introduces inconsistencies, which can be resolved by versioning.
Vector clocks assist in conflict resolution among replicas.
Vector clocks track server and version pairs for conflict detection.
Some of the complexities and potential downsides associated with vector clocks are listed below:
- Client Complexity: Implementation of conflict resolution logic using vector clocks adds complexity to client-side applications.
- Rapid Growth: The [server: version] pairs in vector clocks can grow rapidly, potentially leading to inefficiencies in reconciliation; a threshold may be set to limit the length, removing older pairs and impacting descendant relationship accuracy.
- Threshold Limitation: Setting a threshold to limit the length of vector clocks may lead to situations where older pairs are removed, potentially affecting the accuracy of descendant relationships and the overall effectiveness of conflict resolution.

Handling Failures

Several failure detection methods for a distributed key-value store are discussed below.
All-to-All Multicasting and Gossip Protocols:
- Failure Detection: Both all-to-all multicasting and gossip protocols are employed for failure detection in distributed systems.
- Communication Strategy: While all-to-all multicasting involves each node communicating with every other node in the system, gossip protocols rely on nodes periodically sending and receiving information about the health and status of other nodes.
- Efficiency Tradeoff: All-to-all multicasting, although straightforward, may become inefficient as the number of servers increases, making gossip protocols a preferred choice for decentralized failure detection in larger systems.
Sloppy Quorum Technique:
- Quorum Relaxation: The sloppy quorum technique deviates from strict quorum requirements during temporary failures, allowing for continued read and write operations without enforcing the usual quorum constraints.
- Enhanced Availability: By selecting the first available healthy servers for reads and writes, ignoring temporarily offline servers, the system can maintain higher availability during temporary failures.
- Hinted Handoff Integration: Sloppy quorum often integrates with hinted handoff, where tasks are temporarily reassigned to functioning nodes during non-availability, ensuring that operations continue seamlessly.
Hinted Handoff Mechanism:
- Temporary Task Reassignment: Hinted handoff involves temporarily reassigning tasks, such as reads and writes, to alternative nodes when the original node is temporarily unavailable.
- Maintaining Operations: In scenarios where a node is offline or unreachable, hinted handoff allows another node to process requests on its behalf, ensuring that operations can continue without disruption.
- Data Consistency: Once the original node becomes available again, changes made by the temporary node are synchronized back to maintain data consistency in the system.

Handling Permanent Failures

Anti-Entropy Protocol:
- Handling Permanent Failures: The anti-entropy protocol is designed to address situations where a replica becomes permanently unavailable.
- Syncing Replicas: It involves a systematic approach to keeping replicas in sync by comparing each piece of data and updating each replica to the newest version.
- Merkle Tree Integration: In cases of permanent replica unavailability, the anti-entropy protocol often integrates Merkle trees to efficiently compare and synchronize data, ensuring consistency across replicas.
Merkle Trees for Data Transfer:
- Detecting Inconsistencies: Merkle trees play a crucial role in detecting inconsistencies between replicas by hashing labels or values of child nodes at different levels.
- Minimizing Data Transfer: During synchronization, Merkle trees help minimize the amount of data transferred between replicas by focusing on the differences between the two replicas rather than the entire dataset.
- Efficient Comparison: The structure of Merkle trees allows for an efficient comparison, ensuring that only the portions of data with differences need to be synchronized.
Hash Tree or Merkle Tree Structure:
- Label Hashing: The hash tree or Merkle tree structure involves associating a hash with each non-leaf node, representing the hash of labels or values of its child nodes.
- Verification Mechanism: It provides an efficient and secure verification mechanism for the contents of large data structures, ensuring the integrity and consistency of the data.
- Hierarchical Organization: Hash tree structures organize data hierarchically, allowing for effective detection of inconsistencies and facilitating the synchronization process during anti-entropy.
Efficiency and Accuracy Challenges in Descendant Relationships:
- Rapid Growth Concerns: The [server: version] pairs in vector clocks may grow rapidly, posing challenges in terms of efficiency, storage, and computation.
- Threshold Limitations: Setting a threshold to limit the length of vector clocks introduces challenges in accurately determining descendant relationships, potentially affecting the effectiveness of conflict resolution.
- Client Overhead: The implementation of vector clocks introduces overhead on the client side, as applications need to handle complex conflict resolution logic.
- Balancing Efficiency and Accuracy: Balancing the efficiency of conflict resolution with the accuracy of determining relationships becomes a challenge, especially in scenarios where vector clocks may exceed the defined threshold.

Data Center Outage

Replication across multiple data centers ensures continued access during a data center outage.
Users can access data through other operational data centers.
This shows the importance of distributing data across geographically diverse locations.
Essential for maintaining system availability during data center-related disruptions.

Write Path

Write requests are persisted in a commit log file.
Data is saved in the memory cache for quick retrieval.
Periodic flushing of data to SSTable on disk.
Memory cache management ensures efficient write operations.

Read Path

Read requests first check if data is present in the memory cache.
Bloom filters are used for the efficient identification of relevant SSTables.
Data is retrieved from the disk if not it is not present in the memory cache.
This ensures an efficient read path design with an emphasis on quick data access.