Interview Preparation | Building Blocks - Databases

December 23rd, 2023

Introduction

This summary will serve to further cement my learnings taken when reviewing the Databases module in the Grokking the Modern System Design Interview course, and I hope will provide some learnings to you as well.

Without Databases:

Simple file storage for applications like WhatsApp produce many limitations.

Limitations of File Storage:

Inability to offer concurrent management to users.
Lack of different access rights.
Challenges in scaling with thousands of entries.
Inefficient content search for different users.

Solution:

Databases act as organized collections to address these limitations.

Databases facilitate easy management, retrieval, modification, and deletion of data.
Essential for various applications like banking systems, online shopping stores, etc.
Two basic types: SQL (relational databases) and NoSQL (non-relational databases).

Advantages:

Essential for business, storing personnel records, transactions, salary information, etc.
Manages large data efficiently.
Ensures data consistency and accuracy.
Facilitates easy data updating (DML).
Provides security by allowing authorized access.
Ensures data integrity through constraints.
Enables availability through data replication.
Enhances scalability through data partitioning.

Relational Databases

Structure: Follows predefined schemas, organizing data into tables with unique keys.
Query Language: Utilizes SQL for data manipulation (insertion, deletion, retrieval).
ACID Properties: Ensures Atomicity, Consistency, Isolation, and Durability for data integrity.

Atomicity: A transaction is considered an atomic unit. Therefore, either all the statements within a transaction will successfully execute, or none of them will execute. If a statement fails within a transaction, it should be aborted and rolled back.

Consistency: At any given time, the database should be in a consistent state, and it should remain in a consistent state after every transaction. For example, if multiple users want to view a record from the database, it should return a similar result each time.

Isolation: In the case of multiple transactions running concurrently, they shouldn’t be affected by each other. The final state of the database should be the same as the transactions were executed sequentially.

Durability: The system should guarantee that completed transactions will survive permanently in the database even in system failure events.
Popular DBMS: MySQL, Oracle, SQL Server, DB2, Postgres, SQLite.

Advantages

Flexibility: DDL allows easy modification of the database.
Reduced Redundancy: Eliminates data redundancy through normalization.
Concurrency: Handles multiple users reading and writing concurrently.
Integration: Supports aggregation of data from multiple sources.
Backup and Recovery: Ensures consistent data states for easier backup and restoration.

Drawback

Impedance Mismatch: Relational model limitations when dealing with complex in-memory structures.

Non-Relational (NoSQL) Databases

Design Philosophy: Offers various data models for flexibility and scalability.
Types: Includes key-value, document, graph, and columnar databases.
Characteristics: Simple design, horizontal scaling, availability, support for unstructured data.
Examples: DynamoDB, MongoDB, Neo4J, Cassandra, HBase.

Use Cases for NoSQL Types

Key-Value Databases: Efficient for session-oriented applications.
Document Databases: Suitable for unstructured catalog data, e.g., e-commerce product attributes.
Graph Databases: Ideal for social applications and data analysis based on relationships.
Columnar Databases: Efficient for data analytics queries, reducing disk I/O requirements.

Drawbacks

Lack of Standardization: NoSQL lacks a universal standard, complicating application portability.
Consistency: May sacrifice strong data integrity for eventual consistency.

Data Replication Models

Single Leader/Primary-Secondary Replication

Overview: Data is replicated across nodes with one designated primary node handling writes and syncing data with secondary nodes.
Suitability: Ideal for read-heavy workloads; scalability achieved by adding followers.
Advantages: Read resilience, scalability with increased readers.
Challenges: Write-heavy workloads may lead to primary node bottleneck; inconsistency with asynchronous replication.
Leader Elections: In the event of a primary node failure, the secondary nodes can perform a leader election to select a secondary node that automatically replaces the primary node in the architecture.

Primary-Secondary Replication Methods:

Statement-based Replication:
- Logs and sends executed statements to secondary nodes.
- Disadvantages include nondeterministic function issues and uncertain outcomes with dependencies.
Write-ahead Log (WAL) Shipping:
- Primary node logs queries before execution for replication.
- Tightly coupled with database engine structure; complicates software upgrades.
Logical (Row-Based) Log Replication:
- Secondary nodes replicate actual data changes.
- Avoids issues present in WAL by not requiring database engine structure details.

Multi-Leader Replication

Overview: Introduces multiple primary nodes processing writes, sent to all nodes for replication.
Use Cases: Offline-capable applications, offering flexibility when online.
Challenges: Potential conflicts due to concurrent writes on different primary nodes.

Handling Conflicts:

Conflict Avoidance: Prevent conflicts by ensuring all writes for a record go through the same leader.
Last-Write-Wins: Assign timestamps to updates; conflict resolved with the latest timestamp.
Custom Logic: Implement custom logic for conflict resolution, executed on both reads and writes.

Replication Topologies:

All-to-All Topology: Commonly used; minimizes single node failure impact.

Peer-to-Peer/Leaderless Replication

Overview: Eliminates primary node bottleneck; all nodes have equal weight, accepting read and write requests.
Implementation: Found in databases like Cassandra.
Consistency Challenges: Potential inconsistencies due to concurrent writes.

Quorums:

Definition: A strategy ensuring a specified number of nodes return successful updates for read and write operations.
Configuration: Parameters include total nodes (n), required write nodes (w), and required read nodes (r).

Data Partitioning

Why Partition Data:
- Increasing data and read/write traffic affects scalability in traditional databases.
- Traditional databases face challenges in providing single-node database-like properties in a distributed environment.
- Data partitioning or sharding is a solution to distribute data over multiple nodes for balanced partitions and load.

Sharding

Vertical Sharding

Splitting tables or breaking them into multiple tables across different database instances.
Useful for tables with wide text or binary large objects (blobs).

Horizontal Sharding

Key-range based Sharding

Assigning continuous ranges of keys to partitions.
Advantages: Easy range-query-based implementation.
Disadvantages: Limited to partitioning keys, potential uneven distribution.

Hash-based Sharding

Using a hash function on an attribute for partitioning.
Advantages: Uniform distribution of keys.
Disadvantages: No range queries, keys spread across all partitions.

Consistent Hashing

Assigning each server a place on a ring, facilitating horizontal scaling.
Advantages: Horizontal scaling, improved throughput.
Disadvantages: Potential non-uniform distribution.

Re-balancing Partitions

Strategies to balance query load and handle data distribution changes:
- Avoid hash mod n to minimize data movement.
- Fixed number of partitions, but careful selection is crucial.
- Dynamic partitioning, splitting partitions when reaching a threshold.
- Partition proportionally to nodes, adapting to overall data amount.

Partitioning and Secondary Indexes

Strategies for accessing records through secondary indexes:
- Partition secondary indexes by document.
- Partition secondary indexes by the term.

Request Routing

Challenge: How does a client know which node to connect to for a request?
Approaches:
- Clients request any node, which forwards requests if data is not present.
- Routing tier determines the node to connect to.
- Clients directly contact nodes with the needed data.

ZooKeeper

A separate management server to track changes in the cluster.
Keeps mappings and notifies about changes.
Used by distributed systems like HBase, Kafka, and SolrCloud.

Trade-offs in Databases

Best Database Sharding Approach

Considerations:
- Choosing between horizontal and vertical sharding.
- Scaling resources for organization growth, preventing downtime, and reducing latency.

No Sharding vs. Sharding

Advantages and Disadvantages of Centralized Database

Advantages

Easy data maintenance (updates, backups).
Strong consistency and ACID transactions.
Simpler programming model for small data.

Disadvantages

Slows down with high query loads.
Single point of failure.

Advantages and Disadvantages of Distributed Database

Advantages

Fast access to data, retrieval from nearest shard.
Different levels of distribution transparency.
Parallel processing for intensive transactions.

Disadvantages

Delay when data is required from multiple sites.
Complex operations like joins in partitioned relations.
Challenges in maintaining consistency and synchronization.

Query Optimization in a Distributed Database Example

Parameters Assumption:
- Data is needed from three tables, Store, Product and Sales
- Store has 10,000 tuples stored at Site A
- Product has 100,000 tuples stored at Site B
- Sales has 1,000,000 tuples stored at Site A
- Every stored tuple is 200 bits (25 bytes) long
- Data rate = 50M bits per second
- Access delay = 0.1 seconds
- Query to be performed:
```
 Select Store_key from (Store JOIN Sales JOIN Product)
 where Region= 'East' AND Brand='Wolf';
```
- The number of the Wolf brand is 10.
- The number of East region stores is 3000 (since there are 10,000 rows in the store table, and 3000 have region as east).
- Total communication time T = a + v / b where a = access delay, v = total data volume and b = data rate

Possible Approaches:

Move Product table to site A and process the query.

  T = a + v / b = 0.1 seconds + (100,000\*200)/(50,000,000) = 0.5 seconds

Move Store and Sales to site B and process the query.

  T = a + v / b = 0.2 seconds + (10,000*200)/50,000,000 + (1,000,000 * 200)/50,000,000 = 4.24 seconds

Restrict Brand at site B to Wolf and move the result to site A.

  T = a + v / b = 0.1 seconds + (10\*200)/50,000,000 = 0.1 seconds