Interview Preparation | Abstractions and Non-functional System Characteristics

December 13th, 2023

##Introduction

This summary will serve to cover the learnings taken from two modules in the Grokking the Modern System Design Interview Course on Educative. These modules are Abstractions and Non-functional System Characteristics, hopefully these notes are useful to you as well.

##Abstractions in the Network

What is an RPC?

RPC is an interprocess communication protocol utilized in distributed systems, spanning the transport and application layers in the OSI model, enabling programs to execute procedures in separate address spaces without explicit remote interaction coding and providing a high-level abstraction.

How does RPC work?

When making a remote procedure call, the calling environment pauses, sending parameters over the network for execution in another environment; upon completion, results are returned, restarting execution as a regular procedure call, involving components like client and server stubs, which serve as abstractions for the underlying communication details.

The workflow of an RPC

In the RPC process, a client initiates a stub process, which converts parameters into a message, delivered via RPC runtime to the server; after execution, the result undergoes a similar process back to the client, enabling communication via network transmission and abstracting the intricacies of the communication mechanism.

##Spectrum of Consistency Models

Types of Consistency Models

Consistency models, ranging from strongest to weakest, include strong consistency, eventual consistency, causal consistency, sequential consistency, and strict consistency (linearizability), each providing different guarantees and applications, with a spectrum illustrated to depict the strength of consistency guarantees.

Eventual Consistency

Eventual consistency, the weakest model, suits applications without strict ordering requirements, ensuring convergence to a final value after a finite time and reflecting the absence of new writes, illustrated with the domain name system and Cassandra as examples.

Causal Consistency

Causal consistency categorizes operations into dependent and independent, preserving the order of causally-related operations while allowing non-causally related operations to appear in different orders, exemplified in a commenting system and its application in preventing non-intuitive behaviors.

Sequential Consistency

Sequential consistency is stronger than causal consistency, preserving the ordering specified by each client’s program, but it doesn’t ensure instantaneous visibility or a consistent order across a global clock, with an example related to social networking applications.

Strict Consistency (Linearizability)

Strict consistency or linearizability, the strongest model, ensures that a read request from any replica gets the latest write value immediately after the write operation acknowledgment, presenting challenges in achieving it due to network delays and failures, and applications requiring strong consistency may use techniques like quorum-based replication.

Non-Functional System Requirements

Availability

Availability is the percentage of time that some service or infrastructure is accessible to clients and is operated upon under normal conditions. For example, if a service has 100% availability, it means that the said service functions and responds as intended (operates normally) all the time.

Reliability

Reliability, denoted as R, represents the probability that a service will perform its functions for a specified time, measuring its performance under various operating conditions and often quantified using metrics like mean time between failures (MTBF) and mean time to repair (MTTR).

Metrics for Reliability

Mean time between failures (MTBF) and mean time to repair (MTTR) serve as key metrics for measuring reliability, with the goal of achieving a higher MTBF value and a lower MTTR value to enhance the overall reliability of the service.

Reliability and Availability

Reliability and availability, crucial metrics for assessing service compliance with service level objectives (SLO), are distinct but related concepts; availability (A) is a function of reliability (R), emphasizing that A depends on R, leading to scenarios with various combinations of low or high availability and reliability, with high A and high R being the desirable state.

Scalability

Scalability is the system’s ability to handle increased workload without compromising performance, vital for services like search engines managing growing numbers of users and data.

Workload Dimensions

Workload comes in various types, including request workload (number of requests served) and data/storage workload (amount of stored data), influencing a system’s scalability requirements.

Dimensions of Scalability

Scalability encompasses size scalability (adding users and resources), administrative scalability (ease of sharing among organizations or users), and geographical scalability (ability to serve different regions while maintaining performance).

Different Approaches to Scalability

Vertical scaling involves adding capabilities to an existing device, often limited by server constraints, while horizontal scaling increases machines in the network, requiring a system design allowing multiple nodes to function collectively.

Maintainability

Maintainability involves tasks like bug fixing, adding functionalities, and updating platforms to ensure smooth system operations post-development, with three key aspects: operability, lucidity (code simplicity), and modifiability.

Measuring Maintainability

Maintainability (M) is the probability that a service will restore its functions within a specified time after a fault, measured by metrics like mean time to repair (MTTR), indicating the system’s capability to undergo repairs and modifications while operational.

MTTR as a Metric

Mean time to repair (MTTR) is the average time required to repair and restore a failed component, with the goal of achieving as low an MTTR value as possible for efficient system maintenance.

Maintainability and Reliability

Maintainability, focusing on time-to-repair, is closely related to reliability, which considers both time-to-repair and time-to-failure; combining analysis of maintainability and reliability provides insights into availability, downtime, and uptime.

Fault Tolerance

Fault tolerance ensures a system’s persistent execution despite component failures in large-scale applications, aiming to prevent single points of failure and maintain data safety.

Necessity for Fault Tolerance

Fault tolerance becomes essential for features like availability, ensuring 24/7 accessibility, and reliability, focusing on responding to client requests with specified actions.

Fault Tolerance Techniques

Replication-based fault tolerance involves replicating both services and data, allowing seamless switching between failed and healthy components, with a trade-off between consistency approaches under failures, as outlined in the CAP theorem.

Checkpointing

Checkpointing, a technique saving a system’s state at consistent intervals, aids recovery after a failure, but synchronous checkpointing poses challenges with consistent versus inconsistent states, impacting data consistency during recovery.