Interview Preparation | Building Twitter

January 3rd, 2024

Introduction to Building Twitter

This summary will serve to further cement my learnings taken when reviewing the Building Twitter module in the Grokking the Modern System Design Interview course, and I hope will provide some learnings to you as well.

Overview of Twitter Design Requirements

Functional Requirements

Post Tweets:
- Registered users can post one or more Tweets on Twitter.
Delete Tweets:
- Registered users can delete one or more of their Tweets on Twitter.
Like or dislike Tweets:
- Registered users can like and dislike public Tweets and their own Tweets on Twitter.
Reply to Tweets:
- Registered users can reply to the public Tweets on Twitter.
Search Tweets:
- Registered users can search Tweets by typing keywords, hashtags, or usernames in the search bar on Twitter.
View user or home timeline:
- Registered users can view the user’s timeline, which contains their own Tweets. They can also view the home’s timeline, which contains followers’ Tweets on Twitter.
Follow or unfollow the account:
- Registered users can follow or unfollow other users on Twitter.
Retweet a Tweet:
- Registered users can Retweet public Tweets of other users on Twitter.

Non-functional Requirements

Availability:
- The service must be highly available with a good uptime percentage due to the time-sensitive nature of the information.
Latency:
- Near real-time services, such as Tweet distribution, must have low latency.
Scalability:
- The design should accommodate a read-heavy workload with high scalability, given the 1:1000 write-to-read ratio.
Reliability:
- Twitter must ensure a reliable strategy to prevent the loss or damage of uploaded content.
Consistency:
- Effective techniques are needed to ensure consistency in status updates across different regions for user interactions.

Building Blocks in Twitter Design

DNS:
- Maps human-friendly Twitter domain names to machine-readable IP addresses.
Load Balancers:
- Distribute read/write requests among respective services.
Sequencers:
- Generate unique IDs for Tweets.
Databases:
- Store metadata of Tweets and users.
Blob Stores:
- Store images and video clips attached with Tweets.
Key-value Stores:
- Used for indexing, identifying counters, and managing user-specific data.
Pub-sub:
- Facilitates real-time processing, eliminating redundant data.
Sharded Counters:
- Handle counts of various features for accounts with millions of followers.
Cache:
- Stores frequently requested data in RAM for quick user response.
CDN:
- Helps end users access data with low latency.
Monitoring:
- Analyzes incoming and outgoing traffic, identifies storage redundancies, and detects system failures.

User-System Interaction

In the high-level design of Twitter, the user-system interaction involves several components:

Users Post Tweets: Tweets from registered users are delivered to the server through a load balancer and stored in persistent storage.
DNS: Maps human-friendly Twitter domain names to machine-readable IP addresses.
CDN: Situated near users to provide requested data with low latency.
Load Balancer: Chooses operational application servers based on traffic load and user requests.
Storage System: Various types of storage, including SQL-based and NoSQL-based.
Application Servers: Provide various services and orchestrate business logic between different components.

High-Level API Design

This section focuses on designing various APIs for Twitter functionalities:

Post Tweet

API: /postTweet
Parameters:
- user_id: Unique ID of the user who posted the Tweet.
- access_type: Indicates if the Tweet is protected (visible to followers) or public.
- tweet_type: Indicates if the Tweet is text-based, video-clip based, image(s)-based, or of different types.
- content: Actual content (text) of the Tweet.
- tweet_length: Represents the text length in the Tweet; for video, indicates duration and size.
- media_field: Specifies the type of media (image, video, GIF, etc.) delivered in each Tweet.
- post_time: Time when the Tweet is posted.
- tweet_location: Location associated with the Tweet.
- list_of_used_hashtags: List of hashtags used in the Tweet.
- list_of_tagged_people: List of people tagged in the Tweet.

Like or Dislike Tweet

API: /likeTweet
Parameters:
- user_id: Unique ID of the user who liked the Tweet.
- tweet_id: Unique ID of the Tweet.
- tweeted_user_id: Unique ID of the user who posted the Tweet.
- user_location: Location of the user who liked the Tweet.

Reply to Tweet

API: /replyTweet
Parameters:
- user_id: Unique ID of the user who replied to the Tweet.
- tweet_id: Unique ID of the Tweet.
- tweeted_user_id: Unique ID of the user who posted the Tweet.
- reply_type: Type of the reply (text, video, image, etc.).
- reply_content: Content of the reply.
- reply_length: Length of the reply content.

Search Tweet

API: /searchTweet
Parameters:
- user_id: Unique ID of the user searching for Tweets.
- search_term: Keyword or phrase to search.
- max_result: Number of Tweets returned per response page.
- exclude: Specifies what to exclude from returned Tweets (replies, retweets, etc.).
- media_field: Specifies the media (image, video, GIF) delivered in each returned Tweet.
- expansions: Enables requesting additional data objects in returned Tweets.
- sort_order: Specifies the order in which Tweets are returned.
- next_token: Used to get the next page of results.
- user_location: Location of the user searching for Tweets.

View Home Timeline

API: /viewHome_timeline
Parameters:
- user_id: Unique ID of the user viewing the home timeline.
- tweets_count: Number of Tweets to be shown.
- max_result: Number of Tweets returned per response.
- exclude: Specifies what to exclude from returned Tweets.
- next_token: Used to get the next page of results.
- user_location: Location of the user viewing the home timeline.

Follow or Unfollow Account

API: /followAccount
Parameters:
- account_id: Unique ID of the user who follows the account.
- followed_account_id: Unique ID of the account being followed.

Retweet a Tweet

API: /retweet
Parameters:
- user_id: Unique ID of the user who retweeted.
- tweet_id: Unique ID of the Tweet being retweeted.
- retweet_user_id: Unique ID of the user who posted the original Tweet.

Storage System

Google Cloud

Twitter utilized HDFS (Hadoop Distributed File System) on Google Cloud to host over 300PB of data.
Shifted data from Hadoop clusters to Google Cloud’s BigQuery for better analysis and management.
Ad-hoc clusters and cold storage clusters were initially migrated.

Manhattan

Twitter’s real-time distributed key-value store, replacing Cassandra in 2014.
Employs RocksDB as a storage engine for storing and retrieving data within a node.
Manages backend for Tweets, Twitter accounts, direct messages, etc.

Blobstore

Constructed in 2012 to store photos attached to Tweets.
Now stores videos, binary files, and other objects.
Utilizes server checkpoints for durable storage.

SQL-based Databases

Uses MySQL and PostgreSQL for strong consistency, ad exchange, and managing ad campaigns.
Incorporates Vertica for querying aggregated datasets and Tableau dashboards.
Gizzard framework built on MySQL for sharding.

Kafka and Cloud Dataflow

Processes around 400 billion real-time events daily using Kafka on-premise.
Google Dataflow jobs handle deduping and real-time aggregation on Google Cloud.

FlockDB

Graph database for storing user relationships.
Tuned for huge adjacency lists, rapid reads and writes, and graph-traversal operations.

Apache Lucene

Used for real-time search, indexing approximately a trillion records.
Employs an inverted index and stores a real-time index in RAM for low latency.

Cache

Pelikan Cache

Replaced Twemcache and Nighthawk due to performance issues.
Utilizes various back-end servers for high throughput and low latency.
Introduced Segcache for scalability and memory efficiency.

Observability

Uses Zipkin for distributed tracing of requests.
ZooKeeper is employed for critical data storage, distributed locking, and leader election.

Real-World Complex Problems

Sharded Counters

Addresses heavy hitter problems with millions of interactions on Tweets.
Efficiently manages burst write requests against various interactions on celebrities’ Tweets.
Resolves Top-k problems in trends and the timeline through distributed counters.

The Top-k problem refers to the challenge of efficiently identifying and retrieving the “top” or “highest-ranked” elements from a dataset based on a specific criterion. The “k” in Top-k represents the number of items to be retrieved. This problem is prevalent in various applications, especially in the context of real-time systems and data analysis. Here are two common scenarios where the Top-k problem arises:
1. Trends (e.g., Twitter Trends):
  - Scenario: In social media platforms like Twitter, identifying the top trends involves determining the most frequently used hashtags or keywords in a given time frame.
  - Challenge: Locally and globally, determining the top trends requires analyzing a massive volume of user-generated content in real-time. The goal is to surface the most popular or relevant topics.
  - Solution: Systems employ algorithms and data structures to efficiently track the frequency of keywords and identify the top-k trends, considering factors like geographical relevance.
2. Timelines (e.g., Twitter Timeline):
  - Scenario: Constructing user timelines involves curating a stream of content (e.g., tweets) that is most relevant or engaging to a user.
  - Challenge: With a vast amount of content being generated continuously, selecting the most relevant tweets for a user’s timeline requires efficient algorithms and data structures.
  - Solution: Sharded counters and distributed systems are often used to track interactions (likes, views) on tweets. This data is then used to determine the top-k tweets for a user’s timeline.
In essence, the Top-k problem is about efficiently finding and presenting the most significant elements from a large dataset. The specific criteria for what makes an element “top” can vary based on the context of the application, such as frequency, popularity, relevance, or engagement. Efficient solutions to the Top-k problem are crucial for providing real-time, personalized, and contextually relevant information to users in applications like social media, search engines, and recommendation systems.

The Complete Design Overview

End User Requests:
- Users get the nearest load balancer address from the local DNS.
Load Balancer: Routes request to appropriate servers for Tweet, timeline, and search services.
Tweet Service:
- Handles operations like posting Tweets, storing attachments, and real-time processing.
- Data is moved to cloud pub-sub, then to BigQuery for deduping and aggregation, and finally to Google Cloud Bigtable.
Timeline Service:
- Fetches data from different databases or stores for home timeline requests.
- Uses sharded counters for real-time counts and returns Top-k Tweets.
Search Service:
- Processes search requests, fetching real-time Tweets from Apache Lucene.
- Ranks discovered Tweets based on factors like time or location and returns the top Tweets.
Observability:
- Utilizes Zipkin for tracing sampled requests.
- ZooKeeper maintains different data, including configuration information and distributed synchronization.

Twitter’s backend system involves multiple steps:

End Users’ Requests: Directed to the nearest load balancer.
Load Balancer: Routes requests to appropriate servers for the tweet, timeline, and search services.
Services:
- Tweet Service: Handles operations like posting tweets, storing attachments in Blobstore, and real-time processing using Kafka.
- Timeline Service: Fetches data from different databases, returns top tweets, and interacts with sharded counters.
- Search Service: Processes search requests using Apache Lucene and ranks tweets based on various factors.

Client-side Load Balancer

Twitter utilizes client-side load balancing with a deterministic aperture as part of the Finagle RPC framework. They use the Power of Two Random Choices (P2C) technique for both request and session distribution.

Request Distribution (P2C): Randomly selects two instances and picks the one with the least load.
Session Distribution:
- Mesh Topology: Each client starts a session with all instances, fair but not scalable.
- Random Aperture: Randomly chooses a subset of servers for sessions, scales better but may be unfair.
- Deterministic Aperture (Continuous Ring): Ensures fairness and scalability by mapping clients to servers using continuous ring coordinates and P2C.