0% found this document useful (0 votes)
4 views

System Design

The document outlines a structured approach to system design, divided into three key stages: Architectural Design, Logical Design, and Physical Design. It provides detailed methodologies for each stage, including requirements clarification, high-level architecture, database design, caching strategies, and performance optimization. The document emphasizes the importance of scalability, reliability, and efficiency in building robust systems.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

System Design

The document outlines a structured approach to system design, divided into three key stages: Architectural Design, Logical Design, and Physical Design. It provides detailed methodologies for each stage, including requirements clarification, high-level architecture, database design, caching strategies, and performance optimization. The document emphasizes the importance of scalability, reliability, and efficiency in building robust systems.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

System Design

Fundamentals

By: Ahmed Khalafallah


The #1 System Design Method consists of three key stages:
Architectural Design, Logical Design, and Physical Design.

1. Architectural Design (High-Level Design)


This phase defines the big picture of the system, focusing on the
system's overall structure and interaction between components.
Key Components:
• System Components: Identify core elements (e.g.,
frontend, backend, databases, caching, load balancers).
• Scalability Strategy: Choose monolithic or microservices
architecture.
• Communication Model: Define interactions between
services (REST, gRPC, WebSockets).
• High-Level Diagram: Sketch the system's flow, showing
how components interact.
Example: Designing a Scalable Social Media Platform
• Clients (Web, Mobile Apps)
• Load Balancer (Nginx, AWS ELB)
• Application Servers (Node.js, NestJS, Express)
• Databases (PostgreSQL, MongoDB, Redis for caching)
• Message Queue (Kafka, RabbitMQ for async processing)
• CDN (Cloudflare for static content delivery)
2. Logical Design (Detailed Component-Level Design)
This phase focuses on data flow, APIs, database schema, and
system interactions.
Key Aspects:
• Database Schema Design: Tables, relationships, indexing,
normalization.
• API Endpoints & Data Flow: Define RESTful APIs, GraphQL,
or gRPC endpoints.
• Caching Strategy: Plan Redis/Memcached usage for fast
data retrieval.
• Security Considerations: Authentication (OAuth, JWT),
authorization (RBAC).
Example: Database Design for a Social Media Platform
• Users Table (id, name, email, password, profile picture)
• Posts Table (id, user_id, content, timestamp, likes_count)
• Comments Table (id, post_id, user_id, text, timestamp)
• Indexes & Sharding Strategy (indexing on frequently
queried fields like user_id).
3. Physical Design (Infrastructure & Deployment)
This phase defines the hardware, cloud services, and
networking details to implement the logical design.
Key Aspects:
• Hosting & Deployment: AWS, GCP, Azure, Kubernetes,
Docker.
• Database Deployment: PostgreSQL (RDS), MongoDB Atlas,
Redis.
• Load Balancing & Auto Scaling: AWS ALB, Kubernetes
Horizontal Pod Autoscaler.
• Logging & Monitoring: Prometheus, Grafana, ELK Stack.
• Fault Tolerance & Disaster Recovery: Replication,
backups, multi-region deployment.
Example: Deploying a Scalable System on AWS
• Frontend: Hosted on S3 with CloudFront CDN.
• Backend: Deployed on AWS EC2 with auto-scaling.
• Database: PostgreSQL on AWS RDS with read replicas.
• Caching: Redis on AWS ElastiCache.
• Logging & Monitoring: CloudWatch, Prometheus, Grafana.
The #1 system design method follows a structured approach to
designing scalable, reliable, and efficient systems. Here's a step-
by-step systematic method to tackle any system design
problem:

1. Understand the Requirements (Functional & Non-


Functional)
• Clarify: Ask questions to remove ambiguity (e.g., "How
many users?", "What is the expected latency?")
• Define Scope: Prioritize core features before optimizations.
• Identify Constraints: Storage limits, number of requests
per second, response time, etc.

2. Establish High-Level Design


• Choose between Monolithic vs. Microservices
Architecture.
• Identify data flow (how requests move from client to
database).
• Sketch a simple block diagram (Client → Load Balancer →
App Servers → Database).

3. Define Database Design


• Choose SQL or NoSQL based on requirements.
• Design ER diagrams, define indexes, and apply
normalization/denormalization if needed.
• Consider Sharding, Replication, and Partitioning
strategies.

4. Caching Strategy
• Use Redis/Memcached to reduce database load.
• Define cache eviction policies (LRU, LFU, TTL).
• Determine what to cache (e.g., user sessions, API
responses, frequently accessed data).

5. Load Balancing & Scaling


• Vertical Scaling vs. Horizontal Scaling.
• Use Load Balancers (Round Robin, Least Connections).
• Auto-scaling based on traffic spikes.

6. Asynchronous Processing & Queues


• Implement message queues (Kafka, RabbitMQ, SQS) for
event-driven architecture.
• Use background workers for non-blocking tasks (e.g.,
sending emails, processing videos).

7. Security & Fault Tolerance


• Implement rate limiting (Token Bucket, Leaky Bucket).
• Use CDNs for static content delivery.
• Apply authentication & authorization (OAuth, JWT, role-
based access control).
• Plan for disaster recovery & backups.

8. Monitoring & Logging


• Use logging & monitoring tools (ELK Stack, Prometheus,
Grafana).
• Implement distributed tracing for debugging in
microservices.
• Set up alerting mechanisms for failures.

9. Performance Optimization
• Reduce latency (optimize database queries, use CDNs).
• Minimize network bandwidth usage (compression,
efficient payloads).
• Optimize database writes (batch processing, eventual
consistency).

10. Trade-Off Analysis & Final Refinement


• Evaluate CAP theorem (Consistency, Availability, Partition
Tolerance).
• Choose the right compromises (latency vs. consistency,
cost vs. scalability).
• Prepare for future scaling needs.
• System Design Steps:
You're structuring system design in a very methodical way! Here’s
the step-by-step breakdown of the #1 System Design Method
following your structure:

Step 1: Requirements Clarification


Before designing anything, clarify the requirements to remove
ambiguity and understand the scope.
1.1 Functional Requirements (What the system should do)
• Identify core features (e.g., for Twitter: users should post
tweets, follow others, like/retweet).
• Define API endpoints & expected behavior.
• Identify real-time, batch processing, or background tasks.
1.2 Non-Functional Requirements (How the system should
perform)
• Scalability: How many users? Expected requests per
second?
• Latency: Does the system need responses in milliseconds?
• Availability: 99.99% uptime? Disaster recovery strategy?
• Consistency vs. Availability Trade-offs (CAP Theorem):
Strong or eventual consistency?
• Security & Compliance: OAuth, GDPR compliance, rate
limiting.
1.3 Constraints & Edge Cases
• Storage capacity? (e.g., how much data is stored per user?)
• Network bandwidth limitations?
• Handling high traffic (Black Friday, viral posts, etc.)

Example: Designing a URL Shortener (like Bitly)


Functional Requirements: Generate short URLs
Redirect users to the original URL
Track analytics (click count, country, browser info)
Non-Functional Requirements:
Must handle 1 million URL generations per day
Response time < 100ms
High availability (99.99% uptime)

Step 2: Estimations of Key Components

Estimating critical system metrics helps understand scalability


requirements.

2.1 Traffic Estimation

• Determine the number of users and requests per second (RPS).


• Example: If there are 1 million active users, and each user
makes 5 requests per day:
o Total requests per day = 1M × 5 = 5M
o Requests per second (RPS) ≈ 60

2.2 Storage Estimation

• Formula: Number of records × size per record


• Example: If a URL shortener stores 1 billion URLs, each taking
100 bytes:
o Total storage required = 100GB

2.3 Bandwidth Estimation

• Formula: Requests per second × response size


• Example: If a URL shortener sends 500 bytes per response with
1000 RPS:
o Bandwidth required ≈ 500KB/sec

2.4 Database Estimation

• Read vs. Write ratio (e.g., 80% reads, 20% writes).


• Indexing strategy for efficient retrieval.

Example: Video Streaming Service (YouTube-like system)

• 10 million daily active users.


• Each user watches 5 videos/day, with an average video size of
100MB.
• Total storage required = 50 Petabytes.

Step 3: Data Flow Design

This step defines how data moves through the system.

3.1 Identify the Actors

• Users → Interact with the system via web or mobile apps.


• Backend Services → API Gateway, Authentication, Database,
Caching.
• External Services → CDN, Payment Gateway, Third-party
APIs.

3.2 Define Request-Response Flow


Example: URL Shortener

1. User enters a long URL, frontend sends a request to the


backend.
2. Backend checks cache (Redis) for an existing short URL.
3. If not found, backend generates a unique hash and saves it in
the database.
4. Backend sends a response to the user with the short URL.
5. When a user visits the short URL, the system redirects them to
the original URL.

3.3 API Call Sequence

Example: User Posts a Tweet (Twitter-like system)

1. Client sends a POST /tweet request.


2. API Gateway routes the request to the Tweet Service.
3. Tweet Service stores data in the database (PostgreSQL).
4. Caching Layer (Redis) stores recent tweets for fast retrieval.
5. Message Queue (Kafka) notifies followers asynchronously.
6. Users see the tweet on their timeline almost instantly.

Example: E-commerce Checkout Flow (Amazon-like system)

1. User clicks "Checkout," frontend sends a request.


2. Backend validates cart & inventory.
3. Backend calls the Payment Service (Stripe, PayPal).
4. If successful, Order Service updates the order status.
5. Email Service sends a confirmation email.
6. Background workers notify the warehouse for packaging.
Step 4: High-Level Components Design

Now that we have clarified requirements, estimated key components,


and mapped the data flow, it's time to design the high-level
architecture of the system.

4.1 Identify the Core Components

Every system consists of different components interacting with each


other. The goal is to break down the system into independent,
scalable, and fault-tolerant services.

Common components in most systems:

• Clients → Web, Mobile Apps


• API Gateway → Routes requests to appropriate services
• Authentication & Authorization Service → Handles OAuth,
JWT, RBAC
• Microservices / Monolithic Backend → Business logic handling
• Database (SQL / NoSQL) → Data storage
• Caching Layer (Redis, Memcached) → Faster data retrieval
• Load Balancer → Distributes traffic across servers
• Message Queue (Kafka, RabbitMQ) → Handles async tasks
• CDN (Cloudflare, AWS CloudFront) → Serves static content
• Logging & Monitoring (Prometheus, ELK) → Tracks system
health

Example: High-Level Components for a URL Shortener

1. Client (Browser/Mobile App) → Sends long URL


2. API Gateway → Forwards request to backend
3. URL Service → Generates & stores short URL
4. Database (PostgreSQL, DynamoDB) → Stores URL mappings
5. Cache (Redis) → Fast lookup for short URLs
6. Analytics Service → Tracks click count, location
7. CDN (Cloudflare) → Reduces load on origin servers

4.2 Define the Interaction Between Components

• User → API Gateway → Authentication, Request Routing


• API Gateway → Backend Services → Routes requests to
microservices
• Backend Services → Database & Cache → Reads/writes data
efficiently
• Backend Services → Message Queues → Handles background
tasks (e.g., notifications)
• Load Balancer → Backend Instances → Distributes load across
servers

Example: High-Level Architecture for a Twitter-like System

1. Clients (Web/Mobile) → Users post tweets


2. Load Balancer → Routes traffic
3. API Gateway → Manages authentication, rate limiting
4. User Service → Manages users & profiles
5. Tweet Service → Handles tweet creation & storage
6. Timeline Service → Fetches & caches relevant tweets
7. Database (PostgreSQL, MongoDB, Redis) → Stores tweets &
user data
8. Kafka (Message Queue) → Notifies followers asynchronously
9. CDN → Stores images & videos
4.3 Choose System Architecture (Monolith vs. Microservices)

Monolithic Architecture

• Everything is in one large codebase


• Easier to develop initially but harder to scale
• Best for small apps

Microservices Architecture

• Splits system into independent services


• Each service scales independently
• Requires API Gateway, service discovery, monitoring

Example: Uber System Design

• Monolithic (Initial) → Single backend for users, rides,


payments
• Microservices (Scaling) →
o User Service (Handles authentication & profiles)
o Ride Matching Service (Assigns drivers)
o Payment Service (Handles billing, refunds)
o Notification Service (Sends SMS, push notifications)

4.4 Example: High-Level Design for an E-Commerce Platform

Scenario: Building an Amazon-like Store

Components & Responsibilities:

1. Clients → Web & Mobile apps


2. API Gateway → Manages authentication, request routing
3. User Service → Handles sign-ups, authentication
4. Product Service → Stores product details, inventory
5. Cart Service → Manages shopping cart
6. Order Service → Handles order processing
7. Payment Service → Connects to Stripe/PayPal
8. Notification Service → Sends order confirmations
9. Database → PostgreSQL for transactions, Redis for caching
10. CDN (Cloudflare) → Serves product images

Data Flow:

1. User searches for a product → API Gateway → Product Service


→ Database
2. User adds product to cart → Cart Service → Database
3. User proceeds to checkout → Order Service → Payment
Service
4. Payment is confirmed → Order is placed → Notification
Service sends email
5. Order is processed → Warehouse receives shipment request

Step 5: Identifying and Addressing Bottlenecks

Bottlenecks are performance limitations in a system that can slow


down response times, reduce scalability, or cause failures under high
load. Identifying and mitigating them is crucial for building a robust
system.

5.1 Common Bottlenecks in System Design

1. Database Bottlenecks

• Issue: Slow queries, high read/write latency, database locking,


or limited scalability.
• Solutions:
o Index frequently queried fields to speed up lookups.
o Use read replicas for distributing read traffic.
o Implement sharding to distribute data across multiple
instances.
o Use caching (Redis, Memcached) to reduce database
hits.

2. High Latency in API Calls

• Issue: Slow response times from external APIs or internal


microservices.
• Solutions:
o Use asynchronous processing with message queues
(Kafka, RabbitMQ).
o Implement circuit breakers to prevent cascading
failures.
o Use gRPC instead of REST for high-performance
service-to-service communication.

3. Caching Issues

• Issue: Inefficient or stale cache leading to performance drops.


• Solutions:
o Implement cache eviction policies (LRU, LFU) to
manage memory effectively.
o Use write-through caching to keep cache and database
in sync.
o Optimize cache keys to avoid unnecessary cache misses.

4. Load Balancing Issues

• Issue: Uneven traffic distribution leading to overloading certain


servers.
• Solutions:
o Use Round Robin, Least Connections, or IP Hashing
load balancing strategies.
o Implement sticky sessions if required for user session
consistency.
o Deploy multiple load balancers to avoid single points of
failure.

5. Network Bottlenecks

• Issue: Slow data transmission, high packet loss, or network


congestion.
• Solutions:
o Use CDNs (Cloudflare, Akamai) to serve static content
from edge locations.
o Optimize API payloads using compression (Gzip,
Brotli).
o Use WebSockets for real-time bidirectional
communication instead of polling.

6. Storage & File System Issues

• Issue: File system limitations causing delays in retrieving or


storing large files.
• Solutions:
o Use distributed file storage (Amazon S3, Google Cloud
Storage).
o Implement data partitioning for efficient file retrieval.
o Use streaming processing instead of batch processing
for large files.

7. Scalability Challenges

• Issue: System struggles under high traffic due to monolithic


architecture.
• Solutions:
o Migrate to microservices to distribute workloads.
o Implement auto-scaling to dynamically add/remove
instances based on load.
o Use event-driven architecture to handle spikes in
traffic.

5.2 Example: Bottlenecks in a Video Streaming Service


(YouTube-like System)

Bottleneck Issue Solution


Use read replicas and
Database Slow video metadata
NoSQL databases for
Load retrieval
metadata storage
High API Slow video Precompute recommendations
Latency recommendations and cache them using Redis
High Network Use CDNs to deliver videos
Streaming delays
Load from edge servers
Storage Large video storage Use distributed object
Scalability requirements storage like S3
Load
Uneven traffic Implement multiple regional
Balancer
distribution load balancers
Overload

5.3 Monitoring & Optimization

To detect and fix bottlenecks, use:

• Monitoring tools: Prometheus, Grafana, Datadog.


• Profiling tools: New Relic, Jaeger for tracing service calls.
• Performance testing: Load tests using Apache JMeter or k6.
Key System Design Concepts
When designing scalable, efficient, and maintainable systems,
you need to consider various architectural concepts. Here are
some of the most important ones:

1. Horizontal vs. Vertical Scaling


Scaling is about increasing a system’s capacity to handle more
load.
• Vertical Scaling (Scaling Up)
o Upgrading an existing server (more RAM, CPU, or
storage).
o Easier to implement but has hardware limits.
o Example: Adding more memory and CPUs to a single
database server to handle more queries.
• Horizontal Scaling (Scaling Out)
o Adding more machines to distribute the load.
o Requires a load balancer to direct traffic efficiently.
o More complex but offers better fault tolerance and
unlimited scalability.
o Example: Instead of upgrading one database server,
deploy multiple database servers and distribute the
queries across them.
2. Microservices Architecture
Instead of building a monolithic application where everything is
tightly coupled, microservices break the system into smaller,
independent services that communicate via APIs.
• Advantages:
o Easier to scale specific components individually.
o Each service can use different technologies (polyglot
architecture).
o Faster development and deployment cycles.
• Challenges:
o Requires strong API design and service
communication (REST, gRPC, or event-driven
messaging).
o More complex to monitor and manage compared to
monoliths.
Example:
A large e-commerce platform might have separate microservices
for user authentication, product catalog, orders, and payments,
each running independently.

3. Proxy Servers
A proxy server acts as an intermediary between clients and
backend servers, improving security, performance, and
scalability.
• Forward Proxy
o Used by clients to access external servers.
o Common in corporate networks for filtering web
access.
o Example: A company’s proxy that blocks access to
social media sites.
• Reverse Proxy
o Sits in front of backend servers to distribute traffic.
o Helps with load balancing, caching, and security
(e.g., preventing direct exposure of backend servers).
o Example: Nginx or HAProxy acting as a reverse proxy
for multiple backend microservices.

4. Load Balancing
Distributes incoming network traffic across multiple servers to
prevent overload and improve reliability.
• Types of Load Balancing:
o DNS Load Balancing – Different IP addresses for a
domain, resolved randomly.
o Application Layer (L7) Load Balancer – Routes traffic
based on request content (e.g., API gateway).
o Network Layer (L4) Load Balancer – Routes traffic at
the transport level (IP, TCP, UDP).
• Example:
o A web application with millions of users can use a
load balancer to distribute traffic across multiple
backend servers, preventing downtime if one server
fails.

5. Caching
Stores frequently accessed data in memory to reduce database
load and improve response times.
• Types of Caching:
o Client-Side Caching – Browsers cache static assets
like images, CSS, and JavaScript.
o Server-Side Caching – Databases and APIs cache
responses (e.g., Redis or Memcached).
o CDN (Content Delivery Network) – Caches static
content at edge locations worldwide.
• Example:
o A social media app caches frequently viewed user
profiles in Redis to avoid repeatedly querying the
database.

6. Database Sharding
Splitting a large database into smaller, more manageable parts
(shards) to improve performance and scalability.
• Horizontal Sharding (Range-Based or Hash-Based)
o Example: A user database is divided based on user
IDs, where users with IDs 1-100K go to one shard, and
100K-200K go to another.
• Vertical Sharding
o Example: Storing user profiles in one database and
user transactions in another.
Challenge:
• Requires a shard key to efficiently route queries.

7. Event-Driven Architecture
Instead of synchronous request-response interactions,
components communicate asynchronously using events.
• Message Brokers (e.g., Kafka, RabbitMQ) handle event
distribution.
• Example:
o An e-commerce system publishes an "Order Placed"
event, which different services (inventory, shipping,
billing) consume asynchronously.

8. Rate Limiting
Controls the number of requests a user or client can make in a
given time period to prevent abuse (e.g., API rate limiting).
• Example:
o A login system limits failed attempts to 5 per minute
to prevent brute-force attacks.

9. Consistent Hashing
A technique for distributing data across multiple servers in a way
that minimizes key redistribution when adding or removing
servers.
• Example:
o A caching system (e.g., Redis cluster) uses consistent
hashing to distribute user sessions across multiple
nodes efficiently.

10. CAP Theorem


When designing distributed databases, you can choose only two
of the following three guarantees:
1. Consistency (C) – Every read gets the latest write.
2. Availability (A) – Every request receives a response, even if
some nodes are down.
3. Partition Tolerance (P) – The system continues operating
despite network failures.
• Example:
o Traditional relational databases (SQL) prioritize
Consistency over Availability.
o NoSQL databases (e.g., Cassandra) prioritize
Availability and Partition Tolerance over strict
Consistency.
Communicating between services
1. Messaging Queues
A message queue is a system that allows services to
communicate asynchronously by sending and receiving messages
via a queue. It ensures that messages are processed reliably, even
if some services are temporarily unavailable.
How It Works
1. Producers send messages to a queue.
2. The message broker stores these messages.
3. Consumers (workers) pull messages from the queue and
process them.
4. The broker ensures messages are processed at least once
(or exactly once, depending on configuration).
Example Use Cases
• Task Processing: A web application offloads heavy
computations (e.g., video processing, email sending) to a
queue, where background workers process them
asynchronously.
• Order Processing: In an e-commerce app, when a user
places an order, a message is added to a queue, and
different services (payment, inventory, shipping) pick it up
for processing.
• Load Balancing: Distributing workload among multiple
workers prevents a single server from being overloaded.
Popular Messaging Queues
• RabbitMQ: Uses the AMQP protocol; supports complex
routing.
• Amazon SQS: Fully managed queue service in AWS.
• Apache ActiveMQ: Java-based queue system.
• Redis Lists (as a queue): Simple and fast, but lacks
advanced features.

2. Event Brokers (Event-Driven Architecture)


An event broker is a system that enables asynchronous, event-
driven communication between services. It is commonly used in
microservices architectures.
How It Works
1. Producers (Publishers) send events to an event broker.
2. The event broker distributes events to subscribed
consumers (Subscribers).
3. Consumers process the event and may trigger further
actions.
Difference Between Messaging Queues & Event Brokers

Feature Messaging Queue Event Broker

Point-to-point (one Publish-Subscribe


Message
consumer processes (multiple consumers can
Handling
a message) receive the same event)
Feature Messaging Queue Event Broker

Messages are
Events may be stored for
Persistence removed once
replay
consumed

Event-driven architecture
Task processing (e.g.,
Use Case (e.g., real-time
background jobs)
notifications)

Example Use Cases


• User Signup Event: A "UserRegistered" event triggers
multiple services (e.g., sending a welcome email,
generating a profile, assigning default settings).
• IoT Data Streaming: Sensors publish temperature data to
an event broker, which distributes it to analytics services.
• Stock Market Updates: Real-time stock price updates are
published and consumed by trading applications.
Popular Event Brokers
• Apache Kafka: Distributed, scalable, and used for high-
throughput event streaming.
• RabbitMQ (with Pub/Sub mode): Can work as both a
queue and event broker.
• Google Pub/Sub: Fully managed messaging service by
Google.
• Amazon EventBridge: Serverless event bus for AWS
services.
When to Use Messaging Queues vs. Event Brokers?

Use Messaging Use Event


Scenario
Queue Broker

Background processing (e.g.,


emails, reports)

Real-time event notifications

Task distribution (load balancing


workers)

Logging and analytics

Ensuring exactly one consumer


processes a message

Broadcasting updates to multiple


services

You might also like