System Design
System Design
System is the set of technologies which serve some users to fill set of requirement.
What is design?
"design" refers to the process of creating a detailed plan or blueprint for how the system will
be structured and how its components will work together to achieve the desired goals.
System design is the process of designing the element of the system such as the architecture,
modules and components, the different interfaces of those components and the data that goes
through the system.
Func onal requirement for twi er Non-Func onal requirement for twi er
Post tweet, like tweet, follow, unfollow and etc. Reliable, highly available, security, consistent
and low latency.
Architecture:
Monolithic architecture: it is the approach in which front-end, back-end and database of an web
applica on are wri en as a single unit or in a single codebase.
There is only a single codebase.
Monolithic system is also known as centralized system.
Advantages:
In monothe c architecture, all the modules are present in the single system, so they require
fewer network calls.
Integra on tes ng is easier.
Less confusion.
Disadvantage:
If there is a bug in the single module it can destroy the whole system.
Whenever a single module is updated, the whole system needs to be updated to reflect the
changes to the users.
Distributed System
A distributed system is a collec on of mul ple individual system connected through a
network that shares resources, communicate, and coordinate to achieve common goals.
Modules are present in different codebase and they connected through a network that
shares resources, they communicate and coordinate to achieve common goals.
Advantages:
Scalable.
Low latency.
No single point of failure.
Disadvantages:
Scalability
Increasing the performance of system by handling large number of transac on or serving
more unit of work.
A service is said to be scalable if when we increase the resources in a system, it results in
increased performance in a manner proportional to resources added. Increasing
performance in general means serving more units of work, but it can also be to handle larger
units of work, such as when datasets grow. Introducing redundancy is an important for
defending against failures. An always-on service is said to be scalable if adding resources
to facilitate redundancy does not result in a loss of performance.
If you have a performance problem, your system is slow for a single user.
If you have a scalability problem, your system is fast for a single user but slow
under heavy load.
Latency vs throughput
Latency is the time to perform some action or to produce some result.
Throughput is the number of such actions or results per unit of time.
Generally, you should aim for maximal throughput with acceptable latency.
T1
user Server
T2
T3
The following manufacturing example should clarify these two concepts:
An assembly line is manufacturing cars. It takes eight hours to manufacture a car and that
the factory produces one hundred and twenty cars per day.
CAP theorem
1. Consistency: All nodes in the distributed system have the same data at the
same time. In other words, when a change is made to one part of the system,
all other parts immediately see that change.
2. Availability: Every request made to the system gets a response, either
successful or failed. The system is always up and operational, providing timely
responses to client requests.
3. Partition tolerance: The system continues to function even if network
communication between nodes is unreliable or completely breaks down. This
means the system can handle and recover from network failures and splits.
Consistency patterns
With multiple copies of the same data, we are faced with options on how to
synchronize them so clients have a consistent view of the data. Recall the definition
of consistency from the CAP theorem - Every read receives the most recent write or
an error.
Weak consistency
After a write, reads may or may not see it. A best effort approach is taken.
This approach is seen in systems such as Memcached. Weak consistency works well
in real time use cases such as VoIP, video chat, and real-time multiplayer games. For
example, if you are on a phone call and lose reception for a few seconds, when you
regain connection, you do not hear what was spoken during connection loss.
Weak consistency is suitable for use cases where high availability and low-latency
access are critical, and occasional data conflicts can be tolerated.
Eventual consistency
Eventual consistency is a consistency model that allows data replicas in a distributed
system to become consistent over time without the need for immediate
synchronization
This approach is seen in systems such as DNS and email. Eventual consistency works
well in highly available systems.
Strong Consistency
Strong consistency is the strongest and most rigid consistency model. It ensures that
all read operations return the most recent write, and all replicas remain consistent at
all times.
Data is replicated synchronously. This approach is seen in file systems and RDBMS.
Strong consistency works well in systems that need transactions.
Availability patterns
There are two complementary patterns to support high availability:
Example 2
In a database solution with multiple IBM® Data Servers, if one database becomes
unavailable, the database manager can reroute database applications that were
connected to the database server that is no longer available to a secondary
database server.
The two most common failover strategies on the market are known as idle standby
and mutual takeover:
Idle Standby
Mutual Takeover
In this configuration, there are multiple systems, and each system is the
designated/appoint secondary for another system. When a system fails, the
secondary for the system that failed must continue to process its own workload
as well as the workload of the failed system.
2. Replica on
Master-slave replication
The master serves reads and writes, replicating writes to one or more slaves, which
serve only reads. Slaves can also replicate to additional slaves in a tree-like fashion. If
the master goes offline, the system can continue to operate in read-only mode until
a slave is promoted to a master or a new master is provisioned.
Disadvantage Master-slave Replication
Master-Master replication
Both masters serve reads and writes and coordinate with each other on writes. If
either master goes down, the system can continue to operate with both reads and
writes.
You'll need a load balancer or you'll need to make changes to your application
logic to determine where to write.
Most master-master systems are either loosely consistent (violating ACID) or
have increased write latency due to synchronization.
Conflict resolution comes more into play as more write nodes are added and
as latency increases.
Disadvantage(s): replication both master-master and master-slave.
There is a potential for loss of data if the master fails before any newly written data
can be replicated to other nodes.
Writes are replayed to the read replicas. If there are a lot of writes, the read replicas
can get bogged down with replaying writes and can't do as many reads.
The more read slaves, the more you have to replicate, which leads to greater
replication lag.
On some systems, writing to the master can spawn multiple threads to write in
parallel, whereas read replicas only support writing sequentially with a single thread.
Replication adds more hardware and additional complexity.
Availability in numbers
For example, if both Foo and Bar each had 99.9% availability, their total availability in
sequence would be 99.8%.
Parallel
For example, if both Foo and Bar each had 99.9% availability, their total availability in
parallel would be 99.9999%.
Vertical scaling:
Increasing the capacity of a single machine or server by adding more resources to it. This can
involve upgrading the hardware components such as CPU, memory, storage, or network
capabilities of the existing machine.
Example: Increased Workload: When the demand on an application or system grows, the
existing hardware may not be able to handle the increased load efficiently. Vertical scaling
can be done.
Data Growth: As the size of the data being processed or stored increases, the existing
storage capacity may become insufficient.
Horizontal scaling:
increasing the capacity of a system by adding more machines or servers to distribute the
workload.
Increased User Base: When the number of users or clients accessing an application or service
grows, Horizontal scaling can be done which allows us to add more servers to distribute the
load and ensure that each user gets a responsive experience.
Big Data Processing: When dealing with large datasets or performing complex data
processing tasks, horizontal scaling can significantly improve performance
What is cache?
Cache is special very high-speed memory. It stores frequently used data and reduces
opera onal latency.
Distributed cache:
Group of cache servers which collec vely work to reduce latency.
In system client makes request, Service fulfill request using data in the database.
Cache store:
1. precalculated result.
When do you load data into the cache? When do you evict data from the cache?
Loading or Evic ng data from the cache is called a Policy. So the cache performance depends on your
cache policy. there are two types of policy LRU ans LFU.
Problems with Caching
1. Making extra calls searching for the data which is not present in the cache memory this will
increase network latency.
2. Consistency suppose you have updated a value which is not updated in some cache and now if we
get the data from the same un-updated cache, it will cause us great problem.
3. Thrashing: repeatedly evic ng and reloading data at a very high rate, resul ng in poor
performance.
Write back: The data is updated only in the cache and updated into the memory later in batch. Data
is updated in the memory only when the cache line is ready to be replaced (cache line replacement is
done using Belady’s Anomaly, Least Recently Used Algorithm, FIFO, LIFO, and others depending on
the applica on). there is chances of inconsistence data. this algo is used when data is frequently
used.
Write back is a storage method in which data is wri en into the cache every me a change occurs,
but is wri en into the corresponding loca on in main memory only at specified intervals or under
certain condi ons.
Problem:
Problems:
Higher latency for write opera on: Write-through caching can introduce higher latency for write
opera ons, as the CPU has to wait for the write opera on to be completed in both the cache and the
main memory before proceeding.
I general we use combina on of both algorithm write through(when less frequent writes) and write
back(when more frequent writes).
We can place cache close to the servers as well as close to the Database. If you place cache close to
the servers, How can you place it? You can place it in-memory itself (in-memory in the servers). Yes,
it will be faster if we use an in-memory cache with the server. But there will be some problems. Take
this example:
1. The first problem is the Cache failure. Assuming Applica on Server 1 failed. Now the Cache also
failed. We will lose that data in Applica on Server1.
2. The second thing is the consistency. Applica on server 1 data and Applica on server 2 data are
not the same. They are not in sync.
If we place cache close to the database using a Global cache the benefit is that all servers are hi ng
this global cache. If there is a miss it will query the database otherwise it will return data to the
servers. And we can maintain distributed caches here. And it will maintain the data consistency.
Sharding:
Dividing the database into small databases. Database sharding is the process of spli ng up a
database across mul ple machines to improve the scalability of an applica on.
In Sharding, one’s data is broken into two or more smaller chunks, called logical shards.
The logical shards are then distributed across separate database nodes, referred to as physical
shards.
Need of sharding:
Suppose a database of 100000 student record, Now when we need to find a student from this
Database, each me around 100, 000 transac ons has to be done to find the student, which is very
costly.
Benefits of Sharding:
• Scalable system.
Drawbacks of sharding:
• Rebalancing data: In a sharded database architecture, some mes a shard outgrows other
shards and becomes unbalanced, which is also known as database hotspot. In this case any benefits
of sharding the database is cancelled out. The database would be likely need to be re-sharded to
allow for a more even data distribu on.
Applica on shard: If the applica on has some algorithm in it which reroutes to par cular shard.
Dynamic shard: in this we have separate modules which has loca on to every shard and tells where
the shad is.
Asynchronous communica on in system design refers to the exchange of data between two or more
systems where there is no need for all par es to respond immediately. This type of communica on is
me-independent, meaning the receiver can respond to the message at their convenience.
Be er reliability: The separa on of interac ng en es via a messaging queue makes the system
more fault tolerant. For example, a producer or consumer can fail independently without affec ng
the others and restart later. Moreover, replica ng the messaging queue on mul ple servers ensures
the system’s availability if one or more servers are down.
Granular scalability: Asynchronous communica on makes the system more scalable. For example,
many processes can communicate via a messaging queue. In addi on, when the number of requests
increases, we distribute the workload across several consumers.
Rate limi ng: Messaging queues also help absorb any load spikes and prevent services from
becoming overloaded, ac ng as a rudimentary form of rate limi ng when there is a need to avoid
dropping any incoming request.
1. Sending many emails: Emails are used for numerous purposes, such as sharing informa on,
account verifica on, rese ng passwords, marke ng campaigns, and more. All of these emails
wri en for different purposes don’t need immediate processing and, therefore, they don’t disturb
the system’s core func onality. A messaging queue can help coordinate a large number of emails
between different senders and receivers in such cases.
2. Recommender systems: Some pla orms use recommender systems to provide preferred
content or informa on to a user. The recommender system takes the user’s historical data, processes
it, and predicts relevant content or informa on. Since this is a me-consuming task, a messaging
queue can be incorporated between the recommender system and reques ng processes to increase
and quicken performance.
DNS stands for Domain Name System. It is a hierarchical and distributed naming system that
translates domain names to their corresponding IP address.
DNS works similar to a phonebook, where names are matched with phone numbers. In the same
way, DNS works by transla ng domain names such as google.com into IP addresses that computers
can understand, such as 216.58.192.14.
A load balancer is a device that acts as a reverse proxy and is responsible for evenly distribu ng
network traffic across mul ple servers. Load balancers smooth out the concurrent user experience of
the applica on and improve reliability.
The requests received by a load balancer are distributed among mul ple servers using a configured
algorithm that could be based on:
• Round-robin
• Weighted round-robin
• Least response me
There are some other load balancing algorithms which I need to read and understand separately.
load balancers can ensure that transac on loads are distributed evenly across all servers, preven ng
any single server from becoming a bo leneck and ensuring smooth and efficient service to all users.
Load balancers ensure reliability and availability of servers around the clock by constantly monitoring
the load that each server is under and only sending requests to servers and applica ons that can
respond in a mely manner.
Hardware load balancer is quite expensive and used in big organiza on. While so ware load
balancer is less expensive as compare to hardware but are less powerful then hardware. Example:
AWS, Azure and GCP.
Database: structured set of data like tables, queues, forms, reports (NOT FORMAL DEFINATION)
Advantages:
• consistency.
Various types of database: Hierarchal, graph, network, Rela onal db(SQL) & Non-Rela onal
db(NOSQL) we will discuss only SQL vs NO-SQL.
SQL VS NOSQL
1. Structure: RDBMSs structure data in a 'rela on' or table format of rows and columns. The
rows represent records, and the columns represent a ributes.
2. Schema: RDBMSs require a pre-defined schema set up before you can add data. This means
structure and data types must be specified beforehand.
3. Scalability: RDBMSs are typically scaled ver cally by adding more powerful hardware
resources.
4. ACID Transac ons: RDBMSs follow ACID proper es (Atomicity, Consistency, Isola on,
Durability) which ensures that transac ons are processed reliably.
5. SQL: RDBMS uses structured query language (SQL) for defining and manipula ng the data,
which is very powerful.
1. Structure: NoSQL databases can store data in mul ple ways - key-value pairs, wide-column
stores, graph databases, or document-based within the same database.
2. Schema: NoSQL databases are schema-less, meaning you can add any type of data you want
on the fly. No need to define what data types you have before you insert data.
3. Scalability: NoSQL databases are scaled horizontally, meaning one can add more servers to
the pool to handle the increased load.
4. CAP Theorem: NoSQL databases follow the CAP Theorem (Consistency, Availability, Par on
tolerance) which allows to choose two out of three.
5. Query Language: Most NoSQL systems do not offer a language equivalent to SQL in power,
and instead queries are typically made by a custom API to the database.
Cohesion: degree to which the elements within module are functionally related. It is the degree
In microservice architecture, mul ple loosely coupled services work together. Each service focuses
on a single purpose and has a high cohesion. of related behaviors and data.
Advantages
Disadvantages
Distributed Monolith is a system that resembles the microservices architecture but is tightly
coupled within itself like a monolithic application. Adopting microservices architecture
comes with a lot of advantages. But while making one, there are good chances that we
might end up with a distributed monolith.
NOTE: Aim to implement microservices is not small code its to make services loosely coupled
API
API stands for "Applica on Programming Interface." An API acts as an intermediary that
enables different so ware systems to talk to each other and share data or func onality. It
provides a way for developers to access specific features or services of a so ware
applica on without needing to understand the internal workings of that applica on.
API gateway
An API gateway is a server that acts as a single point of entry for a set of microservices. It
receives client requests, forwards them to the appropriate microservice, and then returns the
server’s response to the client. The API gateway is responsible for tasks such as routing,
authentication, rate limiting, load balancing, caching, monitoring, and Transformation.
API gateways are used for a variety of purposes in microservice architectures, including the
following:
Routing: The API gateway receives requests from clients and routes them to the appropriate
microservice. This enables clients to access the various microservices through a single entry
point, simplifying the overall system design.
Authentication and Authorization: The API gateway can be used to authenticate client, This
helps to ensure that only authorized clients can access the microservices and helps to
prevent unauthorized access.
Rate limiting: You can rate limit client access to microservices with an API gateway. This can
help prevent denial of service attacks and other types of malicious behaviour.
Load balancing: The API gateway can distribute incoming requests among multiple instances
of a microservice, enabling the system to handle a larger number of requests and improving
its overall performance and scalability.
Caching: The API gateway can cache responses from the microservices, reducing the number
of requests that need to be forwarded to the microservices and improving the overall
performance of the system.
Monitoring: The API gateway can collect metrics and other data about requests and
responses, providing valuable insights into the performance and behaviour of the
microservices. This can help to identify and diagnose problems, and improve the overall
reliability and resilience of the system.
API Design
API design, which stands for Application Programming Interface design, refers to the process
of defining the rules and conventions that allow different software applications to
communicate and interact with each other.
Designing a high-quality API is crucial for ensuring its usability, flexibility, and maintainability.
Here are some best practices to consider when designing APIs:
Use Nouns for Resource Names: When defining endpoints, use nouns to represent
resources rather than verbs. For example, use /users instead of /getUsers.
Request and Response Formats: Use widely adopted data formats such as JSON or XML for
request and response payloads.
Error Handling: Provide meaningful error messages and appropriate HTTP status codes for
different scenarios to help developers troubleshoot issues effectively.
In general, there are three possible outcomes when using your API: -
1. The client application behaved erroneously (client error - 4xx response code)
2. The API behaved erroneously (server error - 5xx response code)
3. The client and API worked (success - 2xx response code)
Pagination and Filtering: Use pagination to handle large result sets and filtering options to
allow clients to request specific data.
Caching: Support caching of responses where applicable to reduce server load and improve
API performance.
Polling: simply means checking for new data over a fixed interval of me by making API calls
at regular intervals to the server.
Components
Event-driven architectures have three key components:
• Event producers: Publishes an event to the router.
• Event routers: Filters and pushes the events to consumers.
• Event consumers: Uses events to reflect changes in the system.
Producer services and consumer services are decoupled, which allows them to be scaled,
updated, and deployed independently.
Benefits of an event-driven architecture
Scale and fail independently
By decoupling your services, they are only aware of the event router, not each other. This
means that your services are interoperable, but if one service has a failure, the rest will keep
running. The event router acts as an elas c buffer that will accommodate surges in
workloads.
Develop with agility
You no longer need to write custom code to poll, filter, and route events; the event router
will automa cally filter and push events to consumers. The router also removes the need for
heavy coordina on between producer and consumer services, speeding up your
development process.
Audit with ease
An event router acts as a centralized loca on to audit your applica on and define policies.
These policies can restrict who can publish and subscribe to a router and control which users
and resources have permission to access your data. You can also encrypt your events both in
transit and at rest.
Cut costs
Event-driven architectures are push-based, so everything happens on-demand as the event
presents itself in the router. This way, you’re not paying for con nuous polling to check for
an event. This means less network bandwidth consump on, less CPU u liza on, less idle
fleet capacity, and less SSL/TLS handshakes.
Example architecture:
h ps://d1.awssta c.com/product-marke ng/EventBridge/1-SEO-Diagram_Event-Driven-
Architecture_Diagram.b3 c18f8cd65e3af3ccb4845dce735b0b9e2c54.png