0% found this document useful (0 votes)
10 views

HLD_Interview

The document discusses strategies for efficiently accessing large datasets, emphasizing the importance of caches, proxies, indexes, and load balancers. Caches improve speed by storing frequently accessed data, while proxies optimize requests to reduce server load. Load balancers distribute traffic among servers to prevent overload, and indexes enhance data retrieval speed at the cost of increased storage overhead.

Uploaded by

Reddy Sri Pavani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

HLD_Interview

The document discusses strategies for efficiently accessing large datasets, emphasizing the importance of caches, proxies, indexes, and load balancers. Caches improve speed by storing frequently accessed data, while proxies optimize requests to reduce server load. Load balancers distribute traffic among servers to prevent overload, and indexes enhance data retrieval speed at the cost of increased storage overhead.

Uploaded by

Reddy Sri Pavani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 253

https://aosabook.org/en/v2/distsys.

html

let's assume you have many terabytes (TB) of data and you want to allow users to
access small portions of that data at random. This is similar to locating an image file
somewhere on the file server in the image application example.

This is particularly challenging because it can be very costly to load TBs of data into
memory; this directly translates to disk IO. Reading from disk is many times slower
than from memory.

Thankfully there are many options that you can employ to make this easier; four of
the more important ones are caches, proxies, indexes and load balancers.

Caches

Caches take advantage of the locality of reference principle: recently requested data
is likely to be requested again. They are used in almost every layer of computing:
hardware, operating systems, web browsers, web applications and more. A cache is
like short-term memory: it has a limited amount of space, but is typically faster than
the original data source and contains the most recently accessed items. Caches can
exist at all levels in architecture, but are often found at the level nearest to the front
end, where they are implemented to return data quickly without taxing downstream
levels.
Global Cache:

all the nodes use the same single cache space. This involves adding a
server, or file store of some sort, faster than your original store and
accessible by all the request layer nodes. Each of the request nodes
queries the cache in the same way it would a local one. This kind of
caching scheme can get a bit complicated because it is very easy to
overwhelm a single cache as the number of clients and requests increase,
but is very effective in some architectures (particularly ones with
specialized hardware that make this global cache very fast, or that have a
fixed dataset that needs to be cached).

There are two common forms of global caches depicted in the diagrams. In Figure
1.10, when a cached response is not found in the cache, the cache itself becomes
responsible for retrieving the missing piece of data from the underlying store.
The majority of applications leveraging global caches tend to use
the first type, where the cache itself manages eviction and fetching
data to prevent a flood of requests for the same data from the
clients.
However, there are some cases where the 1.11 implementation
makes more sense. For example, if the cache is being used for very
large files, a low cache hit percentage would cause the cache buffer
to become overwhelmed with cache misses; in this situation it helps
to have a large percentage of the total data set (or hot data set) in
the cache.

Another example is an architecture where the files stored in the


cache are static and shouldn't be evicted. (This could be because of
application requirements around that data latency—certain pieces
of data might need to be very fast for large data sets—where the
application logic understands the eviction strategy or hot spots
better than the cache.
Distributed Cache

In a distributed cache (Figure 1.12), each of


its nodes own part of the cached data
The great thing about caches is that they
usually make things much faster
(implemented correctly, of course!
However, all this caching comes at the cost
of having to maintain additional storage
space, typically in the form of expensive
memory; nothing is free. Caches are
wonderful for making things generally faster,
and moreover provide system functionality
under high load conditions when otherwise
there would be complete service
degradation.

Memcached : which can work both as a local


cache and distributed cache.
Memcached is used in many large web sites,
and even though it can be very powerful, it is
simply an in-memory key value store,
optimized for arbitrary data storage and fast
lookups (O(1)).

Typically the cache is divided up using a


consistent hashing function, such that if a
request node is looking for a certain piece of
data it can quickly know where to look within
the distributed cache to determine if that
data is available. In this case, each node has
a small piece of the cache, and will then send
a request to another node for the data before
going to the origin. Therefore, one of the
advantages of a distributed cache is the
increased cache space that can be had just
by adding nodes to the request pool.
A disadvantage of distributed caching is
remedying a missing node. Some distributed
caches get around this by storing multiple
copies of the data on different nodes;
however, you can imagine how this logic can
get complicated quickly, especially when you
add or remove nodes from the request layer.
Although even if a node disappears and part
of the cache is lost, the requests will just pull
from the origin—so it isn't necessarily
catastrophic!
Use Case Example:

 A shared distributed cache for all services to store user session data, ensuring every service
can retrieve the user's session state when needed.
When you log into Facebook and load your news feed:

1. The server retrieves your user profile, preferences, and recent activities.
2. This data is fetched from the distributed cache (Memcached), pulling pieces from different
servers.
3. Intermediate computations (e.g.,which posts to show you) are sped up by language-level
caching like APC or $GLOBALS.

The result is delivered to you almost instantly because the caching system avoids unnecessary
work and fetches data efficiently.

Imagine many users or clients request the same data from a server around the same
time (e.g., fetching a popular image, video, or API response)
Proxies

At a basic level, a proxy server is an


intermediate piece of hardware/software that
receives requests from clients and relays
them to the backend origin servers.
Typically, proxies are used to filter
requests, log requests, or sometimes
transform requests (by adding/removing
headers, encrypting/decrypting, or
compression).

Figure 1.13: Proxy server

Without optimization: The proxy or server must handle each request individually. This leads to
redundant work, as the same data is fetched multiple times from the backend servers.It increases
server load, network traffic, and response time

How Collapsed Forwarding Works


 When the proxy detects multiple requests for the same data, instead of forwarding all those
requests to the backend, it:
1. Collapses the requests: Combines similar or identical requests into a single request.
2. Sends one request to the backend server to fetch the required data.
3. Caches the response: The proxy stores the result in memory or a local cache.
4. Sends the response to all clients: The cached result is returned to all the original
clients who made the request.

Why It Speeds Up Data Access


 Reduces Backend Load: Only one request is sent to the backend instead of many, which
reduces server processing and database queries.
 Lowers Network Traffic: A single request uses less bandwidth compared to multiple
redundant requests.
 Improves Client Response Time: The proxy can serve cached responses faster than making
new requests to the backend for each client.

6. Common Uses
 Content Delivery Networks (CDNs): CDNs use collapsed forwarding to reduce backend load
when serving popular content (e.g., videos, images, or webpages).
 Caching Proxies: Proxies like Varnish or Squid implement this to optimize responses for
repeated API calls or database queries.
 Distributed Systems: Systems with multiple microservices can use collapsed forwarding to
reduce internal communication overhead.

There is some cost associated with this design, since each request
can have slightly higher latency, and some requests may be slightly
delayed to be grouped with similar ones. But it will improve
performance in high load situations, particularly when that same
data is requested over and over. This is similar to a cache, but
instead of storing the data/document like a cache, it is optimizing
the requests or calls for those documents and acting as a proxy for
those clients.

It is easy to get confused here though, since many proxies are


also caches (as it is a very logical place to put a cache), but
not all caches act as proxies.

It is worth noting that you can use proxies and caches together, but
generally it is best to put the cache in front of the proxy, This is
because the cache is serving data from memory, it is very fast, and
it doesn't mind multiple requests for the same result. But if the
cache was located on the other side of the proxy server, then there
would be additional latency with every request before the cache,
and this could hinder performance.

Proxies can optimize requests using data locality.


Indexes: Using an index to access your data quickly is a well-known
strategy for optimizing data access performance; probably the most well
known when it comes to databases. An index makes the trade-offs of
increased storage overhead and slower writes (since you must both write the
data and update the index) for the benefit of faster reads.

Payload refers to the actual piece of data that you want to retrieve or work with.
Think of it as the useful or meaningful portion of the data you're searching for, as
opposed to all the surrounding data in the larger dataset.
The payload is the small, specific piece of data that you're interested in, such as:

A single record in a database.

Indexes are typically stored in memory (RAM),locally or somewhere very close to


the client making the request. This is because memory is much faster than other
storage mediums like disks. Storing indexes in memory ensures:

1. Speed: Fast access to the index allows quicker lookups.


2. Efficiency: When a client needs data, it doesn’t have to repeatedly access slower disk storage
for the index.
3. Proximity: Keeping the index local to the client reduces the need for network
communication, which adds latency.
B-Trees and B+ Trees, the nodes do not directly contain the entire
row of data from the table. Instead, they store keys and pointers
to the actual data in the database.

Without Indexing

1. The database performs a full table scan.


2. For every query, the system checks every row to match the condition.
3. If the table grows to thousands or millions of rows, this becomes very slow.
B+Tree
E.g when we need to use multiple layers of indexing
There are many different algorithms that can be used to service
requests, including picking a random node, round robin, or even
selecting the node based on certain criteria, such as memory or CPU
utilization. Load balancers can be implemented as software or
hardware appliances. One open source software load balancer that
has received wide adoption is HAProxy).

Software vs. Hardware Load Balancers

Load balancers can be implemented in software or as dedicated hardware


appliances. Let’s understand each:

1. Software Load Balancers

 These are programs running on general-purpose servers.


 Examples: HAProxy, NGINX, Traefik.

2. Hardware Load Balancers

 These are physical devices specifically designed to perform load balancing.


 Examples: F5 Big-IP, Citrix ADC.

A proxy is an entity that has the authority to act on behalf of another.

forward proxy is a server that acts on behalf of clients on a network.

Think of proxy server as a middleman that sits between a private


network and the public internet
One of the most common uses of proxy servers is bypassing geographic
restrictions on websites and content.

Streaming services, for instance, often offer different content based on a


user’s location. With a proxy server based in the target region, you can
access that region’s content library as if you were a local user.

Suppose you’re in India and want to access the US library of a streaming


platform (eg.. Netflix). By connecting to a proxy server located in the US,
your request to the streaming platform will appear to be coming from the
US, allowing access to its content as if you were a US-based viewer.

Proxies can store cached versions of frequently accessed content,


enabling faster load times and reducing bandwidth usage.
An organization with hundreds of employees frequently accessing the
same online resources can deploy a caching proxy. This proxy caches
common websites in it’s database, so subsequent requests are served
quickly from the proxy’s storage, saving time and bandwidth.
A reverse proxy can implement load balancing algorithms such as round-
robin, least connections, or IP hash, ensuring optimal distribution of traffic.

Nginx uses round robin by default. To change it, we can simply add the
required algorithm (eg.. ip_hash) in the upstream block.

With this configuration, Nginx will balance requests


among backend1, backend2, and backend3, ensuring no single server
becomes overwhelmed.

· Proxy: Acts as an intermediary between the client and the server.


· Load Balancer: Distributes traffic across multiple servers to ensure they don’t get overloaded.
· Reverse Proxy: A proxy on the server side that forwards client requests to backend servers and
may also handle load balancing, caching, or security.
The most common load balancers are AWS Elastic Load Balancer, NGINX, and HAProxy.
 If a user is routed to a different server (e.g., due to server failure or load balancing), their
session data is still available in the shared session store. This eliminates the risk of losing
session data when one server goes down.
 Redundancy: Many distributed session stores, such as databases or caching systems (e.g.,
Redis or Memcached), allow for data replication. This means there can be multiple copies of
session data across different storage nodes, so if one storage node fails, others can provide
the data.

Scalability:
 Decoupling: The session data is decoupled from the individual servers, allowing you to add or
remove servers without impacting the user's session.
 Load Distribution: Since session data isn't tied to a specific server, the load balancer can
route users to any available server, distributing traffic evenly and enabling the system to
handle higher user volumes.

However, there is not necessarily more than one storage place for session data in
distributed session management. The centralized session store can consist of:
1. A single centralized database (though this might create a bottleneck or single point of failure
unless replicated).
2. A distributed database or caching system, where data is spread across multiple nodes to
ensure both scalability and fault tolerance.

If a system only has a couple of a nodes, systems like round robin


DNS may make more sense since load balancers can be expensive
and add an unneeded layer of complexity.

Of course in larger systems there are all sorts of different scheduling


and load-balancing algorithms, including simple ones like random
choice or round robin, and more sophisticated mechanisms that
take things like utilization and capacity into consideration. All of
these algorithms allow traffic and requests to be distributed, and
can provide helpful reliability tools like automatic failover, or
automatic removal of a bad node (such as when it becomes
unresponsive). However, these advanced features can make
problem diagnosis cumbersome. For example, when it comes to
high load situations, load balancers will remove nodes that may be
slow or timing out (because of too many requests), but that only
exacerbates the situation for the other nodes. In these cases
extensive monitoring is important, because overall system traffic
and throughput may look like it is decreasing (since the nodes are
serving less requests) but the individual nodes are becoming maxed
out.

Load balancers are an easy way to allow you to expand system


capacity, and like the other techniques in this article, play an
essential role in distributed system architecture. Load balancers also
provide the critical function of being able to test the health of a
node, such that if a node is unresponsive or over-loaded, it can be
removed from the pool handling requests, taking advantage of the
redundancy of different nodes in your system.

So far we have covered a lot of ways to read data quickly, but


another important part of scaling the data layer is effective
management of writes.

in complex systems writes can take an almost non-deterministically long


time. For example, data may have to be written several places on different
servers or indexes, or the system could just be under high load. In the
cases where writes, or any task for that matter, may take a long time,
achieving performance and availability requires building asynchrony into
the system; a common way to do that is with queues.
when the server receives more requests than it can handle, then
each client is forced to wait for the other clients' requests to
complete before a response can be generated. This is an example of
a synchronous request, depicted in above figure/.

To avoid this synchronous waiting and the risk of server failure affecting clients,
abstraction is needed. This means separating or decoupling the client’s request from
the actual work that needs to be done to fulfill that request. Here’s how it works:

Asynchronous Processing: Instead of making the client wait for the server to
complete the work, the server can handle the request asynchronously. This
means the server tells the client that it’s accepted the request and will process
it in the background, without the client needing to wait. The client can
continue doing other tasks, like browsing other products, while the server
works on the original request. When the work is done, the client is notified
with the results.
Queueing Work: Rather than each server immediately processing the client’s
request, a message queue can be used. When the client sends a request, it’s
placed in a queue. The work is picked up by available servers as they become
free. This avoids overloading any single server, and servers can handle
requests in a more balanced way.

Failover Systems: By abstracting the client request from the specific server
handling it, the system can ensure that if one server fails, another server can
take over the work. This ensures fault tolerance and improves reliability. The
client doesn’t have to know which server is processing the request, making the
system more resilient to server failures.
Abstraction between the client and the server’s work, such as
asynchronous processing, message queues, and failover systems,
helps to improve performance, distribute the load fairly, and ensure
reliability.

Failover systems are mechanisms or setups that ensure


continuity of service in the event that a server, service, or system
component fails. The primary goal of failover is to automatically
switch to a backup or redundant system without affecting the
client's experience or causing downtime.

Queues enable clients to work in an asynchronous manner,


providing a strategic abstraction of a client's request and its
response. On the other hand, in a synchronous system, there is no
differentiation between request and reply, and they therefore
cannot be managed separately. In an asynchronous system the
client requests a task, the service responds with a message
acknowledging the task was received, and then the client can
periodically check the status of the task, only requesting the result
once it has completed. While the client is waiting for an
asynchronous request to be completed it is free to perform other
work, even making asynchronous requests of other services. The
latter is an example of how queues and messages are leveraged in
distributed systems.

Queues also provide some protection from service outages and


failures. For instance, it is quite easy to create a highly robust queue
that can retry service requests that have failed due to transient
server failures. It is more preferable to use a queue to enforce
quality-of-service guarantees than to expose clients directly to
intermittent service outages, requiring complicated and often-
inconsistent client-side error handling.

There are quite a few open source queues


like RabbitMQ, ActiveMQ, BeanstalkD, but some also use services
like Zookeeper, or even data stores like Redis.

Start from Here

Imagine a system where users are able to upload their images to a


central server, and the images can be requested via a web link or
API, just like Flickr or Picasa. For the sake of simplicity, let's assume
that this application has two key parts: the ability to upload (write)
an image to the server, and the ability to query for an image. While
we certainly want the upload to be efficient, we care most about
having very fast delivery when someone requests an image (for
example, images could be requested for a web page or other
application). This is very similar functionality to what a web server
or Content Delivery Network (CDN) edge server (a server CDN uses
to store content in many locations so content is
geographically/physically closer to users, resulting in faster
performance) might provide.

Other important aspects of the system are:


 There is no limit to the number of images that will be stored,
so storage scalability, in terms of image count needs to be
considered.
 There needs to be low latency for image downloads/requests.
 If a user uploads an image, the image should always be there
(data reliability for images).
 The system should be easy to maintain (manageability).
 Since image hosting doesn't have high profit margins, the
system needs to be cost-effective

Services:

When considering scalable system design, it helps to decouple


functionality and think about each part of the system as its own
service with a clearly defined interface. In practice, systems
designed in this way are said to have a Service-Oriented
Architecture (SOA). For these types of systems, each service has its
own distinct functional context, and interaction with anything
outside of that context takes place through an abstract interface,
typically the public-facing API of another service.

Deconstructing a system into a set of complementary services


decouples the operation of those pieces from one another.Creating
these clear delineations can help isolate problems, but also allows
each piece to scale independently of one another. This sort of
service-oriented design for systems is very similar to object-oriented
design for programming.

In our example, all requests to upload and retrieve images are


processed by the same server; however, as the system needs to
scale it makes sense to break out these two functions into their own
services.

Read files will typically be read from cache, and writes will have to
go to disk eventually Even if everything is in memory or read from
disks (like SSDs), database writes will almost always be slower than
reads.

Since reads can be asynchronous, or take advantage of other


performance optimizations like gzip compression or chunked
transfer encoding, the web server can serve reads faster and switch
between clients quickly serving many more requests per second
than the max number of connections (with Apache and max
connections set to 500, it is not uncommon to serve several
thousand read requests per second). Writes, on the other hand, tend
to maintain an open connection for the duration for the upload, so
uploading a 1MB file could take more than 1 second on most home
networks, so that web server could only handle 500 such
simultaneous writes.

Asynchronous operations mean that the server doesn’t have to


wait for one task (e.g., reading data) to finish before starting the
next task.
There are lots of ways to address these types of bottlenecks
though(like having different services here) and each has different
tradeoffs.

For example, Flickr solves this read/write issue by distributing users


across different shards such that each shard can only handle a set
number of users, and as users increase more shards are added to
the cluster.

In the solution where we split out read and write services separately
it is easier to scale hardware based on actual usage (the number of
reads and writes across the whole system).

In the former(splitting services solution) an outage or issue with one


of the services brings down functionality across the whole system
(no-one can write files, for example), whereas an outage with one of
Flickr's shards will only affect those users. In the first example it is
easier to perform operations across the whole dataset—for example,
updating the write service to include new metadata or searching
across all image metadata—whereas with the Flickr architecture
each shard would need to be updated or searched.
Alternatively, Flickr uses a search service that searches across all
shards and aggregates the results. This allows them to search
across all shards at once or update data in a more efficient way.

Redundancy:
If there is a core piece of functionality for an application, ensuring that multiple
copies or versions are running simultaneously can secure against the failure of a
single node.

Creating redundancy in a system can remove single points of failure and provide
a backup or spare functionality if needed in a crisis. For example, if there are two
instances of the same service running in production, and one fails or degrades,
the system can failover to the healthy copy. Failover can happen automatically or
require manual intervention.

Failover is the ability to switch automatically and seamlessly to a


reliable backup system. When a component or primary system fails,
either a standby operational mode or redundancy should achieve
failover and lessen or eliminate negative impact on users.
Failover automation in servers includes pulse or heartbeat
conditions. That is, heartbeat cables connect two servers or
multiple servers in a network with the primary server always
active. As long as the heartbeat continues or it perceives the pulse,
the secondary server merely rests. However, should the secondary
server perceive any change in the pulse from the primary failover
server, it will initiate its instances and take over the primary
server’s operations. It will also message the technician or data
center requesting that they bring the primary server back online.
Some systems, called automated with manual approval
configuration, simply alert the technician or data center instead,
requesting the change to the server take place manually.

Virtualization simulates a computer environment using a virtual machine or pseudo


machine running host software. In this way, the failover process can be independent
of the physical hardware components of computer server systems.
Virtualization uses virtual machines to simulate complete computer
systems on host software, making the failover process independent
of physical hardware.
Active-active and active-passive or active-standby are the most common
configurations for high availability (HA). Each implementation technique achieves
failover in a different way, although both improve reliability.

In an active-active cluster, utilization of both nodes nears half and half— although
each node can handle the entire load alone. However, this also means that node
failure can cause performance to degrade if one active-active configuration node
handles more than half of the load consistently.

Outage time during a failure is virtually zero with an active-active HA configuration,


because both paths are active. With an active-passive configuration, outage time has
the potential to be greater, as the system must switch from one node to the other,
which requires time.

Another key part of service redundancy is creating a shared-


nothing architecture. With this architecture, each node is able to
operate independently of one another and there is no central "brain"
managing state or coordinating activities for the other nodes. This
helps a lot with scalability since new nodes can be added without
special conditions or knowledge. However, and most importantly,
there is no single point of failure in these systems, so they are much
more resilient to failure.

For example, in our image server application, all images would have
redundant copies on another piece of hardware somewhere (ideally
in a different geographic location in the event of a catastrophe like
an earthquake or fire in the data center), and the services to access
the images would be redundant.

Partitions: To scale horizontally, on the other hand, is to add more nodes.


In the case of the large data set, this might be a second server to store
parts of the data set, and for the computing resource it would mean
splitting the operation or load across some additional nodes. To take full
advantage of horizontal scaling, it should be included as an intrinsic
design principle of the system architecture, otherwise it can be quite
cumbersome to modify and separate out the context to make this
possible.

When it comes to horizontal scaling, one of the more common


techniques is to break up your services into partitions, or shards. The
partitions can be distributed such that each logical set of functionality is
separate; this could be done by geographic boundaries, or by another
criteria like non-paying versus paying users. The advantage of these
schemes is that they provide a service or data store with added capacity.

In our image server example, it is possible that the single file server
used to store images could be replaced by multiple file servers,
each containing its own unique set of images. (See figure below)
Such an architecture would allow the system to fill each file server
with images, adding additional servers as the disks become full. The
design would require a naming scheme that tied an image's
filename to the server containing it. An image's name could be
formed from a consistent hashing scheme mapped across the
servers. Or alternatively, each image could be assigned an
incremental ID, so that when a client makes a request for an image,
the image retrieval service only needs to maintain the range of IDs
that are mapped to each of the servers (like an index).

Of course there are challenges distributing data or functionality


across multiple servers. One of the key issues is data locality; in
distributed systems the closer the data to the operation or point of
computation, the better the performance of the system. Therefore it
is potentially problematic to have data spread across multiple
servers, as any time it is needed it may not be local, forcing the
servers to perform a costly fetch of the required information across
the network.

Another potential issue comes in the form of inconsistency. When


there are different services reading and writing from a shared
resource, potentially another service or data store, there is the
chance for race conditions—where some data is supposed to be
updated, but the read happens prior to the update—and in those
cases the data is inconsistent.
Now go to Page 1

https://www.hellointerview.com/learn/system-design/in-a-hurry/delivery

 Requirements

1) Functional requirements are your "Users/Clients should be able to..." statements.

Many of these systems have hundreds of features, but it's your job to identify and prioritize the top 3. Having a long
list of requirements will hurt you more than it will help you and many top FAANGs directly evaluate you on your ability
to focus on what matters

2) Non-functional requirements are statements about the system qualities that are important
to your users. These can be phrased as "The system should be able to..." or "The system
should be..." statements.
It's important that non-functional requirements are put in the context of the system and,
where possible, are quantified. For example, "the system should be low latency" is
obvious and not very meaningful—nearly all systems should be low latency. "The system
should have low latency search, < 500ms," is much more useful.

Here is a checklist of things(non functional requirements) to consider that might help you
identify the most important non-functional requirements for your system. You'll want to
identify the top 3-5 that are most relevant to your system.

 CAP Theorem: Should your system prioritize consistency or


availability? Note, partition tolerance is a given in distributed systems.
 Environment Constraints
 Scalability: All systems need to scale, but does this system have
unique scaling requirements? For example, does it have bursty traffic
at a specific time of day? Are there events, like holidays, that will
cause a significant increase in traffic? Also consider the read vs write
ratio here. Does your system need to scale reads or writes more?
 Latency: How quickly does the system need to respond to user
requests?
 Durability: How important is it that the data in your system is not
lost? For example, a social network might be able to tolerate some
data loss, but a banking system cannot.
 Security: How secure does the system need to be?
 Fault Tolerance: How well does the system need to handle
failures? Consider redundancy, failover, and recovery mechanisms.
 Compliance: Are there legal or regulatory requirements the
system needs to meet? Consider industry standards, data protection
laws, and other regulations.
3) Capacity Estimation: (DAU,QPS etc..)

Our suggestion is to explain to the interviewer that you would like to skip on estimations upfront
and that you will do math while designing when/if necessary.perform calculations only if they will
directly influence your design

When would it be necessary? Imagine you are designing a TopK system for trending topics in FB
posts. You would want to estimate the number of topics you would expect to see, as this will
influence whether you can use a single instance of a data structure like a min-heap or if you need
to shard it across multiple instances, which will have a big impact on your design.

 Core Entities: These are the core entities that your API will exchange and that your
system will persist in a Data Model.

In Twitter core entities are

User
Tweet
Follow
Aim to choose good names for your entities

A couple useful questions to ask yourself to help identify core


entities:
Who are the actors in the system? Are they overlapping?
What are the nouns or resources necessary to satisfy the functional
requirements?
 API or System Interface:
Before you get into the high-level design, you'll want to define the contract between your system
and its users.

You have a quick decision to make here -- do you want to design a


RESTful API or a GraphQL API?
RESTful API: The standard communication constraints of the
internet. Uses HTTP verbs (GET, POST, PUT, DELETE) to perform
CRUD operations on resources.
GraphQL API: A newer communication protocol that allows clients to
specify exactly what data they want to receive from the server.
Wire Protocol: If you're communicating over websockets or raw TCP
sockets, you'll want to define the wire protocol. This is the format of
the data that will be sent over the network, usually in the format of
messages.
Don't overthink this. Bias toward creating a REST API. Use GraphQL only if you really need clients
to fetch only the requested data (no over- or under- fetching). If you're going to use websockets,
you'll want to describe the wire protocol.

 [Optional] Data Flow (~5 minutes)


For some backend systems, especially data-processing systems, it can be helpful to describe the
high level sequence of actions or processes that the system performs on the inputs to produce the
desired outputs. If your system doesn't involve a long sequence of actions, skip this!

 High Level Design (~10-15 minutes)


Now that you have a clear understanding of the requirements,
entities, and API of your system, you can start to design the high-
level architecture. This consists of drawing boxes and arrows to
represent the different components of your system and how they
interact. Components are basic building blocks like servers,
databases, caches, etc.
In most cases, you can even go one-by-one through your API endpoints and build up your design
sequentially to satisfy each one.

It's incredibly common for candidates to start layering on complexity too early, resulting in them never arriving at a
complete solution. Focus on a relatively simple design that meets the core functional requirements, and then layer on
complexity to satisfy the non-functional requirements in your deep dives section. It's natural to identify areas where
you can add complexity, like caches or message queues, while in the high-level design. We encourage you to note
these areas with a simple verbal callout and written note, and then move on.

As you're drawing your design, you should be talking through your thought process with your
interviewer. Be explicit about how data flows through the system and what state (either in
databases, caches, message queues, etc.) changes with each request, starting from API requests
and ending with the response. When your request reaches your database or persistence layer, it's
a great time to start documenting the relevant columns/fields for each entity. You can do this
directly next to your database visually. This helps keep it close to the relevant components and
makes it easy to evolve as you iterate on your design.
 Deep Dives
A) ensuring it meets all of your non-functional requirements (b) addressing edge cases (c)
identifying and adressing issues and bottlenecks and (d) improving the design based on probes
from your interviewer.

So for example, one of our non-functional requirements for Twitter was that our system needs to
scale to >100M DAU. We could then lead a discussion oriented around horizontal scaling, the
introduction of caches, and database sharding -- updating our design as we go. Another was that
feeds need to be fetched with low latency. In the case of Twitter, this is actually the most
interesting problem. We'd lead a discussion about fanout-on-read vs fanout-on-write and the use
of caches.

Core Concepts:
Scaling:

If you can estimate your workload and determine that you can scale vertically for the
forseeable future, this is often a better solution than horizontal scaling. Many
systems can scale vertically to a surprising degree.
The first challenge of horizontal scaling is getting the work to the right machine. This is often done
via a load balancer. For asynchronous jobs work, this is often done via a queueing system.
However, in hybrid systems (real-time + asynchronous), you might see both load
balancers and queues used together!

Work distribution needs to try to keep load on the system as even as possible. For example, if
you're using a hash map to distribute work across a set of nodes, you might find that one node is
getting a disproportionate amount of work because of the distribution of incoming requests.

Data Distribution:

We can keep data in-memory on the node that's processing the request. data in a database that's
shared across all nodes. Look for ways to partition your data such that a single node can access
the data it needs without needing to talk to another node. If you do need to talk to other nodes (a
concept known as "fan-out"), keep the number small.

A common antipattern is to have requests which fan out to many different nodes and then the
results are all gathered together. This "scatter gather" pattern can be problematic because it can
lead to a lot of network traffic, is sensitive to failures in each connection, and suffers from tail
latency issues.
How to mitigate Tail Latency:

horizontal scaling on data introduces synchronization challenges.race conditions and consistency


challenges! Most database systems are built to resolve some of the these problems directly (e.g.
by using transactions). In other cases, you may need to use a Distributed Lock.

Many candidates will try to make a decision on consistency across their entire system, but many systems will actually
blend strong and weak consistency in different parts of the system. For example, on an ecommerce site the item
details might be eventually consistent (if you update the description of an item, it's okay if it takes a few minutes for
the description to update on the product page) but the inventory count needs to be strongly consistent (if you sell an
item, it's not okay for another customer to buy the same unit).

Locks happen at every scale of computer systems: there are locks in your operating system
kernel, locks in your applications, locks in the database, and even distributed locks. Locks are
important for enforcing the correctness of our system but can be disastrous for performance.

Traditional databases with ACID properties use transaction locks to keep data consistent, which is
great for ensuring that while one user is updating a record, no one else can update it, but they're
not designed for longer-term locking. This is where distributed locks come in handy.

Distributed locks are perfect for situations where you need to lock something across different
systems or processes for a reasonable period of time. They're often implemented using a
distributed key-value store like Redis or Zookeeper. The basic idea is that you can use a key-value
store to store a lock and then use the atomicity of the key-value store to ensure that only one
process can acquire the lock at a time. For example, if you have a Redis instance with a key
ticket-123 and you want to lock it, you can set the value of ticket-123 to locked.
If another process tries to set the value of ticket-123 to locked, it will fail because the
value is already set to locked. Once the first process is done with the lock, it can set the value
of ticket-123 to unlocked and another process can acquire the lock.

Another handy feature of distributed locks is that they can be set to expire after a certain amount of
time. This is great for ensuring that locks don't get stuck in a locked state if a process crashes or is
killed. For example, if you set the value of ticket-123 to locked and then the process
crashes, the lock will expire after a certain amount of time (like after 10 minutes) and another
process can acquire the lock at that point.
Redlock is a distributed locking algorithm built on top of Redis,
designed to handle distributed locks in a fault-tolerant and
consistent way. The problem Redlock aims to solve is how to
ensure that a lock is acquired across a distributed system (where
multiple instances or nodes are involved) while avoiding race
conditions and ensuring that locks are properly released.
Why Use Multiple Redis Instances?

Fault tolerance: If one or more Redis nodes fail, Redlock still works as long
as the majority of the nodes are still available. This prevents the entire lock
mechanism from failing if a single Redis server goes down.

Consistency: Using multiple nodes helps to avoid scenarios where one node's
state is inconsistent due to network partitions or failures. The majority
consensus ensures that the lock is acquired safely.
When you horizontally scale data (distribute it across multiple servers):

1. Shared Database Issues:


1. All servers must coordinate when accessing the database, leading to potential race
conditions.
2. Redundant Copies:

1. If each server keeps a copy of the data, updates must be synchronized across all
servers.
2. Example: If Server A updates "Name: John," Server B's copy must also be updated to
"Name: John," or inconsistencies occur.

Transactions and distributed locks help solve these challenges, but they come with
trade-offs like added complexity and performance overhead.
If your system design problem involves geography, there's a good chance you have the option to partition by some
sort of REGION_ID.
Replication ensures that the user's data is available in both the US-East and EU-West
regions. Here's how it works:

1. Data Duplication Across Regions

 The user's data is periodically copied from the primary database (e.g., in "US-East") to
secondary databases in other regions (e.g., "EU-West").
 This allows the system in "EU-West" to serve the user's requests locally.

Distributed Cache
What is a distributed cache and when should you use it?

As the system gets bigger, the cache size also gets bigger and a single-
node cache often falls short when scaling to handle millions of users and
massive datasets.
In such scenarios, we need to distribute the cache data across multiple
servers. This is where distributed caching comes into play.

https://blog.algomaster.io/p/distributed-caching

Dedicated Cache Servers vs. Co-located Cache

When designing a caching strategy for a distributed system, one of the


critical decisions you need to make is where to host the cache.

The two primary options are using dedicated cache servers or co-
locating the cache with application servers.
Don't forget to be explicit about what data you are storing in the cache, including the data structure you're using.
Remember, modern caches have many different datastructures you can leverage, they are not just simple key-value
stores. So for example, if you are storing a list of events in your cache, you might want to use a sorted set so that you
can easily retrieve the most popular events. Many candidates will just say, "I'll store the events in a cache" and leave
it at that. This is a missed opportunity and may invite follow-up questions.

The two most common in-memory caches are Redis and Memcached. Redis is a key-value store
that supports many different data structures, including strings, hashes, lists, sets, sorted sets,
bitmaps, and hyperloglogs. Memcached is a simple key-value store that supports strings and
binary objects.
A distributed cache is a system that stores data in memory across
multiple nodes (servers) in a distributed network, making it highly
available and scalable.

In a distributed cache, the data is spread across multiple nodes,


and these nodes work together to ensure that data is available to
the application with low latency, even under high load. A
distributed cache system often handles data replication and fault
tolerance to ensure high availability and reliability.

Consistency: Most real-life systems don't require strong consistency everywhere. For
example, a social media feed can be eventually consistent -- if you post a tweet, it's okay if it takes
a few seconds for your followers to see it. However, a banking system needs to be strongly
consistent -- if you transfer money from one account to another, it's not okay if the money
disappears from one account and doesn't appear in the other.

many systems will actually blend strong and weak consistency in different parts of the system. For example, on an
ecommerce site the item details might be eventually consistent (if you update the description of an item, it's okay if it
takes a few minutes for the description to update on the product page) but the inventory count needs to be strongly
consistent (if you sell an item, it's not okay for another customer to buy the same unit).

Locking: Locking is the process of ensuring that only one client can access a shared resource at
a time.
Locks happen at every scale of computer systems: there are locks in your operating system
kernel, locks in your applications, locks in the database, and even distributed locks.

Locks are important for enforcing the correctness of our system but can be disastrous for
performance.

There's three things to worry about when employing locks:


Granularity of the lock
We want locks to be as fine-grained as possible. This means that we
want to lock as little as possible to ensure that we're not blocking
other clients from accessing the system. For example, if we're
updating a user's profile, we want to lock only that user's profile and
not the entire user table.
Duration of the lock
We want locks to be held for as short a time as possible. This means
that we want to lock only for the duration of the critical section. For
example, if we're updating a user's profile, we want to lock only for
the duration of the update and not for the entire request.

Whether we can bypass the lock


In many cases, we can avoid locking by employing an "optimistic"
concurrency control strategy, especially if the work to be done is
either read-only or can be retried. In an optimistic strategy we're
going to assume that we can do the work without locking and then
check to see if we were right.
Optimistic concurrency control makes the assumption that most of the time we won't have
contention (or multiple people trying to lock at the same time) in a system, which is a good
assumption for many systems! That said, not all systems can use optimistic concurrency control.
For example, if you're updating a user's bank account balance, you can't just assume that you can
do the update without locking.
Indexing in Databases: If you can do your indexing in your primary database, do it! Databases have
been battle-honed over decades for exactly this problem, so don't reinvent the wheel unless you have to.

Databases like DynamoDB make secondary indexing easy and


automatic, while with Redis, you must design and maintain custom
structures to enable efficient queries based on fields other than the
primary key.
Secondary indexes are additional data structures that a database
uses to speed up queries based on fields (or attributes) other than
the primary key.
· Primary Key: The main unique identifier for each record (e.g., user_id in a user table).
· Secondary Index: Allows querying the database using non-primary fields (e.g., email,
last_name, etc.).
ElasticSearch: ElasticSearch can be confusing at first because it's
not a traditional database, but rather a search and analytics
engine.ElasticSearch is a tool designed for searching, analyzing,
and storing data. It's not a full-fledged database like MySQL or
PostgreSQL, but it acts as a complement to databases by enabling
very fast searches and complex queries.
ElasticSearch is built on Lucene, which is a powerful library for
full-text search. It enhances Lucene by making it scalable,
distributed, and easier to use.

Technically, ElasticSearch is not a database in the same way MySQL or MongoDB is.
However:

 It can store data (in JSON format), so some people use it as a NoSQL database.
 Its primary purpose is search and indexing rather than managing transactional data (e.g.,
updating user account balances like traditional databases).

So, ElasticSearch is best thought of as a search layer or search engine for your
database, rather than a replacement for the database itself.

What are Secondary Indexes?

Secondary indexes are additional data structures created on top of your primary
database to enable faster searches. ElasticSearch can act as a secondary index for
databases by indexing your data and allowing for:
1. Full-text search: Searching large volumes of text very quickly (e.g., finding articles containing
"climate change").
2. Geospatial search: Finding nearby locations (e.g., restaurants within 5 km).
3. Vector search: Finding similar items (e.g., images or documents).

ElasticSearch can handle huge amounts of data by scaling


horizontally (adding more servers).
DO YOU ALWAYS NEED A VECTOR DATABASE
COMMUNICATION PROTOCOLS:

You've got two different categories of protocols to handle: internal and external.

Internally, for a typical microservice application which consistitues 90%+ of system design
problems, either HTTP(S) or gRPC will do the job. Don't make things complicated.

Externally, you'll need to consider how your clients will communicate with your system:
who initiates the communication, what are the latency considerations, and how much
data needs to be sent.

Across choices, most systems can be built with a combination of HTTP(S), SSE or long
polling, and Websockets.

Use HTTP(S) for APIs with simple request and responses. Because
each request is stateless, you can scale your API horizontally by
placing it behind a load balancer. Make sure that your services aren't
assuming dependencies on the state of the client (e.g. sessions) and
you're good to go.

· HTTP(S) requests are stateless by nature.


This means each request from the client is treated as independent, and the server
doesn't need to remember anything about previous requests.

Because of this stateless nature, any server in your pool (behind


the load balancer) can handle any request. This makes it easy to
scale horizontally since you can add more servers without worrying
about breaking existing client connections.
If you need to give your clients near-realtime updates, you'll need a
way for the clients to receive updates from the server. Long polling is
a great way to do this that blends the simplicity and scalability of
HTTP with the realtime updates of Websockets. With long polling, the
client makes a request to the server and the server holds the request
open until it has new data to send to the client. Once the data is sent,
the client makes another request and the process repeats. Notably,
you can use standard load balancers and firewalls with long polling -
no special infrastructure needed.

When a client makes a request and waits for data, the load
balancer simply routes the request to an available server. If that
server is processing a long-polling request, it continues to hold it
until it can respond with new data.

While it might seem like a new request could be independent of the


previous one, in a long-polling scenario, it's important for
subsequent requests to go to the same server because of the
state, session management, event handling, and resource allocation
that is required for maintaining a long-lived, open connection.
Using sticky sessions or session persistence on the load
balancer ensures that the client’s next request goes to the same
server that handled the initial request, which helps maintain
consistent behavior and ensures reliable long-polling connections.

Websockets are necessary if you need realtime, bidirectional


communication between the client and the server. From a system
design perspective, websockets can be challenging because you
need to maintain the connection between client and server. This can
be a challenge for load balancers and firewalls, and it can be a
challenge for your server to maintain many open connections. A
common pattern in these instances is to use a message broker to
handle the communication between the client and the server and for
the backend services to communicate with this message broker. This
ensures you don't need to maintain long connections to every service
in your backend.

Lastly, Server Sent Events (SSE) are a great way to send updates
from the server to the client. They're similar to long polling, but
they're more efficient for unidirectional communication from the
server to the client. SSE allows the server to push updates to the
client whenever new data is available, without the client having to
make repeated requests as in long polling. This is achieved through a
single, long-lived HTTP connection, making it more suitable for
scenarios where the server frequently updates data that needs to be
sent to the client. Unlike Websockets, SSE is designed specifically
for server-to-client communication and does not support client-to-
server messaging. This makes SSE simpler to implement and
integrate into existing HTTP infrastructure, such as load balancers
and firewalls, without the need for special handling.
Statefulness is a major source of complexity for systems. Where possible, relegating your state to a
message broker or a database is a great way to simplify your system. This enables your services to be
stateless and horizontally scalable while still maintaining stateful communication with your clients.

Understanding Real-Time Communication


Real-time communication refers to the ability of a server to push
information to a client as soon as it becomes available, without the client
having to request it explicitly.
This is in contrast to the traditional request-response model of HTTP,
where the client must always initiate communication.

Long-Polling and WebSockets are two strategies to overcome this


limitation.
Message Routing and Sharding: Many message brokers, like Kafka or RabbitMQ,
use a concept called routing or sharding. This means that messages can be sent to
specific "partitions" or "queues" based on certain rules, such as:

Key-based routing: If you have a message that needs to go to a specific


service instance, you can use a message key (e.g., a user ID or session ID) to
determine which partition the message should go to. This ensures that all
messages for a specific key are sent to the same partition.

For example, if you're building a system where every user has a dedicated
service instance, you might hash the user ID and route all messages for that
user to the same partition or queue. This way, the same service will process
those messages consistently.


Queue
The most common queueing technologies are Kafka and SQS. Kafka is a distributed streaming
platform that can be used as a queue, while SQS is a fully managed queue services provided by AWS.

Streams / Event Sourcing

Sometimes you'll be asked a question that requires either processing vast amounts of data in real-time
or supporting complex processing scenarios, such as event sourcing.

Event sourcing is a technique where changes in application state are stored as a sequence of events. These events can
be replayed to reconstruct the application's state at any point in time, making it an effective strategy for systems that
require a detailed audit trail or the ability to reverse or replay transactions.

In either case, you'll likely want to use a stream. Unlike message queues, streams can retain data for a
configurable period of time, allowing consumers to read and re-read messages from the same position
or from a specified time in the past. Streams are a good choice
The most common stream technologies are Kafka and Kinesis. Kafka can be configured to be both a
message queue and a stream, while Kinesis is a fully managed stream service provided by AWS. Both
are great choices.

Security

Authentication/Authorization
In many systems you'll expose an API to external users which needs to be locked down to only
specific users. Delegating this work to either an API Gateway or a dedicated service like Auth0 is a
great way to ensure that you're not reinventing the wheel.
often it's sufficient to say "My API Gateway will handle authentication and authorization".

An API Gateway is a server that acts as an entry point into a


system, providing a single point of access to a variety of services,
often in a microservices architecture. It routes requests from
clients to the appropriate backend services, handles different types
of tasks (such as authentication, load balancing, rate limiting,
caching), and simplifies client communication by consolidating
multiple API calls into a single request.
Especially in a microservice architecture, an API gateway sits in front of your system and is
responsible for routing incoming requests to the appropriate backend service. For example, if the
system receives a request to GET /users/123, the API gateway would route that request to
the users service and return the response to the client.

The most common API gateways are AWS API Gateway, Kong, and Apigee.

An API gateway accepts API requests from a client, processes them based on
defined policies, directs them to the appropriate services, and combines the
responses for a simplified user experience. Typically, it handles a request by
invoking multiple microservices and aggregating the results. It can also
translate between protocols in legacy deployments.
https://www.nginx.com/blog/how-do-i-choose-api-gateway-vs-ingress-controller-vs-service-mesh/
Backend API Response:

 The backend API processes the request and sends a response back to the client. The API
Gateway may handle additional concerns like rate limiting, logging, or security headers.

Encryption:
You'll want to cover both the data in transit (e.g. via protocol encryption) and the data at rest (e.g.
via storage encryption). HTTPS is the SSL/TLS protocol that encrypts data in transit and is the
standard for web traffic. If you're using gRPC it supports SSL/TLS out of the box. For data at rest,
you'll want to use a database that supports encryption or encrypt the data yourself before storing it.

For example, if you're building a system that stores user data, you might want to encrypt that data with a key that's
unique to each user. This way, even if your database is compromised, the data is still secure.
DATA PROTECTION:
A hacker could send a lot of requests to your endpoint, trying to
access information they shouldn't be able to. This is often called
"scraping," where someone collects data by sending many requests,
sometimes without permission.

MONITORING:
DATABASES:

the most common are relational databases (e.g. Postgres) and NoSQL databases (e.g.
DynamoDB) - we recommend you pick one of these for your interview. If you are taking
predominantly product design interviews, we recommend you pick a relational database. If you are
taking predominantly infrastructure design interviews, we recommend you pick a NoSQL database.
The great thing about relational databases is (a) their support for
arbitrarily many indexes, which allows you to optimize for different
queries and (b) their support for multi-column and specialized
indexes (e.g. geospatial indexes, full-text indexes.)

Transactions are a way of grouping multiple operations together into


a single atomic operation. For example, if you have a users table
and a posts table, you might want to create a new user and a new
post for that user at the same time. If you do this in a transaction,
either both operations will succeed or both will fail. This is important
for maintaining data integrity.
What are the most common NoSQL databases?

The most common NoSQL databases are DynamoDB and


MongoDB.

Blob Storage:
Blob storage services are simple. You can upload a blob of data and that data is stored and get
back a URL. You can then use this URL to download the blob of data. Often times blob storage
services work in conjunction with CDNs, so you can get fast downloads from anywhere in the
world. Upload a file/blob to blob storage which will act as your origin, and then use a CDN to cache
the file/blob in edge locations around the world.
Avoid using blob storage like S3 as your primary data store unless you have a very good reason. In a typical setup
you will have a core database like Postgres or DynamoDB that has pointers (just a url) to the blobs stored in S3. This
allows you to use the database to query and index the data, while still getting the benefits of cheap blob storage.
Full-text search is the ability to search through a large amount of text
data and find relevant results. This is different from a traditional
database query, which is usually based on exact matches or ranges.
Without a search optimized database, you would need to run a query
that looks something like this:
SELECT * FROM documents WHERE document_text LIKE
'%search_term%'

This query is slow and inefficient, and it doesn't scale well because it
requires a full table scan. Search optimized databases, on the other
hand, are specifically designed to handle full-text search. They use
techniques like indexing, tokenization, and stemming to make search
queries fast and efficient. In short, they work by building what are
called inverted indexes.

Examples of search optimized databases


The clear leader in this space is Elasticsearch. Elasticsearch is a
distributed, RESTful search and analytics engine that is built on top
of Apache Lucene.
Elasticsearch has hosted offerings like Elastic Cloud and AWS OpenSearch which make it easy
to get started.

Elastic Search:
Sharding in ElasticSearch:

When someone searches for "Harry Potter":

 Elasticsearch automatically sends the query to Shard 2 because "Harry Potter" starts with
"H."
 The results from all shards are combined and returned to the user.

What Are Primary Shards?

When you create an index in Elasticsearch, the data in that index is divided into
smaller, more manageable pieces called primary shards. Each primary shard is
responsible for storing and indexing a portion of your data.
CDN: CDNs are often used to deliver static content like images, videos, and HTML files, but
they can also be used to deliver dynamic content like API responses.

They work by caching content on servers that are close to users.


When a user requests content, the CDN routes the request to the
closest server. If the content is cached on that server, the CDN will
return the cached content. If the content is not cached on that server,
the CDN will fetch the content from the origin server, cache it on the
server, and then return the content to the user.
The most common application of a CDN in an interview is to cache
static media assets like images and videos. For example, if you have
a social media platform like Instagram, you might use a CDN to
cache user profile pictures. This would allow you to serve profile
pictures quickly to users all over the world.
Some of the most popular CDNs are Cloudflare, Akamai, and Amazon CloudFront. These
CDNs offer a range of features, including caching, DDoS protection, and web application firewalls.
They also have a global network of edge locations, which means that they can deliver content to
users around the world with low latency.

A CDN (Content Delivery Network) does cache content in itself, but only under
certain conditions. Let me clarify how CDNs work and address your question:

1. Caching Behavior of CDNs


 What gets cached: CDNs typically cache static content like images, CSS, JavaScript, and other
files that don't change frequently. This is because static content is easier to cache and
doesn't require complex logic to determine freshness or validity.
 Dynamic content caching: Dynamic content, such as API responses, can also be cached by
CDNs, but caching dynamic content depends on specific rules and configurations. Since
dynamic content often changes based on user interactions or real-time data, CDNs need
more precise settings to determine when to serve cached responses or fetch fresh data.

Edge Servers: CDNs consist of multiple edge servers located


globally. These servers store cached content so that users get faster
responses by accessing the nearest server.
In summary, CDNs do cache content within themselves (on edge
servers), but what and how they cache is determined by
configurations, HTTP headers, and the type of content being
delivered.

CDNs (Content Delivery Networks) deliver content to users around


the world efficiently by leveraging a global network of edge servers

Why Do CDNs Have Global Reach?

CDNs like Cloudflare, Akamai, and Amazon CloudFront maintain thousands of edge
servers worldwide in regions like:

 North America, Europe, Asia-Pacific, Africa, South America, and the Middle East.
 They are strategically placed near major Internet Exchange Points (IXPs) to connect efficiently
to local ISPs (Internet Service Providers).
Patterns:

Simple DB-backed CRUD service with caching


Async job worker pool

This pattern is common in systems that need to process a lot of data, like a social network that needs to
process a lot of images or videos. You'll use a queue to store jobs, and a pool of workers to process them.

A popular option for the queue is SQS, and for the workers, you might use a pool of EC2 instances or
Lambda functions. SQS guarantees at least once delivery of messages and the workers will respond back
to the queue with heartbeat messages to indicate that they are still processing the job. If the worker fails
to respond with a heartbeat, the job will be retried on another host.

Another option is for your queue to be a log of events coming from something like Kafka. Kafka gives you
many of the same guarantees as SQS, but since the requests are written to an append-only log, you can
replay the log to reprocess events if something goes wrong.

Kafka is a system designed to handle large streams of data in real-time. It works like a message broker,
which means it helps applications send messages to each other. But unlike some other systems (like SQS),
Kafka stores messages in a log.
An append only log is like a diary or notebook where you keep writing new entries at the end. Once
written, entries are never erased or modified.

In Kafka:

1. Messages (or “events”) are written to this log in order.

2. These messages are stored for a certain period (or until they are manually deleted), even
after being read.

3. You can “replay” the log to process past messages again if needed

How does this help?

1. Replay Events: If something goes wrong while processing a message, you can go back to the
log and reprocess it. For example:

• Imagine you’re processing payments, and a server crashes.

• With Kafka, you can replay the log to process payments from the point of failure without
losing data.

2. Multiple Consumers: Many different systems can read the same log and process it
independently without interfering with each other

Example to clarify:

Imagine you are running an online store:

1. Every time a user makes a purchase, an “Order Placed” event is sent to Kafka.

2. This event is stored in the Kafka log.

3. Different systems read the log:

• One system sends an email to the customer.

• Another updates the inventory.

• A third handles payment processing.

4. If the payment system crashes, you can replay the log to reprocess payment events.
Two stage architecture
A common problem in system design is in "scaling" an algorithm with poor performance
characteristics.
Two-stage architecture solves the problem of scaling an
algorithm with poor performance characteristics by
addressing efficiency at large scales. Here's how:
Event-Driven Architecture
Event-Driven Architecture (EDA) is a design pattern centered around
events. This architecture is particularly useful in systems where it is
crucial to react to changes in real-time. EDA helps in building
systems that are highly responsive, scalable, and loosely coupled.

The core components of an EDA are event producers, event routers (or brokers), and event
consumers. Event producers generate a stream of events which are sent to an event router. The
router, such as Apache Kafka or AWS EventBridge, then dispatches these events to appropriate
consumers based on the event type or content. Consumers process the events and take
necessary actions, which could range from sending notifications to updating databases or
triggering other processes.

An example use of EDA could be in an e-commerce system where an event is emitted every time
a new order is placed. This event can trigger multiple downstream processes like order
processing, inventory management, and notification systems simultaneously.

One of the more important design decisions in event-driven architectures is how to handle failures. Technologies like
Kafka keep a durable log of their events with configurable retention which allows processors to pick up where they
left off. This can be a double-edged sword! If your system can only process N messages per second, you may
quickly find yourself in a situation where you'll take hours or even days to catch back up with the service substantially
degraded the entire time. Be careful about where this is used.
Key Point

The durable log in Kafka ensures no messages are lost, which is great for reliability.
However, if your system gets overwhelmed by a backlog, this reliability can become a
double-edged sword:

 The backlog grows and persists.


 Your system may struggle to catch up and may be degraded for hours or days.
DURABLE JOB PROCESSING

Some systems need to manage long-running jobs that can take hours or days to complete. For
example, a system that needs to process a large amount of data might need to run a job that takes
a long time to complete. If the system crashes, you don't want to lose the progress of the job. You
also want to be able to scale the job across multiple machines.
A common pattern is to use a log like Kafka to store the jobs, and then have a pool of workers that
can process the jobs. The workers will periodically checkpoint their progress to the log, and if a
worker crashes, another worker can pick up the job where the last worker left off. Another option is
to use something like Uber's Cadence (more popularly Temporal).

Setups like this can be difficult to evolve with time. For example, if you want to change the format of the job, you'll
need to handle both the old and new formats for a while.
How to handle these changes?
· Kafka and tools like Temporal ensure jobs are processed reliably, even if workers crash.
· The challenge lies in evolving the system over time, such as changing the job format.
· Solutions include versioning, backward compatibility, and careful updates to handle both old and
new job formats.

Proximity Based services:


Several systems like Design Uber or Design Gopuff will require you to search for entities by
location. Geospatial indexes are the key to efficiently querying and retrieving entities based on
geographical proximity. These services often rely on extensions to commodity databases like
PostgreSQL with PostGIS extensions or Redis' geospatial data type, or dedicated solutions
like Elasticsearch with geo-queries enabled.

The architecture typically involves dividing the geographical area into manageable regions and
indexing entities within these regions. This allows the system to quickly exclude vast areas that
don't contain relevant entities, thereby reducing the search space significantly.

Note that most systems won't require users to be querying globally. Often, when proximity is
involved, it means users are looking for entities local to them.

With a geospatial index, the database divides the area into


smaller regions, so it can quickly exclude irrelevant regions and
focus only on the relevant areas.
First we need to setup PostgreSQL with PostGIS.
How the Index Works
 The geospatial index divides the world into smaller regions (e.g., a grid or tree structure).
 When you query for nearby restaurants:
o The index quickly identifies regions near the user’s location.
o It only checks restaurants in those regions, skipping the rest.

For example:

 If the user is in NYC, the index skips all restaurants in San Francisco, significantly reducing the
search space.

While geospatial indexes are great, they're only really necessary when you need to index hundreds of thousands or
millions of items. If you need to search through a map of 1,000 items, you're better off scanning all of the items than
the overhead of a purpose-built index or service.

System Design Interview guide Alex Xu:

How websites or apps actually work


Internet Protocol (IP) address is returned to the browser or mobile
app from the DNS server
Once the IP address is obtained, Hypertext Transfer Protocol
(HTTP) requests are sent directly to your web server.The web
server returns HTML pages or JSON response for rendering.
• Web application: it uses a combination of server-side languages
(Java, Python, etc.) to handle business logic, storage, etc., and
client-side languages (HTML and JavaScript) for presentation.
•Mobile application: HTTP protocol is the communication protocol
between the mobile app and the web server. JavaScript Object
Notation (JSON) is commonly used API response format to transfer
data due to its simplicity.
An example of the API response

A web application is a program that runs in a web browser and


interacts with a server. It has two main parts:
· Web Server: Serves static files like HTML, CSS, and JS.
· Application Server: Handles server-side programming, connects to databases, and creates dynamic
content.
· Both servers often work together to deliver complete functionality in a web application.

 Web Server is designed to serve HTTP Content. App Server can also serve HTTP
Content but is not limited to just HTTP. It can be provided other protocol support such
as RMI/RPC
 Web Server is mostly designed to serve static content, though most Web
Servers have plugins to support scripting languages like Perl, PHP, ASP, JSP
etc. through which these servers can generate dynamic HTTP content.

Feature Web Server Application Server


Serves static content Handles dynamic
Purpose
(HTML, CSS, JS). content and logic.
Dynamic (e.g., data from
Content Type Static (e.g., HTML files).
a database).
Role Responds to HTTP Processes business logic
Feature Web Server Application Server
requests. and APIs.
Tomcat, Django, Spring
Examples Nginx.
Boot, Node.js
Connection to No direct database Interacts with databases
Database interaction. for data.

 Most of the application servers have Web Server as integral part of them, that means
App Server can do whatever Web Server is capable of. Additionally App Server have
components and features to support Application level services such as Connection
Pooling, Object Pooling, Transaction Support, Messaging services etc.
Transactions:

Messaging services allow different parts of an application or


different systems to communicate with each other by sending and
receiving messages. These services make it easier to build
distributed applications where different components can work
together, even if they run on separate servers or platforms.

As web servers are well suited for static content and app servers for dynamic content, most
of the production environments have web server acting as reverse proxy to app server.
That means while servicing a page request, static contents (such as images/Static HTML)
are served by web server that interprets the request. Using some kind of filtering technique
(mostly extension of requested resource) web server identifies dynamic content request
and transparently forwards to app server.

· Reverse Proxy: When you need to improve backend security, enable SSL termination, or cache
content close to clients.

E.g : Nginx
Reverse Proxy:
· Forward Proxy: When clients need anonymity, content filtering, or access to restricted resources.
· Load Balancer: When you need to distribute traffic for better availability and performance.

A load balancer can:

1. Forward to the same service for scaling and fault tolerance.


2. Forward to different services to support microservices, path-based routing, or feature-
specific workflows.
· API Gateway: When managing microservices, securing APIs, or enabling protocol translations.

--> This Interaction might defer


· A Reverse Proxy can sometimes act as a Load Balancer.
· An API Gateway often includes reverse proxy features like routing and SSL termination.

The order in which a Load Balancer, API Gateway, and Proxy are
placed in an architecture depends on the specific design and
purpose of each component.
Example Scenario

For a cloud-based architecture:

 Proxy: AWS CloudFront (for caching and edge delivery).


 Load Balancer: AWS Elastic Load Balancer (distributing requests across regions or instances).
 API Gateway: AWS API Gateway (handling API logic, security, and routing).

This order can vary:

 In some setups, the API Gateway itself might perform load balancing (e.g., AWS API
Gateway).
 A proxy might not be required if a load balancer or API Gateway already handles all
necessary functions.
Database With the growth of the user base, one server is not
enough, and we need multiple servers: one for web/mobile traffic,
the other for the database. Separating web/mobile traffic (web tier)
and database (data tier) servers allows them to be scaled
independently.

A server has limited CPU power, memory (RAM), disk space,


and network bandwidth. When we say a server has limited
bandwidth, it refers to the amount of data that can be
transferred between the server and its clients (users or other
systems) over the network in a given period of time. This
limitation is typically measured in terms of megabits per second
(Mbps), gigabits per second (Gbps).

Non-relational databases might be the right choice if: • Your


application requires super-low latency. • Your data are
unstructured, or you do not have any relational data. • You only
need to serialize and deserialize data (JSON, XML, YAML, etc.). •
You need to store a massive amount of data.
Non-relational databases are often the right choice when you only
need to serialize and deserialize data because they can
efficiently handle data formats like JSON, XML, or YAML without
requiring rigid schemas or complex relational mapping.

· Serialization is the process of converting data or objects (like those used in programming) into a
format that can be stored or transmitted.
· These formats are typically text-based or binary, such as JSON, XML, or YAML, which are easy to
send over a network or save in a file.

Let’s break down serialization and deserialization with a beginner-friendly


explanation and examples:

What is Serialization?
 Serialization is the process of converting data or objects (like those used in programming)
into a format that can be stored or transmitted.
 These formats are typically text-based or binary, such as JSON, XML, or YAML, which are
easy to send over a network or save in a file.

Think of it as packing data into a suitcase so it can be sent or stored.

Example of Serialization in JSON:

Imagine you have an object in Python:

python
Copy code
person = {
"name": "Alice",
"age": 30,
"city": "New York"
}

To send this over the internet or store it in a database, you serialize it into a JSON
string:

json
Copy code
{
"name": "Alice",
"age": 30,
"city": "New York"}

This JSON string is the serialized version of the person object. It’s compact and easy
to send or save.

What is Deserialization?
 Deserialization is the reverse process: converting the serialized data (e.g., JSON or XML) back
into the original object or data structure that can be used by the application.

Think of it as unpacking the data from the suitcase so you can use it.

Users connect to the public IP of the load balancer directly. With


this setup, web servers are unreachable directly by clients
anymore. For better security, private IPs are used for
communication between servers. A private IP is an IP address
reachable only between servers in the same network; however, it is
unreachable over the internet. The load balancer communicates
with web servers through private IPs.

If the website traffic grows rapidly, and two servers are not enough
to handle the traffic, the load balancer can handle this problem
gracefully. You only need to add more servers to the web server
pool, and the load balancer automatically starts to send requests to
them.

“Database replication can be used in many database management


systems, usually with a master/slave relationship between the
original (master) and the copies (slaves)”

A master database generally only supports write operations. A


slave database gets copies of the data from the master database
and only supports read operations. All the data-modifying
commands like insert, delete, or update must be sent to the master
database. Most applications require a much higher ratio of reads to
writes; thus, the number of slave databases in a system is usually
larger than the number of master databases.

If only one slave database is available and it goes offline, read


operations will be directed to the master database temporarily. As
soon as the issue is found, a new slave database will replace the old
one. In case multiple slave databases are available, read operations
are redirected to other healthy slave databases. A new database
server will replace the old one.

If the master database goes offline, a slave database will be


promoted to be the new master. All the database operations will be
temporarily executed on the new master database. A new slave
database will replace the old one for data replication immediately.
In production systems, promoting a new master is more
complicated as the data in a slave database might not be up to
date. The missing data needs to be updated by running data
recovery scripts. Although some other replication methods like
multi-masters and circular replication could help, those setups are
more complicated.

improving the load/response time can be done by adding a cache


layer and shifting static content (JavaScript/CSS/image/video files)
to the content delivery network (CDN).

Cache: Every time a new web page loads, one or more database
calls are executed to fetch data. The application performance is
greatly affected by calling the database repeatedly. The cache can
mitigate this problem.

After receiving a request, a web server first checks if the cache has
the available response. If it has, it sends data back to the client. If
not, it queries the database, stores the response in cache, and
sends it back to the client. This caching strategy is called a read-
through cache.

Inconsistency can happen because data-modifying operations on


the data store and cache are not in a single transaction. When
scaling across multiple regions, maintaining consistency between
the data store and cache is challenging.

CDN: A CDN is a network of geographically dispersed servers used


to deliver static content. CDN servers cache static content like
images, videos, CSS, JavaScript files etc.. Dynamic content caching
can also be done by CDN.
Domain: A domain is the part of a URL that specifies the address
of a server, like cloudfront.net (for Amazon CloudFront) or
akamai.com (for Akamai).

A third-party provider in this context means a company that


specializes in providing CDN services, such as Cloudflare, Akamai,
or Amazon CloudFront. These providers maintain a global network
of servers (also called edge servers) to deliver your content
efficiently.

Global Network of Servers (Edge Locations):

 CDNs like Cloudflare and Amazon CloudFront have servers distributed across the world.

Static assets (JS, CSS, images, etc.,) are no longer served by web
servers. They are fetched from the CDN for better performance. 2.
The database load is lightened by caching data.

State:

In the context of web applications, "state" refers to any data that is stored or tracked
during a user's interaction with the web application. This data can include things like:

 User session information: This might include things like the user’s login status, preferences,
or items in their shopping cart.
 Application data: This can be temporary or session-specific data related to the operations or
actions a user is performing on the site.

When you interact with a web application, the system needs to remember things about
you across different pages. For example, when you log in to an online store and add
items to your cart, the system needs to keep track of who you are and what you've
added to the cart while you navigate between pages. This is the "state" of your
session.

Stateless web tier: it is time to consider scaling the web tier


horizontally. For this, we need to move state (for instance user
session data) out of the web tier. A good practice is to store session
data in the persistent storage such as relational database or
NoSQL. Each web server in the cluster can access state data from
databases. This is called stateless web tier.
1. To authenticate User A, HTTP requests must be routed to Server
1. If a request is sent to other servers like Server 2, authentication
would fail because Server 2 does not contain User A’s session data.
Similarly, all HTTP requests from User B must be routed to Server
2; all requests from User C must be sent to Server 3
The issue is that every request from the same client must be routed
to the same server. This can be done with sticky sessions in most
load balancers [10]; however, this adds the overhead. Adding or
removing servers is much more difficult with this approach. It is
also challenging to handle server failures.
After the state data is removed out of web servers, auto-scaling of
the web tier is easily achieved by adding or removing servers based
on traffic load.

Data Centers:
Example setup with two data centers. In normal operation, users
are geoDNS-routed, also known as geo-routed, to the closest data
center, with a split traffic of x% in US-East and (100 – x)% in US-
West. geoDNS is a DNS service that allows domain names to be
resolved to IP addresses based on the location of a user.
In the event of any significant data center outage, we direct all
traffic to a healthy data center.

To further scale our system, we need to decouple different


components of the system so they can be scaled independently.
Messaging queue is a key strategy employed by many realworld
distributed systems to solve this problem.
Message Queue:
Input services, called producers/publishers, create messages, and
publish them to a message queue. Other services or servers, called
consumers/subscribers, connect to the queue, and perform actions
defined by the messages.

With the message queue, the producer can post a message to the
queue when the consumer is unavailable to process it. The
consumer can read messages from the queue even when the
producer is unavailable.

The producer and the consumer can be scaled independently.


When the size of the queue becomes large, more workers are
added to reduce the processing time. However, if the queue is
empty most of the time, the number of workers can be reduced.
Types of Message Queues:

1. Point-to-Point (P2P) Queue

In this model, messages are sent from one producer to one consumer.

Used when a message needs to be processed by a single


consumer, such as in task processing systems.
LOGGING, METRICS and AUTOMATION:
Logging: Monitoring error logs is important because it helps to
identify errors and problems in the system. You can monitor error
logs at per server level or use tools to aggregate them to a
centralized service for easy search and viewing.
Metrics: Collecting different types of metrics help us to gain
business insights and understand the health status of the system.
Some of the following metrics are useful:
• Host level metrics: CPU, Memory, disk I/O, etc.
• Aggregated level metrics: for example, the performance of the
entire database tier, cache tier, etc.
• Key business metrics: daily active users, retention, revenue, etc.
Automation: When a system gets big and complex, we need to
build or leverage automation tools to improve productivity.
Continuous integration is a good practice, in which each code
check-in is verified through automation, allowing teams to detect
problems early. Besides, automating your build, test, deploy
process, etc. could improve developer productivity significantly.
DATABASE SCALING:

Horizantal scaling:

Sharding separates large databases into smaller, more easily


managed parts called shards. Each shard shares the same schema,
though the actual data on each shard is unique to the shard.
The most important factor to consider when implementing a
sharding strategy is the choice of the sharding key. Sharding key
(known as a partition key) consists of one or more columns that
determine how data is distributed. As shown in Figure 1-22,
“user_id” is the sharding key. A sharding key allows you to retrieve
and modify data efficiently by routing database queries to the
correct database. When choosing a sharding key, one of the most
important criteria is to choose a key that can evenly distributed
data.
After denormalization, the Users table might look like this:

a summary of how we scale our system to support millions of users:


• Keep web tier stateless • Build redundancy at every tier •
Cache data as much as you can • Support multiple data
centers • Host static assets in CDN • Scale your data tier by
sharding • Split tiers into individual services • Monitor your
system and use automation tools
BACK OF THE ENVELOPE ESTIMATION:
Dr. Dean from Google reveals the length of typical computer
operations in 2010.
By analyzing those numbers, we get the following conclusions:
• Memory is fast but the disk is slow.
• Avoid disk seeks if possible.
• Simple compression algorithms are fast.
• Compress data before sending it over the internet if possible.
• Data centers are usually in different regions, and it takes time to
send data between them.
Consensus in Distributed Systems:
A consensus algorithm, if it can handle Byzantine failure can handle
any type of consensus problem in a distributed system.

Consensus Algorithms

Voting-based Consensus Algorithms

1. Practical Byzantine Fault Tolerance

If more than two-thirds of all nodes in a system are honest then


consensus can be reached.

The distributed system is divided into three phases (pre-prepare,


prepare, commit) and nodes are sequentially ordered with one node
being the Primary node (or leader node) and others as
the Secondary node (or backup node). The objective is that all
non-faulty nodes help in achieving a consensus regarding the state
of the system using the majority rule.
2. Other Notable Algorithms

There are other voting-based consensus algorithms like —

 HotStuff
 Paxos
 Raft etc…

Proof-based Consensus Algorithms

In this case, a participant must show adequate proof of something


in order to contribute to decision-making.

There are several Proof-based Consensus algorithms —

 Proof of Work (PoW)


 Proof of Stake (PoS)

Consensus algorithms are crucial in decentralized networks, where there is no


central authority to coordinate decisions.
https://blog.algomaster.io/p/batch-processing-vs-stream-processing

Batch Processing:

Example Use Cases:

Generating end-of-day reports


Processing payroll at the end of the month.
Performing large-scale ETL (Extract, Transform, Load) tasks

Batch Processing Frameworks and Tools: Apache Hadoop, Apache Spark, AWS Batch

Stream Processing:
Stream Processing Workflow:
Stream Processing Frameworks and Tools: Apache Kafka, Apache Flink, Apache Kinesis

Real-time requirements: If your application requires immediate


insights or actions based on incoming data, stream processing is
the way to go.
Complexity: If your processing tasks require complex algorithms
and data transformations, batch processing might be more suitable.
Data nature: Is your data finite and predictable in size, or an
unbounded, ongoing flow? Batch processing is better suited for the
former, stream processing for the latter.

Hybrid Approach: Micro-Batch Processing

Some systems like Apache Spark Streaming employ a hybrid approach


known as micro-batch processing.

This method bridges the gap between traditional batch and stream
processing by processing small chunks of data over short intervals.

This allows for near real-time processing with the simplicity of batch
processing.

What exactly is a Heartbeat?

In distributed systems, a heartbeat is a periodic message sent from


one component to another to monitor each other's health and
status.
This signal is usually a small packet of data transmitted at regular
intervals, typically ranging from seconds to minutes, depending on the
system's requirements.
Frequency: How often should heartbeats be sent? There needs to
be a balance. If they're sent too often, they'll use up too much
network resources. If they're sent too infrequently, it might take
longer to detect problems.
Timeout: How long should a node wait before it considers another
node 'dead'? This depends on expected network latency and
application needs. If it's too quick, it might mistake a live node for a
dead one, and if it's too slow, it might take longer to recover from
problems.
Payload: Heartbeats usually just contain a little bit of information
like a timestamp or sequence number. But, they can also carry
additional data like how much load a node is currently handling,
health metrics, or version information.

Types of Heart Beats:

Push heartbeats: Nodes actively send heartbeat signals to the monitor.


Pull heartbeats: The monitor periodically queries nodes for their status.
Circuit Breaker Pattern

The Circuit Breaker design pattern is used to stop the request and
response process if a service is not working.
You can leverage the Circuit Breaker Design Pattern to avoid such
issues. The consumer will use this pattern to invoke a remote
service using a proxy. This proxy will behave as a circuit barrier.

When the number of failures reaches a certain threshold, the circuit


breaker trips for a defined duration of time.

During this timeout period, any requests to the offline server will
fail. When that time period is up, the circuit breaker will allow a
limited number of tests to pass, and if those requests are successful,
the circuit breaker will return to normal operation. If there is a
failure, the time out period will start again.
--> Service Running out of threads

--> cascading failures


During the failure time all the requests that come to consume
service A are sent back with an error message. So now there is no
queue. When the service back online it will be open for new traffic.
From the user’s perspective, having to wait a long time for a
response is not a good user experience. Rather than keeping the
consumer waiting for a longer duration, it is better to respond
quickly. It doesn’t matter if it’s a success or a failure; what counts is
that the user isn’t kept waiting.

The Aggregator Pattern is a design pattern used in distributed


systems and microservices architectures. It allows a service or
component (called the aggregator) to gather data or responses
from multiple backend services and combine them into a single
response to send back to the client.

Idempotency:
This scenario highlights a common problem in distributed
systems: handling repeated operations gracefully.

The solution to this problem lies in the concept of idempotency.

For example, the absolute value function is idempotent: ||-5|| = |-5| = 5.

Idempotency is a property of certain operations whereby executing the


same operation multiple times produces the same result as executing it
once.

For example: If a request to delete an item is idempotent—all requests


after the first will have no impact.
If the system handles retries without idempotency, every retry could
change the system’s state unpredictably. By designing operations to be
idempotent, engineers create a buffer against unexpected behaviors
caused by retries. This “safety net” prevents repeated attempts from
distorting the outcome, ensuring stability and reliability.
Each message has a unique messageId. Before processing, we check if
the messageId is already in processedMessages. If it is, the message is
ignored; otherwise, it’s processed and added to the set to avoid
duplicates.

HTTP IDEMPOTENT METHODS: GET, PUT, DELETE

HTTP NON IDEMPOTENT METHODS:

Whether you're designing a distributed database, a payment processing


system, or a simple web API, considering idempotency in your design can
save you (and your users) from many headaches down the road.

Web Sockets: Websockets are a communication protocol used


to build real-time features by establishing a two-way
connection between a client and a server.
Imagine an online multiplayer game where the leaderboard updates
instantly as players score points, showing real-time rankings of all
players.

WebSockets enable full-duplex, bidirectional communication between a


client (typically a web browser) and a server over a single TCP
connection.

Unlike the traditional HTTP protocol, where the client sends a request to
the server and waits for a response, WebSockets allow both the client and
server to send messages to each other independently and continuously
after the connection is established.

Full-duplex means that communication can occur simultaneously


in both directions over a single connection.
Latency: Polling introduces delays because updates are only checked
periodically
Consistent Hashing:

Hotspot: A performance-degraded node in a distributed system due to a large share of data


storage and a high volume of retrieval or storage requests
The trade off of the cache replication -->consistency between cache replicas is expensive to
maintain
The spread is the number of cache servers holding the same key-value pair (data object).
The load is the number of distinct data objects assigned to a cache server. The optimal
configuration for the high performance of a cache server is to keep the spread and the load
at a minimum .
Partitioning and Replication are orthogonal because partitioning ensures scalability (data
distribution), while replication ensures reliability (data redundancy).
The reasons for partitioning are :

1) a cache server is memory bound


2) Increased throughput
Partitioning

The data set is partitioned among multiple nodes to horizontally scale out. The different
techniques for partitioning the cache servers are the following :

Random assignment
Single global cache
Key range partitioning
Static hash partitioning
Consistent hashing

The basic gist behind the consistent hashing algorithm is to hash both node identifiers
and data keys using the same hash function. A uniform and independent hashing
function such as message-digest 5 (MD5) is used to find the position of the nodes and
keys (data objects) on the hash ring. The output range of the hash function must be of
reasonable size to prevent collisions.

There is a chance that nodes are not uniformly distributed on the consistent hash ring.
The nodes that receive a huge amount of traffic become hotspots resulting in cascading
failure of the nodes.

Consistent hashing: Virtual nodes

The nodes are assigned to multiple positions on the hash ring by hashing the node IDs
through distinct hash functions to ensure uniform distribution of keys among the nodes.
The technique of assigning multiple positions to a node is known as a virtual node.
The virtual nodes improve the load balancing of the system and prevent hotspots. The
number of positions for a node is decided by the heterogeneity of the node. In other
words, the nodes with a higher capacity are assigned more positions on the hash ring.

more spaces are needed to store data about virtual nodes. This is a
tradeoff, and we can tune the number of virtual nodes to fit our
system requirements.

The data objects can be replicated on adjacent nodes to minimize the data movement
when a node crashes or when a node is added to the hash ring. In conclusion, consistent
hashing resolves the problem of dynamic load.
The distributed NoSQL data stores such as Amazon DynamoDB, Apache Cassandra,
and Riak use consistent hashing to dynamically partition the data set across the set
of nodes. The data is partitioned for incremental scalability.

Examples of Consistent Hashing:

Discord chat application

Akamai content delivery network

Maglev network load balancer

Consistent hashing optimization


Some of the popular variants of consistent hashing are the following:

Multi-probe consistent hashing

Consistent hashing with bounded loads: The consistent hashing with bounded
load puts an upper limit on the load received by a node on the hash ring, relative to
the average load of the whole hash ring. The distribution of requests is the same as
consistent hashing as long as the nodes are not overloaded.

When a specific data object becomes extremely popular, the node hosting the data
object receives a significant amount of traffic resulting in the degradation of the
service. If a node is overloaded, the incoming request is delegated to a fallback
node. The list of fallback nodes will be the same for the same request hash. In
simple words, the same node(s) will consistently be the “second choice” for a
popular data object. The fallback nodes resolve the popular data object caching
problem.

If a node is overloaded, the list of the fallback nodes will usually be different for
different request hashes. In other words, the requests to an overloaded node are
distributed among the available nodes instead of a single fallback node.

Rate Limiting: https://blog.algomaster.io/p/rate-limiting-algorithms-explained-with-code

Service Discovery: https://blog.algomaster.io/p/0204da93-f0e9-49b9-a88a-


cb20b9931575
Think about a massive system like Netflix, with hundreds of microservices
working together. Hardcoding the locations of these services isn’t
scalable. If a service moves to a new server or scales dynamically, it could
break the entire system.

Service discovery solves this by dynamically and reliably enabling


services to locate and communicate with one another.
Disaster Recovery:
When creating your recovery strategy, it’s useful to consider your RTO and RPO values
and pick a DR pattern that will enable you to meet those values and your overall goals.
Typically, the smaller your values (or the faster your applications need to recover after an
interruption), the higher the cost to run your application.

Cloud disaster recovery can greatly reduce the costs of RTO and RPO when it comes to
fulfilling on-premises requirements for capacity, security, network infrastructure,
bandwidth, support, and facilities. A highly managed service on Google Cloud can help
you avoid most, if not all, complicating factors and allow you to reduce many business
costs significantly.

https://www.dynatrace.com/news/blog/what-is-distributed-tracing/

Distributed Tracing: Distributed tracing is a method of observing requests


as they propagate through distributed cloud environments. It follows an
interaction and tags it with a unique identifier. This identifier stays with the
transaction as it interacts with microservices, containers, and infrastructure.
In turn, this identifier offers real-time visibility into user experience, from the
top of the stack to the application layer and the infrastructure beneath.
It provides a comprehensive view of the entire request lifecycle including time
taken at each step, the services involved and any errors or bottlenecks
encoutered along the way.

https://youtu.be/XYvQHjWJJTE?si=WEk6KLr9HFmvg82D

CCID --> correlation contextID

One widely adapted distributed tracing framework is Open Telemetry. Open


Telemetry is a standardized way to instrument applications and collect trace data
across various languages and platforms. Open Telemetry is vendor neutral.

With open telemetry developers can capture and export trace data to observability
tools for visualization, analysis and troubleshooting . Jaeger and Zipkin integrate with
open telemetry and can correlate data from span and provide webpage visualization.

In addition to open telemetry there are other tools and platforms available that offer
distributed tracing capabilities along with additional features. Some of these include

APM (Application Performance Monitoring) solutions like new relic, splunk, datadog.

APM tools provide end to end visibility into applications performance and offers
features beyond distributed tracing. They collect data on errors, metrics and logs
providing a comprehensive view of the systems health. These tools typically offer
user friendly interfaces and advanced analytics capabilities making them suitable for
monitoring and optimizing complex distributed systems.

Micrometer library is a metrics collection library that focuses on gathering and


exporting metrics data from applications. While it doesn’t provide native distributed
tracing functionality it can be used in conjunction with distributed tracing frameworks
like Open Telemetry to capture and export metrics alongside tracedata. While open
telemetry also supports metrics collection micrometer provides a wider range of
integrations out of the box and provides extensive support for various monitoring
systems and frameworks like Prometheus, influx DB etc.. In addition Micrometer
provides advanced features for aggregating and analyzing metrics data.

Instead of Open Telemetry (Vendor Neutral) if we use vendor specific solution like
new relic APM for data collection you are tightly coupled to that particular vendor.If
we want to change vendor it would require changes to instrumentation code/
configurations. However by leveraging open telemetry you can achieve a higher level
of flexibility and portability since open telemetry provides a standard way to collect
and export telemetry data you can switch b/w different vendors or tools without
needing to modify your application code extensively. We can just simply configure
open Telemetry exporters to send the data to new vendors platform without rewriting
or reconfiguring your application instrumentation. This flexibility gives you the
freedom to choose the best monitoring and observability solution for your needs and
easily adapt to changing requirements or preferences.

As far as logging is concerned log serve as a trail of breadcrumbs within our


application distinct from distributed tracing. They offer engineers valuable
information regarding events within their services. In distributed systems logs and
traces work together while logs furnish insights into activities within individual
services distributed traces supply info about inter-service interactions.

Bloom Filter: A Bloom filter is a space-efficient probabilistic data structure.

It is used to answer the question, is this element in the set? A Bloom filter would
answer with either a firm no or a probably yes. In other words, false positives are
possible. That is, the element is not there, but the Bloom filter says it is. While false
negatives are not possible. That is, the element is there, but the Bloom filter says it's
not. The probably yes part is what makes a Bloom filter probabilistic.
As with many things in software engineering, this is a trade-off. The trade-off here is
this. In exchange for providing sometimes incorrect false positive answers, a Bloom
filter consumes a lot less memory than a data structure, like a hash table that would
provide a perfect answer all the time.

We cannot remove an item from a bloom filter. It never forgets.

Many NoSQL databases use bloom filters to reduce the disk reads for keys that don't
exist. With an LSM tree-based database, searching for a key that doesn't exist requires
looking through many files and is very costly.

Content delivery networks like Akamai use Bloom Filter to prevent caching one-hit
wonders. These are web pages that are only requested once. According to Akamai,
75% of the pages are one-hit wonders. Using a Bloom Filter to track all the URLs
seen and only caching a page on the second request, it significantly reduces the
caching workload and increases the caching hit rate.

Web browsers like Chrome, used to use a Bloom filter to identify malicious URLs.
Any URL was first checked against a Bloom filter. It only performed a more
expensive full check of the URL if the Bloom filter returned a probably-yes answer.
This is no longer used, however, as the number of malicious URLs grows to the
millions, and a more efficient but complicated solution is needed.

Similarly, some password validators use bloom filter to prevent users from using
weak passwords. Sometimes a strong password will be a victim of a false positive.
But in this case, they could just ask the users to come up with another password.

Now let's discuss how a bloom filter works. A critical ingredient to a good bloom
filter is some good hash functions. These hash functions should be fast, and they
should produce outputs that are evenly and randomly distributed. Collisions are okay
as long as they are rare. A Bloom filter is a large set of buckets, with each bucket
containing a single bit, and they all start with a 0. Let's imagine we want to keep track
of the food I like. For this example, we'll use a Bloom filter with 10 buckets labeled
from 0 to 9. And we would use 3 hash functions. Let's start by putting ribs into the
Bloom filter.

The three hash functions return the numbers 1, 3, and 4. These will set the buckets at
those locations to 1. Now this can be done in constant time. Next, let's put potato into
the bloom filter. The hashing function returned the numbers 0, 4, and 8 this time.
let's see if the bloom filter thinks I like porkchop. In this case, the hash functions
return the buckets 0, 5, and 8. And even though buckets 0 and 8 are set to 1, bucket 5
is 0. In this case, the bloom filter can confidently say no, I don't like porkchop.

But how do we get the Bloom filter to tell us something is there when it's not? Let's
walk through an example. Let's say Lemon hashes to the bucket 1, 4, and 8. Since all
those buckets are set to 1, even though I don't like Lemon, the Bloom filter will return
Yes in this case, which is a false positive.

We can control how often we see false positives. by choosing the correct size for the
bloom filter based on the expected number of entries in it. These are trade-offs
between space used and accuracy.

Throughput vs Latency:
Latency determines the delay that a user experiences when they send or receive data from the network. Throughput

determines the number of users that can access the network at the same time.

How to measure
You can measure network latency by measuring ping time. This process is where you transmit a small data packet
and receive confirmation that it arrived.

Most operating systems support a ping command which does this from your device. The round-trip-time (RTT)
displays in milliseconds and gives you an idea of how long it takes for your network to transfer data.

You can measure throughput either with network testing tools or manually. If you wanted to test throughput manually,

you would send a file and divide the file size by the time it takes to arrive. However, latency and bandwidth impact

throughput. Because of this, many people use network testing tools, as the tools report throughput alongside other

factors like bandwidth and latency.

Impacting factors: latency vs throughput


For Latency:

One of the most important factors is the location of where data originates and its intended destination. If your servers

are in a different geographical region from your device, the data has to travel further, which increases latency. This

factor is called propagation.


For Throughput:
Synchronous vs Asynchronous communication:

https://blog.algomaster.io/p/aec1cebf-6060-45a7-8e00-47364ca70761
REST VS RPC : https://blog.algomaster.io/p/106604fb-b746-41de-88fb-60e932b2ff68

REST treats server data as resources that can be created, read, updated, or deleted (CRUD
operations) using standard HTTP methods (GET, POST, PUT, DELETE).

Data and Resources: Emphasizes on resources, identified by URLs, and their state
transferred over HTTP in a textual representation like JSON or XML.

Example: A RESTful web service for a blog might provide a URL


like http://example.com/articles for accessing articles. A GET request to that URL would
retrieve articles, and a POST request would create a new article.
Advantages of REST
 Scalability: Stateless interactions improve scalability and visibility.
 Performance: Can leverage HTTP caching infrastructure.
 Simplicity and Flexibility: Uses standard HTTP methods, making it easy to understand
and implement.

RPC: Remote Procedure Calls --> It is designed to make a network call look just like
a local function call

RPC, or Remote Procedure Call, is a protocol that allows a program to


execute a procedure (subroutine) on another address space (commonly on
another physical machine) as if it were local.
RPC abstracts the complexity of the communication process, allowing
developers to focus on the logic of the procedure.

Example: A client invoking a method getArticle(articleId) on a remote


server. The server executes the method and returns the article's
details to the client.
gRPC --> we write stub interface definition in proto buff format
RPC can also use websockets.

Why RPC:
REST remains a solid choice for public APIs and web services due to its
scalability, flexibility, and widespread adoption.

RPC, especially modern implementations like gRPC, can be more efficient


for internal communication between microservices or for complex
operations.
CheckSums:

Checksums are calculated by performing a mathematical operation on the


data, such as adding up all the bytes or running it through
a cryptographic hash function.
https://blog.algomaster.io/p/what-are-checksums

Read Through vs Write through cache: https://blog.algomaster.io/p/59cae60d-9717-4e20-a59e-


759e370db4e5

Yes, a write-through cache provides strong consistency


because every write operation is immediately propagated to both
the cache and the underlying data store (e.g., a database).
Cache-aside (Lazy Loading)
- The application queries the cache first.
- If the required data is not in the cache (cache miss), it fetches the data from the DB.
- The fetched data is stored in the cache for future requests and returned to the application.

- Key points:
- Data is loaded into the cache only when it is requested (on-demand).
- The cache is populated gradually, based on what the application needs.
- The cache doesn’t automatically know when data in the DB is updated, leading to potential stale
data.

- Best for:
- Systems where reads are infrequent but maintaining up-to-date data is crucial.
- Applications with unpredictable data access patterns.

- Drawback:
- Cache miss penalty on the first read, leading to slower initial response.
Cache-through :
- The cache is tightly integrated with the DB.
- All reads and writes are routed through the cache.
- If a cache miss occurs, the cache fetches data from the DB, updates itself, and serves the data.

- Key points:
- The cache acts as a proxy for the DB.
- Changes to the DB are automatically synchronized with the cache.
- Ensures that data consistency between the cache and DB is maintained.

- Best for:
- Applications requiring a consistent caching layer that is closely tied to the DB.
- Systems with frequent reads and writes where maintaining the cache manually would be complex.
- Drawback:
- Increased complexity due to tight coupling between the cache and DB.
- If the cache layer is unavailable, the system may fail to handle requests efficiently.

Refresh-ahead :
- Cache entries are refreshed predictively before they expire.
- Uses patterns or demand forecasting to update data in advance.

- Key points:
- Ensures that frequently accessed data is always fresh in the cache.
- Minimizes cache misses by updating data before it’s requested.
- Works well with time-sensitive or regularly accessed data.

- Best for:
- Systems where cache misses are expensive (e.g., large-scale distributed systems).
- Apps with predictable data access patterns (e.g., stock prices, live scores).

- Drawback:
- Can lead to wasted resources by preloading data that may not be accessed.

Write-through (Synchronous) :
- Data is written to both the cache and the database simultaneously during a write operation.
- Ensures that the cache always contains the most up-to-date data.

- Key points:
- Writes are atomic, ensuring data consistency between the cache and the database.
- Slower write performance due to simultaneous writes to both layers.
- Useful when the cache needs to store the exact same data as the database.

- Best for:
- Systems where data consistency is critical, and the cache must always reflect the latest state.
- Applications with moderate write operations.

- Drawback:
- Increased latency in write operations as both the cache and database must be updated.

Write-behind (Asynchronous) :
- Data is first written to the cache.
- The cache asynchronously writes the data to the database later.

- Key points:
- Faster write performance as the database write is deferred.
- Reduces load on the database for write-heavy systems.
- Risk of data loss if the cache fails before syncing with the database.

- Best for:
- Write-heavy systems where immediate database consistency isn’t required.
- Scenarios where optimizing write latency is more important than immediate durability.

- Drawback:
- Increased complexity due to managing asynchronous writes.
- Possibility of stale or missing data if the cache fails before persisting.
Write-around:
- Data is written directly to the database, bypassing the cache during write operations.
- The cache gets updated only when the data is read again.

- Key points:
- Reduces cache churn by avoiding updates for data that isn’t frequently accessed.
- The cache can become inconsistent until the next read operation triggers an update.

- Best for:
- Systems with infrequent reads where caching write operations adds unnecessary overhead.
- Use cases where stale data is acceptable in the short term.

- Drawback:
- Cache inconsistency, as it won’t always reflect the latest writes until a subsequent read occurs.

Summary
- Read Strategies: Cache-aside, Cache-through, and Refresh-ahead focus on how and when data is
loaded into the cache.

- Write Strategies: Write-through, Write-behind, and Write-around deal with how changes in data
propagate between the cache and database.

Each caching strategy comes with trade-offs. Selecting the right one depends on your system’s
requirements for read/write performance, data consistency, and resource efficiency.

HLD Questions:

URL shortener

https://www.youtube.com/watch?v=iUU4O1sWtJA&t=65s

Anything underor equal to 200ms is perceived as real time

POST --> creating a new resource

PUT/PATCH --> when updating a resource

the endpoint /urls is plural because it represents a collection of URLs being managed
by the system.

Your goal is to simply go one-by-one through the core requirements and define the APIs that are
necessary to satisfy them. Usually, these map 1:1 to the functional requirements, but there are
times when multiple endpoints are needed to satisfy an individual functional requirement.

HLD design: we go through each of the APIs and starting with an API we draw out
the system that’s necessary in order to satisfy that API.

https://www.youtube.com/watch?v=JQDHz72OA3c
https://www.hellointerview.com/learn/system-design/problem-breakdowns/bitly

Availability> Consistency for this system:

 In a URL shortening service, it ensures that when you use a short URL, it always maps to the
correct long URL, no matter which server or data center processes the request.

One way to guarantee we don't have collisions is to simply increment a


counter for each new url. We can then take the output of the counter and
encode it using base62 encoding to ensure it's a compacted representation.
Redis is particularly well-suited for managing this counter because it's single-
threaded and supports atomic operations. Being single-threaded means
Redis processes one command at a time, eliminating race conditions. Its
INCR command is atomic, meaning the increment operation is guaranteed to
execute completely without interference from other operations. This is crucial
for our counter - we need absolute certainty that each URL gets a unique
number, with no duplicates or gaps.
With proper counter management, the system can scale horizontally to handle
massive numbers of URLs.

Horizontally scaling our write service introduces a significant issue! For our short code
generation to remain globally unique, we need a single source of truth for the counter. This
counter needs to be accessible to all instances of the Write Service so that they can all agree
on the next value.

We could solve this by using a centralized Redis instance to store the counter. This Redis
instance can be used to store the counter and any other metadata that needs to be shared
across all instances of the Write Service.

But should we be concerned about the overhead of an additional network request for each new
write request? Because every new request should go through this centralized redis server.

The reality is, this is probably not a big deal. Network requests are fast! In practice, the overhead
of an additional network request is negligible compared to the time it takes to perform other
operations in the system. That said, we could always use a technique called "counter batching"
to reduce the number of network requests.

Rate Limiter: https://blog.algomaster.io/p/rate-limiting-algorithms-explained-with-


code

From Alex Xu Book:


API gateway is a fully managed service that supports rate limiting,
SSL termination, authentication, IP whitelisting, servicing static
content, etc..

SSL Termination refers to the process of decrypting SSL/TLS-


encrypted traffic at the API Gateway or a load balancer before
passing the unencrypted (plain HTTP) traffic to the backend
servers.

· IP Whitelisting ensures only trusted clients can communicate with the API.

IP Whitelisting is a security measure that allows access to a


resource (such as an API) only from specific, pre-approved IP
addresses. Requests from any other IP addresses are denied
access.

Both Amazon and Stripe use Token Bucket Algorithm

Token Bucket:

How many buckets do we need? This varies, and it depends on the


rate-limiting rules. Here are a few examples.
• It is usually necessary to have different buckets for different API
endpoints. For instance, if a user is allowed to make 1 post per
second, add 150 friends per day, and like 5 posts per second, 3
buckets are required for each user.
• If we need to throttle requests based on IP addresses, each IP
address requires a bucket.
• If the system allows a maximum of 10,000 requests per second, it
makes sense to have a global bucket shared by all requests.

Leaky Bucket:

Shopify, an ecommerce company, uses leaky buckets for rate-


limiting.

Con: • A burst of traffic fills up the queue with old requests, and if
they are not processed in time, recent requests will be rate limited.

Sliding window log algorithm

• The algorithm keeps track of request timestamps. Timestamp


data is usually kept in cache, such as sorted sets of Redis.

Sliding Window counter


• It only works for not-so-strict look back window. It is an
approximation of the actual rate because it assumes requests in the
previous window are evenly distributed. However, this problem
may not be as bad as it seems. According to experiments done by
Cloudflare, only 0.003% of requests are wrongly allowed or rate
limited among 400 million requests.

The client sends a request to rate limiting middleware.

• Rate limiting middleware fetches the counter from the


corresponding bucket in Redis and checks if the limit is reached or
not.

• If the limit is reached, the request is rejected. • If the limit is not


reached, the request is sent to API servers. Meanwhile, the system
increments the counter and saves it back to Redis.
Exceeding the rate limit:

In case a request is rate limited, APIs return a HTTP response code


429 (too many requests) to the client. Depending on the use cases,
we may enqueue the rate-limited requests to be processed later.
For example, if some orders are rate limited due to system
overload, we may keep those orders to be processed later.

Rate limiter in a distributed environment:

Building a rate limiter that works in a single server environment is


not difficult. However, scaling the system to support multiple
servers and concurrent threads is a different story. There are two
challenges: • Race condition • Synchronization issue
Locks are the most obvious solution for solving race condition.
However, locks will significantly slow down the system.

Two strategies are commonly used to solve the problem: Lua


script and sorted sets data structure in Redis.
Synchronization issue: Synchronization is another important
factor to consider in a distributed environment. To support millions
of users, one rate limiter server might not be enough to handle the
traffic. When multiple rate limiter servers are used,
synchronization is required.
Performance optimization: Performance optimization is a
common topic in system design interviews. We will cover two areas
to improve. First, multi-data center setup is crucial for a rate
limiter because latency is high for users located far away from the
data center. Most cloud service providers build many edge server
locations around the world. Traffic is automatically routed to the
closest edge server to reduce latency

Second, synchronize data with an eventual consistency model.


. Similar to any system design interview questions, there are
additional talking points you can mention if time allows:

• Hard vs soft rate limiting:

Hard: The number of requests cannot exceed the threshold.

Soft: Requests can exceed the threshold for a short period.

• Rate limiting at different levels: In this chapter, we only


talked about rate limiting at the application level (HTTP: layer 7). It
is possible to apply rate limiting at other layers. For example, you
can apply rate limiting by IP addresses using Iptables(IP: layer 3)

Ticket Master:

https://www.hellointerview.com/learn/system-design/problem-breakdowns/
ticketmaster

https://www.youtube.com/watch?v=fhdPyoO6aXI
Based on the above functional requirements we should do API design

From website

When it comes to purchasing/booking a ticket, we have a post endpoint that


takes the list of tickets and payment details and returns a bookingId.
Later in the design, we'll evolve this into two separate endpoints - one for
reserving a ticket and one for confirming a purchase, but this is a good
starting point.
--> we are using PUT here because we are not
creating a new entry

Sweet, we now have the core functionality in place to view an event! But how are users supposed
to find events in the first place? When users first open your site, they expect to be able to search
for upcoming events. This search will be parameterized based on any combination of keywords,
artists/teams, location, date, or event type.
--> The highlighted query is really slow.

The main thing we are trying to avoid is two (or more) users paying for the same ticket. That would
make for an awkward situation at the event! To handle this consistency issue, we need to select a
database that supports transactions.

While anything from MySQL to DynamoDB would be fine choices (just needs ACID properties),
we'll opt for PostgreSQL. Additionally, we need to implement proper isolation levels and either row-
level locking or Optimistic Concurrency Control (OCC) to fully prevent double bookings.

Stripe handles payment asynchronously. It needs to call out to the credit card/ debit
card company to determine whether or not this payment can hapeen which happens
very quickly and it calls back to our system not by respomding to single request but
actually via web hook so you are going to register a call back URL and we will have
some end points in our booking service which is exposed for this to call back to so we
use two arrows instead of one.
Other issue with this approach is that when a user clicked on a ticket and went to a
payment page and on the payment page they have a timer lets say 10 mins what
happens if 10 mins is exceeded and what if they closed the laptop the ticket status
would stay reserved forever what that means is that when we show users the seat
mapwe would be querying the database for tickets that are available which would
exclude the reserved of course and that seat would be infinitely reserved for that user
which is wrong and doesn’t meet our requirements of it needing to expire after 10
minutes.

So how do we handle it ?

1) We can add an additional column in the Ticket table with the time stamp of when
the ticket is reserved. We can make a query to the database to show all available
tickets and the reserved tickets whose timestamp exceeds 10 minutes limit.
2) We can introduce a cron job that will run every 10 mins or so and its responsible
for querying the database for every ticket that’s in a reserved status and checking it’s
reserved timestamp, if its more than 10 mins It will set the status to available. By the
time a cron job was running if time consumed for a reserved ticket is 9 mins then its
status won’t be changed but it will be changed after 19mins.

3) Due to limitation in 2) we need something in realtime we can use Distributed Lock.


We can use redis or any in memory cache when a ticket gets reserved instead of
updating our ticket table at all we are just going to keep track of the ticket in redis
with a TTL. We can have a key value pair of ticket ID as key and boolean as
value(Here value doesn’t matter) and a TTL. So after TTL expiry this entry would be
deleted automatically.

When a user try to reserve a ticket we are not going to write to the database at all
instead we are simply going to put that ticket in our lock and we are going to lock
that ticket for 10 mins by setting the ticket ID with a TTL of 10 mins.

If through our event CRUD service we want to view all available tickets for that event
we will first query our database for all tickets that have a status of available and then
for each of those ticket IDs we would need to look them up in redis to see if they are
reserved , if they are reserved we remove them from that list of available tickets and
send it back to our client.

We are using this redis cache for locking instead of putting it in the memory in
booking service because thers going to be multiple instances of this booking service
and all them needs to have the same consistent singular view of a lock hence we use it
separately.
If the lock goes down and in theory for a 10 minute window users will be using
directly database and because of the ACID properties whoever ends up submitting
that purchase first is going to win and all others will get an error. This is a bad user
experience for that 10 mins window. i.e Users can get to the booking page, type in their
payment details, and then find out that the ticket they wanted is no longer available.

Deep dives: usually Non Functional requirements

Low latency search: Elastic search

However Elastic Search cannot be used as our primary database due to durability
limitations and no support for complex transaction management.

We can use ELastic search in the following ways:

1) By directly connecting Elastic search with Event CRUD service. Event CRUD
service writes to both Postgres DB and Elastic search --> this becomes complex logic
in our application code because you need to handle the case when write to DB fails
we don’t want that write to happen to Elastic Search and some more cases to consider.
2) CDC: Changes to a primary data store can be put onto a stream and those changes
can be consumed.

There is a limit on no:of writes to Elastic search so when we need to do a lot of


updates to Elastic serach we need to use like a queue to update changes . In this
problem reads are going to be much larger than writes so a stream would suffice here.

In this system we are not doing any ranking or recommendation for users so if two
users search for the same thing they are going to get the same result, To spped up
search queries we can use caching.
If we chose to use AWS Open Search as Elastic Search It provides caching on each of
the instances of our elastic search cluster. Or we can use redis for caching . we can
use the search term or normalize the search term and use as key and search results as
value .

We can use a CDN

CDN can cache API calls so if the search term string is short enough then we get very
little cache misses however if this search term string is lomg then precision increases
then we can see more cache misses.

We need to make the Seat Map real time to have good user experience for this we can
use Long Polling or Websockets or SSE.

Long Polling : Client sends a HTTP Request and that request is kept open for usually
like 30s to a minute or so for the server to be able to respond and we can keep it in a
while loop. No additional infrastructure and works especially well if users are not on
this page for a long time.

If users sit on this page for more time then we need a more sophisticated approach.
That sophisticated approach could open up a persistent connection like websockets
but here we can use SSE . Websockets are bidirectionsal and SSE is unidirectional
from server side.

Everytime there is a change to either to database or ticket lock we push the change to
client.

For events like Taylor Swift concert the user experience would be they would open
the seat map and everythig goes black they’ll see all these available seats and then
within couple of milliseconds it’s just going to go black because everything gets
booked because we have like 10 million users fighting for 10,000 seats . To fix that
issue we need to introduce a choke point to protect our backend services and improve
user experience. We can use a virtual waiting queue for really popular events which
says thank you for your interest you are in the queue and we’ll let you know when
you are out.

This queue could be redis that would probably be a cheap lightweight implementation
you can use a redis sorted set so that’s a priority queue based on the time they arrived
other implementations make this random so it’s a bit more fair and its not just the
users who are closest to our company servers that can get in first then we will have
some event driven logic like we let 100 people by adding userID’s to the queue once
we have 100 seats booked we let the next 100 people or 100 Whatever it may be and
you pull those people off of the queue and then notify them through SSE.

We can add Redis for event CRUD service and we can cache the events, venues and
performers in redis. We do not cahe tickets because its dynamic.
For scaling:

API Gateway takes care of load balancing and each service can have it own load
balancer .

Database Shard.

From Website:

We need to ensure that the ticket is locked for the user while they are checking out. We also need
to ensure that if the user abandons the checkout process, the ticket is released for other users to
purchase. Finally, we need to ensure that if the user completes the checkout process, the ticket is
marked as sold and the booking is confirmed.

Bad Solution: Pessimistic Locking

--> doubt
Design a Key Value Store:

The value in a key-value pair can be strings, lists, objects, etc. The
value is usually treated as an opaque object in key-value stores,
such as Amazon dynamo, Memcached, Redis.

Yes, a hashtable can be created using code in various


programming languages, and it typically resides in memory while
the program is running.
Since network failure is unavoidable, a distributed system must
tolerate network partition. Thus, a CA system cannot exist in
realworld applications.
Partition tolerance is about handling network failures, while availability is about
ensuring the system can serve requests.

 Partition Tolerance Focus: Ensures the system doesn’t crash entirely during network issues.
 Availability Focus: Ensures users receive responses even when parts of the system are down
or unreachable.

No, partition tolerance means the system tolerates network failures


and continues operating within isolated partitions. Availability
ensures the system responds to requests, but a partition-tolerant
system may prioritize consistency over availability (CP systems).

Qurom Consensus:
A coordinator acts as a proxy between the client and the nodes.
Strong consistency is usually achieved by forcing a replica not to accept new
reads/writes until every replica has agreed on current write. This approach is
not ideal for highly available systems because it could block new operations.
Dynamo and Cassandra adopt eventual consistency, which is our recommended
consistency model for our key-value store. From concurrent writes, eventual
consistency allows inconsistent values to enter the system and force the client
to read the values to reconcile.
Reconciliation can happen in two ways:
The system uses reconciliation (client-side or server-side) to detect
and resolve the conflict between valueC and valueD, ensuring the
replicas converge to a consistent state.

Replication gives high availability but causes inconsistencies


among replicas. Versioning and vector locks are used to solve
inconsistency problems. Versioning means treating each data
modification as a new immutable version of data.
Failure detection In a distributed system, it is insufficient to believe that a
server is down because another server says so. Usually, it requires at least
two independent sources of information to mark a server down.

all-to-all multicasting is a straightforward solution. However, this is


inefficient when many servers are in the system.

Gossip Protocol:
• Node s0 maintains a node membership list shown on the left side.

• Node s0 notices that node s2’s (member ID = 2) heartbeat


counter has not increased for a long time.

• Node s0 sends heartbeats that include s2’s info to a set of


random nodes. Once other nodes confirm that s2’s heartbeat
counter has not been updated for a long time, node s2 is marked
down, and this information is propagated to other nodes.

After failures have been detected through the gossip protocol, the
system needs to deploy certain mechanisms to ensure availability.
In the strict quorum approach, read and write operations could be
blocked as illustrated in the quorum consensus section. A
technique called “sloppy quorum” is used to improve availability.
Instead of enforcing the quorum requirement, the system chooses
the first W healthy servers for writes and first R healthy servers for
reads on the hash ring. Offline servers are ignored.
Handling permanent failures Hinted handoff is used to
handle temporary failures. What if a replica is permanently
unavailable? To handle such a situation, we implement an
anti-entropy protocol to keep replicas in sync. Anti-entropy
involves comparing each piece of data on replicas and
updating each replica to the newest version. A Merkle tree is
used for inconsistency detection and minimizing the amount
of data transferred.

To compare two Merkle trees, start by comparing the root hashes.


If root hashes match, both servers have the same data. If root
hashes disagree, then the left child hashes are compared followed
by right child hashes. You can traverse the tree to find which
buckets are not synchronized and synchronize those buckets only.
Using Merkle trees, the amount of data needed to be synchronized
is proportional to the differences between the two replicas, and not
the amount of data they contain.

A hash tree or Merkle tree is a tree in which every non-leaf node is


labeled with the hash of the labels or values (in case of leaves) of
its child nodes. Hash trees allow efficient and secure verification of
the contents of large data structures”.

You might also like