0% found this document useful (0 votes)
7 views

System_Design_Notes_1664811186

Uploaded by

sunil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

System_Design_Notes_1664811186

Uploaded by

sunil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

SYSTEM DESIGN

NOTES

 Latency:
 How long it takes for data from one point in a system to another point in system.
 Talking about Latency we might refer to a lot of different kind of things system.
o Network Request
o Memory
o Disk: HDD or SSD

 Important fact to know different things in system will have different latency
o Reading data from Memory will be faster than reading data from Disk
o Network Calls latency increases as distance increases.

1|Page
 When designing systems you typically want to optimize those system by lowering
overall latencies of the system.
 Some systems might really care for low latencies.
o Video Game
o Video Conferencing
 Some systems might not care for low latencies
o Websites
o Accurate information
o Never Down

 Throughput:
 How much work a machine can perform in given period of time.
 What we are actually referring here is how much data can be transferred from one point
of system to another point of system in a given amount of time.
 Typically we measure this throughput in Gigabits per second or Kilobits per second ,
megabits per second
 For example Network of 1Gbps.

2|Page
 Availability:
 How Resistant a system is for failures.
 Is your system completely go down or your system still going to be operational.
 Think about availability as percentage of time in a given period of time, like a month
or a year your services operational enough such that all of it’s primary functions are
satisfied.

 There are varying degree of availability that you might expect from different
systems.
o YouTube
o Airplane System
o Cloud Providers
o We typically measure availability as the percentage of a system's uptime in a
given year
 If a system is up and operational for half of an entire year, then system has 50%
availability.
 In practice, you could imagine that 50% availability would be really, really bad for
most services.
 Percentages can be pretty deceptive because even an availability of 90% isn't really
great.
 Outage of 36 days around out of the year.
Nines:
 We measure availability not exactly in percentages but rather in what we call nines.
 Nines are effectively percentages but they are specifically percentages with the
number nine.
 If you have a system that has 99% availability. Then, in the industry, we say that
your system has two nines of availability.
 If it has 99.99%, then we say it has four nines of availability.

3|Page
4|Page
 Load Balancer:
 A load balancer is going to be a server that sits in between your clients and your
servers, and that basically has the job of, as its name suggests, balancing workloads
across resources.

 Load balancing can happen at a lot of different places in your system.


o load balancer in between your clients and your servers.
o load balancer in between your servers and your databases.
o or when you're dealing with the website, you might even have load balancing
at the DNS layer where Ip resolving happens.

 People who are in charge of the system can configure the load balancer and the
servers to know about each other.
 When you add a new server or when you remove an old server, it registers itself with
the load balancer or perhaps it deregisters itself with a load balancer.

5|Page
 Server Selection Strategy:

6|Page
 Proxies:
 Forward Proxy ( generally referred as proxy):
 A forward proxy is a server that sits in between a client or a set of clients.
 But more specifically a forward proxy is a server that acts on behalf of the client
or clients.

7|Page
 Reverse Proxy:
 Reverse proxies act on behalf of a server in an interaction between a client and
a server.
 Popular example Nginx

 It can filter out requests that you want to ignore.


 Logging
 Metrics
 Cache stuff ( Putting data in memory )
 Load Balancing

 Caching:
 Caching is the process of storing copies of files in a cache, or temporary
storage location, so that they can be accessed more quickly.
 The data in a cache is generally stored in fast access hardware such as RAM
(Random-access memory)
 A cache's primary purpose is to increase data retrieval performance by
reducing the need to access the underlying slower storage layer.
 Caching is used to reduce or to improve the latency of a system.
 And you can use caching in a bunch of different places in a system.
 Browser Cache ( Client Level ):
 When a user visits a new website, their browser needs to download data to load
and display the content on the page.
 To speed up this process the next time a user visits the site, browsers cache the
content on the page and save a copy of it.
 As a result, the next time the user goes to that website, the content is already
stored on their device and the page will load faster.

8|Page
 Server Level:
 Client interacts with the server 20 times to get a piece of data, but maybe the server
doesn't always need to go to the database to retrieve data.
 Maybe Server only needs to go to the database once and we can have some form of
cache here at the server level (in memory).
 You could also have a cache in-between two components in a system. So maybe
you could have a cache in-between a server and a database.

 The first instance where caching is going to be really helpful is if you're doing a lot
of network requests and you basically want to avoid doing all of these network
requests.

 Another instance where caching is very helpful is if you're doing some very
computationally long operation.
 Assume at the server level you perform some very long algorithm, maybe an
algorithm that has a poor time complexity.
 Cache that result because you don't want to be performing that very long operation
multiple times.

 Let Say we have six additional servers


 Suppose Millions of Client are communicating to servers to ask info from database.
 We don't want to read from this database a thousand times or a million times
because that might overload the database for instance.
 So, you would use caching to not have to read from the database that many times.
 Each server would have it's individual cache and we would store the the Instagram
profile, in memory at each server.
 That would make your system avoid having to do so many reads at the database
level.
 Caching Strategies:

9|Page
10 | P a g
e
11 | P a g
e
 Stale Cache:

 Caches can become stale if they haven't been updated properly.


 A solution would be to move our cache out of the servers and to put a single
cache maybe. And this could be Redis.
 And all of the servers would hit the cache and we'd have this single source of
truth for the caching mechanism.

 Sometimes for certain parts of our system, or rather certain features , we might
actually not care that much about the staleness or non-staleness of the data in our
caches.
 As an example, let's take view count on YouTube videos.
 If one user sees a slightly stale version of a view count on a video, that's probably
not going to be the end of the world.

 LRU Eviction Policy:

 The Least Recently Used policy.


 It means get rid of the least recently used pieces of data in a cache and you have
some way of tracking what pieces of data are the least recently used.
 Here we make the assumption that the piece of data that was used least recently is
likely the one that we no longer care about, or that we least care about.

12 |P a g
e
 Hashing:
 Hashing is an action that you can perform to transform an arbitrary piece of data into
a fixed size value, typically an integer value.
 In the context of systems design interview, that arbitrary piece of data can be an IP
address, it can be a username, it can be an HTTP request, anything that can be
hashed or transformed into an integer value.
 Few hashing function and algorithm used in Industry
 MD5 hashing
 SHA-256 hashing algorithm,
 Bcrypt hashing function

13 | P a g
e
 Simple Hashing Example or Why Consistent Hashing?
 In Hashing we can hash the requests that come in to the load balancer.
 And then based on the hash we can send the requests according to the position of
the servers.
 So let's walk through this example and for the sake of simplicity what we're going
to do here is we're just going to hash the names of our clients.
 Our goal is to get every client to have all of its requests rerouted to the same server.
 We are going to hash the client's names themselves that is C1, C2, C3, C4.
 Let's assume that we've got a hashing function that's been given to us and that when
we pass C1, C2, C3, C4 through that hashing function.
 We get the following results. 11 for C1, we get 12 for C2, we get 13 for C3 and we
get 14 for C4.
 remember that a hashing function transforms your arbitrary pieces of data into
some fixed size value, typically an integer value.
 Right Now we will use really the simplest hashing strategy in the context of
systems design interviews,
 We will mod these hashes here by the number of servers that we have.
 11%4 = 3, 12%4 = 0, 13%4 = 1, 14%4= 2
 Now it means that we have the numbers corresponding to our four servers that
these four clients should be associated with.
 C1 -> D. C2 -> A , C3 -> B , C4 -> C
 11%4 = 3. 12%4 = 0. 13%4 = 1, 14%4= 2

 Problem with Simple Hashing


 We could very easily have server A fail on us and completely die.
 Similarly, our system could be experiencing a ton of traffic and we might need to
add servers to our system.
 So maybe our server A doesn't die, but maybe we need to add a new server to our
system, server E to handle some new incoming traffic.
 We can't continue to mod all of our hashes by four because we no longer have four
servers. We have five servers.

14 | P a g
e
 If we keep modding our hashes for our clients by four, then all of our requests are
always only going to go to servers A, B, C, and D. They're never going to go to E.
 If we add a new server, we have to change some logic here. Namely, we have to
mod our hashes by the new number of servers.
 11%5 = 1 , 12%5 = 2, 13%5 =3, 14%5 =4
 Previous values: 11%4 = 3. 12%4 = 0. 13%4 = 1, 14%4= 2
 When we mod our hashes by five instead of four, we get completely different
results for the servers that our clients are going to be rerouted to.

 Our very simple hashing strategy of hashing our clients or requests IP addresses,
and then modding the hashes by the number of servers to figure out what server to
reroute stuff to just doesn't work.
 All of your in-memory caches that you may have had in your system are no longer
nearly as useful
 Consistent Hashing to Rescue.

 Consistent Hashing:
 Consistent Hashing is a distributed hashing scheme that operates independently of
the number of servers or objects
 We assigning servers, object a position on an abstract circle, or hash ring.

15 | P a g
e
 Hash Ring:
 The number of locations is no longer fixed, but the ring is considered to have an
infinite number of points and the server nodes can be placed at random locations on
this ring.
 Of course, choosing this random number again can be done using a hash function.
 Step of dividing it with the number of available locations is skipped as it is no
longer a finite number.
 We will map the hash output range on the edge of a circle.
 That means that the minimum possible hash value, zero, would correspond to an
angle of zero,
 The maximum possible value (some big integer we’ll call INT_MAX) or 360
degrees, and all other hash values would linearly fit somewhere in between.
 The way that you place these servers here on the circle is by putting them through a
hashing function.
 We will pass server names in a hashing function. You get a value and depending on
the value, you position them on the circle.
 If hashing function used is a good hashing function that has that uniformity about
it, then the servers will be sort of evenly distributed.
 Exact same thing with your clients. Our clients are going to go through a hashing
function and then you position them on the circle.

16 | P a g
e
17 | P a g
e
 Replication:
 Replication is the process of storing the same data in multiple locations to improve
data availability and accessibility, and to improve system resilience and reliability.
 The idea behind replication is that you have a duplicate version of your main
database, a replica of the main database.
 The main database handles all of the reads and writes coming to it, but it also
updates the replica such that the replica is effectively the same as the main
database.
 The replica can take over if the main database fails.
o So when the main database goes down, the replica takes over and now
becomes the new main database
 Once the original main database comes back up, it gets updates by the replica, and
then eventually they can swap roles.
o This is one possible use case.
 Or maybe your main database server is getting overloaded, then you can split up
your traffic between replicas hence increased throughput.
 In order for this to work, Your replica needs to always be exactly up to date with
the main database.
 Whenever someone writes or update the main database, that update needs to also
happen in the replica.
 If write operation fails on the replica, there's an issue and the right operation should
not complete on the main database.
 In this scenario where you want your replica to be able to take over for your main
database in the event of database failure, you never want the replica to be out of
date with the main database.
 This means that your write operations are going to take a little bit longer, because
they have to be done both on the main database and on the replica.
 Benefits of data replication:
 Improved reliability and availability: If one system goes down due to faulty
hardware, malware attack, or another problem, the data can be accessed from a
different server.
 Improved network performance: Having the same data in multiple locations can
lower data access latency, since required data can be retrieved closer to where the
transaction is executing.
 Issues:
 If our system is a system that's got tons of data? Where we've got over a billion
users, Do we really want to have all of that data replicated across a bunch of
different databases? Maybe No.
 Keeping copies of the same data in multiple locations leads to higher storage and
processor costs.
 Maintaining consistency across data copies requires new procedures and adds
traffic to the network hence increased bandwidth consumption.

18 | P a g
e
 Sharding:
 Sharding is a database architecture where we separate table’s rows into multiple
different tables, known as partitions.
 Each partition has the same schema and columns, but also entirely different rows.
 One part of the data would be stored in one database server, another part of the data
would be stored in another database server, and so on.
 Splitting up of main database into a bunch of little databases, which are called
shards or data partitions.
 Key Based Sharding:
 Key based sharding, also known as hash based sharding.
 It involves using a value taken from newly written data - such as a customer’s ID
number, a client application’s IP address, a ZIP code, etc
 And plugging it into a hash function to determine which shard the data should go to

 To ensure that entries are placed in the correct shards and in a consistent manner,
the values are passed through the hash function
 The main appeal of this strategy is that it can be used to evenly distribute data so as
to prevent hotspots.
 Hotspots:
 The database hotspot problem arises when one shard accessed more as compared to
all other shards
 And hence, in this case, any benefits of sharding the database are cancelled out by
the slowdowns and crashes.
 Drawback of Key based Sharding:
 Its challenging to dynamically add or remove a database server.
 Every time this happens, we need to re-shard the database which means we need to
update the hash function and rebalance the data.
 If your database server goes down, then consistent hashing will not help. We will
probably need a replica of each shard.

19 | P a g
e
 Range Based Sharding:
 In range-based sharding, the shard is chosen on the basis of the range of a shard
key.
 Let’s say we have a recommender system that stores all the information about a
user and recommends user movies based on their age.
 Range-based sharding is easy to implement as we just need to check the range in
which our current data falls and insert/read data from the shard corresponding to
that shard.

 Drawback:
 The major drawback of this technique is that if our data is unevenly
distributed, again it can lead to database hotspots.
 Directory-Based Sharding:
 In directory-based sharding have a lookup table.
 It stores the shard key to keep track of which shard store what entry.
 To read or write data, first we need to consult the lookup table to find the
shard number for the corresponding data using the shard-key and then visits
a particular shard to perform the further operation.

20 | P a g
e
 Drawback:
 The main issue with directory-based sharding is we need to consult a lookup table
before every read and write query hence it can impact application performance.
 Also, the lookup table is prone to a single point of failure.

 Geo-Based sharding:
 In Geo-based sharding, the data is processed by a shard corresponding to the user
region or location.
 The obvious problem we will face using this strategy is if we have the majority of
users from one pin code, city, or country, then we will have hotspots.

 Where to put Sharding logic?


 At server level
o The end user makes a request for data to the server, and then the server has
some sort of logic in here. And then finally, your server goes to that shard
 At Reverse Proxy
o Your server would make a request to read data from or to write data to the
database, but this request would actually go to the reverse proxy.
o Then reverse proxy have the job of actually figuring out what shard it should
go to.
 Should I Shard?
 Shard when a data store is likely need to scale beyond the resources available to a
single storage node.
 To improve performance by reducing load.
 The primary focus of sharding is to improve the performance and scalability of a
system, but as a by-product it can also improve availability.
 The data is divided into separate partitions. A failure in one partition doesn't
necessarily prevent an application from accessing data held in other partitions
 The volume of writes or reads to the database surpasses what a single node or its
read replicas can handle, resulting in slowed response times or timeouts.
 There is no one technique of sharding that will fit for all systems.

 Leader Election:
 Why Leader Election?
 Imagine that you're designing a system for a product that allows users to
subscribe to the product on a recurring basis.
 You can think of Netflix, or of Amazon Prime, where users can subscribe on
monthly or annual basis.
 You will have a database in which you're going to store information about
user subscriptions.
 You might store whether or not a user is currently subscribed to the service
that you're offering.
 You might store the date at which point the users subscription is suppose to
renew.
 You might store the price that the user should be charged on a recurring
basis.

21 | P a g
e
 And then, you would be using a third party service that would be the service
actually taking care of charging the users, or debiting funds from their bank
accounts.
 Suppose your third party service is PayPal or Stripe
 And of course, this means that your third party service needs to somehow
communicate with your database.
 Because your third party service needs to know when a user should be charged
again, how much they should be charged, etc.
 You don’t want to have this third party service actually interact with your database
directly.
 Your database is a pretty sensitive part of your system.
 It contains important information, and you may not want to have some seemly
random third party service connect directly into it.
 So, a reasonable thing to do, would be to create a service in the middle, between
the third party service and the database.
 And this service would be in charge of talking to the database, maybe on a periodic
basis.
 This new service will figure out when certain users subscription is going to renew,
how much that user needs to be charged.
 And then the new service is going to go to the third party service, to PayPal or to
Stripe, and actually tell the third party service to charge the user.

22 | P a g
e
 Leader Election:
 Leader election, as the name suggest, if you have a group of machines or a group of
servers that are in charge of doing the same thing, instead of having all of them
doing that same thing, a machine is selected as a leader which will perform actions.
 The leader is going to be the one performing the business logic, or whatever needs
to be done.
 Like in our previous example you definitely don't want to make a request for a
given user multiple times.
 So in our case, we've got five servers that were all effectively responsible for this
business logic.
 But instead of having all of them do business logic, they are going to elect a leader
amongst themselves.
 And the leader, is going to be the only server responsible for doing the business
logic. And four other servers in example are just going to sit there, on standby, in
case something happens to the leader.
 And if something happens to the leader, then one of the other servers is going to
become the new leader.
 A new leader is going to be elected, and one of these other servers is going to
become that leader, and is going to take over.

 It turns out that this seemingly simple task of having,


o multiple distributed machines elect a single leader,
o and all be aware of who the leader is at any given time,
o and all be capable of re-electing a new leader if something happens to the
original leader at any given point in time.
o this is actually a very difficult problem to solve.
 The reason it's difficult, is because when you've got multiple machines that are
distributed, that need to essentially share state.
 In this case the state would be who the leader is. That's just difficult.
 You never know what can happen in a network. What if there's a some network
failure that makes some machines no longer be able to communicate with other
machines. What happens then?
 This concept of electing a leader is really non-trivial. And here it's not so much the
act of electing a leader.
 It's more the act of having multiple machines gain consensus, or agree upon
something together. That's the real difficulty.
 It's the act of gaining that consensus, of sharing some state, in this case, who the
leader is, it's the act of doing that that's difficult.
23 | P a
g
e
 Consensus Algorithm:
 Consensus algorithms are very complicated algorithms that allow multiple servers
in a group, or multiple nodes in a cluster, to reach consensus, or to agree on some
single data value.
 In the case of leader election, that single data value is going to be who the leader is
in a given group of machines or given group of servers, or a cluster of nodes.
 There are a lot of these consensus algorithms out there. A couple of really popular
ones are Paxos and Raft.
 You are never going to be expected to implement them yourself in the industry
 In the industry, typically you use some other third party service that itself might
use Paxos or Raft under the hood.
 There are these two tools called Zookeeper and etcd which implement consensus
algorithm.
 Zookeeper and etcd are these two tools that aren't necessarily primarily meant to be
used for leader election, but that happen to allow you to implement your own
leader election in a very easy way.
 Etcd
 etcd is a key-value store. A database that allows you to store key-value pairs.
 You can think of it as a hash table that allows you to map keys to values
 etcd is highly available and strongly consistent.
 Strong consistency means that if you've got multiple machines, or even just one
machine, reading and writing to the same key-value pair in the key-value store in
etcd at any given point in time, you're always guaranteed to be returned the correct
value.
 How etcd achieve high availability and strong consistency?
o They do so using leader election, and more precisely, by implementing a
Consensus algorithm.
 etcd implements Raft
 We would have your multiple servers communicating with etcd with the key-value
store.
 And at any given point in time, we would have one special key-value pair in the
etcd key-value store.

 And that key-value pair would represent who the leader amongst our servers.
 And that could just be a key-value pair where the key is leader, or some special key
that represents the leader.
 And the value is the name of your server, or the IP address of your server.
 etcd guarantees high availability and strong consistency, the value of the leader in
that key-value pair at any given point in time is correct for any machine that's
reading from it.
 And just like that, you've got leader election implemented.

=====Follow @Onkar Deolankar for more====

24 | P a g
e

You might also like