Quastor Summaries
Quastor Summaries
These are the full archives of the Quastor Newsletter until July of 2023.
Quastor Archives 1
Table of Contents 2
Measuring Availability 9
Service Level Agreement 9
Availability 11
Latency 13
Other Metrics 14
Load Balancing Strategies 15
The Purpose of Load Balancers 15
Load Balancer vs. API Gateway 16
Types of Load Balancers 16
Load Balancing Algorithms 19
Scaling Relational Databases with Replicas and Sharding 21
Optimizations 21
Vertical Scaling 21
Adding Read Replicas 22
Sharding 23
Horizontal Partitioning Strategies 24
Backend Caching 28
Downsides of Caching 29
Implementing Caching 30
Cache Aside 30
Write Through 31
Cache Eviction 32
API Gateways 34
API Gateway use cases 35
API Gateway Lifecyle 36
History of API Gateways 39
API Gateways 40
Real World Uses 41
Zuul, Netflix’s API Gateway 41
TAG, Tinder API Gateway 42
An Introduction to Compilers and LLVM 43
Introduction to Classical Compiler Design 44
LLVM’s Implementation of the Three-Phase Design 46
How WhatsApp served 1 billion users with only 50 engineers. 50
Engineering Culture 50
These availability goals will affect how you design your system and what tradeoffs you
make in terms of redundancy, autoscaling policies, message queue guarantees, and
much more.
These SLAs provide monthly guarantees in terms of Nines. We’ll discuss this shortly. If
they don’t meet their availability agreements, then they’ll refund a portion of the bill.
Service Level Agreements are composed of multiple Service Level Objectives (SLOs). An
SLO is a specific target level objective for the reliability of your service.
● handle 1,500 requests per second during peak periods, with a maximum
allowable response time of 200 milliseconds for 99% of requests.
SLOs are based on Service Level Indicators (SLI), which are specific measures
(indicators) of how the service is performing.
The SLIs you set will depend on the service that you’re measuring. For example, you
might not care about the response latency for a batch logging system that collects a
bunch of logging data and transforms it. In that scenario, you might care more about the
recovery point objective (maximum amount of data that can be lost during the recovery
from a disaster) and say that no more than 12 hours of logging data can be lost in the
event of a failure.
The Google SRE Workbook has a great table of the types of SLIs you’ll want depending
on the type of service you’re measuring.
Every service will need a measure of availability. However, the exact definition will
depend on the service.
You might define availability using the SLO of "successfully responds to requests within
100 milliseconds". As long as the service meets that SLO, it'll be considered available.
Availability is measured as a proportion, where it’s time spent available / total time.
You have ~720 hours in a month and if your service is available for 715 of those hours
then your availability is 99.3%.
It is usually conveyed in nines, where the nine represents how many 9s are in the
proportion.
If your service is available 92% of the time, then that’s 1 nine. 99% is two nines. 99.9% is
three nines. 99.99% is four nines, and so on. The gold standard is 5 Nines of availability,
or available at least 99.999% of the time.
When you talk about availability, you also need to talk about the unit of time that you’re
measuring availability in. You can measure your availability weekly, monthly, yearly,
etc.
If you’re measuring it weekly, then you give an availability score for the week and that
score resets every week.
So, if you measure downtime monthly, then you must have less than 40 minutes of
downtime to have 3 Nines.
However, if you were measuring availability monthly, then the moment you had that 5
minutes of downtime in the week of May 1rst, your availability for the month would’ve
been at most 3 Nines. Having 4 Nines means less than 4 minutes, 21 seconds of
downtime, so that would’ve been impossible for the month.
Here’s a calculator that shows daily, weekly, monthly and yearly availability calculations
for the different Nines.
Most services will measure availability monthly. At the end of every month, the
availability proportion will reset. You can read more about choosing an appropriate time
window here in the Google SRE workbook.
An important way of measuring the availability of your system is with the latency. You’ll
frequently see SLOs where the system has to respond within a certain amount of time in
order to be considered available.
There’s different ways of measuring latency, but you’ll commonly see two ways
● Averages - Take the mean or median of the response times. If you’re using the
mean, then tail latencies (extremely long response times due to network
congestion, errors, etc.) can throw off the calculation.
● Percentiles - You’ll frequently see this as P99 or P95 latency (99th percentile
latency or 95th percentile latency). If you have a P99 latency of 200 ms, then
99% of your responses are sent back within 200 ms.
Latency will typically go hand-in-hand with throughput, where throughput measures the
number of requests your system can process in a certain interval of time (usually
measured in requests per second). As the requests per second goes up, the latency will
go up as well. If you have a sudden spike in requests per second, users will experience a
spike in latency until your backend’s autoscaling kicks in and you get more machines
added to the server pool.
You can use load testing tools like JMeter, Gatling and more to put your backend under
heavy stress and see how the average/percentile latencies change.
The high percentile latencies (have a latency that is slower than 99.9% or 99.99% of all
responses) might also be important to measure, depending on the application. These
latencies can be caused by network congestion, garbage collection pauses, packet loss,
contention, and more.
Other Metrics
There’s an infinite number of other metrics you can track, depending on what your use
case is. Your customer requirements will dictate this.
Some other examples of commonly tracked SLOs are MTTR, MTBM and RPO.
MTTR - Mean Time to Recovery measures the average time it takes to repair a failed
system. Given that the system is down, how long does it take to become operational
again? Reducing the MTTR is crucial to improving availability.
MTBM - Mean Time Between Maintenance measures the average time between
maintenance activities on your system. Systems may have scheduled downtime (or
degraded performance) for maintenance activities, so MTBM measures how often this
happens.
RPO - Recovery Point Objective measures the maximum amount of data that a company
can lose in the event of a disaster. It’s usually measured in time and it represents the
point in time when the data must be restored in order to minimize business impact. If a
company has an RPO of 2 hours, then that means that the company can tolerate the loss
of data up to 2 hours old in the event of a disaster. RPO goes hand-in-hand with MTTR,
as a short RPO means that the MTTR must also be very short. If the company can’t
tolerate a significant loss of data when the system goes down, then the Site Reliability
Engineers must be able to bring the system back up ASAP.
As your backend gets more traffic, you’ll eventually reach a point where vertically
scaling your web server (upgrading your hardware) becomes too costly. You’ll have to
scale horizontally and create a server pool of multiple machines that are handling
incoming requests.
A load balancer sits in front of that server pool and directs incoming requests to the
servers in the pool. If one of the web servers goes down, the load balancer will stop
sending it traffic. If another web server is added to the pool, the load balancer will start
sending it requests.
Load balancers can also handle other tasks like caching responses, handling session
persistence (send requests from the same client to the same web server), rate limiting
and more.
Typically, the web servers are hidden in a private subnet (keeping them secure) and
users connect to the public IP of the load balancer. The load balancer is the “front door”
to the backend.
When you’re using a services oriented architecture, you’ll have an API gateway that
directs requests to the corresponding backend service. The API Gateway will also
provide other features like rate limiting, circuit breakers, monitoring, authentication
and more. The API Gateway can act as the front door for your application instead of a
load balancer.
API Gateways can replace what a load balancer would provide, but it’ll usually be
cheaper to use a load balancer if you’re not using the extra functionality provided by the
API Gateway.
Here’s a great blog post that gives a detailed comparison of AWS API Gateway vs. AWS
Application Load Balancer.
When you’re adding a load balancer, there are two main types you can use: layer 4 load
balancers and layer 7 load balancers.
This is based on the OSI Model, where layer 4 is the transport layer and layer 7 is the
application layer.
The main transport layer protocols are TCP and UDP, so a L4 load balancer will make
routing decisions based on the packet headers for those protocols: the IP address and
the port. You’ll frequently see the terms “4-tuple” or “5-tuple” hash when looking at L4
load balancers.
● Source IP
● Source Port
● Destination IP
● Destination Port
● Protocol Type
With a 5 tuple hash, you would use all 5 of those to create a hash and then use that hash
to determine which server to route the request to. A 4 tuple hash would use 4 of those
factors.
With Layer 7 load balancers, they operate on the application layer so they have access to
the HTTP headers. They can read data like the URL, cookies, content type and other
headers. An L7 load balancer can consider all of these things when making routing
decisions.
Popular load balancers like HAProxy and Nginx can be configured to run in layer 4 or
layer 7. AWS Elastic Load Balancing service provides Application Load Balancer (ALB)
and Network Load Balancer (NLB) where ALB is layer 7 and NLB is layer 4 (there’s also
Classic Load Balancer which allows both).
The main benefit of an L4 load balancer is that it’s quite simple. It’s just using the IP
address and port to make its decision and so it can handle a very high rate of requests
per second. The downside is that it has no ability to make smarter load balancing
decisions. Doing things like caching requests is also not possible.
On the other hand, layer 7 load balancers can be a lot smarter and forward requests
based on rules set up around the HTTP headers and the URL parameters. Additionally,
you can do things like cache responses for GET requests for a certain URL to reduce
load on your web servers.
Therefore, most general purpose load balancers operate at layer 7. However, you’ll also
see companies use both L4 and L7 load balancers, where the L4 load balancers are
placed before the L7 load balancers.
Facebook has a setup like this where they use shiv (a L4 load balancer) in front of
proxygen (a L7 load balancer). You can see a talk about this set up here.
Round Robin - This is usually the default method chosen for load balancing where web
servers are selected in round robin order: you assign requests one by one to each web
server and then cycle back to the first server after going through the list. Many load
balancers will also allow you to do weighted round robin, where you can assign each
server weights and assign work based on the server weight (a more powerful machine
gets a higher weight).
An issue with Round Robin scheduling comes when the incoming requests vary in
processing time. Round robin scheduling doesn’t consider how much computational
time is needed to process a request, it just sends it to the next server in the queue. If a
server is next in the queue but it’s stuck processing a time-consuming request, Round
Robin will still send it another job anyway. This can lead to a work skew where some of
the machines in the pool are at a far higher utilization than others.
Least Connections (Least Outstanding Requests) - With this strategy, you look at the
number of active connections/requests a web server has and also look at server weights
(based on how powerful the server's hardware is). Taking these two into consideration,
you send your request to the server with the least active connections / outstanding
requests. This helps alleviate the work skew issue that can come with Round Robin.
Hashing - In some scenarios, you’ll want certain requests to always go to the same
server in the server pool. You might want all GET requests for a certain URL to go to a
certain server in the pool or you might want all the requests from the same client to
always go to the same server (session persistence). Hashing is a good solution for this.
You can define a key (like request URL or client IP address) and then the load balancer
will use a hash function to determine which server to send the request to. Requests with
the same key will always go to the same server, assuming the number of servers is
constant.
Consistent Hashing - The issue with the hashing approach mentioned above is that
adding/removing servers to the server pool will mess up the hashing scheme. Anytime a
There are different consistent hashing algorithms that you can use and the most
common one is Ring hash. Maglev is another consistent hashing algorithm that was
developed by Google in 2016 and has been serving Google’s traffic since 2008.
For most applications, the read load will be far greater than the write load, so you’ll have
to deal with scaling reads first. This is done through adding read replicas to the database
so the original database can just handle write requests and the database replicas will
handle reads.
Eventually, the master won’t be able to handle all the write pressure, so a popular way of
dealing with that is through sharding. You split up your data into multiple databases, so
you can dedicate a machine to handling each chunk of data.
Optimizations
Before you think about adding replicas or sharding, you should first explore all your
options around database optimizations.
Using the right indexes on tables, batching writes, analyzing the execution plan for your
most expensive SQL queries, etc. should all be done first.
Vertical Scaling
After exploring possible optimizations, vertical scaling will likely give you the biggest
bang for your buck. Getting a beefier CPU, more RAM, faster disk etc. should be
considered before scaling with replicas and sharding.
Most web applications have far more read load than write load, so reducing that
pressure by adding read replicas can have a huge impact on your database’s scalability.
Your original database becomes the leader and you add additional machines that
function as replicas/followers. Database reads are completed by the follower nodes
while writes are executed on the master node. After being executed on the master, writes
are copied over asynchronously to follower nodes.
Due to replication lag between the master and follower nodes, you’ll be sacrificing
consistency and have to deal with some stale reads.
One way of fixing this is to implement different read modes with strongly consistent
reads and eventually consistent reads.
Eventually consistent reads go to a follower node whereas strongly consistent reads will
be handled by the leader node. This lets you reduce read load on the leader database
while also having the ability to run strongly consistent read queries.
Another way to make reads strongly consistent is to wait until all the replicas have
gotten the write before you acknowledge it as being completed to the user.
Here’s a great blog post from Box on how they dealt with consistency issues with read
replicas.
However, adding read replicas will not help to scale the write load in any way. For that,
you’ll have to shard your database.
With sharding, you’re taking your database and splitting up the data to store on multiple
databases. This lets you use multiple machines to store the same data, which makes it
easier to deal with write and read load (more machines to handle the traffic).
● Increased latency for certain reads - if a read needs data from multiple shards,
it will have to perform reads from multiple databases and then join the data
together.
You can break down your data using horizontal partitioning where you maintain the
table schema and columns, but have each shard contain a number of the rows of your
table.
Vertical Partitioning is where you split up your table based on the columns of your
table.
You can read about how Notion sharded Postgres by implementing their own
partitioning scheme in their application logic.
In terms of potential third party solutions, Vitess is an open source solution for sharding
MySQL that was developed at YouTube. Citus is a similar open source tool for sharding
Postgres.
You can read about how GitHub used Vitess to shard MySQL here.
With horizontal partitioning, you’re taking one (or multiple) of the fields in your table
and making that your shard key.
Some ways of using the shard key to partition are hash based sharding, range based
sharding and lookup table based sharding.
You run a hash function on the value of the shard key and then categorize it based on
the hashed value. A good hash function will satisfy the uniformity property, so the
outputs will be mapped evenly over the output range. This should help mitigate hot/cold
shards.
With this strategy, you’ll split up your partition key into ranges and then divide the rows
into shards based on that.
Examples of ranges could be location (each country is a shard), date (each month is a
shard), price (every $100 increment is a shard), etc.
A third way of implementing sharding is by using a lookup table (or hash table).
Like we did in key-based sharding, you'll set one column of your data as the shard key.
Then, you can randomly assign rows to different shards and keep track of which shard
contains which row with a lookup table.
If the data being read doesn’t change often, then adding a caching layer can significantly
reduce the latency users experience while also reducing the load on your database. The
reduction in read requests frees up your database for writes (and reads that were missed
by the cache).
The cache tier is a data store that temporarily stores frequently accessed data. It’s not
meant for permanent storage and typically you’ll only store a small subset of your data
in this layer.
When a client requests data, the backend will first check the caching tier to see if the
data is cached. If it is, then the backend can retrieve the data from cache and serve it to
the client. This is a cache hit. If the data isn’t in the cache, then the backend will have to
query the database. This is a cache miss. Depending on how the cache is set up, the
backend might write that data to the cache to avoid future cache misses.
A couple examples of popular data stores for caching tiers are Redis, Memcached,
Couchbase and Hazelcast. Redis and Memcached are the most popular and they’re
offered as options with cloud cache services like AWS’s ElastiCache and Google Cloud’s
Memorystore.
Redis and Memcached are in-memory, key-value data stores, so they can serve reads
with a lower latency than disk-based data stores like Postgres. They’re in-memory data
stores, so RAM is used as the primary method of storing and serving data while the disk
is used for backups and logging. This translates to speed improvements as memory is
much faster than disk.
If you have a high cache miss rate, then that means the caching tier is adding more
latency than it’s reducing and you’d be faster off by removing it. We’ll talk about
strategies to minimize the cache miss rate.
Another downside of adding a cache tier is dealing with stale data. If the data you’re
caching is static, then this isn’t an issue but you’ll frequently want to cache data that is
being changed. You’ll have to have a strategy for cache invalidation to minimize the
amount of stale data you're sending to clients. We’ll talk about this below.
Cache Aside
2. The server checks the caching tier. If there’s a cache hit, then the data is
immediately served.
3. If there’s a cache miss, then the server checks the database and returns the
data.
Here, your cache is being loaded lazily, as data is only being cached after it’s been
requested. You usually can’t store your entire dataset in cache, so lazy loading the cache
is a good way to make sure the most frequently read data is cached.
However, this also means that the first time data is requested will always result in a
cache miss. Developers solve this by cache warming, where you load data into the cache
manually.
In order to prevent stale data, you’ll also give a Time-To-Live (TTL) whenever you cache
an item. When the TTL expires, then that data is removed from the cache. Setting a very
low TTL will reduce the amount of stale data but also result in a higher number of cache
misses. You can read more about this tradeoff in this AWS Whitepaper.
An alternative caching method that minimizes stale data is the Write-Through cache.
A Write Through cache can be viewed as an eager loading approach. Whenever there’s a
change to the data, that change is reflected in the cache.
This helps solve the data consistency issues (avoid stale data) and it also prevents cache
misses for when data is requested the first time.
2. Backend writes the change to both the database and also to the cache. You can
also do this step asynchronously, where the change is written to the cache and
then the database is updated after a delay (a few seconds, minutes, etc.). This
is known as a Write Behind cache.
3. Clients can request data and the backend will try to serve it from the cache.
A Write Through strategy is often combined with a Read Through so that changes are
propagated in the cache (Write Through) but missed cache reads are also written to the
cache (Read Through).
You can read about more caching patterns in the Oracle Coherence Documentation
(Coherence is a Java-based distributed cache).
Therefore, you’ll eventually reach a situation where your cache is full and you can’t add
any new data to it.
To solve this, you’ll need a cache replacement policy. The ideal cache replacement policy
will remove cold data (data that is not being read frequently) and replace it with hot data
(data that is being read frequently).
● Queue-Based - Use a FIFO queue and evict data based on the order in which it
was added regardless of how frequently/recently it was accessed.
The type of cache eviction policy you use depends on your use case. Picking the optimal
eviction policy can massively improve your cache hit rate.
Exposing any of your internal machines to the public internet is not a good idea.
Anything that is exposed will need rate limiting and protections around DDoS. It’s also a
security risk; if hackers can see what software your servers are running, then they can
look up any known vulnerabilities/exploits and attack you with them.
Instead, you’ll want to have all your internal machines inside of a VPC and expose them
through a single entrypoint: a reverse proxy. The reverse proxy is the only thing exposed
to the public internet and it will have security protections against things like DDoS
attacks.
Previously, the most common type of reverse proxy was a load balancer. When someone
sends requests to your backend, they go to the load balancer as the entrypoint.
We talked in depth about load balancers in a previous Quastor Pro article here.
WIth the rise of microservice architectures, API gateways have become significantly
more popular as the entrypoint to your backend. They handle tasks like load balancing,
but they can also do much more.
● Load Balancing - Balance the requests between the different servers in your
server pool. API gateways function in a similar way to an L7 load balancer.
● Caching - The gateway can cache the response to specific endpoints to reduce the
number of calls made to your backend services. You can see how caching works
with Amazon API gateway here.
● Authentication and Authorization - Many API gateways will also authenticate the
requests that come in and make sure the client is authorized to access the
resources they’re asking for.
● Protocol Translation - The client might be sending you requests with HTTP but
your backend uses gRPC to communicate. The API gateway will handle this
translation where it takes the HTTP request and converts it to gRPC. Then, it
takes the backend service’s response and creates the HTTP response to send back
to the client.
API gateways are primarily used with microservices architectures. You can use an API
gateway with a monolith, but it’s usually not necessary. The features around
authentication/authorization and protocol translation are typically already handled by
the monolith. You also don’t need service discovery as there’s just a single service.
Therefore, with monoliths, teams will usually just use load balancers as the entrypoint.
However, you’ll also see engineering teams (with services-oriented architectures) use
both. They’ll have load balancers serve as the entry point to the application. Then, an
API gateway will sit between the load balancers and all microservices in the backend.
1. Protocol Manager - This layer contains a deserializer and serializer for all of the
protocols supported by the gateway. The gateway has to deal with different types
of protocol payloads like JSON, Thrift, Protobuf, etc. so this layer handles
translation
2. Middleware - This layer contains different middleware functions for things like
authentication, authorization, rate limiting, logging, monitoring, etc. Engineers
can write the code for new middleware functions and add them to the API
gateway through a UI
3. Endpoint Handler - This layer validates the request and transforms the request
object into an object that backend services can understand.
4. Requests to Backend Services - This part of the backend is responsible for making
the actual calls to backend microservices. It handles service discovery to figure
out which microservices to call and also takes care of things like circuit breaking
(so a microservice doesn’t get overloaded) and error handling, timeouts, retries
and more.
This describes how the API gateway takes in a request. To send a response back to the
client, the request goes through the same components in reverse order.
1. Requests to backend services - This service has already made requests to various
backend microservices and now it’s received responses. This part of the gateway
aggregates the responses and sends it to the endpoint handler.
3. Middleware - Uber has middleware functions that work on the response objects
to handle things like logging and monitoring. These functions might also add
things like HTTP headers to the response object.
4. Payload Manager - This transforms the backend response to the relevant payload
protocol like JSON, Thrift, Protobuf, etc. Then it sends it back to the load
balancer, who sends it to the client.
F5 was founded in 1996, HAProxy in 2001 and NGINX in 2002. These companies all
produced network/application load balancers to direct traffic to the servers in the pool.
Large tech companies like Amazon, Netflix, eBay and more began to adopt service
oriented architectures. These companies (Netflix especially) began to publicly champion
microservices and the term became commonly used by the early 2010s.
In mid-2013, Netflix released their JVM-based API gateway: Zuul. It has a ton of cool
features like allowing Groovy scripts to be injected at runtime to add middleware and
dynamically modify behavior. You could write middleware to handle things like rate
limiting, load shedding, authentication, monitoring, release engineering (canary
releases or A/B testing) and more.
In the mid to late 2010s, you had the adoption of containers and container management
tools like Kubernetes. This led to the concept of a service mesh architecture with
projects like Envoy by Lyft. These tools facilitate service discovery so developers can
focus on writing the code for their microservice and let the service mesh tool handle
inter-service communication. Companies like Snapchat use Envoy for their service mesh
and also have it serve as the API gateway.
● Amazon API Gateway (plus managed solutions from other cloud providers)
● Kong Gateway
● Tyk Gateway
● Ambassador
● Nginx Plus
● Envoy
And more. Kong, Tyk, Ambassador and Envoy are open source.
Netflix uses Zuul, which is built with Java. It handles tasks like load balancing, routing,
monitoring, authentication, payload transformation, etc. They also built in features to
increase reliability like load shedding and stress testing (to stress test backend
microservices). You can read about how Zuul does Service Discovery for Netflix’s
thousands of microservices here. The gateway also handles errors in the backend by
categorizing the error message and running retries.
The wiki gives a great overview of how Netflix uses Zuul and the other tooling they built
as complements to the gateway.
Tinder has more than 500 microservices that communicate with each other using a
service mesh. They built TAG on top of Spring Cloud Gateway, an API gateway that’s
part of the Java Spring ecosystem. The API gateway handles tasks like load balancing,
transforming requests/responses, HTTP to gRPC conversion and more.
This post goes through an introduction to compiler design, the motivations behind
LLVM, the design of LLVM and some extremely useful features that LLVM provides.
It was very uncommon to see a language implementation that had supported both, and
if there was then there was very little sharing of code.
Front End
The front end parses the source code (checking it for errors) and builds a
language-specific Abstract Syntax Tree.
The AST is optionally converted to a new representation for optimization (this may be a
common code representation, where the code is the same regardless of the input source
code’s language).
Optimizer
The optimizer runs a series of optimizing transformations to the code to improve the
code’s running time, memory footprint, storage size, etc.
This is more or less independent of the input source code language and the target
language
Back End
The back end maps the optimized code onto the target instruction set.
It’s responsible for generating fast code that takes advantage of the specific features of
the supported architecture.
Common parts of the back end include instruction selection, register allocation, and
instruction scheduling.
The most important part of this design is that a compiler can be easily adapted to
support multiple source languages or target architectures.
The same applies for adding a new target architecture for the compiler. You just have to
implement the back end and you can reuse the front end and optimizer.
Additionally, you can use specific parts of the compiler for other purposes. For example,
pieces of the compiler front end could be used for documentation generation and static
analysis tools.
The main issue was that this model was rarely realized in practice. If you looked at the
open source language implementations prior to LLVM, you’d see that implementations
of Perl, Python, Ruby, Java, etc. shared no code.
While projects like GHC and FreeBASIC were designed to compile to multiple different
CPUs, their implementations were specific to the one source language they supported
(Haskell for GHC).
Compilers like GCC suffered from layering problems and leaky abstractions. The back
end in GCC uses front end ASTs to generate debug info and the front end generates back
end data structures.
In an LLVM-based compiler, the front end is responsible for parsing, validating, and
diagnosing errors in the input code.
The front end then translates the parsed code into LLVM IR (LLVM Intermediate
Representation).
The LLVM IR is a complete code representation. It is well specified and is the only
interface to the optimizer.
This means that if you want to write a front end for LLVM, all you need to know is what
LLVM IR is, how it works, and what invariants it expects.
LLVM IR is a low-level RISC-like virtual instruction set and it is how the code is
represented in the optimizer. It generally looks like a weird form of assembly language.
You can easily customize the optimizer to add your own optimizing transformations.
After the LLVM IR code is optimized, it goes to the back end of the compiler (also
known as the code generator).
The LLVM code generator transforms the LLVM IR into target specific machine code.
The code generator’s job is to produce the best possible machine code for any given
target.
LLVM’s code generator splits the code generation problem into individual passes-
instruction selection, register allocation, scheduling, code layout optimization, assembly
emission, and more.
You can customize the back end and choose among the default passes (or override them)
and add your own target-specific passes.
This allows target authors to choose what makes sense for their architecture and also
permits a large amount of code reuse across different target back ends (code from a pass
for one target back end can be reused by another target back end).
The code generator will output target specific machine code that you can now run on
your computer.
Here’s a dive into the engineering culture and tech stack that made this possible.
Engineering Culture
WhatsApp consciously keeps the engineering staff small to only about 50 engineers.
Individual engineering teams are also small, consisting of 1 - 3 engineers and teams are
each given a great deal of autonomy.
In terms of servers, WhatsApp prefers to use a smaller number of servers and vertically
scale each server to the highest extent possible.
Their goal was previously to have 1 million users for every server (but that’s become
more difficult as they’ve added more features to the app and as users are generating
more activity on a per-user basis).
Having a fewer number of servers means fewer things breaking down, which makes it
easier for the team to handle.
The same goes for the software side where they limit the total number of systems and
components in production.
That means fewer systems that have to be developed, deployed and supported.
There aren’t many systems/components that are developed and then put into
maintenance mode (to eventually become orphans until something goes wrong).
Instead, they focus on building just enough for scalability, security and reliability.
One of the key factors when they make technical choices is “what is the simplest
approach?”
They avoid extra bells and whistles and don’t implement features that aren’t exclusively
focused on core communications.
The tech stack revolves around 3 core components: Erlang, FreeBSD and SoftLayer.
Erlang
Erlang was designed for concurrency from the start, and fault tolerance is a first class
feature of the language.
The language is very concise and it’s easy to do things with very few lines of code.
The OTP (Open Telecom Platform) is a collection of open source middleware, libraries
and tools for Erlang.
WhatsApp tries to avoid dependencies as much as possible, but they do make use of
Mnesia, a distributed database that’s part of OTP.
Erlang also brings the ability to hotswap code. You can take new application code and
load it into a running application without restarting the application.
This makes the iteration cycle very quick and allows WhatsApp to release quick fixes
and have extremely long uptimes for their services.
To see exactly how WhatsApp’s backend is built with Erlang, you can watch this talk
from 2018.
The decision to use FreeBSD was made by the founders of WhatsApp, based on their
previous experience at Yahoo!
The founders (and a lot of the early team) all used to be part of Yahoo!, where FreeBSD
was used extensively.
To see exactly how WhatsApp uses FreeBSD, you can watch this talk.
Just note, the talk is from 2014, so some things may be out of date now.
SoftLayer
However, SoftLayer is owned by IBM (part of IBM public cloud), and WhatsApp has
since moved off SoftLayer to use Facebook’s infrastructure.
Get more specific details from this High Scalability post on WhatsApp Engineering.
Summary
When Uber’s ride sharing app makes a request to the backend, the first point of contact
is Uber’s API gateway.
The API gateway provides a single point of entry for all of Uber’s apps and gives a clean
interface to access data, logic or functionality from back-end microservices.
The API gateway is the place to implement things like rate limiting, security auditing,
user access blocking, protocol conversion, and more.
A backend engineer at Uber will be working on their own microservice (you can read
about how Uber handles microservices here).
Their microservice will have an API with it’s own configuration parameters: path, type
of request data, type of response, maximum calls allowed, apps allowed, observability,
etc.
The engineer can then configure these parameters in a UI for Uber’s API gateway. The
UI walks the user through a step-by-step process for creating their API endpoint.
The gateway infrastructure will then convert these configurations into valid and
functional APIs that can serve traffic from Uber’s apps.
The four components are the Protocol Manager, Middleware, Endpoint Handler and
finally the Client.
1. Protocol Manager - This is the first layer of the stack. It contains a deserializer
and serializer for all of the protocols supported by the gateway. It can ingest
any type of relevant protocol payload, including JSON, Thrift, or Protobuf.
1. Middleware - This layer handles things like rate limiting, authentication and
authorization, etc. Each endpoint can choose to configure one or more
middleware. If a middleware fails execution, the call short circuits the
remainder of the stack and the response from the middleware will be returned
to the caller.
2. Client - This layer performs the request to the specific backend microservice.
Clients are protocol-aware and generated based on the protocol selected
during configuration.
If you’d like to read about how Uber thinks about scaling this API gateway, here’s
another interesting blog post on that.
The MapReduce paper was first published in 2004 by Jeff Dean and Sanjay Ghemawat.
It was originally designed, built and used by Google.
At the time, Google had an issue. Google’s search engine required constant crawling of
the internet, content indexing of every website and analyzing the link structure of the
web (for the PageRank algorithm).
For a while, engineers had to laboriously hand-write software to take whatever problem
they’re working on and pharm it out to all the computers.
Engineers would have to manage things like parallelization, fault tolerance, data
distribution and various other concepts in distributed systems.
If you’re not skilled in distributing systems, then it can be really difficult to do this. If
you are skilled in distributed systems, then you can do it but it’s a waste of your time to
do it again and again.
Therefore, Google wanted a framework that allowed engineers to focus on the code for
their problem (building a web index/link analyzer/whatever) and provided an interface
so engineers could use the vast array of machines without having to worry about the
distributed systems stuff.
MapReduce is based on two functions from the Lisp programming language (and many
other functional languages): Map and Reduce (also known as Fold).
Map takes in a list of elements (or any iterable object) and a function, applies the
function to each element in the list and then returns the list.
Reduce takes in a list of elements (or any iterable object), a function and a starting
element.
The function (that gets passed into the Reduce operation) takes in a starting element
and an element from the list and combines the two in some way. It then returns the
combination.
The reduce operation sequentially runs the function on all the elements in the list
combining each element with the result from the previous function call.
Example
You have a billion documents and you want to create a dictionary of all the words that
appear in those documents and the count of each word (number of times each word
appears across the documents).
The map function emits each word plus a count of occurrences (here it’s just 1).
The reduce function will then sum together all counts emitted for a particular word.
In order to use the MapReduce framework (and take advantage of the distributed
computers), you’ll have to provide the map function (listed above), the reduce function
(also listed above), names of the input files (for the billion documents), output files
(where you want MapReduce to put the finished dictionary) and some tuning
parameters (discussed below).
If you’d like to see the full program for this using the MapReduce framework, please
look at page 13 of the MapReduce paper.
Now, here’s a breakdown of exactly how the MapReduce framework works internally.
The MapReduce library will look at the input files and split them into M pieces (M is a
parameter specified by the user).
It will then start up many copies of the MapReduce program on a cluster of machines.
One of the copies will be the master. The rest of the copies are workers that are assigned
tasks by the master.
A worker who is assigned a map task will read the contents of their input split. They will
parse key/value pairs out of the input data and pass each key/value pair to the
user-provided Map function.
The intermediate key/value pairs are partitioned into R regions by the partitioning
function. The location of these pairs is passed back to the master.
The default partitioning function just uses hashing (e.g. hash(key) mod R) but the user
can provide a special partitioning function if desired. For example, the output keys could
be URLs and the user wants URLs from the same website to go to the same output file.
Then, the user can specify his own partitioning function like hash(Hostname(urlkey))
mod R.
After the location of the intermediate key/value pairs is passed to master, the master
assigns reduce tasks to workers and notifies them of the storage locations for their
assigned key/value pairs (one of the R partitions).
The reduce worker will read the intermediate key/value pairs from their region and then
sort them by the intermediate key so that all occurrences of the same key are grouped
together.
The reduce worker then iterates over the sorted intermediate data.
For each unique intermediate key encountered, the reduce worker passes the key and
the corresponding set of intermediate values to the user’s Reduce function.
The output of the Reduce function is appended to a final output file for this reduce
partition.
After all the map tasks and reduce tasks have been completed, the master wakes up the
user program.
The storage engine is responsible for storing, retrieving and managing data in memory
and on disk.
A DBMS will often let you pick which storage engine you want to use. For example,
MySQL has several choices for the storage engine, including RocksDB and InnoDB.
Having an understanding of how storage engines work is crucial for helping you pick the
right database.
Log Structured storage engines treat the database as an append-only log file where data
is sequentially added to the end of the log.
Page Oriented storage engines break the database down into fixed-size pages
(traditionally 4 KB in size) and they read/write one page at a time.
Each page is identified using an address, which allows one page to refer to another page.
These page references are used to construct a tree of pages.
In this update, we’ll be focusing on Log Structured storage engines. We’ll look at how to
build a basic log structured engine with a hash index for faster reads.
● db_set(key, value) - give the database a (key, value) pair and the database will
store it. If the key already exists then the database will just update the value.
● db_get(key) - give the database a key and the database will return the
associated value. If the key doesn’t exist, then the database will return null.
The storage engine works by maintaining a log of all the (key, value) pairs.
Anytime you call db_set, you append the (key, value) pair to the bottom of the log.
This is done regardless of whether the key already existed in the log.
The append-only strategy works well because appending is a sequential write operation,
which is generally much faster than random writes on magnetic spinning-disk hard
drives (and on solid state drives to some extent).
Now, anytime you call db_get, you look through all the (key, value) pairs in the log and
return the last value for any given key.
We return the last value since there may be duplicate values for a specific key (if the key,
value pair was inserted multiple times) and the last value that was inserted is the most
up-to-date value for the key.
The db_set function runs in O(1) time since appending to a file is very efficient.
On the other hand, db_get runs in O(n) time, since it has to look through the entire log.
This becomes an issue as we scale the log to handle more (key, value) pairs.
This metadata can act as a “signpost” as the storage engine searches for data and helps it
find the data faster.
The database index is an additional structure that is derived from the primary data and
adding a database index to our storage engine will not affect the log in any way.
The tradeoff with a database index is that your database reads can get significantly
faster. You can use the database index during a read to achieve sublinear read speeds.
However, database writes are slower now because for every write you have to update
your log and update the database index.
An example of a database index we can add to our storage engine is a hash index.
Hash Indexes
We keep a hash table in-memory where each key in our log is stored as a key in our hash
table and the location in the log where that key can be found (the byte offset) is stored as
the key’s value in our hash table.
1. Append the (key, value) pair to the end of the log. Note the byte offset for
where it’s added.
3. If it is, then update its value in the hash table to the new byte offset.
One requirement with this strategy is that all your (key, byte offset) pairs have to fit in
RAM since the hash map is kept in-memory.
An issue that will eventually come up is that we’ll run out of disk space.
Since we’re only appending to our log, we’ll have repeat entries for the same key even
though we’re only using the most recent value.
Whenever our log reaches a certain size, we’ll close it and then make subsequent writes
to a new segment file.
Then, we can perform compaction on our old segment, where we throw away duplicate
keys in that segment and only keep the most recent update for each key.
After removing duplicate keys, we can merge past segments together into larger
segments of a fixed size.
We’ll also create new hash tables with (key, byte offset) pairs for all the past merged
segments we have and keep those hash tables in-memory.
The merging and compaction of past segments is done in a background thread, so the
database can still respond to read/writes while this is going on.
1. Check the current segment’s in-memory hash table for the key.
2. If it’s there, then use the byte offset to find the value in the segment on disk
and return the value.
3. Otherwise, look through the hash tables for our previous segments to find the
key and its byte offset.
The methodology we’ve described is very similar to a real world storage engine - Bitcask.
Bitcask is a storage engine used for the Riak database, and it’s written in Erlang. You can
read Bitcask’s whitepaper here.
Bitcask (the storage engine that uses this methodology) also stores snapshots of the
in-memory hash tables to disk.
This helps out in case of crash recovery, so the database can quickly spin back up if the
in-memory hash tables are lost.
● The hash table of (key, byte offsets) must fit in memory. Maintaining a
hashmap on disk is too slow.
● Range queries are not efficient. If you want to scan over all the keys that start
with the letter a, then you’ll have to look up each key individually in the hash
maps.
Robinhood paid $60 million dollars to AWS in 2020 for cloud hosting fees. This is a 5x
increase from their 2019 cloud bill of $12 million dollars.
If you’d like to learn more about Robinhood’s engineering, check out their interview on
the Software Engineering Daily podcast. We’ll summarize the interesting bits here.
When Robinhood was first getting started, they were a Python/Django shop, however
they’ve been shifting towards Go.
They’ve also been microservices oriented, and most of their APIs are written in Python
and Go. There is some Java and Rust in their codebase however.
Robinhood is built on AWS, so they use Amazon RDS (Relational Database Service) for
their data store. They use Postgres as the database engine for RDS.
Since they’re on AWS, an interesting choice that comes up is whether they should utilize
AWS’s PaaS (Platform as a Service) offerings or if they should go the “Do It Yourself”
route.
Jaren Glover (tech lead at Robinhood) found that PaaS products work great when you
play inside the guardrails in which they’re presented.
However, if you move from a generic compute to something domain specific, then you
may start to bump against those guardrails and not have the quality of service that you
want.
Another interesting challenge that comes up when building a Stock brokerage is how
you manage time.
It’s very important to process orders in the same order that the customer placed them
and also to route orders correctly relative to which customer placed which order first.
In order to do this, Robinhood relies on NTP - Network Time Protocol. NTP allows you
to synchronize clocks between computers over a variable-latency network.
NTP works in a client-server type model, but can also be adapted for a peer-to-peer
system.
NTP can usually maintain time to within tens of milliseconds over the public Internet,
and can achieve better than one millisecond accuracy over LANs.
Robinhood (and other brokerages) are required to ensure their trading computers are
synced with atomic clocks maintained by the National Institute of Standards and
Technology (NIST), a non-regulating agency of the US Department of Commerce.
Robinhood, despite being a brokerage, has experienced viral growth. The app is
frequently ranked #1 in the iOS and Android app stores, a spot that is usually reserved
for some type of social media app (Instagram, Snapchat, YouTube, etc.). It’s unheard of
for a financial app to get that kind of growth.
Sharding is where you split up a large dataset into smaller partitions (shards).
Robinhood created multiple shards where each shard held a subset of their users and
every shard had its own application servers, database and deployment pipeline.
In order to make the divided system appear as one system (rather than independent
shards), Robinhood built out several new layers.
● Routing Layer - the routing layer handles routing external API requests to the
correct shards. The layer first inspects and maps the request to a specific user.
Then, it makes a synchronous API call to Robinhood’s shard mapping service
to look up the shard ID for the user. After, it sends the request to the correct
application server with that shard.
You can read more about Robinhood’s horizontal scaling efforts here.
Designing a bad API means wasting developer time and (for a public API) poor
adoption.
It’s also extremely difficult to make breaking changes to an API after shipping so bad
API design can be hard to fix.
In order to avoid this at Slack, they’ve published some principles around API design.
1. Do one thing and do it well - It can be tempting to try and solve too many
problems at once. Instead, pick a specific use case and design your API
around solving that.Simple APIs are more easy to understand and easier to
scale.It’s easy to add features to an API, but hard to remove them.
2. Make it fast and easy to get started - Developers should be able to complete a
basic task using your API quickly.At Slack, they want entry-level developers to
be able to learn about the platform, create an app, and send their first API call
within 15 minutes.
1. Pagination
2. Rate Limiting
6. Avoid breaking changes - A breaking change is any change that can stop an
existing client app from functioning as it was before the change.Avoid these
and have an apologetic communication plan if you need to make a breaking
change.
You can store notes, tasks, wikis, kanban boards and other things in a Notion workspace
and you can easily share it with other users.
If you’ve been a Notion user for a while, you probably noticed that the app got extremely
slow in late 2019 and 2020.
Earlier this year, Notion sharded their Postgres monolith into a fleet of horizontally
scalable databases. The resulting performance boost was pretty big.
Sharding a database means partitioning your data across multiple database instances.
This allows you to run your database on multiple computers and scale horizontally
instead of vertically.
When to Shard?
Sharding your database prematurely can be a big mistake. It can result in an increased
maintenance burden, new constraints in application code and little to no performance
improvement (so a waste of engineering time).
However, Notion was growing extremely quickly, so they knew they’d have to implement
sharding at some point.
The breaking point came when the Postgres VACUUM process began to stall
consistently.
The VACUUM process clears storage occupied by dead tuples in your database.
When you update data in Postgres, the existing data is not modified. Instead, a new
(updated) version of that data is added to the database.
This is because it’s not safe to directly modify existing data, as other transactions could
be reading it.
At a later point, you can run the VACUUM process to delete the old, outdated data and
reclaim disk space.
If you don’t regularly vacuum your database (or have Postgres run autovacuum, where it
does this for you), you’ll eventually reach a transaction ID wraparound failure.
Having the VACUUM process consistently stall is not an issue that can be ignored.
● Third-Party Sharding - You rely on a third party to handle the sharding for
you. An example is Citus, an open source extension for Postgres.
They didn’t want to go with a third party solution because they felt it’s sharding logic
would be opaque and hard to debug.
Shard Key
In order to shard a database, you have to pick a shard key. This determines how your
data will be split up amongst the shards.
You want to pick a shard key that will equally distribute loads amongst all the shards.
If one shard is getting a lot more reads/writes than the others, that can make scaling
very difficult.
So, if you’re a student using Notion, you might have separate Workspaces for all your
classes.
Each workspace is assigned a UUID upon creation, so that UUID space is partitioned
into uniform buckets.
Notion ended up going with 460 logical shards distributed across 32 physical databases
(with 15 logical shards per database).
This allows them to handle their existing data and scale for the next two years (based off
their projected growth).
Database Migration
After establishing how the sharded database works, you still have to migrate from the
old database to the new distributed database.
1. Double-write: Incoming writes are applied to both the old and new databases.
As you might imagine, the storage needs required for this kind of processing were
massive and rapidly growing.
To solve this, Google built Google File System (GFS), a scalable distributed file system
written in C++. Even in 2003, the largest GFS cluster provided hundreds of terabytes of
storage across thousands of machines and it was serving hundreds of clients
concurrently.
GFS is a proprietary distributed file system, so you’ll only encounter it if you work at
Google. However, Doug Cutting and Mike Cafarella implemented Hadoop Distributed
File System (HDFS) based on Google File System and HDFS is used widely across the
industry.
LinkedIn recently published a blog post on how they store 1 exabyte of data across their
HDFS clusters. An exabyte is 1 billion gigabytes.
In this post, we’ll be talking about the goals of GFS and its design. If you’d like more
detail, you can read the full GFS paper here.
Goals of GFS
The main goal for GFS was that it be big and fast. Google wanted to store extremely
large amounts of data and also wanted clients to be able to quickly access that data.
Using commodity machines is great because then you can quickly add more machines to
your distributed system (as your storage needs grow). If Google relied on specialized
hardware, then there may be limits on how quickly they can acquire new machines.
Therefore, GFS needed to have systems in place for automatic failure recovery. An
engineer shouldn’t have to get involved every time there’s a failure. The system should
be able to handle common failures on its own.
The individual files that Google wanted to store in GFS are quite big. Individual files are
typically multiple gigabytes and so this affected the block sizes and I/O operation
assumptions that Google made for GFS.
GFS is designed for big, sequential reads and writes of data. Most files are mutated by
appending new data rather than overwriting existing data and random writes within a
file are rare. Because of that access pattern, appending new data was the focus of
performance optimization.
Design of GFS
A GFS cluster consists of a single master node and multiple chunkserver nodes.
The master node maintains the file system’s metadata and coordinates the system. The
chunkserver nodes are where all the data is stored and accessed by clients.
The master node keeps track of the file namespace, the mappings from files to chunks
and the locations of all the chunks. It also handles garbage collection of orphaned
chunks and chunk migration between the chunkservers. The master periodically
communicates with all the chunkservers through HeartBeat messages to collect its state
and give it instructions.
An interesting design choice is the decision to use a single master node. Having a single
master greatly simplified the design since the master could make chunk placement and
replication decisions without coordinating with other master nodes.
However, Google engineers had to make sure that the single master node doesn’t
become a bottleneck in the system.
Therefore, clients never read or write file data through the master node. Instead, the
client asks the master which chunkservers it should contact. Then, the client caches this
information for a limited time so it doesn’t have to keep contacting the master node.
A mutation is an operation that changes the contents or the metadata of a chunk (so a
write or an append operation).
In order to guarantee consistency amongst the replicas after a mutation, GFS performs
mutations in a certain order.
As stated before, each chunk will have multiple replicas. The master will designate one
of these replicas as the primary replica.
1. The client asks the master which chunkserver is the primary chunk and for
the locations of the other chunkservers that have that chunk.
2. The master replies with the identity of the primary chunkserver and the other
replicas. The client caches this information.
3. The client pushes data directly to all the chunkserver replicas. Each
chunkserver will store the data in an internal LRU buffer cache.
7. The primary chunkserver replies to the client informing the client that the
write was successful (or if there were errors).
GFS Interface
GFS organizes files hierarchically in directories and identifies them by pathnames, like a
standard file system. The master node keeps track of the mappings between files and
chunks.
GFS provides the usual operations to create, delete, open, close, read and write files.
Snapshot lets you create a copy of a file or directory tree at low cost.
Record append allows multiple clients to append data to a file concurrently and it
guarantees the atomicity of each individual client’s append.
To learn more about Google File System, read the full paper here.
If you’d like to read about the differences between GFS and HDFS, you can check that
out here.
Here’s a summary.
Rate Limiting is a technique used to limit the amount of requests a client can send to
your server.
It’s incredibly important to prevent DoS attacks from clients that are (accidentally or
maliciously) flooding your server with requests.
A rule of thumb for when you should use a rate limiter is if your users can reduce the
frequency of their API requests without affecting the outcome of their requests, then a
rate limiter is appropriate.
For example, if you’re running Facebook’s API and you have a user sending 60 requests
a minute to query for their list of Facebook friends, you can rate limit them without
affecting their outcome. It’s unlikely that they’re adding new Facebook friends every
single second.
Rate Limiting is great for day-to-day operations, but you’ll occasionally have incidents
where some component of your system is down and you can’t process requests at your
normal level.
In these scenarios, Load Shedding is a technique where you drop low-priority requests
to make sure that critical requests get through.
Stripe is a payment processing company (you can use their API to collect payments from
your users) so a critical request for them is a request to create a charge.
An example of a non-critical method would be a request to read charge data from the
past.
Stripe uses 4 different types of limiters in production (2 rate limiters and 2 load
shedders).
Restricts each user to n requests per second. However, they also built in the ability for a
user to briefly burst above the cap to handle legitimate spikes in usage.
This helps stripe manage the load of their CPU-intensive API endpoints.
Stripe divides their traffic into two types: critical API methods and non-critical methods.
Stripe always reserves a fraction of their infrastructure for critical requests. If the
reservation number is 10%, then any non-critical request over the 90% allocation would
be rejected with a 503 status code.
● Critical Methods
● POSTs
● GETs
● Test mode traffic (traffic from developers testing the API and making sure
payments are properly processed)
There are quite a few algorithms you can use to build a rate limiter. Algorithms include
Token Bucket - Every user gets a bucket with a certain amount of “tokens”. On each
request, tokens are removed from the bucket. If the bucket is empty, then the request is
rejected.
New tokens are added to the bucket at a certain threshold (every n seconds). The bucket
can hold a certain number of tokens, so if the bucket is full of tokens then no new tokens
will be added.
Fixed Window - The rate limiter uses a window size of n seconds for a user. Each
incoming request from the user will increment the counter for the window. If the
counter exceeds a certain threshold, then requests will be discarded.
Sliding Log - The rate limiter track’s every user’s request in a time-stamped log. When a
new request comes in, the system calculates the sum of logs to determine the request
rate. If the request rate exceeds a certain threshold, then it is denied.
After a certain period of time, previous requests are discarded from the log.
She gave a talk at Qcon 2020 about GitHub’s transition from a Monolith architecture to
Microservices-oriented architecture.
Here’s a summary
History
GitHub was founded in 2008 by Chris Wanstrath, P.J. Hyett, Tom Preston-Werner and
Scott Chacon.
The founders of the company were open source contributors and influencers in the Ruby
community. Because of this, GitHub’s architecture is deeply rooted in Ruby on Rails.
With the Ruby on Rails monolith, GitHub scaled to 50 million developers on the
platform, over 100 million repositories and over 1 billion API calls per day.
Over the past 18 months, GitHub has grown rapidly as a company. They’ve doubled the
number of engineers at the company, and now have over 2000 employees.
The company is also highly distributed with over 70% of employees working outside of
the headquarters, working in all timezones.
Because of this diversity of engineers, GitHub is having trouble scaling the monolith.
Having everyone learn Ruby before they can be productive and having everyone doing
development in the same monolithic code base is no longer the most efficient way to
scale GitHub.
● Code Simplicity - You don’t have to add extra logic to deal with timeouts or
worry about failing gracefully due to network latency and outages.
● System ownership - There are functional boundaries for teams through clearly
defined API contracts. This gives teams much more ownership over their
features and also gives them freedom to choose the tech stack that makes the
most sense for them. They just have to make sure the API contract is followed.
However, the change isn’t expected to be immediate or rapid. For the foreseeable future,
GitHub plans to have a hybrid monolith-microservices environment.
The first step towards breaking up a monolith is to think about the separation of code
and data based on feature functionalities.
This can be done within the monolith before physically separating them into a
microservices environment. It’s generally a good architectural practice to make the
codebase more manageable.
Start with the data and pay close attention to how it’s being accessed.
Each service should own and control access to its own data. Data access should only
happen through clearly defined API contracts.
If you don’t enforce this, you can fall into a common microservice anti-pattern: the
distributed monolith.
This is where you have the inflexibility of a monolith and the complexity of
microservices.
Separating Data
Before making the transition to microservices, GitHub made sure they got data
separation right. Getting it wrong can lead to the distributed monolith anti-pattern.
They first looked at their monolith and identified the functional boundaries within the
database schemas.
Then, they grouped the actual database tables along these functional boundaries.
Then, GitHub implemented a query watcher in the monolith to detect and alert them
anytime a query crosses multiple schema domains.
If a query touched more than one schema domain, then they would break the query up
and rewrite it into multiple queries that respect the functional boundaries. They would
then perform the necessary joins at the application layer.
Separating Services
When separating services out of the monolith to a microservice, you should start with
the core services and then work your way out to the feature level.
Dependency direction should always go from inside of the monolith to outside of the
monolith, NOT the other way around. If you have dependency directions from
microservices to inside the monolith then that can lead to the distributed monolith
anti-pattern.
At GitHub, the core service that they extracted first was Authentication and
Authorization. The Rails monolith communicated with the microservice using Twirp, a
gRPC-like service-to-service communications framework, with an inside-to-outside
dependency direction (inside of the monolith to outside of the monolith).
When separating services out of the monolith, you should be on the lookout for things
that keep developers working in the monolith.
A common example is shared tooling that is built over time and makes development
inside the monolith more convenient. Make those shared resources available to
developers outside of the monolith.
An example at GitHub was feature flags that provide monolith developers an easy way to
control who sees a new feature.
For more details, you can view the full talk here.
This summary is on GitHub’s blog post on how they built tooling to make partitioning
their database easier.
Until recently, GitHub was built around one MySQL database cluster that housed a large
portion of the data used for repositories, issues, pull requests, user profiles, etc.
This created challenges around scaling and issues around reliability since all of GitHub’s
core features would stop working if the database cluster went down.
In 2019, GitHub set up a plan to improve their ability to partition their relational
databases. The goal was to create better tooling that improved GitHub’s ability to
partition relational databases.
The tooling GitHub built made partitioning much easier by allowing for
Since implementing these changes, GitHub has seen a significant decrease in load on the
main database cluster.
In 2019, mysql1 answered 950,000 queries per second on average, 900,000 queries per
second on replicas, and 50,000 queries per second on the primary.
Virtual Partitions
Before database tables are physically partitioned, they need to be virtually partitioned in
the application layer. You can’t have SQL queries that span partitioned (or soon to be
partitioned) database tables.
Schema Domains
In order to implement Virtual Partitioning, GitHub first created schema domains, where
a schema domain describes a tightly coupled set of database tables that are frequently
used together in queries.
An example of a schema domain is the gists schema domain, which consists of the tables
gists, gist_comments, and starred_gists. These tables would remain together after a
partition.
GitHub stored a list of all the schema domains in a YAML configuration file. Here’s an
example of a YAML file with the gists, repositories and users schema domains and their
respective tables.
SQL Linters
GitHub has two SQL linters (a Query linter and a Transaction linter) that enforce virtual
boundaries between the schema domains.
They identify any violating queries and transactions that span schema domains and
throw an exception with a helpful message for the developer.
Transactions aren’t allowed to span multiple schema domains because after the
partition, those transactions will no longer be able to guarantee consistency.
Now that GitHub has virtually isolated schema domains, they can physically move their
schema domains to separate database clusters.
In order to do this on the fly, GitHub uses Vitess. Vitess is an open source database
clustering system for MySQL that was originally developed at YouTube.
Vitess was serving all YouTube database traffic from 2011 to 2019, so it’s battle-tested.
GitHub uses Vitess’ vertical sharding feature to move sets of tables together in
production without downtime.
To do that, GitHub uses Vitess’ VTGate proxies as the endpoint for applications to
connect to instead of direct connections to MySQL.
For more details, you can read the full blog post here.
In order to run analytics workloads on all the data generated by these users, LinkedIn
relies on Hadoop.
More specifically, they store all this data on the Hadoop Distributed File System
(HDFS).
HDFS was based on Google File System (GFS), and you can read our article on GFS
here.
Over the last 5 years, LinkedIn’s analytics infrastructure has grown exponentially,
doubling every year in data size and compute workloads.
The largest Hadoop cluster stores 500 petabytes of data and needs over 10,000 nodes in
the cluster. This makes it one of the largest (if not the largest) Hadoop cluster in the
industry.
Despite the massive scale, the average latency for RPCs (remote procedure calls) to the
cluster is under 10 milliseconds.
It took just 5 years for that cluster to grow to over 200 petabytes of data.
In the article, LinkedIn engineers talk about some of the steps they took to ensure that
HDFS could scale to 500 terabytes.
Replicating NameNodes
With HDFS, the file system metadata is decoupled from the data.
An HDFS cluster consists of two types of servers: a NameNode server and a bunch of
DataNode servers.
The DataNodes are responsible for storing the actual data in HDFS. When clients are
reading/writing data to the distributed file system, they are communicating directly
with the DataNodes.
A client will first ask NameNode for the location of a certain file. NameNode will
respond with the DataNode that contains that file and the client can then read/write
their data directly to that DataNode server.
If your NameNode server goes down, then that’s no bueno. Your entire Hadoop cluster
will be down as the NameNode is a single point of failure.
Also, when you’re operating at the scale of hundreds of petabytes in your cluster,
restarting NameNode can take more than an hour. During this time, all jobs on the
cluster must be suspended.
When you have clusters that large, it becomes extremely expensive to have downtime
since a massive amount of processes at the organization rely on that cluster (that cluster
accounts for half of all the data at LinkedIn).
Additionally, upgrading the cluster also becomes an issue since you have to restart the
NameNode. This results in hours of additional downtime.
Fortunately, Hadoop 2 introduced a High Availability feature to solve this issue. With
this feature, you can have multiple replicated NameNode servers.
The way it works is that you have a single Active NameNode that receives all the client’s
requests.
The Active NameNode will publish its transactions into a Journal Service (LinkedIn uses
Quorum Journal Manager for this) and the Standby NameNode servers will consume
those transactions and update their namespace state accordingly.
This keeps them up-to-date so they can take over in case the Active NameNode server
fails.
LinkedIn uses IP failover to make failovers seamless. Clients communicate to the Active
NameNode server using the same Virtual IP address irrespective of which physical
NameNode server is assigned as the Active NameNode. A transition between NameNode
servers will happen transparently to the clients.
First, one of the Standby NameNodes is upgraded with the new software and restarted.
Then, the Active NameNode fails over to the upgraded standby and is subsequently
upgraded and restarted.
After, the DataNodes can also be restarted with the new software. The DataNode
restarts are done in batches, so that at least one replica of a piece of data remains online
at all times.
Previously, we talked about how the NameNode server will keep track of all the file
system metadata.
The NameNode server keeps all of this file system metadata in RAM for low latency
access.
This adds the requirement for periodic increases of the Java heap size on the NameNode
server (Hadoop is written in Java).
LinkedIn’s largest NameNode server is set to use a 380 gigabyte heap to maintain the
namespace for 1.1 billion objects.
Maintaining such a large heap requires elaborate tuning in order to provide high
performance.
The Java heap is generally divided into two spaces: Young generation and Tenured (Old)
generation.
An object will first start in the young generation, and as it survives garbage collection
events, it will get promoted to eventually end up in the old generation.
As the workload on the NameNode increases, it generates more temporary objects in the
young generation space.
LinkedIn engineers try to keep the storage ratio between the young and old generations
at around 1:4.
By keeping the young and Old spaces appropriately sized, LinkedIn can completely
avoid full garbage collection events (where both the Young and Old generations are
collected), which would result in a many-minutes-long outage in the NameNode.
Non-Fair Locking
The write lock is exclusive (only one thread can hold it and write) while the read lock is
shared, allowing multiple reader threads to run while holding it.
With fair locking, the NameNode server frequently ends up in situations where writer
threads block reader threads (where the readers could be running in parallel).
Non-fair mode, on the other hand, allows reader threads to go ahead of the writers.
Other Optimizations
Satellite Clusters
HDFS is optimized for maintaining large files and providing high throughput for
sequential reads and writes.
As stated before, HDFS splits up files into blocks and then stores the blocks on the
various DataNode servers.
Each block is set to a default size of 128 megabytes (LinkedIn has configured their
cluster to 512 megabytes).
If lots of small files (the file size is less than the block size) are stored on the HDFS
cluster, this can create issues by disproportionately inflating the metadata size
compared to the aggregate size of the data.
In order to ease these limits, LinkedIn created Satellite HDFS Clusters that handled
storing these smaller files.
You can read the details on how they split off the data from the main cluster to the
satellite clusters in the article.
The main limiting factor for HDFS scalability eventually becomes the performance of
the NameNode server.
However, LinkedIn is using the High Availability feature, so they have multiple
NameNode servers (one in Active mode and the others in Standby state).
This creates an opportunity for reading metadata from Standby NameNodes instead of
the Active NameNode.
Then, the Active NameNode can just be responsible for serving write requests for
namespace updates.
In order to implement this, LinkedIn details the consistency model they used to ensure
highly-consistent reads from the Standby NameNodes.
Here’s a summary
Static Code Analysis tools inspect your source code (without executing it) and identify
potential errors and security vulnerabilities.
Giving engineers static analysis tools can immensely help developer productivity and
make the codebase much more secure.
Hack was developed at Facebook and is a typed dialect of PHP (Hack allows both
dynamic and static typing, so it’s type system is classified as gradually typed).
There were no static analysis tools available for Hack, so Nicholas and David set out to
build one.
Building a static analysis tool from scratch would be too complex, so they decided to
extend an existing open source static analysis tool, Semgrep.
Semgrep was already in use at Slack to scan code in 6 different languages, and there’s
already infrastructure in place to integrate Semgrep into the CI/CD pipeline.
In order to add Hack functionality to Semgrep, they needed to answer two questions
Here’s an example of a very simple grammar that can recognize arithmetic expressions.
A programming language’s grammar will let you transform the program from a series of
ASCII characters (words, spaces, etc.) into a concrete syntax tree (also known as a parse
tree).
The concrete syntax tree (CST) is an exact visual representation of the parsed source
code based on the grammar.
Tree-sitter can take your grammar rules and then generate a language parser from
them.
Nicholas and David used Tree-sitter to create a Hack parser using their grammar.
Then, they tested that parser by using it to convert some of Slack’s source code into a
CST.
From this conversion, they can measure the parse rate, which is the proportion of the
source code that could be properly parsed to construct a CST.
The higher the parse rate, the more code your parser was able to understand to
construct the CST.
Nicholas and Dave were eventually able to develop a grammar that achieved a parse rate
of greater than 99.999%.
Of the 5 million lines of Hack code they tested, there were less than 15 lines of
unparsable code.
The Hack grammar they developed is open source. You can view it here.
Semgrep (the static analysis tool) uses an abstract syntax tree to understand the source
code and find bugs/vulnerabilities.
While the CST is an exact representation of your code, the abstract syntax tree (AST)
focuses on the essential information.
You can read about the differences between an AST and a CST here.
Semgrep converts your source code (in Go, Java, JavaScript, Python, Ruby, etc.) into a
language-agnostic AST.
Then, it looks through a list of rules that check the Semgrep AST for bugs and
vulnerabilities.
This makes Semgrep highly extensible, as it’s loosely coupled with the programming
language.
In order to map the tree-sitter CST to the Semgrep AST, Nicholas and David wrote a
custom parser file in OCaml (Semgrep-core is written in the OCaml programming
language).
Then, they plugged this file into Semgrep and used the static analysis capabilities with
Hack code.
In order to deliver these videos with high quality and little buffering, Facebook uses a
variety of video codecs to compress and decompress videos. They also use Adaptive
Bitrate Streaming (ABR).
We’ll first give a bit of background information on what ABR and video codecs are.
Then, we’ll talk about the process at Facebook.
Progressive Streaming is where a single video file is being streamed over the internet to
the client.
The video will automatically expand or contract to fit the screen you are playing it on,
but regardless of the device, the video file size will always be the same.
● Quality Issue - Your users will have different screen sizes, so the video will be
stretched/pixelated if their screen resolution is different from the video’s
resolution.
Adaptive Bitrate Streaming is where the video provider creates different videos for each
of the screen sizes that he wants to target.
He can encode the video into multiple resolutions (480p, 720p, 1080p) so that users
with slow internet connections can stream a smaller video file than users with fast
internet connections.
The player client can detect the user’s bandwidth and CPU capacity in real time and
switch between streaming the different encodings depending on available resources.
Video Codec
Transmitting uncompressed video data over a network is impractical due to the size
(tens to hundreds of gigabytes).
Video codecs solve this problem by compressing video data and encoding it in a format
that can later be decoded and played back.
The various codecs have different trade-offs between compression efficiency, visual
quality, and how much computing power is needed.
So, you upload a video of your dog to Facebook. What happens next?
Once the video is uploaded, the first step is to encode the video into multiple resolutions
(360p, 480p, 720p, 1080p, etc.)
Next, Facebook’s video encoding system will try to further improve the viewing
experience by using advanced codecs such as H264 and VP9.
The encoding job requests are each assigned a priority value, and then put into a priority
queue.
Now, the Facebook web app (or mobile app) and Facebook backend can coordinate to
stream the highest-quality video file with the least buffering to people who watch your
video.
A key question Facebook has to deal with here revolves around how they should assign
priority values to jobs?
Let’s say Cristiano Ronaldo uploaded a video of his dog at the same time that you
uploaded your video.
There’s probably going to be a lot more viewers for Ronaldo’s video compared to yours
so Facebook will want to prioritize encoding for Ronaldo’s video (and give those users a
better experience).
The encoding job’s priority is then calculated by taking Benefit and dividing it by Cost.
Benefit
The benefit metric attempts to quantify how much benefit Facebook users will get from
advanced encodings.
The effective predicted watch time is an estimate of the total watch time that a video will
be watched in the near future across all of its audience.
Facebook uses a sophisticated ML model to predict the watch time. They talk about how
they created the model (and the parameters involved) in the article.
The relative compression efficiency is a measure of how much a user benefits from the
codec’s efficiency.
It’s based on a metric called the Minutes of Video at High Quality per GB (MVHQ)
which is a measure of how many minutes of high-quality video can you stream per
gigabyte of data.
Cost
This is a measure of the amount of logical computing cycles needed to make the
encoding family (consisting of all the different resolutions) deliverable.
Some jobs may require more resolutions than others before they’re considered
deliverable.
As stated before, Facebook divides Benefit / Cost to get the priority for a video encoding
job.
After encoding, Facebook’s backend will store all the various video files and
communicate with the frontend to stream the optimal video file for each user.
LedgerStore’s storage backend was AWS DynamoDB but Uber decided to migrate away
because it was becoming expensive.
● Latency improvements
Uber was able to do this without a single production incident and not a single data
inconsistency in the 250 billion unique records that were migrated from DynamoDB to
Docstore.
Piyush Patel, Jaydeepkumar Chovatia and Kaushik Devarajaiah wrote a great article on
the migration, and we’ll be giving a summary below.
Here’s a summary
Uber moves millions of people around the world and delivers tens of millions of food
orders daily. This generates a massive amount of financial transactions that need to be
stored.
Tables can have one or many indexes and an index belongs to exactly one table.
Indexes in LedgerStore are strongly consistent, so when a write to the main table
succeeds, all indexes are updated at the same time using a 2-phase commit.
The data is mostly read within a few weeks or months after being written. Because it’s
expensive to store data in hot databases like DynamoDB, Uber offloads the data to
colder storage after a time period.
In order to provide these guarantees, Ledger Store created the concept of Sealing.
Sealing
After a sealing window is closed, signed and sealed, no further updates to it will be
permitted.
If you need to correct data in an already-sealed time range, LedgerStore uses Revisions,
which we’ll discuss below.
Because there are no updates, any query that only reads data from a sealed time range is
guaranteed to be reproducible.
Revisions
If you need to correct data in already-sealed time ranges, LedgerStore uses the notion of
Revisions.
A revision is a table-level entity consisting of all sealed record corrections and the
associated business justifications. All records, both corrected and the original, are
maintained to allow reproducible queries.
Choosing Docstore
LedgerStore was designed to abstract away the underlying storage technology so that
switching technologies could be done if the business need arose.
As the database scaled, using AWS DynamoDB as a storage backend became extremely
expensive.
Additionally, having different backend databases in the tech stack created fragmentation
and made it difficult to operate.
● Secondary Indexes
Docstore, Uber’s homegrown database, was a perfect match for those requirements.
The only issue was that Docstore didn’t have Change Data Capture (streaming)
functionality.
Uber wanted streaming functionality because reading data from a stream of updates is
more efficient than reading from the table directly.
You don’t have to perform table scans or range reads spawning a large number of rows.
Also, the stream data can be stored in cheaper, commodity hardware.
The stream data can be stored in a system like Apache Kafka, which is optimized for
stream reading.
Uber solved this issue by building a streaming framework for Docstore called Flux. Read
the article for more details on Flux.
When migrating from DynamoDB to Docstore, Uber had several objectives they wanted
One part of the migration was moving all the historical data from DynamoDB to
Docstore in real time.
The data consisted of more than 250 billion unique records and was 300 terabytes of
data in total.
Engineers did this by breaking the historical data down into subsets and then processing
them individually via checkpointing.
Therefore, the data is already broken down into subsets, where each subset is an
individual sealing window.
The goal was to keep the 2 databases consistent at any phase of the migration so that
rollback and forward were possible.
4. Final Cutover - Once reads were fully served out of Docstore, it was time to
stop the shadow writes to DynamoDB. Afterwards, engineers backed up the
DynamoDB database and finally decommissioned it.
Here’s a summary.
When you’re dealing with massive amounts of data (hundreds of terabytes or petabytes),
then the traditional ways of dealing with data start to break down.
You’ll need to use a distributed system, and you’ll have to orchestrate the different
components to suit your workload.
Typically, big data solutions involve one or more of the following types of workloads.
● Data Sources - It’s pretty dumb to build a distributed system for managing
data if you don’t have any data. So, all architectures will have at least one
source that’s creating data. This can be a web application, an IoT device, etc.
● Batch Processing - Because the data sets are so large, often a big data solution
must process data files using long-running batch jobs to filter, aggregate, and
prepare the data for analysis. Map Reduce is a very popular way of running
these batch jobs on distributed data.
● Analytical Data Store - Many big data solutions prepare data for analysis and
then serve the processed data in a structured format that can be queried using
analytical tools. This can be done with a relational data warehouse
(commonly used for BI solutions) or through NoSQL technology like HBase
or Hive.
When you’re working with a large data set, analytical queries will often require batch
processing. You’ll have to use something like MapReduce.
This means that getting an answer to your query can take hours, as you have to wait for
the batch job to finish.
The issue is that this means you won’t get real time results to your queries. You’ll always
get an answer that is a few hours old.
The ideal scenario is where you can get some results in real time (perhaps with some
loss of accuracy) and combine these results with the results from the batch job.
Lambda Architecture
The Lambda Architecture solves this by creating two paths for data flow: the cold path
and the hot path.
The raw data in the batch layer is immutable. The incoming data is always appended to
the existing data, and the previous data is never overwritten.
The cold path has a high latency when answering analytical queries. This is because the
batch layer aims at perfect accuracy by processing all available data when generating
views.
The hot path is also known as the speed layer, and it analyzes the incoming data in real
time. The speed layer’s views may not be as accurate or complete as the batch layer, but
they’re available almost immediately after the data is received.
The speed layer is responsible for filling the gap caused by the batch layer’s lag and
provides views for the most recent data.
The hot and cold paths converge at the serving layer. The serving layer indexes the batch
view for efficient querying and incorporates incremental updates from the speed layer
based on the most recent data.
With this solution, you can run analytical queries on your datasets and get up-to-date
answers.
The Kappa Architecture is meant to be a solution to this, where all your data flows
through a single path, using a stream processing engine.
The company’s main product is the Stripe Payments API, which developers can use to
easily embed payment functionality into their applications.
Due to Stripe’s scale, they’re a big target for payments fraud and cybercrime.
Andrew Tausz is part of the Risk Intelligence team at Stripe, and he wrote a great blog
post on how Stripe uses similarity clustering to catch fraud rings.
One of the most common types of fraud that Stripe faces is merchant fraud, where a
scammer will create a website that advertises fraudulent products or services (and uses
Stripe to process payments).
The customer will end up issuing a chargeback through their credit card, which will
eventually get paid back by Stripe.
Stripe will then attempt to debit the account of the scammer, but if they’re unable to (the
scammer transferred out all his money) then Stripe will have to eat the losses.
After a fraudster gets caught by Stripe, his account will be disabled. But, it’s quite likely
that he’ll try to continue the scam by creating a new Stripe account.
One way Stripe can reduce fraud is by catching these repeat fraudsters through
similarity clustering.
When a scammer creates a new Stripe account (after getting caught on his previous
account), he’ll probably reuse some information and attributes from his previous
account.
Certain information is easy to fabricate, like your name or date of birth. But, other
attributes are more difficult. For example, it takes significant effort to obtain a new bank
account.
Therefore, Stripe has found that linking accounts together via shared attributes is quite
effective at catching obvious fraud attempts.
They take two accounts and then assign them a similarity score based on the number of
shared attributes the accounts have.
Some shared attributes are weighed more heavily than others. Two Stripe accounts who
share dates of birth should have a lower similarity score than two accounts who share a
bank account.
Previously, Stripe relied on a heuristic based system where the weightings were
hand-constructed (based on guess and check).
Stripe decided to switch by training a machine learning model to handle this task.
Now, they can automatically retrain the model over time as they obtain more data and
improve in accuracy, adapt to new fraud trends, and learn the signatures of particular
adversarial groups.
The approach Stripe took to build the model is Similarity Learning, where the objective
is to learn a similarity function that can measure how similar two objects are.
They already had a massive dataset of fraud rings and clusters of fraudulent accounts
based on prior work from their risk underwriting team.
Stripe cleaned that into a dataset consisting of pairs of accounts along with a label for
each pair indicating whether or not the two accounts belong to the same cluster.
Now that they had the dataset, Stripe had to generate features that the model could use
to compare the pair of accounts.
Creating a Stripe account requires quite a bit of data, so Stripe had a large feature set
they could utilize.
Examples of features chosen include the account’s email domain, overlap in credit card
numbers used for both accounts, measure of text similarity, and more.
Due to the huge range of features, Stripe decided to go with gradient-boosted decision
trees (GBDTs) to represent their similarity model.
Stripe found that GBDTs strike the right balance between being easy to train, having
strong predictive power, and being robust despite variations in the data.
Stripe has an internal API called Railyard that handles training ML models in a scalable
and maintainable way.
You can read more about Railyard and it’s architecture here.
Prediction Use
Since this model operates on pairs of Stripe accounts, it’s not possible to feed it all pairs
of accounts and compute similarity scores across all pairs (there’s too many
combinations).
Instead, Stripe uses some heuristics to identify suspicious accounts and prune the set of
candidates to a reasonable number.
Then, they use their ML models to generate similarity scores between the accounts.
After, they compute the connected components on the resulting graph to get a final
output of high-fidelity account clusters that can be analyzed, processed or manually
inspected.
If a cluster contains a large amount of known fraudulent accounts, then a risk analyst
may want to further investigate the remaining accounts in that cluster.
Here’s a summary
During LinkedIn’s early stages (early 2010s), they were growing extremely quickly. To
keep up with this growth, they leveraged several third party proprietary platforms (3PP)
in their analytics stack.
Using these proprietary platforms was far quicker than piecing together off-the-shelf
products.
LinkedIn relied on Informatica and Appworx for ETL to a Data Warehouse built with
Teradata.
ETL stands for Extract, Transfer, Load. It’s the process of copying data from various
sources (the different data producers) into a single destination system (usually a data
warehouse) where it can more easily be consumed.
● Lack of freedom to evolve - Because of the closed nature of this system, they
were limited in options for innovation. Also, integration with internal and
open source systems was a challenge.
These disadvantages motivated LinkedIn engineers to develop a new data lake (data
lakes let you contain raw data without having to structure it) on Hadoop in parallel.
You can read about how LinkedIn scaled Hadoop Distributed File System to 1 exabyte of
data here.
However, they did not have a clear transition process, and that led to them maintaining
both the new system and the legacy system simultaneously.
Data Migration
To solve this issue, engineers decided to migrate all datasets to the new analytics stack
with Hadoop.
In order to do this, the first step was to derive LinkedIn’s data lineage.
Data lineage is the process of tracking data as it flows from data sources to
consumption, including all the transformations the data underwent along the way.
Knowing this would enable engineers to plan the order of dataset migration, identify
zero usage datasets (and delete them for workload reduction) and track the usage of the
new vs. old system.
After data lineage, engineers used this information to plan major data model revisions.
They planned to consolidate 1424 datasets down to 450, effectively cutting ~70% of the
datasets from their migration workload.
They also transformed data sets that were generated from OLTP workloads into a
different model that was more suited for business analytics workloads.
The migration was done using various data pipelines and illustrated bottlenecks in
LinkedIn’s systems.
One bottleneck was poor read performance of the Avro file format. Engineers migrated
to ORC and consequently saw a read speed increase of ~10-1000x, along with a 25-50%
improvement in compression ratio.
After the data transfer, depreciating the 1400+ datasets on the legacy system would be
tedious and error prone if done manually, so engineers also built an automated system
to handle this process.
They built a service to coordinate the deprecation where the service would identify
dataset candidates for deletion (datasets with no dependencies and low usage) and then
send emails to users of that those datasets with news about the upcoming deprecation.
The service would also notify SREs to lock, archive and delete the dataset from the
legacy system after a grace period.
The design of the new ecosystem was heavily influenced by the old ecosystem, and
addressed the major pain points from the legacy tech stack.
● Dataset Readers - Datasets are stored on Hadoop Distributed File System and
can be read in a variety of ways.
For more details on LinkedIn’s learnings and their process for the data (and user)
migration, read the full article.
Here’s a summary
Etsy’s codebase is a monorepo with over 17,000 JavaScript files, spanning many
iterations of the site.
In order to improve the codebase, Etsy made the decision to adopt TypeScript, a
superset of JavaScript with the optional addition of types. This means that any valid
JavaScript code is valid TypeScript code, but TypeScript provides additional features on
top of JS (the type system).
Based on research at Microsoft, static type systems can heavily reduce the amount of
bugs in a codebase. Microsoft researchers found that using TypeScript or Flow could
have prevented 15% of the public bugs for JavaScript projects on Github.
For example, Airbnb automated as much of their migration as possible while other
companies enable less-strict TypeScript across their projects, and add types to their
code over time.
1. How strict do they want their flavor of TypeScript to be? - TypeScript can be
more or less “strict” about checking the types in your codebase. A stricter
configuration results in stronger guarantees of program correctness.
TypeScript is a superset of JavaScript, so if you wanted you could just rename
all your .js files to .ts and still have valid TypeScript, but you would not get
strong guarantees of program correctness.
3. How specific do they want the types they write to be? - How accurately should
a type fit the thing it’s describing? For example, let’s say you have a function
that takes in the name of an HTML tag. Should the parameter’s type be a
string? Or, should you create a map of all the HTML tags and the parameter
should be a key in that map (far more specific)?
2. Add really good types and really good supporting documentation to all of the
utilities, components, and tools that product developers use regularly.
Etsy wanted to set the compiler parameters for TypeScript to be as strict as possible.
The downside with this is that they would need a lot of type annotations.
They decided to approach the migration incrementally, and first focus on typing
actively-developed areas of the site.
Files that had reliable types were given the .ts file extension while files that didn’t kept
the .js file extension.
Before engineers started writing TypeScript, Etsy made sure that all of their tooling
supported the language and that all of their core libraries had usable, well-defined types.
In terms of tooling, Etsy uses Babel and the plugin babel-preset-typescript that turns
TypeScript into JavaScript. This allowed Etsy to continue to use their existing build
infrastructure. To check types, they run the TypeScript compiler as part of their test
suite.
Etsy makes heavy use of custom ESLint linting rules to maintain code quality.
They used the TypeScript ESLint project to get a handful of TypeScript specific linting
rules.
The biggest hurdle to adopting TypeScript was getting everyone to learn TypeScript.
TypeScript works better the more types there are. If engineers aren’t comfortable
writing TypeScript code, fully adopting the language becomes an uphill battle.
Etsy has several hundred engineers, and very few of them had TypeScript experience
before the migration.
The strategy Etsy used was to onboard teams to TypeScript gradually on a team by team
basis.
● Etsy could refine their tooling and educational materials over time. Etsy
found a course from ExecuteProgram that was great for teaching the basics of
TypeScript in an interactive and effective way. All members of a team would
have to complete that course before they onboarded.
● Engineers had plenty of time to learn TypeScript and factor it into their
roadmaps. Teams that were about to start new projects with flexible deadlines
were the first to onboard TypeScript.
Kevin Dangoor and Marta Kosarchyn are senior engineers at Khan Academy and they
wrote a series of blog posts about the technical choices, execution and results of the
rewrite. We’ll be summarizing the series below.
Summary
In late 2019, Khan Academy was looking to upgrade their backend. The site was built on
a Python 2 monolith and it worked well for over 10 years.
However, Python 2 was about to reach the official end of life on January 1rst, 2020 so
Khan Academy engineers decided they had to update.
Of these options, Khan Academy decided to go with the third choice and do a rewrite of
their Python 2 monolith with Go.
They ran performance tests and found that Go and Kotlin (on the JVM) perform
similarly, with Kotlin being a few percent ahead. However, Go used a lot less memory.
Brief Overview of Go
Ken Thompson and Rob Pike were key employees at Bell Labs, and were instrumental in
building the original Unix operating system (and a bunch of other stuff, they developed
the UTF-8 encoding for example).
Go includes things like garbage collection, structural typing, and extremely fast compile
times.
Monolith to Services
Previously, all of Khan Academy’s servers ran the same code and could respond to a
request for any part of the website. Separate services were used for storing data and
managing caches, but the logic for any request was the same regardless of which server
responded.
Despite the additional complexity that comes with a services architecture, Khan
Academy decided to go with it because of several big benefits.
● Limited Impact for Problems - KA engineers can now be more confident that
a problem with a deployment will have a limited impact on other parts of the
site.
Despite the change in architecture, Khan Academy plans to continue using Google App
Engine for hosting, Google Cloud Datastore for their database, and other Google Cloud
products.
Big rewrites are extremely risky (Joel Spolsky, co-founder and former CEO of Stack
Overflow, has a great blog post on this) so KA picked a strategy of incremental rewrites.
The hub of Khan Academy’s new backend is based on GraphQL Federation. Khan
Academy is switching from using REST to GraphQL, and GraphQL Federation allows
you to combine multiple backend services into one unified graph interface.
This way, you have a single, typed schema for all the data the various backend systems
provide that is accessed through the GraphQL gateway. Each backend service provides
part of the overall GraphQL schema and the gateway merges all of these separate
schemas into one.
Khan Academy incrementally switched over from the old backend to the new backend.
1. The monolith is in control - At first, the services backend will not have the
functionality to answer a specific request, so the GraphQL gateway will route
that request to the Python monolith. This request will be noted and KA
engineers can write the Go code to handle the request in the new backend.
2. Side-by-side - Once the new backend can handle the request, KA engineers
will switch to side-by-side. In this state, the GraphQL gateway will call both
the Python code and the new Go code. It will compare the results, log the
instances where there’s a difference, and return the Python result to the user.
4. Python code removed - Finally, after extensive testing, the Python code is
removed.
Khan Academy used this process to rewrite their backend with high availability. They
were still able to handle this task despite having a massive increase in website traffic.
The bulk of the rewrite was done in 2020, when schools switched to remote due to
COVID-19 and students, parents and teachers made significantly more use of Khan
Academy.
Within a period of 2 weeks, KA saw an increase of 2.5x in usage. You can read about how
they handled it in this blog post by Marta Kosarchyn.
As of August 30th, 2021, the new services backend was handling 95% of all traffic to the
site.
They were able to meet their initial estimate of completion that they made 20 months
prior (despite the massive bump with COVID) and achieve performance goals for the
new backend.
● Side by side testing - We’ve already discussed this, but the side-by-side testing
approach was critical to the success of the rewrite. It was an efficient way to
make sure that the functionality being replaced was equivalent.
They wrote a great article summarizing the design of the ML infrastructure that Etsy
uses and the design choices that went into building the system.
Here’s a summary
Etsy is an e-commerce platform that allows users to sell handmade or vintage items.
Popular products sold on the site include things like jewelry, clothing, bags, etc.
The website makes extensive use of machine learning models for things like search,
recommendations, the ad platform, trust & safety, and more.
The ML Platform team at Etsy develops and maintains the technical infrastructure that
Etsy data scientists use to prototype, train and deploy ML models at scale.
Etsy’s first ML platform was built in 2017, when the data science team was much smaller
and largely relied on much simpler models. As the platform had to start supporting
more complex machine learning projects and new ML frameworks, the maintenance
costs started to become too high.
Etsy’s training and prototyping platform largely relies on Google Cloud services like
Vertex AI and Dataflow, where the data science team can experiment freely with the ML
framework of their choice.
Massive extract transform load (ETL) jobs can be run through Dataflow while complex
training jobs can be submitted to Vertex AI for optimization.
Model Serving
Etsy relies on Google Kubernetes Engine (GKE) for the core of their Model Serving
system (making inferences in production).
To deploy models, data scientists will create stateless ML microservices that are
deployed in Etsy’s Kubernetes cluster.
These microservices will then serve requests from Etsy’s website or mobile app.
The deployments are managed through the Model Management Service, an in-house
developed control plane that gives the data science team a simple UI to manage their
model deployments.
The Model Management Service violates Etsy’s avoid in-house rule, but that’s because it
was already built-out and the ML platform team found it was still the best tool available.
Workflow Orchestration
In order to keep ML models up-to-date, the ML platform also needs robust pipelines for
retraining and deployment.
Etsy relies on Kubeflow and TFX pipelines (TensorFlow Extended) for this. With Google
Cloud Platform’s Vertex AI Pipelines, the data science team can develop and test
pipelines using either the Kubeflow or TFX SDE, based on their own preference.
Outcomes
The ML Platform team estimates that with V2, ML practitioners at Etsy can now go from
idea to live ML experiment in half the time it previously took. Launching new model
architectures takes days instead of weeks, and data scientists can launch dozens of
hyperparameter tuning experiments with a single command.
The biggest challenge around V2 has been encouraging adoption of the new ML
platform. Migrating to a new platform requires upfront effort that may not align with
the current priorities of the staff at Etsy.
He wrote a great blog post on all the engineering behind how videos play on your
computer (adaptive bitrate streaming, HLS, etc.), how videos are delivered to your
computer (CDNs, Multi-CDNs, etc.) and how film is processed into digital video
(Codecs, Containers, FFMPEG, etc.).
Playback
When you come across a website that has a video player embedded in it, there’s quite a
bit going on behind the scenes.
You have the player UI, with the pause/play button, subtitle controls, video speed and
other options.
Players will support different options around DRM, ad injection, thumbnail previews,
etc.
Behind the scenes, modern video platforms will use adaptive bitrate streaming to stream
the video from the server.
Adaptive bitrate streaming means that the server has several different versions of the
video (known as renditions) and each version differs in display size (resolution) and file
size (bitrate).
The video player will dynamically choose the best rendition based on the user’s screen
size and bandwidth. It will choose the rendition that minimizes buffering and gives the
best user experience.
HTTP Live Streaming (HLS) is a protocol designed by Apple for HTTP-based adaptive
bitrate streaming. It’s the most popular streaming format on the internet.
The basic concept is that you take your video file and break it up into small segments,
where each segment is 2-12 seconds long.
If you have a 2 hour long video, you could break it up into segments that are 10 seconds
long and end up with 720 segments.
Each of the segments is a file that ends with a .ts extension. The files are numbered
sequentially, so you get a directory that looks like this
segments/
00001.ts
00002.ts
00003.ts
00004.ts
00005.ts
The player will then download and play each segment as the user is streaming. It will
also keep a buffer of segments in case the user loses network connection.
Again, HLS is an adaptive bitrate streaming protocol, so the web server will have
several different renditions (versions) of the video that is being played.
All of the renditions will be broken into segments of the same length. So, going back to
our example with the 2 hour long video, it could have 720 segment files at 1080p, 720
segment files at 720p, 720 segment files at 480p.
All the segment files are ordered and are each 10 seconds in length.
If your network connection slows down while you’re watching a video, the player can
downgrade you to a lower quality rendition for the next segment files.
When your connection gets faster, the player can upgrade your rendition.
Typically, the video is stored in the temporary directory of the web browser and the user
can start watching while the file is being downloaded in the background.
The user can also jump to specific points in the video and the player will use byte-range
requests to estimate which part of the file corresponds to the place in the video that the
user is attempting to seek.
What makes MP4 and WebM playback inefficient is that they do not support adaptive
bitrates.
Every user who wants to watch the file buffer-free must have an internet connection that
is fast enough to download the file faster than the playback.
Therefore, when you are using these formats you have to make a tradeoff between
serving a higher quality video file vs. decreasing the internet connection speed
requirements.
The origin server is the source of truth. It’s where the developer uploads the original
video files.
The CDN will then pull files from the origin server and cache that file on a bunch of
interconnected servers around the world (in locations that are close to your users).
That way, when users want to request the file, they can do so from a server in the CDN.
This is way faster (and much more scalable) than your origin server sending the entire
file to all your users.
Many enterprises will choose a Multi-CDN environment, where the load is distributed
among multiple CDNs. This improves the user experience by giving them more servers
to choose from and improving the availability of your website.
One of the marketing features in the Grab app is to offer real-time rewards whenever a
user takes a certain action (or series of actions).
For example, if a user uses the Grab app to get a ride to work in the morning, the app
might immediately reward her with a 50% off ride reward that she can use in the
evening for the ride back home.
Jie Zhang and Abdullah Al Mamum are two senior software engineers at Grab and they
wrote a great blog post on how they process thousands of events every second to send
out hundreds of millions of rewards monthly.
Here’s a summary
Grab runs growth campaigns where they’ll reward a user with discounts and perks if the
user completes a certain set of actions. Over a typical month, they’ll send out ~500
million rewards and over 2.5 billion messages to their end-users.
Trident is the engine Grab engineers built to handle this workload. It’s an If This, Then
That engine which allows Grab’s growth managers to create new promotional
campaigns. If a user does this, then award that user with that.
Whenever a customer uses one of Grab’s products the backend service associated with
that product will publish an event to a specific Kafka stream.
Trident subscribes to all the events from these multiple Kafka streams and processes
them. By utilizing Kafka streams, Trident is decoupled from the upstream backend
services.
After filtering out duplicates, Trident will process each event and check if it results in
any messages/rewards that have to be sent to the user. Trident does this by taking the
event and running a rule evaluation process where it checks if the event satisfies any of
the pre-defined rules set by the growth campaigns.
All processed events are stored in Redis (for 24 hours) and events that trigger an action
are persisted in MySQL as well.
If an action is triggered, Trident will then call the backend service associated with that
action. These calls are rate-limited (with tighter limits during peak hours) so that
Trident doesn’t accidently DoS attack any of Grab’s downstream backend services.
Scalability
The number of events that Trident has to process can vary widely based on the time of
day, day of week and time of year. During the peak of 2020, Trident was processing
more than 2,000 events per second.
Grab uses quite a few strategies to make sure Trident can scale properly. The strategies
are illustrated in this diagram.
The source of events for Trident are Kafka streams. Upstream backend services that are
handling delivery/taxi orders will publish events to these streams after they handle a
user’s request.
Trident can handle increased load (more events coming down the Kafka streams) by
● Reducing Load - The majority of the processing that the Trident servers are
doing is checking to see if the event matches the criteria for any of the
campaigns and whether any actions are triggered.Grab engineers sped this
process up by prefiltering events. They load active campaigns every few
minutes and organize them into an in-memory hashmap with the event type
as the key and the list of corresponding campaigns as the value.When
processing an event, they can quickly figure out all the possible matching
campaigns by first checking in the hash map.
If any actions are triggered, Trident will call downstream backend services to handle
them. For example, the GrabRewards service could be called to give a user a free ride.
There are strict rate-limits built in to stop Trident from overwhelming these
downstream services during a time of high load.
Trident uses two types of storage: cache storage (Redis) and persistent storage (MySQL
and S3).
In terms of persistent storage, Trident has two types of data in terms of access pattern:
online data and offline data.
The online data is frequently accessed (so it has to be relatively quick) and medium size
(a couple of terabytes). Grab uses MySQL for this data.
The offline data is infrequently accessed but very large in size (hundreds of terabytes
generated per day). Grab uses AWS S3 for this.
For the MySQL database, Grab added read replicas that can handle some of the load
from the read queries. This relieved more than 30% of the load from the master instance
and allows MySQL queries to be performant.
In the future, they plan to vertically partition (split the database up by tables) the single
MySQL database into multiple databases based on table usage.
Summary
When a user is making an online order, they will submit their credit card information to
a payment gateway such as Stripe or PayPal. The gateway encrypts this information and
facilitates the transaction with payment processors.
The payment processor will talk to the issuing bank (for the user’s credit card) and
request approval.
The approval will then bubble to the backend, which lets the client know if the payment
was accepted/declined.
● User experience - The UI needs to work with all payment methods and also
for new and existing users. This creates quite a few scenarios that have to be
implemented and tested.
● Location - you need to account for the user’s location before processing their
payment to comply with each country’s laws and regulations
In an earlier version of the DoorDash app, the developers didn’t properly account for the
addition of new payment methods to the app. Instead, it was designed around credit
cards and Google Pay. This led to challenges when adding new methods like PayPal.
In the new design, engineers introduced the notion of payment methods into the
codebase. These are categorized into local payment methods that are part of the device
(like Google Pay) and external payment methods that require interactions with a
payment gateway (like Stripe).
Payment vendors may have specific ways they want to be portrayed in an app.
Additionally, they have strict UX guidelines that explain how to display their logos and
buttons.
Payments usually can’t be implemented in a generic way that scales worldwide. Each
country has its own technical, legal and accounting implications.
Some payment methods may also need extra verification or information in other
countries.
It’s absolutely critical to keep an app performant while it processes payments. Caching
the payment methods and cards makes it faster because there’s less waiting for payment
information on the cart and checkout screens.
However, this has to be done with care and must account for error cases where the
backend and device are out of sync.
Payment flows can be tricky to debug if they don’t work properly. Therefore, instead of
just sending generic information like “payment failed”, the system should send as much
information as possible; including things like error codes from providers, device
information or any diagnostics that can help to identify the state of the app when it
failed.
However, be careful not to include any personal identifiable information or any payment
information that could be compromised by an attacker.
For more details on each of these points, read the full blog post here
Matt Anger is a Senior Staff Engineer at DoorDash where he works on the Core Platform
and Performance teams.
He published a great blog post (May 2021) on DoorDash’s migration from Python 2 to
Kotlin. Here’s a summary.
Summary
DoorDash was quickly approaching the limits of what their Django-based monolithic
codebase could support.
With their legacy system, the number of nodes that needed to be updated added
significant time to releases. Debugging bad deploys with bisection got harder and longer
due to the number of commits each deploy had. The monolith was built with Python 2
which was also rapidly entering end-of-life.
One of their goals was to only use one language for the backend.
● Promote Best Practices - Having one language makes it easier for teams to
share development best practices across the entire company.
● Build Common Libraries - All engineers can share common libraries and
tooling.
● Change Teams - Engineers can change teams with minimal friction, which
encourages more collaboration.
First, DoorDash engineers looked at the parts of their tech stack that would not change.
They had a lot of experience with Postgres and Apache Cassandra, so they would
continue to use those technologies as data stores.
They would use gRPC for synchronous service-to-service communication, with Apache
Kafka as a message queue.
In terms of the programming language, the choices in contention were Kotlin, Java, Go,
Rust and Python 3.
Kotlin mitigated some of the pain points around Java like Null Safety and Coroutines.
● Getting around Java interoperability pain points - There were some pain
points with Java interop. Many libraries claiming to implement modern Java
Non-blocking I/O standards did so in an unscalable manner. This caused
issues when using coroutines. Check the article for full details.
However, atomic clocks are way too expensive and bulky to put in every computer.
Instead, individual computers contain quartz clocks which are far less accurate.
The clock drift differs on the hardware, but it’s an error of ~10 seconds per month.
When you’re dealing with a distributed system with multiple machines, it’s very
important that you have some degree of clock synchronization. Having machines that
are dozens of seconds apart on time makes it impossible to coordinate.
Clock skew is a measure that tells you the difference between two clocks on different
machines at a certain point of time.
You can never reduce clock skew to 0, but you want to reduce clock skew as much as
possible through the synchronization process.
The way clock synchronization is done is with a protocol called NTP, Network Time
Protocol.
NTP works by having servers that maintain accurate measures of the time. Clients can
query those servers and ask for the current time.
The client will take those answers, discard any outliers, and average the rest. It’ll use a
variety of statistical techniques to get the most accurate time possible.
This is a great blog post that delves into the clock synchronization algorithm.
Here’s a list of NTP servers that you can query for the current time. It’s likely that your
personal computer uses NTP to contact a time server and adjust its own personal clock.
Therefore, there are some NTP servers in between your computer and the reference
clock.
A computer may query multiple NTP servers, discard any outliers (in case of faults with
the servers) and then average the rest.
Computers may also query the same NTP server multiple times over the course of a few
minutes and then use statistics to reduce random error due to variations in network
latency.
For connections through the public internet, NTP can usually maintain time to within
tens of milliseconds (a millisecond is one thousandth of a second).
To learn more about NTP, watch the full lecture by Martin Kleppmann.
A critical part of the DoorDash app is the search function. You can search for Scallion
Pancakes and the DoorDash app will give you restaurants near you that are open and
currently serving that dish.
Solving this problem at scale is quite challenging, as restaurants are constantly changing
their menus, store hours, locations, etc.
You need to quickly index all of the store data to provide a great restaurant discovery
feature.
Satish, Danial, and Siddharth are software engineers on DoorDash’s Search Platform
team, and they wrote a great blog post about how they built a faster indexing system
with Apache Kafka, Apache Flink and Elasticsearch.
Here’s a summary
DoorDash’s legacy indexing system was very slow, unreliable and not extensible. It took
a long time for changes in store and item descriptions to be reflected in the search index.
It was also very difficult to assess the indexing quality.
There were frequent complaints about mismatches in store details between the search
index and the source of truth. These had to be fixed manually.
Engineers solved these problems by building a new search indexing platform with the
goals of providing fast and reliable indexing while also improving search performance.
The new platform is built on a data pipeline that uses Apache Kafka as a message queue,
Apache Flink for data transformation and Elasticsearch as the search engine.
● Data sources - These are the sources of truth for the data. When CRUD
operations take place on the data (changing store menu, updating store hours,
etc.) then they are reflected here. DoorDash uses Postgres as the database and
Snowflake as the data warehouse.
● Flink application - There are two custom Apache Flink applications in this
pipeline: Assembler and ES Sink. Assembler is responsible for assembling all
the data required in an Elasticsearch document. ES Sink is responsible for
shaping the documents as per the schema and writing the data to the targeted
Elasticsearch cluster.
● Message queue - Kafka 1 and Kafka 2 are the message queue components.
The changes in data sources are propagated to Flink applications using Kafka. The Flink
apps implement business logic to curate the search documents and then write them to
Elasticsearch.
Incremental Indexing
The first type of data change is when human operators make ad hoc changes to stores or
restaurant items. An example of a possible data change is a restaurant owner adding a
new dish to her menu.
The second type of data change is ETL data changes that are generated from machine
learning models. Things like restaurant ratings/scores or auto-generated tags are
generated by machine learning models and then stored in a data warehouse.
Both of these changes need to be reflected in the search index for the best customer
experience.
Restaurant owners will frequently update their menus and store information. These
changes need to be reflected onto the search experience as quickly as possible.
To keep track of these updates, DoorDash search engineers rely on Change Data Capture
(CDC) events.
They tested other solutions like Debezium connector, a Red Hat-developed open source
project for capturing row-level changes with Postgres but they found that this strategy
had too much overhead and was not performant.
Many properties that are used in the search index are generated by ML models. Things
like restaurant scores, auto-generated tags, etc.
These properties are updated in bulk, once a day. The data gets populated into tabs in
DoorDash’s data warehouse after a nightly run of the respective ETL jobs.
The CDC patterns described for Human Operator Changes don’t work here because you
don’t constantly have changes/updates through the day. Instead, you have one bulk
update that happens once a day.
Using the CDC pattern described above would overwhelm the system when making the
bulk update due to the size of the update.
Therefore, DoorDash engineers built a custom Flink source function which spreads out
the ETL ingestion over a 24 hour interval so that the systems don’t get overwhelmed.
The Flink source function will periodically stream rows from an ETL table to Kafka in
batches, where the batch size is chosen to ensure that the downstream systems do not
get overwhelmed.
Once the Assembler application publishes data to Kafka, the consumer (ES Sink) will
read those messages, transform them according to the specific index schema, and then
send them to their appropriate index in Elasticsearch.
It has rate limiting and throttling capabilities out of the box, which are essential for
protecting Elasticsearch clusters when the system is under heavy write load.
Results
With the new search indexing platform, updates happen much faster. The time needed
to reindex existing stores and items on the platform fell from 1 week to 2 hours.
The reliance on open source tools for the index means a lot of accessible documentation
online and engineers with this expertise who can join the DoorDash team in the future.
For information on how DoorDash backfilled the search index (and more!), read the full
blog post here.
We’ll be summarizing a small snippet from his book on the architecture behind
Database Management Systems (DBMSs).
Summary
They provide an extremely useful abstraction that you can use in your applications for
storing and (later) retrieving data.
All DBMSs provide an API that you can use through some type of Query Language to
store and retrieve data.
They also provide a set of guarantees around how this data is stored/retrieved. A couple
examples of such guarantees are
● Durability - guarantees that you won’t lose any data if the DBMS crashes
● Consistency - After you write data, will all subsequent reads always give the
most recent value of the data? (this is important for distributed databases)
The architecture of the various Database Management Systems vary widely based on
their guarantees and design goals. A database designed for OLTP use will be designed
differently than a database meant for OLAP.
An in-memory DBMS (designed to store data primarily in memory and use disk for
recovery and logging) will also be designed differently than a disk-based DBMS
(designed to primarily store data on disk and use memory for caching).
Database Management Systems use a client/server model. Your application is the client
and the DBMS is the server (either hosted on the same machine or on a different
machine).
Client requests come in the form of a database query and are usually expressed in some
type of a query language (ex. SQL).
The Query Processor will first parse the query (using an Abstract Syntax Tree for ex.)
and make sure it is valid.
Checking for validity means making sure that the query makes sense (all the commands
are recognized, the data accessed is valid, etc.) and also that the client is correctly
permissioned to access/modify the data that they’re requesting.
If the query isn’t valid, then the database will return an error to the client.
The optimizer will first eliminate redundant parts of the query and then use internal
database statistics (index cardinality, approximate intersection size, etc.) to find the
most efficient way to execute the query.
For distributed databases, the optimizer will also consider data placement like which
node in the cluster holds the data and the costs associated with the transfer.
The output from the Optimizer is an Execution Plan that describes the optimal method
of executing the query. This plan is also called the Query Plan or Query Execution Plan.
This execution plan gets passed on to the Execution Engine which carries out the plan.
When you’re using a distributed database, the execution plan can involve remote
execution (making network requests for data that is stored on a different machine).
Otherwise, it’s just local execution (carrying out queries for data that is stored locally).
Local execution involves talking to the Storage Engine to get the data.
Storage Engines typically provide a simple data manipulation API (allowing for CRUD
features) and contain all the logic for the actual details of how to manipulate the data.
Databases will often allow you to pick the Storage Engine that’s being used.
MySQL, for example, has several choices for the storage engine, including RocksDB and
InnoDB.
Transaction Manager - responsible for creating transaction objects and managing their
atomicity (either the entire transaction succeeds or it is rolled back).
Access Methods - These manage access, compression and organizing data on disk.
Access methods include heap files and storage structures such as B-trees.
Buffer Manager - This manager caches data pages in RAM to reduce the number of
accesses to disk.
Recovery Manager - Maintains the operation log and restoring the system state in case
of a failure.
Different Storage Engines make different tradeoffs between these components resulting
in differing performance for things like compression, scaling, partitioning, speed, etc.
To fix this, Twitter adopted Splunk for their logging system. After the switch, they’ve
been able to ingest 4-5 times more logging data with faster queries.
Kristopher Kirland is a senior Site Reliability Engineer at Twitter and he wrote a great
blog post on this migration and some of the challenges involved (published August
2021).
Here’s a summary
Before Loglens (the legacy centralized logging system), Twitter engineers had great
difficulty with browsing through the different log files from their various backend
services. This would be super frustrating when engineers were investigating an ongoing
incident.
To solve this, they designed Loglens as a centralized logging platform that would ingest
logs from all the various services that made up Twitter.
● Ease of onboarding
● Low cost
Log files would be written to local Scribe daemons, forwarded onto Kafka and then
ingested into the Loglens indexing system and written to HDFS.
However, only 10% of the logs ingested were actually submitted. The other 90% were
discarded by the rate limiter to avoid overwhelming Loglens.
Transitioning to Splunk
Twitter engineers decided to switch to Splunk for their centralized logging platform.
They chose Splunk because it could scale to their needs; ingesting logs from hundreds of
thousands of servers in the Twitter fleet. Splunk also offers flexible tooling that satisfies
the majority of Twitter’s log analysis needs.
Due to the loosely coupled design of Loglens, migrating to Splunk was pretty
straightforward.
Twitter engineers created a new service to subscribe to the Kafka topic that was already
in use for Loglens, and then they forwarded those logs to Splunk. The new service is
called the Application Log Forwarder (ALF).
With Splunk, Twitter now has a much greater ingestion capacity compared to Loglens.
As of August 2021, they collect nearly 42 terabytes of data per datacenter each day.
They also gained some other features like greater configurability, the ability to save and
schedule searches, complex alerting, a more robust and flexible query language, and
more.
Some of the challenges that Twitter engineers faced with the migration were
For more information, check out the full blog post here.
He wrote an amazing book called Crafting Interpreters, where he walks you through
how programming language implementations work. In the book, you’ll build two
interpreters, one in Java and another in C.
He published the entire book for free here, but I’d highly suggest you support the author
if you have the means to do so.
The programming language refers to the syntax, keywords, etc. The programming
language itself is usually designed by a committee and there are some standard
documents that describe the language. These documents are usually called the
Programming Language Specification.
(Not all languages have a specification. Python, for example, has the Python Language
Reference, which is the closest thing to it’s specification.)
The language implementation is the actual software that allows you to run code from
that programming language. Typically, an implementation consists of a
compiler/interpreter.
A programming language can have many different language implementations, and these
implementations can all be quite different from each other. The key factor is that all
The most popular implementation for Python is CPython but there’s also PyPy (JIT
compiler), Jython (Python running on the JVM), IronPython (Python running on .NET)
and many more implementations.
Language implementations are obviously built differently, but there are some general
patterns.
● Front end - takes in your source code and turns it into an intermediate
representation.
● Back end - takes the optimized intermediate representation and turns it into
machine code or bytecode.
For example, you can write multiple back ends that turn the intermediate representation
into machine code for x86, ARM, and other platforms and then reuse the same front end
that turns your C code into intermediate representation.
Front end
As we said before, the front end is responsible for taking in your source code and
turning it into an intermediate representation.
The first part is scanning (also known as lexical analysis). This is where the front end
reads your source code and converts it into a series of tokens. A token is a single element
of a programming language. It can be a single character, like a {, or it can be a word, like
System.out.println.
After scanning and converting your source code into tokens, the next step is parsing.
A parser will take in the flat sequence of tokens and build a tree structure based on the
programming language’s grammar.
This tree is usually called an abstract syntax tree (AST). While the parser is creating the
abstract syntax tree, it will let you know if there are any syntax errors in your code.
If the language is statically typed, this is where type checking happens and type errors
are reported.
Some will store it on the abstract syntax tree as attributes. Others will store it in a
lookup table called a symbol table.
There are a couple of well established styles of intermediate representation out there.
Examples include
● Three-address Code
● Continuation-passing Style
Having a shared intermediate representation helps make your compiler design much
more modular.
Middle End
The middle end is responsible for performing various optimizations on the intermediate
representation.
These optimizations are independent of the platform that’s being targeted, therefore
they’ll speed up your code regardless of what the backend does.
Other examples are removal of unreachable code (reachability analysis) and code that
does not affect the program results (dead code elimination).
Back End
The back end is responsible for taking the optimized intermediate representation from
the middle end and generating the machine code for the specific CPU architecture of the
computer (or generating bytecode).
The back end may perform more analysis, transformations and optimizations that are
specific for that CPU architecture.
If the back end produces bytecode, then you’ll also need another compiler for each
target architecture that turns that bytecode into machine code.
Or, many runtimes rely on a virtual machine, where a program emulates a hypothetical
chip. The bytecode is run on that virtual machine.
An example is the Java Virtual Machine, which runs Java bytecode. You can reuse that
backend and write frontends to handle different languages. Python, Kotlin, Clojure and
Scala are a few examples of languages that have front ends that can convert that
language into Java bytecode.
Here’s a summary
Airbnb has been through three major stages in their architecture since the company’s
founding.
We’ll go through each architecture and talk about the pros/cons and why Airbnb
migrated.
Most engineers were full stack and could work on every part of the codebase, executing
on end-to-end features by themselves. Features could be completed within a single
team, which helped the company build new products very quickly.
However, as Airbnb entered hypergrowth, the number of engineers, teams, and lines of
code scaled up very quickly.
It became impossible for a single engineer/team to have context on the entire codebase,
so ownership and team boundaries were needed.
Airbnb struggled with drawing these team boundaries since the monolith was very
tightly coupled. Code changes in one team were having unintended consequences for
another team and who owned what was confusing for different parts of the codebase.
These issues were leading to a slower developer velocity and Airbnb decided to shift to a
microservices oriented approach to reduce these pain points.
2. A business logic service able to apply functions and combine different pieces
of data together
To avoid ownership issues seen with the monolith, each microservice would only have
one owning team (and each team could own multiple services).
With these changes, Airbnb also changed the way engineering teams were structured.
Previously, engineering teams were full stack and able to handle anything. But now,
with microservices, Airbnb shifted to teams that were just focused on a certain parts of
the stack. Some were focused on certain data services while others were focused on
specific pieces of business logic.
Airbnb also had a specific team that was tasked with running the migration of monolith
to microservice. This team was responsible for building tooling to help with the
After a few years into the microservices migration, new challenges started to arise.
Managing all these services and their dependencies was quite difficult. Teams needed to
be more aware of the service ecosystem to understand where any dependencies may lie.
Building an end-to-end feature meant using various services across the stack, so
different engineering teams would all need to be involved. All of these teams needed to
have similar priorities around that feature, which was difficult to manage as each team
owned multiple services.
They’re creating a system where their internal backend service gets its data from the
data aggregation service. The data aggregator then communicates with the various
service blocks where each service block encapsulates a collection of microservices.
For the service block layer, engineers need to make sure that they’re defining the
schema boundaries in a clean way. There are many pieces of data/logic that can span
multiple entities, so it needs to be clearly defined.
For more details, you can watch Jessica’s full talk here.
It’s now become one of the most popular open source projects in the big data space and
is used by companies like Amazon, Tencent, Shopify, eBay and more.
Spark was introduced as a way to solve those pain points, and it’s quickly evolved into
much more.
We’ll talk about why Spark was created, what makes Spark so fast and how it works
under the hood.
History of MapReduce
In a previous tech dive, we talked about Google MapReduce and how Google was using
it to run massive computations to help power Google Search.
MapReduce introduced a new parallel programming paradigm that made it much easier
to run computations on massive amounts of distributed data.
Hadoop gained widespread popularity as a set of open source tools for companies
dealing with massive amounts of data.
Let’s say you have 100 terabytes of data split across 100 different machines. You want to
run some computations on this data.
With MapReduce, you take your computation and split it into a Map function and a
Reduce function.
You take the code from your map function and run it on each of the 100 machines in a
parallel manner.
On each machine, the map function will take in that machine’s chunk of the data and
output the results of the map function.
The output will get written to local disk on that machine (or a nearby machine if there
isn’t enough space on local).
Then, the reduce function will take in the output of all the map functions and combine
that to give the answer to your computation.
The MapReduce framework on Hadoop had some shortcomings that were becoming big
issues for engineers.
● Interactive Analysis - When you store data on Hadoop (using HDFS), you’ll
want to run ad-hoc exploratory queries to better understand your data. Doing
this with MapReduce can be a pain because of how unintuitive it can be to
create Map and Reduce functions to do your data exploration.Instead, you’ll
The main goal was to create a fast and versatile tool to handle distributed processing of
large amounts of data. The tool should be able to handle a variety of different workloads,
with a specific emphasis on workloads that reuse a working set of data across multiple
operations.
Many common machine learning algorithms will repeatedly apply a function to the
same dataset to optimize a parameter (ex. Gradient descent).
Running a bunch of random SQL queries on a dataset to get a feel for it is another
example of reusing a working set of data across multiple operations (SQL queries in this
scenario).
Spark is a program for distributed data processing, so it runs on top of your data storage
layer. You can use Spark on top of Hadoop Distributed File System, MongoDB, HBase,
Cassandra, Amazon S3, RDBMSs and a bunch of other storage layers.
In a Spark program, you can transform your data in different ways (filter, map,
intersection, union, etc.) and Spark can distribute these operations across multiple
computers for parallel processing.
Spark offers nearly 100 high-level, commonly needed data processing operators and you
can use Spark with Scala, Java, Python and R.
● Spark SQL will let you use SQL queries to do data processing.
● Spark Structured Streaming lets you process real-time streaming data from
something like Kafka or Kinesis.
● GraphX will let you manipulate graphs and offers algorithms for traversal,
connections, etc. You can use algorithms like pagerank, triangle counting and
connected components.
1. Lazy Evaluation - When you’re manipulating your data, Spark will not
execute your manipulations (called transformations in Spark lingo)
immediately.Instead, Spark will take your transformations (like sort, join,
map, filter, etc.) and keep track of them in a Directed Acyclic Graph (DAG). A
DAG is just a graph (a set of nodes and edges) where the nodes have directed
edges (the first transformation will point to the second transformation and so
on) and the graph has no cycles.Then, when you want to get your results, you
can trigger an Action in Spark. Actions trigger the evaluation of all the
recorded transformations in the DAG.Because Spark knows what all your
chained transformations are, Spark can then use its optimizer to construct the
most efficient way to execute all the transformations in a parallel way. This
helps make Spark much faster.
2. In Memory - We’ve said several times above that one of the issues with
MapReduce is all the disk I/O. Spark solves this by retaining all the
intermediate results in memory.After you trigger an Action, Spark will be
calculating all the transformations in RAM using the memory from all the
machines in your Spark cluster and then run the computations.If you don’t
have enough RAM, then Spark can also use disk and swap data between the
two.
As we said before, Spark is a distributed data processing engine that can process huge
volumes of data distributed across thousands of machines.
The collection of machines is called a Spark cluster and the largest Spark cluster is
around 8000 machines. (Note. You can also run Spark on a single machine. If you want,
you can download it from the Apache website )
Leader-Worker Architecture
Spark is based on a leader-worker architecture. In Spark lingo, the leader is called the
Spark driver while the worker is called the Spark executor.
A Spark application has a single driver, where the driver functions as the central
coordinator. You’ll be interacting with the driver with your Scala/Python/R/Java code
and you can run the driver on your own machine or on one of the machines in the Spark
cluster.
The executors are the worker processes that execute the instructions given to them by
the driver. Each Spark executor is a JVM process that is run on each of the nodes in the
Spark cluster (you’ll mostly have one executor per node).
The Spark executor will get assigned tasks that require working on a partition of the
data that is closest to them in the cluster. This helps reduce network congestion.
When you’re working with a distributed system, you’ll typically use a cluster manager
(like Apache Mesos, Kubernetes, Docker Swarm, etc.) to help manage all the nodes in
your cluster.
Spark is no different. The Spark driver will work with a cluster manager to orchestrate
the Spark Executors. You can configure Spark to use Apache Mesos, Kubernetes,
Hadoop YARN or Spark’s built-in cluster manager.
When Spark runs your computations on the given datasets, it uses a data structure
called a Resilient Distributed Dataset (RDD).
RDDs are the fundamental abstraction for representing data in Spark and they were first
introduced in the original Spark paper.
Spark will look at your dataset across all the partitions and create an RDD that
represents it. This RDD will then be stored in memory where it will be manipulated
through transformations and actions.
● Resilience - RDDs are fault-tolerant and able to survive failures of the nodes
in the Spark cluster. As you call transformation operations on your RDD,
Spark will be building up a DAG of all the transformations. This DAG can be
used to track the data lineage of all the RDDs so you can reconstruct any of
the past RDDs if one of the machines fails.Just note, this is fault tolerance for
the RDD, not for the underlying data. Spark is assuming that the storage layer
As you’re running your transformations, Spark will not be executing any computations.
Instead, the Spark driver will be adding these transformations to a Directed Acyclic
Graph. You can think of this as just a flowchart of all the transformations you’re
applying on the data.
Once you call an action, then the Spark driver will start computing all the
transformations. Within the driver are the DAG Scheduler and the Task Scheduler.
These two will manage executing the DAG.
When you call an action, the DAG will go to the DAG scheduler.
The DAG scheduler will divide the DAG into different stages where each stage contains
various tasks related to your transformations.
The DAG scheduler will run various optimizations to make sure that the stages are being
done in the most optimal way to eliminate any redundant computations. Then, it will
create a set of stages and then pass this to the Task Scheduler.
The Task Scheduler will then coordinate with the Cluster Manager (Apache Mesos,
Kubernetes, Hadoop YARN, etc.) to execute all the stages using the machines in your
Spark cluster and get the results from the computations.
These 220 million active users are accessing their Netflix account from multiple devices,
so Netflix engineers have to make sure that all the different clients that a user logs in
from are synced.
You might start watching Breaking Bad on your iPhone and then switch over to your
laptop. After you switch to your laptop, you expect Netflix to continue playback of the
show exactly where you left off on your iPhone.
Syncing between all these devices for all of their users requires an immense amount of
communication between Netflix’s backend and all the various clients (iOS, Android,
smart TV, web browser, Roku, etc.). At peak, it can be about 150,000 events per second.
To handle this, Netflix built RENO, their Rapid Event Notification System.
Ankush Gulati and David Gevorkyan are two senior software engineers at Netflix, and
they wrote a great blog post on the design decision behind RENO.
Here’s a Summary
Netflix engineers have to make sure that things like viewing activity, membership plan,
movie recommendations, profile changes, etc. are synced between all these devices.
The company uses a microservices architecture for their backend, and built the RENO
service to handle this task.
2. Event Prioritization - If a user changes their child’s profile maturity level, that
event change should have a very high priority compared to other events.
Therefore, each event-type that RENO handles has a priority assigned to it
and RENO then shards by that event priority.This way, Netflix can tune
system configuration and scaling policies differently for events based on their
priority.
5. Managing High RPS - At peak times, RENO serves 150,000 events per
second. This high load can put strain on the downstream services.Netflix
handles this high load by adding various gate checks before sending an
event.Some of the gate checks are
These are from the various backend services that handle things like movie
recommendations, profile changes, watch activity, etc.
Whenever there are any changes, an event is created. These events go to the Event
Management Engine.
The Event Management Engine serves as a layer of indirection so that RENO has a
single source of events.
From there, the events get passed down to Amazon SQS queues. These queues are
sharded based on event priority.
AWS Instance Clusters will subscribe to the various queues and then process the events
off those queues. They will generate actionable notifications for all the devices.
These notifications then get sent to Netflix’s outbound messaging system. This system
handles delivery to all the various devices.
The notifications will also get sent to a Cassandra database. When devices need to pull
for notifications, they can do so using the Cassandra database (remember it’s a Hybrid
Communications Model of push and pull).
The RENO system has served Netflix well as they’ve scaled. It is horizontally scalable
due to the decision of sharding by event priority and adding more machines to the
processing cluster layer.
For more details, you can read the full blog post here.
When you’re creating a new username on Twitter and they have to check whether that
username is already taken, a bloom filter is a great data structure to use for speeding
that up.
The core data structure in a bloom filter is a bit vector (an array of bits). This bit vector
will be used to keep track of items that exist in the set.
When you want to insert an item, you’ll first use a hash function (or multiple hash
functions) to hash that item into an integer.
Then, you can mod that integer by the number of slots in your bit vector
integer % num_slots_bit_vector
This will give you a slot in your bit vector for that item. You set that slot’s bit to 1. Then,
you can add the item to your database (or whatever storage layer you’re using).
If you want to check if some item exists in your bloom filter, you can repeat the process
of hashing and modulus to find the corresponding slot in the bit vector. If the slot in the
bit vector is not set to 1, then you immediately know that the item does not exist in the
set. You don’t have to query your database (which is a lot more expensive than using the
bloom filter).
However, if the slot in the bit vector for that item is set to 1, then that doesn’t tell you for
sure that the item exists in the set. You’ll have to do a further check within your
database to know for certain.
The bloom filter will only tell you if an item doesn’t exist in the set.
You’ll eventually run into a scenario where two different items get mapped to the same
slot in your bloom filter’s bit vector.
False positive matches are possible, but false negatives are not.
Bloom filters are very similar to hash tables, but a hash table eliminates the possibility
of false positive matches through collision resolution.
The hash table solves the collision problem with solutions like open addressing.
The downside of this is that it makes the hash table take up far more space than the
bloom filter.
If you want to keep track of every single twitter username in your data structure, a hash
table may become too large to store in-memory. In that situation, you’ll want to use a
bloom filter.
Going back to the twitter example, let’s say you’re an engineer at twitter and you’re
working on the sign up form for new users.
When a new user tries to create their username, they need to quickly find out if their
username has already been taken.
Therefore, you can use a bloom filter to quickly check if a username is unique.
If the username doesn’t exist in the bloom filter, then you can avoid querying the
database.
If the username does exist in the bloom filter, then you can query the database to check
if the username is already taken or if the bloom filter match was a false positive.
Databases use bloom filters extensively to reduce disk lookups for non-existent
rows/columns. Apache Cassandra, Postgres and Google Big Table are just a few
examples of databases that use bloom filters to reduce disk lookups.
The Akamai Content Delivery Network (one of the largest CDNs in the world) uses
bloom filters to avoid “one-hit wonders” from being stored in their caches.
One-hit-wonders are objects that are requested by users just once. Akamai uses a bloom
filter to detect the second request for a web object and then cache it only after the
second request.
Here’s a summary.
Rate Limiting is a technique used to limit the amount of requests a client can send to
your server.
It’s incredibly important to prevent DoS attacks from clients that are (accidentally or
maliciously) flooding your server with requests.
A rule of thumb for when you should use a rate limiter is if your users can reduce the
frequency of their API requests without affecting the outcome of their requests, then a
rate limiter is appropriate.
For example, if you’re running Facebook’s API and you have a user sending 60 requests
a minute to query for their list of Facebook friends, you can rate limit them without
affecting their outcome. It’s unlikely that they’re adding new Facebook friends every
single second.
Rate Limiting is great for day-to-day operations, but you’ll occasionally have incidents
where some component of your system is down and you can’t process requests at your
normal level.
In these scenarios, Load Shedding is a technique where you drop low-priority requests
to make sure that critical requests get through.
Stripe is a payment processing company (you can use their API to collect payments from
your users) so a critical request for them is a request to charge a user money.
An example of a non-critical method would be a request to read charge data from the
past.
Stripe uses 4 different types of limiters in production (2 rate limiters and 2 load
shedders).
Restricts each user to n requests per second. However, they also built in the ability for a
user to briefly burst above the cap to handle legitimate spikes in usage.
This helps stripe manage the load of their CPU-intensive API endpoints.
Stripe divides their traffic into two types: critical API methods and non-critical methods.
Stripe always reserves a fraction of their infrastructure for critical requests. If the
reservation number is 10%, then any non-critical request over the 90% allocation would
be rejected with a 503 status code.
● Critical Methods
● POSTs
● GETs
● Test mode traffic (traffic from developers testing the API and making sure
payments are properly processed)
There are quite a few algorithms you can use to build a rate limiter. Algorithms include
Token Bucket - Every user gets a bucket with a certain amount of “tokens”. On each
request, tokens are removed from the bucket. If the bucket is empty, then the request is
rejected.
New tokens are added to the bucket at a certain threshold (every n seconds). The bucket
can hold a certain number of tokens, so if the bucket is full of tokens then no new tokens
will be added.
Fixed Window - The rate limiter uses a window size of n seconds for a user. Each
incoming request from the user will increment the counter for the window. If the
counter exceeds a certain threshold, then requests will be discarded.
Sliding Log - The rate limiter track’s every user’s request in a time-stamped log. When a
new request comes in, the system calculates the sum of logs to determine the request
rate. If the request rate exceeds a certain threshold, then it is denied.
After a certain period of time, previous requests are discarded from the log.
In order for the ML models to run, Lyft engineers have to make sure the model’s
features are always available.
The features are the input that an ML model uses in order to get its prediction. If you’re
building a machine learning algorithm that predicts a house’s sale price, some features
might be the number of bedrooms, square footage, zip code, etc.
A core part of Lyft’s Machine Learning Platform is their Feature Serving service, which
makes sure that ML models can get low latency access to feature data.
Vinay Kakade worked on Lyft’s Machine Learning Platform and he wrote a great blog
post on the architecture of Lyft’s Feature Serving service.
Here’s a summary
● Some are computed via batch jobs. Deciding which users should get a 10% off
discount can be computed via a batch job that can run nightly.
● Others are computed in real time. When a user inputs her destination into the
app, the ML model has to immediately output the optimal price for the ride.
Lyft also needs to train their ML models (determine the optimal model parameters to
produce the best predictions) which is done via batch jobs.
The Feature Serving service at Lyft is responsible for making sure all features are
available for both training ML models and for making predictions in production.
● Feature Definitions
● Feature Ingestion
Feature Definitions
The features are defined in SQL. The complexity of the definitions can range from a
single query to thousands of lines of SQL comprising complex joins and
transformations.
The definitions also have metadata in JSON that describes the feature version, owner,
validation information, and more.
For real time feature data, Lyft uses Apache Flink. They execute SQL against a stream
window and then write to the Feature Service.
The Feature Serving service is written in Golang and has gRPC and REST endpoints for
writing and reading feature data.
When feature data is added to the service, it is written in both DynamoDB and Redis
(Redis is used as a write-through cache to reduce read load on DynamoDB).
Lyft uses Dynamo streams to replicate the feature data to Apache Hive (their data
warehouse tool) and Elasticsearch.
The Feature Serving service will then utilize the Redis cache, DynamoDB, Hive and
Elasticsearch to serve requests for feature data.
For real-time ML models that need feature data back quickly, the Feature Serving
service will try to retrieve the feature data from the Redis cache. If there is a cache miss,
then it will retrieve the data from DynamoDB.
For batch-job ML models, they can retrieve the feature data from Hive. If they have an
advanced query then they can also use Elasticsearch. You can read more about how Lyft
uses Elasticsearch (and performance optimizations they’ve made) here.
For more details on their Feature Serving service, you can read the full article here.
Here’s a summary
Etsy’s codebase is a monorepo with over 17,000 JavaScript files, spanning many
iterations of the site.
In order to improve the codebase, Etsy made the decision to adopt TypeScript, a
superset of JavaScript with the optional addition of types. This means that any valid
JavaScript code is valid TypeScript code, but TypeScript provides additional features on
top of JS (the type system).
Based on research at Microsoft, static type systems can heavily reduce the amount of
bugs in a codebase. Microsoft researchers found that using TypeScript or Flow could
have prevented 15% of the public bugs for JavaScript projects on Github.
For example, Airbnb automated as much of their migration as possible while other
companies enable less-strict TypeScript across their projects, and add types to their
code over time.
1. How strict do they want their flavor of TypeScript to be? - TypeScript can be
more or less “strict” about checking the types in your codebase. A stricter
configuration results in stronger guarantees of program correctness.
TypeScript is a superset of JavaScript, so if you wanted you could just rename
all your .js files to .ts and still have valid TypeScript, but you would not get
strong guarantees of program correctness.
3. How specific do they want the types they write to be? - How accurately should
a type fit the thing it’s describing? For example, let’s say you have a function
that takes in the name of an HTML tag. Should the parameter’s type be a
string? Or, should you create a map of all the HTML tags and the parameter
should be a key in that map (far more specific)?
1.
2. Add really good types and really good supporting documentation to all of the
utilities, components, and tools that product developers use regularly.
Etsy wanted to set the compiler parameters for TypeScript to be as strict as possible.
The downside with this is that they would need a lot of type annotations.
They decided to approach the migration incrementally, and first focus on typing
actively-developed areas of the site.
Files that had reliable types were given the .ts file extension while files that didn’t kept
the .js file extension.
Before engineers started writing TypeScript, Etsy made sure that all of their tooling
supported the language and that all of their core libraries had usable, well-defined types.
In terms of tooling, Etsy uses Babel and the plugin babel-preset-typescript that turns
TypeScript into JavaScript. This allowed Etsy to continue to use their existing build
infrastructure. To check types, they run the TypeScript compiler as part of their test
suite.
Etsy makes heavy use of custom ESLint linting rules to maintain code quality.
They used the TypeScript ESLint project to get a handful of TypeScript specific linting
rules.
The biggest hurdle to adopting TypeScript was getting everyone to learn TypeScript.
TypeScript works better the more types there are. If engineers aren’t comfortable
writing TypeScript code, fully adopting the language becomes an uphill battle.
Etsy has several hundred engineers, and very few of them had TypeScript experience
before the migration.
The strategy Etsy used was to onboard teams to TypeScript gradually on a team by team
basis.
● Etsy could refine their tooling and educational materials over time. Etsy
found a course from ExecuteProgram that was great for teaching the basics of
TypeScript in an interactive and effective way. All members of a team would
have to complete that course before they onboarded.
● Engineers had plenty of time to learn TypeScript and factor it into their
roadmaps. Teams that were about to start new projects with flexible deadlines
were the first to onboard TypeScript.
Jeremy Cobb is a software engineer at Shopify, where he works on the Contact Center
team. They’re responsible for building the tooling that helps Shopify’s customer service
team deal with all the support inquiries from businesses that use the platform.
He wrote a great blog post on how his team uses Terraform for configuration
management. Terraform is an open source tool that lets you configure your
infrastructure using code.
Here’s a summary
The Contact Center team builds the tooling that Shopify customer service agents use to
handle support requests.
One tool the engineers rely on is Twilio’s TaskRouter service. Twilio is a company that
builds programmable communication tools, so you can use Twilio’s API for sending
emails, text messages, etc.
Shopify uses Twilio TaskRouter to handle routing communication tasks (voice, chat,
etc.) to the most appropriate customer service agent based on a set of routing rules. For
example, users in the US might get sent to a different customer service agent than users
in Canada.
Previously, Shopify would configure these routing rules using Twilio’s website. However,
the complexity of the rules grew and it became too much for a single person to manage.
Having multiple people manage the rules quickly became troublesome because the
website doesn’t provide a clear history of changes or way to roll changes back.
In order to solve this, the Contact Center team decided to use Terraform to manage the
configuration of Twilio Taskrouter.
3. A Client Library - You’ll also want a separate library that the Terraform
Provider can interface with to make API requests to the external
infrastructure API. You could create a Terraform Provider Plugin that makes
the API calls itself, but this is highly discouraged. It’s better to modularize the
API calls in a separate client library.
There was no TaskRouter Terraform Provider available at the time (Twilio has since
developed their own) so the Shopify team built one themselves.
The Provider defines how Terraform should manage Twilio TaskRouter. It contains
resource files for every type of resource in TaskRouter that Terraform has to manage;
each resource file has CRUD instructions that tell Terraform how to manage it.
The Provider also has import instructions that let Terraform import existing
infrastructure. This is useful if you already have infrastructure running and want to start
using Terraform to manage it.
The Shopify team also built a client library that the Terraform Provider would use to
make HTTP calls to Twilio’s API.
Using Terraform
With Terraform set up, Shopify could stop relying on Twilio’s website for configuring
TaskRouter rules and instead write them using HCL (Terraform’s domain specific
language).
This made seeing changes to the infrastructure much easier and allowed Shopify to
integrate software engineering practices like pull requests, code reviews, etc for their
TaskRouter rules.
For more details on how Shopify created the Provider and on how they use Terraform,
you can read the full article here.
Quora relies on MySQL to store critical data like questions, answers, upvotes,
comments, etc. The size of the data is on the order of tens of terabytes (without counting
replicas) and the database gets hundreds of thousands of queries per second.
Vamsi Ponnekanti is a software engineer at Quora, and he wrote a great blog post about
why Quora decided to shard their MySQL database.
MySQL at Quora
Over the years, Quora’s MySQL usage has grown in the number of tables, size of each
table, read queries per second, write queries per second, etc.
In order to handle the increase in read QPS (queries per second), Quora implemented
caching using Memcache and Redis.
However, the growth of write QPS and growth of the size of the data made it necessary
to shard their MySQL database.
At first, Quora engineers split the database up by tables and moved tables to different
machines in their database cluster.
Afterwards, individual tables grew too large and they had to split up each logical table
into multiple physical tables and put the physical tables on different machines.
As the read/write query load grew, engineers had to scale the database horizontally (add
more machines).
They did this by splitting up the database tables into different partitions. If a certain
table was getting very large or had lots of traffic, they create a new partition for that
table. Each partition consists of a master node and replica nodes.
The mapping from a partition to the list of tables in that partition is stored in
ZooKeeper.
3. Replay binary logs from the position noted to the present. This will transfer
over any writes that happened after the initial dump during the restore
process (step 2).
4. When the replay is almost caught up, the database will cutover to the new
partition and direct queries to it. Also, the location of the table will be set to
the new partition in ZooKeeper.
● Replication lag - For large tables, there can be some lag where the replica
nodes aren’t fully updated.
● No joins - If two tables need to be joined then they need to live in the same
partition. Therefore, joins were strongly discouraged in the Quora codebase
so that engineers could have more freedom in choosing which tables to move
to a new partition.
Splitting large/high-traffic tables onto new partitions worked well, but there were still
issues around tables that became very large (even if they were on their own partition).
Schema changes became very difficult with large tables as they needed a huge amount of
space and took several hours (they would also have to frequently be aborted due to load
spikes).
There were unknown risks involved as few companies have individual tables as large as
what Quora was operating with.
MySQL would sometimes choose the wrong index when reading or writing. Choosing
the wrong index on a 1 terabyte table is much more expensive than choosing the wrong
index on a 100 gigabyte table.
Therefore, engineers at Quora looked into sharding strategies, where large tables could
be split up into smaller tables and then put on new partitions.
When implementing sharding, engineers at Quora had to make quite a few decisions.
We’ll go through a couple of the interesting ones here. Read the full article for more.
Quora decided to build an in-house solution rather than use a third-party MySQL
sharding solution (Vitess for example).
They only had to shard 10 tables, so they felt implementing their own solution would be
faster than having to develop expertise in the third party solution.
Also, they could reuse a lot of their infrastructure from splitting by table.
There are different partitioning criteria you can use for splitting up the rows in your
database table.
You can do range-based sharding, where you split up the table rows based on whether
the partition key is in a certain range. For example, if your partition key is a 5 digit zip
code, then all the rows with a partition key between 7000 and 79999 can go into one
shard and so on.
You can also do hash-based sharding, where you apply a hash function to an attribute of
the row. Then, you use the hash function’s output to determine which shard the row
goes to.
Quora makes frequent use of range queries so they decided to use range-based sharding.
Hash-based sharding performs poorly for range queries.
So, when Quora has a table that is extremely large, they’ll split it up into smaller tables
and create new partitions that hold each of the smaller tables.
1. Data copy phase - Read from the original table and copy to all the shards.
Quora engineers set up N threads for the N shards and each thread copies
data to one shard. Also, they take note of the current binary log position.
2. Binary log replay phase - Once the initial data copy is done, they replay the
binary log from the position noted in step 1. This copies over all the writes
that happened during the data copy phase that were missed.
3. Dark read testing phase - They send shadow read traffic to the sharded table
in order to compare the results with the original table.
4. Dark write testing phase - They start doing dark writes on the sharded table
for testing. Database writes will go to both the unsharded table and the
sharded table and engineers will compare.
If Quora engineers are satisfied with the results from the dark traffic testing, they’ll
restart the process from step 1 with a fresh copy of the data. They do this because the
data may have diverged between the sharded and unsharded tables during the dark
write testing.
They will repeat all the steps from the process until step 3, the dark read testing phase.
They’ll do a short dark read testing as a sanity check.
Then, they’ll proceed to the cutover phase where they update ZooKeeper to indicate that
the sharded table is the source of truth. The sharded table will now serve read/write
traffic.
However, Quora engineers will still propagate all changes back to the original,
unsharded table. This is done just in case they need to switch back to the old table.
For more details, you can read the full article here.
Everyone on the app is watching the same video clips and it’s done live - at 6 p.m. local
time.
Shreyas Hirday is a senior software engineer at Tinder and he wrote a great blog post on
the technology Tinder used to stream the video clips to millions of users simultaneously.
Here’s a summary
There are many ways to deliver video content. The best approach depends on the
tradeoffs you’re making.
● Dynamic - Tinder should have the ability to change the video content at any
time.
Based on these goals, Tinder decided to use HTTP Live Streaming (HLS), an adaptive
bitrate protocol developed by Apple.
The video player will dynamically choose the best video version based on the user’s
screen size and bandwidth. It will choose the version that minimizes buffering and gives
the best user experience.
An HLS stream will provide a manifest file to the video player, which includes the URL
to each copy of the video as well as the level of bandwidth the user should have in order
to view that level of quality without issue.
Transcoding
Tinder engineers used FFMPEG to transcode MP4 files to HLS streams. They developed
a workflow that had the MP4 file and configurations (resolution, bitrate, frame rate,
etc.) as input and a directory containing the HLS stream as output.
They had multiple configurations for all the different video versions they wanted and
they stored all these video versions in an AWS S3 bucket.
● Frame rate
● Video Resolution
● Segment Length
● Optimizations
You can read the article for a discussion on how they configured each of these
parameters.
Validation
The output directory will have a Master Manifest file with information about all the
different video versions in the HLS stream.
The video player will then decide which version to play and whether it should switch to a
lower file-size version based on the information in the manifest file.
Therefore, having an accurate manifest file is very important for the user experience.
Apple provides a Media Stream Validator tool that tests the manifest by simulating a
streaming experience. Tinder uses the results from that test to update the manifest and
ensure accuracy.
Tinder then places the finalized manifest and videos in their production AWS S3 bucket.
Tinder uses AWS Cloudfront, a content delivery network (CDN), to ensure low-latency
streaming for all their users.
Tinder uses 5 key performance indicators (KPIs) to measure how well the streaming
works
The Tinder app measures these KPIs along with metadata about the device and its
network connection.
Tinder then works to find the right balance between these 5 KPIs for their use case. For
a traditional streaming app like Netflix, spending 5 seconds in a buffering state might
not be that bad. But a 5 second buffer on a mobile-centric app like Tinder can feel like
an eternity.
For more details, you can read the full blog post here.
Their biggest product is Aladdin, the financial industry's most popular software
platform for investment management. Asset managers (banks, pension funds, hedge
funds, etc.) use Aladdin to track profit/loss, manage portfolio risk, make trades, analyze
historical data, etc.
In 2013, the Aladdin platform was used to manage more than 7% of the world's 225
trillion dollars of financial assets (and it's grown since then), so any issues with the
platform can have major consequences on the global financial system.
BlackRock's Site Reliability Engineering team has built a robust telemetry platform to
oversee the health, performance and reliability of Aladdin.
Sudipan Mishra is an engineer on BlackRock's SRE team and he wrote a great blog post
on the architecture of their Telemetry platform.
Here's a summary
All the various components of Aladdin generate large amounts of logs, data, etc.
The Telemetry platform is responsible for aggregating all these reports, displaying them,
and sending alerts to the various Aladdin developers at BlackRock if one of their services
is not performant.
From the collector, these metrics go to a Catalog server, an internally developed service
that manages which metrics should be cataloged.
Some metrics might be too noisy/unnecessary so engineers can remove them from the
Telemetry Catalog UI.
InfluxDB is an open source time series database that BlackRock uses for long-term
storage of all the telemetry metrics.
Grafana is another Prometheus tool that lets engineers produce dashboards and charts
to visualize the telemetry metrics.
Alerting Strategy
However, if the SRE team isn't careful about how they implement alerts then they can
cause things like alarm fatigue.
1. Actionable - Every alert should clearly define what is broken or about to break.
Alerts should also propose the corrective actions to take.
2. Effective - False positives (issuing an alert when there is no incident) and False
negatives (not triggering an alert despite there being an issue) must both be
minimized/eliminated. Otherwise, they can cause mistrust in the alerting system.
3. Impactful - Developers should not be getting alerts for trivial/unimportant
things. Otherwise, developers can get alarm fatigue and accidentally ignore
important alerts.
You can read about some of the options in Google's SRE book. The book is free and
definitely a must-read if you're interested in SRE.
You see how much error is allowable and set various limits around that. As you notice
errors in the telemetry, you "burn" against the allowable error limit. Once you surpass
that allowable limit (and if there are still errors coming) then you send an alert.
BlackRock found this strategy to have a low false-positive rate, a low false-negative rate,
a fast detection time and a very low reset time.
In order to test their alerting strategy, they wrote a script that would let developers
extract metrics from their Prometheus instance for a given time and date range. They
can take those metrics and then backtest their alert strategy to see how many alerts they
would've gotten and whether there would've been any false negatives or false positives.
For more details, you can read the full article here.
The app experienced viral growth in 2020, peaking in February 2021 with Elon Musk
interviewing Vlad Tenev (the CEO of Robinhood) on the whole GameStop saga. Since
then, however, usage of the Clubhouse app has plummeted and downloads of the app
have been stagnating.
One of the reasons why is because Clubhouse’s room recommendations (the audio
rooms recommended when you open the app) were pretty poor.
Speaking from my personal usage, the room recommendations were not relevant to my
interests so I’d frequently open the app, find nothing interesting and immediately leave.
Here’s a summary
When you first open the Clubhouse app, you are in the “hallway”. In the hallway are a
bunch of different audio rooms that you can join.
Early on, Clubhouse used simple heuristics to rank the rooms. Heuristics like how many
of your friends were in the room or how closely the room matched with the topics that
you’re following.
If you’re unfamiliar with GBDTs, this is the best article I found that explains them. It
starts from decision trees, goes into ensemble learning (a model that makes predictions
by combining multiple simpler models) and boosting and then goes into GBDTs.
The GBDT model at Clubhouse is based on hundreds of different data points like
whether you spend more time in smaller rooms vs. larger rooms, whether you prefer to
speak or just listen in, how many participants are in the room, etc.
The model is trained as a classifier, where it will create a score for each room between 0
and 1, where 0 means the room is not relevant to you at all while 1 means it’s extremely
relevant.
This classification score is then used to rank the rooms in your hallway.
Complexities
There are many complexities that arise with the machine learning model. Here are a
couple.
Many of the features (data points) used by the ranking model are slow-moving batch
features that can be computed once every few hours.
However, the model is recommending live rooms that change on a second by second
basis. A celebrity can randomly join a room and that completely changes how the room
should be ranked.
Therefore, engineers also have to incorporate real time data into the recommendation
model.
To do this, they have individual events that fire every time there is a change to a room.
These events are then sent as streaming data to Clubhouse’s recommendation model so
it can incorporate real time information into its recommendations.
They also spin up lightweight stateless microservices that are solely responsible for
model inference. The server will fetch the feature data and then send it to the
microservice responsible for machine-learning inferences. With this set up,
model-inference is isolated from the core server and it can be scaled up/down
independently.
This migration helped Airbnb scale their application, but it also introduced new
challenges.
One of these challenges was around Airbnb's Continuous Delivery process and how they
had to adapt it to the new services architecture.
Here’s a summary
Previously, Airbnb used an internal service called Deployboard to handle their deploys.
Deployboard worked great when Airbnb was using a Ruby on Rails monolith but over
the past few years the company has shifted to a Microservices-oriented Architecture.
Airbnb needed something more templated, so that each team could quickly get a
standard, best-practices pipeline, rather than building their own service from scratch.
Spinnaker is an open source continuous delivery platform that was developed internally
at Netflix and further extended by Google.
● allows you to easily plug in custom logic so you can add/change functionality
without forking the core codebase
Migrating to Spinnaker
Airbnb has a globally distributed team of thousands of software engineers. Getting all of
them to shift to Spinnaker would be a challenge.
They were particularly worried about the Long-tail Migration Problem, where they
could get 80% of teams to switch over to the new deployment system but then struggle
to get the remaining 20% to switch over.
Being forced to maintain two deployment systems can become very costly and is a
reliability/security risk because the legacy system gets less and less
maintenance/attention over time.
1. Focus on Benefits
2. Automated Onboarding
3. Provide Data
Focus on Benefits
They did this by first onboarding a small group of early adopters. They identified a set of
services that were prone to causing incidents and switched those teams over to
Spinnaker.
The automated Canary analysis quickly demonstrated its value to those teams as well as
the other features that Spinnaker provided.
These early adopters ended up becoming evangelists for Spinnaker and spread the word
to other teams at Airbnb organically. This helped increase voluntary adoption.
As more teams started adopting Spinnaker, the Continuous Delivery team at Airbnb
could no longer keep up with demand. Therefore, they started building tooling to
automate the onboarding process to Spinnaker.
They created an abstraction layer on top of Spinnaker that let engineers make changes
to the CD platform with code (IaC). This allowed all continuous delivery configuration to
be source controlled and managed by Airbnb's tools and processes.
Data
The Continuous Delivery team also put a great amount of effort into clearly
communicating the value-add of adopting Spinnaker.
They created dashboards for every service that adopted Spinnaker to show metrics like
number of regressions prevented, increase in deploy frequency, etc.
Final Hurdle
With this 3 pillar strategy, the vast majority of teams at Airbnb had organically switched
over to Spinnaker.
However, adoption began to tail off as the company reached ~85% of deployments on
Spinnaker.
At this point, the team decided to switch strategy to avoid the long-tail migration
problem described above.
1. Stop the bleeding - Stop any new services/teams from being deployed using the
old continuous delivery platform.
2. Announce deprecation date - Announce a deprecation date for the old continuous
delivery platform and add a warning banner at the top.
3. Send out automated PRs - Airbnb has an in-house refactor tool called
Refactorator that helped with making the switch to Spinnaker easier.
Conclusion
With this strategy, Airbnb was able to get to the 100% finish line in the migration.
This migration serves as the blueprint for how other infrastructure-related migrations
will be done at Airbnb.
With this update, Google would start using their Core Web Vital metrics as a factor in
their page rankings. Sites with poor Core Web Vitals would rank lower in Google search
results.
Core Web Vitals are a set of standardized metrics that can measure how good of a user
experience a website is giving. They currently focus on 3 metrics
1. Largest Contentful Paint (LCP) - how many seconds does a website take to
show the user the largest content (text or image block) on the screen? A good
LCP score would be under 2.5 seconds.
2. First Input Delay (FID) - how much time does it take from when a user first
interacts with the website (click a link, tap a button, etc.) to the time when the
browser is able to respond to that interaction? A good FID score is under 100
milliseconds.
3. Cumulative Layout Shift (CLS) - How much does a website unexpectedly shift
during its lifespan? A large paywall popping up 10 seconds after the content
loads is an example of an unexpected layout shift that will cause a negative
user experience. You can read about how the CLS score is calculated here.
Google has come up with two metrics: impact fraction and distance fraction,
and they multiply those two to calculate the CLS score.
BuzzFeed is a digital media company that covers content around pop culture, movies, tv
shows, etc. and they get a significant part of their traffic from Google Search (more than
100 million visits per month). Having their articles rank high on the google search
results page is extremely critical to their business.
Edgar Sanchez is a software engineer at BuzzFeed, and he wrote a great 3 part series on
how BuzzFeed fixed their Core Web Vitals to meet Google’s standards. More specifically,
Here’s a Summary
When Google announced that they’d be factoring Core Web Vitals into PageRank,
engineers at BuzzFeed took note.
They checked their Largest Contentful Paint (LCP), First Input Delay (FID) and
Cumulative Layout Shift (CLS) scores.
Their LCP and FID scores were fine. However, their CLS score was very poor.
Only 20% of visits to BuzzFeed were achieving a “good” experience (a CLS score of less
than 0.1). In order to pass Google’s Core Vitals test, 75% of visits should get a CLS score
of less than 0.1.
The first step in addressing this issue was to improve Observability over CLS, so
engineers could figure out the cause of the issue.
2. Real User Monitoring - Add analytics metrics to the frontend that measure
how much CLS users are experiencing. Hence monitor real users.
They broke their web pages down into independently testable layers to help make tests
more consistent.
● Content Layer - just the page content. So, the article, any quizzes, interactive
embeds etc.
● Feature Layer - Include everything above (page content) but also include
complimentary units like a comment section, polls, trending feed, etc.
They loaded a couple hundred pages into Calibre and ran tests to figure out what was
causing the CLS issues.
With Synthetic Monitoring, they were quickly able to narrow in on some of the causes
for the issues.
However, even after solving the biggest issues that were apparent from their tests in
Calibre, BuzzFeed was still unable to get their CLS score above Google’s threshold.
With RUM, BuzzFeed would lean on their massive audience (more than 100 million
visits per month), their analytics pipeline and the Layout Instability API.
The Layout Instability API provides 2 interfaces for measuring and reporting layout
shifts so you can send that data to your backend server.
BuzzFeed has an in-house analytics pipeline that they use for keeping track of various
types of real user monitoring data, so they hooked the pipeline up with the Layout
Instability API.
The data travels from the frontend through various filters before being stored in
BigQuery (data warehouse from Google). From there, engineers can run analyses or
export the data to analysis tools like Looker and DataStudio.
Optimizations
Here are some of the common issues BuzzFeed solved that resulted in improvements to
their CLS scores
● Static Placeholder for Ads - BuzzFeed has ads that will change dimensions
depending on which ad is being served. They looked at the most common ad
sizes and created static placeholders for them so the page wouldn’t change
suddenly once an ad was loaded.
The most difficult to solve was generating static placeholders for embedded content
since many embeds have no fixed dimensions and are difficult to accurately size.
Embedding a tweet, for example, can vary dramatically in height depending on the
content of the tweet and whether it contains an image/video.
BuzzFeed engineers solved this by gathering embed dimensions from all their pages and
collecting them in their analytics pipeline and eventually in BigQuery.
Now, when a page is requested, the rendering layer will check BigQuery for the
dimensions of the embedded content and add correctly-sized placeholders for the
content.
As new pages get published, the dimensions of any third party embeds on those pages
will be loaded into BigQuery.
For more details, you can read the full series here.
Over 200,000 researchers use Benchling as a core part of their workflow when running
experiments.
Matt Fox is a software engineer at Benchling and he wrote a great blog post on the
architecture of their Search System.
Here’s a summary
Benchling’s search feature is a core part of the platform. Researchers can use the search
feature to find whatever data they have stored; whether it’s specific experiments, DNA
sequences, documents, etc.
The search system also has full-text search capabilities so you can search for certain
keywords across all the contents that you’ve stored on Benchling.
If the user wanted to perform a CRUD action like creating an experiment or deleting a
project, then that would be carried out using the core Postgres database.
If the user wanted to search for something, then that would be done by sending a search
request to an Elasticsearch cluster that was kept synced with the Postgres database.
Benchling managed the syncing with a data pipeline that copied any CRUD updates to
Postgres over to Elasticsearch.
1. Postgres triggers would trigger if a searchable item was changed (items that
were not searchable did not need to be stored in Elasticsearch). They would
send the changes to a Task Queue (Celery).
This architecture worked well, but had several pain points with the main one being
keeping Elasticsearch synced with Postgres. There was too much replication lag.
With the Elasticsearch cluster, fast reads were prioritized (so users could get search
results quickly) and data was denormalized when transferred from Postgres.
Denormalization is where you write the same data multiple times in the different
documents instead of using a relation between those documents. This way, you can
avoid costly joins during reads. Data denormalization improves read performance at the
cost of write performance.
These costlier writes meant more lag between updating the Postgres database and
seeing that update reflected in Elasticsearch. This could be confusing to users as
someone might create a new project but then not see it appear if he searches the project
name immediately after.
The Benchling team was also dealing with a lot of overhead and complexity with
maintaining this search system. They had to maintain both Postgres and Elasticsearch.
Therefore, they decided to move away from Elasticsearch altogether and solely rely on
Postgres.
In 2019, the Benchling team migrated to a new architecture for the Search System, that
was solely based on Postgres.
However, searches were done as SQL queries against core tables in Postgres, so they
were directly accessing the source of truth (hence no replication lag).
For full-text search queries, they used GIN Indexes, which stands for Generalized
Inverted Indexes. An Inverted Index is the most common data structure you’ll use for
full text search (Elasticsearch uses an inverted index as well). The basic concept is quite
similar to an index section you might find at the back of a textbook where the words in
the text are mapped to their location in the textbook.
This setup worked great for developer productivity (no need to maintain Elasticsearch)
and solved most of the replication lag issues that the team was facing.
However, Benchling experienced tremendous growth during this time period. They
onboarded many new customers and users also started to use Benchling as their main
data platform.
The sheer volume of data pouring into the system was orders of magnitude greater than
they’d seen before.
This caused scaling issues and searches began taking tens of seconds or longer to
execute. Searches were done as SQL queries against Postgres, which made heavy use of
data normalization. This meant costly joins across many tables when doing a search;
hence the slow reads.
Also, the Benchling team faced a lot of difficulty when trying to adapt the system to new
use cases.
The third (and current) iteration of Benchling’s search architecture is displayed below.
They also made some changes to solve some of the issues from the first search system.
The data inconsistency between Postgres and Elasticsearch was the main problem, and
that was due to replication lag. Benchling addressed this in the third iteration by adding
the option to synchronously index the data into Elasticsearch.
With this, the CRUD actions are first copied into Elasticsearch from Postgres and then
the user is given confirmation that the action was successful.
They addressed the write amplification issue (due to the denormalization) by tracking
changes at the column level for their Postgres triggers. This greatly reduced the number
of false positives that were being re-indexed.
They’ve also done performance testing and made some changes to their Elasticsearch
cluster topology so they’re comfortable that the system can handle the load of hundreds
of millions of items.
For more details, you can read the full article here.
The website operates at a massive scale, with over half of the UK’s population using the
site every week (along with tens of millions of additional users from across the world).
They have content in 44 different languages and have hundreds of different page types
(news articles, food recipes, videos, etc.).
Until a few years ago, the website was written in PHP and hosted on two datacenters
near London. However, the engineering team has rebuilt the website on AWS and used
newer technologies like ReactJS.
The website relies heavily on Functions as a Service (FaaS) for scaling, specifically AWS
Lambda functions.
Jonathan Ishmael is the Lead Technical Architect at the BBC, and he wrote a great series
of blog posts on why the BBC chose serverless and how their backend works.
Here’s a summary
Before getting into the choice of serverless, it’s important to get some context about the
type of workloads that the BBC website has to serve.
Traffic to the website can fluctuate greatly depending on current events, social media
traffic, etc. These events can be predictable (a traffic spike during a national election)
but they can also be random.
During the 2019 London Bridge attack, requests for the BBC’s coverage of the event
resulted in a 3x increase in traffic in a single minute (4,000 req/s to 12,000 req/s).
Within the next few minutes, traffic doubled again (from 12,000 req/s to 20,000 req/s).
If there’s an unexpected, consequential event then the BBC’s article about it can quickly
start trending on social media. This brings a massive amount of traffic.
All traffic to the BBC website goes to the Global Traffic Manager, which is a web server
based on Nginx. This layer handles thousands of requests per second and is run on AWS
EC2 instances.
The layer handles caching, sanitizing requests and forwarding traffic to the relevant
backend services.
The EC2 instances run with 50% reserve capacity available for handling bursts of traffic.
They don’t have a CPU intensive workload, so AWS autoscaling works well for high
traffic events.
The BBC uses ReactJS for their website. They make use of React’s server-side rendering
feature to reduce the initial page load time when someone first visits the website. The
Web Rendering layer is where the server side rendering happens.
Therefore, the BBC relies on AWS Lambda functions for the rendering as they can scale
up much faster. Approximately 2,000 lambdas run every second to create the BBC
website and AWS will automatically provision more compute when there’s a burst of
traffic (there will be a small cold start time discussed below).
Business Layer
The Rendering Layer focuses solely on presentation, and it fetches data through a REST
API provided by the Business Layer.
The BBC has a wide variety of content types (TV shows, movies, weather forecasts, etc.)
and each one has different data / business logic.
The Business Layer is responsible for taking data from all the various BBC backend
systems and transforming it into a common data model for the Web Rendering layer.
The REST API is run on EC2 instances while Lambda functions are used for the
compute-intensive task of transforming data from all the different systems into a
common data model.
The EC2 instances also handle intermediate caching to reduce load on the Lambda
functions.
The last two layers provide a wide range of services and tools that allow content to be
created, controlled, stored and processed.
The BBC team wanted to make sure they were optimizing their serverless functions to
reduce cost and improve user experience. We’ll go through a couple of the things they
did.
Caching
As discussed above, the BBC has two layers that rely on serverless functions: the web
rendering layer and the Business Layer.
However, they made sure to put in an intermediate caching layer between the two
serverless functions to avoid the rendering functions calling any business logic functions
directly.
If they didn’t, then the rendering function would be sitting idle while the business logic
function was working. Serverless functions are billed by GB-seconds (number of seconds
your function runs for multiplied by the amount of RAM consumed), so any time spent
idle is money being wasted.
The caching layers ensure that most business logic serverless functions can complete in
under 50 milliseconds, reducing idle time for the rendering function.
Memory Profile
When you’re working with Lambda functions, the main configurable parameter is the
amount of RAM each Lambda instance has (from 128 MB to 10 GB).
The amount of memory you select will impact the available vCPUs, which impacts your
response time.
Although the BBC only needed ~200 megabytes for their React app, they found that 1
gigabyte of RAM gave them the optimal price/performance point.
After the request is done, the cloud platform will keep the instance alive for 15-20
minutes (this differs based on provider) so any subsequent requests will not have to deal
with a cold start time.
However, if you have a sudden burst in traffic, your cloud provider will have to spin up
new instances to run your functions on. This means additional cold start times
(although it’s still faster than using EC2 autoscaling).
Factors that impact the cold start time are RAM allocation per Lambda function
(discussed above), size of the code bundle, time taken to invoke the runtime associated
with your code (you can write your function in Java, Go, Python, JavaScript and more),
etc.
You are not charged for any of the compute that happens during the cold start process,
so engineers at the BBC took advantage of this. They used that time to establish network
connections to all the APIs that they needed and also loaded any JavaScript
requirements into memory.
Additionally, they optimized the RAM allocated per Lambda to minimize cold start time.
They found that a 512 mb memory profile increased cold start time by 3x over a 1
gigabyte memory profile, which is part of the reason why they went with 1 gb of RAM
allocated.
They ended up with an average cold start time of ~250 milliseconds, with a peak of 1-2
seconds.
Performance
The BBC is running over 100 million serverless function invocations per day with 90%
of the invocations taking less than 220 milliseconds (for the rendering functions).
For more details, you can read the full article here.
You simulate different failures across your system using tools like Chaos Monkey,
Gremlin, AWS Fault Injection Simulator, etc. and then measure what the impact is.
These tools allow you to set up simulated faults (like blocking outgoing DNS traffic,
shutting down virtual machines, packet loss, etc.) and then schedule them to run
randomly during a specific time window.
Chaos Engineering is meant to be done as a scientific process, where you follow 4 steps.
1. Define how your system should behave under normal circumstances using
quantitative measurements like latency percentiles, error rates, throughput,
etc.
3. Simulate failures that reflect real world events like server crashes, severed
network connections, etc.
Typically, Chaos Engineering is used for measuring the resiliency of the backend
(usually service-oriented architectures).
However, engineers at Twitch decided to use Chaos Engineering techniques to test their
front-end. The question they wanted to answer was “If some part of their overall system
fails, how does the front-end behave and what do end users see?”
Joaquim Verges is a senior developer at Twitch and he wrote a great blog post on
Twitch’s process for chaos testing.
Twitch is a live streaming website where content creators can stream live video to an
audience. They have millions of broadcasters and tens of millions of daily active users.
At any given time, there’s more than a million users on the site.
Twitch uses a services-oriented architecture for their backend and they have hundreds
of microservices.
The front end clients use a single GraphQL API to communicate with the backend.
GraphQL allows frontend devs to use a query language to request the exact data they’re
looking for rather than calling a bunch of different REST endpoints.
The GraphQL server has a resolver layer that is responsible for calling the specific
backend services to get all the data requested.
The most common fault that happens for their system is one of their microservices
failing. In that scenario, GraphQL will forward partial data to the client and it’s the
client’s job to handle the partial data gracefully and provide the best degraded
experience possible.
Engineers at Twitch decided to use Chaos Engineering to test these microservice failure
scenarios.
The GraphQL resolver layer will read this header and stop any call to that service.
The main issue with this approach was that Twitch would need to send the name of the
backend service within the GraphQL header. Therefore, they would have to maintain a
list of all the various backend services to test.
Features and services are constantly changing, so manually mapping specific services to
test was not scalable. They needed a way for the test suite to “discover” the services that
they should be simulating failures for.
To solve this, Twitch added a debug header in their GraphQL calls which enabled
tracing at the GraphQL resolver layer. The resolvers record any method call done to
internal service dependencies, and then send the information back to the client in the
same GraphQL call.
From there, the client can extract the service names that were involved and use that as
input for the Chaos Testing suite.
Twitch has many end-to-end tests for all their clients that test the various user flows
(navigating to a screen, logging in, sending a chat message, etc.)
They try each of these tests with all of the Chaos Mode microservice failures and see
whether the test was successful. Then, they aggregate all the Chaos Mode test results for
each user flow and use that to calculate a resilience score for that particular user action
test.
Resilience scores are displayed on a Dashboard where it’s easy to see any anomalies in
performance. They run Chaos Mode tests every night for their Android, iOS and Web
clients.
Twitch has been able to use this testing tool to boost resilience across all their clients.
Next they want to add the ability to test secondary microservices (services that are called
from another service rather than just testing services that are called directly from the
GraphQL resolver layer).
They also want to add the ability to simulate failures for multiple services at once.
For more details, you can read the full blog post here.
Detection and prevention of fraud is one of the biggest problems fintech companies have
to deal with and PayPal is no exception.
In order to do this, PayPal relies on a graph database. Quinn Zuo is the head of AI/ML
Product Management at PayPal and he wrote a great blog post on how PayPal does this.
Here’s a summary
A graph database management system (graph database) gives you a durable way to store
your graph along with an interface to easily perform create/read/update/delete (CRUD)
operations on the graph.
You can view a comparison of querying for data between SQL and Cyper (Neo4J’s graph
query language) here.
If you have a graph database with data about actors and the movies they were involved
with, then an SQL query for all the directors of Keanu Reeves movies might look like
this…
When looking at graph database technologies, there are two properties you should be
examining: the underlying storage (native vs. non-native storage) and the processing
engine (native vs. non-native processing).
Underlying Storage
This is the underlying structure of the database that contains the graph data.
It can be either native or non-native. Native graph storage means that it’s been built
specifically for storing graph-like data, which means more efficiency when running
graph queries. You can see this with graph databases like Neo4j. For non-native storage,
the graph database will serialize the graph data into relational, key-value,
document-oriented or some other general-purpose data store.
The benefit of non-native graph storage is that you can build your graph database on a
battle-tested backend like Postgres or Cassandra where the scaling characteristics are
well understood. You can take advantage of a Graph API without having to rebuild all
the sharding, replication, redundancy, etc.
PayPal uses Aerospike as the underlying storage for their graph database. Aerospike is
an open source, distributed, key value database.
The processing engine runs database operations on your graph, and can be split into
native or non-native processing.
The key difference between native and non-native processing is index-free adjacency.
Index-free adjacency means that your graph doesn’t have to work with a database index
to hop from any node to its neighboring nodes. Each node has direct addresses to all of
its neighboring nodes. Native graph processing means using index-free adjacency.
Graph Database
PayPal has a two-sided network with buyers and merchants who are sending each other
transactions.
They encode this network as a graph with buyers/sellers modeled as vertices in the
graph.
Edges are connections between the vertices. Examples of potential connections are
sending a payment, sharing the same IP, having the same home address, etc.
The real-time graph platform will return graph query results very quickly (sub-second).
The returned query results can be used in machine learning models for immediate fraud
prevention. If a malicious actor tries to create a new PayPal account after getting
banned, the real-time graph platform can help identify that user and block his account
right after he creates it.
The Interactive graph platform serves use cases where the query latency can be within a
few seconds or minutes. This is useful for graph visualization and is suitable for
investigations done by PayPal’s fraud-prevention teams.
The analytics graph platform is used to uncover unknown patterns using graph
algorithms and training graph ML models. It’s built on HPCs so that training and
algorithms can be run quickly whereas the interactive and real-time platforms are both
built on commodity servers.
The real time graph platform is used to query the graph database and identify potential
fraudulent activities immediately.
Some of the core requirements for PayPal’s Real Time Graph Platform are
Here’s the architecture for the Real-Time Graph Stack. You can view a larger image
here.
The storage backend for the database uses Aerospike and there’s a Write Path and a
Read Path built around it to perform CRUD operations on the database.
Write Path
The offline channel is set up for loading snapshots of the data and supports daily or
weekly updates.
Event-based, near real-time data comes from a variety of production data services at
PayPal. These data sources have been abstracted as events/messages in Kafka. The
Graph Data Process Service consumes those messages to create new vertices and edges
in the graph database.
Read Path
The Graph Query Service is responsible for handling reads from the underlying
Aerospike data store. It provides template APIs that the upstream services (for running
the ML models) can use.
Those APIs wrap Gremlin queries that run on a Gremlin Layer. Gremlin is an open
source graph query language that can be used for OLTP and OLAP traversals. It’s part of
Apache TinkerPop, which is a popular graph computing framework.
The Gremlin layer converts the queries into optimized Aerospike queries, where they
can be run against the underlying storage.
For more details on PayPal’s Graph viewer and Graph embeddings, you can read the full
article here.
In order to help the rider and driver connect, the Lyft app shows both users a map with
the real time location of the other person.
Lyft relies on GPS data from both the riders and drivers mobile devices for the real-time
position, however GPS signals are notoriously noisy and unreliable. Relying on GPS
signals alone would mean inaccurate real-time positioning of the rider and driver,
resulting in a poor user experience.
The Mapping team at Lyft solves this issue by using map data to more accurately localize
the driver/rider. This reduces the space of locations to just the roads and makes it much
easier to run map matching algorithms (match the user to the correct position on the
map). You can read about the map matching algorithms that Lyft uses here.
Karina Goot is an Engineering Manager at Lyft and she wrote a great blog post on this
transition.
Here’s a summary
There are quite a few benefits from shifting localization to the client rather than running
it server-side.
● Driver Safety Features - Running localization client side means that map data
has to be on the driver’s phone. Lyft can later use that map data to add in
additional features to the UI like symbols for traffic lights, stop signs, speed
limits, etc.
Lyft cannot put too much map data on the client or that will cause the Lyft app to take
up too much user storage. They also can’t send all the map data through the network as
downloading data while on cellular is expensive and slow.
The Lyft Engineering team had to navigate these technical limitations and designed the
client localization system around them.
Lyft has to send map data to the client devices without taking up too much user storage,
so they can only include data about the user’s local area. This means changing up the
map data format, how it’s generated and how it’s served.
To do this, they used the S2 Geometry library. The S2 library was developed at Google
and made open source in 2017. It represents all data on a three-dimensional sphere,
which models map data better than traditional geographic information systems that
represent data in two dimensions.
The team divided the entire LyftMap into small chunks of S2 Cells. When the client tries
to download map data from the server, it specifies the cell id of the S2 cell and the map
version. The server will then return the map data serialized as S2 Cell Elements.
The client will download the necessary cells based on the user location and dynamically
build the road network graph in memory.
Lyft created a backend service called MapAttributes, that reads the map elements data
from DynamoDB based on a geospatial index. The S2 library uses the Quadtree data
structure for geospatial indexing. Here’s a great blog post on how S2 does indexing if
you’d like to learn more. These map elements are serialized and converted to S2 Cell
Elements.
Once the client fetches the desired map cells from the server, it passes this information
through to a C++ localization library on the client.
Lyft drivers generally operate in a single service area so their locality is highly
concentrated. This causes duplicate downloads of the same data, leading to
unnecessarily high network data usage.
To solve this, engineers added an in-memory SQLite caching layer directly in C++. They
used SQLite because of its simplicity and native support on client platforms.
With this cache, they can store the highest locality map data for each driver directly
on-device. By persisting the map data on disk, they can store data across sessions and
only have to refresh the cache when the underlying map data changes.
Based on data analysis of driving patterns, the Lyft team found that they can achieve a
high cache hit rate for the vast majority of Lyft drivers with only 15 megabytes of
on-device data.
In order to track the success of the project, Lyft looked at how often mobile clients have
map data and what the latency of the map matching system was.
They found > 99% on-device map data availability among drivers and sub 10 ms latency
for 99% of map matching computations.
With this project, drivers and riders now have significantly better map localization in
the Lyft app.
However, this change also brought multiple challenges that the team had to deal with.
Data was now scattered across many different services so aggregating information in the
presentation layer was complicated, especially for complex domains like payments.
Getting all the information on fees, currency fluctuations, taxes, discounts and more
meant that there were far too many different services to call.
Airbnb addressed this by adding a service mesh to provide a unified endpoint to the
client services in the presentation layer. A service mesh is a layer of proxy servers added
to facilitate communication between microservices. You can also add observability,
security and reliability features into the service mesh rather than at the application layer
(in the microservices).
Ali Can Göksel is a senior software engineer at Airbnb and he wrote a great blog post on
how Airbnb re-architected their Payments layer to incorporate Viaduct, a service mesh
built on GraphQL.
Here’s a summary
● Better data separation into the different domains. Data was kept in a
normalized shape (where you reduce data redundancy). This resulted in
better correctness and consistency.
2. The system was difficult to change - When the payments team had to update
their APIs, they had to make sure that all dependent presentation services
adopted these changes.
With this, clients can query the layer for the data entity instead of having to identify
dozens of services and their APIs.
However, just using a single entry point doesn’t resolve all the complexity. Their
payments system has 100+ data models, and exposing all of them from a single entry
point would still be overly complex for client engineers.
To simplify this, they created higher-level domain entities to further hide internal
payment details. They made fewer than 10 high level entities, so it became much easier
for client teams to find the data they wanted. Also, Airbnb could now make changes to
Improving Performance
As stated earlier, one of the challenges in the previous system was poor performance
and scalability. The complex read flows of fetching the data from all the different
services caused too much latency, especially for large hosts.
The core problem was reading and joining many different tables and services while
executing client queries. To solve this, Airbnb added secondary denormalized
Elasticsearch indices to serve as read replicas.
This moves the expensive operations from query time to ingestion time. Instead of doing
lots of joins during a query, the data has to be written to the replicas during ingestion. It
also sacrifices data consistency due to replication lag.
They created a system where real-time data could be written to the secondary store via
database change data capture mechanisms and historical data could be written through
daily database dumps. They were able to reliably achieve less than 10 seconds of
replication lag.
After combining all of the above improvements, their new payments read flow looked
like the following
For more details, you can read the full post here.
Like many other companies, Dropbox relies on an asynchronous task manager (called
Asynchronous Task Framework or ATF) to manage and run async tasks. When you
remove a file on your Dropbox account, the UI may show that the file was moved; but
behind the scenes, an async task was created to delete the file on all the database
replicas.
ATF (Asynchronous Task Framework) serves more than 9000 async tasks scheduled per
second, and more than 30 teams at Dropbox make use of the framework.
Arun Sai Krishnan is a Software Engineer at Dropbox, and he wrote a great blog post on
design goals of ATF and the architecture behind it.
Here’s a summary
The callback functions are called lambdas, and developers can write lambdas to execute
async tasks like sending out an email to a user.
When an engineer wants to execute a lambda, they can submit it to the ATF. This
creates a task, which is just a unit of execution of a lambda (similar to how a process is a
unit of execution of a program).
● Task Status Querying - clients can query the status of a scheduled task.
● Frontend - Clients can schedule tasks using remote procedure calls (RPC).
Dropbox uses gRPC with an in-house built RPC framework called Courier.
● Task Store - The frontend accepts tasks and stores them in the task store. This
can be any generic data store that has indexed querying capability. Dropbox
uses their in-house metadata store called Edgestore. It's built on top of
MySQL.
● Store Consumer - The store consumer is a service that will periodically poll
the task store to find tasks that are ready for execution. It pushes these tasks
onto the right queue.
● Queue - Dropbox uses AWS Simple Queue Service (SQS) to queue the tasks.
Worker machines will pull tasks off the SQS queues.
● Heartbeat and Status Controller (HSC) - The HSC serves RPCs for status
updates during task execution and setting task status in the task store after
execution.
● At-least Once Task Execution - tasks will be executed at least once. The ATF
will try and retry tasks until they complete execution or reach a fatal failure
state. This means that a task may get executed multiple times, so developers
have to ensure that their lambda logic is idempotent (can be run multiple
times without changing the result).
● No Concurrent Task Execution - The ATF system guarantees that at most one
instance of a task will be actively executing at any given time, so developers
can write their callback logic without designing for concurrent execution of
the same task from different workers. Before a task starts execution, it will be
marked with a state of “Claimed” so it doesn’t get assigned to another worker
machine.
● Delivery Latency - 95% of tasks begin execution within 5 seconds from their
scheduled execution time. The store consumer polls for ready tasks once every
two seconds. This polling frequency can be configured to change the task
delivery latency.
For more details on ATF’s ownership model, task lifecycle and data model, you can read
the full article here.
Because it serves a variety of different use cases, ZippyDB offers users a lot of flexibility
with the option to tune durability, consistency, availability and latency to fit the
application’s needs.
Sarang Masti is a software engineer at Facebook and he wrote a great blog post about
the design choices and trade-offs made in building the ZippyDB service.
Here’s a summary
Before ZippyDB, various teams at Facebook used RocksDB to manage their data.
RocksDB is a fork of Google’s LevelDB with the goal of improving performance for
server workloads.
However, the teams using RocksDB were each facing similar challenges around
consistency, fault tolerance, failure recovery, replication, etc. They were building their
own custom solutions, which meant an unnecessary duplication of effort.
ZippyDB was created to address the issues for all these teams. It provides a highly
durable and consistent key-value data store with RocksDB as the underlying storage
engine.
They also have TTL (Time to live) support for ephemeral data where clients can specify
the expiry time for a key-value pair.
Architecture
The basic unit of data management for ZippyDB is a shard, where each shard consists of
multiple replicas that are spread across geographic regions for fault tolerance. The
replication is done with either Paxos or async replication (depending on the
configuration).
Within a shard, a subset of the replicas are configured to be part of the Paxos quorum
group, where data is synchronously replicated between those nodes. The write involves
persisting the data on a majority of the Paxos replicas log’s (so Paxos’ consensus
algorithm will return the new write) and also writing the data to RocksDB on the
primary. Once that’s done, the write gets confirmed to the client, providing highly
durable writes.
The remaining replicas in the shard are configured as followers. These receive data
through asynchronous replication. These replicas handle low-latency reads with the
tradeoff being that they have worse consistency.
The quorum size vs. the number of follower replicas is configurable, and it lets a user
strike their preferred balance between durability, write performance, read performance
and consistency. We’ll talk about ZippyDB’s consistency in the next section.
Each shard has a size of 50 - 100 gigabytes and is split into several thousand
microshards which are then stored on different physical servers. This additional layer of
abstraction allows ZippyDB to reshard the data without any changes for the client.
ZippyDB maps from microshards to shards with two types of mapping: Compact
mapping and Akkio mapping.
Compact mapping is used when the assignment is fairly static and mapping is only
changed when there is a need to split shards that have become too large or hot.
By default, a write involves persisting the data on a majority of the Paxos replicas’ logs
and also writing the data to RocksDB on the primary before confirming the write to the
client. Persisting the write on a majority of the Paxos replicas means that the Paxos
Quorum will return the new value.
However, some applications need lower latency writes so ZippyDB also supports a
fast-acknowledge mode where writes are confirmed as soon as they are enqueued on the
primary for replication. This means lower durability.
For reads, the three most popular consistency levels for ZippyDB are
● Eventual
● Read-your-writes
● Strong
Eventual - This is a much stronger consistency level than what’s typically described as
eventual consistency. ZippyDB ensures that reads that are served by follower replicas
aren’t lagging behind the primary/quorum beyond a certain configurable threshold.
Therefore, it’s similar to something like Bounded Staleness that you might see in Azure’s
CosmosDB.
Read-Your-Writes - The client will always get a replica that is current enough to have
any previous writes made by this client. In order to implement this, ZippyDB assigns a
monotonically increasing sequence number to each write and it’ll return this number in
response to a client’s write request. The client can use their latest sequence number
Strong - The client will see the effects of the most recent writes. This is done by routing
the read requests to the primary.
For more details on how ZippyDB implements transactions and conditional writes, you
can read the full article here.
Here’s a summary
On the “easier” side (but still far from trivial to implement) are Offline Distributed
Systems where you take a batch job and split it up across many machines that are
located in close proximity. These systems are frequently used for big data analysis or
high performance computing. You can get almost all the benefits of distributed
computing (scalability and fault tolerance) and avoid much of the downside (complex
failure modes and non-determinism).
In the middle are Soft Real-Time Distributed Systems. These are systems that
must continually produce or update results, but have a relatively generous time window
in which to do so (hence soft real-time). Things like web crawlers, search indexers, ML
training infrastructure, etc. The system can go down for several hours without undue
customer impact.
The most difficult are Hard Real-Time Distributed Systems. These are
request/reply services where clients will randomly send requests and expect an
immediate reply. Web servers, credit card processors, every AWS API, etc. are examples
Complexity
Request/reply networking is the main reason why hard, real-time distributed systems
are so challenging. Regardless of what protocols you’re using, using the network means
you’re sending messages from one fault domain to another.
This introduces many steps where something can go wrong. As your systems grow
larger, what had previously been theoretical edge cases will turn into regular
occurrences due to the law of large numbers.
4. Update Server State - The server may update its state based on the message.
8. Update Client State - The client may update its state based on the reply.
Creating a distributed system means introducing all of these steps into your program. It
turns one step (calling a method or writing to disk) into eight steps that will each fail
with some non-zero probability.
With a single machine, you don’t have to test for conditions where the CPU dies. If the
CPU dies on your laptop, then those test conditions obviously won’t be processed
anyway.
However, in hard real-time distributed systems, the client, network and server do not
share fate. One of the machines can die on the backend but the other machines, the
client and the network will still function as normal.
This means testing for all possible failure scenarios and controlling for code behavior
during these faults. The increased number of failure modes multiply the number of test
conditions.
Each of those eight steps in request/reply networking introduce possible failure modes,
and building distributed systems at scale means you have to test for all of them and
handle all the permutations.
Distributed bugs (those that result from failing to handle all the permutations of the
eight failure modes) are usually severe and can be caused by bugs that were deployed to
production months earlier.
It takes a while to trigger the exact combination of scenarios that lead to these bugs
happening, hence the delay.
This is especially true since distributed systems will have multiple layers of abstraction.
Your system usually won’t just be a single client, a network and a single server machine.
Instead, the backend will consist of multiple machines grouped together across different
geographic regions.
“The failure was caused by a single server failing within the remote
catalog service when its disk filled up. Due to mishandling of that error
condition, the remote catalog server started returning empty responses
to every request it received. It also started returning them very quickly,
because it’s a lot faster to return nothing than something (at least it was
in this case). Meanwhile, the load balancer between the website and the
remote catalog service didn’t notice that all the responses were
zero-length. But, it did notice that they were blazingly faster than all the
other remote catalog servers. So, it sent a huge amount of the traffic from
www.amazon.com to the one remote catalog server whose disk was full.
Effectively, the entire website went down because one remote server
couldn’t display any product information.”
For more details, you can read the full blog post here.
PayPal acquired Braintree in 2013, so the company comes under the PayPal umbrella.
One of the APIs Braintree provides is the Disputes API, which merchants can use to
manage credit card chargebacks (when a customer tries to reverse a credit card
transaction due to fraud, poor experience, etc).
The traffic to this API is highly irregular and difficult to predict, so Braintree uses
autoscaling and asynchronous processing where feasible.
One of the issues Braintree engineers dealt with was the thundering herd problem where
a huge number of Disputes jobs were getting queued in parallel and bringing down the
downstream service.
Anthony Ross is a senior engineering manager at Braintree, and he wrote a great blog
post on the cause of the issue and how his team solved it with exponential backoff and
by introducing randomness/jitter.
Here’s a summary
Braintree uses Ruby on Rails for their backend and they make heavy use of a component
of Rails called ActiveJob. ActiveJob is a framework to create jobs and run them on a
variety of queueing backends (you can use popular Ruby job frameworks like Sidekiq,
Shoryuken and more as your backend).
This makes picking between queueing backends more of an operational concern, and
allows you to switch between backends without having to rewrite your jobs.
Merchants interact via SDKs with the Disputes API. Once submitted, Braintree
enqueues a job to AWS Simple Queue Service to be processed.
ActiveJob then manages the jobs in SQS and handles their execution by talking to
various Processor services in Braintree’s backend.
The Problem
Braintree set up the Disputes API, ActiveJob and the Processor services to autoscale
whenever there was an increase in traffic.
Despite this, engineers were seeing a spike in failures in ActiveJob whenever traffic went
up. They have a robust retry logic setup so that jobs that fail will be retried a certain
number of times before they’re pushed into the dead letter queue (to store messages that
failed so engineers can debug them later).
The retry logic had ActiveJob attempt the retries again after a set time interval, but the
retries were failing again.
Then, these failed jobs would retry on a static interval, where they’d also be combined
with new jobs from the increasing traffic, and they would trample the service down
again. The original jobs would have to be retried as well as new jobs that failed.
This created a cycle that kept repeating until the retries were exhausted and eventually
DLQ’d (placed in the dead letter queue).
To solve this, Braintree used a combination of two tactics: Exponential Backoff and
Jitter.
Exponential Backoff is an algorithm where you reduce the rate of requests exponentially
by increasing the amount of time delay between the requests.
The equation you use to calculate the time delay looks something like this…
With this, the amount of time between requests increases exponentially as the number
of requests increases.
By just using exponential backoff, the retries + new jobs still weren't spread out enough
and there were clusters of jobs that all got the same sleep time interval. Once that time
interval passed, these failed jobs all flooded back in and trampled over the service again.
Jitter is where you add randomness to the time interval the requests that you’re
applying exponential backoff to.
To prevent the requests from flooding back in at the same time, you’ll spread them out
based on the randomness factor in addition to the exponential function. By adding jitter,
you can space out the spike of jobs to an approximately constant rate between now and
the exponential backoff time.
Here’s an example of calls that are spaced out by just using exponential backoff.
The time interval between calls is increasing exponentially, but there are still clusters of
calls between 0 ms and 250 ms, near 500 ms, and then again near 900 ms.
In order to smooth these clusters out, you can introduce randomness/jitter to the time
interval.
Now, the calls are much more evenly spaced out and there’s an approximately constant
rate of calls.
For more details, you can read the full article by Braintree here.
Here’s a good article on Exponential Backoff and Jitter from the AWS Builders Library,
if you’d like to learn more about that.
Twitch’s Video Ingest team is responsible for developing the distributed systems and
services that
● Provide a high throughput control plane to make the video available for
world-wide distribution with low latency
Eric Kwong, Kevin Pan, Christopher Lafata and Rohit Puri are software engineers on the
Video Ingest team and they wrote a great blog post on their infrastructure/architecture,
problems they encountered and solutions they employed.
Here’s a summary
These Points of Presence (PoPs) are connected through Twitch’s private Backbone
Network, which is dedicated to transmitting their content. Relying on the public
Internet would be susceptible to bottlenecks/instability so instead, 98% of all Twitch
traffic remains on their private network.
Between the PoPs are origin data centers, which are also geographically distributed.
These origin data centers handle tasks around video processing (like transcoding a
livestream into different bitrates/formats for all the various devices that viewers may be
using).
Previously, all the PoPs ran HAProxy (a reverse proxy that is commonly used for load
balancing) for forwarding the video streams to the origin data centers. However, Twitch
faced several issues with this approach as they scaled.
Inefficient Usage of Origin Data Center Resources - Each PoP was configured to send
its video streams to a specific origin data center (located in the same geographic area as
the PoP). This meant that the origin data centers for a region ran at full load during the
busy hours of that geographic area, but utilization became very minimal outside of that
time period. When one region has minimal utilization, another geographic region might
be having their busy hours but they couldn’t take advantage of the origin data centers of
the minimal utilization region.
Difficult to Handle Unexpected Changes - The relatively static nature of the HAProxy
configuration also made it difficult to handle unexpected surges of live video traffic.
Reacting to system fluctuations like the loss of capacity of an origin data center was also
very difficult.
Twitch decided to revamp the software in their PoPs and completely retire HAProxy.
To replace it, they developed Intelligest, a proprietary ingest routing system that could
intelligently distribute live video ingest traffic from the PoPs to the origins.
The Ingest architecture consists of two components: the Intelligest Media Proxy and the
Intelligest Routing Service (IRS). The Intelligest Media Proxy is a data plane component
so it runs in all the PoPs and sends the video streams to various origin data centers. The
Intelligest Routing Service is a control plane and tells the Intelligest Media Proxy which
origin data center to send the video to.
When a broadcaster starts streaming, his computer will transmit video to the nearest
Twitch Point of Presence (PoP) server. The Intelligest Media Proxy is running on that
PoP and it will extract all the relevant metadata from the stream.
It will then query the Intelligest Routing Service (IRS) and ask which origin data center
it should route the video stream to. The IRS service has a real-time view of all of
Twitch’s infrastructure and it will make a routing decision based on minimizing latency
for the viewers and maximizing utilization of compute resources in all the origins.
The IRS service will send its decision back to the Intelligest Media Proxy, which can
then route the video stream to the selected origin data center.
Capacitor monitors the compute resources in every origin and keeps track of any
capacity fluctuations (due to maintenance/failures).
The Well monitors the backbone network and provides information about the status of
network links so latency issues are minimized.
The IRS service uses a randomized greedy algorithm to compute routing decisions based
on compute resources available, backbone network bandwidth and other factors.
For more details, you can read the full blog post here.
Photos are among the most common types of files uploaded to Dropbox. The Dropbox
app allows users to set up camera sync so that any photo they take on their smart phone
will automatically get synced and stored in their Dropbox account.
To make it easier for people to find their photos, Dropbox built an image search feature
where you can search for objects/scenery/action and Dropbox will find images that
contain what you searched for.
For example, if you search for “picnic”, then Dropbox will find your images that contain
a picnic.
Here's a summary
Dropbox’s Approach
To build this, Dropbox relies on two areas of machine learning
● Image Classification
● Word Vectors
Image Classification
An image classifier takes in the pixel values of an image and outputs a list of things that
the image contains (where each of these things are categories that the classifier is
trained to recognize).
There has been tremendous progress over the last 10 years in image classification with
the innovations in deep learning, specifically convolutional neural networks. Model
architecture improvements, better training methods, large datasets, faster GPUs and
more have resulted in image classifiers that can recognize thousands of different
categories with extremely high accuracy.
For Dropbox’s image search, their image classifier is an EfficientNet network trained on
the OpenImages dataset. This produces classification scores for ~8500 categories
ranging from grapes to telephone to picnic and much more.
However, an issue that comes up with image classification is synonyms. What if a user
searches for seashore but the image classifier is trained on the term beach?
The idea is that you have a vector space with hundreds of dimensions (your standard
X-Y cartesian coordinate system could be viewed as an example of a 2 dimensional
vector space).
Then, you use a neural network to map every word to a vector in the vector space. For
each word, the neural network will assign a number for each of the hundreds of different
dimensions.
You train the neural network so that words with similar meanings will be assigned
vectors that are close to each other in vector space. Here's a great article that dives
deeper into Word2vec.
Dropbox used the ConceptNet Numberbatch pre-computed word embeddings. This gave
them good results in their testing and the word embeddings also support multiple
languages, so they return close vectors for words in different languages with similar
meanings. This makes supporting image search in multiple languages much easier.
If there’s a multi-word image search, Dropbox parses the query as an AND of the
individual words. They also maintain a list of multi-word terms like beach ball that they
can match for.
Production Architecture
When a user submits a search query, it’s obviously not possible to immediately run
image classification on all of their images. Users can have tens of thousands of images in
their dropbox account, so that solution would be way too slow.
Instead, Dropbox uses an Inverted Index data structure (also used by many Full-Text
search engines like Elasticsearch). You can think of an Inverted Index as very similar to
the Index section at the back of the textbook where it contains a list of all the words in
Dropbox will scan through all the images in a user’s account and run their image
classification algorithm to find all the categories (things) that appear in that image. They
convert whatever categories are found into the corresponding word embedding vectors.
Then, they create an Inverted Index where for each category, they have a list of images
that contain that thing.
When a user searches for a word, Dropbox will first find the word vector for that term.
Then, they'll find the closest word vectors that are categories for the image classifier.
They'll query the inverted index to find the matching images for these categories and
then rank the matching images based on how strong the classifier ranked each category
in the image.
For more details and some additional optimizations Dropbox made, you can view the
full article here.
Instagram launched a Suggested Posts feature where they recommend posts a user may
enjoy from accounts the user isn’t following and they place those posts in the user’s feed.
The goal is to make it easier for users to find new accounts to follow.
Amogh Mahapatra is a machine learning engineer at Meta and he wrote a great blog
post on how Instagram implemented this feature.
Here’s a summary
Instagram’s suggested posts feature will find photos/video posts that you may like from
accounts that you don’t follow. This results in you finding more content you like,
following more accounts and spending more time on Instagram.
This feature is an example of the Information Retrieval problem, where you have a large
set of documents (Instagram posts) and you want to find certain documents based on a
set of criteria.
1. Candidate Generation - based on the user’s interests, fetch all the candidates
that a user could be interested in. In this case, Instagram is looking for all the
possible posts from accounts the user doesn’t follow that he/she may be
interested in.
2. Candidate Selection/Scoring - rank the candidates and select the best subset
to show to the user. In this scenario that means looking at the potential posts
from the candidate generation stage and selecting the best few posts that will
be shown to the user as Suggested Posts in their feed.
Instagram also uses a technique called Co-occurrence Similarity, which comes from
frequent pattern mining. They look at user data to see what media users are engaging
with and look for any co-occurring accounts (accounts that also get engagement from
those users). Then, they calculate co-occurrence frequencies of media pairs and use
them for Candidate Generation. For example, there may be a lot of users who like posts
from the Golden State Warriors and also from the Los Angeles Lakers (two NBA teams).
Users who follow one team and not the other might benefit from getting the other
team’s posts as Suggested Posts.
● Popular Media - For extremely new users who don’t follow anyone / haven’t
engaged with any content, Instagram will recommend posts that are popular
with the general instagram user base. The recommendation algorithm can
then adjust based on the user’s response to those initial posts.
● Fallback Graph Exploration - If a user hasn’t engaged with any content but
follows other accounts, Instagram will generate candidates for them by
evaluating their one-hop and two-hop connections. They’ll look at accounts
followed by the user and see what posts those accounts liked and use that to
generate candidates.
To do this, Instagram uses a ton of different data points and various machine learning
models. Many of the data points are also generated using ML models.
● User embeddings
● Log-linear models
In order to select the best models, hyperparameters, etc. Facebook relies on online A/B
testing and offline simulations. The offline simulations work by replaying a user’s
actions (their likes, comments, shares, etc.) to different models and training them to
predict the user’s actions. Then, these engagement prediction models can be used to
evaluate candidate ranking models.
Offline simulation can’t replace A/B testing since there are many behavioral dynamics
that are too complicated to model, but it provides a higher throughput alternative to
quickly evaluate model performance. You can read more about offline simulation at
Meta here.
For more details on Instagram’s Suggested Post feature, read the full article here.
Saral Jain is the Senior Director of Engineering at Snap Inc where he leads the Cloud
Infrastructure, Data and IT organizations. He gave a great interview on the AWS series
This is my Architecture.
He discussed the process of what happens on the backend when a user sends/receives a
snap on the app (sends or receives an image/video). This is for a video series by AWS, so
unfortunately he only talks about the AWS architecture.
Here’s a summary
For their AWS stack, Snap runs their backend on Elastic Kubernetes Service (EKS) and
they use more than 900 EKS clusters where many of the clusters have 1000+ instances.
● Friend Graph
● Snap DB
When a user sends a snap from their mobile device, their phone will talk to Snapchat’s
API Gateway.
The Gateway will communicate with the Media Delivery Service to send the
picture/video to AWS CloudFront (AWS’ Content Delivery Network) and also persist it
in S3. The media will be given a media ID that it can be referenced through.
If the permissions check passes, the Orchestration service will persist the conversation
metadata (including the media ID) into Snap DB. Snap DB is Snapchat’s custom
database that is built on top of DynamoDB (a proprietary NoSQL database by AWS).
They store nearly 400 terabytes of data in DynamoDB.
The team created their own database as a frontend to DynamoDB to add higher level
features to meet Snap’s specific use cases. Snap has to deal with a lot of ephemeral data
so they added optimizations for that and also TTL and custom transactions to reduce
costs.
For receiving a snap, the orchestration service will look up a connection ID from
ElasticCache, to get access to the persistent connection that Snap servers have with the
clients who have the app open.
The service looks at the conversation metadata to get the media ID of the picture/video.
The content is retrieved from CloudFront and then sent to the recipient’s device.
If the recipient doesn’t have the app open, then Snapchat relies on Apple Push
Notification Service or Firebase Cloud Messaging.
For more details, you can watch the full video here.
There are a ton of different tools you can use for an API Gateway like Nginx, Zuul (by
Netflix), Envoy (by Lyft), offerings from all the major cloud providers, etc.
One feature that many API gateways provide is load shedding, where you can configure
the gateway to automatically drop certain requests and ignore them. This is crucial for
times when you face a spike in traffic or if something’s wrong with your backend (and
you can’t handle the usual traffic).
Netflix built and maintains a popular API Gateway called Zuul and they gave a great talk
at AWS Re:Invent 2021 about how they designed and tested Zuul’s prioritized load
shedding feature for their internal use.
Here’s a summary
Despite all the effort Netflix engineers put into developing resiliency, there are still
many different incidents that degrade user experience.
Whether it’s something like under-scaled services, network blips, cloud provider
outages, bugs in code or something else, engineers need to ensure that the end user
experience is minimally affected. Netflix is a movie/tv-show streaming website, so this
means that users should still be able to stream their movies and TV shows on their
phone/laptop/TV/gaming console.
To ensure high availability, Netflix uses the load-shedding technique where low priority
requests are dropped when the system is under severe strain.
If the system is under severe strain, then Netflix will ignore these requests and the client
will fall-back on just displaying the show’s image and not playing any trailer. This
doesn’t result in a severe degradation in user experience and it allows the system to
prioritize requests that directly relate to the streaming experience for users.
In order to implement prioritized load shedding, Netflix engineers went through 3 steps
Netflix created a scoring system from 0 to 100 that assigns a priority to a request, with 0
being the highest priority and 100 being the lowest priority.
● Request State - Was the request initiated by the user? Or was the request
initiated by the Netflix app?
Using these dimensions, the API gateway assigns a priority score to every request that
comes in.
The first decision was where to implement the load shedding algorithm. Netflix decided
to put the logic in their API Gateway, Zuul.
When a request comes in to Zuul, the first thing that the gateway does is execute a set of
inbound filters. These filters are responsible for decorating the incoming request with
extra information. This is where the priority score is computed and added to the
request.
With the priority score information, Zuul can now do global throttling. This is where
Zuul will throttle requests below a certain priority threshold. This is meant to protect
the API gateway itself. The metrics used to trigger global throttling are concurrent
requests, connection count and CPU utilization.
Netflix also implemented service throttling, where they can load shed requests for
specific microservices that Zuul is talking to. Zuul will monitor the error rate and
concurrent requests for each of the microservices. If a threshold is crossed for those
In order to calculate the priority level, Netflix uses a cubic function. When the overload
percentage is at 35%, Netflix will shed any requests that are above 95% priority. When
the overload percentage reaches 80%, then the API Gateway will shed any request with a
priority score of greater than ~50.
Netflix routinely runs these types of experiments in their production environment and
have built tools like Chaos Monkey and ChAP (Chaos Automation Platform) to make
this testing easier.
Engineers staged an A/B test that will allocate a small number of production users to
either a control or treatment group for 45 minutes. During that time period, they’ll
throttle a range of priorities for the treatment group and measure the impact on
playback experience.
This allows Netflix to quickly determine how the load shedding system is performing
across a variety of client devices, client versions, locations, etc.
For more details, you can watch the full talk here.
Facebook Ordered Queueing Service (FOQS) is an internal Facebook tool that fills that
role. FOQS is a horizontally scalable, persistent, distributed priority queue that’s built
on top of sharded MySQL.
Akshay Nanavati and Girish Joshi are two software engineers at Facebook, and they
wrote a great blog post on how FOQS works and the architecture behind it.
Here’s a Summary
FOQS is a general purpose priority queue so hundreds of different services across the
Facebook stack rely on it to pass messages. Facebook’s video encoding service, language
translation technologies and notification services are a few examples.
Producer services will enqueue items on to FOQS to be processed. These items can have
a priority and also a delay (if the item processing needs to be deferred). The item will
have a topic, where each topic is a separate priority queue.
Consumer services can dequeue items from a certain topic and process them. If the
processing succeeds, they send an “ack” message back to FOQS. If the processing fails,
then they send a “nack” message back and the items will be redelivered from the priority
queue at a later time.
FOQS is organized into namespaces where each namespace has many topics and each
topic has many items.
Namespaces provide a way to separate all the different services/use-cases that are using
FOQS. Each namespace will have many (thousands) of topics, where each topic
represents a single priority queue.
Clients will enqueue and dequeue items to a topic where an item represents a message
with some user specified data.
Each item will have fields for the namespace, topic, priority (a 32 bit integer), payload
(an immutable 10 kilobyte blob), metadata, delivery delay (how long until the item can
be dequeued) and a few other fields.
● Dequeue - Accepts a topic and a number where the number signifies how
many items to return from the topic. Items are returned based on priority and
delivery delay.
● Ack - Sends a message that the dequeued item was successfully processed, so
it doesn’t need to be delivered again.
When a client enqueues an item to FOQS, the request gets put on an Enqueue Buffer
and FOQS returns a promise back to the client.
FOQS is built on top of sharded MySQL and each shard has a corresponding worker
node. The workers are reading items from the Enqueue Buffer and inserting them into
their MySQL shard where one database row corresponds to one item.
Once the row insertion is complete, the promise is fulfilled and an enqueue response is
sent back to the client. The response contains a unique string that contains the MySQL
shard’s ID and a 64-bit primary key (that identifies the item in its shard).
FOQS uses a circuit breaker design pattern to avoid sending items to unhealthy MySQL
shards. Health is defined by slow queries or error rate; if either of those cross a
threshold then the corresponding worker will stop accepting more work until it’s
healthy.
The dequeue API accepts a collection of (topic, count) pairs where count represents the
number of items to return from the topic. The items returned are ordered by priority.
Since each topic is sharded, each topic host will need to run a reduce operation across all
the MySQL shards for that topic to find the highest priority items and select those.
To optimize this, FOQS has a data structure called the Prefetch Buffer that works in the
background and fetches the highest priority items across all the shards.
Each shard has an in-memory index of the primary keys of items that are ready to be
delivered on the shard, sorted by priority. The Prefetch Buffer will build its own priority
queue from these indexes using a K-way merge.
The dequeue API just has to read items out of the Prefetch Buffer and return them to the
client.
FOQS supports at least once delivery and that’s implemented using Ack/Nack (short for
Acknowledged or Not Acknowledged). An ack signifies that the dequeued item was
successfully processed by the consumer, so the message doesn’t need to be delivered
again. A nack signifies that the item should be redelivered because the consumer client
failed to process it.
When an item is enqueued, FOQS allows the client to specify a lease duration. When
that item gets dequeued, the lease begins. If the item is not acked or nacked within the
lease duration, it is assumed to have failed (nacked) and it’s made available for
redelivery, so that the at least once guarantee is met.
When an item succeeds/fails, the client sends the ack/nack request to FOQS. The shard
ID is contained in the item ID, so the FOQS client uses that ID to locate the specific
FOQS shard that manages that item.
The ack/nack gets sent to a shard-specific in-memory buffer, there are separate buffers
for acks vs. nacks. A worker will pull items from the ack buffer and delete those rows
from the MySQL shard. Similarly, a worker will pull items from the nack buffer and
update that row with a new deliver_after time so the item gets redelivered.
In order to best serve their users, Mixpanel needs to support real-time event ingestion
while also supporting fast analytical queries over all a user’s history.
If a Mixpanel customer wants to see the number of sign up conversion events over the
past 6 months, they should be able to query that data quickly (fast analytical queries).
Mixpanel accomplishes this with their in-house database, Arb. They leverage both
row-oriented and column-oriented data formats where row-oriented works better for
real-time event ingestion and column-oriented works well for analytical queries. This is
based on the classic Lambda Architecture where you have a speed layer for real-time
views and a batch layer for historical data.
If you're interested in learning more about Mixpanel's system architecture, you can read
about it here.
In order to convert data from row format to a columnar format, Mixpanel has a service
called Compacter.
Vijay Jayaram was the Principal Tech Lead Manager of the Performance team at
Mixpanel, and he wrote a great blog post on technical challenges the company faced
when scaling the Compacter service and how they overcame them.
Here’s a Summary
This data gets pushed onto Mixpanel’s storage system, where storage nodes will write
the events to disk using a row-oriented format.
Then, the Compacter service will convert the data from row format to columnar format,
making it faster to query.
Given the nature of the work, the Compacter service is very computationally expensive.
It runs in an autoscaling nodepool on Google Kubernetes Engine.
When a storage node has a row file of a certain size/age, it will send a request to a
randomly selected compacter node to convert it. The compacter node will then return a
handle to the resulting columnar file.
If a compacter node has too many requests, then it’ll load shed and return an error. The
storage node will retry after a backoff period.
Mixpanel engineers were having a great deal of trouble scaling the compacter in time to
absorb the spikes in load. The compacter service failed to autoscale and this resulted in a
spike in errors (as storage node requests were getting shedded and the retries were also
getting shedded).
Engineers would have to manually set the autoscaler’s minimum number of nodes to a
higher number to deal with the load. This resulted in a waste of engineer time and also
inefficient provisioning.
When Mixpanel looked at the average utilization of nodes in the compacter service, they
expected it to be at 80-90%. This would mean that the compute provisioned in the
service was being used efficiently.
However, they found that average CPU utilization was ~40%. They checked the median
utilization and the 90th percentile utilization to find that while median utilization was
low, the 90th percentile utilization was near 80%.
This meant that half the compacter nodes provisioned were doing little work, while the
top 10% of nodes were maxed out.
This was why the autoscaling was messed up, because the autoscaling algorithm was
using the average utilization to make its scaling decisions.
Engineers were confused about why there was a skew since the storage nodes were
randomly selecting compacter nodes based on a uniform random distribution
(Randomized Static load balancing). Each compacter node was equally likely to be
selected for a row-to-column conversion job.
However, because the individual jobs had a very uneven distribution in terms of
computational load, this caused a large work skew between the compacter nodes.
This meant that the individual jobs were distributed based on a power law, where the
largest jobs were significantly larger than the smallest jobs. Some compacter nodes were
getting significantly more time-consuming jobs than other nodes and this is what caused
the work skew between the nodes.
Having unequal load will also present problems for many other load balancing
algorithms as well, like Round Robin.
Mixpanel considered several solutions to solve this including inserting a queue between
the storage nodes and compacters or inserting a more complex load balancing service.
You can check out the full post to read about these options.
They went with a far simpler solution. They used a popular strategy called The Power of
2-Choices, which uses randomized load balancing.
Instead of the storage nodes randomly picking 1 compacter, they randomly pick 2
compacter nodes. Then, they ask each node for its current load and send the request to
the less loaded of the two.
There’s been quite a few papers on this strategy, and it’s been found to drastically reduce
the maximum load over having just one choice. It’s used quite frequently with load
balancers like Nginx. Mixpanel wrote some quick Python simulations to confirm their
intuition about how The Power of 2-Choices worked.
Implementing this into their system was extremely easy and it ended up massively
closing the gap between the median utilization and the 90th percentile utilization.
Average utilization increased to 90% and the error rate dropped to nearly 0 in steady
state since the compacters rarely had to shed load.
For more details, you can read the full summary here.
In order to store and serve all of their reporting metrics, Pinterest relies on Apache
Druid - an open source, column-oriented, distributed data store that’s written in Java.
Druid is commonly used for OLAP (analytics workloads) and it’s designed to ingest
massive amounts of event data and then provide low latency, analytics queries on the
ingested data. Druid is also used at Netflix, Twitter, Walmart, Airbnb and many other
companies.
To get an idea of how Druid works, you can split its architecture into three components:
the query nodes, data nodes and deep storage.
Deep storage is where the company stores all their data permanently, like AWS S3 or
HDFS.
Druid connects with the company’s deep storage and indexes the company’s data into
Druid data nodes for fast analytical queries. New data can also be ingested through the
data nodes and Druid will then write it to deep storage.
In order to make analytical queries performant, the data stored in Druid's data nodes is
stored in a columnar format. You can read more about this here.
Clients can send their queries (in SQL or JSON) for Druid through the query nodes.
Engineers on the Real-Time Analytics team at Pinterest wrote a great blog post on the
process they go through for load testing Druid.
Here’s a summary
When load testing Druid, engineers are looking to verify several areas
3. Data Size - The storage system should have sufficient capacity to handle the
increased data volume.
We’ll go through each of these and talk about how Pinterest tests them.
When load testing their Druid system, Pinterest can either do so with generated queries
or with real production queries.
With generated queries, queries are created based on the current data set in Druid. This
is fairly simple to run and does not require any preparation. However, it may not
accurately show how the system will behave in production scenarios since the generated
queries might not be representative of a real world workload (in terms of which data is
accessed, query types, edge cases).
Another option is to capture real production queries and re-run these queries during
testing. This is more involved as queries need to be captured and then updated for the
Pinterest moved ahead with using real production queries and implemented query
capture using Druid’s logging feature that automatically logs any query that is being sent
to a Druid broker host (you send your query to a Query Server which contains a broker
host).
Engineers don’t conduct testing on the production environment, as that could adversely
affect users. Instead, they create a test environment that’s as close to production as
possible.
They replicate the Druid setup of brokers, coordinators, and more and also make sure to
use the same host machine types, configurations, pool size, etc.
Druid relies on an external database for metadata storage (data on configuration, audit,
usage information, etc.) and it supports Derby, MySQL and Postgres. Pinterest uses
MySQL.
Therefore, they use a MySQL dump to create a copy of all the metadata stored in the
production environment and add that to a MySQL instance in the test environment.
They spin up data nodes in the test environment that read from deep storage and index
data from the past few weeks/months.
Pinterest runs the real production queries on the test environment and looks at several
metrics like
Some changes can be done quickly while others can take hours. Increasing the number
of machines in the query services can be done quickly, whereas increasing the number of
data replicas takes time since data needs to be indexed and loaded from deep storage.
Testing Data Ingestion is quite similar to testing queries per second. Pinterest sets up a
test environment with the same capacity, configuration, etc. as the production
environment.
The main difference is that the Real-Time Analytics team now needs some help from
client teams who generate the ingested data to also send additional events that mimic
production traffic.
● Ingestion lag
And more.
They also make sure to validate the ingested data and make sure it’s being written
correctly.
Evaluating if the system can handle the increase in data volume is the simplest and
quickest check.
Results
From the testing, Pinterest found that they were able to handle the additional traffic
expected during the holiday period. They saw that the broker pool may need additional
hosts if traffic meets a certain threshold, so they made a note that the pool size may need
to be increased.
For more details, you can read the full blog post here.
One big challenge when operating at this scale is distributing data across the system.
Objects like executable files, search indexes, AI models and containers are a few
examples of files that Facebook needs to send to many different machines globally.
Each of these files ranges from a couple of megabytes to a few terabytes and they’re split
into small chunks. These chunks need to be transferred between Facebook machines
with low latency and a very high throughput (millions of machines may need to quickly
read a certain object).
These files are stored on Facebook’s distributed data store and client machines can read
the files from there. However, having all the client machines read from this data store
quickly leads to scalability issues.
There are far too many machines requesting files and the transfer speeds would be too
slow. Instead, there needs to be a caching system built on top of the distributed data
store that can facilitate easy transfer of this data.
To build this system, Facebook tried multiple approaches with varying degrees of
centralization. They first tried a highly centralized system with a hierarchical caching
layer but that led to scalability issues. They also tried a decentralized approach with the
BitTorrent protocol but that was too complex to manage.
Eventually, they settled on a balance between these two approaches with Owl, a system
for high-fanout distribution of data objects across Meta’s private cloud. Owl distributes
over 700 petabytes of data per day to over 10 million unique client machines across
Facebook’s data centers.
Engineers at Facebook published a great paper where they talked about their prior data
distribution systems (hierarchical caching, bittorrent and more), lessons learned and
the architecture/implementation details behind Owl.
Here’s a Summary
Facebook engineers needed a way to distribute large objects across their private cloud.
The task can be described by 3 dimensions
● Scale - The same object could be read by anywhere from a handful of client
machines to millions of clients around the world.
● Size - Objects range in size from 1 megabyte to a few terabytes. Objects are
split up into chunks and stored in a distributed storage system.
● Hotness - All the client machines may request the object within a few seconds
of each other, or their reads could be spread out over a few hours
Distributing these files must also be done efficiently and reliably. To be considered
reliable, the caching system must successfully complete a large percentage of download
requests within a certain latency. It should also not be too burdensome for engineers to
maintain the system.
Facebook tried several approaches with varying amounts of centralization in the control
plane and the data plane. The data plane are the machines where the cached data is
stored while the machines in the control plane determine which files the data plane
nodes should cache/delete and how requests should be routed to nodes in the data
plane.
The first attempt was to add a hierarchical cache system in front of their distributed data
stores. This is a pretty standard solution and also relatively simple to implement.
Facebook set aside a dedicated pool of machines to use as the caching layer.
When a client machine needs a certain file, their first request goes to a first-level cache.
If there’s a cache-miss, then the first-level will request the data from the next level
caches in the hierarchy (second-level, third-level, and so on). The final layer is the
distributed data store itself.
The data would be stored/evicted in these hierarchies so that the first-level cache held
the most requested (hottest) data.
The issue with this approach is that it was too difficult for Facebook to handle load
spikes for particular pieces of content. Machines in the caching system would get
overloaded and start to throttle requests from the clients and Facebook had trouble
provisioning capacity appropriately.
They would either provision for the steady state and miss load spikes or they would
provision for load spikes and waste compute/servers.
The centralization of the data plane on the dedicated pool of machines was making the
system too slow to scale.
Bittorrent
To address these scaling issues, Facebook built a second solution based on the bittorrent
protocol, a very popular protocol for peer-to-peer file sharing.
With this system, any client that wants to download data becomes a peer in the system
(so there were millions of peers). Clients would dedicate whatever resources they had
available to sharing their downloaded files to other peers in the network. Trackers
maintained a list of all the peers and which file chunks were stored on which peers.
This scaled much better than hierarchical caching due to the peer-to-peer nature of the
system. When there was a load spike for a particular piece of content, the number of
peers sharing that content would automatically increase at a similar rate as the demand.
However, each peer in this system was making its own individual decision on which data
to request and share. A machine would only become a peer for a certain file if the
machine needed to download that file for its own purposes.
The decentralization also made it very hard to operate and debug. Engineers could not
get a clear picture of health and status without aggregating data from a large number of
peers.
Owl
Facebook then designed a system that combined the best of both of these approaches
with Owl.
Owl has a decentralized data plane and a centralized control plane. Data is stored in a
decentralized manner (similar to bittorrent) on all the client machines that are
downloading from the system. However, decisions around which client stores what
chunk and how data is cached on the various peers are managed more centrally.
● Peer - A peer is a library that’s linked with a client machine that wants to
download data from Owl. As the machine is downloading the file, it can share
data with other client machines (also peers) that request the file. When a peer
● Superpeer - Superpeers are dedicated machines that can cache and serve data
but aren’t linked to a client process. Instead, the entire machine is dedicated
to caching and sharing data to other peers/superpeers. Owl Trackers will
manage what data gets stored on the superpeers.
● Tracker - The trackers are the brain of the system. They tell the peers and
superpeers what data to cache/evict based on the entire state of the system.
Trackers have a global view of what data is being requested so they can
intelligently manage the system.
When a client machine wants to download a file from Owl, it will use the Owl library to
send a remote procedure call for the data to a tracker.
The tracker has a global view of state and it will return information on the optimal
peer/superpeer that has the data and can share it with the client. The client can then
download the data from that node and become a peer itself.
The selection policy for how the tracker selects the peer/superpeer to share the data
depends on a variety of factors like geographic distance, load, amount of the file that the
node has saved.
If none of the peers have the file, then the tracker can have a superpeer fetch the data
from the underlying data store. The superpeer can then share the data with the client
machine, making the client machine a peer that can share the file.
The default cache eviction policy is LRU where the least recently used files get evicted
when a peer’s storage is full. However many nodes also use a least rare policy where files
are evicted from a peer based on how many other peers have that file cached. A file that
is cached on many other peers will be evicted over a file cached on only a few peers.
The system can also be configured to use a hybrid policy of least-rare eviction for hot
data and LRU eviction for cold data.
This is just a brief overview of how Owl works. You can get way more more details on
sharding, security, fault tolerance and much more by reading the full paper.
Results
Facebook started Owl 2 years ago and since then they’ve seen 200x growth in the
amount of traffic Owl is getting. This growth came from replacing the prior systems and
taking on their load as well as organic adoption.
Despite this massive increase in traffic, the number of machines needed to run Owl (for
the superpeers and trackers) only increased by 4x. The decentralized nature of the data
plane (with peer-to-peer distribution) makes the system much easier to scale.
Owl is now handling over 700 petabytes of data per day and has over 10 million client
processes using the system. This amounts to a throughput of ~7 - 15 terabytes per
second of data that client processes are reading. With Owl, the amount of storage reads
that have to be served by the underlying distributed data store is less than 0.7 terabytes
per second.
For more details, you can read the full paper here.
It’s extremely important that Dropbox meets various service level objectives (SLOs)
around availability, durability, security and more. Failing to meet these objectives
means unhappy users (increased churn, bad PR, fewer sign ups) and lost revenue from
enterprises who have service level agreements (SLAs) in their contracts with Dropbox.
The availability SLA in contracts is 99.9% uptime but Dropbox sets a higher internal bar
of 99.95% uptime. This translates to less than 21 minutes of downtime allowed per
month.
In order to meet their objectives, Dropbox has a rigorous process they execute whenever
an incident comes up. They’ve also developed a large amount of tooling around this
process to ensure that incidents are resolved as soon as possible.
Joey Beyda is a Senior Engineering Manager at Dropbox and Ross Delinger is a Site
Reliability Engineer. They wrote a great blog post on incident management at Dropbox
and how the company ensures great service for their users.
Here’s a Summary
2. Diagnosis - the time it takes for responders to root-cause an issue and identify
a resolution approach.
3. Recovery - the time it takes to mitigate the issue for users once a resolution
approach is found.
Detection
Whenever there’s an incident around availability, durability, security, etc. it’s important
that Dropbox engineers are notified as soon as possible.
To accomplish this, Dropbox built Vortex, their server-side metrics and alerting system.
You can read a detailed blog post about the architecture of Vortex here. It provides an
ingestion latency on the order of seconds and has a 10 second sampling rate. This allows
engineers to be notified of any potential incident within tens of seconds of its beginning.
However, in order to be useful, Vortex needs well-defined metrics to alert on. These
metrics are often use-case specific, so individual teams at Dropbox will need to
configure them themselves.
To reduce the burden on service owners, Vortex provides a rich set of service, runtime
and host metrics that come baked in for teams.
Noisy alerts can be a big challenge, as they can cause alarm fatigue which will increase
the response time.
To address this, Dropbox built an alert dependency system into Vortex where service
owners can tie their alerts to other alerts and also silence a page if the problem is in
some common dependency. This helps on-call engineers avoid getting paged for issues
that are not actionable by them.
Diagnosis
In the diagnosis stage, engineers are trying to root-cause the issue and identify possible
resolution approaches.
To make this easier, Dropbox has built a ton of tooling to speed up common workflows
and processes.
They’ve also built dashboards with Grafana that list data points that are valuable to
incident responders like
● RPC latency
● Exception trends
And more. Service owners can then build more nuanced dashboards that list
team-specific metrics that are important for diagnosis.
One of the highest signal tools Dropbox has for diagnosing issues is their exception
tracking infrastructure. It allows any service at Dropbox to emit stack traces to a central
store and tag them with useful metadata.
Developers can then view the exceptions within their services through a dashboard.
Recovery
Once a resolution approach is found, the recovery process consists of executing that
approach and resolving the incident.
To make this as fast as possible, Dropbox asked their engineers the following question -
“Which incident scenarios for your system would take more than 20 minutes to
recover from?”.
They picked the 20 minute mark since their availability targets were no more than 21
minutes of downtime per month.
● Experiments and Feature Gates could be hard to roll back - If there was an
experimentation-related issue, this could take longer than 20 minutes to roll
back and resolve. To address this, engineers ensured all experiments and
feature gates had a clear owner and that they provided rollback capabilities
and a playbook to on-call engineers.
For more details, you can read the full blog post here.