COLL-LB-white-paper-reactive-programming-v - Unknown
COLL-LB-white-paper-reactive-programming-v - Unknown
versus
Reactive Systems
Landing on a set of simple Reactive design principles in a
sea of constant confusion and overloaded expectations
Executive Summary....................................................................................................................... 3
Reactive Programming................................................................................................................... 6
Event-Driven VS Message-Driven.................................................................................................... 9
How Does Reactive Programming & Systems Relate To Fast Data Streaming?................................17
How Does Reactive Programming & Systems Relate To Traditional Web Applications?...................18
The goal of this white paper is to define and clarify the different aspects of “Reactive” by looking at the
differences between writing code in a Reactive Programming style, and the design of Reactive Systems as
a cohesive whole.
¹ A rather undefined term, generally referring to applications which do not depend on underlying OS features such as a file system, but
instead use typically configurable endpoints, allowing them to run in a virtualized environment like the Cloud.
From the perspective of this white paper, “Reactive” is a set of design principles for creating cohesive
systems. It’s a way of thinking about systems architecture and design in a distributed environment where
implementation techniques, tooling, and design patterns are components of a larger whole.
Consider the following analogy: an athletic team (e.g. football, basketball, etc.) is often composed of
exceptional individuals. Yet, losing to an “inferior” team is nonetheless common when a team comes
together and something doesn’t click. This lack of synergy to operate effectively as a team is what we see.
This analogy illustrates the difference between a set of individual Reactive services cobbled together
without thought—even though individually they’re great—and a Reactive System.
In a Reactive System, it’s the interaction between the individual parts that makes all the difference,
which is the ability to operate individually yet act in concert to achieve their intended result.
A Reactive System is based on an architectural style that allows these multiple individual services to
coalesce as a single unit and react to its surroundings while remaining aware of each other—this could
manifest in being able to scale up/down, load balance and even take some of these steps proactively.
Thus, we see that it’s possible to write a single application in a Reactive style (i.e. using Reactive
Programming); however, that’s merely one piece of the puzzle. Though each of the above aspects
may seem to qualify as “Reactive,” in and of themselves they do not make a system Reactive.
When people talk about Reactive in the context of software development and design, they generally
mean one of three things:
We’ll examine what each of these practices and techniques mean, with emphasis on the first two.
More specifically, we’ll discuss when to use them, how they relate to each other, and what you can
expect the benefits from each to be—particularly in the context of building systems for multicore,
Cloud, and Mobile architectures.
The main driver behind modern systems is the notion of Responsiveness: the acknowledgement that
if the client/customer does not get value in a timely fashion then they will go somewhere else.
Fundamentally there is no difference between not getting value and not getting value when it is needed.
In order to facilitate Responsiveness, two challenges need to be faced: being Responsive under failure,
defined as Resilience, and being Responsive under load, defined as Elasticity. The Reactive Manifesto
prescribes that in order to achieve this, the system needs to be Message-driven.
Responsive
Elastic Resilient
Message Driven
In 2016, several major vendors in the JVM space have announced core initiatives to embrace
Reactive Programming—this is a tremendous validation of the problems faced by companies today.
Undertaking this change in direction from traditional programming techniques is a big and challenging
task; having to maintain compatibility with pre-existing technologies as well as shepherding the user
base to a different mindset—as well as building out internal developer and operational experience.
As such, the investment by these companies is non-trivial and it needs no mention that this is a large
engineering challenge.
While there seems to be much activity in the Reactive Programming space, at the systems architecture
level it will take time to build up architectural and operational experience—something which is not au-
tomatically solved by adopting a different programming paradigm. It will be interesting to see what will
Let’s start by talking about Functional Reactive Programming, and why we chose to exclude it from
further discussions in this article.
Reactive Programming
Reactive Programming, not to be confused with Functional Reactive Programming, is a subset of
Asynchronous Programming and a paradigm where the availability of new information drives the
logic forward rather than having control flow driven by a thread-of-execution.
It supports decomposing the problem into multiple discrete steps where each can be executed in an
asynchronous and nonblocking fashion, and then be composed to produce a workflow—possibly
unbounded in its inputs or outputs.
Asynchronous is defined by the Oxford Dictionary as “not existing or occurring at the same time”, which
in this context means that the processing of a message or event is happening at some arbitrary time,
possibly in the future.
This is a very important technique in Reactive Programming since it allows for non-blocking
execution—where threads of execution competing for a shared resource don’t need to wait by blocking
(preventing the thread of execution from performing other work until current work is done), and can as
such perform other useful work while the resource is occupied. Amdahl’s Law³ tells us that contention is
the biggest enemy of scalability, and therefore a Reactive program should rarely, if ever, have to block.
T3
T1
T1 T2
T4
T5
{...} {...}
Thread Usage Thread Usage
Resource Lifetime
The Application Program Interface (API) for Reactive Programming libraries are generally either:
It would be reasonable to claim that Reactive Programming is related to Dataflow Programming, since
the emphasis is on the flow of data rather than the flow of control.
Popular libraries supporting the Reactive Programming techniques on the JVM include, but are not
limited to, Akka Streams, Ratpack, Reactor, RxJava and Vert.x. These libraries implement the Reactive
Streams specification, which is a standard for interoperability between Reactive Programming libraries
on the JVM, and according to its own description is “...an initiative to provide a standard for asynchronous
stream processing with non-blocking back pressure.”
A secondary benefit is one of developer productivity as traditional programming paradigms have all
struggled to provide a straightforward and maintainable approach to dealing with asynchronous and
nonblocking computation and IO. Reactive Programming solves most of the challenges here since it
typically removes the need for explicit coordination between active components.
⁴ Neil Günter’s Universal Scalability Law is an essential tool in understanding the effects of contention and coordination in
concurrent and distributed systems, and shows that the cost of coherency in a system can lead to negative results, as new
resources are added to the system.
To ensure steady state in terms of data flow, pull-based back-pressure sends demand flowing upstream and
receives messages flowing downstream, which avoids the producer overwhelming the consumer(s). Images by Kevin Webber (@kvnwbbr).
But even though Reactive Programming is a very useful piece when constructing modern software, in
order to reason about a system at a higher level one has to use another tool: Reactive Architecture—the
process of designing Reactive Systems. Furthermore, it is important to remember that there are many
programming paradigms and Reactive Programming is but one of them, so just as with any tool, it is not
intended for any and all use-cases.
Event-Driven VS Message-Driven
As mentioned previously, Reactive Programming—focusing on computation through ephemeral dataflow
chains—tend to be Event-driven, while Reactive Systems—focusing on resilience and elasticity through the
communication, and coordination, of distributed systems—is Message-driven⁵(also referred to as
Messaging).
The main difference between a Message-driven system with long-lived addressable components, and an
Event-driven dataflow-driven model, is that Messages are inherently directed, Events are not. Messages
have a clear, single, destination; while Events are facts for others to observe. Furthermore, mes-
saging is preferably asynchronous, with the sending and the reception decoupled from the sender and
receiver respectively.
⁵ Messaging can be either synchronous (requiring the sender and receiver to be available at the same time) or asynchronous
(allowing them to be decoupled in time). Discussing the semantic differences is out scope for this white paper.
The glossary in the Reactive Manifesto defines the conceptual difference as:
Messages are needed to communicate across the network and forms the basis for communication in
distributed systems, while Events, on the other hand, are emitted locally. It is common to use Messaging
under the hood to bridge an Event-driven system across the network by sending Events inside Messages.
This allows maintaining the relative simplicity of the Event-driven programming model in a distributed
context and can work very well for specialized and well scoped use-cases (e.g., AWS Lambda, Distributed
Stream Processing products like Spark Streaming, Flink, Kafka and Akka Streams over Gearpump, and
Distributed Publish Subscribe products like Kafka and Kinesis).
Messaging forces us to embrace the reality and constraints of distributed systems—things like partial
failures, failure detection, dropped/duplicated/reordered messages, eventual consistency, managing
multiple concurrent realities, etc.—and tackle them head on instead of hiding them behind a leaky
abstraction—pretending that the network is not there—as has been done too many times in the past
(e.g. EJB, RPC, CORBA, and XA).
These differences in semantics and applicability have profound implications in the application design,
including things like resilience, elasticity, mobility, location transparency and management of the
complexity of distributed systems, which will be explained further in this white paper.
In a Reactive System, especially one which uses Reactive Programming, both events and messages will
be present—as one is a great tool for communication (messages), and another is a great way of
representing facts (events).
The principles of Reactive Systems are most definitely not new, and can be traced back to the 70s and
80s and the seminal work by Jim Gray and Pat Helland on the Tandem System and Joe Armstrong and
Robert Virding on Erlang. However, these people were ahead of their time and it’s been only in the last
5-10 years that the technology industry has been forced to rethink current “best practices” for enterprise
system development. This means learning to apply the hard-won knowledge of the Reactive principles on
today’s world of multicore, Cloud Computing and the Internet of Things.
The foundation for a Reactive System is Message-Passing, which creates a temporal boundary between
components which allows them to be decoupled in time—this allows for concurrency—and space—which
allows for distribution and mobility. This decoupling is a requirement for full isolation between
components, and forms the basis for both Resilience and Elasticity.
The world is becoming increasingly interconnected. Systems are complex by definition—each consisting
of a multitude of components, who in and of themselves also can be systems—which mean software is
increasingly dependant on other software to function properly.
The systems we create today are to be operated on computers small and large, few and many, near each
other or half a world away. And at the same time, users’ expectations have become harder and harder to
meet as everyday human life is increasingly dependant on the availability of systems to function smoothly.
In order to deliver systems that users—and businesses—can depend on, they have to be Responsive,
since it doesn’t matter if something provides the correct response if the response is not available when it
is needed. In order to achieve this, we need to make sure that Responsiveness can be maintained under
failure (Resilience) and under dynamically-changing load (Elasticity). To make that happen, we make these
systems Message-Driven, and we call them Reactive Systems.
This requires component isolation and containment of failures in order to avoid failures spreading to
neighbouring components—resulting in, often catastrophic, cascading failure scenarios.
So the key to building Resilient, self-healing systems is to allow failures to be: contained, reified as mes-
sages, sent to other components (that act as supervisors), and managed from a safe context outside the
failed component. Here, being Message-driven is the enabler: moving away from strongly coupled, brittle,
deeply nested synchronous call chains that everyone learned to suffer through…or ignore. The idea is
to decouple the management of failures from the call chain, freeing the client from the responsibility of
handling the failures of the server.
Systems need to be adaptive—allowing for intervention-less auto-scaling, replication of state and behav-
ior, load-balancing of communication, failover and upgrades, all without rewriting or even reconfiguring
the system. The enabler for this is Location Transparency: the ability to scale the system in the same way,
using the same programming abstractions, with the same semantics, across all dimensions of scale—from
CPU cores to data centers.
One key insight that simplifies this problem immensely is to realize that we are all
doing distributed computing. This is true whether we are running our systems on a
single node (with multiple independent CPUs communicating over the QPI link) or
on a cluster of nodes (with independent machines communicating over the net-
work). Embracing this fact means that there is no conceptual difference between
scaling vertically on multicore or horizontally on the cluster.
So no matter where the recipient resides, we communicate with it in the same way. The only way that can be
done semantically equivalent is via Messaging.
This is important, since during the lifecycle of a system—if not properly designed—it will become harder
and harder to maintain, and require an ever increasing amount of time and effort to understand in order
to localize and to rectify problems.
• Isolation of failures offer bulkheads between components, preventing failures from cascading and
limiting the scope and severity of failures.
• Supervisor hierarchies offer multiple levels of defences paired with self-healing capabilities, which
removes a lot of transient failures from ever incurring any operational cost to investigate.
• Message-passing and location transparency allow for components to be taken offline and replaced
or rerouted without affecting the end-user experience. This reduces the cost of disruptions, their
relative urgency, and also the resources required to diagnose and rectify.
• Replication reduces the risk of data loss, and lessens the impact of failure on the availability of
retrieval and storage of information.
Elasticity allows for conservation of resources as usage fluctuates, allowing for minimizing
operational costs when load is low, and minimizing the risk of outages or urgent investment into
scalability as load increases.
Though done poorly in the Titanic, bulkheading has long been used in the ship construction industry
to avoid cascading failures from affecting other operations.
Thus, Reactive Systems allows for the creation systems that cope well under failure, varying load and
change over time—all while offering a low cost of ownership over time.
One common problem with only leveraging Reactive Programming is that its tight coupling between
computation stages in an Event-driven callback-based or declarative program makes Resilience harder to
achieve because its transformation chains are often ephemeral and its stages—the callbacks or
combinators—are anonymous, i.e. not addressable.
This means that they usually handle success or failure directly without signalling it to the outside world.
This lack of addressability makes the recovery of individual stages harder to achieve as it is typically
unclear where exceptions should, or even could, be propagated. As a result, failures are tied to ephemer-
al client requests instead of to the overall health of the component—if one of the stages in the dataflow
chain fails, then the whole chain needs to be restarted, and the client notified. This is in contrast to a Mes-
sage-driven Reactive System, which has the ability to self-heal without necessitating notifying the client.
Another contrast to the Reactive Systems approach is that pure Reactive Programming allows decoupling
in time, but not space (unless leveraging Message-passing to distribute the dataflow graph under the
hood, across the network, as discussed previously).
Decoupling in time allows for concurrency, but it is decoupling in space that allows
for distribution, and mobility—allowing for not only static but also dynamic
topologies—which is essential for Elasticity.
A lack of location transparency makes it hard to scale out a program purely based on Reactive
Programming techniques adaptively in an elastic fashion and therefore requires layering additional tools
on top, such as a Message Bus, Data Grid or bespoke network protocols. This is where the
Message-driven approach of Reactive Systems shines, since it is a communication abstraction that
maintains its programming model and semantics across all dimensions of scale, and therefore reduces
system complexity and cognitive overhead.
A commonly cited problem of callback-based programming is that while writing such programs may be
comparatively easy, it can have real consequences in the long run.
Libraries and platforms designed for Reactive Systems (such as the Akka project and the Erlang plat-
form) learned this lesson long ago and are relying on long-lived addressable components that are easier
to reason about for the future. When failures occur, the component is uniquely identifiable along with the
message that caused the failure. With the concept of addressability at the core of the component model,
monitoring solutions have a meaningful way to present data that is gathered—leveraging the identities
that are propagated.
The choice of a good programming paradigm, one that enforces things like addressability and failure
management, has proven to be invaluable in production, as it is designed with the harshness of reality in
mind, to expect and embrace failure rather than the lost cause of trying to prevent it.
All in all, Reactive Programming is a very useful implementation technique, which can be used in a
Reactive Architecture. Remember that it will only help manage one part of the story: dataflow
management through asynchronous and nonblocking execution—usually only within a single node
or service. Once there are multiple nodes, there is a need to start thinking hard about things like data
consistency, cross-node communication, coordination, versioning, orchestration, failure management,
separation of concerns and responsibilities etc.—i.e. system architecture.
While Reactive Programming focuses on asynchronous, nonblocking dataflow management between a single node or service,
complex Reactive System architectures need far more to successfully deploy multiple services across nodes and clusters.
Image by Kevin Webber (@kvnwbbr).
Then, underneath the end-user API, it typically uses Message-passing and the principles of Reactive
Systems in-between nodes supporting a distributed system of stream processing stages, durable event
logs, replication protocols—although these parts are typically not exposed to the developer. It is a good
example of using Reactive Programming at the user level and Reactive Systems at the system level.
As we have seen, both Reactive Programming and Reactive Systems design are important—in different
contexts and for different reasons:
• Reactive Programming is used within a single Microservice to implement the service-internal logic
and dataflow management.
• Reactive Systems design is used in between the Microservices, allowing the creation of systems of
Microservices that play by the rules of distributed systems—Responsiveness through Resilience
and Elasticity made possible by being Message-Driven.
⁶ For example using Spark Streaming, Flink, Kafka Streams, Beam or Gearpump.
⁷ The word autonomous comes from the greek words auto which means self and nomos which means law. I.e. an agent that lives by its
own laws: self-governance and independence.
When building services to be used by potentially millions of connected devices, there’s a need for a model
which copes with information flow at scale. There’s a need for strategies for handling device failures, for
when information is lost, and when services fail—because they will. The back-end systems managing all
this needs to be able to scale on demand and be fully resilient, in other words, there’s a need for Reactive
Systems.
Having lots of sensors generating data, and being unable to deal with the rate with which this data
arrives—a common problem set seen for the back-end of IoT—indicates a need to implement
back-pressure for devices and sensors. Looking at the end-to-end data flow of an IoT system—with tons
of devices: the need to store data, cleanse it, process it, run analytics, without any service interruption—
the necessity of asynchronous, non-blocking, fully back-pressured streams becomes critical, this is where
Reactive Programming really shines.
Web applications also benefit from Reactive System design for things like: distributed caching, data
consistency, and cross-node notifications. Traditional web applications normally use stateless nodes.
But as soon as you start using Server-Sent-Events (SSE) and WebSockets, your nodes become stateful,
since at a minimum, they are holding the state of a client connection, and push notifications need to be
routed to them accordingly. Doing this effectively requires a Reactive System design, since it is an area
where directly addressing the recipients through messaging is important.
Summary
Enterprises and middleware vendors alike are beginning to embrace Reactive, with 2016 witnessing a
huge growth in corporate interest in adopting Reactive. In this white paper, we have described Reactive
Systems as being the end goal—assuming the context of multicore, Cloud and Mobile architectures—for
enterprises, with Reactive Programming serving as one of the important tools.
reactivesummit.org
Build modern systems
for the modern world.
lightbend.com
Lightbend (Twitter: @Lightbend) provides the leading Reactive application development platform
for building distributed applications and modernizing aging infrastructures. Using microservices and
fast data on a message-driven runtime, enterprise applications scale effortlessly on multi-core and
cloud computing architectures. Many of the most admired brands around the globe are transforming
their businesses with our platform, engaging billions of users every day through software that is
changing the world.
Lightbend, Inc. 625 Market Street, 10th Floor, San Francisco, CA 94105 | www.lightbend.com