Architecting For Fast Data Applications Mesosphere
Architecting For Fast Data Applications Mesosphere
Introduction 2
Fast Data: The New Big Data 3
Fast Data Applications in Action 8
A Reference Architecture for Fast Data Applications 10
1. High availability with no single point of failure 11
2. Elastic scaling 12
3. Storage management 12
4. Infrastructure and application-level monitoring & metrics 13
5. Security and access control 14
6. Ability to build and run applications on any infrastructure 15
Fast Data Applications Require New Platform Services 17
1. Delivering real-time data 19
2. Storing distributed data 21
3. Processing fast data 22
4. Acting on data 23
Key Challenges Implementing Fast Data Services 24
1. Deploying each data service is time consuming 24
2. Operating data services is manual and error-prone 25
3. Infrastructure silos with low utilization 26
Public Cloud - The Solution? 28
Mesosphere DC/OS: Simplifying the Development and Operations of Fast Data Applications 30
1. On-demand provisioning 32
2. Simplied operations 34
3. Elastic data infrastructure 36
Case Studies: Fast Data Done Well 37
Verizon Adopts New Strategic Technologies to Serve Millions of Subscribers in Real-
Time 37
Esri Builds Real-Time Mapping Service With Kafka, Spark, and More 39
Wellframe Expands its Healthcare Management Platform 41
1 Mesosphere, Inc.
Architecting for Fast Data Applications
INTRODUCTION
In todays always-connected economy, businesses need to provide real-
time services to customers that utilize vast amounts of data. Examples
aboundreal-time decision-making in nance and insurance, to enabling
the connected home, to powering autonomous cars. While innovators
such as Twitter, Uber and Netix were at the forefront of creating
personalized, real-time services for their customers, companies of all
shapes and sizes in industries including telecom, nancial services,
healthcare, retail, and many more now need to respond or face risk of
disruption.
To serve customers at scale and process and store the huge amount of
data they produce and consume, successful businesses are changing how
they build applications. Modern enterprise applications are shifting from
monolithic architectures to cloud native architectures: distributed systems
of microservices, containers, and data services. Modern applications built
on cloud native platform services are always-on, scalable, and efcient,
while taking advantage of huge volumes of real-time data.
This eBook details the vital shift from big data to fast data, describes the
changing requirements for applications utilizing real-time data, and
presents a reference architecture for fast data infrastructure.
2 Mesosphere, Inc.
Architecting for Fast Data Applications
One of the key drivers of the sheer increase in the volume of data is the
growth of unstructured data, which now makes up approximately 80% of
enterprise data. Structured data is information, usually text les, displayed
in titled columns and rows which can easily be analyzed. Historically,
structured data was the norm because of limited processing capability,
inadequate memory and high costs of storage. In contrast, unstructured
data has no identiable internal structure; examples include emails, video,
audio and social media. Unstructured data has skyrocketed due to the
increased availability of storage and the number of complex data sources.
3 Mesosphere, Inc.
Architecting for Fast Data Applications
160
120
80
40
0
2008 2009 2010 2011 2012 2013 2014 2015
The term big data was popularized in the early- to mid-2000s, when many
companies started to focus on obtaining business insights from the vast
amounts of data being generated. Hadoop was created in 2006 to handle
the explosion of data from the web.
While most large enterprises have put forth efforts to build data
warehouses, the challenge is in seeing real business impact
organizations leave the vast amount of unstructured data unused. Despite
substantial hype and reported successes for early adopters, over half of
the respondents to a Gartner survey reported no plans to invest in Hadoop
as of 2015.3 The key big data adoption inhibitors include:
3 Survey Analysis: Hadoop Adoption Drivers and Challenges, Gartner, May 2015
4 Mesosphere, Inc.
Architecting for Fast Data Applications
1990s 2013+
Online customer Real-time & predictive
engagement customer engagement
1980s 2000s
Electronic customer Customer analytics
records
Industry Transitions
Over the past two to three years, companies have started transitioning
from big data, where analytics are processed after-the-fact in batch mode,
to fast data, where data analysis is done in real-time to provide immediate
insights. For example, in the past, retail stores such as Macys analyzed
historical purchases by store to determine which products to add to stores
in the next year. In comparison, Amazon drives personalized
recommendations based on hundreds of individual characteristics about
you, including what products you viewed in the last ve minutes.
5 Mesosphere, Inc.
Architecting for Fast Data Applications
6 Mesosphere, Inc.
Architecting for Fast Data Applications
80%
68%
60%
40%
20%
14%
10%
5% 1%
0%
Reduce batch, Increase investment Eliminate batch, Reduce stream, Eliminate stream,
increase stream in both shift to stream increase batch shift to batch
How Will Usage of Batch and Streaming Shift in Your Company in the Next One Year?
Source: 2016 State of Fast Data and Streaming Applications, OpsClarity
7 Mesosphere, Inc.
Architecting for Fast Data Applications
8 Mesosphere, Inc.
Architecting for Fast Data Applications
At Capital One, analytics are not just used for pricing and fraud detection,
but also for predictive sales, driving customer retention, and reducing the
cost of customer acquisition. Machine learning algorithms play a critical
role at Capital One. Every time a Capital One card gets swiped, we capture
that data and are running modeling on it, Capital One data scientist
Brendan Herger says. The results of the fast data analytics have made
their way into new offerings, such as the Mobile Deals app that sends
coupon offers to customers based on their spending habits. It has also
enabled predictive capabilities in the call center, which CapGemini says
can determine the topic of a customers call within 100 milliseconds with
70 percent accuracy.6
6 How Credit Card Companies Are Evolving with Big Data, Datanami, May 2016
9 Mesosphere, Inc.
Architecting for Fast Data Applications
10 Mesosphere, Inc.
Architecting for Fast Data Applications
11 Mesosphere, Inc.
Architecting for Fast Data Applications
2. Elastic scaling
Fast data workloads can vary considerably over a month, week, day, or
even hour. In addition, the volume of data continues to multiply. Based on
these two factors, fast data infrastructure must be able to dynamically and
automatically scale horizontally (i.e. changing the number of service
instances), and vertically (i.e. allocating more or less resources to
services), up or down. And so data doesnt get lost, scaling must occur
with no downtime.
3. Storage management
Fast data applications must be able to read and write data from storage in
real time. There are many kinds of storage, such as local le systems,
12 Mesosphere, Inc.
Architecting for Fast Data Applications
13 Mesosphere, Inc.
Architecting for Fast Data Applications
7 http://opentracing.io
14 Mesosphere, Inc.
Architecting for Fast Data Applications
15 Mesosphere, Inc.
Architecting for Fast Data Applications
16 Mesosphere, Inc.
Architecting for Fast Data Applications
17 Mesosphere, Inc.
Architecting for Fast Data Applications
Platform Services
Today, most people think of Hadoop or NoSQL databases when they think
of big data. Recently, several open source technologies have emerged to
address the challenges of processing high-volume, real-time data, most
prominently including Apache KafkaTM for data ingestion, Apache SparkTM
for data analysis, Apache Cassandra for distributed storage, and Akka for
building fast data applications.
18 Mesosphere, Inc.
Architecting for Fast Data Applications
19 Mesosphere, Inc.
Architecting for Fast Data Applications
While Kafka is the most popular message broker, other popular tools
include Apache FlumeTM and RabbitMQ. Apache Flume is a distributed,
reliable, and available service for efciently collecting, aggregating, and
moving large amounts of data. RabbitMQ, backed by Pivotal, is a popular
open source message broker that gives applications a common platform
to send and receive messages. RabbitMQ is preferred for use-cases
requiring support for Advanced Message Queuing Protocol (AMQP).
80%
86%
60%
40%
20%
22% 21%
11% 11%
0%
Apache Kafka Apache Flume Rabbit MQ Amazon SQS AWS Kinesis
20 Mesosphere, Inc.
Architecting for Fast Data Applications
10 http://cassandra.apache.org/
21 Mesosphere, Inc.
Architecting for Fast Data Applications
11 http://spark.apache.org/
12 Apache Spark Market Survey, Taneja Group, November 2016
22 Mesosphere, Inc.
Architecting for Fast Data Applications
80%
70%
60%
50%
40%
20%
27%
0%
Apache Spark MapReduce Apache Storm
4. Acting on data
Once real-time data is analyzed, insights need to be presented to a human
or trigger actions in connected devices or applications. Akka is a popular
toolkit and runtime to simplify development of data centric applications.
Akka was designed to enable developers to easily build reactive
applications using a high level of abstraction, and the technology makes
building highly concurrent, distributed, and resilient message-driven
applications on the JVM a much simpler process.
23 Mesosphere, Inc.
Architecting for Fast Data Applications
13 Stitchdata blog, Why you shouldnt build your own data pipeline
24 Mesosphere, Inc.
Architecting for Fast Data Applications
25 Mesosphere, Inc.
Architecting for Fast Data Applications
14 The Sorry State of Server Utilization and the Impending Post Hypervisor Era, Gigaom,
November 2013
15 NRDC Data Center Efciency Assessment, August 2014
26 Mesosphere, Inc.
Architecting for Fast Data Applications
27 Mesosphere, Inc.
Architecting for Fast Data Applications
While public cloud provides clear advantages for fast data workloads, the
major downside is the risk of lock-in. Applications that are developed
using public cloud platforms are tied to a specic cloud providers APIs,
and moving workloads after the fact is near impossible without rewriting
them.
One recent story highlights the risk of cloud lock-in, that of Snap (of
Snapchat). In the S1 Registration Statement issued by Snap17 in February
2017, it came to light that Snap had handcuffed itself to Google Cloud. Of
the annual loss of over $500 million, 80% was attributed to contractually
obligated spend with Google. In the same ling, Snap states that they
wrote their application to use some Google services which do not have an
alternative in the market. Google now has them in handcuffs, and there is
28 Mesosphere, Inc.
Architecting for Fast Data Applications
29 Mesosphere, Inc.
Architecting for Fast Data Applications
30 Mesosphere, Inc.
Architecting for Fast Data Applications
The core of DC/OS is the Apache MesosTM distributed systems kernel. Its
power comes from the two-level scheduling that enables distributed
systems to be pooled and share datacenter resources. Mesos provides
the core primitives for distributed systems, such as resource allocation,
isolation, and quota management. DC/OS provides a highly-available
infrastructure for fast data workloadsworkloads are automatically
restarted when a server fails. Pooling resources across a datacenter or
cloud also enables elastic scaling, where workloads can scale up or down
based on demand.
31 Mesosphere, Inc.
Architecting for Fast Data Applications
Resource Task
Offer Launch
Task Task
Launch Launch
Task Task
Status Status
Task
Status
Executor
Executor Executor Executor
Executor Executor Executor
Executor Executor Executor
Executor
Distributed Systems
A+B+C+
1. On-demand provisioning
Mesosphere DC/OS enables single-command install of data services such
as Spark, Cassandra, Kafka and Elasticsearch, among many others. Where
deployment of these services used to be incredibly time-consuming and
error prone, data services can be up and running across an entire cluster in
a matter of minutes with Mesosphere DC/OS.
32 Mesosphere, Inc.
Architecting for Fast Data Applications
33 Mesosphere, Inc.
Architecting for Fast Data Applications
2. Simplied operations
Mesosphere DC/OS dramatically reduces the time and effort involved with
operating data services through simple runtime software upgrades and
34 Mesosphere, Inc.
Architecting for Fast Data Applications
35 Mesosphere, Inc.
Architecting for Fast Data Applications
36 Mesosphere, Inc.
Architecting for Fast Data Applications
37 Mesosphere, Inc.
Architecting for Fast Data Applications
Acting fast on new trends: From streaming video services like Go90 to
drone video analysis, Mesosphere DC/OS lets Verizon act quickly on
the types of applications its consumer and enterprise customers
demand.
Mesosphere DC/OS gives Verizon far-reaching benets to quickly launch new products
and services while reducing the IT requirements in our data centers
38 Mesosphere, Inc.
Architecting for Fast Data Applications
39 Mesosphere, Inc.
Architecting for Fast Data Applications
New class of customer: Esri is now getting requests from a new class
of customer with more sophisticated and large-scale applications.
Before DC/OS, Esri would shy away from those opportunities because
it was outside the scale of what its technology could handle.
With the Mesosphere DC/OS platform, we can serve a new set of customers with entirely
new capabilities in terms of the performance and intelligence of their map and analytic
applications. And with our cloud-based platform, we can get these solutions up and
running in minutes. This gives Esri and our clients a level of innovation and business
agility weve never had before.
- Adam Mollenkopf, Real-Time & Big Data GIS Capability Lead, Esri
40 Mesosphere, Inc.
Architecting for Fast Data Applications
41 Mesosphere, Inc.
Architecting for Fast Data Applications
DC/OS has allowed us to take on, manage and open up wide areas of business that we
couldnt address before. We are able to expand our business to new geographies and
deliver services that address the complexity of serving patients managing chronic health
conditions.
42 Mesosphere, Inc.
Architecting for Fast Data Applications
ABOUT MESOSPHERE
Mesosphere is leading the enterprise transformation toward distributed computing and
hybrid cloud. We combine the rich capability you get from public cloud providers with
the freedom and control of choosing your own infrastructure. Mesosphere DC/OS is the
premier platform for building, deploying, and elastically scaling modern applications and
big data services. DC/OS makes running containers, data services, and microservices
easy across your own hardware and cloud instances. Mesosphere was founded in 2013
by the architects of hyperscale infrastructures at Airbnb and Twitter and the co-creator
of Apache Mesos. Mesosphere is headquartered in San Francisco with additional
ofces in New York and Hamburg, Germany. Mesospheres investors include
Andreessen Horowitz, Hewlett Packard Enterprise, Khosla Ventures, Kleiner Perkins
Caueld & Byers, and Microsoft.
43 Mesosphere, Inc.