Learning Spark Preview Ed
Learning Spark Preview Ed
m
pl
im
en
ts
of
Learning
Spark
LIGHTNING-FAST DATA ANALYTICS
Bring Your
This Preview Edition of Learning Spark, Chapter 1, is a work in progress. The final
book is currently scheduled for release in February 2015 and will be available at
oreilly.com and other retailers once it is published.
Learning Spark
Learning Spark
by Holden Karau, Andy Konwinski, Patrick Wendell, and Matei Zaharia
Copyright 2010 Databricks. All rights reserved.
Printed in the United States of America.
Published by OReilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.
OReilly books may be purchased for educational, business, or sales promotional use. Online editions are
also available for most titles (http://safaribooksonline.com). For more information, contact our corporate/
institutional sales department: 800-998-9938 or [email protected].
First Edition
ISBN: 978-1-449-35862-4
[?]
Table of Contents
Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
1. Introduction to Data Analysis with Spark. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
What is Apache Spark?
A Unified Stack
Spark Core
Spark SQL
Spark Streaming
MLlib
GraphX
Cluster Managers
Who Uses Spark, and For What?
Data Science Tasks
Data Processing Applications
A Brief History of Spark
Spark Versions and Releases
Persistence layers for Spark
1
2
3
3
4
4
4
4
5
5
6
6
7
7
iii
Preface
As parallel data analysis has become increasingly common, practitioners in many fields
have sought easier tools for this task. Apache Spark has quickly emerged as one of the
most popular tools for this purpose, extending and generalizing MapReduce. Spark
offers three main benefits. First, it is easy to use you can develop applications on your
laptop, using a high-level API that lets you focus on the content of your computation.
Second, Spark is fast, enabling interactive use and complex algorithms. And third, Spark
is a general engine, allowing you to combine multiple types of computations (e.g., SQL
queries, text processing and machine learning) that might previously have required
learning different engines. These features make Spark an excellent starting point to learn
about big data in general.
This introductory book is meant to get you up and running with Spark quickly. Youll
learn how to download and run Spark on your laptop and use it interactively to learn
the API. Once there, well cover the details of available operations and distributed ex
ecution. Finally, youll get a tour of the higher-level libraries built-into Spark, including
libraries for machine learning, stream processing, graph analytics and SQL. We hope
that this book gives you the tools to quickly tackle data analysis problems, whether you
do so on one machine or hundreds.
Audience
This book targets Data Scientists and Engineers. We chose these two groups because
they have the most to gain from using Spark to expand the scope of problems they can
solve. Sparks rich collection of data focused libraries (like MLlib) make it easy for data
scientists to go beyond problems that fit on single machine while making use of their
statistical background. Engineers, meanwhile, will learn how to write general-purpose
distributed programs in Spark and operate production applications. Engineers and data
scientists will both learn different details from this book, but will both be able to apply
Spark to solve large distributed problems in their respective fields.
Data scientists focus on answering questions or building models from data. They often
have a statistical or math background and some familiarity with tools like Python, R
and SQL. We have made sure to include Python, and where relevant SQL, examples for
all our material, as well as an overview of the machine learning and advanced analytics
libraries in Spark. If you are a data scientist, we hope that after reading this book you
will be able to use the same mathematical approaches to solving problems, except much
faster and on a much larger scale.
The second group this book targets is software engineers who have some experience
with Java, Python or another programming language. If you are an engineer, we hope
that this book will show you how to set up a Spark cluster, use the Spark shell, and write
Spark applications to solve parallel processing problems. If you are familiar with Ha
doop, you have a bit of a head start on figuring out how to interact with HDFS and how
to manage a cluster, but either way, we will cover basic distributed execution concepts.
Regardless of whether you are a data analyst or engineer, to get the most of this book
you should have some familiarity with one of Python, Java, Scala, or a similar language.
We assume that you already have a solution for storing your data and we cover how to
load and save data from many common ones, but not how to set them up. If you dont
have experience with one of those languages, dont worry, there are excellent resources
available to learn these. We call out some of the books available in Supporting Books.
Supporting Books
If you are a data scientist and dont have much experience with Python, the Learning
Python and Head First Python books are both excellent introductions. If you have some
Python experience and want some more, Dive into Python is a great book to get a deeper
understanding of Python.
vi
Preface
If you are an engineer and after reading this book you would like to expand your data
analysis skills, Machine Learning for Hackers and Doing Data Science are excellent
books from OReilly.
This book is intended to be accessible to beginners. We do intend to release a deep dive
follow-up for those looking to gain a more thorough understanding of Sparks internals.
Code Examples
All of the code examples found in this book are on GitHub. You can examine them and
check them out from https://github.com/databricks/learning-spark. Code examples are
provided in Java, Scala, and Python.
Our Java examples are written to work with Java version 6 and high
er. Java 8 introduces a new syntax called lambdas that makes writ
ing inline functions much easier, which can simplify Spark code. We
have chosen not to take advantage of this syntax in most of our ex
amples, as most organizations are not yet using Java 8. If you would
like to try Java 8 syntax, you can see the Databricks blog post on this
topic. Some of the examples will also be ported to Java 8 and posted
to the books GitHub.
Preface
vii
chapter for correctness with an extremely short time line. Joseph Bradley provided an
introductory example for MLlib in all of its APIs. Reza Zadeh provided text and code
examples for dimensionality reduction. Xiangrui Meng, Joseph Bradley and Reza Zadeh
also provided editing and technical feedback to improve the MLlib chapter.
viii
Preface
CHAPTER 1
This chapter provides a high level overview of what Apache Spark is. If you are already
familiar with Apache Spark and its components, feel free to jump ahead to (to come).
A Unified Stack
The Spark project contains multiple closely-integrated components. At its core, Spark
is a computational engine that is responsible for scheduling, distributing, and moni
toring applications consisting of many computational tasks across many worker ma
chines, or a computing cluster. Because the core engine of Spark is both fast and generalpurpose, it powers multiple higher-level components specialized for various workloads,
such as SQL or machine learning. These components are designed to interoperate
closely, letting you combine them like libraries in a software project.
A philosophy of tight integration has several benefits. First, all libraries and higher level
components in the stack benefit from improvements at the lower layers. For example,
when Sparks core engine adds an optimization, SQL and machine learning libraries
automatically speed up as well. Second, the costs associated with running the stack are
minimized, because instead of running 5-10 independent software systems, an organi
zation only needs to run one. These costs include deployment, maintenance, testing,
support, and more. This also means that each time a new component is added to the
Spark stack, every organization that uses Spark will immediately be able to try this new
component. This changes the cost of trying out a new type of data analysis from down
loading, deploying, and learning a new software project to upgrading Spark.
Finally, one of the largest advantages of tight integration is the ability to build applica
tions that seamlessly combine different processing models. For example, in Spark you
can write one application that uses machine learning to classify data in real time as it is
ingested from streaming sources. Simultaneously analysts can query the resulting data,
also in real-time, via SQL, e.g. to join the data with unstructured log files. In addition,
more sophisticated data engineers & data scientists can access the same data via the
Python shell for ad-hoc analysis. Others might access the data in standalone batch ap
plications. All the while, the IT team only has to maintain one software stack.
Spark Core
Spark Core contains the basic functionality of Spark, including components for task
scheduling, memory management, fault recovery, interacting with storage systems, and
more. Spark Core is also home to the API that defines Resilient Distributed Datasets
(RDDs), which are Sparks main programming abstraction. RDDs represent a collection
of items distributed across many compute nodes that can be manipulated in parallel.
Spark Core provides many APIs for building and manipulating these collections.
Spark SQL
Spark SQL is Sparks package for working with structured data. It allows querying data
via SQL as well as the Apache Hive variant of SQL, called the Hive Query Language
(HQL), and it supports many sources of data including Hive tables, Parquet, and JSON.
Beyond providing a SQL interface to Spark, Spark SQL allows developers to intermix
SQL queries with the programmatic data manipulations supported by RDDs in Python,
Java and Scala, all within a single application, thus combining SQL with complex ana
lytics. This tight integration with the rich computing environment provided by Spark
makes Spark SQL unlike any other open source data warehouse tool. Spark SQL was
added to Spark in version 1.0.
Shark was an older SQL-on-Spark project out of UC Berkeley that modified Apache
Hive to run on Spark. It has now been replaced by Spark SQL to provide better inter
gration with the Spark engine and language APIs.
A Unified Stack
Spark Streaming
Spark Streaming is a Spark component that enables processing live streams of data.
Examples of data streams include log files generated by production web servers, or
queues of messages containing status updates posted by users of a web service. Spark
Streaming provides an API for manipulating data streams that closely matches the Spark
Cores RDD API, making it easy for programmers to learn the project and move between
applications that manipulate data stored in memory, on disk, or arriving in real-time.
Underneath its API, Spark Streaming was designed to provide the same degree of fault
tolerance, throughput, and scalability that the Spark Core provides.
MLlib
Spark comes with a library containing common machine learning (ML) functionality
called MLlib. MLlib provides multiple types of machine learning algorithms, including
classification, regression, clustering and collaborative filtering, as well as supporting
functionality such as model evaluation and data import. It also provides some lower
level ML primitives including a generic gradient descent optimization algorithm. All of
these methods are designed to scale out across a cluster.
GraphX
GraphX is a library that provides an API for manipulating graphs (e.g., a social networks
friend graph) and performing graph-parallel computations. Like Spark Streaming and
Spark SQL, GraphX extends the Spark RDD API, allowing us to create a directed graph
with arbitrary properties attached to each vertex and edge. GraphX also provides set of
operators for manipulating graphs (e.g., subgraph and mapVertices) and a library of
common graph algorithms (e.g., PageRank and triangle counting).
Cluster Managers
Cluster Managers are a bit different as the previous components are things that are built
on Spark, but Spark can run on different cluster managers. Under the hood, Spark is
designed to efficiently scale up from one to many thousands of compute nodes. To
achieve this while maximizing flexibility, Spark can run over a variety of cluster man
agers, including Hadoop YARN, Apache Mesos, and a simple cluster manager included
in Spark itself called the Standalone Scheduler. If you are just installing Spark on an
empty set of machines, the Standalone Scheduler provides an easy way to get started;
while if you already have a Hadoop YARN or Mesos cluster, Sparks support for these
allows your applications to also run on them. The (to come) explores the different
options and how to choose the correct cluster manager.
ganizations list themselves on the Spark PoweredBy page 1, and dozens speak about
their use cases at Spark community events such as Spark Meetups 2 and the Spark Sum
mit 3. Apart from UC Berkeley, major contributors to the project currently include
Databricks, Yahoo! and Intel.
In 2011, the AMPLab started to develop higher-level components on Spark, such as
Shark (Hive on Spark)4 and Spark Streaming. These and other components are some
times referred to as the Berkeley Data Analytics Stack (BDAS) 5.
Spark was first open sourced in March 2010, and was transferred to the Apache Software
Foundation in June 2013, where it is now a top-level project.
1. https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark
2. http://www.meetup.com/spark-users/
3. http://spark-summit.org
4. Shark has been replaced by Spark SQL
5. https://amplab.cs.berkeley.edu/software