Scaling Analytics with Apache Spark

Location:
QuantUniversity Meetup
August 8th 2016
Boston MA
Scaling Analytics with Apache Spark
2016 Copyright QuantUniversity LLC.
Presented By:
Sri Krishnamurthy, CFA, CAP
www.QuantUniversity.com
sri@quantuniversity.com

2
Slides and Code will be available at:
http://www.analyticscertificate.com/SparkWorkshop/

- Analytics Advisory services
- Custom training programs
- Architecture assessments, advice and audits

• Founder of QuantUniversity LLC. and
www.analyticscertificate.com
• Advisory and Consultancy for Financial Analytics
• Prior Experience at MathWorks, Citigroup and
Endeca and 25+ financial services and energy
customers (Shell, Firstfuel Software etc.)
• Regular Columnist for the Wilmott Magazine
• Author of forthcoming book
“Financial Modeling: A case study approach”
published by Wiley
• Charted Financial Analyst and Certified Analytics
Professional
• Teaches Analytics in the Babson College MBA
program and at Northeastern University, Boston
Sri Krishnamurthy
Founder and CEO
4

5
Quantitative Analytics and Big Data Analytics Onboarding
• Trained more than 500 students in
Quantitative methods, Data Science
and Big Data Technologies using
MATLAB, Python and R
• Launching the Analytics Certificate
Program in September

(MATLAB version also available)

7
Quantitative Analytics and Big Data Analytics Onboarding
• Apply at:
www.analyticscertificate.com
• Program starting September 18th
• Module 1:
▫ Sep 18th , 25th , Oct 2nd, 9th
• Module 2:
▫ Oct 16th , 23th , 30th, Nov 6th
• Module 3:
▫ Nov 13th, 20th, Dec 4th, Dec 11th
• Capstone + Certification Ceremony
▫ Dec 18th

8
• August
▫ 14-20th : ARPM in New York www.arpm.co
 QuantUniversity presenting on Model Risk on August 14th
▫ 18-21st : Big-data Bootcamp
http://globalbigdataconference.com/68/boston/big-data-
bootcamp/event.html
• September
▫ 1st : QuantUniversity Meetup (AnalyticsCertificate program open house)
▫ 11th, 12th : Spark Workshop, Boston
▫ 19th, 20th : Anomaly Detection Workshop, New York
Events of Interest

Agenda
1. A quick introduction to Apache Spark
2. A sample Spark Program
3. Clustering using Apache Spark
4. Regression using Apache Spark
5. Simulation using Apache Spark

Apache Spark : Soaring in Popularity
Ref: Wall street Journal http://www.wsj.com/articles/newer-software-aims-to-crunch-hadoops-numbers-1434326008

What is Spark ?
• Apache Spark™ is a fast and general engine for large-scale data
processing.
• Came out of U.C. Berkeley’s AMP Lab
Lightning-fast cluster computing

Why Spark ?
Speed
Run programs up to 100x faster than Hadoop MapReduce
in memory, or 10x faster on disk.
Spark has an advanced DAG execution engine that
supports cyclic data flow and in-memory computing.

Why Spark ?
• text_file =
spark.textFile("hdfs://...")
text_file.flatMap(lambda line: l
ine.split())
.map(lambda word: (word,
1))
.reduceByKey(lambda a, b:
a+b)
• Word count in Spark's Python
API
Ease of Use
• Write applications quickly in Java, Scala or
Python,R.
• Spark offers over 80 high-level operators that
make it easy to build parallel apps. And you can
use it interactively from the Scala and Python
shells.
• R support recently added

Why Spark ?
• Generality
• Combine SQL, streaming, and
complex analytics.
• Spark powers a stack of high-level
tools including:
1. Spark Streaming: processing real-time
data streams
2. Spark SQL and DataFrames: support
for structured data and relational
queries
3. MLlib: built-in machine learning library
4. GraphX: Spark’s new API for graph
processing

Why Spark?
• Runs Everywhere
• Spark runs on Hadoop, Mesos,
standalone, or in the cloud. It can
access diverse data sources
including HDFS, Cassandra,
HBase, and S3.
• You can run Spark using
its standalone cluster mode,
on EC2, on Hadoop YARN, or
on Apache Mesos.
• Access data
in HDFS, Cassandra, HBase,
Hive, Tachyon, and any Hadoop
data source.

Key Features of Spark
• Handles batch, interactive, and real-time within a single
framework
• Native integration with Java, Python, Scala, R
• Programming at a higher level of abstraction
• More general: map/reduce is just one set of supported
constructs

Secret Sauce : RDD, Transformation, Action

How does it work?
• Resilient Distributed Datasets (RDD) are the primary abstraction in Spark – a
fault-tolerant collection of elements that can be operated on in parallel.
• Transformations create a new dataset from an existing one. All transformations
in Spark are lazy: they do not compute their results right away – instead they
remember the transformations applied to some base dataset.
• Actions return a value to the driver program after running a computation on
the dataset.

How is Spark different?
• Map – Reduce : Hadoop

Problems with this MR model
• Difficult to code

Getting started
• http://spark.apache.org/docs/latest/index.html
• http://datascience.ibm.com/
• https://community.cloud.databricks.com

Quick Demo
• Test_Notebook.ipyb

26
Machine learning with Spark

Use case 1 : Segmenting stocks
• If we have a basket of stocks and their price history, how do we
segment them into different clusters?
• What metrics could we use to measure similarity?
• Can we evaluate the effect of changing the number of clusters ?
• Do the results seem actionable?

K-means
Given a set of observations (x1, x2, …, xn), where each observation is
a d-dimensional real vector, k-means clustering aims to partition
the n observations into k (≤ n) sets S = {S1, S2, …, Sk} so as to minimize
the within-cluster sum of squares (WCSS). In other words, its objective
is to find:
where μi is the mean of points in Si.
http://shabal.in/visuals/kmeans/2.html

Demo
• Kmeans spark case.ipynb

Use-case 2 – Regression
• Given historical weekly interest data of AAA bond yields, 10 year
treasuries, 30 year treasuries and Federal fund rates, build a
regression model that fits
• Changes to AAA = function of (Changes to 10year rates, Changes to 30 year rates, Changes to FF rates)

Linear regression
• Linear regression investigates the linear relationships between variables and
predict one variable based on one or more other variables and it can be
formulated as:
𝑌 = 𝛽0 + ෍
𝑖=1
𝑝
𝛽𝑖 𝑋𝑖
where Y and 𝑋𝑖 are random variables, 𝛽𝑖 is regression coefficient and 𝛽0 is a
constant.
• In this model, ordinary least squares estimator is usually used to minimize the
difference between the dependent variable and independent variables.
31

Ordinary Least Squares Regression

Scaling Monte-Carlo simulations

Example:
• Portfolio Growth
• Given:
▫ INVESTMENT_INIT = 100000 # starting amount
▫ INVESTMENT_ANN = 10000 # yearly new investment
▫ TERM = 30 # number of years
▫ MKT_AVG_RETURN = 0.11 # percentage
▫ MKT_STD_DEV = 0.18 # standard deviation
▫ Run 10000 monte-carlo simulation paths and compute the expected
value of the portfolio at the end of 30 years
Ref: https://cloud.google.com/solutions/monte-carlo-methods-with-hadoop-spark

36
• The count-distinct problem is the problem of finding the number of
distinct elements in a data stream with repeated elements.
• HyperLogLog is an algorithm for the count-distinct problem,
approximating the number of distinct elements in a multiset
• Calculating the exact cardinality of a multiset requires an amount of
memory proportional to the cardinality, which is impractical for very
large data sets. Probabilistic cardinality estimators, such as the
HyperLogLog algorithm, use significantly less memory than this, at
the cost of obtaining only an approximation of the cardinality.
Hyperloglog
Ref: https://en.wikipedia.org/wiki/HyperLogLog

37
Hyperloglog
The basis of the HyperLogLog algorithm is the observation
that the cardinality of a multiset of uniformly distributed
random numbers can be estimated by calculating the
maximum number of leading zeros in the binary
representation of each number in the set. If the maximum
number of leading zeros observed is n, an estimate for the
number of distinct elements in the set is 2^n
Ref: https://en.wikipedia.org/wiki/HyperLogLog

38
• Approximate algorithms
▫ approxCountDistinct: returns an estimate of the number of distinct
elements
▫ approxQuantile: returns approximate percentiles of numerical data
Refer:
https://databricks-prod-
cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f17
3bcfc/8599738367597028/4196864626084292/3601578643761083/late
st.html
Demo from Databricks’s blog

39
• As per Databricks’s blog:
“Spark strives at implementing approximate algorithms that are
deterministic (they do not depend on random numbers to work) and that
have proven theoretical error bounds: for each algorithm, the user can
specify a target error bound, and the result is guaranteed to be within
this bound, either exactly (deterministic error bounds) or with very high
confidence (probabilistic error bounds)”
Spark’s implementation

Scaling Analytics with Apache Spark

41
www.analyticscertificate.com/SparkWorkshop

42
Q&A
Slides, code and details about the Apache Spark Workshop
at: http://www.analyticscertificate.com/SparkWorkshop/

Thank you!
Members & Sponsors!
Sri Krishnamurthy, CFA, CAP
Founder and CEO
QuantUniversity LLC.
srikrishnamurthy
www.QuantUniversity.com
Contact
Information, data and drawings embodied in this presentation are strictly a property of QuantUniversity LLC. and shall not be
distributed or used in any other publication without the prior written consent of QuantUniversity LLC.
43

Scaling Analytics with Apache Spark

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Scaling Analytics with Apache Spark (20)

More from QuantUniversity (20)

Recently uploaded (20)

Scaling Analytics with Apache Spark