SlideShare a Scribd company logo
Location:
QuantUniversity Meetup
August 8th 2016
Boston MA
Scaling Analytics with Apache Spark
2016 Copyright QuantUniversity LLC.
Presented By:
Sri Krishnamurthy, CFA, CAP
www.QuantUniversity.com
sri@quantuniversity.com
2
Slides and Code will be available at:
http://www.analyticscertificate.com/SparkWorkshop/
- Analytics Advisory services
- Custom training programs
- Architecture assessments, advice and audits
• Founder of QuantUniversity LLC. and
www.analyticscertificate.com
• Advisory and Consultancy for Financial Analytics
• Prior Experience at MathWorks, Citigroup and
Endeca and 25+ financial services and energy
customers (Shell, Firstfuel Software etc.)
• Regular Columnist for the Wilmott Magazine
• Author of forthcoming book
“Financial Modeling: A case study approach”
published by Wiley
• Charted Financial Analyst and Certified Analytics
Professional
• Teaches Analytics in the Babson College MBA
program and at Northeastern University, Boston
Sri Krishnamurthy
Founder and CEO
4
5
Quantitative Analytics and Big Data Analytics Onboarding
• Trained more than 500 students in
Quantitative methods, Data Science
and Big Data Technologies using
MATLAB, Python and R
• Launching the Analytics Certificate
Program in September
(MATLAB version also available)
7
Quantitative Analytics and Big Data Analytics Onboarding
• Apply at:
www.analyticscertificate.com
• Program starting September 18th
• Module 1:
▫ Sep 18th , 25th , Oct 2nd, 9th
• Module 2:
▫ Oct 16th , 23th , 30th, Nov 6th
• Module 3:
▫ Nov 13th, 20th, Dec 4th, Dec 11th
• Capstone + Certification Ceremony
▫ Dec 18th
8
• August
▫ 14-20th : ARPM in New York www.arpm.co
 QuantUniversity presenting on Model Risk on August 14th
▫ 18-21st : Big-data Bootcamp
http://globalbigdataconference.com/68/boston/big-data-
bootcamp/event.html
• September
▫ 1st : QuantUniversity Meetup (AnalyticsCertificate program open house)
▫ 11th, 12th : Spark Workshop, Boston
▫ 19th, 20th : Anomaly Detection Workshop, New York
Events of Interest
9
Agenda
1. A quick introduction to Apache Spark
2. A sample Spark Program
3. Clustering using Apache Spark
4. Regression using Apache Spark
5. Simulation using Apache Spark
Apache Spark : Soaring in Popularity
Ref: Wall street Journal http://www.wsj.com/articles/newer-software-aims-to-crunch-hadoops-numbers-1434326008
What is Spark ?
• Apache Spark™ is a fast and general engine for large-scale data
processing.
• Came out of U.C. Berkeley’s AMP Lab
Lightning-fast cluster computing
Why Spark ?
Speed
Run programs up to 100x faster than Hadoop MapReduce
in memory, or 10x faster on disk.
Spark has an advanced DAG execution engine that
supports cyclic data flow and in-memory computing.
Why Spark ?
• text_file =
spark.textFile("hdfs://...")
text_file.flatMap(lambda line: l
ine.split())
.map(lambda word: (word,
1))
.reduceByKey(lambda a, b:
a+b)
• Word count in Spark's Python
API
Ease of Use
• Write applications quickly in Java, Scala or
Python,R.
• Spark offers over 80 high-level operators that
make it easy to build parallel apps. And you can
use it interactively from the Scala and Python
shells.
• R support recently added
Why Spark ?
• Generality
• Combine SQL, streaming, and
complex analytics.
• Spark powers a stack of high-level
tools including:
1. Spark Streaming: processing real-time
data streams
2. Spark SQL and DataFrames: support
for structured data and relational
queries
3. MLlib: built-in machine learning library
4. GraphX: Spark’s new API for graph
processing
Why Spark?
• Runs Everywhere
• Spark runs on Hadoop, Mesos,
standalone, or in the cloud. It can
access diverse data sources
including HDFS, Cassandra,
HBase, and S3.
• You can run Spark using
its standalone cluster mode,
on EC2, on Hadoop YARN, or
on Apache Mesos.
• Access data
in HDFS, Cassandra, HBase,
Hive, Tachyon, and any Hadoop
data source.
Key Features of Spark
• Handles batch, interactive, and real-time within a single
framework
• Native integration with Java, Python, Scala, R
• Programming at a higher level of abstraction
• More general: map/reduce is just one set of supported
constructs
Secret Sauce : RDD, Transformation, Action
How does it work?
• Resilient Distributed Datasets (RDD) are the primary abstraction in Spark – a
fault-tolerant collection of elements that can be operated on in parallel.
• Transformations create a new dataset from an existing one. All transformations
in Spark are lazy: they do not compute their results right away – instead they
remember the transformations applied to some base dataset.
• Actions return a value to the driver program after running a computation on
the dataset.
How is Spark different?
• Map – Reduce : Hadoop
Problems with this MR model
• Difficult to code
Getting started
• http://spark.apache.org/docs/latest/index.html
• http://datascience.ibm.com/
• https://community.cloud.databricks.com
Quick Demo
• Test_Notebook.ipyb
Machine learning with Spark
Machine learning with Spark
26
Machine learning with Spark
Use case 1 : Segmenting stocks
• If we have a basket of stocks and their price history, how do we
segment them into different clusters?
• What metrics could we use to measure similarity?
• Can we evaluate the effect of changing the number of clusters ?
• Do the results seem actionable?
K-means
Given a set of observations (x1, x2, …, xn), where each observation is
a d-dimensional real vector, k-means clustering aims to partition
the n observations into k (≤ n) sets S = {S1, S2, …, Sk} so as to minimize
the within-cluster sum of squares (WCSS). In other words, its objective
is to find:
where μi is the mean of points in Si.
http://shabal.in/visuals/kmeans/2.html
Demo
• Kmeans spark case.ipynb
Use-case 2 – Regression
• Given historical weekly interest data of AAA bond yields, 10 year
treasuries, 30 year treasuries and Federal fund rates, build a
regression model that fits
• Changes to AAA = function of (Changes to 10year rates, Changes to 30 year rates, Changes to FF rates)
Linear regression
• Linear regression investigates the linear relationships between variables and
predict one variable based on one or more other variables and it can be
formulated as:
𝑌 = 𝛽0 + ෍
𝑖=1
𝑝
𝛽𝑖 𝑋𝑖
where Y and 𝑋𝑖 are random variables, 𝛽𝑖 is regression coefficient and 𝛽0 is a
constant.
• In this model, ordinary least squares estimator is usually used to minimize the
difference between the dependent variable and independent variables.
31
Ordinary Least Squares Regression
Demo
• Regression.ipyb
Scaling Monte-Carlo simulations
Example:
• Portfolio Growth
• Given:
▫ INVESTMENT_INIT = 100000 # starting amount
▫ INVESTMENT_ANN = 10000 # yearly new investment
▫ TERM = 30 # number of years
▫ MKT_AVG_RETURN = 0.11 # percentage
▫ MKT_STD_DEV = 0.18 # standard deviation
▫ Run 10000 monte-carlo simulation paths and compute the expected
value of the portfolio at the end of 30 years
Ref: https://cloud.google.com/solutions/monte-carlo-methods-with-hadoop-spark
36
• The count-distinct problem is the problem of finding the number of
distinct elements in a data stream with repeated elements.
• HyperLogLog is an algorithm for the count-distinct problem,
approximating the number of distinct elements in a multiset
• Calculating the exact cardinality of a multiset requires an amount of
memory proportional to the cardinality, which is impractical for very
large data sets. Probabilistic cardinality estimators, such as the
HyperLogLog algorithm, use significantly less memory than this, at
the cost of obtaining only an approximation of the cardinality.
Hyperloglog
Ref: https://en.wikipedia.org/wiki/HyperLogLog
37
Hyperloglog
The basis of the HyperLogLog algorithm is the observation
that the cardinality of a multiset of uniformly distributed
random numbers can be estimated by calculating the
maximum number of leading zeros in the binary
representation of each number in the set. If the maximum
number of leading zeros observed is n, an estimate for the
number of distinct elements in the set is 2^n
Ref: https://en.wikipedia.org/wiki/HyperLogLog
38
• Approximate algorithms
▫ approxCountDistinct: returns an estimate of the number of distinct
elements
▫ approxQuantile: returns approximate percentiles of numerical data
Refer:
https://databricks-prod-
cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f17
3bcfc/8599738367597028/4196864626084292/3601578643761083/late
st.html
Demo from Databricks’s blog
39
• As per Databricks’s blog:
“Spark strives at implementing approximate algorithms that are
deterministic (they do not depend on random numbers to work) and that
have proven theoretical error bounds: for each algorithm, the user can
specify a target error bound, and the result is guaranteed to be within
this bound, either exactly (deterministic error bounds) or with very high
confidence (probabilistic error bounds)”
Spark’s implementation
Scaling Analytics with Apache Spark
41
www.analyticscertificate.com/SparkWorkshop
42
Q&A
Slides, code and details about the Apache Spark Workshop
at: http://www.analyticscertificate.com/SparkWorkshop/
Thank you!
Members & Sponsors!
Sri Krishnamurthy, CFA, CAP
Founder and CEO
QuantUniversity LLC.
srikrishnamurthy
www.QuantUniversity.com
Contact
Information, data and drawings embodied in this presentation are strictly a property of QuantUniversity LLC. and shall not be
distributed or used in any other publication without the prior written consent of QuantUniversity LLC.
43

More Related Content

What's hot (20)

PDF
Ds for finance day 2
QuantUniversity
 
PDF
Deep learning QuantUniversity meetup
QuantUniversity
 
PDF
Nlp and Neural Networks workshop
QuantUniversity
 
PDF
An Introduction to Anomaly Detection
Kenneth Graham
 
PPTX
Anomaly Detection for Real-World Systems
Manojit Nandi
 
PPTX
Feature Selection for Document Ranking
Andrea Gigli
 
PDF
Heuristic design of experiments w meta gradient search
Greg Makowski
 
PPTX
Musings of kaggler
Kai Xin Thia
 
PDF
Biostatistics Workshop: Missing Data
HopkinsCFAR
 
PDF
Machine Learning Algorithm & Anomaly detection 2021
Chakrit Phain
 
PPTX
Missing Data and data imputation techniques
Omar F. Althuwaynee
 
PDF
Explainable AI Workshop
QuantUniversity
 
PDF
Machine Learning for Dummies
Venkata Reddy Konasani
 
PPTX
Imputation Techniques For Market Research Datasets With Missing Values
Salford Systems
 
PPTX
Linear regression on 1 terabytes of data? Some crazy observations and actions
Hesen Peng
 
PPTX
Aggregating Multiple Dimensions for Computing Document Relevance
José Ramón Ríos Viqueira
 
PDF
udacity-dandsyllabus
Bora Yüret
 
PDF
Andrew Bossy. Data Imputation Using Reverse ML
Lviv Startup Club
 
PPTX
Statistical Approaches to Missing Data
DataCards
 
PPTX
Missing Data and Causes
akanni azeez olamide
 
Ds for finance day 2
QuantUniversity
 
Deep learning QuantUniversity meetup
QuantUniversity
 
Nlp and Neural Networks workshop
QuantUniversity
 
An Introduction to Anomaly Detection
Kenneth Graham
 
Anomaly Detection for Real-World Systems
Manojit Nandi
 
Feature Selection for Document Ranking
Andrea Gigli
 
Heuristic design of experiments w meta gradient search
Greg Makowski
 
Musings of kaggler
Kai Xin Thia
 
Biostatistics Workshop: Missing Data
HopkinsCFAR
 
Machine Learning Algorithm & Anomaly detection 2021
Chakrit Phain
 
Missing Data and data imputation techniques
Omar F. Althuwaynee
 
Explainable AI Workshop
QuantUniversity
 
Machine Learning for Dummies
Venkata Reddy Konasani
 
Imputation Techniques For Market Research Datasets With Missing Values
Salford Systems
 
Linear regression on 1 terabytes of data? Some crazy observations and actions
Hesen Peng
 
Aggregating Multiple Dimensions for Computing Document Relevance
José Ramón Ríos Viqueira
 
udacity-dandsyllabus
Bora Yüret
 
Andrew Bossy. Data Imputation Using Reverse ML
Lviv Startup Club
 
Statistical Approaches to Missing Data
DataCards
 
Missing Data and Causes
akanni azeez olamide
 

Viewers also liked (20)

PPTX
Deep learning Tutorial - Part II
QuantUniversity
 
PDF
Deep learning - Part I
QuantUniversity
 
PDF
Deep learning and Apache Spark
QuantUniversity
 
PDF
Model Risk Management : Best Practices
QuantUniversity
 
PDF
Missing data handling
QuantUniversity
 
PDF
Energy analytics with Apache Spark workshop
QuantUniversity
 
PDF
FitchLearning QuantUniversity Model Risk Presentation
QuantUniversity
 
PDF
PythonQuants conference - QuantUniversity presentation - Stress Testing in th...
QuantUniversity
 
ODP
Stormwater analytics with MongoDB and Pentaho
Dave Callaghan
 
PDF
Model Risk Management: Using an infinitely scalable stress testing platform f...
QuantUniversity
 
PDF
Guest talk- Roof Classification
QuantUniversity
 
PDF
Big data, Analytics and Beyond
QuantUniversity
 
PDF
A Framework Driven Approach to Model Risk Management (www.dataanalyticsfinanc...
QuantUniversity
 
PDF
Anomaly detection
QuantUniversity
 
PPTX
Top 10 Data Science Practitioner Pitfalls
Sri Ambati
 
PDF
H2O World - NCS Continuous Media Optimization w/H2O - Satya Satyamoorthy
Sri Ambati
 
PPTX
Scaling spark
Alex Rovner
 
PDF
Debugging & Tuning in Spark
Shiao-An Yuan
 
PDF
Spark After Dark - LA Apache Spark Users Group - Feb 2015
Chris Fregly
 
PPTX
Driving In-Store Sales with Real-Time Personalization - Cyril Nigg, Catalina ...
Sri Ambati
 
Deep learning Tutorial - Part II
QuantUniversity
 
Deep learning - Part I
QuantUniversity
 
Deep learning and Apache Spark
QuantUniversity
 
Model Risk Management : Best Practices
QuantUniversity
 
Missing data handling
QuantUniversity
 
Energy analytics with Apache Spark workshop
QuantUniversity
 
FitchLearning QuantUniversity Model Risk Presentation
QuantUniversity
 
PythonQuants conference - QuantUniversity presentation - Stress Testing in th...
QuantUniversity
 
Stormwater analytics with MongoDB and Pentaho
Dave Callaghan
 
Model Risk Management: Using an infinitely scalable stress testing platform f...
QuantUniversity
 
Guest talk- Roof Classification
QuantUniversity
 
Big data, Analytics and Beyond
QuantUniversity
 
A Framework Driven Approach to Model Risk Management (www.dataanalyticsfinanc...
QuantUniversity
 
Anomaly detection
QuantUniversity
 
Top 10 Data Science Practitioner Pitfalls
Sri Ambati
 
H2O World - NCS Continuous Media Optimization w/H2O - Satya Satyamoorthy
Sri Ambati
 
Scaling spark
Alex Rovner
 
Debugging & Tuning in Spark
Shiao-An Yuan
 
Spark After Dark - LA Apache Spark Users Group - Feb 2015
Chris Fregly
 
Driving In-Store Sales with Real-Time Personalization - Cyril Nigg, Catalina ...
Sri Ambati
 
Ad

Similar to Scaling Analytics with Apache Spark (20)

PDF
Spark ml streaming
Adam Doyle
 
PPTX
Large Scale Machine Learning with Apache Spark
Cloudera, Inc.
 
PPT
Apache spark-melbourne-april-2015-meetup
Ned Shawa
 
PDF
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
DB Tsai
 
PPT
Big_data_analytics_NoSql_Module-4_Session
RUHULAMINHAZARIKA
 
PPTX
Big Data Analytics with Storm, Spark and GraphLab
Impetus Technologies
 
PPTX
Big Data Processing with Apache Spark 2014
mahchiev
 
PDF
Spark The Definitive Guide Big Data Processing Made Simple Bill Chambers
shalikstenmo
 
PDF
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
DataStax Academy
 
PDF
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
BigDataEverywhere
 
PDF
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Spark Summit
 
PDF
Integrate SparkR with existing R packages to accelerate data science workflows
Artem Ervits
 
PDF
SparkR - Play Spark Using R (20160909 HadoopCon)
wqchen
 
PPT
An Introduction to Apache spark with scala
johnn210
 
PDF
Big data distributed processing: Spark introduction
Hektor Jacynycz García
 
PDF
Big Data Analytics and Ubiquitous computing
Animesh Chaturvedi
 
PPTX
Intro to Apache Spark by Marco Vasquez
MapR Technologies
 
PPTX
Spark-Zeppelin-ML on HWX
Kirk Haslbeck
 
PDF
Apache Spark Overview
Vadim Y. Bichutskiy
 
PPTX
introduction to big data frameworks
Amal Targhi
 
Spark ml streaming
Adam Doyle
 
Large Scale Machine Learning with Apache Spark
Cloudera, Inc.
 
Apache spark-melbourne-april-2015-meetup
Ned Shawa
 
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
DB Tsai
 
Big_data_analytics_NoSql_Module-4_Session
RUHULAMINHAZARIKA
 
Big Data Analytics with Storm, Spark and GraphLab
Impetus Technologies
 
Big Data Processing with Apache Spark 2014
mahchiev
 
Spark The Definitive Guide Big Data Processing Made Simple Bill Chambers
shalikstenmo
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
DataStax Academy
 
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
BigDataEverywhere
 
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Spark Summit
 
Integrate SparkR with existing R packages to accelerate data science workflows
Artem Ervits
 
SparkR - Play Spark Using R (20160909 HadoopCon)
wqchen
 
An Introduction to Apache spark with scala
johnn210
 
Big data distributed processing: Spark introduction
Hektor Jacynycz García
 
Big Data Analytics and Ubiquitous computing
Animesh Chaturvedi
 
Intro to Apache Spark by Marco Vasquez
MapR Technologies
 
Spark-Zeppelin-ML on HWX
Kirk Haslbeck
 
Apache Spark Overview
Vadim Y. Bichutskiy
 
introduction to big data frameworks
Amal Targhi
 
Ad

More from QuantUniversity (20)

PDF
AI in Finance and Retirement Systems: Insights from the EBRI-Milken Institute...
QuantUniversity
 
PDF
Adversarial Machine Learning: A Taxonomy and Terminology of Attacks and Mitig...
QuantUniversity
 
PDF
EU Artificial Intelligence Act 2024 passed !
QuantUniversity
 
PDF
Managing-the-Risks-of-LLMs-in-FS-Industry-Roundtable-TruEra-QuantU.pdf
QuantUniversity
 
PDF
PYTHON AND DATA SCIENCE FOR INVESTMENT PROFESSIONALS
QuantUniversity
 
PDF
Qu for India - QuantUniversity FundRaiser
QuantUniversity
 
PDF
Ml master class for CFA Dallas
QuantUniversity
 
PDF
Algorithmic auditing 1.0
QuantUniversity
 
PDF
Towards Fairer Datasets: Filtering and Balancing the Distribution of the Peop...
QuantUniversity
 
PDF
Machine Learning: Considerations for Fairly and Transparently Expanding Acces...
QuantUniversity
 
PDF
Seeing what a gan cannot generate: paper review
QuantUniversity
 
PDF
AI Explainability and Model Risk Management
QuantUniversity
 
PDF
Algorithmic auditing 1.0
QuantUniversity
 
PDF
Machine Learning in Finance: 10 Things You Need to Know in 2021
QuantUniversity
 
PDF
Bayesian Portfolio Allocation
QuantUniversity
 
PDF
The API Jungle
QuantUniversity
 
PDF
Constructing Private Asset Benchmarks
QuantUniversity
 
PDF
Machine Learning Interpretability
QuantUniversity
 
PDF
Responsible AI in Action
QuantUniversity
 
PDF
Qu speaker series 14: Synthetic Data Generation in Finance
QuantUniversity
 
AI in Finance and Retirement Systems: Insights from the EBRI-Milken Institute...
QuantUniversity
 
Adversarial Machine Learning: A Taxonomy and Terminology of Attacks and Mitig...
QuantUniversity
 
EU Artificial Intelligence Act 2024 passed !
QuantUniversity
 
Managing-the-Risks-of-LLMs-in-FS-Industry-Roundtable-TruEra-QuantU.pdf
QuantUniversity
 
PYTHON AND DATA SCIENCE FOR INVESTMENT PROFESSIONALS
QuantUniversity
 
Qu for India - QuantUniversity FundRaiser
QuantUniversity
 
Ml master class for CFA Dallas
QuantUniversity
 
Algorithmic auditing 1.0
QuantUniversity
 
Towards Fairer Datasets: Filtering and Balancing the Distribution of the Peop...
QuantUniversity
 
Machine Learning: Considerations for Fairly and Transparently Expanding Acces...
QuantUniversity
 
Seeing what a gan cannot generate: paper review
QuantUniversity
 
AI Explainability and Model Risk Management
QuantUniversity
 
Algorithmic auditing 1.0
QuantUniversity
 
Machine Learning in Finance: 10 Things You Need to Know in 2021
QuantUniversity
 
Bayesian Portfolio Allocation
QuantUniversity
 
The API Jungle
QuantUniversity
 
Constructing Private Asset Benchmarks
QuantUniversity
 
Machine Learning Interpretability
QuantUniversity
 
Responsible AI in Action
QuantUniversity
 
Qu speaker series 14: Synthetic Data Generation in Finance
QuantUniversity
 

Recently uploaded (20)

PPTX
办理学历认证InformaticsLetter新加坡英华美学院毕业证书,Informatics成绩单
Taqyea
 
PDF
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
PDF
Research Methodology Overview Introduction
ayeshagul29594
 
PDF
SQL for Accountants and Finance Managers
ysmaelreyes
 
PPTX
thid ppt defines the ich guridlens and gives the information about the ICH gu...
shaistabegum14
 
PPTX
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 
PPTX
Comparative Study of ML Techniques for RealTime Credit Card Fraud Detection S...
Debolina Ghosh
 
PDF
Group 5_RMB Final Project on circular economy
pgban24anmola
 
PDF
1750162332_Snapshot-of-Indias-oil-Gas-data-May-2025.pdf
sandeep718278
 
PPTX
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
PDF
NIS2 Compliance for MSPs: Roadmap, Benefits & Cybersecurity Trends (2025 Guide)
GRC Kompas
 
PDF
UNISE-Operation-Procedure-InDHIS2trainng
ahmedabduselam23
 
PDF
apidays Singapore 2025 - Surviving an interconnected world with API governanc...
apidays
 
PDF
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
PDF
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
PPTX
BinarySearchTree in datastructures in detail
kichokuttu
 
PPTX
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
PDF
Data Science Course Certificate by Sigma Software University
Stepan Kalika
 
PPTX
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
PPTX
Feb 2021 Ransomware Recovery presentation.pptx
enginsayin1
 
办理学历认证InformaticsLetter新加坡英华美学院毕业证书,Informatics成绩单
Taqyea
 
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
Research Methodology Overview Introduction
ayeshagul29594
 
SQL for Accountants and Finance Managers
ysmaelreyes
 
thid ppt defines the ich guridlens and gives the information about the ICH gu...
shaistabegum14
 
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 
Comparative Study of ML Techniques for RealTime Credit Card Fraud Detection S...
Debolina Ghosh
 
Group 5_RMB Final Project on circular economy
pgban24anmola
 
1750162332_Snapshot-of-Indias-oil-Gas-data-May-2025.pdf
sandeep718278
 
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
NIS2 Compliance for MSPs: Roadmap, Benefits & Cybersecurity Trends (2025 Guide)
GRC Kompas
 
UNISE-Operation-Procedure-InDHIS2trainng
ahmedabduselam23
 
apidays Singapore 2025 - Surviving an interconnected world with API governanc...
apidays
 
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
BinarySearchTree in datastructures in detail
kichokuttu
 
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
Data Science Course Certificate by Sigma Software University
Stepan Kalika
 
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
Feb 2021 Ransomware Recovery presentation.pptx
enginsayin1
 

Scaling Analytics with Apache Spark

  • 1. Location: QuantUniversity Meetup August 8th 2016 Boston MA Scaling Analytics with Apache Spark 2016 Copyright QuantUniversity LLC. Presented By: Sri Krishnamurthy, CFA, CAP www.QuantUniversity.com [email protected]
  • 2. 2 Slides and Code will be available at: http://www.analyticscertificate.com/SparkWorkshop/
  • 3. - Analytics Advisory services - Custom training programs - Architecture assessments, advice and audits
  • 4. • Founder of QuantUniversity LLC. and www.analyticscertificate.com • Advisory and Consultancy for Financial Analytics • Prior Experience at MathWorks, Citigroup and Endeca and 25+ financial services and energy customers (Shell, Firstfuel Software etc.) • Regular Columnist for the Wilmott Magazine • Author of forthcoming book “Financial Modeling: A case study approach” published by Wiley • Charted Financial Analyst and Certified Analytics Professional • Teaches Analytics in the Babson College MBA program and at Northeastern University, Boston Sri Krishnamurthy Founder and CEO 4
  • 5. 5 Quantitative Analytics and Big Data Analytics Onboarding • Trained more than 500 students in Quantitative methods, Data Science and Big Data Technologies using MATLAB, Python and R • Launching the Analytics Certificate Program in September
  • 6. (MATLAB version also available)
  • 7. 7 Quantitative Analytics and Big Data Analytics Onboarding • Apply at: www.analyticscertificate.com • Program starting September 18th • Module 1: ▫ Sep 18th , 25th , Oct 2nd, 9th • Module 2: ▫ Oct 16th , 23th , 30th, Nov 6th • Module 3: ▫ Nov 13th, 20th, Dec 4th, Dec 11th • Capstone + Certification Ceremony ▫ Dec 18th
  • 8. 8 • August ▫ 14-20th : ARPM in New York www.arpm.co  QuantUniversity presenting on Model Risk on August 14th ▫ 18-21st : Big-data Bootcamp http://globalbigdataconference.com/68/boston/big-data- bootcamp/event.html • September ▫ 1st : QuantUniversity Meetup (AnalyticsCertificate program open house) ▫ 11th, 12th : Spark Workshop, Boston ▫ 19th, 20th : Anomaly Detection Workshop, New York Events of Interest
  • 9. 9
  • 10. Agenda 1. A quick introduction to Apache Spark 2. A sample Spark Program 3. Clustering using Apache Spark 4. Regression using Apache Spark 5. Simulation using Apache Spark
  • 11. Apache Spark : Soaring in Popularity Ref: Wall street Journal http://www.wsj.com/articles/newer-software-aims-to-crunch-hadoops-numbers-1434326008
  • 12. What is Spark ? • Apache Spark™ is a fast and general engine for large-scale data processing. • Came out of U.C. Berkeley’s AMP Lab Lightning-fast cluster computing
  • 13. Why Spark ? Speed Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. Spark has an advanced DAG execution engine that supports cyclic data flow and in-memory computing.
  • 14. Why Spark ? • text_file = spark.textFile("hdfs://...") text_file.flatMap(lambda line: l ine.split()) .map(lambda word: (word, 1)) .reduceByKey(lambda a, b: a+b) • Word count in Spark's Python API Ease of Use • Write applications quickly in Java, Scala or Python,R. • Spark offers over 80 high-level operators that make it easy to build parallel apps. And you can use it interactively from the Scala and Python shells. • R support recently added
  • 15. Why Spark ? • Generality • Combine SQL, streaming, and complex analytics. • Spark powers a stack of high-level tools including: 1. Spark Streaming: processing real-time data streams 2. Spark SQL and DataFrames: support for structured data and relational queries 3. MLlib: built-in machine learning library 4. GraphX: Spark’s new API for graph processing
  • 16. Why Spark? • Runs Everywhere • Spark runs on Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources including HDFS, Cassandra, HBase, and S3. • You can run Spark using its standalone cluster mode, on EC2, on Hadoop YARN, or on Apache Mesos. • Access data in HDFS, Cassandra, HBase, Hive, Tachyon, and any Hadoop data source.
  • 17. Key Features of Spark • Handles batch, interactive, and real-time within a single framework • Native integration with Java, Python, Scala, R • Programming at a higher level of abstraction • More general: map/reduce is just one set of supported constructs
  • 18. Secret Sauce : RDD, Transformation, Action
  • 19. How does it work? • Resilient Distributed Datasets (RDD) are the primary abstraction in Spark – a fault-tolerant collection of elements that can be operated on in parallel. • Transformations create a new dataset from an existing one. All transformations in Spark are lazy: they do not compute their results right away – instead they remember the transformations applied to some base dataset. • Actions return a value to the driver program after running a computation on the dataset.
  • 20. How is Spark different? • Map – Reduce : Hadoop
  • 21. Problems with this MR model • Difficult to code
  • 22. Getting started • http://spark.apache.org/docs/latest/index.html • http://datascience.ibm.com/ • https://community.cloud.databricks.com
  • 27. Use case 1 : Segmenting stocks • If we have a basket of stocks and their price history, how do we segment them into different clusters? • What metrics could we use to measure similarity? • Can we evaluate the effect of changing the number of clusters ? • Do the results seem actionable?
  • 28. K-means Given a set of observations (x1, x2, …, xn), where each observation is a d-dimensional real vector, k-means clustering aims to partition the n observations into k (≤ n) sets S = {S1, S2, …, Sk} so as to minimize the within-cluster sum of squares (WCSS). In other words, its objective is to find: where μi is the mean of points in Si. http://shabal.in/visuals/kmeans/2.html
  • 29. Demo • Kmeans spark case.ipynb
  • 30. Use-case 2 – Regression • Given historical weekly interest data of AAA bond yields, 10 year treasuries, 30 year treasuries and Federal fund rates, build a regression model that fits • Changes to AAA = function of (Changes to 10year rates, Changes to 30 year rates, Changes to FF rates)
  • 31. Linear regression • Linear regression investigates the linear relationships between variables and predict one variable based on one or more other variables and it can be formulated as: 𝑌 = 𝛽0 + ෍ 𝑖=1 𝑝 𝛽𝑖 𝑋𝑖 where Y and 𝑋𝑖 are random variables, 𝛽𝑖 is regression coefficient and 𝛽0 is a constant. • In this model, ordinary least squares estimator is usually used to minimize the difference between the dependent variable and independent variables. 31
  • 35. Example: • Portfolio Growth • Given: ▫ INVESTMENT_INIT = 100000 # starting amount ▫ INVESTMENT_ANN = 10000 # yearly new investment ▫ TERM = 30 # number of years ▫ MKT_AVG_RETURN = 0.11 # percentage ▫ MKT_STD_DEV = 0.18 # standard deviation ▫ Run 10000 monte-carlo simulation paths and compute the expected value of the portfolio at the end of 30 years Ref: https://cloud.google.com/solutions/monte-carlo-methods-with-hadoop-spark
  • 36. 36 • The count-distinct problem is the problem of finding the number of distinct elements in a data stream with repeated elements. • HyperLogLog is an algorithm for the count-distinct problem, approximating the number of distinct elements in a multiset • Calculating the exact cardinality of a multiset requires an amount of memory proportional to the cardinality, which is impractical for very large data sets. Probabilistic cardinality estimators, such as the HyperLogLog algorithm, use significantly less memory than this, at the cost of obtaining only an approximation of the cardinality. Hyperloglog Ref: https://en.wikipedia.org/wiki/HyperLogLog
  • 37. 37 Hyperloglog The basis of the HyperLogLog algorithm is the observation that the cardinality of a multiset of uniformly distributed random numbers can be estimated by calculating the maximum number of leading zeros in the binary representation of each number in the set. If the maximum number of leading zeros observed is n, an estimate for the number of distinct elements in the set is 2^n Ref: https://en.wikipedia.org/wiki/HyperLogLog
  • 38. 38 • Approximate algorithms ▫ approxCountDistinct: returns an estimate of the number of distinct elements ▫ approxQuantile: returns approximate percentiles of numerical data Refer: https://databricks-prod- cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f17 3bcfc/8599738367597028/4196864626084292/3601578643761083/late st.html Demo from Databricks’s blog
  • 39. 39 • As per Databricks’s blog: “Spark strives at implementing approximate algorithms that are deterministic (they do not depend on random numbers to work) and that have proven theoretical error bounds: for each algorithm, the user can specify a target error bound, and the result is guaranteed to be within this bound, either exactly (deterministic error bounds) or with very high confidence (probabilistic error bounds)” Spark’s implementation
  • 42. 42 Q&A Slides, code and details about the Apache Spark Workshop at: http://www.analyticscertificate.com/SparkWorkshop/
  • 43. Thank you! Members & Sponsors! Sri Krishnamurthy, CFA, CAP Founder and CEO QuantUniversity LLC. srikrishnamurthy www.QuantUniversity.com Contact Information, data and drawings embodied in this presentation are strictly a property of QuantUniversity LLC. and shall not be distributed or used in any other publication without the prior written consent of QuantUniversity LLC. 43