SlideShare a Scribd company logo
®
© 2016 MapR Technologies 1®
© 2016 MapR Technologies 1MapR Confidential
© 2016 MapR Technologies
Spark Application for
Time Series Predictive
Analysis
Dong Meng, Data Scientist
®
© 2016 MapR Technologies 2®
© 2016 MapR Technologies 2MapR Confidential
•  Originally developed in 2009 in
UC Berkeley’s AMP Lab
•  Fully open-sourced in 2010 –
now a Top Level Project at the
Apache Software Foundation
•  The most Active Open Source
Project in Big Data
spark.apache.org
github.com/apache/spark
user@spark.apache.org
What is Apache Spark?
®
© 2016 MapR Technologies 3®
© 2016 MapR Technologies 3MapR Confidential
Resilient Distributed Datasets (RDD)
RDD =
read-only,
partitioned
collection of
records
Create new
RDD’s via
transformations
Operate &
analyze RDD’s
via actions
Users have option
to persist data in-
memory & control
partitioning
{map, filter, union,
flatMap, distinct,
groupByKey, cogroup,
reduceByKey, … }
{reduce, collect, count,
first, take, foreach,
countByKey, takeSample,
saveAsTextfile, … }
Fault-tolerance is
derived from the
ability to recreate
the lineage of the
RDD in case of a
partition loss
®
© 2016 MapR Technologies 4®
© 2016 MapR Technologies 4MapR Confidential
Resilient Distributed Datasets (RDD)
•  How to generate RDD
•  How to compute RDD
•  What is relationship between RDDs
–  Narrow Dependency (each partition of the parent RDD is used by at most
one partition of the child RDD. [map, filter, union])
–  Wide/Shuffle Dependency (multiple partitions of child RDD may depend
on it [groupByKey])
•  In this scene, mapreduce ~ map+reduceByKey
®
© 2016 MapR Technologies 5®
© 2016 MapR Technologies 5MapR Confidential
Resilient Distributed Datasets (RDD)
®
© 2016 MapR Technologies 6®
© 2016 MapR Technologies 6MapR Confidential
•  A distributed collection of rows organized into named columns
•  Select, filter, aggregate, join and plotting the data with structure
(python pandas, R dataframe datatable)
•  Based on schemaRDD (spark<1.3)
•  Much less code to write, plus the schema
RDD: data = sc.textFile(…).split(“,”)
data.map(lambda x : (x[0], [x[1], 1] ))
.reduceByKey(lambda x, y: [x[0]+y[0], x[1]+y[1]])
.map(lambda x: x[0], x[1][0]/x[1][1])
.collect()
Spark DataFrame
DataFrame:df.groupBy(“name”)
.agg(“name”, avg(“age”))
.collect()
®
© 2016 MapR Technologies 7®
© 2016 MapR Technologies 7MapR Confidential
•  Add Schema to RDD
•  Write SQL query against RDD
•  Interact with Hive
•  Flexibility in UDFs
•  Support Parquet, Json, CSV, etc
SparkSQL
®
© 2016 MapR Technologies 8®
© 2016 MapR Technologies 8MapR Confidential
•  Active community
•  Pipeline style model build and tunning (scikit-learn)
•  Core spark
•  Distributed computing
Spark Mllib
®
© 2016 MapR Technologies 9®
© 2016 MapR Technologies 9MapR Confidential
Spark Execution
•  Job -> Stage -> Task
•  Flow: Directed Acyclic Graph(DAG)
•  Pipeline(lazy load)
•  Standalone(not necessarily means single machine)/Yarn(yarn-
client, yarn-cluster)
®
© 2016 MapR Technologies 10®
© 2016 MapR Technologies 10MapR Confidential
Spark Execution
®
© 2016 MapR Technologies 11®
© 2016 MapR Technologies 11MapR Confidential © 2016 MapR Technologies© 2016 MapR Technologies
Now let’s build the spark application
®
© 2016 MapR Technologies 12®
© 2016 MapR Technologies 12MapR Confidential
Find my presentation and other related resources here:
http://events.mapr.com/DataScienceMD
Today’s Presentation
Whiteboard & demo
videos
Free On-Demand Training
Free Cheat Sheet
Free Articles And more…
Ad

More Related Content

What's hot (20)

Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1
Rohit Agrawal
 
Big Data and Hadoop Ecosystem
Big Data and Hadoop EcosystemBig Data and Hadoop Ecosystem
Big Data and Hadoop Ecosystem
Rajkumar Singh
 
Apache hadoop introduction and architecture
Apache hadoop  introduction and architectureApache hadoop  introduction and architecture
Apache hadoop introduction and architecture
Harikrishnan K
 
Introduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemIntroduction To Hadoop Ecosystem
Introduction To Hadoop Ecosystem
InSemble
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop Technology
Manish Borkar
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
EMC
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
Sandip Darwade
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystem
Shivaji Dutta
 
Real Time and Big Data – It’s About Time
Real Time and Big Data – It’s About TimeReal Time and Big Data – It’s About Time
Real Time and Big Data – It’s About Time
MapR Technologies
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
Debraj GuhaThakurta
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
Stanley Wang
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce intro
Geoff Hendrey
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
Ran Silberman
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
Rahul Agarwal
 
Hadoop Presentation - PPT
Hadoop Presentation - PPTHadoop Presentation - PPT
Hadoop Presentation - PPT
Anand Pandey
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.ir
datastack
 
Time series database by Harshil Ambagade
Time series database by Harshil AmbagadeTime series database by Harshil Ambagade
Time series database by Harshil Ambagade
Sigmoid
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecase
sudhakara st
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
Vadim Y. Bichutskiy
 
Hadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingHadoop: Distributed Data Processing
Hadoop: Distributed Data Processing
Cloudera, Inc.
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1
Rohit Agrawal
 
Big Data and Hadoop Ecosystem
Big Data and Hadoop EcosystemBig Data and Hadoop Ecosystem
Big Data and Hadoop Ecosystem
Rajkumar Singh
 
Apache hadoop introduction and architecture
Apache hadoop  introduction and architectureApache hadoop  introduction and architecture
Apache hadoop introduction and architecture
Harikrishnan K
 
Introduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemIntroduction To Hadoop Ecosystem
Introduction To Hadoop Ecosystem
InSemble
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop Technology
Manish Borkar
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
EMC
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystem
Shivaji Dutta
 
Real Time and Big Data – It’s About Time
Real Time and Big Data – It’s About TimeReal Time and Big Data – It’s About Time
Real Time and Big Data – It’s About Time
MapR Technologies
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
Debraj GuhaThakurta
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce intro
Geoff Hendrey
 
Hadoop Presentation - PPT
Hadoop Presentation - PPTHadoop Presentation - PPT
Hadoop Presentation - PPT
Anand Pandey
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.ir
datastack
 
Time series database by Harshil Ambagade
Time series database by Harshil AmbagadeTime series database by Harshil Ambagade
Time series database by Harshil Ambagade
Sigmoid
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecase
sudhakara st
 
Hadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingHadoop: Distributed Data Processing
Hadoop: Distributed Data Processing
Cloudera, Inc.
 

Viewers also liked (20)

Big Data Paris
Big Data ParisBig Data Paris
Big Data Paris
MapR Technologies
 
Analyzing Log Data With Apache Spark
Analyzing Log Data With Apache SparkAnalyzing Log Data With Apache Spark
Analyzing Log Data With Apache Spark
Spark Summit
 
Oracle rac cachefusion - High Availability Day 2015
Oracle rac cachefusion - High Availability Day 2015Oracle rac cachefusion - High Availability Day 2015
Oracle rac cachefusion - High Availability Day 2015
aioughydchapter
 
MongoDB
MongoDBMongoDB
MongoDB
Steven Francia
 
Cloud conference - mongodb
Cloud conference - mongodbCloud conference - mongodb
Cloud conference - mongodb
Mitch Pirtle
 
Aman sharma hyd_12crac High Availability Day 2015
Aman sharma hyd_12crac High Availability Day 2015Aman sharma hyd_12crac High Availability Day 2015
Aman sharma hyd_12crac High Availability Day 2015
aioughydchapter
 
Recommendation Techn
Recommendation TechnRecommendation Techn
Recommendation Techn
Ted Dunning
 
RACATTACK Lab Handbook - Enable Flex Cluster and Flex ASM
RACATTACK Lab Handbook - Enable Flex Cluster and Flex ASMRACATTACK Lab Handbook - Enable Flex Cluster and Flex ASM
RACATTACK Lab Handbook - Enable Flex Cluster and Flex ASM
Maaz Anjum
 
Flex Your Database on 12c's Flex ASM and Flex Cluster
Flex Your Database on 12c's Flex ASM and Flex ClusterFlex Your Database on 12c's Flex ASM and Flex Cluster
Flex Your Database on 12c's Flex ASM and Flex Cluster
Maaz Anjum
 
Running Analytics at the Speed of Your Business
Running Analytics at the Speed of Your BusinessRunning Analytics at the Speed of Your Business
Running Analytics at the Speed of Your Business
Redis Labs
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
MapR Technologies
 
Oracle 12.2 sharding learning more
Oracle 12.2 sharding learning moreOracle 12.2 sharding learning more
Oracle 12.2 sharding learning more
Leyi (Kamus) Zhang
 
Spark Streaming Tips for Devs and Ops by Fran perez y federico fernández
Spark Streaming Tips for Devs and Ops by Fran perez y federico fernándezSpark Streaming Tips for Devs and Ops by Fran perez y federico fernández
Spark Streaming Tips for Devs and Ops by Fran perez y federico fernández
J On The Beach
 
Policy based cluster management in oracle 12c
Policy based cluster management in oracle 12c Policy based cluster management in oracle 12c
Policy based cluster management in oracle 12c
Anju Garg
 
La transformacion digital en nuestra vida cotidiana. Un vistazo a las APIs
La transformacion digital en nuestra vida cotidiana. Un vistazo a las APIsLa transformacion digital en nuestra vida cotidiana. Un vistazo a las APIs
La transformacion digital en nuestra vida cotidiana. Un vistazo a las APIs
Guatemala User Group
 
Practical Machine Learning: Innovations in Recommendation Workshop
Practical Machine Learning:  Innovations in Recommendation WorkshopPractical Machine Learning:  Innovations in Recommendation Workshop
Practical Machine Learning: Innovations in Recommendation Workshop
MapR Technologies
 
Leverage integration cloud_service_for_ebs_
Leverage integration cloud_service_for_ebs_Leverage integration cloud_service_for_ebs_
Leverage integration cloud_service_for_ebs_
aioughydchapter
 
Serverless Architecture
Serverless ArchitectureServerless Architecture
Serverless Architecture
Michał Kurzeja
 
Oracle sharding : Installation & Configuration
Oracle sharding : Installation & ConfigurationOracle sharding : Installation & Configuration
Oracle sharding : Installation & Configuration
suresh gandhi
 
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-timeReal-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Ted Dunning
 
Analyzing Log Data With Apache Spark
Analyzing Log Data With Apache SparkAnalyzing Log Data With Apache Spark
Analyzing Log Data With Apache Spark
Spark Summit
 
Oracle rac cachefusion - High Availability Day 2015
Oracle rac cachefusion - High Availability Day 2015Oracle rac cachefusion - High Availability Day 2015
Oracle rac cachefusion - High Availability Day 2015
aioughydchapter
 
Cloud conference - mongodb
Cloud conference - mongodbCloud conference - mongodb
Cloud conference - mongodb
Mitch Pirtle
 
Aman sharma hyd_12crac High Availability Day 2015
Aman sharma hyd_12crac High Availability Day 2015Aman sharma hyd_12crac High Availability Day 2015
Aman sharma hyd_12crac High Availability Day 2015
aioughydchapter
 
Recommendation Techn
Recommendation TechnRecommendation Techn
Recommendation Techn
Ted Dunning
 
RACATTACK Lab Handbook - Enable Flex Cluster and Flex ASM
RACATTACK Lab Handbook - Enable Flex Cluster and Flex ASMRACATTACK Lab Handbook - Enable Flex Cluster and Flex ASM
RACATTACK Lab Handbook - Enable Flex Cluster and Flex ASM
Maaz Anjum
 
Flex Your Database on 12c's Flex ASM and Flex Cluster
Flex Your Database on 12c's Flex ASM and Flex ClusterFlex Your Database on 12c's Flex ASM and Flex Cluster
Flex Your Database on 12c's Flex ASM and Flex Cluster
Maaz Anjum
 
Running Analytics at the Speed of Your Business
Running Analytics at the Speed of Your BusinessRunning Analytics at the Speed of Your Business
Running Analytics at the Speed of Your Business
Redis Labs
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
MapR Technologies
 
Oracle 12.2 sharding learning more
Oracle 12.2 sharding learning moreOracle 12.2 sharding learning more
Oracle 12.2 sharding learning more
Leyi (Kamus) Zhang
 
Spark Streaming Tips for Devs and Ops by Fran perez y federico fernández
Spark Streaming Tips for Devs and Ops by Fran perez y federico fernándezSpark Streaming Tips for Devs and Ops by Fran perez y federico fernández
Spark Streaming Tips for Devs and Ops by Fran perez y federico fernández
J On The Beach
 
Policy based cluster management in oracle 12c
Policy based cluster management in oracle 12c Policy based cluster management in oracle 12c
Policy based cluster management in oracle 12c
Anju Garg
 
La transformacion digital en nuestra vida cotidiana. Un vistazo a las APIs
La transformacion digital en nuestra vida cotidiana. Un vistazo a las APIsLa transformacion digital en nuestra vida cotidiana. Un vistazo a las APIs
La transformacion digital en nuestra vida cotidiana. Un vistazo a las APIs
Guatemala User Group
 
Practical Machine Learning: Innovations in Recommendation Workshop
Practical Machine Learning:  Innovations in Recommendation WorkshopPractical Machine Learning:  Innovations in Recommendation Workshop
Practical Machine Learning: Innovations in Recommendation Workshop
MapR Technologies
 
Leverage integration cloud_service_for_ebs_
Leverage integration cloud_service_for_ebs_Leverage integration cloud_service_for_ebs_
Leverage integration cloud_service_for_ebs_
aioughydchapter
 
Oracle sharding : Installation & Configuration
Oracle sharding : Installation & ConfigurationOracle sharding : Installation & Configuration
Oracle sharding : Installation & Configuration
suresh gandhi
 
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-timeReal-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Ted Dunning
 
Ad

Similar to Spark Application for Time Series Analysis (20)

Intro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezIntro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco Vasquez
MapR Technologies
 
Apache Spark & Hadoop
Apache Spark & HadoopApache Spark & Hadoop
Apache Spark & Hadoop
MapR Technologies
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
Databricks
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
Sneha Challa
 
Cleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - SparkCleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - Spark
Vince Gonzalez
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
Vincent Poncet
 
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
Tugdual Grall
 
An Introduction to Apache Spark
An Introduction to Apache SparkAn Introduction to Apache Spark
An Introduction to Apache Spark
Elvis Saravia
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introduction
Hektor Jacynycz García
 
Jump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksJump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with Databricks
Anyscale
 
Enabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and REnabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and R
Databricks
 
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
cdmaxime
 
Cassandra Lunch #89: Semi-Structured Data in Cassandra
Cassandra Lunch #89: Semi-Structured Data in CassandraCassandra Lunch #89: Semi-Structured Data in Cassandra
Cassandra Lunch #89: Semi-Structured Data in Cassandra
Anant Corporation
 
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
cdmaxime
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
Amir Sedighi
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study Notes
Richard Kuo
 
5 Ways to Use Spark to Enrich your Cassandra Environment
5 Ways to Use Spark to Enrich your Cassandra Environment5 Ways to Use Spark to Enrich your Cassandra Environment
5 Ways to Use Spark to Enrich your Cassandra Environment
Jim Hatcher
 
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
IT Event
 
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
cdmaxime
 
Intro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezIntro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco Vasquez
MapR Technologies
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
Databricks
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
Sneha Challa
 
Cleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - SparkCleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - Spark
Vince Gonzalez
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
Vincent Poncet
 
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
Tugdual Grall
 
An Introduction to Apache Spark
An Introduction to Apache SparkAn Introduction to Apache Spark
An Introduction to Apache Spark
Elvis Saravia
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introduction
Hektor Jacynycz García
 
Jump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksJump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with Databricks
Anyscale
 
Enabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and REnabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and R
Databricks
 
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
cdmaxime
 
Cassandra Lunch #89: Semi-Structured Data in Cassandra
Cassandra Lunch #89: Semi-Structured Data in CassandraCassandra Lunch #89: Semi-Structured Data in Cassandra
Cassandra Lunch #89: Semi-Structured Data in Cassandra
Anant Corporation
 
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
cdmaxime
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
Amir Sedighi
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study Notes
Richard Kuo
 
5 Ways to Use Spark to Enrich your Cassandra Environment
5 Ways to Use Spark to Enrich your Cassandra Environment5 Ways to Use Spark to Enrich your Cassandra Environment
5 Ways to Use Spark to Enrich your Cassandra Environment
Jim Hatcher
 
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
IT Event
 
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
cdmaxime
 
Ad

More from MapR Technologies (20)

Converging your data landscape
Converging your data landscapeConverging your data landscape
Converging your data landscape
MapR Technologies
 
ML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & Evaluation
MapR Technologies
 
Self-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataSelf-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your Data
MapR Technologies
 
Enabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureEnabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data Capture
MapR Technologies
 
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
MapR Technologies
 
ML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning Logistics
MapR Technologies
 
Machine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMachine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model Management
MapR Technologies
 
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action
MapR Technologies
 
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsLive Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIs
MapR Technologies
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
MapR Technologies
 
Live Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionLive Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn Prediction
MapR Technologies
 
An Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformAn Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data Platform
MapR Technologies
 
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
MapR Technologies
 
Best Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareBest Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in Healthcare
MapR Technologies
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and Analytics
MapR Technologies
 
MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Product Update - Spring 2017
MapR Product Update - Spring 2017
MapR Technologies
 
3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics
MapR Technologies
 
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsCisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
MapR Technologies
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT Better
MapR Technologies
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQL
MapR Technologies
 
Converging your data landscape
Converging your data landscapeConverging your data landscape
Converging your data landscape
MapR Technologies
 
ML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & Evaluation
MapR Technologies
 
Self-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataSelf-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your Data
MapR Technologies
 
Enabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureEnabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data Capture
MapR Technologies
 
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
MapR Technologies
 
ML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning Logistics
MapR Technologies
 
Machine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMachine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model Management
MapR Technologies
 
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action
MapR Technologies
 
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsLive Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIs
MapR Technologies
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
MapR Technologies
 
Live Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionLive Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn Prediction
MapR Technologies
 
An Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformAn Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data Platform
MapR Technologies
 
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
MapR Technologies
 
Best Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareBest Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in Healthcare
MapR Technologies
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and Analytics
MapR Technologies
 
MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Product Update - Spring 2017
MapR Product Update - Spring 2017
MapR Technologies
 
3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics
MapR Technologies
 
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsCisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
MapR Technologies
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT Better
MapR Technologies
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQL
MapR Technologies
 

Recently uploaded (20)

AI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global TrendsAI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global Trends
InData Labs
 
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-UmgebungenHCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
panagenda
 
Generative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in BusinessGenerative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in Business
Dr. Tathagat Varma
 
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveDesigning Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
ScyllaDB
 
Role of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered ManufacturingRole of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered Manufacturing
Andrew Leo
 
Build Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For DevsBuild Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For Devs
Brian McKeiver
 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 
HCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser EnvironmentsHCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser Environments
panagenda
 
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
Alan Dix
 
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptxIncreasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Anoop Ashok
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
Drupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy ConsumptionDrupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy Consumption
Exove
 
How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?
Daniel Lehner
 
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc
 
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven InsightsAndrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell
 
Electronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploitElectronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploit
niftliyevhuseyn
 
How analogue intelligence complements AI
How analogue intelligence complements AIHow analogue intelligence complements AI
How analogue intelligence complements AI
Paul Rowe
 
tecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdftecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdf
fjgm517
 
Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025
Splunk
 
Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)
Ortus Solutions, Corp
 
AI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global TrendsAI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global Trends
InData Labs
 
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-UmgebungenHCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
panagenda
 
Generative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in BusinessGenerative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in Business
Dr. Tathagat Varma
 
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveDesigning Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
ScyllaDB
 
Role of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered ManufacturingRole of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered Manufacturing
Andrew Leo
 
Build Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For DevsBuild Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For Devs
Brian McKeiver
 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 
HCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser EnvironmentsHCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser Environments
panagenda
 
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
Alan Dix
 
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptxIncreasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Anoop Ashok
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
Drupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy ConsumptionDrupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy Consumption
Exove
 
How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?
Daniel Lehner
 
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc
 
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven InsightsAndrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell
 
Electronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploitElectronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploit
niftliyevhuseyn
 
How analogue intelligence complements AI
How analogue intelligence complements AIHow analogue intelligence complements AI
How analogue intelligence complements AI
Paul Rowe
 
tecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdftecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdf
fjgm517
 
Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025
Splunk
 
Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)
Ortus Solutions, Corp
 

Spark Application for Time Series Analysis

  • 1. ® © 2016 MapR Technologies 1® © 2016 MapR Technologies 1MapR Confidential © 2016 MapR Technologies Spark Application for Time Series Predictive Analysis Dong Meng, Data Scientist
  • 2. ® © 2016 MapR Technologies 2® © 2016 MapR Technologies 2MapR Confidential •  Originally developed in 2009 in UC Berkeley’s AMP Lab •  Fully open-sourced in 2010 – now a Top Level Project at the Apache Software Foundation •  The most Active Open Source Project in Big Data spark.apache.org github.com/apache/spark [email protected] What is Apache Spark?
  • 3. ® © 2016 MapR Technologies 3® © 2016 MapR Technologies 3MapR Confidential Resilient Distributed Datasets (RDD) RDD = read-only, partitioned collection of records Create new RDD’s via transformations Operate & analyze RDD’s via actions Users have option to persist data in- memory & control partitioning {map, filter, union, flatMap, distinct, groupByKey, cogroup, reduceByKey, … } {reduce, collect, count, first, take, foreach, countByKey, takeSample, saveAsTextfile, … } Fault-tolerance is derived from the ability to recreate the lineage of the RDD in case of a partition loss
  • 4. ® © 2016 MapR Technologies 4® © 2016 MapR Technologies 4MapR Confidential Resilient Distributed Datasets (RDD) •  How to generate RDD •  How to compute RDD •  What is relationship between RDDs –  Narrow Dependency (each partition of the parent RDD is used by at most one partition of the child RDD. [map, filter, union]) –  Wide/Shuffle Dependency (multiple partitions of child RDD may depend on it [groupByKey]) •  In this scene, mapreduce ~ map+reduceByKey
  • 5. ® © 2016 MapR Technologies 5® © 2016 MapR Technologies 5MapR Confidential Resilient Distributed Datasets (RDD)
  • 6. ® © 2016 MapR Technologies 6® © 2016 MapR Technologies 6MapR Confidential •  A distributed collection of rows organized into named columns •  Select, filter, aggregate, join and plotting the data with structure (python pandas, R dataframe datatable) •  Based on schemaRDD (spark<1.3) •  Much less code to write, plus the schema RDD: data = sc.textFile(…).split(“,”) data.map(lambda x : (x[0], [x[1], 1] )) .reduceByKey(lambda x, y: [x[0]+y[0], x[1]+y[1]]) .map(lambda x: x[0], x[1][0]/x[1][1]) .collect() Spark DataFrame DataFrame:df.groupBy(“name”) .agg(“name”, avg(“age”)) .collect()
  • 7. ® © 2016 MapR Technologies 7® © 2016 MapR Technologies 7MapR Confidential •  Add Schema to RDD •  Write SQL query against RDD •  Interact with Hive •  Flexibility in UDFs •  Support Parquet, Json, CSV, etc SparkSQL
  • 8. ® © 2016 MapR Technologies 8® © 2016 MapR Technologies 8MapR Confidential •  Active community •  Pipeline style model build and tunning (scikit-learn) •  Core spark •  Distributed computing Spark Mllib
  • 9. ® © 2016 MapR Technologies 9® © 2016 MapR Technologies 9MapR Confidential Spark Execution •  Job -> Stage -> Task •  Flow: Directed Acyclic Graph(DAG) •  Pipeline(lazy load) •  Standalone(not necessarily means single machine)/Yarn(yarn- client, yarn-cluster)
  • 10. ® © 2016 MapR Technologies 10® © 2016 MapR Technologies 10MapR Confidential Spark Execution
  • 11. ® © 2016 MapR Technologies 11® © 2016 MapR Technologies 11MapR Confidential © 2016 MapR Technologies© 2016 MapR Technologies Now let’s build the spark application
  • 12. ® © 2016 MapR Technologies 12® © 2016 MapR Technologies 12MapR Confidential Find my presentation and other related resources here: http://events.mapr.com/DataScienceMD Today’s Presentation Whiteboard & demo videos Free On-Demand Training Free Cheat Sheet Free Articles And more…