SlideShare a Scribd company logo
‹#›© Cloudera, Inc. All rights reserved.
Mirko Kämpf | 2015
Apache Spark:
Next Generation Data
Processing for Hadoop
‹#›© Cloudera, Inc. All rights reserved.
Agenda
• The Data Science Process (DSP)
- Why or when to use Spark
• The role of: Apache Hadoop and Apache Spark
- History & Hadoop Ecosystem
• Apache Spark: Overview and Concepts
• Practical Tips
‹#›© Cloudera, Inc. All rights reserved.
The Data Science Process
Application of Big-Data-Technology
Images from: http://semanticommunity.info/Data_Science/Doing_Data_Science
‹#›© Cloudera, Inc. All rights reserved.
Huge Data Sets in Science
Application of Big-Data-Technology
Images from: http://semanticommunity.info/Data_Science/Doing_Data_Science
‹#›© Cloudera, Inc. All rights reserved.
“Spark offers tools for Data Science
and components for Data
Products.”
—How can Apache Spark fit into my world?
‹#›© Cloudera, Inc. All rights reserved.
Should I use Apache Spark?
• If all my data fits into Excel-Spreadsheets?
• If I have a special purpose application to work with?
• If my current system is just a bit to slow?
‹#›© Cloudera, Inc. All rights reserved.
Should I use Apache Spark?
• If all my data fits into Excel-Spreadsheets?
• If I have a special purpose application to work with?
• If my current system is just a bit to slow?
• Just export as CSV / JSON and use a DataFrame to join with other DS.
Why not?
‹#›© Cloudera, Inc. All rights reserved.
Should I use Apache Spark?
• If all my data fits into Excel-Spreadsheets?
• If I have a special purpose application to work with?
• If my current system is just a bit to slow?
• Just export as CSV / JSON and use a DataFrame to join with other DS.
• Think about additional analysis methods! Maybe it is already built into Apache
Spark!
Why not?
‹#›© Cloudera, Inc. All rights reserved.
Should I use Apache Spark?
• If all my data fits into Excel-Spreadsheets?
• If I have a special purpose application to work with?
• If my current system is just a bit to slow?
• Just export as CSV / JSON and use a DataFrame to join with other DS.
• Think about additional analysis methods! Maybe it is build into Spark.
• OK, Spark will probably not help to speed up your system, but maybe you can
offload data to Hadoop, which releases some resources.
Why not?
‹#›© Cloudera, Inc. All rights reserved.
“Spark offers fast in memory processing on
huge distributed and even on heterogeneous
datasets.”
—What type of data fits into Spark?
‹#›© Cloudera, Inc. All rights reserved.
History of Spark
Spark is really young, but has a very
active community!
‹#›© Cloudera, Inc. All rights reserved.
Timeline: Spark Adoption
‹#›© Cloudera, Inc. All rights reserved.
Apache Spark:
Overview & Concepts
‹#›© Cloudera, Inc. All rights reserved.
Hadoop Ecosystem incl. Apache Spark
Spark can be an entry point to your Big Data world …
‹#›© Cloudera, Inc. All rights reserved.
“Apache Spark is distributed on top of Hadoop
and brings parallel processing
to powerful workstations.”
—Do I need a Hadoop cluster to work with Apache Spark?
‹#›© Cloudera, Inc. All rights reserved.
Spark vs. MapReduce
‹#›© Cloudera, Inc. All rights reserved.
How to interact with Spark?
‹#›© Cloudera, Inc. All rights reserved.
Spark Components
‹#›© Cloudera, Inc. All rights reserved.
‹#›© Cloudera, Inc. All rights reserved.
MLLib: GraphX:
Basic statistics
summary statistics, correlations, stratified sampling,
hypothesis testing, random data generation
Classification and regression
linear models (SVMs, logistic / linear regression)
naive Bayes, decision trees
ensembles of trees (Random Forests / Gradient-Boosted Trees)
isotonic regression
Collaborative filtering
alternating least squares (ALS)
Clustering
k-means, Gaussian mixture, power iteration clustering (PIC)
latent Dirichlet allocation (LDA), streaming k-means
Dimensionality reduction
singular value decomposition (SVD)
principal component analysis (PCA)
…
PageRank
Connected Components
Triangle Counting
Pregel API
‹#›© Cloudera, Inc. All rights reserved.
How to use your code in Spark?
A. Interactively, by loading it into the spark-shell.
B. Contribute to existing Spark projects.
C. Create your module and use it in a spark-shell session.
D. Build a data-product which uses Apache Spark.
For simple and reliable usage of Java classes
and complete third-party libraries, we define
a Spark Module as a self-contained artifact
created by Maven. This module can easily
be shared by multiple users via repositories.
http://blog.cloudera.com/blog/2015/03/how-to-build-re-usable-spark-programs-using-spark-shell-and-maven/
‹#›© Cloudera, Inc. All rights reserved.
Apache Spark:
Overview & Concepts
‹#›© Cloudera, Inc. All rights reserved.
Spark Context
‹#›© Cloudera, Inc. All rights reserved.
RDDs and DataFrames
‹#›© Cloudera, Inc. All rights reserved.
Creation of RDDs
‹#›© Cloudera, Inc. All rights reserved.
Datatypes in RDDs
‹#›© Cloudera, Inc. All rights reserved.
‹#›© Cloudera, Inc. All rights reserved.
‹#›© Cloudera, Inc. All rights reserved.
Spark in a Cluster
‹#›© Cloudera, Inc. All rights reserved.
Spark in a Cluster
‹#›© Cloudera, Inc. All rights reserved.
‹#›© Cloudera, Inc. All rights reserved.
‹#›© Cloudera, Inc. All rights reserved.
DStream: The heart of Spark Streaming
‹#›© Cloudera, Inc. All rights reserved.
“Efficient hardware utilization, caching,
simple APIs, and access to a variety of data
in Hadoop is key to success.”
—What makes Spark so different, compared to core MapReduce?
‹#›© Cloudera, Inc. All rights reserved.
Practical Tips
‹#›© Cloudera, Inc. All rights reserved.
Development Techniques
• Build your tools and analysis procedures in small cycles.
• Test all phases of your work and document carefully.
• Document what you expect! => Requirements management …
• Collect what you get! => Operational logs …
• Reuse well tested components and modularize your analysis scripts.
• Learn „state of the art“ tools and share your work!
‹#›© Cloudera, Inc. All rights reserved.
Data Management
• Think about typical access patterns:
• random access to each record or field?
• access to entire groups of records?
• variable size or fixed size sets?
• „full table scan“
• OPTIMIZE FOR YOUR DOMINANT ACCESS PATTERN!
• Select efficient storage formats: Avro, Parquet
• Index your data in SOLR for random access and data exploration
• Indexing can be done by just a few clicks in HUE …
‹#›© Cloudera, Inc. All rights reserved.
Collecting Sensor Data with Spark Streaming …
• Spark Streaming works on fixed time slices only (in current version, 1.5)
• Use the original time stamp?
• Requires additional storage and bandwidth
• Original system clock defines resolution
• Use „Spark-Time“ or a local time reference:
• You may lose information!
• You have a limited resolution, defined by batch size.
‹#›© Cloudera, Inc. All rights reserved.
Thank you !
Enjoy Apache Spark and all your data …
Ad

More Related Content

What's hot (20)

Big Telco - Yousun Jeong
Big Telco - Yousun JeongBig Telco - Yousun Jeong
Big Telco - Yousun Jeong
Spark Summit
 
Lambda architecture with Spark
Lambda architecture with SparkLambda architecture with Spark
Lambda architecture with Spark
Vincent GALOPIN
 
Building Data Pipelines with Spark and StreamSets
Building Data Pipelines with Spark and StreamSetsBuilding Data Pipelines with Spark and StreamSets
Building Data Pipelines with Spark and StreamSets
Pat Patterson
 
Spark - Migration Story
Spark - Migration Story Spark - Migration Story
Spark - Migration Story
Roman Chukh
 
Tangram: Distributed Scheduling Framework for Apache Spark at Facebook
Tangram: Distributed Scheduling Framework for Apache Spark at FacebookTangram: Distributed Scheduling Framework for Apache Spark at Facebook
Tangram: Distributed Scheduling Framework for Apache Spark at Facebook
Databricks
 
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Databricks
 
Cloud Experience: Data-driven Applications Made Simple and Fast
Cloud Experience: Data-driven Applications Made Simple and FastCloud Experience: Data-driven Applications Made Simple and Fast
Cloud Experience: Data-driven Applications Made Simple and Fast
Databricks
 
A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...
A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...
A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...
Databricks
 
Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...
Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...
Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...
Spark Summit
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous Applications
Databricks
 
Fast data for fitness 10 nov 2020
Fast data for fitness 10 nov 2020Fast data for fitness 10 nov 2020
Fast data for fitness 10 nov 2020
Timothy Spann
 
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Spark Summit
 
How Apache Spark Is Helping Tame the Wild West of Wi-Fi
How Apache Spark Is Helping Tame the Wild West of Wi-FiHow Apache Spark Is Helping Tame the Wild West of Wi-Fi
How Apache Spark Is Helping Tame the Wild West of Wi-Fi
Spark Summit
 
SQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at ComcastSQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at Comcast
Databricks
 
Hadoop and Spark-Perfect Together-(Arun C. Murthy, Hortonworks)
Hadoop and Spark-Perfect Together-(Arun C. Murthy, Hortonworks)Hadoop and Spark-Perfect Together-(Arun C. Murthy, Hortonworks)
Hadoop and Spark-Perfect Together-(Arun C. Murthy, Hortonworks)
Spark Summit
 
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo LeeData Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Spark Summit
 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
Spark Summit
 
Self-Service Analytics on Hadoop: Lessons Learned
Self-Service Analytics on Hadoop: Lessons LearnedSelf-Service Analytics on Hadoop: Lessons Learned
Self-Service Analytics on Hadoop: Lessons Learned
DataWorks Summit/Hadoop Summit
 
Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...
Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...
Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...
Databricks
 
Cloud Connect 2012, Big Data @ Netflix
Cloud Connect 2012, Big Data @ NetflixCloud Connect 2012, Big Data @ Netflix
Cloud Connect 2012, Big Data @ Netflix
Jerome Boulon
 
Big Telco - Yousun Jeong
Big Telco - Yousun JeongBig Telco - Yousun Jeong
Big Telco - Yousun Jeong
Spark Summit
 
Lambda architecture with Spark
Lambda architecture with SparkLambda architecture with Spark
Lambda architecture with Spark
Vincent GALOPIN
 
Building Data Pipelines with Spark and StreamSets
Building Data Pipelines with Spark and StreamSetsBuilding Data Pipelines with Spark and StreamSets
Building Data Pipelines with Spark and StreamSets
Pat Patterson
 
Spark - Migration Story
Spark - Migration Story Spark - Migration Story
Spark - Migration Story
Roman Chukh
 
Tangram: Distributed Scheduling Framework for Apache Spark at Facebook
Tangram: Distributed Scheduling Framework for Apache Spark at FacebookTangram: Distributed Scheduling Framework for Apache Spark at Facebook
Tangram: Distributed Scheduling Framework for Apache Spark at Facebook
Databricks
 
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Databricks
 
Cloud Experience: Data-driven Applications Made Simple and Fast
Cloud Experience: Data-driven Applications Made Simple and FastCloud Experience: Data-driven Applications Made Simple and Fast
Cloud Experience: Data-driven Applications Made Simple and Fast
Databricks
 
A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...
A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...
A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...
Databricks
 
Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...
Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...
Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...
Spark Summit
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous Applications
Databricks
 
Fast data for fitness 10 nov 2020
Fast data for fitness 10 nov 2020Fast data for fitness 10 nov 2020
Fast data for fitness 10 nov 2020
Timothy Spann
 
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Spark Summit
 
How Apache Spark Is Helping Tame the Wild West of Wi-Fi
How Apache Spark Is Helping Tame the Wild West of Wi-FiHow Apache Spark Is Helping Tame the Wild West of Wi-Fi
How Apache Spark Is Helping Tame the Wild West of Wi-Fi
Spark Summit
 
SQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at ComcastSQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at Comcast
Databricks
 
Hadoop and Spark-Perfect Together-(Arun C. Murthy, Hortonworks)
Hadoop and Spark-Perfect Together-(Arun C. Murthy, Hortonworks)Hadoop and Spark-Perfect Together-(Arun C. Murthy, Hortonworks)
Hadoop and Spark-Perfect Together-(Arun C. Murthy, Hortonworks)
Spark Summit
 
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo LeeData Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Spark Summit
 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
Spark Summit
 
Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...
Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...
Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...
Databricks
 
Cloud Connect 2012, Big Data @ Netflix
Cloud Connect 2012, Big Data @ NetflixCloud Connect 2012, Big Data @ Netflix
Cloud Connect 2012, Big Data @ Netflix
Jerome Boulon
 

Viewers also liked (7)

Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on TutorialsSparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Databricks
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem
DataWorks Summit/Hadoop Summit
 
Apache Spark Briefing
Apache Spark BriefingApache Spark Briefing
Apache Spark Briefing
Thomas W. Dinsmore
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
tcloudcomputing-tw
 
Apache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster ComputingApache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster Computing
Gerger
 
Introduction to Stateful Stream Processing with Apache Flink.
Introduction to Stateful Stream Processing with Apache Flink.Introduction to Stateful Stream Processing with Apache Flink.
Introduction to Stateful Stream Processing with Apache Flink.
Konstantinos Kloudas
 
What the Spark!? Intro and Use Cases
What the Spark!? Intro and Use CasesWhat the Spark!? Intro and Use Cases
What the Spark!? Intro and Use Cases
Aerospike, Inc.
 
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on TutorialsSparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Databricks
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem
DataWorks Summit/Hadoop Summit
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
tcloudcomputing-tw
 
Apache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster ComputingApache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster Computing
Gerger
 
Introduction to Stateful Stream Processing with Apache Flink.
Introduction to Stateful Stream Processing with Apache Flink.Introduction to Stateful Stream Processing with Apache Flink.
Introduction to Stateful Stream Processing with Apache Flink.
Konstantinos Kloudas
 
What the Spark!? Intro and Use Cases
What the Spark!? Intro and Use CasesWhat the Spark!? Intro and Use Cases
What the Spark!? Intro and Use Cases
Aerospike, Inc.
 
Ad

Similar to Apache Spark in Scientific Applications (20)

39.-Introduction-to-Sparkspark and all-1.pdf
39.-Introduction-to-Sparkspark and all-1.pdf39.-Introduction-to-Sparkspark and all-1.pdf
39.-Introduction-to-Sparkspark and all-1.pdf
ajajkhan16
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
Home
 
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsfPyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
sasuke20y4sh
 
APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
DeepaThirumurugan
 
Started with-apache-spark
Started with-apache-sparkStarted with-apache-spark
Started with-apache-spark
Happiest Minds Technologies
 
Spark_Part 1
Spark_Part 1Spark_Part 1
Spark_Part 1
Shashi Prakash
 
Data Science and CDSW
Data Science and CDSWData Science and CDSW
Data Science and CDSW
Jason Hubbard
 
Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15
Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15
Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15
MLconf
 
Atlanta MLConf
Atlanta MLConfAtlanta MLConf
Atlanta MLConf
Qubole
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptx
Rahul Borate
 
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptxCLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
bhuvankumar3877
 
Analyzing Hadoop Data Using Sparklyr

Analyzing Hadoop Data Using Sparklyr
Analyzing Hadoop Data Using Sparklyr

Analyzing Hadoop Data Using Sparklyr

Cloudera, Inc.
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
Zahra Eskandari
 
Spark 101
Spark 101Spark 101
Spark 101
Shahaf Azriely {TopLinked} ☁
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code Workshop
Amanda Casari
 
Pyspark presentationfsfsfjspfsjfsfsfjsfpsfsf
Pyspark presentationfsfsfjspfsjfsfsfjsfpsfsfPyspark presentationfsfsfjspfsjfsfsfjsfpsfsf
Pyspark presentationfsfsfjspfsjfsfsfjsfpsfsf
sasuke20y4sh
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Agile Testing Alliance
 
apache spark Presentation general seminar.pptx
apache spark Presentation general seminar.pptxapache spark Presentation general seminar.pptx
apache spark Presentation general seminar.pptx
abhinavas9207
 
Apache Spark in Industry
Apache Spark in IndustryApache Spark in Industry
Apache Spark in Industry
Dorian Beganovic
 
39.-Introduction-to-Sparkspark and all-1.pdf
39.-Introduction-to-Sparkspark and all-1.pdf39.-Introduction-to-Sparkspark and all-1.pdf
39.-Introduction-to-Sparkspark and all-1.pdf
ajajkhan16
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
Home
 
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsfPyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
sasuke20y4sh
 
Data Science and CDSW
Data Science and CDSWData Science and CDSW
Data Science and CDSW
Jason Hubbard
 
Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15
Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15
Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15
MLconf
 
Atlanta MLConf
Atlanta MLConfAtlanta MLConf
Atlanta MLConf
Qubole
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptx
Rahul Borate
 
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptxCLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
bhuvankumar3877
 
Analyzing Hadoop Data Using Sparklyr

Analyzing Hadoop Data Using Sparklyr
Analyzing Hadoop Data Using Sparklyr

Analyzing Hadoop Data Using Sparklyr

Cloudera, Inc.
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
Zahra Eskandari
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code Workshop
Amanda Casari
 
Pyspark presentationfsfsfjspfsjfsfsfjsfpsfsf
Pyspark presentationfsfsfjspfsjfsfsfjsfpsfsfPyspark presentationfsfsfjspfsjfsfsfjsfpsfsf
Pyspark presentationfsfsfjspfsjfsfsfjsfpsfsf
sasuke20y4sh
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Agile Testing Alliance
 
apache spark Presentation general seminar.pptx
apache spark Presentation general seminar.pptxapache spark Presentation general seminar.pptx
apache spark Presentation general seminar.pptx
abhinavas9207
 
Ad

More from Dr. Mirko Kämpf (9)

IoT meets AI in the Clouds
IoT meets AI in the CloudsIoT meets AI in the Clouds
IoT meets AI in the Clouds
Dr. Mirko Kämpf
 
Improving computer vision models at scale (Strata Data NYC)
Improving computer vision models at scale  (Strata Data NYC)Improving computer vision models at scale  (Strata Data NYC)
Improving computer vision models at scale (Strata Data NYC)
Dr. Mirko Kämpf
 
Improving computer vision models at scale presentation
Improving computer vision models at scale presentationImproving computer vision models at scale presentation
Improving computer vision models at scale presentation
Dr. Mirko Kämpf
 
Etosha - Data Asset Manager : Status and road map
Etosha - Data Asset Manager : Status and road mapEtosha - Data Asset Manager : Status and road map
Etosha - Data Asset Manager : Status and road map
Dr. Mirko Kämpf
 
Apache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsApache Spark in Scientific Applciations
Apache Spark in Scientific Applciations
Dr. Mirko Kämpf
 
DPG Berlin - SOE 18 - talk v1.2.4
DPG Berlin - SOE 18 - talk v1.2.4DPG Berlin - SOE 18 - talk v1.2.4
DPG Berlin - SOE 18 - talk v1.2.4
Dr. Mirko Kämpf
 
Information Spread in the Context of Evacuation Optimization
Information Spread in the Context of Evacuation OptimizationInformation Spread in the Context of Evacuation Optimization
Information Spread in the Context of Evacuation Optimization
Dr. Mirko Kämpf
 
Hadoop & Complex Systems Research
Hadoop & Complex Systems ResearchHadoop & Complex Systems Research
Hadoop & Complex Systems Research
Dr. Mirko Kämpf
 
DPG 2014: "Context Sensitive and Time Dependent Relevance of Wikipedia Articles"
DPG 2014: "Context Sensitive and Time Dependent Relevance of Wikipedia Articles"DPG 2014: "Context Sensitive and Time Dependent Relevance of Wikipedia Articles"
DPG 2014: "Context Sensitive and Time Dependent Relevance of Wikipedia Articles"
Dr. Mirko Kämpf
 
IoT meets AI in the Clouds
IoT meets AI in the CloudsIoT meets AI in the Clouds
IoT meets AI in the Clouds
Dr. Mirko Kämpf
 
Improving computer vision models at scale (Strata Data NYC)
Improving computer vision models at scale  (Strata Data NYC)Improving computer vision models at scale  (Strata Data NYC)
Improving computer vision models at scale (Strata Data NYC)
Dr. Mirko Kämpf
 
Improving computer vision models at scale presentation
Improving computer vision models at scale presentationImproving computer vision models at scale presentation
Improving computer vision models at scale presentation
Dr. Mirko Kämpf
 
Etosha - Data Asset Manager : Status and road map
Etosha - Data Asset Manager : Status and road mapEtosha - Data Asset Manager : Status and road map
Etosha - Data Asset Manager : Status and road map
Dr. Mirko Kämpf
 
Apache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsApache Spark in Scientific Applciations
Apache Spark in Scientific Applciations
Dr. Mirko Kämpf
 
DPG Berlin - SOE 18 - talk v1.2.4
DPG Berlin - SOE 18 - talk v1.2.4DPG Berlin - SOE 18 - talk v1.2.4
DPG Berlin - SOE 18 - talk v1.2.4
Dr. Mirko Kämpf
 
Information Spread in the Context of Evacuation Optimization
Information Spread in the Context of Evacuation OptimizationInformation Spread in the Context of Evacuation Optimization
Information Spread in the Context of Evacuation Optimization
Dr. Mirko Kämpf
 
Hadoop & Complex Systems Research
Hadoop & Complex Systems ResearchHadoop & Complex Systems Research
Hadoop & Complex Systems Research
Dr. Mirko Kämpf
 
DPG 2014: "Context Sensitive and Time Dependent Relevance of Wikipedia Articles"
DPG 2014: "Context Sensitive and Time Dependent Relevance of Wikipedia Articles"DPG 2014: "Context Sensitive and Time Dependent Relevance of Wikipedia Articles"
DPG 2014: "Context Sensitive and Time Dependent Relevance of Wikipedia Articles"
Dr. Mirko Kämpf
 

Recently uploaded (20)

2025 Insilicogen Company English Brochure
2025 Insilicogen Company English Brochure2025 Insilicogen Company English Brochure
2025 Insilicogen Company English Brochure
Insilico Gen
 
amino compounds.pptx class 12_Govinda Pathak
amino compounds.pptx class 12_Govinda Pathakamino compounds.pptx class 12_Govinda Pathak
amino compounds.pptx class 12_Govinda Pathak
GovindaPathak6
 
On the Lunar Origin of Near-Earth Asteroid 2024 PT5
On the Lunar Origin of Near-Earth Asteroid 2024 PT5On the Lunar Origin of Near-Earth Asteroid 2024 PT5
On the Lunar Origin of Near-Earth Asteroid 2024 PT5
Sérgio Sacani
 
Preparation of Permanent mounts of Parasitic Protozoans.pptx
Preparation of Permanent mounts of Parasitic Protozoans.pptxPreparation of Permanent mounts of Parasitic Protozoans.pptx
Preparation of Permanent mounts of Parasitic Protozoans.pptx
Dr Showkat Ahmad Wani
 
Class-11-notes- Inorganic Chemistry Hydrogen, Oxygen,Ozone,Carbon,Phosphoros
Class-11-notes- Inorganic Chemistry Hydrogen, Oxygen,Ozone,Carbon,PhosphorosClass-11-notes- Inorganic Chemistry Hydrogen, Oxygen,Ozone,Carbon,Phosphoros
Class-11-notes- Inorganic Chemistry Hydrogen, Oxygen,Ozone,Carbon,Phosphoros
govindapathak8
 
Skin function_protective_absorptive_Presentatation.pptx
Skin function_protective_absorptive_Presentatation.pptxSkin function_protective_absorptive_Presentatation.pptx
Skin function_protective_absorptive_Presentatation.pptx
muralinath2
 
Causes of mortalities of eggs and spawn and remedies.pptx
Causes of mortalities of eggs and spawn and remedies.pptxCauses of mortalities of eggs and spawn and remedies.pptx
Causes of mortalities of eggs and spawn and remedies.pptx
anshumanmohanty9090
 
Nutritional Diseases in poultry.........
Nutritional Diseases in poultry.........Nutritional Diseases in poultry.........
Nutritional Diseases in poultry.........
Bangladesh Agricultural University,Mymemsingh
 
SuperconductingMagneticEnergyStorage.pptx
SuperconductingMagneticEnergyStorage.pptxSuperconductingMagneticEnergyStorage.pptx
SuperconductingMagneticEnergyStorage.pptx
BurkanAlpKale
 
when is CT scan need in breast cancer patient.pptx
when is CT scan need in breast cancer patient.pptxwhen is CT scan need in breast cancer patient.pptx
when is CT scan need in breast cancer patient.pptx
Rukhnuddin Al-daudar
 
Zoonosis, Types, Causes. A comprehensive pptx
Zoonosis, Types, Causes. A comprehensive pptxZoonosis, Types, Causes. A comprehensive pptx
Zoonosis, Types, Causes. A comprehensive pptx
Dr Showkat Ahmad Wani
 
whole ANATOMY OF EYE with eye ball .pptx
whole ANATOMY OF EYE with eye ball .pptxwhole ANATOMY OF EYE with eye ball .pptx
whole ANATOMY OF EYE with eye ball .pptx
simranjangra13
 
Chapter 4_Part 2_Infection and Immunity.ppt
Chapter 4_Part 2_Infection and Immunity.pptChapter 4_Part 2_Infection and Immunity.ppt
Chapter 4_Part 2_Infection and Immunity.ppt
JessaBalanggoyPagula
 
APES 6.5 Presentation Fossil Fuels .pdf
APES 6.5 Presentation Fossil Fuels   .pdfAPES 6.5 Presentation Fossil Fuels   .pdf
APES 6.5 Presentation Fossil Fuels .pdf
patelereftu
 
06-Molecular basis of transformation.pptx
06-Molecular basis of transformation.pptx06-Molecular basis of transformation.pptx
06-Molecular basis of transformation.pptx
LanaQadumii
 
Gender Bias and Empathy in Robots: Insights into Robotic Service Failures
Gender Bias and Empathy in Robots:  Insights into Robotic Service FailuresGender Bias and Empathy in Robots:  Insights into Robotic Service Failures
Gender Bias and Empathy in Robots: Insights into Robotic Service Failures
Selcen Ozturkcan
 
Metallurgical process class 11_Govinda Pathak
Metallurgical process class 11_Govinda PathakMetallurgical process class 11_Govinda Pathak
Metallurgical process class 11_Govinda Pathak
GovindaPathak6
 
Botany-Finals-Patterns-of-Inheritance-DNA-Synthesis.pdf
Botany-Finals-Patterns-of-Inheritance-DNA-Synthesis.pdfBotany-Finals-Patterns-of-Inheritance-DNA-Synthesis.pdf
Botany-Finals-Patterns-of-Inheritance-DNA-Synthesis.pdf
JseleBurgos
 
VERMICOMPOSTING A STEP TOWARDS SUSTAINABILITY.pptx
VERMICOMPOSTING A STEP TOWARDS SUSTAINABILITY.pptxVERMICOMPOSTING A STEP TOWARDS SUSTAINABILITY.pptx
VERMICOMPOSTING A STEP TOWARDS SUSTAINABILITY.pptx
hipachi8
 
Application of Microbiology- Industrial, agricultural, medical
Application of Microbiology- Industrial, agricultural, medicalApplication of Microbiology- Industrial, agricultural, medical
Application of Microbiology- Industrial, agricultural, medical
Anoja Kurian
 
2025 Insilicogen Company English Brochure
2025 Insilicogen Company English Brochure2025 Insilicogen Company English Brochure
2025 Insilicogen Company English Brochure
Insilico Gen
 
amino compounds.pptx class 12_Govinda Pathak
amino compounds.pptx class 12_Govinda Pathakamino compounds.pptx class 12_Govinda Pathak
amino compounds.pptx class 12_Govinda Pathak
GovindaPathak6
 
On the Lunar Origin of Near-Earth Asteroid 2024 PT5
On the Lunar Origin of Near-Earth Asteroid 2024 PT5On the Lunar Origin of Near-Earth Asteroid 2024 PT5
On the Lunar Origin of Near-Earth Asteroid 2024 PT5
Sérgio Sacani
 
Preparation of Permanent mounts of Parasitic Protozoans.pptx
Preparation of Permanent mounts of Parasitic Protozoans.pptxPreparation of Permanent mounts of Parasitic Protozoans.pptx
Preparation of Permanent mounts of Parasitic Protozoans.pptx
Dr Showkat Ahmad Wani
 
Class-11-notes- Inorganic Chemistry Hydrogen, Oxygen,Ozone,Carbon,Phosphoros
Class-11-notes- Inorganic Chemistry Hydrogen, Oxygen,Ozone,Carbon,PhosphorosClass-11-notes- Inorganic Chemistry Hydrogen, Oxygen,Ozone,Carbon,Phosphoros
Class-11-notes- Inorganic Chemistry Hydrogen, Oxygen,Ozone,Carbon,Phosphoros
govindapathak8
 
Skin function_protective_absorptive_Presentatation.pptx
Skin function_protective_absorptive_Presentatation.pptxSkin function_protective_absorptive_Presentatation.pptx
Skin function_protective_absorptive_Presentatation.pptx
muralinath2
 
Causes of mortalities of eggs and spawn and remedies.pptx
Causes of mortalities of eggs and spawn and remedies.pptxCauses of mortalities of eggs and spawn and remedies.pptx
Causes of mortalities of eggs and spawn and remedies.pptx
anshumanmohanty9090
 
SuperconductingMagneticEnergyStorage.pptx
SuperconductingMagneticEnergyStorage.pptxSuperconductingMagneticEnergyStorage.pptx
SuperconductingMagneticEnergyStorage.pptx
BurkanAlpKale
 
when is CT scan need in breast cancer patient.pptx
when is CT scan need in breast cancer patient.pptxwhen is CT scan need in breast cancer patient.pptx
when is CT scan need in breast cancer patient.pptx
Rukhnuddin Al-daudar
 
Zoonosis, Types, Causes. A comprehensive pptx
Zoonosis, Types, Causes. A comprehensive pptxZoonosis, Types, Causes. A comprehensive pptx
Zoonosis, Types, Causes. A comprehensive pptx
Dr Showkat Ahmad Wani
 
whole ANATOMY OF EYE with eye ball .pptx
whole ANATOMY OF EYE with eye ball .pptxwhole ANATOMY OF EYE with eye ball .pptx
whole ANATOMY OF EYE with eye ball .pptx
simranjangra13
 
Chapter 4_Part 2_Infection and Immunity.ppt
Chapter 4_Part 2_Infection and Immunity.pptChapter 4_Part 2_Infection and Immunity.ppt
Chapter 4_Part 2_Infection and Immunity.ppt
JessaBalanggoyPagula
 
APES 6.5 Presentation Fossil Fuels .pdf
APES 6.5 Presentation Fossil Fuels   .pdfAPES 6.5 Presentation Fossil Fuels   .pdf
APES 6.5 Presentation Fossil Fuels .pdf
patelereftu
 
06-Molecular basis of transformation.pptx
06-Molecular basis of transformation.pptx06-Molecular basis of transformation.pptx
06-Molecular basis of transformation.pptx
LanaQadumii
 
Gender Bias and Empathy in Robots: Insights into Robotic Service Failures
Gender Bias and Empathy in Robots:  Insights into Robotic Service FailuresGender Bias and Empathy in Robots:  Insights into Robotic Service Failures
Gender Bias and Empathy in Robots: Insights into Robotic Service Failures
Selcen Ozturkcan
 
Metallurgical process class 11_Govinda Pathak
Metallurgical process class 11_Govinda PathakMetallurgical process class 11_Govinda Pathak
Metallurgical process class 11_Govinda Pathak
GovindaPathak6
 
Botany-Finals-Patterns-of-Inheritance-DNA-Synthesis.pdf
Botany-Finals-Patterns-of-Inheritance-DNA-Synthesis.pdfBotany-Finals-Patterns-of-Inheritance-DNA-Synthesis.pdf
Botany-Finals-Patterns-of-Inheritance-DNA-Synthesis.pdf
JseleBurgos
 
VERMICOMPOSTING A STEP TOWARDS SUSTAINABILITY.pptx
VERMICOMPOSTING A STEP TOWARDS SUSTAINABILITY.pptxVERMICOMPOSTING A STEP TOWARDS SUSTAINABILITY.pptx
VERMICOMPOSTING A STEP TOWARDS SUSTAINABILITY.pptx
hipachi8
 
Application of Microbiology- Industrial, agricultural, medical
Application of Microbiology- Industrial, agricultural, medicalApplication of Microbiology- Industrial, agricultural, medical
Application of Microbiology- Industrial, agricultural, medical
Anoja Kurian
 

Apache Spark in Scientific Applications

  • 1. ‹#›© Cloudera, Inc. All rights reserved. Mirko Kämpf | 2015 Apache Spark: Next Generation Data Processing for Hadoop
  • 2. ‹#›© Cloudera, Inc. All rights reserved. Agenda • The Data Science Process (DSP) - Why or when to use Spark • The role of: Apache Hadoop and Apache Spark - History & Hadoop Ecosystem • Apache Spark: Overview and Concepts • Practical Tips
  • 3. ‹#›© Cloudera, Inc. All rights reserved. The Data Science Process Application of Big-Data-Technology Images from: http://semanticommunity.info/Data_Science/Doing_Data_Science
  • 4. ‹#›© Cloudera, Inc. All rights reserved. Huge Data Sets in Science Application of Big-Data-Technology Images from: http://semanticommunity.info/Data_Science/Doing_Data_Science
  • 5. ‹#›© Cloudera, Inc. All rights reserved. “Spark offers tools for Data Science and components for Data Products.” —How can Apache Spark fit into my world?
  • 6. ‹#›© Cloudera, Inc. All rights reserved. Should I use Apache Spark? • If all my data fits into Excel-Spreadsheets? • If I have a special purpose application to work with? • If my current system is just a bit to slow?
  • 7. ‹#›© Cloudera, Inc. All rights reserved. Should I use Apache Spark? • If all my data fits into Excel-Spreadsheets? • If I have a special purpose application to work with? • If my current system is just a bit to slow? • Just export as CSV / JSON and use a DataFrame to join with other DS. Why not?
  • 8. ‹#›© Cloudera, Inc. All rights reserved. Should I use Apache Spark? • If all my data fits into Excel-Spreadsheets? • If I have a special purpose application to work with? • If my current system is just a bit to slow? • Just export as CSV / JSON and use a DataFrame to join with other DS. • Think about additional analysis methods! Maybe it is already built into Apache Spark! Why not?
  • 9. ‹#›© Cloudera, Inc. All rights reserved. Should I use Apache Spark? • If all my data fits into Excel-Spreadsheets? • If I have a special purpose application to work with? • If my current system is just a bit to slow? • Just export as CSV / JSON and use a DataFrame to join with other DS. • Think about additional analysis methods! Maybe it is build into Spark. • OK, Spark will probably not help to speed up your system, but maybe you can offload data to Hadoop, which releases some resources. Why not?
  • 10. ‹#›© Cloudera, Inc. All rights reserved. “Spark offers fast in memory processing on huge distributed and even on heterogeneous datasets.” —What type of data fits into Spark?
  • 11. ‹#›© Cloudera, Inc. All rights reserved. History of Spark Spark is really young, but has a very active community!
  • 12. ‹#›© Cloudera, Inc. All rights reserved. Timeline: Spark Adoption
  • 13. ‹#›© Cloudera, Inc. All rights reserved. Apache Spark: Overview & Concepts
  • 14. ‹#›© Cloudera, Inc. All rights reserved. Hadoop Ecosystem incl. Apache Spark Spark can be an entry point to your Big Data world …
  • 15. ‹#›© Cloudera, Inc. All rights reserved. “Apache Spark is distributed on top of Hadoop and brings parallel processing to powerful workstations.” —Do I need a Hadoop cluster to work with Apache Spark?
  • 16. ‹#›© Cloudera, Inc. All rights reserved. Spark vs. MapReduce
  • 17. ‹#›© Cloudera, Inc. All rights reserved. How to interact with Spark?
  • 18. ‹#›© Cloudera, Inc. All rights reserved. Spark Components
  • 19. ‹#›© Cloudera, Inc. All rights reserved.
  • 20. ‹#›© Cloudera, Inc. All rights reserved. MLLib: GraphX: Basic statistics summary statistics, correlations, stratified sampling, hypothesis testing, random data generation Classification and regression linear models (SVMs, logistic / linear regression) naive Bayes, decision trees ensembles of trees (Random Forests / Gradient-Boosted Trees) isotonic regression Collaborative filtering alternating least squares (ALS) Clustering k-means, Gaussian mixture, power iteration clustering (PIC) latent Dirichlet allocation (LDA), streaming k-means Dimensionality reduction singular value decomposition (SVD) principal component analysis (PCA) … PageRank Connected Components Triangle Counting Pregel API
  • 21. ‹#›© Cloudera, Inc. All rights reserved. How to use your code in Spark? A. Interactively, by loading it into the spark-shell. B. Contribute to existing Spark projects. C. Create your module and use it in a spark-shell session. D. Build a data-product which uses Apache Spark. For simple and reliable usage of Java classes and complete third-party libraries, we define a Spark Module as a self-contained artifact created by Maven. This module can easily be shared by multiple users via repositories. http://blog.cloudera.com/blog/2015/03/how-to-build-re-usable-spark-programs-using-spark-shell-and-maven/
  • 22. ‹#›© Cloudera, Inc. All rights reserved. Apache Spark: Overview & Concepts
  • 23. ‹#›© Cloudera, Inc. All rights reserved. Spark Context
  • 24. ‹#›© Cloudera, Inc. All rights reserved. RDDs and DataFrames
  • 25. ‹#›© Cloudera, Inc. All rights reserved. Creation of RDDs
  • 26. ‹#›© Cloudera, Inc. All rights reserved. Datatypes in RDDs
  • 27. ‹#›© Cloudera, Inc. All rights reserved.
  • 28. ‹#›© Cloudera, Inc. All rights reserved.
  • 29. ‹#›© Cloudera, Inc. All rights reserved. Spark in a Cluster
  • 30. ‹#›© Cloudera, Inc. All rights reserved. Spark in a Cluster
  • 31. ‹#›© Cloudera, Inc. All rights reserved.
  • 32. ‹#›© Cloudera, Inc. All rights reserved.
  • 33. ‹#›© Cloudera, Inc. All rights reserved. DStream: The heart of Spark Streaming
  • 34. ‹#›© Cloudera, Inc. All rights reserved. “Efficient hardware utilization, caching, simple APIs, and access to a variety of data in Hadoop is key to success.” —What makes Spark so different, compared to core MapReduce?
  • 35. ‹#›© Cloudera, Inc. All rights reserved. Practical Tips
  • 36. ‹#›© Cloudera, Inc. All rights reserved. Development Techniques • Build your tools and analysis procedures in small cycles. • Test all phases of your work and document carefully. • Document what you expect! => Requirements management … • Collect what you get! => Operational logs … • Reuse well tested components and modularize your analysis scripts. • Learn „state of the art“ tools and share your work!
  • 37. ‹#›© Cloudera, Inc. All rights reserved. Data Management • Think about typical access patterns: • random access to each record or field? • access to entire groups of records? • variable size or fixed size sets? • „full table scan“ • OPTIMIZE FOR YOUR DOMINANT ACCESS PATTERN! • Select efficient storage formats: Avro, Parquet • Index your data in SOLR for random access and data exploration • Indexing can be done by just a few clicks in HUE …
  • 38. ‹#›© Cloudera, Inc. All rights reserved. Collecting Sensor Data with Spark Streaming … • Spark Streaming works on fixed time slices only (in current version, 1.5) • Use the original time stamp? • Requires additional storage and bandwidth • Original system clock defines resolution • Use „Spark-Time“ or a local time reference: • You may lose information! • You have a limited resolution, defined by batch size.
  • 39. ‹#›© Cloudera, Inc. All rights reserved. Thank you ! Enjoy Apache Spark and all your data …