SlideShare a Scribd company logo
II
(SparkSQL)
Contents
Introduction to Spark1
2
3
Spark modules
SparkSQL
4 Workshop
1. Introduction
What is Apache Spark?
● Extends MapReduce
● Cluster computing platform
● Runs in memory
Fast
Easy of
development
Unified
Stack
Multi
Language
Support
Deployment
Flexibility
❏ Scala, python, java, R
❏ Deployment: Mesos, YARN, standalone, local
❏ Storage: HDFS, S3, local FS
❏ Batch
❏ Streaming
❏ 10x faster on disk
❏ 100x in memory
❏ Easy code
❏ Interactive shell
Why
Spark
Rise of the data center
Hugh amounts of data spread out
across many commodity servers
MapReduce
lots of data → scale out
Data Processing Requirements
Network bottleneck → Distributed Computing
Hardware failure → Fault Tolerance
Abstraction to organize parallelizable tasks
MapReduce
Abstraction to organize parallelizable tasks
MapReduce
Input Split Map [combine]
Suffle &
Sort
Reduce Output
AA BB AA
AA CC DD
AA EE DD
BB FF AA
AA BB AA
AA CC DD
AA EE DD
BB FF AA
(AA, 1)
(BB, 1)
(AA, 1)
(AA, 1)
(CC, 1)
(DD, 1)
(AA, 1)
(EE, 1)
(DD, 1)
(BB, 1)
(FF, 1)
(AA, 1)
(AA, 2)
(BB, 1)
(AA, 1)
(CC, 1)
(DD, 1)
(AA, 1)
(EE, 1)
(DD, 1)
(BB, 1)
(FF, 1)
(AA, 1)
(AA, 2)
(AA, 1)
(AA, 1)
(AA, 1)
(BB, 1)
(BB, 1)
(CC, 1)
(DD, 1)
(DD, 1)
(EE, 1)
(FF, 1)
(AA, 5)
(BB, 2)
(CC, 1)
(DD, 2)
(EE, 1)
(FF, 1)
AA, 5
BB, 2
CC, 1
DD, 2
EE, 1
FF, 1
Spark Components
Cluster Manager
Driver Program
SparkContext
Worker Node
Executor
Task Task
Worker Node
Executor
Task Task
Spark Components
SparkContext
● Main entry point for Spark functionality
● Represents the connection to a Spark cluster
● Tells Spark how & where to access a cluster
● Can be used to create RDDs, accumulators and
broadcast variables on that cluster
Driver program
● “Main” process coordinated by the
SparkContext object
● Allows to configure any spark process with
specific parameters
● Spark actions are executed in the Driver
● Spark-shell
● Application → driver program + executors
Driver Program
SparkContext
Spark Components
● External service for acquiring resources on the cluster
● Variety of cluster managers
○ Local
○ Standalone
○ YARN
○ Mesos
● Deploy mode:
○ Cluster → framework launches the driver inside of the cluser
○ Client → submitter launches the driver outside of the cluster
Cluster Manager
Spark Components
● Any node that can run application code in the cluster
● Key Terms
○ Executor: A process launched for an application on a worker node, that runs tasks and
keeps data in memory or disk storage across them. Each application has its own executors.
○ Task: Unit of work that will be sent to one executor
○ Job: A parallel computation consisting of multiple tasks that gets spawned in response to a
Spark action (e.g. save, collect)
○ Stage: smaller set of tasks inside any job
Worker Node
Executor
Task Task
Worker
RDD
Resilient Distributed Datasets
● Collection of objects that is distributed across
nodes in a cluster
● Data Operations are performed on RDD
● Once created, RDD are immutable
● RDD can be persisted in memory or on disk
● Fault Tolerant
numbers = RDD[1,2,3,4,5,6,7,8,9,10]
Worker Node
Executor
[1,5,6,9]
Worker Node
Executor
[2,7,8]
Worker Node
Executor
[3,4,10]
2. Spark modules
Spark modules
Spark streaming
MLlib
● Classification: logistic regression, naive Bayes,...
● Regression: generalized linear regression, survival regression,...
● Decision trees, random forests, and gradient-boosted trees
● Recommendation: alternating least squares (ALS)
● Clustering: K-means, Gaussian mixtures (GMMs),...
● Topic modeling: latent Dirichlet allocation (LDA)
● Frequent itemsets, association rules, and sequential pattern mining
ML Algorithms Include
GraphX
3. SparkSQL
Spark SQL
Spark SQL
● Integrated: Query data stored in RDDs. Languages: Python, Scala, Java, R.
● Unified data access: Parquet, JSON, CSV, Hive tables
● Apache Hive compatibility.
● Standard connectivity: JDBC, ODBC.
● Scalability
Features
DataFrame
Column 1 Column 2 Column 3 ... Column N
Column 1 Column 2 Column 3 ... Column N
DataFrame
● Ability to process the data in the size of Kilobytes to Petabytes on a single
node cluster to large cluster.
● Different data formats (JSON, Csv, Elastic Search, ...) and storage systems
(HDFS, HIVE tables, Oracle, ...)
● Easily integrated with others Big Data tools (Spark-Core).
● API for Python, Java, Scala, and R.
Features
Spark Architecture
4. Workshop
WORKSHOP
In order to practice the main concepts, please complete the exercises
proposed at our Github repository by clicking the following link:
○ Homework
THANKS!
Any questions?
@datiobddatio-big-data
Special thanks to Stratio for its theoretical contribution
academy@datiobd.com
Ad

More Related Content

What's hot (20)

Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
Machine learning at Scale with Apache Spark
Machine learning at Scale with Apache SparkMachine learning at Scale with Apache Spark
Machine learning at Scale with Apache Spark
Martin Zapletal
 
Apache spark - History and market overview
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overview
Martin Zapletal
 
Apache spark - Installation
Apache spark - InstallationApache spark - Installation
Apache spark - Installation
Martin Zapletal
 
Big data analytics_beyond_hadoop_public_18_july_2013
Big data analytics_beyond_hadoop_public_18_july_2013Big data analytics_beyond_hadoop_public_18_july_2013
Big data analytics_beyond_hadoop_public_18_july_2013
Vijay Srinivas Agneeswaran, Ph.D
 
Unsupervised Learning with Apache Spark
Unsupervised Learning with Apache SparkUnsupervised Learning with Apache Spark
Unsupervised Learning with Apache Spark
DB Tsai
 
Apache spark core
Apache spark coreApache spark core
Apache spark core
Thành Nguyễn
 
Spark Based Distributed Deep Learning Framework For Big Data Applications
Spark Based Distributed Deep Learning Framework For Big Data Applications Spark Based Distributed Deep Learning Framework For Big Data Applications
Spark Based Distributed Deep Learning Framework For Big Data Applications
Humoyun Ahmedov
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Sachin Aggarwal
 
Powering a Graph Data System with Scylla + JanusGraph
Powering a Graph Data System with Scylla + JanusGraphPowering a Graph Data System with Scylla + JanusGraph
Powering a Graph Data System with Scylla + JanusGraph
ScyllaDB
 
Introduction to Apache Spark Ecosystem
Introduction to Apache Spark EcosystemIntroduction to Apache Spark Ecosystem
Introduction to Apache Spark Ecosystem
Bojan Babic
 
Apache spark Intro
Apache spark IntroApache spark Intro
Apache spark Intro
Tudor Lapusan
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDs
Dean Chen
 
End-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache Spark
Databricks
 
AWS Big Data Demystified #2 | Athena, Spectrum, Emr, Hive
AWS Big Data Demystified #2 |  Athena, Spectrum, Emr, Hive AWS Big Data Demystified #2 |  Athena, Spectrum, Emr, Hive
AWS Big Data Demystified #2 | Athena, Spectrum, Emr, Hive
Omid Vahdaty
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
Patrick Wendell
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
Databricks
 
Map reduce vs spark
Map reduce vs sparkMap reduce vs spark
Map reduce vs spark
Tudor Lapusan
 
SORT & JOIN IN SPARK 2.0
SORT & JOIN IN SPARK 2.0SORT & JOIN IN SPARK 2.0
SORT & JOIN IN SPARK 2.0
Sigmoid
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
Machine learning at Scale with Apache Spark
Machine learning at Scale with Apache SparkMachine learning at Scale with Apache Spark
Machine learning at Scale with Apache Spark
Martin Zapletal
 
Apache spark - History and market overview
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overview
Martin Zapletal
 
Apache spark - Installation
Apache spark - InstallationApache spark - Installation
Apache spark - Installation
Martin Zapletal
 
Unsupervised Learning with Apache Spark
Unsupervised Learning with Apache SparkUnsupervised Learning with Apache Spark
Unsupervised Learning with Apache Spark
DB Tsai
 
Spark Based Distributed Deep Learning Framework For Big Data Applications
Spark Based Distributed Deep Learning Framework For Big Data Applications Spark Based Distributed Deep Learning Framework For Big Data Applications
Spark Based Distributed Deep Learning Framework For Big Data Applications
Humoyun Ahmedov
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Sachin Aggarwal
 
Powering a Graph Data System with Scylla + JanusGraph
Powering a Graph Data System with Scylla + JanusGraphPowering a Graph Data System with Scylla + JanusGraph
Powering a Graph Data System with Scylla + JanusGraph
ScyllaDB
 
Introduction to Apache Spark Ecosystem
Introduction to Apache Spark EcosystemIntroduction to Apache Spark Ecosystem
Introduction to Apache Spark Ecosystem
Bojan Babic
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDs
Dean Chen
 
End-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache Spark
Databricks
 
AWS Big Data Demystified #2 | Athena, Spectrum, Emr, Hive
AWS Big Data Demystified #2 |  Athena, Spectrum, Emr, Hive AWS Big Data Demystified #2 |  Athena, Spectrum, Emr, Hive
AWS Big Data Demystified #2 | Athena, Spectrum, Emr, Hive
Omid Vahdaty
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
Patrick Wendell
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
Databricks
 
SORT & JOIN IN SPARK 2.0
SORT & JOIN IN SPARK 2.0SORT & JOIN IN SPARK 2.0
SORT & JOIN IN SPARK 2.0
Sigmoid
 

Similar to Apache Spark II (SparkSQL) (20)

Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.ppt
bhargavi804095
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hood
Adarsh Pannu
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introduction
Hektor Jacynycz García
 
Spark 计算模型
Spark 计算模型Spark 计算模型
Spark 计算模型
wang xing
 
Data processing platforms with SMACK: Spark and Mesos internals
Data processing platforms with SMACK:  Spark and Mesos internalsData processing platforms with SMACK:  Spark and Mesos internals
Data processing platforms with SMACK: Spark and Mesos internals
Anton Kirillov
 
spark ...................................
spark ...................................spark ...................................
spark ...................................
itsTIM66
 
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
cdmaxime
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analytics
inoshg
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
Databricks
 
Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64
Ganesh Raju
 
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to ClusterBKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
Linaro
 
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to ClusterBKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
Linaro
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
Massimo Schenone
 
4Introduction+to+Spark.pptx sdfsdfsdfsdfsdf
4Introduction+to+Spark.pptx sdfsdfsdfsdfsdf4Introduction+to+Spark.pptx sdfsdfsdfsdfsdf
4Introduction+to+Spark.pptx sdfsdfsdfsdfsdf
yafora8192
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
Djamel Zouaoui
 
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
cdmaxime
 
Spark
SparkSpark
Spark
Heena Madan
 
Apache Spark e AWS Glue
Apache Spark e AWS GlueApache Spark e AWS Glue
Apache Spark e AWS Glue
Laercio Serra
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
Ned Shawa
 
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
Synergetics Learning and Cloud Consulting
 
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.ppt
bhargavi804095
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hood
Adarsh Pannu
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introduction
Hektor Jacynycz García
 
Spark 计算模型
Spark 计算模型Spark 计算模型
Spark 计算模型
wang xing
 
Data processing platforms with SMACK: Spark and Mesos internals
Data processing platforms with SMACK:  Spark and Mesos internalsData processing platforms with SMACK:  Spark and Mesos internals
Data processing platforms with SMACK: Spark and Mesos internals
Anton Kirillov
 
spark ...................................
spark ...................................spark ...................................
spark ...................................
itsTIM66
 
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
cdmaxime
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analytics
inoshg
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
Databricks
 
Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64
Ganesh Raju
 
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to ClusterBKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
Linaro
 
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to ClusterBKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
Linaro
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
Massimo Schenone
 
4Introduction+to+Spark.pptx sdfsdfsdfsdfsdf
4Introduction+to+Spark.pptx sdfsdfsdfsdfsdf4Introduction+to+Spark.pptx sdfsdfsdfsdfsdf
4Introduction+to+Spark.pptx sdfsdfsdfsdfsdf
yafora8192
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
Djamel Zouaoui
 
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
cdmaxime
 
Apache Spark e AWS Glue
Apache Spark e AWS GlueApache Spark e AWS Glue
Apache Spark e AWS Glue
Laercio Serra
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
Ned Shawa
 
Ad

More from Datio Big Data (17)

Búsqueda IA
Búsqueda IABúsqueda IA
Búsqueda IA
Datio Big Data
 
Descubriendo la Inteligencia Artificial
Descubriendo la Inteligencia ArtificialDescubriendo la Inteligencia Artificial
Descubriendo la Inteligencia Artificial
Datio Big Data
 
Learning Python. Level 0
Learning Python. Level 0Learning Python. Level 0
Learning Python. Level 0
Datio Big Data
 
Learn Python
Learn PythonLearn Python
Learn Python
Datio Big Data
 
How to document without dying in the attempt
How to document without dying in the attemptHow to document without dying in the attempt
How to document without dying in the attempt
Datio Big Data
 
Developers on test
Developers on testDevelopers on test
Developers on test
Datio Big Data
 
Ceph: The Storage System of the Future
Ceph: The Storage System of the FutureCeph: The Storage System of the Future
Ceph: The Storage System of the Future
Datio Big Data
 
Datio OpenStack
Datio OpenStackDatio OpenStack
Datio OpenStack
Datio Big Data
 
Quality Assurance Glossary
Quality Assurance GlossaryQuality Assurance Glossary
Quality Assurance Glossary
Datio Big Data
 
Data Integration
Data IntegrationData Integration
Data Integration
Datio Big Data
 
Gamification: from buzzword to reality
Gamification: from buzzword to realityGamification: from buzzword to reality
Gamification: from buzzword to reality
Datio Big Data
 
Pandas: High Performance Structured Data Manipulation
Pandas: High Performance Structured Data ManipulationPandas: High Performance Structured Data Manipulation
Pandas: High Performance Structured Data Manipulation
Datio Big Data
 
Del Mono al QA
Del Mono al QADel Mono al QA
Del Mono al QA
Datio Big Data
 
DC/OS: The definitive platform for modern apps
DC/OS: The definitive platform for modern appsDC/OS: The definitive platform for modern apps
DC/OS: The definitive platform for modern apps
Datio Big Data
 
PDP Your personal development plan
PDP Your personal development planPDP Your personal development plan
PDP Your personal development plan
Datio Big Data
 
Security&Governance
Security&GovernanceSecurity&Governance
Security&Governance
Datio Big Data
 
Kafka Connect by Datio
Kafka Connect by DatioKafka Connect by Datio
Kafka Connect by Datio
Datio Big Data
 
Descubriendo la Inteligencia Artificial
Descubriendo la Inteligencia ArtificialDescubriendo la Inteligencia Artificial
Descubriendo la Inteligencia Artificial
Datio Big Data
 
Learning Python. Level 0
Learning Python. Level 0Learning Python. Level 0
Learning Python. Level 0
Datio Big Data
 
How to document without dying in the attempt
How to document without dying in the attemptHow to document without dying in the attempt
How to document without dying in the attempt
Datio Big Data
 
Ceph: The Storage System of the Future
Ceph: The Storage System of the FutureCeph: The Storage System of the Future
Ceph: The Storage System of the Future
Datio Big Data
 
Quality Assurance Glossary
Quality Assurance GlossaryQuality Assurance Glossary
Quality Assurance Glossary
Datio Big Data
 
Gamification: from buzzword to reality
Gamification: from buzzword to realityGamification: from buzzword to reality
Gamification: from buzzword to reality
Datio Big Data
 
Pandas: High Performance Structured Data Manipulation
Pandas: High Performance Structured Data ManipulationPandas: High Performance Structured Data Manipulation
Pandas: High Performance Structured Data Manipulation
Datio Big Data
 
DC/OS: The definitive platform for modern apps
DC/OS: The definitive platform for modern appsDC/OS: The definitive platform for modern apps
DC/OS: The definitive platform for modern apps
Datio Big Data
 
PDP Your personal development plan
PDP Your personal development planPDP Your personal development plan
PDP Your personal development plan
Datio Big Data
 
Kafka Connect by Datio
Kafka Connect by DatioKafka Connect by Datio
Kafka Connect by Datio
Datio Big Data
 
Ad

Recently uploaded (20)

Lidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptx
Lidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptxLidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptx
Lidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptx
RishavKumar530754
 
Mathematical foundation machine learning.pdf
Mathematical foundation machine learning.pdfMathematical foundation machine learning.pdf
Mathematical foundation machine learning.pdf
TalhaShahid49
 
Data Structures_Searching and Sorting.pptx
Data Structures_Searching and Sorting.pptxData Structures_Searching and Sorting.pptx
Data Structures_Searching and Sorting.pptx
RushaliDeshmukh2
 
fluke dealers in bangalore..............
fluke dealers in bangalore..............fluke dealers in bangalore..............
fluke dealers in bangalore..............
Haresh Vaswani
 
Compiler Design_Lexical Analysis phase.pptx
Compiler Design_Lexical Analysis phase.pptxCompiler Design_Lexical Analysis phase.pptx
Compiler Design_Lexical Analysis phase.pptx
RushaliDeshmukh2
 
MAQUINARIA MINAS CEMA 6th Edition (1).pdf
MAQUINARIA MINAS CEMA 6th Edition (1).pdfMAQUINARIA MINAS CEMA 6th Edition (1).pdf
MAQUINARIA MINAS CEMA 6th Edition (1).pdf
ssuser562df4
 
15th International Conference on Computer Science, Engineering and Applicatio...
15th International Conference on Computer Science, Engineering and Applicatio...15th International Conference on Computer Science, Engineering and Applicatio...
15th International Conference on Computer Science, Engineering and Applicatio...
IJCSES Journal
 
introduction to machine learining for beginers
introduction to machine learining for beginersintroduction to machine learining for beginers
introduction to machine learining for beginers
JoydebSheet
 
some basics electrical and electronics knowledge
some basics electrical and electronics knowledgesome basics electrical and electronics knowledge
some basics electrical and electronics knowledge
nguyentrungdo88
 
Development of MLR, ANN and ANFIS Models for Estimation of PCUs at Different ...
Development of MLR, ANN and ANFIS Models for Estimation of PCUs at Different ...Development of MLR, ANN and ANFIS Models for Estimation of PCUs at Different ...
Development of MLR, ANN and ANFIS Models for Estimation of PCUs at Different ...
Journal of Soft Computing in Civil Engineering
 
Oil-gas_Unconventional oil and gass_reseviours.pdf
Oil-gas_Unconventional oil and gass_reseviours.pdfOil-gas_Unconventional oil and gass_reseviours.pdf
Oil-gas_Unconventional oil and gass_reseviours.pdf
M7md3li2
 
π0.5: a Vision-Language-Action Model with Open-World Generalization
π0.5: a Vision-Language-Action Model with Open-World Generalizationπ0.5: a Vision-Language-Action Model with Open-World Generalization
π0.5: a Vision-Language-Action Model with Open-World Generalization
NABLAS株式会社
 
Fort night presentation new0903 pdf.pdf.
Fort night presentation new0903 pdf.pdf.Fort night presentation new0903 pdf.pdf.
Fort night presentation new0903 pdf.pdf.
anuragmk56
 
new ppt artificial intelligence historyyy
new ppt artificial intelligence historyyynew ppt artificial intelligence historyyy
new ppt artificial intelligence historyyy
PianoPianist
 
Degree_of_Automation.pdf for Instrumentation and industrial specialist
Degree_of_Automation.pdf for  Instrumentation  and industrial specialistDegree_of_Automation.pdf for  Instrumentation  and industrial specialist
Degree_of_Automation.pdf for Instrumentation and industrial specialist
shreyabhosale19
 
railway wheels, descaling after reheating and before forging
railway wheels, descaling after reheating and before forgingrailway wheels, descaling after reheating and before forging
railway wheels, descaling after reheating and before forging
Javad Kadkhodapour
 
Smart_Storage_Systems_Production_Engineering.pptx
Smart_Storage_Systems_Production_Engineering.pptxSmart_Storage_Systems_Production_Engineering.pptx
Smart_Storage_Systems_Production_Engineering.pptx
rushikeshnavghare94
 
RICS Membership-(The Royal Institution of Chartered Surveyors).pdf
RICS Membership-(The Royal Institution of Chartered Surveyors).pdfRICS Membership-(The Royal Institution of Chartered Surveyors).pdf
RICS Membership-(The Royal Institution of Chartered Surveyors).pdf
MohamedAbdelkader115
 
International Journal of Distributed and Parallel systems (IJDPS)
International Journal of Distributed and Parallel systems (IJDPS)International Journal of Distributed and Parallel systems (IJDPS)
International Journal of Distributed and Parallel systems (IJDPS)
samueljackson3773
 
theory-slides-for react for beginners.pptx
theory-slides-for react for beginners.pptxtheory-slides-for react for beginners.pptx
theory-slides-for react for beginners.pptx
sanchezvanessa7896
 
Lidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptx
Lidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptxLidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptx
Lidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptx
RishavKumar530754
 
Mathematical foundation machine learning.pdf
Mathematical foundation machine learning.pdfMathematical foundation machine learning.pdf
Mathematical foundation machine learning.pdf
TalhaShahid49
 
Data Structures_Searching and Sorting.pptx
Data Structures_Searching and Sorting.pptxData Structures_Searching and Sorting.pptx
Data Structures_Searching and Sorting.pptx
RushaliDeshmukh2
 
fluke dealers in bangalore..............
fluke dealers in bangalore..............fluke dealers in bangalore..............
fluke dealers in bangalore..............
Haresh Vaswani
 
Compiler Design_Lexical Analysis phase.pptx
Compiler Design_Lexical Analysis phase.pptxCompiler Design_Lexical Analysis phase.pptx
Compiler Design_Lexical Analysis phase.pptx
RushaliDeshmukh2
 
MAQUINARIA MINAS CEMA 6th Edition (1).pdf
MAQUINARIA MINAS CEMA 6th Edition (1).pdfMAQUINARIA MINAS CEMA 6th Edition (1).pdf
MAQUINARIA MINAS CEMA 6th Edition (1).pdf
ssuser562df4
 
15th International Conference on Computer Science, Engineering and Applicatio...
15th International Conference on Computer Science, Engineering and Applicatio...15th International Conference on Computer Science, Engineering and Applicatio...
15th International Conference on Computer Science, Engineering and Applicatio...
IJCSES Journal
 
introduction to machine learining for beginers
introduction to machine learining for beginersintroduction to machine learining for beginers
introduction to machine learining for beginers
JoydebSheet
 
some basics electrical and electronics knowledge
some basics electrical and electronics knowledgesome basics electrical and electronics knowledge
some basics electrical and electronics knowledge
nguyentrungdo88
 
Oil-gas_Unconventional oil and gass_reseviours.pdf
Oil-gas_Unconventional oil and gass_reseviours.pdfOil-gas_Unconventional oil and gass_reseviours.pdf
Oil-gas_Unconventional oil and gass_reseviours.pdf
M7md3li2
 
π0.5: a Vision-Language-Action Model with Open-World Generalization
π0.5: a Vision-Language-Action Model with Open-World Generalizationπ0.5: a Vision-Language-Action Model with Open-World Generalization
π0.5: a Vision-Language-Action Model with Open-World Generalization
NABLAS株式会社
 
Fort night presentation new0903 pdf.pdf.
Fort night presentation new0903 pdf.pdf.Fort night presentation new0903 pdf.pdf.
Fort night presentation new0903 pdf.pdf.
anuragmk56
 
new ppt artificial intelligence historyyy
new ppt artificial intelligence historyyynew ppt artificial intelligence historyyy
new ppt artificial intelligence historyyy
PianoPianist
 
Degree_of_Automation.pdf for Instrumentation and industrial specialist
Degree_of_Automation.pdf for  Instrumentation  and industrial specialistDegree_of_Automation.pdf for  Instrumentation  and industrial specialist
Degree_of_Automation.pdf for Instrumentation and industrial specialist
shreyabhosale19
 
railway wheels, descaling after reheating and before forging
railway wheels, descaling after reheating and before forgingrailway wheels, descaling after reheating and before forging
railway wheels, descaling after reheating and before forging
Javad Kadkhodapour
 
Smart_Storage_Systems_Production_Engineering.pptx
Smart_Storage_Systems_Production_Engineering.pptxSmart_Storage_Systems_Production_Engineering.pptx
Smart_Storage_Systems_Production_Engineering.pptx
rushikeshnavghare94
 
RICS Membership-(The Royal Institution of Chartered Surveyors).pdf
RICS Membership-(The Royal Institution of Chartered Surveyors).pdfRICS Membership-(The Royal Institution of Chartered Surveyors).pdf
RICS Membership-(The Royal Institution of Chartered Surveyors).pdf
MohamedAbdelkader115
 
International Journal of Distributed and Parallel systems (IJDPS)
International Journal of Distributed and Parallel systems (IJDPS)International Journal of Distributed and Parallel systems (IJDPS)
International Journal of Distributed and Parallel systems (IJDPS)
samueljackson3773
 
theory-slides-for react for beginners.pptx
theory-slides-for react for beginners.pptxtheory-slides-for react for beginners.pptx
theory-slides-for react for beginners.pptx
sanchezvanessa7896
 

Apache Spark II (SparkSQL)

  • 2. Contents Introduction to Spark1 2 3 Spark modules SparkSQL 4 Workshop
  • 4. What is Apache Spark? ● Extends MapReduce ● Cluster computing platform ● Runs in memory
  • 5. Fast Easy of development Unified Stack Multi Language Support Deployment Flexibility ❏ Scala, python, java, R ❏ Deployment: Mesos, YARN, standalone, local ❏ Storage: HDFS, S3, local FS ❏ Batch ❏ Streaming ❏ 10x faster on disk ❏ 100x in memory ❏ Easy code ❏ Interactive shell Why Spark
  • 6. Rise of the data center Hugh amounts of data spread out across many commodity servers MapReduce lots of data → scale out Data Processing Requirements Network bottleneck → Distributed Computing Hardware failure → Fault Tolerance Abstraction to organize parallelizable tasks MapReduce Abstraction to organize parallelizable tasks
  • 7. MapReduce Input Split Map [combine] Suffle & Sort Reduce Output AA BB AA AA CC DD AA EE DD BB FF AA AA BB AA AA CC DD AA EE DD BB FF AA (AA, 1) (BB, 1) (AA, 1) (AA, 1) (CC, 1) (DD, 1) (AA, 1) (EE, 1) (DD, 1) (BB, 1) (FF, 1) (AA, 1) (AA, 2) (BB, 1) (AA, 1) (CC, 1) (DD, 1) (AA, 1) (EE, 1) (DD, 1) (BB, 1) (FF, 1) (AA, 1) (AA, 2) (AA, 1) (AA, 1) (AA, 1) (BB, 1) (BB, 1) (CC, 1) (DD, 1) (DD, 1) (EE, 1) (FF, 1) (AA, 5) (BB, 2) (CC, 1) (DD, 2) (EE, 1) (FF, 1) AA, 5 BB, 2 CC, 1 DD, 2 EE, 1 FF, 1
  • 8. Spark Components Cluster Manager Driver Program SparkContext Worker Node Executor Task Task Worker Node Executor Task Task
  • 9. Spark Components SparkContext ● Main entry point for Spark functionality ● Represents the connection to a Spark cluster ● Tells Spark how & where to access a cluster ● Can be used to create RDDs, accumulators and broadcast variables on that cluster Driver program ● “Main” process coordinated by the SparkContext object ● Allows to configure any spark process with specific parameters ● Spark actions are executed in the Driver ● Spark-shell ● Application → driver program + executors Driver Program SparkContext
  • 10. Spark Components ● External service for acquiring resources on the cluster ● Variety of cluster managers ○ Local ○ Standalone ○ YARN ○ Mesos ● Deploy mode: ○ Cluster → framework launches the driver inside of the cluser ○ Client → submitter launches the driver outside of the cluster Cluster Manager
  • 11. Spark Components ● Any node that can run application code in the cluster ● Key Terms ○ Executor: A process launched for an application on a worker node, that runs tasks and keeps data in memory or disk storage across them. Each application has its own executors. ○ Task: Unit of work that will be sent to one executor ○ Job: A parallel computation consisting of multiple tasks that gets spawned in response to a Spark action (e.g. save, collect) ○ Stage: smaller set of tasks inside any job Worker Node Executor Task Task Worker
  • 12. RDD Resilient Distributed Datasets ● Collection of objects that is distributed across nodes in a cluster ● Data Operations are performed on RDD ● Once created, RDD are immutable ● RDD can be persisted in memory or on disk ● Fault Tolerant numbers = RDD[1,2,3,4,5,6,7,8,9,10] Worker Node Executor [1,5,6,9] Worker Node Executor [2,7,8] Worker Node Executor [3,4,10]
  • 16. MLlib ● Classification: logistic regression, naive Bayes,... ● Regression: generalized linear regression, survival regression,... ● Decision trees, random forests, and gradient-boosted trees ● Recommendation: alternating least squares (ALS) ● Clustering: K-means, Gaussian mixtures (GMMs),... ● Topic modeling: latent Dirichlet allocation (LDA) ● Frequent itemsets, association rules, and sequential pattern mining ML Algorithms Include
  • 20. Spark SQL ● Integrated: Query data stored in RDDs. Languages: Python, Scala, Java, R. ● Unified data access: Parquet, JSON, CSV, Hive tables ● Apache Hive compatibility. ● Standard connectivity: JDBC, ODBC. ● Scalability Features
  • 21. DataFrame Column 1 Column 2 Column 3 ... Column N Column 1 Column 2 Column 3 ... Column N
  • 22. DataFrame ● Ability to process the data in the size of Kilobytes to Petabytes on a single node cluster to large cluster. ● Different data formats (JSON, Csv, Elastic Search, ...) and storage systems (HDFS, HIVE tables, Oracle, ...) ● Easily integrated with others Big Data tools (Spark-Core). ● API for Python, Java, Scala, and R. Features
  • 25. WORKSHOP In order to practice the main concepts, please complete the exercises proposed at our Github repository by clicking the following link: ○ Homework