SlideShare a Scribd company logo
Programming in Spark using PySpark
Mostafa Elzoghbi
Sr. Technical Evangelist – Microsoft
@MostafaElzoghbi
http://mostafa.rocks
Session Objectives & Takeaways
• Programming Spark
• Spark Program Structure
• Working with RDDs
• Transformations versus Actions
• Lambda, Shared Variables (Broadcast vs accumulators)
• Visualizing big data in Spark
• Spark in the cloud (Azure)
• Working with cluster types, notebooks, scaling.
Python Spark (pySpark)
• We are using the Python programming interface to Spark (pySpark)
• pySpark provides an easy-to-use programming abstraction and parallel
runtime:
“Here’s an operation, run it on all of the data”
• RDDs are the key concept
Apache Spark Driver and
Workers
• A Spark program is two programs:
• A driver program and a workers program
• Worker programs run on cluster nodes or in
local threads
• RDDs (Resilient Distributed Datasets) are
distributed
Spark Essentials: Master
• The master parameter for a SparkContext determines which type and size of
cluster to use
Spark Context
• A Spark program first creates a SparkContext object
» Tells Spark how and where to access a cluster
» pySpark shell and Databricks cloud automatically create the sc variable
» iPython and programs must use a constructor to create a new
SparkContext
• Use SparkContext to create RDDs
Resilient Distributed Datasets
• The primary abstraction in Spark
» Immutable once constructed
» Track lineage information to efficiently recompute lost data
» Enable operations on collection of elements in parallel
• You construct RDDs
» by parallelizing existing Python collections (lists)
» by transforming an existing RDDs
» from files in HDFS or any other storage system
RDDs
• Spark revolves around the concept of a resilient distributed dataset (RDD),
which is a fault-tolerant collection of elements that can be operated on in
parallel.
• Two types of operations: transformations and actions
• Transformations are lazy (not computed immediately)
• Transformed RDD is executed when action runs on it
• Persist (cache) RDDs in memory or disk
Programming in Spark using PySpark
Creating an RDD
• Create RDDs from Python collections (lists)
• From HDFS, text files, Hypertable, Amazon S3, Apache Hbase, SequenceFiles,
any other Hadoop InputFormat, and directory or glob wildcard: /data/201404*
Working with RDDs
• Create an RDD from a data source: <list>
• Apply transformations to an RDD: map filter
• Apply actions to an RDD: collect count
Spark Transformations
• Create new datasets from an existing one
• Use lazy evaluation: results not computed right away –
• instead Spark remembers set of transformations applied to base dataset
» Spark optimizes the required calculations
» Spark recovers from failures and slow workers
• Think of this as a recipe for creating result
Python lambda Functions
• Small anonymous functions (not bound to a name)
lambda a, b: a+b
» returns the sum of its two arguments
• Can use lambda functions wherever function objects are required
• Restricted to a single expression
Spark Actions
• Cause Spark to execute recipe to transform source
• Mechanism for getting results out of Spark
Spark Program Lifecycle
1. Create RDDs from external data or parallelize a collection in your driver
program
2. Lazily transform them into new RDDs
3. cache() some RDDs for reuse -- IMPORTANT
4. Perform actions to execute parallel
5. Computation and produce results
pySpark Shared Variables
• Broadcast Variables
» Efficiently send large, read-only value to all workers
» Saved at workers for use in one or more Spark operations
» Like sending a large, read-only lookup table to all the nodes
At the driver: broadcastVar = sc.broadcast([1, 2, 3])
At a worker: broadcastVar.value
• Accumulators
» Aggregate values from workers back to driver
» Only driver can access value of accumulator
» For tasks, accumulators are write-only
» Use to count errors seen in RDD across workers
>>> accum = sc.accumulator(0)
>>> rdd = sc.parallelize([1, 2, 3, 4])
>>> def f(x):
>>> global accum
>>> accum += x
>>> rdd.foreach(f)
>>> accum.value
Value: 10
Visualizing Big Data in the browser
• Challenges:
• Manipulating large data can take long time
Memory: caching -> Scale clusters
CPU: Parallelism -> Scale clusters
• We have more data points than possible pixels
> Summarize: Aggregation, Pivoting (more data than pixels)
> Model (Clustering, Classification, D. Reduction, …etc)
> Sample: approximate (faster) and exact sampling
• Internal Tools: Matplotlib, GGPlot, D3, SVC, and more.
Spark Kernels and MAGIC keywords
• PySpark kernel supports set of %%MAGIC keywords
• It supports built-in IPython built-in magics, including %%sh.
• Auto visualization
• Magic keywords:
• %%SQL % Spark SQL
• %%lsmagic % List all supported magic keywords (Important)
• %env % Set environment variable
• %run % Execute python code
• %who % List all variables of global scope
• Run code from a different kernel in a notebook.
Spark in Azure
Hadoop clusters in Azure are packaged under “HDInsight” service
Spark in Azure
• Create clusters in few clicks
• Apache Spark comes only in Linux OS.
• Multiple HDP versions
• Comes with preloaded: SSH, Hive, Oozie, DLS, Vnets.
• Multiple Storage options:
• Azure Storage
• ADL store
• External metadata store in SQL server database for
Hive and Oozie.
• All notebooks are stored in the storage account
associated with Spark cluster
• Zeppelin notebook is available on certain Spark
versions but not all.
Programming Spark Apps in HDInsight
• Supports four kernels in Jupyter in HDInsight Spark clusters in Azure
DEMO
Spark Apps using Jupyter
References
• Spark Programming Guide
http://spark.apache.org/docs/latest/programming-guide.html
• edx.org: Free Apache Spark courses
• Visualizations for Databricks
https://docs.cloud.databricks.com/docs/latest/databricks_guide/01%20
Databricks%20Overview/15%20Visualizations/0%20Visualizations%20Ov
erview.html
• SPARKHub by Databricks
https://sparkhub.databricks.com/resources/
Thank you
• Check out my blog big data articles: http://mostafa.rocks
• Follow me on Twitter: @MostafaElzoghbi
• Want some help in building cloud solutions? Contact me to know more.

More Related Content

What's hot (20)

PDF
A Deep Dive into Query Execution Engine of Spark SQL
Databricks
 
PDF
Introduction to Spark with Python
Gokhan Atil
 
PPTX
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Simplilearn
 
PPTX
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
PDF
Dive into PySpark
Mateusz Buśkiewicz
 
PPTX
Spark
Heena Madan
 
PDF
Deep Dive: Memory Management in Apache Spark
Databricks
 
PDF
Spark overview
Lisa Hua
 
PDF
Scaling Apache Spark at Facebook
Databricks
 
PDF
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
 
PDF
The Parquet Format and Performance Optimization Opportunities
Databricks
 
PPTX
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
Databricks
 
PDF
Spark SQL
Joud Khattab
 
PDF
Introduction to apache spark
Aakashdata
 
PPTX
PySpark dataframe
Jaemun Jung
 
PPTX
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Simplilearn
 
PDF
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 
PDF
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
Databricks
 
PDF
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Edureka!
 
PDF
Apache Spark Overview
Vadim Y. Bichutskiy
 
A Deep Dive into Query Execution Engine of Spark SQL
Databricks
 
Introduction to Spark with Python
Gokhan Atil
 
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Simplilearn
 
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
Dive into PySpark
Mateusz Buśkiewicz
 
Deep Dive: Memory Management in Apache Spark
Databricks
 
Spark overview
Lisa Hua
 
Scaling Apache Spark at Facebook
Databricks
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
 
The Parquet Format and Performance Optimization Opportunities
Databricks
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
Databricks
 
Spark SQL
Joud Khattab
 
Introduction to apache spark
Aakashdata
 
PySpark dataframe
Jaemun Jung
 
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Simplilearn
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
Databricks
 
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Edureka!
 
Apache Spark Overview
Vadim Y. Bichutskiy
 

Viewers also liked (20)

PDF
Modern SQL in Open Source and Commercial Databases
Markus Winand
 
PDF
Debugging PySpark: Spark Summit East talk by Holden Karau
Spark Summit
 
PPTX
Big data solutions in Azure
Mostafa
 
PPTX
Introducing Power BI Embedded
Mostafa
 
PDF
PySpark Cassandra - Amsterdam Spark Meetup
Frens Jan Rumph
 
PDF
PySpark in practice slides
Dat Tran
 
PDF
Python and Bigdata - An Introduction to Spark (PySpark)
hiteshnd
 
PPTX
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
Slim Baltagi
 
PDF
Building predictive models in Azure Machine Learning
Mostafa
 
PPTX
Build intelligent solutions using Azure
Mostafa
 
PDF
Azure Machine Learning
Mostafa
 
PPTX
Extending Product Outreach with Outlook Connectors
Mostafa
 
PDF
Machine Learning Classifiers
Mostafa
 
PDF
Spark, Python and Parquet
odsc
 
PDF
Rapid Cluster Computing with Apache Spark 2016
Zohar Elkayam
 
PDF
Apache Spark 101 - Demi Ben-Ari
Demi Ben-Ari
 
PDF
Parquet - Data I/O - Philadelphia 2013
larsgeorge
 
PPTX
Spark Concepts - Spark SQL, Graphx, Streaming
Petr Zapletal
 
PDF
Architecting big data solutions in the cloud
Mostafa
 
PDF
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...
Spark Summit
 
Modern SQL in Open Source and Commercial Databases
Markus Winand
 
Debugging PySpark: Spark Summit East talk by Holden Karau
Spark Summit
 
Big data solutions in Azure
Mostafa
 
Introducing Power BI Embedded
Mostafa
 
PySpark Cassandra - Amsterdam Spark Meetup
Frens Jan Rumph
 
PySpark in practice slides
Dat Tran
 
Python and Bigdata - An Introduction to Spark (PySpark)
hiteshnd
 
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
Slim Baltagi
 
Building predictive models in Azure Machine Learning
Mostafa
 
Build intelligent solutions using Azure
Mostafa
 
Azure Machine Learning
Mostafa
 
Extending Product Outreach with Outlook Connectors
Mostafa
 
Machine Learning Classifiers
Mostafa
 
Spark, Python and Parquet
odsc
 
Rapid Cluster Computing with Apache Spark 2016
Zohar Elkayam
 
Apache Spark 101 - Demi Ben-Ari
Demi Ben-Ari
 
Parquet - Data I/O - Philadelphia 2013
larsgeorge
 
Spark Concepts - Spark SQL, Graphx, Streaming
Petr Zapletal
 
Architecting big data solutions in the cloud
Mostafa
 
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...
Spark Summit
 
Ad

Similar to Programming in Spark using PySpark (20)

PDF
Apache Spark Tutorial
Ahmet Bulut
 
PPTX
Apache Spark on HDinsight Training
Synergetics Learning and Cloud Consulting
 
PPTX
4Introduction+to+Spark.pptx sdfsdfsdfsdfsdf
yafora8192
 
PPT
Big_data_analytics_NoSql_Module-4_Session
RUHULAMINHAZARIKA
 
PPTX
Apache Spark Fundamentals
Zahra Eskandari
 
PPTX
Unit II Real Time Data Processing tools.pptx
Rahul Borate
 
PPTX
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Alex Zeltov
 
PPTX
Azure Databricks is Easier Than You Think
Ike Ellis
 
PDF
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Euangelos Linardos
 
PPTX
Apache spark
Prashant Pranay
 
PPTX
Big Data tools in practice
Darko Marjanovic
 
PPTX
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
sasuke20y4sh
 
PPTX
Spark core
Prashant Gupta
 
PPTX
Apache spark sneha challa- google pittsburgh-aug 25th
Sneha Challa
 
PDF
A Deep Dive Into Spark
Ashish kumar
 
PPTX
Apache Spark Core
Girish Khanzode
 
PDF
Apache Spark: What? Why? When?
Massimo Schenone
 
PDF
Data processing with spark in r &amp; python
Maloy Manna, PMP®
 
PPTX
Apache Spark II (SparkSQL)
Datio Big Data
 
PDF
Fast Data Analytics with Spark and Python
Benjamin Bengfort
 
Apache Spark Tutorial
Ahmet Bulut
 
Apache Spark on HDinsight Training
Synergetics Learning and Cloud Consulting
 
4Introduction+to+Spark.pptx sdfsdfsdfsdfsdf
yafora8192
 
Big_data_analytics_NoSql_Module-4_Session
RUHULAMINHAZARIKA
 
Apache Spark Fundamentals
Zahra Eskandari
 
Unit II Real Time Data Processing tools.pptx
Rahul Borate
 
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Alex Zeltov
 
Azure Databricks is Easier Than You Think
Ike Ellis
 
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Euangelos Linardos
 
Apache spark
Prashant Pranay
 
Big Data tools in practice
Darko Marjanovic
 
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
sasuke20y4sh
 
Spark core
Prashant Gupta
 
Apache spark sneha challa- google pittsburgh-aug 25th
Sneha Challa
 
A Deep Dive Into Spark
Ashish kumar
 
Apache Spark Core
Girish Khanzode
 
Apache Spark: What? Why? When?
Massimo Schenone
 
Data processing with spark in r &amp; python
Maloy Manna, PMP®
 
Apache Spark II (SparkSQL)
Datio Big Data
 
Fast Data Analytics with Spark and Python
Benjamin Bengfort
 
Ad

More from Mostafa (20)

PPTX
The role of intelligent sensors in the cloud public
Mostafa
 
PPTX
Skill up in machine learning using Azure ML
Mostafa
 
PDF
Big data talking stories in Healthcare
Mostafa
 
PPTX
Building Big data solutions in Azure
Mostafa
 
PPTX
Patterns and Practices in Building Office Add-ins
Mostafa
 
PPTX
Data science essentials in azure ml
Mostafa
 
PPTX
Build Interactive Analytics using Power BI
Mostafa
 
PPTX
TypeScript Jump Start
Mostafa
 
PPTX
Big data solutions in azure
Mostafa
 
PPTX
Build intelligent solutions using ms azure
Mostafa
 
PPTX
Mistakes that kill startups
Mostafa
 
PPTX
PnP in building office add ins - public
Mostafa
 
PPTX
How to migrate Console Apps as a cloud service
Mostafa
 
PPTX
HBase introduction in azure
Mostafa
 
PDF
eRecall
Mostafa
 
PPTX
Get your site microsoft edge ready
Mostafa
 
PPTX
Developing cross platform mobile apps using Apache Cordova
Mostafa
 
PPTX
Identity and o365 on Azure
Mostafa
 
PPTX
Azure Data platform
Mostafa
 
PPTX
Building IoT solutions using Windows 10 IoT Core & Azure
Mostafa
 
The role of intelligent sensors in the cloud public
Mostafa
 
Skill up in machine learning using Azure ML
Mostafa
 
Big data talking stories in Healthcare
Mostafa
 
Building Big data solutions in Azure
Mostafa
 
Patterns and Practices in Building Office Add-ins
Mostafa
 
Data science essentials in azure ml
Mostafa
 
Build Interactive Analytics using Power BI
Mostafa
 
TypeScript Jump Start
Mostafa
 
Big data solutions in azure
Mostafa
 
Build intelligent solutions using ms azure
Mostafa
 
Mistakes that kill startups
Mostafa
 
PnP in building office add ins - public
Mostafa
 
How to migrate Console Apps as a cloud service
Mostafa
 
HBase introduction in azure
Mostafa
 
eRecall
Mostafa
 
Get your site microsoft edge ready
Mostafa
 
Developing cross platform mobile apps using Apache Cordova
Mostafa
 
Identity and o365 on Azure
Mostafa
 
Azure Data platform
Mostafa
 
Building IoT solutions using Windows 10 IoT Core & Azure
Mostafa
 

Recently uploaded (20)

PDF
DoS Attack vs DDoS Attack_ The Silent Wars of the Internet.pdf
CyberPro Magazine
 
PDF
Java 25 and Beyond - A Roadmap of Innovations
Ana-Maria Mihalceanu
 
PPTX
MARTSIA: A Tool for Confidential Data Exchange via Public Blockchain - Poster...
Michele Kryston
 
PDF
Automating the Geo-Referencing of Historic Aerial Photography in Flanders
Safe Software
 
PDF
Simplify Your FME Flow Setup: Fault-Tolerant Deployment Made Easy with Packer...
Safe Software
 
PDF
Optimizing the trajectory of a wheel loader working in short loading cycles
Reno Filla
 
PDF
ICONIQ State of AI Report 2025 - The Builder's Playbook
Razin Mustafiz
 
PDF
Hello I'm "AI" Your New _________________
Dr. Tathagat Varma
 
PDF
''Taming Explosive Growth: Building Resilience in a Hyper-Scaled Financial Pl...
Fwdays
 
PDF
My Journey from CAD to BIM: A True Underdog Story
Safe Software
 
PPTX
Practical Applications of AI in Local Government
OnBoard
 
PDF
How to Visualize the ​Spatio-Temporal Data Using CesiumJS​
SANGHEE SHIN
 
PPTX
Wondershare Filmora Crack Free Download 2025
josanj305
 
PDF
🚀 Let’s Build Our First Slack Workflow! 🔧.pdf
SanjeetMishra29
 
PPTX
2025 HackRedCon Cyber Career Paths.pptx Scott Stanton
Scott Stanton
 
PPTX
Reimaginando la Ciberdefensa: De Copilots a Redes de Agentes
Cristian Garcia G.
 
PPTX
MARTSIA: A Tool for Confidential Data Exchange via Public Blockchain - Pitch ...
Michele Kryston
 
PPSX
Usergroup - OutSystems Architecture.ppsx
Kurt Vandevelde
 
PDF
Proactive Server and System Monitoring with FME: Using HTTP and System Caller...
Safe Software
 
PDF
FME as an Orchestration Tool with Principles From Data Gravity
Safe Software
 
DoS Attack vs DDoS Attack_ The Silent Wars of the Internet.pdf
CyberPro Magazine
 
Java 25 and Beyond - A Roadmap of Innovations
Ana-Maria Mihalceanu
 
MARTSIA: A Tool for Confidential Data Exchange via Public Blockchain - Poster...
Michele Kryston
 
Automating the Geo-Referencing of Historic Aerial Photography in Flanders
Safe Software
 
Simplify Your FME Flow Setup: Fault-Tolerant Deployment Made Easy with Packer...
Safe Software
 
Optimizing the trajectory of a wheel loader working in short loading cycles
Reno Filla
 
ICONIQ State of AI Report 2025 - The Builder's Playbook
Razin Mustafiz
 
Hello I'm "AI" Your New _________________
Dr. Tathagat Varma
 
''Taming Explosive Growth: Building Resilience in a Hyper-Scaled Financial Pl...
Fwdays
 
My Journey from CAD to BIM: A True Underdog Story
Safe Software
 
Practical Applications of AI in Local Government
OnBoard
 
How to Visualize the ​Spatio-Temporal Data Using CesiumJS​
SANGHEE SHIN
 
Wondershare Filmora Crack Free Download 2025
josanj305
 
🚀 Let’s Build Our First Slack Workflow! 🔧.pdf
SanjeetMishra29
 
2025 HackRedCon Cyber Career Paths.pptx Scott Stanton
Scott Stanton
 
Reimaginando la Ciberdefensa: De Copilots a Redes de Agentes
Cristian Garcia G.
 
MARTSIA: A Tool for Confidential Data Exchange via Public Blockchain - Pitch ...
Michele Kryston
 
Usergroup - OutSystems Architecture.ppsx
Kurt Vandevelde
 
Proactive Server and System Monitoring with FME: Using HTTP and System Caller...
Safe Software
 
FME as an Orchestration Tool with Principles From Data Gravity
Safe Software
 

Programming in Spark using PySpark

  • 1. Programming in Spark using PySpark Mostafa Elzoghbi Sr. Technical Evangelist – Microsoft @MostafaElzoghbi http://mostafa.rocks
  • 2. Session Objectives & Takeaways • Programming Spark • Spark Program Structure • Working with RDDs • Transformations versus Actions • Lambda, Shared Variables (Broadcast vs accumulators) • Visualizing big data in Spark • Spark in the cloud (Azure) • Working with cluster types, notebooks, scaling.
  • 3. Python Spark (pySpark) • We are using the Python programming interface to Spark (pySpark) • pySpark provides an easy-to-use programming abstraction and parallel runtime: “Here’s an operation, run it on all of the data” • RDDs are the key concept
  • 4. Apache Spark Driver and Workers • A Spark program is two programs: • A driver program and a workers program • Worker programs run on cluster nodes or in local threads • RDDs (Resilient Distributed Datasets) are distributed
  • 5. Spark Essentials: Master • The master parameter for a SparkContext determines which type and size of cluster to use
  • 6. Spark Context • A Spark program first creates a SparkContext object » Tells Spark how and where to access a cluster » pySpark shell and Databricks cloud automatically create the sc variable » iPython and programs must use a constructor to create a new SparkContext • Use SparkContext to create RDDs
  • 7. Resilient Distributed Datasets • The primary abstraction in Spark » Immutable once constructed » Track lineage information to efficiently recompute lost data » Enable operations on collection of elements in parallel • You construct RDDs » by parallelizing existing Python collections (lists) » by transforming an existing RDDs » from files in HDFS or any other storage system
  • 8. RDDs • Spark revolves around the concept of a resilient distributed dataset (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel. • Two types of operations: transformations and actions • Transformations are lazy (not computed immediately) • Transformed RDD is executed when action runs on it • Persist (cache) RDDs in memory or disk
  • 10. Creating an RDD • Create RDDs from Python collections (lists) • From HDFS, text files, Hypertable, Amazon S3, Apache Hbase, SequenceFiles, any other Hadoop InputFormat, and directory or glob wildcard: /data/201404*
  • 11. Working with RDDs • Create an RDD from a data source: <list> • Apply transformations to an RDD: map filter • Apply actions to an RDD: collect count
  • 12. Spark Transformations • Create new datasets from an existing one • Use lazy evaluation: results not computed right away – • instead Spark remembers set of transformations applied to base dataset » Spark optimizes the required calculations » Spark recovers from failures and slow workers • Think of this as a recipe for creating result
  • 13. Python lambda Functions • Small anonymous functions (not bound to a name) lambda a, b: a+b » returns the sum of its two arguments • Can use lambda functions wherever function objects are required • Restricted to a single expression
  • 14. Spark Actions • Cause Spark to execute recipe to transform source • Mechanism for getting results out of Spark
  • 15. Spark Program Lifecycle 1. Create RDDs from external data or parallelize a collection in your driver program 2. Lazily transform them into new RDDs 3. cache() some RDDs for reuse -- IMPORTANT 4. Perform actions to execute parallel 5. Computation and produce results
  • 16. pySpark Shared Variables • Broadcast Variables » Efficiently send large, read-only value to all workers » Saved at workers for use in one or more Spark operations » Like sending a large, read-only lookup table to all the nodes At the driver: broadcastVar = sc.broadcast([1, 2, 3]) At a worker: broadcastVar.value
  • 17. • Accumulators » Aggregate values from workers back to driver » Only driver can access value of accumulator » For tasks, accumulators are write-only » Use to count errors seen in RDD across workers >>> accum = sc.accumulator(0) >>> rdd = sc.parallelize([1, 2, 3, 4]) >>> def f(x): >>> global accum >>> accum += x >>> rdd.foreach(f) >>> accum.value Value: 10
  • 18. Visualizing Big Data in the browser • Challenges: • Manipulating large data can take long time Memory: caching -> Scale clusters CPU: Parallelism -> Scale clusters • We have more data points than possible pixels > Summarize: Aggregation, Pivoting (more data than pixels) > Model (Clustering, Classification, D. Reduction, …etc) > Sample: approximate (faster) and exact sampling • Internal Tools: Matplotlib, GGPlot, D3, SVC, and more.
  • 19. Spark Kernels and MAGIC keywords • PySpark kernel supports set of %%MAGIC keywords • It supports built-in IPython built-in magics, including %%sh. • Auto visualization • Magic keywords: • %%SQL % Spark SQL • %%lsmagic % List all supported magic keywords (Important) • %env % Set environment variable • %run % Execute python code • %who % List all variables of global scope • Run code from a different kernel in a notebook.
  • 20. Spark in Azure Hadoop clusters in Azure are packaged under “HDInsight” service
  • 21. Spark in Azure • Create clusters in few clicks • Apache Spark comes only in Linux OS. • Multiple HDP versions • Comes with preloaded: SSH, Hive, Oozie, DLS, Vnets. • Multiple Storage options: • Azure Storage • ADL store • External metadata store in SQL server database for Hive and Oozie. • All notebooks are stored in the storage account associated with Spark cluster • Zeppelin notebook is available on certain Spark versions but not all.
  • 22. Programming Spark Apps in HDInsight • Supports four kernels in Jupyter in HDInsight Spark clusters in Azure
  • 24. References • Spark Programming Guide http://spark.apache.org/docs/latest/programming-guide.html • edx.org: Free Apache Spark courses • Visualizations for Databricks https://docs.cloud.databricks.com/docs/latest/databricks_guide/01%20 Databricks%20Overview/15%20Visualizations/0%20Visualizations%20Ov erview.html • SPARKHub by Databricks https://sparkhub.databricks.com/resources/
  • 25. Thank you • Check out my blog big data articles: http://mostafa.rocks • Follow me on Twitter: @MostafaElzoghbi • Want some help in building cloud solutions? Contact me to know more.

Editor's Notes

  • #2: Ref.: https://azure.microsoft.com/en-us/services/hdinsight/apache-spark/ Apache Spark leverages a common execution model for doing multiple tasks like ETL, batch queries, interactive queries, real-time streaming, machine learning, and graph processing on data stored in Azure Data Lake Store. This allows you to use Spark for Azure HDInsight to solve big data challenges in near real-time like fraud detection, click stream analysis, financial alerts, telemetry from connected sensors and devices (Internet of Things, IoT), social analytics, always-on ETL pipelines, and network monitoring.
  • #3: A) Main concepts to cover for Data Science: Regression Classification -- FOCUS Clustering Recommendation B) Building programmable components in Azure ML experiments C) Working with Azure ML studio
  • #5: Source: https://courses.edx.org/c4x/BerkeleyX/CS100.1x/asset/Week2Lec4.pdf Spark standalone running on two nodes with two workers: A client process submit an app to the master. The master instructs one of its workers to launch a driver. The worker spawns a driver JVM. The master instructs both works to launch executors for the app. The workers spawn executor JVMs. The driver and executors communicate independent of the cluster’s processes.
  • #6: Running Spark: Standalone cluster: Spark standalone comes out of the box. Comes with it is own web UI (monitor and run apps/jobs) Contains of master and worker (also called slave) Mesos and Yarn are also supported in Spark. Yarn is the only cluster manager on which spark can access HDFS secured with Kerberos. Yarn is the new generation of Hadoop’s MapReduce execution engine and can run MapReduce, Spark and other types of programs.
  • #16: For that reason, cache is said to 'break the lineage' as it creates a checkpoint that can be reused for further processing. Rule of thumb: Use cache when the lineage of your RDD branches out or when an RDD is used multiple times like in a loop.
  • #17: Keep read-only variable cached on workers » Ship to each worker only once instead of with each task • Example: efficiently give every worker a large dataset • Usually distributed using efficient broadcast algorithms
  • #19: Extensively used in statistics Spark offers native support for: • Approximate and exact sampling • Approximate and exact stratified sampling Approximate sampling is faster and is good enough in most cases
  • #20: 1) Jupyter notebooks kernels with Apache Spark clusters in HDInsight https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-apache-spark-jupyter-notebook-kernels 2) Ipython built in magics https://ipython.org/ipython-doc/3/interactive/magics.html#cell-magics Source for tipcs and magic keywords: https://www.dataquest.io/blog/jupyter-notebook-tips-tricks-shortcuts/
  • #21: Url: https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-apache-spark-jupyter-spark-sql
  • #25: Url: https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-apache-spark-jupyter-spark-sql
  • #26: Spark 2.0 announcements: https://databricks.com/blog/2016/07/26/introducing-apache-spark-2-0.html