SlideShare a Scribd company logo
Lightning-fast cluster computing
Apache Spark
What is Apache Spark?
Cluster computing platform designed to be fast and general-purpose.
Fast
Universal
Highly Accessible
Unified stack
Comparison with MR
val textFile = spark.textFile("hdfs://...")
val counts = textFile.flatMap(line =>
line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
Examples:Word Count
val sc = new SparkContext(...)
val textFile = sc.textFile("hdfs://...")
val counts = textFile.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
val sc = new SparkContext(...)
val inputRDD = sc.textFile("log.txt")
val errorsRDD = inputRDD.filter(line => line.contains("error"))
val warningsRDD = inputRDD.filter(line => line.contains("warning"))
val badLinesRDD = errorsRDD.union(warningsRDD)
badLinesRDD.persist()
badLinesRDD.count()
badLinesRDD.collect()
Examples:Log Mining
How it works?
RDD
Resilient
Distributed
Dataset
Example Hadoop RDD
partitions = One per HDFS block
dependencies = none
compute = read corresponding block
preferredLocations = HDFS block locations
partitioner = none
Advanced: RDD as interface
Direct Acyclic Graph (DAG)
hadoopRDD
errorsRDD warningsRDD
badLinseRDD
filterfilter
union
Function Name Purpose Example
map() Apply a function to each element in the RDD and return an RDD of the result. rdd.map(x => x + 1)
flatMap() Apply a function to each element in the RDD and return an RDD of the contents of
the iterators returned. Often used to extract words.
rdd.flatMap(x => x.to(3))
filter() Return an RDD consisting of only elements that pass the condition passed to
filter().
rdd.filter(x => x != 1)
distinct() Remove duplicates. rdd.distinct()
union() Produce an RDD containing elements from both RDDs. rdd.union(other)
intersection() RDD containing only elements found in both RDDs. rdd.intersection(other)
join() Perform an inner join between two RDDs. rdd.join(other)
groupByKey() Group values with same key rdd.groupByKey(other)
RDD Transformations
RDD actions
Function Name Purpose Example
count() Number of elements in RDD rdd.count()
collect() Return all elements from the RDD rdd.collect()
saveAsTextFile() Saves RDD elements to an external
storage system
rdd.saveAsTextFile(“hdfs://...”)
take(num) Return num elements from RDD rdd.take(10)
reduce(func) Combine the elements of the RDD
together in parallel (e.g., sum)
rdd.reduce((x, y) => x + y)
takeOrdered(num)(ordering) Return num elements regarding provided
ordering
rdd.takeOrdered(2)(myOrdering)
RDD Caching
Level Space Used CPU Time In Memory On disk Comments
MEMORY_ONLY High Low Y N
MEMORY_ONLY_SER Low High Y N
MEMORY_AND_DISK High Medium Some Some Spills to disk if there is too
much data to fit in memory.
MEMORY_AND_DISK_SER Low High Some Some Spills to disk if there is too
much data to fit in memory.
Stores serialized
representation in memory.
DISK_ONLY Low High N Y
How it works?
Main program which controls the flow
Driver Executors
Nodes that execute actions
How it works?
DAG
Scheduler
Coordination between
RDDs, driver and
nodes
What is Spark Application
Advanced Topics: Stages
Advanced Topics: Shuffling
Spark Stack
SQL
Streaming
Machine Learning
GraphX
if not… DEMO everyone?
?

More Related Content

What's hot (20)

MongoDB
MongoDBMongoDB
MongoDB
Ganesh Kunwar
 
Apache avro and overview hadoop tools
Apache avro and overview hadoop toolsApache avro and overview hadoop tools
Apache avro and overview hadoop tools
alireza alikhani
 
Spark: Taming Big Data
Spark: Taming Big DataSpark: Taming Big Data
Spark: Taming Big Data
Leonardo Gamas
 
MongoDB Aggregation Framework
MongoDB Aggregation FrameworkMongoDB Aggregation Framework
MongoDB Aggregation Framework
Caserta
 
working with files
working with filesworking with files
working with files
SangeethaSasi1
 
JavaScript client API for Google Apps Script API primer
JavaScript client API for Google Apps Script API primerJavaScript client API for Google Apps Script API primer
JavaScript client API for Google Apps Script API primer
Bruce McPherson
 
Aggregation Framework in MongoDB Overview Part-1
Aggregation Framework in MongoDB Overview Part-1Aggregation Framework in MongoDB Overview Part-1
Aggregation Framework in MongoDB Overview Part-1
Anuj Jain
 
MongoDB Aggregation
MongoDB Aggregation MongoDB Aggregation
MongoDB Aggregation
Amit Ghosh
 
Java JVM Memory Cheat Sheet
Java JVM Memory Cheat SheetJava JVM Memory Cheat Sheet
Java JVM Memory Cheat Sheet
Mark Papis
 
Mongo indexes
Mongo indexesMongo indexes
Mongo indexes
Mehmet Çetin
 
MongoDB Aggregation MongoSF May 2011
MongoDB Aggregation MongoSF May 2011MongoDB Aggregation MongoSF May 2011
MongoDB Aggregation MongoSF May 2011
Chris Westin
 
MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...
MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...
MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...
MongoDB
 
Using spark data frame for sql
Using spark data frame for sqlUsing spark data frame for sql
Using spark data frame for sql
DaeMyung Kang
 
Unsupervised Learning with Apache Spark
Unsupervised Learning with Apache SparkUnsupervised Learning with Apache Spark
Unsupervised Learning with Apache Spark
DB Tsai
 
The Aggregation Framework
The Aggregation FrameworkThe Aggregation Framework
The Aggregation Framework
MongoDB
 
Data warehouse or conventional database: Which is right for you?
Data warehouse or conventional database: Which is right for you?Data warehouse or conventional database: Which is right for you?
Data warehouse or conventional database: Which is right for you?
Data Con LA
 
Get docs from sp doc library
Get docs from sp doc libraryGet docs from sp doc library
Get docs from sp doc library
Sudip Sengupta
 
MongoDb and NoSQL
MongoDb and NoSQLMongoDb and NoSQL
MongoDb and NoSQL
TO THE NEW | Technology
 
Google apps script database abstraction exposed version
Google apps script database abstraction   exposed versionGoogle apps script database abstraction   exposed version
Google apps script database abstraction exposed version
Bruce McPherson
 
R statistics with mongo db
R statistics with mongo dbR statistics with mongo db
R statistics with mongo db
MongoDB
 
Apache avro and overview hadoop tools
Apache avro and overview hadoop toolsApache avro and overview hadoop tools
Apache avro and overview hadoop tools
alireza alikhani
 
Spark: Taming Big Data
Spark: Taming Big DataSpark: Taming Big Data
Spark: Taming Big Data
Leonardo Gamas
 
MongoDB Aggregation Framework
MongoDB Aggregation FrameworkMongoDB Aggregation Framework
MongoDB Aggregation Framework
Caserta
 
JavaScript client API for Google Apps Script API primer
JavaScript client API for Google Apps Script API primerJavaScript client API for Google Apps Script API primer
JavaScript client API for Google Apps Script API primer
Bruce McPherson
 
Aggregation Framework in MongoDB Overview Part-1
Aggregation Framework in MongoDB Overview Part-1Aggregation Framework in MongoDB Overview Part-1
Aggregation Framework in MongoDB Overview Part-1
Anuj Jain
 
MongoDB Aggregation
MongoDB Aggregation MongoDB Aggregation
MongoDB Aggregation
Amit Ghosh
 
Java JVM Memory Cheat Sheet
Java JVM Memory Cheat SheetJava JVM Memory Cheat Sheet
Java JVM Memory Cheat Sheet
Mark Papis
 
MongoDB Aggregation MongoSF May 2011
MongoDB Aggregation MongoSF May 2011MongoDB Aggregation MongoSF May 2011
MongoDB Aggregation MongoSF May 2011
Chris Westin
 
MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...
MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...
MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...
MongoDB
 
Using spark data frame for sql
Using spark data frame for sqlUsing spark data frame for sql
Using spark data frame for sql
DaeMyung Kang
 
Unsupervised Learning with Apache Spark
Unsupervised Learning with Apache SparkUnsupervised Learning with Apache Spark
Unsupervised Learning with Apache Spark
DB Tsai
 
The Aggregation Framework
The Aggregation FrameworkThe Aggregation Framework
The Aggregation Framework
MongoDB
 
Data warehouse or conventional database: Which is right for you?
Data warehouse or conventional database: Which is right for you?Data warehouse or conventional database: Which is right for you?
Data warehouse or conventional database: Which is right for you?
Data Con LA
 
Get docs from sp doc library
Get docs from sp doc libraryGet docs from sp doc library
Get docs from sp doc library
Sudip Sengupta
 
Google apps script database abstraction exposed version
Google apps script database abstraction   exposed versionGoogle apps script database abstraction   exposed version
Google apps script database abstraction exposed version
Bruce McPherson
 
R statistics with mongo db
R statistics with mongo dbR statistics with mongo db
R statistics with mongo db
MongoDB
 

Viewers also liked (18)

Rss та wiki
Rss та wiki Rss та wiki
Rss та wiki
Alla239
 
UXPA 2016 - Using UX Skills to Shape Your Career
UXPA 2016 - Using UX Skills to Shape Your CareerUXPA 2016 - Using UX Skills to Shape Your Career
UXPA 2016 - Using UX Skills to Shape Your Career
Amanda Stockwell
 
Research is not just for the UX team.
Research is not just for the UX team. Research is not just for the UX team.
Research is not just for the UX team.
Amanda Stockwell
 
A toolset for a modern dev company
A toolset for a modern dev companyA toolset for a modern dev company
A toolset for a modern dev company
Hovhannes Kuloghlyan
 
Rss та wiki
Rss та wiki Rss та wiki
Rss та wiki
Alla239
 
Linked In PP
Linked In PPLinked In PP
Linked In PP
Luis Grullon
 
resume2
resume2resume2
resume2
HARENDRA SINGH
 
Using UX Skills to Craft Your Career
Using UX Skills to Craft Your CareerUsing UX Skills to Craft Your Career
Using UX Skills to Craft Your Career
Amanda Stockwell
 
SISTEM INFORMASI KESEHATAN STIKES AL IRSYAD AL ISLAMIYAH CILACAP (108114020 &...
SISTEM INFORMASI KESEHATAN STIKES AL IRSYAD AL ISLAMIYAH CILACAP (108114020 &...SISTEM INFORMASI KESEHATAN STIKES AL IRSYAD AL ISLAMIYAH CILACAP (108114020 &...
SISTEM INFORMASI KESEHATAN STIKES AL IRSYAD AL ISLAMIYAH CILACAP (108114020 &...
Ibrahim Lubis
 
UX is not just for designers. UX IRL
UX is not just for designers. UX IRLUX is not just for designers. UX IRL
UX is not just for designers. UX IRL
Amanda Stockwell
 
Respetar a los demás
Respetar a los demásRespetar a los demás
Respetar a los demás
lorenieto
 
SISTEM INFPRMASI KESEHATAN STIKES AL IRSYAD AL ISLAMIYAH CILACAP (108114020 &...
SISTEM INFPRMASI KESEHATAN STIKES AL IRSYAD AL ISLAMIYAH CILACAP (108114020 &...SISTEM INFPRMASI KESEHATAN STIKES AL IRSYAD AL ISLAMIYAH CILACAP (108114020 &...
SISTEM INFPRMASI KESEHATAN STIKES AL IRSYAD AL ISLAMIYAH CILACAP (108114020 &...
Ibrahim Lubis
 
COMPANY PROFILE 2015
COMPANY PROFILE 2015COMPANY PROFILE 2015
COMPANY PROFILE 2015
VIJAY XAVIER
 
Propiedades de la Potencia
Propiedades de la PotenciaPropiedades de la Potencia
Propiedades de la Potencia
Adriana Barrios
 
Serve your customers better with User Experience Research
Serve your customers better with User Experience ResearchServe your customers better with User Experience Research
Serve your customers better with User Experience Research
Amanda Stockwell
 
Apron feeder
Apron feederApron feeder
Apron feeder
elunaedgar
 
Praveen
PraveenPraveen
Praveen
Praveen Kumar
 
PPT encontro com Professores Coordenadores
PPT encontro com Professores CoordenadoresPPT encontro com Professores Coordenadores
PPT encontro com Professores Coordenadores
Giani de Cássia Santana
 
Rss та wiki
Rss та wiki Rss та wiki
Rss та wiki
Alla239
 
UXPA 2016 - Using UX Skills to Shape Your Career
UXPA 2016 - Using UX Skills to Shape Your CareerUXPA 2016 - Using UX Skills to Shape Your Career
UXPA 2016 - Using UX Skills to Shape Your Career
Amanda Stockwell
 
Research is not just for the UX team.
Research is not just for the UX team. Research is not just for the UX team.
Research is not just for the UX team.
Amanda Stockwell
 
A toolset for a modern dev company
A toolset for a modern dev companyA toolset for a modern dev company
A toolset for a modern dev company
Hovhannes Kuloghlyan
 
Rss та wiki
Rss та wiki Rss та wiki
Rss та wiki
Alla239
 
Using UX Skills to Craft Your Career
Using UX Skills to Craft Your CareerUsing UX Skills to Craft Your Career
Using UX Skills to Craft Your Career
Amanda Stockwell
 
SISTEM INFORMASI KESEHATAN STIKES AL IRSYAD AL ISLAMIYAH CILACAP (108114020 &...
SISTEM INFORMASI KESEHATAN STIKES AL IRSYAD AL ISLAMIYAH CILACAP (108114020 &...SISTEM INFORMASI KESEHATAN STIKES AL IRSYAD AL ISLAMIYAH CILACAP (108114020 &...
SISTEM INFORMASI KESEHATAN STIKES AL IRSYAD AL ISLAMIYAH CILACAP (108114020 &...
Ibrahim Lubis
 
UX is not just for designers. UX IRL
UX is not just for designers. UX IRLUX is not just for designers. UX IRL
UX is not just for designers. UX IRL
Amanda Stockwell
 
Respetar a los demás
Respetar a los demásRespetar a los demás
Respetar a los demás
lorenieto
 
SISTEM INFPRMASI KESEHATAN STIKES AL IRSYAD AL ISLAMIYAH CILACAP (108114020 &...
SISTEM INFPRMASI KESEHATAN STIKES AL IRSYAD AL ISLAMIYAH CILACAP (108114020 &...SISTEM INFPRMASI KESEHATAN STIKES AL IRSYAD AL ISLAMIYAH CILACAP (108114020 &...
SISTEM INFPRMASI KESEHATAN STIKES AL IRSYAD AL ISLAMIYAH CILACAP (108114020 &...
Ibrahim Lubis
 
COMPANY PROFILE 2015
COMPANY PROFILE 2015COMPANY PROFILE 2015
COMPANY PROFILE 2015
VIJAY XAVIER
 
Propiedades de la Potencia
Propiedades de la PotenciaPropiedades de la Potencia
Propiedades de la Potencia
Adriana Barrios
 
Serve your customers better with User Experience Research
Serve your customers better with User Experience ResearchServe your customers better with User Experience Research
Serve your customers better with User Experience Research
Amanda Stockwell
 
PPT encontro com Professores Coordenadores
PPT encontro com Professores CoordenadoresPPT encontro com Professores Coordenadores
PPT encontro com Professores Coordenadores
Giani de Cássia Santana
 

Similar to Apache Spark - Aram Mkrtchyan (20)

Operations on rdd
Operations on rddOperations on rdd
Operations on rdd
sparrowAnalytics.com
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Mohamed hedi Abidi
 
Spark workshop
Spark workshopSpark workshop
Spark workshop
Wojciech Pituła
 
Introduction to Spark with Scala
Introduction to Spark with ScalaIntroduction to Spark with Scala
Introduction to Spark with Scala
Himanshu Gupta
 
Apache spark: in and out
Apache spark: in and outApache spark: in and out
Apache spark: in and out
Ben Fradet
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark
YahooTechConference
 
Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013
Samir Bessalah
 
Spark & Spark Streaming Internals - Nov 15 (1)
Spark & Spark Streaming Internals - Nov 15 (1)Spark & Spark Streaming Internals - Nov 15 (1)
Spark & Spark Streaming Internals - Nov 15 (1)
Akhil Das
 
Meetup ml spark_ppt
Meetup ml spark_pptMeetup ml spark_ppt
Meetup ml spark_ppt
Snehal Nagmote
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
Databricks
 
Testing batch and streaming Spark applications
Testing batch and streaming Spark applicationsTesting batch and streaming Spark applications
Testing batch and streaming Spark applications
Łukasz Gawron
 
[QE 2018] Łukasz Gawron – Testing Batch and Streaming Spark Applications
[QE 2018] Łukasz Gawron – Testing Batch and Streaming Spark Applications[QE 2018] Łukasz Gawron – Testing Batch and Streaming Spark Applications
[QE 2018] Łukasz Gawron – Testing Batch and Streaming Spark Applications
Future Processing
 
Spark training-in-bangalore
Spark training-in-bangaloreSpark training-in-bangalore
Spark training-in-bangalore
Kelly Technologies
 
CloudCamp Chicago lightning talk "Spark: A Quick Ignition" - Matthew Kem...
CloudCamp Chicago lightning talk      "Spark: A Quick Ignition" - Matthew Kem...CloudCamp Chicago lightning talk      "Spark: A Quick Ignition" - Matthew Kem...
CloudCamp Chicago lightning talk "Spark: A Quick Ignition" - Matthew Kem...
CloudCamp Chicago
 
Scala meetup - Intro to spark
Scala meetup - Intro to sparkScala meetup - Intro to spark
Scala meetup - Intro to spark
Javier Arrieta
 
Introduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with HadoopIntroduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with Hadoop
Dilum Bandara
 
Simple Apache Spark Introduction - Part 2
Simple Apache Spark Introduction - Part 2Simple Apache Spark Introduction - Part 2
Simple Apache Spark Introduction - Part 2
chiragmota91
 
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and Monoids
Hugo Gävert
 
Scalding - Hadoop Word Count in LESS than 70 lines of code
Scalding - Hadoop Word Count in LESS than 70 lines of codeScalding - Hadoop Word Count in LESS than 70 lines of code
Scalding - Hadoop Word Count in LESS than 70 lines of code
Konrad Malawski
 
Apache Spark for Library Developers with William Benton and Erik Erlandson
 Apache Spark for Library Developers with William Benton and Erik Erlandson Apache Spark for Library Developers with William Benton and Erik Erlandson
Apache Spark for Library Developers with William Benton and Erik Erlandson
Databricks
 
Introduction to Spark with Scala
Introduction to Spark with ScalaIntroduction to Spark with Scala
Introduction to Spark with Scala
Himanshu Gupta
 
Apache spark: in and out
Apache spark: in and outApache spark: in and out
Apache spark: in and out
Ben Fradet
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark
YahooTechConference
 
Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013
Samir Bessalah
 
Spark & Spark Streaming Internals - Nov 15 (1)
Spark & Spark Streaming Internals - Nov 15 (1)Spark & Spark Streaming Internals - Nov 15 (1)
Spark & Spark Streaming Internals - Nov 15 (1)
Akhil Das
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
Databricks
 
Testing batch and streaming Spark applications
Testing batch and streaming Spark applicationsTesting batch and streaming Spark applications
Testing batch and streaming Spark applications
Łukasz Gawron
 
[QE 2018] Łukasz Gawron – Testing Batch and Streaming Spark Applications
[QE 2018] Łukasz Gawron – Testing Batch and Streaming Spark Applications[QE 2018] Łukasz Gawron – Testing Batch and Streaming Spark Applications
[QE 2018] Łukasz Gawron – Testing Batch and Streaming Spark Applications
Future Processing
 
CloudCamp Chicago lightning talk "Spark: A Quick Ignition" - Matthew Kem...
CloudCamp Chicago lightning talk      "Spark: A Quick Ignition" - Matthew Kem...CloudCamp Chicago lightning talk      "Spark: A Quick Ignition" - Matthew Kem...
CloudCamp Chicago lightning talk "Spark: A Quick Ignition" - Matthew Kem...
CloudCamp Chicago
 
Scala meetup - Intro to spark
Scala meetup - Intro to sparkScala meetup - Intro to spark
Scala meetup - Intro to spark
Javier Arrieta
 
Introduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with HadoopIntroduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with Hadoop
Dilum Bandara
 
Simple Apache Spark Introduction - Part 2
Simple Apache Spark Introduction - Part 2Simple Apache Spark Introduction - Part 2
Simple Apache Spark Introduction - Part 2
chiragmota91
 
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and Monoids
Hugo Gävert
 
Scalding - Hadoop Word Count in LESS than 70 lines of code
Scalding - Hadoop Word Count in LESS than 70 lines of codeScalding - Hadoop Word Count in LESS than 70 lines of code
Scalding - Hadoop Word Count in LESS than 70 lines of code
Konrad Malawski
 
Apache Spark for Library Developers with William Benton and Erik Erlandson
 Apache Spark for Library Developers with William Benton and Erik Erlandson Apache Spark for Library Developers with William Benton and Erik Erlandson
Apache Spark for Library Developers with William Benton and Erik Erlandson
Databricks
 

Recently uploaded (20)

Big Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur MorganBig Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur Morgan
Arthur Morgan
 
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
SOFTTECHHUB
 
AI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global TrendsAI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global Trends
InData Labs
 
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
BookNet Canada
 
Quantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur MorganQuantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur Morgan
Arthur Morgan
 
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes
 
Electronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploitElectronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploit
niftliyevhuseyn
 
What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...
Vishnu Singh Chundawat
 
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath MaestroDev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
UiPathCommunity
 
Cybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure ADCybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure AD
VICTOR MAESTRE RAMIREZ
 
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveDesigning Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
ScyllaDB
 
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdfComplete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Software Company
 
Rusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond SparkRusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond Spark
carlyakerly1
 
Build Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For DevsBuild Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For Devs
Brian McKeiver
 
Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025
Splunk
 
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxDevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
Justin Reock
 
Linux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdfLinux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdf
RHCSA Guru
 
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-UmgebungenHCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
panagenda
 
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
Alan Dix
 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 
Big Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur MorganBig Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur Morgan
Arthur Morgan
 
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
SOFTTECHHUB
 
AI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global TrendsAI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global Trends
InData Labs
 
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
BookNet Canada
 
Quantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur MorganQuantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur Morgan
Arthur Morgan
 
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes
 
Electronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploitElectronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploit
niftliyevhuseyn
 
What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...
Vishnu Singh Chundawat
 
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath MaestroDev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
UiPathCommunity
 
Cybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure ADCybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure AD
VICTOR MAESTRE RAMIREZ
 
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveDesigning Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
ScyllaDB
 
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdfComplete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Software Company
 
Rusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond SparkRusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond Spark
carlyakerly1
 
Build Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For DevsBuild Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For Devs
Brian McKeiver
 
Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025
Splunk
 
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxDevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
Justin Reock
 
Linux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdfLinux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdf
RHCSA Guru
 
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-UmgebungenHCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
panagenda
 
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
Alan Dix
 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 

Apache Spark - Aram Mkrtchyan

  • 2. What is Apache Spark? Cluster computing platform designed to be fast and general-purpose. Fast Universal Highly Accessible
  • 4. Comparison with MR val textFile = spark.textFile("hdfs://...") val counts = textFile.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")
  • 5. Examples:Word Count val sc = new SparkContext(...) val textFile = sc.textFile("hdfs://...") val counts = textFile.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")
  • 6. val sc = new SparkContext(...) val inputRDD = sc.textFile("log.txt") val errorsRDD = inputRDD.filter(line => line.contains("error")) val warningsRDD = inputRDD.filter(line => line.contains("warning")) val badLinesRDD = errorsRDD.union(warningsRDD) badLinesRDD.persist() badLinesRDD.count() badLinesRDD.collect() Examples:Log Mining
  • 8. Example Hadoop RDD partitions = One per HDFS block dependencies = none compute = read corresponding block preferredLocations = HDFS block locations partitioner = none Advanced: RDD as interface
  • 9. Direct Acyclic Graph (DAG) hadoopRDD errorsRDD warningsRDD badLinseRDD filterfilter union
  • 10. Function Name Purpose Example map() Apply a function to each element in the RDD and return an RDD of the result. rdd.map(x => x + 1) flatMap() Apply a function to each element in the RDD and return an RDD of the contents of the iterators returned. Often used to extract words. rdd.flatMap(x => x.to(3)) filter() Return an RDD consisting of only elements that pass the condition passed to filter(). rdd.filter(x => x != 1) distinct() Remove duplicates. rdd.distinct() union() Produce an RDD containing elements from both RDDs. rdd.union(other) intersection() RDD containing only elements found in both RDDs. rdd.intersection(other) join() Perform an inner join between two RDDs. rdd.join(other) groupByKey() Group values with same key rdd.groupByKey(other) RDD Transformations
  • 11. RDD actions Function Name Purpose Example count() Number of elements in RDD rdd.count() collect() Return all elements from the RDD rdd.collect() saveAsTextFile() Saves RDD elements to an external storage system rdd.saveAsTextFile(“hdfs://...”) take(num) Return num elements from RDD rdd.take(10) reduce(func) Combine the elements of the RDD together in parallel (e.g., sum) rdd.reduce((x, y) => x + y) takeOrdered(num)(ordering) Return num elements regarding provided ordering rdd.takeOrdered(2)(myOrdering)
  • 12. RDD Caching Level Space Used CPU Time In Memory On disk Comments MEMORY_ONLY High Low Y N MEMORY_ONLY_SER Low High Y N MEMORY_AND_DISK High Medium Some Some Spills to disk if there is too much data to fit in memory. MEMORY_AND_DISK_SER Low High Some Some Spills to disk if there is too much data to fit in memory. Stores serialized representation in memory. DISK_ONLY Low High N Y
  • 13. How it works? Main program which controls the flow Driver Executors Nodes that execute actions
  • 14. How it works? DAG Scheduler Coordination between RDDs, driver and nodes
  • 15. What is Spark Application
  • 19. if not… DEMO everyone? ?