SlideShare a Scribd company logo
Spark Conf Taiwan 2016
Apache Spark
RICH LEE
2016/9/21
Agenda
Spark overview
Spark core
RDD
Spark Application Develop
Spark Shell
Zeppline
Application
Apache Spark Introduction
Apache Spark Introduction
Spark Overview
Apache Spark is a fast and general-purpose cluster computing system
Key Features:
Fast
Ease of Use
General-purpose
Scalable
Fault tolerant
Logistic regression in Hadoop and Spark
Apache Spark Introduction
Spark Overview
Cluster Mode
Local
Standalone
Hadoop YARN
Apache Mesos
Spark Overview
Spark Overview
Spark High Level Architecture
Driver Program
Cluster Management
Worker Node
Executor
Task
Apache Spark Introduction
Spark Overview
Install and startup
Download
http://spark.apache.org/downloads.html
Start Master and Worker
./sbin/start-all.sh
http://localhost:8080
Start History server
./sbin/start-history-server.sh hdfs://localhost:9000/spark/directory
RDD
Resilient Distributed Datase
RDD represents a collection of partitioned data elements that can be operated on in
parallel. It is the primary data abstraction mechanism in Spark.
Partitioned
Fault Tolerant
Interface
In Memory
RDD
Create RDD
parallelize
val xs = (1 to 10000).toList
val rdd = sc.parallelize(xs)
textFile
val lines = sc.textFile("/input/README.md")
val lines = sc.textFile("file:///RICH_HD/BigData_Tools/spark-1.6.2/README.md")
HDFS - "hdfs://"
Amazon S3 - "s3n://"
Cassandra, HBase
RDD
RDD
Transformation
Creates a new RDD by performing a computation on the source RDD
map
val txtFile = sc.textFile("/input/README.md")
val lengths = lines map { l => l.length}
flatMap
val words = lines flatMap { l => l.split(" ")}
filter
val longLines = lines filter { l => l.length > 80}
RDD
Action
Return a value to a driver program
first
val numbersRdd = sc.parallelize(List(10, 5, 3, 1))
val firstElement = numbersRdd.first
max
numbersRdd.max
reduce
val sum = numbersRdd.reduce((x, y) => x + y)
val product = numbersRdd.reduce((x, y) => x * y)
RDD
Filter log example:
val logs = sc.textFile("path/to/log-files")
val errorLogs = logs filter { l => l.contains("ERROR")}
val warningLogs = logs filter { l => l.contains("WARN")}
val errorCount = errorLogs.count
val warningCount = warningLogs.count
log RDD
error
RDD
warn
RDD
count count
RDD
Caching
Stores an RDD in the memory or storage
When an application caches an RDD in memory, Spark stores it in the executor
memory on each worker node. Each executor stores in memory the RDD
partitions that it computes.
cache
persist
MEMORY_ONLY
DISK_ONLY
MEMORY_AND_DISK
RDD
Cache example:
val logs = sc.textFile("path/to/log-files")
val errorsAndWarnings = logs filter { l => l.contains("ERROR") || l.contains("WARN")}
errorsAndWarnings.cache()
val errorLogs = errorsAndWarnings filter { l => l.contains("ERROR")}
val warningLogs = errorsAndWarnings filter { l => l.contains("WARN")}
val errorCount = errorLogs.count
val warningCount = warningLogs.count
Apache Spark Introduction
Spark Application Develop
Spark-Shell
Zeppline
Application (Java/Scala)
spark-submit
Spark Application Develop
WordCount
val textFile = sc.textFile("/input/README.md")
val wcData = textFile.flatMap(line => line.split(" "))
.map((_, 1))
.reduceByKey(_ + _)
wcData.collect().foreach(println)
工商時間
Taiwan Hadoop User Group
https://www.facebook.com/groups/hadoop.tw/
Taiwan Spark User Group
https://www.facebook.com/groups/spark.tw/

More Related Content

What's hot (20)

Intro to apache spark stand ford
Intro to apache spark stand fordIntro to apache spark stand ford
Intro to apache spark stand ford
Thu Hiền
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
Alexey Grishchenko
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
Robert Sanders
 
Introduction to Spark with Scala
Introduction to Spark with ScalaIntroduction to Spark with Scala
Introduction to Spark with Scala
Himanshu Gupta
 
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
CloudxLab
 
Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer Training
Cloudera, Inc.
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
Ml2
Ml2Ml2
Ml2
poovarasu maniandan
 
Spark and spark streaming internals
Spark and spark streaming internalsSpark and spark streaming internals
Spark and spark streaming internals
Sigmoid
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Sachin Aggarwal
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscape
Paco Nathan
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
Ahmet Bulut
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Anastasios Skarlatidis
 
Building production spark streaming applications
Building production spark streaming applicationsBuilding production spark streaming applications
Building production spark streaming applications
Joey Echeverria
 
Apache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLabApache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLab
Abhinav Singh
 
Apache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster ComputingApache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster Computing
Gerger
 
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Databricks
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hood
Adarsh Pannu
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study Notes
Richard Kuo
 
Spark core
Spark coreSpark core
Spark core
Freeman Zhang
 
Intro to apache spark stand ford
Intro to apache spark stand fordIntro to apache spark stand ford
Intro to apache spark stand ford
Thu Hiền
 
Introduction to Spark with Scala
Introduction to Spark with ScalaIntroduction to Spark with Scala
Introduction to Spark with Scala
Himanshu Gupta
 
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
CloudxLab
 
Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer Training
Cloudera, Inc.
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
Spark and spark streaming internals
Spark and spark streaming internalsSpark and spark streaming internals
Spark and spark streaming internals
Sigmoid
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Sachin Aggarwal
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscape
Paco Nathan
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
Ahmet Bulut
 
Building production spark streaming applications
Building production spark streaming applicationsBuilding production spark streaming applications
Building production spark streaming applications
Joey Echeverria
 
Apache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLabApache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLab
Abhinav Singh
 
Apache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster ComputingApache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster Computing
Gerger
 
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Databricks
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hood
Adarsh Pannu
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study Notes
Richard Kuo
 

Viewers also liked (20)

Workshop
WorkshopWorkshop
Workshop
Beth Kanter
 
Datomic
DatomicDatomic
Datomic
Jordan Leigh
 
Datomic
DatomicDatomic
Datomic
Christophe Marchal
 
Datomic
DatomicDatomic
Datomic
jperkelens
 
Backbone.js
Backbone.jsBackbone.js
Backbone.js
daisuke shimizu
 
Selena Gomez
Selena GomezSelena Gomez
Selena Gomez
guest5fa9931
 
Management Consulting
Management ConsultingManagement Consulting
Management Consulting
Alexandros Chatzopoulos
 
Sap fiori
Sap fioriSap fiori
Sap fiori
Anudeep Bhatia
 
French Property Market 2014
French Property Market 2014French Property Market 2014
French Property Market 2014
David Bourla
 
ReactJs
ReactJsReactJs
ReactJs
LearningTech
 
Medical devices
Medical devicesMedical devices
Medical devices
Somnath Zambare
 
French Property market 2015 - Cushman & Wakefield
French Property market 2015 - Cushman & WakefieldFrench Property market 2015 - Cushman & Wakefield
French Property market 2015 - Cushman & Wakefield
David Bourla
 
ReactJS | 서버와 클라이어트에서 동시에 사용하는
ReactJS | 서버와 클라이어트에서 동시에 사용하는ReactJS | 서버와 클라이어트에서 동시에 사용하는
ReactJS | 서버와 클라이어트에서 동시에 사용하는
Taegon Kim
 
Elon Musk
Elon MuskElon Musk
Elon Musk
Miloš Ivković
 
Simo Ahava - Tag Management Solutions – Best. Data. Ever. MKTFEST 2014
Simo Ahava - Tag Management Solutions – Best. Data. Ever. MKTFEST 2014Simo Ahava - Tag Management Solutions – Best. Data. Ever. MKTFEST 2014
Simo Ahava - Tag Management Solutions – Best. Data. Ever. MKTFEST 2014
Marketing Festival
 
Reverse Engineering
Reverse EngineeringReverse Engineering
Reverse Engineering
dswanson
 
Chess
ChessChess
Chess
Chuck Vohs
 
Personal Development
Personal DevelopmentPersonal Development
Personal Development
Seta Wicaksana
 
GridGain 6.0: Open Source In-Memory Computing Platform - Nikita Ivanov
GridGain 6.0: Open Source In-Memory Computing Platform - Nikita IvanovGridGain 6.0: Open Source In-Memory Computing Platform - Nikita Ivanov
GridGain 6.0: Open Source In-Memory Computing Platform - Nikita Ivanov
JAXLondon2014
 
Digital Data Tips Tuesday #1 - Tag Management: Simo Ahava - NetBooster
Digital Data Tips Tuesday #1 - Tag Management: Simo Ahava - NetBoosterDigital Data Tips Tuesday #1 - Tag Management: Simo Ahava - NetBooster
Digital Data Tips Tuesday #1 - Tag Management: Simo Ahava - NetBooster
Webanalisten .nl
 
French Property Market 2014
French Property Market 2014French Property Market 2014
French Property Market 2014
David Bourla
 
French Property market 2015 - Cushman & Wakefield
French Property market 2015 - Cushman & WakefieldFrench Property market 2015 - Cushman & Wakefield
French Property market 2015 - Cushman & Wakefield
David Bourla
 
ReactJS | 서버와 클라이어트에서 동시에 사용하는
ReactJS | 서버와 클라이어트에서 동시에 사용하는ReactJS | 서버와 클라이어트에서 동시에 사용하는
ReactJS | 서버와 클라이어트에서 동시에 사용하는
Taegon Kim
 
Simo Ahava - Tag Management Solutions – Best. Data. Ever. MKTFEST 2014
Simo Ahava - Tag Management Solutions – Best. Data. Ever. MKTFEST 2014Simo Ahava - Tag Management Solutions – Best. Data. Ever. MKTFEST 2014
Simo Ahava - Tag Management Solutions – Best. Data. Ever. MKTFEST 2014
Marketing Festival
 
Reverse Engineering
Reverse EngineeringReverse Engineering
Reverse Engineering
dswanson
 
GridGain 6.0: Open Source In-Memory Computing Platform - Nikita Ivanov
GridGain 6.0: Open Source In-Memory Computing Platform - Nikita IvanovGridGain 6.0: Open Source In-Memory Computing Platform - Nikita Ivanov
GridGain 6.0: Open Source In-Memory Computing Platform - Nikita Ivanov
JAXLondon2014
 
Digital Data Tips Tuesday #1 - Tag Management: Simo Ahava - NetBooster
Digital Data Tips Tuesday #1 - Tag Management: Simo Ahava - NetBoosterDigital Data Tips Tuesday #1 - Tag Management: Simo Ahava - NetBooster
Digital Data Tips Tuesday #1 - Tag Management: Simo Ahava - NetBooster
Webanalisten .nl
 

Similar to Apache Spark Introduction (20)

Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Mohamed hedi Abidi
 
Apache Spark Workshop
Apache Spark WorkshopApache Spark Workshop
Apache Spark Workshop
Michael Spector
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
Massimo Schenone
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark
YahooTechConference
 
Apache spark basics
Apache spark basicsApache spark basics
Apache spark basics
sparrowAnalytics.com
 
Bring the Spark To Your Eyes
Bring the Spark To Your EyesBring the Spark To Your Eyes
Bring the Spark To Your Eyes
Demi Ben-Ari
 
Spark 101
Spark 101Spark 101
Spark 101
Mohit Garg
 
Spark core
Spark coreSpark core
Spark core
Prashant Gupta
 
Apache Spark Fundamentals Meetup Talk
Apache Spark Fundamentals Meetup TalkApache Spark Fundamentals Meetup Talk
Apache Spark Fundamentals Meetup Talk
Eren Avşaroğulları
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
Walaa Hamdy Assy
 
Spark basic.pdf
Spark basic.pdfSpark basic.pdf
Spark basic.pdf
ssuser8b6c85
 
SparkR: Enabling Interactive Data Science at Scale on Hadoop
SparkR: Enabling Interactive Data Science at Scale on HadoopSparkR: Enabling Interactive Data Science at Scale on Hadoop
SparkR: Enabling Interactive Data Science at Scale on Hadoop
DataWorks Summit
 
Introduction to Apache Spark
Introduction to Apache Spark Introduction to Apache Spark
Introduction to Apache Spark
Hubert Fan Chiang
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Rahul Jain
 
Apache Spark Fundamentals Training
Apache Spark Fundamentals TrainingApache Spark Fundamentals Training
Apache Spark Fundamentals Training
Eren Avşaroğulları
 
Parallelizing Existing R Packages
Parallelizing Existing R PackagesParallelizing Existing R Packages
Parallelizing Existing R Packages
Craig Warman
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
Sneha Challa
 
Brief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICMEBrief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICME
Paco Nathan
 
Introduction to apache spark and the architecture
Introduction to apache spark and the architectureIntroduction to apache spark and the architecture
Introduction to apache spark and the architecture
sundharakumarkb2
 
Advanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xinAdvanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xin
caidezhi655
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
Massimo Schenone
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark
YahooTechConference
 
Bring the Spark To Your Eyes
Bring the Spark To Your EyesBring the Spark To Your Eyes
Bring the Spark To Your Eyes
Demi Ben-Ari
 
Apache Spark Fundamentals Meetup Talk
Apache Spark Fundamentals Meetup TalkApache Spark Fundamentals Meetup Talk
Apache Spark Fundamentals Meetup Talk
Eren Avşaroğulları
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
Walaa Hamdy Assy
 
SparkR: Enabling Interactive Data Science at Scale on Hadoop
SparkR: Enabling Interactive Data Science at Scale on HadoopSparkR: Enabling Interactive Data Science at Scale on Hadoop
SparkR: Enabling Interactive Data Science at Scale on Hadoop
DataWorks Summit
 
Introduction to Apache Spark
Introduction to Apache Spark Introduction to Apache Spark
Introduction to Apache Spark
Hubert Fan Chiang
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Rahul Jain
 
Parallelizing Existing R Packages
Parallelizing Existing R PackagesParallelizing Existing R Packages
Parallelizing Existing R Packages
Craig Warman
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
Sneha Challa
 
Brief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICMEBrief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICME
Paco Nathan
 
Introduction to apache spark and the architecture
Introduction to apache spark and the architectureIntroduction to apache spark and the architecture
Introduction to apache spark and the architecture
sundharakumarkb2
 
Advanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xinAdvanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xin
caidezhi655
 

More from Rich Lee (11)

COSCUP 2023 Building Portable and Reliable Applications on Google Cloud
COSCUP 2023 Building Portable and Reliable Applications on Google CloudCOSCUP 2023 Building Portable and Reliable Applications on Google Cloud
COSCUP 2023 Building Portable and Reliable Applications on Google Cloud
Rich Lee
 
2021 JCConf 使用Dapr簡化Java微服務應用開發
2021 JCConf 使用Dapr簡化Java微服務應用開發2021 JCConf 使用Dapr簡化Java微服務應用開發
2021 JCConf 使用Dapr簡化Java微服務應用開發
Rich Lee
 
GDG Taipei 2020 - Cloud and On-premises Applications Integration Using Event-...
GDG Taipei 2020 - Cloud and On-premises Applications Integration Using Event-...GDG Taipei 2020 - Cloud and On-premises Applications Integration Using Event-...
GDG Taipei 2020 - Cloud and On-premises Applications Integration Using Event-...
Rich Lee
 
JCConf.tw 2020 - Building cloud-native applications with Quarkus
JCConf.tw 2020 - Building cloud-native applications with QuarkusJCConf.tw 2020 - Building cloud-native applications with Quarkus
JCConf.tw 2020 - Building cloud-native applications with Quarkus
Rich Lee
 
Redis Cache design
Redis Cache designRedis Cache design
Redis Cache design
Rich Lee
 
Couchbase & FTS
Couchbase & FTSCouchbase & FTS
Couchbase & FTS
Rich Lee
 
Centralized log-management-with-elastic-stack
Centralized log-management-with-elastic-stackCentralized log-management-with-elastic-stack
Centralized log-management-with-elastic-stack
Rich Lee
 
Microservice Architecture
Microservice ArchitectureMicroservice Architecture
Microservice Architecture
Rich Lee
 
AWS IoT in action
AWS IoT in actionAWS IoT in action
AWS IoT in action
Rich Lee
 
Realtime web development
Realtime web developmentRealtime web development
Realtime web development
Rich Lee
 
Event sourcing
Event sourcingEvent sourcing
Event sourcing
Rich Lee
 
COSCUP 2023 Building Portable and Reliable Applications on Google Cloud
COSCUP 2023 Building Portable and Reliable Applications on Google CloudCOSCUP 2023 Building Portable and Reliable Applications on Google Cloud
COSCUP 2023 Building Portable and Reliable Applications on Google Cloud
Rich Lee
 
2021 JCConf 使用Dapr簡化Java微服務應用開發
2021 JCConf 使用Dapr簡化Java微服務應用開發2021 JCConf 使用Dapr簡化Java微服務應用開發
2021 JCConf 使用Dapr簡化Java微服務應用開發
Rich Lee
 
GDG Taipei 2020 - Cloud and On-premises Applications Integration Using Event-...
GDG Taipei 2020 - Cloud and On-premises Applications Integration Using Event-...GDG Taipei 2020 - Cloud and On-premises Applications Integration Using Event-...
GDG Taipei 2020 - Cloud and On-premises Applications Integration Using Event-...
Rich Lee
 
JCConf.tw 2020 - Building cloud-native applications with Quarkus
JCConf.tw 2020 - Building cloud-native applications with QuarkusJCConf.tw 2020 - Building cloud-native applications with Quarkus
JCConf.tw 2020 - Building cloud-native applications with Quarkus
Rich Lee
 
Redis Cache design
Redis Cache designRedis Cache design
Redis Cache design
Rich Lee
 
Couchbase & FTS
Couchbase & FTSCouchbase & FTS
Couchbase & FTS
Rich Lee
 
Centralized log-management-with-elastic-stack
Centralized log-management-with-elastic-stackCentralized log-management-with-elastic-stack
Centralized log-management-with-elastic-stack
Rich Lee
 
Microservice Architecture
Microservice ArchitectureMicroservice Architecture
Microservice Architecture
Rich Lee
 
AWS IoT in action
AWS IoT in actionAWS IoT in action
AWS IoT in action
Rich Lee
 
Realtime web development
Realtime web developmentRealtime web development
Realtime web development
Rich Lee
 
Event sourcing
Event sourcingEvent sourcing
Event sourcing
Rich Lee
 

Recently uploaded (20)

Marko.js - Unsung Hero of Scalable Web Frameworks (DevDays 2025)
Marko.js - Unsung Hero of Scalable Web Frameworks (DevDays 2025)Marko.js - Unsung Hero of Scalable Web Frameworks (DevDays 2025)
Marko.js - Unsung Hero of Scalable Web Frameworks (DevDays 2025)
Eugene Fidelin
 
Measuring Microsoft 365 Copilot and Gen AI Success
Measuring Microsoft 365 Copilot and Gen AI SuccessMeasuring Microsoft 365 Copilot and Gen AI Success
Measuring Microsoft 365 Copilot and Gen AI Success
Nikki Chapple
 
Introducing FME Realize: A New Era of Spatial Computing and AR
Introducing FME Realize: A New Era of Spatial Computing and ARIntroducing FME Realize: A New Era of Spatial Computing and AR
Introducing FME Realize: A New Era of Spatial Computing and AR
Safe Software
 
Introducing the OSA 3200 SP and OSA 3250 ePRC
Introducing the OSA 3200 SP and OSA 3250 ePRCIntroducing the OSA 3200 SP and OSA 3250 ePRC
Introducing the OSA 3200 SP and OSA 3250 ePRC
Adtran
 
From Legacy to Cloud-Native: A Guide to AWS Modernization.pptx
From Legacy to Cloud-Native: A Guide to AWS Modernization.pptxFrom Legacy to Cloud-Native: A Guide to AWS Modernization.pptx
From Legacy to Cloud-Native: A Guide to AWS Modernization.pptx
Mohammad Jomaa
 
MCP Dev Summit - Pragmatic Scaling of Enterprise GenAI with MCP
MCP Dev Summit - Pragmatic Scaling of Enterprise GenAI with MCPMCP Dev Summit - Pragmatic Scaling of Enterprise GenAI with MCP
MCP Dev Summit - Pragmatic Scaling of Enterprise GenAI with MCP
Sambhav Kothari
 
Break Down of Service Mesh Concepts and Implementation in AWS EKS.pptx
Break Down of Service Mesh Concepts and Implementation in AWS EKS.pptxBreak Down of Service Mesh Concepts and Implementation in AWS EKS.pptx
Break Down of Service Mesh Concepts and Implementation in AWS EKS.pptx
Mohammad Jomaa
 
With Claude 4, Anthropic redefines AI capabilities, effectively unleashing a ...
With Claude 4, Anthropic redefines AI capabilities, effectively unleashing a ...With Claude 4, Anthropic redefines AI capabilities, effectively unleashing a ...
With Claude 4, Anthropic redefines AI capabilities, effectively unleashing a ...
SOFTTECHHUB
 
UiPath Community Berlin: Studio Tips & Tricks and UiPath Insights
UiPath Community Berlin: Studio Tips & Tricks and UiPath InsightsUiPath Community Berlin: Studio Tips & Tricks and UiPath Insights
UiPath Community Berlin: Studio Tips & Tricks and UiPath Insights
UiPathCommunity
 
TrustArc Webinar: Mastering Privacy Contracting
TrustArc Webinar: Mastering Privacy ContractingTrustArc Webinar: Mastering Privacy Contracting
TrustArc Webinar: Mastering Privacy Contracting
TrustArc
 
Talk: On an adventure into the depths of Maven - Kaya Weers
Talk: On an adventure into the depths of Maven - Kaya WeersTalk: On an adventure into the depths of Maven - Kaya Weers
Talk: On an adventure into the depths of Maven - Kaya Weers
Kaya Weers
 
Security Operations and the Defense Analyst - Splunk Certificate
Security Operations and the Defense Analyst - Splunk CertificateSecurity Operations and the Defense Analyst - Splunk Certificate
Security Operations and the Defense Analyst - Splunk Certificate
VICTOR MAESTRE RAMIREZ
 
Content and eLearning Standards: Finding the Best Fit for Your-Training
Content and eLearning Standards: Finding the Best Fit for Your-TrainingContent and eLearning Standards: Finding the Best Fit for Your-Training
Content and eLearning Standards: Finding the Best Fit for Your-Training
Rustici Software
 
Fully Open-Source Private Clouds: Freedom, Security, and Control
Fully Open-Source Private Clouds: Freedom, Security, and ControlFully Open-Source Private Clouds: Freedom, Security, and Control
Fully Open-Source Private Clouds: Freedom, Security, and Control
ShapeBlue
 
Kubernetes Cloud Native Indonesia Meetup - May 2025
Kubernetes Cloud Native Indonesia Meetup - May 2025Kubernetes Cloud Native Indonesia Meetup - May 2025
Kubernetes Cloud Native Indonesia Meetup - May 2025
Prasta Maha
 
Droidal: AI Agents Revolutionizing Healthcare
Droidal: AI Agents Revolutionizing HealthcareDroidal: AI Agents Revolutionizing Healthcare
Droidal: AI Agents Revolutionizing Healthcare
Droidal LLC
 
STKI Israel Market Study 2025 final v1 version
STKI Israel Market Study 2025 final v1 versionSTKI Israel Market Study 2025 final v1 version
STKI Israel Market Study 2025 final v1 version
Dr. Jimmy Schwarzkopf
 
ECS25 - The adventures of a Microsoft 365 Platform Owner - Website.pptx
ECS25 - The adventures of a Microsoft 365 Platform Owner - Website.pptxECS25 - The adventures of a Microsoft 365 Platform Owner - Website.pptx
ECS25 - The adventures of a Microsoft 365 Platform Owner - Website.pptx
Jasper Oosterveld
 
Contributing to WordPress With & Without Code.pptx
Contributing to WordPress With & Without Code.pptxContributing to WordPress With & Without Code.pptx
Contributing to WordPress With & Without Code.pptx
Patrick Lumumba
 
Building Agents with LangGraph & Gemini
Building Agents with LangGraph &  GeminiBuilding Agents with LangGraph &  Gemini
Building Agents with LangGraph & Gemini
HusseinMalikMammadli
 
Marko.js - Unsung Hero of Scalable Web Frameworks (DevDays 2025)
Marko.js - Unsung Hero of Scalable Web Frameworks (DevDays 2025)Marko.js - Unsung Hero of Scalable Web Frameworks (DevDays 2025)
Marko.js - Unsung Hero of Scalable Web Frameworks (DevDays 2025)
Eugene Fidelin
 
Measuring Microsoft 365 Copilot and Gen AI Success
Measuring Microsoft 365 Copilot and Gen AI SuccessMeasuring Microsoft 365 Copilot and Gen AI Success
Measuring Microsoft 365 Copilot and Gen AI Success
Nikki Chapple
 
Introducing FME Realize: A New Era of Spatial Computing and AR
Introducing FME Realize: A New Era of Spatial Computing and ARIntroducing FME Realize: A New Era of Spatial Computing and AR
Introducing FME Realize: A New Era of Spatial Computing and AR
Safe Software
 
Introducing the OSA 3200 SP and OSA 3250 ePRC
Introducing the OSA 3200 SP and OSA 3250 ePRCIntroducing the OSA 3200 SP and OSA 3250 ePRC
Introducing the OSA 3200 SP and OSA 3250 ePRC
Adtran
 
From Legacy to Cloud-Native: A Guide to AWS Modernization.pptx
From Legacy to Cloud-Native: A Guide to AWS Modernization.pptxFrom Legacy to Cloud-Native: A Guide to AWS Modernization.pptx
From Legacy to Cloud-Native: A Guide to AWS Modernization.pptx
Mohammad Jomaa
 
MCP Dev Summit - Pragmatic Scaling of Enterprise GenAI with MCP
MCP Dev Summit - Pragmatic Scaling of Enterprise GenAI with MCPMCP Dev Summit - Pragmatic Scaling of Enterprise GenAI with MCP
MCP Dev Summit - Pragmatic Scaling of Enterprise GenAI with MCP
Sambhav Kothari
 
Break Down of Service Mesh Concepts and Implementation in AWS EKS.pptx
Break Down of Service Mesh Concepts and Implementation in AWS EKS.pptxBreak Down of Service Mesh Concepts and Implementation in AWS EKS.pptx
Break Down of Service Mesh Concepts and Implementation in AWS EKS.pptx
Mohammad Jomaa
 
With Claude 4, Anthropic redefines AI capabilities, effectively unleashing a ...
With Claude 4, Anthropic redefines AI capabilities, effectively unleashing a ...With Claude 4, Anthropic redefines AI capabilities, effectively unleashing a ...
With Claude 4, Anthropic redefines AI capabilities, effectively unleashing a ...
SOFTTECHHUB
 
UiPath Community Berlin: Studio Tips & Tricks and UiPath Insights
UiPath Community Berlin: Studio Tips & Tricks and UiPath InsightsUiPath Community Berlin: Studio Tips & Tricks and UiPath Insights
UiPath Community Berlin: Studio Tips & Tricks and UiPath Insights
UiPathCommunity
 
TrustArc Webinar: Mastering Privacy Contracting
TrustArc Webinar: Mastering Privacy ContractingTrustArc Webinar: Mastering Privacy Contracting
TrustArc Webinar: Mastering Privacy Contracting
TrustArc
 
Talk: On an adventure into the depths of Maven - Kaya Weers
Talk: On an adventure into the depths of Maven - Kaya WeersTalk: On an adventure into the depths of Maven - Kaya Weers
Talk: On an adventure into the depths of Maven - Kaya Weers
Kaya Weers
 
Security Operations and the Defense Analyst - Splunk Certificate
Security Operations and the Defense Analyst - Splunk CertificateSecurity Operations and the Defense Analyst - Splunk Certificate
Security Operations and the Defense Analyst - Splunk Certificate
VICTOR MAESTRE RAMIREZ
 
Content and eLearning Standards: Finding the Best Fit for Your-Training
Content and eLearning Standards: Finding the Best Fit for Your-TrainingContent and eLearning Standards: Finding the Best Fit for Your-Training
Content and eLearning Standards: Finding the Best Fit for Your-Training
Rustici Software
 
Fully Open-Source Private Clouds: Freedom, Security, and Control
Fully Open-Source Private Clouds: Freedom, Security, and ControlFully Open-Source Private Clouds: Freedom, Security, and Control
Fully Open-Source Private Clouds: Freedom, Security, and Control
ShapeBlue
 
Kubernetes Cloud Native Indonesia Meetup - May 2025
Kubernetes Cloud Native Indonesia Meetup - May 2025Kubernetes Cloud Native Indonesia Meetup - May 2025
Kubernetes Cloud Native Indonesia Meetup - May 2025
Prasta Maha
 
Droidal: AI Agents Revolutionizing Healthcare
Droidal: AI Agents Revolutionizing HealthcareDroidal: AI Agents Revolutionizing Healthcare
Droidal: AI Agents Revolutionizing Healthcare
Droidal LLC
 
STKI Israel Market Study 2025 final v1 version
STKI Israel Market Study 2025 final v1 versionSTKI Israel Market Study 2025 final v1 version
STKI Israel Market Study 2025 final v1 version
Dr. Jimmy Schwarzkopf
 
ECS25 - The adventures of a Microsoft 365 Platform Owner - Website.pptx
ECS25 - The adventures of a Microsoft 365 Platform Owner - Website.pptxECS25 - The adventures of a Microsoft 365 Platform Owner - Website.pptx
ECS25 - The adventures of a Microsoft 365 Platform Owner - Website.pptx
Jasper Oosterveld
 
Contributing to WordPress With & Without Code.pptx
Contributing to WordPress With & Without Code.pptxContributing to WordPress With & Without Code.pptx
Contributing to WordPress With & Without Code.pptx
Patrick Lumumba
 
Building Agents with LangGraph & Gemini
Building Agents with LangGraph &  GeminiBuilding Agents with LangGraph &  Gemini
Building Agents with LangGraph & Gemini
HusseinMalikMammadli
 

Apache Spark Introduction

  • 3. Agenda Spark overview Spark core RDD Spark Application Develop Spark Shell Zeppline Application
  • 6. Spark Overview Apache Spark is a fast and general-purpose cluster computing system Key Features: Fast Ease of Use General-purpose Scalable Fault tolerant Logistic regression in Hadoop and Spark
  • 10. Spark Overview Spark High Level Architecture Driver Program Cluster Management Worker Node Executor Task
  • 12. Spark Overview Install and startup Download http://spark.apache.org/downloads.html Start Master and Worker ./sbin/start-all.sh http://localhost:8080 Start History server ./sbin/start-history-server.sh hdfs://localhost:9000/spark/directory
  • 13. RDD Resilient Distributed Datase RDD represents a collection of partitioned data elements that can be operated on in parallel. It is the primary data abstraction mechanism in Spark. Partitioned Fault Tolerant Interface In Memory
  • 14. RDD Create RDD parallelize val xs = (1 to 10000).toList val rdd = sc.parallelize(xs) textFile val lines = sc.textFile("/input/README.md") val lines = sc.textFile("file:///RICH_HD/BigData_Tools/spark-1.6.2/README.md") HDFS - "hdfs://" Amazon S3 - "s3n://" Cassandra, HBase
  • 15. RDD
  • 16. RDD Transformation Creates a new RDD by performing a computation on the source RDD map val txtFile = sc.textFile("/input/README.md") val lengths = lines map { l => l.length} flatMap val words = lines flatMap { l => l.split(" ")} filter val longLines = lines filter { l => l.length > 80}
  • 17. RDD Action Return a value to a driver program first val numbersRdd = sc.parallelize(List(10, 5, 3, 1)) val firstElement = numbersRdd.first max numbersRdd.max reduce val sum = numbersRdd.reduce((x, y) => x + y) val product = numbersRdd.reduce((x, y) => x * y)
  • 18. RDD Filter log example: val logs = sc.textFile("path/to/log-files") val errorLogs = logs filter { l => l.contains("ERROR")} val warningLogs = logs filter { l => l.contains("WARN")} val errorCount = errorLogs.count val warningCount = warningLogs.count log RDD error RDD warn RDD count count
  • 19. RDD Caching Stores an RDD in the memory or storage When an application caches an RDD in memory, Spark stores it in the executor memory on each worker node. Each executor stores in memory the RDD partitions that it computes. cache persist MEMORY_ONLY DISK_ONLY MEMORY_AND_DISK
  • 20. RDD Cache example: val logs = sc.textFile("path/to/log-files") val errorsAndWarnings = logs filter { l => l.contains("ERROR") || l.contains("WARN")} errorsAndWarnings.cache() val errorLogs = errorsAndWarnings filter { l => l.contains("ERROR")} val warningLogs = errorsAndWarnings filter { l => l.contains("WARN")} val errorCount = errorLogs.count val warningCount = warningLogs.count
  • 23. Spark Application Develop WordCount val textFile = sc.textFile("/input/README.md") val wcData = textFile.flatMap(line => line.split(" ")) .map((_, 1)) .reduceByKey(_ + _) wcData.collect().foreach(println)
  • 24. 工商時間 Taiwan Hadoop User Group https://www.facebook.com/groups/hadoop.tw/ Taiwan Spark User Group https://www.facebook.com/groups/spark.tw/