SlideShare a Scribd company logo
Building a Unified Data 
Aaron Davidson 
Slides adapted from Matei Zaharia 
spark.apache.org 
Pipeline in 
Spark で構築する統合データパイプライン
What is Apache Spark? 
Fast and general cluster computing system 
interoperable with Hadoop 
Improves efficiency through: 
»In-memory computing primitives 
»General computation graphs 
Improves usability through: 
»Rich APIs in Java, Scala, Python 
»Interactive shell 
Up to 100× faster 
(2-10× on disk) 
2-5× less code 
Hadoop互換のクラスタ計算システム 
計算性能とユーザビリティを改善
Project History 
Started at UC Berkeley in 2009, open 
sourced in 2010 
50+ companies now contributing 
»Databricks, Yahoo!, Intel, Cloudera, IBM, … 
Most active project in Hadoop ecosystem 
UC バークレー生まれ 
OSSとして50社以上が開発に参加
A General Stack 
Spark 
Spark 
Streaming 
real-time 
Spark 
SQL 
structured 
GraphX 
graph 
MLlib 
machine 
learning 
… 
構造化クエリ、リアルタイム分析、グラフ処理、機械学習
This Talk 
Spark introduction & use cases 
Modules built on Spark 
The power of unification 
Demo 
Sparkの紹介とユースケース
Why a New Programming 
Model? 
MapReduce greatly simplified big data 
analysis 
But once started, users wanted more: 
»More complex, multi-pass analytics (e.g. ML, 
graph) 
»More interactive ad-hoc queries 
»More real-time stream processing 
All 3 need faster data sharing in parallel 
aMpappRseduceの次にユーザが望むもの: 
より複雑な分析、対話的なクエリ、リアルタイム処理
Data Sharing in MapReduce 
iter. 1 iter. 2 . . . 
Input 
HDFS 
read 
HDFS 
write 
HDFS 
read 
HDFS 
write 
Input 
query 1 
query 2 
query 3 
result 1 
result 2 
result 3 
. . . 
HDFS 
read 
Slow due to replication, serialization, and disk IO 
MapReduce のデータ共有が遅いのはディスクIOのせい
What We’d Like 
iter. 1 iter. 2 . . . 
Input 
Distributed 
memory 
Input 
query 1 
query 2 
query 3 
. . . 
one-time 
processing 
10-100× faster than network and disk 
ネットワークやディスクより10~100倍くらい高速化したい
Spark Model 
Write programs in terms of transformations 
on distributed datasets 
Resilient Distributed Datasets (RDDs) 
»Collections of objects that can be stored in 
memory or disk across a cluster 
»Built via parallel transformations (map, filter, …) 
»Automatically rebuilt on failure 
自己修復する分散データセット(RDD) 
RDDはmap やfilter 等のメソッドで並列に変換できる
Example: Log Mining 
Load error messages from a log into memory, 
then interactively search for various patterns 
BaseT RraDnDsformed RDD 
lines = spark.textFile(“hdfs://...”) 
errors = lines.filter(lambda s: s.startswith(“ERROR”)) 
messages = errors.map(lambda s: s.split(‘t’)[2]) 
messages.cache() Block 1 
Result: full-scaled text to search 1 TB data of Wikipedia in 5-7 sec 
in 
<1 (sec vs 170 (vs 20 sec sec for for on-on-disk disk data) 
data) 
Block 2 
Action 
Block 3 
Worker 
Worker 
Worker 
Driver 
messages.filter(lambda s: “foo” in s).count() 
messages.filter(lambda s: “bar” in s).count() 
. . . 
results 
tasks 
Cache 1 
Cache 2 
Cache 3 
様々なパターンで対話的に検索。1 TBの処理時間が170 -> 5~7秒に
Fault Tolerance 
RDDs track lineage info to rebuild lost data 
file.map(lambda rec: (rec.type, 1)) 
.reduceByKey(lambda x, y: x + y) 
.filter(lambda (type, count): count > 10) 
map reduce filter 
Input file 
**系統** 情報を追跡して失ったデータを再構築
Fault Tolerance 
RDDs track lineage info to rebuild lost data 
file.map(lambda rec: (rec.type, 1)) 
map reduce filter 
Input file 
.reduceByKey(lambda x, y: x + y) 
.filter(lambda (type, count): count > 10) 
**系統** 情報を追跡して失ったデータを再構築
Example: Logistic 
Regression 
4000 
3500 
3000 
2500 
2000 
1500 
1000 
500 
0 
1 5 10 20 30 
Running Time (s) 
Number of Iterations 
110 s / iteration 
Hadoop 
Spark 
first iteration 80 s 
further iterations 1 s 
ロジスティック回帰
Behavior with Less RAM 
68.8 
58.1 
40.7 
29.7 
11.5 
100 
80 
60 
40 
20 
0 
Cache 
disabled 
25% 50% 75% Fully 
cached 
Iteration time (s) 
% of working set in memory 
キャッシュを減らした場合の振る舞い
Spark in Scala and Java 
// Scala: 
val lines = sc.textFile(...) 
lines.filter(s => s.contains(“ERROR”)).count() 
// Java: 
JavaRDD<String> lines = sc.textFile(...); 
lines.filter(new Function<String, Boolean>() { 
Boolean call(String s) { 
return s.contains(“error”); 
} 
}).count();
Spark in Scala and Java 
// Scala: 
val lines = sc.textFile(...) 
lines.filter(s => s.contains(“ERROR”)).count() 
// Java 8: 
JavaRDD<String> lines = sc.textFile(...); 
lines.filter(s -> s.contains(“ERROR”)).count();
Supported Operators 
map 
filter 
groupBy 
sort 
union 
join 
leftOuterJoin 
rightOuterJoin 
reduce 
count 
fold 
reduceByKey 
groupByKey 
cogroup 
cross 
zip 
sample 
take 
first 
partitionBy 
mapWith 
pipe 
save 
...
Spark Community 
250+ developers, 50+ companies contributing 
Most active open source project in big data 
MapReduce 
YARN 
HDFS 
Storm 
Spark 
1400 
1200 
1000 
800 
600 
400 
200 
0 
commits past 6 months 
ビッグデータ分野で最も活発なOSSプロジェクト
Continuing Growth 
source: ohloh.net 
Contributors per month to Spark 
貢献者は増加し続けている
Get Started 
Visit spark.apache.org for docs & tutorials 
Easy to run on just your laptop 
Free training materials: spark-summit.org 
ラップトップ一台から始められます
This Talk 
Spark introduction & use cases 
Modules built on Spark 
The power of unification 
Demo 
Spark 上に構築されたモジュール
The Spark Stack 
Spark 
Spark 
Streaming 
real-time 
Spark 
SQL 
structured 
GraphX 
graph 
MLlib 
machine 
learning 
… 
Spark スタック
Evolution of the Shark project 
Allows querying structured data in Spark 
From Hive: 
c = HiveContext(sc) 
rows = c.sql(“select text, year from hivetable”) 
rows.filter(lambda r: r.year > 2013).collect() 
{“text”: “hi”, 
“user”: { 
“name”: “matei”, 
“id”: 123 
}} 
From JSON: 
c.jsonFile(“tweets.json”).registerAsTable(“tweets”) 
c.sql(“select text, user.name from tweets”) 
tweets.json 
Spark SQL 
Shark の後継。Spark で構造化データをクエリする。
Spark SQL 
Integrates closely with Spark’s language APIs 
c.registerFunction(“hasSpark”, lambda text: “Spark” in text) 
c.sql(“select * from tweets where hasSpark(text)”) 
Uniform interface for data access 
Python Scala Java 
Hive Parquet JSON 
Cassan-dra 
… 
SQL 
Spark 言語APIとの統合 
様々なデータソースに対して統一インタフェースを提供
Spark Streaming 
Stateful, fault-tolerant stream processing 
with the same API as batch jobs 
sc.twitterStream(...) 
.map(tweet => (tweet.language, 1)) 
.reduceByWindow(“5s”, _ + _) 
Storm 
Spark 
35 
30 
25 
20 
15 
10 
5 
0 
Throughput … 
ステートフルで耐障害性のあるストリーム処理 
バッチジョブと同じAPI
MLlib 
Built-in library of machine learning 
algorithms 
»K-means clustering 
»Alternating least squares 
»Generalized linear models (with L1 / L2 reg.) 
»SVD and PCA 
»Naïve Bayes 
points = sc.textFile(...).map(parsePoint) 
model = KMeans.train(points, 10) 
組み込みの機械学習ライブラリ
This Talk 
Spark introduction & use cases 
Modules built on Spark 
The power of unification 
Demo 
統合されたスタックのパワー
Big Data Systems Today 
MapReduce 
Pregel 
Dremel 
GraphLab 
Storm 
Giraph 
Drill 
Tez 
Impala 
S4 
… 
Specialized systems 
(iterative, interactive and 
streaming apps) 
General batch 
processing 
現状: 特化型のビッグデータシステムが乱立
Spark’s Approach 
Instead of specializing, generalize MapReduce 
to support new apps in same engine 
Two changes (general task DAG & data 
sharing) are enough to express previous 
models! 
Unification has big benefits 
»For the engine 
»For users Spark 
Streaming 
GraphX 
… 
Shark 
MLbase 
Spark のアプローチ: 特化しない 
汎用的な同一の基盤で、新たなアプリをサポートする
What it Means for Users 
Separate frameworks: 
… 
HDFS 
read 
HDFS 
write 
ETL 
HDFS 
read 
HDFS 
write 
train 
HDFS 
read 
HDFS 
write 
query 
Spark: Interactive 
HDFS 
HDFS 
read 
ETL 
train 
query 
analysis 
全ての処理がSpark 上で完結。さらに対話型分析も
Combining Processing 
Types 
// Load data using SQL 
val points = ctx.sql( 
“select latitude, longitude from historic_tweets”) 
// Train a machine learning model 
val model = KMeans.train(points, 10) 
// Apply it to a stream 
sc.twitterStream(...) 
.map(t => (model.closestCenter(t.location), 1)) 
.reduceByWindow(“5s”, _ + _) 
SQL、機械学習、ストリームへの適用など、 
異なる処理タイプを組み合わせる
This Talk 
Spark introduction & use cases 
Modules built on Spark 
The power of unification 
Demo 
デモ
The Plan 
Raw JSON 
Tweets 
SQL 
Streaming 
Machine 
Learning 
訓生S特p練徴aJSrしkベO SNたクQ をモLト でデHルDツルをFイSで抽 かー、出らトツし読本イてみ文ーk込を-トmむ抽スea出トns リでーモムデをルクをラ訓ス練タすリるングする
Demo!
Summary: What We Did 
Raw JSON 
SQL 
Streaming 
Machine 
Learning 
-生JSON をHDFS から読み込む 
-Spark SQL でツイート本文を抽出 
-特徴ベクトルを抽出してk-means でモデルを訓練する 
-訓練したモデルで、ツイートストリームをクラスタリングする
import org.apache.spark.sql._ 
val ctx = new org.apache.spark.sql.SQLContext(sc) 
val tweets = sc.textFile("hdfs:/twitter") 
val tweetTable = JsonTable.fromRDD(sqlContext, tweets, Some(0.1)) 
tweetTable.registerAsTable("tweetTable") 
ctx.sql("SELECT text FROM tweetTable LIMIT 5").collect.foreach(println) 
ctx.sql("SELECT lang, COUNT(*) AS cnt FROM tweetTable  
GROUP BY lang ORDER BY cnt DESC LIMIT 10").collect.foreach(println) 
val texts = sql("SELECT text FROM tweetTable").map(_.head.toString) 
def featurize(str: String): Vector = { ... } 
val vectors = texts.map(featurize).cache() 
val model = KMeans.train(vectors, 10, 10) 
sc.makeRDD(model.clusterCenters, 10).saveAsObjectFile("hdfs:/model") 
val ssc = new StreamingContext(new SparkConf(), Seconds(1)) 
val model = new KMeansModel( 
ssc.sparkContext.objectFile(modelFile).collect()) 
// Streaming 
val tweets = TwitterUtils.createStream(ssc, /* auth */) 
val statuses = tweets.map(_.getText) 
val filteredTweets = statuses.filter { 
t => model.predict(featurize(t)) == clusterNumber 
} 
filteredTweets.print() 
ssc.start()
Conclusion 
Big data analytics is evolving to include: 
»More complex analytics (e.g. machine learning) 
»More interactive ad-hoc queries 
»More real-time stream processing 
Spark is a fast platform that unifies these 
apps 
Learn more: spark.apache.org 
ビッグデータ分析は、複雑で、対話的で、リアルタイムな方向へと進化 
Sparkはこれらのアプリを統合した最速のプラットフォーム

More Related Content

What's hot (20)

KEY
The Why and How of Scala at Twitter
Alex Payne
 
PDF
Scala, Akka, and Play: An Introduction on Heroku
Havoc Pennington
 
PDF
Scala @ TechMeetup Edinburgh
Stuart Roebuck
 
ODP
Refactoring to Scala DSLs and LiftOff 2009 Recap
Dave Orme
 
PPTX
From Ruby to Scala
tod esking
 
PDF
Build Cloud Applications with Akka and Heroku
Salesforce Developers
 
PPTX
Java 7 Whats New(), Whats Next() from Oredev
Mattias Karlsson
 
PDF
Martin Odersky: What's next for Scala
Marakana Inc.
 
KEY
JavaOne 2011 - JVM Bytecode for Dummies
Charles Nutter
 
PDF
Scala Days NYC 2016
Martin Odersky
 
PDF
Kotlin @ Coupang Backed - JetBrains Day seoul 2018
Sunghyouk Bae
 
PDF
Spring data requery
Sunghyouk Bae
 
PPTX
Akka Actor presentation
Gene Chang
 
PDF
Alternatives of JPA/Hibernate
Sunghyouk Bae
 
PDF
Requery overview
Sunghyouk Bae
 
PDF
Scala profiling
Filippo Pacifici
 
PDF
Short intro to scala and the play framework
Felipe
 
PDF
Scala coated JVM
Stuart Roebuck
 
ZIP
Above the clouds: introducing Akka
nartamonov
 
PDF
Scala : language of the future
AnsviaLab
 
The Why and How of Scala at Twitter
Alex Payne
 
Scala, Akka, and Play: An Introduction on Heroku
Havoc Pennington
 
Scala @ TechMeetup Edinburgh
Stuart Roebuck
 
Refactoring to Scala DSLs and LiftOff 2009 Recap
Dave Orme
 
From Ruby to Scala
tod esking
 
Build Cloud Applications with Akka and Heroku
Salesforce Developers
 
Java 7 Whats New(), Whats Next() from Oredev
Mattias Karlsson
 
Martin Odersky: What's next for Scala
Marakana Inc.
 
JavaOne 2011 - JVM Bytecode for Dummies
Charles Nutter
 
Scala Days NYC 2016
Martin Odersky
 
Kotlin @ Coupang Backed - JetBrains Day seoul 2018
Sunghyouk Bae
 
Spring data requery
Sunghyouk Bae
 
Akka Actor presentation
Gene Chang
 
Alternatives of JPA/Hibernate
Sunghyouk Bae
 
Requery overview
Sunghyouk Bae
 
Scala profiling
Filippo Pacifici
 
Short intro to scala and the play framework
Felipe
 
Scala coated JVM
Stuart Roebuck
 
Above the clouds: introducing Akka
nartamonov
 
Scala : language of the future
AnsviaLab
 

Viewers also liked (20)

PDF
sbt, past and future / sbt, 傾向と対策
scalaconfjp
 
PDF
Xitrum Web Framework Live Coding Demos / Xitrum Web Framework ライブコーディング
scalaconfjp
 
PDF
What's a macro?: Learning by Examples / Scalaのマクロに実用例から触れてみよう!
scalaconfjp
 
PDF
Scalable Generator: Using Scala in SIer Business (ScalaMatsuri)
TIS Inc.
 
PPTX
[ScalaMatsuri] グリー初のscalaプロダクト!チャットサービス公開までの苦労と工夫
gree_tech
 
PDF
GitBucket: The perfect Github clone by Scala
takezoe
 
PDF
Node.js vs Play Framework (with Japanese subtitles)
Yevgeniy Brikman
 
PDF
A Journey into Databricks' Pipelines: Journey and Lessons Learned
Databricks
 
PDF
Building a Data Pipeline from Scratch - Joe Crobak
Hakka Labs
 
PDF
Building a Data Ingestion & Processing Pipeline with Spark & Airflow
Tom Lous
 
PPTX
Building data pipelines
Jonathan Holloway
 
PDF
COUG_AAbate_Oracle_Database_12c_New_Features
Alfredo Abate
 
PDF
Aioug vizag oracle12c_new_features
AiougVizagChapter
 
PDF
Oracle12 - The Top12 Features by NAYA Technologies
NAYATech
 
PPTX
Introduce to Spark sql 1.3.0
Bryan Yang
 
PDF
芸者東京とScala〜おみせやさんから脳トレクエストまでの軌跡〜
scalaconfjp
 
PPTX
Spark etl
Imran Rashid
 
PPTX
AMIS Oracle OpenWorld 2015 Review – part 3- PaaS Database, Integration, Ident...
Getting value from IoT, Integration and Data Analytics
 
PPTX
SPARQL and Linked Data Benchmarking
Kristian Alexander
 
PDF
End-to-end Data Pipeline with Apache Spark
Databricks
 
sbt, past and future / sbt, 傾向と対策
scalaconfjp
 
Xitrum Web Framework Live Coding Demos / Xitrum Web Framework ライブコーディング
scalaconfjp
 
What's a macro?: Learning by Examples / Scalaのマクロに実用例から触れてみよう!
scalaconfjp
 
Scalable Generator: Using Scala in SIer Business (ScalaMatsuri)
TIS Inc.
 
[ScalaMatsuri] グリー初のscalaプロダクト!チャットサービス公開までの苦労と工夫
gree_tech
 
GitBucket: The perfect Github clone by Scala
takezoe
 
Node.js vs Play Framework (with Japanese subtitles)
Yevgeniy Brikman
 
A Journey into Databricks' Pipelines: Journey and Lessons Learned
Databricks
 
Building a Data Pipeline from Scratch - Joe Crobak
Hakka Labs
 
Building a Data Ingestion & Processing Pipeline with Spark & Airflow
Tom Lous
 
Building data pipelines
Jonathan Holloway
 
COUG_AAbate_Oracle_Database_12c_New_Features
Alfredo Abate
 
Aioug vizag oracle12c_new_features
AiougVizagChapter
 
Oracle12 - The Top12 Features by NAYA Technologies
NAYATech
 
Introduce to Spark sql 1.3.0
Bryan Yang
 
芸者東京とScala〜おみせやさんから脳トレクエストまでの軌跡〜
scalaconfjp
 
Spark etl
Imran Rashid
 
AMIS Oracle OpenWorld 2015 Review – part 3- PaaS Database, Integration, Ident...
Getting value from IoT, Integration and Data Analytics
 
SPARQL and Linked Data Benchmarking
Kristian Alexander
 
End-to-end Data Pipeline with Apache Spark
Databricks
 
Ad

Similar to Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一 (20)

PDF
SAIS/DWS2018報告会 #saisdws2018
Yahoo!デベロッパーネットワーク
 
PPTX
Spark 计算模型
wang xing
 
PPTX
Introduction to Apache Spark
Rahul Jain
 
PPTX
Big data analytics_beyond_hadoop_public_18_july_2013
Vijay Srinivas Agneeswaran, Ph.D
 
PPTX
Zaharia spark-scala-days-2012
Skills Matter Talks
 
PDF
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
台灣資料科學年會
 
PDF
Spark + AI Summit 2020 イベント概要
Paulo Gutierrez
 
PDF
Simplifying Big Data Analytics with Apache Spark
Databricks
 
PDF
Osd ctw spark
Wisely chen
 
PDF
Apache Spark for Library Developers with William Benton and Erik Erlandson
Databricks
 
PPTX
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Jetlore
 
PDF
Unified Big Data Processing with Apache Spark (QCON 2014)
Databricks
 
PDF
OCF.tw's talk about "Introduction to spark"
Giivee The
 
PPT
Bigdata processing with Spark - part II
Arjen de Vries
 
PDF
Distributed computing with spark
Javier Santos Paniego
 
PPTX
An Introduct to Spark - Atlanta Spark Meetup
jlacefie
 
PPTX
An Introduction to Spark
jlacefie
 
PDF
Stanford CS347 Guest Lecture: Apache Spark
Reynold Xin
 
PPTX
Apache spark core
Thành Nguyễn
 
PDF
Unified Big Data Processing with Apache Spark
C4Media
 
SAIS/DWS2018報告会 #saisdws2018
Yahoo!デベロッパーネットワーク
 
Spark 计算模型
wang xing
 
Introduction to Apache Spark
Rahul Jain
 
Big data analytics_beyond_hadoop_public_18_july_2013
Vijay Srinivas Agneeswaran, Ph.D
 
Zaharia spark-scala-days-2012
Skills Matter Talks
 
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
台灣資料科學年會
 
Spark + AI Summit 2020 イベント概要
Paulo Gutierrez
 
Simplifying Big Data Analytics with Apache Spark
Databricks
 
Osd ctw spark
Wisely chen
 
Apache Spark for Library Developers with William Benton and Erik Erlandson
Databricks
 
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Jetlore
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Databricks
 
OCF.tw's talk about "Introduction to spark"
Giivee The
 
Bigdata processing with Spark - part II
Arjen de Vries
 
Distributed computing with spark
Javier Santos Paniego
 
An Introduct to Spark - Atlanta Spark Meetup
jlacefie
 
An Introduction to Spark
jlacefie
 
Stanford CS347 Guest Lecture: Apache Spark
Reynold Xin
 
Apache spark core
Thành Nguyễn
 
Unified Big Data Processing with Apache Spark
C4Media
 
Ad

More from scalaconfjp (20)

PDF
脆弱性対策のためのClean Architecture ~脆弱性に対するレジリエンスを確保せよ~
scalaconfjp
 
PDF
Alp x BizReach SaaS事業を営む2社がお互い気になることをゆるゆる聞いてみる会
scalaconfjp
 
PDF
GraalVM Overview Compact version
scalaconfjp
 
PDF
Run Scala Faster with GraalVM on any Platform / GraalVMで、どこでもScalaを高速実行しよう by...
scalaconfjp
 
PPTX
Monitoring Reactive Architecture Like Never Before / 今までになかったリアクティブアーキテクチャの監視...
scalaconfjp
 
PPTX
Scala 3, what does it means for me? / Scala 3って、私にはどんな影響があるの? by Joan Goyeau
scalaconfjp
 
PDF
Functional Object-Oriented Imperative Scala / 関数型オブジェクト指向命令型 Scala by Sébasti...
scalaconfjp
 
PDF
Scala ♥ Graal by Flavio Brasil
scalaconfjp
 
PPTX
Introduction to GraphQL in Scala
scalaconfjp
 
PDF
Safety Beyond Types
scalaconfjp
 
PDF
Reactive Kafka with Akka Streams
scalaconfjp
 
PDF
Reactive microservices with play and akka
scalaconfjp
 
PDF
Scalaに対して意識の低いエンジニアがScalaで何したかの話, by 芸者東京エンターテインメント
scalaconfjp
 
PDF
DWANGO by ドワンゴ
scalaconfjp
 
PDF
OCTOPARTS by M3, Inc.
scalaconfjp
 
PDF
Try using Aeromock by Marverick, Inc.
scalaconfjp
 
PDF
統計をとって高速化する
Scala開発 by CyberZ,Inc.
scalaconfjp
 
PDF
Short Introduction of Implicit Conversion by TIS, Inc.
scalaconfjp
 
PPTX
ビズリーチ x ScalaMatsuri by BIZREACH, Inc.
scalaconfjp
 
PDF
Solid and Sustainable Development in Scala
scalaconfjp
 
脆弱性対策のためのClean Architecture ~脆弱性に対するレジリエンスを確保せよ~
scalaconfjp
 
Alp x BizReach SaaS事業を営む2社がお互い気になることをゆるゆる聞いてみる会
scalaconfjp
 
GraalVM Overview Compact version
scalaconfjp
 
Run Scala Faster with GraalVM on any Platform / GraalVMで、どこでもScalaを高速実行しよう by...
scalaconfjp
 
Monitoring Reactive Architecture Like Never Before / 今までになかったリアクティブアーキテクチャの監視...
scalaconfjp
 
Scala 3, what does it means for me? / Scala 3って、私にはどんな影響があるの? by Joan Goyeau
scalaconfjp
 
Functional Object-Oriented Imperative Scala / 関数型オブジェクト指向命令型 Scala by Sébasti...
scalaconfjp
 
Scala ♥ Graal by Flavio Brasil
scalaconfjp
 
Introduction to GraphQL in Scala
scalaconfjp
 
Safety Beyond Types
scalaconfjp
 
Reactive Kafka with Akka Streams
scalaconfjp
 
Reactive microservices with play and akka
scalaconfjp
 
Scalaに対して意識の低いエンジニアがScalaで何したかの話, by 芸者東京エンターテインメント
scalaconfjp
 
DWANGO by ドワンゴ
scalaconfjp
 
OCTOPARTS by M3, Inc.
scalaconfjp
 
Try using Aeromock by Marverick, Inc.
scalaconfjp
 
統計をとって高速化する
Scala開発 by CyberZ,Inc.
scalaconfjp
 
Short Introduction of Implicit Conversion by TIS, Inc.
scalaconfjp
 
ビズリーチ x ScalaMatsuri by BIZREACH, Inc.
scalaconfjp
 
Solid and Sustainable Development in Scala
scalaconfjp
 

Recently uploaded (20)

PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked} 2025
hashhshs786
 
PDF
Beyond Binaries: Understanding Diversity and Allyship in a Global Workplace -...
Imma Valls Bernaus
 
PDF
MiniTool Partition Wizard 12.8 Crack License Key LATEST
hashhshs786
 
PPTX
Revolutionizing Code Modernization with AI
KrzysztofKkol1
 
PDF
Salesforce CRM Services.VALiNTRY360
VALiNTRY360
 
PDF
GetOnCRM Speeds Up Agentforce 3 Deployment for Enterprise AI Wins.pdf
GetOnCRM Solutions
 
PDF
Digger Solo: Semantic search and maps for your local files
seanpedersen96
 
PPTX
3uTools Full Crack Free Version Download [Latest] 2025
muhammadgurbazkhan
 
PPTX
Engineering the Java Web Application (MVC)
abhishekoza1981
 
PDF
Mobile CMMS Solutions Empowering the Frontline Workforce
CryotosCMMSSoftware
 
PDF
Unlock Efficiency with Insurance Policy Administration Systems
Insurance Tech Services
 
PDF
Revenue streams of the Wazirx clone script.pdf
aaronjeffray
 
PPTX
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pptx
Varsha Nayak
 
PPTX
How Apagen Empowered an EPC Company with Engineering ERP Software
SatishKumar2651
 
PPTX
An Introduction to ZAP by Checkmarx - Official Version
Simon Bennetts
 
PDF
Efficient, Automated Claims Processing Software for Insurers
Insurance Tech Services
 
PPTX
Platform for Enterprise Solution - Java EE5
abhishekoza1981
 
PDF
HiHelloHR – Simplify HR Operations for Modern Workplaces
HiHelloHR
 
PPTX
Feb 2021 Cohesity first pitch presentation.pptx
enginsayin1
 
PDF
Alarm in Android-Scheduling Timed Tasks Using AlarmManager in Android.pdf
Nabin Dhakal
 
Capcut Pro Crack For PC Latest Version {Fully Unlocked} 2025
hashhshs786
 
Beyond Binaries: Understanding Diversity and Allyship in a Global Workplace -...
Imma Valls Bernaus
 
MiniTool Partition Wizard 12.8 Crack License Key LATEST
hashhshs786
 
Revolutionizing Code Modernization with AI
KrzysztofKkol1
 
Salesforce CRM Services.VALiNTRY360
VALiNTRY360
 
GetOnCRM Speeds Up Agentforce 3 Deployment for Enterprise AI Wins.pdf
GetOnCRM Solutions
 
Digger Solo: Semantic search and maps for your local files
seanpedersen96
 
3uTools Full Crack Free Version Download [Latest] 2025
muhammadgurbazkhan
 
Engineering the Java Web Application (MVC)
abhishekoza1981
 
Mobile CMMS Solutions Empowering the Frontline Workforce
CryotosCMMSSoftware
 
Unlock Efficiency with Insurance Policy Administration Systems
Insurance Tech Services
 
Revenue streams of the Wazirx clone script.pdf
aaronjeffray
 
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pptx
Varsha Nayak
 
How Apagen Empowered an EPC Company with Engineering ERP Software
SatishKumar2651
 
An Introduction to ZAP by Checkmarx - Official Version
Simon Bennetts
 
Efficient, Automated Claims Processing Software for Insurers
Insurance Tech Services
 
Platform for Enterprise Solution - Java EE5
abhishekoza1981
 
HiHelloHR – Simplify HR Operations for Modern Workplaces
HiHelloHR
 
Feb 2021 Cohesity first pitch presentation.pptx
enginsayin1
 
Alarm in Android-Scheduling Timed Tasks Using AlarmManager in Android.pdf
Nabin Dhakal
 

Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一

  • 1. Building a Unified Data Aaron Davidson Slides adapted from Matei Zaharia spark.apache.org Pipeline in Spark で構築する統合データパイプライン
  • 2. What is Apache Spark? Fast and general cluster computing system interoperable with Hadoop Improves efficiency through: »In-memory computing primitives »General computation graphs Improves usability through: »Rich APIs in Java, Scala, Python »Interactive shell Up to 100× faster (2-10× on disk) 2-5× less code Hadoop互換のクラスタ計算システム 計算性能とユーザビリティを改善
  • 3. Project History Started at UC Berkeley in 2009, open sourced in 2010 50+ companies now contributing »Databricks, Yahoo!, Intel, Cloudera, IBM, … Most active project in Hadoop ecosystem UC バークレー生まれ OSSとして50社以上が開発に参加
  • 4. A General Stack Spark Spark Streaming real-time Spark SQL structured GraphX graph MLlib machine learning … 構造化クエリ、リアルタイム分析、グラフ処理、機械学習
  • 5. This Talk Spark introduction & use cases Modules built on Spark The power of unification Demo Sparkの紹介とユースケース
  • 6. Why a New Programming Model? MapReduce greatly simplified big data analysis But once started, users wanted more: »More complex, multi-pass analytics (e.g. ML, graph) »More interactive ad-hoc queries »More real-time stream processing All 3 need faster data sharing in parallel aMpappRseduceの次にユーザが望むもの: より複雑な分析、対話的なクエリ、リアルタイム処理
  • 7. Data Sharing in MapReduce iter. 1 iter. 2 . . . Input HDFS read HDFS write HDFS read HDFS write Input query 1 query 2 query 3 result 1 result 2 result 3 . . . HDFS read Slow due to replication, serialization, and disk IO MapReduce のデータ共有が遅いのはディスクIOのせい
  • 8. What We’d Like iter. 1 iter. 2 . . . Input Distributed memory Input query 1 query 2 query 3 . . . one-time processing 10-100× faster than network and disk ネットワークやディスクより10~100倍くらい高速化したい
  • 9. Spark Model Write programs in terms of transformations on distributed datasets Resilient Distributed Datasets (RDDs) »Collections of objects that can be stored in memory or disk across a cluster »Built via parallel transformations (map, filter, …) »Automatically rebuilt on failure 自己修復する分散データセット(RDD) RDDはmap やfilter 等のメソッドで並列に変換できる
  • 10. Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns BaseT RraDnDsformed RDD lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(‘t’)[2]) messages.cache() Block 1 Result: full-scaled text to search 1 TB data of Wikipedia in 5-7 sec in <1 (sec vs 170 (vs 20 sec sec for for on-on-disk disk data) data) Block 2 Action Block 3 Worker Worker Worker Driver messages.filter(lambda s: “foo” in s).count() messages.filter(lambda s: “bar” in s).count() . . . results tasks Cache 1 Cache 2 Cache 3 様々なパターンで対話的に検索。1 TBの処理時間が170 -> 5~7秒に
  • 11. Fault Tolerance RDDs track lineage info to rebuild lost data file.map(lambda rec: (rec.type, 1)) .reduceByKey(lambda x, y: x + y) .filter(lambda (type, count): count > 10) map reduce filter Input file **系統** 情報を追跡して失ったデータを再構築
  • 12. Fault Tolerance RDDs track lineage info to rebuild lost data file.map(lambda rec: (rec.type, 1)) map reduce filter Input file .reduceByKey(lambda x, y: x + y) .filter(lambda (type, count): count > 10) **系統** 情報を追跡して失ったデータを再構築
  • 13. Example: Logistic Regression 4000 3500 3000 2500 2000 1500 1000 500 0 1 5 10 20 30 Running Time (s) Number of Iterations 110 s / iteration Hadoop Spark first iteration 80 s further iterations 1 s ロジスティック回帰
  • 14. Behavior with Less RAM 68.8 58.1 40.7 29.7 11.5 100 80 60 40 20 0 Cache disabled 25% 50% 75% Fully cached Iteration time (s) % of working set in memory キャッシュを減らした場合の振る舞い
  • 15. Spark in Scala and Java // Scala: val lines = sc.textFile(...) lines.filter(s => s.contains(“ERROR”)).count() // Java: JavaRDD<String> lines = sc.textFile(...); lines.filter(new Function<String, Boolean>() { Boolean call(String s) { return s.contains(“error”); } }).count();
  • 16. Spark in Scala and Java // Scala: val lines = sc.textFile(...) lines.filter(s => s.contains(“ERROR”)).count() // Java 8: JavaRDD<String> lines = sc.textFile(...); lines.filter(s -> s.contains(“ERROR”)).count();
  • 17. Supported Operators map filter groupBy sort union join leftOuterJoin rightOuterJoin reduce count fold reduceByKey groupByKey cogroup cross zip sample take first partitionBy mapWith pipe save ...
  • 18. Spark Community 250+ developers, 50+ companies contributing Most active open source project in big data MapReduce YARN HDFS Storm Spark 1400 1200 1000 800 600 400 200 0 commits past 6 months ビッグデータ分野で最も活発なOSSプロジェクト
  • 19. Continuing Growth source: ohloh.net Contributors per month to Spark 貢献者は増加し続けている
  • 20. Get Started Visit spark.apache.org for docs & tutorials Easy to run on just your laptop Free training materials: spark-summit.org ラップトップ一台から始められます
  • 21. This Talk Spark introduction & use cases Modules built on Spark The power of unification Demo Spark 上に構築されたモジュール
  • 22. The Spark Stack Spark Spark Streaming real-time Spark SQL structured GraphX graph MLlib machine learning … Spark スタック
  • 23. Evolution of the Shark project Allows querying structured data in Spark From Hive: c = HiveContext(sc) rows = c.sql(“select text, year from hivetable”) rows.filter(lambda r: r.year > 2013).collect() {“text”: “hi”, “user”: { “name”: “matei”, “id”: 123 }} From JSON: c.jsonFile(“tweets.json”).registerAsTable(“tweets”) c.sql(“select text, user.name from tweets”) tweets.json Spark SQL Shark の後継。Spark で構造化データをクエリする。
  • 24. Spark SQL Integrates closely with Spark’s language APIs c.registerFunction(“hasSpark”, lambda text: “Spark” in text) c.sql(“select * from tweets where hasSpark(text)”) Uniform interface for data access Python Scala Java Hive Parquet JSON Cassan-dra … SQL Spark 言語APIとの統合 様々なデータソースに対して統一インタフェースを提供
  • 25. Spark Streaming Stateful, fault-tolerant stream processing with the same API as batch jobs sc.twitterStream(...) .map(tweet => (tweet.language, 1)) .reduceByWindow(“5s”, _ + _) Storm Spark 35 30 25 20 15 10 5 0 Throughput … ステートフルで耐障害性のあるストリーム処理 バッチジョブと同じAPI
  • 26. MLlib Built-in library of machine learning algorithms »K-means clustering »Alternating least squares »Generalized linear models (with L1 / L2 reg.) »SVD and PCA »Naïve Bayes points = sc.textFile(...).map(parsePoint) model = KMeans.train(points, 10) 組み込みの機械学習ライブラリ
  • 27. This Talk Spark introduction & use cases Modules built on Spark The power of unification Demo 統合されたスタックのパワー
  • 28. Big Data Systems Today MapReduce Pregel Dremel GraphLab Storm Giraph Drill Tez Impala S4 … Specialized systems (iterative, interactive and streaming apps) General batch processing 現状: 特化型のビッグデータシステムが乱立
  • 29. Spark’s Approach Instead of specializing, generalize MapReduce to support new apps in same engine Two changes (general task DAG & data sharing) are enough to express previous models! Unification has big benefits »For the engine »For users Spark Streaming GraphX … Shark MLbase Spark のアプローチ: 特化しない 汎用的な同一の基盤で、新たなアプリをサポートする
  • 30. What it Means for Users Separate frameworks: … HDFS read HDFS write ETL HDFS read HDFS write train HDFS read HDFS write query Spark: Interactive HDFS HDFS read ETL train query analysis 全ての処理がSpark 上で完結。さらに対話型分析も
  • 31. Combining Processing Types // Load data using SQL val points = ctx.sql( “select latitude, longitude from historic_tweets”) // Train a machine learning model val model = KMeans.train(points, 10) // Apply it to a stream sc.twitterStream(...) .map(t => (model.closestCenter(t.location), 1)) .reduceByWindow(“5s”, _ + _) SQL、機械学習、ストリームへの適用など、 異なる処理タイプを組み合わせる
  • 32. This Talk Spark introduction & use cases Modules built on Spark The power of unification Demo デモ
  • 33. The Plan Raw JSON Tweets SQL Streaming Machine Learning 訓生S特p練徴aJSrしkベO SNたクQ をモLト でデHルDツルをFイSで抽 かー、出らトツし読本イてみ文ーk込を-トmむ抽スea出トns リでーモムデをルクをラ訓ス練タすリるングする
  • 34. Demo!
  • 35. Summary: What We Did Raw JSON SQL Streaming Machine Learning -生JSON をHDFS から読み込む -Spark SQL でツイート本文を抽出 -特徴ベクトルを抽出してk-means でモデルを訓練する -訓練したモデルで、ツイートストリームをクラスタリングする
  • 36. import org.apache.spark.sql._ val ctx = new org.apache.spark.sql.SQLContext(sc) val tweets = sc.textFile("hdfs:/twitter") val tweetTable = JsonTable.fromRDD(sqlContext, tweets, Some(0.1)) tweetTable.registerAsTable("tweetTable") ctx.sql("SELECT text FROM tweetTable LIMIT 5").collect.foreach(println) ctx.sql("SELECT lang, COUNT(*) AS cnt FROM tweetTable GROUP BY lang ORDER BY cnt DESC LIMIT 10").collect.foreach(println) val texts = sql("SELECT text FROM tweetTable").map(_.head.toString) def featurize(str: String): Vector = { ... } val vectors = texts.map(featurize).cache() val model = KMeans.train(vectors, 10, 10) sc.makeRDD(model.clusterCenters, 10).saveAsObjectFile("hdfs:/model") val ssc = new StreamingContext(new SparkConf(), Seconds(1)) val model = new KMeansModel( ssc.sparkContext.objectFile(modelFile).collect()) // Streaming val tweets = TwitterUtils.createStream(ssc, /* auth */) val statuses = tweets.map(_.getText) val filteredTweets = statuses.filter { t => model.predict(featurize(t)) == clusterNumber } filteredTweets.print() ssc.start()
  • 37. Conclusion Big data analytics is evolving to include: »More complex analytics (e.g. machine learning) »More interactive ad-hoc queries »More real-time stream processing Spark is a fast platform that unifies these apps Learn more: spark.apache.org ビッグデータ分析は、複雑で、対話的で、リアルタイムな方向へと進化 Sparkはこれらのアプリを統合した最速のプラットフォーム

Editor's Notes

  • #4: TODO: Apache incubator logo
  • #8: Each iteration is, for example, a MapReduce job
  • #11: Add “variables” to the “functions” in functional programming
  • #14: 100 GB of data on 50 m1.xlarge EC2 machines
  • #19: Alibaba, tenzent At Berkeley, we have been working on a solution since 2009. This solution consists of a software stack for data analytics, called the Berkeley Data Analytics Stack. The centerpiece of this stack is Spark. Spark has seen significant adoption with hundreds of companies using it, out of which around sixteen companies have contributed back the code. In addition, Spark has been deployed on clusters that exceed 1,000 nodes.
  • #20: Despite Hadoop having been around for 7 years, the Spark community is still growing; to us this shows that there’s still a huge gap in making big data easy to use and contributors are excited about Spark’s approach here