Apache Spark + Arrow

と
Takeshi Yamamuro
@ NTT Software Innovation Center

3
What’s Spark?
Apache Spark 2015 Year In Review,
https://bit.ly/2JaG0He
• AMPLab@UC Berkeleyの成果で，2012年にOSSとして公開され
た汎⽤的な分散・並列処理フレームワーク
– 現在の最新がv2.4.4、現在はv3.0のリリースに向けて準備中
• 代表的な特徴はユーザが使
いやすいAPI，外部データ
との連携，内部での⾼度な
最適化

• 最新のSpark（PySpark）のインストール
• テスト実⾏
4
Quick Start Guide for Spark
aa$ conda install –c conda-forge pyspark
aa
$ cat test.csv
1,a,0.3
2,b,1.7
$ pyspark
>>> df = spark.read.csv(‘test.csv’)
>>> df.show()

• @pandas_udfアノテーション（v2.3+）を使い、Pandasを⽤い
て書かれたコードをSparkで分散処理
– 処理データの増⼤に対して既存コードを流⽤
– 複雑な処理をPandasのコードとして切り出してテストを容易化
5
Use Case: Distribute&Run Your Pandas Code
>>> @pandas_udf(‘double’)
. . . defpandas_mean_diff(pdf):
. . . return pdf - pdf.mean()
>>> sdf.select(pandas_mean_diff(sdf[‘v’])).show()
pdf:
sdf:
Pandas DataFrame
Spark DataFrame

• より複雑な例: scikit-learnで作成した機械学習モデルによる
推論をSparkで分散処理
6
Use Case: Distribute&Run Your Pandas Code
# sckit-learnで学習モデルを作成
>>> clf = DecisionTreeClassifier()
>>> clf.fit(X_train, y_train)
# 作成したモデルをSparkクラスタで共有
>>> broadcasted_clf= spark.sparkContext.broadcast(clf)
# 共有したモデルを使⽤して推論を⾏うUDFを定義
>>> @pandas_udf(returnType='int')
... defpredict(*cols):
... X = pd.concat(cols, axis=1)
... predicted = broadcasted_clf.value.predict(X)
... return pd.Series(predicted)
# 定義したUDFを⽤いて推論をSparkで分散処理
>>> df.select('y', predict(*X.columns).alias('predicted'))
Example Code: https://bit.ly/357C0kp

• @pandas_udfは内部的にArrowで⾼速化
– @udf はv2.2まで使⽤されていたUDF⽤のアノテーション
7
PySpark UDF Performance
Benchmark Code: https://bit.ly/2LITA7c
70.93
19.92
@udf @pandas_udf
UDF: (x, y) => x + y
Accelerated by

• Sparkにおいては外部プロセス（Python, R, Shell）とのデータ
のやり取りにおいてArrowによる効率化を実施
8
Spark + Arrow Internal
JVM
(Driver)
User
Interaction
JVM
(Woker)
Python/R/Shell
JVM
(Woker)
Python/R/Shell
Data Sources
Python/R

• spark.sql.execution.arrow.maxRecordsPerBatch (default: 10000)
– Arrowによるデータ転送（SerDe）で１回の転送に含める⾏数，設定値
を⼤きくすると使⽤メモリ量が増⼤
• spark.sql.execution.arrow.enabled (default: false)
– Spark/Pandas Dataframeの相互変換を⾏うtoPandas/createDataFrameの
処理においてArrowによる最適化の有効化
• spark.sql.execution.arrow.fallback.enabled (default: true)
– Arrowによるデータ転送が⾏えない場合（e.g., 未サポートの型）に
Arrowを使わないデータ転送に切り替える機能の有効化
9
Configurations for Arrow-Accelerated Processing
v2.4.4におけるArrowに関係する設定は3つだけ

• SPARK-29376: Upgrade Arrow to v0.15.1
• SPARK-29493: Support Arrow MapType
– v2.4.4で未サポートの型はMapType，ArrayType of TimestampType，
Nested StructType，ただしBinaryTypeはPyArrowv0.10.0+のみ
10
Current Development Status for v3.0

• 上記の組み合わせで@pandas_udf の処理がエラーで停⽌
– 原因: v0.15.0からIPCで使⽤するバイナリフォーマットが変更
• Apache Arrow 0.15.0 Release:
https://arrow.apache.org/blog/2019/10/06/0.15.0-release/
– 解決策: ARROW_PRE_0_15_IPC_FORMAT=1を設定することでv0.14.0まで
のIPCのバイナリフォーマットに変更可能
11
Notice: PySpark + PyArrow v0.15.0+ Issue
# Run the code as shown in p5 with PyArrow v0.15.1
>>> sdf.select(pandas_mean_diff(sdf[‘v’])).show()
java.lang.IllegalArgumentException at
java.nio.ByteBuffer.allocate(ByteBuffer.java:334)
at o.a.arrow.vector.ipc.message.MessageSerializer.readMessage
(MessageSerializer.java:543) at
o.a.arrow.vector.ipc.message.MessageChannelReader.readNext
(MessageChannelReader.java:58)
at org.apache.arrow.vector.ipc.ArrowStreamReader.readSchema
(ArrowStreamReader.java:132)
...

• v3.0からはSparkR においてもArrowによる最適化を適⽤
– 使⽤⽅法はPySparkと同等でデータ転送の効率化
• Spark+AI Summit Europe（2019.10）の以下の発表資料を参照
– Vectorized R Execution in Apache Spark, Hyukjin Kwon@Databricks,
https://bit.ly/2LIbmaC
12
SparkR + Arrow

• Script TransformationはSQL構⽂上で任意のスクリプト処理を
外部プロセス経由で実⾏する拡張
• Spark+AI Summit Europe (2019.10) でFacebookからデータ転送
の独⾃の改善に関する発表
– Powering Custom Apps at Facebook using Spark Script Transformation,
Abdulrahman Alfozan@Facebook, https://bit.ly/2q5OH0w
13
Script Transformation + Arrow
p8から引⽤

• Script TransformationはSQL構⽂上で任意のスクリプト処理を
外部プロセス経由で実⾏する拡張
14
外部プロセスとして実行

• Sparkと外部プロセスの間のI/O処理の効率化
– SQL構⽂上からユーザが定義した独⾃のI/O処理を指定可能
• 開発環境ではデバック効率重視のテキスト形式、プロダク
ション環境では速度重視のバイナリ形式
– 現在はSparkの内部表現（UnsafeRow）をそのまま活⽤，今後はArrow
などの⾼効率なバイナリ形式を検討
– Int/Doubleなどの単純型でテキスト形式と⽐較して，バイナリ形式は
最⼤で4倍程度の⾼速化
• Map/Arrayなどの複雑型では性能差が拡⼤
15
スライドp8から引用

• Spark + Arrow: 外部プロセスとのデータ転送の効率化
– PySparkにおけるPandas UDFsの話を中⼼に紹介
– v3.0ではSparkRにも同様の最適化が適⽤

• 宣伝: Apache Spark 3.0 Preview Released:
16
Wrap-up
https://bit.ly/2KavYHV

Apache Spark + Arrow

More Related Content

What's hot (20)

Similar to Apache Spark + Arrow (20)

More from Takeshi Yamamuro (20)

Recently uploaded (7)

Apache Spark + Arrow