Scaling Machine Learning Feature Engineering in Apache Spark at Facebook

Scaling ML Feature Engineering with
Apache Spark at Facebook
Cheng Su & Sameer Agarwal
Facebook Inc.

About Us
▪ Sameer Agarwal
▪ Software Engineer at Facebook (Data Platform Team)
▪ Apache Spark Committer (Spark Core/SQL)
▪ Previously at Databricks and UC Berkeley
▪ Cheng Su
▪ Software Engineer at Facebook (Data Platform Team)
▪ Apache Spark Contributor (Spark SQL)
▪ Previously worked on Hive & Hadoop at Facebook

Agenda
▪ Machine Learning at Facebook
▪ Data Layouts (Tables and Physical Encodings)
▪ Feature Reaping
▪ Feature Injection
▪ Future Work

Machine Learning at Facebook1
Data Features Training Inference
1Hazelwood et al., Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective. In HPCA 2018
PredictionsModel

Machine Learning at Facebook1
Data Features Training
Inferenc
e
PredictionsModel
This Talk
1. Data Layouts (Tables and Physical Encodings)
2. Feature Reaping
3. Feature Injection
1Hazelwood et al., Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective. In HPCA 2018

Data Layouts (Tables and Physical Encodings)
Training Data Table
- Table to store data for ML training
- Huge volume (multiple PBs/day)
userId: BIGINT
adId: BIGINT
features: MAP<INT, DOUBLE>
…
Feature Tables
- Tables to store all possible features (many of them aren’t promoted in training data
table)
- Smaller volume (low-100s of TBs/ day)
userId: BIGINT
…
gender likes …
age
state
country

1. Feature Injection: Extending base features with new/experimental features to
improve model performance. Think “adding new keys to a map”
gender likes …
age
state
country
Feature Injection
Training Data Table Feature Tables

1. Feature Injection: Extending base features with new/experimental features to
2. Feature Reaping: Removing unnecessary features (id and value) from training
data. Think “deleting existing keys from a map”
gender likes …
age
state
country
Feature Injection
Feature Reaping

Background: Apache ORC
▪ Stripe (Row Group)
▪ Rows are divided into multiple groups
▪ Stream
▪ Columns are stored separately
▪ PRESET, DATA, LENGTH stream for each column
▪ Different encoding and compression
strategy for each column

How is a Feature Map Stored in ORC?
▪ Key and value are stored as separate streams/columns
- Raw Data
- Row 1: (k1, v1)
- Row 2: (k1, v2), (k2, v3)
- Row 3: (k1, v5), (k2, v4)
- Streams
- Key stream: k1, k1, k2, k1, k2
- Value stream: v1, v2, v3 v5, v4
▪ Each stream is individually encoded and compressed
▪ Reading or deleting specific keys (i.e., feature reaping) becomes a
problem
- Need to read (decompress and decode) and re-write ALL keys and values
STRUCT
col -1, node: 0
MAP
INT
col 0, node: 2
DOUBLE
col 0, node: 3
col 0, node: 1
k1, k1, k2, k1, k2 v1, v2, v3 v5, v4

Introducing: ORC Flattened Map
▪ Values that correspond to each key are stored as separate streams
- Raw Data
- Row 1: (k1, v1)
- Row 2: (k1, v2), (k2, v3)
- Row 3: (k1, v5), (k2, v4)
- Streams
- k1 stream: v1, v2, v5
- k2 stream: NULL, v3, v4
- Stores map like a struct
▪ Each key’s value stream is individually encoded and compressed
▪ Reading or deleting specific keys becomes very efficient!
STRUCT
col -1, node: 0
MAP
Value (k1)
col 0, node: 3, seq: 1
Value (k2)
col 0, node: 1
v1, v2, v5 NULL, v3, v4

Feature Reaping
▪ Feature Reaping frameworks generate Spark
SQL queries based on table name, partitions,
and reaped feature ids
▪ For each reaping SQL query, Spark has special
customization in query planner, execution
engine and commit protocol
▪ Each Spark task launches a SQL transform
process, and uses native/C++ binary to do
efficient flat map operations
SparkJavaExecutor
c++ reaper
transform
SparkJavaExecutor
c++ reaper
transform
training_data_v1_1.orc training_data_v1_2.orc
training_data_v2_1.orc training_data_v2_2.orc

Performance
0
10000
20000
30000
40000
50000
20PB
CPU(days)
CPU cost for flat map vs naïve solution*
(14x better on 20PB data)
Naïve Flat Map
0
500000
1000000
1500000
2000000
300PB
CPU(days)
CPU cost for flat map vs naïve solution*
(89x better on 300PB data)
Naïve Flat Map
▪ Case 1
▪ Input data size: 20PB
▪ # of reaped features: 200
▪ # total features: ~1k
▪ Case 2
▪ Input data size: 300PB
▪ # of reaped features: 200
▪ # total features: ~10k
*Naïve solution: A Spark SQL query to re-write all data
with removing required features from map column with
UDF/Lambda.

Feature Injection: Extending base features with new/experimental features to
Requirements:
1. Allow fast ML training experimentation
2. Save storage space
gender likes …
age
state
country
Feature Injection

Feature Injection: Extending base features with new/experimental features to
Requirements:
1. Allow fast ML training experimentation
2. Save storage space
gender likes …
age
state
country
Feature Injection
Introducing: Aligned Tables!

Introducing: Aligned Table
▪ Intuition: Store the output of the join between the training table
and the feature table in 2 separate row-by-row aligned tables
▪ An aligned table is a table that has the same layout as the original
table
- Same number of files
- Same file names
- Same number of rows (and their order) in each file.
col -1, node: 0
id features
1 ...
2 ...
5 ...
id features
3 ...
4 ...
6 ...
id feature
1 f1
2 f2
4 f4
6 f6
training table
feature table
file_1.orc file_2.orc
file_1.orc
id feature
1 f1
2 f2
5 NULL
id feature
3 NULL
4 f4
6 f6
aligned table

Query Plan for Aligned Table
col -1, node: 0
id features
1 ...
2 ...
5 ...
id features
3 ...
4 ...
6 ...
id feature
1 f1
2 f2
4 f4
6 f6
training table
feature table
file_1.orc
id feature
1 f1
2 f2
5 NULL
id feature
3 NULL
4 f4
6 f6
aligned table
Scan
(training table)
Scan
(feature table)
Project
(…, file_name,
row_order)
Join
(LEFT OUTER)
Shuffle
(file_name)
Sort
(file_name,
row_order)
InsertIntoHadoopFsRelationComman
d (Aligned Table)

Reading Aligned Tables
▪ FB-ORC aligned table row-by-row merge reader
▪ Read each aligned table file with the corresponding original table file in one task
▪ Read row-by-row according to row order
▪ Merge aligned table columns per row with corresponding original table columns per row
id features
1 ...
2 ...
5 ...
id features
3 ...
4 ...
6 ...
training table
id feature
1 f1
2 f2
5 NULL
id feature
3 NULL
4 f4
6 f6
aligned table aligned tabletraining table
reader task 1 reader task 2

End to End Performance
1. Baseline 1: Left Outer Join
▪ LEFT OUTER join that materializes new columns/sub-fields into training table
▪ Cons: Reads and overwrites ALL columns of training table every time

Aligned Tables vs Left Outer Join
Compute Savings: 15x
Storage Savings: 30x

2. Baseline 2: Lookup Hash Join
▪ Load feature table(s) into a distributed hash table (Laser1)
▪ Lookup hash join while reading training table
▪ Cons:
▪ Adds an external dependency on a distributed hash table; impacts latency, reliability &
efficiency
▪ Needs a lookup hash join each time the training table is read
1
Laser: a distributed hash table service built on top of RocksDB, see https://research.fb.com/wp-
content/uploads/2016/11/realtime_data_processing_at_facebook.pdf for details

2. Baseline 2: Lookup Hash Join
▪ Load feature table(s) into a distributed hash table (Laser1)
▪ Lookup hash join while reading training table
▪ Cons:
▪ Adds an external dependency on a distributed hash table; impacts latency, reliability &
efficiency
▪ Needs a lookup hash join each time the training table is read
1
Laser: a distributed hash table service built on top of RocksDB, see https://research.fb.com/wp-
content/uploads/2016/11/realtime_data_processing_at_facebook.pdf for details
Aligned Tables vs Lookup Hash
Join
Compute Savings: 1.5x
Storage Savings: 2.1x

Future Work
▪ Better Spark SQL interface for ML primitives (e.g., UPSERTs)
▪ Onboarding more ML use cases to Spark
▪ Batch Inference
▪ Training
MERGE training_table
PARTITION(ds='2020-10-28', pipeline='...', ts)
USING (
SELECT ...) AS f
ON features[0][0] = f.key
WHEN MATCHED THEN UPDATE
SET float_features = MAP_CONCAT(float_features,
f.densefeatures)

Thank you!
Your feedback is important to us.
Don’t forget to rate
and review the sessions.

Scaling Machine Learning Feature Engineering in Apache Spark at Facebook

Recommended

More Related Content

What's hot (20)

Similar to Scaling Machine Learning Feature Engineering in Apache Spark at Facebook (20)

More from Databricks (20)

Recently uploaded (20)

Scaling Machine Learning Feature Engineering in Apache Spark at Facebook