SlideShare a Scribd company logo
Productionalizing Spark
ML
https://github.com/shashankgowdal/productionalising_spark_ml
Stories from Spark battle field
● Shashank L
● Senior Software engineer at Tellius
● Big data consultant and trainer at
datamantra.io
● www.shashankgowda.com
Stages of ML
● Gathering Data
● Data preparation
● Choosing a Model
● Training
● Evaluation
● Operationalise
Motivation
● Spark ML though is an End to End solution for Distributed
ML but, not everything will be done by the Framework
● Custom data preparation techniques may be needed
depending on the quality of the data
● Efficient resource utilization when running to Scale
● Operationalising the Trained models for use
● Best practices
Introduction to SparkML
Introduction to Spark ML
● Provides higher-level API for construction and tuning of
ML workflows
● Built on top of Dataset
● Abstractions
○ Transformer
○ Estimator
○ Evaluator
○ Pipeline
Transformer
● A Transformer is an abstraction which transforms a
dataframe into another.
transform(dataset: DataFrame): DataFrame
● Prepares the dataframe for a ML algorithm to work with
● Typically contains logic which works with single row of
data
DF DFTransformer
Vector assembler
● A feature transformer that merges multiple columns into
a vector as a new column.
● Algorithm stages like LogisticRegression requires a
vector as input which is a collection of feature values
with which the algorithm has to be trained
Estimator
● An Estimator is an abstraction of a learning algorithm
that fits a model on a dataset.
fit(dataset: DataFrame): M
● Estimator is ran only in the training step
● Model returned is a transformer
DF Estimator Model
String Indexer
● Encodes set of String values to its indices.
● Label indices are stored in the StringIndexer model
● Transforming a dataset through this model adds a
output column containing those indices
Pipeline
● Chain of Transformers and Estimators
● Pipeline itself is an Estimator
● It is fitted on a DataFrame turning it into a model called
PipelineModel
● PipelineModel can contain only Transformers
● Pipeline will be fitted on the Train dataset and Test
datasets will transform on the PipelineModel
What is missing?
Data Cleanup
Null values
● Data is rarely clean and can have missing values
● Important to identify and handle them
● SparkML doesn’t handle NULLs gracefully, It's
mandatory to handle them before Training or using any
Spark ML pipeline stages
● Domain expertise is necessary to decide on how to
handle missing values
Custom Spark ML stage
● Handling Nulls should be a part of Spark ML pipeline
● Spark ML has APIs to create a custom Transformer
● Implementation
○ transform
○ transformSchema
com.shashank.sparkml.datapreparation.NullHandlerTransformer
Null Handler Transformer - Cons
● Null handling may involve aggregating over the Train data
and store state
○ Calculating mean
○ Smart handling based on % of null values
● Aggregations in a Transformer runs aggregations on the test
set
● Prediction will be slower
● Prediction accuracy also depends on type of the data in test
set
Null Handler Estimator
● Null Handler Estimator fits the Train data to get Null Handler
Model, which is a Transformer
● Similar abstraction as that of other algorithm training
● Implementation
○ fit
○ transformSchema
● NullHandler Model
○ transform
○ transformSchema
com.shashank.sparkml.datapreparation.NullHandlerEstimator
NA Values
● All missing values may not be nulls
● Missing values can also be encoded as
○ null in String
○ NA
○ Empty String
○ Custom value
● Convert these values to null and use NullHandler to
handle them
● Can be implemented as a Transformer
com.shashank.sparkml.datapreparation.NaValuesHandler
Cast Transformer
● ML is all about mathematics and numericals
● Double data type is widely used for representing
features, labels
● Spark ML expects the data type to be DoubleType in
few APIs and NumericType to be in most APIs
● Casting them as a part of Pipeline would solve
DataType mismatch problems
● Cast can be a Transformer
com.shashank.sparkml.datapreparation.CastTransformer
Building Pipeline
● Use custom stages with built-in stages to build a Pipeline
● Categorical Columns
○ NaValuesHandler
○ NullHandler
○ StringIndexer
○ OneHotEncoder
● Continuous Columns
○ NullHandler
● VectorAssembler
● AlgorithmStage
com.shashank.sparkml.datapreparation.BuildingPipeline
Efficienct?
Iterative programming in Spark
● Spark is one of the first big data framework to have
great support iterative programming natively
● Iterative programs go over the data again and again to
compute some results
● Spark ML is one of iterative frameworks in spark
Growing Logical plan
● Every iteration creates a new dataset which keeps the
logical plan growing
● A ML Transformer can have 1 or more iterations in them
● As there are more stages, logical plan grows adding
overhead to analyse the plan
● This overhead is compute bound and done at master
com.shashank.sparkml.datapreparation.GrowingLineageIssue
Multi Column handling
● Reducing the number of stages in a Pipeline can reduce
iterations on the dataset
● Pipeline stages should have the ability to handle multi
columns instead of 1 stage per column
○ Handle Nulls in all columns in a single stage
○ Replace NA values in all columns in a single stage
● Improves the plan processing performance drastically
even in case of dataset having many columns
com.shashank.sparkml.datapreparation.MultiColumnNullHandler
com.shashank.sparkml.datapreparation.GrowingLineageIssueFixed
Training
Data sampling
● ML makes data-driven predictions by building a
mathematical model from input data
● To avoid overfitting the model for input data, data is
normally sampled in train, test data
● Train data is used for learning and test data to verify
model accuracy
● Normally data is divided into 2 samples using random
sampling without overlapping rows
data.randomSplit(Array(0.6, 0.8))
Caching source data
● ML modelling is an iterative process
● ML Training or preprocessing goes over the data
multiple times
● Spark transformation being lazily evaluated, every pass
on the data reads the data from source
● Caching the source dataset speeds up the ML
modelling process
Caching source data
● Sampling and Caching the data is necessary in terms of
accuracy and performance
● Normally Data is cached, then sampled. This takes a hit
on the performance
● randomSplit on the data requires sorting the complete
data to avoid overlapping rows
● Cached data is sorted on every pass on the data
com.shashank.sparkml.caching.PipelineWithSampling
Caching source vs sample data
Caching only required columns
● Caching the source data speeds up the processing
● Normally a model may not trained on all the columns in
the dataset.
● In a Scenario where, 10 columns are considered for
Training compared to 100 columns in the data
● Applying smartness in caching will have efficient
memory utilization
● Cache only columns which are used for Training
com.shashank.sparkml.caching.CachingRequiredColumns
Spark caching behaviour
● Spark uses memory for 2 purpose - caching and
processing
● We had a definite limits for both in earlier versions
● There is possibility that caching the data equal to size of
the memory available slows down the processing
● Sometimes processing may have to flush the data to disk
to free up space for processing
● It will happen in a repeated loop if caching and processing
are done by the same Spark job
Tree Based classifier memory issue
● Tree based classifiers caches intermediate tree data
using storage level MEMORY_AND_DISK
● The data size cached is normally 3 times the source
data size (source data being a csv)
● Training a DecisionTree classifier on 20GB data has a
requirement of 60 to 80GB RAM which is impractical
● No config to disable cache or control the storage level
Adding config to Tree based classifier
● We added a new configuration parameter for Tree
based classifiers to control the storage level
decisionTreeClassifier.setIntermediateStorageLevel("DISK_ONLY")
● https://github.com/apache/spark/pull/17972
● Changes may land in Spark 2.3.0
"org.apache.spark" %% "spark-mllib" % "2.2.0_mod" from "url/to/jar/spark-mllib_2.11-2.2.0.jar",
Operationalise
Model persistence
● Built In stages of Spark ML supports model persistence
out of the box
● Every stage should extend class DefaultParamsWritable
● Provides a general implementation for persisting the
Params to a Parquet file
● Only params will be persisted, all inputs, state should be
a param
● Persisting a pipeline internally calls the persist on all its
stages
Reading Persisted model
● Custom ML stage should have a Companion object for itself,
which extends class DefaultParamsReadable
● Provides a general implementation for reading the saved
parameters into Stage params
● PipelineModel.load internally calls the read method on all
its stages to create a PipelineModel
com.shashank.sparkml.operationalize.stages.CastTransformer
Persistent Params
● If params are of type Double, Float, Long, Int, Boolean, Array,
Vector they are persistent params.
● Spark internally has logic to persist them
● Custom type like Map[K,V] or Option[Double] which we have
used cannot be persisted by Spark
● A param implementation has to be provided by the user
which requires below methods to be implemented
def jsonEncode(value: Option[T]): String
def jsonDecode(json: String): Option[T]
com.shashank.sparkml.operationalize.stages.PersistentParams
Predict Schema check
● Stages in a trained model are simple transformations
which transform the dataset from one form to another
● These transformations expects the feature columns to be
present in the Prediction dataset
● There is no ability in SparkML to validate if a dataset is
suitable for the model
● Information about the schema should be stored while
training to verify the schema and throw meaningful errors
com.shashank.sparkml.operationalize.PredictSchemaIssue
FeatureNames extraction
● A pipeline model doesn’t have API to get a list of feature
names which were used to train the model
● Feature Vector is just a collection of double values
● No information about what each of these values represent
● We can use multiple stage metadata to derive the feature
names associated with each feature value
● These features would also contain OneHotEncoded values
com.shashank.sparkml.operationalize.FeatureExtraction
References
● https://jaceklaskowski.gitbooks.io/mastering-apache-spark/
spark-mllib/spark-mllib-pipelines.html
● https://spark.apache.org/docs/latest/ml-guide.html
● https://issues.apache.org/jira/browse/SPARK-20723
intermediateRDDStorageLevel for Treebased Classifier
● https://issues.apache.org/jira/browse/SPARK-8418
single- and multi-value support to ML Transformers
● https://issues.apache.org/jira/browse/SPARK-13434
Reduce Spark RandomForest memory footprint
Thank you

More Related Content

What's hot (20)

PDF
Introduction to Datasource V2 API
datamantra
 
PDF
Migrating to Spark 2.0 - Part 2
datamantra
 
PDF
Introduction to Structured Streaming
datamantra
 
PDF
Evolution of apache spark
datamantra
 
PDF
Exploratory Data Analysis in Spark
datamantra
 
PDF
Spark on Kubernetes
datamantra
 
PDF
Introduction to Flink Streaming
datamantra
 
PDF
Functional programming in Scala
datamantra
 
PDF
Building distributed processing system from scratch - Part 2
datamantra
 
PDF
Interactive Data Analysis in Spark Streaming
datamantra
 
PPTX
Building real time Data Pipeline using Spark Streaming
datamantra
 
PDF
Building Distributed Systems from Scratch - Part 1
datamantra
 
PDF
Productionalizing a spark application
datamantra
 
PPTX
State management in Structured Streaming
datamantra
 
PDF
Interactive workflow management using Azkaban
datamantra
 
PPTX
Zoo keeper in the wild
datamantra
 
PDF
Building scalable rest service using Akka HTTP
datamantra
 
PDF
Optimizing S3 Write-heavy Spark workloads
datamantra
 
PDF
A Tool For Big Data Analysis using Apache Spark
datamantra
 
ODP
Reactors.io
Knoldus Inc.
 
Introduction to Datasource V2 API
datamantra
 
Migrating to Spark 2.0 - Part 2
datamantra
 
Introduction to Structured Streaming
datamantra
 
Evolution of apache spark
datamantra
 
Exploratory Data Analysis in Spark
datamantra
 
Spark on Kubernetes
datamantra
 
Introduction to Flink Streaming
datamantra
 
Functional programming in Scala
datamantra
 
Building distributed processing system from scratch - Part 2
datamantra
 
Interactive Data Analysis in Spark Streaming
datamantra
 
Building real time Data Pipeline using Spark Streaming
datamantra
 
Building Distributed Systems from Scratch - Part 1
datamantra
 
Productionalizing a spark application
datamantra
 
State management in Structured Streaming
datamantra
 
Interactive workflow management using Azkaban
datamantra
 
Zoo keeper in the wild
datamantra
 
Building scalable rest service using Akka HTTP
datamantra
 
Optimizing S3 Write-heavy Spark workloads
datamantra
 
A Tool For Big Data Analysis using Apache Spark
datamantra
 
Reactors.io
Knoldus Inc.
 

Similar to Productionalizing Spark ML (20)

PDF
Using Spark Mllib Models in a Production Training and Serving Platform: Exper...
Databricks
 
PDF
Introduction to Spark ML Pipelines Workshop
Holden Karau
 
ODP
Spark Structured Streaming
Knoldus Inc.
 
PDF
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Databricks
 
PDF
Putting the Spark into Functional Fashion Tech Analystics
Gareth Rogers
 
PDF
Porting R Models into Scala Spark
carl_pulley
 
PDF
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
DB Tsai
 
PPTX
Apache Spark MLlib - Random Foreset and Desicion Trees
Tuhin Mahmud
 
PDF
Anatomy of spark catalyst
datamantra
 
ODP
A Step to programming with Apache Spark
Knoldus Inc.
 
PPTX
Data Science Salon: A Journey of Deploying a Data Science Engine to Production
Formulatedby
 
PDF
Boosting spark performance: An Overview of Techniques
Ahsan Javed Awan
 
PDF
Putting the Spark into Functional Fashion Tech Analystics
Gareth Rogers
 
PPTX
Combining Machine Learning frameworks with Apache Spark
DataWorks Summit/Hadoop Summit
 
PPTX
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
Omid Vahdaty
 
PPTX
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
zmhassan
 
PPTX
Apache Spark
masifqadri
 
PPTX
SparkNet presentation
Sneh Pahilwani
 
PPTX
4Introduction+to+Spark.pptx sdfsdfsdfsdfsdf
yafora8192
 
PPTX
Machine Learning Orchestration with Airflow
Anant Corporation
 
Using Spark Mllib Models in a Production Training and Serving Platform: Exper...
Databricks
 
Introduction to Spark ML Pipelines Workshop
Holden Karau
 
Spark Structured Streaming
Knoldus Inc.
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Databricks
 
Putting the Spark into Functional Fashion Tech Analystics
Gareth Rogers
 
Porting R Models into Scala Spark
carl_pulley
 
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
DB Tsai
 
Apache Spark MLlib - Random Foreset and Desicion Trees
Tuhin Mahmud
 
Anatomy of spark catalyst
datamantra
 
A Step to programming with Apache Spark
Knoldus Inc.
 
Data Science Salon: A Journey of Deploying a Data Science Engine to Production
Formulatedby
 
Boosting spark performance: An Overview of Techniques
Ahsan Javed Awan
 
Putting the Spark into Functional Fashion Tech Analystics
Gareth Rogers
 
Combining Machine Learning frameworks with Apache Spark
DataWorks Summit/Hadoop Summit
 
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
Omid Vahdaty
 
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
zmhassan
 
Apache Spark
masifqadri
 
SparkNet presentation
Sneh Pahilwani
 
4Introduction+to+Spark.pptx sdfsdfsdfsdfsdf
yafora8192
 
Machine Learning Orchestration with Airflow
Anant Corporation
 
Ad

More from datamantra (10)

PPTX
Multi Source Data Analysis using Spark and Tellius
datamantra
 
PDF
Understanding transactional writes in datasource v2
datamantra
 
PDF
Spark stack for Model life-cycle management
datamantra
 
PDF
Scalable Spark deployment using Kubernetes
datamantra
 
PDF
Introduction to concurrent programming with akka actors
datamantra
 
PPTX
Telco analytics at scale
datamantra
 
PPTX
Platform for Data Scientists
datamantra
 
PDF
Real time ETL processing using Spark streaming
datamantra
 
PDF
Anatomy of Spark SQL Catalyst - Part 2
datamantra
 
PDF
Introduction to Spark 2.0 Dataset API
datamantra
 
Multi Source Data Analysis using Spark and Tellius
datamantra
 
Understanding transactional writes in datasource v2
datamantra
 
Spark stack for Model life-cycle management
datamantra
 
Scalable Spark deployment using Kubernetes
datamantra
 
Introduction to concurrent programming with akka actors
datamantra
 
Telco analytics at scale
datamantra
 
Platform for Data Scientists
datamantra
 
Real time ETL processing using Spark streaming
datamantra
 
Anatomy of Spark SQL Catalyst - Part 2
datamantra
 
Introduction to Spark 2.0 Dataset API
datamantra
 
Ad

Recently uploaded (20)

PPTX
IP_Journal_Articles_2025IP_Journal_Articles_2025
mishell212144
 
PPT
From Vision to Reality: The Digital India Revolution
Harsh Bharvadiya
 
PPT
Classification and Prediction_ai_101.ppt
fmodtel
 
PPTX
things that used in cleaning of the things
drkaran1421
 
PDF
apidays Munich 2025 - Developer Portals, API Catalogs, and Marketplaces, Miri...
apidays
 
PDF
Before tackling these green level readers child Will need to be able to
startshws
 
PDF
APEX PROGRAMME _ JEE MAIN _ REVISION SCHEDULE_2025-26 (11 07 2025) 6 PM.pdf
dhanvin1493
 
PDF
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
PPTX
materials that are required to used.pptx
drkaran1421
 
PDF
Top Civil Engineer Canada Services111111
nengineeringfirms
 
PPTX
isaacnewton-250718125311-e7ewqeqweqwa74d99.pptx
MahmoudHalim13
 
PDF
An Uncut Conversation With Grok | PDF Document
Mike Hydes
 
PPTX
7 Easy Ways to Improve Clarity in Your BI Reports
sophiegracewriter
 
PDF
Basotho Satisfaction with Electricity(Statspack)
KatlehoMefane
 
DOCX
Online Delivery Restaurant idea and analyst the data
sejalsengar2323
 
PDF
apidays Munich 2025 - The Double Life of the API Product Manager, Emmanuel Pa...
apidays
 
PDF
apidays Munich 2025 - The Physics of Requirement Sciences Through Application...
apidays
 
PDF
apidays Munich 2025 - Automating Operations Without Reinventing the Wheel, Ma...
apidays
 
PPT
Real Life Application of Set theory, Relations and Functions
manavparmar205
 
PPTX
UPS Case Study - Group 5 with example and implementation .pptx
yasserabdelwahab6
 
IP_Journal_Articles_2025IP_Journal_Articles_2025
mishell212144
 
From Vision to Reality: The Digital India Revolution
Harsh Bharvadiya
 
Classification and Prediction_ai_101.ppt
fmodtel
 
things that used in cleaning of the things
drkaran1421
 
apidays Munich 2025 - Developer Portals, API Catalogs, and Marketplaces, Miri...
apidays
 
Before tackling these green level readers child Will need to be able to
startshws
 
APEX PROGRAMME _ JEE MAIN _ REVISION SCHEDULE_2025-26 (11 07 2025) 6 PM.pdf
dhanvin1493
 
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
materials that are required to used.pptx
drkaran1421
 
Top Civil Engineer Canada Services111111
nengineeringfirms
 
isaacnewton-250718125311-e7ewqeqweqwa74d99.pptx
MahmoudHalim13
 
An Uncut Conversation With Grok | PDF Document
Mike Hydes
 
7 Easy Ways to Improve Clarity in Your BI Reports
sophiegracewriter
 
Basotho Satisfaction with Electricity(Statspack)
KatlehoMefane
 
Online Delivery Restaurant idea and analyst the data
sejalsengar2323
 
apidays Munich 2025 - The Double Life of the API Product Manager, Emmanuel Pa...
apidays
 
apidays Munich 2025 - The Physics of Requirement Sciences Through Application...
apidays
 
apidays Munich 2025 - Automating Operations Without Reinventing the Wheel, Ma...
apidays
 
Real Life Application of Set theory, Relations and Functions
manavparmar205
 
UPS Case Study - Group 5 with example and implementation .pptx
yasserabdelwahab6
 

Productionalizing Spark ML

  • 2. ● Shashank L ● Senior Software engineer at Tellius ● Big data consultant and trainer at datamantra.io ● www.shashankgowda.com
  • 3. Stages of ML ● Gathering Data ● Data preparation ● Choosing a Model ● Training ● Evaluation ● Operationalise
  • 4. Motivation ● Spark ML though is an End to End solution for Distributed ML but, not everything will be done by the Framework ● Custom data preparation techniques may be needed depending on the quality of the data ● Efficient resource utilization when running to Scale ● Operationalising the Trained models for use ● Best practices
  • 6. Introduction to Spark ML ● Provides higher-level API for construction and tuning of ML workflows ● Built on top of Dataset ● Abstractions ○ Transformer ○ Estimator ○ Evaluator ○ Pipeline
  • 7. Transformer ● A Transformer is an abstraction which transforms a dataframe into another. transform(dataset: DataFrame): DataFrame ● Prepares the dataframe for a ML algorithm to work with ● Typically contains logic which works with single row of data DF DFTransformer
  • 8. Vector assembler ● A feature transformer that merges multiple columns into a vector as a new column. ● Algorithm stages like LogisticRegression requires a vector as input which is a collection of feature values with which the algorithm has to be trained
  • 9. Estimator ● An Estimator is an abstraction of a learning algorithm that fits a model on a dataset. fit(dataset: DataFrame): M ● Estimator is ran only in the training step ● Model returned is a transformer DF Estimator Model
  • 10. String Indexer ● Encodes set of String values to its indices. ● Label indices are stored in the StringIndexer model ● Transforming a dataset through this model adds a output column containing those indices
  • 11. Pipeline ● Chain of Transformers and Estimators ● Pipeline itself is an Estimator ● It is fitted on a DataFrame turning it into a model called PipelineModel ● PipelineModel can contain only Transformers ● Pipeline will be fitted on the Train dataset and Test datasets will transform on the PipelineModel
  • 14. Null values ● Data is rarely clean and can have missing values ● Important to identify and handle them ● SparkML doesn’t handle NULLs gracefully, It's mandatory to handle them before Training or using any Spark ML pipeline stages ● Domain expertise is necessary to decide on how to handle missing values
  • 15. Custom Spark ML stage ● Handling Nulls should be a part of Spark ML pipeline ● Spark ML has APIs to create a custom Transformer ● Implementation ○ transform ○ transformSchema com.shashank.sparkml.datapreparation.NullHandlerTransformer
  • 16. Null Handler Transformer - Cons ● Null handling may involve aggregating over the Train data and store state ○ Calculating mean ○ Smart handling based on % of null values ● Aggregations in a Transformer runs aggregations on the test set ● Prediction will be slower ● Prediction accuracy also depends on type of the data in test set
  • 17. Null Handler Estimator ● Null Handler Estimator fits the Train data to get Null Handler Model, which is a Transformer ● Similar abstraction as that of other algorithm training ● Implementation ○ fit ○ transformSchema ● NullHandler Model ○ transform ○ transformSchema com.shashank.sparkml.datapreparation.NullHandlerEstimator
  • 18. NA Values ● All missing values may not be nulls ● Missing values can also be encoded as ○ null in String ○ NA ○ Empty String ○ Custom value ● Convert these values to null and use NullHandler to handle them ● Can be implemented as a Transformer com.shashank.sparkml.datapreparation.NaValuesHandler
  • 19. Cast Transformer ● ML is all about mathematics and numericals ● Double data type is widely used for representing features, labels ● Spark ML expects the data type to be DoubleType in few APIs and NumericType to be in most APIs ● Casting them as a part of Pipeline would solve DataType mismatch problems ● Cast can be a Transformer com.shashank.sparkml.datapreparation.CastTransformer
  • 20. Building Pipeline ● Use custom stages with built-in stages to build a Pipeline ● Categorical Columns ○ NaValuesHandler ○ NullHandler ○ StringIndexer ○ OneHotEncoder ● Continuous Columns ○ NullHandler ● VectorAssembler ● AlgorithmStage com.shashank.sparkml.datapreparation.BuildingPipeline
  • 22. Iterative programming in Spark ● Spark is one of the first big data framework to have great support iterative programming natively ● Iterative programs go over the data again and again to compute some results ● Spark ML is one of iterative frameworks in spark
  • 23. Growing Logical plan ● Every iteration creates a new dataset which keeps the logical plan growing ● A ML Transformer can have 1 or more iterations in them ● As there are more stages, logical plan grows adding overhead to analyse the plan ● This overhead is compute bound and done at master com.shashank.sparkml.datapreparation.GrowingLineageIssue
  • 24. Multi Column handling ● Reducing the number of stages in a Pipeline can reduce iterations on the dataset ● Pipeline stages should have the ability to handle multi columns instead of 1 stage per column ○ Handle Nulls in all columns in a single stage ○ Replace NA values in all columns in a single stage ● Improves the plan processing performance drastically even in case of dataset having many columns com.shashank.sparkml.datapreparation.MultiColumnNullHandler com.shashank.sparkml.datapreparation.GrowingLineageIssueFixed
  • 26. Data sampling ● ML makes data-driven predictions by building a mathematical model from input data ● To avoid overfitting the model for input data, data is normally sampled in train, test data ● Train data is used for learning and test data to verify model accuracy ● Normally data is divided into 2 samples using random sampling without overlapping rows data.randomSplit(Array(0.6, 0.8))
  • 27. Caching source data ● ML modelling is an iterative process ● ML Training or preprocessing goes over the data multiple times ● Spark transformation being lazily evaluated, every pass on the data reads the data from source ● Caching the source dataset speeds up the ML modelling process
  • 28. Caching source data ● Sampling and Caching the data is necessary in terms of accuracy and performance ● Normally Data is cached, then sampled. This takes a hit on the performance ● randomSplit on the data requires sorting the complete data to avoid overlapping rows ● Cached data is sorted on every pass on the data com.shashank.sparkml.caching.PipelineWithSampling
  • 29. Caching source vs sample data
  • 30. Caching only required columns ● Caching the source data speeds up the processing ● Normally a model may not trained on all the columns in the dataset. ● In a Scenario where, 10 columns are considered for Training compared to 100 columns in the data ● Applying smartness in caching will have efficient memory utilization ● Cache only columns which are used for Training com.shashank.sparkml.caching.CachingRequiredColumns
  • 31. Spark caching behaviour ● Spark uses memory for 2 purpose - caching and processing ● We had a definite limits for both in earlier versions ● There is possibility that caching the data equal to size of the memory available slows down the processing ● Sometimes processing may have to flush the data to disk to free up space for processing ● It will happen in a repeated loop if caching and processing are done by the same Spark job
  • 32. Tree Based classifier memory issue ● Tree based classifiers caches intermediate tree data using storage level MEMORY_AND_DISK ● The data size cached is normally 3 times the source data size (source data being a csv) ● Training a DecisionTree classifier on 20GB data has a requirement of 60 to 80GB RAM which is impractical ● No config to disable cache or control the storage level
  • 33. Adding config to Tree based classifier ● We added a new configuration parameter for Tree based classifiers to control the storage level decisionTreeClassifier.setIntermediateStorageLevel("DISK_ONLY") ● https://github.com/apache/spark/pull/17972 ● Changes may land in Spark 2.3.0 "org.apache.spark" %% "spark-mllib" % "2.2.0_mod" from "url/to/jar/spark-mllib_2.11-2.2.0.jar",
  • 35. Model persistence ● Built In stages of Spark ML supports model persistence out of the box ● Every stage should extend class DefaultParamsWritable ● Provides a general implementation for persisting the Params to a Parquet file ● Only params will be persisted, all inputs, state should be a param ● Persisting a pipeline internally calls the persist on all its stages
  • 36. Reading Persisted model ● Custom ML stage should have a Companion object for itself, which extends class DefaultParamsReadable ● Provides a general implementation for reading the saved parameters into Stage params ● PipelineModel.load internally calls the read method on all its stages to create a PipelineModel com.shashank.sparkml.operationalize.stages.CastTransformer
  • 37. Persistent Params ● If params are of type Double, Float, Long, Int, Boolean, Array, Vector they are persistent params. ● Spark internally has logic to persist them ● Custom type like Map[K,V] or Option[Double] which we have used cannot be persisted by Spark ● A param implementation has to be provided by the user which requires below methods to be implemented def jsonEncode(value: Option[T]): String def jsonDecode(json: String): Option[T] com.shashank.sparkml.operationalize.stages.PersistentParams
  • 38. Predict Schema check ● Stages in a trained model are simple transformations which transform the dataset from one form to another ● These transformations expects the feature columns to be present in the Prediction dataset ● There is no ability in SparkML to validate if a dataset is suitable for the model ● Information about the schema should be stored while training to verify the schema and throw meaningful errors com.shashank.sparkml.operationalize.PredictSchemaIssue
  • 39. FeatureNames extraction ● A pipeline model doesn’t have API to get a list of feature names which were used to train the model ● Feature Vector is just a collection of double values ● No information about what each of these values represent ● We can use multiple stage metadata to derive the feature names associated with each feature value ● These features would also contain OneHotEncoded values com.shashank.sparkml.operationalize.FeatureExtraction
  • 40. References ● https://jaceklaskowski.gitbooks.io/mastering-apache-spark/ spark-mllib/spark-mllib-pipelines.html ● https://spark.apache.org/docs/latest/ml-guide.html ● https://issues.apache.org/jira/browse/SPARK-20723 intermediateRDDStorageLevel for Treebased Classifier ● https://issues.apache.org/jira/browse/SPARK-8418 single- and multi-value support to ML Transformers ● https://issues.apache.org/jira/browse/SPARK-13434 Reduce Spark RandomForest memory footprint