SlideShare a Scribd company logo
Leah McGuire
Principal Member of Technical Staff, Salesforce Einstein
lmcguire@salesforce.com
@leahmcguire
Types, Types, Types!
Embracing a hierarchy of types to simplify machine
learning
Let’s make sure you are in the right talk
​ What I am going to talk about:
• What does machine learning mean at Salesforce
• Problems in machine learning for business to business (B2B)
companies
• Automating machine learning and how our AutoML library (Optimus
Prime) works
• The utility of having strongly typed features in AutoML
• What we have learned and what we are planning
Salesforce and Machine
Learning
Salesforce
The Problem
​ For the majority of businesses, data science is out of reach
Sales Cloud Einstein
Predictive Lead Scoring
Opportunity Insights
Automated Activity Capture
Building World’s Smartest CRM
Commerce Cloud
Einstein
Product Recommendations
Predictive Sort
Commerce Insights
App Cloud Einstein
Heroku + PredictionIO
Predictive Vision Services
Predictive Sentiment Services
Predictive Modeling Services
Service Cloud Einstein
Recommended Case Classification
Recommended Responses
Predictive Close Time
Marketing Cloud Einstein
Predictive Scoring
Predictive Audiences
Automated Send-time Optimization
Community Cloud Einstein
Recommended Experts, Articles & Topics
Automated Service Escalation
Newsfeed Insights
Analytics Cloud Einstein
Predictive Wave Apps
Smart Data Discovery
Automated Analytics & Storytelling
IoT Cloud Einstein
Predictive Device Scoring
Recommend Best Next Action
Automated IoT Rules Optimization
Machine learning workflows
And how much more complicated they get for B2B
Feature
Engineering
Model
Training
Model A
Model B
Model C
Model
Evaluation
​ What Kaggle would lead us to believe
Building a machine learning model
Real-life ML
​ Building a ML model workflow
ETL
Model
Evaluation
Feature
Engineering
Scoring
Model
Training
Model A
Model B
Model C
Deployment
​ Over and over again
Building a machine learning model
ETL
Model
Evaluation
Feature
Engineering
Scoring
Model
Trainin
g
Model
A
Model
B
Model
C
Deployment
ETL
Model
Evaluation
Feature
Engineering
Scoring
Model
Trainin
g
Model
A
Model
B
Model
C
Deployment
ETL
Model
Evaluation
Feature
Engineering
Scoring
Model
Trainin
g
Model
A
Model
B
Model
C
Deployment
We can’t build one global model
• Privacy concerns
•  Customers don’t want data cross-
pollinated
• Business Use Cases
•  Industries are very different
•  Processes are different
• Platform customization
•  Ability to create custom fields and
objects
• Scale, Automation,
•  Ability to create
​ Over and over again
Building a machine learning model
M
o
d
e
l
E
v
a
l
u
a
t
i
o
n
F
e
a
t
u
r
e
E
n
g
i
n
e
e
r
i
n
g
S
c
o
r
i
n
g
D
e
p
l
o
y
m
e
n
t
M
o
d
e
l
E
v
a
l
u
a
t
i
o
n
F
e
a
t
u
r
e
E
n
g
i
n
e
e
r
i
n
g
S
c
o
r
i
n
g
D
e
p
l
o
y
m
e
n
t
M
o
d
e
l
E
v
a
l
u
a
t
i
o
n
F
e
a
t
u
r
e
E
n
g
i
n
e
e
r
i
n
g
S
c
o
r
i
n
g
D
e
p
l
o
y
m
e
n
t
M
o
d
e
l
E
v
a
l
u
a
t
i
o
n
F
e
a
t
u
r
e
E
n
g
i
n
e
e
r
i
n
g
S
c
o
r
i
n
g
D
e
p
l
o
y
m
e
n
t
ET
L
Model
Evaluati
on
Feature
Enginee
ring
Scoring
Mod
el
Trai
ning
Mod
el A
Mod
el B
Mod
el C
Deploy
ment
M
o
d
el
E
v
al
u
at
io
n
F
e
at
ur
e
E
n
gi
n
e
er
in
g
S
c
o
ri
n
g
M
o
d
e
l
T
r
a
i
n
i
n
g
M
o
d
e
l
A
M
o
d
e
l
B
M
o
d
e
l
C
D
e
pl
o
y
m
e
nt
M
o
d
e
l
E
v
a
l
u
a
t
i
o
n
F
e
a
t
u
r
e
E
n
g
i
n
e
e
r
i
n
g
S
c
o
r
i
n
g
D
e
p
l
o
y
m
e
n
t
M
o
d
e
l
E
v
a
l
F
e
a
t
u
r
e
E
n
g
i
n
S
c
o
r
i
n
g
D
e
p
l
o
y
m
e
n
M
o
d
e
l
E
v
a
l
u
a
t
i
o
n
F
e
a
t
u
r
e
E
n
g
i
n
e
e
r
i
n
g
S
c
o
r
i
n
g
D
e
p
l
o
y
m
e
n
t
M
o
d
e
l
E
v
F
e
a
t
u
r
e
E
n
S
c
o
r
i
n
g
D
e
p
l
o
y
m
M
o
d
e
l
E
v
a
l
u
F
e
a
t
u
r
e
E
n
g
i
n
S
c
o
r
i
n
g
D
e
p
l
o
y
m
e
n
t
M
o
a
F
e
a
t
u
r
S
c
o
D
e
p
Mo
del
Ev
alu
ati
on
Fe
atu
re
En
gin
eer
ing
Sc
ori
ng
M
o
d
e
l
T
r
a
i
n
i
n
g
M
o
d
e
l
A
M
o
d
e
l
B
M
o
d
e
l
C
De
plo
ym
ent
M
o
d
e
l
E
v
a
l
u
a
t
i
o
F
e
a
t
u
r
e
E
n
g
i
n
e
e
r
i
n
S
c
o
r
i
n
g
D
e
p
l
o
y
m
e
n
t
M
o
d
e
l
E
v
a
l
u
a
t
i
o
n
F
e
a
t
u
r
e
E
n
g
i
n
e
e
r
i
n
g
S
c
o
r
i
n
g
D
e
p
l
o
y
m
e
n
t
M
o
d
e
l
E
v
F
e
a
t
u
r
e
E
n
g
S
c
o
r
i
n
g
D
e
p
l
o
y
m
Automating machine learning
Enter Einstein (and Optimus Prime)
•  ML is not magic, just statistics –
generalizing examples
•  But there is a ‘black art’ to producing
good models
•  Input data needs to be combined, filtered,
cleaned etc.
•  Producing the best features for your model
takes time
•  You can’t just throw a ml algorithm at your raw
data and expect good results
Turning a black art into a paint by number kit.
Keep it DRY (don’t repeat yourself) and DRO (don’t repeat
others)
• The Spark ML pipeline (estimator, transformer) model is nice
• The lack of types in Spark is not
• Want to use more than Spark ML
• Declarative and intuitive syntax – for both
workflow generation and developers
• Typed reusable operations
• Multitenant application support
• All built in scala
Optimus Prime - A library to develop reusable, modular and typed ML workflows
Simple interchangeable parts
​ val featureVector = Seq(pClass, name, gender, age, sibSp, parch, ticket, cabin, embarked).vectorize()
​ val (pred, raw, prob) = featureVector.check(survived).classify(survived)
​ val workflow = new OpWorkflow().setResultFeatures(pred).setDataReader(titanicReader)
​ In a declarative type safe syntax
Stages
Features
transformed with
produce
Transformers Estimators
fitted into
ReadersWorkflows
input data
materialized by
Automating typed feature
engineering and modeling
(with Optimus Prime)
Features are given a type on creation
​ val gender = FeatureBuilder.Categorical[Titanic]
.extract(d => Option(d.getGender).toSet[String]).asPredictor
• Features are strongly typed
• Each stage takes specific input type(s) and returns a specific output type(s)
​ Death to runtime errors!
StagesFeatures
produce
Creating a workflow DAG with features
• Features point to a
column of data
• The type of the feature
determines which stages
can act on it
pred
prob
rawPred
survived
featureVector
gender age name
genderPivot title
Creating a workflow DAG with features
• When a stage acts on a
feature it produces a new
feature (or features)
• Keep on manipulating
features until you get
your goal
pred
prob
rawPred
survived
featureVector
gender age name
genderPivot title
Pivot Title Regex
Combine
Model
Done manipulating your features? Make them.
Stages
Features
transformed with
produce
Transformers Estimators
fitted
into
ReadersWorkflows
input data
materialized by
• Once you make your final feature you have the full DAG
• Features are materialized by the workflow
• Initial data into the workflow provided by the reader
The power of types!
Using types to automate feature engineering
​ val featureVector = Seq(pClass, name, gender, age, sibSp, parch, ticket, cabin, embarked).vectorize()
• Each feature is mapped to an appropriate .vectorize() stage based on its type
•  gender (a Categorical) and age (a Real) are automatically assigned to different stages
• You also have an option to do the exact type safe manipulations you want
•  age can undergo special transformations if desired
•  val ageBuckets =age.bucketize(buckets(0, 10, 20, 40, 100))
•  val featureVector = Seq(pClass, name, gender, ageBuckets, sibSp, parch, ticket, cabin,
embarked).vectorize()
Show me the types!
FeatureType
OPNumeric OPCollection
OPSet
OPSortedSet
OPList
NonNullable
Text
Email
Base64
Phone
ID
URL
ComboBox
PickList
TextArea
OPVector OPMap
BinaryMap
IntegralMap
RealMap
CategoricalMap
OrdinalMap
DateList
DateTimeList
Integral
Real
Binary
Percent
Currency
Date
DateTime Categorical
MultiPickList
Note: all the types are assumed to be nullable, unless NonNullable trait is mixed - https://developer.salesforce.com/docs/atlas.en-us.api.meta/api/field_types.htm
Ordinal
TextMap
Legend: - inheritance, bold - abstract class, italic - trait, normal - concrete class
...
Optimus Prime Type Hierarchy
TextList
City
Street
Country
Postal Code
Location
State
Geolocation
StateMap
Take the types away!!
• Sometimes a type is all you have
• Hierarchy allows both very specific
and very general stages
• Type safety for production saves a
lot of headaches
​ Why would we make this monstrosity??
FeatureType
OPNumeric OPCollection
OPSet
OPSortedSet
OPList
NonNullable
Text
Email
Base64
Phone
ID
URL
ComboBox
PickList
TextArea
OPVector OPMap
BinaryMap
IntegralMap
RealMap
CategoricalMap
OrdinalMap
DateList
DateTimeList
Integral
Real
Binary
Percent
Currency
Date
DateTime Categorical
MultiPickList
Note: all the types are assumed to be nullable, unless NonNullable trait is mixed - https://developer.salesforce.com/docs/atlas.en-us.api.meta/api/field_types.htm
Ordinal
TextMap
Legend: - inheritance, bold - abstract class, italic - trait, normal - concrete class
...
Optimus Prime Type Hierarchy
TextList
City
Street
Country
Postal Code
Location
State
Geolocation
StateMap
Sanity Checking – the stage that checks your features
• Check data quality before modeling
• Label leakage
• Features have acceptable ranges
• The feature types allow much better checks
val checkedVector = featureVector.check(survived)
Model Selection Stage - Resampling, Hyper-parameter
Tuning, Comparing Models
• Many possible models for each class of
problem
• Many hyper parameters for each type of
model
• Finding the right model for THIS dataset
makes a huge difference
val (pred, raw, prob) = checkedFeatureVector.classify(survived)
Types can save us
​ And if you don’t believe me take a look at the code
val featureVector = Seq(pClass, name, gender, age, sibSp, parch, ticket, cabin, embarked).vectorize()
val (pred, raw, prob) = featureVector.check(survived).classify(survived)
val workflow = new OpWorkflow().setResultFeatures(pred).setDataReader(titanicReader)
Types can save us
​ And if you don’t believe me take a look at the code
val featureVector = Seq(pClass, name, gender, age, sibSp, parch, ticket, cabin, embarked).vectorize()
val (pred, raw, prob) = featureVector.check(survived).classify(survived)
val workflow = new OpWorkflow().setResultFeatures(pred).setDataReader(titanicReader)
def addFeatures(df: DataFrame): DataFrame = {!
// Create a new family size field := siblings + spouses + parents + children + self!
val familySizeUDF = udf { (sibsp: Double, parch: Double) => sibsp + parch + 1 }!
!
df.withColumn("fsize", familySizeUDF(col("sibsp"), col("parch"))) // <-- full freedom to overwrite !
}!
!
def fillMissing(df: DataFrame): DataFrame = {!
// Fill missing age values with average age!
val avgAge = df.select("age").agg(avg("age")).collect.first()!
!
// Fill missing embarked values with default "S" (i.e Southampton)!
val embarkedUDF = udf{(e: String)=> e match { case x if x == null || x.isEmpty => "S"; case x => x}}!
!
df.na.fill(Map("age" -> avgAge)).withColumn("embarked", embarkedUDF(col("embarked")))!
}!
Types can save us
​ And if you don’t believe me take a look at the code
// Modify the dataframe!
val allData = fillMissing(addFeatures(rawData)).cache() // <-- need to remember about caching!
// Split the data and cache it!
val Array(trainSet, testSet) = allData.randomSplit(Array(0.75, 0.25)).map(_.cache())!
!
// Prepare categorical columns!
val categoricalFeatures = Array("pclass", "sex", "embarked")!
val stringIndexers = categoricalFeatures.map(colName =>!
new StringIndexer().setInputCol(colName).setOutputCol(colName + "_index").fit(allData)!
)!
!
// Concat all the feature into a numeric feature vector!
val allFeatures = Array("age", "sibsp", "parch", "fsize") ++ stringIndexers.map(_.getOutputCol)!
!
val vectorAssembler = new VectorAssembler().setInputCols(allFeatures).setOutputCol("feature_vector”)!
!
// Prepare Logistic Regression estimator!
val logReg = new LogisticRegression().setFeaturesCol("feature_vector").setLabelCol("survived”)!
!
// Finally build the pipeline with the stages above!
val pipeline = new Pipeline().setStages(stringIndexers ++ Array(vectorAssembler, logReg))!
Types can save us
​ And if you don’t believe me take a look at the code
// Cross validate our pipeline with various parameters!
val paramGrid =!
new ParamGridBuilder()!
.addGrid(logReg.regParam, Array(1, 0.1, 0.01))!
.addGrid(logReg.maxIter, Array(10, 50, 100))!
.build()!
!
val crossValidator =!
new CrossValidator()!
.setEstimator(pipeline) // <-- set our pipeline here!
.setEstimatorParamMaps(paramGrid)!
.setEvaluator(new BinaryClassificationEvaluator().setLabelCol("survived"))!
.setNumFolds(3)!
!
// Train the model & compute scores !
val model: CrossValidationModel = crossValidator.fit(trainSet)!
val scores: DataFrame = model.transform(testSet)!
!
// Save the model for later use!
model.save("/models/titanic-model.ml")!
!
Where are we going and what
have we learned
Key takeaways
• ML for B2B is a whole other beast
• Spark ML is great, but it needs type safety
• Simple and intuitive syntax saves you trouble down the road
• Types in ML are incredibly useful
• Scala has all the relevant facilities to provide the above
• Modularity and reusability is the key
Going forward with Optimus Prime
• Going beyond Spark ML for algorithms and small scale
• Making everything smarter (feature eng, sanity checking, model selection)
• Template generation
• Improvements to developer interface
If You’re Curious …
einstein-recruiting@salesforce.com
Thank Y u
Ad

More Related Content

What's hot (20)

Deploying Machine Learning Models to Production
Deploying Machine Learning Models to ProductionDeploying Machine Learning Models to Production
Deploying Machine Learning Models to Production
Anass Bensrhir - Senior Data Scientist
 
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache SparkData-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
Databricks
 
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines Using Apache SparkBuild, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Databricks
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
Jose Quesada (hiring)
 
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsApache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Anyscale
 
Dev Ops Training
Dev Ops TrainingDev Ops Training
Dev Ops Training
Spark Summit
 
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Spark Summit
 
Informational Referential Integrity Constraints Support in Apache Spark with ...
Informational Referential Integrity Constraints Support in Apache Spark with ...Informational Referential Integrity Constraints Support in Apache Spark with ...
Informational Referential Integrity Constraints Support in Apache Spark with ...
Databricks
 
Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark
Spark After Dark:  Real time Advanced Analytics and Machine Learning with SparkSpark After Dark:  Real time Advanced Analytics and Machine Learning with Spark
Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark
Chris Fregly
 
Distributed Inference on Large Datasets Using Apache MXNet and Apache Spark ...
 Distributed Inference on Large Datasets Using Apache MXNet and Apache Spark ... Distributed Inference on Large Datasets Using Apache MXNet and Apache Spark ...
Distributed Inference on Large Datasets Using Apache MXNet and Apache Spark ...
Databricks
 
Practical Machine Learning Pipelines with MLlib
Practical Machine Learning Pipelines with MLlibPractical Machine Learning Pipelines with MLlib
Practical Machine Learning Pipelines with MLlib
Databricks
 
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
Databricks
 
Building a Location Based Social Graph in Spark at InMobi-(Seinjuti Chatterje...
Building a Location Based Social Graph in Spark at InMobi-(Seinjuti Chatterje...Building a Location Based Social Graph in Spark at InMobi-(Seinjuti Chatterje...
Building a Location Based Social Graph in Spark at InMobi-(Seinjuti Chatterje...
Spark Summit
 
Operationalizing Machine Learning—Managing Provenance from Raw Data to Predic...
Operationalizing Machine Learning—Managing Provenance from Raw Data to Predic...Operationalizing Machine Learning—Managing Provenance from Raw Data to Predic...
Operationalizing Machine Learning—Managing Provenance from Raw Data to Predic...
Databricks
 
Continuous Evaluation of Deployed Models in Production Many high-tech industr...
Continuous Evaluation of Deployed Models in Production Many high-tech industr...Continuous Evaluation of Deployed Models in Production Many high-tech industr...
Continuous Evaluation of Deployed Models in Production Many high-tech industr...
Databricks
 
Spark and machine learning in microservices architecture
Spark and machine learning in microservices architectureSpark and machine learning in microservices architecture
Spark and machine learning in microservices architecture
Stepan Pushkarev
 
Apache con big data 2015 - Data Science from the trenches
Apache con big data 2015 - Data Science from the trenchesApache con big data 2015 - Data Science from the trenches
Apache con big data 2015 - Data Science from the trenches
Vinay Shukla
 
Bring Satellite and Drone Imagery into your Data Science Workflows
Bring Satellite and Drone Imagery into your Data Science WorkflowsBring Satellite and Drone Imagery into your Data Science Workflows
Bring Satellite and Drone Imagery into your Data Science Workflows
Databricks
 
MLeap: Productionize Data Science Workflows Using Spark
MLeap: Productionize Data Science Workflows Using SparkMLeap: Productionize Data Science Workflows Using Spark
MLeap: Productionize Data Science Workflows Using Spark
Jen Aman
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
Anyscale
 
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache SparkData-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
Databricks
 
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines Using Apache SparkBuild, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Databricks
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
Jose Quesada (hiring)
 
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsApache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Anyscale
 
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Spark Summit
 
Informational Referential Integrity Constraints Support in Apache Spark with ...
Informational Referential Integrity Constraints Support in Apache Spark with ...Informational Referential Integrity Constraints Support in Apache Spark with ...
Informational Referential Integrity Constraints Support in Apache Spark with ...
Databricks
 
Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark
Spark After Dark:  Real time Advanced Analytics and Machine Learning with SparkSpark After Dark:  Real time Advanced Analytics and Machine Learning with Spark
Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark
Chris Fregly
 
Distributed Inference on Large Datasets Using Apache MXNet and Apache Spark ...
 Distributed Inference on Large Datasets Using Apache MXNet and Apache Spark ... Distributed Inference on Large Datasets Using Apache MXNet and Apache Spark ...
Distributed Inference on Large Datasets Using Apache MXNet and Apache Spark ...
Databricks
 
Practical Machine Learning Pipelines with MLlib
Practical Machine Learning Pipelines with MLlibPractical Machine Learning Pipelines with MLlib
Practical Machine Learning Pipelines with MLlib
Databricks
 
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
Databricks
 
Building a Location Based Social Graph in Spark at InMobi-(Seinjuti Chatterje...
Building a Location Based Social Graph in Spark at InMobi-(Seinjuti Chatterje...Building a Location Based Social Graph in Spark at InMobi-(Seinjuti Chatterje...
Building a Location Based Social Graph in Spark at InMobi-(Seinjuti Chatterje...
Spark Summit
 
Operationalizing Machine Learning—Managing Provenance from Raw Data to Predic...
Operationalizing Machine Learning—Managing Provenance from Raw Data to Predic...Operationalizing Machine Learning—Managing Provenance from Raw Data to Predic...
Operationalizing Machine Learning—Managing Provenance from Raw Data to Predic...
Databricks
 
Continuous Evaluation of Deployed Models in Production Many high-tech industr...
Continuous Evaluation of Deployed Models in Production Many high-tech industr...Continuous Evaluation of Deployed Models in Production Many high-tech industr...
Continuous Evaluation of Deployed Models in Production Many high-tech industr...
Databricks
 
Spark and machine learning in microservices architecture
Spark and machine learning in microservices architectureSpark and machine learning in microservices architecture
Spark and machine learning in microservices architecture
Stepan Pushkarev
 
Apache con big data 2015 - Data Science from the trenches
Apache con big data 2015 - Data Science from the trenchesApache con big data 2015 - Data Science from the trenches
Apache con big data 2015 - Data Science from the trenches
Vinay Shukla
 
Bring Satellite and Drone Imagery into your Data Science Workflows
Bring Satellite and Drone Imagery into your Data Science WorkflowsBring Satellite and Drone Imagery into your Data Science Workflows
Bring Satellite and Drone Imagery into your Data Science Workflows
Databricks
 
MLeap: Productionize Data Science Workflows Using Spark
MLeap: Productionize Data Science Workflows Using SparkMLeap: Productionize Data Science Workflows Using Spark
MLeap: Productionize Data Science Workflows Using Spark
Jen Aman
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
Anyscale
 

Similar to Embracing a Taxonomy of Types to Simplify Machine Learning with Leah McGuire (20)

Fantastic ML apps and how to build them
Fantastic ML apps and how to build themFantastic ML apps and how to build them
Fantastic ML apps and how to build them
Matthew Tovbin
 
Write Generic Code with the Tooling API
Write Generic Code with the Tooling APIWrite Generic Code with the Tooling API
Write Generic Code with the Tooling API
Adam Olshansky
 
Spring into AI presented by Dan Vega 5/14
Spring into AI presented by Dan Vega 5/14Spring into AI presented by Dan Vega 5/14
Spring into AI presented by Dan Vega 5/14
VMware Tanzu
 
OSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine LearningOSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine Learning
Paco Nathan
 
Scaling Machine Learning from zero to millions of users (May 2019)
Scaling Machine Learning from zero to millions of users (May 2019)Scaling Machine Learning from zero to millions of users (May 2019)
Scaling Machine Learning from zero to millions of users (May 2019)
Julien SIMON
 
Data Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAMLData Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAML
Paco Nathan
 
Data Workflows for Machine Learning - SF Bay Area ML
Data Workflows for Machine Learning - SF Bay Area MLData Workflows for Machine Learning - SF Bay Area ML
Data Workflows for Machine Learning - SF Bay Area ML
Paco Nathan
 
How I Learned to Stop Worrying and Love Legacy Code - Ox:Agile 2018
How I Learned to Stop Worrying and Love Legacy Code - Ox:Agile 2018How I Learned to Stop Worrying and Love Legacy Code - Ox:Agile 2018
How I Learned to Stop Worrying and Love Legacy Code - Ox:Agile 2018
Mike Harris
 
Mark Tortoricci - Talent42 2015
Mark Tortoricci - Talent42 2015Mark Tortoricci - Talent42 2015
Mark Tortoricci - Talent42 2015
Talent42
 
Deep Dive Time Series Anomaly Detection in Azure with dotnet
Deep Dive Time Series Anomaly Detection in Azure with dotnetDeep Dive Time Series Anomaly Detection in Azure with dotnet
Deep Dive Time Series Anomaly Detection in Azure with dotnet
Marco Parenzan
 
Machine Learning with TensorFlow 2
Machine Learning with TensorFlow 2Machine Learning with TensorFlow 2
Machine Learning with TensorFlow 2
Sarah Stemmler
 
Punta Dreamin 17 Generic Apex and Tooling Api
Punta Dreamin 17 Generic Apex and Tooling ApiPunta Dreamin 17 Generic Apex and Tooling Api
Punta Dreamin 17 Generic Apex and Tooling Api
Adam Olshansky
 
Big Data Pipelines and Machine Learning at Uber
Big Data Pipelines and Machine Learning at UberBig Data Pipelines and Machine Learning at Uber
Big Data Pipelines and Machine Learning at Uber
Sudhir Tonse
 
The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it Work
Ivo Andreev
 
Machine learning for IoT - unpacking the blackbox
Machine learning for IoT - unpacking the blackboxMachine learning for IoT - unpacking the blackbox
Machine learning for IoT - unpacking the blackbox
Ivo Andreev
 
Domain oriented development
Domain oriented developmentDomain oriented development
Domain oriented development
rajmundr
 
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
Chetan Khatri
 
Machine learning
Machine learningMachine learning
Machine learning
Saravanan Subburayal
 
The Machine Learning Workflow with Azure
The Machine Learning Workflow with AzureThe Machine Learning Workflow with Azure
The Machine Learning Workflow with Azure
Ivo Andreev
 
Wix Machine Learning - Ran Romano
Wix Machine Learning - Ran RomanoWix Machine Learning - Ran Romano
Wix Machine Learning - Ran Romano
Wix Engineering
 
Fantastic ML apps and how to build them
Fantastic ML apps and how to build themFantastic ML apps and how to build them
Fantastic ML apps and how to build them
Matthew Tovbin
 
Write Generic Code with the Tooling API
Write Generic Code with the Tooling APIWrite Generic Code with the Tooling API
Write Generic Code with the Tooling API
Adam Olshansky
 
Spring into AI presented by Dan Vega 5/14
Spring into AI presented by Dan Vega 5/14Spring into AI presented by Dan Vega 5/14
Spring into AI presented by Dan Vega 5/14
VMware Tanzu
 
OSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine LearningOSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine Learning
Paco Nathan
 
Scaling Machine Learning from zero to millions of users (May 2019)
Scaling Machine Learning from zero to millions of users (May 2019)Scaling Machine Learning from zero to millions of users (May 2019)
Scaling Machine Learning from zero to millions of users (May 2019)
Julien SIMON
 
Data Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAMLData Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAML
Paco Nathan
 
Data Workflows for Machine Learning - SF Bay Area ML
Data Workflows for Machine Learning - SF Bay Area MLData Workflows for Machine Learning - SF Bay Area ML
Data Workflows for Machine Learning - SF Bay Area ML
Paco Nathan
 
How I Learned to Stop Worrying and Love Legacy Code - Ox:Agile 2018
How I Learned to Stop Worrying and Love Legacy Code - Ox:Agile 2018How I Learned to Stop Worrying and Love Legacy Code - Ox:Agile 2018
How I Learned to Stop Worrying and Love Legacy Code - Ox:Agile 2018
Mike Harris
 
Mark Tortoricci - Talent42 2015
Mark Tortoricci - Talent42 2015Mark Tortoricci - Talent42 2015
Mark Tortoricci - Talent42 2015
Talent42
 
Deep Dive Time Series Anomaly Detection in Azure with dotnet
Deep Dive Time Series Anomaly Detection in Azure with dotnetDeep Dive Time Series Anomaly Detection in Azure with dotnet
Deep Dive Time Series Anomaly Detection in Azure with dotnet
Marco Parenzan
 
Machine Learning with TensorFlow 2
Machine Learning with TensorFlow 2Machine Learning with TensorFlow 2
Machine Learning with TensorFlow 2
Sarah Stemmler
 
Punta Dreamin 17 Generic Apex and Tooling Api
Punta Dreamin 17 Generic Apex and Tooling ApiPunta Dreamin 17 Generic Apex and Tooling Api
Punta Dreamin 17 Generic Apex and Tooling Api
Adam Olshansky
 
Big Data Pipelines and Machine Learning at Uber
Big Data Pipelines and Machine Learning at UberBig Data Pipelines and Machine Learning at Uber
Big Data Pipelines and Machine Learning at Uber
Sudhir Tonse
 
The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it Work
Ivo Andreev
 
Machine learning for IoT - unpacking the blackbox
Machine learning for IoT - unpacking the blackboxMachine learning for IoT - unpacking the blackbox
Machine learning for IoT - unpacking the blackbox
Ivo Andreev
 
Domain oriented development
Domain oriented developmentDomain oriented development
Domain oriented development
rajmundr
 
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
Chetan Khatri
 
The Machine Learning Workflow with Azure
The Machine Learning Workflow with AzureThe Machine Learning Workflow with Azure
The Machine Learning Workflow with Azure
Ivo Andreev
 
Wix Machine Learning - Ran Romano
Wix Machine Learning - Ran RomanoWix Machine Learning - Ran Romano
Wix Machine Learning - Ran Romano
Wix Engineering
 
Ad

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Ad

Recently uploaded (20)

Principles of information security Chapter 5.ppt
Principles of information security Chapter 5.pptPrinciples of information security Chapter 5.ppt
Principles of information security Chapter 5.ppt
EstherBaguma
 
Defense Against LLM Scheming 2025_04_28.pptx
Defense Against LLM Scheming 2025_04_28.pptxDefense Against LLM Scheming 2025_04_28.pptx
Defense Against LLM Scheming 2025_04_28.pptx
Greg Makowski
 
LLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bertLLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bert
ChadapornK
 
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
Simran112433
 
183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag
fardin123rahman07
 
IAS-slides2-ia-aaaaaaaaaaain-business.pdf
IAS-slides2-ia-aaaaaaaaaaain-business.pdfIAS-slides2-ia-aaaaaaaaaaain-business.pdf
IAS-slides2-ia-aaaaaaaaaaain-business.pdf
mcgardenlevi9
 
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbbEDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
JessaMaeEvangelista2
 
chapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptxchapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptx
justinebandajbn
 
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnTemplate_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
cegiver630
 
Stack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptxStack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptx
binduraniha86
 
Cleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdfCleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdf
alcinialbob1234
 
Medical Dataset including visualizations
Medical Dataset including visualizationsMedical Dataset including visualizations
Medical Dataset including visualizations
vishrut8750588758
 
VKS-Python Basics for Beginners and advance.pptx
VKS-Python Basics for Beginners and advance.pptxVKS-Python Basics for Beginners and advance.pptx
VKS-Python Basics for Beginners and advance.pptx
Vinod Srivastava
 
VKS-Python-FIe Handling text CSV Binary.pptx
VKS-Python-FIe Handling text CSV Binary.pptxVKS-Python-FIe Handling text CSV Binary.pptx
VKS-Python-FIe Handling text CSV Binary.pptx
Vinod Srivastava
 
Calories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptxCalories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptx
TijiLMAHESHWARI
 
Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..
yuvarajreddy2002
 
Geometry maths presentation for begginers
Geometry maths presentation for begginersGeometry maths presentation for begginers
Geometry maths presentation for begginers
zrjacob283
 
Classification_in_Machinee_Learning.pptx
Classification_in_Machinee_Learning.pptxClassification_in_Machinee_Learning.pptx
Classification_in_Machinee_Learning.pptx
wencyjorda88
 
GenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.aiGenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.ai
Inspirient
 
FPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptxFPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptx
ssuser4ef83d
 
Principles of information security Chapter 5.ppt
Principles of information security Chapter 5.pptPrinciples of information security Chapter 5.ppt
Principles of information security Chapter 5.ppt
EstherBaguma
 
Defense Against LLM Scheming 2025_04_28.pptx
Defense Against LLM Scheming 2025_04_28.pptxDefense Against LLM Scheming 2025_04_28.pptx
Defense Against LLM Scheming 2025_04_28.pptx
Greg Makowski
 
LLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bertLLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bert
ChadapornK
 
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
Simran112433
 
183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag
fardin123rahman07
 
IAS-slides2-ia-aaaaaaaaaaain-business.pdf
IAS-slides2-ia-aaaaaaaaaaain-business.pdfIAS-slides2-ia-aaaaaaaaaaain-business.pdf
IAS-slides2-ia-aaaaaaaaaaain-business.pdf
mcgardenlevi9
 
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbbEDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
JessaMaeEvangelista2
 
chapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptxchapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptx
justinebandajbn
 
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnTemplate_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
cegiver630
 
Stack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptxStack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptx
binduraniha86
 
Cleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdfCleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdf
alcinialbob1234
 
Medical Dataset including visualizations
Medical Dataset including visualizationsMedical Dataset including visualizations
Medical Dataset including visualizations
vishrut8750588758
 
VKS-Python Basics for Beginners and advance.pptx
VKS-Python Basics for Beginners and advance.pptxVKS-Python Basics for Beginners and advance.pptx
VKS-Python Basics for Beginners and advance.pptx
Vinod Srivastava
 
VKS-Python-FIe Handling text CSV Binary.pptx
VKS-Python-FIe Handling text CSV Binary.pptxVKS-Python-FIe Handling text CSV Binary.pptx
VKS-Python-FIe Handling text CSV Binary.pptx
Vinod Srivastava
 
Calories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptxCalories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptx
TijiLMAHESHWARI
 
Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..
yuvarajreddy2002
 
Geometry maths presentation for begginers
Geometry maths presentation for begginersGeometry maths presentation for begginers
Geometry maths presentation for begginers
zrjacob283
 
Classification_in_Machinee_Learning.pptx
Classification_in_Machinee_Learning.pptxClassification_in_Machinee_Learning.pptx
Classification_in_Machinee_Learning.pptx
wencyjorda88
 
GenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.aiGenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.ai
Inspirient
 
FPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptxFPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptx
ssuser4ef83d
 

Embracing a Taxonomy of Types to Simplify Machine Learning with Leah McGuire

  • 1. Leah McGuire Principal Member of Technical Staff, Salesforce Einstein [email protected] @leahmcguire Types, Types, Types! Embracing a hierarchy of types to simplify machine learning
  • 2. Let’s make sure you are in the right talk ​ What I am going to talk about: • What does machine learning mean at Salesforce • Problems in machine learning for business to business (B2B) companies • Automating machine learning and how our AutoML library (Optimus Prime) works • The utility of having strongly typed features in AutoML • What we have learned and what we are planning
  • 5. The Problem ​ For the majority of businesses, data science is out of reach
  • 6. Sales Cloud Einstein Predictive Lead Scoring Opportunity Insights Automated Activity Capture Building World’s Smartest CRM Commerce Cloud Einstein Product Recommendations Predictive Sort Commerce Insights App Cloud Einstein Heroku + PredictionIO Predictive Vision Services Predictive Sentiment Services Predictive Modeling Services Service Cloud Einstein Recommended Case Classification Recommended Responses Predictive Close Time Marketing Cloud Einstein Predictive Scoring Predictive Audiences Automated Send-time Optimization Community Cloud Einstein Recommended Experts, Articles & Topics Automated Service Escalation Newsfeed Insights Analytics Cloud Einstein Predictive Wave Apps Smart Data Discovery Automated Analytics & Storytelling IoT Cloud Einstein Predictive Device Scoring Recommend Best Next Action Automated IoT Rules Optimization
  • 7. Machine learning workflows And how much more complicated they get for B2B
  • 8. Feature Engineering Model Training Model A Model B Model C Model Evaluation ​ What Kaggle would lead us to believe Building a machine learning model
  • 9. Real-life ML ​ Building a ML model workflow ETL Model Evaluation Feature Engineering Scoring Model Training Model A Model B Model C Deployment
  • 10. ​ Over and over again Building a machine learning model ETL Model Evaluation Feature Engineering Scoring Model Trainin g Model A Model B Model C Deployment ETL Model Evaluation Feature Engineering Scoring Model Trainin g Model A Model B Model C Deployment ETL Model Evaluation Feature Engineering Scoring Model Trainin g Model A Model B Model C Deployment
  • 11. We can’t build one global model • Privacy concerns •  Customers don’t want data cross- pollinated • Business Use Cases •  Industries are very different •  Processes are different • Platform customization •  Ability to create custom fields and objects • Scale, Automation, •  Ability to create
  • 12. ​ Over and over again Building a machine learning model M o d e l E v a l u a t i o n F e a t u r e E n g i n e e r i n g S c o r i n g D e p l o y m e n t M o d e l E v a l u a t i o n F e a t u r e E n g i n e e r i n g S c o r i n g D e p l o y m e n t M o d e l E v a l u a t i o n F e a t u r e E n g i n e e r i n g S c o r i n g D e p l o y m e n t M o d e l E v a l u a t i o n F e a t u r e E n g i n e e r i n g S c o r i n g D e p l o y m e n t ET L Model Evaluati on Feature Enginee ring Scoring Mod el Trai ning Mod el A Mod el B Mod el C Deploy ment M o d el E v al u at io n F e at ur e E n gi n e er in g S c o ri n g M o d e l T r a i n i n g M o d e l A M o d e l B M o d e l C D e pl o y m e nt M o d e l E v a l u a t i o n F e a t u r e E n g i n e e r i n g S c o r i n g D e p l o y m e n t M o d e l E v a l F e a t u r e E n g i n S c o r i n g D e p l o y m e n M o d e l E v a l u a t i o n F e a t u r e E n g i n e e r i n g S c o r i n g D e p l o y m e n t M o d e l E v F e a t u r e E n S c o r i n g D e p l o y m M o d e l E v a l u F e a t u r e E n g i n S c o r i n g D e p l o y m e n t M o a F e a t u r S c o D e p Mo del Ev alu ati on Fe atu re En gin eer ing Sc ori ng M o d e l T r a i n i n g M o d e l A M o d e l B M o d e l C De plo ym ent M o d e l E v a l u a t i o F e a t u r e E n g i n e e r i n S c o r i n g D e p l o y m e n t M o d e l E v a l u a t i o n F e a t u r e E n g i n e e r i n g S c o r i n g D e p l o y m e n t M o d e l E v F e a t u r e E n g S c o r i n g D e p l o y m
  • 13. Automating machine learning Enter Einstein (and Optimus Prime)
  • 14. •  ML is not magic, just statistics – generalizing examples •  But there is a ‘black art’ to producing good models •  Input data needs to be combined, filtered, cleaned etc. •  Producing the best features for your model takes time •  You can’t just throw a ml algorithm at your raw data and expect good results Turning a black art into a paint by number kit.
  • 15. Keep it DRY (don’t repeat yourself) and DRO (don’t repeat others) • The Spark ML pipeline (estimator, transformer) model is nice • The lack of types in Spark is not • Want to use more than Spark ML • Declarative and intuitive syntax – for both workflow generation and developers • Typed reusable operations • Multitenant application support • All built in scala Optimus Prime - A library to develop reusable, modular and typed ML workflows
  • 16. Simple interchangeable parts ​ val featureVector = Seq(pClass, name, gender, age, sibSp, parch, ticket, cabin, embarked).vectorize() ​ val (pred, raw, prob) = featureVector.check(survived).classify(survived) ​ val workflow = new OpWorkflow().setResultFeatures(pred).setDataReader(titanicReader) ​ In a declarative type safe syntax Stages Features transformed with produce Transformers Estimators fitted into ReadersWorkflows input data materialized by
  • 17. Automating typed feature engineering and modeling (with Optimus Prime)
  • 18. Features are given a type on creation ​ val gender = FeatureBuilder.Categorical[Titanic] .extract(d => Option(d.getGender).toSet[String]).asPredictor • Features are strongly typed • Each stage takes specific input type(s) and returns a specific output type(s) ​ Death to runtime errors! StagesFeatures produce
  • 19. Creating a workflow DAG with features • Features point to a column of data • The type of the feature determines which stages can act on it pred prob rawPred survived featureVector gender age name genderPivot title
  • 20. Creating a workflow DAG with features • When a stage acts on a feature it produces a new feature (or features) • Keep on manipulating features until you get your goal pred prob rawPred survived featureVector gender age name genderPivot title Pivot Title Regex Combine Model
  • 21. Done manipulating your features? Make them. Stages Features transformed with produce Transformers Estimators fitted into ReadersWorkflows input data materialized by • Once you make your final feature you have the full DAG • Features are materialized by the workflow • Initial data into the workflow provided by the reader
  • 22. The power of types!
  • 23. Using types to automate feature engineering ​ val featureVector = Seq(pClass, name, gender, age, sibSp, parch, ticket, cabin, embarked).vectorize() • Each feature is mapped to an appropriate .vectorize() stage based on its type •  gender (a Categorical) and age (a Real) are automatically assigned to different stages • You also have an option to do the exact type safe manipulations you want •  age can undergo special transformations if desired •  val ageBuckets =age.bucketize(buckets(0, 10, 20, 40, 100)) •  val featureVector = Seq(pClass, name, gender, ageBuckets, sibSp, parch, ticket, cabin, embarked).vectorize()
  • 24. Show me the types! FeatureType OPNumeric OPCollection OPSet OPSortedSet OPList NonNullable Text Email Base64 Phone ID URL ComboBox PickList TextArea OPVector OPMap BinaryMap IntegralMap RealMap CategoricalMap OrdinalMap DateList DateTimeList Integral Real Binary Percent Currency Date DateTime Categorical MultiPickList Note: all the types are assumed to be nullable, unless NonNullable trait is mixed - https://developer.salesforce.com/docs/atlas.en-us.api.meta/api/field_types.htm Ordinal TextMap Legend: - inheritance, bold - abstract class, italic - trait, normal - concrete class ... Optimus Prime Type Hierarchy TextList City Street Country Postal Code Location State Geolocation StateMap
  • 25. Take the types away!! • Sometimes a type is all you have • Hierarchy allows both very specific and very general stages • Type safety for production saves a lot of headaches ​ Why would we make this monstrosity?? FeatureType OPNumeric OPCollection OPSet OPSortedSet OPList NonNullable Text Email Base64 Phone ID URL ComboBox PickList TextArea OPVector OPMap BinaryMap IntegralMap RealMap CategoricalMap OrdinalMap DateList DateTimeList Integral Real Binary Percent Currency Date DateTime Categorical MultiPickList Note: all the types are assumed to be nullable, unless NonNullable trait is mixed - https://developer.salesforce.com/docs/atlas.en-us.api.meta/api/field_types.htm Ordinal TextMap Legend: - inheritance, bold - abstract class, italic - trait, normal - concrete class ... Optimus Prime Type Hierarchy TextList City Street Country Postal Code Location State Geolocation StateMap
  • 26. Sanity Checking – the stage that checks your features • Check data quality before modeling • Label leakage • Features have acceptable ranges • The feature types allow much better checks val checkedVector = featureVector.check(survived)
  • 27. Model Selection Stage - Resampling, Hyper-parameter Tuning, Comparing Models • Many possible models for each class of problem • Many hyper parameters for each type of model • Finding the right model for THIS dataset makes a huge difference val (pred, raw, prob) = checkedFeatureVector.classify(survived)
  • 28. Types can save us ​ And if you don’t believe me take a look at the code val featureVector = Seq(pClass, name, gender, age, sibSp, parch, ticket, cabin, embarked).vectorize() val (pred, raw, prob) = featureVector.check(survived).classify(survived) val workflow = new OpWorkflow().setResultFeatures(pred).setDataReader(titanicReader)
  • 29. Types can save us ​ And if you don’t believe me take a look at the code val featureVector = Seq(pClass, name, gender, age, sibSp, parch, ticket, cabin, embarked).vectorize() val (pred, raw, prob) = featureVector.check(survived).classify(survived) val workflow = new OpWorkflow().setResultFeatures(pred).setDataReader(titanicReader) def addFeatures(df: DataFrame): DataFrame = {! // Create a new family size field := siblings + spouses + parents + children + self! val familySizeUDF = udf { (sibsp: Double, parch: Double) => sibsp + parch + 1 }! ! df.withColumn("fsize", familySizeUDF(col("sibsp"), col("parch"))) // <-- full freedom to overwrite ! }! ! def fillMissing(df: DataFrame): DataFrame = {! // Fill missing age values with average age! val avgAge = df.select("age").agg(avg("age")).collect.first()! ! // Fill missing embarked values with default "S" (i.e Southampton)! val embarkedUDF = udf{(e: String)=> e match { case x if x == null || x.isEmpty => "S"; case x => x}}! ! df.na.fill(Map("age" -> avgAge)).withColumn("embarked", embarkedUDF(col("embarked")))! }!
  • 30. Types can save us ​ And if you don’t believe me take a look at the code // Modify the dataframe! val allData = fillMissing(addFeatures(rawData)).cache() // <-- need to remember about caching! // Split the data and cache it! val Array(trainSet, testSet) = allData.randomSplit(Array(0.75, 0.25)).map(_.cache())! ! // Prepare categorical columns! val categoricalFeatures = Array("pclass", "sex", "embarked")! val stringIndexers = categoricalFeatures.map(colName =>! new StringIndexer().setInputCol(colName).setOutputCol(colName + "_index").fit(allData)! )! ! // Concat all the feature into a numeric feature vector! val allFeatures = Array("age", "sibsp", "parch", "fsize") ++ stringIndexers.map(_.getOutputCol)! ! val vectorAssembler = new VectorAssembler().setInputCols(allFeatures).setOutputCol("feature_vector”)! ! // Prepare Logistic Regression estimator! val logReg = new LogisticRegression().setFeaturesCol("feature_vector").setLabelCol("survived”)! ! // Finally build the pipeline with the stages above! val pipeline = new Pipeline().setStages(stringIndexers ++ Array(vectorAssembler, logReg))!
  • 31. Types can save us ​ And if you don’t believe me take a look at the code // Cross validate our pipeline with various parameters! val paramGrid =! new ParamGridBuilder()! .addGrid(logReg.regParam, Array(1, 0.1, 0.01))! .addGrid(logReg.maxIter, Array(10, 50, 100))! .build()! ! val crossValidator =! new CrossValidator()! .setEstimator(pipeline) // <-- set our pipeline here! .setEstimatorParamMaps(paramGrid)! .setEvaluator(new BinaryClassificationEvaluator().setLabelCol("survived"))! .setNumFolds(3)! ! // Train the model & compute scores ! val model: CrossValidationModel = crossValidator.fit(trainSet)! val scores: DataFrame = model.transform(testSet)! ! // Save the model for later use! model.save("/models/titanic-model.ml")! !
  • 32. Where are we going and what have we learned
  • 33. Key takeaways • ML for B2B is a whole other beast • Spark ML is great, but it needs type safety • Simple and intuitive syntax saves you trouble down the road • Types in ML are incredibly useful • Scala has all the relevant facilities to provide the above • Modularity and reusability is the key
  • 34. Going forward with Optimus Prime • Going beyond Spark ML for algorithms and small scale • Making everything smarter (feature eng, sanity checking, model selection) • Template generation • Improvements to developer interface