Moving a Fraud-Fighting Random Forest from scikit-learn to Spark with MLlib, MLflow, and Jupyter

Moving a Fraud-Fighting
Random Forest from scikit-
learn to Spark with ML,
MLflow, and Jupyter
Josh Johnston
Director of AI Science
josh.johnston@kount.com

©Kount Inc All Rights Reserved
Overview
Model lifecycle
Our fraud-detecting model
Initial method with database and scikit learn
Improved method with HDFS and Spark
Robust model governance

Manage the model lifecycle
Microsoft. (2017, October 19). What is the Team Data Science Process? Retrieved March 26, 2019, from
https://docs.microsoft.com/en-us/azure/machine-learning/team-data-science-process/overview
Modeling
• Configuration management
• Performance (speed)
• Accuracy
• Validation
Governance Questions
• Which model are you using?
• How did you train it?
• How well does it work?
After each answer: Why?
Science is repeatable

Kount protects digital innovations from…
Fraudulent
Account Creation
Transaction/
Payment Fraud
Account
Takeover Fraud
Authentication
Friction

Evaluate transactions for fraud
• Substantial throughput
• 30-100 transactions per second
• Low latency
• 250 ms end-to-end system latency
• ~15 ms for machine learning features and model

Evaluate transactions for fraud

Approve an extra ~3K transactions and $1.2M
USD per month
Reduced manual reviews by 200 hours/month
Reduced chargeback rate by 17%
Reduced manual reviews by 20%
Sleep better at night
Don’t hear complaints from fraud team about
review queue anymore
Fraud Manager Feedback:
Boost Technology™ Customer View

Boost Technology™ Technical View
Feature Engineering
• 200 GB of precomputed data
Model
• Random forest
• 250 trees
• ~100k nodes per tree
• ~1GB serialized representation
Model Training
• ~150 features
• ~60M observations

Initial training with
database and scikit
learn

First approach gets to production
Analytics
Database
Model Training
Service
Network
Storage
Fetch observations
Fetch lookups
Observation Lookup Flat File Logging
Pickled Model
Train Model
(Scikit Learn)
Time
16 hrs
24 hrs
8 hrs
Lookup compute
1 hr
12 hrs
2.5 days 400GB RAM
1TB into swap

What works
• Trains a high value model

What doesn’t work
• Time-intensive
• Errors force restarts since everything is held in memory (and swap)
• Burdens production analytics database
• Pickled model ties execution environment to training environment
• Traceability provided by log files and manual documentation
• Ad hoc experiments with little configuration control

Improved training
with HDFS and
Spark

Cluster for distributed computing
• Dell hardware
• 6 nodes
• 484 vCores
• 1.35 TB RAM
• Cloudera Manager
• Spark 2.4
• Mostly python
HDFS
• Attached to 3 nodes
• 171 TB usable space

Spark Cluster
Improved approach through cluster
Analytics
Database
HDFSsqoop data
Observation
Lookup
Logging
Zipped MLeap Model
Train Model
(Spark ML)
Time
45 min
2 hrs
8 hrs
Compute lookups
MLflow
Perform lookups
Luigi
<1/2 day

Remote development with Jupyter
• Most criticisms of notebooks are things you COULD do, not what you
MUST do
• Good development practices are independent of tools
Juptyer Notebook
Pyspark Application
Python Packages
MaturityResearch Production
Version Control (git)
Automation

What works
• Faster
• Failures restart in the middle
• Reduces burden on production analytics database
• Redesign experiments without penalty
• MLeap decouples evaluation environment from training environment

What still doesn’t work
• Non-deterministic Spark ML behavior and errors
• Spark pipelines rely on configurations that change based on input data

Tools and Processes
for Model Governance

Tools and processes for governance
Solution components
• Data traceability
• Experiment, configuration, and accuracy traceability

• Data pipelines with error handling
• Repeatable and documented data transformations
• Document parameters
• Trace to code and data used
• Record accuracy of selected and not selected models
• Store final model and configurations as artifact

Kount’s benefits from Spark/HDFS, Luigi, and MLflow
• Faster
• Failures can restart in the middle
• Reduce burden on production analytics database
• Redesign experiments without penalty
• MLeap decouples evaluation environment from training environment

Moving a Fraud-Fighting Random Forest from scikit-learn to Spark with MLlib, MLflow, and Jupyter

More Related Content

What's hot (20)

Similar to Moving a Fraud-Fighting Random Forest from scikit-learn to Spark with MLlib, MLflow, and Jupyter (20)

More from Databricks (20)

Recently uploaded (20)

Moving a Fraud-Fighting Random Forest from scikit-learn to Spark with MLlib, MLflow, and Jupyter