SlideShare a Scribd company logo
QuTrack: A Blockchain-based approach to Model
Lifecycle Management
2019 Copyright QuantUniversity LLC.
Presented By:
Sri Krishnamurthy
s.Krishnamurthy@neu.edu
• Northeastern University
• QuantUniversity : A NEU IDEAS startup
2
• Model Life-cycle Management challenges
• Approaches
• QuTrack: A Blockchain-based approach to Model Lifecycle
Management
• Demo
Agenda
3
Machine Learning Workflow
Data Scraping/
Ingestion
Data
Exploration
Data Cleansing
and Processing
Feature
Engineering
Model
Evaluation
& Tuning
Model
Selection
Model
Deployment/
Inference
Supervised
Unsupervised
Modeling
Data Engineer, Dev Ops Engineer
Data Scientist/QuantsSoftware/Web Engineer
• AutoML
• Model Validation
• Interpretability
Robotic Process Automation (RPA) (Microservices, Pipelines )
• SW: Web/ Rest API
• HW: GPU, Cloud
• Monitoring
• Regression
• KNN
• Decision Trees
• Naive Bayes
• Neural Networks
• Ensembles
• Clustering
• PCA
• Autoencoder
• RMS
• MAPS
• MAE
• Confusion Matrix
• Precision/Recall
• ROC
• Hyper-parameter
tuning
• Parameter Grids
Risk Management/ Compliance(All stages)
Analysts&
DecisionMakers
5
• Developing ML applications involves:
▫ Heuristics
▫ Best practices (templates)
▫ Lots of experimentation
▫ Many moving pieces
▫ No “right” answer
▫ Error creep: Even small untracked errors can through off results
AI/ML application development => Design of Experiments
6
Source: Sculley et al., 2015 "Hidden Technical Debt in Machine Learning Systems"
Challenges
7
The reproducibility challenge
https://www.nature.com/news/1-500-scientists-lift-the-
lid-on-reproducibility-1.19970
8
• Repeatability (Same team, same experimental setup)
▫ The measurement can be obtained with stated precision by the same team using the
same measurement procedure, the same measuring system, under the same operating
conditions, in the same location on multiple trials. For computational experiments, this
means that a researcher can reliably repeat her own computation.
• Replicability (Different team, same experimental setup)
▫ The measurement can be obtained with stated precision by a different team using the
same measurement procedure, the same measuring system, under the same operating
conditions, in the same or a different location on multiple trials. For computational
experiments, this means that an independent group can obtain the same result using the
author’s own artifacts.
• Reproducibility (Different team, different experimental setup)
▫ The measurement can be obtained with stated precision by a different team, a different
measuring system, in a different location on multiple trials. For computational
experiments, this means that an independent group can obtain the same result using
artifacts which they develop completely independently.
Repeatable or Reproducible or Replicable
https://www.acm.org/publications/policies/artifact-review-badging
9
Many choices
Languages
Frameworks
Platforms
10
Elements of Model Risk Management
11
AI Governance is gaining focus
https://legalinstruments.oecd.org/
en/instruments/OECD-LEGAL-0449
12
AI Governance is gaining focus
https://legalinstruments.oecd.org/en/instruments/OECD-LEGAL-0449
13
NLP pipeline
Data Ingestion
from Edgar
Pre-Processing
Invoking APIs to
label data
Compare APIs
Build a new
model for
sentiment
Analysis
Stage 1 Stage 2 Stage 3 Stage 4 Stage 5
• Amazon Comprehend API
• Google API
• Watson API
• Azure API
14
• Programming environment
• Execution environment
• Hardware specs
• Cloud
• GPU
• Dependencies
• Lineage/Provenance of
individual components
• Model params
• Hyper parameters
• Pipeline specifications
• Model specific
• Tests
• Data versions
Data Model
EnvironmentProcess
Components that needs to be tracked
15
Source: T. van derWeide, O. Smirnov, M. Zielinski, D. Papadopoulos, and T. van Kasteren. Versioned machine learning pipelines for batch experimentation. In ML
Systems, Workshop NIPS 2016, 2016.
Provenance and Lineage of pipelines
16
Schemas proposed
Sebastian Schelter, Joos-Hendrik Boese, Johannes Kirschnick, Thoralf Klein, and Stephan Seufert. Automatically
Tracking Metadata and Provenance of Machine Learning Experiments. NIPS Workshop on Machine Learning
Systems, 2017.
17
Schemas proposed
G. C. Publio, D. Esteves, and H. Zafar, “ML-Schema : Exposing the Semantics of Machine Learning with Schemas
and Ontologies,” in Reproducibility in ML Workshop, ICML’18, 2018.
18
MLFlow
19
DVC
20
GoCD
21
22
I. Altintas, O. Barney, and E. Jaeger-Frank. Provenance collection support in the Kepler scientific workflow
system. In Provenance and annotation of data, pages 118–132.
Current approaches
23
Miao, Hui & Chavan, Amit & Deshpande, Amol. (2016). ProvDB: A System for Lifecycle Management of
Collaborative Analysis Workflows.
Current approaches
24
Related work
Xueping Liang, Sachin Shetty, Deepak Tosh, Charles Kamhoua, Kevin Kwiat, and Laurent Njilla. 2017. ProvChain: A Blockchain-based Data Provenance Architecture in Cloud
Environment with Enhanced Privacy and Availability. In Proceedings of the 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid '17). IEEE Press,
Piscataway, NJ, USA, 468-477. DOI: https://doi.org/10.1109/CCGRID.2017.8
Focus on Cloud data
provenance using
Blockchain
25
Related work
Ramachandran, Aravind & Kantarcioglu, Dr. (2017). Using Blockchain and smart contracts for secure data provenance management.
DataProv: Built on top of
Ethereum, the platform
utilizes smart contracts
and open provenance
model (OPM) to record
immutable data trails.
26
Related work
Sarpatwar, Kanthi & Vaculín, Roman & Min, Hong & Su, Gong & Heath, Terry & Ganapavarapu, Giridhar & Dillenberger, Donna. (2019). Towards Enabling Trusted Artificial
Intelligence via Blockchain. 10.1007/978-3-030-17277-0_8.
Trusted AI and
provenance of AI models
27
28
QuSandbox research suite
Model Analytics
Studio
QuSandboxQuTrack
QuResearchHub
Prototype, Iterate and tune
Standardize workflowsProductionize and share
Track experiments
29
The four components that need to be encapsulated for
reproducible pipelines
Code Data
Environment Process
30
QuSandbox
31
32
Model Management Studio
33
• JDF: Job Definition File; A DSL for representing Model Pipelines
• Stage
• Entity
▫ Model
▫ Data
▫ Environment
• Version format
▫ M:m:p -> Major Version: Minor Version: Patch
Terms
34
JDF- DSL
35
QuResearchHub
36
• Ability to track the evolution of experiments
• Snapshot and store the lineage of pipelines
• Ensure auditability and secure access to archived pipelines
• Track experiments and their associated parameters
• Track all aspects of experiments to ensure reproducibility
• For high-impact (regulated, critical) applications, ensure experiment
traces are immutable and verifiable
QuTrack Design requirements
37
Metadata
• Data about the information to be tracked
• Includes version number, timestamps, user information, MD5 of the artifacts
and high-level notes
Data
• Pipelines, custom DSL, standard formats for representing models
• Events (Updates, rollbacks
• JSON, Amazon ION, YAML,
Artifacts
• Model Pickle files, ONYX, COREML, Model params
• Data, blobs etc.
Architecture : What’s tracked ?
38
Blockchain-based:
• QLDB
• Ethereum
Non-Blockchain-based:
• MongoDB
Architectures supported
39
Amazon Quantum Ledger Database (QLDB)
• Fully managed ledger database that provides a transparent, immutable, and cryptographically verifiable transaction log.
• SQL-like API
• Cost-effective
Amazon QLDB
40
Amazon QLDB
41
Amazon QLDB
https://aws.amazon.com/qldb/faqs/
42
Demo: Reference App
43
44
• Support for ONYX, CoreML
• Integration with:
▫ MLFlow, DVC, GoCD
• Integration with SCM systems
▫ Github, SVM
• Tracking Back tests
• Push Architecture -> Event-Driven Architecture
• Enriched Analytics
• Roles and Authorization
Future work
Thank You!
If you are interested in trying
out QuSandbox,
Please sign up for updates at:
www.qusandbox.com
Information, data and drawings embodied in this presentation are strictly a property of QuantUniversity LLC. and shall not be
distributed or used in any other publication without the prior written consent of QuantUniversity LLC.
45

More Related Content

What's hot (20)

PPTX
Visualizing and Clustering Life Science Applications in Parallel 
Geoffrey Fox
 
PPTX
Classifying Simulation and Data Intensive Applications and the HPC-Big Data C...
Geoffrey Fox
 
PPTX
51 Use Cases and implications for HPC & Apache Big Data Stack
Geoffrey Fox
 
PPTX
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Paolo Missier
 
PPTX
Learning Systems for Science
Ian Foster
 
PDF
Flexible Design for Simple Digital Library Tools and Services
Lighton Phiri
 
PDF
2015 presentation
Nikhil Ghosh
 
PPTX
Data Automation at Light Sources
Ian Foster
 
PPTX
NIH Data Commons Architecture Ideas
Ian Foster
 
PPTX
Bionimbus - Northwestern CGI Workshop 4-21-2011
Robert Grossman
 
PPTX
The Discovery Cloud: Accelerating Science via Outsourcing and Automation
Ian Foster
 
PDF
Workshop: Introduction to Cytoscape at UT-KBRIN Bioinformatics Summit 2014 (4...
Keiichiro Ono
 
PPTX
Hattrick Simpers TMS Machine Learning Workshop Slides
Jason Hattrick-Simpers
 
PPTX
ReComp: optimising the re-execution of analytics pipelines in response to cha...
Paolo Missier
 
PPTX
Analytics of analytics pipelines: from optimising re-execution to general Dat...
Paolo Missier
 
PPTX
ReComp, the complete story: an invited talk at Cardiff University
Paolo Missier
 
PPT
Grid optical network service architecture for data intensive applications
Tal Lavian Ph.D.
 
PPT
Poster iscram 2008
bdemchak
 
PDF
Data mining weka
prashant 100702007
 
PPT
Integrative information management for systems biology
Neil Swainston
 
Visualizing and Clustering Life Science Applications in Parallel 
Geoffrey Fox
 
Classifying Simulation and Data Intensive Applications and the HPC-Big Data C...
Geoffrey Fox
 
51 Use Cases and implications for HPC & Apache Big Data Stack
Geoffrey Fox
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Paolo Missier
 
Learning Systems for Science
Ian Foster
 
Flexible Design for Simple Digital Library Tools and Services
Lighton Phiri
 
2015 presentation
Nikhil Ghosh
 
Data Automation at Light Sources
Ian Foster
 
NIH Data Commons Architecture Ideas
Ian Foster
 
Bionimbus - Northwestern CGI Workshop 4-21-2011
Robert Grossman
 
The Discovery Cloud: Accelerating Science via Outsourcing and Automation
Ian Foster
 
Workshop: Introduction to Cytoscape at UT-KBRIN Bioinformatics Summit 2014 (4...
Keiichiro Ono
 
Hattrick Simpers TMS Machine Learning Workshop Slides
Jason Hattrick-Simpers
 
ReComp: optimising the re-execution of analytics pipelines in response to cha...
Paolo Missier
 
Analytics of analytics pipelines: from optimising re-execution to general Dat...
Paolo Missier
 
ReComp, the complete story: an invited talk at Cardiff University
Paolo Missier
 
Grid optical network service architecture for data intensive applications
Tal Lavian Ph.D.
 
Poster iscram 2008
bdemchak
 
Data mining weka
prashant 100702007
 
Integrative information management for systems biology
Neil Swainston
 

Similar to 2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain approach - Srikanth Krishnamurthy, October 7, 2019 (20)

PDF
QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...
QuantUniversity
 
PDF
What’s New with Databricks Machine Learning
Databricks
 
PPTX
Open, Secure & Transparent AI Pipelines
Nick Pentreath
 
PDF
Week 3 data journey and data storage
Ajay Taneja
 
PPTX
A machine learning and data science pipeline for real companies
DataWorks Summit
 
PPTX
03_aiops-1.pptx
FarazulHoda2
 
PDF
Limited Budget but Effective End to End MLOps Practices (Machine Learning Mod...
IRJET Journal
 
PPTX
Software engineering practices for the data science and machine learning life...
DataWorks Summit
 
PDF
Machine learning at scale challenges and solutions
Stavros Kontopoulos
 
PPTX
Building Data Ecosystems for Accelerated Discovery
adamkraut
 
PDF
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...
HostedbyConfluent
 
PPTX
MLOps Virtual Event | Building Machine Learning Platforms for the Full Lifecycle
Databricks
 
PPTX
DevOps for Machine Learning overview en-us
eltonrodriguez11
 
PDF
Data ops: Machine Learning in production
Stepan Pushkarev
 
PDF
Sharing massive data analysis: from provenance to linked experiment reports
Gaignard Alban
 
PDF
Building a Distributed Collaborative Data Pipeline with Apache Spark
Databricks
 
PPTX
AI-ML-Virtual-Internship on new technology
AnubhavKumar615216
 
PPTX
Data provenance in Hopsworks
Alexandru Adrian Ormenisan
 
PPTX
Data governance datalakes_multitenancy
Sathish K S
 
PDF
Phoenix Data Conference - Big Data Analytics for IoT 11/4/17
Mark Goldstein
 
QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...
QuantUniversity
 
What’s New with Databricks Machine Learning
Databricks
 
Open, Secure & Transparent AI Pipelines
Nick Pentreath
 
Week 3 data journey and data storage
Ajay Taneja
 
A machine learning and data science pipeline for real companies
DataWorks Summit
 
03_aiops-1.pptx
FarazulHoda2
 
Limited Budget but Effective End to End MLOps Practices (Machine Learning Mod...
IRJET Journal
 
Software engineering practices for the data science and machine learning life...
DataWorks Summit
 
Machine learning at scale challenges and solutions
Stavros Kontopoulos
 
Building Data Ecosystems for Accelerated Discovery
adamkraut
 
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...
HostedbyConfluent
 
MLOps Virtual Event | Building Machine Learning Platforms for the Full Lifecycle
Databricks
 
DevOps for Machine Learning overview en-us
eltonrodriguez11
 
Data ops: Machine Learning in production
Stepan Pushkarev
 
Sharing massive data analysis: from provenance to linked experiment reports
Gaignard Alban
 
Building a Distributed Collaborative Data Pipeline with Apache Spark
Databricks
 
AI-ML-Virtual-Internship on new technology
AnubhavKumar615216
 
Data provenance in Hopsworks
Alexandru Adrian Ormenisan
 
Data governance datalakes_multitenancy
Sathish K S
 
Phoenix Data Conference - Big Data Analytics for IoT 11/4/17
Mark Goldstein
 
Ad

More from The Statistical and Applied Mathematical Sciences Institute (20)

PDF
Causal Inference Opening Workshop - Latent Variable Models, Causal Inference,...
The Statistical and Applied Mathematical Sciences Institute
 
PDF
2019 Fall Series: Special Guest Lecture - 0-1 Phase Transitions in High Dimen...
The Statistical and Applied Mathematical Sciences Institute
 
PDF
Causal Inference Opening Workshop - Causal Discovery in Neuroimaging Data - F...
The Statistical and Applied Mathematical Sciences Institute
 
PDF
Causal Inference Opening Workshop - Smooth Extensions to BART for Heterogeneo...
The Statistical and Applied Mathematical Sciences Institute
 
PDF
Causal Inference Opening Workshop - A Bracketing Relationship between Differe...
The Statistical and Applied Mathematical Sciences Institute
 
PDF
Causal Inference Opening Workshop - Testing Weak Nulls in Matched Observation...
The Statistical and Applied Mathematical Sciences Institute
 
PPTX
Causal Inference Opening Workshop - Difference-in-differences: more than meet...
The Statistical and Applied Mathematical Sciences Institute
 
PDF
Causal Inference Opening Workshop - New Statistical Learning Methods for Esti...
The Statistical and Applied Mathematical Sciences Institute
 
PDF
Causal Inference Opening Workshop - Bipartite Causal Inference with Interfere...
The Statistical and Applied Mathematical Sciences Institute
 
PPTX
Causal Inference Opening Workshop - Bridging the Gap Between Causal Literatur...
The Statistical and Applied Mathematical Sciences Institute
 
PDF
Causal Inference Opening Workshop - Some Applications of Reinforcement Learni...
The Statistical and Applied Mathematical Sciences Institute
 
PDF
Causal Inference Opening Workshop - Bracketing Bounds for Differences-in-Diff...
The Statistical and Applied Mathematical Sciences Institute
 
PDF
Causal Inference Opening Workshop - Assisting the Impact of State Polcies: Br...
The Statistical and Applied Mathematical Sciences Institute
 
PDF
Causal Inference Opening Workshop - Experimenting in Equilibrium - Stefan Wag...
The Statistical and Applied Mathematical Sciences Institute
 
PDF
Causal Inference Opening Workshop - Targeted Learning for Causal Inference Ba...
The Statistical and Applied Mathematical Sciences Institute
 
PDF
Causal Inference Opening Workshop - Bayesian Nonparametric Models for Treatme...
The Statistical and Applied Mathematical Sciences Institute
 
PPTX
2019 Fall Series: Special Guest Lecture - Adversarial Risk Analysis of the Ge...
The Statistical and Applied Mathematical Sciences Institute
 
PPTX
2019 Fall Series: Professional Development, Writing Academic Papers…What Work...
The Statistical and Applied Mathematical Sciences Institute
 
PDF
2019 GDRR: Blockchain Data Analytics - Machine Learning in/for Blockchain: Fu...
The Statistical and Applied Mathematical Sciences Institute
 
PDF
2019 GDRR: Blockchain Data Analytics - Modeling Cryptocurrency Markets with T...
The Statistical and Applied Mathematical Sciences Institute
 
Causal Inference Opening Workshop - Latent Variable Models, Causal Inference,...
The Statistical and Applied Mathematical Sciences Institute
 
2019 Fall Series: Special Guest Lecture - 0-1 Phase Transitions in High Dimen...
The Statistical and Applied Mathematical Sciences Institute
 
Causal Inference Opening Workshop - Causal Discovery in Neuroimaging Data - F...
The Statistical and Applied Mathematical Sciences Institute
 
Causal Inference Opening Workshop - Smooth Extensions to BART for Heterogeneo...
The Statistical and Applied Mathematical Sciences Institute
 
Causal Inference Opening Workshop - A Bracketing Relationship between Differe...
The Statistical and Applied Mathematical Sciences Institute
 
Causal Inference Opening Workshop - Testing Weak Nulls in Matched Observation...
The Statistical and Applied Mathematical Sciences Institute
 
Causal Inference Opening Workshop - Difference-in-differences: more than meet...
The Statistical and Applied Mathematical Sciences Institute
 
Causal Inference Opening Workshop - New Statistical Learning Methods for Esti...
The Statistical and Applied Mathematical Sciences Institute
 
Causal Inference Opening Workshop - Bipartite Causal Inference with Interfere...
The Statistical and Applied Mathematical Sciences Institute
 
Causal Inference Opening Workshop - Bridging the Gap Between Causal Literatur...
The Statistical and Applied Mathematical Sciences Institute
 
Causal Inference Opening Workshop - Some Applications of Reinforcement Learni...
The Statistical and Applied Mathematical Sciences Institute
 
Causal Inference Opening Workshop - Bracketing Bounds for Differences-in-Diff...
The Statistical and Applied Mathematical Sciences Institute
 
Causal Inference Opening Workshop - Assisting the Impact of State Polcies: Br...
The Statistical and Applied Mathematical Sciences Institute
 
Causal Inference Opening Workshop - Experimenting in Equilibrium - Stefan Wag...
The Statistical and Applied Mathematical Sciences Institute
 
Causal Inference Opening Workshop - Targeted Learning for Causal Inference Ba...
The Statistical and Applied Mathematical Sciences Institute
 
Causal Inference Opening Workshop - Bayesian Nonparametric Models for Treatme...
The Statistical and Applied Mathematical Sciences Institute
 
2019 Fall Series: Special Guest Lecture - Adversarial Risk Analysis of the Ge...
The Statistical and Applied Mathematical Sciences Institute
 
2019 Fall Series: Professional Development, Writing Academic Papers…What Work...
The Statistical and Applied Mathematical Sciences Institute
 
2019 GDRR: Blockchain Data Analytics - Machine Learning in/for Blockchain: Fu...
The Statistical and Applied Mathematical Sciences Institute
 
2019 GDRR: Blockchain Data Analytics - Modeling Cryptocurrency Markets with T...
The Statistical and Applied Mathematical Sciences Institute
 
Ad

Recently uploaded (20)

PDF
RANKING THE MICRO LEVEL CRITICAL FACTORS OF ELECTRONIC MEDICAL RECORDS ADOPTI...
hiij
 
PDF
Asthamudi lake and its fisheries&importance .pdf
J. Bovas Joel BFSc
 
PDF
Webinar: World's Smallest Pacemaker
Scintica Instrumentation
 
PDF
Rational points on curves -- BIMR 2025 --
mmasdeu
 
PPTX
Raising awareness on the story beyond the surface. A case study on the signif...
Kristel Wautier
 
PDF
Plant growth promoting bacterial non symbiotic
psuvethapalani
 
PDF
Global Congress on Forensic Science and Research
infoforensicscience2
 
PPTX
770043401-q1-Ppt-pe-and-Health-7-week-1-lesson-1.pptx
AizaRazonado
 
PDF
BlackBody Radiation experiment report.pdf
Ghadeer Shaabna
 
PPTX
Presentation 1 Microbiome Engineering and Synthetic Microbiology.pptx
Prachi Virat
 
PDF
Lab 3 the microscope uses for students.pdf
jamie088j
 
PDF
2025 Insilicogen Company Korean Brochure
Insilico Gen
 
PDF
Carbonate formation and fluctuating habitability on Mars
Sérgio Sacani
 
PPTX
CNS.pptx Central nervous system meninges ventricles of brain it's structure a...
Ashwini I Chuncha
 
PDF
A Classification of Metamorphic Rocks Canada12.pdf
ErturulKanmaz
 
PPTX
Renewable Energy Resources - Introduction
BhajneetSingh1
 
PDF
The scientific heritage No 163 (163) (2025)
The scientific heritage
 
PPTX
Q1_Science 8_Week3-Day 1.pptx science lesson
AizaRazonado
 
PDF
Plankton and Fisheries Bovas Joel Notes.pdf
J. Bovas Joel BFSc
 
PPTX
SCHOOL HOLIDAY REVISION CHAPTER 8.pptx science kssm
SITIATHIRAHBINTISULA
 
RANKING THE MICRO LEVEL CRITICAL FACTORS OF ELECTRONIC MEDICAL RECORDS ADOPTI...
hiij
 
Asthamudi lake and its fisheries&importance .pdf
J. Bovas Joel BFSc
 
Webinar: World's Smallest Pacemaker
Scintica Instrumentation
 
Rational points on curves -- BIMR 2025 --
mmasdeu
 
Raising awareness on the story beyond the surface. A case study on the signif...
Kristel Wautier
 
Plant growth promoting bacterial non symbiotic
psuvethapalani
 
Global Congress on Forensic Science and Research
infoforensicscience2
 
770043401-q1-Ppt-pe-and-Health-7-week-1-lesson-1.pptx
AizaRazonado
 
BlackBody Radiation experiment report.pdf
Ghadeer Shaabna
 
Presentation 1 Microbiome Engineering and Synthetic Microbiology.pptx
Prachi Virat
 
Lab 3 the microscope uses for students.pdf
jamie088j
 
2025 Insilicogen Company Korean Brochure
Insilico Gen
 
Carbonate formation and fluctuating habitability on Mars
Sérgio Sacani
 
CNS.pptx Central nervous system meninges ventricles of brain it's structure a...
Ashwini I Chuncha
 
A Classification of Metamorphic Rocks Canada12.pdf
ErturulKanmaz
 
Renewable Energy Resources - Introduction
BhajneetSingh1
 
The scientific heritage No 163 (163) (2025)
The scientific heritage
 
Q1_Science 8_Week3-Day 1.pptx science lesson
AizaRazonado
 
Plankton and Fisheries Bovas Joel Notes.pdf
J. Bovas Joel BFSc
 
SCHOOL HOLIDAY REVISION CHAPTER 8.pptx science kssm
SITIATHIRAHBINTISULA
 

2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain approach - Srikanth Krishnamurthy, October 7, 2019

  • 1. QuTrack: A Blockchain-based approach to Model Lifecycle Management 2019 Copyright QuantUniversity LLC. Presented By: Sri Krishnamurthy [email protected] • Northeastern University • QuantUniversity : A NEU IDEAS startup
  • 2. 2 • Model Life-cycle Management challenges • Approaches • QuTrack: A Blockchain-based approach to Model Lifecycle Management • Demo Agenda
  • 3. 3
  • 4. Machine Learning Workflow Data Scraping/ Ingestion Data Exploration Data Cleansing and Processing Feature Engineering Model Evaluation & Tuning Model Selection Model Deployment/ Inference Supervised Unsupervised Modeling Data Engineer, Dev Ops Engineer Data Scientist/QuantsSoftware/Web Engineer • AutoML • Model Validation • Interpretability Robotic Process Automation (RPA) (Microservices, Pipelines ) • SW: Web/ Rest API • HW: GPU, Cloud • Monitoring • Regression • KNN • Decision Trees • Naive Bayes • Neural Networks • Ensembles • Clustering • PCA • Autoencoder • RMS • MAPS • MAE • Confusion Matrix • Precision/Recall • ROC • Hyper-parameter tuning • Parameter Grids Risk Management/ Compliance(All stages) Analysts& DecisionMakers
  • 5. 5 • Developing ML applications involves: ▫ Heuristics ▫ Best practices (templates) ▫ Lots of experimentation ▫ Many moving pieces ▫ No “right” answer ▫ Error creep: Even small untracked errors can through off results AI/ML application development => Design of Experiments
  • 6. 6 Source: Sculley et al., 2015 "Hidden Technical Debt in Machine Learning Systems" Challenges
  • 8. 8 • Repeatability (Same team, same experimental setup) ▫ The measurement can be obtained with stated precision by the same team using the same measurement procedure, the same measuring system, under the same operating conditions, in the same location on multiple trials. For computational experiments, this means that a researcher can reliably repeat her own computation. • Replicability (Different team, same experimental setup) ▫ The measurement can be obtained with stated precision by a different team using the same measurement procedure, the same measuring system, under the same operating conditions, in the same or a different location on multiple trials. For computational experiments, this means that an independent group can obtain the same result using the author’s own artifacts. • Reproducibility (Different team, different experimental setup) ▫ The measurement can be obtained with stated precision by a different team, a different measuring system, in a different location on multiple trials. For computational experiments, this means that an independent group can obtain the same result using artifacts which they develop completely independently. Repeatable or Reproducible or Replicable https://www.acm.org/publications/policies/artifact-review-badging
  • 10. 10 Elements of Model Risk Management
  • 11. 11 AI Governance is gaining focus https://legalinstruments.oecd.org/ en/instruments/OECD-LEGAL-0449
  • 12. 12 AI Governance is gaining focus https://legalinstruments.oecd.org/en/instruments/OECD-LEGAL-0449
  • 13. 13 NLP pipeline Data Ingestion from Edgar Pre-Processing Invoking APIs to label data Compare APIs Build a new model for sentiment Analysis Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 • Amazon Comprehend API • Google API • Watson API • Azure API
  • 14. 14 • Programming environment • Execution environment • Hardware specs • Cloud • GPU • Dependencies • Lineage/Provenance of individual components • Model params • Hyper parameters • Pipeline specifications • Model specific • Tests • Data versions Data Model EnvironmentProcess Components that needs to be tracked
  • 15. 15 Source: T. van derWeide, O. Smirnov, M. Zielinski, D. Papadopoulos, and T. van Kasteren. Versioned machine learning pipelines for batch experimentation. In ML Systems, Workshop NIPS 2016, 2016. Provenance and Lineage of pipelines
  • 16. 16 Schemas proposed Sebastian Schelter, Joos-Hendrik Boese, Johannes Kirschnick, Thoralf Klein, and Stephan Seufert. Automatically Tracking Metadata and Provenance of Machine Learning Experiments. NIPS Workshop on Machine Learning Systems, 2017.
  • 17. 17 Schemas proposed G. C. Publio, D. Esteves, and H. Zafar, “ML-Schema : Exposing the Semantics of Machine Learning with Schemas and Ontologies,” in Reproducibility in ML Workshop, ICML’18, 2018.
  • 21. 21
  • 22. 22 I. Altintas, O. Barney, and E. Jaeger-Frank. Provenance collection support in the Kepler scientific workflow system. In Provenance and annotation of data, pages 118–132. Current approaches
  • 23. 23 Miao, Hui & Chavan, Amit & Deshpande, Amol. (2016). ProvDB: A System for Lifecycle Management of Collaborative Analysis Workflows. Current approaches
  • 24. 24 Related work Xueping Liang, Sachin Shetty, Deepak Tosh, Charles Kamhoua, Kevin Kwiat, and Laurent Njilla. 2017. ProvChain: A Blockchain-based Data Provenance Architecture in Cloud Environment with Enhanced Privacy and Availability. In Proceedings of the 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid '17). IEEE Press, Piscataway, NJ, USA, 468-477. DOI: https://doi.org/10.1109/CCGRID.2017.8 Focus on Cloud data provenance using Blockchain
  • 25. 25 Related work Ramachandran, Aravind & Kantarcioglu, Dr. (2017). Using Blockchain and smart contracts for secure data provenance management. DataProv: Built on top of Ethereum, the platform utilizes smart contracts and open provenance model (OPM) to record immutable data trails.
  • 26. 26 Related work Sarpatwar, Kanthi & Vaculín, Roman & Min, Hong & Su, Gong & Heath, Terry & Ganapavarapu, Giridhar & Dillenberger, Donna. (2019). Towards Enabling Trusted Artificial Intelligence via Blockchain. 10.1007/978-3-030-17277-0_8. Trusted AI and provenance of AI models
  • 27. 27
  • 28. 28 QuSandbox research suite Model Analytics Studio QuSandboxQuTrack QuResearchHub Prototype, Iterate and tune Standardize workflowsProductionize and share Track experiments
  • 29. 29 The four components that need to be encapsulated for reproducible pipelines Code Data Environment Process
  • 31. 31
  • 33. 33 • JDF: Job Definition File; A DSL for representing Model Pipelines • Stage • Entity ▫ Model ▫ Data ▫ Environment • Version format ▫ M:m:p -> Major Version: Minor Version: Patch Terms
  • 36. 36 • Ability to track the evolution of experiments • Snapshot and store the lineage of pipelines • Ensure auditability and secure access to archived pipelines • Track experiments and their associated parameters • Track all aspects of experiments to ensure reproducibility • For high-impact (regulated, critical) applications, ensure experiment traces are immutable and verifiable QuTrack Design requirements
  • 37. 37 Metadata • Data about the information to be tracked • Includes version number, timestamps, user information, MD5 of the artifacts and high-level notes Data • Pipelines, custom DSL, standard formats for representing models • Events (Updates, rollbacks • JSON, Amazon ION, YAML, Artifacts • Model Pickle files, ONYX, COREML, Model params • Data, blobs etc. Architecture : What’s tracked ?
  • 39. 39 Amazon Quantum Ledger Database (QLDB) • Fully managed ledger database that provides a transparent, immutable, and cryptographically verifiable transaction log. • SQL-like API • Cost-effective Amazon QLDB
  • 43. 43
  • 44. 44 • Support for ONYX, CoreML • Integration with: ▫ MLFlow, DVC, GoCD • Integration with SCM systems ▫ Github, SVM • Tracking Back tests • Push Architecture -> Event-Driven Architecture • Enriched Analytics • Roles and Authorization Future work
  • 45. Thank You! If you are interested in trying out QuSandbox, Please sign up for updates at: www.qusandbox.com Information, data and drawings embodied in this presentation are strictly a property of QuantUniversity LLC. and shall not be distributed or used in any other publication without the prior written consent of QuantUniversity LLC. 45