SlideShare a Scribd company logo
QuTrack: A Blockchain-based approach to Model
Lifecycle Management
2019 Copyright QuantUniversity LLC.
Presented By:
Sri Krishnamurthy
sri@quantuniversity.com
QuantUniversity
2
• Model Life-cycle Management challenges
• Approaches
• QuTrack: A Blockchain-based approach to Model Lifecycle
Management
• Demo
Agenda
3
Machine Learning Workflow
Data Scraping/
Ingestion
Data
Exploration
Data Cleansing
and Processing
Feature
Engineering
Model
Evaluation
& Tuning
Model
Selection
Model
Deployment/
Inference
Supervised
Unsupervised
Modeling
Data Engineer, Dev Ops Engineer
Data Scientist/QuantsSoftware/Web Engineer
• AutoML
• Model Validation
• Interpretability
Robotic Process Automation (RPA) (Microservices, Pipelines )
• SW: Web/ Rest API
• HW: GPU, Cloud
• Monitoring
• Regression
• KNN
• Decision Trees
• Naive Bayes
• Neural Networks
• Ensembles
• Clustering
• PCA
• Autoencoder
• RMS
• MAPS
• MAE
• Confusion Matrix
• Precision/Recall
• ROC
• Hyper-parameter
tuning
• Parameter Grids
Risk Management/ Compliance(All stages)
Analysts&
DecisionMakers
5
• Developing ML applications involves:
▫ Heuristics
▫ Best practices (templates)
▫ Lots of experimentation
▫ Many moving pieces
▫ No “right” answer
▫ Error creep: Even small untracked errors can through off results
AI/ML application development => Design of Experiments
6
Source: Sculley et al., 2015 "Hidden Technical Debt in Machine Learning Systems"
Challenges
7
The reproducibility challenge
https://www.nature.com/news/1-500-scientists-lift-the-
lid-on-reproducibility-1.19970
8
• Repeatability (Same team, same experimental setup)
▫ The measurement can be obtained with stated precision by the same team using the
same measurement procedure, the same measuring system, under the same operating
conditions, in the same location on multiple trials. For computational experiments, this
means that a researcher can reliably repeat her own computation.
• Replicability (Different team, same experimental setup)
▫ The measurement can be obtained with stated precision by a different team using the
same measurement procedure, the same measuring system, under the same operating
conditions, in the same or a different location on multiple trials. For computational
experiments, this means that an independent group can obtain the same result using the
author’s own artifacts.
• Reproducibility (Different team, different experimental setup)
▫ The measurement can be obtained with stated precision by a different team, a different
measuring system, in a different location on multiple trials. For computational
experiments, this means that an independent group can obtain the same result using
artifacts which they develop completely independently.
Repeatable or Reproducible or Replicable
https://www.acm.org/publications/policies/artifact-review-badging
9
Many choices
Languages
Frameworks
Platforms
10
Elements of Model Risk Management
11
AI Governance is gaining focus
https://legalinstruments.oecd.org/
en/instruments/OECD-LEGAL-0449
12
AI Governance is gaining focus
https://legalinstruments.oecd.org/en/instruments/OECD-LEGAL-0449
13
NLP pipeline
Data Ingestion
from Edgar
Pre-Processing
Invoking APIs to
label data
Compare APIs
Build a new
model for
sentiment
Analysis
Stage 1 Stage 2 Stage 3 Stage 4 Stage 5
• Amazon Comprehend API
• Google API
• Watson API
• Azure API
14
• Programming environment
• Execution environment
• Hardware specs
• Cloud
• GPU
• Dependencies
• Lineage/Provenance of
individual components
• Model params
• Hyper parameters
• Pipeline specifications
• Model specific
• Tests
• Data versions
Data Model
EnvironmentProcess
Components that needs to be tracked
15
Source: T. van derWeide, O. Smirnov, M. Zielinski, D. Papadopoulos, and T. van Kasteren. Versioned machine learning pipelines for batch experimentation. In ML
Systems, Workshop NIPS 2016, 2016.
Provenance and Lineage of pipelines
16
Schemas proposed
Sebastian Schelter, Joos-Hendrik Boese, Johannes Kirschnick, Thoralf Klein, and Stephan Seufert. Automatically
Tracking Metadata and Provenance of Machine Learning Experiments. NIPS Workshop on Machine Learning
Systems, 2017.
17
Schemas proposed
G. C. Publio, D. Esteves, and H. Zafar, “ML-Schema : Exposing the Semantics of Machine Learning with Schemas
and Ontologies,” in Reproducibility in ML Workshop, ICML’18, 2018.
18
MLFlow
19
DVC
20
GoCD
21
22
I. Altintas, O. Barney, and E. Jaeger-Frank. Provenance collection support in the Kepler scientific workflow
system. In Provenance and annotation of data, pages 118–132.
Current approaches
23
Miao, Hui & Chavan, Amit & Deshpande, Amol. (2016). ProvDB: A System for Lifecycle Management of
Collaborative Analysis Workflows.
Current approaches
24
Related work
Xueping Liang, Sachin Shetty, Deepak Tosh, Charles Kamhoua, Kevin Kwiat, and Laurent Njilla. 2017. ProvChain: A Blockchain-based Data Provenance Architecture in Cloud
Environment with Enhanced Privacy and Availability. In Proceedings of the 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid '17). IEEE Press,
Piscataway, NJ, USA, 468-477. DOI: https://doi.org/10.1109/CCGRID.2017.8
Focus on Cloud data
provenance using
Blockchain
25
Related work
Ramachandran, Aravind & Kantarcioglu, Dr. (2017). Using Blockchain and smart contracts for secure data provenance management.
DataProv: Built on top of
Ethereum, the platform
utilizes smart contracts
and open provenance
model (OPM) to record
immutable data trails.
26
Related work
Sarpatwar, Kanthi & Vaculín, Roman & Min, Hong & Su, Gong & Heath, Terry & Ganapavarapu, Giridhar & Dillenberger, Donna. (2019). Towards Enabling Trusted Artificial
Intelligence via Blockchain. 10.1007/978-3-030-17277-0_8.
Trusted AI and
provenance of AI models
27
28
QuSandbox research suite
Model Analytics
Studio
QuSandboxQuTrack
QuResearchHub
Prototype, Iterate and tune
Standardize workflowsProductionize and share
Track experiments
29
The four components that need to be encapsulated for
reproducible pipelines
Code Data
Environment Process
30
QuSandbox
31
32
Model Management Studio
33
• JDF: Job Definition File; A DSL for representing Model Pipelines
• Stage
• Entity
▫ Model
▫ Data
▫ Environment
• Version format
▫ M:m:p -> Major Version: Minor Version: Patch
Terms
34
JDF- DSL
35
QuResearchHub
36
• Ability to track the evolution of experiments
• Snapshot and store the lineage of pipelines
• Ensure auditability and secure access to archived pipelines
• Track experiments and their associated parameters
• Track all aspects of experiments to ensure reproducibility
• For high-impact (regulated, critical) applications, ensure experiment
traces are immutable and verifiable
QuTrack Design requirements
37
Metadata
• Data about the information to be tracked
• Includes version number, timestamps, user information, MD5 of the artifacts
and high-level notes
Data
• Pipelines, custom DSL, standard formats for representing models
• Events (Updates, rollbacks
• JSON, Amazon ION, YAML,
Artifacts
• Model Pickle files, ONYX, COREML, Model params
• Data, blobs etc.
Architecture : What’s tracked ?
38
Blockchain-based:
• QLDB
• Ethereum
Non-Blockchain-based:
• MongoDB
Architectures supported
39
Amazon Quantum Ledger Database (QLDB)
• Fully managed ledger database that provides a transparent, immutable, and cryptographically verifiable transaction log.
• SQL-like API
• Cost-effective
Amazon QLDB
40
Amazon QLDB
41
Amazon QLDB
https://aws.amazon.com/qldb/faqs/
42
Demo: Reference App
43
• Support for ONYX, CoreML
• Integration with:
▫ MLFlow, DVC, GoCD
• Integration with SCM systems
▫ Github, SVM
• Tracking Back tests
• Push Architecture -> Event-Driven Architecture
• Enriched Analytics
• Roles and Authorization
Future work
Thank You!
If you are interested in trying
out QuSandbox,
Please sign up for updates at:
www.qusandbox.com
Information, data and drawings embodied in this presentation are strictly a property of QuantUniversity LLC. and shall not be
distributed or used in any other publication without the prior written consent of QuantUniversity LLC.
44

More Related Content

What's hot (20)

PPTX
Automatic Model Documentation with H2O
Sri Ambati
 
PDF
Adopting Data Science and Machine Learning in the financial enterprise
QuantUniversity
 
PDF
Machine Learning and AI: An Intuitive Introduction - CFA Institute Masterclass
QuantUniversity
 
PDF
achine Learning and Model Risk
QuantUniversity
 
PDF
Latent Panelists Affinities: a Helixa case study
Gianmario Spacagna
 
PDF
Python for Data science
QuantUniversity
 
PDF
Nlp workshop-share
QuantUniversity
 
PDF
Ai in finance
QuantUniversity
 
PDF
QCon conference 2019
QuantUniversity
 
PDF
CFA-NY Workshop - Final slides
QuantUniversity
 
PDF
More thinking about xApi and IMS Caliper - Structural/Syntactic & Ontological...
Open Cyber University of Korea
 
PPTX
Thinking About Guideline for Data Interoperability - Design concept and workf...
Open Cyber University of Korea
 
PPTX
Driverless AI - Arno Candel, H2O.ai
Sri Ambati
 
PDF
Machine Learning system architecture – Microsoft Translator, a Case Study : ...
Vishal Chowdhary
 
PDF
10 Key Considerations for AI/ML Model Governance
QuantUniversity
 
PDF
CD4ML and the challenges of testing and quality in ML systems
Seldon
 
PDF
An Interactive Visual Analytics Dashboard for the Employment Situation Report
Benjamin Bengfort
 
PPTX
Intuit - Machine learning platform lifecycle management 2018
Karthik Murugesan
 
PDF
Tech leaders guide to effective building of machine learning products
Gianmario Spacagna
 
PPTX
Helping data scientists escape the seduction of the sandbox - Krish Swamy, We...
Sri Ambati
 
Automatic Model Documentation with H2O
Sri Ambati
 
Adopting Data Science and Machine Learning in the financial enterprise
QuantUniversity
 
Machine Learning and AI: An Intuitive Introduction - CFA Institute Masterclass
QuantUniversity
 
achine Learning and Model Risk
QuantUniversity
 
Latent Panelists Affinities: a Helixa case study
Gianmario Spacagna
 
Python for Data science
QuantUniversity
 
Nlp workshop-share
QuantUniversity
 
Ai in finance
QuantUniversity
 
QCon conference 2019
QuantUniversity
 
CFA-NY Workshop - Final slides
QuantUniversity
 
More thinking about xApi and IMS Caliper - Structural/Syntactic & Ontological...
Open Cyber University of Korea
 
Thinking About Guideline for Data Interoperability - Design concept and workf...
Open Cyber University of Korea
 
Driverless AI - Arno Candel, H2O.ai
Sri Ambati
 
Machine Learning system architecture – Microsoft Translator, a Case Study : ...
Vishal Chowdhary
 
10 Key Considerations for AI/ML Model Governance
QuantUniversity
 
CD4ML and the challenges of testing and quality in ML systems
Seldon
 
An Interactive Visual Analytics Dashboard for the Employment Situation Report
Benjamin Bengfort
 
Intuit - Machine learning platform lifecycle management 2018
Karthik Murugesan
 
Tech leaders guide to effective building of machine learning products
Gianmario Spacagna
 
Helping data scientists escape the seduction of the sandbox - Krish Swamy, We...
Sri Ambati
 

Similar to QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain approach (20)

PDF
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
The Statistical and Applied Mathematical Sciences Institute
 
PDF
Week 3 data journey and data storage
Ajay Taneja
 
PPTX
Open, Secure & Transparent AI Pipelines
Nick Pentreath
 
PDF
What’s New with Databricks Machine Learning
Databricks
 
PPTX
A machine learning and data science pipeline for real companies
DataWorks Summit
 
PPTX
03_aiops-1.pptx
FarazulHoda2
 
PDF
Limited Budget but Effective End to End MLOps Practices (Machine Learning Mod...
IRJET Journal
 
PDF
Machine learning at scale challenges and solutions
Stavros Kontopoulos
 
PPTX
Software engineering practices for the data science and machine learning life...
DataWorks Summit
 
PDF
Data ops: Machine Learning in production
Stepan Pushkarev
 
PDF
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...
HostedbyConfluent
 
PPTX
Building Data Ecosystems for Accelerated Discovery
adamkraut
 
PDF
ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...
Spark Summit
 
PDF
Building a Distributed Collaborative Data Pipeline with Apache Spark
Databricks
 
PPTX
AI-ML-Virtual-Internship on new technology
AnubhavKumar615216
 
PPTX
MOPs & ML Pipelines on GCP - Session 6, RGDC
gdgsurrey
 
PDF
Phoenix Data Conference - Big Data Analytics for IoT 11/4/17
Mark Goldstein
 
PPTX
Data provenance in Hopsworks
Alexandru Adrian Ormenisan
 
PPTX
MLOps Virtual Event | Building Machine Learning Platforms for the Full Lifecycle
Databricks
 
PDF
Enabling Scalable Data Science Pipeline with Mlflow at Thermo Fisher Scientific
Databricks
 
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
The Statistical and Applied Mathematical Sciences Institute
 
Week 3 data journey and data storage
Ajay Taneja
 
Open, Secure & Transparent AI Pipelines
Nick Pentreath
 
What’s New with Databricks Machine Learning
Databricks
 
A machine learning and data science pipeline for real companies
DataWorks Summit
 
03_aiops-1.pptx
FarazulHoda2
 
Limited Budget but Effective End to End MLOps Practices (Machine Learning Mod...
IRJET Journal
 
Machine learning at scale challenges and solutions
Stavros Kontopoulos
 
Software engineering practices for the data science and machine learning life...
DataWorks Summit
 
Data ops: Machine Learning in production
Stepan Pushkarev
 
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...
HostedbyConfluent
 
Building Data Ecosystems for Accelerated Discovery
adamkraut
 
ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...
Spark Summit
 
Building a Distributed Collaborative Data Pipeline with Apache Spark
Databricks
 
AI-ML-Virtual-Internship on new technology
AnubhavKumar615216
 
MOPs & ML Pipelines on GCP - Session 6, RGDC
gdgsurrey
 
Phoenix Data Conference - Big Data Analytics for IoT 11/4/17
Mark Goldstein
 
Data provenance in Hopsworks
Alexandru Adrian Ormenisan
 
MLOps Virtual Event | Building Machine Learning Platforms for the Full Lifecycle
Databricks
 
Enabling Scalable Data Science Pipeline with Mlflow at Thermo Fisher Scientific
Databricks
 
Ad

More from QuantUniversity (20)

PDF
AI in Finance and Retirement Systems: Insights from the EBRI-Milken Institute...
QuantUniversity
 
PDF
Adversarial Machine Learning: A Taxonomy and Terminology of Attacks and Mitig...
QuantUniversity
 
PDF
EU Artificial Intelligence Act 2024 passed !
QuantUniversity
 
PDF
Managing-the-Risks-of-LLMs-in-FS-Industry-Roundtable-TruEra-QuantU.pdf
QuantUniversity
 
PDF
PYTHON AND DATA SCIENCE FOR INVESTMENT PROFESSIONALS
QuantUniversity
 
PDF
Qu for India - QuantUniversity FundRaiser
QuantUniversity
 
PDF
Ml master class for CFA Dallas
QuantUniversity
 
PDF
Algorithmic auditing 1.0
QuantUniversity
 
PDF
Towards Fairer Datasets: Filtering and Balancing the Distribution of the Peop...
QuantUniversity
 
PDF
Machine Learning: Considerations for Fairly and Transparently Expanding Acces...
QuantUniversity
 
PDF
Seeing what a gan cannot generate: paper review
QuantUniversity
 
PDF
AI Explainability and Model Risk Management
QuantUniversity
 
PDF
Algorithmic auditing 1.0
QuantUniversity
 
PDF
Machine Learning in Finance: 10 Things You Need to Know in 2021
QuantUniversity
 
PDF
Bayesian Portfolio Allocation
QuantUniversity
 
PDF
The API Jungle
QuantUniversity
 
PDF
Explainable AI Workshop
QuantUniversity
 
PDF
Constructing Private Asset Benchmarks
QuantUniversity
 
PDF
Machine Learning Interpretability
QuantUniversity
 
PDF
Responsible AI in Action
QuantUniversity
 
AI in Finance and Retirement Systems: Insights from the EBRI-Milken Institute...
QuantUniversity
 
Adversarial Machine Learning: A Taxonomy and Terminology of Attacks and Mitig...
QuantUniversity
 
EU Artificial Intelligence Act 2024 passed !
QuantUniversity
 
Managing-the-Risks-of-LLMs-in-FS-Industry-Roundtable-TruEra-QuantU.pdf
QuantUniversity
 
PYTHON AND DATA SCIENCE FOR INVESTMENT PROFESSIONALS
QuantUniversity
 
Qu for India - QuantUniversity FundRaiser
QuantUniversity
 
Ml master class for CFA Dallas
QuantUniversity
 
Algorithmic auditing 1.0
QuantUniversity
 
Towards Fairer Datasets: Filtering and Balancing the Distribution of the Peop...
QuantUniversity
 
Machine Learning: Considerations for Fairly and Transparently Expanding Acces...
QuantUniversity
 
Seeing what a gan cannot generate: paper review
QuantUniversity
 
AI Explainability and Model Risk Management
QuantUniversity
 
Algorithmic auditing 1.0
QuantUniversity
 
Machine Learning in Finance: 10 Things You Need to Know in 2021
QuantUniversity
 
Bayesian Portfolio Allocation
QuantUniversity
 
The API Jungle
QuantUniversity
 
Explainable AI Workshop
QuantUniversity
 
Constructing Private Asset Benchmarks
QuantUniversity
 
Machine Learning Interpretability
QuantUniversity
 
Responsible AI in Action
QuantUniversity
 
Ad

Recently uploaded (20)

PPTX
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
PPTX
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
PPTX
big data eco system fundamentals of data science
arivukarasi
 
PDF
Research Methodology Overview Introduction
ayeshagul29594
 
PPT
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
PPTX
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 
PPTX
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
PPTX
How to Add Columns and Rows in an R Data Frame
subhashenia
 
PDF
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
PDF
Business implication of Artificial Intelligence.pdf
VishalChugh12
 
PDF
NIS2 Compliance for MSPs: Roadmap, Benefits & Cybersecurity Trends (2025 Guide)
GRC Kompas
 
PDF
apidays Singapore 2025 - From API Intelligence to API Governance by Harsha Ch...
apidays
 
PDF
InformaticsPractices-MS - Google Docs.pdf
seshuashwin0829
 
PPTX
Powerful Uses of Data Analytics You Should Know
subhashenia
 
PDF
apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)
apidays
 
PPTX
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
PPTX
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
PDF
The Best NVIDIA GPUs for LLM Inference in 2025.pdf
Tamanna36
 
PDF
1750162332_Snapshot-of-Indias-oil-Gas-data-May-2025.pdf
sandeep718278
 
PPTX
What Is Data Integration and Transformation?
subhashenia
 
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
big data eco system fundamentals of data science
arivukarasi
 
Research Methodology Overview Introduction
ayeshagul29594
 
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
How to Add Columns and Rows in an R Data Frame
subhashenia
 
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
Business implication of Artificial Intelligence.pdf
VishalChugh12
 
NIS2 Compliance for MSPs: Roadmap, Benefits & Cybersecurity Trends (2025 Guide)
GRC Kompas
 
apidays Singapore 2025 - From API Intelligence to API Governance by Harsha Ch...
apidays
 
InformaticsPractices-MS - Google Docs.pdf
seshuashwin0829
 
Powerful Uses of Data Analytics You Should Know
subhashenia
 
apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)
apidays
 
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
The Best NVIDIA GPUs for LLM Inference in 2025.pdf
Tamanna36
 
1750162332_Snapshot-of-Indias-oil-Gas-data-May-2025.pdf
sandeep718278
 
What Is Data Integration and Transformation?
subhashenia
 

QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain approach

  • 1. QuTrack: A Blockchain-based approach to Model Lifecycle Management 2019 Copyright QuantUniversity LLC. Presented By: Sri Krishnamurthy [email protected] QuantUniversity
  • 2. 2 • Model Life-cycle Management challenges • Approaches • QuTrack: A Blockchain-based approach to Model Lifecycle Management • Demo Agenda
  • 3. 3
  • 4. Machine Learning Workflow Data Scraping/ Ingestion Data Exploration Data Cleansing and Processing Feature Engineering Model Evaluation & Tuning Model Selection Model Deployment/ Inference Supervised Unsupervised Modeling Data Engineer, Dev Ops Engineer Data Scientist/QuantsSoftware/Web Engineer • AutoML • Model Validation • Interpretability Robotic Process Automation (RPA) (Microservices, Pipelines ) • SW: Web/ Rest API • HW: GPU, Cloud • Monitoring • Regression • KNN • Decision Trees • Naive Bayes • Neural Networks • Ensembles • Clustering • PCA • Autoencoder • RMS • MAPS • MAE • Confusion Matrix • Precision/Recall • ROC • Hyper-parameter tuning • Parameter Grids Risk Management/ Compliance(All stages) Analysts& DecisionMakers
  • 5. 5 • Developing ML applications involves: ▫ Heuristics ▫ Best practices (templates) ▫ Lots of experimentation ▫ Many moving pieces ▫ No “right” answer ▫ Error creep: Even small untracked errors can through off results AI/ML application development => Design of Experiments
  • 6. 6 Source: Sculley et al., 2015 "Hidden Technical Debt in Machine Learning Systems" Challenges
  • 8. 8 • Repeatability (Same team, same experimental setup) ▫ The measurement can be obtained with stated precision by the same team using the same measurement procedure, the same measuring system, under the same operating conditions, in the same location on multiple trials. For computational experiments, this means that a researcher can reliably repeat her own computation. • Replicability (Different team, same experimental setup) ▫ The measurement can be obtained with stated precision by a different team using the same measurement procedure, the same measuring system, under the same operating conditions, in the same or a different location on multiple trials. For computational experiments, this means that an independent group can obtain the same result using the author’s own artifacts. • Reproducibility (Different team, different experimental setup) ▫ The measurement can be obtained with stated precision by a different team, a different measuring system, in a different location on multiple trials. For computational experiments, this means that an independent group can obtain the same result using artifacts which they develop completely independently. Repeatable or Reproducible or Replicable https://www.acm.org/publications/policies/artifact-review-badging
  • 10. 10 Elements of Model Risk Management
  • 11. 11 AI Governance is gaining focus https://legalinstruments.oecd.org/ en/instruments/OECD-LEGAL-0449
  • 12. 12 AI Governance is gaining focus https://legalinstruments.oecd.org/en/instruments/OECD-LEGAL-0449
  • 13. 13 NLP pipeline Data Ingestion from Edgar Pre-Processing Invoking APIs to label data Compare APIs Build a new model for sentiment Analysis Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 • Amazon Comprehend API • Google API • Watson API • Azure API
  • 14. 14 • Programming environment • Execution environment • Hardware specs • Cloud • GPU • Dependencies • Lineage/Provenance of individual components • Model params • Hyper parameters • Pipeline specifications • Model specific • Tests • Data versions Data Model EnvironmentProcess Components that needs to be tracked
  • 15. 15 Source: T. van derWeide, O. Smirnov, M. Zielinski, D. Papadopoulos, and T. van Kasteren. Versioned machine learning pipelines for batch experimentation. In ML Systems, Workshop NIPS 2016, 2016. Provenance and Lineage of pipelines
  • 16. 16 Schemas proposed Sebastian Schelter, Joos-Hendrik Boese, Johannes Kirschnick, Thoralf Klein, and Stephan Seufert. Automatically Tracking Metadata and Provenance of Machine Learning Experiments. NIPS Workshop on Machine Learning Systems, 2017.
  • 17. 17 Schemas proposed G. C. Publio, D. Esteves, and H. Zafar, “ML-Schema : Exposing the Semantics of Machine Learning with Schemas and Ontologies,” in Reproducibility in ML Workshop, ICML’18, 2018.
  • 21. 21
  • 22. 22 I. Altintas, O. Barney, and E. Jaeger-Frank. Provenance collection support in the Kepler scientific workflow system. In Provenance and annotation of data, pages 118–132. Current approaches
  • 23. 23 Miao, Hui & Chavan, Amit & Deshpande, Amol. (2016). ProvDB: A System for Lifecycle Management of Collaborative Analysis Workflows. Current approaches
  • 24. 24 Related work Xueping Liang, Sachin Shetty, Deepak Tosh, Charles Kamhoua, Kevin Kwiat, and Laurent Njilla. 2017. ProvChain: A Blockchain-based Data Provenance Architecture in Cloud Environment with Enhanced Privacy and Availability. In Proceedings of the 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid '17). IEEE Press, Piscataway, NJ, USA, 468-477. DOI: https://doi.org/10.1109/CCGRID.2017.8 Focus on Cloud data provenance using Blockchain
  • 25. 25 Related work Ramachandran, Aravind & Kantarcioglu, Dr. (2017). Using Blockchain and smart contracts for secure data provenance management. DataProv: Built on top of Ethereum, the platform utilizes smart contracts and open provenance model (OPM) to record immutable data trails.
  • 26. 26 Related work Sarpatwar, Kanthi & Vaculín, Roman & Min, Hong & Su, Gong & Heath, Terry & Ganapavarapu, Giridhar & Dillenberger, Donna. (2019). Towards Enabling Trusted Artificial Intelligence via Blockchain. 10.1007/978-3-030-17277-0_8. Trusted AI and provenance of AI models
  • 27. 27
  • 28. 28 QuSandbox research suite Model Analytics Studio QuSandboxQuTrack QuResearchHub Prototype, Iterate and tune Standardize workflowsProductionize and share Track experiments
  • 29. 29 The four components that need to be encapsulated for reproducible pipelines Code Data Environment Process
  • 31. 31
  • 33. 33 • JDF: Job Definition File; A DSL for representing Model Pipelines • Stage • Entity ▫ Model ▫ Data ▫ Environment • Version format ▫ M:m:p -> Major Version: Minor Version: Patch Terms
  • 36. 36 • Ability to track the evolution of experiments • Snapshot and store the lineage of pipelines • Ensure auditability and secure access to archived pipelines • Track experiments and their associated parameters • Track all aspects of experiments to ensure reproducibility • For high-impact (regulated, critical) applications, ensure experiment traces are immutable and verifiable QuTrack Design requirements
  • 37. 37 Metadata • Data about the information to be tracked • Includes version number, timestamps, user information, MD5 of the artifacts and high-level notes Data • Pipelines, custom DSL, standard formats for representing models • Events (Updates, rollbacks • JSON, Amazon ION, YAML, Artifacts • Model Pickle files, ONYX, COREML, Model params • Data, blobs etc. Architecture : What’s tracked ?
  • 39. 39 Amazon Quantum Ledger Database (QLDB) • Fully managed ledger database that provides a transparent, immutable, and cryptographically verifiable transaction log. • SQL-like API • Cost-effective Amazon QLDB
  • 43. 43 • Support for ONYX, CoreML • Integration with: ▫ MLFlow, DVC, GoCD • Integration with SCM systems ▫ Github, SVM • Tracking Back tests • Push Architecture -> Event-Driven Architecture • Enriched Analytics • Roles and Authorization Future work
  • 44. Thank You! If you are interested in trying out QuSandbox, Please sign up for updates at: www.qusandbox.com Information, data and drawings embodied in this presentation are strictly a property of QuantUniversity LLC. and shall not be distributed or used in any other publication without the prior written consent of QuantUniversity LLC. 44