0% found this document useful (0 votes)

58 views

Data Science Product Development Lecture 1

The document outlines a lecture for a course on data science product development, describing the instructor's background and experience, different types of data scientists and their skill sets, the main modules to be covered in the course including data collection, deep learning, model optimization and deployment, and logistics for the course.

Uploaded by

Daniyal Raza

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

58 views

Data Science Product Development Lecture 1

Uploaded by

Daniyal Raza

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 39

Data Science Product Development

(CSE 679)
Lecture 1 (22nd Jan)
Today’s agenda
● What this course is about.
● Course overview.
● Course logistics.
● Review of basic data science models.

2
About me
● Did my Ph.d in theoretical computer science.
● Wrote mathematical proofs / algorithms for:
○ Similarity search.
○ Dimensionality reduction.
○ Uncertain dataset analysis.
○ Information distances.

3
About me
● My experience informs the course.
● Lot of time with text analytics and retail.
● Working with mega-orgs and startups.
● Fond of math fundamentals, but focus on delivering
products.
● Experimental approach is key.

4
About me
● Interned at Google, 2014 and 2015.
● Interned Microsoft Research, 2013.
● Text Analytics engineer at Qualtrics (2016-19)
● Data scientist at Wayfair (2019-2020)
● Data scientist at a startup, Aktify (2020- date).

5
Conception of Course
● Different types of data scientist.
● Overlapping skill sets, different focuses.

DS Academic DS Analyst DS Engineer 6

Discussion of skill sets.
Academic:
● Novelty of research > results.
● Data often more at POC level, sometimes unrealistic.
● Ideas matter more than benchmarks.
● Productization / scale unnecessary.
● Important for developing broader field.

7
Discussion of skill sets.
DS Analyst:
● Efficiently query / consume massive data through SQL,
scraping, etc.
● Design and run massive ML models offline.
● Present insights to org to action further.
● May integrate with readymade pipeline.
● Valuable as part of large teams and orgs.

8
Discussion of skill sets.
DS Engineer:
● Build massive ML models, robustly.
● Ensure data ingests and runs in production.
● Understands deployment, ongoing experimentation
frameworks.
● Understands optimization, interpretability.
● Results oriented.
● Backbone of small business or team.

9
So what about us?
● Will focus on Data Science Engineering.
● Full product cycle of model to productization.
● Steps needed to release fully realized model.
● Focus on steps not often covered in core courses.
● Will also involve software, data engineering and data
collection processes.

10
Why should you care?
● 75% integrated DS model has 0% business value.
● Data science engineer can complete “last mile”.
● Data scientist can productize more scientifically.
● Understand needs of your team/business better.
● Understand constraints to build your models in.

11
Course prereqs
● Extensive programming experience, esp. Python.
● Probability, linear algebra, statistics.
● Familiarity with core data science models, like clustering,
regression, decision trees.
● Familiarity with basic machine learning.

If you don’t have these, I strongly recommend you do more

foundational course first.
12
Nice-to-haves, but not needed.
● Prior experience with deep learning.
● Experience with text analytics.
● Familiarity with working in the cloud.
● Will use deep learning more as black box, for the most part.

13
Main course modules
1. Data collection and handling.
2. Deep learning - use popular python packages and tools,
rather than reinvent wheel.
3. Model optimization - quantization, distillation,
experimentation.
4. Model interpretability - saliency maps, sensitivity.
5. Deployments - scheduling, microservices, APIs, working in
the cloud.
14
Main course modules
● Data collection and handling - cloud databases, data
annotation packages.
● Deep learning - use popular python packages and tools,
rather than reinvent the wheel.
● Model optimization - quantization, distillation.
● Model interpretability - saliency maps, attention, sensitivity.

15
Logistics, grading and details.
● 150 minutes is upper bound on class time, will often disband
earlier.
● Let’s go through the course syllabus for more details now.
● It’s available here: Syllabus

16
Case study: Business Ask
● Airline bigwig contacts you and your friends.
● Want you to build a reservation tool.
● Works over email and phone.
● Your mission is building the conversational agent.

17
Case study: The POC
● Your engineer friend sets up a software.
● Wants you to build API that can classify 1000+ messages a
minute.
● Should be accurate, reliable, fast POC.
● You have exams next month and a friend’s wedding.
● How do you do this fast?

18
Case study: The Research
● You want to build a T-800 ML model.
● You Google the best NLP models.
● DistillBert seems popular: DistillBert.
● This Github is highly popular: Huggingface
● But will take too long to productize for POC.

19
Case study: The Research
● You start looking for commercially available tools.
● LUIS, Dialogflow, RASA, IBM Watson are in vogue.
● But which one will perform better?
● You come across a promising benchmark paper:
comparison paper.
● Your team is already fully on Google.
● Based on this, you think DialogFlow sounds good.

20
Case study: The Setup
● You work through a tutorial or
two on Dialogflow.
● You look through data to get a
sense of common “intents”.
● Maybe there’s “reschedule”,
“cancel”, “delay”, “add person”,
“change seats”.

21
Case study: Integrate.
● Your engineer friends needs an API to hit.
● You look up DialogFlow API docs.
● You look up Google “key” for your agent.
● You tell your friend to set up billing.
● You write Python code to “detect_intent”.

22
Case study: It’s a POC, not a hack.
● Want some metrics.
● Do an 80-20 split of training data for accuracy.
● Check accuracy on airline logs, academic data, POC customers.
● Set up a dashboard for your team, perhaps with Tableau.
● Realize Dialogflow has a ceiling, need a better model while your
POC runs.

23
Case study: Back to Research.
● Do a really deep dive on papers.
● Amazon has interesting ideas, so does Poly-AI.
● But…..they’re stingy and don’t open source their code.
● Fine. I can do this with a Bert model from HuggingFace, their
repositories are hugely popular.
● Time to dig in to their tutorials and notebooks.

24
Case study: Back to Research.
● Want to verify it works, but deep learning needs GPU to train
effectively.
● Can’t afford a GPU machine, till I’m sure it performs.
● So what to do?
● Google Colab provides GPU for free: example Colab.
● For assignments that need any deep learning, you will use Colab to
run and submit code.

25
Case study: Back to Research.
● Great, we trained a model and it performs accuracy.
● But our engineer friend still wants an API.
● Ok, lets use Fast API to wrap our model in an API.
● Let’s use a microservice to host the API.
● Let’s push to Google Cloud or AWS to host this.

26
Case study: And we’re at V0
● After our new and improved model is out, we’re earning 2 million
dollars a year.
● But our competitor claims better performance.
● We’re not handling delays due to Covid well.
● We clearly need more labelled data to improve our model, and
discover missing intents.
● Collect some more training data also from academic datasets and
web scraping.

27
Case study: Collect more data.
● You can’t manage 10 contractors to label 100K data points easily
in a spreadsheet.
● Need more modern tools.
● Frameworks like Doccano, Snorkel, Super Annotate, MechTurk,
are just a few of the options
● You decide to pick Doccano for your particular usecase.

28
Case study: Business is scaling!
● Your friend complains that she can’t handle all the customers.
● Your model makes 5 predictions a second per microservice.
● Can’t afford to buy enough microservices to handle it.
● You decide to optimize, condense and distill your model to have
higher throughput.

29
Case study: Changing policies.
● You’ve been manually uploading a new reservations model when
flight policies change.
● Adds or removes “delay_no_charges” intent.
● But with Covid changes, and business growth, a human can’t
manage it.
● You need an automated workflow to retrain new models, and pull
new data from airlines databases.
● You decide to use Prefect to manage your workflows.

30
Case study: Training your model.
● At this point, you have a lot of money and compute resources.
● But you don’t have the time to find new hyperparameters for your
model each retrain.
● Brute forcing all combinations is still too slow.
● So you decide to use HyperOpt or Optuna.

31
Case study: Experimenting
● The airlines want to try different cancellation policies, and see how
it impacts revenue.
● You’re also curious if people want more to-the-point agents, or
ones who give more options.
● But you don’t want to code each experiment by hand.
● You decide Facebook’s AX or Planout experimentation platforms
should be integrated to your project.

32
Case study: Understand weak points.
● Congrats! You’re a director of data science right now.
● You have staff managing training data curation.
● But they’re struggling to understand why the data for “delay
flight” is sometimes misclassified.
● So you decide they need tools to see why the ML makes certain
decisions.
● You decide to look at Captum or What If tool.

33
Case study: Recap.
● Data Science product delivery moves from POC to advanced
models.
● Data collection and curation is never ending.
● Models need to be optimized, deployed and accessible to
engineering.
● Product choices need sound experimental tools.
● Parameter choices cannot always be hand tuned.
● Need to continually deepen understanding / interpretability of
models.
34
Case study: Recap.
● Product moves from POC to advanced models.
● Research into state of the art is important.
● Data collection and curation is never ending.
● Models need to be optimized, deployed and accessible to
engineering.
● Product choices need sound experimental tools.
● Parameter choices cannot always be hand tuned.
● Need to continually deepen understanding / interpretability of
models.
35
Class implications.
● We’ll go over each of the cycle steps over course.
● Will discuss the tools and concepts over lectures.
● Class Project should incorporate each step.
● Your research report tests ability to survey literature.
● Project presentation ability to convey results.

I care more about methodology - feel free to experiment with the vast
number of specific tools available.
36
Break time!

37
Review material.
● Will lean on external material for review.
● Some nice slides for learning a decision tree.
● Some slides for random forests and XG-Boost.
● Some slides for k-means++.

38
What comes next.
● Will upload more review resources over next week.
● Form your groups of 3-4 people.
● If you don’t have a group by next week, email me and I will
randomly assign.
● We’ll go over example course project ideas next week.
● We’ll go through some deep learning packages online next week.
● First homework will be released.

Trackpad iPro Ver. 4.0 Class 6
From Everand
Trackpad iPro Ver. 4.0 Class 6
Team Orange
No ratings yet
Intelligent Capture For SAP Solutions CE 20.4
No ratings yet
Intelligent Capture For SAP Solutions CE 20.4
14 pages
Data Science
No ratings yet
Data Science
34 pages
AI in Marketing Industry Course Curriculum
No ratings yet
AI in Marketing Industry Course Curriculum
17 pages
MLOps.fr.en
No ratings yet
MLOps.fr.en
27 pages
AI Syllabus - IBM
No ratings yet
AI Syllabus - IBM
18 pages
BCA507
No ratings yet
BCA507
2 pages
21MCME07
No ratings yet
21MCME07
5 pages
Data Scientist Nanodegree Syllabus
No ratings yet
Data Scientist Nanodegree Syllabus
16 pages
Day 1 - Week 1
No ratings yet
Day 1 - Week 1
46 pages
Full Stack Data Science Brochure
No ratings yet
Full Stack Data Science Brochure
15 pages
Data Science Solutions Sample
100% (6)
Data Science Solutions Sample
53 pages
AI - (Deep Learning/NLP) : 5 Days
No ratings yet
AI - (Deep Learning/NLP) : 5 Days
4 pages
NewSyllabus_1157202352913185 (5)
No ratings yet
NewSyllabus_1157202352913185 (5)
7 pages
Practical C++ Machine Learning: Hands-on strategies for developing simple machine learning models using C++ data structures and libraries
From Everand
Practical C++ Machine Learning: Hands-on strategies for developing simple machine learning models using C++ data structures and libraries
Anais Sutherland
No ratings yet
Data Roadmap
No ratings yet
Data Roadmap
3 pages
Full Stack Data Science Combined Brochure
No ratings yet
Full Stack Data Science Combined Brochure
15 pages
Adaptation to AI : Platforms for ML, AI and Data Science Best Practices
No ratings yet
Adaptation to AI : Platforms for ML, AI and Data Science Best Practices
7 pages
Adavanced_Applied Artificial Intelligence(Practical Implementations)
No ratings yet
Adavanced_Applied Artificial Intelligence(Practical Implementations)
9 pages
Data Science
No ratings yet
Data Science
16 pages
Generative AI Tghjraining in Hyderabad
No ratings yet
Generative AI Tghjraining in Hyderabad
22 pages
Data Science C
No ratings yet
Data Science C
21 pages
Data Roadmap
No ratings yet
Data Roadmap
9 pages
DataScience, AI, GenerativeAI, Analytics Tech Insights
No ratings yet
DataScience, AI, GenerativeAI, Analytics Tech Insights
97 pages
Data Science Bootcamp
No ratings yet
Data Science Bootcamp
21 pages
ML Process and Map
No ratings yet
ML Process and Map
7 pages
Data Science Student Schedule
No ratings yet
Data Science Student Schedule
7 pages
BCA Minor Project Proposal Vinutha AASCU3BCA2207162
No ratings yet
BCA Minor Project Proposal Vinutha AASCU3BCA2207162
4 pages
Aiml Online Brochure
No ratings yet
Aiml Online Brochure
20 pages
Ai for IT Coders
No ratings yet
Ai for IT Coders
18 pages
Full Stack Data Science Combined Without Prices Brochure
No ratings yet
Full Stack Data Science Combined Without Prices Brochure
15 pages
Practical Full Stack Machine Learning: A Guide to Build Reliable, Reusable, and Production-Ready Full Stack ML Solutions
From Everand
Practical Full Stack Machine Learning: A Guide to Build Reliable, Reusable, and Production-Ready Full Stack ML Solutions
Alok Kumar
No ratings yet
Machine Learning Systems
No ratings yet
Machine Learning Systems
300 pages
Hands-on NumPy for Numerical Analysis: Unlock NumPy with Google Colab for High-Performance Numerical Computing and Optimizing Numerical Data Analysis (English Edition)
From Everand
Hands-on NumPy for Numerical Analysis: Unlock NumPy with Google Colab for High-Performance Numerical Computing and Optimizing Numerical Data Analysis (English Edition)
Rituraj Dixit
No ratings yet
MLOps Research Work by Arka Roy (1)
No ratings yet
MLOps Research Work by Arka Roy (1)
21 pages
Artificial Intelligence and Deep Learning: Certificate Program
No ratings yet
Artificial Intelligence and Deep Learning: Certificate Program
12 pages
Data Engineering with Google Cloud Platform: A guide to leveling up as a data engineer by building a scalable data platform with Google Cloud
From Everand
Data Engineering with Google Cloud Platform: A guide to leveling up as a data engineer by building a scalable data platform with Google Cloud
Adi Wijaya
No ratings yet
DL Lab Manual Student
No ratings yet
DL Lab Manual Student
6 pages
Machine Learning Specialization CloudxLab PDF
No ratings yet
Machine Learning Specialization CloudxLab PDF
12 pages
Coursepack Deep Learningeven2024 - R1uc604c
No ratings yet
Coursepack Deep Learningeven2024 - R1uc604c
13 pages
Lec0 Logistics
No ratings yet
Lec0 Logistics
40 pages
Artificial Intelligence & Machine Learning: Post Graduate Program in
No ratings yet
Artificial Intelligence & Machine Learning: Post Graduate Program in
16 pages
Roadmap To Become A Data Scientist in 2024
No ratings yet
Roadmap To Become A Data Scientist in 2024
12 pages
TIMUR BIKMUKHAMETOV_DS_Roadmap
No ratings yet
TIMUR BIKMUKHAMETOV_DS_Roadmap
27 pages
AIML V.22 Brochure Newversion22
No ratings yet
AIML V.22 Brochure Newversion22
16 pages
ai and ml qp1 solved
No ratings yet
ai and ml qp1 solved
20 pages
Answers 111111111111111111111111111
No ratings yet
Answers 111111111111111111111111111
21 pages
Ai for IT Non Coders
No ratings yet
Ai for IT Non Coders
14 pages
Data Scientist Nanodegree Syllabus: Before You Start
No ratings yet
Data Scientist Nanodegree Syllabus: Before You Start
5 pages
CodeOp DS Course Guide 2023
No ratings yet
CodeOp DS Course Guide 2023
15 pages
Data Scientist: Nanodegree Program Syllabus
No ratings yet
Data Scientist: Nanodegree Program Syllabus
17 pages
Course Outline: DLCP Curriculum Walkthrough
No ratings yet
Course Outline: DLCP Curriculum Walkthrough
3 pages
Generative AI Roadmap
No ratings yet
Generative AI Roadmap
36 pages
6 Open Source Data Science Projects Interviewer
No ratings yet
6 Open Source Data Science Projects Interviewer
7 pages
Data Scientist: Nanodegree Program Syllabus
No ratings yet
Data Scientist: Nanodegree Program Syllabus
16 pages
Data Science ML Full Stack 2022 GitHub
No ratings yet
Data Science ML Full Stack 2022 GitHub
9 pages
1-Introduction
No ratings yet
1-Introduction
81 pages
DL Lab Manual
No ratings yet
DL Lab Manual
65 pages
Aiml Online Brochure
No ratings yet
Aiml Online Brochure
15 pages
Final PPT
No ratings yet
Final PPT
24 pages
Data Science with Generative AI outline
No ratings yet
Data Science with Generative AI outline
12 pages
Parallel Programming with Microsoft NET Design Patterns for Decomposition and Coordination on Multicore Architectures Patterns Practices 1st Edition Colin Campbell - The newest ebook version is ready, download now to explore
100% (1)
Parallel Programming with Microsoft NET Design Patterns for Decomposition and Coordination on Multicore Architectures Patterns Practices 1st Edition Colin Campbell - The newest ebook version is ready, download now to explore
49 pages
Installation Instruction
No ratings yet
Installation Instruction
2 pages
Nptel Mooc, Jan-Feb 2015 Week 4, Module 1: Design and Analysis of Algorithms
No ratings yet
Nptel Mooc, Jan-Feb 2015 Week 4, Module 1: Design and Analysis of Algorithms
34 pages
Password Management For Dummies Keeper Security Special Edition
No ratings yet
Password Management For Dummies Keeper Security Special Edition
29 pages
Exception Handling in Python
No ratings yet
Exception Handling in Python
18 pages
Dberr
No ratings yet
Dberr
43 pages
Digital Avionics Systems Conference
100% (1)
Digital Avionics Systems Conference
17 pages
Datasheet LuciadRIA Browser V2018
No ratings yet
Datasheet LuciadRIA Browser V2018
8 pages
Test Report: SPICE To Qucs Conversion: Test File 5
No ratings yet
Test Report: SPICE To Qucs Conversion: Test File 5
15 pages
Agenda Style
No ratings yet
Agenda Style
48 pages
Blueprint Session 1 PM Structure Workshop
0% (1)
Blueprint Session 1 PM Structure Workshop
13 pages
5 Marks Questions
100% (2)
5 Marks Questions
4 pages
MIAirsoft MIASMG1 Guide
No ratings yet
MIAirsoft MIASMG1 Guide
4 pages
Age of Empires III - The Asian Dynasties Cheats, Codes, and Secrets For PC - GameFAQs
No ratings yet
Age of Empires III - The Asian Dynasties Cheats, Codes, and Secrets For PC - GameFAQs
4 pages
Fact Checking Data With Jennifer LaFleur
No ratings yet
Fact Checking Data With Jennifer LaFleur
11 pages
Lab 6a
No ratings yet
Lab 6a
5 pages
Introduction of Different Informatica Transformation
No ratings yet
Introduction of Different Informatica Transformation
5 pages
P 7132 BGM en A5
No ratings yet
P 7132 BGM en A5
32 pages
CT10
No ratings yet
CT10
460 pages
Tut_01 CMT221
No ratings yet
Tut_01 CMT221
4 pages
Horus Manual
No ratings yet
Horus Manual
41 pages
LCD Module Repair Guide 240209
No ratings yet
LCD Module Repair Guide 240209
21 pages
ZTE LTE Termianl Solution
No ratings yet
ZTE LTE Termianl Solution
17 pages
C Programming For Everybody-56-68
No ratings yet
C Programming For Everybody-56-68
13 pages
Republic of The Philippines Sultan Kudarat State University Kalamansig Campus Collage of Fisheries
No ratings yet
Republic of The Philippines Sultan Kudarat State University Kalamansig Campus Collage of Fisheries
16 pages
Itr Report
No ratings yet
Itr Report
13 pages
4057MATLAB for Engineers 3rd Edition Holly Moore pdf download
100% (1)
4057MATLAB for Engineers 3rd Edition Holly Moore pdf download
64 pages
Minh Nguyen Resume
No ratings yet
Minh Nguyen Resume
2 pages
Jimma University Course Outline
50% (2)
Jimma University Course Outline
1 page

Data Science Product Development Lecture 1

Uploaded by

Data Science Product Development Lecture 1

Uploaded by

Data Science Product Development

DS Academic DS Analyst DS Engineer 6

If you don’t have these, I strongly recommend you do more

You might also like