Data Science Product Development Lecture 1
Data Science Product Development Lecture 1
(CSE 679)
Lecture 1 (22nd Jan)
Today’s agenda
● What this course is about.
● Course overview.
● Course logistics.
● Review of basic data science models.
2
About me
● Did my Ph.d in theoretical computer science.
● Wrote mathematical proofs / algorithms for:
○ Similarity search.
○ Dimensionality reduction.
○ Uncertain dataset analysis.
○ Information distances.
3
About me
● My experience informs the course.
● Lot of time with text analytics and retail.
● Working with mega-orgs and startups.
● Fond of math fundamentals, but focus on delivering
products.
● Experimental approach is key.
4
About me
● Interned at Google, 2014 and 2015.
● Interned Microsoft Research, 2013.
● Text Analytics engineer at Qualtrics (2016-19)
● Data scientist at Wayfair (2019-2020)
● Data scientist at a startup, Aktify (2020- date).
5
Conception of Course
● Different types of data scientist.
● Overlapping skill sets, different focuses.
7
Discussion of skill sets.
DS Analyst:
● Efficiently query / consume massive data through SQL,
scraping, etc.
● Design and run massive ML models offline.
● Present insights to org to action further.
● May integrate with readymade pipeline.
● Valuable as part of large teams and orgs.
8
Discussion of skill sets.
DS Engineer:
● Build massive ML models, robustly.
● Ensure data ingests and runs in production.
● Understands deployment, ongoing experimentation
frameworks.
● Understands optimization, interpretability.
● Results oriented.
● Backbone of small business or team.
9
So what about us?
● Will focus on Data Science Engineering.
● Full product cycle of model to productization.
● Steps needed to release fully realized model.
● Focus on steps not often covered in core courses.
● Will also involve software, data engineering and data
collection processes.
10
Why should you care?
● 75% integrated DS model has 0% business value.
● Data science engineer can complete “last mile”.
● Data scientist can productize more scientifically.
● Understand needs of your team/business better.
● Understand constraints to build your models in.
11
Course prereqs
● Extensive programming experience, esp. Python.
● Probability, linear algebra, statistics.
● Familiarity with core data science models, like clustering,
regression, decision trees.
● Familiarity with basic machine learning.
13
Main course modules
1. Data collection and handling.
2. Deep learning - use popular python packages and tools,
rather than reinvent wheel.
3. Model optimization - quantization, distillation,
experimentation.
4. Model interpretability - saliency maps, sensitivity.
5. Deployments - scheduling, microservices, APIs, working in
the cloud.
14
Main course modules
● Data collection and handling - cloud databases, data
annotation packages.
● Deep learning - use popular python packages and tools,
rather than reinvent the wheel.
● Model optimization - quantization, distillation.
● Model interpretability - saliency maps, attention, sensitivity.
15
Logistics, grading and details.
● 150 minutes is upper bound on class time, will often disband
earlier.
● Let’s go through the course syllabus for more details now.
● It’s available here: Syllabus
16
Case study: Business Ask
● Airline bigwig contacts you and your friends.
● Want you to build a reservation tool.
● Works over email and phone.
● Your mission is building the conversational agent.
17
Case study: The POC
● Your engineer friend sets up a software.
● Wants you to build API that can classify 1000+ messages a
minute.
● Should be accurate, reliable, fast POC.
● You have exams next month and a friend’s wedding.
● How do you do this fast?
18
Case study: The Research
● You want to build a T-800 ML model.
● You Google the best NLP models.
● DistillBert seems popular: DistillBert.
● This Github is highly popular: Huggingface
● But will take too long to productize for POC.
19
Case study: The Research
● You start looking for commercially available tools.
● LUIS, Dialogflow, RASA, IBM Watson are in vogue.
● But which one will perform better?
● You come across a promising benchmark paper:
comparison paper.
● Your team is already fully on Google.
● Based on this, you think DialogFlow sounds good.
20
Case study: The Setup
● You work through a tutorial or
two on Dialogflow.
● You look through data to get a
sense of common “intents”.
● Maybe there’s “reschedule”,
“cancel”, “delay”, “add person”,
“change seats”.
21
Case study: Integrate.
● Your engineer friends needs an API to hit.
● You look up DialogFlow API docs.
● You look up Google “key” for your agent.
● You tell your friend to set up billing.
● You write Python code to “detect_intent”.
22
Case study: It’s a POC, not a hack.
● Want some metrics.
● Do an 80-20 split of training data for accuracy.
● Check accuracy on airline logs, academic data, POC customers.
● Set up a dashboard for your team, perhaps with Tableau.
● Realize Dialogflow has a ceiling, need a better model while your
POC runs.
23
Case study: Back to Research.
● Do a really deep dive on papers.
● Amazon has interesting ideas, so does Poly-AI.
● But…..they’re stingy and don’t open source their code.
● Fine. I can do this with a Bert model from HuggingFace, their
repositories are hugely popular.
● Time to dig in to their tutorials and notebooks.
24
Case study: Back to Research.
● Want to verify it works, but deep learning needs GPU to train
effectively.
● Can’t afford a GPU machine, till I’m sure it performs.
● So what to do?
● Google Colab provides GPU for free: example Colab.
● For assignments that need any deep learning, you will use Colab to
run and submit code.
25
Case study: Back to Research.
● Great, we trained a model and it performs accuracy.
● But our engineer friend still wants an API.
● Ok, lets use Fast API to wrap our model in an API.
● Let’s use a microservice to host the API.
● Let’s push to Google Cloud or AWS to host this.
26
Case study: And we’re at V0
● After our new and improved model is out, we’re earning 2 million
dollars a year.
● But our competitor claims better performance.
● We’re not handling delays due to Covid well.
● We clearly need more labelled data to improve our model, and
discover missing intents.
● Collect some more training data also from academic datasets and
web scraping.
27
Case study: Collect more data.
● You can’t manage 10 contractors to label 100K data points easily
in a spreadsheet.
● Need more modern tools.
● Frameworks like Doccano, Snorkel, Super Annotate, MechTurk,
are just a few of the options
● You decide to pick Doccano for your particular usecase.
28
Case study: Business is scaling!
● Your friend complains that she can’t handle all the customers.
● Your model makes 5 predictions a second per microservice.
● Can’t afford to buy enough microservices to handle it.
● You decide to optimize, condense and distill your model to have
higher throughput.
29
Case study: Changing policies.
● You’ve been manually uploading a new reservations model when
flight policies change.
● Adds or removes “delay_no_charges” intent.
● But with Covid changes, and business growth, a human can’t
manage it.
● You need an automated workflow to retrain new models, and pull
new data from airlines databases.
● You decide to use Prefect to manage your workflows.
30
Case study: Training your model.
● At this point, you have a lot of money and compute resources.
● But you don’t have the time to find new hyperparameters for your
model each retrain.
● Brute forcing all combinations is still too slow.
● So you decide to use HyperOpt or Optuna.
31
Case study: Experimenting
● The airlines want to try different cancellation policies, and see how
it impacts revenue.
● You’re also curious if people want more to-the-point agents, or
ones who give more options.
● But you don’t want to code each experiment by hand.
● You decide Facebook’s AX or Planout experimentation platforms
should be integrated to your project.
32
Case study: Understand weak points.
● Congrats! You’re a director of data science right now.
● You have staff managing training data curation.
● But they’re struggling to understand why the data for “delay
flight” is sometimes misclassified.
● So you decide they need tools to see why the ML makes certain
decisions.
● You decide to look at Captum or What If tool.
33
Case study: Recap.
● Data Science product delivery moves from POC to advanced
models.
● Data collection and curation is never ending.
● Models need to be optimized, deployed and accessible to
engineering.
● Product choices need sound experimental tools.
● Parameter choices cannot always be hand tuned.
● Need to continually deepen understanding / interpretability of
models.
34
Case study: Recap.
● Product moves from POC to advanced models.
● Research into state of the art is important.
● Data collection and curation is never ending.
● Models need to be optimized, deployed and accessible to
engineering.
● Product choices need sound experimental tools.
● Parameter choices cannot always be hand tuned.
● Need to continually deepen understanding / interpretability of
models.
35
Class implications.
● We’ll go over each of the cycle steps over course.
● Will discuss the tools and concepts over lectures.
● Class Project should incorporate each step.
● Your research report tests ability to survey literature.
● Project presentation ability to convey results.
I care more about methodology - feel free to experiment with the vast
number of specific tools available.
36
Break time!
37
Review material.
● Will lean on external material for review.
● Some nice slides for learning a decision tree.
● Some slides for random forests and XG-Boost.
● Some slides for k-means++.
38
What comes next.
● Will upload more review resources over next week.
● Form your groups of 3-4 people.
● If you don’t have a group by next week, email me and I will
randomly assign.
● We’ll go over example course project ideas next week.
● We’ll go through some deep learning packages online next week.
● First homework will be released.
39