0% found this document useful (0 votes)
58 views

Data Science Product Development Lecture 1

The document outlines a lecture for a course on data science product development, describing the instructor's background and experience, different types of data scientists and their skill sets, the main modules to be covered in the course including data collection, deep learning, model optimization and deployment, and logistics for the course.

Uploaded by

Daniyal Raza
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views

Data Science Product Development Lecture 1

The document outlines a lecture for a course on data science product development, describing the instructor's background and experience, different types of data scientists and their skill sets, the main modules to be covered in the course including data collection, deep learning, model optimization and deployment, and logistics for the course.

Uploaded by

Daniyal Raza
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 39

Data Science Product Development

(CSE 679)
Lecture 1 (22nd Jan)
Today’s agenda
● What this course is about.
● Course overview.
● Course logistics.
● Review of basic data science models.

2
About me
● Did my Ph.d in theoretical computer science.
● Wrote mathematical proofs / algorithms for:
○ Similarity search.
○ Dimensionality reduction.
○ Uncertain dataset analysis.
○ Information distances.

3
About me
● My experience informs the course.
● Lot of time with text analytics and retail.
● Working with mega-orgs and startups.
● Fond of math fundamentals, but focus on delivering
products.
● Experimental approach is key.

4
About me
● Interned at Google, 2014 and 2015.
● Interned Microsoft Research, 2013.
● Text Analytics engineer at Qualtrics (2016-19)
● Data scientist at Wayfair (2019-2020)
● Data scientist at a startup, Aktify (2020- date).

5
Conception of Course
● Different types of data scientist.
● Overlapping skill sets, different focuses.

DS Academic DS Analyst DS Engineer 6


Discussion of skill sets.
Academic:
● Novelty of research > results.
● Data often more at POC level, sometimes unrealistic.
● Ideas matter more than benchmarks.
● Productization / scale unnecessary.
● Important for developing broader field.

7
Discussion of skill sets.
DS Analyst:
● Efficiently query / consume massive data through SQL,
scraping, etc.
● Design and run massive ML models offline.
● Present insights to org to action further.
● May integrate with readymade pipeline.
● Valuable as part of large teams and orgs.

8
Discussion of skill sets.
DS Engineer:
● Build massive ML models, robustly.
● Ensure data ingests and runs in production.
● Understands deployment, ongoing experimentation
frameworks.
● Understands optimization, interpretability.
● Results oriented.
● Backbone of small business or team.

9
So what about us?
● Will focus on Data Science Engineering.
● Full product cycle of model to productization.
● Steps needed to release fully realized model.
● Focus on steps not often covered in core courses.
● Will also involve software, data engineering and data
collection processes.

10
Why should you care?
● 75% integrated DS model has 0% business value.
● Data science engineer can complete “last mile”.
● Data scientist can productize more scientifically.
● Understand needs of your team/business better.
● Understand constraints to build your models in.

11
Course prereqs
● Extensive programming experience, esp. Python.
● Probability, linear algebra, statistics.
● Familiarity with core data science models, like clustering,
regression, decision trees.
● Familiarity with basic machine learning.

If you don’t have these, I strongly recommend you do more


foundational course first.
12
Nice-to-haves, but not needed.
● Prior experience with deep learning.
● Experience with text analytics.
● Familiarity with working in the cloud.
● Will use deep learning more as black box, for the most part.

13
Main course modules
1. Data collection and handling.
2. Deep learning - use popular python packages and tools,
rather than reinvent wheel.
3. Model optimization - quantization, distillation,
experimentation.
4. Model interpretability - saliency maps, sensitivity.
5. Deployments - scheduling, microservices, APIs, working in
the cloud.
14
Main course modules
● Data collection and handling - cloud databases, data
annotation packages.
● Deep learning - use popular python packages and tools,
rather than reinvent the wheel.
● Model optimization - quantization, distillation.
● Model interpretability - saliency maps, attention, sensitivity.

15
Logistics, grading and details.
● 150 minutes is upper bound on class time, will often disband
earlier.
● Let’s go through the course syllabus for more details now.
● It’s available here: Syllabus

16
Case study: Business Ask
● Airline bigwig contacts you and your friends.
● Want you to build a reservation tool.
● Works over email and phone.
● Your mission is building the conversational agent.

17
Case study: The POC
● Your engineer friend sets up a software.
● Wants you to build API that can classify 1000+ messages a
minute.
● Should be accurate, reliable, fast POC.
● You have exams next month and a friend’s wedding.
● How do you do this fast?

18
Case study: The Research
● You want to build a T-800 ML model.
● You Google the best NLP models.
● DistillBert seems popular: DistillBert.
● This Github is highly popular: Huggingface
● But will take too long to productize for POC.

19
Case study: The Research
● You start looking for commercially available tools.
● LUIS, Dialogflow, RASA, IBM Watson are in vogue.
● But which one will perform better?
● You come across a promising benchmark paper:
comparison paper.
● Your team is already fully on Google.
● Based on this, you think DialogFlow sounds good.

20
Case study: The Setup
● You work through a tutorial or
two on Dialogflow.
● You look through data to get a
sense of common “intents”.
● Maybe there’s “reschedule”,
“cancel”, “delay”, “add person”,
“change seats”.

21
Case study: Integrate.
● Your engineer friends needs an API to hit.
● You look up DialogFlow API docs.
● You look up Google “key” for your agent.
● You tell your friend to set up billing.
● You write Python code to “detect_intent”.

22
Case study: It’s a POC, not a hack.
● Want some metrics.
● Do an 80-20 split of training data for accuracy.
● Check accuracy on airline logs, academic data, POC customers.
● Set up a dashboard for your team, perhaps with Tableau.
● Realize Dialogflow has a ceiling, need a better model while your
POC runs.

23
Case study: Back to Research.
● Do a really deep dive on papers.
● Amazon has interesting ideas, so does Poly-AI.
● But…..they’re stingy and don’t open source their code.
● Fine. I can do this with a Bert model from HuggingFace, their
repositories are hugely popular.
● Time to dig in to their tutorials and notebooks.

24
Case study: Back to Research.
● Want to verify it works, but deep learning needs GPU to train
effectively.
● Can’t afford a GPU machine, till I’m sure it performs.
● So what to do?
● Google Colab provides GPU for free: example Colab.
● For assignments that need any deep learning, you will use Colab to
run and submit code.

25
Case study: Back to Research.
● Great, we trained a model and it performs accuracy.
● But our engineer friend still wants an API.
● Ok, lets use Fast API to wrap our model in an API.
● Let’s use a microservice to host the API.
● Let’s push to Google Cloud or AWS to host this.

26
Case study: And we’re at V0
● After our new and improved model is out, we’re earning 2 million
dollars a year.
● But our competitor claims better performance.
● We’re not handling delays due to Covid well.
● We clearly need more labelled data to improve our model, and
discover missing intents.
● Collect some more training data also from academic datasets and
web scraping.

27
Case study: Collect more data.
● You can’t manage 10 contractors to label 100K data points easily
in a spreadsheet.
● Need more modern tools.
● Frameworks like Doccano, Snorkel, Super Annotate, MechTurk,
are just a few of the options
● You decide to pick Doccano for your particular usecase.

28
Case study: Business is scaling!
● Your friend complains that she can’t handle all the customers.
● Your model makes 5 predictions a second per microservice.
● Can’t afford to buy enough microservices to handle it.
● You decide to optimize, condense and distill your model to have
higher throughput.

29
Case study: Changing policies.
● You’ve been manually uploading a new reservations model when
flight policies change.
● Adds or removes “delay_no_charges” intent.
● But with Covid changes, and business growth, a human can’t
manage it.
● You need an automated workflow to retrain new models, and pull
new data from airlines databases.
● You decide to use Prefect to manage your workflows.

30
Case study: Training your model.
● At this point, you have a lot of money and compute resources.
● But you don’t have the time to find new hyperparameters for your
model each retrain.
● Brute forcing all combinations is still too slow.
● So you decide to use HyperOpt or Optuna.

31
Case study: Experimenting
● The airlines want to try different cancellation policies, and see how
it impacts revenue.
● You’re also curious if people want more to-the-point agents, or
ones who give more options.
● But you don’t want to code each experiment by hand.
● You decide Facebook’s AX or Planout experimentation platforms
should be integrated to your project.

32
Case study: Understand weak points.
● Congrats! You’re a director of data science right now.
● You have staff managing training data curation.
● But they’re struggling to understand why the data for “delay
flight” is sometimes misclassified.
● So you decide they need tools to see why the ML makes certain
decisions.
● You decide to look at Captum or What If tool.

33
Case study: Recap.
● Data Science product delivery moves from POC to advanced
models.
● Data collection and curation is never ending.
● Models need to be optimized, deployed and accessible to
engineering.
● Product choices need sound experimental tools.
● Parameter choices cannot always be hand tuned.
● Need to continually deepen understanding / interpretability of
models.
34
Case study: Recap.
● Product moves from POC to advanced models.
● Research into state of the art is important.
● Data collection and curation is never ending.
● Models need to be optimized, deployed and accessible to
engineering.
● Product choices need sound experimental tools.
● Parameter choices cannot always be hand tuned.
● Need to continually deepen understanding / interpretability of
models.
35
Class implications.
● We’ll go over each of the cycle steps over course.
● Will discuss the tools and concepts over lectures.
● Class Project should incorporate each step.
● Your research report tests ability to survey literature.
● Project presentation ability to convey results.

I care more about methodology - feel free to experiment with the vast
number of specific tools available.
36
Break time!

37
Review material.
● Will lean on external material for review.
● Some nice slides for learning a decision tree.
● Some slides for random forests and XG-Boost.
● Some slides for k-means++.

38
What comes next.
● Will upload more review resources over next week.
● Form your groups of 3-4 people.
● If you don’t have a group by next week, email me and I will
randomly assign.
● We’ll go over example course project ideas next week.
● We’ll go through some deep learning packages online next week.
● First homework will be released.

39

You might also like