0% found this document useful (0 votes)
3 views

1.Statistics and Probability (1)

This document outlines a 1.5-hour course focused on statistics and probability, specifically tailored for Data Science interviews. It covers key topics such as experimental design, theoretical distributions, and statistical tests, along with practical examples and questions. The course aims to provide an overview of common themes and practice problems rather than an exhaustive list of statistics concepts.

Uploaded by

lakshmisai1190
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

1.Statistics and Probability (1)

This document outlines a 1.5-hour course focused on statistics and probability, specifically tailored for Data Science interviews. It covers key topics such as experimental design, theoretical distributions, and statistical tests, along with practical examples and questions. The course aims to provide an overview of common themes and practice problems rather than an exhaustive list of statistics concepts.

Uploaded by

lakshmisai1190
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Statistics and Probability

In this 1.5 hour course, we'll cover everything from experimental design to the various
distribution questions you'll need to know, all focused on interview questions that are
extremely common in DS interviews.
What this won't be:

• An exhaustive list of every stats concept out there


• A passive stats lecture
• A document containing everything you could ever see in a stats related interview
What this will be:

• An overview of the most common topics and themes across full time Data Science
interviews

• A set of practice problems curated to test those common themes and let you know what
you know well, and what you don't
How to best sit for this session:
1. Take notes on topics you don't know for self study later
2. Ask copious questions, please interrupt if you'd like

Experimental Design:
• A/B testing design
◦ Sample Size
◦ Test Length
◦ Applications for recommendation algorithms
◦ Ability to implement an A/B test, python/R skills if handed the result of an A
◦ Communication around A/B testing is vital

Experiment interpretation:
• p values (how to interpret, not to interpret)
• Confidence intervals

◦ CI Generation, interpretation

Example A: Let's say your company has run an experiment and has found that there is a p-
value of .04 between the two. How do you figure out if the test is valid?

Example B: Find all the errors in this A/B testing design


Airbnb wants to figure out how changing their logo affects how often people return to their
website. In order to do this, they run a test where 1% of users see logo A (new) and 99% see
logo B and measure the return rates for their users. Every week, they run a t-test on the
results for the experiment. This week, they got a p-value of .02 and stopped the experiment,
and are going to switch their logo.

Example C: How would you explain to a non-technical coworker what Confidence Intervals
are?
Example D: When does Bootstrapping fail?

Theoretical Distributions:
• Should know:
◦ Mean/variance formulae
◦ How to detect when described
◦ What are necessary conditions e.g. poisson requires events to be independent and
number of events must be unbounded)

• Normal/Gaussian
◦ symmetric, unimodal, and asymptotic
◦ Useful in Law of Large Numbers/Central Limit Theorem
• Binomial:
◦ multiple trials where each trial can succeed or fail
• Bernoulli:
◦ binomial when n = 1
◦ only possibilities are success and failure (coin flip)
• Geometric:
◦ trials before Bernoulli success
• Poisson:
◦ observed independent events over fixed time period
◦ The key parameter is the mean events over the fixed time period, often denotes as
lambda

◦ e.g. catching fish

Example A:
Say you have two options for how to deliver an ad in a newsletter. For the first option, you
will put an ad at the top and bottom of every newsletter. For the second, at every paragraph,
you have a ten percent chance of placing an ad there. What type of distribution are each of
these?

Example B: continuing from above, what is the variance and expected value for each
option?

Example C: continuing from above, how often would you expect ads to be shown right
next to each other? if we have 10 paragraphs, what is the expected number of adjacent
ads?

Example D: continuing from above, what about your answers for A-C would change if in
scenario two you were not allowed to run two ads in a row?

Example E:

Say you were a data scientist for McDonalds, and you had data about cars coming through
the drive-through line at a specific location over a time period. What theoretical distribution
could you fit this data to? How would you use this data to figure out how many people to
keep running the drive-through?
Law Of Large Numbers:
• In the long term, running an experiment (or values from a distribution) give us a
representative average

Central Limit Theorem:


• Sampling from any distribution over time produces a normal distribution IF all samples
are equal in size

• standard deviation = st_dev / N^.5


If we viewed the number of ads shown in example B from above over time, what distribution
would the resulting values be most similar to? What would the mean be? How would you find
out what the expected 5th percentile of samples?

Power Analysis:
• Sample size calculations for setting up experiments
• How to Interpret statistical power

Statistical Tests:
• T-test: testing means
• Z-test: normality assumption
• Chi-square test: categorical data

Bayes Rule:
• How to apply to common situations
• P(A|B) = (P(A) * P(B|A)) / P(B)

Questions if we have time:


Let's say we pay people to watch soccer games and rate how good the soccer game is.
For 80% of raters: they have a 60% of rating the quality as high and a 40% chance as low
For 20% of raters, they have a 100% chance of rating a game as good

What is the probability that a random rater rates a game as good?

What if we have 100 raters for the PSG vs Bayern game the other day, and they all rate it
independently. What is the expected number of good ratings?

What if we have 100 games that are all rated by the same person independently. What is the
expected number of good ratings?

If we have 3 games that were all rated by the same person, and they are all rated as good.
What is the probability the rater was in the 80% vs the 20%?

Best Places to Learn about Stats:


Khan Academy
Interview Query
Naked Statistics

For help with statistical intuition


Chris Albon's Machine Learning Flashcards
TidyTuesday Data
for data cleaning/analysis practice, great for take home practice
ISLR

You might also like