1.Statistics and Probability (1)
1.Statistics and Probability (1)
In this 1.5 hour course, we'll cover everything from experimental design to the various
distribution questions you'll need to know, all focused on interview questions that are
extremely common in DS interviews.
What this won't be:
• An overview of the most common topics and themes across full time Data Science
interviews
• A set of practice problems curated to test those common themes and let you know what
you know well, and what you don't
How to best sit for this session:
1. Take notes on topics you don't know for self study later
2. Ask copious questions, please interrupt if you'd like
Experimental Design:
• A/B testing design
◦ Sample Size
◦ Test Length
◦ Applications for recommendation algorithms
◦ Ability to implement an A/B test, python/R skills if handed the result of an A
◦ Communication around A/B testing is vital
Experiment interpretation:
• p values (how to interpret, not to interpret)
• Confidence intervals
◦ CI Generation, interpretation
Example A: Let's say your company has run an experiment and has found that there is a p-
value of .04 between the two. How do you figure out if the test is valid?
Example C: How would you explain to a non-technical coworker what Confidence Intervals
are?
Example D: When does Bootstrapping fail?
Theoretical Distributions:
• Should know:
◦ Mean/variance formulae
◦ How to detect when described
◦ What are necessary conditions e.g. poisson requires events to be independent and
number of events must be unbounded)
• Normal/Gaussian
◦ symmetric, unimodal, and asymptotic
◦ Useful in Law of Large Numbers/Central Limit Theorem
• Binomial:
◦ multiple trials where each trial can succeed or fail
• Bernoulli:
◦ binomial when n = 1
◦ only possibilities are success and failure (coin flip)
• Geometric:
◦ trials before Bernoulli success
• Poisson:
◦ observed independent events over fixed time period
◦ The key parameter is the mean events over the fixed time period, often denotes as
lambda
Example A:
Say you have two options for how to deliver an ad in a newsletter. For the first option, you
will put an ad at the top and bottom of every newsletter. For the second, at every paragraph,
you have a ten percent chance of placing an ad there. What type of distribution are each of
these?
Example B: continuing from above, what is the variance and expected value for each
option?
Example C: continuing from above, how often would you expect ads to be shown right
next to each other? if we have 10 paragraphs, what is the expected number of adjacent
ads?
Example D: continuing from above, what about your answers for A-C would change if in
scenario two you were not allowed to run two ads in a row?
Example E:
Say you were a data scientist for McDonalds, and you had data about cars coming through
the drive-through line at a specific location over a time period. What theoretical distribution
could you fit this data to? How would you use this data to figure out how many people to
keep running the drive-through?
Law Of Large Numbers:
• In the long term, running an experiment (or values from a distribution) give us a
representative average
Power Analysis:
• Sample size calculations for setting up experiments
• How to Interpret statistical power
Statistical Tests:
• T-test: testing means
• Z-test: normality assumption
• Chi-square test: categorical data
Bayes Rule:
• How to apply to common situations
• P(A|B) = (P(A) * P(B|A)) / P(B)
What if we have 100 raters for the PSG vs Bayern game the other day, and they all rate it
independently. What is the expected number of good ratings?
What if we have 100 games that are all rated by the same person independently. What is the
expected number of good ratings?
If we have 3 games that were all rated by the same person, and they are all rated as good.
What is the probability the rater was in the 80% vs the 20%?