0% found this document useful (0 votes)
1 views

Stats 101 - Class 01

Introduction - Probability Concepts and Decisions
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

Stats 101 - Class 01

Introduction - Probability Concepts and Decisions
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 109

Section 1: Introduction

Probability Concepts and Decisions

Bryon Aragam
Chicago Booth
[email protected]
https://canvas.uchicago.edu/courses/43775/

Suggested Reading:
Naked Statistics, Chapters 1, 2, 3, 5, 5.5 and 6
OpenIntro Statistics, Chapters 2, 3, and 4

Last Updated: September 25, 2022

1 / 110
Reminders

Some of you have asked about prerequisites / expectations.

Quick reminder: This is not a math class, but some things are
expected
I Manipulating equations, square roots, solving linear equations
P
(y = mx + b), sums i

I e.g. if y = a2 u + b 2 v + 2z, solve for u, v , or z
xy
I e.g. if z = ab+cd , solve for a, b, c, or d
I Averages, proportions, percentages, etc.

NO binomial coefficients, calculus, limits, etc.

1 / 110
Why Probability?

A central theme in this course will be random sampling and the


error associated with random sampling.

First, we need to understand probability and randomness.

2 / 110
Kidney stones

In a medical study comparing the success rates of two treatments


for kidney stones, the following success probabilities were observed:

Treatment A Treatment B

Small stones 93% 87%


Large stones 73% 69%

3 / 110
Kidney stones

Treatment A Treatment B

Small stones 93% (81/87) 87% (234/270)


Large stones 73% (192/263) 69% (55/80)
Overall 78% (273/350) 83% (289/350)

The explanation for this surprising result is complicated: It involves


studying the confounding variable of “severity” (as determined by
a doctor)—Treatment B was systemically preferred by doctors for
less severe cases.

This is an example of Simpson’s paradox.

5 / 110
Ad targeting

Suppose you are deciding whether or not to target a customer with


a promotion (or an ad)...

It will cost you $.80 (eighty cents) to run the promotion and a
customer spends $40 if they respond to the promotion.

Should you do it?

7 / 110
Probability Basics

8 / 110
Introduction

Probability and statistics let us talk efficiently about things we are


unsure about.
I How likely is an incumbent president to get re-elected?
I How much will Amazon sell next quarter?
I What will the return of my retirement portfolio be next year?
I How often will users click on a particular Facebook ad?

All of these involve inferring or predicting unknown quantities!

9 / 110
Randomness and outcomes

I When we collect data, there are outcomes that we are


interested in
I Example: Suppose we are about to toss two coins. The
possible outcomes are:
I TT , HH, TH, HT
I We can write the collection of all possible outcomes like this:
{TT , HH, TH, HT }
I Generally, data collection involves randomness
I e.g. sampling, human error, things that are unpredictable
(weather, stock prices, etc.)
I If we are being very formal, sometimes we call the data
collection procedure an experiment

10 / 110
Events

I Events are just collections of outcomes.


I Example: Suppose we are about to toss two coins.
I A = {at least one heads}
I B = {no tails}
I C = {no tails and no heads}

We call A, B, C are events, and we can assign probabilities to


each of these events.
I An event need not be possible
I e.g. the event with no outcomes (called the “null event”) is a
valid event
I This is mainly for practical convenience

11 / 110
Random Variables

I Random Variables are numbers that we are NOT sure about


but we might have some idea of how to describe its possible
outcomes.
I Example: Suppose we are about to toss two coins.
Let X denote the number of heads.

We say that X , is the random variable that stands for the


number we are not sure about.
I We can use random variables to describe events, e.g.
A = {X ≥ 1}, B = {X = 2}, C = {X = 0 and 2 − X = 0}

12 / 110
Example: Baseball

Suppose we track Pete Rose’s hit statistics over the course of a


baseball game. To be concrete, let’s say he comes up to bat 4
times per game.
I Randomness: The outcome of each at-bat is not known in
advance
I Possible outcomes: 1000, 0101, 0100, etc. where 1 = hit, 0 =
no hit
I Events: {at least one hit}, {no hits}, {exactly 3 hits}, etc.
I Random variables: X = total number of hits, Y = 1 if at
least one hit, etc.

13 / 110
The language of probability

Summary:
I Events are collections of outcomes
I Random variables make it easy to describe complicated
events
I Data collection entails randomness and possible outcomes

Of course...
I This is all very formal, and you probably won’t find yourself
using this language very often in board meetings...
I ...but we need to settle on a common language so we’re all on
the same page

14 / 110
Probability

Probability is a language designed to help us talk and think about


aggregate properties of events and random variables.

The key idea is that to each event we will assign a number


between 0 (0%) and 1 (100%) which reflects how likely that event
is to occur, which is written P(A).

P(A) = the probability that the event A occurs

There are lots of common notation for probabilities: pr (A), p(A),


Pr (A), P(A), µ(A)...
15 / 110
A note about time and space

Unless it is explicitly specified in the event, this interpretation of


events and probabilities ignores time and space altogether.
I i.e. P(A) = the probability that A happens anywhere at any
time for any reason (including the past and in undiscovered
parallel universes)
I i.e. P(A) = the probability that A has occurred, will occur, is
occurring now... etc.

Of course, we can take time and space into account by simply


specifying it:
P(something interesting occurs in Chicago at 9:17 on January 6, 2028).

16 / 110
The Four Rules of Probability

For such an immensely useful language, probability has only a few


basic rules.

1. If an event A is certain to occur, it has probability 1, denoted


P(A) = 1.
2. P(A does not occur) = 1 − P(A).
3. If two events A and B are mutually exclusive (both cannot
occur), then P(A or B) = P(A) + P(B).
4. P(A and B) = P(A)P(B|A) = P(B)P(A|B)

P(A|B) is the conditional probability of A given B.


17 / 110
Conditional probability and independence

18 / 110
Conditional probability
Intuitively, P(A|B) means: In an alternative world where we know
that B has occurred (or will occur), how does P(A) change?

Re-writing rule #4 (P(A and B) = P(B)P(A|B)):

P(A, B)
P(A|B) = .
P(B)

What if P(A|B) = P(A)? Then B tells us nothing about A. In


other words, even if we know that B happened, it does nothing to
change what we believe about A.

(More on this soon.)


19 / 110
Understanding conditional probability

In 2019, United Airlines set a 2020 earnings goal of $11 to $13 per
share. Let’s say they felt there was a 95% probability of hitting
this target:

P($11 to $13 per share in 2020) = 0.95.

This probability statement is implicitly accounting for all of the


unknown factors that could affect this event in 2020, even the
really unlikely ones...

Like, say, a global pandemic.

20 / 110
Understanding conditional probability

Let A = earnings of $11 to $13 per share in 2020. There are lots
of events that can affect the outcome of this event.
I B1 = United buys Southwest P(B1 ) = ?
I B2 = It rains at 6:24PM sometime in May P(B2 ) = ?
I B3 = A major network outage grounds all United planes for
48 hours P(B3 ) = ?
I B4 = A global pandemic quarantines the world’s population
for months years P(B4 ) = ?

P(A) = probability of A occurring without any certainty about B1 ,


B2 , or any other event for that matter.

21 / 110
Understanding conditional probability

P(A|Bk ) = probability of A occurring with absolute certainty that


Bk has occurred.

Once there is new information, our probabilities change:

Before pandemic After pandemic


0 < P(B4 ) < 1 P(B4 ) = 1

Before the pandemic, we said P(A) = 0.95. (What was P(B4 ) in


Oct 2019?)

Now, what do you think about P(A|B4 )? What about


P(A|not B4 )?
22 / 110
Another example...

23 / 110
Reality vs. imagination

Conditional probability allows us to update our knowledge about


the world in the presence of new information.

WARNING: Conditional probabilities are most often used to discuss


hypotheticals.

P(A|B) = What is the probability of A if B were to happen (or


had happened, etc.)?

It is not necessary to wait until B has really happened to model,


discuss, and imagine P(A|B): This is the real utility of probability.

As with regular probability, conditional probability ignores time and


space unless it is specified. 24 / 110
Independence

Two events A and B are called independent if P(A|B) = P(A).

I In words: If we know with absolute certainty that B has


occurred, it does not change what we know/believe about A
I i.e. A and B have nothing to do with each other
I Same as requiring P(B|A) = P(B)

Example: Coin flipping. The outcome of a coin flip does not


depend on previous tosses.

More formally: Let A = {first toss is heads} and


B = {second toss is tails}. Then A and B are independent.
25 / 110
Independence and rule #4

Rule #4 implies a very special formula when A and B are


independent:

A and B independent =⇒ P(A and B) = P(A)P(B).

Be careful: Only true for independent events, and only true with
“and” (not “or”)!

Exercise: Can you see why rule #4 and independence imply this
formula?

26 / 110
Independence and statistics

The most common way we invoke independence in statistics and


data analysis is through sampling: When collecting data, whether
or not we survey one individual has no effect on who we survey
next.

This can be enforced via random sampling, but it can be difficult.

We will often use this as a simplifying assumption, but more


rigourous justification of this is a topic for future courses.

27 / 110
Pete Rose Hitting Streak

Pete Rose of the Cincinnati Reds set a National League record of


hitting safely in 44 consecutive games...

I Rose was a 300 hitter (this means


P(hit safely in a single at bat) = 0.300).
I Assume he comes to bat 4 times each game
I “Hitting safely in a game” means that he hits safely in at
least one out of these four at-bats
I Each at bat is assumed to be independent, i.e., the current at
bat doesn’t affect the outcome of the next.

What is the probability of a hitting streak of that length?


28 / 110
Pete Rose Hitting Streak
Let Ai denote the event that “Rose hits safely in the i th game”.

Then P(Rose Hits Safely in 44 consecutive games) =


P(A1 and A2 . . . and A44 ) = P(A1 )P(A2 )...P(A44 )

We now need to find P(Ai ): What are all the ways to get at least one hit
in 4 tries? This is hard. Instead, how many ways ways are there to get
NO hits in 4 tries? Exactly one: All misses (i.e. 0000).

Now use the rules of probability: Think of the complement of Ai , i.e.,


P(Ai ) = 1 − P(not Ai ) and use Rule #2
P(Ai ) = 1 − P(Rose makes 4 outs)
= 1 − (0.7 × 0.7 × 0.7 × 0.7)
= 1 − (0.7)4 = 0.76

So, for the winning streak we have (0.76)44 = 0.0000057!!!


29 / 110
New England Patriots and Coin Tosses

For the past 25 games the Patriots won 19 coin tosses!

What is the probability of that happening?

Let T be a random variable taking the value 1 when the Patriots


win the toss or 0 otherwise.
It’s reasonable to assume Pr (T = 1) = 0.5, right??

Now what? It turns out that there are 177,100 different sequences
of 25 games where the Patriots win 19... it turns out each
potential sequence has probability 0.525 (why?)

Therefore the probability for the Patriots to win 19 out 25 tosses is


177, 100 × 0.525 = 0.005

30 / 110
Probability distributions

31 / 110
Probability Distributions

I We describe the behavior of random variables with a


Probability Distribution
I Example: If X is the random variable denoting the number of
heads in two independent coin tosses, we can describe its
behavior through the following probability distribution:

 0 with prob. 0.25


X = 1 with prob. 0.5


 2 with prob. 0.25

I X is called a discrete random variable as we are able to list


all the possible outcomes
I Question: What is Pr (X = 0)? How about Pr (X ≥ 1)?
32 / 110
Binary (aka Bernoulli) Random Variables

The simplest possible random variable takes on one of two values.


We will always call these values 0 and 1. So
(
1 with prob. p
X = ⇐⇒ P(X = 1) = p.
0 with prob. 1−p

We let p denote the probability of seeing 1, i.e. p = P(X = 1).

This is written X ∼ Ber(p), where “Ber” is short for “Bernoulli”.

Examples:
I Voting: 0 = Candidate A, 1 = Candidate B
I Product testing: 0 = safe, 1 = defective / unsafe
I Yes/no: 0 = no, 1 = yes
33 / 110
Conditional, Joint and Marginal Distributions

In general we want to use probability to address problems involving


more than one variable at the time

Think back to our first question on the returns of my portfolio... if


we know that the economy will be growing next year, does that
change the assessment about the behavior of my returns?

We need to be able to describe what we think will happen to one


variable relative to another...

34 / 110
Conditional, Joint and Marginal Distributions

There are two different ways to discuss the probabilities of two events:

I Joint probability P(A and B): I don’t know if either event has
occurred.
I Conditional probability P(A|B): I know for sure that B has occurred
(or will occur).

In general the shorthand notation is...

I P(A, B) is the joint probability that both A AND B occur (not


necessarily at the same time).
I P(A|B) is the conditional probability of A happening GIVEN that
B has already occurred (or will occur).

Compare these to the individual probabilities:

I P(A) and P(B) are the marginal probabilities of A and B.

35 / 110
Conditional, Joint and Marginal Distributions

Here’s an example: We want to answer questions like: How are my


sales impacted by the overall economy?

Let E denote the performance of the economy next quarter... for


simplicity, say E = 1 if the economy is expanding and E = 0 if the
economy is contracting. (What kind of random variable is this?)

Let’s assume P(E = 1) = 0.7.

(What are the two events described above?)

36 / 110
Conditional probability tables

Let S denote my sales next quarter... and let’s suppose the


following probability statements:

S P(S|E = 1) S P(S|E = 0)
1 0.05 1 0.20
2 0.20 2 0.30
3 0.50 3 0.30
4 0.25 4 0.20

These are called conditional distributions and the table is called


a conditional probability table.

37 / 110
Conditional probability tables

S P(S|E = 1) S P(S|E = 0)
1 0.05 1 0.20
2 0.20 2 0.30
3 0.50 3 0.30
4 0.25 4 0.20

I In blue is the conditional distribution of S given E = 1


I In red is the conditional distribution of S given E = 0
I We read: the probability of Sales of 4 (S = 4) given(or
conditional on) the economy is growing (E = 1) is 0.25

38 / 110
Computing joint probabilities

The conditional distributions tell us about about what can happen


to S for a given value of E ... but what about S and E jointly?

P(S = 4 and E = 1) = P(E = 1) × P(S = 4|E = 1)


= 0.70 × 0.25 = 0.175

In plain language: “70% of the time the economy grows and of


those times, 25% of the time sales equals 4.”

...25% of 70% is 17.5%

39 / 110
Probability tree diagrams

40 / 110
Joint probability tables

The joint probability table describes all possible joint


probabilities between two random variables. Summing each row or
column gives the corresponding marginal probability.

e.g. P(E = 0, S = 1) = 0.06.


41 / 110
Marginal probabilities
Marginal probabilities can be computed from joint probabilities:

X
P(X = x) = P(X = x, Y = y ).
y

Why we call marginals marginals: the table represents the joint


and at the margins, we get the marginals.
42 / 110
Conditional, Joint and Marginal Distributions

Example... Given E = 1 what is the probability of S = 4?

P(S = 4, E = 1) 0.175
P(S = 4|E = 1) = = = 0.25
P(E = 1) 0.7

43 / 110
Conditional, Joint and Marginal Distributions

Example... Given S = 4 what is the probability of E = 1?

P(S = 4, E = 1) 0.175
P(E = 1|S = 4) = = = 0.745
P(S = 4) 0.235

44 / 110
Independence of random variables

Two random variables X and Y are independent if

P(Y = y |X = x) = P(Y = y )

for all possible x and y .

In other words,

knowing X tells you nothing about Y !

This is just the previous definition of independence for the events


A = {Y = y } and B = {X = x}.

e.g., tossing a coin 2 times... what is the probability of getting H


in the second toss given we saw a T in the first one?
45 / 110
Dependence

If two random variables are not independent, they are called


dependent:
P(Y = y |X = x) 6= P(Y = y )

i.e. P(Y = y |X = x) depends on x.

In other words,

there is a relationship between X and Y .

More on this soon.

46 / 110
Disease Testing Example

Let D = 1 indicate you have a disease


Let T = 1 indicate that you test positive for it

If you take the test and the result is positive, you are really
interested in the question: Given that you tested positive, what is
the chance you have the disease?
47 / 110
Disease Testing Example

P(D = 1, T = 1) 0.019
P(D = 1|T = 1) = = = 0.66
P(T = 1) (0.019 + 0.0098)

48 / 110
Bayes Theorem

I Try to think about this intuitively... imagine you are about to


test 100,000 people.
I we assume that about 2,000 of those have the disease.
I we also expect 1% of the disease-free people to test positive,
ie, 980, and 95% of the sick people to test positive, ie 1,900.
So, we expect a total of 2,880 positive tests.
I Choose one of the 2,880 people at random... what is the
probability that he/she has the disease?

p(D = 1|T = 1) = 1, 900/2, 880 = 0.66

I isn’t that the same?!


49 / 110
Bayes Theorem
The computation of P(X |Y ) from P(X ) and P(Y |X ) is called
Bayes theorem:

P(X = x, Y = y ) P(X = x, Y = y )
P(X = x|Y = y ) = =P
P(Y = y ) x P(X = x, Y = y )
P(X = x)P(Y = y |X = x)
=P .
x P(X = x)P(Y = y |X = x)

In the disease testing example:


P(T =1|D=1)P(D=1)
P(D = 1|T = 1) = P(T =1|D=1)P(D=1)+P(T =1|D=0)P(D=0)

0.019
P(D = 1|T = 1) = (0.019+0.0098) = 0.66
50 / 110
Causality and experimentation

51 / 110
Interlude: Conditional probability vs causation

OK, so P(A|B) = P(A, B)/P(B)...

...but what does this really mean?

I Given that I happen to observe B, what is the probability of


A?
I No assumptions on why I observed B
I No assumptions on the relationship between B and A, or
other possible variables
I No control over B
I Purely a “fun fact” about A and B

52 / 110
Interlude: Conditional probability vs causation

I When ice cream sales are high, the probability of getting


attacked by a shark goes up
I When ice cream sales are low, the probability of getting
attacked by a shark goes down
So, there is a clear relationship between shark attacks and ice
cream sales.
53 / 110
What’s in a relationship?

Ordinary relationships:
When we say that two things are related, our language is
deliberately vague: You are not allowed to conclude ANYTHING
about how or why they are related. (See Slide 52.)

Causal relationships:
Causality is a much stronger type of relationship: Two things share
a causal relationship if manipulating one of them changes the
other.
I Non-causal: Ice cream sales and shark attacks
I Causal: Ice cream melting and temperature

Should you always care if a relationship is causal or not?


54 / 110
Interlude: Conditional probability vs causation

When we see dependencies in data, it is tempting to want to draw


causal conclusions...

...but this is rarely justified by the data.

To make a causal conclusion, you must run a carefully planned


experiment (e.g. a randomized controlled trial).

This allows you to ensure that there is no other possible


explanation for the relationships in the data.

55 / 110
Randomized controlled trials

A randomized controlled trial (RCT) is the gold standard of causal


inference. An RCT is built on several key principles:

I Controls: Everything is carefully controlled and monitored;


I Randomization: Assignment of subjects is purely random and does
not depend on any outside factors;
I Replication: Many subjects are monitored and observed.

More concretely, an RCT always consists of the following:

I A treatment and a control;


I Perfect random assignment of subjects into treatment and control
groups;
I Compliance of study protocols.

If your study violates any of these conditions, you might be doing


something, but you’re not doing an RCT!
56 / 110
Example: Clinical trials

You want to test the efficacy of a new drug.

To ensure validity of causal conclusions, you randomly divide


patients into two groups (control = placebo and treatment = new
drug).

Why random?

Example: Divide based on severity of condition =⇒ how do you


know severity isn’t the explanation? ???

(More generally, if decision to divide patients is based on variables


A, B, C , . . . then no way to know whether drug or A, B, C , . . . is
reason for relationship.)

57 / 110
Ice cream does not cause shark attacks

Back to the sharks:


I If we forcibly change ice cream sales (e.g. an external
intervention), will the number of shark attacks change?
I Is there an alternative explanation for the relationship?
I Yes: Time of year (summer)
58 / 110
Conditional probability 6= causation

Other explanations for a perceived relationship between A and B?


I A causes B
I B causes A
I C causes both A and B
I Chance (aka “spurious” relationship)

When you observe a relationship in data, you do not know what


the reason for this relationship is.

Common pitfall: “It’s causal when I want it to be, and it’s spurious
otherwise.” Don’t do this!
(See also: Lucas critique in economics)

59 / 110
Beyond the scope of this course

The difference between conditional probability (“I just happened to


observe this”) and randomized experiments (“I forcibly
manipulated the environment”) is so crucial, statisticians
sometimes reserve special notation for this:

I Conditional probability: P(Y |X = x)


I Randomized experiment: P(Y |do(X = x))

The “do” notation means: I didn’t just happen to observe this, I


forcibly set X to be x, and there is no other explanation for why X
is equal to x.

That’s all we’ll say about this topic—more on this in advanced


courses.
60 / 110
Mean, variance, covariance

61 / 110
Probability and Decisions
Suppose you are presented with an investment opportunity in the
development of a drug... probabilities are a vehicle to help us build
scenarios and make decisions.

62 / 110
Probability and Decisions

We basically have a new random variable, i.e, our revenue, with


the following probabilities...

Revenue P(Revenue)
$250,000 0.7
$0 0.138
$25,000,000 0.162

So, should we invest or not? Everyone has their own opinions, but
let’s try to analyze this rigorously.

63 / 110
Mean and Variance of a Random Variable

The mean or expected value is defined as (for a discrete X ):

n
X
E(X ) = P(X = xi ) × xi
i=1

We weight each possible value by how likely they are... this


provides us with a measure of centrality of the distribution... a
“good” prediction for X !

64 / 110
Mean and Variance of a Random Variable

Suppose X ∼ Ber(p).

n
X
E(X ) = P(X = xi ) × xi
i=1
= (1 − p) × 0 + p × 1
E(X ) = p

Beyond binary: What is the E(Revenue) in the drug investment


example above?

65 / 110
Managing expectations (pun intended)

In the previous example, the expected value was p, even though


the only possible outcomes are 0 and 1. Your expected value need
not be one of the actual outcomes!

Huh?

Example:
I How to price an auto insurance policy?
I At the end of the day, either the policy holder gets in a wreck
or not
I But you don’t charge them the entire cost of a wrecked car!

The expected value is a mathematically rigourous way to set this


price so you don’t lose money in the long run.
66 / 110
Honesty is (not?) the best policy

What does it mean when your weather app says there is a 40%
chance of rain?
This is psychology, not math:
I If there is a 10% chance it rains, and it rains, you will get mad
for not bringing an umbrella
I If there is a 50% chance it rains then the app sounds
wishy-washy
I If there is a 90% chance it rains, and it doesn’t rain, then you
just changed all your plans for nothing
There’s no winning... unless you always report 40% or 60%!
(Let’s come back to this after Section 2. Also, nowadays with modern tools it’s a lot
more sophisticated.)
67 / 110
Mean and Variance of a Random Variable
The variance is defined as (for a discrete X ):

n
X
var(X ) = P(X = xi ) × [xi − E(X )]2
i=1

Weighted average of squared prediction errors... This is a measure


of spread of a distribution. More risky distributions have larger
variance.

Think of [xi − E(X )]2 as the (squared) distance from the


observation xi to the expected value E(X ). Then the variance
measures the “average” or “expected” distance between a random
sample and its expectation.
68 / 110
Mean and Variance of a Random Variable

Suppose X ∼ Ber(p).
n
X
=⇒ var(X ) = P(X = xi ) × [xi − E(X )]2
i=1

= (1 − p) × (0 − p)2 + p × (1 − p)2
= p(1 − p) × [(1 − p) + p]
= p(1 − p)

For which value of p is the variance the largest?


Beyond binary: What is the var(Revenue) in the drug investment
example above?

69 / 110
The Standard Deviation

I What are the units of E(X )? What are the units of var(X )?
I A more intuitive way to understand the spread of a
distribution is to look at the standard deviation:
p
sd(X ) = var(X )

I What are the units of sd(X )?

70 / 110
Probability and Decisions

Let’s get back to the drug investment example...

Left: Original investment option


Right: New investment option

Revenue P(Revenue) Revenue P(Revenue)


$250,000 0.7 $3,721,428 0.7
$0 0.138 $0 0.138
$25,000,000 0.162 $10,000,000 0.162

The expected revenue for both is $4,225,000...


What is the difference? The risk (i.e. variance)!

71 / 110
Back to Target Marketing

Should we send the promotion?

Well, it depends on how likely it is that the customer will respond!

If they respond, you get $40 − $0.80 = $39.20.

If they do not respond, you lose $0.80.

Let’s assume your “predictive analytics” team has studied the


conditional probability of customer responses given customer
characteristics... (say, previous purchase behavior, demographics,
etc)

72 / 110
Back to Target Marketing

Suppose that for a particular customer, the probability of a


response is 0.05.

Revenue P(Revenue)
$-0.80 0.95
$39.20 0.05

Should you do the promotion?

Homework question: How low can the probability of a response be


so that it is still a good idea to send out the promotion?

73 / 110
Covariance
The covariance between X and Y is:

n X
X m
cov(X , Y ) = P(X = xi , Y = yj )×[xi − E(X )]×[yj − E(Y )]
i=1 j=1

I A measure of dependence between two random variables...


I If X and Y typically increase and decrease together, then
cov(X , Y ) > 0
I If X typically goes up when Y goes down (and vice versa),
then cov(X , Y ) < 0
I It tells us how two unknown quantities tend to move together
I What are the units of cov(X , Y )?
74 / 110
Covariance
Measures the direction and strength of the linear relationship between Y
and X :
!

(Yi − Ȳ )(Xi − X̄) < 0 (Yi − Ȳ )(Xi − X̄) > 0


!
! !
! !
!
20

!
!
! !
! !
! !
!
!!
!
! ! !
!
! !
! ! ! !
!

Ȳ !
!

!
! !
0
Y

! !

! ! !
! !
!
!!
!
!
! !

! ! !
−20

! !

!
(Yi − Ȳ )(Xi − X̄) > 0 (Yi − Ȳ )(Xi − X̄) < 0
!
!

!
−40

−20 −10 0 10 20

X

75 / 110
Ford vs. Tesla

I Assume a very simple joint distribution of monthly returns for


Ford (F ) and Tesla (T ):

t=-7% t=0% t=7% Pr(F=f)


f=-4% 0.06 0.07 0.02 0.15
f=0% 0.03 0.62 0.02 0.67
f=4% 0.00 0.11 0.07 0.18
Pr(T=t) 0.09 0.80 0.11 1

Let’s summarize this table with some numbers...

76 / 110
Ford vs. Tesla

t=-7% t=0% t=7% Pr(F=f)


f=-4% 0.06 0.07 0.02 0.15
f=0% 0.03 0.62 0.02 0.67
f=4% 0.00 0.11 0.07 0.18
Pr(T=t) 0.09 0.80 0.11 1

I E(F ) = 0.12, E(T ) = 0.14


I var(F ) = 5.25, sd(F ) = 2.29, var(T ) = 9.76, sd(T ) = 3.12
I What stock do you prefer? Why? Is one better than the
other?

77 / 110
Ford vs. Tesla

t=-7% t=0% t=7% Pr(F=f)


f=-4% 0.06 0.07 0.02 0.15
f=0% 0.03 0.62 0.02 0.67
f=4% 0.00 0.11 0.07 0.18
Pr(T=t) 0.09 0.80 0.11 1

cov(F , T ) =0.06(−7 − 0.14)(−4 − 0.12) + 0.03(−7 − 0.14)(0 − 0.12)+


0.00(−7 − 0.14)(4 − 0.12)+0.07(0 − 0.14)(−4 − 0.12)+
0.62(0 − 0.14)(0 − 0.12) + 0.11(0 − 0.14)(4 − 0.12)+
0.02(7 − 0.14)(−4 − 0.12) + 0.02(7 − 0.14)(0 − 0.12)+
0.07(7 − 0.14)(4 − 0.12) = 3.063

But... is this a strong relationship?


78 / 110
Correlation
Okay, so the covariance was positive... makes sense, but is ≈ 3 a
strong or weak relationship? Depends on the units!

Can we get a more intuitive number? Yes! The correlation


between X and Y is:

cov(X , Y )
cor (X , Y ) =
sd(X )sd(Y )

In our Ford vs. Tesla example:

3.063
cor (F , T ) = = 0.428 (not too strong!)
2.29 × 3.12

79 / 110
Correlation and dependence

I What are the units of cor (X , Y )? It doesn’t depend on the


units of X or Y !
I −1 ≤ cor (X , Y ) ≤ 1
I 1 = strong positive relationship, 0 = weak relationship, −1 =
strong negative relationship

The catch: cor (X , Y ) measures some but not all possible


relationships—cor (X , Y ) = 0 does not imply X and Y are
completely independent!

For example, if there is a complex, nonlinear relationship between


X and Y , correlation will not capture this. More on this in Section
3...
80 / 110
Covariance, correlation, dependence

I If X and Y typically increase and decrease together, then


cov(X , Y ) > 0
I If X typically goes up when Y goes down (and vice versa),
then cov(X , Y ) < 0
I If X and Y are independent:

cov(X , Y ) = 0 and cor (X , Y ) = 0.

I The reverse is FALSE:

cov(X , Y ) = 0 does NOT imply X and Y are independent.


The same is true if cor (X , Y ) = 0.

81 / 110
Spurious correlations

Much like conditional probability, correlation is an (imperfect) way


to measure the dependence between two variables.

And it comes with all the same warnings!

https://tylervigen.com/spurious-correlations

82 / 110
Linear combinations

83 / 110
Linear Combination of Random Variables

Do we have to choose between either Ford OR Tesla? How about


half and half?

To answer this question we need to understand the behavior of the


weighted sum (linear combinations) of two random variables...

Let X and Y be two random variables:


I E(aX + bY ) = aE(X ) + bE(Y )
I var(aX + bY ) = a2 var(X ) + b 2 var(Y ) + 2ab × cov(X , Y )

84 / 110
Linear Combination of Random Variables

Applying this to the Ford vs. Tesla example...

I E(0.5F + 0.5T ) = 0.5E(F ) + 0.5E(T ) =


0.5 × 0.12 + 0.5 × 0.14 = 0.13
I var(0.5F + 0.5T ) =
(0.5)2 var(F ) + (0.5)2 var(T ) + 2(0.5)(0.5) × cov(F , T ) =
(0.5)2 (5.25) + (0.5)2 (9.76) + 2(0.5)(0.5) × 3.063 = 5.28
I sd(0.5F + 0.5T ) = 2.297

So, which do you prefer? Holding Ford, Tesla or the combination?


Why?

85 / 110
Linear Combination of Random Variables

More generally...

I E(w1 X1 + w2 X2 + ...wp Xp ) =
Pp
w1 E(X1 ) + w2 E(X2 ) + ... + wp E(Xp ) = i=1 wi E(Xi )

I var(w1 X1 + w2 X2 + ...wp Xp ) = w12 var(X1 ) + w22 var(X2 ) + ... +


wp2 var(Xp ) + 2w1 w2 × cov(X1 , X2 ) + 2w1 w3 cov(X1 , X3 ) + ... =
Pp 2
Pp P
i=1 wi var(Xi ) + i=1 j6=i wi wj cov(Xi , Xj )

I Important special case: If all the Xi are independent, then


var(X1 + X2 + · · · + Xp ) = var(X1 ) + var(X2 ) + · · · + var(Xp )

In general, the variance of a sum is NOT the sum of the variances!

86 / 110
Example
On average, LeBron James scores 27.1 points per game, with a standard
deviation of 5.3 points.
I Over a randomly selected series of 3 games, how many points do
you expect LeBron to score?
I What is the standard deviation over these three games?
Assume that each game is independent.

Let Xi =points scored in each game, so the total points is


P = X1 + X2 + X3 .
I E(P) = E(X1 + X2 + X3 ) = E(X1 ) + E(X2 ) + E(X3 ) = 3 · (27.1) =
81.3 expected points
I var(P) = var(X1 + X2 + X3 ) = var(X1 ) + var(X2 ) + var(X3 ) =
3 · (5.3)2 = 84.3 points “squared”
p
I =⇒ sd(P) = var(P) = 9.18 points
87 / 110
Normal distributions

88 / 110
Continuous Random Variables

I Suppose we are trying to predict tomorrow’s return on the


S&P500...
I Question: What is the random variable of interest?
I Question: How can we describe our uncertainty about
tomorrow’s outcome?
I Listing all possible values seems like a crazy task... we’ll work
with intervals instead.
I These are called continuous random variables.
I The probability of an interval is defined by the area under the
probability density function.

89 / 110
Probability density functions

We use the probability density function (pdf) f (x) to describe


continuous probabilities:

I x-axis corresponds to the values that X may take on


I f (x) is NOT the probability that X = x!
I If X is continuous, what is P(X = x)?
90 / 110
The Normal Distribution

I The normal distribution is a common probability distribution


to describe a continuous random variable
I Also known as the “bell curve”
I Values near the mean are likely, become less likely as you
move away from the mean
I Symmetric: Equal probabilities on either side of the mean
0.4
0.3
standard normal pdf

0.2
0.1
0.0

−4 −2 0 2 4

91 / 110
The Standard Normal Distribution

I The standard normal distribution is the special normal


distribution that has mean 0 and variance 1.
I Notation: Z ∼ N(0, 1) (“Z is a random variable whose
distribution is standard normal”)
Pr (−1 < Z < 1) = 0.6826895...
Pr (−2 < Z < 2) = 0.9544997...
Pr (−3 < Z < 3) = 0.9973002...
0.4

0.4
standard normal pdf

standard normal pdf


0.3

0.3
0.2

0.2
0.1

0.1
0.0

0.0

−4 −2 0 2 4 −4 −2 0 2 4

z z
92 / 110
Example: Normal Probabilities

Note: For simplicity we will just use

P(−1 < Z < 1) ≈ 0.68


P(−2 < Z < 2) ≈ 0.95
P(−3 < Z < 3) ≈ 0.99

Questions:

I What is Pr (Z < 2) ? How about Pr (Z ≤ 2)?


I What is Pr (Z < 0)?

93 / 110
Normal Distribution as a Family

I The standard normal is not that useful by itself. When we say


“the normal distribution”, we really mean a family of
distributions.
I We obtain pdfs in the normal family by shifting the bell curve
around and spreading it out (or tightening it up).
I Caution: Of course, in real applications we typically use more
complicated distributions with more interesting shapes.

94 / 110
The Family of Normal Distributions

I A normal distribution is fully specified by its mean (location)


and variance (spread)
I We write X ∼ N(µ, σ 2 ). “Normal distribution with mean µ
and variance σ 2 .
I The parameter µ determines where the center of the curve is.
I The parameter σ determines how spread out the curve is. The
area under the curve in the interval (µ − 2σ, µ + 2σ) is 95%.
Pr (µ − 2 σ < X < µ + 2 σ) ≈ 0.95

95 / 110
Why normal?

Most distributions are NOT normally distributed.

So why study normal distributions?

96 / 110
Why normal?

Why study normal distributions?


I It turns out to be a “good enough” model in many
applications
I Central limit theorem: The average of many independent
RVs is approximately normal—no matter what their initial
distribution is!
I The normal distribution is “universal” in some sense
I (More advanced courses) You can test for non-normality, run
sensitivity analyses, etc.
I (For us) It gives nice formulas

97 / 110
Mean and Variance of Continuous RVs

Defining the mean, variance, and covariance of continuous random


variables is tricky: we need calculus for this. You do NOT need to
know the formal definitions (or calculus).

But you DO need to know that the basic idea is the same as for
discrete RVs. The interpretation is also the same:
I The mean measures the central tendency (but NOT
necessarily the most likely value);
I The variance measures the average spread or variation around
the mean;
I The covariance measures how two RVs move together on
average.
98 / 110
Mean and Variance of a Random Variable

I For the normal family of distributions we can see that the


parameter µ talks about “where” the distribution is located or
centered.
I We often use µ as our best guess for a prediction.
I The parameter σ talks about how spread out the distribution
is. This gives us and indication about how uncertain or how
risky our prediction is.
I If X is any random variable, the mean will be a measure of
the location of the distribution and the variance will be a
measure of how spread out it is.

99 / 110
Normal pdfs

I Example: Below are the pdfs of X1 ∼ N(0, 1), X2 ∼ N(3, 1),


and X3 ∼ N(0, 16).
I Which pdf goes with which X ?

−8 −6 −4 −2 0 2 4 6 8

100 / 110
Example: Modeling returns with a normal

I Assume the annual returns on the SP500 are normally


distributed with mean 6% and standard deviation 15%.
SP500 ∼ N(6, 225). (Notice: 152 = 225).
I Two questions: (i) What is the chance of losing money on a
given year? (ii) What is the value that there’s only a 2%
chance of losing that or more?

I (i) Pr (SP500 < 0) = ? and (ii) Pr (SP500 < ?) = 0.02

101 / 110
Example: Modeling returns with a normal
prob less than 0 prob is 2%
0.020

0.020
0.010

0.010
0.000

0.000
−40 −20 0 20 40 60 −40 −20 0 20 40 60

sp500 sp500

I (i) Pr (SP500 < 0) = 0.35 and (ii) Pr (SP500 < −25) = 0.02
I (This is just a conceptual example. You are not expected to
know these precise numbers.)

102 / 110
Normal probabilities and standardization

In general: If X ∼ N(µ, σ 2 ) then

Pr (µ − σ < X < µ + σ) = 0.683...


Pr (µ − 2σ < X < µ + 2σ) = 0.954...
Pr (µ − 3σ < X < µ + 3σ) = 0.997...

We saw something similar already for standard normals: That was


just the special case µ = 0 and σ = 1.

An easier way to figure out these probabilities is standardization.

103 / 110
Standardization

If X ∼ N(µ, σ 2 ) then

X −µ
Z= ∼ N(0, 1).
σ

In other words, by subtracting off the mean and dividing by the


standard deviation, you can use this table instead:

Pr (−1 < Z < 1) = 0.683...


Pr (−2 < Z < 2) = 0.954...
Pr (−3 < Z < 3) = 0.997...

Z measures “how many standard deviations away from the mean”


your observed X is. Larger Z =⇒ less likely / more surprising.
104 / 110
Example: Standardization

Prior to the 1987 crash, monthly S&P500 returns (r ) followed


(approximately) a normal with mean 0.012 and standard deviation
equal to 0.043. How extreme was the crash of -0.2176?

Standardization helps put this number into context:

r − 0.012
r ∼ N(0.012, 0.0432 ) =⇒ z = ∼ N(0, 1)
0.043

For the crash,

−0.2176 − 0.012
z= = −5.27
0.043

How extreme is this z-value? 5 standard deviations away!!

105 / 110
Building normals
One more very useful property of normal distributions... sum/linear
combination of normal random variables is a new normal random variable!

So X ∼ N(µX , σX2 ) and Y ∼ N(µY , σY2 ), then aX + bY is also a normal


2
with some mean and some variance: aX + bY ∼ N(µaX +bY , σaX +bY ).

Recall that for two random variables X and Y :

I E(aX + bY ) = aE(X ) + bE(Y )

I var(aX + bY ) = a2 var(X ) + b 2 var(Y ) + 2ab · cov(X , Y )

This implies that:

I aX + bY ∼ N(aµX + bµY , a2 σX2 + b 2 σY2 + 2ab · cov(X , Y ))

I The mean of aX + bY is aµX + bµY and the variance is


a2 σX2 + b 2 σY2 + 2ab · cov(X , Y )
106 / 110
Example: Portfolios, once again...

I As before, let’s assume that the annual returns on the SP500


are normally distributed with mean 6% and standard deviation
of 15%, i.e., SP500 ∼ N(0.06, 0.152 )
I Let’s also assume that annual returns on bonds are normally
distributed with mean 2% and standard deviation 5%, i.e.,
Bonds ∼ N(0.02, 0.052 )
I What is the best investment?
I What else do I need to know if I want to consider a portfolio
of SP500 and bonds?
I Additionally, let’s assume the correlation between the returns
on SP500 and the returns on bonds is -0.2.

107 / 110
Portfolios once again...

What is the behavior of the returns of a portfolio with 70% in the


SP500 and 30% in Bonds?
I E(0.7SP500 + 0.3Bonds) = 0.7E(SP500) + 0.3E(Bonds) =
0.7 × 0.06 + 0.3 × 0.02 = .048
I var(0.7SP500 + 0.3Bonds) =
(0.7)2 var(SP500) + (0.3)2 var(Bonds) + 2(0.7)(0.3) ×
cor (SP500, Bonds) × sd(SP500) × sd(Bonds) =
(0.7)2 (0.152 ) + (0.3)2 (0.052 ) + 2(0.7)(0.3) × −0.2 × 0.15 ×
0.05 = 106.2
I So, Portfolio = 0.7SP500 + 0.3Bonds∼ N(0.048, 0.1032 )

108 / 110
What about outliers?

I The mean and variance are useful summaries of a


distribution...
I ...however, they can be misleading

109 / 110
Median, Skewness

I The median of a random variable X is the point such that


there is 50% chance X is above it, and hence a 50% chance
X is below it.

I For symmetric distributions, the expected value (mean) and


the median are always the same... look at all of our normal
distribution examples.

I But sometimes, distributions are skewed, i.e., not symmetric.


In those cases the median becomes another helpful summary!

110 / 110

You might also like