0% found this document useful (0 votes)
75 views

Data Science 1 2023 - Lecture 02 - Mathematical Preliminaries and Correlation

This document outlines key concepts in probability and statistics that are important for data science. It discusses probability distributions, descriptive statistics like mean and standard deviation, Bayes' theorem, and how variance in data can be misinterpreted as a signal when it is actually just noise. Understanding these fundamental statistical concepts is necessary for working with data and building predictive models.

Uploaded by

Mehmet Yalçın
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
75 views

Data Science 1 2023 - Lecture 02 - Mathematical Preliminaries and Correlation

This document outlines key concepts in probability and statistics that are important for data science. It discusses probability distributions, descriptive statistics like mean and standard deviation, Bayes' theorem, and how variance in data can be misinterpreted as a signal when it is actually just noise. Understanding these fundamental statistical concepts is necessary for working with data and building predictive models.

Uploaded by

Mehmet Yalçın
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

1

Data Science 1:
Supervised Learning
Errors & Artifacts
Introduction to Data Science
Correlation
Variance
Gradient Descent
Sampling
Data Bias
Probability
Probability, Statistics &
Significance
Skew
Precision
Correlation
Classification
F-Score Recall

Charts & Plots Unsupervised Learning


Summer 2023
Machine Learning Statistics

Prediction Logistic Regression Wolfram Wingerath, MaFe Davila Restrepo


Linear Regression Clustering Department for Computing Science
Data Science / Information Systems
Bias-Variance Tradeoffs
2

Data Science 1:
Supervised Learning
Errors & Artifacts
Introduction to Data Science
Correlation
Variance
Gradient Descent
Sampling
Data Bias
Probability
Probability, Statistics &
Significance
Skew
Precision
Correlation
Classification
F-Score Recall

Charts & Plots Unsupervised Learning


Summer 2023
Machine Learning Statistics

Prediction Logistic Regression Wolfram Wingerath, MaFe Davila Restrepo


Linear Regression Clustering Department for Computing Science
Data Science / Information Systems
Bias-Variance Tradeoffs
3

Probability
Probability theory provides a formal framework
for reasoning about the likelihood of events.
The probability p(s) of an outcome s satisfies:
● 0 <= p(s) <= 1

These basic properties are often violated in


casual use of “probability” in data science.
4

Probability vs. Statistics


● Probability deals with predicting the
likelihood of future events, while statistics
analyzes the frequency of past events.
● Probability is theoretical branch of
mathematics on the consequences of
definitions, while statistics is applied
mathematics trying to make sense of real-
world observations.
5

Compound Events and Independence


Suppose half my students are female (event A)
Half my students are above median (event B).
What is the probability a student is both A & B?
Events A and B are independent iff

Independence (zero correlation) is good to


simplify calculations, but bad for prediction.
6

Conditional Probability
The conditional probability P(A|B) is defined:

Conditional probabilities get interesting only


when events are not independent, otherwise:
7

Bayes Theorem
Bayes theorem is an important tool which
reverses the direction of the dependences:

?
1 1

= 2 2=1∙1∙4=1
3 2 2 3 3
4
8

Proof of Bayes Theorem

(q.e.d.) 😎
9

Distributions of Random Variables


Random variables (RVs) are numerical
functions where values come with probabilities.
Probability density functions (pdfs)
represent RVs, essentially as histograms.
10

Distributions of Random Variables


Example: the sum of two dice throws.
11

Probability/Cumulative Distributions
The cdf is the running sum of the pdf:

The pdf and cdf contain


exactly the same information,
one being the integral /
derivative of the other.
12

Visualizing Cumulative Distributions


Apple iPhone sales have been exploding, right?
13

How explosive is that growth, really?


Cumulative distributions present a misleading
view of growth rate.
The incremental
change is the
derivative of this
function, which is hard
to visualize
14

How explosive is that growth, really?


15

Descriptive Statistics
Descriptive statistics provides ways to capture
the properties of a given data set / sample.
● Central tendency measures describe the
center around the data is distributed.
● Variation or variability measures describe
data spread, i.e. how far the measurements
lie from the center.
16

Centrality Measure: Mean


To calculate the mean, sum values and divide
by number of observations:

Mean is meaningful for symmetric distributions


without outliers.
17

Other Centrality Measures


The median represents the middle value.
The geometric mean is the nth root of the
product of n values:
The geometric mean is always <= arithmetic
mean, and more sensitive to values near zero.
Geometric means make sense with ratios:
1/2 and 2/1 should average to 1.
18

Which Measure is Best?


Mean is meaningful for symmetric distributions
without outliers: e.g. height and weight.
Median is better for skewed distributions or
data with outliers: e.g. wealth and income.
Bill Gates adds $250 to the mean per capita
wealth in the US, but nothing to the median.
19

Aggregation as Data Reduction


Representing a group of elements by a new
derived element, like mean, min, count, sum
reduces a large dataset to a small summary
statistic.
Such statistics can become features when
taken over natural groups or clusters in the full
data set.
20

Variance Metric: Standard Deviation


The variance is the square of the standard
deviation (SD) sigma.
Do we divide by n or n-1?

The population SD divides by n, the sample SD


by n-1, but for large n, n ~ (n-1) so it doesn’t
really matter.
21

The Printer Cartridge Life Distribution


Distributions with the same mean can look very
different.
But together, the mean and standard deviation
fairly well characterize any distribution.
22

The Printer Cartridge Life Distribution

Super-reliable printer cartridge


Normal printer cartridge with built-in end-of-warrantee
killswitch
23

Parameterizing Distributions
Regardless of how data is distributed, at least
1
(1 − 2 )th of the points must lie within k sigma
𝑘
of the mean (Chebyshev's inequality).
Thus at least 75% must lie within two sigma of
the mean.
Even tighter bounds apply for normal
distributions.
24

Interpreting Variance (Stock Market)


It is hard to measure “signal to noise” ratio,
because much of what you see is just variance.
Consider measuring the relative “skill” of
different stock market investors.
Annual fluctuation in performance among funds
is such that investor performance is random,
meaning there is little real difference in skill.
25

Interpreting Variance (Batting Avg)


In baseball, 0.300 hitters (30% success rate)
represent consistency over 500 at-bats/season.
But simulations show a real
0.300 hitter has a 10% chance of
hitting 0.275 or below.

They also have a 10% chance of


hitting 0.325 or above.

Good or bad season, or lucky/lucky?


→ It‘s really easy to interpret something as signal that is actually just noise
→ This is the kind of problem where wisdom helps (arguably)
26

Interpreting Variance (Many Models)


We will typically develop several models for
each challenge, from very simple to complex.
Some difference in performance will be
explained by simple variance: which
training/evaluation pairs were selected, how
well parameters were optimized, etc.
Small performance wins argue for simpler
models.
27

Data Science 1:
Supervised Learning
Errors & Artifacts
Introduction to Data Science
Correlation
Variance
Gradient Descent
Sampling
Data Bias
Probability
Probability, Statistics &
Significance
Skew
Precision
Correlation
Classification
F-Score Recall

Charts & Plots Unsupervised Learning


Summer 2023
Machine Learning Statistics

Prediction Logistic Regression Wolfram Wingerath, MaFe Davila Restrepo


Linear Regression Clustering Department for Computing Science
Data Science / Information Systems
Bias-Variance Tradeoffs
28

Correlation Analysis
Two factors are correlated when values of x
has some predictive power on the value of y.
The correlation coefficient of X and Y measures
the degree to which Y is a function of X (and
visa versa).
Correlation ranges from -1 (anti-correlated) to
1 (fully correlated) through 0 (uncorrelated).
29

The Pearson Correlation Coefficient


The numerator defines the covariance, which
determines the sign but not the scale.
30

The Pearson Correlation Coefficient


Covariance

Std. Dev. of X Std. Dev. of Y


A point (x,y) makes a positive contribution to r
when both are above or below their means.
31

Representative Pearson Correlations

● SAT scores and freshman GPA (r=0.47)


● SAT scores and economic status (r=0.42)
● Income and coronary disease (r=-0.717)
● Smoking and mortality rate (r=0.716)
● Video games and violent behavior (r=0.19)
32

Interpreting Correlations: r²
The square of the sample correlation coefficient
r2 estimates the fraction of the variance in Y
explained by X in a simple linear regression.
Thus the predictive value of a correlation
decreases quadratically with r.
The correlation between height and weight
is approximately 0.8, meaning it explains
about ⅔ of the variance.
33

Variance Reduction and r²


If there is a good linear fit f(x), then the
residuals y-f(x) will have lower variance than y.

Generally speaking,
1-r² = V(r) / V(y)
Here r = 0.94,
explaining 88.4% of
V(y).
34

Interpreting Correlation: Significance


The statistical significance of a correlation
depends upon the sample size as well as r.
Even small correlations become significant (at the
0.05 level) with large-enough sample sizes.

This motivates “big data” multiple parameter


models: each single correlation may explain/predict
only small effects, but large numbers of weak but
independent correlations may together have strong
predictive power.
35

Interpreting Correlations: r²

Weak correlations only explain a With more samples, even weak


small fraction of the variance. correlations become significant.
36

Spearman Rank Correlation


Counts the number of disordered pairs, not how
well the data fits a line.
Thus better with non-linear relationships and
outliers.
37

Spearman Rank Correlation


Thus better with non-linear relationships & outliers.
38

Computing Spearman Correlation


Let rank(xi) be the rank position of xi in sorted
order, from 1 to n.
Then:

where di = rank(xi) - rank(yi).


It is the Pearson correlation of the X and Y
value ranks, so it ranges from -1 to 1.
39

Correlation vs. Causation


Correlation does not mean causation.
The number of police active in a precinct
correlated strongly with the local crime rate, but
the police do not cause the crime.
The amount of medicine people take is correlated
strongly with their probability to get sick, but
medicine is typically not causing the sickness.
40

Correlation vs. Causation

“Correlation doesn't imply causation, but it does waggle its eyebrows


suggestively and gesture furtively while mouthing 'look over there’.”
XKCD: Correlation
41

Autocorrelation and Periodicity


Time-series data often exhibits cycles which
affect its interpretation.
Sales in different businesses may well have
7 day, 30 day, 365 day, and 4*365 day cycles.
A cycle of length k can be identified by
unexpectedly large autocorrelation between
S[t] and S[t+k] for all 0 < t < n-k.
42

The Autocorrelation Function


Computing the lag-k autocorrelation takes O(n),
but the full set can be computed in O(n log n)
via the Fast Fourier Transform (FFT).
43

Logarithms
The logarithm is the inverse exponential
function, i.e.
We will use them here for reasons different
than in algorithms courses:
Summing logs of probabilities is more
numerically stable than multiplying them:
44

Logarithms and Ratios


Ratios of two similar quantities (e.g new_price /
old_price) behave differently when reflecting
increases vs. decreases.
200/100 is 200% above baseline, but 100/200
is 50% below despite being similar changes!
Taking the log of the ratios yield equal
displacement: 1.0 and -1.0 (for base-2 logs)
45

Always Plot Logarithms of Ratios!


46

Logarithms and Power Laws


Taking the logarithm of variables with a power
law distribution brings them more in line with
traditional distributions.
Steven Skiena’s wealth is reportedly about the
same number of logs from typical students as
he is from Bill Gates!
47

Normalizing Skewed Distributions


Taking the logarithm of a value before analysis
is useful for power laws and ratios.
48

3 Use Cases for Logarithms


1. Higher precision for probability multiplication:
sum up logarithms, don’t multiply probabilities!
2. Representation of increase/decrease of ratios:
plot ratio logarithms rather than actual ratios!
3. Visualize distributions with skew or outliers:
put X-axis on a logarithmic scale when you are
looking at a power law variable!
49

Wrapup: Intro to Data Science


● Probability & statistics are fundamental for
making predictions and summarizing data
● Correlation & significance can help understand
the relationship between variables in data sets
● Logarithms can be used to normalize skewed
distributions and to make power law variables
easier to interpret

You might also like