0% found this document useful (0 votes)

75 views

Data Science 1 2023 - Lecture 02 - Mathematical Preliminaries and Correlation

This document outlines key concepts in probability and statistics that are important for data science. It discusses probability distributions, descriptive statistics like mean and standard deviation, Bayes' theorem, and how variance in data can be misinterpreted as a signal when it is actually just noise. Understanding these fundamental statistical concepts is necessary for working with data and building predictive models.

Uploaded by

Mehmet Yalçın

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

75 views

Data Science 1 2023 - Lecture 02 - Mathematical Preliminaries and Correlation

Uploaded by

Mehmet Yalçın

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 49

1

Data Science 1:
Supervised Learning
Errors & Artifacts
Introduction to Data Science
Correlation
Variance
Gradient Descent
Sampling
Data Bias
Probability
Probability, Statistics &
Significance
Skew
Precision
Correlation
Classification
F-Score Recall

Charts & Plots Unsupervised Learning

Summer 2023
Machine Learning Statistics

Prediction Logistic Regression Wolfram Wingerath, MaFe Davila Restrepo

Linear Regression Clustering Department for Computing Science
Data Science / Information Systems
Bias-Variance Tradeoffs
2

Charts & Plots Unsupervised Learning

Summer 2023
Machine Learning Statistics

Prediction Logistic Regression Wolfram Wingerath, MaFe Davila Restrepo

Linear Regression Clustering Department for Computing Science
Data Science / Information Systems
Bias-Variance Tradeoffs
3

Probability
Probability theory provides a formal framework
for reasoning about the likelihood of events.
The probability p(s) of an outcome s satisfies:
● 0 <= p(s) <= 1

These basic properties are often violated in

casual use of “probability” in data science.
4

Probability vs. Statistics

● Probability deals with predicting the
likelihood of future events, while statistics
analyzes the frequency of past events.
● Probability is theoretical branch of
mathematics on the consequences of
definitions, while statistics is applied
mathematics trying to make sense of real-
world observations.
5

Compound Events and Independence

Suppose half my students are female (event A)
Half my students are above median (event B).
What is the probability a student is both A & B?
Events A and B are independent iff

Independence (zero correlation) is good to

simplify calculations, but bad for prediction.
6

Conditional Probability
The conditional probability P(A|B) is defined:

Conditional probabilities get interesting only

when events are not independent, otherwise:
7

Bayes Theorem
Bayes theorem is an important tool which
reverses the direction of the dependences:

?
1 1
∙
= 2 2=1∙1∙4=1
3 2 2 3 3
4
8

Proof of Bayes Theorem

(q.e.d.) 😎
9

Distributions of Random Variables

Random variables (RVs) are numerical
functions where values come with probabilities.
Probability density functions (pdfs)
represent RVs, essentially as histograms.
10

Distributions of Random Variables

Example: the sum of two dice throws.
11

Probability/Cumulative Distributions
The cdf is the running sum of the pdf:

The pdf and cdf contain

exactly the same information,
one being the integral /
derivative of the other.
12

Visualizing Cumulative Distributions

Apple iPhone sales have been exploding, right?
13

How explosive is that growth, really?

Cumulative distributions present a misleading
view of growth rate.
The incremental
change is the
derivative of this
function, which is hard
to visualize
14

How explosive is that growth, really?

Descriptive Statistics
Descriptive statistics provides ways to capture
the properties of a given data set / sample.
● Central tendency measures describe the
center around the data is distributed.
● Variation or variability measures describe
data spread, i.e. how far the measurements
lie from the center.
16

Centrality Measure: Mean

To calculate the mean, sum values and divide
by number of observations:

Mean is meaningful for symmetric distributions

without outliers.
17

Other Centrality Measures

The median represents the middle value.
The geometric mean is the nth root of the
product of n values:
The geometric mean is always <= arithmetic
mean, and more sensitive to values near zero.
Geometric means make sense with ratios:
1/2 and 2/1 should average to 1.
18

Which Measure is Best?

Mean is meaningful for symmetric distributions
without outliers: e.g. height and weight.
Median is better for skewed distributions or
data with outliers: e.g. wealth and income.
Bill Gates adds $250 to the mean per capita
wealth in the US, but nothing to the median.
19

Aggregation as Data Reduction

Representing a group of elements by a new
derived element, like mean, min, count, sum
reduces a large dataset to a small summary
statistic.
Such statistics can become features when
taken over natural groups or clusters in the full
data set.
20

Variance Metric: Standard Deviation

The variance is the square of the standard
deviation (SD) sigma.
Do we divide by n or n-1?

The population SD divides by n, the sample SD

by n-1, but for large n, n ~ (n-1) so it doesn’t
really matter.
21

The Printer Cartridge Life Distribution

Distributions with the same mean can look very
different.
But together, the mean and standard deviation
fairly well characterize any distribution.
22

The Printer Cartridge Life Distribution

Super-reliable printer cartridge

Normal printer cartridge with built-in end-of-warrantee
killswitch
23

Parameterizing Distributions
Regardless of how data is distributed, at least
1
(1 − 2 )th of the points must lie within k sigma
𝑘
of the mean (Chebyshev's inequality).
Thus at least 75% must lie within two sigma of
the mean.
Even tighter bounds apply for normal
distributions.
24

Interpreting Variance (Stock Market)

It is hard to measure “signal to noise” ratio,
because much of what you see is just variance.
Consider measuring the relative “skill” of
different stock market investors.
Annual fluctuation in performance among funds
is such that investor performance is random,
meaning there is little real difference in skill.
25

Interpreting Variance (Batting Avg)

In baseball, 0.300 hitters (30% success rate)
represent consistency over 500 at-bats/season.
But simulations show a real
0.300 hitter has a 10% chance of
hitting 0.275 or below.

They also have a 10% chance of

hitting 0.325 or above.

Good or bad season, or lucky/lucky?

→ It‘s really easy to interpret something as signal that is actually just noise
→ This is the kind of problem where wisdom helps (arguably)
26

Interpreting Variance (Many Models)

We will typically develop several models for
each challenge, from very simple to complex.
Some difference in performance will be
explained by simple variance: which
training/evaluation pairs were selected, how
well parameters were optimized, etc.
Small performance wins argue for simpler
models.
27

Charts & Plots Unsupervised Learning

Summer 2023
Machine Learning Statistics

Prediction Logistic Regression Wolfram Wingerath, MaFe Davila Restrepo

Linear Regression Clustering Department for Computing Science
Data Science / Information Systems
Bias-Variance Tradeoffs
28

Correlation Analysis
Two factors are correlated when values of x
has some predictive power on the value of y.
The correlation coefficient of X and Y measures
the degree to which Y is a function of X (and
visa versa).
Correlation ranges from -1 (anti-correlated) to
1 (fully correlated) through 0 (uncorrelated).
29

The Pearson Correlation Coefficient

The numerator defines the covariance, which
determines the sign but not the scale.
30

The Pearson Correlation Coefficient

Covariance

Std. Dev. of X Std. Dev. of Y

A point (x,y) makes a positive contribution to r
when both are above or below their means.
31

Representative Pearson Correlations

● SAT scores and freshman GPA (r=0.47)

● SAT scores and economic status (r=0.42)
● Income and coronary disease (r=-0.717)
● Smoking and mortality rate (r=0.716)
● Video games and violent behavior (r=0.19)
32

Interpreting Correlations: r²
The square of the sample correlation coefficient
r2 estimates the fraction of the variance in Y
explained by X in a simple linear regression.
Thus the predictive value of a correlation
decreases quadratically with r.
The correlation between height and weight
is approximately 0.8, meaning it explains
about ⅔ of the variance.
33

Variance Reduction and r²

If there is a good linear fit f(x), then the
residuals y-f(x) will have lower variance than y.

Generally speaking,
1-r² = V(r) / V(y)
Here r = 0.94,
explaining 88.4% of
V(y).
34

Interpreting Correlation: Significance

The statistical significance of a correlation
depends upon the sample size as well as r.
Even small correlations become significant (at the
0.05 level) with large-enough sample sizes.

This motivates “big data” multiple parameter

models: each single correlation may explain/predict
only small effects, but large numbers of weak but
independent correlations may together have strong
predictive power.
35

Interpreting Correlations: r²

Weak correlations only explain a With more samples, even weak

small fraction of the variance. correlations become significant.
36

Spearman Rank Correlation

Counts the number of disordered pairs, not how
well the data fits a line.
Thus better with non-linear relationships and
outliers.
37

Spearman Rank Correlation

Thus better with non-linear relationships & outliers.
38

Computing Spearman Correlation

Let rank(xi) be the rank position of xi in sorted
order, from 1 to n.
Then:

where di = rank(xi) - rank(yi).

It is the Pearson correlation of the X and Y
value ranks, so it ranges from -1 to 1.
39

Correlation vs. Causation

Correlation does not mean causation.
The number of police active in a precinct
correlated strongly with the local crime rate, but
the police do not cause the crime.
The amount of medicine people take is correlated
strongly with their probability to get sick, but
medicine is typically not causing the sickness.
40

Correlation vs. Causation

“Correlation doesn't imply causation, but it does waggle its eyebrows

suggestively and gesture furtively while mouthing 'look over there’.”
XKCD: Correlation
41

Autocorrelation and Periodicity

Time-series data often exhibits cycles which
affect its interpretation.
Sales in different businesses may well have
7 day, 30 day, 365 day, and 4*365 day cycles.
A cycle of length k can be identified by
unexpectedly large autocorrelation between
S[t] and S[t+k] for all 0 < t < n-k.
42

The Autocorrelation Function

Computing the lag-k autocorrelation takes O(n),
but the full set can be computed in O(n log n)
via the Fast Fourier Transform (FFT).
43

Logarithms
The logarithm is the inverse exponential
function, i.e.
We will use them here for reasons different
than in algorithms courses:
Summing logs of probabilities is more
numerically stable than multiplying them:
44

Logarithms and Ratios

Ratios of two similar quantities (e.g new_price /
old_price) behave differently when reflecting
increases vs. decreases.
200/100 is 200% above baseline, but 100/200
is 50% below despite being similar changes!
Taking the log of the ratios yield equal
displacement: 1.0 and -1.0 (for base-2 logs)
45

Always Plot Logarithms of Ratios!

Logarithms and Power Laws

Taking the logarithm of variables with a power
law distribution brings them more in line with
traditional distributions.
Steven Skiena’s wealth is reportedly about the
same number of logs from typical students as
he is from Bill Gates!
47

Normalizing Skewed Distributions

Taking the logarithm of a value before analysis
is useful for power laws and ratios.
48

3 Use Cases for Logarithms

1. Higher precision for probability multiplication:
sum up logarithms, don’t multiply probabilities!
2. Representation of increase/decrease of ratios:
plot ratio logarithms rather than actual ratios!
3. Visualize distributions with skew or outliers:
put X-axis on a logarithmic scale when you are
looking at a power law variable!
49

Wrapup: Intro to Data Science

● Probability & statistics are fundamental for
making predictions and summarizing data
● Correlation & significance can help understand
the relationship between variables in data sets
● Logarithms can be used to normalize skewed
distributions and to make power law variables
easier to interpret

BRM Data Analysis Techniques
No ratings yet
BRM Data Analysis Techniques
53 pages
DV Stat
No ratings yet
DV Stat
39 pages
Basic Statistics For Data Science
100% (1)
Basic Statistics For Data Science
45 pages
Introduction To Data Science: Course Code: CS-4883 Course Instructor: Muhammad Owais
No ratings yet
Introduction To Data Science: Course Code: CS-4883 Course Instructor: Muhammad Owais
38 pages
1.1 CS3352-FDS -UNIT 1
No ratings yet
1.1 CS3352-FDS -UNIT 1
42 pages
Session 3
No ratings yet
Session 3
61 pages
stastics for data science1 (quiz1 notes)
No ratings yet
stastics for data science1 (quiz1 notes)
2 pages
Data Visualization
No ratings yet
Data Visualization
37 pages
Unit 2
No ratings yet
Unit 2
20 pages
Unit II Descriptive-Statistics-And-Correlation
No ratings yet
Unit II Descriptive-Statistics-And-Correlation
19 pages
Comprehensive Ebook of Statistics For Data Science - Chaitali
No ratings yet
Comprehensive Ebook of Statistics For Data Science - Chaitali
21 pages
Correlation New
No ratings yet
Correlation New
37 pages
Statistics For Data Analytics
No ratings yet
Statistics For Data Analytics
15 pages
chapter2-statistical analysis
No ratings yet
chapter2-statistical analysis
86 pages
Lecture 2 - Statistical Inference - EDA and DS Process - 02032023 111156am 1 - 1 27022024 012412pm
No ratings yet
Lecture 2 - Statistical Inference - EDA and DS Process - 02032023 111156am 1 - 1 27022024 012412pm
44 pages
Statistics For Data Analysis
No ratings yet
Statistics For Data Analysis
13 pages
ML Course Slides
No ratings yet
ML Course Slides
345 pages
Chap 4 Research Method and Technical Writing
No ratings yet
Chap 4 Research Method and Technical Writing
33 pages
Unit_I_II_III_IV
No ratings yet
Unit_I_II_III_IV
23 pages
U 3
No ratings yet
U 3
22 pages
MLCourse Slides
No ratings yet
MLCourse Slides
356 pages
Parametric and non parametric test
No ratings yet
Parametric and non parametric test
76 pages
MLCourseSlides
No ratings yet
MLCourseSlides
427 pages
SCS3250A - Module 1 - Introduction To Statistics and Analytics
No ratings yet
SCS3250A - Module 1 - Introduction To Statistics and Analytics
44 pages
MMW Notes
No ratings yet
MMW Notes
10 pages
Book review by Anang Tawiah_ Comprehensive Summary and Review of Practical Statistics for Data Scientists by Andrew Bruce, Peter Bruce, and Peter Gedeck
No ratings yet
Book review by Anang Tawiah_ Comprehensive Summary and Review of Practical Statistics for Data Scientists by Andrew Bruce, Peter Bruce, and Peter Gedeck
14 pages
ML Course Slides
No ratings yet
ML Course Slides
356 pages
Statistics Probabilty
No ratings yet
Statistics Probabilty
92 pages
Introduction To Data Science Exploratory Data Analysis
No ratings yet
Introduction To Data Science Exploratory Data Analysis
55 pages
Statistics For Data Science 1
No ratings yet
Statistics For Data Science 1
65 pages
Data Analysis and Visualization EDA
No ratings yet
Data Analysis and Visualization EDA
51 pages
Statistics in Research Processing and Data Analysis
No ratings yet
Statistics in Research Processing and Data Analysis
34 pages
Unit 1 Computational Statistics
No ratings yet
Unit 1 Computational Statistics
58 pages
Machine Learning (1) : Inteligência Artificial E Cibersegurança (Inacs)
No ratings yet
Machine Learning (1) : Inteligência Artificial E Cibersegurança (Inacs)
33 pages
Module3
No ratings yet
Module3
54 pages
Outliers, Variances, Probability Distributions (1) (Read-Only)
No ratings yet
Outliers, Variances, Probability Distributions (1) (Read-Only)
8 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
15 pages
Basic Statistics in Assessment: Mean, Variability, Correlation
No ratings yet
Basic Statistics in Assessment: Mean, Variability, Correlation
18 pages
Statistics (Curso completo)
No ratings yet
Statistics (Curso completo)
9 pages
Iba Unit - Ii
No ratings yet
Iba Unit - Ii
31 pages
Chap 4 Research Method and Technical Writing
No ratings yet
Chap 4 Research Method and Technical Writing
34 pages
Answering questions with data: Introductory Statistics for Psychology Students Matthew J. C. Crump pdf download
100% (5)
Answering questions with data: Introductory Statistics for Psychology Students Matthew J. C. Crump pdf download
68 pages
Intro To Probability and Statistics
No ratings yet
Intro To Probability and Statistics
147 pages
MAT 211 Introduction To Business Statistics I Lecture Notes
No ratings yet
MAT 211 Introduction To Business Statistics I Lecture Notes
69 pages
Mosconi W1
No ratings yet
Mosconi W1
14 pages
E-Note_33325_Content_Document_20250319114322AM
No ratings yet
E-Note_33325_Content_Document_20250319114322AM
69 pages
Statistics_FoundationalMathofAI_S24
No ratings yet
Statistics_FoundationalMathofAI_S24
5 pages
INF30036 Lecture5
No ratings yet
INF30036 Lecture5
33 pages
Data Science & Machine Learning Algorithms - A CONCISEtasets, and Free Text Books) - Ananthu S Chakravarthi
100% (3)
Data Science & Machine Learning Algorithms - A CONCISEtasets, and Free Text Books) - Ananthu S Chakravarthi
90 pages
Three Segments: - Overview - Calculation of R - Assumptions
No ratings yet
Three Segments: - Overview - Calculation of R - Assumptions
16 pages
Lesson 3 - Statistics Refresher
No ratings yet
Lesson 3 - Statistics Refresher
56 pages
8_2_correlations+models_ninell
No ratings yet
8_2_correlations+models_ninell
44 pages
Statistics and Probability
No ratings yet
Statistics and Probability
351 pages
Exploratory Data Analysis: Datascience Using Python Topic: 3
No ratings yet
Exploratory Data Analysis: Datascience Using Python Topic: 3
32 pages
Introduction To Data Analysis: Professor David Richardson IIT Stuart School of Business
No ratings yet
Introduction To Data Analysis: Professor David Richardson IIT Stuart School of Business
31 pages
4-DataUnderstanding
No ratings yet
4-DataUnderstanding
51 pages
Econ1203 Notes
67% (3)
Econ1203 Notes
35 pages
Instant Download Answering questions with data: Introductory Statistics for Psychology Students Matthew J. C. Crump PDF All Chapters
100% (3)
Instant Download Answering questions with data: Introductory Statistics for Psychology Students Matthew J. C. Crump PDF All Chapters
62 pages
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet
De-Mystifying Math and Stats for Machine Learning: Mastering the Fundamentals of Mathematics and Statistics for Machine Learning
From Everand
De-Mystifying Math and Stats for Machine Learning: Mastering the Fundamentals of Mathematics and Statistics for Machine Learning
Seaport AI Madhavan
No ratings yet
My Favourite Things Object Attachment Comparative PDF
No ratings yet
My Favourite Things Object Attachment Comparative PDF
18 pages
Foundation Design Principles and Practices 3rd Edition Coduto Solutions Manual PDF Download Full Book With All Chapters
100% (13)
Foundation Design Principles and Practices 3rd Edition Coduto Solutions Manual PDF Download Full Book With All Chapters
35 pages
Modeling With @risk: A Tutorial Guide
No ratings yet
Modeling With @risk: A Tutorial Guide
31 pages
Sta404 Chapter 03
No ratings yet
Sta404 Chapter 03
85 pages
QPCR A&E
No ratings yet
QPCR A&E
51 pages
W6-Sampling Distributions
No ratings yet
W6-Sampling Distributions
12 pages
K 14slr PDF
No ratings yet
K 14slr PDF
49 pages
ST130 Final Exam S1, 2010
No ratings yet
ST130 Final Exam S1, 2010
8 pages
Effects of Strategic Leadership Styles On OD-ninix
No ratings yet
Effects of Strategic Leadership Styles On OD-ninix
7 pages
Project Report PDF
100% (1)
Project Report PDF
37 pages
Elementary Quality Assurance Tools
No ratings yet
Elementary Quality Assurance Tools
19 pages
Dyedx: 1.1 Error Types
No ratings yet
Dyedx: 1.1 Error Types
44 pages
Design of Packed Bed Reactor Catalyst Based On Shape, Size
100% (1)
Design of Packed Bed Reactor Catalyst Based On Shape, Size
14 pages
Safety Stock and Reorder Point
No ratings yet
Safety Stock and Reorder Point
22 pages
Mtcsi (NR) PDF
No ratings yet
Mtcsi (NR) PDF
11 pages
How Do Find An Area Under The Normal Curve? P (-3.03 Z 1.3) 0,902
No ratings yet
How Do Find An Area Under The Normal Curve? P (-3.03 Z 1.3) 0,902
6 pages
Final Review Questions
No ratings yet
Final Review Questions
5 pages
Chap -7 Data Analysis & Interpretation
No ratings yet
Chap -7 Data Analysis & Interpretation
10 pages
CMSC 15100
No ratings yet
CMSC 15100
66 pages
EES201
No ratings yet
EES201
4 pages
Stat&PropbTQQ3W1 4
No ratings yet
Stat&PropbTQQ3W1 4
4 pages
Get (Ebook) Bioanalytical Techniques by Abhilasha Shourie ISBN 9788179936467, 8179936465 free all chapters
100% (6)
Get (Ebook) Bioanalytical Techniques by Abhilasha Shourie ISBN 9788179936467, 8179936465 free all chapters
81 pages
General Physics Lab-281
No ratings yet
General Physics Lab-281
38 pages
05 spm2 Manual CH Sample 041621 1
No ratings yet
05 spm2 Manual CH Sample 041621 1
42 pages
Chap 007
No ratings yet
Chap 007
15 pages
MC 106 354 395
No ratings yet
MC 106 354 395
42 pages
CHAPTER-3_research
No ratings yet
CHAPTER-3_research
8 pages
C - Normal Distribution
No ratings yet
C - Normal Distribution
196 pages
Bba Question Paper
40% (5)
Bba Question Paper
5 pages
Prob 07 Distributions 2
No ratings yet
Prob 07 Distributions 2
10 pages

Data Science 1 2023 - Lecture 02 - Mathematical Preliminaries and Correlation

Uploaded by

Data Science 1 2023 - Lecture 02 - Mathematical Preliminaries and Correlation

Uploaded by

1

Charts & Plots Unsupervised Learning

Prediction Logistic Regression Wolfram Wingerath, MaFe Davila Restrepo

Charts & Plots Unsupervised Learning

Prediction Logistic Regression Wolfram Wingerath, MaFe Davila Restrepo

These basic properties are often violated in

Probability vs. Statistics

Compound Events and Independence

Independence (zero correlation) is good to

Conditional probabilities get interesting only

Proof of Bayes Theorem

Distributions of Random Variables

Distributions of Random Variables

The pdf and cdf contain

Visualizing Cumulative Distributions

How explosive is that growth, really?

How explosive is that growth, really?

Centrality Measure: Mean

Mean is meaningful for symmetric distributions

Other Centrality Measures

Which Measure is Best?

Aggregation as Data Reduction

Variance Metric: Standard Deviation

The population SD divides by n, the sample SD

The Printer Cartridge Life Distribution

The Printer Cartridge Life Distribution

Super-reliable printer cartridge

Interpreting Variance (Stock Market)

Interpreting Variance (Batting Avg)

They also have a 10% chance of

Good or bad season, or lucky/lucky?

Interpreting Variance (Many Models)

Charts & Plots Unsupervised Learning

Prediction Logistic Regression Wolfram Wingerath, MaFe Davila Restrepo

The Pearson Correlation Coefficient

The Pearson Correlation Coefficient

Std. Dev. of X Std. Dev. of Y

Representative Pearson Correlations

● SAT scores and freshman GPA (r=0.47)

Variance Reduction and r²

Interpreting Correlation: Significance

This motivates “big data” multiple parameter

Weak correlations only explain a With more samples, even weak

Spearman Rank Correlation

Spearman Rank Correlation

Computing Spearman Correlation

where di = rank(xi) - rank(yi).

Correlation vs. Causation

Correlation vs. Causation

“Correlation doesn't imply causation, but it does waggle its eyebrows

Autocorrelation and Periodicity

The Autocorrelation Function

Logarithms and Ratios

Always Plot Logarithms of Ratios!

Logarithms and Power Laws

Normalizing Skewed Distributions

3 Use Cases for Logarithms

Wrapup: Intro to Data Science

You might also like