0% found this document useful (0 votes)

8 views

4 - Basics in Statistics and Linear Algebra

The document outlines various exercises and concepts related to matrix operations, probability theory, descriptive statistics, and data transformation techniques. It categorizes data types, explains statistical measures such as mean, median, and variance, and introduces methods for exploratory data analysis. Additionally, it discusses normalization, data reduction strategies, and discretization techniques for effective data handling.

Uploaded by

xerodo1379

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views

4 - Basics in Statistics and Linear Algebra

Uploaded by

xerodo1379

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Exercises:

4.1 – Matrix Multiplication & Transposing

4.2 – Density Function, Proof for the Mode of the Univariate Normal Distribution
4.3 – Covariance, Independence
5.1 – Hypothesis Testing
5.2 – Min-Max & Z-score Normalization
6.1 – Maximum Likelihood (ML) Estimation of a probability mass function
6.2 – Principal Component Analysis (PCA):
1. center of mass, centered matrix, covariance matrix
2. eigenvalues of a covariance matrix
3. (normalized) eigenvectors
4. feature vector / eigenvector matrix, dataset transformation
5. dimensionality reduction, variance
6. percentage of retained information

Categorization of Data:

 Nominal: no order among values

 Ordinal: some ordering exists

 Discrete: the total number of values is finite

 Continuous: the total number of values is infinite (a.k.a. metric data)

DESCRIPTIVE STATISTICS:

Mean – (weighted) arithmetic mean value.

o Algebraic measure
o Only applicable for numerical data
1
𝑥̅ = 𝑥
𝑛

Midrange – average of the largest and the smallest values in a data set.
o Applicable for numerical and metric data
o Not robust against outliers
( ) ( )
𝑚𝑟 =

Median – the middle value if we have an odd number of values or the average of the middle two values
otherwise.
o Holistic measure
o Applicable to numerical and ordinal data
o Not applicable to nominal data
o Very robust against outliers

Mode – the value that occurs most frequently in the data. There is no mode if each data value occurs
only once.
o Well suited for categorical (i.e., non-numeric) data
o Can be problematic for continuous data
o If there are multiple nodes: the data set is multimodal (e.g. bimodal)

Variance:
1
𝜎 = (𝑥 − 𝑥̅ )
𝑛−1

Standard deviation:

1
𝜎= (𝑥 − 𝑥̅ )
𝑛−1
EXPLORATORY DATA ANALYSIS:

Boxplot:
o Data is represented using a box
o The ends of the box are at the first and third
quartiles, i.e., the height of the box is IQR
(interquartile range)
o The median is marked by a line within the box
o Whiskers: two lines outside the box extend to
Minimum and Maximum

Histogram Analysis – graph displays of basic statistical class descriptions.

o A univariate graphical method
o Consists of a set of rectangles that reflect the counts (frequencies) of the classes present in the
given data

Scatter Plot – provides a first look at bivariate data to see clusters of points,
outliers, etc.

Loess Curve – Adds a smooth curve to a scatter plot in order to provide

better perception of the pattern of dependence.
A Loess curve is fitted by setting two parameters: a smoothing parameter,
and the degree of the polynomials that are fitted by the regression.

Scatterplot Matrix – Matrix of scatterplots (x-y diagrams) of d-dimensional

data.
o Most useful for multidimensional data (d >> 2)

Parallel Coordinates – every data object is visualized as a

polygonal line which intersects each of the axes at the point
that corresponds to the value of the object in the respective
dimension.
o The ordering of dimensions is important for evaluating
correlations and/or clusters
PROBABILITY THEORY:

2 main distributions:

1. Normal Distribution:
1 ( )
𝑁(𝑥|𝜇, 𝜎 ) = 𝑒
√2𝜋𝜎
𝑃(𝑋 ∈ [𝑎, 𝑏]) = 𝑁(𝑥|𝜇, 𝜎 )𝑑𝑥

𝐸[𝑋] = 𝜇
𝑉𝑎𝑟(𝑋) = 𝜎

2. Bernoulli Distribution:
1
𝑃(𝑋) =
𝑛

Main axioms:
1. Non-negative:
∀𝑎 ∈ Ω: 𝑃(𝑎) ≥ 0
2. Trivial event:
𝑃(Ω) = 1
3. Additivity:
𝑎 ∩ 𝑏 = ∅ ⇒ 𝑃(𝑎 ∪ 𝑏) = 𝑃(𝑎) + 𝑃(𝑏)

Conditional probability:
𝑃(𝑋 ∩ 𝑌)
𝑃(𝑋|𝑌) =
𝑃(𝑌)
𝑃(𝑋 = 𝑎, 𝑌 = 𝑏)
⇔ 𝑃(𝑋 = 𝑎|𝑌 = 𝑏) =
𝑃(𝑌 = 𝑏)
(I.e. the probability of A given B)

(Stochastic) independence:
A and B independent
⇔ 𝑃(𝑋) = 𝑃(𝑋|𝑌)
⇔ 𝑃(𝑋, 𝑌) = 𝑃(𝑋) ∗ 𝑃(𝑌)
⇔ 𝑃(𝑋 = 𝑎, 𝑌 = 𝑏) = 𝑃(𝑋 = 𝑎) ∗ 𝑃(𝑌 = 𝑏)

Bayes’ Theorem:
𝑃(𝐵|𝐴) ∗ 𝑃(𝐴)
𝑃(𝐴|𝐵) =
𝑃(𝐵)
Follows from:
( ∩ )
1. Conditional probability: 𝑃(𝐴|𝐵) = ( )
2. Product rule: 𝑃(𝐴 ∩ 𝐵) = 𝑃(𝐴|𝐵) ∗ 𝑃(𝐵)
Probability of discrete random variable:
𝑓(𝑥) = 𝑃(𝑋 = 𝑥)
Probability of continuous random variable:

𝑃(𝑋 ∈ [𝑎, 𝑏]) = 𝑓(𝑥)𝑑𝑥

Expected value:
𝐸(𝑋) = ∑ ∈ ( ) 𝑎 ∗ 𝑃(𝑋 = 𝑎) (Discrete)
𝐸(𝑋) = ∫ 𝑥 ∗ 𝑓(𝑥) 𝑑𝑥 (Continuous)

Variance:
𝜎 = 𝑉𝑎𝑟(𝑋) = 𝐸 𝑋 − 𝐸(𝑋) = 𝐸(𝑋 ) − 𝐸(𝑋)

Standard deviation:
𝜎= 𝑉𝑎𝑟(𝑋)

Covariance:
𝐶𝑜𝑣(𝑋, 𝑌) = 𝐸 𝑋 − 𝐸(𝑋) 𝑌 − 𝐸(𝑌) = 𝐸[𝑋𝑌] − 𝐸[𝑋]𝐸[𝑌]

> 0 ⇔ 𝑋 𝑖𝑛𝑐𝑟𝑒𝑎𝑠𝑖𝑛𝑔 𝑚𝑒𝑎𝑛𝑠 𝑌 𝑖𝑛𝑐𝑟𝑒𝑎𝑠𝑖𝑛𝑔

𝐶𝑜𝑣(𝑋, 𝑌) = = 0 ⇔ 𝑛𝑜 𝑐𝑜𝑟𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 𝑋 𝑎𝑛𝑑 𝑌
< 0 ⇔ 𝑋 𝑑𝑒𝑐𝑟𝑒𝑎𝑠𝑖𝑛𝑔 𝑚𝑒𝑎𝑛𝑠 𝑌 𝑖𝑛𝑐𝑟𝑒𝑎𝑠𝑖𝑛𝑔

Pearson’s Correlation Coefficient:

𝐶𝑜𝑣(𝑋, 𝑌)
𝜌(𝑋, 𝑌) =
𝑉𝑎𝑟(𝑋) ∗ 𝑉𝑎𝑟(𝑌)

𝜌 = 0 ⇔ 𝑛𝑜 𝑙𝑖𝑛𝑒𝑎𝑟 𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛𝑠ℎ𝑖𝑝, 𝑝𝑜𝑠𝑠𝑖𝑏𝑙𝑦 𝑛𝑜𝑛𝑙𝑖𝑛𝑒𝑎𝑟

𝜌 ≠ 0 ⇔ 𝑙𝑖𝑛𝑒𝑎𝑟 𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛𝑠ℎ𝑖𝑝
DATA TRANSFORMATION:

Normalization – an important pre-processing step.

General Idea: to transform dimensions to be in comparable ranges.
2 ways to normalize data:
1. Min-Max Normalization: normalize the values to a range of [0, 1]
2. Z-Score Normalization: 𝑋 → 𝑍 ; 𝑍 = to a range of [-x, x]

Data Reduction Strategies:

1. Numerosity Reduction (reduce number of objects)

a. Parametric Methods – assume the data fits some model, estimate model parameters, store
only the parameters, and discard the data (except possible outliers)
b. Non-parametric Methods – do not assume models
i. Sampling
 Simple random sampling may have very poor performance in the presence of skew

2. Dimensionality Reduction / Feature Selection (reduce number of attributes)

a. Step-wise best feature selection:
 The best single-feature is picked first, and so on...
 The more variance a feature/component has, the more significant it is
b. Step-wise feature elimination:
 Repeatedly eliminate the worst feature
c. Optimal branch and bound:
 Use feature elimination and backtracking
d. PCA (Principal Component Analysis) Computation:
 Goal: Transform the original coordinate axes according to the data’s variance
 Motivation: Mostly, already few dimensions are responsible for a big portion of the data’s
variance
 In some cases, PCA computation is pointless, i.e. in cases where the variance for each
dimension is the same -> each dimension has the same importance
 In the transformed coordinate system we can neglect dimensions with low variance
3. Discretization (reduce number of values per attribute, quantization)

a. Binning Techniques:
– Useful in unsupervised machine learning
i. Equi-width Distance Partitioning:
 Divide the range into N intervals of equal size
 Most straightforward
 Outliers may dominate presentation
 Skewed data is not handled well
ii. Equi-height Distance Partitioning:
 Divides the range into N intervals, each containing approximately same number of samples
(quantile-based approach)
 Good data scaling
 If, e.g., there are many occurrences of a single value, the equal frequency criterion might
not be met

b. Clustering Analysis:
 Partition data set into clusters, and one can store cluster representation only
 Can be very effective if data is clustered but not if data is “smeared”, i.e. equally
distributed

c. Entropy-Based Discretization:
 We compute the entropy scores for different partition and choose that which has the
lowest score (as it minimizes the entropy)
 Entropy-based discretization helps with determining the best "break values" for binning, so
that we get more-or-less equal bins in size
 Useful in supervised machine learning

CH1-data Preprocessing
No ratings yet
CH1-data Preprocessing
49 pages
Cheat Sheet
No ratings yet
Cheat Sheet
4 pages
Week 2
No ratings yet
Week 2
96 pages
Data Analysis Course Notes
No ratings yet
Data Analysis Course Notes
9 pages
margin_6794edf99eb1f_3c24107b2ce99dfbffd813406a34e332_6794ede66a47f
No ratings yet
margin_6794edf99eb1f_3c24107b2ce99dfbffd813406a34e332_6794ede66a47f
2 pages
PPA Data Preparation
No ratings yet
PPA Data Preparation
31 pages
Lecture 7 Data Transformation and Dimensionality Reduction
No ratings yet
Lecture 7 Data Transformation and Dimensionality Reduction
22 pages
Cheatsheet FDA a4 Full
No ratings yet
Cheatsheet FDA a4 Full
2 pages
Module 8
No ratings yet
Module 8
13 pages
Most Compact and Complete Data Science Cheat Sheet 1672981093
No ratings yet
Most Compact and Complete Data Science Cheat Sheet 1672981093
10 pages
Data Mining and Business Intelligence
No ratings yet
Data Mining and Business Intelligence
52 pages
Projecting Data To A Lower Dimension With PCA
No ratings yet
Projecting Data To A Lower Dimension With PCA
6 pages
Dimension Reduction
No ratings yet
Dimension Reduction
15 pages
5 Data Pre Processing III
No ratings yet
5 Data Pre Processing III
30 pages
What Is Data Science? Probability Overview Descriptive Statistics
No ratings yet
What Is Data Science? Probability Overview Descriptive Statistics
10 pages
Preprocessing 935
No ratings yet
Preprocessing 935
68 pages
Script
No ratings yet
Script
5 pages
HIT391-week 3-New
No ratings yet
HIT391-week 3-New
43 pages
Data Science Cheat Sheet
No ratings yet
Data Science Cheat Sheet
10 pages
Data Characterization
No ratings yet
Data Characterization
31 pages
L 10 Principal Component Analysis 09052024 072206pm
No ratings yet
L 10 Principal Component Analysis 09052024 072206pm
37 pages
Pca Kmeans GMM
No ratings yet
Pca Kmeans GMM
96 pages
5 Data Preprocessing III Editted Notes
No ratings yet
5 Data Preprocessing III Editted Notes
17 pages
Lecture6a DataPreprocessing
No ratings yet
Lecture6a DataPreprocessing
52 pages
BUSINESS ANALYTICS
No ratings yet
BUSINESS ANALYTICS
14 pages
ML - Unit 3
No ratings yet
ML - Unit 3
4 pages
Lecture 9 - Data Reduction
No ratings yet
Lecture 9 - Data Reduction
36 pages
Data_Preprocessing-2
No ratings yet
Data_Preprocessing-2
30 pages
Machine Learning Techniques
No ratings yet
Machine Learning Techniques
8 pages
17 Data Analysis
No ratings yet
17 Data Analysis
64 pages
cheatsheet data
No ratings yet
cheatsheet data
3 pages
6 - Data Pre-Processing-III
No ratings yet
6 - Data Pre-Processing-III
30 pages
UNIT 1,2
No ratings yet
UNIT 1,2
17 pages
DMDW 5
No ratings yet
DMDW 5
25 pages
4
No ratings yet
4
26 pages
Statistics Cheatsheet 1703847367
No ratings yet
Statistics Cheatsheet 1703847367
8 pages
02 Data
No ratings yet
02 Data
62 pages
Data Mining Reviewer
No ratings yet
Data Mining Reviewer
4 pages
Mvda - Question Bank
No ratings yet
Mvda - Question Bank
14 pages
Notes For Multivariate Statistics With R
No ratings yet
Notes For Multivariate Statistics With R
189 pages
Solution
No ratings yet
Solution
148 pages
16 dm2 Dimred 2022 23
No ratings yet
16 dm2 Dimred 2022 23
49 pages
Data Mining Notes
No ratings yet
Data Mining Notes
25 pages
Multivariate Data Analysis in R PDF
No ratings yet
Multivariate Data Analysis in R PDF
400 pages
03preprocessing3 Part3 4
No ratings yet
03preprocessing3 Part3 4
49 pages
Pattern L1 L6
No ratings yet
Pattern L1 L6
19 pages
Chapter 2 - Understand Data
No ratings yet
Chapter 2 - Understand Data
63 pages
DM 2 Part 2
No ratings yet
DM 2 Part 2
35 pages
PrincipalComponentAnalysis-LectureNotesPublic
No ratings yet
PrincipalComponentAnalysis-LectureNotesPublic
24 pages
Feature Extraction
No ratings yet
Feature Extraction
16 pages
Feature Extraction
No ratings yet
Feature Extraction
90 pages
02Data Edited v2
No ratings yet
02Data Edited v2
43 pages
DS Day 5
No ratings yet
DS Day 5
11 pages
P-3.1.4 - Pca
No ratings yet
P-3.1.4 - Pca
44 pages
Lec4 Data Preprocessing
No ratings yet
Lec4 Data Preprocessing
43 pages
Notes Minal PDF
No ratings yet
Notes Minal PDF
178 pages
Chapter 2
No ratings yet
Chapter 2
53 pages
Statistical Foundations for Psychology
From Everand
Statistical Foundations for Psychology
James C. Ware
No ratings yet
Learn Statistics Fast: A Simplified Detailed Version for Students
From Everand
Learn Statistics Fast: A Simplified Detailed Version for Students
Hesbon R.M
No ratings yet
Exercises of Numerical Analysis
From Everand
Exercises of Numerical Analysis
Simone Malacrida
No ratings yet
Predication of Hotel Booking Cancellation 1
No ratings yet
Predication of Hotel Booking Cancellation 1
16 pages
Practical No.3
No ratings yet
Practical No.3
5 pages
0478_w24_ms_22
No ratings yet
0478_w24_ms_22
14 pages
Applied Psychometrics Using SPSS and AMOS 1st Edition Holmes Finch download
No ratings yet
Applied Psychometrics Using SPSS and AMOS 1st Edition Holmes Finch download
72 pages
Practical No. 1 Aim: The Euclid Problem Theory
No ratings yet
Practical No. 1 Aim: The Euclid Problem Theory
4 pages
Course Outline
No ratings yet
Course Outline
2 pages
1stconference Paper
No ratings yet
1stconference Paper
14 pages
Pulse Load Response Characteristics and Impulse Loading
No ratings yet
Pulse Load Response Characteristics and Impulse Loading
17 pages
Recursion
No ratings yet
Recursion
12 pages
Algo Lab Project
No ratings yet
Algo Lab Project
9 pages
Assignment 2
No ratings yet
Assignment 2
2 pages
Local Search
No ratings yet
Local Search
34 pages
HoT - 8 Motion Estimation-Refael
No ratings yet
HoT - 8 Motion Estimation-Refael
32 pages
Performance Analysis of Algorithms
100% (1)
Performance Analysis of Algorithms
19 pages
Microsoft Word - Edit Linear Regression Prep Session - Revision2
No ratings yet
Microsoft Word - Edit Linear Regression Prep Session - Revision2
5 pages
NEWBACGROUP2FT
No ratings yet
NEWBACGROUP2FT
5 pages
Skema Trial STPM 2021 S2 Stu
No ratings yet
Skema Trial STPM 2021 S2 Stu
8 pages
AI&ML-QB-2 (Solutions)
No ratings yet
AI&ML-QB-2 (Solutions)
8 pages
Greedy Algorithms: Comp 122, Spring 2004
No ratings yet
Greedy Algorithms: Comp 122, Spring 2004
29 pages
Chapter 4
No ratings yet
Chapter 4
38 pages
Consequences of Lorentz Transformation Notes.
88% (8)
Consequences of Lorentz Transformation Notes.
4 pages
PI With Anti Windup
No ratings yet
PI With Anti Windup
61 pages
Data & File Structure
No ratings yet
Data & File Structure
2 pages
SE Comp IV InSem PYQPapers
No ratings yet
SE Comp IV InSem PYQPapers
11 pages
Deep Finesse Network Model With Multichannel Syntactic and Contextual Features For Target-Specific Sentiment Classification
No ratings yet
Deep Finesse Network Model With Multichannel Syntactic and Contextual Features For Target-Specific Sentiment Classification
21 pages
2.0 Numerical Solution of Simulteneous Linear Algebraic Equations
No ratings yet
2.0 Numerical Solution of Simulteneous Linear Algebraic Equations
38 pages
Numerical Analysis - I. Jacques and C. Judd
No ratings yet
Numerical Analysis - I. Jacques and C. Judd
110 pages
Details Example 11.3.1 in Text: Aircraft Routing and Scheduling
No ratings yet
Details Example 11.3.1 in Text: Aircraft Routing and Scheduling
10 pages
Numerical Solutions of Riccati Equation
No ratings yet
Numerical Solutions of Riccati Equation
10 pages
Simulation of Queueing Systems
No ratings yet
Simulation of Queueing Systems
19 pages