0% found this document useful (0 votes)
8 views

4 - Basics in Statistics and Linear Algebra

The document outlines various exercises and concepts related to matrix operations, probability theory, descriptive statistics, and data transformation techniques. It categorizes data types, explains statistical measures such as mean, median, and variance, and introduces methods for exploratory data analysis. Additionally, it discusses normalization, data reduction strategies, and discretization techniques for effective data handling.

Uploaded by

xerodo1379
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

4 - Basics in Statistics and Linear Algebra

The document outlines various exercises and concepts related to matrix operations, probability theory, descriptive statistics, and data transformation techniques. It categorizes data types, explains statistical measures such as mean, median, and variance, and introduces methods for exploratory data analysis. Additionally, it discusses normalization, data reduction strategies, and discretization techniques for effective data handling.

Uploaded by

xerodo1379
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Exercises:

4.1 – Matrix Multiplication & Transposing


4.2 – Density Function, Proof for the Mode of the Univariate Normal Distribution
4.3 – Covariance, Independence
5.1 – Hypothesis Testing
5.2 – Min-Max & Z-score Normalization
6.1 – Maximum Likelihood (ML) Estimation of a probability mass function
6.2 – Principal Component Analysis (PCA):
1. center of mass, centered matrix, covariance matrix
2. eigenvalues of a covariance matrix
3. (normalized) eigenvectors
4. feature vector / eigenvector matrix, dataset transformation
5. dimensionality reduction, variance
6. percentage of retained information

Categorization of Data:

 Nominal: no order among values

 Ordinal: some ordering exists

 Discrete: the total number of values is finite

 Continuous: the total number of values is infinite (a.k.a. metric data)


DESCRIPTIVE STATISTICS:

Mean – (weighted) arithmetic mean value.


o Algebraic measure
o Only applicable for numerical data
1
𝑥̅ = 𝑥
𝑛

Midrange – average of the largest and the smallest values in a data set.
o Applicable for numerical and metric data
o Not robust against outliers
( ) ( )
𝑚𝑟 =

Median – the middle value if we have an odd number of values or the average of the middle two values
otherwise.
o Holistic measure
o Applicable to numerical and ordinal data
o Not applicable to nominal data
o Very robust against outliers

Mode – the value that occurs most frequently in the data. There is no mode if each data value occurs
only once.
o Well suited for categorical (i.e., non-numeric) data
o Can be problematic for continuous data
o If there are multiple nodes: the data set is multimodal (e.g. bimodal)

Variance:
1
𝜎 = (𝑥 − 𝑥̅ )
𝑛−1

Standard deviation:

1
𝜎= (𝑥 − 𝑥̅ )
𝑛−1
EXPLORATORY DATA ANALYSIS:

Boxplot:
o Data is represented using a box
o The ends of the box are at the first and third
quartiles, i.e., the height of the box is IQR
(interquartile range)
o The median is marked by a line within the box
o Whiskers: two lines outside the box extend to
Minimum and Maximum

Histogram Analysis – graph displays of basic statistical class descriptions.


o A univariate graphical method
o Consists of a set of rectangles that reflect the counts (frequencies) of the classes present in the
given data

Scatter Plot – provides a first look at bivariate data to see clusters of points,
outliers, etc.

Loess Curve – Adds a smooth curve to a scatter plot in order to provide


better perception of the pattern of dependence.
A Loess curve is fitted by setting two parameters: a smoothing parameter,
and the degree of the polynomials that are fitted by the regression.

Scatterplot Matrix – Matrix of scatterplots (x-y diagrams) of d-dimensional


data.
o Most useful for multidimensional data (d >> 2)

Parallel Coordinates – every data object is visualized as a


polygonal line which intersects each of the axes at the point
that corresponds to the value of the object in the respective
dimension.
o The ordering of dimensions is important for evaluating
correlations and/or clusters
PROBABILITY THEORY:

2 main distributions:

1. Normal Distribution:
1 ( )
𝑁(𝑥|𝜇, 𝜎 ) = 𝑒
√2𝜋𝜎
𝑃(𝑋 ∈ [𝑎, 𝑏]) = 𝑁(𝑥|𝜇, 𝜎 )𝑑𝑥

𝐸[𝑋] = 𝜇
𝑉𝑎𝑟(𝑋) = 𝜎

2. Bernoulli Distribution:
1
𝑃(𝑋) =
𝑛

Main axioms:
1. Non-negative:
∀𝑎 ∈ Ω: 𝑃(𝑎) ≥ 0
2. Trivial event:
𝑃(Ω) = 1
3. Additivity:
𝑎 ∩ 𝑏 = ∅ ⇒ 𝑃(𝑎 ∪ 𝑏) = 𝑃(𝑎) + 𝑃(𝑏)

Conditional probability:
𝑃(𝑋 ∩ 𝑌)
𝑃(𝑋|𝑌) =
𝑃(𝑌)
𝑃(𝑋 = 𝑎, 𝑌 = 𝑏)
⇔ 𝑃(𝑋 = 𝑎|𝑌 = 𝑏) =
𝑃(𝑌 = 𝑏)
(I.e. the probability of A given B)

(Stochastic) independence:
A and B independent
⇔ 𝑃(𝑋) = 𝑃(𝑋|𝑌)
⇔ 𝑃(𝑋, 𝑌) = 𝑃(𝑋) ∗ 𝑃(𝑌)
⇔ 𝑃(𝑋 = 𝑎, 𝑌 = 𝑏) = 𝑃(𝑋 = 𝑎) ∗ 𝑃(𝑌 = 𝑏)

Bayes’ Theorem:
𝑃(𝐵|𝐴) ∗ 𝑃(𝐴)
𝑃(𝐴|𝐵) =
𝑃(𝐵)
Follows from:
( ∩ )
1. Conditional probability: 𝑃(𝐴|𝐵) = ( )
2. Product rule: 𝑃(𝐴 ∩ 𝐵) = 𝑃(𝐴|𝐵) ∗ 𝑃(𝐵)
Probability of discrete random variable:
𝑓(𝑥) = 𝑃(𝑋 = 𝑥)
Probability of continuous random variable:

𝑃(𝑋 ∈ [𝑎, 𝑏]) = 𝑓(𝑥)𝑑𝑥

Expected value:
𝐸(𝑋) = ∑ ∈ ( ) 𝑎 ∗ 𝑃(𝑋 = 𝑎) (Discrete)
𝐸(𝑋) = ∫ 𝑥 ∗ 𝑓(𝑥) 𝑑𝑥 (Continuous)

Variance:
𝜎 = 𝑉𝑎𝑟(𝑋) = 𝐸 𝑋 − 𝐸(𝑋) = 𝐸(𝑋 ) − 𝐸(𝑋)

Standard deviation:
𝜎= 𝑉𝑎𝑟(𝑋)

Covariance:
𝐶𝑜𝑣(𝑋, 𝑌) = 𝐸 𝑋 − 𝐸(𝑋) 𝑌 − 𝐸(𝑌) = 𝐸[𝑋𝑌] − 𝐸[𝑋]𝐸[𝑌]

> 0 ⇔ 𝑋 𝑖𝑛𝑐𝑟𝑒𝑎𝑠𝑖𝑛𝑔 𝑚𝑒𝑎𝑛𝑠 𝑌 𝑖𝑛𝑐𝑟𝑒𝑎𝑠𝑖𝑛𝑔


𝐶𝑜𝑣(𝑋, 𝑌) = = 0 ⇔ 𝑛𝑜 𝑐𝑜𝑟𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 𝑋 𝑎𝑛𝑑 𝑌
< 0 ⇔ 𝑋 𝑑𝑒𝑐𝑟𝑒𝑎𝑠𝑖𝑛𝑔 𝑚𝑒𝑎𝑛𝑠 𝑌 𝑖𝑛𝑐𝑟𝑒𝑎𝑠𝑖𝑛𝑔

Pearson’s Correlation Coefficient:


𝐶𝑜𝑣(𝑋, 𝑌)
𝜌(𝑋, 𝑌) =
𝑉𝑎𝑟(𝑋) ∗ 𝑉𝑎𝑟(𝑌)

𝜌 = 0 ⇔ 𝑛𝑜 𝑙𝑖𝑛𝑒𝑎𝑟 𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛𝑠ℎ𝑖𝑝, 𝑝𝑜𝑠𝑠𝑖𝑏𝑙𝑦 𝑛𝑜𝑛𝑙𝑖𝑛𝑒𝑎𝑟


𝜌 ≠ 0 ⇔ 𝑙𝑖𝑛𝑒𝑎𝑟 𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛𝑠ℎ𝑖𝑝
DATA TRANSFORMATION:

Normalization – an important pre-processing step.


General Idea: to transform dimensions to be in comparable ranges.
2 ways to normalize data:
1. Min-Max Normalization: normalize the values to a range of [0, 1]
2. Z-Score Normalization: 𝑋 → 𝑍 ; 𝑍 = to a range of [-x, x]

Data Reduction Strategies:

1. Numerosity Reduction (reduce number of objects)


a. Parametric Methods – assume the data fits some model, estimate model parameters, store
only the parameters, and discard the data (except possible outliers)
b. Non-parametric Methods – do not assume models
i. Sampling
 Simple random sampling may have very poor performance in the presence of skew

2. Dimensionality Reduction / Feature Selection (reduce number of attributes)


a. Step-wise best feature selection:
 The best single-feature is picked first, and so on...
 The more variance a feature/component has, the more significant it is
b. Step-wise feature elimination:
 Repeatedly eliminate the worst feature
c. Optimal branch and bound:
 Use feature elimination and backtracking
d. PCA (Principal Component Analysis) Computation:
 Goal: Transform the original coordinate axes according to the data’s variance
 Motivation: Mostly, already few dimensions are responsible for a big portion of the data’s
variance
 In some cases, PCA computation is pointless, i.e. in cases where the variance for each
dimension is the same -> each dimension has the same importance
 In the transformed coordinate system we can neglect dimensions with low variance
3. Discretization (reduce number of values per attribute, quantization)

a. Binning Techniques:
– Useful in unsupervised machine learning
i. Equi-width Distance Partitioning:
 Divide the range into N intervals of equal size
 Most straightforward
 Outliers may dominate presentation
 Skewed data is not handled well
ii. Equi-height Distance Partitioning:
 Divides the range into N intervals, each containing approximately same number of samples
(quantile-based approach)
 Good data scaling
 If, e.g., there are many occurrences of a single value, the equal frequency criterion might
not be met

b. Clustering Analysis:
 Partition data set into clusters, and one can store cluster representation only
 Can be very effective if data is clustered but not if data is “smeared”, i.e. equally
distributed

c. Entropy-Based Discretization:
 We compute the entropy scores for different partition and choose that which has the
lowest score (as it minimizes the entropy)
 Entropy-based discretization helps with determining the best "break values" for binning, so
that we get more-or-less equal bins in size
 Useful in supervised machine learning

You might also like