4 - Basics in Statistics and Linear Algebra
4 - Basics in Statistics and Linear Algebra
Categorization of Data:
Midrange – average of the largest and the smallest values in a data set.
o Applicable for numerical and metric data
o Not robust against outliers
( ) ( )
𝑚𝑟 =
Median – the middle value if we have an odd number of values or the average of the middle two values
otherwise.
o Holistic measure
o Applicable to numerical and ordinal data
o Not applicable to nominal data
o Very robust against outliers
Mode – the value that occurs most frequently in the data. There is no mode if each data value occurs
only once.
o Well suited for categorical (i.e., non-numeric) data
o Can be problematic for continuous data
o If there are multiple nodes: the data set is multimodal (e.g. bimodal)
Variance:
1
𝜎 = (𝑥 − 𝑥̅ )
𝑛−1
Standard deviation:
1
𝜎= (𝑥 − 𝑥̅ )
𝑛−1
EXPLORATORY DATA ANALYSIS:
Boxplot:
o Data is represented using a box
o The ends of the box are at the first and third
quartiles, i.e., the height of the box is IQR
(interquartile range)
o The median is marked by a line within the box
o Whiskers: two lines outside the box extend to
Minimum and Maximum
Scatter Plot – provides a first look at bivariate data to see clusters of points,
outliers, etc.
2 main distributions:
1. Normal Distribution:
1 ( )
𝑁(𝑥|𝜇, 𝜎 ) = 𝑒
√2𝜋𝜎
𝑃(𝑋 ∈ [𝑎, 𝑏]) = 𝑁(𝑥|𝜇, 𝜎 )𝑑𝑥
𝐸[𝑋] = 𝜇
𝑉𝑎𝑟(𝑋) = 𝜎
2. Bernoulli Distribution:
1
𝑃(𝑋) =
𝑛
Main axioms:
1. Non-negative:
∀𝑎 ∈ Ω: 𝑃(𝑎) ≥ 0
2. Trivial event:
𝑃(Ω) = 1
3. Additivity:
𝑎 ∩ 𝑏 = ∅ ⇒ 𝑃(𝑎 ∪ 𝑏) = 𝑃(𝑎) + 𝑃(𝑏)
Conditional probability:
𝑃(𝑋 ∩ 𝑌)
𝑃(𝑋|𝑌) =
𝑃(𝑌)
𝑃(𝑋 = 𝑎, 𝑌 = 𝑏)
⇔ 𝑃(𝑋 = 𝑎|𝑌 = 𝑏) =
𝑃(𝑌 = 𝑏)
(I.e. the probability of A given B)
(Stochastic) independence:
A and B independent
⇔ 𝑃(𝑋) = 𝑃(𝑋|𝑌)
⇔ 𝑃(𝑋, 𝑌) = 𝑃(𝑋) ∗ 𝑃(𝑌)
⇔ 𝑃(𝑋 = 𝑎, 𝑌 = 𝑏) = 𝑃(𝑋 = 𝑎) ∗ 𝑃(𝑌 = 𝑏)
Bayes’ Theorem:
𝑃(𝐵|𝐴) ∗ 𝑃(𝐴)
𝑃(𝐴|𝐵) =
𝑃(𝐵)
Follows from:
( ∩ )
1. Conditional probability: 𝑃(𝐴|𝐵) = ( )
2. Product rule: 𝑃(𝐴 ∩ 𝐵) = 𝑃(𝐴|𝐵) ∗ 𝑃(𝐵)
Probability of discrete random variable:
𝑓(𝑥) = 𝑃(𝑋 = 𝑥)
Probability of continuous random variable:
Expected value:
𝐸(𝑋) = ∑ ∈ ( ) 𝑎 ∗ 𝑃(𝑋 = 𝑎) (Discrete)
𝐸(𝑋) = ∫ 𝑥 ∗ 𝑓(𝑥) 𝑑𝑥 (Continuous)
Variance:
𝜎 = 𝑉𝑎𝑟(𝑋) = 𝐸 𝑋 − 𝐸(𝑋) = 𝐸(𝑋 ) − 𝐸(𝑋)
Standard deviation:
𝜎= 𝑉𝑎𝑟(𝑋)
Covariance:
𝐶𝑜𝑣(𝑋, 𝑌) = 𝐸 𝑋 − 𝐸(𝑋) 𝑌 − 𝐸(𝑌) = 𝐸[𝑋𝑌] − 𝐸[𝑋]𝐸[𝑌]
a. Binning Techniques:
– Useful in unsupervised machine learning
i. Equi-width Distance Partitioning:
Divide the range into N intervals of equal size
Most straightforward
Outliers may dominate presentation
Skewed data is not handled well
ii. Equi-height Distance Partitioning:
Divides the range into N intervals, each containing approximately same number of samples
(quantile-based approach)
Good data scaling
If, e.g., there are many occurrences of a single value, the equal frequency criterion might
not be met
b. Clustering Analysis:
Partition data set into clusters, and one can store cluster representation only
Can be very effective if data is clustered but not if data is “smeared”, i.e. equally
distributed
c. Entropy-Based Discretization:
We compute the entropy scores for different partition and choose that which has the
lowest score (as it minimizes the entropy)
Entropy-based discretization helps with determining the best "break values" for binning, so
that we get more-or-less equal bins in size
Useful in supervised machine learning