0% found this document useful (0 votes)
545 views

Cheat Sheet

Binning and scaling are techniques used for data discretization and normalization. Binning partitions data into bins of equal width or frequency, while scaling transforms data values into a specific range like 0-1. Common scaling methods include z-score normalization and decimal scaling. Dissimilarity matrices contain measures of distance or dissimilarity between data points, like Euclidean distance, Manhattan distance, and Minkowski distance. Entropy is a measure of randomness in data, with higher entropy indicating more random data. Information gain is used to find the optimal split point in entropy-based discretization by maximizing the difference in entropy.

Uploaded by

jelmood
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
545 views

Cheat Sheet

Binning and scaling are techniques used for data discretization and normalization. Binning partitions data into bins of equal width or frequency, while scaling transforms data values into a specific range like 0-1. Common scaling methods include z-score normalization and decimal scaling. Dissimilarity matrices contain measures of distance or dissimilarity between data points, like Euclidean distance, Manhattan distance, and Minkowski distance. Entropy is a measure of randomness in data, with higher entropy indicating more random data. Information gain is used to find the optimal split point in entropy-based discretization by maximizing the difference in entropy.

Uploaded by

jelmood
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Binning: Scaling: u[u0,um] to v[v0,vm] Norms: Dissimilarity matrix:

1- Sort ascending
2- Partition data:
Equal-width (interval/distance) 0-1 scaling: v=(u-u0)/(um-u0) L1 norm (p=1):
Range=Max-Min, Interval Z-score normalization:
Length(L)=Range/No.Bins v=(u-µ)/σ, z=(X-µ)/σ L2 norm (p=2):
Bins:Bin1=[min,min+L)…BinMax=[m Decimal scaling:v=u/10k Minkowsky distances:
ax-L,max)→different bin sizes k max(|v|)≤1,v∈[-1,1]
Equal-freq (equal-depth): each Manhattan (city block):
bin contains (L) samples Median = mid-point
3- Smooth: mean, median, Range = Max(X)-Min(X) Euclidean:
boundary (based on assumption)
Euclidean norm (length) Quartiles: Entropy: The more random the data, the higher
• Q2 = median information, the higher entropy, the lower
Scaling: • Q1,3 = median of the right/left half of Q2: probability
Normalization (ǁxǁ=1): x~=x/ǁxǁ o If total count is odd, we take the middle value to be
Q2, then split into right and left (excluding the Q2
value)
o If total count is even, we take the average of the
middle 2 values to be Q2. Then we split into right and
left (including those 2 values)
• Python way: Qi = i(n+1)/4 => gives the position of Qi, Entropy based discretization:
then compensate with the real value at that positions. best split τ to max info
o If the position is i.5 (or a fraction), take the average
between i and i+1 p, q: probabilities of each class
H(S)= –p log2 p – q log2 q
Empirical cumulative Sample variance:
or distribution function: Select best τ (mid-points) then split:
If xTy = 0 → orthogonal or normal vector S1: value ≤ τ, S2: value> τ
Info of τ: H(S1,S2)=(|S1|/|S|)H(S1)+(|S2|/|S|)H(S2)
Standard deviation Info gain: G(τ,S)=H(S)-H(S1,S2), find max Info gain
Kronecker delta:
where I(.) is binary ξσ2 =σ
Projection of y on x: indicator function:
ux: unit vector of x, ux=x/ǁxǁ Also: where
(n-1) for samples, (n) for populations
Bivariate Joint Distribution
Empirical Probability
Linear independence: Mass Function:
Statistical Independence: condition:
where α≠0, Orthogonal vectors are linear
(but not vice versa) Hence: (cdf) and
Probabilities: Pr(X < 5) = number of conditioned (pdf)
samples/total Mean of multivariate vector: Pearson Correlation Coefficient:
Probability Distribution Functions (Discrete):
o Probability Mass Function (pmf)
Total variance: sum of variances of X1, X2…
Linear Transformation (Eigenvectors): Given X,
Set of probabilities, so the sum = 1 µX, ƩY
Covariance:
o Cumulative Distribution Function (cdf)

Covariance Matrix:

Probability Distribution Function (Continuous):


o Cumulative Distribution Function (cdf)

o Probability Density Function (pdf) Calculate Eigenvectors


Eigen Values:
T=a+d
D=ad-bc
c≠0? b≠0? b=0 & c=0

Total:
Verify by Av=Lv
To obtain probabilities -> integrate: Generalized: Then normalize using Euclidean norm: divide
each eigenvector [a b] by
Matrix Inverse:
Ratio:
if p(x1)> p(x2), then probability that X is closer to
x1 is higher than x2.
Normal Distributions (Gaussians) Univariate Categorical Attribute Multivariate Bernoulli
X is a normal dist. if its pdf is gaussian Sample mean:
Bernoulli Variable:

pmf:
, i.e. X is has a normal dist.
Standard Normal Distribution:
Mean: Covariance between Xi, Xj:
Variance:
where E[Xi Xj] = 0 (never overlap)
Sample mean:
Sample variance:
n1: xi = 1, n0: xi = 0, n: total
z-score = 0 is the mean
Linear transformation:
p: prob. of success, K: success count Transforming x onto u = y

Joint pdf of multivariate normal RV => Number of ways (possible y=


combinations) Approximating x → x~:

Mahalanobis distance: Distance of x from The projection of onto the subspace spanned by
the mean normalized by covariance X = X1+X2…, µX= µX2 + µX1…
ui (i = 1, …, r):
PCA:
If covariance matrix is identity matrix Total variance along u: Orthogonal projection matrix (m x m)

Error vector:
Mean:
Center the data:
with m attributes.
Covariance matrix: Kernel Function
Squared norm of each point:
Number of matching symbols of 1’s: Eigenvectors:

Euclidean distance:
Hamming distance: m – s For 3x1 matrices:

Cosine similarity: Xr → projection on the same feature space


Jaccard similarity coefficient: Yr → projection on a different feature space
In this case, it is a six-dimension vector
Eigenvector of max eigenvalue → 1st pc, 2nd Sum of any 2 kernel functions is a kernel
max→2nd pc function: K=K1+K2
Mean square error = smallest eigen value / m Polynomial Kernel
Mean in feature space: Eigenvector of min eigenvalue → 1st minor
component
Quadratic kernel (p=2), for m=2:
Projection of x onto u1:
Norm of the mean: Gaussian (RBF) Kernel

Where is the squared Euclidean distance


Centering in feature space: where K: centered kernel matrix Kernel matrix is symmetric
c1: eigenvector of K
Projection of (Kj is the jth column of K): When calculating the quadratic kernel where
c=1:
The ith pc: where ◦ denotes the
Hadamard (entry-wise) product
Kernel PCA Kernel PCA Algorithm: Norms of a point:
Compute kernel matrix: K=[Kij] → Kij = K(xi,xj)
Center the matrix:
Distance in feature space:
Eigenvalue K:
where u1: principal eigenvector, Σfi Scale:
covariance in feature space
Fraction of total variance, choose r such that: K(x,y) is a similarity measure in feature space

Reduce dimensionality (r < n)


SVD LDA Fisher’s LDA: finding the w vector
Matrix X to factorize to: Finding vector w to project x on that maximizes
separation between classes ω1 and ω2 (yi = wT xi) &
mxn = mxm, mxn, nxn Between-class scatter:
U: left singular vectors , m1 = projected means of class ω1
V: right singular vectors Distance between means m1, m2 defines class
Delta (Δ): diag(singular values) (no. non- separability
zero singular values=rank of matrix X) Optimum LD vector is the dominant
Linear expansion of X using SVD: eigenvector of S-1 B → eigenvector of the
largest |λi|
B: between-class scatter
Then project data on the selected
X can be approximated by using the best (mxm rank-one)
singular values (δi) in descending order: Within-class scatter: Variances within each class eigenvector
For 2 classes only:
Ur can be used to project X onto subspace: s12, s22: sum of squared deviation of means: LDA vector , then
normalize:
LDA Algorithm Multi-class LDA
1-Join X1 and X2 (2x4 becomes 2x8) Total squared deviation from the means: Given the global mean xbar:
2-Apply → K = 8x8
3-Means: m1: mean of each X1 portion in
K rows, m2: similar, m: mean(m1,m2)
4-Between-class scatter B: Apply S: within-class scatter matrix (mxm symmetric) LDA Classifier
→ 8x8 Discriminant Functions g(x) Project x on LDA vector
Minimum Distance Classifier: find the minimum distance
5-Within-class scatter S 8x8: Apply
We take the max g(x) → i.e. the min distance
Nearest Centroid Classifier Minimum Mahalanobis Distance Classifier
Square Euclidean distance: Find the squared Mahalanobis distance
6-Compute dominant eigenvector between x and µi of class wi

Multiple classes:
Binary (if + → x belongs to w1, if - → x belongs to w2):
K-Nearest Neighbor (KNN) Classifier
7- X belongs to the class of the closest K
Bayes Rule Quadratic Discriminant Analysis (QDA) neighbors

LDA classifier is a special case: Σ1 = Σ2 = S/n (Q = 0) Distance-Weighted KNN


Rank the K-neighbors:
If Σ1 = Σ2 = σ2 I, MMDC is reduced to minimum furthest = 1, nearest = K
Log likelihood: centroid classifier gi(x) = Sum(class ωi weights)/Sum(all weights)
Multiclass Optimal classifiers for normal patterns
Prior probabilities: πi = P(ωi)
If Σi = Σ (i belongs to k) → linear discriminant function: Maximum Likelihood estimates:
Posteriori Rule

If Σi = σ2 I, P(ωi) = π, (i belongs to k) → minimum


For binary: Euclidean distance classifier

or
Binary Classification of Gaussian Patterns: Logistic Regression (Maximum
entropy/maxent)
Softmax: Multiclass logistic regression Logistic sigmoid function:
Posterior modeled with softmax function: Where

Cross-entropy error function for target


output yi=[yi1, yi2,…, yik] If Σ1 = Σ2 (Gaussian):
If → Mahalanobis
If Σ1 = Σ2 → linear classifier:
Gradient with respect to wi:
If Σ1 = Σ2 = σ2 I, P(ω1) = P(ω2) →minimum dist. classifier Given training set:
Naïve Bayes Classifier Normal pdf:

Decision trees: Sigmoid Output:


1) Extract class-specific data subset find thresholds
that maximizes Weight vector w:
info for multiple
2) 3) classes
Gradient of E(w):
4) Classify x: → More robust to outliers than least squares
Distances: Pearson Correlation Coefficient Distance
Minkowsky (Lp norm)

Single linkage: nearest neighbor method


City block (Manhattan/sum of absolute
difference)
Complete linkage: furthest neighbor method

Euclidean distance (L2 norm) Group average: unweighted and weighted

n: number of points, d

Chebychev distance (L norm) Weighted:


Centroid: Distance between p, q and a third cluster k: Hierarchical Clustering Strategies
Canberra distance Agglomerative: bottom-up approach
(most used)
Median: Distances are given equal weights Divisive: top-down (computationally
intensive)
Quadratic Distance:
Ward’s method (sum of squared errors, scatter of
the error)
Cosine

Optimization Algorithms
µi is the mean of cluster Ci
K-Means Mixture scatter matrix (T)
1-Assign two initial means
2-Cluster 2 groups based on their distances to each mean
3-Calculate the means of the new clusters → the new means
4-Repeat 2 until the new means = the old means
Buckshot Algorithm
1-Randomly select subsample with N1 objects Generalized Formula:
2-Apply group-average hierarchical clustering
3-Use the result as seeds for K-mean
Optimization criteria:
min. within-class scatter, max between-class scatter
1st criterion:

2nd criterion:

3rd criterion:

4th criterion:

Recalculate all distances when merging points using the specified method
Associations: A → B Apriori
Support: P(A,B) i.e. A&&B / total • F1 = {all 1-itemset}.Support >= minsup
• C2 = All possible 2-itemset combinations from F1
Confidence: P(A,B|A) i.e. A&&B/count(A) • F2 = C2.Support >= minsup
Minsup: minimum support • C3 = Join F2 (using same 1st elements, different last element)
Minconf: minimum confidence • F3 = Pruning C3 (downward closure): each 2-itemset combination of each C3 item must exist
in F2. Then add the 3-itemset to F3
returns Xs (items) that are common in all • Then from F2 & F3 (i.e. all F >= F2), find all possible associations (2m-2 associations), calculate
confidence, > minconf
transactions
returns
transactions that has at least one X
Support in this context is the count, e.g. {A:4, B:5}, {A,B}:4
Class Association Rule (CAR)
X →y (X: itemset, y: label) same rules apply

You might also like