ML cheat sheet(1)
ML cheat sheet(1)
Ch1. [Machine Learning] : automation of problem solving. The study of computer 69(84 ): + @ ∑" %
@ ∑" % slope 와 intercept 으로 나타낼 수 있음 Ex) 8 ! =
4$! <4 4$! <4 is a penalty for model complexity. Smaller
algorithms that improve automatically through experience. (becoming better at (2, 0, 0, 5), [ = (−0.5, 1, −2, 0), \ = 0 6(8 ! ) = [ ∙ 8 ! + \
values of regularized <, less overfitting. 6(8 ! ) = 2 ∙ (−0.5) + 0 ∙ 1 + 0 ∙ (−2) + 5 ∙ 0 + 0 = −1
task T based on some experience E with respect to some performance measure
P) ML involves ① a notion of generalization ② labeled data, objective, Ch2. [Decision Trees] •(decision) node: check the value of a feature •root node: ⟶increasing b or w, f(x) will be positive. If y=+1, (*=-1 ⟶ [w⇽w+x , b⇽b+1] w+x
optimization algorithm. •Supervised learning: annotated / labelled dataset / start of the decision tree •Edges: correspond to value of a test, denotes is a new weight vector. If y=-1, (*=+1⟶[ w⇽w-x , b⇽b-1] “Perceptron can
ground truth ① Classification: discrete variable – spam detection ② Regression: relationship by connecting to next
continuous variable – predicting the price of a house •Unsupervised learning: node of leaf (node 연결시키는 다리)
unlabeled dataset – Clustering, association mining – customer segmentation, •Leaves: terminal nodes showing
recommendation •Semi supervised learning – text classification predicted outcome (가장 아래 node) “We do not use a loss function while training achieve zero training error on any linearly separable dataset.” [Termination] If
•Reinforcement learning – self driving car [Type of ML] •Regression: predict age, DT. There local optimal choices are made in greedy manner.” [Splitting criteria] there is a linear boundary separating +1 from -1, the perceptron algorithm will
price, score •Binary classification: Y/N, detect SPAM, positive/negative •Multi- •Information Gain: to determine the order of the magnitude of the DC find it. •Batch: has to remember whole dataset, evaluation: cross validation. Stop
class classification: classify article as politics/sports/science •Multilabel •Entropy(Hp): measure of unorderness, lowest when all labels are same class, training when error on validation data stops dropping. •Online: only remembers
classification: assign songs to one or more genres (a finite set of Y/N answers) highest when all labels are equally likely. more uniform distribution ⇾more current example, make prediction for current example. Evaluation: error
•Autonomous behavior: Input – measurement from sensors, response – uncertain. rate=
C4:-.M,: C.>, :( ).+
•weight averaging: averaging all the weights from
instructions for actuators [Regression loss function] •MAE (Mean Absolute •Gini impurity: how often a random -(-./ ,N.CB/,: :,,# :( ).+
element would be labeled incorrectly. each iteration. •Sparsity: empty cells in dataset. All absent values are implicitly
Error): the average (absolute) difference between true value and precited value zero.
! M(N) = 1 − ∑4 34% *binary case= 2O(1 − O) •Misclassification impurity:
– dis: fails to punish large errors in prediction !"# = ∑" |( − (*# | •MSE Ch4. [Gradient Descent]: optimization algorithm. It incrementally moves in the
" #$! # !M(N) = 1 − PQ8K (3K )
(Mean Squared Error): the average square of the difference between true value % % % domain of a function to find the values to minimize it. •linear regression
Ex) 6 objects with 4 labels (1:2, 2:2, 3:1, 4:1) #2RSTO( = ( ∗ UTV% ) + ( ∗
and predicted value. Give more weight to outliers than MAE – dis: more sensitive L L L ( = [ ∙ 8 + \ Want to find w, b for which error on training data is smallest.
! % ! ! ! ! % % % % ! % ! %
to outliers !+# = ∑" #$!((# − (*# )% [Classification loss function] UTV% ) + ( ∗ UTV% ) + ( ∗ UTV% ) M(N) = 1 − (7 : + 7 : + 7 : + 7 : ) •SSE(Sum of Squared Error) ++# = ∑" 4
4$!((B+,> − ( )
4 %
#SSTS([, \) =
" L L L L L L L L L
" 4 4 %
•Accuracy:
# () *(++,*-
=
0120"
•Error rate:
# () 4#*(++,*-
=
3123"
[Recursion]: trees are recursively defined data structures. •Base case: leaf node, ∑4$!([ ∙ 8 + \ − ( ) •Slope: the ‘steepness’ of the line, the ratio of change
# () -(-./ 0120"23123"
01
# () -(-./ 0120"23123"
01 •Recursive case: branch nodes. When a node can’t provide further splits, we in y in the 2d plane. Slope is positive if f(x) is increasing. Slope is negative if f(x) is
•Precision (Positive Predictive Value): •Recall (True Positive Rate): choose the label with the majority class. [ID3 algorithm]: IG 를 계산해서 반복적으로 decreasing. Slope is 0 if f(x) is at a local minimum or maxima. [Derivative (미분)]
01231 0123"
0" 1∗6 1∗6
•True Negative Rate: •F1 score: 2 ∗ •Fn score: (1 + 2% ) ∗ 이분하는 방법 A feature is good (high IG) when the distribution of data point in First derivative can be written 6 O and 6 O (Q) is the slope of function f at point a.
0"231 (126) #! ∗126
each node is containing a single class as much as possible. Multiclass classify 가능 f’(w)<0 → move to right. F’(w)>0 → move to left (y=SSE x=w 인 그래프)
[Multiclass classification]
Prediction 3 classes: (Spam, “DT divides the complete dataset into smaller subset, while an associated DT is •Update:
Spam OK Phish OK, Phish) incrementally developed.” •CART (Classification and Regression Trees): use
Spam (1) (2) Create 3*3 table. Gini Index as selection measure. Only binary tree.
OK (3) Ex) Multi-class DC Entropy (= Hp(X)) 구함.
Phish (4) (5) IG(X, Green), IG(X, Red), IG(X, Yellow), IG(X, Long) 구해서 _ is the learning rate, controlling speed of
가장 높은 값이 root node.
descent. If learning rate is too high, it skips the optimal solution. If it is too small,
• Macro-average: weigh each class equally (calculate k metrics). Rare classes
4 1 1 1 1 2 2 2 2 2 we will need too many iterations to converge to the best values.[Batch and Epoch]
have the same impact as frequent classes. “It reduces the problem to multiple !"($, &'()) = ,' $ − . 1 ∙ 4')2 + ∙ 4')2 + ∙ 4')2 7 + 1 ∙ 4')2 78
6 4 4 4 4 4 4 6 2 2 •Batch: a number of samples processed before the model is updated for each
one-vs-all comparisons.” Macro F1-score is the harmonic mean of Macro P and R.
0+9, 1(:4-4;, 0+9, 1(:4-4;, •Efficiency: Prediction speed depends on the depth of the tree •Speed: depends iteration •Epoch: the number of complete passes through the training dataset.
3= , 4= Total data=200, Batch size=20, 1 Iteration
<.=,/,> 1(:4-4;, ?*-9./ 1(:4-4;,
1" 2 1# 2 1% .*-9./ AB.C ! (!) ! ! on number of questions needed to get to a leaf node •Depth: In balanced binary
P= , 3: = = = , 3E = , 31 = tree, each time asking a question, halving the # of remaining Qs. •Repeated = 20 data training, 1Epoch =10 iterations
@ 1+,>4*-,> AB.C % (!)2(D) % !
F.H2F.H2! halving: Node 가 8 개 있을 경우, 2@ = 8 3 번 halving 해야 함. [Generating •Stochastic gradient descent: calculates
∴ Macro-averaging Precision =
@ the error and updates the model for each example in the training dataset. Adv1.
6" 2 6# 2 6% .*-9./ & /.=,/,> AB.C ! (!) ! ! question] •Categorical: Binarize (1/0, Y/N) •Numerical: Discretize, x<threshold
R= , 4: = = = , 4E = , 41 = Suitable for online learning, Adv2. Time-efficient, Dis1. May not reach global
@ ?*-9./ AB.C % (!)2(%) ! % [Advantage] white box approach, interpretability, fast classification, show which
F.H 2 !2 F.H minima/maxima Dis2. Less accuracy
∴ Macro-averaging Recall = feature is important (IG), both continuous & categorical, less data preparation,
@
no feature scaling, no centering, no standardization [Disadv] overfit, tree can be Batch Descent Stochastic Descent Mini-Batch Descent
• Micro-average: weigh each sample equally (calculate 1 metric). It treats the
too complex and detailed, less proper for continuous, expensive to train, do not Uses all training samples It uses one random Divides the dataset into
entire set of data as an aggregate result. 각 샘플에 동일한 가중치를 적용. Good for
treat well non-rectangular region [pruning]: reducing the complexity to cure while training the dataset training sample and makes batches and, after
imbalanced dataset. and only after the the required update in the completion of every batch
01# 2 01" 2 01% (@)2(!)2(H) over-fitting. Less accuracy on training set, but more accuracy on test set, better
3= = (@)2(!)2(H)2(D)2(%) complete cycle makes the training dataset. makes the required update
01# 2 01" 2 01% 2 31# 2 31" 2 31% interpretability [Random Forest]: a collection of DC combined. Sample examples
4=
01# 2 01" 2 01%
=
(@)2(!)2(H)
n times (w replacement), Build n trees, predict each tree, vote for the majority, required update in the in the dataset.
01# 2 01" 2 01% 2 3"# 2 3"" 2 3"% (@)2(!)2(H)2(%):3"& 2(D):3"%
•adv: Increase accuracy, less over-fitting. * ”DT is robust to outliers for regression training dataset.
Micro-average P,R and F1 are the same values when there is one label
problems, and here by outliers fall far from the rest of the distribution.” ① The •Momentum: A modification to SGD smoothing gradient with memory. No
[Empirical risk minimization]
% %
outliers are taken care of in the root node (x) ② DT divides items by lines, and modification to learning rate. •Local minima: potential problem for non-linear
! ! "
4(6) = ∑" 9
4$! 76(84 ) − 6 (84 ): = ∑4$!;6(84 ) − (<84 + =)> how far is appoint from lines has no impact (o) ③ The nodes are determined models. But not a big problem in high dimensional data
" "
•Polynomial function: more complex relationship, higher degree of based on the sample proportions in each split region.(o) Ch5. [Logistic Regression]: regress on probabilities of labels. Let p=probability
freedom/flexibility. High p ⟶ poor generation (overfit), low p ⟶ poor result Ch3. [Perceptron]: linear classifier, binary classification learning algorithm that label is positive number between 0 and 1. •Logit function maps p to [- ∞,
B
(underfit). Solution: split the data into train/validation/test. [Cross Validation] •Bias: decides which class to ∞] UTV`R(O) = UTV( )
!PB
•Train: use to minimize empirical risk, find < . •validation: to find hyper- predict. Make the default
parameter p [Regularization] Regularized risk: 4(6) =
!
∑" 76(84 ) − decision. w*x 가 0 에 가까울 때,
" 4$!
b 를 추가하여 결정 쉽게 해줌. [Decision boundary]:
ex) logit(0.01) = -4.6, logit(0.5)=0, logit(0.99)=4.6 maximum, standard deviation, skewness(비대칭도): jki[(8) =
A logistic regression model predicts logit(p) using a PiQ2[(
NPC,.#(N) @
) ], kurtosis(첨도) knSR(8) = PiQ2[(
NPC,.#(N) D
) ], [Feature
:->(N) :->(N)
linear model. UTV`R;OB+,> > = [ ∙ 8 + \ = a z=[0,1]
ablation analysis]: feature 1 개를 제거하고 accuracy 측정. Accuracy 가 가장 낮게 나온
•Inverse logit (=logistic = sigmoid) function: map the
(삭제된) feature 이 가장 중요한 feature •Bucketing (numeric to categorical):
logit back to probability UTV`R P! (a) = b(a) =
! beneficial when ① handling outliers ② reducing model complexity ③ handling
0 ≤ b(a) ≤ 1 non-linear relationship
!2,NB(PQ)
ex)!"#$% !" (0) = 0.5, !"#$% !" (−4.6) = 0.01, !"#$% !"(4.6) = 0.99 Ch7. [Artificial Neural Networks] •KNN: Given a set of labeled instances (training
OB+,> = UTV`R P! ([ ∙ 8 + \) set), new instances (test set) are classified according to their nearest labeled
b=0.5 neighbor. The k is the number of labeled neighbors to consider. Test points are
score1=3 assigned to the majority label of the k-nearest neighbors. Requires complete
score2=-1 training set in memory.
Algorithm Boundary Learning Limitation of
O! = UTV`R P! (3) = 0.95 ⟶ 95% of labeling’1’ O% = UTV`R P! (−1) = 0.27 KNN Non-linear Memorization non-linear:
•Cross entropy: loss function quantifying mistakes for Logistic Regression Decision Trees Non-linear Specialized rule learning inefficient for
실제 y=1, prediction=1 이면 Perceptron Linear Mistake-driven low-dimensional
-log(1)=0 ∴ lost function=0 Logistic Regression Linear Error function minimization (SGD)
data (than linear)
Neural networks Non-linear Error function minimization (SGD)
실제 y=1, prediction=0 이면 [ANN] can learn Perceptron
-log(1)= ∞ ∴ lost function=∞ / non-linear decision boundaries and approximate any computable function. Learn
실제 y=0, prediction=1 이면 -log(1 - 1)= -log(0) ∴ lost function=∞, intermediate representations. •Universal Function Approximator: feedforward
실제 y=0, prediction=0 이면 -log(1)=0 ∴ lost function=0 neural nets (tensor 가 신경망의 앞으로만 전달) can represent any computable
U/(R (a) = −(UTV;OB+,> > − (1 − ()UTV;1 − OB+,> > Minimize log loss: find the function. •XOR classifier: 이 점 4 개를 직선으로 분류하는 것은 불가능.
model which gives maximum probability to training target. [Perceptron vs. LR] 9(,( = 1, 9(,* = −1, 9*,( = −1, 9*,* = 1
Both use the score of the linear model (z=w*x+b) Perceptron passes it through a
threshold function, LR through inverse logit. Linear: (B+,> = a LR: OB+,> =
LR
UTV`R P! (a) [SGD]: both Linear and Logistic regression can be learned via SGD
LR: [#,S = [(/> + g ∙ (( − (B+,> )8 •Overfitting: models with small & less
• Finding weights: ① Define error function
variable weights less flexible (less freedom to fit data), Penalize weight variance
② Find weights which minimize it ③ go down the gradient of the error function
to control overfitting (h 증가할 수록 정확도↓) [Interpretation] •The score of an •Backpropagation: an algorithm calculating the gradient of the error function
B B
example = log odds UTV( )=a The odd of the example: ∴ i8O(a) = with respect to the neural network’s weights (손실 함수의 값에 대한 미분을 계산하여
!PB !PB
어떤 사건이 일어나지 않을 확률 대비 일어날 확률 (range: 0 ≤ odd rate <∞) Ex) 출력층에서 입력층 방향으로 오차를 역전파, weight 와 bias 를 얼만큼 update 할지
W123=1.5, feature X123 increases by 1 ⟶ log odds increase by 1.5, x123 (gradient) 찾는 방법) ① Standardizing of input features ② Random initialization
increases by 1 ⟶ odds increases exp(1.5) times •Binary LR: There is no of weight ③ Choice of activation function ④ Adapting learning rate ⑤ Number
dependent variable (x). The dependent var. is divided into two equal and size of hidden layers •Non-convexity: NN’s error function is non-convex, GD
subcategories (x). The dependent var. is continuous (x). The dependent variable can end up in local minima. Linear models have convex error function -> single
consists of two categories (o). •Evaluation metrics: Accuracy, Recall, Precision, global minimum. [Convolutional neural networks] for image and video
Neural network
F1, AUC-ROC (It measures the trade-off between the true positive rate and the application. Units connected to spatially close units in layer below, layers are not
false positive rate), (no MSE) fully connected. Weights are shared between units [Recurrent neural network]
Ch6. [Feature Engineering] Consideration: ① expensiveness of the model ② for time-varying signals, speech and written text
structure of the data ③ Nature of the task •Input: ① categorical: (sometimes)
need to convert them to numerical features ②numerical: generic, application
agnostic representations, common language for many algorithm [Feature -