BMI 704 - Machine Learning Lab
BMI 704 - Machine Learning Lab
Lab
030719
Topics
• Introduction to Supervised Learning
• Introduction to Unsupervised Learning
• Features
• i.e. variables (Xs)
• Inputs you are using to predict outcome
• Model Models
• 1) Pick a guy Diabetes = 0.5*age + 0.2*sex + 2.1*BMI + …
• 2) sub his features into the model
Height = 0.2*age + 0.8*sex + 1.3*weight + …
• 3) now you know his outcome
Where is the predicting model come from?
• 1) Pick an algorithm
• Linear model
• Y = X1 + X2 + X3
• 2) Split your data set into train and test (e.g. 80/20,
70/30)
Simple Regression
• R2 - amount of variance
explained
• If Y = 1 or 0;
• High sensitivity:
• Y = 1; ➙ Y^ = 1
• High specificity:
• Y = 0; ➙ Y^ = 0
Which model (algorithm) should you use?
Unsupervised Learning
• Not interest in predicting Y but exploratory analysis (Xs)
• discovering patterns
• Find subgroups that you don’t know
• Visualize the results
• A few latent variables to capture the most of the information of the data
• i.e. the variance explained
loading x%
Score x%
Unsupervised Learning
• Clustering
• PCA looks to find a low-dimensional representation of the observations that
explain a good fraction of the variance;
• Clustering looks to find homogeneous subgroups among the observations.
• K-means clustering
• hierarchical clustering
K-means clustering
• partitioning a data set into K distinct, non-overlapping clusters.
• Specify how many clusters do you want
• The algorithm looks for
local optimum
• Run a few times to see
the different
hierarchical clustering
• tree-based representation of the
observations, called a
dendrogram.
• bottom-up clustering
Algorithms and Packages
• ML Algorithms (many, many, many!)
• Basics: linear-based
• Shrinkage Methods
• Lasso and Ridge regression
• ElasticNet
• Non-linear methods
• Spline
• Support Vector Machines
• Tree based methods
• Decision trees
• Random Forests
• Packages in R
• Individual packages for each algorithm - glmnet
• Meta packages – caret
Unsupervised Learning (con’t)
• Clustering
• Partitional methods
• K-means: partition {x1,…xn} into K clusters where K is
predefined.
• Build a new partition by associating each point with the nearest
centroid
• Compute the centroid (mean point) for each set. Repeat until
converge.
• “kmeans” function in R.
Unsupervised Learning
• Not interest in predicting but discovering patterns
• Find subgroups that you don’t know
• Visualize the results
• Principle component
• Clustering
• Hierarchical clustering– Build a hierarchy of clusters
• Agglomerative: A “bottom up” approach. You start with each element in a separate
cluster, then merge them according to a given property.
• Divisive: A “top down” approach. All elements start in one all-inclusive cluster, then you
split recursively.