Chapter 8
Chapter 8
Unsupervised learning
3.1. Introduction
Unsupervised learning is a branch of machine learning where models analyze “unlabeled data” to
discover hidden patterns, intrinsic structures, or relationships without predefined outputs. Unlike
supervised learning, there are no "correct answers" provided during training.
Objectives
Dimensionality Reduction: Simplify data while retaining critical information (e.g., compressing
images).
Association Rule Learning: Identify relationships between variables (e.g., "customers who buy
X also buy Y").
Density Estimation: Model the underlying probability distribution of data (e.g., anomaly
detection).
Example: K-means minimizes the sum of squared distances between points and their cluster
centroids.
Algorithmic Principles
Clustering
Centroid-based (K-means) Iteratively refines cluster centers to minimize within-cluster
variance.
Hierarchical Builds nested clusters using linkage criteria (e.g., Ward’s
(Agglomerative/Divisive) method).
Density-based (DBSCAN) Groups points in dense regions and marks outliers as noise.
Dimensionality Reduction
PCA Maximizes variance in orthogonal directions (principal
components).
Autoencoders Neural networks trained to reconstruct input data through a
compressed bottleneck layer.
Association Rule Learning
Apriori Algorithm Uses frequent itemsets (e.g., "if {bread, butter}, then
{jam}") and support/confidence thresholds.
Generative Modeling
GANs (Generative Adversarial Learn the data distribution to generate new samples.
Networks) & VAEs
(Variational Autoencoders)
Practical Principles
Data Preprocessing
Normalization/Scaling Critical for distance-based algorithms (e.g., K-means).
Handling Missing Data Imputation or removal of incomplete samples.
Model Evaluation
Clustering Metrics like silhouette score, Davies-Bouldin index, or
visual inspection.
Dimensionality Reduction Explained variance ratio (PCA) or reconstruction error
(autoencoders).
For deeper insights, explore how these principles are applied in cutting-edge areas like “self-
supervised learning” (using pretext tasks to generate pseudo-labels) or :contrastive learning”
(learning embedding by contrasting similar/dissimilar pairs).
Main Procedure
Distance Metric: Uses Euclidean, Manhattan, or cosine similarity to find nearest neighbors.
Lazy Learning: No explicit training phase; stores all training data.
Hyperparameter: “k” (number of neighbors to consider).
Use Cases
Classifying customer churn (yes/no).
Predicting house prices based on similar properties.
Recommendation systems ("users like you also bought...").
Example
```python
from sklearn.neighbors import KNeighborsClassifier
# Example: Classification with k=3
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)
predictions = knn.predict(X_test)
```
3.3.2. K-means
Divides data into “k” clusters by iteratively minimizing the sum of squared distances between
points and cluster centroids.
Main Procedure
Centroid-Based: Iteratively updates cluster centers (means).
Hard Clustering: Each point belongs to exactly one cluster.
Hyperparameter: “k” (number of clusters).
Use Cases
Customer segmentation.
Image compression (reducing color palette).
Anomaly detection (outliers far from centroids).
Example
```python
from sklearn.cluster import KMeans
# Cluster data into 3 groups
kmeans = KMeans(n_clusters=3)
kmeans.fit(X)
labels = kmeans.labels_
Important Procedure
Agglomerative (Bottom-Up): Merge closest clusters iteratively.
Divisive (Top-Down): Split one cluster into smaller ones.
Linkage Criteria: Single, complete, average, or Ward’s method.
Use Cases
Taxonomy of species in biology.
Document clustering (e.g., grouping news articles).
Social network community detection.
Example
```python
from sklearn.cluster import AgglomerativeClustering
# Agglomerative clustering with 2 clusters
model = AgglomerativeClustering(n_clusters=2)
labels = model.fit_predict(X)
```
Key Differences
Aspect KNN K-Means Hierarchical Clustering
Learning Type Supervised (needs labels) Unsupervised (no Unsupervised (no
labels) labels)
Objective Prediction Grouping similar Building cluster
(classification/regression) data points hierarchies
Output Class labels/values Cluster Dendrogram + cluster
assignments hierarchy
Hyperparameters “k” (neighbors) “k” (clusters) Linkage method,
distance threshold
Scalability Slow for large datasets (O(n)) Fast (O(n)) Slow for large datasets
(O(n²))
Interpretability Depends on neighbors Centroid-based Dendrogram
clusters visualizations
Conclusion
KNN is your go-to for prediction tasks with labeled data.
K-Means** excels at fast, scalable clustering.
Hierarchical Clustering reveals nested structures but is computationally heavy.
```python
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='mean')
X_imputed = imputer.fit_transform(X)
```
B. Normalize/Standardize Features
Use standardization (“Z-score”) or min-max scaling to ensure features are on the same
scale.
```python
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_imputed)
```
C. Reduce Dimensionality (Optional)
Apply “PCA” or “t-SNE” if dealing with high-dimensional data.
```python
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
```
Choosing the Right Algorithm
Select a clustering method based on your data’s structure:
```python
from sklearn.cluster import DBSCAN
dbscan = DBSCAN(eps=0.5, min_samples=5)
labels = dbscan.fit_predict(X_scaled)
```
Determining the Optimal Number of Clusters
a) Elbow Method (K-Means)
Plot the **inertia** (sum of squared distances) vs. `k` and look for the "elbow."
```python
inertia = []
for k in range(1, 11):
kmeans = KMeans(n_clusters=k)
kmeans.fit(X_scaled)
inertia.append(kmeans.inertia_)
plt.plot(range(1, 11), inertia, marker='o')
plt.xlabel('Number of clusters (k)')
plt.ylabel('Inertia')
```
b) Silhouette Score
Values range from “-1 to 1”; higher values indicate better-defined clusters.
```python
from sklearn.metrics import silhouette_score
silhouette_scores = []
for k in range(2, 11):
kmeans = KMeans(n_clusters=k)
labels = kmeans.fit_predict(X_scaled)
score = silhouette_score(X_scaled, labels)
silhouette_scores.append(score)
```
c) Dendrogram (Hierarchical Clustering)
Visualize cluster merging distances to choose a cutoff.
```python
from scipy.cluster.hierarchy import dendrogram, linkage
linked = linkage(X_scaled, method='ward')
dendrogram(linked)
plt.show()
```
Evaluating Clustering Performance
i. Internal Metrics (No Ground Truth)
Silhouette Score: Cohesion vs. separation of clusters.
Davies-Bouldin Index: Lower values = better clustering.
Calinski-Harabasz Index: Higher values = dense, well-separated clusters.
ii. External Metrics (With Ground Truth)
Adjusted Rand Index (ARI): Compares predicted vs. true labels (range: -1 to 1).
Normalized Mutual Information (NMI): Measures cluster-label similarity (0 to 1).
```python
from sklearn.metrics import adjusted_rand_score, normalized_mutual_info_score
ari = adjusted_rand_score(true_labels, predicted_labels)
nmi = normalized_mutual_info_score(true_labels, predicted_labels)
```
iii. Visual Evaluation
Use “PCA” or “t-SNE” to project clusters into 2D/3D for inspection.
```python
import seaborn as sns
sns.scatterplot(x=X_pca[:, 0], y=X_pca[:, 1], hue=labels, palette='viridis')
```
Post-Processing and Interpretation
a) Analyze Cluster Characteristics
Compute “mean/median values” of features per cluster.
Use domain knowledge to label clusters (e.g., "High-Income Customers").
b) Handle Noise (DBSCAN)
Points labeled `-1` are outliers. Decide whether to exclude or analyze them separately.
c) Refine Features
Remove redundant features or engineer new ones based on cluster patterns.
Key Concepts
Itemset: A collection of items (e.g., {milk, bread}).
Support: The proportion of transactions containing an itemset.
\{
\text{Support}(X) = \frac{\text{Transactions containing } X}{\text{Total transactions}}
\]
Frequent Itemset: An itemset with support ≥ a user-defined threshold (“min_support”).
Association Rule: An implication \( X \rightarrow Y \), where \( X \) (antecedent) and \( Y
\) (consequent) are disjoint itemsets.
Confidence: The likelihood that \( Y \) is bought when \( X \) is bought.
\{
\text{Confidence}(X \rightarrow Y) = \frac{\text{Support}(X \cup Y)}{\text{Support}(X)}
\}
Lift: Measures how much more likely \( Y \) is bought with \( X \) than by chance.
\{
\text{Lift}(X \rightarrow Y) = \frac{\text{Support}(X \cup Y)}{\text{Support}(X) \cdot
\text{Support}(Y)}
\}
Lift > 1: Positive correlation.
Lift = 1: Independence.
Lift < 1: Negative correlation.
Implementation in Python
Using the `mlxtend` library
```python
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, association_rules
# Sample data
transactions = {
{'milk', 'bread'},
{'milk', 'diapers'},
{'bread', 'eggs'},
{'milk', 'bread', 'eggs'}
}
# Encode transactions
te = TransactionEncoder()
te_ary = te.fit(transactions).transform(transactions)
df = pd.DataFrame(te_ary, columns=te.columns_)
# Frequent itemsets (min_support=0.5)
frequent_itemsets = apriori(df, min_support=0.5, use_colnames=True)
# Association rules (min_confidence=0.6)
rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.6)
print(rules[['antecedents', 'consequents', 'support', 'confidence', 'lift']])
```
Output
antecedents consequents support confidence lift
{milk} {bread} 0.5 0.67 1.33
The Apriori algorithm is a foundational tool for discovering hidden associations in transactional
data. While it has scalability challenges, its simplicity and interpretability make it valuable for
tasks like market basket analysis and cross-selling strategies. Use metrics like lift and confidence
to filter actionable rules and complement results with domain expertise.
Important Components
States (S): Possible situations the agent can be in (e.g., positions in a maze).
Actions (A): Moves the agent can take (e.g., "up," "down").
Transition Probability \( P(s'|s,a) \): Probability of moving to state \( s' \) from state \( s \)
after taking action \( a \).
Reward Function \( R(s,a,s') \): Immediate reward for transitioning from \( s \) to \( s' \)
via \( a \).
Discount Factor (\( \gamma \)): Reduces future rewards’ weight (0 ≤ γ < 1).
Policy (\( \pi \)): Strategy mapping states to actions (e.g., "always go left").
Objective
Find the optimal policy \( \pi \) that maximizes the “expected discounted return”
\{
G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \dots
\}
Value Functions
State-Value Function (\( V^\pi(s) \)): Expected return starting from state \( s \) under
policy \( \pi \).
Action-Value Function (\( Q^\pi(s,a) \)): Expected return after taking action \( a \) in state
\( s \).
3.6.2. Monte Carlo (MC) Prediction
Monte Carlo methods learn value functions directly from “complete episodes” of experience
without requiring a model of the environment.
Algorithm Steps
Generate Episodes: Follow policy \( \pi \) to collect trajectories (e.g., \( s_0, a_0, R_1,
s_1, a_1, ..., s_T \)).
Calculate Returns: Compute \( G_t \) for each state/action in the episode.
Average Returns: Update \( V^\pi(s) \) or \( Q^\pi(s,a) \) as the mean of observed returns.
Important Features
Model-Free: No knowledge of \( P(s'|s,a) \) or \( R(s,a,s') \) required.
High Variance: Estimates depend on full trajectories, which can be noisy.
Episodic Tasks Only: Requires terminal states (e.g., winning a game).