0% found this document useful (0 votes)
1 views

Chapter 8

Chapter 3 discusses unsupervised learning, focusing on its ability to analyze unlabeled data to identify patterns and relationships. Key techniques include clustering, dimensionality reduction, and association rule learning, with algorithms like K-means and DBSCAN highlighted. The chapter emphasizes the importance of preprocessing, algorithm selection, and evaluation metrics for successful application in various domains.

Uploaded by

abenezerd900
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

Chapter 8

Chapter 3 discusses unsupervised learning, focusing on its ability to analyze unlabeled data to identify patterns and relationships. Key techniques include clustering, dimensionality reduction, and association rule learning, with algorithms like K-means and DBSCAN highlighted. The chapter emphasizes the importance of preprocessing, algorithm selection, and evaluation metrics for successful application in various domains.

Uploaded by

abenezerd900
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Chapter 3

Unsupervised learning
3.1. Introduction
Unsupervised learning is a branch of machine learning where models analyze “unlabeled data” to
discover hidden patterns, intrinsic structures, or relationships without predefined outputs. Unlike
supervised learning, there are no "correct answers" provided during training.

Objectives

Clustering: Group similar data points (e.g., customer segmentation).

Dimensionality Reduction: Simplify data while retaining critical information (e.g., compressing
images).

Association Rule Learning: Identify relationships between variables (e.g., "customers who buy
X also buy Y").

Density Estimation: Model the underlying probability distribution of data (e.g., anomaly
detection).

Common Algorithms for Clustering


K-means: Partitions data into “k” clusters.
Hierarchical Clustering: Builds nested clusters using a tree structure.
DBSCAN: Groups data based on density.
Gaussian Mixture Models (GMM): Probabilistic clustering using Gaussian distributions.

Unsupervised learning excels at “exploratory data analysis”, preprocessing, and uncovering


latent patterns. While challenging due to the lack of labels, it powers critical applications from
marketing to anomaly detection. Future advancements aim to bridge it with supervised and
reinforcement learning for more adaptive AI systems.

3.2. Understand the Principles of Unsupervised Learning Models


Understanding the principles of unsupervised learning models involves grasping the foundational
concepts that drive how these models learn patterns, structures, or relationships from “unlabeled
data”. Below is a structured breakdown of the core principles:
Foundational Assumptions
Unsupervised learning relies on key assumptions about the nature of data
Data contains hidden patterns or groupings (e.g., clusters,
Inherent Structure
hierarchies, or manifolds).
Data points that are "similar" (based on metrics like distance or
Similarity Principle
density) belong to the same group or structure.
Redundancy High-dimensional data can often be represented in fewer dimensions
Reduction without losing critical information.

Core Mathematical Principles


A. Distance and Similarity Metrics
Algorithms like “K-means” and “hierarchical clustering” rely on distance measures (e.g.,
Euclidean, Manhattan, or cosine similarity) to quantify similarity between data points.

Example: K-means minimizes the sum of squared distances between points and their cluster
centroids.

B. Probability and Density Estimation


Models like “Gaussian Mixture Models (GMMs)” assume data is generated from a mixture
of probability distributions.
Density-based algorithms (e.g., DBSCAN) identify clusters as regions of high data density.
C. Linear Algebra
Principal Component Analysis (PCA) uses eigenvectors and eigenvalues to project data
into a lower-dimensional subspace while preserving variance.
Singular Value Decomposition (SVD) is used in dimensionality reduction and matrix
factorization.
D. Information Theory
Techniques like “t-SNE” (t-Distributed Stochastic Neighbor Embedding) minimize the
divergence between probability distributions in high and low dimensions to visualize data.
E. Manifold Hypothesis
Assumes high-dimensional data lies on a lower-dimensional manifold (e.g., a curved surface
in 3D space). Methods like “UMAP” or “Isomap” exploit this for non-linear dimensionality
reduction.

Algorithmic Principles
Clustering
Centroid-based (K-means) Iteratively refines cluster centers to minimize within-cluster
variance.
Hierarchical Builds nested clusters using linkage criteria (e.g., Ward’s
(Agglomerative/Divisive) method).
Density-based (DBSCAN) Groups points in dense regions and marks outliers as noise.
Dimensionality Reduction
PCA Maximizes variance in orthogonal directions (principal
components).
Autoencoders Neural networks trained to reconstruct input data through a
compressed bottleneck layer.
Association Rule Learning
Apriori Algorithm Uses frequent itemsets (e.g., "if {bread, butter}, then
{jam}") and support/confidence thresholds.
Generative Modeling
GANs (Generative Adversarial Learn the data distribution to generate new samples.
Networks) & VAEs
(Variational Autoencoders)

Practical Principles
Data Preprocessing
Normalization/Scaling Critical for distance-based algorithms (e.g., K-means).
Handling Missing Data Imputation or removal of incomplete samples.
Model Evaluation
Clustering Metrics like silhouette score, Davies-Bouldin index, or
visual inspection.
Dimensionality Reduction Explained variance ratio (PCA) or reconstruction error
(autoencoders).

Challenges and Trade-offs


Curse of Dimensionality
Local vs. Global Structure
Scalability

For deeper insights, explore how these principles are applied in cutting-edge areas like “self-
supervised learning” (using pretext tasks to generate pseudo-labels) or :contrastive learning”
(learning embedding by contrasting similar/dissimilar pairs).

3.3. Clustering Approaches


Clustering is a core technique in unsupervised learning that groups data points into “clusters”
based on their inherent similarities, without prior knowledge of class labels. It is widely used for
exploratory data analysis, pattern discovery, and preprocessing. Below is a detailed breakdown
of clustering approaches:

3.3.1. K-Nearest Neighbors (KNN)


 Supervised learning algorithm (used for classification/regression).
 The goal is predicting the label/value of a new data point based on the majority
(classification) or average (regression) of its “k” closest neighbors.

Main Procedure
 Distance Metric: Uses Euclidean, Manhattan, or cosine similarity to find nearest neighbors.
 Lazy Learning: No explicit training phase; stores all training data.
 Hyperparameter: “k” (number of neighbors to consider).

Use Cases
 Classifying customer churn (yes/no).
 Predicting house prices based on similar properties.
 Recommendation systems ("users like you also bought...").

Example
```python
from sklearn.neighbors import KNeighborsClassifier
# Example: Classification with k=3
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)
predictions = knn.predict(X_test)
```
3.3.2. K-means
Divides data into “k” clusters by iteratively minimizing the sum of squared distances between
points and cluster centroids.

Type “Unsupervised” learning algorithm (clustering).


Goal Partition data into “k” clusters by minimizing intra-cluster variance.

Main Procedure
 Centroid-Based: Iteratively updates cluster centers (means).
 Hard Clustering: Each point belongs to exactly one cluster.
 Hyperparameter: “k” (number of clusters).

Use Cases
 Customer segmentation.
 Image compression (reducing color palette).
 Anomaly detection (outliers far from centroids).

Example
```python
from sklearn.cluster import KMeans
# Cluster data into 3 groups
kmeans = KMeans(n_clusters=3)
kmeans.fit(X)
labels = kmeans.labels_

3.3.3. Hierarchical Clustering


Type “Unsupervised” learning algorithm (clustering).
Goal Build a tree-like hierarchy of clusters (dendrogram) without predefining “k”.

Important Procedure
 Agglomerative (Bottom-Up): Merge closest clusters iteratively.
 Divisive (Top-Down): Split one cluster into smaller ones.
 Linkage Criteria: Single, complete, average, or Ward’s method.

Use Cases
 Taxonomy of species in biology.
 Document clustering (e.g., grouping news articles).
 Social network community detection.

Example
```python
from sklearn.cluster import AgglomerativeClustering
# Agglomerative clustering with 2 clusters
model = AgglomerativeClustering(n_clusters=2)
labels = model.fit_predict(X)
```
Key Differences
Aspect KNN K-Means Hierarchical Clustering
Learning Type Supervised (needs labels) Unsupervised (no Unsupervised (no
labels) labels)
Objective Prediction Grouping similar Building cluster
(classification/regression) data points hierarchies
Output Class labels/values Cluster Dendrogram + cluster
assignments hierarchy
Hyperparameters “k” (neighbors) “k” (clusters) Linkage method,
distance threshold
Scalability Slow for large datasets (O(n)) Fast (O(n)) Slow for large datasets
(O(n²))
Interpretability Depends on neighbors Centroid-based Dendrogram
clusters visualizations

Conclusion
KNN is your go-to for prediction tasks with labeled data.
K-Means** excels at fast, scalable clustering.
Hierarchical Clustering reveals nested structures but is computationally heavy.

3.4. Correctly Apply and Evaluate Clustering Models


Clustering is an unsupervised learning task that groups data points based on inherent similarities.
Below is a structured approach to “apply and evaluate clustering models effectively”, along with
best practices and common pitfalls to avoid.

Preprocessing the Data


Clustering algorithms are sensitive to scale and noise. Follow these steps:

A. Handle Missing Values


 Remove or impute missing data (e.g., using mean/median imputation).

```python
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='mean')
X_imputed = imputer.fit_transform(X)
```
B. Normalize/Standardize Features
 Use standardization (“Z-score”) or min-max scaling to ensure features are on the same
scale.

```python
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_imputed)
```
C. Reduce Dimensionality (Optional)
 Apply “PCA” or “t-SNE” if dealing with high-dimensional data.

```python
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
```
Choosing the Right Algorithm
Select a clustering method based on your data’s structure:

Algorithm Best For Key Hyperparameters


K-Means Spherical, similarly sized clusters “n_clusters”
DBSCAN Arbitrary shapes, noisy data “eps”, “min_samples”
Hierarchical Hierarchical relationships “n_clusters”, “linkage”
Gaussian Mixture Probabilistic soft clustering “n_components” (clusters)

Example: For noisy data with irregular clusters:

```python
from sklearn.cluster import DBSCAN
dbscan = DBSCAN(eps=0.5, min_samples=5)
labels = dbscan.fit_predict(X_scaled)
```
Determining the Optimal Number of Clusters
a) Elbow Method (K-Means)
 Plot the **inertia** (sum of squared distances) vs. `k` and look for the "elbow."

```python
inertia = []
for k in range(1, 11):
kmeans = KMeans(n_clusters=k)
kmeans.fit(X_scaled)
inertia.append(kmeans.inertia_)
plt.plot(range(1, 11), inertia, marker='o')
plt.xlabel('Number of clusters (k)')
plt.ylabel('Inertia')
```
b) Silhouette Score
 Values range from “-1 to 1”; higher values indicate better-defined clusters.

```python
from sklearn.metrics import silhouette_score
silhouette_scores = []
for k in range(2, 11):
kmeans = KMeans(n_clusters=k)
labels = kmeans.fit_predict(X_scaled)
score = silhouette_score(X_scaled, labels)
silhouette_scores.append(score)
```
c) Dendrogram (Hierarchical Clustering)
 Visualize cluster merging distances to choose a cutoff.

```python
from scipy.cluster.hierarchy import dendrogram, linkage
linked = linkage(X_scaled, method='ward')
dendrogram(linked)
plt.show()
```
Evaluating Clustering Performance
i. Internal Metrics (No Ground Truth)
 Silhouette Score: Cohesion vs. separation of clusters.
 Davies-Bouldin Index: Lower values = better clustering.
 Calinski-Harabasz Index: Higher values = dense, well-separated clusters.
ii. External Metrics (With Ground Truth)
 Adjusted Rand Index (ARI): Compares predicted vs. true labels (range: -1 to 1).
 Normalized Mutual Information (NMI): Measures cluster-label similarity (0 to 1).

```python
from sklearn.metrics import adjusted_rand_score, normalized_mutual_info_score
ari = adjusted_rand_score(true_labels, predicted_labels)
nmi = normalized_mutual_info_score(true_labels, predicted_labels)
```
iii. Visual Evaluation
 Use “PCA” or “t-SNE” to project clusters into 2D/3D for inspection.

```python
import seaborn as sns
sns.scatterplot(x=X_pca[:, 0], y=X_pca[:, 1], hue=labels, palette='viridis')
```
Post-Processing and Interpretation
a) Analyze Cluster Characteristics
 Compute “mean/median values” of features per cluster.
 Use domain knowledge to label clusters (e.g., "High-Income Customers").
b) Handle Noise (DBSCAN)
 Points labeled `-1` are outliers. Decide whether to exclude or analyze them separately.
c) Refine Features
 Remove redundant features or engineer new ones based on cluster patterns.

Clustering success hinges on “preprocessing’, “algorithm selection”, and “rigorous evaluation”.


Always:
 Validate clusters using multiple metrics.
 Interpret results with domain context.
 Iterate by tuning hyperparameters or trying different algorithms.

3.5. Association Rule Learning


Overview
Association Rule Learning is an unsupervised technique used to uncover relationships between
variables in large datasets. The “Apriori algorithm” is a classic method for mining frequent
itemsets and generating association rules, widely applied in market basket analysis,
recommendation systems, and bioinformatics.

Key Concepts
 Itemset: A collection of items (e.g., {milk, bread}).
 Support: The proportion of transactions containing an itemset.

\{
\text{Support}(X) = \frac{\text{Transactions containing } X}{\text{Total transactions}}
\]
 Frequent Itemset: An itemset with support ≥ a user-defined threshold (“min_support”).
 Association Rule: An implication \( X \rightarrow Y \), where \( X \) (antecedent) and \( Y
\) (consequent) are disjoint itemsets.
 Confidence: The likelihood that \( Y \) is bought when \( X \) is bought.

\{
\text{Confidence}(X \rightarrow Y) = \frac{\text{Support}(X \cup Y)}{\text{Support}(X)}
\}
 Lift: Measures how much more likely \( Y \) is bought with \( X \) than by chance.

\{
\text{Lift}(X \rightarrow Y) = \frac{\text{Support}(X \cup Y)}{\text{Support}(X) \cdot
\text{Support}(Y)}
\}
 Lift > 1: Positive correlation.
 Lift = 1: Independence.
 Lift < 1: Negative correlation.

3.5.1. Apriori Algorithm Steps


 Principle: “If an itemset is infrequent, all its supersets are infrequent (prunes search
space).”
A. Generate Frequent Itemsets
 Level-wise search: Start with 1-itemsets, then iteratively generate larger itemsets.
 Join Step: Create candidate \( k \)-itemsets by joining frequent \((k-1)\)-itemsets.
 Prune Step: Remove candidates with infrequent subsets.
 Support Check: Retain itemsets meeting “min_support”.
B. Generate Association Rules
 Split frequent itemsets into antecedent (\(X\)) and consequent (\(Y\)).
 Compute confidence and lift for each rule \( X \rightarrow Y \).
 Retain rules meeting “min_confidence” and desired lift.

Example: Market Basket Analysis


(Dataset)
Transaction Items
T1 {milk, bread}
T2 {milk, diapers}
T3 {bread, eggs}
T4 {milk, bread, eggs}

Step 1: “Find Frequent Itemsets” (min_support = 0.5)


1-itemsets
{milk}: 3/4 = 0.75
{bread}: 2/4 = 0.5
{eggs}: 2/4 = 0.5
{diapers}: 1/4 = 0.25 (pruned)
2-itemsets
{milk, bread}: 2/4 = 0.5
{milk, eggs}: 1/4 = 0.25 (pruned)
{bread, eggs}: 1/4 = 0.25 (pruned)
Step 2: “Generate Rules” (min_confidence = 0.6):

Rule: {milk} → {bread}


Confidence = Support({milk, bread}) / Support({milk}) = 0.5 / 0.75 = 0.67
Lift = 0.5 / (0.75 * 0.5) = 1.33

Implementation in Python
Using the `mlxtend` library
```python
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, association_rules
# Sample data
transactions = {
{'milk', 'bread'},
{'milk', 'diapers'},
{'bread', 'eggs'},
{'milk', 'bread', 'eggs'}
}
# Encode transactions
te = TransactionEncoder()
te_ary = te.fit(transactions).transform(transactions)
df = pd.DataFrame(te_ary, columns=te.columns_)
# Frequent itemsets (min_support=0.5)
frequent_itemsets = apriori(df, min_support=0.5, use_colnames=True)
# Association rules (min_confidence=0.6)
rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.6)
print(rules[['antecedents', 'consequents', 'support', 'confidence', 'lift']])
```
Output
antecedents consequents support confidence lift
{milk} {bread} 0.5 0.67 1.33

The Apriori algorithm is a foundational tool for discovering hidden associations in transactional
data. While it has scalability challenges, its simplicity and interpretability make it valuable for
tasks like market basket analysis and cross-selling strategies. Use metrics like lift and confidence
to filter actionable rules and complement results with domain expertise.

3.6. Reinforcement learning


Reinforcement Learning (RL) is a machine learning paradigm where agents learn to make
decisions by interacting with an environment to maximize cumulative rewards. Below, we
explore its foundational concepts “Markov Decision Processes (MDPs)” and “Monte Carlo (MC)
prediction” and illustrate them with a case study.

3.6.1. Markov Decision Processes (MDPs)


MDPs provide a mathematical framework for modeling sequential decision-making problems.

Important Components
 States (S): Possible situations the agent can be in (e.g., positions in a maze).
 Actions (A): Moves the agent can take (e.g., "up," "down").
 Transition Probability \( P(s'|s,a) \): Probability of moving to state \( s' \) from state \( s \)
after taking action \( a \).
 Reward Function \( R(s,a,s') \): Immediate reward for transitioning from \( s \) to \( s' \)
via \( a \).
 Discount Factor (\( \gamma \)): Reduces future rewards’ weight (0 ≤ γ < 1).
 Policy (\( \pi \)): Strategy mapping states to actions (e.g., "always go left").

Objective
 Find the optimal policy \( \pi \) that maximizes the “expected discounted return”

\{
G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \dots
\}
Value Functions
 State-Value Function (\( V^\pi(s) \)): Expected return starting from state \( s \) under
policy \( \pi \).
 Action-Value Function (\( Q^\pi(s,a) \)): Expected return after taking action \( a \) in state
\( s \).
3.6.2. Monte Carlo (MC) Prediction
Monte Carlo methods learn value functions directly from “complete episodes” of experience
without requiring a model of the environment.

Algorithm Steps
 Generate Episodes: Follow policy \( \pi \) to collect trajectories (e.g., \( s_0, a_0, R_1,
s_1, a_1, ..., s_T \)).
 Calculate Returns: Compute \( G_t \) for each state/action in the episode.
 Average Returns: Update \( V^\pi(s) \) or \( Q^\pi(s,a) \) as the mean of observed returns.

Example: First-Visit MC Prediction


 For each state \( s \), average returns only from the first time \( s \) is visited in an episode.

Important Features
 Model-Free: No knowledge of \( P(s'|s,a) \) or \( R(s,a,s') \) required.
 High Variance: Estimates depend on full trajectories, which can be noisy.
 Episodic Tasks Only: Requires terminal states (e.g., winning a game).

You might also like