A1388404476 - 64039 - 23 - 2023 - Machine Learning II
A1388404476 - 64039 - 23 - 2023 - Machine Learning II
SEMESTER- VII
COURSE OUTCOMES
Upon successful completion of this course, students will be able to:
CO1 Evaluate the effectiveness of machine learning models and select appropriate techniques for model
selection, regularization, and hyperparameter tuning using cross-validation.
CO2 Analyze and apply different linear regression techniques, including simple linear regression, multiple
linear regression, polynomial regression, and regularization, to model and make predictions on datasets.
CO3 Analyze and apply support vector machine (SVM) techniques, including the concept of a hyperplane,
maximal margin classifier, soft margin classifier, slack variables, and cost of misclassification, to classify
data in both two and three dimensions.
CO4 Evaluate and apply kernel methods, including feature transformation and the kernel trick, to map nonlinear
data to a linear feature space and build nonlinear models using support vector machines (SVM) in Python.
CO5 Evaluate and apply decision tree techniques, including building decision trees, measuring impurity and
feature importance, and choosing hyperparameters, to model and make predictions on both classification
and regression datasets in Python.
CO6 Evaluate and apply ensemble techniques, including random forests, to improve model performance and
feature importance, and apply these techniques to real-world datasets such as telecom churn prediction.
Introduction to Model Selection, Model and Learning Algorithm, Simplicity, Complexity and Overfitting, Bias-
Variance Tradeoff, Regularization and Hyperparameters, Model Evaluation and Cross Validation, Model Evaluation:
Python Demonstration-I, Model Evaluation: Python Demonstration-II, Cross-Validation: Motivation, Cross-Validation:
Python Demonstration, Cross-Validation: Hyperparameter Tuning
Linear Regression - Review, Estimating Coefficients in SLR, Matrix Representation for SLR, Estimating Coefficients
in MLR, Assumptions of Linear Regression, Multiple Linear Regression in Python, Identifying Nonlinearity in Data,
Polynomial Regression, Data Transformation, Nonlinear Regression, Linear Regression Pitfalls, Regularization -
Introduction, Ridge Regression, Ridge Regression - Python Implementation, Lasso Regression, Regularization - Python
Demo, Geometrical Representation of Ridge and Lasso
Introduction to Kernels, Mapping Non - Linear Data to Linear Data, Feature Transformation, The Kernel Trick,
Building Non - Linear Models in Python, Shiny Apps - Types of Kernels, Choosing a Kernel Function, Letter
Recognition using SVM
Introduction to Decision Trees, Interpreting a Decision Tree, Building Decision Trees, Comprehension - Decision Tree
Classification in Python, Tree Models over Linear Models, Splitting and Homogeneity, Impurity Measures,
Comprehension: The GINI Index, Feature Importance in Decision Trees, Disadvantages of Decision Trees, Tree
Truncation, Building Decision Trees in Python, Choosing Tree Hyperparameters in Python, Comprehension -
Hyperparameters, Decision Tree Regression, Decision Tree Regression in Python
Ensembles, Comprehension - Ensembles, Introduction to Random Forests, Comprehension - OOB (Out-of-Bag) Error,
Feature Importance in Random Forests, Random Forests in Python, Random Forest Regression in Python, Telecom
Churn Prediction.
PRACTICAL LIST
For the following three questions, use the boston housing dataset provided here.
Unit 1 |Lab 1
How can we determine the optimal complexity of a model to prevent overfitting while maintaining good performance
on the test dataset?
Unit 1 | Lab 2
How do different regularization techniques, such as L1 and L2 regularization, affect the bias-variance tradeoff in a
model, and how can we select the optimal regularization hyperparameters for a given dataset?
Unit 1 | Lab 3
How does cross-validation help us evaluate the generalization performance of a model, and how can we use cross-
validation to tune hyperparameters for a given model?
Assumptions:
1. Linear relationship between target and independent variables: The response variable y should be linearly
related to the explanatory variables X.
2. No or Little multicollinearity between independent variables: Linear regression assumes that there is little or no
multicollinearity in the data. Multicollinearity occurs when the independent variables are too highly correlated
with each other.
3. Residual errors must be normally distributed: The residual errors should be normally distributed.
4. Residual errors must be homoscedastic: The residual errors should have constant variance. Otherwise it is
known as heteroscedastic.
Unit 2 | Lab 2
Advanced Regression
A US-based housing company named Surprise Housing has decided to enter the Australian market. The company uses
data analytics to purchase houses at a price below their actual values and flip them on at a higher price. For the same
purpose, the company has collected a data set from the sale of houses in Australia. The data is provided in the CSV file
below.
The company is looking at prospective properties to buy to enter the market. You are required to build a regression
model using regularisation in order to predict the actual value of the prospective properties and decide whether to invest
in them or not.
The company wants to know the following things about the prospective properties:
Also, determine the optimal value of lambda for ridge and lasso regression.
Unit 2 | Lab 3
Regression
Stock Price Prediction
Use machine learning algorithms with multiple linear regressions to develop a stock prices predictor and then take it
even further by using Lasso and Ridge regression models, and test on the Tesla stock from the 2010 to 2020 dataset
from Kaggle.
For the following two questions, make use of the Iris data set provided here.
Unit 3 | Lab 1
How does the concept of a hyperplane in SVMs help us classify data points in higher-dimensional spaces, and how can
we visualize this process using tools such as 3D plots or decision boundary plots?
Unit 3 | Lab 2
How can we use SVMs to classify data points that are not linearly separable, and what are some common kernel
functions that can be used to transform the data into a higher-dimensional space where linear separation is possible?
Unit 3 | Lab 3
Dataset
Problem Description
Predict Heart Disease using the concepts of support vector machines based on given attributes.
● 0 - NO HEART DISEASE
● 1 - HEART DISEASE
Attribute Information:
1. age
2. sex (1 = male; 0 = female)
3. chest pain type (4 values) Value 1: typical angina Value 2: atypical angina Value 3: non-anginal
pain Value 4: asymptomatic
4. resting blood pressure
5. serum cholesterol in mg/dl
6. fasting blood sugar > 120 mg/dl
7. resting electrocardiographic results (values 0,1,2) -- Value 0: normal -- Value 1: having ST-T wave
abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV) -- Value 2:
showing probable or definite left ventricular hypertrophy by Estes' criteria
8. maximum heart rate achieved
9. exercise induced angina
10. oldpeak = ST depression induced by exercise relative to rest
11. the slope of the peak exercise ST segment
12. number of major vessels (0-3) coloured by fluoroscopy
13. thal: 3 = normal; 6 = fixed defect; 7 = reversible defect
For the following two questions, you could use the USPS dataset, which contains images of handwritten
digits (0-9) and can be used for tasks such as classification and recognition.
Unit 4 | Lab 1
How can we use different types of kernel functions (e.g., linear, polynomial, radial basis function) to transform non-
linear data into a linearly separable form for classification using SVMs, and how do we choose the appropriate kernel
function for a given dataset?
Unit 4 | Lab 2
How can we apply SVMs with kernel methods to the task of letter recognition, and what are some challenges and
limitations of this approach?
Unit 4 | Lab 3
Description
The Pima Indians Diabetes dataset contains observations on various health-related attributes, such as plasma glucose
concentration and body mass index (BMI). Each row contains a patient’s attributes and whether they had diabetes.
Now, build a linear SVM model using cost, C = 1, to predict whether a given patient has diabetes.
After you train the model, use the test data to make predictions. The test data can be accessed here.
/data/test/diabetes_test.csv
You have to write the prediction in the file given below. In the following format, carefully note the names of the
columns.
/code/output/diabetes_predictions.csv
Datasets
● Training dataset
Unit 4 | Lab 4
Description
You have already built a linear SVM model on the Pima Indians Diabetes dataset, which contains observations on
various health-related attributes, such as plasma glucose concentration and Body Mass Index (BMI).
In this question, you will find the optimal value of the hyperparameter ‘C’ using GridSearchCV() and then build a
linear SVM model to predict whether a given patient has diabetes.
To find the optimal value of ‘C’, you can plot training and test accuracy versus ‘C’ using matplotlib (the code is already
written; you will see the plot displayed below the coding console).
After you train the model, use the test data to make predictions. The test data can be accessed here.
/data/test/diabetes_test.csv
Note the names of the columns carefully in the format provided in the dataset given below.
Datasets
● Training dataset
Unit 4 | Lab 5 You are required to develop a model by using a support
vector machine which should correctly classify
handwritten digits from 0–9 based on the pixel values
given as features. Thus, this is a 10-class classification
problem.
You can download the dataset from Kaggle here. You can
use train.csv to train the model and test.csv to evaluate the
results.
For the upcoming two questions, you could use the Boston Housing dataset, which contains information
about various housing features such as crime rate, number of rooms, and median value. You could use
decision trees to model the relationship between these features and the target variable (median value), and
explore feature importance and hyperparameter tuning to optimize the model's performance.
Unit 5 | Lab 2
How can we build a decision tree classifier using Python, and how can we interpret and visualize the resulting tree to
gain insights into the underlying data and decision-making process?
Unit 5 | Lab 3
How can we use decision tree regression to model a given dataset and make predictions on new data, and what are some
strategies for tuning hyperparameters such as the maximum tree depth and minimum sample split size?
Unit 6 | Lab 1
Problem Statement
Predict the median value of owner-occupied homes with the help of the Decision Tree Regression Model.
Unit 6 | Lab 2
Problem Statement
● Identify the variables affecting house prices, such as the area and the number of rooms and bathrooms,
● Create a linear model that quantitatively relates house prices with variables, such as the area and the number of
rooms and bathrooms, and
● Know the variables that significantly contribute towards predicting house prices.
Problem Statement
A telecom company has all its clients’ data. The main types of attributes are as follows:
Based on all this data, you want to construct a model that predicts whether a customer would churn, i.e., switch service
providers. So, the target variable is ‘Churn’, which tells us whether a customer has churned. 1 signifies that the customer
has churned, while 0 means they haven’t.
Text Books:
Reference Books: