0% found this document useful (0 votes)
53 views10 pages

A1388404476 - 64039 - 23 - 2023 - Machine Learning II

This syllabus outlines a Machine Learning II course that covers the following topics: 1. Model selection and evaluation techniques like cross-validation and regularization 2. Advanced regression methods including linear, polynomial, and regularized regression 3. Support vector machines for classification, including kernels and nonlinear models 4. Decision trees and random forests for classification and regression It includes 6 units that provide theory and Python demonstrations for each major topic, as well as practical labs applying the techniques to datasets. The goal is for students to learn how to apply, evaluate, and tune machine learning models for problems like housing price prediction, churn prediction, and disease classification.

Uploaded by

raj241299
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views10 pages

A1388404476 - 64039 - 23 - 2023 - Machine Learning II

This syllabus outlines a Machine Learning II course that covers the following topics: 1. Model selection and evaluation techniques like cross-validation and regularization 2. Advanced regression methods including linear, polynomial, and regularized regression 3. Support vector machines for classification, including kernels and nonlinear models 4. Decision trees and random forests for classification and regression It includes 6 units that provide theory and Python demonstrations for each major topic, as well as practical labs applying the techniques to datasets. The goal is for students to learn how to apply, evaluate, and tune machine learning models for problems like housing price prediction, churn prediction, and disease classification.

Uploaded by

raj241299
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

SYLLABUS: MACHINE LEARNING-II

SEMESTER- VII

COURSE OUTCOMES
Upon successful completion of this course, students will be able to:

CO1 Evaluate the effectiveness of machine learning models and select appropriate techniques for model
selection, regularization, and hyperparameter tuning using cross-validation.

CO2 Analyze and apply different linear regression techniques, including simple linear regression, multiple
linear regression, polynomial regression, and regularization, to model and make predictions on datasets.

CO3 Analyze and apply support vector machine (SVM) techniques, including the concept of a hyperplane,
maximal margin classifier, soft margin classifier, slack variables, and cost of misclassification, to classify
data in both two and three dimensions.

CO4 Evaluate and apply kernel methods, including feature transformation and the kernel trick, to map nonlinear
data to a linear feature space and build nonlinear models using support vector machines (SVM) in Python.

CO5 Evaluate and apply decision tree techniques, including building decision trees, measuring impurity and
feature importance, and choosing hyperparameters, to model and make predictions on both classification
and regression datasets in Python.

CO6 Evaluate and apply ensemble techniques, including random forests, to improve model performance and
feature importance, and apply these techniques to real-world datasets such as telecom churn prediction.

UNIT-WISE CONTENT SAGRIGATION

Unit 1 Module 1: Model Selection and Evaluation

Introduction to Model Selection, Model and Learning Algorithm, Simplicity, Complexity and Overfitting, Bias-
Variance Tradeoff, Regularization and Hyperparameters, Model Evaluation and Cross Validation, Model Evaluation:
Python Demonstration-I, Model Evaluation: Python Demonstration-II, Cross-Validation: Motivation, Cross-Validation:
Python Demonstration, Cross-Validation: Hyperparameter Tuning

Unit 2 Module 2: Advanced Regression

Linear Regression - Review, Estimating Coefficients in SLR, Matrix Representation for SLR, Estimating Coefficients
in MLR, Assumptions of Linear Regression, Multiple Linear Regression in Python, Identifying Nonlinearity in Data,
Polynomial Regression, Data Transformation, Nonlinear Regression, Linear Regression Pitfalls, Regularization -
Introduction, Ridge Regression, Ridge Regression - Python Implementation, Lasso Regression, Regularization - Python
Demo, Geometrical Representation of Ridge and Lasso

Unit 3 Module 3: Support Vector Machines - I


Introduction to SVM, Concept of a Hyperplane in 2D, Concept of a Hyperplane in 3D, Maximal Margin Classifier, The
Soft Margin Classifier, The Slack Variable, Comprehension 1: Notion of Slack Variables, Cost of Misclassification,
SVM Python Labs

Unit 4 Module 3: Support Vector Machines - II

Introduction to Kernels, Mapping Non - Linear Data to Linear Data, Feature Transformation, The Kernel Trick,
Building Non - Linear Models in Python, Shiny Apps - Types of Kernels, Choosing a Kernel Function, Letter
Recognition using SVM

Unit 5 Module 4: Tree Models - I

Introduction to Decision Trees, Interpreting a Decision Tree, Building Decision Trees, Comprehension - Decision Tree
Classification in Python, Tree Models over Linear Models, Splitting and Homogeneity, Impurity Measures,
Comprehension: The GINI Index, Feature Importance in Decision Trees, Disadvantages of Decision Trees, Tree
Truncation, Building Decision Trees in Python, Choosing Tree Hyperparameters in Python, Comprehension -
Hyperparameters, Decision Tree Regression, Decision Tree Regression in Python

Unit 6 Module 4: Tree Models - II

Ensembles, Comprehension - Ensembles, Introduction to Random Forests, Comprehension - OOB (Out-of-Bag) Error,
Feature Importance in Random Forests, Random Forests in Python, Random Forest Regression in Python, Telecom
Churn Prediction.

PRACTICAL LIST

Unit 1: Model Selection and Evaluation

For the following three questions, use the boston housing dataset provided here.

Unit 1 |Lab 1

How can we determine the optimal complexity of a model to prevent overfitting while maintaining good performance
on the test dataset?

Unit 1 | Lab 2

How do different regularization techniques, such as L1 and L2 regularization, affect the bias-variance tradeoff in a
model, and how can we select the optimal regularization hyperparameters for a given dataset?

Unit 1 | Lab 3

How does cross-validation help us evaluate the generalization performance of a model, and how can we use cross-
validation to tune hyperparameters for a given model?

Unit 2: Advanced Regression


Unit 2 | Lab 1

Linear Regression Analysis: Validating Assumptions


Validate the assumptions of linear regression using the dataset provided here.

Assumptions:
1. Linear relationship between target and independent variables: The response variable y should be linearly
related to the explanatory variables X.
2. No or Little multicollinearity between independent variables: Linear regression assumes that there is little or no
multicollinearity in the data. Multicollinearity occurs when the independent variables are too highly correlated
with each other.
3. Residual errors must be normally distributed: The residual errors should be normally distributed.
4. Residual errors must be homoscedastic: The residual errors should have constant variance. Otherwise it is
known as heteroscedastic.

Unit 2 | Lab 2

Advanced Regression

A US-based housing company named Surprise Housing has decided to enter the Australian market. The company uses
data analytics to purchase houses at a price below their actual values and flip them on at a higher price. For the same
purpose, the company has collected a data set from the sale of houses in Australia. The data is provided in the CSV file
below.

The company is looking at prospective properties to buy to enter the market. You are required to build a regression
model using regularisation in order to predict the actual value of the prospective properties and decide whether to invest
in them or not.

The company wants to know the following things about the prospective properties:

● Which variables are significant in predicting the price of a house, and


● How well those variables describe the price of a house.

Also, determine the optimal value of lambda for ridge and lasso regression.

Note: You can download the dataset from the platform.

Unit 2 | Lab 3

Regression
Stock Price Prediction

Use machine learning algorithms with multiple linear regressions to develop a stock prices predictor and then take it
even further by using Lasso and Ridge regression models, and test on the Tesla stock from the 2010 to 2020 dataset
from Kaggle.

Unit 3: Support Vector Machines - I

For the following two questions, make use of the Iris data set provided here.

Unit 3 | Lab 1

How does the concept of a hyperplane in SVMs help us classify data points in higher-dimensional spaces, and how can
we visualize this process using tools such as 3D plots or decision boundary plots?

Unit 3 | Lab 2

How can we use SVMs to classify data points that are not linearly separable, and what are some common kernel
functions that can be used to transform the data into a higher-dimensional space where linear separation is possible?

Unit 3 | Lab 3

Support Vector Machines

Dataset

Problem Description

Predict Heart Disease using the concepts of support vector machines based on given attributes.

● 0 - NO HEART DISEASE
● 1 - HEART DISEASE

Attribute Information:

1. age
2. sex (1 = male; 0 = female)
3. chest pain type (4 values) Value 1: typical angina Value 2: atypical angina Value 3: non-anginal
pain Value 4: asymptomatic
4. resting blood pressure
5. serum cholesterol in mg/dl
6. fasting blood sugar > 120 mg/dl
7. resting electrocardiographic results (values 0,1,2) -- Value 0: normal -- Value 1: having ST-T wave
abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV) -- Value 2:
showing probable or definite left ventricular hypertrophy by Estes' criteria
8. maximum heart rate achieved
9. exercise induced angina
10. oldpeak = ST depression induced by exercise relative to rest
11. the slope of the peak exercise ST segment
12. number of major vessels (0-3) coloured by fluoroscopy
13. thal: 3 = normal; 6 = fixed defect; 7 = reversible defect

Unit 4: Support Vector Machines - II

For the following two questions, you could use the USPS dataset, which contains images of handwritten
digits (0-9) and can be used for tasks such as classification and recognition.

Unit 4 | Lab 1

How can we use different types of kernel functions (e.g., linear, polynomial, radial basis function) to transform non-
linear data into a linearly separable form for classification using SVMs, and how do we choose the appropriate kernel
function for a given dataset?

Unit 4 | Lab 2

How can we apply SVMs with kernel methods to the task of letter recognition, and what are some challenges and
limitations of this approach?

Unit 4 | Lab 3

SVM – Linear Model – Diabetes Dataset

Description

The Pima Indians Diabetes dataset contains observations on various health-related attributes, such as plasma glucose
concentration and body mass index (BMI). Each row contains a patient’s attributes and whether they had diabetes.

Now, build a linear SVM model using cost, C = 1, to predict whether a given patient has diabetes.

A sample of the training data is shown below.

The training data is provided here:


/data/training/diabetes_train.csv

After you train the model, use the test data to make predictions. The test data can be accessed here.
/data/test/diabetes_test.csv
You have to write the prediction in the file given below. In the following format, carefully note the names of the
columns.
/code/output/diabetes_predictions.csv

Your model’s accuracy will be evaluated on an unseen test dataset.

Datasets

● Training dataset

Unit 4 | Lab 4

SVM Hyperparameter Tuning – Diabetes Data

Description

You have already built a linear SVM model on the Pima Indians Diabetes dataset, which contains observations on
various health-related attributes, such as plasma glucose concentration and Body Mass Index (BMI).

Recall that you used C = 1 while building the model.

In this question, you will find the optimal value of the hyperparameter ‘C’ using GridSearchCV() and then build a
linear SVM model to predict whether a given patient has diabetes.
To find the optimal value of ‘C’, you can plot training and test accuracy versus ‘C’ using matplotlib (the code is already
written; you will see the plot displayed below the coding console).

A sample of the training data is shown below:

The training data is provided here:


/data/training/diabetes_train.csv

After you train the model, use the test data to make predictions. The test data can be accessed here.
/data/test/diabetes_test.csv

You have to write the prediction in the file given below.


/code/output/diabetes_predictions.csv

Note the names of the columns carefully in the format provided in the dataset given below.

Your model's accuracy will be evaluated on an unseen test dataset.

Datasets

● Training dataset
Unit 4 | Lab 5 You are required to develop a model by using a support
vector machine which should correctly classify
handwritten digits from 0–9 based on the pixel values
given as features. Thus, this is a 10-class classification
problem.

For this problem, we use the MNIST data, which is a large


database of handwritten digits. The ‘pixel values’ of each
digit (image) comprise the features, and the actual number
between 0–9 is the label.

Each image has 28 x 28 pixels. Each pixel has a feature, and


there are 784 features in each image. MNIST digit
recognition is a well-studied problem in the machine learning
community, and people have trained numerous models (such
as neural networks, SVMs and boosted trees), achieving error
rates as low as 0.23% (i.e., accuracy = 99.77%, with a
convolutional neural network).

However, before the popularity of neural networks, models


like SVMs and boosted trees were state-of-the-art in such
problems.

In this assignment, try to experiment with various


hyperparameters in SVMs and observe the highest accuracy
you can get. With a sub-sample of 10%–20% of the training
data (see note below), you should expect more than 90%
accuracy.

Note: Since the training dataset is quite large (42,000


labelled images), it would take a lot of time to train an SVM
on the full MNIST data. So, you can sub-sample the data for
training (10%–20% of the data should be enough to achieve
decent accuracy). It may also take hours to run a
GridSearchCV() if you use a large value of k (fold-CV), such
as 10, and a wide range of hyperparameters; k = 5 should be
adequate.

You can download the dataset from Kaggle here. You can
use train.csv to train the model and test.csv to evaluate the
results.

Unit 5: Tree Models - I


Unit 5 | Lab 1

Decision Tree Classifier

Build a Decision Tree Classifier to predict the safety of the car.

You can find the dataset here.

For the upcoming two questions, you could use the Boston Housing dataset, which contains information
about various housing features such as crime rate, number of rooms, and median value. You could use
decision trees to model the relationship between these features and the target variable (median value), and
explore feature importance and hyperparameter tuning to optimize the model's performance.

Unit 5 | Lab 2

How can we build a decision tree classifier using Python, and how can we interpret and visualize the resulting tree to
gain insights into the underlying data and decision-making process?

Unit 5 | Lab 3

How can we use decision tree regression to model a given dataset and make predictions on new data, and what are some
strategies for tuning hyperparameters such as the maximum tree depth and minimum sample split size?

Unit 6: Tree Models - II

Unit 6 | Lab 1

Decision Tree Regression Model

Problem Statement

Predict the median value of owner-occupied homes with the help of the Decision Tree Regression Model.

You can find the dataset here

Unit 6 | Lab 2

Decision Tree Regression in Python

Problem Statement

● Identify the variables affecting house prices, such as the area and the number of rooms and bathrooms,
● Create a linear model that quantitatively relates house prices with variables, such as the area and the number of
rooms and bathrooms, and
● Know the variables that significantly contribute towards predicting house prices.

Use the ‘Housing Data Set’ provided in the platform.


Unit 6 | Lab 3

Telecom Churn Prediction using Random Forest

Problem Statement

A telecom company has all its clients’ data. The main types of attributes are as follows:

● Demographics (age, gender, etc.)


● Services used (internet packs purchased, special offers taken, etc.)
● Expenditure (amount of recharge done per month, etc.)

Based on all this data, you want to construct a model that predicts whether a customer would churn, i.e., switch service
providers. So, the target variable is ‘Churn’, which tells us whether a customer has churned. 1 signifies that the customer
has churned, while 0 means they haven’t.

You can download the data sets from the platform.

Text Books:
Reference Books:

You might also like