Ass3 v1

This document provides instructions for Assignment 3 in COL 774. The assignment consists of two parts: 1. Decision trees and random forests. Students will implement decision tree algorithms and experiment with parameter tuning using scikit-learn. 2. Neural networks. Students will implement a generic neural network with backpropagation and train it on a poker hands dataset, using one-hot encoding to preprocess categorical features. The assignment is due on April 5th and must be completed individually. Code submissions will be checked for plagiarism.

Uploaded by

Reeya Prakash

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views4 pages

Ass3 v1

Uploaded by

Reeya Prakash

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

COL 774: Assignment 3

Due Date: 11:50 pm, April 5, 2019. Total Points: (35 + 35)

Notes:

• This assignment has two implementation questions.

• You should submit all your code (including any pre-processing scripts written by you) and any graphs that
you might plot.
• Do not submit the datasets. Do not submit any code that we have provided to you for processing.

• Include a single write-up (pdf ) file which includes a brief description for each question explaining what
you did. Include any observations and/or plots required by the question in this single write-up file.
• You should use MATLAB/Python for the neural network question (Question 2). For the decision tree imple-
mentation (Question 1), you are also free to use C++/Java

• Your code should have appropriate documentation for readability.

• You will be graded based on what you have submitted as well as your ability to explain your code.
• Refer to the course website for assignment submission instructions.
• This assignment is supposed to be done individually. You should carry out all the implementation by yourself.

• We plan to run Moss on the submissions. We will also include submissions from previous years since some of
the questions may be repeated. Any cheating will result in a zero on the assignment, a penalty of -10 points
and possibly much stricter penalties (including a fail grade and/or a DISCO).

1. (35 points) Decision Trees (and Random Forests): In this problem, you will work with the Credit
Card Defaults of Clients dataset available on the UCI repository. Read about the dataset in detail from the
link given above. For the purpose of this assignment, the dataset has been processed and split into separate
training, validation and testing sets. The training set consists of 18000 examples whereas the validation
and the test sets consist of 6000 examples each. The dataset consists of binary, categorical and continuous
attributes. The last entry in each row denotes the class label. You can read more about the attributes, in
the README included with the processed dataset. You have to implement the decision tree algorithm for
predicting whether a person defaulted or not based on various personal and economic attributes. You will
also experiment with random forests in the last part of this problem.
(a) (10 points) Construct a decision tree for the above prediction problem. Preprocess (before growing the
tree) each numerical (continuous) attribute into a Boolean attribute by a) computing the median value
of the attribute in the training data (make sure not to ignore the duplicates) b) replacing each numerical
value by a 1/0 value based on whether the value is greater than the median threshold or not. Note
that this process should be repeated for each attribute independently. For non-Boolean (categorical)
attributes, you should use a multi-way split as discussed in class. Use information gain as the criterion
for choosing the attribute to split on. In case of a tie, choose the attribute which appears first in the

1
ordering as given in the training data. Plot the train, validation and test set accuracies against the
number of nodes in the tree as you grow the tree. On X-axis you should plot the number of nodes in the
tree and Y-axis should represent the accuracy. Comment on your observations.
(b) (5 points) One of the ways to reduce overfitting in decision trees is to grow the tree fully and then
use post-pruning based on a validation set. In post-pruning, we greedily prune the nodes of the tree
(and sub-tree below them) by iteratively picking a node to prune so that resultant tree gives maximum
increase in accuracy on the validation set. In other words, among all the nodes in the tree, we prune
the node such that pruning it(and sub-tree below it) results in maximum increase in accuracy over the
validation set. This is repeated until any further pruning leads to decrease in accuracy over the validation
set. Post prune the tree obtained in step (a) above using the validation set. Again plot the training,
validation and test set accuracies against the number of nodes in the tree as you successively prune the
tree. Comment on your findings.
(c) (5 points) In Part(a) we used the median value of a numerical attribute to convert it in a 1/0 valued
attribute as a pre-processing step. In a more sophisticated setting, no such pre-processing is done in
the beginning. At any given internal node of the tree, a numerical attribute is considered for a two
way split by calculating the median attribute value from the data instances coming to that node, and
then computing the information gain if the data was split based on whether the numerical value of the
attribute is greater than the median or not. As earlier, the node is split on the attribute which maximizes
the information gain. Note that in this setting, the original value of a numerical attribute remains intact,
and a numerical attribute can be considered for splitting in the tree multiple times. Implement this new
way of handing numerical attributes. Report the numerical attributes which are split multiple times
in a branch (along with the maximum number of times they are split in any given branch and the
corresponding thresholds). Replot the curves in part (a). Comment on which results (Part (a) or Part
(c)) are better and why. You dont have to implement pruning for this part.
(d) (5 points) A number of libraries are available for decision tree implementation. Use the scikit-learn
library of Python to grow a decision tree. Click here to read the documentation and the details of
various parameter options. Try growing different trees by playing around with parameter values. Some
parameters that you should specifically experiment with include min samples split, min samples leaf and
max depth (feel free to vary other parameters as well). How does the validation set accuracy change
as you try various parameter settings? Comment. Find the setting of parameters which gives you best
accuracy on the validation set. Report training, validation and test set accuracies for this parameter
setting. How do your numbers compare with those you obtained in part (b) and (c) above?
(e) (5 points) The decision tree classifier in sciki-learn uses an optimised version of the CART algorithm
for the tree construction. However, the scikit implementation does not support categorical variables.
This makes the scikit decision tree treat the categorical variables as numerical (continuous) variables.
One way to get around this issue is to convert the categorical variables into multiple binary variables
using One-hot encoding. Transform the train, validation and the test sets by converting the categorical
variables into binary variables as described above. Retrain a decision tree classifier on the transformed
dataset and the parameter settings used in the previous part. Report training, validation and test set
accuracies for this parameter setting. How do your numbers compare with those you obtained in part
(b), (c) and (d) above?
(f) (5 points) Next, use the scikit-learn library to learn a random forest using the same transformation
over the data, as described in the previous part. Click here to read the documentation and the details
of various parameter settings. Try growing different forests by playing around with parameter values.
Some parameters that you should specifically experiment with include n estimators, max features and
bootstrap (feel free vary other parameters as well). How does the validation set accuracy change as you
try various parameter settings? Comment. Find the setting of parameters which gives you best accuracy
on the validation set. Report training, validation and test set accuracies for this parameter setting.
How do your numbers compare with those you obtained in parts (b), (c), (d) and (e) above. Comment
on your observations.

2
2. (35 points) Neural Networks: In this problem, you will work with the Poker Hand dataset available
on the UCI repository. We will use the entire dataset for the purpose of this assignment. The training set
contains 25010 examples whereas the test set contains 1000000 examples each. The dataset consists of 10
categorical attributes. The last entry in each row denotes the class label. You can read about the dataset and
the attributes in detail from the link given above.
(a) (3 points) The Poker Hand dataset described above has 10 categorical attributes. In the decision
tree part, you have learned about one hot encoding as a way to convert categorical features to binary.
Transform and save the given train and test sets using one hot encoding. We will use these new train
and test sets for the subsequent parts. Note that the new dataset should have 85 features.
(b) (12 points) Write a program to implement a generic neural network architecture. Implement the
backpropagation algorithm to train the network. You should train the network using Stochastic Gradient
Descent (SGD) where the batch size is an input to your program. Your implementation should be generic
enough so that it can work with different architectures. Specifically, your program should be able to
accept the following parameters:
• the size of the batch for SGD
• the number of inputs
• a list of numbers where the size of the list denotes the number of hidden layers in the network and
each number in the list denotes the number of units (perceptrons) in the corresponding hidden layer.
Eg. a list [100 50] specifies two hidden layers; first one with 100 units and second one with 50 units.
• the number of outputs i.e the number of classes
Assume a fully connected architecture i.e., each unit in a hidden layer is connected to every unit in
the next layer. You should implement the algorithm from first principles and not use any existing
MATLAB/python modules. Use the sigmoid function as the activation unit.
(c) (4 points) In this part, we use the above implementation to experiment with a neural network having
a single hidden layer. Vary the number of hidden layer units from the set {5, 10, 15, 20, 25}. Set the
learning rate to 0.1. Choose a suitable stopping criterion and report it. Report and plot the accuracy
on the training and the test sets, time taken to train the network. Plot the metric on the Y axis against
the number of hidden layer units on the X axis. Additionally, report the confusion matrix for the test
set, for each of the above parameter values. What do you observe? How do the above metrics and the
confusion matrix change with the number of hidden layer units?
(d) (4 points) In this part, we will experiment with a network having two hidden layers, each having the
same number of neurons. Set the learning rate to 0.1 and vary the number of hidden layer units, as
described in part (c). Report the metrics and the confusion matrix on the test set, as described in the
previous part. How do the metrics and the confusion matrix change with the number of hidden layer
units? What effect does increasing the number of hidden layers, keeping the number of hidden layer
units same, have on the metrics and the confusion matrix?
(e) (6 points) In both the previous parts, the value of the learning rate is fixed to 0.1, throughout the
training. In this part, we will experiment with adaptive learning rate. We will fix the initial learning
rate to 0.1. The learning rate is not changed as long as training loss keeps decreasing. Each time two
consecutive epochs fail to decrease training loss by a fixed tolerance value, tol, the current learning rate
is divided by 5. Repeat parts (c) and (d) by varying the learning rate as described, keeping tol = 10−4 .
What affect does adaptive learning rate have on the metrics and the confusion matrix? Support your
observations with reasons.
(f) (6 points) In this part, we will use ReLU as the activation instead of the sigmoid function, only in the
hidden layer(s). ReLU is defined using the function: g(z) = max(0, z). Change your code to work with
the ReLU activation unit. Make sure to correctly implement gradient descent by making use of sub-
gradient at z = 0. Use the same definition of sub-gradient for the purpose of this assignment, as specified
in the Minor 2 problem statement. Here is a resource to know more about sub-gradients. Repeat part
(e) using ReLU as the activation function in the hidden layers and report the metrics and the confusion
matrix and described previously. What effect does using ReLU have on each of the metrics as well as
the confusion matrix? Support your observations with reasons.

3
(g) Extra fun - No credits! Observe that the poker hand dataset contains classes, which have a rare
occurrence. For instance, there are only 4 instances of the Royal Flush in the train and test set combined.
Data Augmentation is a popular technique which is widely used to make classifiers more robust. A
popular use of data augmentation is in deep learning based image classification models, where the input
images are flipped, rotated and then fed to the network. Upscaling or upsampling is a technique which
is used handle class imbalance in a dataset. Here, we randomly choose a few examples from the rare
class and feed them through the model multiple times. Now, observe that in the Poker Hand dataset,
the order in which the cards are shown, does not determine the hand. Thus, any permutation of an
input will also have the same hand as the input. In this part, you should experiment with upscaling the
rare classes in the dataset by adding a suitable number of permutations of each input of the class, to the
dataset. Check if this technique has any effect on the metrics and the confusion matrix described in the
above parts.

HarvardX PH125X Maching Learning Assessments
100% (1)
HarvardX PH125X Maching Learning Assessments
74 pages
SonarQube_Mock_Test_Questions[1]
No ratings yet
SonarQube_Mock_Test_Questions[1]
3 pages
Experiment 3: Name: Reena Kale Te Comps Roll No:23
100% (1)
Experiment 3: Name: Reena Kale Te Comps Roll No:23
4 pages
2016 SMQ Cs 171 Final Exam Key
No ratings yet
2016 SMQ Cs 171 Final Exam Key
14 pages
Issues in Decision Tree Learning
No ratings yet
Issues in Decision Tree Learning
6 pages
Project 1
No ratings yet
Project 1
4 pages
Ai Combined Update
No ratings yet
Ai Combined Update
274 pages
Unit 2 Data types, variables, Constants
No ratings yet
Unit 2 Data types, variables, Constants
34 pages
Database Management System Lab Report Name: CH - Gopi Reg - No:18Bce7212 SLOT: L39+L40 Topic: PL / SQL Cursors Procedure Table
No ratings yet
Database Management System Lab Report Name: CH - Gopi Reg - No:18Bce7212 SLOT: L39+L40 Topic: PL / SQL Cursors Procedure Table
10 pages
OpenCL Error Codes (1.x and 2.x) PDF
No ratings yet
OpenCL Error Codes (1.x and 2.x) PDF
13 pages
Alay Shah: Education
No ratings yet
Alay Shah: Education
1 page
GoldenGate _ 12c_Demo
No ratings yet
GoldenGate _ 12c_Demo
11 pages
Second Life For Clipper Applications
100% (6)
Second Life For Clipper Applications
28 pages
UCS622
No ratings yet
UCS622
1 page
Net Resume
No ratings yet
Net Resume
1 page
Gujarat Technological University
No ratings yet
Gujarat Technological University
2 pages
Exp 3 121a1047 Lavanya Kurup ML
No ratings yet
Exp 3 121a1047 Lavanya Kurup ML
4 pages
dwm_06
No ratings yet
dwm_06
4 pages
A12
No ratings yet
A12
3 pages
Exercise7 Ensembles
No ratings yet
Exercise7 Ensembles
19 pages
2021BCS0103
No ratings yet
2021BCS0103
7 pages
Homework3
No ratings yet
Homework3
10 pages
Python Programs
No ratings yet
Python Programs
12 pages
CIS 520, Machine Learning, Fall 2015: Assignment 2 Due: Friday, September 18th, 11:59pm (Via Turnin)
No ratings yet
CIS 520, Machine Learning, Fall 2015: Assignment 2 Due: Friday, September 18th, 11:59pm (Via Turnin)
3 pages
Expt7_ML2025_250306_143857
No ratings yet
Expt7_ML2025_250306_143857
5 pages
Q3-Copy1: Pandas PD Numpy NP CSV
No ratings yet
Q3-Copy1: Pandas PD Numpy NP CSV
7 pages
Decision Trees in Sklearn Decision Trees in Sklearn
No ratings yet
Decision Trees in Sklearn Decision Trees in Sklearn
7 pages
Electric Network Analysis Lab
No ratings yet
Electric Network Analysis Lab
11 pages
A1 CCS345- Ethics and AI_rubric
No ratings yet
A1 CCS345- Ethics and AI_rubric
3 pages
r 2031053
No ratings yet
r 2031053
12 pages
Project_1
No ratings yet
Project_1
4 pages
Assignment-1 of Machine Learning On Decision Tree: Submitted To: Submitted by
No ratings yet
Assignment-1 of Machine Learning On Decision Tree: Submitted To: Submitted by
13 pages
Assignment 04
No ratings yet
Assignment 04
17 pages
ML Assignment
No ratings yet
ML Assignment
7 pages
Unit 3 Classification - Dr. Vidyut D
No ratings yet
Unit 3 Classification - Dr. Vidyut D
72 pages
An Introduction To P5.Js: A Manual For The Javascript P5.Js Workshop For Creative Technology
100% (1)
An Introduction To P5.Js: A Manual For The Javascript P5.Js Workshop For Creative Technology
40 pages
ML FA24 Final Term Exam (Solution)
No ratings yet
ML FA24 Final Term Exam (Solution)
19 pages
Minor Project 0901CS221078
No ratings yet
Minor Project 0901CS221078
21 pages
Updated Cases List - 11.10.21
No ratings yet
Updated Cases List - 11.10.21
11 pages
programming in c arrear
No ratings yet
programming in c arrear
4 pages
Merging Result-Merged
No ratings yet
Merging Result-Merged
14 pages
IDAI610_PS1_DecisionTree
No ratings yet
IDAI610_PS1_DecisionTree
5 pages
DA_Lab_Week-3 (1)
No ratings yet
DA_Lab_Week-3 (1)
15 pages
CONTENTS
No ratings yet
CONTENTS
7 pages
Classification
No ratings yet
Classification
8 pages
Source Code
No ratings yet
Source Code
22 pages
Rfactors
No ratings yet
Rfactors
31 pages
Experiment 3: Name: Reena Kale Te Comps Roll No:23
No ratings yet
Experiment 3: Name: Reena Kale Te Comps Roll No:23
4 pages
MANUAL (2)
No ratings yet
MANUAL (2)
33 pages
issues in decision trees
No ratings yet
issues in decision trees
22 pages
MANUAL (1)
No ratings yet
MANUAL (1)
34 pages
K.Venkat Ratnam 191911412 Class Work 1) Describe The Attribute Selection Measures Used by The ID3 Algorithm To Construct A Decision Tree. A)
No ratings yet
K.Venkat Ratnam 191911412 Class Work 1) Describe The Attribute Selection Measures Used by The ID3 Algorithm To Construct A Decision Tree. A)
8 pages
Draft Xai
No ratings yet
Draft Xai
16 pages
AIH_Lab2
No ratings yet
AIH_Lab2
10 pages
Programs
No ratings yet
Programs
29 pages
DT RF
No ratings yet
DT RF
7 pages
S.No. Pg. No
No ratings yet
S.No. Pg. No
17 pages
Domain Constraints Referential Integrity Assertions Triggers Functional Dependencies
No ratings yet
Domain Constraints Referential Integrity Assertions Triggers Functional Dependencies
31 pages
4 LZW
No ratings yet
4 LZW
7 pages
DA_LAB3_221IT064
No ratings yet
DA_LAB3_221IT064
6 pages
Unit-5 Decision Trees & Ensembles Methods
No ratings yet
Unit-5 Decision Trees & Ensembles Methods
11 pages
CLOUD COMPUTING Notes
No ratings yet
CLOUD COMPUTING Notes
10 pages
Exam1 Practice Solutions
No ratings yet
Exam1 Practice Solutions
25 pages
practical 15 python
No ratings yet
practical 15 python
6 pages
MONGO DB Lab Manual-1
No ratings yet
MONGO DB Lab Manual-1
54 pages
تمارین درس داده کاوی فصل طبقه بندی
No ratings yet
تمارین درس داده کاوی فصل طبقه بندی
7 pages
Hybris Developer Training Part II - Commerce - Module 05 - Commerceservices and Commercefacades
100% (1)
Hybris Developer Training Part II - Commerce - Module 05 - Commerceservices and Commercefacades
7 pages
MLA Lab 6:-Implementation of Decision Tree
No ratings yet
MLA Lab 6:-Implementation of Decision Tree
16 pages
BC411 - Advanced ABAP Programming
No ratings yet
BC411 - Advanced ABAP Programming
488 pages
IGNOU Tentative Date Sheet June 2024
No ratings yet
IGNOU Tentative Date Sheet June 2024
9 pages
Reeya Evidence
No ratings yet
Reeya Evidence
8 pages
UNIT 2 - Decision Tree - Issues
No ratings yet
UNIT 2 - Decision Tree - Issues
23 pages
DM Lab 04
No ratings yet
DM Lab 04
6 pages
ML Lab Manual (1-9)
No ratings yet
ML Lab Manual (1-9)
37 pages
Practical No4 - 5 ML
No ratings yet
Practical No4 - 5 ML
11 pages
Department of Electronics & Telecommunications Engineering: ETEL71A-Machine Learning and AI
No ratings yet
Department of Electronics & Telecommunications Engineering: ETEL71A-Machine Learning and AI
4 pages
Waqf-Family Law Psda
No ratings yet
Waqf-Family Law Psda
11 pages
Data Mining Assignment No. 1
No ratings yet
Data Mining Assignment No. 1
7 pages
M01 Tree-Based Methods
No ratings yet
M01 Tree-Based Methods
38 pages
Shareholder Activism As A New Facet of Corporate Governance in India
No ratings yet
Shareholder Activism As A New Facet of Corporate Governance in India
11 pages
Random Forest
No ratings yet
Random Forest
25 pages
Logcat
No ratings yet
Logcat
100 pages
635 - Probono Article 1 Defence of Insanity - Anoushka Singh
No ratings yet
635 - Probono Article 1 Defence of Insanity - Anoushka Singh
10 pages
System Verilog Interview Q
No ratings yet
System Verilog Interview Q
22 pages
Lease Tpa
No ratings yet
Lease Tpa
5 pages
DevGuide PDF
No ratings yet
DevGuide PDF
324 pages
DIT865 2018 Mar Solution
No ratings yet
DIT865 2018 Mar Solution
9 pages
Soft Computing Lab Practical Assignment 2
No ratings yet
Soft Computing Lab Practical Assignment 2
10 pages
Ora Error Messages
No ratings yet
Ora Error Messages
781 pages
Classification Problems
No ratings yet
Classification Problems
53 pages
An Introduction TO Decision Trees
No ratings yet
An Introduction TO Decision Trees
30 pages
Effects of Non-Registration of Firm
No ratings yet
Effects of Non-Registration of Firm
18 pages
Assignment 6 (Sol.) : Introduction To Machine Learning Prof. B. Ravindran
No ratings yet
Assignment 6 (Sol.) : Introduction To Machine Learning Prof. B. Ravindran
10 pages
Totalitarianism
No ratings yet
Totalitarianism
11 pages
601 sp09 Midterm Solutions
No ratings yet
601 sp09 Midterm Solutions
14 pages
Power Script Reference Manual
No ratings yet
Power Script Reference Manual
60 pages
Unit 3 - Week 1 Quiz
No ratings yet
Unit 3 - Week 1 Quiz
3 pages
Software Design Description For: COMSATS University Islamabad, Park Road, Chak Shahzad, Islamabad Pakistan
No ratings yet
Software Design Description For: COMSATS University Islamabad, Park Road, Chak Shahzad, Islamabad Pakistan
7 pages
Administering Microsoft Azure SQL Solutions DP 300
From Everand
Administering Microsoft Azure SQL Solutions DP 300
Manish Soni
No ratings yet
Apache Cassandra Administrator Associate - Exam Practice Tests
From Everand
Apache Cassandra Administrator Associate - Exam Practice Tests
Cristian Scutaru
No ratings yet

Ass3 v1

Uploaded by

Ass3 v1

Uploaded by

COL 774: Assignment 3

• This assignment has two implementation questions.

• Your code should have appropriate documentation for readability.

You might also like