Department of Computer Science: Mcgill University

This document is a submission for a machine learning mini-project at McGill University. It summarizes the student's work applying linear regression algorithms to predict the popularity of Reddit comments based on features like number of replies and text. The students implemented and compared the closed form and gradient descent algorithms. They found that the closed form approach had better performance and runtime than gradient descent on this dataset. Changing the number of words in the text feature and other parameters affected the models' accuracy and stability.

Uploaded by

SuryaKumar Devarajan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

75 views6 pages

Department of Computer Science: Mcgill University

Uploaded by

SuryaKumar Devarajan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

McGill UNIVERSITY

845 Sherbrooke St W, Montreal,

QC H3A 0G4

DEPARTMENT OF COMPUTER SCIENCE

COMP 551- APPLIED MACHINE LEARNING

MINIPROJECT 1: MACHINE LEARNING 101

Submitted to - PROF. WILLIAM L. HAMILTON

Submission Date: 31-01-2019

Submitted by:

MOHAMMAD HAMED AZIZI - 260812541

SURYA KUMAR DEVARAJAN - 260815492

SAEED SHOARAYE NEJATI - 260890049

ABSTRACT
In this project we’re going to apply machine learning algorithms on a dataset that is provided by
reddit.com. The datasets have different features such as children, controversiality, is_root and text.
Our task is to figure out a model that will predict the popularity of a comment. we’ll start by
defining linear regression algorithms-closed form and gradient descent and compare the results by
implementing them, considering different features and different parameters for each task. Our
project is divided into three tasks: first, we will split our dataset into training, validation and test
datasets and extract the desired features for every task. Then, we’ll be implementing the closed
form and gradient descent algorithms. Finally, we’ll compare the results of these algorithms on the
validation set to check the performance and stability of our models. Getting the best model that we
have and running it on the test set will examine our trained model when it comes to unseen dataset.
We found that the gradient descent approach was slower than the closed-form approach for the
dataset provided and we analysed how decay plays a role in gradient descent.

INTRODUCTION
This project involves the use of the Linear Regression models to predict the popularity of the
comments on Reddit. We are provided with a large set of data of reddit comments which also has
features like children, Controversiality and is_root. children. We also have the information on the
popularity score. It measures the popularity of the comments and it is the target variable that we
need to predict.
Machine learning, more specifically the field of predictive modeling is primarily concerned with
minimizing the error of a model or making the most accurate predictions possible. For example,
in a simple regression problem (a single x vector and a single y), the form of the model would be:

𝑦= 𝑤 + 𝑤𝒙
:

From least square solution:

Gradient descent is an optimization algorithm that uses the decay to control the step size to
minimizes the loss. Decay here depends on the value of η and β. In this project, we have showed
the results of using both algorithms and comapred them for getting the best algorithm for our
dataset.
For our given dataset, changing the number of features in each experiment will show us which
model(closed form and gradient descent) fitsthe data. Evenmore, changing the parameters of
gradient descent will have a big impact on the stability, runtime and performance of the model.
In the next part of this project we will show how we implemented these two algorithms and apply
them taking different features in each task. We referred to “Jason Brownlee (2017, Oct 20), How
to Develop a Deep Learning Bag-of-Words Model for Predicting Movie Review Sentiment.
Retrieved from https://machinelearningmastery.com/deep-learning-bag-of-words-model-
sentiment-analysis/”

Dataset
The first part of the project is to split the data sets into Training, Validation and Testing sets. We
use first 10000 data points for training, next 1000 for validation and the last 1000 as test sets. The
features other than text does not need preprocessing as we can use the numerical values and binary
values as it. Children feature counts the number of replies a comment received. Controversiality
feature is a metric of controversial of a comment and takes on the binary values. is_root feature is
a binary variable that indicates if a comment is the root comment of a discussion thread. The
preprocessing of text involves two operations, making all the text to lower case and split the text
according to white spaces to define different words. After preprocessing, we are extracting 160
most frequently occurring words over all the comments that we have taken in the training set.
Every comment will have a 160-dimensional feature vector. The word count feature is done by
creating a bag of 160 most frequently occurring words and compare this to the vector such that the
frequency of the most occurring words is filled in that vector.
Moreover, generating a model with more accuracy depends on the precision of the features. To
get more precise features we tried to clean the text feature, by removing some words of
prepositions, pronouns, articles. for example; a dataset with text feature “I went to school” will
be filtered to “went school”. The top 160 frequent words before cleaning and after cleaning the
dataset are attached with the report.
We come up with new feature which is the transformation of children feature, since the number of
comments will have a great impact on the popularity. We will start squaring the children feature
(MSE_closed_square) and compute the new value of MSE. then we will compute the MSE for
closed form after adding the exp of children feature (MSE_closed_f2).
The main concern that might arise when working with public social media is that we are not dealing
with a small dataset anymore, in real life we are getting millions of comments and replies on daily
basis in a streaming continuous way, preprocessing a large streaming dataset will be an issue. In a
way or another, this will push us to use another technique to deal with this kind of dataset.

RESULTS
Task 1

After implementing the two algorithms, we start by taking into consideration the first three features
(children, is_root, controvasilty), and compare the performance and accuracy of these algorithms
by calculating the runtime and the value of errors.
Gradient descent
In the below table, we are changing the value of η and β to get different decay learning values,
then we calculate the runtime, loss and steps for each pair values of η and β.
Table 1

β η Runtime Steps
10e-3 10e-2 0.01847 117
10e-3 10e-3 0.018992 175
10e-3 10e-4 It takes a long time It takes a long time
10e-3 10e-5 1.800957918 36333
10e-3 10e-6 It takes a long time It takes a long time
10e-3 10e-7 It takes a long time It takes a long time
10e-4 10e-2 0.04224681 118
10e-4 10e-3 0.0176391 190
10e-4 10e-4 0.0517199 608
10e-4 10e-5 0.3444151 6604
10e-4 10e-6 9.12999901 186161
10e-5 10e-2 0.0242769 118
10e-5 10e-3 0.0203499 190
10e-5 10e-4 0.0482759 590
10e-5 10e-5 0.2966492 5581
10e-5 10e-6 2.437783 47315

As we can see from the above table, for 10e-4 and 10e-5 pair of η and beta we’re having the best
performance with 9.9914e-9 error and 0.3444151 runtime.
Mean square error = 1.34161955
The Stability of closed form does not depend on any hyperparameters. It is straightforward
calculation. But, the stability of Gradient descent mainly depends on the learning rate value which
in turn depends on the hyperparameters eta and beta. When the learning rate is too big, it will not
reach the local minimum because it just bounces back and forth between the convex function of
gradient descent. If learning rate is very small, gradient descent will eventually reach the local
minimum, but it will take too much time. Please refer to table 1 to see the number of steps and
runtime taken for different values of η and β.
Comparing the runtime and stability (for values mentioned above β=10e-5 and η=10e-4) of
gradient descent and closed form algorithms, we can conclude that the closed form has less MSE
(better performance) and runtime than the gradient descent method.
Table 2
TRAINING VALIDATION
Closed Form:
MSE= 1.084674 MSE=1.011749
Runtime= 0.214557 Runtime= 0.2057149
Gradient Descent:
MSE=1.34161955 MSE=1.27693395
Runtime= 0.3444151 Runtime= 18.5808629
N.B: we’re taking the best values of η and β that we got in Table 1
From the above results, we can notice that the runtime of gradient descent is more since we are
calculating the decay learning rate for each step. Moreover, the closed have a better accuracy since
the loss is less than in gradient descent.
Task 2
Below are the results of closed form algorithm.
Table 3
TRAINING VALIDATION
NO TEXT:
MSE= 1.084674 MSE=1.011749
Runtime= 0.214557 Runtime= 0.2057149
60 words:
MSE=1.061161 MSE=0.904507
Runtime= 0.275542 Runtime= 0.26471018
160 words:
MSE=1.0467629 MSE=0.912364
Runtime= 0.284834 Runtime=0.365887
For all features, no text feature and 60 frequent word feature, MSE on validation is slightly better
than the training set. It is not entirely possible all the time but here we have values on validation
due to some random noise. In these cases, the model neither overfits nor underfits.
Task 3
Below are the values of MSE for closed form applied on training set after adding the two new
features;
Closed_without new feature 1.0467629
Closed_ with square feature 1.0005977
Closed_ with exp feature 1.0366922
Closed with both new feature 1.0003905

We can see that the performance of the model improved by more than 4.6% after adding the
two new features.
Task 4:
Now we will run our model after adding the two new features on the test set to see how generalise
is the model and how accurate it is for an unseen dataset.
Below are the values of MSE for closed form applied on training, validation and test set after
adding the two new features;
Table 5
Closed_training 1.0003905
Closed_validation 0.756792
Closed_test 1.0185782

Conclusion
After running these two algorithms, we observed that closed form approach might have better
performance for some datasets, and for a big dataset, gradient descent will take more time.
Moreover, adding more features to our dataset will improve the performance of our model. These
new features have great impact on training but lower on validation and test sets. MSE on training
set, decrease on validation and test set because of data size (noise of small dataset).

Contribution
We divided the work load that involves analysing the tasks, coding, running, testing and writing
the report equally. We gathered once in every 2-3 days, working together towards the completeness
of the project.

Solutions For Kinematics, Dynamics, and Design of Machinery (3rd Edition) by Kenneth Waldron & Gary Kinzel
0% (1)
Solutions For Kinematics, Dynamics, and Design of Machinery (3rd Edition) by Kenneth Waldron & Gary Kinzel
54 pages
Machine Learning Interview Questions
From Everand
Machine Learning Interview Questions
Tech Interviews
4.5/5 (2)
Data Interpretation Guide For All Competitive and Admission Exams
From Everand
Data Interpretation Guide For All Competitive and Admission Exams
Mohmmad Khaja Shareef
2.5/5 (6)
120 DS-With Answer
100% (1)
120 DS-With Answer
32 pages
Ultrasonic Examination of CNG Vehicle Cylinders As An Alternative To Periodic Hydrostatic Testing
No ratings yet
Ultrasonic Examination of CNG Vehicle Cylinders As An Alternative To Periodic Hydrostatic Testing
2 pages
Forma Scientific - 86 Freezer Models 916 - 917 - 923 - 925 - and 926 Manual ENG
100% (1)
Forma Scientific - 86 Freezer Models 916 - 917 - 923 - 925 - and 926 Manual ENG
63 pages
Miniproject 1: Machine Learning 101: Preamble
No ratings yet
Miniproject 1: Machine Learning 101: Preamble
5 pages
Qs ML
No ratings yet
Qs ML
8 pages
Malignant Comments Classifier Project
No ratings yet
Malignant Comments Classifier Project
30 pages
Reinforcement Learning: A Practical Guide to Algorithms
From Everand
Reinforcement Learning: A Practical Guide to Algorithms
Trilokesh Khatri
No ratings yet
A Conversation About Calculus
From Everand
A Conversation About Calculus
Ginachukwu Amah
No ratings yet
Stack Overflow Query Outcome Prediction
No ratings yet
Stack Overflow Query Outcome Prediction
5 pages
HW 3
No ratings yet
HW 3
4 pages
Homework 3: SVM and Sentiment Analysis: Minted Listings
No ratings yet
Homework 3: SVM and Sentiment Analysis: Minted Listings
7 pages
Twitter_Dataset_for_Advanced_Hate_Speech_Identification__Final
No ratings yet
Twitter_Dataset_for_Advanced_Hate_Speech_Identification__Final
78 pages
PMP Formula Guide
From Everand
PMP Formula Guide
Mohammad Usmani
4.5/5 (16)
Zhang Jzhzhang PHD EECS 2022
No ratings yet
Zhang Jzhzhang PHD EECS 2022
175 pages
Midterm Solutions For Machine Learning
No ratings yet
Midterm Solutions For Machine Learning
13 pages
Gradient Ascent
No ratings yet
Gradient Ascent
27 pages
MachineLearning 2024S Exercise2
No ratings yet
MachineLearning 2024S Exercise2
8 pages
Doan Uccs 0892D 10279
No ratings yet
Doan Uccs 0892D 10279
147 pages
Z.H. Sikder University of Science and Technology: Mid-Term Examination, Fall-2020
No ratings yet
Z.H. Sikder University of Science and Technology: Mid-Term Examination, Fall-2020
6 pages
Logistic
No ratings yet
Logistic
14 pages
7641 Assignment 1
No ratings yet
7641 Assignment 1
4 pages
S-1
No ratings yet
S-1
5 pages
ZamoshchinSegall PredictingRedditPostPopularity
No ratings yet
ZamoshchinSegall PredictingRedditPostPopularity
5 pages
SampleQuestion- AIOL 2024
No ratings yet
SampleQuestion- AIOL 2024
5 pages
FODL Question bank
No ratings yet
FODL Question bank
28 pages
Deep Learning Based Sentiment
No ratings yet
Deep Learning Based Sentiment
62 pages
ML Summary PDF
No ratings yet
ML Summary PDF
5 pages
23,25 CPPP Final
No ratings yet
23,25 CPPP Final
17 pages
Lecture3_upload
No ratings yet
Lecture3_upload
28 pages
Distributed Linear Regression Class Notes
No ratings yet
Distributed Linear Regression Class Notes
140 pages
Sentiment Analysis of Reviews Using Machine Learning
100% (1)
Sentiment Analysis of Reviews Using Machine Learning
33 pages
Assignment 4
No ratings yet
Assignment 4
5 pages
QuadraticProgramming
No ratings yet
QuadraticProgramming
25 pages
Lecture 4-5
No ratings yet
Lecture 4-5
48 pages
2012 Nikolaos Nikolaou MSC
No ratings yet
2012 Nikolaos Nikolaou MSC
102 pages
AITT BTech V Semester Supply April 2024 Key cum scheme
No ratings yet
AITT BTech V Semester Supply April 2024 Key cum scheme
9 pages
Ctrl+Shift+Enter Mastering Excel Array Formulas: Do the Impossible with Excel Formulas Thanks to Array Formula Magic
From Everand
Ctrl+Shift+Enter Mastering Excel Array Formulas: Do the Impossible with Excel Formulas Thanks to Array Formula Magic
Mike Girvin
4/5 (11)
Project - Machine Learning-Business Report: By: K Ravi Kumar PGP-Data Science and Business Analytics (PGPDSBA.O.MAR23.A)
No ratings yet
Project - Machine Learning-Business Report: By: K Ravi Kumar PGP-Data Science and Business Analytics (PGPDSBA.O.MAR23.A)
38 pages
Optimization Methods For Large-Scale Machine Learnig2
No ratings yet
Optimization Methods For Large-Scale Machine Learnig2
95 pages
Amazon Food Review Notes
No ratings yet
Amazon Food Review Notes
37 pages
L 1 Intro Machine Learning
No ratings yet
L 1 Intro Machine Learning
45 pages
Thesis - Aru Omarali
No ratings yet
Thesis - Aru Omarali
34 pages
Analysis and Design of Algorithms: A Beginner’s Hope
From Everand
Analysis and Design of Algorithms: A Beginner’s Hope
Shefali Singhal
No ratings yet
[1122]AI_Assignment2
No ratings yet
[1122]AI_Assignment2
2 pages
Mini Project
No ratings yet
Mini Project
21 pages
A Recommender System: John Urbanic
No ratings yet
A Recommender System: John Urbanic
36 pages
Mauryan Empire
No ratings yet
Mauryan Empire
11 pages
Chapter04_Training_Models
No ratings yet
Chapter04_Training_Models
33 pages
Practice Midterm Sol
No ratings yet
Practice Midterm Sol
15 pages
SkriptOptMach
No ratings yet
SkriptOptMach
49 pages
Homework2
No ratings yet
Homework2
3 pages
Paper 8675
No ratings yet
Paper 8675
6 pages
Midpaper
No ratings yet
Midpaper
16 pages
Em Semester Project
No ratings yet
Em Semester Project
21 pages
Ascertaining Polarity of Public Opinions On Bangladesh Cricket Through Sentiment Analysis
No ratings yet
Ascertaining Polarity of Public Opinions On Bangladesh Cricket Through Sentiment Analysis
51 pages
Machinelearning
No ratings yet
Machinelearning
59 pages
CSE4062S21_Group3_Project_Delivery7_FinalReport
No ratings yet
CSE4062S21_Group3_Project_Delivery7_FinalReport
9 pages
20101128, 20101123, 20101115, 20101346_CSE
No ratings yet
20101128, 20101123, 20101115, 20101346_CSE
52 pages
ML ans
No ratings yet
ML ans
18 pages
MCS-011: Problem Solving and Programming
From Everand
MCS-011: Problem Solving and Programming
Dr. DK Sukhani
No ratings yet
Counter Design For Monitoring Shopping Mall Entrances During Covid-19
100% (1)
Counter Design For Monitoring Shopping Mall Entrances During Covid-19
3 pages
1 s2.0 S0925346724008644 Main
No ratings yet
1 s2.0 S0925346724008644 Main
7 pages
Physical Qty and Measurments Revision
No ratings yet
Physical Qty and Measurments Revision
4 pages
Volkmann Atex 7005 Certification Web PDF
No ratings yet
Volkmann Atex 7005 Certification Web PDF
16 pages
2 - Analogy BTW SME and ED
No ratings yet
2 - Analogy BTW SME and ED
20 pages
STN32 160 B
No ratings yet
STN32 160 B
3 pages
How To Repair Laptop Battery
No ratings yet
How To Repair Laptop Battery
6 pages
COBOL Reserved Words
No ratings yet
COBOL Reserved Words
8 pages
Mechanical and Hydraulic Press Machines
No ratings yet
Mechanical and Hydraulic Press Machines
2 pages
.. PDF Modernizatioueuden Doors ReNova800
No ratings yet
.. PDF Modernizatioueuden Doors ReNova800
2 pages
Lenovo IdeaPad S10-3c Hardware Maintenance Manual Service
No ratings yet
Lenovo IdeaPad S10-3c Hardware Maintenance Manual Service
88 pages
TUTORIAL - 13 - Series Solutions-New (1718)
No ratings yet
TUTORIAL - 13 - Series Solutions-New (1718)
3 pages
90399-01 - Lock Post
No ratings yet
90399-01 - Lock Post
1 page
SILICATES, STRUCTURE AND CLASSIFICATION
No ratings yet
SILICATES, STRUCTURE AND CLASSIFICATION
22 pages
FMEA Integration in Requirements Management As A Basis For An Automotive SPICE Project
No ratings yet
FMEA Integration in Requirements Management As A Basis For An Automotive SPICE Project
11 pages
Using Mathcad, Matlab and Pspice For Electronics Simulations
No ratings yet
Using Mathcad, Matlab and Pspice For Electronics Simulations
30 pages
CS 1 2
100% (1)
CS 1 2
83 pages
Lab Part I
No ratings yet
Lab Part I
4 pages
Landslide Hazard Evaluation and Zonation Mapping in Mountainous Terrain
No ratings yet
Landslide Hazard Evaluation and Zonation Mapping in Mountainous Terrain
9 pages
MS6001FA
No ratings yet
MS6001FA
14 pages
Are Discretionary Accruals A Good Measure of Audit Quality?
No ratings yet
Are Discretionary Accruals A Good Measure of Audit Quality?
17 pages
Implemen Ntation of Fofdmb Baseband T Transmitt Fpga Ter Compli Iant IEEE E STD 802.1 16d On
No ratings yet
Implemen Ntation of Fofdmb Baseband T Transmitt Fpga Ter Compli Iant IEEE E STD 802.1 16d On
5 pages
Cable Route Survey Dictionary
No ratings yet
Cable Route Survey Dictionary
18 pages
Digital Filters
No ratings yet
Digital Filters
24 pages
Jinsoon Choi Dong Jin Suh (2007) - Catalytic Applications of Aerogels., 11 (3)
No ratings yet
Jinsoon Choi Dong Jin Suh (2007) - Catalytic Applications of Aerogels., 11 (3)
11 pages
Roger Penrose - Is Quantum Mechanics Relevant To Understanding Consciousness
No ratings yet
Roger Penrose - Is Quantum Mechanics Relevant To Understanding Consciousness
7 pages