Ejemplo Reporte
Ejemplo Reporte
Logistic Regression:
Logistic Regression is a type of a statistical model which is used for predicting the probability of
an event happening. 𝑙𝑜𝑔𝑖𝑡(𝑝𝑖) = 𝑙𝑛 ( ) 𝑝𝑖
1−𝑝𝑖
= β0 + β1𝑥𝑖
𝑁 𝑝
The response variable is modeled with Bernoulli distribution: 𝑙𝑛(𝐿) = ∑ ⎡⎢𝑙𝑛(1 − 𝑝𝑖) + 𝑦𝑖𝑙𝑛( 1−𝑝𝑖 )⎤⎥
𝑖=1⎣ 𝑖 ⎦
In order to maximise the above equation we need to find the most optimal 𝑝𝑖 value.
B. Lasso Regression: Lasso or the Least Absolute Shrinkage and Selection Operator is a type of
linear regression model that incorporates variable selection and regularisation to increase the
prediction accuracy. Regularization adds a penalty to change the cost function. Lasso Regression
𝑛
uses L1 penalty: 𝑅(θ) = ∑ θ𝑗
𝑗=1
| |
In Lasso Regression, the non-important features have a weight of approximately 0, making them
useless when predicting new values. In our case, the hyperparameter that needs tuning is the C
value used in the alpha parameter. Alpha value is defined by the following equation: α = ( ).
1
2𝐶
So, alpha is inversely proportional to the C value.
C. k-Fold Lasso Regression: To improve the efficiency of the model, we used k-fold cross validation
as the hyperparameter tuning. Of the previous model.
D. Ridge Regression: Ridge regression is used to analyse data with multicollinearity. When data
contains multicollinearity, the least square are unbiased and variance is huge. This leads to a
drastic reduction in prediction accuracy. Ridge regression adds penalty to cost function based of
𝑚 2
the following formula: 𝐽(θ) =
1
2 (
∑ ℎθ(𝑥 − 𝑦
𝑖=1
𝑖 𝑖
) . This model applies a L2 penalty.
E. Sequential Model: Sequential Model is a linear stack of layers which have single-input and
single-output layer.
F. K-Nearest Neighbours Model: This model is used to solve regression as well as classification
problems. KNN Model assumes that similar things exist in close proximity and groups them
accordingly. Then predicts the data based on which group it would be in.
Experiments/Results/Discussion(2-3 pages)
We first generated the wordcloud for the entire review texts and then only for the extreme ends of the
rating variable to observe the most commonly occurring words in all the cases. The data was divided in
80/20 train/test split. The wordclouds for extreme ends of rating were not upto our expectations. The
wordcloud for rating=5 was expected to contain positive words that are used as compliments and vice
versa for rating=1 wordcloud. SInce both the wordclouds had few commonly occurring words, we
realised the need for preprocessing the data so that we could remove highly repetitive but neutral words
from the review texts. The WordClouds are as follows:
The first wordcloud in the first row is for the entire dataset. The second word cloud in the first row is for
the reviews which have rating=5(maximum). The wordcloud in the second row is for the reviews which
have rating=1(minimum)
Logistic Regression:
L2 regularized Logistic Regression
A Logistic Regression Model is used to predict the ratings from the user reviews. We have tested the
model against the following values for C: [0.001, 0.01, 0.1, 1, 10, 100]
As observed from the plot of true positive rate vs true negative rate, the model with L2 penalty term
performs best when C is set to 100. The AUC score on testing data is 81.08%
Lasso Regression:
From the above outputs, we can observe that the Root mean square error for testing data is the least
when the alpha value is set to 0.001. As we keep increasing the alpha value, the root mean square error
increases.
On viewing the features based on the alpha value with the least root mean square error, we can confirm
that it is the best fitting model as it focuses on the negative words in the reviews on which the review
rating depends the most.
Ridge Regression:
From the above outputs, we can observe that the Root mean square error for testing data is the least
when the alpha value is set to 0.1. As we keep decreasing the alpha value, the root mean square error
increases.
On viewing the features based on the alpha value with the least root mean square error, we can confirm
that it is the best fitting model as it focuses on the positive words in the reviews on which the review
rating depends the most.
Sequential Model:
From the above plots, it can be easily identified that on increasing the epoch(the number of iteration), the
loss decreases and the accuracy increases steadily.
The above accuracy score states the accuracy of the sequential model on testing data to be 74.3%
which can be confirmed from the classification report
Summary(100-200 words)
Lasso Regression 74
Ridge Regression 44
Sequential Model 75