BTP Report Final 1
BTP Report Final 1
BACHELOR OF TECHNOLOGY
IN
MECHANICAL ENGINEERING
Submitted By
ACKNOWLEDGEMENT
It is our great fortune to have had the opportunity to work on this exciting and
thought- provoking project in this institute. The learning and experience we have
received here is of inexplicable value to us. Gratitude is one of the deepest
expressions of one’s heart. So, it gives us immense pleasure to express our
paramount gratitude to each one of those who made this possible.
We would like to express our deep sense of gratitude to Dr. Anil Kumar Agrawal,
Professor, IIT (BHU) for providing us this unique opportunity of carrying out the
project and for his constant guidance, encouragement and timely support throughout
the course of this project.
3
TABLE OF CONTENTS
1.ACKNOWLEDGEMENT 2
2.INTRODUCTION 4
3.PREVIOUS WORK 5
4.MACHINE LEARNING MODELS 7
5.PERFORMANCE METRICS 10
6.FEATURE EXTRACTION 12
7.NEURAL NETWORKS 15
8.COURSE RECOMMENDER SYSTEM 22
9.CONCLUSION 28
4
INTRODUCTION
In today’s world where time is a commodity in sales .One of the most critical
business decisions of a company is related to customer acquisition. During the
acquisition phase of the customer life cycle, companies try to convert leads into
customers through different methods. One useful way to save time while increasing
sales is to make sure that we focus on the best available leads and not waste time on
inactive leads. In order to increase the conversion rate we used the customer specific
data like time spent on a website, total visits, occupation etc. and the customers are
then ranked in order of the probability of conversion , which are then pursued by
sales people. In order to achieve this goal, real world data is utilized as input to
various machine learning models which predicts the probability of conversion using
the important features of the data.
PREVIOUS WORK
Our aim of the project is to help e-learning platforms to increase their sales and
customer conversion rate using machine learning. Our previous work involves data
preprocessing wherein we performed data cleaning, feature scaling and one-hot
encoding so that the data can be fed into machine learning models for prediction,
further performing data visualization to get valuable insights and inferences from
the data which can be interpreted and thus can be utilized to support better business
decision making and support conclusion in order to optimize the sales. Previously
we used supervised machine learning model i.e. Logistic Regression, Random
Forest, Support Vector Machine, K-nearest neighbor, Naive Bayes for predicting the
most potential customers.
Data Preprocessing
One Hot Encoding - One hot encoding is a process by which categorical variables
are converted into a form that could be provided to ML algorithms to do a better job
in prediction.
6
Data Visualization
Some of the data visualization of important features from our previous work:
Specialization
Occupation
7
Logistic Regression
Logistic regression is a predictive analysis regression technique where the dependent
variable is categorical. Logistic regression is used to describe data and to explain
the relationship between one dependent binary variable and one or more independent
variables. It uses a logistic function called the sigmoid function as its activation
function which outputs the value between 0 and 1.
Best accurate hyperparameter tuned logistic regression model:
K-nearest neighbor
The principle behind nearest neighbour methods is to find a predefined number of
training samples closest in distance to the new point, and predict the label from these.
The number of samples can be a user-defined constant (k-nearest neighbour
learning), or vary based on the local density of points (radius-based neighbour
learning).
8
Naive Bayes
Naive Bayes methods are a set of supervised learning algorithms based on applying
Bayes’ theorem with the “naive” assumption of conditional independence between
every pair of features given the value of the class variable
Gaussian 87.54
9
Random Forest
Random forest is a supervised learning algorithm which is used for classification
problems. It is an ensemble tree-based learning algorithm. It is a set of decision trees
from a randomly selected subset of the training set. The algorithm creates decision
trees on data samples and then gets the prediction from each of them and finally
selects the best solution by means of voting.
Best accurate hyperparameter tuned Support vector machine model:
Precision
Recall
Threshold Value
Precision-Recall Curve
A precision-recall curve (or PR Curve) is a plot of the precision and the recall for
different probability thresholds.In our project we selected the optimum threshold
value that gives us the best precision and recall value. The optimum threshold value
comes out to be 0.2 as shown below in the precision-recall curve.
Accuracy- 92.37
Precision- 90.06
Recall- 86.32
12
FEATURE EXTRACTION
Companies have more data than ever, so it’s crucial to know the difference between
Useful Data and Unuseful Data. Amongst the important aspects in Machine Learning
are “Feature Selection” and “Feature Extraction”. Problem of selecting some
subset of a learning algorithm’s input variables upon which it should focus attention,
while ignoring the rest. Feature Selection can significantly improve a learning
algorithm’s performance. Feature extraction involves reducing the number of
resources required to describe a large set of data. When performing analysis of
complex data one of the major problems stems from the number of variables
involved. Analysis with a large number of variables generally requires a large
amount of memory and computation power, also it may cause a classification
algorithm to over fit to training samples and generalize poorly to new samples.
Out of the 37 features present in the data many features have the value as “NO” as
the primary answer (up to 90%), hence not much data could have been extracted
from them even if we try to run the machine learning models on them. Not only these
features will reduce the overall quality of the data but also degrade the model’s
learning ability. These features were dropped after the processing of the data.
features were dropped. Finally the data set was split randomly in the training test and
test set.
The features selected by Recursive feature elimination are given below-
PCA Workflow-
1. Normalize the data - PCA is used to identify the components with the maximum
variance, and the contribution of each variable to a component is based on its
magnitude of variance. It is best practice to normalize the data before conducting a
PCA as unscaled data with different measurement units can distort the relative
comparison of variance across features.
2. Create a covariance matrix - A useful way to get all the possible relationships
between all the different dimensions is to calculate the covariance among them all
and put them in a covariance matrix which represents these relationships in the data.
NEURAL NETWORK
Hyperparameters
Hyperparameters are important because they directly control the behaviour of the
training algorithm and have a significant impact on the performance of the model
being trained.
Needs of hyperparameter tuning:-
1. To find the right balance between bias and variance
2. To prevent the model from falling into vanishing/exploding gradient
problem
3. Encountering local optima.
4. Prevent the no convergence of the model.
Hyperparameter tuning done in our models:-
1. Number of Layers:- It was chosen in accordance with the thought that a very
high number may introduce problems like overfitting and vanishing and
exploding gradient problems and a lower number may cause a model to have
high bias and low potential model. The number of hidden layers during
hyperparameter tuning were in the range of 2 to 5.
2. Number of hidden units per layer:-It was also selected reasonably to find a
right spot between high bias and variance. It also depends on the data size
used for training. Hidden units in our models were in general powers of 2
between 128 to 1024 and were used in different combinations with different
models.
3. Activation Function:- Our choices in this are ReLU, Sigmoid & Tanh.
17
less repeating and hence the weights would be all over the place and
convergence would become difficult. If batch size is high learning would
become slow as only after many iterations will the batch size change. We tried
batch sizes in hta range of 32 to 512 and in general the batches which were
power of 2 gave the best results.
7. Number of Epochs:- The number of epochs is the number of times the entire
training data is shown to the model. It played an important role in how well
the model fits on the train data. High number of epochs overfitted on our data
.Lower number of epochs also limited the potential of the model leading to
underfitting. A large number of different epochs were used while training the
models and many lead to overfit or underfit the data . Epochs in the range of
18 to 23 gave the best results.
8. Dropout:- The keep-probability of the Dropout layer can be thought of as a
hyper-parameter which could act as a regularizer to help us find the optimum
bias-variance spot. While using dropout the models drop certain connections
every iteration therefore the hidden units cannot depend a lot on any particular
feature. The values it can take can be anywhere between 0 to 1 and it is solely
based on how much the model is overfitting. When we used a 5 layer deep
neural network there was a huge problem of overfitting which we tried to
overcome using high dropouts . In general while using 2 layer deep neural
networks we experimented with dropout’s keep probability values between
0.2 to 0.4.
9. L1/L2 Regularization:- It serves as another regularizer wherein the very high
weight values are curbed so that the model is not dependent on a single
feature. This generally reduces variance with a trade-off of increasing bias i.e.
lowering accuracy. We experimented with both types of regularizations
techniques and after feature selection we generally used L2 regularization.
19
Four types of recommendation systems have been implemented which are described
as follows-
1. Simple Recommender -
The Simple Recommender offers generalized recommendations to every user based
on course popularity. The basic idea behind this recommender is that courses that
are more popular and more critically acclaimed will have a higher probability of
being liked by the average users. The main drawback of this recommender system
is that it does not give personalized recommendations based on the user.
In a simple recommendation system, weighted ratings were calculated and on the
basis of that recommendations were made.
where,
User’s taste and history were not considered in this recommender system.
23
3. Collaborative filtering -
Collaborative Filtering is based on the idea that users similar to us can be used to
predict how much we will like a particular product or service those users have
used/experienced but we have not. Singular value decomposition (SVD) was used
to predict how much rating a user will give to a particular course.
4. Hybrid Recommender -
In this system, all three recommender systems were combined to obtain optimised
course recommandations.
First, recommendations from content based recommenders were retrieved and then
sorted retrieved courses on the basis of weighted ratings obtained from a simple
recommender system and then predicted the ratings of these retrieved courses that a
particular user will give using collaborative filtering. In this way, this system
considers the user's taste, user’s history and popularity of courses.
24
Algorithms
Tf-idf vectorizer -
Term Frequency -
The number of times a word appears in a document divided by the total number of
words in the document. Every document has its own term frequency.
1. Cosine similarity -
Cosine Similarity is a measurement that quantifies the similarity between two
or more vectors.The cosine similarity is the cosine of the angle between
vectors.
The cosine similarity is described mathematically as the division between the
dot product of vectors and the product of the euclidean norms or magnitude
of each vector.
Results
In this system, an input field is given and on the basis of weighted ratings courses
are recommended.
Let’s say input field is “Statistics” then our system recommends following courses-
In this system, an input course is given and similar courses with respect to input are
recommended on the basis of similarity metrics.
Let’s say input course is “Introduction to Data Science in python by university of
michigan” then our system recommends courses that are shown in the table below:
Courses Organization
Advanced Data Science with IBM IBM
Data Science Johns Hopkins University
Mathematics for Data Science National Research University Higher
School of Economics
Applied Machine Learning in Python University of Michigan
Databases and SQL for Data Science IBM
In the code snippet shown below, the function takes predictions from SVD and how
many courses to recommend as inputs and shows top recommendations to each user.
27
CONCLUSION
In our overall BTP work, a detailed & comprehensive analysis of data form e-
learning platform was performed. Analysis of the data showed how a lot of
inferences can be drawn from the raw data. Raw data cleaning was done which
included working on missing value, encoding categorical variables and feature
scaling of the variables.
For feature extraction PCA and RFE(Recursive Feature Elimination) was used
which greatly affected the models performance. The extracted features were then
used as input variables for 5 classification models which contained a huge range of
hyper parameters. Tuning of those parameters revealed the facts of how greatly a
model is affected by these hyper parameters. The SVM model using the linear kernel
resulted in the highest accuracy of 92.37%. But as it turns out the other models like
Logistic Regression (accuracy of 91.36%), K Nearest Neighbour (accuracy of
91.06%) and Random Forest (accuracy of 91.11%) all showed an accuracy very
close to SVM. The results showed that even though there are a great number of
different classification algorithms which are all great at making predictions, one of
the major factors which greatly influenced the accuracy is how data preprocessing
is done. Features extraction and feature selection plays a vital role in models
performance.
Classification of customers was also done using Neural Networks which contained
a lot of hyperparameters . With the correct sets of hyperparameters like the number
of hidden layers, nodes in each layer, the learning rate , type of optimization and
activation functions etc played a vital role in the performance of the model. Deep
neural networks tend to overfit on the training data which we overcame by using
dropouts and different regularization techniques.
Using the data of different courses on the website along with user specific data we
also developed a recommendation system which can suggest the most likely courses
the user will be interested in . This recommended course data can be helpful to the
sales team which can further increase the conversion rate as these courses will be
tailored towards the user that already has a high probability of buying the course.
Hence with this the targeted user specific system the courses bought per user can
also be increased along with the conversion rate.