Capstone Project Final Report Rupesh Kumar PGP-DSBA APR 21C
Capstone Project Final Report Rupesh Kumar PGP-DSBA APR 21C
List of Figures
List of Figures
List of Tables
List of Tables
CAPSTONE FINAL REPORT PAGE NO : 1
Executive Summary
HR Department of Delta Ltd. want to predict a salary range / ctc for applicants with
similar profiles apart from the existing salary, there is a considerable number of factors
regarding an employee’s experience and other abilities to which they get evaluated in
interviews.The dataset consists of various attributes of the applicants based their job
title/role, Industry, location, years of experience, current ctc , education and skill profile.
Based on the different attributes/characteristics the applicants are defined. In this
problem statement we will explore the different attributes of applicants like education ,
total experience , current ctc , role , industry etc and build machine learning model to
predict the ctc which company can offer to applicants at time of joining.
The main objective of this problem is to provides salary estimates at time of joining for
applicants based on their job title, location, years of experience, and skill profile to
minimize human judgment with regard to salary to be offered.
Information irregularity amongst employers and employees has become a problem that
needs immediate solving. The probable applicants are most often kept blind with
regards to the interview procedure and only are aware of it at the end. In the meantime,
the employers must be committed to rightly meeting up with the candidate's prospects
for making new HR strategies that satisfy the demands of the applicant. Therefore, one
must be vigilant enough to not offer too low a salary, which would result in the decline
in not just the salary or not offer too high a salary to applicant whose CTC is already as
per market range, but also will build more irresponsible, lack-luster individuals with
longer untaken positions. Whilst the vice-versa would also be a cause of concern leading
to wastage of companies vital resources. Therefore, it is imperative to provide an
unbiased salary for an employee which he/she truly deserves, and also has to be
appropriate to the market demands.
The purpose of this whole exercise is to explore the dataset. Do the exploratory data
analysis , visualization like univariate , bivariate and multivariate to check the
distribution and realtionship between the given attributes. & apply supervised machine
learning algorithms i.e. Linear Regression ,XG-Boost Regressor , Decision Tree Regressor ,
Random Forest Regressor and ANN Regressor to predict the correct salary/ctc range for
applicants on the basis of the given information, which will help company to offer
correct ctc/salary range to applicants at time of joining plus this model reduced the
manual judgement on selecting the ctc/salary range.It is intended to have a robust
approach and eliminate any discrimination in salary among similar employee profiles.
Explore the dataset using central tendency and other parameters. The data consists of
25000 different applicants with 29 unique features . Analyse the different attributes of
the applicants which can help company in building a machine learning model to predic.
the ctc offer to applicant at time of joining.This assignment should help the company to
take right judgment with regard to salary to be offered.
CAPSTONE FINAL REPORT PAGE NO : 2
Note: We are dropping the column IDX and Applicant_Id as these columns didn't contribute for analysis and
model building exercise , because IDX and Applicant_ID for each applicant is unique and increases the
cardinality, variables with high cardinality isn't preferred , hence it is useless for the model.That's why we
decided to drop these two columns.
Observation:
Now we have all the columns which are useful for the model.
We changed name of Education column to Highest Education because it looks for
appropriate in terms of readability & clearly give idea about the educational
background of the applicant.
Second we rename the Curent Location column to Current location because there
is spelling error.
Observations
From the above table we can infer the count,mean, std , 25% , 50% ,75% and min & max
values of the all numeric variables present in the dataset.
From the above table we get the count,unique,top,freq of all the categorical variables
present in the dataset.
Insights -
Shape attribute tells us number of observations and variables we have in the data set. It is
used to check the dimension of data. The expected_ctc.csv data set has 25000 observations
(rows) and 27 variables (columns) in the dataset.
The info() function is used to print a concise summary of a DataFrame. This method
prints information about a DataFrame including the index d-type and column d-types,
non-null values and memory usage.
Insights -
From the above results we can see that there are null values present in most of the
columns
like(Department,Role,Industry,Organization,Designation,Graduation_Specialization
,University_Grad Passing_Year_Of_Graduation , PG_Specialization ,University_PG ,
Passing_Year_Of_PG and PHD_Specialization , University_PHD and
Passing_Year_Of_PHD etc) of the dataset.Their are total 25000 rows & 27 columns given
in this dataset,indexed from 0 to 24999. Out of 27 variables 3 are float64 , 16 variables
are object and 8 variable are int64 d-type. Memory used by the dataset: 5.1+ MB.
Skewness is a well-established statistical concept for continuous and to a lesser extent for
discrete quantitative statistical variables.Here we are going to check the skewness of the
features which are present in our dataset.
Insights -
From the above result, we can check which variable is normally distributed and which is not.
The variables with 0.5 < skewness < 1 are moderately positively skewed.
The variables with -0.5 < skewness < -1 such as stroke are moderately negatively skewed.
And, the variables with -0.5 < skewness < 0.5 are symmetric i.e normally distributed
Insights -
From the above output we found that most of the columns have null values.
Graduation_Specialization , University_Grad , Passing_Year_Of_Graduation ,
PG_Specialization ,University_PG , Passing_Year_Of_PG and PHD_Specialization ,
University_PHD and Passing_Year_Of_PHD have some kind of pattern for null values may
be these applicants didn't have UG , PG or PhD degree or data not available will impute
these missing values with suitable imputation method down the line.
CAPSTONE FINAL REPORT PAGE NO : 8
Total_Experience
array([ 0, 23, 21, 15, 10, 16, 1, 19, 8, 13, 7, 12, 20, 4, 14, 17, 22, 3, 5, 24, 2, 25, 9, 6, 11, 18])
Total_Experience_in_field_applied
array([ 0, 14, 12, 8, 5, 3, 1, 11, 7, 15, 10, 9, 4, 6, 2, 20, 16,25, 13, 19, 21, 22, 23, 17, 18, 24])
Department
Role
array([nan, 'Consultant', 'Financial Analyst', 'Project Manager', 'Area Sales
Manager', 'Team Lead', 'Analyst', 'Others', 'CEO', 'Business Analyst', 'Sales
Manager', 'Bio statistician', 'Scientist', 'Research Scientist', 'Head', 'Associate',
'Senior Researcher', 'Sales Execituve', 'Sr. Business Analyst','Principal Analyst',
'Data scientist', 'Researcher', 'Senior Analyst', 'Professor', 'Lab Executuve'],
dtype=object)
Industry
array([nan, 'Analytics', 'Training', 'Aviation', 'Insurance', 'Retail', 'FMCG', 'Others', 'Telecom',
'Automobile', 'IT', 'BFSI'], dtype=object)
Organization
array([nan, 'H', 'J', 'F', 'E', 'G', 'L', 'M', 'O', 'D', 'N', 'A', 'B', 'I', 'K', 'P', 'C'], dtype=object)
Designation
array([nan, 'HR', 'Medical Officer', 'Director', 'Marketing Manager', 'Manager', 'Product
Manager', 'Consultant', 'CA','Research Scientist', 'Sr.Manager', 'Data Analyst', 'Assistant
Manager', 'Others', 'Web Designer', 'Research Analyst', 'Software Developer', 'Network
Engineer', 'Scientist'], dtype=object)
Highest Education
Graduation_Specialization
University_Grad
array(['Lucknow', 'Surat', 'Jaipur', 'Bangalore', 'Mumbai', 'Delhi', 'Mangalore', nan,
'Nagpur', 'Kolkata', 'Ahmedabad', 'Guwahati', 'Pune', 'Bhubaneswar'], dtype=object)
Passing_Year_Of_Graduation
array([2020., 1988., 1990., 1997., 2004., 1998., 2011., 2001., 2003.,2000., nan, 2012., 2002.,
2016., 2013., 1999., 1993., 2009., 1989., 1991., 2008., 2005., 2018., 1992., 1996., 2010., 2019.,
1986., 2007., 2015., 1995., 2006., 2014., 1987., 2017., 1994.])
CAPSTONE FINAL REPORT PAGE NO : 9
PG_Specialization
array([nan, 'Others', 'Zoology', 'Chemistry', 'Psychology', 'Mathematics','Engineering', 'Sociology',
'Arts', 'Statistics', 'Economics','Botony'], dtype=object)
University_PG
array([nan, 'Surat', 'Jaipur', 'Bangalore', 'Mumbai', 'Delhi', 'Mangalore', 'Nagpur', 'Kolkata',
'Lucknow', 'Ahmedabad', 'Guwahati', 'Pune', 'Bhubaneswar'], dtype=object)
Passing_Year_Of_PG
array([ nan, 1990., 1992., 1999., 2006., 2000., 2013., 2005., 2002., 2014., 2004., 2009., 2017., 2001.,
1995., 2011., 1991., 1993.,2003., 2007., 2010., 1994., 2020., 2016., 1998., 2012., 2022.,1988., 2019., 2018.,
1997., 2008., 2015., 1989., 2021., 1996.,2023.])
Current_location
array(['Guwahati', 'Bangalore', 'Ahmedabad', 'Kanpur', 'Pune', 'Delhi','Surat', 'Nagpur', 'Jaipur',
'Kolkata', 'Bhubaneswar', 'Mangalore', 'Mumbai', 'Lucknow', 'Chennai'], dtype=object)
Preferred_location
Current_CTC
array([ 0, 2702664, 2236661, ..., 1681796, 3311090, 935897])
Inhand_Offer Last_Appraisal_Rating
array(['N', 'Y'], dtype=object) array([nan, 'Key_Performer', 'C', 'B', 'A', 'D'], dtype=object)
International_degree_any Expected_CTC
array([0, 1]) array([ 384551, 3783729, 3131325, ..., 1934065, 4370638, 1216666])
Insights
There is no Anomalies present in the dataset , but have nan values in many of columns.
Insights -
There are 12 types of Department present
in the data named as 'Marketing' ,
'Analytics/BI' , 'Healthcare' , 'Others' ,
'Sales' , 'HR' ,'Banking' , 'Education'
,'Engineering' , 'Top Management'
,'Accounts' and 'IT-Software'.
Insights -
Insights -
Insights -
Insights -
Insights -
Insights -
Insights -
Insights -
Insights -
Insights -
Insights -
Insights -
Insights -
Insights -
Insights -
Data Visualization
A histogram takes as input a numeric variable only. The variable is cut into several
bins, and the number of observation per bin is represented by the height of the bar. It
is possible to represent the distribution of several variable on the same axis using this
technique.
A box-plot gives a nice summary of one or several numeric variables. The line that
divides the box into 2 parts represents the median of the data. The end of the box
shows the upper and lower quartiles. The extreme lines show the highest and lowest
value excluding outliers.
Insights -
Insights -
Total_Experience_in_field_applied - Total
experience in the field applied for (past work
experience that is relevant to the job) ranges from
a minimum of 0 to maximum of 25.
Average Total_Experience_in_field_applied is
around 6.25.
Insights -
Insights -
Insights -
Insights -
Insights -
PieChart :
A pie chart is a circle divided into sectors that each represent a proportion of the
whole. It is often used to show proportion, where the sum of the sectors equal 100%.
Insights -
Insights -
Insights -
Insights -
Insights -
Insights -
Insights -
Insights -
Insights -
Insights -
Insights -
Insights -
Insights -
Insights -
Insights -
Insights -
Bivariate Analysis :
Scatter Plot :
A scatter plot (aka scatter chart, scatter graph) uses dots to represent values for two
different numeric variables. The position of each dot on the horizontal and vertical axis
indicates values for an individual data point. Scatter plots are used to observe
relationships between variables.
Insights -
Insights -
Insights -
Insights -
Insights -
Insights -
Insights -
Insights -
Insights -
Expected_CTC does vary based on the Department as expected. This conclusion can only be
drawn through the graphical plots.
Applicants for Top Management have higher median value than others for Expected_CTC
Distribution for Expected_CTC is bigger for Marketing Department applicants.
Banking , Sales , Engineering and Others have median almost equivalent to each other and
have almost similar kind of distribution too.
Insights -
Median values of CEO and Research Scientists for Expected_CTC are quite high as compared to
others but distribution is wider for Research Scientists.
Median values for Expected_CTC of Business Analyst , Sales Manager and Bio-Statistician are
almost equivalent to each other.
Insights -
There is not any variation in the distribution of Expected_CTC w.r.t. Industry , looks almost
similar plus median values are also almost equivalent for all distribution not varies too much.
Insights -
There is not any variation in the distribution of Expected_CTC w.r.t. Organization , looks almost
similar plus median values are also almost equivalent for all distribution not varies too much.
CAPSTONE FINAL REPORT PAGE NO : 30
Insights -
Median values of Research Scientists for Expected_CTC are quite high as compared to others.
Marketing Manger , Manager ,Product Manager and HR almost have equivalent median values for
Expected_CTC.
Similarly Data Analyst , Assistant Manger , Others , Web Designers and Research Analyst have
equivalent median values for Expected_CTC.
Insights -
Box-plot of Doctorate have higher median values for Expected_CTC as compared to others.
Insights -
There is not any variation in the distribution of Expected_CTC w.r.t. Graduation_Specialization , looks
almost similar plus median values are also almost equivalent for all not varies too much.
Insights -
There is not any variation in the distribution of Expected_CTC w.r.t. University_Grad , looks almost
similar plus median values are also almost equivalent for all not varies too much.
CAPSTONE FINAL REPORT PAGE NO : 32
Insights -
We infer that Expected_CTC for recently graduated applicants is least as compared to others.
Insights -
There is not any variation in the distribution of Expected_CTC w.r.t. PG_Specialization , looks almost
similar plus median values are also almost equivalent for all not varies too much.
CAPSTONE FINAL REPORT PAGE NO : 33
Insights -
There is not any variation in the distribution of Expected_CTC w.r.t. PG_Specialization , looks almost
similar plus median values are also almost equivalent for all not varies too much.
Insights -
Expected_CTC does vary based on the Passing_Year_Of_PG as expected. This conclusion can only be
drawn through the above plot.
We infer that Expected_CTC for recently post-graduated applicants is more than applicants who
completed post-graduation in early 1990s.
Insights -
Expected_CTC doesn't vary based on the PHD_Specialization as expected. This conclusion can
only be drawn through the above plot.
Insights -
There is not any variation in the distribution of Expected_CTC w.r.t. University_PHD , looks almost
similar plus median values are also almost equivalent for all not varies too much.
CAPSTONE FINAL REPORT PAGE NO : 35
Insights -
Expected_CTC does vary based on the Passing_Year_Of_PHD as expected. This conclusion can
only be drawn through the above plot.
We infer that Expected_CTC for recently PhD passed applicants is less than applicants who
completed PhD in early 1990s and 2000s.
Insights -
There is not any variation in the distribution of Expected_CTC w.r.t.Current_Location , looks almost
similar plus median values are also almost equivalent for all not varies too much.
CAPSTONE FINAL REPORT PAGE NO : 36
Insights -
There is not any variation in the distribution of Expected_CTC w.r.t. Preferred_location , looks almost
similar plus median values are also almost equivalent for all not varies too much.
Insights -
Distribution of Expected_CTC for applicants who have offer in hand or who don't have offer in hand
is almost similar , but median value for applicants who have in hand offer is slightly high.
CAPSTONE FINAL REPORT PAGE NO : 37
Insights -
Median values for applicants who have B and D as Last_Appraisal_Rating are almost equivalent.
Multivariate Analysis :
Heat-map :
Insights -
Total_Experience with Current_CTC , Expected_CTC have strong correlation i.e. (0.85 and 0.82).
Rest as there is no issues of strong multi-collinearity only few features have strong correlation
with each other we can select out them which suits best as per domain.
Pairplot :
Pairplot shows the relationship between the variables in the form of scatter-plot and
the distribution of the variable in the form of histogram.
Insights -
From the above plot we see that the Total_Experience and the Expected_CTC is showing a
strong relationship,with increase in Total_Experience(Independent Variable),Expected_CTC
(Target Variable)is also increases.
From the above plot we see that the Total_Experience_in_field_applied and the Expected_CTC
is showing a some relationship,with increase in
Total_Experience_in_field_applied(Independent Variable),Expected_CTC (Target Variable)is
slightly increases.
From the above plot we see that the Passing_Year_Of_Graduation and the Expected_CTC is
showing a some relationship, as the Passing_Year_Of_Graduation increases the Expected_CTC
goes on decreases.
From the above plot we see that the Passing_Year_Of_PHD and the Expected_CTC is showing a
negative relationship, as the Passing_Year_Of_PHD increases the Expected_CTC goes on
decreases.
From the above plot we see that the Current_CTC and the Expected_CTC is showing a strong
relationship, as the Current_CTC increases the Expected_CTC goes on increases.
From the above plot we infer that there is some sort of relation between
No_Of_Companies_worked and Expected_CTC.As No_Of_Companies_worked increase there is
also some increase in Expected_CTC.
From the above plot we see that the Percentage_Relevant_Exp_in_Field and the
Expected_CTC is showing a no relationship as all the data-points are scatter over plane.
Insights -
Looking at the box plot, it seems that the only Total_Experience_in_field_applied , Certifications and
International_degree_any variables have afew outliers , others don't have outliers.
We are dropping the column IDX and Applicant_Id as these columns didn't contribute for analysis
and model building exercise , because IDX and Applicant_ID for each applicant is unique hence it is
useless for the model.That's why we decided to drop these two columns.
Observation: Now we have all the columns which are useful for the model.
CAPSTONE FINAL REPORT PAGE NO : 42
Observation -
By looking at the above results we found that Industry and Organization have same number of
missing values (908) which indicates that the for these applicants in terms of Industry and
Organization is data is unknown.
Role & Department also have null values which indicates that for these applicants data is also
unknown.
Practice -
For numerical features we will going impute Passing_Year_Of_Graduation with meadian and for
Passing_Year_Of_PG and Passing_Year_Of_PHD we impute the missing values with 0 by using
fillna( ) function as these applicants might not have PG / PHD education.
For categorical features we are going to use fillna ( ) function and impute unknown label in place
of null values as data is given because here we can cannot impute with mode because most
records is a pattern of missing values and data is missing, if we impute with mode so model will
not perform well ,so it be good practicesper the problem .Because in note 2/milestone when we
encode them for model building it will be easy to club them or encode them by using target
encoding or mean encoding method.
Conclusion -
We successfully imputed all the null values present in the dataset with suitable values as per the
context of the business problem. Now we don't have any null values in the dataset.
Observations-
On the basis of Pearson's Correlation Feature Selection method , we come know that these 4 features
{'Current_CTC',
'Passing_Year_Of_Graduation',
'Passing_Year_Of_PHD',
'Total_Experience_in_field_applied'}
are correlated with each other so we can remove them or compare with the variables with which they
are correlated and drop as per the context of business / domain knowledge.By this method we come
know that about the Multicollinearity also , so we can drop those feature which are highly correlated.This
is our basic approach for feature selection rest in next milestone when we going to built linear regression
model then by using p-value of OLS summary we can pick features also or we can use another sk-learn
automatic functions for feature selection for model building . Till now we used this as we did not built
our base model yet beacuse in this excercise we just need to explore data do pre-processing , do some
treatment of missing values and visualization for getting the insights.
CAPSTONE FINAL REPORT PAGE NO : 45
From the above box-plot visuals we saw that many of the categorical feature like Industry ,
Organization , Graduation_Specialization ,University_Grad ,PG_Specialization , University_PG
,PHD_Specialization , University_PHD ,Current_location , and Preferred_location any of these
variables is not showing any variation with the target variable (Expected_CTC) and there is no
specific relation between them and target so we decided to drop these variables because they will
not possess any impact on the model even it is good practice to remove such variables which are
not in relationship with the target variable or helps us to predict the dependent variable.By doing
this we also reduces the dimensionality of the dataset as these features don't have any impact on
target feature.Now we left 6 categorical features like - Department , Role , Designation , Highest
Education , Inhand_Offer and Last_Appraisal_Rating which we are going to use for model building.
Outlier treatment -
Variable transformation -
As we saw in the skewness table that our target column Expected_CTC columns have skewness
value - 0.33 and by looking at the histogram and boxplot we also found it is almost normal
distributed.So right now we make an assumption that data is normal. Once we build the first
model and check its performance if anything is needed then we go for transformation.
Insights -
From the above plot we see that the Percentage_Relevant_Exp_in_Field and the Expected_CTC is
showing a no relationship as all the data-points are scatter over plane.
Is the data unbalanced? If so, what can be done? Please explain in the context of
the business -
In the given data set we left with 6 categorical features named as Department , Role, Designation, Highest
Education , Inhand_Offer and Last_Appraisal_Rating. Now we are trying to combine the sub-levels of the
categorical features as they have large number of labels and then after combining the labels we do the label
encoding on them .
Label Encoding refers to converting the labels into a numeric form so as to convert them into the machine
readable form. Machine learning algorithms can then decide in a better way how those labels must be
operated. It is an important pre-processing step for the structured dataset in supervised learning methods.
"Department"
Tab: 37 Value Counts for Categorical Feature (Department) Fig: 56 Box- Plot of Department Vs Expected_CTC
Note -
As we saw in the above figure median values for expected_ctc of HR, Banking , Sales , Engineering,
Others,Anlaytics/BI, Healthcare and IT-Software and Marketing is almost same / nearby so we can
combine them into one and named them as mid_level_dept. As we saw in the plot Top Managemnet
department applicants have higher median values than other so we can named them as
top_level_dept.Education and Accounts dept have nearby median values so we can combine them as
named as low_level_dept and unknown have least median value among all so we can name this as
very_low_dept.So here we combine them first and then do label encoding on them.
CAPSTONE FINAL REPORT PAGE NO : 47
"Role"
Note -
As we saw in the above figure median values for expected_ctc of HR, Banking , Sales , Engineering,
Others,Anlaytics/BI, Healthcare and IT-Software and Marketing is almost same / nearby so we can combine
them into one and named them as mid_level_dept. As we saw in the plot Top Managemnet department
applicants have higher median values than other so we can named them as top_level_dept.Education and
Accounts dept have nearby median values so we can combine them as named as low_level_dept and
unknown have least median value among all so we can name this as very_low_dept.So here we combine
them first and then do label encoding on them.
Note -
As we saw in the above figure median values for expected_ctc of CEO , Research Scientist ,Head ,
Area Sales Manager , Senior Buisnes Analyst , Senior Researcher and Senior Analyst is quite nearby
so we can combine them and name them as top_level_roles. Secondly, we saw that Consultant
,Financial Analyst ,Project Manager ,Team Lead ,Analyst ,Others ,Business Analyst ,Sales Manager
,Bio statistician , Scientist ,Sales Executive ,Data Scientist , Researcher and Lab Executives have near
by median values so we can combine them and name them as mid_level_roles.Principal Analyst
have low median value than top_level_roles and mid_level_roles so we can name them as
mid_low_level_roles.Associate and Professor have lower median values than all other so we
combine them and name them as low_level_roles. Atlast we have unknown which have least
median among all so we can name them as extremely_low_level_roles.
"Designation"
Note-
As we saw in the above figure median values for expected_ctc of Research Scientist is high as
compared to all so we name it as top_designation.Secondly, we found that HR, Markerting Manager
,Director , Manager , Product Manager ,Consultant, CA ,Sr.Manger , Data Analyst ,Assistant Manager
,Others ,Web Desginers, Research Analyst ,Software Developer ,have almost nearby median values so
we can combine them and name them mid_designation.Medical Office and Network Engineer has
almost nearby median values so we can combine them and name them as mid_low_designation.Then
we unknown so we can name them as low_designation as their median values is lower than above
two.Atlast we have scientist which has least median value so we can name them as
extremely_low_designation.
"Highest_Education"
Note-
Here , we do the ordinal label encoding as simple we encode Under_Grad as 0 , Grad as 1 , PG as 2 and
Doctorate as 3 as per their order.
"Inhand_Offer"
Note-
"Last_Appraisal_Rating"
Note -
As we saw that C and D have same median values for expected_ctc so we combine C and D and name
them as C.
The info() function is used to print a concise summary of a DataFrame. This method prints information
about a DataFrame including the index d-type and column d-types, non-null values and memory
usage.
CAPSTONE FINAL REPORT PAGE NO : 52
Insights -
From the above results we can see that there are no null values present in dataset.Their are total
25000 rows & 18 columns are in this dataset,indexed from 0 to 24999. Out of 18 variables 4 are float64
, and 14 variable are int64 d-type. Memory used by the dataset: 3.4 MB.
Note-
Now all the features in the numerical form and we can use them to build various machine learning
regression model to predict the expected_ctc of the applicant.
Model Building
In this model building exercise we will build different regression models like Linear Regression , XG-
Boost Regressor , Decision Tree , Random Forest and ANN Regressor models to predict the
Expected_CTC for applicant who are applying/ joining for different roles in the Delta Ltd.The main
objective of this problem is to provides salary estimates at time of joining for applicants based on their
job title, location, years of experience, skill and profile to minimise human judgment with regard to
salary to be offered. It is imperative to provide an unbiased salary for an employee which he/she truly
deserves, and also has to be appropriate to the market demands.
Note:
Before proceeding to the model building we need to split the data-set into train and test set.Then we
apply the supervised regression algorithm to the training set and check the prediction on test set.We
are doing the train-test split down the line.
CAPSTONE FINAL REPORT PAGE NO : 53
The train-test split is a technique for evaluating the performance of a machine learning algorithm. It
can be used for classification or regression problems and can be used for any supervised learning
algorithm. The procedure involves taking a dataset and dividing it into two subsets.
In the given problem, we are advised to split the training and the testing data in the ratio of ( 70:
30).Here we are split the data into train and test part , like x_train , x_test , train_labels & test_labels
,by using train_test_split func() from sk-learn library here ,we are taking 70 % data for training and 30
% data for testing.
In statistics, linear regression is a linear approach for modelling the relationship between a scalar
response and one or more explanatory variables (also known as dependent and independent variables).
The case of one explanatory variable is called simple linear regression; for more than one, the process
is called multiple linear regression.
[1] This term is distinct from multivariate linear regression, where multiple correlated dependent
variables are predicted, rather than a single scalar variable.
[2]In linear regression, the relationships are modeled using linear predictor functions whose unknown
model parameters are estimated from the data. Such models are called linear models.
[3] Most commonly, the conditional mean of the response given the values of the explanatory variables
(or predictors) is assumed to be an affine function of those values; less commonly, the conditional
median or some other quantile is used. Like all forms of regression analysis, linear regression focuses
on the conditional probability distribution of the response given the values of the predictors, rather
than on the joint probability distribution of all of these variables, which is the domain of multivariate
analysis. Linear regression was the first type of regression analysis to be studied rigorously, and to be
used extensively in practical applications.
[4] This is because models which depend linearly on their unknown parameters are easier to fit than
models which are non-linearly related to their parameters and because the statistical properties of the
resulting estimators are easier to determine.
Linear regression has many practical uses. Most applications fall into one of the following two broad
categories:
If the goal is prediction, forecasting, or error reduction,[clarification needed] linear regression can
be used to fit a predictive model to an observed data set of values of the response and explanatory
variables. After developing such a model, if additional values of the explanatory variables are
collected without an accompanying response value, the fitted model can be used to make a
prediction of the response
If the goal is to explain variation in the response variable that can be attributed to variation in the
explanatory variables, linear regression analysis can be applied to quantify the strength of the
relationship between the response and the explanatory variables, and in particular to determine
whether some explanatory variables may have no linear relationship with the response at all, or to
identify which subsets of explanatory variables may contain redundant information about the
response.
Linear regression models are often fitted using the least squares approach, but they may also be
fitted in other ways, such as by minimizing the "lack of fit" in some other norm (as with least
absolute deviations regression), or by minimizing a penalized version of the least squares cost
function as in ridge regression (L2-norm penalty) and lasso (L1-norm penalty). Conversely, the least
squares approach can be used to fit models that are not linear models. Thus, although the terms
"least squares" and "linear model" are closely linked, they are not synonymous.
CAPSTONE FINAL REPORT PAGE NO : 54
Note :
A linear regression model describes the relationship between a dependent variable, y, and one or
more independent variables, X. The dependent variable is also called the response variable. ...
Continuous predictor variables are also called covariates, and categorical predictor variables are
also called factors.Linear regression analysis is used to predict the value of a variable based on the
value of another variable. The variable you want to predict is called the dependent variable.
Let’s start with building a linear model. Instead of simple linear regression, where you have
one predictor and one outcome, we will go with multiple linear regression, where you have
more than one predictors and one outcome.
Where:
yi is the dependent or predicted variable.
β0 is the y-intercept, i.e., the value of y when both xi and x2 are 0.
β1 and β2 are the regression coefficients representing the change in y relative to a one-unit change
in xi1 and xi2, respectively.
βp is the slope coefficient for each independent variable.
ϵ is the model’s random error (residual) term.
The coefficients in this linear equation denote the magnitude of additive relation between the
predictor and the response. In simpler words, keeping everything else fixed, a unit change in x1 will
lead to change of β1 in the outcome, and so on.
Invoke the Linear Regression function ( from sklearn.linear_model import LinearRegression ) fit the
function on the train & test data and build the linear regression model. In this problem we are advised
to build various linear regression model and check the performance of Predictions on Train and Test
sets using R-square, RMSE & Adj R-square & atlast we need to compare these models and select the
best one.
The coefficients in this linear equation denote the magnitude of additive relation between the
predictor and the response. In simpler words, keeping everything else fixed, a unit change in x1 will
lead to change of β1 in the outcome, and so on.
CAPSTONE FINAL REPORT PAGE NO : 55
R square - is a statistical measure of how close the data are to the fitted regression line. It is also
known as the coefficient of determination, or the coefficient of multiple determination for multiple
regression. 100% indicates that the model explains all the variability of the response data around
its mean.
R-Squared value of 0.989 would indicate that 98.9% of the variance of the dependent variable
being studied is explained by the variance of the independent variable.
RMSE - The root mean square error (RMSE) for a regression model is similar to the standard
deviation (SD) for the ideal measurement model. The SD estimates the deviation from the sample
mean x. The RMSE estimates the deviation of the actual y-values from the regression line.
Root mean square error, which is a metric that tells us the average distance between the predicted
values from the model and the actual values in the dataset. The lower the RMSE, the better a given
model is able to “fit” a dataset.
Explore the coefficients for each of the independent attributes of train data.
The coefficients in this linear equation denote the magnitude of additive relation between the
predictor and the response. In simpler words, keeping everything else fixed, a unit change in x1
will lead to change of β1 in the outcome, and so on.
The regression constant is also known as the intercept thus, regression models without predictors
are also known as intercept only models. ... As such, we will begin with intercept only models for
OLS regression and then move on to logistic regression models without predictors. Intercept for
the model is 7.951914e+06.
OLS is a common technique used in analyzing linear regression. In brief, it compares the
difference between individual points in your data set and the predicted best fit line to measure
the amount of error produced. ... ols() function requires two inputs the formula for producing the
best fit line, and the dataset.
Insights :
Fig: 63 Prediction on Train Data Model 1 (Scatter Plot Showing Distribution of Actual y & Predicted y)
Explore the coefficients for each of the independent attributes of Test data.
The coefficients in this linear equation denote the magnitude of additive relation between the
predictor and the response. In simpler words, keeping everything else fixed, a unit change in x1 will
lead to change of β1 in the outcome, and so on.
The regression constant is also known as the intercept thus, regression models without predictors
are also known as intercept only models. ... As such, we will begin with intercept only models for
OLS regression and then move on to logistic regression models without predictors. Intercept for
the model is 8.289336e+06
OLS is a common technique used in analyzing linear regression. In brief, it compares the
difference between individual points in your data set and the predicted best fit line to measure
the amount of error produced. ... ols() function requires two inputs the formula for producing the
best fit line, and the dataset.
CAPSTONE FINAL REPORT PAGE NO : 59
Insights :
Fig: 64 Prediction on Test Data Model 1 (Scatter Plot Showing Distribution of Actual y & Predicted y)
The coefficients in this linear equation denote the magnitude of additive relation between the
predictor and the response. In simpler words, keeping everything else fixed, a unit change in x1 will
lead to change of β1 in the outcome, and so on.
CAPSTONE FINAL REPORT PAGE NO :61
R square - is a statistical measure of how close the data are to the fitted regression line. It is also
known as the coefficient of determination, or the coefficient of multiple determination for multiple
regression. 100% indicates that the model explains all the variability of the response data around
its mean.
R-Squared value of 0.989 would indicate that 98.9% of the variance of the dependent variable
being studied is explained by the variance of the independent variable.
RMSE - The root mean square error (RMSE) for a regression model is similar to the standard
deviation (SD) for the ideal measurement model. The SD estimates the deviation from the sample
mean x. The RMSE estimates the deviation of the actual y-values from the regression line.
Root mean square error, which is a metric that tells us the average distance between the predicted
values from the model and the actual values in the dataset. The lower the RMSE, the better a given
model is able to “fit” a dataset.
Explore the coefficients for each of the independent attributes of train data.
The coefficients in this linear equation denote the magnitude of additive relation between the
predictor and the response. In simpler words, keeping everything else fixed, a unit change in x1
will lead to change of β1 in the outcome, and so on.
The regression constant is also known as the intercept thus, regression models without predictors
are also known as intercept only models. ... As such, we will begin with intercept only models for
OLS regression and then move on to logistic regression models without predictors. Intercept for
the model is -1.734723e-18.
OLS is a common technique used in analyzing linear regression. In brief, it compares the
difference between individual points in your data set and the predicted best fit line to measure
the amount of error produced. ... ols() function requires two inputs the formula for producing the
best fit line, and the dataset.
Insights :
Fig: 65 Prediction on Train Data Model 2 (Scatter Plot Showing Distribution of Actual y & Predicted y)
Explore the coefficients for each of the independent attributes of test data.
The coefficients in this linear equation denote the magnitude of additive relation between the
predictor and the response. In simpler words, keeping everything else fixed, a unit change in x1
will lead to change of β1 in the outcome, and so on.
The regression constant is also known as the intercept thus, regression models without predictors
are also known as intercept only models. ... As such, we will begin with intercept only models for
OLS regression and then move on to logistic regression models without predictors. Intercept for
the model is 1.942890e-16.
OLS is a common technique used in analyzing linear regression. In brief, it compares the
difference between individual points in your data set and the predicted best fit line to measure
the amount of error produced. ... ols() function requires two inputs the formula for producing the
best fit line, and the dataset.
CAPSTONE FINAL REPORT PAGE NO : 65
Insights :
Fig: 66 Prediction on Test Data Model 2 (Scatter Plot Showing Distribution of Actual y & Predicted y)
Conclusion :
Tab : 57 Comparison Table of Linear Regression Model 1 & 2 on Train and Test Data
As we saw from the above results that Linear Regression Model 1 and Model 2 don't have any issues
of Overfitting and Underfitting plus R-square and Adj.R-Square values are also good for both models
(nearly same).
The objective function contains loss function and a regularization term. It tells about the difference
between actual values and predicted values, i.e how far the model results are from the real values.
The most common loss functions in XGBoost for regression problems is reg:linear, and that for binary
classification is reg:logistics.
CAPSTONE FINAL REPORT PAGE NO : 67
Ensemble learning involves training and combining individual models (known as base learners) to
get a single prediction, and XGBoost is one of the ensemble learning methods. XGBoost expects to
have the base learners which are uniformly bad at the remainder so that when all the predictions
are combined, bad predictions cancels out and better one sums up to form final good predictions.
Results :
Conclusion :
As we saw from the above results that XG-Boost Regressor Model-3 don't have any issues of
Overfitting and Underfitting plus R-square and Adj.R-Square values are also good for both train
and test. Even, here we get less RMSE on train and test as compared above Linear Regression
Model.
Model - 4 Building 3 models using Decision Tree, Random Forest and ANN Regressor
It is a tree-structured classifier with three types of nodes. The Root Node is the initial node
which represents the entire sample and may get split further into further nodes. The Interior
Nodes represent the features of a data set and the branches represent the decision rules. Finally,
the Leaf Nodes represent the outcome. This algorithm is very useful for solving decision-related
problems.
With a particular data point, it is run completely through the entirely tree by answering
True/False questions till it reaches the leaf node. The final prediction is the average of the value of
the dependent variable in that particular leaf node. Through multiple iterations, the Tree is able
to predict a proper value for the data point.
A random forest regressor. A random forest is a meta estimator that fits a number of classifying
decision trees on various sub-samples of the dataset and uses averaging to improve the
predictive accuracy and control over-fitting.
Builds n decision tree regressors (estimators). The number of estimators n defaults to 100 in Scikit
Learn (the machine learning Python library), where it is called n_estimators. The trees are built
following the specified hyperparameters (e.g. minimum number of samples at the leaf nodes,
maximum depth that a tree can grow, etc.).
Average prediction across estimators. Each decision tree regression predicts a number as an output
for a given input. Random forest regression takes the average of those predictions as its ‘final’ output.
CAPSTONE FINAL REPORT PAGE NO : 68
ANN/MLP Regressor
Regression ANNs predict an output variable as a function of the inputs. The input features
(independent variables) can be categorical or numeric types, however, for Regression ANNs, we
require a numeric dependent variable.As we have to predict a continuous variable Expected_CTC
,that's why we are going to use this algorithm.
MLP Regressor trains iteratively since at each time step the partial derivatives of the loss function
with respect to the model parameters are computed to update the parameters. It can also have a
regularization term added to the loss function that shrinks model parameters to prevent overfitting.
Tab : 59 Result of ANN, Decision Tree, Random Forest Regressor Model on Train and Test Data
Conclusion :
As we saw from the above results that Decision Tree , Random Forest and ANN Regressor don't
have any issues of Overfitting and Underfitting plus model scores are also good for both train and
test. Even, here we get good RMSE on train and test data.
From the above results we clearly infer that no model have any issues of Overfitting and Under-
fitting , we have good accuracy scores and also have good R-square , Adj-R square and RMSE on
train and test set. But we are doing hyper-parameter tuning on Decision Tree , Random Forest and
ANN Regressor and check their accuracy ,and RMSE on train and test set whether there is any
impact or not.
Model 5 - Hyper-parameter Tuning for Descision Tree / Random Forest / ANN Regressor
We are using Grid search to build a model for every combination of hyper-parameters specified
and evaluates each model. A more efficient technique for hyper-parameter tuning is the
Randomized search — where random combinations of the hyper-parameters are used to find the
best solution.
As per the industries standards we are taking various hyper parameters to build our Decision Tree ,
Random Forest and ANN Regressor Model ,hyperparametrs are listed below.
Note : From Grid Search CV we get our best_params_ , we uses these parameters to build our
Decision Tree Regressor.
CAPSTONE FINAL REPORT PAGE NO : 69
Note : From Grid Search CV we get our best_params_ , we uses these parameters to build our
Random Forest Regressor.
Note : From Grid Search CV we get our best_params_ , we uses these parameters to build our
ANN Regressor Model.
Results Tuned Decision Tree, Tuned Random Forest and Tuned ANN Regressor
Tab : 60 Result of Tuned ANN, Tuned Decision Tree, Tuned Random Forest Regressor Model on Train and Test Data
Conclusion :
As we saw from the above results that Tuned Decision Tree , Tuned Random Forest and Tuned
ANN Regressor don't have any issues of Overfitting and Under-fitting , plus model scores are also
good for both train and test and almost similar with above results. Even, here we get almost same
RMSE on train and test data too as compared to the above results.
For model validation we did the train -test split , trained the model on train data and validate it by
using the test set. For model performance we have to check Accuracy metrics and R-Sqaure and
Adj. R-Square and RMSE values all the models.As we build various Regression models like Linear
Regression Models like Linear Regression Model Linear Regression (With Z-Score Scaling) , XG-
Boost Regressor Model , Decision Tree Regressor , Random Forest Regressor , ANN Regressor
Model and also did the hyper-parameter tuning of the Decision Tree ,Random Forest and ANN and
calculate the R-Square and RMSE on train and test data for all models.Now we compare the all
the models which we have been build durning the model building exercise and choose our
generalised model for deployment i.e. Model which have highest R-Square / Model Score and
Least RMSE on train and test data.
CAPSTONE FINAL REPORT PAGE NO : 70
Conclusion :
As we saw from the above results that all regression models don't have any issues of Overfitting and
Underfitting plus model scores are also good and almost similar for all models on train and test. On
comparing the Model Scores from the Linear Regression models and other regression models ,Tuned
Random Forest Regressor seems to be an optimum model as we get lowest RMSE on Tuned Random
Forest Regressor for train and test and have very good model score of 0.99 on train and test.
We can also use XG_Boost Regressor and it also have very good model score and least difference in
between RMSE of train and test set , as it is also powerful technique now-a-days.
But as per the data given associated with the problem we finalise that Tuned Random Forest Regressor
Model will be the final generalised model for deployment.As It can perform both regression and
classification tasks. A random forest produces good predictions that can be understood easily. It can
handle large datasets efficiently. The random forest algorithm provides a higher level of accuracy in
predicting outcomes over the decision tree algorithm.
The Top Attributes reasonable & influencing the Expected_CTC are : Total_Experience_in_field_applied,
Highest_Education ,Inhand_offer , Last_Appraisal_Rating , Role, Designation , Current_CTC,
International_degree_any and No_Of_Companies_worked.
Most of applicants are of Marketing Department (2379) ,Analytics/BI (2096) , Health-care (2062 ) and
others (2041).
Less applicants belongs to IT-Software Department(1078).
Only 25 applicants worked as Lab executive (Role).
Majority of applicants worked in Training Industry.
There is not too much variations in the Organization columns equivalent number of applicants worked
in 16 different organization.
CAPSTONE FINAL REPORT PAGE NO : 71
When Total_Experience increases by 1 unit, Expected_CTC decreases by -5708.37 units, keeping all
other predictors constant.
Final Recommendations
Based on our analysis we found that most variables like Organization , Graduation_Specialization , University_Grad
, PG_Specialization , University_PG , PHD_Specialization , University_PHD , Current_location and
Preferred_location of these variables is not showing any variation with the target variable (Expected_CTC) and
there is no specific relation between them and target so instead of these variables we suggest company to take
different variables from applicants like Employment_Gap , Marital_Status , No_dependent_in_family and
Interview_Test_Scores . Secondly , we saw in our data that there are Passing_Year_of_UG , Passing_Year_of_PG ,
Passing_Year_of_PHD is asked by applicants instead of asking for all years we can only ask for passing_year w.r.t
to highest education applicant can have.
As we saw in our analysis that applicants who have in_hand_offer for job have higher expection of ctc , but
applicants who didn't have any offer in hand have lesser expected_ctc. So company can focus such applicants if
they are fit for company , then company should hire them immediately , it help company in cost costing as we get
good applicants at lower expected_ctc.
Recently graduates applicants asking for lower ctc if they best fit the positions then company should hire recently
grads .Comapny will get good employees at lower salary.
Company should check for the applicants whether they have completed their degrees in standard duration or not.
Applicants which have backlog or incomplete degree will be given lower ctc.
As we saw in our analysis fresher applicants have zero total_experience and zero current_ctc. As these being the
important predictors of target . So we need to build a separate model for such applicants.
Applicants with PhD qualifications are the most expensive applicants in terms of CTC , so hire PhD holders only
when company actually need them.
Applicants with higher number of companies worked have higher Expected_CTC.Business should look for those
applicants who worked for less number of comapnies but have experienced and perform well in the company.
Most of applicants preferred location is Kanpur but current location is Bangalore. As we know Bangalore is tier1
city which has more expensive living cost than Kanpur , but in our analysis we infer that current_location and
preferred_location both features didn't play any vital role even we drop these variables from models too , so we
suggest company shouldn't give high or low ctc to applicants on basis of location.
Company should focus on trends of market salary for different industries , roles and designation so every applicant
will get an unbiased salary which he/she truly deserves.
Company should make new HR strategies that satisfy the demands of the applicants also it will help company to
get their employee in their desired budget and also help company to reduce their attrition rate.
Thank You !