vertopal.com_Final007
vertopal.com_Final007
• Data Description: The dataset we have includes information about colleges, cities,
job roles, previous experience, and salaries. We’ll use this data to train and test our
predictive model.
• Regression Task: Our main goal is to build a regression model that can predict
salaries based on the provided data. We’re specifically looking to forecast the salary
of newly hired employees.
• Role of Statistics: Statistics will help us build the model and evaluate its accuracy.
We'll use statistical methods to ensure our model is both reliable and effective.
For this task, only 3 out of these 5 models will be used. The models will be tested with default
parameters, and some will be adjusted with parameter changes to see how they affect
performance.
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
# Load datasets
cities = pd.read_csv("C:\\Users\\ADMIN\\Desktop\\Final Pro\\
cities.csv")
college = pd.read_csv("C:\\Users\\ADMIN\Desktop\\Final Pro\\
Colleges.csv")
df = pd.read_csv("C:\\Users\\ADMIN\Desktop\\Final Pro\\ML case
Study.csv")
# Overview of Data
df.head()
cities.head()
college.head()
Tier 1 Tier 2
Tier 3
0 IIT Bombay IIIT Bangalore Ramaiah Institute of Technology,
Bengaluru
1 IIT Delhi IIIT Delhi TIET/Thapar
University
2 IIT Kharagpur IGDTUW Manipal Main
Campus
3 IIT Madras NIT Calicut VIT
Vellore
4 IIT Kanpur IIITM Gwalior SRM Main
Campus
# Extract data from the "Tier 1," "Tier 2," and "Tier 3" columns of
the 'college' DataSet
# and store them in separate lists 'Tier1,' 'Tier2,' and 'Tier3' for
next process.
Tier1
['IIT Bombay',
'IIT Delhi',
'IIT Kharagpur',
'IIT Madras',
'IIT Kanpur',
'IIT Roorkee',
'IIT Guwahati',
'IIIT Hyderabad',
'BITS Pilani (Pilani Campus)',
'IIT Indore',
'IIT Ropar',
'IIT BHU (Varanasi)',
'IIT ISM Dhanbad',
'DTU',
'NSUT Delhi (NSIT)',
'NIT Tiruchipally (Trichy)',
'NIT Warangal',
'NIT Surathkal (Karnataka)',
'Jadavpur University',
'BITS Pilani (Hyderabad Campus)',
'BITS Pilani (Goa Campus)',
'IIIT Allahabad',
nan,
nan,
nan,
nan,
nan,
nan]
df.sample(8)
Missing Values
# Checking missing values in data,this is very importanat task
df.isna().sum()
College 0
City 0
Previous CTC 0
Previous job change 0
Graduation Marks 0
EXP (Month) 0
CTC 0
Role_Manager 0
dtype: int64
df.describe()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1589 entries, 0 to 1588
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 College 1589 non-null int64
1 City 1589 non-null int64
2 Previous CTC 1589 non-null float64
3 Previous job change 1589 non-null int64
4 Graduation Marks 1589 non-null int64
5 EXP (Month) 1589 non-null int64
6 CTC 1589 non-null float64
7 Role_Manager 1589 non-null bool
dtypes: bool(1), float64(2), int64(5)
memory usage: 88.6 KB
<Axes: >
sns.boxplot(df['EXP (Month)'])
<Axes: >
sns.boxplot(df['Graduation Marks'])
<Axes: >
sns.boxplot(df['CTC'])
<Axes: >
# Correlation between variables
corr = df.corr()
corr
<Axes: >
import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize=(12, 8))
sns.boxplot(data=df)
plt.title('Boxplot to Detect Outliers')
plt.show()
# 2nd way to show outliers
iqr = percent75-percent25
In the above DataFrame, These are outliers present in "Previous CTC"column. As seen these
outliers are not extreme, so in my opinion keeping these data may not affect much on my
model.
percent25 = df['CTC'].quantile(0.25)
percent75 = df['CTC'].quantile(0.75)
iqr = percent75-percent25
As seen above, these are some outliers in "CTC" column but they are not as extreme that can
make any huge difference while making prediction. Therefore in my opinion keeping those
outliers into data is more useful than removing.
• Previous CTC and CTC Outliers: Although there are a few outliers in the "Previous
CTC" and "CTC" columns, these outliers are considered part of the natural variability
of the data and are not expected to negatively affect the model. In fact, they might
provide valuable information for predictions.
# Import DecisionTreeRegressor
from sklearn.tree import DecisionTreeRegressor
# Split Data into train and test with test_size = 0.2(80% data into
train and 20% to test)
y_test
1079 74059.06
405 84692.16
1492 75028.75
239 71001.53
610 62426.39
...
1023 67435.46
700 62927.79
486 75143.25
672 60479.67
1303 105077.70
Name: CTC, Length: 318, dtype: float64
print()
# Print the coefficients of the linear regression model
print("Coef:",linear_reg.coef_)
r2_score: 0.5933517958385095
MAE: 7191.23106750003
MSE: 77362774.9495653
print()
r2_score: 0.5926580862926116
MAE: 7198.215276305912
MSE: 77494749.70054282
# Make predictions on the test data using the tuned Ridge model
ridge_predict_tuned = ridge.predict(X_test)
print()
r2_score: 0.5926580862926116
MAE: 7198.215276305912
MSE: 77494749.70054282
Lasso Regression
# Create Lasso regression with default parameters
lasso = Lasso()
print()
r2_score: 0.5933030911807144
MAE: 7191.7094203244
MSE: 77372040.76567228
r2_score: 0.5933372007824145
MAE: 7191.374573405911
MSE: 77365551.58555806
r2_score: 0.3246898332595467
MAE: 8368.60254716981
MSE: 128474361.66212043
r2_score: 0.6454858465976178
MAE: 6265.506445283019
MSE: 67444534.08482905
r2_score: 0.6415701247402137
MAE: 6347.4630952991165
MSE: 68189480.46776979
grid_search.fit(X_train, y_train)
GridSearchCV(cv=5,
estimator=RandomForestRegressor(max_features=5,
min_samples_split=3,
n_jobs=-1),
n_jobs=-1,
param_grid={'max_features': [4, 5, 6, 7, 8, 9, 10],
'min_samples_split': [2, 3, 10]})
# Best parameters
grid_search.best_params_
{'max_features': 4, 'min_samples_split': 3}
print()
print()
r2_score: 0.635607919174224
MAE: 7273.60749691774
MSE: 76141857.81931214
print()
r2_score: 0.635607919174224
MAE: 7273.60749691774
MSE: 76141857.81931214
print()
r2_score: 0.6362725365808042
MAE: 7265.304096193648
MSE: 76002982.12266861
r2_score: 0.6363069568715122
MAE: 7264.807871629203
MSE: 75995789.80148739
Decision Tree Regression Model
# Create Decision tree regression with default parameters
dtr = DecisionTreeRegressor()
r2_score: 0.4548846441682135
MAE: 6964.911383647798
MSE: 113905043.77814023
r2_score: 0.6708307096456481
MAE: 6555.654893710693
MSE: 68781849.61606885
r2_score: 0.6832970839533041
MAE: 6442.3186504759415
MSE: 66176927.74755628
Performing Feature
# Split data into independent and target variable
X = df.loc[:, df.columns != 'CTC']
y = df['CTC']
# Split Data into train and test with test_size = 0.2(80% data into
train and 20% to test)
# Scale the features in the test data using the same scaler to ensure
consistency
X_test_scaled = scaler.transform(X_test)
# Here we can observe that, after scaling, mean is zero and standard
deviation is 1
print()
# Print the coefficients of the linear regression model
print("Coef:",lr_scaled.coef_)
r2_score: 0.5933517958385082
MAE: 7191.231067500048
MSE: 77362774.94956557
r2_score: 0.5932150226490303
MAE: 7192.604677281892
MSE: 77388795.36074269
r2_score: 0.5933169519706637
MAE: 7191.560845403443
MSE: 77369403.82011962
r2_score: 0.3169418447735133
MAE: 8296.539811320754
MSE: 129948377.49059999
r2_score: 0.6359160313882743
MAE: 6368.995153773585
MSE: 69265143.2816063
r2_score: 0.6417049648813692
MAE: 6348.821225637416
MSE: 68163827.80931063
Model Performance Comparison
We compared the performance of different machine learning models using the R-squared
(r2_score) metric as the primary measure, along with Mean Absolute Error (MAE) and
Mean Squared Error (MSE) for reference.
Key Takeaways
• Random Forest consistently outperforms the other models in all scenarios,
particularly after scaling the features.
• Linear, Ridge, and Lasso Regression models show very similar performance, with
minor differences in R-squared values.
• Decision Tree tends to underperform relative to the other models, especially with a
larger test size and after scaling.
• Feature Scaling slightly improves the performance of Random Forest and Decision
Tree models but has little impact on the regression models.
Q4. Summary
When comparing the different models, Random Forest stands out as the most reliable
performer across all scenarios, whether feature scaling is applied or not, and regardless of
the test size. It consistently achieves the highest R-squared scores, indicating a strong fit to
the data.
Linear Regression and Lasso also deliver solid results, but they fall just short of Random
Forest in terms of R-squared scores. They still perform well, making them good options
depending on your specific needs.
On the other hand, the Decision Tree model struggles the most, showing the lowest R-
squared scores across the board. It consistently lags behind the other models, making it the
weakest option in these comparisons.
Feature scaling does seem to help improve model performance, especially in scenarios
where it was applied. The models generally achieved higher R-squared scores with scaled
features compared to those without scaling.
In conclusion, if you're prioritizing R-squared as your main measure of success, Random
Forest is the clear winner, followed by Linear Regression and Lasso. However, it's
important to also weigh other factors like how easy the model is to interpret, how much
computational power it requires, and what exactly you're trying to achieve with your
analysis when deciding on the best model for your project.
3. Feature Selection: Not all features contribute equally to the model’s performance.
By identifying and removing or down-weighting less informative features, you can
make the model more efficient and potentially improve its accuracy.