SLF Project SolutionNotebook
SLF Project SolutionNotebook
Email: [email protected]
Contact No/WhatsApp: +8801869-295800
Sep - 2022
Note: This is a sample solution for the project. Projects will NOT be graded on
the basis of how well the submission matches this sample solution. Projects will
be graded on the basis of the rubric only.
Context
Buying and selling used smartphones used to be something that happened on a handful of
online marketplace sites. But the used and refurbished phone market has grown
considerably over the past decade, and a new IDC (International Data Corporation) forecast
predicts that the used phone market would be worth \$52.7bn by 2023 with a compound
annual growth rate (CAGR) of 13.6% from 2018 to 2023. This growth can be attributed to
an uptick in demand for used smartphones that offer considerable savings compared with
new models.
Refurbished and used devices continue to provide cost-effective alternatives to both
consumers and businesses that are looking to save money when purchasing a smartphone.
There are plenty of other benefits associated with the used smartphone market. Used and
refurbished devices can be sold with warranties and can also be insured with proof of
purchase. Third-party vendors/platforms, such as Verizon, Amazon, etc., provide attractive
offers to customers for refurbished smartphones. Maximizing the longevity of mobile
phones through second-hand trade also reduces their environmental impact and helps in
recycling and reducing waste. The impact of the COVID-19 outbreak may further boost the
cheaper refurbished smartphone segment, as consumers cut back on discretionary
spending and buy phones only for immediate needs.
Objective
The rising potential of this comparatively under-the-radar market fuels the need for an ML-
based solution to develop a dynamic pricing strategy for used and refurbished
smartphones. ReCell, a startup aiming to tap the potential in this market, has hired you as a
data scientist. They want you to analyze the data provided and build a linear regression
model to predict the price of a used phone and identify factors that significantly influence
it.
Data Description
The data contains the different attributes of used/refurbished phones. The detailed data
dictionary is given below.
Data Dictionary
• brand_name: Name of manufacturing brand
• os: OS on which the phone runs
• screen_size: Size of the screen in cm
• 4g: Whether 4G is available or not
• 5g: Whether 5G is available or not
• main_camera_mp: Resolution of the rear camera in megapixels
• selfie_camera_mp: Resolution of the front camera in megapixels
• int_memory: Amount of internal memory (ROM) in GB
• ram: Amount of RAM in GB
• battery: Energy capacity of the phone battery in mAh
• weight: Weight of the phone in grams
• release_year: Year when the phone model was released
• days_used: Number of days the used/refurbished phone has been used
• new_price: Price of a new phone of the same model in euros
• used_price: Price of the used/refurbished phone in euros
sns.set()
# to compute VIF
from statsmodels.stats.outliers_influence import variance_inflation_factor
<IPython.core.display.Javascript object>
# loading data
data = pd.read_csv("used_phone_data.csv")
<IPython.core.display.Javascript object>
<IPython.core.display.Javascript object>
<IPython.core.display.Javascript object>
Observations
• The data cover a variety of brands like Samsung, Sony, LG, etc.
• A high percentage of devices seem to be running on Android.
• There are a few missing values in the data.
# let's create a copy of the data to avoid any changes to original data
df = data.copy()
<IPython.core.display.Javascript object>
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3571 entries, 0 to 3570
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 brand_name 3571 non-null object
1 os 3571 non-null object
2 screen_size 3571 non-null float64
3 4g 3571 non-null object
4 5g 3571 non-null object
5 main_camera_mp 3391 non-null float64
6 selfie_camera_mp 3569 non-null float64
7 int_memory 3561 non-null float64
8 ram 3561 non-null float64
9 battery 3565 non-null float64
10 weight 3564 non-null float64
11 release_year 3571 non-null int64
12 days_used 3571 non-null int64
13 new_price 3571 non-null float64
14 used_price 3571 non-null float64
dtypes: float64(9), int64(2), object(4)
memory usage: 418.6+ KB
<IPython.core.display.Javascript object>
• brand_name, os, 4g, and 5g are object type columns while the rest are numeric in
nature.
# checking for duplicate values
df.duplicated().sum()
<IPython.core.display.Javascript object>
brand_name 0
os 0
screen_size 0
4g 0
5g 0
main_camera_mp 180
selfie_camera_mp 2
int_memory 10
ram 10
battery 6
weight 7
release_year 0
days_used 0
new_price 0
used_price 0
dtype: int64
<IPython.core.display.Javascript object>
<IPython.core.display.Javascript object>
Observations
• There are 33 brands in the data and a category Others too.
• Android is the most common OS for the used phones.
• The phone weight ranges from 23g to ~1kg, which is unusual.
• There are a few unusual values for the internal memory and RAM of used phones
too.
• The average value of the price of a used phone is approx. half the price of a new
model of the same phone.
Univariate Analysis
# function to plot a boxplot and a histogram along the same scale.
data: dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde: whether to the show density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a star will indicate the mean value of
the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins,
palette="winter"
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
<IPython.core.display.Javascript object>
used_price
histogram_boxplot(df, "used_price")
<IPython.core.display.Javascript object>
Observations
• The distribution of used phone prices is heavily right-skewed, with a mean value of
~100 euros.
• Let's apply the log transform to see if we can make the distribution closer to normal.
df["used_price_log"] = np.log(df["used_price"])
<IPython.core.display.Javascript object>
histogram_boxplot(df, "used_price_log")
<IPython.core.display.Javascript object>
<IPython.core.display.Javascript object>
Observations
• The distribution is heavily right-skewed, with a mean value of ~200 euros.
• Let's apply the log transform to see if we can make the distribution closer to normal.
# let's apply the log transform to see if we can make the distribution of
new_price closer to normal
df["new_price_log"] = np.log(df["new_price"])
<IPython.core.display.Javascript object>
histogram_boxplot(df, "new_price_log")
<IPython.core.display.Javascript object>
• The prices of new phone models are almost normally distributed now.
screen_size
histogram_boxplot(df, "screen_size")
<IPython.core.display.Javascript object>
<IPython.core.display.Javascript object>
• Few phones offer rear cameras with more than 20MP resolution.
selfie_camera_mp
histogram_boxplot(df, "selfie_camera_mp")
<IPython.core.display.Javascript object>
• Few phones offer front cameras with more than 16MP resolution.
int_memory
histogram_boxplot(df, "int_memory")
<IPython.core.display.Javascript object>
<IPython.core.display.Javascript object>
• The distribution of weight is close to normally distributed with many upper outliers.
battery
histogram_boxplot(df, "battery")
<IPython.core.display.Javascript object>
• The distribution of energy capacity of phone battery is close to normally distributed
with a few upper outliers.
days_used
histogram_boxplot(df, "days_used")
<IPython.core.display.Javascript object>
• Around 50% of the phones in the data are more than 700 days old.
Few smartphones have a screen size less than 4 inches or greater than 8 inches. Let's
check them.
df[(df.screen_size < 4 * 2.54) | (df.screen_size > 8 * 2.54)]
<IPython.core.display.Javascript object>
Observations
• There are a lot of phones which have very small or very large screen sizes.
• These are unusual values for a smartphone and need to be fixed.
• We will treat them as missing values and impute them later.
idx = df[(df.screen_size < 4 * 2.54) | (df.screen_size > 8 * 2.54)].index
df.loc[idx, "screen_size"] = np.nan
<IPython.core.display.Javascript object>
Few smartphones weigh less than 80g or greater than 350g. Let's check them.
df[(df.weight < 80) | (df.weight > 350)]
<IPython.core.display.Javascript object>
Observations
• There are quite a few phones which have unusual weights for a smartphone.
• The weights and screen sizes of these phones are aligned, i.e, there are bigger
phones which weigh too much and smaller phones which weigh too little.
• We will treat them as missing values and impute them later.
idx = df[(df.weight < 80) | (df.weight > 350)].index
df.loc[idx, "weight"] = np.nan
<IPython.core.display.Javascript object>
Few smartphones have very low internal memory and RAM. Let's check them
df[df.int_memory < 1]
brand_name os screen_size 4g 5g main_camera_mp \
105 Micromax Android 10.16 no no 2.00
106 Micromax Android NaN no no 0.30
107 Micromax Android NaN no no 2.00
115 Nokia Others NaN no no 0.30
116 Nokia Others NaN no no 0.30
118 Nokia Others NaN no no 0.30
119 Nokia Others NaN yes no 0.30
120 Nokia Others 14.76 no no 0.08
121 Nokia Others 14.76 no no 5.00
330 Micromax Android 10.16 no no 2.00
331 Micromax Android NaN no no 0.30
332 Micromax Android NaN no no 2.00
340 Nokia Others NaN no no 0.30
341 Nokia Others NaN no no 0.30
343 Nokia Others NaN no no 0.30
344 Nokia Others NaN yes no 0.30
345 Nokia Others 14.76 no no 0.08
346 Nokia Others 14.76 no no 5.00
419 Others Others NaN no no 0.30
420 Others Others NaN no no 0.30
1498 Karbonn Android 10.16 no no 5.00
2197 Nokia Others NaN no no 2.00
<IPython.core.display.Javascript object>
<IPython.core.display.Javascript object>
Observations
• There are few phones which have very low internal memory and/or a low amount
of RAM, which doesn't seem to be correct.
• We will treat them as missing values and impute them later.
idx = df[df.int_memory < 1].index
df.loc[idx, "int_memory"] = np.nan
<IPython.core.display.Javascript object>
<IPython.core.display.Javascript object>
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all
levels)
"""
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n].sort_values(),
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
<IPython.core.display.Javascript object>
brand_name
labeled_barplot(df, "brand_name", perc=True, n=10)
<IPython.core.display.Javascript object>
Observations
• Samsung has the most number of phones in the data, followed by Huawei and LG.
• Around 14% of the phones in the data are from brands other than the listed ones.
os
labeled_barplot(df, "os", perc=True)
<IPython.core.display.Javascript object>
• Android phones dominate more than 90% of the used phone market.
4g
labeled_barplot(df, "4g", perc=True)
<IPython.core.display.Javascript object>
• Around 50% of the phones in the data were originally released in 2015 or before.
Bivariate Analysis
cols_list = df.select_dtypes(include=np.number).columns.tolist()
# dropping release_year as it is a temporal variable
cols_list.remove("release_year")
plt.figure(figsize=(15, 7))
sns.heatmap(
df[cols_list].corr(), annot=True, vmin=-1, vmax=1, fmt=".2f",
cmap="Spectral"
)
plt.show()
<IPython.core.display.Javascript object>
Observations
• The used phone price is highly correlated with the price of a new phone model.
– This makes sense as the price of a new phone model is likely to affect the
used phone price.
• Weight, screen size, and battery capacity of a phone show a good amount of
correlation.
– This makes sense as larger battery capacity requires bigger space, thereby
increasing phone screen size and phone weight.
• The release year of the phones and the number of days it was used are negatively
correlated.
The amount of RAM is important for the smooth functioning of a phone. Let's see how
the amount of RAM varies across brands.
plt.figure(figsize=(15, 5))
sns.barplot(data=df, x="brand_name", y="ram")
plt.xticks(rotation=90)
plt.show()
<IPython.core.display.Javascript object>
Observations
• Most of the companies offer around 4GB of RAM on average.
• OnePlus offers the highest amount of RAM in general, while Celkon offers the least.
People who travel frequently require phones with large batteries to run through the
day. But large battery often increases a phone's weight, making it feel uncomfortable
in the hands. Let's create a new dataframe of only those phones which offer a large
battery and analyze.
df_large_battery = df[df.battery > 4500]
df_large_battery.shape
(346, 17)
<IPython.core.display.Javascript object>
df_large_battery.groupby("brand_name")["weight"].mean().sort_values(ascending
=True)
brand_name
Micromax 118.000000
Acer 147.500000
Spice 158.000000
Panasonic 182.000000
Infinix 193.000000
Oppo 195.000000
ZTE 195.400000
Vivo 195.630769
Realme 196.833333
Asus 199.357143
Motorola 200.757143
Gionee 209.430000
Honor 210.166667
Xiaomi 218.327586
Samsung 223.733333
Others 238.094737
Lenovo 258.000000
LG 264.128571
Huawei 302.277778
Apple 312.100000
Nokia 318.000000
Alcatel NaN
Google NaN
HTC NaN
Sony NaN
Name: weight, dtype: float64
<IPython.core.display.Javascript object>
plt.figure(figsize=(15, 5))
sns.barplot(data=df_large_battery, x="brand_name", y="weight")
plt.xticks(rotation=60)
plt.show()
<IPython.core.display.Javascript object>
Observations
• A lot of brands offer phones which are not very heavy but have a large battery
capacity.
• Some phones offered by brands like Vivo, Realme, Motorola, etc. weigh just about
200g but offer great batteries.
• Some phones offered by brands like Huawei, Apple, Nokia, etc. offer great batteries
but are heavy.
• Google, HTC, Sony, and Alcatel do not offer phones with a battery capacity greater
than 4500 mAh.
People who buy phones primarily for entertainment purposes prefer a large screen
as they offer a better viewing experience. Let's create a new dataframe of only those
phones which are suitable for such people and analyze.
df_large_screen = df[df.screen_size > 6 * 2.54]
df_large_screen.shape
(726, 17)
<IPython.core.display.Javascript object>
df_large_screen.brand_name.value_counts()
Samsung 86
Huawei 82
Others 62
LG 54
Oppo 52
Lenovo 44
Honor 42
Motorola 40
Asus 32
Vivo 32
Realme 30
Xiaomi 29
Meizu 23
Alcatel 23
Nokia 14
Acer 13
ZTE 10
Apple 9
Infinix 8
Micromax 8
Sony 8
HTC 6
XOLO 3
Google 3
Gionee 3
OnePlus 2
Panasonic 2
Karbonn 2
Celkon 2
Coolpad 1
Spice 1
Name: brand_name, dtype: int64
<IPython.core.display.Javascript object>
Observations
• Huawei and Samsung offer a lot of phones suitable for customers buying phones for
entertainment purposes.
• Brands like Alcatel, Meizu, and Nokia offer fewer phones for this customer segment.
Data Preprocessing
Feature Engineering
• Let's create a new column phone_category from the new_price column to tag
phones as budget, mid-ranger, or premium.
df["phone_category"] = pd.cut(
x=df.new_price,
bins=[-np.infty, 200, 350, np.infty],
labels=["Budget", "Mid-ranger", "Premium"],
)
df["phone_category"].value_counts()
Budget 1904
Mid-ranger 1060
Premium 607
Name: phone_category, dtype: int64
<IPython.core.display.Javascript object>
• More than half the phones in the data are budget phones.
Everyone likes a good phone camera to capture their favorite moments with loved
ones. Some customers specifically look for good front cameras to click cool selfies.
Let's create a new dataframe of only those phones which are suitable for this
customer segment and analyze.
df_selfie_camera = df[df.selfie_camera_mp > 8]
df_selfie_camera.shape
(666, 18)
<IPython.core.display.Javascript object>
plt.figure(figsize=(15, 5))
sns.countplot(data=df_selfie_camera, x="brand_name", hue="phone_category")
plt.xticks(rotation=60)
plt.legend(loc=1)
plt.show()
<IPython.core.display.Javascript object>
Observations
• Huawei is the go-to brand for this customer segment as they offer many phones
across different price ranges with powerful front cameras.
• Xiaomi and Realme also offer a lot of budget phones capable of shooting crisp selfies.
• Oppo and Vivo offer many mid-rangers with great selfie cameras.
• Oppo, Vivo, and Samsung offer many premium phones for this customer segment.
Let's do a similar analysis for rear cameras.
df_main_camera = df[df.main_camera_mp > 16]
df_main_camera.shape
(94, 18)
<IPython.core.display.Javascript object>
plt.figure(figsize=(15, 5))
sns.countplot(data=df_main_camera, x="brand_name", hue="phone_category")
plt.xticks(rotation=60)
plt.legend(loc=2)
plt.show()
<IPython.core.display.Javascript object>
Observations
• Sony is the go-to brand for great rear cameras as they offer many phones across
different price ranges.
• No brand other than Sony seems to be offering great rear cameras in budget phones.
• Brands like Motorola and HTC offer mid-rangers with great rear cameras.
• Nokia offers a few premium phones with great rear cameras.
Let's see how the price of used phones varies across the years.
plt.figure(figsize=(10, 5))
sns.barplot(data=df, x="release_year", y="used_price")
plt.show()
<IPython.core.display.Javascript object>
plt.subplot(121)
sns.heatmap(
pd.crosstab(df["4g"], df["phone_category"], normalize="columns"),
annot=True,
fmt=".4f",
cmap="Spectral",
)
plt.subplot(122)
sns.heatmap(
pd.crosstab(df["5g"], df["phone_category"], normalize="columns"),
annot=True,
fmt=".4f",
cmap="Spectral",
)
plt.show()
<IPython.core.display.Javascript object>
Observations
• There is an almost equal number of 4G and non-4G budget phones, but there are no
budget phones offering 5G network.
• Most of the mid-rangers and premium phones offer 4G network.
• Very few mid-rangers (~3%) and around 20% of the premium phones offer 5G
mobile network.
Data Preprocessing
Missing Value Imputation
• We will impute the missing values in the data by the column medians grouped by
release_year and brand_name.
# let's create a copy of the data
df1 = df.copy()
<IPython.core.display.Javascript object>
brand_name 0
os 0
screen_size 783
4g 0
5g 0
main_camera_mp 180
selfie_camera_mp 2
int_memory 32
ram 13
battery 6
weight 283
release_year 0
days_used 0
new_price 0
used_price 0
used_price_log 0
new_price_log 0
phone_category 0
dtype: int64
<IPython.core.display.Javascript object>
cols_impute = [
"screen_size",
"main_camera_mp",
"selfie_camera_mp",
"int_memory",
"ram",
"battery",
"weight",
]
<IPython.core.display.Javascript object>
brand_name 0
os 0
screen_size 49
4g 0
5g 0
main_camera_mp 180
selfie_camera_mp 2
int_memory 10
ram 10
battery 6
weight 12
release_year 0
days_used 0
new_price 0
used_price 0
used_price_log 0
new_price_log 0
phone_category 0
dtype: int64
<IPython.core.display.Javascript object>
• We will impute the remaining missing values in the data by the column medians
grouped by brand_name.
cols_impute = [
"screen_size",
"main_camera_mp",
"selfie_camera_mp",
"int_memory",
"ram",
"battery",
"weight",
]
<IPython.core.display.Javascript object>
brand_name 0
os 0
screen_size 0
4g 0
5g 0
main_camera_mp 10
selfie_camera_mp 0
int_memory 0
ram 0
battery 0
weight 0
release_year 0
days_used 0
new_price 0
used_price 0
used_price_log 0
new_price_log 0
phone_category 0
dtype: int64
<IPython.core.display.Javascript object>
• We will fill the remaining missing values in the main_camera_mp column by the
column median.
df1["main_camera_mp"] =
df1["main_camera_mp"].fillna(df1["main_camera_mp"].median())
brand_name 0
os 0
screen_size 0
4g 0
5g 0
main_camera_mp 0
selfie_camera_mp 0
int_memory 0
ram 0
battery 0
weight 0
release_year 0
days_used 0
new_price 0
used_price 0
used_price_log 0
new_price_log 0
phone_category 0
dtype: int64
<IPython.core.display.Javascript object>
Outlier Check
• Let's check for outliers in the data.
# outlier detection using boxplot
numeric_columns = df1.select_dtypes(include=np.number).columns.tolist()
# dropping release_year as it is a temporal variable
numeric_columns.remove("release_year")
plt.figure(figsize=(15, 12))
plt.show()
<IPython.core.display.Javascript object>
Observations
• There are quite a few outliers in the data.
• However, we will not treat them as they are proper values.
# let's check the statistical summary of the data once
df1.describe(include="all").T
<IPython.core.display.Javascript object>
print(X.head())
print()
print(y.head())
days_used new_price_log
0 127 4.715100
1 325 5.519018
2 162 5.884631
3 345 5.630961
4 293 4.947837
0 4.465448
1 5.084443
2 5.593037
3 5.194234
4 4.642466
Name: used_price_log, dtype: float64
<IPython.core.display.Javascript object>
X.head()
[5 rows x 48 columns]
<IPython.core.display.Javascript object>
<IPython.core.display.Javascript object>
<IPython.core.display.Javascript object>
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is
correctly specified.
[2] The condition number is large, 7.52e+06. This might indicate that there
are
strong multicollinearity or other numerical problems.
<IPython.core.display.Javascript object>
Observations
• Both the R-squared and Adjusted R squared of our model are 0.99, indicating that it
can explain 99% of the variance in the price of used phones.
• This is a clear indication that we have been able to create a very good model which
is not underfitting the data.
• To be able to make statistical inferences from our model, we will have to test that
the linear regression assumptions are followed.
model: regressor
predictors: independent variables
target: dependent variable
"""
return df_perf
<IPython.core.display.Javascript object>
Training Performance
<IPython.core.display.Javascript object>
Test Performance
<IPython.core.display.Javascript object>
Observations
• RMSE and MAE of train and test data are very close, which indicates that our model
is not overfitting the train data.
• MAE indicates that our current model is able to predict used phone prices within a
mean error of ~7.3 euros on test data.
• The RMSE values are higher than the MAE values as the squares of residuals
penalizes the model more for larger errors in prediction.
• Despite being able to capture 99% of the variation in the data, the MAE is around 7.3
euros as it makes larger predictions errors for the extreme values (very high or very
low prices).
• MAPE of ~7.1 on the test data indicates that the model can predict within ~7.1% of
the used phone price.
2. Linearity of variables
5. No Heteroscedasticity
TEST FOR MULTICOLLINEARITY
• We will test for multicollinearity using VIF.
– If VIF is 1 then there is no correlation between the 𝑘th predictor and the
remaining predictor variables.
– If VIF exceeds 5 or is close to exceeding 5, we say there is moderate
multicollinearity.
– If VIF is 10 or exceeding 10, it shows signs of high multicollinearity.
from statsmodels.stats.outliers_influence import variance_inflation_factor
<IPython.core.display.Javascript object>
checking_vif(x_train1)
feature VIF
0 const 3.682943e+06
1 screen_size 2.708324e+00
2 main_camera_mp 2.223471e+00
3 selfie_camera_mp 2.878177e+00
4 int_memory 1.275161e+00
5 ram 1.796204e+00
6 battery 1.909542e+00
7 weight 1.888805e+00
8 release_year 4.778527e+00
9 days_used 2.605099e+00
10 new_price_log 2.994062e+00
11 brand_name_Alcatel 3.111636e+00
12 brand_name_Apple 2.346056e+01
13 brand_name_Asus 3.487170e+00
14 brand_name_BlackBerry 1.547068e+00
15 brand_name_Celkon 1.845983e+00
16 brand_name_Coolpad 1.396023e+00
17 brand_name_Gionee 2.133539e+00
18 brand_name_Google 1.278299e+00
19 brand_name_HTC 3.277110e+00
20 brand_name_Honor 3.308287e+00
21 brand_name_Huawei 6.075371e+00
22 brand_name_Infinix 1.267413e+00
23 brand_name_Karbonn 1.677137e+00
24 brand_name_LG 5.087642e+00
25 brand_name_Lava 1.715547e+00
26 brand_name_Lenovo 4.265889e+00
27 brand_name_Meizu 2.258390e+00
28 brand_name_Micromax 3.249870e+00
29 brand_name_Microsoft 1.857210e+00
30 brand_name_Motorola 3.113107e+00
31 brand_name_Nokia 3.610847e+00
32 brand_name_OnePlus 1.612205e+00
33 brand_name_Oppo 3.885092e+00
34 brand_name_Others 9.593894e+00
35 brand_name_Panasonic 1.957506e+00
36 brand_name_Realme 1.938276e+00
37 brand_name_Samsung 7.865618e+00
38 brand_name_Sony 2.859107e+00
39 brand_name_Spice 1.767183e+00
40 brand_name_Vivo 3.458048e+00
41 brand_name_XOLO 2.016256e+00
42 brand_name_Xiaomi 4.019903e+00
43 brand_name_ZTE 3.671177e+00
44 os_Others 1.561759e+00
45 os_Windows 1.599747e+00
46 os_iOS 2.189324e+01
47 4g_yes 2.375968e+00
48 5g_yes 1.673915e+00
<IPython.core.display.Javascript object>
Observations
• None of the numerical variables show moderate or high multicollinearity.
• We will ignore the VIF for the dummy variables.
selected_features = cols
print(selected_features)
['const', 'release_year', 'days_used', 'new_price_log', 'brand_name_Gionee',
'brand_name_Panasonic', '5g_yes']
<IPython.core.display.Javascript object>
x_train2 = x_train1[selected_features]
x_test2 = x_test1[selected_features]
<IPython.core.display.Javascript object>
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is
correctly specified.
[2] The condition number is large, 2.92e+06. This might indicate that there
are
strong multicollinearity or other numerical problems.
<IPython.core.display.Javascript object>
Training Performance
<IPython.core.display.Javascript object>
Test Performance
<IPython.core.display.Javascript object>
Observations
• Dropping the high p-value predictor variables has not adversely affected the model
performance.
• This shows that these variables do not significantly impact the target variables.
Now we'll check the rest of the assumptions on olsmod2.
1. Linearity of variables
4. No Heteroscedasticity
df_pred.head()
<IPython.core.display.Javascript object>
sns.residplot(
data=df_pred, x="Fitted Values", y="Residuals", color="purple",
lowess=True
)
plt.xlabel("Fitted Values")
plt.ylabel("Residuals")
plt.title("Fitted vs Residual plot")
plt.show()
<IPython.core.display.Javascript object>
Observations
• We see no pattern in the plot above.
• Hence, the assumptions of linearity and independence are satisfied.
Observations
• The histogram of residuals does have a slight bell shape.
• Let's check the Q-Q plot.
import pylab
import scipy.stats as stats
Observations
• The residuals more or less follow a straight line except for the tails.
• Let's check the results of the Shapiro-Wilk test.
stats.shapiro(df_pred["Residuals"])
ShapiroResult(statistic=0.9549560546875, pvalue=4.112263720997103e-27)
<IPython.core.display.Javascript object>
Observations
• Since p-value < 0.05, the residuals are not normal as per the Shapiro-Wilk test.
• Strictly speaking, the residuals are not normal. However, as an approximation, we
can accept this distribution as close to being normal.
• So, the assumption is satisfied.
<IPython.core.display.Javascript object>
Observations
• Since p-value > 0.05, the residuals are homoscedastic.
• So, the assumption is satisfied.
All the assumptions of linear regression are satisfied. Let's rebuild our final model,
check its performance, and draw inferences from it.
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is
correctly specified.
[2] The condition number is large, 2.92e+06. This might indicate that there
are
strong multicollinearity or other numerical problems.
<IPython.core.display.Javascript object>
Training Performance
<IPython.core.display.Javascript object>
<IPython.core.display.Javascript object>
Actionable Insights
• The model explains 99% of the variation in the data and can predict within 7 euros
of the used phone price.
• The most significant predictors of the used phone price are the price of a new phone
of the same model, the release year of the phone, the number of days it was used,
and the availability of 5G network.
• One percent increase in new phone price will result in a one percent increase in the
used phone price. [100 * {(1.01)**(1.0000) - 1} = 1]
• A unit increase in the number of days used decreases the used phone price by
0.11%. [100 * {exp(0.0011) - 1} = 0.11]
Recommendations
• The model can be used for predictive purposes as it can predict the used phone
price within ~7%.
• ReCell should look to attract people who want to sell used phones which have been
released in recent years and have not been used for many days.
• They should also try to gather and put up phones having a high price for new
models to try and increase revenue.
– They can focus on volume for the budget phones and offer discounts during
festive sales on premium phones.
• Additional data regarding customer demographics (age, gender, income, etc.) can be
collected and analyzed to gain better insights into the preferences of customers
across different segments.