Regression
Regression
Problem Statement
Google Play Store team is about to launch a new feature wherein, certain apps that are
promising are boosted in visibility. The boost will manifest in multiple ways including higher
priority in recommendations sections (“Similar apps”, “You might also like”, “New and updated
games”). These will also get a boost in search results visibility. This feature will help bring more
attention to newer apps that have the potential.
Analysis to be done:
The problem is to identify the apps that are going to be good for Google to promote. App
ratings, which are provided by the customers, are always great indicators of the goodness of the
app. The problem reduces to: predict which apps will have high ratings.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
df = pd.read_csv('googleplaystore.csv')
df.shape
(10841, 13)
df.head(3)
App Category
Rating \
0 Photo Editor & Candy Camera & Grid & ScrapBook ART_AND_DESIGN
4.1
1 Coloring book moana ART_AND_DESIGN
3.9
2 U Launcher Lite – FREE Live Cool Themes, Hide ... ART_AND_DESIGN
4.7
# Data Information
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10841 entries, 0 to 10840
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 App 10841 non-null object
1 Category 10841 non-null object
2 Rating 9367 non-null float64
3 Reviews 10841 non-null object
4 Size 10841 non-null object
5 Installs 10841 non-null object
6 Type 10840 non-null object
7 Price 10841 non-null object
8 Content_Rating 10840 non-null object
9 Genres 10841 non-null object
10 Last Updated 10841 non-null object
11 Current Ver 10833 non-null object
12 Android Ver 10838 non-null object
dtypes: float64(1), object(12)
memory usage: 1.1+ MB
df.isnull().sum()
App 0
Category 0
Rating 1474
Reviews 0
Size 0
Installs 0
Type 1
Price 0
Content_Rating 1
Genres 0
Last Updated 0
Current Ver 8
Android Ver 3
dtype: int64
print("Missing Value %age :" ,
(df.isnull().sum().sum()/df.shape[0])*100)
df.shape
(9360, 13)
df.isnull().sum()
App 0
Category 0
Rating 0
Reviews 0
Size 0
Installs 0
Type 0
Price 0
Content_Rating 0
Genres 0
Last Updated 0
Current Ver 0
Android Ver 0
dtype: int64
Data Wrangling
# Check for Duplicate Data
df[df.duplicated()]
App Category
Rating \
229 Quick PDF Scanner + OCR FREE BUSINESS
4.2
236 Box BUSINESS
4.2
239 Google My Business BUSINESS
4.4
256 ZOOM Cloud Meetings BUSINESS
4.4
261 join.me - Simple Meetings BUSINESS
4.0
... ... ...
...
8643 Wunderlist: To-Do List & Tasks PRODUCTIVITY
4.6
8654 TickTick: To Do List with Reminder, Day Planner PRODUCTIVITY
4.6
8658 ColorNote Notepad Notes PRODUCTIVITY
4.6
10049 Airway Ex - Intubate. Anesthetize. Train. MEDICAL
4.3
10768 AAFP MEDICAL
3.8
# Price
df.Price.value_counts()
0 8275
$2.99 110
$0.99 104
$4.99 68
$1.99 59
...
$1.29 1
$299.99 1
$379.99 1
$39.99 1
$1.20 1
Name: Price, Length: 73, dtype: int64
float('$299.99'[1:])
299.99
float('$299.99'.replace('$', ''))
299.99
df.Price.value_counts()
0.00 8275
2.99 110
0.99 104
4.99 68
1.99 59
...
1.29 1
299.99 1
379.99 1
39.99 1
1.20 1
Name: Price, Length: 73, dtype: int64
df.Price.info()
<class 'pandas.core.series.Series'>
Int64Index: 8886 entries, 0 to 10840
Series name: Price
Non-Null Count Dtype
-------------- -----
8886 non-null float64
dtypes: float64(1)
memory usage: 138.8 KB
df.head()
App Category
Rating \
0 Photo Editor & Candy Camera & Grid & ScrapBook ART_AND_DESIGN
4.1
1 Coloring book moana ART_AND_DESIGN
3.9
2 U Launcher Lite – FREE Live Cool Themes, Hide ... ART_AND_DESIGN
4.7
3 Sketch - Draw & Paint ART_AND_DESIGN
4.5
4 Pixel Draw - Number Art Coloring Book ART_AND_DESIGN
4.3
Android Ver
0 4.0.3 and up
1 4.0.3 and up
2 4.0.3 and up
3 4.2 and up
4 4.4 and up
df.Reviews.value_counts()
2 82
3 76
4 74
5 74
1 67
..
70189 1
10859051 1
111066 1
129272 1
398307 1
Name: Reviews, Length: 5990, dtype: int64
df.Reviews = df.Reviews.astype('int')
df.Reviews.info()
<class 'pandas.core.series.Series'>
Int64Index: 8886 entries, 0 to 10840
Series name: Reviews
Non-Null Count Dtype
-------------- -----
8886 non-null int64
dtypes: int64(1)
memory usage: 138.8 KB
df.Size.value_counts()
def change_size(size):
if 'M' in size:
x = size[:-1]
x= float(x)*1000
return x
elif 'k' in size:
x = float(size[:-1])
return x
else:
return None
df['Size'] = df.Size.map(change_size)
df.Size.info()
<class 'pandas.core.series.Series'>
Int64Index: 8886 entries, 0 to 10840
Series name: Size
Non-Null Count Dtype
-------------- -----
7418 non-null float64
dtypes: float64(1)
memory usage: 138.8 KB
df.Size.isnull().sum()
1468
df.tail(10)
App
Category \
10828 Manga-FR - Anime Vostfr
COMICS
10829 Bulgarian French Dictionary Fr
BOOKS_AND_REFERENCE
10830 News Minecraft.fr
NEWS_AND_MAGAZINES
10832 FR Tides
WEATHER
10833 Chemin (fr)
BOOKS_AND_REFERENCE
10834 FR Calculator
FAMILY
10836 Sya9a Maroc - FR
FAMILY
10837 Fr. Mike Schmitz Audio Teachings
FAMILY
10839 The SCP Foundation DB fr nn5n
BOOKS_AND_REFERENCE
10840 iHoroscope - 2018 Daily Horoscope & Astrology
LIFESTYLE
Rating Reviews Size Installs Type Price
Content_Rating \
10828 3.4 291 13000.0 10,000+ Free 0.0
Everyone
10829 4.6 603 7400.0 10,000+ Free 0.0
Everyone
10830 3.8 881 2300.0 100,000+ Free 0.0
Everyone
10832 3.8 1195 582.0 100,000+ Free 0.0
Everyone
10833 4.8 44 619.0 1,000+ Free 0.0
Everyone
10834 4.0 7 2600.0 500+ Free 0.0
Everyone
10836 4.5 38 53000.0 5,000+ Free 0.0
Everyone
10837 5.0 4 3600.0 100+ Free 0.0
Everyone
10839 4.5 114 NaN 1,000+ Free 0.0
Mature_17+
10840 4.5 398307 19000.0 10,000,000+ Free 0.0
Everyone
Android Ver
10828 4.0 and up
10829 4.1 and up
10830 1.6 and up
10832 2.1 and up
10833 2.2 and up
10834 4.1 and up
10836 4.1 and up
10837 4.1 and up
10839 Varies with device
10840 Varies with device
df.Size.describe()
count 7418.000000
mean 22760.828862
std 23439.210125
min 8.500000
25% 5100.000000
50% 14000.000000
75% 33000.000000
max 100000.000000
Name: Size, dtype: float64
df.Size.isnull().sum()
df.Size.describe()
count 8886.000000
mean 22982.979912
std 23343.630842
min 8.500000
25% 5300.000000
50% 14000.000000
75% 33000.000000
max 100000.000000
Name: Size, dtype: float64
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 8886 entries, 0 to 10840
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 App 8886 non-null object
1 Category 8886 non-null object
2 Rating 8886 non-null float64
3 Reviews 8886 non-null int64
4 Size 8886 non-null float64
5 Installs 8886 non-null object
6 Type 8886 non-null object
7 Price 8886 non-null float64
8 Content_Rating 8886 non-null object
9 Genres 8886 non-null object
10 Last Updated 8886 non-null object
11 Current Ver 8886 non-null object
12 Android Ver 8886 non-null object
dtypes: float64(3), int64(1), object(9)
memory usage: 971.9+ KB
df.Installs
0 10,000+
1 500,000+
2 5,000,000+
3 50,000,000+
4 100,000+
...
10834 500+
10836 5,000+
10837 100+
10839 1,000+
10840 10,000,000+
Name: Installs, Length: 8886, dtype: object
df['Installs'] = df.Installs.map(lambda x:
int(x.replace(',','').replace('+','')))
df.Installs
0 10000
1 500000
2 5000000
3 50000000
4 100000
...
10834 500
10836 5000
10837 100
10839 1000
10840 10000000
Name: Installs, Length: 8886, dtype: int64
df.head()
App Category
Rating \
0 Photo Editor & Candy Camera & Grid & ScrapBook ART_AND_DESIGN
4.1
1 Coloring book moana ART_AND_DESIGN
3.9
2 U Launcher Lite – FREE Live Cool Themes, Hide ... ART_AND_DESIGN
4.7
3 Sketch - Draw & Paint ART_AND_DESIGN
4.5
4 Pixel Draw - Number Art Coloring Book ART_AND_DESIGN
4.3
Reviews Size Installs Type Price Content_Rating \
0 159 19000.0 10000 Free 0.0 Everyone
1 967 14000.0 500000 Free 0.0 Everyone
2 87510 8700.0 5000000 Free 0.0 Everyone
3 215644 25000.0 50000000 Free 0.0 Teen
4 967 2800.0 100000 Free 0.0 Everyone
Android Ver
0 4.0.3 and up
1 4.0.3 and up
2 4.0.3 and up
3 4.2 and up
4 4.4 and up
# Sanity Checks
# 1. Ratings must be between 1 and 5
# 2. No.of Reviews cannot be more than installs
# 3. For Free app the price must be 0
# 4. Paid App the price should be greater than 0
df.Rating.describe()
count 8886.000000
mean 4.187959
std 0.522428
min 1.000000
25% 4.000000
50% 4.300000
75% 4.500000
max 5.000000
Name: Rating, dtype: float64
len(df[df.Reviews>df.Installs])
df.shape
(8879, 13)
Empty DataFrame
Columns: [App, Category, Rating, Reviews, Size, Installs, Type, Price,
Content_Rating, Genres, Last Updated, Current Ver, Android Ver]
Index: []
# Outlier Detection
sns.boxplot(x = 'Price', data = df)
plt.show()
len(df[df.Price>200])
15
df[df.Price>200]
df.shape
(8864, 13)
len(df[df.Reviews>25000000])
20
df.Installs.quantile([0.1,0.25,0.4,0.5,0.7,0.75,0.9,0.95,0.98,0.99])
0.10 1000.0
0.25 10000.0
0.40 100000.0
0.50 500000.0
0.70 1000000.0
0.75 5000000.0
0.90 10000000.0
0.95 100000000.0
0.98 100000000.0
0.99 500000000.0
Name: Installs, dtype: float64
len(df[df.Installs>500000000])
33
df.shape
(8811, 13)
# Bivariate Analysis
App Category
Rating \
0 Photo Editor & Candy Camera & Grid & ScrapBook ART_AND_DESIGN
4.1
1 Coloring book moana ART_AND_DESIGN
3.9
2 U Launcher Lite – FREE Live Cool Themes, Hide ... ART_AND_DESIGN
4.7
3 Sketch - Draw & Paint ART_AND_DESIGN
4.5
4 Pixel Draw - Number Art Coloring Book ART_AND_DESIGN
4.3
Android Ver
0 4.0.3 and up
1 4.0.3 and up
2 4.0.3 and up
3 4.2 and up
4 4.4 and up
df.nunique()
App 8146
Category 33
Rating 39
Reviews 5933
Size 410
Installs 17
Type 2
Price 68
Content_Rating 6
Genres 115
Last Updated 1299
Current Ver 2590
Android Ver 31
dtype: int64
df.shape
(8811, 9)
# Data - Preprocessing
df.skew()
Rating -1.825597
Reviews 8.388563
Size 1.424330
Installs 8.657071
Price 16.445954
dtype: float64
df.kurtosis()
Rating 5.586197
Reviews 90.948655
Size 1.419297
Installs 86.881887
Price 469.767060
dtype: float64
df.columns
newdf.Installs = newdf.Installs.apply(np.log1p)
newdf.Reviews = newdf.Reviews.apply(np.log1p)
df.Installs.value_counts()
1000000 1485
10000000 1132
100000 1109
10000 982
1000 693
5000000 683
500000 515
50000 460
5000 422
100000000 366
100 302
50000000 272
500 199
10 67
500000000 60
50 56
5 8
Name: Installs, dtype: int64
newdf.skew()
Rating -1.825597
Reviews -0.039850
Size 1.424330
Installs -0.312952
Price 16.445954
dtype: float64
newdf.kurtosis()
Rating 5.586197
Reviews -0.922165
Size 1.419297
Installs -0.622031
Price 469.767060
dtype: float64
newdf
newdf2 = pd.get_dummies(newdf)
newdf2.shape
(8811, 161)
newdf2.columns
X = newdf2.iloc[:,1:] # newdf.iloc[1:]
Y = newdf2['Rating']
X.shape
(8811, 160)
Y.shape
(8811,)
x_train.shape
(6167, 160)
# Evaluation metrics
import statsmodels.api as sm
model = sm.OLS(y_train, x_train.astype(float))
model = model.fit()
model.summary()
<class 'statsmodels.iolib.summary.Summary'>
"""
OLS Regression Results
======================================================================
========
Dep. Variable: Rating R-squared:
0.175
Model: OLS Adj. R-squared:
0.159
Method: Least Squares F-statistic:
10.54
Date: Sat, 16 Dec 2023 Prob (F-statistic):
5.91e-172
Time: 07:25:18 Log-Likelihood:
-4093.6
No. Observations: 6167 AIC:
8433.
Df Residuals: 6044 BIC:
9261.
Df Model: 122
======================================================================
==========================================
coef std err
t P>|t| [0.025 0.975]
----------------------------------------------------------------------
------------------------------------------
Reviews 0.1724 0.006
28.418 0.000 0.160 0.184
Size -3.571e-07 3.27e-07
-1.092 0.275 -9.98e-07 2.84e-07
Installs -0.1488 0.006
-24.360 0.000 -0.161 -0.137
Price -0.0046 0.004
-1.152 0.250 -0.012 0.003
Category_ART_AND_DESIGN 0.1894 0.164
1.157 0.247 -0.131 0.510
Category_AUTO_AND_VEHICLES 0.1690 0.037
4.602 0.000 0.097 0.241
Category_BEAUTY 0.2483 0.043
5.840 0.000 0.165 0.332
Category_BOOKS_AND_REFERENCE 0.2305 0.024
9.699 0.000 0.184 0.277
Category_BUSINESS 0.1468 0.020
7.278 0.000 0.107 0.186
Category_COMICS 0.4136 0.156
2.654 0.008 0.108 0.719
Category_COMMUNICATION 0.0941 0.020
4.771 0.000 0.055 0.133
Category_DATING 0.0330 0.030
1.093 0.274 -0.026 0.092
Category_EDUCATION 0.2499 0.064
3.928 0.000 0.125 0.375
Category_ENTERTAINMENT 0.2300 0.065
3.535 0.000 0.102 0.358
Category_EVENTS 0.2927 0.039
7.539 0.000 0.217 0.369
Category_FAMILY 0.3064 0.036
8.617 0.000 0.237 0.376
Category_FINANCE 0.1025 0.019
5.292 0.000 0.065 0.141
Category_FOOD_AND_DRINK 0.0828 0.030
2.738 0.006 0.024 0.142
Category_GAME 0.3896 0.055
7.059 0.000 0.281 0.498
Category_HEALTH_AND_FITNESS 0.1801 0.021
8.656 0.000 0.139 0.221
Category_HOUSE_AND_HOME 0.1583 0.035
4.480 0.000 0.089 0.227
Category_LIBRARIES_AND_DEMO 0.1928 0.038
5.132 0.000 0.119 0.266
Category_LIFESTYLE 0.0949 0.155
0.614 0.539 -0.208 0.398
Category_MAPS_AND_NAVIGATION 0.0646 0.026
2.490 0.013 0.014 0.115
Category_MEDICAL 0.1716 0.020
8.462 0.000 0.132 0.211
Category_NEWS_AND_MAGAZINES 0.0957 0.023
4.238 0.000 0.051 0.140
Category_PARENTING 0.2524 0.118
2.147 0.032 0.022 0.483
Category_PERSONALIZATION 0.2026 0.019
10.560 0.000 0.165 0.240
Category_PHOTOGRAPHY 0.1148 0.019
5.891 0.000 0.077 0.153
Category_PRODUCTIVITY 0.1440 0.019
7.630 0.000 0.107 0.181
Category_SHOPPING 0.1491 0.022
6.737 0.000 0.106 0.192
Category_SOCIAL 0.1324 0.021
6.174 0.000 0.090 0.174
Category_SPORTS 0.4164 0.161
2.591 0.010 0.101 0.731
Category_TOOLS 0.1847 0.154
1.198 0.231 -0.117 0.487
Category_TRAVEL_AND_LOCAL 0.1792 0.155
1.160 0.246 -0.124 0.482
Category_VIDEO_PLAYERS 0.0862 0.025
3.414 0.001 0.037 0.136
Category_WEATHER 0.1344 0.035
3.841 0.000 0.066 0.203
Type_Free 3.0865 0.055
56.271 0.000 2.979 3.194
Type_Paid 3.0461 0.056
53.928 0.000 2.935 3.157
Content_Rating_Adults_only_18+ 1.2008 0.294
4.083 0.000 0.624 1.777
Content_Rating_Everyone 1.2399 0.053
23.439 0.000 1.136 1.344
Content_Rating_Everyone_10+ 1.2376 0.060
20.772 0.000 1.121 1.354
Content_Rating_Mature_17+ 1.2226 0.060
20.397 0.000 1.105 1.340
Content_Rating_Teen 1.2317 0.054
22.674 0.000 1.125 1.338
Content_Rating_Unrated 4.115e-15 9.57e-15
0.430 0.667 -1.47e-14 2.29e-14
Genres_Action -0.1515 0.064
-2.370 0.018 -0.277 -0.026
Genres_Action_Action_&_Adventure 0.0301 0.152
0.198 0.843 -0.268 0.328
Genres_Adventure -0.1986 0.086
-2.299 0.022 -0.368 -0.029
Genres_Adventure_Action_&_Adventure -0.0408 0.148
-0.275 0.783 -0.331 0.249
Genres_Adventure_Brain_Games 0.2074 0.470
0.441 0.659 -0.715 1.129
Genres_Adventure_Education -0.2690 0.334
-0.807 0.420 -0.923 0.385
Genres_Arcade -0.1114 0.069
-1.624 0.104 -0.246 0.023
Genres_Arcade_Action_&_Adventure -0.1383 0.161
-0.861 0.389 -0.453 0.177
Genres_Arcade_Pretend_Play -4.66e-16 1.27e-15
-0.366 0.715 -2.96e-15 2.03e-15
Genres_Art_&_Design 0.3627 0.178
2.037 0.042 0.014 0.712
Genres_Art_&_Design_Creativity -0.3297 0.375
-0.879 0.379 -1.065 0.405
Genres_Art_&_Design_Pretend_Play 0.1564 0.375
0.417 0.677 -0.579 0.892
Genres_Auto_&_Vehicles 0.1690 0.037
4.602 0.000 0.097 0.241
Genres_Beauty 0.2483 0.043
5.840 0.000 0.165 0.332
Genres_Board -0.0749 0.105
-0.711 0.477 -0.281 0.132
Genres_Board_Action_&_Adventure -0.2419 0.333
-0.726 0.468 -0.895 0.411
Genres_Board_Brain_Games 0.0839 0.146
0.576 0.565 -0.202 0.369
Genres_Board_Pretend_Play 0.6364 0.470
1.353 0.176 -0.286 1.559
Genres_Books_&_Reference 0.2305 0.024
9.699 0.000 0.184 0.277
Genres_Books_&_Reference_Education -0.0438 0.333
-0.132 0.895 -0.697 0.609
Genres_Business 0.1468 0.020
7.278 0.000 0.107 0.186
Genres_Card -0.2819 0.098
-2.885 0.004 -0.473 -0.090
Genres_Card_Action_&_Adventure 1.706e-16 2.46e-16
0.694 0.488 -3.11e-16 6.52e-16
Genres_Card_Brain_Games 0.3636 0.470
0.774 0.439 -0.557 1.285
Genres_Casino -0.0752 0.107
-0.704 0.481 -0.284 0.134
Genres_Casual -0.1864 0.053
-3.524 0.000 -0.290 -0.083
Genres_Casual_Action_&_Adventure -0.1409 0.139
-1.010 0.312 -0.414 0.133
Genres_Casual_Brain_Games 0.3627 0.169
2.148 0.032 0.032 0.694
Genres_Casual_Creativity 0.0838 0.194
0.432 0.666 -0.296 0.464
Genres_Casual_Education 0.0501 0.333
0.150 0.880 -0.603 0.703
Genres_Casual_Music_&_Video 0.0543 0.470
0.115 0.908 -0.867 0.975
Genres_Casual_Pretend_Play -0.0974 0.100
-0.977 0.328 -0.293 0.098
Genres_Comics -0.1835 0.169
-1.087 0.277 -0.514 0.147
Genres_Comics_Creativity 0.5971 0.315
1.894 0.058 -0.021 1.215
Genres_Communication 0.0941 0.020
4.771 0.000 0.055 0.133
Genres_Communication_Creativity 1.622e-16 1.75e-16
0.929 0.353 -1.8e-16 5.04e-16
Genres_Dating 0.0330 0.030
1.093 0.274 -0.026 0.092
Genres_Education 0.1673 0.045
3.690 0.000 0.078 0.256
Genres_Education_Action_&_Adventure 0.2533 0.237
1.070 0.285 -0.211 0.718
Genres_Education_Brain_Games 0.0522 0.240
0.218 0.828 -0.419 0.523
Genres_Education_Creativity 0.4886 0.237
2.058 0.040 0.023 0.954
Genres_Education_Education 0.0966 0.096
1.011 0.312 -0.091 0.284
Genres_Education_Music_&_Video -0.3034 0.470
-0.646 0.518 -1.224 0.617
Genres_Education_Pretend_Play 0.2577 0.132
1.957 0.050 -0.000 0.516
Genres_Educational -0.1788 0.108
-1.656 0.098 -0.390 0.033
Genres_Educational_Action_&_Adventure 0.0094 0.333
0.028 0.978 -0.644 0.662
Genres_Educational_Brain_Games 0.0853 0.212
0.401 0.688 -0.331 0.502
Genres_Educational_Creativity -0.1173 0.333
-0.352 0.725 -0.770 0.536
Genres_Educational_Education 0.0945 0.110
0.857 0.391 -0.122 0.311
Genres_Educational_Pretend_Play 0.0416 0.135
0.309 0.757 -0.222 0.305
Genres_Entertainment -0.0735 0.044
-1.660 0.097 -0.160 0.013
Genres_Entertainment_Action_&_Adventure 0.1184 0.333
0.356 0.722 -0.535 0.771
Genres_Entertainment_Brain_Games 0.0791 0.181
0.437 0.662 -0.276 0.434
Genres_Entertainment_Creativity 0.3138 0.273
1.148 0.251 -0.222 0.849
Genres_Entertainment_Education -2.344e-16 1.78e-16
-1.316 0.188 -5.84e-16 1.15e-16
Genres_Entertainment_Music_&_Video -0.0525 0.123
-0.427 0.669 -0.293 0.188
Genres_Entertainment_Pretend_Play -0.3276 0.333
-0.983 0.325 -0.981 0.325
Genres_Events 0.2927 0.039
7.539 0.000 0.217 0.369
Genres_Finance 0.1025 0.019
5.292 0.000 0.065 0.141
Genres_Food_&_Drink 0.0828 0.030
2.738 0.006 0.024 0.142
Genres_Health_&_Fitness 0.1801 0.021
8.656 0.000 0.139 0.221
Genres_Health_&_Fitness_Action_&_Adventure -0.4306 0.470
-0.916 0.360 -1.352 0.491
Genres_Health_&_Fitness_Education 0.2147 0.470
0.457 0.648 -0.706 1.136
Genres_House_&_Home 0.1583 0.035
4.480 0.000 0.089 0.227
Genres_Libraries_&_Demo 0.1928 0.038
5.132 0.000 0.119 0.266
Genres_Lifestyle 0.1262 0.163
0.775 0.438 -0.193 0.445
Genres_Lifestyle_Education -7.666e-17 6.36e-17
-1.205 0.228 -2.01e-16 4.8e-17
Genres_Lifestyle_Pretend_Play -0.0312 0.315
-0.099 0.921 -0.648 0.586
Genres_Maps_&_Navigation 0.0646 0.026
2.490 0.013 0.014 0.115
Genres_Medical 0.1716 0.020
8.462 0.000 0.132 0.211
Genres_Music -0.2048 0.141
-1.449 0.147 -0.482 0.072
Genres_Music_&_Audio_Music_&_Video 0.3781 0.470
0.805 0.421 -0.543 1.299
Genres_Music_Music_&_Video 0.2404 0.333
0.722 0.470 -0.412 0.893
Genres_News_&_Magazines 0.0957 0.023
4.238 0.000 0.051 0.140
Genres_Parenting 0.2384 0.138
1.726 0.084 -0.032 0.509
Genres_Parenting_Brain_Games -0.1344 0.386
-0.348 0.728 -0.892 0.623
Genres_Parenting_Education -0.0719 0.244
-0.294 0.769 -0.551 0.407
Genres_Parenting_Music_&_Video 0.2203 0.220
1.001 0.317 -0.211 0.652
Genres_Personalization 0.2026 0.019
10.560 0.000 0.165 0.240
Genres_Photography 0.1148 0.019
5.891 0.000 0.077 0.153
Genres_Productivity 0.1440 0.019
7.630 0.000 0.107 0.181
Genres_Puzzle 0.0741 0.062
1.186 0.236 -0.048 0.197
Genres_Puzzle_Action_&_Adventure 0.0544 0.273
0.200 0.842 -0.480 0.589
Genres_Puzzle_Brain_Games 0.0895 0.139
0.642 0.521 -0.184 0.363
Genres_Puzzle_Creativity 0.0742 0.333
0.223 0.824 -0.579 0.727
Genres_Puzzle_Education 0.5283 0.470
1.125 0.261 -0.393 1.449
Genres_Racing -0.1413 0.081
-1.741 0.082 -0.300 0.018
Genres_Racing_Action_&_Adventure 0.1638 0.126
1.301 0.193 -0.083 0.411
Genres_Racing_Pretend_Play 0.6221 0.470
1.323 0.186 -0.299 1.544
Genres_Role_Playing -0.0791 0.068
-1.161 0.246 -0.213 0.054
Genres_Role_Playing_Action_&_Adventure 0.2145 0.471
0.455 0.649 -0.709 1.138
Genres_Role_Playing_Brain_Games 0.0286 0.470
0.061 0.951 -0.892 0.950
Genres_Role_Playing_Pretend_Play -0.1079 0.273
-0.395 0.693 -0.643 0.427
Genres_Shopping 0.1491 0.022
6.737 0.000 0.106 0.192
Genres_Simulation -0.1484 0.054
-2.764 0.006 -0.254 -0.043
Genres_Simulation_Action_&_Adventure 0.1695 0.169
1.004 0.315 -0.161 0.501
Genres_Simulation_Education -0.0395 0.273
-0.145 0.885 -0.575 0.495
Genres_Simulation_Pretend_Play -0.1765 0.333
-0.530 0.596 -0.830 0.477
Genres_Social 0.1324 0.021
6.174 0.000 0.090 0.174
Genres_Sports -0.1774 0.164
-1.078 0.281 -0.500 0.145
Genres_Sports_Action_&_Adventure -0.0561 0.273
-0.205 0.837 -0.591 0.479
Genres_Strategy -0.1698 0.065
-2.596 0.009 -0.298 -0.042
Genres_Strategy_Action_&_Adventure 0.1935 0.333
0.580 0.562 -0.460 0.847
Genres_Strategy_Creativity -0.1839 0.470
-0.391 0.696 -1.105 0.737
Genres_Strategy_Education 0.5190 0.470
1.105 0.269 -0.402 1.440
Genres_Tools -0.0069 0.162
-0.043 0.966 -0.324 0.310
Genres_Tools_Education 0.1916 0.314
0.610 0.542 -0.424 0.808
Genres_Travel_&_Local 0.0400 0.163
0.245 0.806 -0.280 0.360
Genres_Travel_&_Local_Action_&_Adventure 0.1392 0.314
0.443 0.658 -0.477 0.756
Genres_Trivia -0.3725 0.113
-3.290 0.001 -0.594 -0.151
Genres_Video_Players_&_Editors 0.0862 0.025
3.414 0.001 0.037 0.136
Genres_Video_Players_&_Editors_Creativity -0.2947 0.470
-0.627 0.530 -1.216 0.626
Genres_Video_Players_&_Editors_Music_&_Video -0.1985 0.333
-0.595 0.552 -0.852 0.455
Genres_Weather 0.1344 0.035
3.841 0.000 0.066 0.203
Genres_Word -0.0164 0.121
-0.135 0.892 -0.254 0.221
======================================================================
========
Omnibus: 2383.378 Durbin-Watson:
2.014
Prob(Omnibus): 0.000 Jarque-Bera (JB):
15000.327
Skew: -1.721 Prob(JB):
0.00
Kurtosis: 9.821 Cond. No.
1.87e+21
======================================================================
========
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is
correctly specified.
[2] The smallest eigenvalue is 1.81e-30. This might indicate that
there are
strong multicollinearity problems or that the design matrix is
singular.
"""
F- Statistic: It given through F- test, whether the lr model provides a better fit to the data in
comparison to a model that contains no independent variable
y_test_predicted = model.predict(x_test)
RMSE is 0.5019999148264612
r2_score(y_test, y_test_predicted)
0.12557936638825973
data = {'coef':model.params,'std err': model.bse,'t': model.tvalues,
'P>|t|': model.pvalues,
'[0.025': model.conf_int()[0],'0.975]': model.conf_int()[0]}
olssum = pd.DataFrame(data).round(3)
olssum
[0.025 0.975]
Reviews 0.160 0.160
Size -0.000 -0.000
Installs -0.161 -0.161
Price -0.012 -0.012
Category_ART_AND_DESIGN -0.131 -0.131
... ... ...
Genres_Video_Players_&_Editors 0.037 0.037
Genres_Video_Players_&_Editors_Creativity -1.216 -1.216
Genres_Video_Players_&_Editors_Music_&_Video -0.852 -0.852
Genres_Weather 0.066 0.066
Genres_Word -0.254 -0.254
olssum[olssum['P>|t|']<0.05].index
Index(['Reviews', 'Installs', 'Category_AUTO_AND_VEHICLES',
'Category_BEAUTY',
'Category_BOOKS_AND_REFERENCE', 'Category_BUSINESS',
'Category_COMICS',
'Category_COMMUNICATION', 'Category_EDUCATION',
'Category_ENTERTAINMENT', 'Category_EVENTS', 'Category_FAMILY',
'Category_FINANCE', 'Category_FOOD_AND_DRINK', 'Category_GAME',
'Category_HEALTH_AND_FITNESS', 'Category_HOUSE_AND_HOME',
'Category_LIBRARIES_AND_DEMO', 'Category_MAPS_AND_NAVIGATION',
'Category_MEDICAL', 'Category_NEWS_AND_MAGAZINES',
'Category_PARENTING',
'Category_PERSONALIZATION', 'Category_PHOTOGRAPHY',
'Category_PRODUCTIVITY', 'Category_SHOPPING',
'Category_SOCIAL',
'Category_SPORTS', 'Category_VIDEO_PLAYERS',
'Category_WEATHER',
'Type_Free', 'Type_Paid', 'Content_Rating_Adults_only_18+',
'Content_Rating_Everyone', 'Content_Rating_Everyone_10+',
'Content_Rating_Mature_17+', 'Content_Rating_Teen',
'Genres_Action',
'Genres_Adventure', 'Genres_Art_&_Design',
'Genres_Auto_&_Vehicles',
'Genres_Beauty', 'Genres_Books_&_Reference', 'Genres_Business',
'Genres_Card', 'Genres_Casual', 'Genres_Casual_Brain_Games',
'Genres_Communication', 'Genres_Education',
'Genres_Education_Creativity', 'Genres_Events',
'Genres_Finance',
'Genres_Food_&_Drink', 'Genres_Health_&_Fitness',
'Genres_House_&_Home',
'Genres_Libraries_&_Demo', 'Genres_Maps_&_Navigation',
'Genres_Medical',
'Genres_News_&_Magazines', 'Genres_Personalization',
'Genres_Photography', 'Genres_Productivity', 'Genres_Shopping',
'Genres_Simulation', 'Genres_Social', 'Genres_Strategy',
'Genres_Trivia', 'Genres_Video_Players_&_Editors',
'Genres_Weather'],
dtype='object')
df3.shape
(8811, 70)