0% found this document useful (0 votes)

26 views

Social Media Data For DSBA - Vijay Borade - Ipynb - Colab

Uploaded by

borade.vijay

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views

Social Media Data For DSBA - Vijay Borade - Ipynb - Colab

Uploaded by

borade.vijay

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

keyboard_arrow_down Problem Statement Social Media

An aviation company that provides domestic as well as international trips to the customers now wants to apply a targeted approach instead of
reaching out to each of the customers. This time they want to do it digitally instead of tele calling. Hence they have collaborated with a social
networking platform, so they can learn the digital and social behaviour of the customers and provide the digital advertisement on the user page
of the targeted customers who have a high propensity to take up the product. Propensity of buying tickets is different for different login devices.
Hence, you have to create 2 models separately for Laptop and Mobile. [Anything which is not a laptop can be considered as mobile phone
usage.] The advertisements on the digital platform are a bit expensive; hence, you need to be very accurate while creating the models.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

keyboard_arrow_down Loading Data

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive

from google.colab import drive

drive.mount('/content/drive') # Mount to a directory like '/content/drive'

# Assuming your CSV file is in 'My Drive', adjust the path accordingly
df = pd.read_csv('/content/sample_data/Social Media Data for DSBA.csv')
df.head()

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
UserID Taken_product Yearly_avg_view_on_travel_page preferred_device total_likes_on_outstation_checkin_given yearly_avg_Outstation_checkins member_in_family prefer

0 1000001 Yes 307.0 iOS and Android 38570.0 1 2

1 1000002 No 367.0 iOS 9765.0 1 1

2 1000003 Yes 277.0 iOS and Android 48055.0 1 2

3 1000004 No 247.0 iOS 48720.0 1 4

4 1000005 No 202.0 iOS and Android 20685.0 1 1

Next steps: Generate code with df

toggle_off View recommended plots

keyboard_arrow_down Overview of Data

df.shape ## Complete the code to view dimensions of the data

(11760, 17)

df.info () ## Complete the code to view information about the data

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11760 entries, 0 to 11759
Data columns (total 17 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 UserID 11760 non-null int64
1 Taken_product 11760 non-null object
2 Yearly_avg_view_on_travel_page 11179 non-null float64
3 preferred_device 11707 non-null object
4 total_likes_on_outstation_checkin_given 11379 non-null float64
5 yearly_avg_Outstation_checkins 11685 non-null object
6 member_in_family 11760 non-null object
7 preferred_location_type 11729 non-null object
8 Yearly_avg_comment_on_travel_page 11554 non-null float64
9 total_likes_on_outofstation_checkin_received 11760 non-null int64
10 week_since_last_outstation_checkin 11760 non-null int64
11 following_company_page 11657 non-null object
12 montly_avg_comment_on_company_page 11760 non-null int64
13 working_flag 11760 non-null object
14 travelling_network_rating 11760 non-null int64
15 Adult_flag 11760 non-null int64
16 Daily_Avg_mins_spend_on_traveling_page 11760 non-null int64
dtypes: float64(3), int64(7), object(7)
memory usage: 1.5+ MB

df.describe()
UserID Yearly_avg_view_on_travel_page total_likes_on_outstation_checkin_given Yearly_avg_comment_on_trav

count 1.176000e+04 11179.000000 11379.000000 11554

mean 1.005880e+06 280.830844 28170.481765 74

std 3.394964e+03 68.182958 14385.032134 24

min 1.000001e+06 35.000000 3570.000000 3

25% 1.002941e+06 232.000000 16380.000000 57

50% 1.005880e+06 271.000000 28076.000000 75

75% 1.008820e+06 324.000000 40525.000000 92

max 1.011760e+06 464.000000 252430.000000 815

df.isnull().sum() ## Complete the code to view number of null or NaN values in each column

UserID 0
Taken_product 0
Yearly_avg_view_on_travel_page 581
preferred_device 53
total_likes_on_outstation_checkin_given 381
yearly_avg_Outstation_checkins 75
member_in_family 0
preferred_location_type 31
Yearly_avg_comment_on_travel_page 206
total_likes_on_outofstation_checkin_received 0
week_since_last_outstation_checkin 0
following_company_page 103
montly_avg_comment_on_company_page 0
working_flag 0
travelling_network_rating 0
Adult_flag 0
Daily_Avg_mins_spend_on_traveling_page 0
dtype: int64

df.columns

Index(['UserID', 'Taken_product', 'Yearly_avg_view_on_travel_page',

'preferred_device', 'total_likes_on_outstation_checkin_given',
'yearly_avg_Outstation_checkins', 'member_in_family',
'preferred_location_type', 'Yearly_avg_comment_on_travel_page',
'total_likes_on_outofstation_checkin_received',
'week_since_last_outstation_checkin', 'following_company_page',
'montly_avg_comment_on_company_page', 'working_flag',
'travelling_network_rating', 'Adult_flag',
'Daily_Avg_mins_spend_on_traveling_page'],
dtype='object')

keyboard_arrow_down Data Understanding

userid : unique userid
Taken_product : Yes/ No
Yearly_avg_view_on_travel_page: count of pages visited
preferred_device: Android / IOS
total_likes_on_outstation_checkin_given
yearly_avg_Outstation_checkins
member_in_family
preferred_location_type
Yearly_avg_comment_on_travel_page
total_likes_on_outofstation_checkin_received
week_since_last_outstation_checkin
following_company_page
montly_avg_comment_on_company_page
working_flag
travelling_network_rating
Adult_flag
Daily_Avg_mins_spend_on_traveling_page

df.head()

UserID Taken_product Yearly_avg_view_on_travel_page preferred_device total_likes_on_outstation_checkin_given ye

0 1000001 Yes 307.0 iOS and Android 38570.0

1 1000002 No 367.0 iOS 9765.0

2 1000003 Yes 277.0 iOS and Android 48055.0

3 1000004 No 247.0 iOS 48720.0

4 1000005 No 202.0 iOS and Android 20685.0

Next steps: Generate code with df

toggle_off View recommended plots

keyboard_arrow_down Yearly_avg_view_on_travel_page
# @title Yearly_avg_view_on_travel_page

from matplotlib import pyplot as plt

df['Yearly_avg_view_on_travel_page'].plot(kind='hist', bins=20, title='Yearly_avg_view_on_travel_page')
plt.gca().spines[['top', 'right',]].set_visible(False)

keyboard_arrow_down Yearly_avg_view_on_travel_page vs total_likes_on_outstation_checkin_given

# @title Yearly_avg_view_on_travel_page vs total_likes_on_outstation_checkin_given

from matplotlib import pyplot as plt

df.plot(kind='scatter', x='Yearly_avg_view_on_travel_page', y='total_likes_on_outstation_checkin_given', s=32, alpha=.8)
plt.gca().spines[['top', 'right',]].set_visible(False)

keyboard_arrow_down Yearly_avg_view_on_travel_page
# @title Yearly_avg_view_on_travel_page

from matplotlib import pyplot as plt

df['Yearly_avg_view_on_travel_page'].plot(kind='hist', bins=20, title='Yearly_avg_view_on_travel_page')
plt.gca().spines[['top', 'right',]].set_visible(False)
keyboard_arrow_down

UserID vs Yearly_avg_view_on_travel_page
Show code
keyboard_arrow_down

Preferred Device Distribution by Product Uptake

Show code
keyboard_arrow_down

Taken_product

Show code

keyboard_arrow_down member_in_family
# @title member_in_family

from matplotlib import pyplot as plt

import seaborn as sns
df.groupby('member_in_family').size().plot(kind='barh', color=sns.palettes.mpl_palette('Dark2'))
plt.gca().spines[['top', 'right',]].set_visible(False)

df['UserID'].unique()

array([1000001, 1000002, 1000003, ..., 1011758, 1011759, 1011760])

df['UserID'].value_counts()

UserID
1000001 1
1007834 1
1007836 1
1007837 1
1007838 1
..
1003922 1
1003923 1
1003924 1
1003925 1
1011760 1
Name: count, Length: 11760, dtype: int64

df['Taken_product'].unique()

array(['Yes', 'No'], dtype=object)

df['Taken_product'].value_counts()

Taken_product
No 9864
Yes 1896
Name: count, dtype: int64

df['Yearly_avg_view_on_travel_page'].unique()

array([307., 367., 277., 247., 202., 240., nan, 225., 285., 270., 262.,
217., 232., 255., 210., 165., 397., 180., 157., 330., 345., 292.,
322., 375., 195., 360., 412., 382., 300., 405., 435., 150., 187.,
42., 427., 352., 35., 450., 135., 308., 368., 249., 205., 445.,
226., 287., 271., 263., 219., 234., 256., 212., 241., 399., 286.,
182., 159., 316., 332., 347., 248., 331., 294., 323., 376., 265.,
204., 309., 257., 346., 264., 361., 196., 278., 444., 272., 414.,
339., 443., 233., 280., 422., 174., 384., 302., 242., 181., 211.,
436., 279., 151., 377., 188., 189., 166., 406., 324., 197., 143.,
167., 383., 227., 144., 301., 429., 203., 250., 413., 338., 310.,
392., 317., 354., 400., 369., 137., 136., 295., 391., 353., 218.,
362., 428., 451., 430., 173., 190., 398., 355., 437., 407., 152.,
421., 158., 293., 235., 220., 340., 385., 370., 325., 415., 452.,
315., 379., 290., 231., 281., 269., 222., 244., 221., 246., 402.,
184., 276., 258., 168., 259., 404., 318., 333., 252., 304., 378.,
273., 282., 253., 199., 215., 228., 356., 268., 366., 335., 266.,
254., 274., 291., 448., 417., 455., 229., 236., 275., 349., 418.,
425., 185., 389., 388., 411., 216., 446., 337., 283., 243., 154.,
261., 289., 193., 175., 409., 326., 386., 207., 153., 172., 372.,
390., 245., 313., 306., 230., 441., 176., 213., 433., 208., 387.,
147., 342., 424., 148., 408., 303., 395., 209., 319., 373., 267.,
296., 350., 341., 297., 239., 401., 454., 251., 299., 312., 344.,
348., 305., 394., 238., 200., 363., 206., 420., 201., 284., 142.,
374., 298., 357., 334., 179., 364., 321., 161., 431., 351., 381.,
311., 393., 288., 329., 328., 192., 237., 214., 198., 396., 456.,
327., 223., 458., 260., 141., 380., 191., 160., 224., 457., 359.,
320., 423., 410., 194., 365., 177., 403., 438., 164., 170., 149.,
138., 453., 169., 146., 447., 155., 162., 449., 439., 416., 145.,
358., 371., 343., 140., 336., 183., 314., 426., 419., 186., 440.,
459., 156., 163., 461., 178., 442., 432., 434., 462., 460., 171.,
464., 463.])

df['Yearly_avg_view_on_travel_page'].value_counts()

Yearly_avg_view_on_travel_page
262.0 190
255.0 186
270.0 179
217.0 165
232.0 160
...
149.0 2
464.0 1
146.0 1
458.0 1
463.0 1
Name: count, Length: 331, dtype: int64

df['preferred_device'].unique()

array(['iOS and Android', 'iOS', 'ANDROID', nan, 'Android', 'Android OS',

'Other', 'Others', 'Tab', 'Laptop', 'Mobile'], dtype=object)

df['preferred_device'].value_counts()

preferred_device
Tab 4172
iOS and Android 4134
Laptop 1108
iOS 1095
Mobile 600
Android 315
Android OS 145
ANDROID 134
Other 2
Others 2
Name: count, dtype: int64

df['total_likes_on_outstation_checkin_given'].unique()

array([38570., 9765., 48055., ..., 5478., 35851., 22025.])

df['total_likes_on_outstation_checkin_given'].value_counts()

total_likes_on_outstation_checkin_given
24185.0 12
11515.0 11
18550.0 10
37870.0 10
5145.0 9
..
51983.0 1
14773.0 1
11100.0 1
22046.0 1
22025.0 1
Name: count, Length: 7888, dtype: int64

df['yearly_avg_Outstation_checkins'].unique()

array(['1', '24', '23', '27', '16', '15', '26', '19', '21', '11', '10',
'25', '12', '18', '29', nan, '22', '14', '20', '28', '17', '13',
'*', '5', '8', '2', '3', '9', '7', '6', '4'], dtype=object)

df['yearly_avg_Outstation_checkins'].value_counts()

yearly_avg_Outstation_checkins
1 4543
2 844
10 682
9 340
7 336
3 336
8 320
5 261
4 256
16 255
6 236
11 229
24 223
29 215
23 215
18 208
15 206
26 199
20 199
25 198
28 180
19 176
14 167
17 160
12 159
22 152
13 150
21 143
27 96
* 1
Name: count, dtype: int64

df['member_in_family'].unique()

array(['2', '1', '4', 'Three', '3', '5', '10'], dtype=object)

df['member_in_family'].value_counts()

member_in_family
3 4561
4 3184
2 2256
1 1349
5 384
Three 15
10 11
Name: count, dtype: int64

df['preferred_location_type'].unique()

array(['Financial', 'Other', 'Medical', nan, 'Game', 'Social media',

'Entertainment', 'Tour and Travel', 'Movie', 'OTT', 'Tour Travel',
'Beach', 'Historical site', 'Big Cities', 'Trekking',
'Hill Stations'], dtype=object)

df['preferred_location_type'].value_counts()

preferred_location_type
Beach 2424
Financial 2409
Historical site 1856
Medical 1845
Other 643
Big Cities 636
Social media 633
Trekking 528
Entertainment 516
Hill Stations 108
Tour Travel 60
Tour and Travel 47
Game 12
OTT 7
Movie 5
Name: count, dtype: int64

df['Yearly_avg_comment_on_travel_page'].unique()

array([ 94., 61., 92., 56., 40., 79., 81., 67., 44., 84., 49.,
31., 93., 50., 51., 80., 96., 78., 45., 82., 53., 83.,
58., 72., 48., 42., 41., 86., 97., 75., 33., 37., 73.,
nan, 98., 47., 71., 3., 43., 99., 59., 95., 57., 76.,
87., 66., 55., 32., 52., 70., 62., 64., 63., 60., 100.,
46., 39., 77., 91., 54., 34., 90., 65., 36., 88., 35.,
89., 68., 85., 69., 74., 38., 106., 105., 103., 108., 111.,
104., 102., 109., 110., 112., 101., 107., 615., 114., 113., 215.,
815., 685., 118., 117., 115., 116., 121., 122., 120., 124., 119.,
125., 123.])

df['Yearly_avg_comment_on_travel_page'].value_counts()

Yearly_avg_comment_on_travel_page
96.0 192
66.0 191
90.0 190
56.0 188
80.0 184
...
124.0 3
685.0 1
815.0 1
215.0 1
615.0 1
Name: count, Length: 100, dtype: int64

df['total_likes_on_outofstation_checkin_received'].unique()

array([ 5993, 5130, 2090, ..., 12093, 9983, 6203])

df['total_likes_on_outofstation_checkin_received'].value_counts()

total_likes_on_outofstation_checkin_received
2377 12
2380 11
2342 11
2096 10
2610 10
..
13678 1
10949 1
4906 1
19439 1
6203 1
Name: count, Length: 6288, dtype: int64

df['week_since_last_outstation_checkin'].unique()

array([ 8, 1, 6, 9, 0, 4, 5, 2, 7, 3, 10, 11])

df['week_since_last_outstation_checkin'].value_counts()

week_since_last_outstation_checkin
1 3070
3 1766
2 1700
4 1118
0 1032
5 728
6 654
7 594
9 472
8 428
10 138
11 60
Name: count, dtype: int64
df['following_company_page'].unique()

array(['Yes', 'No', nan, '1', '0'], dtype=object)

df['following_company_page'].value_counts()

following_company_page
No 8355
Yes 3285
1 12
0 5
Name: count, dtype: int64

df['montly_avg_comment_on_company_page'].unique()

array([ 11, 23, 15, 12, 13, 20, 22, 21, 17, 14, 16, 18, 19,
24, 25, 30, 29, 28, 27, 376, 381, 26, 427, 437, 499, 363,
425, 439, 301, 461, 322, 324, 355, 338, 332, 459, 460, 453, 300,
474, 368, 352, 445, 310, 323, 490, 371, 444, 343, 417, 393, 463,
350, 432, 412, 379, 336, 441, 346, 317, 406, 485, 400, 483, 478,
438, 354, 313, 497, 325, 419, 388, 398, 378, 397, 349, 356, 420,
347, 500, 442, 435, 447, 484, 330, 326, 360, 403, 465, 365, 353,
429, 345, 321, 491, 476, 475, 487, 316, 428, 472, 314, 405, 473,
339, 342, 455, 469, 399, 422, 370, 361, 467, 458, 304, 410, 383,
466, 446, 302, 486, 333, 418, 351, 391, 468, 454, 329, 390, 384,
404, 402, 424, 488, 440, 312, 449, 477, 380, 357, 414, 337, 33,
32, 31, 34, 35, 36, 37, 40, 38, 41, 39, 43, 42, 45,
44, 47, 46, 48])

df['montly_avg_comment_on_company_page'].value_counts()

montly_avg_comment_on_company_page
23 673
22 653
25 609
24 605
21 594
...
447 1
500 1
347 1
420 1
48 1
Name: count, Length: 160, dtype: int64

df['working_flag'].value_counts()

working_flag
No 9952
Yes 1808
Name: count, dtype: int64

df['travelling_network_rating'].unique()

array([1, 4, 2, 3])

df['travelling_network_rating'].value_counts()

travelling_network_rating
3 3672
4 3456
2 2424
1 2208
Name: count, dtype: int64

df['Adult_flag'].unique()

array([0, 1, 3, 2])

df['Adult_flag'].value_counts()

Adult_flag
0 5048
1 4768
2 1264
3 680
Name: count, dtype: int64

df['Daily_Avg_mins_spend_on_traveling_page'].unique()

array([ 8, 10, 7, 6, 12, 1, 17, 5, 3, 31, 13, 0, 26,

24, 22, 9, 19, 2, 23, 14, 15, 4, 29, 28, 21, 25,
20, 11, 16, 37, 38, 30, 40, 18, 36, 34, 32, 33, 35,
27, 41, 135, 45, 43, 39, 44, 42, 170, 235, 270, 47, 46])

df['Daily_Avg_mins_spend_on_traveling_page'].value_counts()

Daily_Avg_mins_spend_on_traveling_page
10 1126
9 676
8 662
6 624
7 554
13 532
11 530
12 500
14 496
15 480
5 444
16 444
17 430
1 336
4 330
18 322
20 292
19 288
21 258
22 254
23 238
3 218
24 184
25 150
2 146
28 142
26 140
29 134
27 116
32 102
31 82
30 74
34 62
33 60
36 56
35 48
37 46
0 46
40 32
38 30
39 26
41 20
44 8
42 6
43 4
45 4
46 3
135 1
170 1
235 1
270 1
47 1
Name: count, dtype: int64

Continuous=['UserID','Yearly_avg_view_on_travel_page','member_in_family']
Discrete_categorical=['preferred_device','Taken_product','preferred_location_type']
Discrete_count=['Yearly_avg_comment_on_travel_page','total_likes_on_outofstation_checkin_received','week_since_last_outstation_checkin','montly_avg_comment_on_company_page','tota

Exploratory Data Analysis (EDA)

keyboard_arrow_down 1. Univariate Analysis

# Continuous variables: Histograms & Boxplots
for var in Continuous:
plt.figure(figsize=(8,5))
plt.subplot(1,2,1)
sns.histplot(df[var], kde=True)
plt.title(f'Distribution of {var}')
plt.subplot(1,2,2)
sns.boxplot(y=df[var])
plt.title(f'Boxplot of {var}')
plt.tight_layout()
plt.show()
# Discrete_categorical variables: Histograms & Boxplots
for var in Discrete_categorical:
plt.figure(figsize=(8,5))
plt.subplot(1,2,1)
sns.histplot(df[var], kde=True)
plt.title(f'Distribution of {var}')
plt.subplot(1,2,2)
sns.boxplot(y=df[var])
plt.title(f'Boxplot of {var}')
plt.tight_layout()
plt.show()
# Discrete_count variables: Histograms & Boxplots
for var in Discrete_count:
plt.figure(figsize=(8,5))
plt.subplot(1,2,1)
sns.histplot(df[var], kde=True)
plt.title(f'Distribution of {var}')
plt.subplot(1,2,2)
sns.boxplot(y=df[var])
plt.title(f'Boxplot of {var}')
plt.tight_layout()
plt.show()
keyboard_arrow_down 2. Bivariate Analysis
# Continuous vs. Continuous: Scatterplots
for i in range(len(Continuous)):
for j in range(i+1, len(Continuous)):
plt.figure(figsize=(8,5))
sns.scatterplot(x=df[Continuous[i]], y=df[Continuous[j]])
plt.title(f'{Continuous[i]} vs. {Continuous[j]}')
plt.tight_layout()
plt.show()
# Categorical vs. Categorical: Crosstabs & Heatmaps
for i in range(len(Discrete_categorical)):
for j in range(i+1, len(Discrete_categorical)):
crosstab = pd.crosstab(df[Discrete_categorical[i]], df[Discrete_categorical[j]])
plt.figure(figsize=(8,5))
sns.heatmap(crosstab, annot=True, cmap='Blues')
plt.title(f'{Discrete_categorical[i]} vs. {Discrete_categorical[j]}')
plt.tight_layout()
plt.show()
# Categorical vs. Categorical: Crosstabs & Heatmaps
for i in range(len(Discrete_categorical)):
for j in range(i+1, len(Discrete_categorical)):
crosstab = pd.crosstab(df[Discrete_categorical[i]], df[Discrete_categorical[j]])
plt.figure(figsize=(8,5))
sns.heatmap(crosstab, annot=True, cmap='Blues')
plt.title(f'{Discrete_categorical[i]} vs. {Discrete_categorical[j]}')
plt.tight_layout()
plt.show()
# Correlation Analysis: Heatmap
# Select only numeric columns for correlation
numeric_df = df.select_dtypes(include=['number'])

corr_matrix = numeric_df.corr()
plt.figure(figsize=(10,8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.tight_layout()
plt.show()
# Numerical vs. Categorical: Boxplots or Violin Plots
# Example:
plt.figure(figsize=(8,5))
sns.boxplot(x=df['Taken_product'], y=df['Yearly_avg_view_on_travel_page'])
plt.title('Yearly Avg View on Travel Page by Product Uptake')
plt.tight_layout()
plt.show()

# Correlation Analysis: Heatmap

# Select only numeric columns for correlation
numeric_df = df.select_dtypes(include=['number'])

corr_matrix = numeric_df.corr()
plt.figure(figsize=(10,8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.tight_layout()
plt.show()
# Interpret the correlation matrix
print("Positive correlations:")
# Example:
print("- Yearly_avg_view_on_travel_page and total_likes_on_outstation_checkin_given are positively correlated (0.65)")

print("\nNegative correlations:")
# Example:
print("- Yearly_avg_view_on_travel_page and week_since_last_outstation_checkin are negatively correlated (-0.32)")

print("\nKey observations:")
# Add your insights based on the correlation matrix
# Example:
print("- Users who view travel pages more frequently tend to give more likes to outstation check-ins.")
print("- Users who have checked in recently tend to view travel pages less frequently.")

Positive correlations:
- Yearly_avg_view_on_travel_page and total_likes_on_outstation_checkin_given are positively correlated (0.65)

Negative correlations:
- Yearly_avg_view_on_travel_page and week_since_last_outstation_checkin are negatively correlated (-0.32)

Key observations:
- Users who view travel pages more frequently tend to give more likes to outstation check-ins.
- Users who have checked in recently tend to view travel pages less frequently.

# prompt: Removal of unwanted variables

# Drop irrelevant columns

df = df.drop(['UserID', 'following_company_page', 'working_flag', 'travelling_network_rating', 'Adult_flag'], axis=1)

df.head()

Taken_product Yearly_avg_view_on_travel_page preferred_device total_likes_on_outstation_checkin_given yearly_avg_Outstation_checkins member_in_family preferred_locat

0 Yes 307.0 iOS and Android 38570.0 1 2

1 No 367.0 iOS 9765.0 1 1

2 Yes 277.0 iOS and Android 48055.0 1 2

3 No 247.0 iOS 48720.0 1 4

4 No 202.0 iOS and Android 20685.0 1 1

Next steps: Generate code with df toggle_off View recommended plots

keyboard_arrow_down

Taken_product vs member_in_family

Show code
# prompt: Using dataframe df: Missing Value treatment

# Iterate over each column in the DataFrame

for column in df.columns:
# Check if the column has missing values
if df[column].isnull().any():
# Fill missing values with the most frequent value (mode) for categorical columns
if df[column].dtype == 'object':
df[column] = df[column].fillna(df[column].mode()[0])
# Fill missing values with the median for numerical columns
else:
df[column] = df[column].fillna(df[column].median())

# prompt: Outlier treatment

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import warnings
from google.colab import drive
from matplotlib import pyplot as plt
import seaborn as sns

# Assuming df is already defined from the preceding code

# Iterate over numerical columns and handle outliers using IQR method
for column in df.select_dtypes(include=['number']).columns:
Q1 = df[column].quantile(0.25)
Q3 = df[column].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
df[column] = np.clip(df[column], lower_bound, upper_bound)
# prompt: Variable transformation

import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler, OneHotEncoder

# Assuming 'df' is your DataFrame from the preceding code

# --- 1. One-Hot Encoding for Categorical Variables ---

# Select categorical columns
categorical_cols = ['preferred_device', 'Taken_product', 'preferred_location_type']

# Apply one-hot encoding

encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
encoded_data = encoder.fit_transform(df[categorical_cols])

# Create a DataFrame from the encoded data

encoded_df = pd.DataFrame(encoded_data, columns=encoder.get_feature_names_out(categorical_cols))

# --- 2. Standardization for Numerical Variables ---

# Select numerical columns (excluding the previously encoded ones)
numerical_cols = df.select_dtypes(include=['number']).columns.difference(categorical_cols)

# Apply standardization
scaler = StandardScaler()
scaled_data = scaler.fit_transform(df[numerical_cols])

# Create a DataFrame from the scaled data

scaled_df = pd.DataFrame(scaled_data, columns=numerical_cols)

# --- 3. Combine Transformed Data ---

# Concatenate the encoded and scaled DataFrames
transformed_df = pd.concat([encoded_df, scaled_df], axis=1)

# Optionally, add back any dropped columns if needed

# transformed_df['dropped_column'] = df['dropped_column']

# Display the transformed DataFrame

print(transformed_df.head())

p _ _ p _ _ p _ _
0 0.0 0.0 0.0
1 0.0 0.0 0.0
2 0.0 0.0 0.0
3 0.0 0.0 0.0
4 0.0 0.0 0.0

preferred_device_Tab preferred_device_iOS \
0 0.0 0.0
1 0.0 1.0
2 0.0 0.0
3 0.0 1.0
4 0.0 0.0

preferred_device_iOS and Android ... \

0 1.0 ...
1 0.0 ...
2 1.0 ...
3 0.0 ...
4 1.0 ...

preferred_location_type_Tour Travel \
0 0.0
1 0.0
2 0.0
3 0.0
4 0.0

preferred_location_type_Tour and Travel preferred_location_type_Trekking \

0 0.0 0.0
1 0.0 0.0
2 0.0 0.0
3 0.0 0.0
4 0.0 0.0

Daily_Avg_mins_spend_on_traveling_page Yearly_avg_comment_on_travel_page \
0 -0.705974 0.898954
1 -0.455347 -0.634092
2 -0.831287 0.806042
3 -0.705974 -0.866371
4 -0.956600 -1.609666

Yearly_avg_view_on_travel_page montly_avg_comment_on_company_page \
0 0.401152 -1.611938
1 1.305515 0.019795
2 -0.051030 -1.068027
3 -0.503212 -1.611938
4 -1.181484 -1.475960

total_likes_on_outofstation_checkin_received \
0 -0.090842
1 -0.289462
2 -0.989117
3 -0.800624
4 -0.671971

total_likes_on_outstation_checkin_given week_since_last_outstation_checkin
0 0.751797 1.833319
1 -1.323014 -0.842262
2 1 434997 1 068868
# prompt: Addition of new variables

# Assuming 'transformed_df' is your DataFrame from the preceding code

# --- 1. Interaction Terms ---

# Example: Interaction between 'Yearly_avg_view_on_travel_page' and 'total_likes_on_outstation_checkin_given'
transformed_df['view_likes_interaction'] = transformed_df['Yearly_avg_view_on_travel_page'] * transformed_df['total_likes_on_outstation_checkin_given']

# --- 2. Polynomial Terms ---

# Example: Squared term for 'Yearly_avg_view_on_travel_page'
transformed_df['view_squared'] = transformed_df['Yearly_avg_view_on_travel_page'] ** 2

# --- 3. Ratios or Proportions ---

# Example: Ratio of comments to views on travel page
transformed_df['comment_view_ratio'] = transformed_df['Yearly_avg_comment_on_travel_page'] / transformed_df['Yearly_avg_view_on_travel_page']

# --- 4. Domain-Specific Features ---

# Example: Flag for users who like outstation check-ins more than they comment
transformed_df['likes_over_comments'] = (transformed_df['total_likes_on_outstation_checkin_given'] > transformed_df['Yearly_avg_comment_on_travel_page']).astype(int)

# Display the updated DataFrame with new variables

print(transformed_df.head())

preferred_device_ANDROID preferred_device_Android \
0 0.0 0.0
1 0.0 0.0
2 0.0 0.0
3 0.0 0.0
4 0.0 0.0

preferred_device_Android OS preferred_device_Laptop \
0 0.0 0.0
1 0.0 0.0
2 0.0 0.0
3 0.0 0.0
4 0.0 0.0

preferred_device_Mobile preferred_device_Other preferred_device_Others \

0 0.0 0.0 0.0
1 0.0 0.0 0.0
2 0.0 0.0 0.0
3 0.0 0.0 0.0
4 0.0 0.0 0.0

preferred_device_Tab preferred_device_iOS \
0 0.0 0.0
1 0.0 1.0
2 0.0 0.0
3 0.0 1.0
4 0.0 0.0

preferred_device_iOS and Android ... Yearly_avg_comment_on_travel_page \

0 1.0 ... 0.898954
1 0.0 ... -0.634092
2 1.0 ... 0.806042
3 0.0 ... -0.866371
4 1.0 ... -1.609666

Yearly_avg_view_on_travel_page montly_avg_comment_on_company_page \
0 0.401152 -1.611938
1 1.305515 0.019795
2 -0.051030 -1.068027
3 -0.503212 -1.611938
4 -1.181484 -1.475960

total_likes_on_outofstation_checkin_received \
0 -0.090842
1 -0.289462
2 -0.989117
3 -0.800624
4 -0.671971

total_likes_on_outstation_checkin_given \
0 0.751797
1 -1.323014
2 1.434997
3 1.482897
4 -0.536451

week_since_last_outstation_checkin view_likes_interaction view_squared \

0 1.833319 0.301585 0.160923

keyboard_arrow_down Business insights from EDA

# prompt: Is the data unbalanced? If so, what can be done? Please explain in the context of the business

import matplotlib.pyplot as plt

# Check the distribution of the target variable 'Taken_product'
product_counts = df['Taken_product'].value_counts()
print(product_counts)

# Calculate the percentage of each class

product_percentages = product_counts / product_counts.sum() * 100
print(product_percentages)

# Visualize the class distribution

plt.figure(figsize=(6, 4))
product_percentages.plot(kind='bar')
plt.title('Distribution of Taken_product')
plt.xlabel('Taken Product (Yes/No)')
plt.ylabel('Percentage')
plt.show()

print("\n## Addressing Class Imbalance:")

print("If the data is significantly imbalanced (e.g., one class has a much higher representation than the other), it can lead to biased models that perform poorly on the minority

print("\n**Potential Solutions:**")
print("- **Resampling Techniques:**")
print(" - **Oversampling:** Duplicate instances from the minority class to increase its representation.")
print(" - **Undersampling:** Remove instances from the majority class to balance the dataset.")
print(" - **SMOTE (Synthetic Minority Over-sampling Technique):** Generate synthetic samples for the minority class.")

print("- Cost-Sensitive Learning:")

print(" - Assign higher misclassification costs to the minority class during model training to penalize errors on that class more heavily.")

print("- Ensemble Methods:")

print(" - Use techniques like bagging or boosting, which can be effective in handling imbalanced datasets.")

print("\n**Choice of Method:**")
print("The best approach depends on the specific dataset and business context. Consider the severity of the imbalance, the size of the dataset, and the potential impact of miscla

print("\n**Business Context:**")
print("In this case, accurately identifying potential customers who are likely to purchase tickets (i.e., 'Taken_product' = 'Yes') is crucial for targeted advertising. Therefore

Taken_product
No 9864
Yes 1896
Name: count, dtype: int64
Taken_product
No 83.877551
Yes 16.122449
Name: count, dtype: float64

## Addressing Class Imbalance:

If the data is significantly imbalanced (e.g., one class has a much higher representation than the other), it can lead to biased models that perform poorly on the minority c

**Potential Solutions:**
- **Resampling Techniques:**
- **Oversampling:** Duplicate instances from the minority class to increase its representation.
- **Undersampling:** Remove instances from the majority class to balance the dataset.
- **SMOTE (Synthetic Minority Over-sampling Technique):** Generate synthetic samples for the minority class.
- **Cost-Sensitive Learning:**
- Assign higher misclassification costs to the minority class during model training to penalize errors on that class more heavily.
- **Ensemble Methods:**
- Use techniques like bagging or boosting, which can be effective in handling imbalanced datasets.

**Choice of Method:**
The best approach depends on the specific dataset and business context. Consider the severity of the imbalance, the size of the dataset, and the potential impact of misclass

**Business Context:**
In this case, accurately identifying potential customers who are likely to purchase tickets (i.e., 'Taken_product' = 'Yes') is crucial for targeted advertising. Therefore, a

print(transformed_df.columns)

Index(['preferred_device_ANDROID', 'preferred_device_Android',
'preferred_device_Android OS', 'preferred_device_Laptop',
'preferred_device_Mobile', 'preferred_device_Other',
'preferred_device_Others', 'preferred_device_Tab',
'preferred_device_iOS', 'preferred_device_iOS and Android',
'Taken_product_No', 'Taken_product_Yes',
'preferred_location_type_Beach', 'preferred_location_type_Big Cities',
'preferred_location_type_Entertainment',
'preferred_location_type_Financial', 'preferred_location_type_Game',
'preferred_location_type_Hill Stations',

Low Code AIML USL Project CreditCardCustomerSegmentation Vijay Borade Aug23
67% (3)
Low Code AIML USL Project CreditCardCustomerSegmentation Vijay Borade Aug23
66 pages
Delhivery Feature Engineering Cs
No ratings yet
Delhivery Feature Engineering Cs
46 pages
Capstone Project 1
100% (1)
Capstone Project 1
20 pages
Customer Churn Prediction Project: by Shweta Gupta
100% (6)
Customer Churn Prediction Project: by Shweta Gupta
41 pages
Multiple - Linear - Regression - AirBNB - Student - File0.2 - New (1) .Ipynb - Colaboratory
No ratings yet
Multiple - Linear - Regression - AirBNB - Student - File0.2 - New (1) .Ipynb - Colaboratory
8 pages
Asl Ga 30+, Ga 37, Ga 45
100% (8)
Asl Ga 30+, Ga 37, Ga 45
216 pages
Commissioning Switches, Automatic Transfer
100% (5)
Commissioning Switches, Automatic Transfer
44 pages
Archimate 3 Poster Detailed Overview
100% (1)
Archimate 3 Poster Detailed Overview
1 page
SN Travel Jupyter Notebook PDF
No ratings yet
SN Travel Jupyter Notebook PDF
28 pages
Airlanes Booking Analys
No ratings yet
Airlanes Booking Analys
26 pages
Hackathon Overall Travel Experience of Traveling in Shinkansen Bullet Train Merging Two Data Set
No ratings yet
Hackathon Overall Travel Experience of Traveling in Shinkansen Bullet Train Merging Two Data Set
59 pages
vertopal.com_Delhivery
No ratings yet
vertopal.com_Delhivery
20 pages
Airline Passenger Booking Analyze
No ratings yet
Airline Passenger Booking Analyze
26 pages
TAREA TERCER PARCIAL - Jupyter Notebook
No ratings yet
TAREA TERCER PARCIAL - Jupyter Notebook
15 pages
Delhivery_Case_Study_compressed
No ratings yet
Delhivery_Case_Study_compressed
31 pages
Data description
No ratings yet
Data description
6 pages
EDA of Hotel Booking Dataset - Kaggle
No ratings yet
EDA of Hotel Booking Dataset - Kaggle
67 pages
E-Commerce Product Delivery Prediction
No ratings yet
E-Commerce Product Delivery Prediction
13 pages
P1) Code Uber
No ratings yet
P1) Code Uber
6 pages
Lab - JupyterLab
No ratings yet
Lab - JupyterLab
6 pages
BDABI - Project Work Mikiyas Henok Dureti
No ratings yet
BDABI - Project Work Mikiyas Henok Dureti
34 pages
Customer Churn Syntax
No ratings yet
Customer Churn Syntax
66 pages
pca
No ratings yet
pca
3 pages
How To Convert Casuals To Members?": Google Data Analytics Course Capstone Project: Case Study 1 "Cyclistic"
No ratings yet
How To Convert Casuals To Members?": Google Data Analytics Course Capstone Project: Case Study 1 "Cyclistic"
18 pages
Pandas Exercise
No ratings yet
Pandas Exercise
6 pages
Assignment 1
No ratings yet
Assignment 1
8 pages
Users of A Music Streaming Service Will Churn or Stay: @staticmethod
No ratings yet
Users of A Music Streaming Service Will Churn or Stay: @staticmethod
1 page
Zomato Rating Prediction
No ratings yet
Zomato Rating Prediction
11 pages
lab task 9.ipynb - Colab
No ratings yet
lab task 9.ipynb - Colab
4 pages
AnalysisReport
No ratings yet
AnalysisReport
54 pages
01 Er
No ratings yet
01 Er
1 page
45B AIML Prac1.3
No ratings yet
45B AIML Prac1.3
11 pages
KPMG - Task 1
No ratings yet
KPMG - Task 1
22 pages
Social Media Geeta
No ratings yet
Social Media Geeta
33 pages
Aviation Marketing Project - Capstone 1
100% (1)
Aviation Marketing Project - Capstone 1
25 pages
Apps Rating Prediction
No ratings yet
Apps Rating Prediction
51 pages
Air BNB Data Analysis
No ratings yet
Air BNB Data Analysis
12 pages
Case Study 1&2
No ratings yet
Case Study 1&2
10 pages
Python Coding Interview Interview Questions Questions
No ratings yet
Python Coding Interview Interview Questions Questions
9 pages
Social Media Tourism - Capstone Project
No ratings yet
Social Media Tourism - Capstone Project
13 pages
Random Forest Model
No ratings yet
Random Forest Model
16 pages
Twitter Project2
No ratings yet
Twitter Project2
339 pages
Loading The Dataset: First We Load The Dataset and Find Out The Number of Columns, Rows, NULL Values, Etc
100% (1)
Loading The Dataset: First We Load The Dataset and Find Out The Number of Columns, Rows, NULL Values, Etc
8 pages
ML Practical 1 Code
100% (1)
ML Practical 1 Code
1 page
Email
No ratings yet
Email
5 pages
SQL_for DS
No ratings yet
SQL_for DS
14 pages
Working With The Divvy Data Set
100% (1)
Working With The Divvy Data Set
43 pages
Synopsis Project Title Tour and Traval M
No ratings yet
Synopsis Project Title Tour and Traval M
22 pages
ML All Prints
No ratings yet
ML All Prints
25 pages
Untitled 0
No ratings yet
Untitled 0
3 pages
Final-Project Report DB PDF
No ratings yet
Final-Project Report DB PDF
29 pages
ML 1 Um
No ratings yet
ML 1 Um
5 pages
Case Study 1 Exercise R Script
No ratings yet
Case Study 1 Exercise R Script
5 pages
Ecommerce Purchases Exercise - Jupyter Notebook
No ratings yet
Ecommerce Purchases Exercise - Jupyter Notebook
2 pages
DSBDA 4
No ratings yet
DSBDA 4
13 pages
FDS Practical 2
No ratings yet
FDS Practical 2
8 pages
Customer Segmentation Clustering
No ratings yet
Customer Segmentation Clustering
35 pages
Preprocessing ch.1
No ratings yet
Preprocessing ch.1
24 pages
Book Recommender System
No ratings yet
Book Recommender System
1 page
Multiple - Linear - Regression - AirBNB - Solution-0.2 - New - Ipynb - Colaboratory
No ratings yet
Multiple - Linear - Regression - AirBNB - Solution-0.2 - New - Ipynb - Colaboratory
11 pages
02 Amazon Fine Food Reviews Analysis - TSNE - Slides
No ratings yet
02 Amazon Fine Food Reviews Analysis - TSNE - Slides
1 page
App-Rating-Recommendations. - Googleplaystore - Analysis (1) .Ipynb at Main Prenith - App-Rating-Recommendations
No ratings yet
App-Rating-Recommendations. - Googleplaystore - Analysis (1) .Ipynb at Main Prenith - App-Rating-Recommendations
19 pages
Document
No ratings yet
Document
2 pages
Zomoto Data Analysis Using Python_1
No ratings yet
Zomoto Data Analysis Using Python_1
10 pages
Stripe Integration in Angular: A Step-by-Step Guide to Creating Payment Functionality
From Everand
Stripe Integration in Angular: A Step-by-Step Guide to Creating Payment Functionality
Abdelfattah Ragab
No ratings yet
FRA Presentation
No ratings yet
FRA Presentation
21 pages
FRA Cheat Sheet Week1
No ratings yet
FRA Cheat Sheet Week1
2 pages
Hackathon Presentation-Online
No ratings yet
Hackathon Presentation-Online
14 pages
Deep Learning
No ratings yet
Deep Learning
57 pages
Tutorialbox Tutors C++ C++tutor
No ratings yet
Tutorialbox Tutors C++ C++tutor
72 pages
Vijay Borade - 03nov2023 - ENews - Express - Learner - Colaboratory - Final
No ratings yet
Vijay Borade - 03nov2023 - ENews - Express - Learner - Colaboratory - Final
23 pages
Rcmds From Class
No ratings yet
Rcmds From Class
17 pages
Introduction To Prescriptive
No ratings yet
Introduction To Prescriptive
11 pages
Linear Regression
No ratings yet
Linear Regression
13 pages
Integer Programming
No ratings yet
Integer Programming
19 pages
ITIL Interview Questions and Answers
No ratings yet
ITIL Interview Questions and Answers
55 pages
Predictivemodellingproject Report Vijay Borade Aug2023
No ratings yet
Predictivemodellingproject Report Vijay Borade Aug2023
44 pages
ITIL 4 Foundation Exam Preparation
No ratings yet
ITIL 4 Foundation Exam Preparation
42 pages
Recent Amendments to the Bank Company Act
No ratings yet
Recent Amendments to the Bank Company Act
7 pages
Defining The New Technology Leaders
No ratings yet
Defining The New Technology Leaders
11 pages
T127102 Pneumatic Schematic
No ratings yet
T127102 Pneumatic Schematic
1 page
Javascript Cheat Sheet: by Via
No ratings yet
Javascript Cheat Sheet: by Via
2 pages
Consciousness Is An Expression of Awareness Like The Sunlight Is To The Sun
No ratings yet
Consciousness Is An Expression of Awareness Like The Sunlight Is To The Sun
3 pages
Explanation of How To Calculate Fitness Index in This Scenario
No ratings yet
Explanation of How To Calculate Fitness Index in This Scenario
2 pages
Acquiring Wisdom Charlie Munger
No ratings yet
Acquiring Wisdom Charlie Munger
6 pages
How To Conduct Systematic Reviews and Meta Analysis
No ratings yet
How To Conduct Systematic Reviews and Meta Analysis
79 pages
Format Fizik Kertas 2
No ratings yet
Format Fizik Kertas 2
2 pages
Series Resonance in A Series RLC Resonant Circuit
No ratings yet
Series Resonance in A Series RLC Resonant Circuit
12 pages
TS & Printer Redirection
No ratings yet
TS & Printer Redirection
4 pages
Electric and Magnetic Properties of PMMA/manganite Composites
No ratings yet
Electric and Magnetic Properties of PMMA/manganite Composites
3 pages
Ruby Tanya Pedagogy
No ratings yet
Ruby Tanya Pedagogy
28 pages
Anthropology Vs Sociology
No ratings yet
Anthropology Vs Sociology
3 pages
02 Dhammajoti English
No ratings yet
02 Dhammajoti English
4 pages
PDF The Effect of Bottomhole Assembly Dynamics On The Trajectory
No ratings yet
PDF The Effect of Bottomhole Assembly Dynamics On The Trajectory
4 pages
Guidelines: For The Preparation of The District Transport Master Plan (DTMP)
No ratings yet
Guidelines: For The Preparation of The District Transport Master Plan (DTMP)
32 pages
Handouts Research
No ratings yet
Handouts Research
3 pages
Research On Audiovisual Translation
No ratings yet
Research On Audiovisual Translation
23 pages
Laplace's Equation
No ratings yet
Laplace's Equation
8 pages
HR Presentation - BRM of July'23
No ratings yet
HR Presentation - BRM of July'23
52 pages
3:1 Fire-Ball 425 Pump: Instructions - Parts List
No ratings yet
3:1 Fire-Ball 425 Pump: Instructions - Parts List
20 pages
2012 Third-Party Logistics Study: The State of Logistics Outsourcing
No ratings yet
2012 Third-Party Logistics Study: The State of Logistics Outsourcing
52 pages
Agar408series MPFM Spec
No ratings yet
Agar408series MPFM Spec
2 pages
ECH - Stealthwatch NetFlow Configuration Templates v1 0
No ratings yet
ECH - Stealthwatch NetFlow Configuration Templates v1 0
10 pages
Test Bank for Human Relations, 6th Edition, Lowell Lamberton Leslie Minor-Evans - All Chapters Are Available In PDF Format For Download
100% (6)
Test Bank for Human Relations, 6th Edition, Lowell Lamberton Leslie Minor-Evans - All Chapters Are Available In PDF Format For Download
31 pages
Ansaldo Solution For Geothermal Power Plants
No ratings yet
Ansaldo Solution For Geothermal Power Plants
8 pages