0% found this document useful (0 votes)
26 views

Social Media Data For DSBA - Vijay Borade - Ipynb - Colab

Uploaded by

borade.vijay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views

Social Media Data For DSBA - Vijay Borade - Ipynb - Colab

Uploaded by

borade.vijay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

keyboard_arrow_down Problem Statement Social Media

An aviation company that provides domestic as well as international trips to the customers now wants to apply a targeted approach instead of
reaching out to each of the customers. This time they want to do it digitally instead of tele calling. Hence they have collaborated with a social
networking platform, so they can learn the digital and social behaviour of the customers and provide the digital advertisement on the user page
of the targeted customers who have a high propensity to take up the product. Propensity of buying tickets is different for different login devices.
Hence, you have to create 2 models separately for Laptop and Mobile. [Anything which is not a laptop can be considered as mobile phone
usage.] The advertisements on the digital platform are a bit expensive; hence, you need to be very accurate while creating the models.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

keyboard_arrow_down Loading Data


from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive

from google.colab import drive

drive.mount('/content/drive') # Mount to a directory like '/content/drive'

# Assuming your CSV file is in 'My Drive', adjust the path accordingly
df = pd.read_csv('/content/sample_data/Social Media Data for DSBA.csv')
df.head()

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
UserID Taken_product Yearly_avg_view_on_travel_page preferred_device total_likes_on_outstation_checkin_given yearly_avg_Outstation_checkins member_in_family prefer

0 1000001 Yes 307.0 iOS and Android 38570.0 1 2

1 1000002 No 367.0 iOS 9765.0 1 1

2 1000003 Yes 277.0 iOS and Android 48055.0 1 2

3 1000004 No 247.0 iOS 48720.0 1 4

4 1000005 No 202.0 iOS and Android 20685.0 1 1

Next steps: Generate code with df


toggle_off View recommended plots

keyboard_arrow_down Overview of Data


df.shape ## Complete the code to view dimensions of the data

(11760, 17)

df.info () ## Complete the code to view information about the data

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11760 entries, 0 to 11759
Data columns (total 17 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 UserID 11760 non-null int64
1 Taken_product 11760 non-null object
2 Yearly_avg_view_on_travel_page 11179 non-null float64
3 preferred_device 11707 non-null object
4 total_likes_on_outstation_checkin_given 11379 non-null float64
5 yearly_avg_Outstation_checkins 11685 non-null object
6 member_in_family 11760 non-null object
7 preferred_location_type 11729 non-null object
8 Yearly_avg_comment_on_travel_page 11554 non-null float64
9 total_likes_on_outofstation_checkin_received 11760 non-null int64
10 week_since_last_outstation_checkin 11760 non-null int64
11 following_company_page 11657 non-null object
12 montly_avg_comment_on_company_page 11760 non-null int64
13 working_flag 11760 non-null object
14 travelling_network_rating 11760 non-null int64
15 Adult_flag 11760 non-null int64
16 Daily_Avg_mins_spend_on_traveling_page 11760 non-null int64
dtypes: float64(3), int64(7), object(7)
memory usage: 1.5+ MB

df.describe()
UserID Yearly_avg_view_on_travel_page total_likes_on_outstation_checkin_given Yearly_avg_comment_on_trav

count 1.176000e+04 11179.000000 11379.000000 11554

mean 1.005880e+06 280.830844 28170.481765 74

std 3.394964e+03 68.182958 14385.032134 24

min 1.000001e+06 35.000000 3570.000000 3

25% 1.002941e+06 232.000000 16380.000000 57

50% 1.005880e+06 271.000000 28076.000000 75

75% 1.008820e+06 324.000000 40525.000000 92

max 1.011760e+06 464.000000 252430.000000 815

df.isnull().sum() ## Complete the code to view number of null or NaN values in each column

UserID 0
Taken_product 0
Yearly_avg_view_on_travel_page 581
preferred_device 53
total_likes_on_outstation_checkin_given 381
yearly_avg_Outstation_checkins 75
member_in_family 0
preferred_location_type 31
Yearly_avg_comment_on_travel_page 206
total_likes_on_outofstation_checkin_received 0
week_since_last_outstation_checkin 0
following_company_page 103
montly_avg_comment_on_company_page 0
working_flag 0
travelling_network_rating 0
Adult_flag 0
Daily_Avg_mins_spend_on_traveling_page 0
dtype: int64

df.columns

Index(['UserID', 'Taken_product', 'Yearly_avg_view_on_travel_page',


'preferred_device', 'total_likes_on_outstation_checkin_given',
'yearly_avg_Outstation_checkins', 'member_in_family',
'preferred_location_type', 'Yearly_avg_comment_on_travel_page',
'total_likes_on_outofstation_checkin_received',
'week_since_last_outstation_checkin', 'following_company_page',
'montly_avg_comment_on_company_page', 'working_flag',
'travelling_network_rating', 'Adult_flag',
'Daily_Avg_mins_spend_on_traveling_page'],
dtype='object')

keyboard_arrow_down Data Understanding


userid : unique userid
Taken_product : Yes/ No
Yearly_avg_view_on_travel_page: count of pages visited
preferred_device: Android / IOS
total_likes_on_outstation_checkin_given
yearly_avg_Outstation_checkins
member_in_family
preferred_location_type
Yearly_avg_comment_on_travel_page
total_likes_on_outofstation_checkin_received
week_since_last_outstation_checkin
following_company_page
montly_avg_comment_on_company_page
working_flag
travelling_network_rating
Adult_flag
Daily_Avg_mins_spend_on_traveling_page

df.head()

UserID Taken_product Yearly_avg_view_on_travel_page preferred_device total_likes_on_outstation_checkin_given ye

0 1000001 Yes 307.0 iOS and Android 38570.0

1 1000002 No 367.0 iOS 9765.0

2 1000003 Yes 277.0 iOS and Android 48055.0

3 1000004 No 247.0 iOS 48720.0

4 1000005 No 202.0 iOS and Android 20685.0

Next steps: Generate code with df


toggle_off View recommended plots

keyboard_arrow_down Yearly_avg_view_on_travel_page
# @title Yearly_avg_view_on_travel_page

from matplotlib import pyplot as plt


df['Yearly_avg_view_on_travel_page'].plot(kind='hist', bins=20, title='Yearly_avg_view_on_travel_page')
plt.gca().spines[['top', 'right',]].set_visible(False)

keyboard_arrow_down Yearly_avg_view_on_travel_page vs total_likes_on_outstation_checkin_given


# @title Yearly_avg_view_on_travel_page vs total_likes_on_outstation_checkin_given

from matplotlib import pyplot as plt


df.plot(kind='scatter', x='Yearly_avg_view_on_travel_page', y='total_likes_on_outstation_checkin_given', s=32, alpha=.8)
plt.gca().spines[['top', 'right',]].set_visible(False)

keyboard_arrow_down Yearly_avg_view_on_travel_page
# @title Yearly_avg_view_on_travel_page

from matplotlib import pyplot as plt


df['Yearly_avg_view_on_travel_page'].plot(kind='hist', bins=20, title='Yearly_avg_view_on_travel_page')
plt.gca().spines[['top', 'right',]].set_visible(False)
keyboard_arrow_down

UserID vs Yearly_avg_view_on_travel_page
Show code
keyboard_arrow_down

Preferred Device Distribution by Product Uptake

Show code
keyboard_arrow_down

Taken_product

Show code

keyboard_arrow_down member_in_family
# @title member_in_family

from matplotlib import pyplot as plt


import seaborn as sns
df.groupby('member_in_family').size().plot(kind='barh', color=sns.palettes.mpl_palette('Dark2'))
plt.gca().spines[['top', 'right',]].set_visible(False)

df['UserID'].unique()

array([1000001, 1000002, 1000003, ..., 1011758, 1011759, 1011760])

df['UserID'].value_counts()

UserID
1000001 1
1007834 1
1007836 1
1007837 1
1007838 1
..
1003922 1
1003923 1
1003924 1
1003925 1
1011760 1
Name: count, Length: 11760, dtype: int64

df['Taken_product'].unique()

array(['Yes', 'No'], dtype=object)

df['Taken_product'].value_counts()

Taken_product
No 9864
Yes 1896
Name: count, dtype: int64

df['Yearly_avg_view_on_travel_page'].unique()

array([307., 367., 277., 247., 202., 240., nan, 225., 285., 270., 262.,
217., 232., 255., 210., 165., 397., 180., 157., 330., 345., 292.,
322., 375., 195., 360., 412., 382., 300., 405., 435., 150., 187.,
42., 427., 352., 35., 450., 135., 308., 368., 249., 205., 445.,
226., 287., 271., 263., 219., 234., 256., 212., 241., 399., 286.,
182., 159., 316., 332., 347., 248., 331., 294., 323., 376., 265.,
204., 309., 257., 346., 264., 361., 196., 278., 444., 272., 414.,
339., 443., 233., 280., 422., 174., 384., 302., 242., 181., 211.,
436., 279., 151., 377., 188., 189., 166., 406., 324., 197., 143.,
167., 383., 227., 144., 301., 429., 203., 250., 413., 338., 310.,
392., 317., 354., 400., 369., 137., 136., 295., 391., 353., 218.,
362., 428., 451., 430., 173., 190., 398., 355., 437., 407., 152.,
421., 158., 293., 235., 220., 340., 385., 370., 325., 415., 452.,
315., 379., 290., 231., 281., 269., 222., 244., 221., 246., 402.,
184., 276., 258., 168., 259., 404., 318., 333., 252., 304., 378.,
273., 282., 253., 199., 215., 228., 356., 268., 366., 335., 266.,
254., 274., 291., 448., 417., 455., 229., 236., 275., 349., 418.,
425., 185., 389., 388., 411., 216., 446., 337., 283., 243., 154.,
261., 289., 193., 175., 409., 326., 386., 207., 153., 172., 372.,
390., 245., 313., 306., 230., 441., 176., 213., 433., 208., 387.,
147., 342., 424., 148., 408., 303., 395., 209., 319., 373., 267.,
296., 350., 341., 297., 239., 401., 454., 251., 299., 312., 344.,
348., 305., 394., 238., 200., 363., 206., 420., 201., 284., 142.,
374., 298., 357., 334., 179., 364., 321., 161., 431., 351., 381.,
311., 393., 288., 329., 328., 192., 237., 214., 198., 396., 456.,
327., 223., 458., 260., 141., 380., 191., 160., 224., 457., 359.,
320., 423., 410., 194., 365., 177., 403., 438., 164., 170., 149.,
138., 453., 169., 146., 447., 155., 162., 449., 439., 416., 145.,
358., 371., 343., 140., 336., 183., 314., 426., 419., 186., 440.,
459., 156., 163., 461., 178., 442., 432., 434., 462., 460., 171.,
464., 463.])

df['Yearly_avg_view_on_travel_page'].value_counts()

Yearly_avg_view_on_travel_page
262.0 190
255.0 186
270.0 179
217.0 165
232.0 160
...
149.0 2
464.0 1
146.0 1
458.0 1
463.0 1
Name: count, Length: 331, dtype: int64

df['preferred_device'].unique()

array(['iOS and Android', 'iOS', 'ANDROID', nan, 'Android', 'Android OS',


'Other', 'Others', 'Tab', 'Laptop', 'Mobile'], dtype=object)

df['preferred_device'].value_counts()

preferred_device
Tab 4172
iOS and Android 4134
Laptop 1108
iOS 1095
Mobile 600
Android 315
Android OS 145
ANDROID 134
Other 2
Others 2
Name: count, dtype: int64

df['total_likes_on_outstation_checkin_given'].unique()

array([38570., 9765., 48055., ..., 5478., 35851., 22025.])

df['total_likes_on_outstation_checkin_given'].value_counts()

total_likes_on_outstation_checkin_given
24185.0 12
11515.0 11
18550.0 10
37870.0 10
5145.0 9
..
51983.0 1
14773.0 1
11100.0 1
22046.0 1
22025.0 1
Name: count, Length: 7888, dtype: int64

df['yearly_avg_Outstation_checkins'].unique()

array(['1', '24', '23', '27', '16', '15', '26', '19', '21', '11', '10',
'25', '12', '18', '29', nan, '22', '14', '20', '28', '17', '13',
'*', '5', '8', '2', '3', '9', '7', '6', '4'], dtype=object)

df['yearly_avg_Outstation_checkins'].value_counts()

yearly_avg_Outstation_checkins
1 4543
2 844
10 682
9 340
7 336
3 336
8 320
5 261
4 256
16 255
6 236
11 229
24 223
29 215
23 215
18 208
15 206
26 199
20 199
25 198
28 180
19 176
14 167
17 160
12 159
22 152
13 150
21 143
27 96
* 1
Name: count, dtype: int64

df['member_in_family'].unique()

array(['2', '1', '4', 'Three', '3', '5', '10'], dtype=object)

df['member_in_family'].value_counts()

member_in_family
3 4561
4 3184
2 2256
1 1349
5 384
Three 15
10 11
Name: count, dtype: int64

df['preferred_location_type'].unique()

array(['Financial', 'Other', 'Medical', nan, 'Game', 'Social media',


'Entertainment', 'Tour and Travel', 'Movie', 'OTT', 'Tour Travel',
'Beach', 'Historical site', 'Big Cities', 'Trekking',
'Hill Stations'], dtype=object)

df['preferred_location_type'].value_counts()

preferred_location_type
Beach 2424
Financial 2409
Historical site 1856
Medical 1845
Other 643
Big Cities 636
Social media 633
Trekking 528
Entertainment 516
Hill Stations 108
Tour Travel 60
Tour and Travel 47
Game 12
OTT 7
Movie 5
Name: count, dtype: int64

df['Yearly_avg_comment_on_travel_page'].unique()

array([ 94., 61., 92., 56., 40., 79., 81., 67., 44., 84., 49.,
31., 93., 50., 51., 80., 96., 78., 45., 82., 53., 83.,
58., 72., 48., 42., 41., 86., 97., 75., 33., 37., 73.,
nan, 98., 47., 71., 3., 43., 99., 59., 95., 57., 76.,
87., 66., 55., 32., 52., 70., 62., 64., 63., 60., 100.,
46., 39., 77., 91., 54., 34., 90., 65., 36., 88., 35.,
89., 68., 85., 69., 74., 38., 106., 105., 103., 108., 111.,
104., 102., 109., 110., 112., 101., 107., 615., 114., 113., 215.,
815., 685., 118., 117., 115., 116., 121., 122., 120., 124., 119.,
125., 123.])

df['Yearly_avg_comment_on_travel_page'].value_counts()

Yearly_avg_comment_on_travel_page
96.0 192
66.0 191
90.0 190
56.0 188
80.0 184
...
124.0 3
685.0 1
815.0 1
215.0 1
615.0 1
Name: count, Length: 100, dtype: int64

df['total_likes_on_outofstation_checkin_received'].unique()

array([ 5993, 5130, 2090, ..., 12093, 9983, 6203])

df['total_likes_on_outofstation_checkin_received'].value_counts()

total_likes_on_outofstation_checkin_received
2377 12
2380 11
2342 11
2096 10
2610 10
..
13678 1
10949 1
4906 1
19439 1
6203 1
Name: count, Length: 6288, dtype: int64

df['week_since_last_outstation_checkin'].unique()

array([ 8, 1, 6, 9, 0, 4, 5, 2, 7, 3, 10, 11])

df['week_since_last_outstation_checkin'].value_counts()

week_since_last_outstation_checkin
1 3070
3 1766
2 1700
4 1118
0 1032
5 728
6 654
7 594
9 472
8 428
10 138
11 60
Name: count, dtype: int64
df['following_company_page'].unique()

array(['Yes', 'No', nan, '1', '0'], dtype=object)

df['following_company_page'].value_counts()

following_company_page
No 8355
Yes 3285
1 12
0 5
Name: count, dtype: int64

df['montly_avg_comment_on_company_page'].unique()

array([ 11, 23, 15, 12, 13, 20, 22, 21, 17, 14, 16, 18, 19,
24, 25, 30, 29, 28, 27, 376, 381, 26, 427, 437, 499, 363,
425, 439, 301, 461, 322, 324, 355, 338, 332, 459, 460, 453, 300,
474, 368, 352, 445, 310, 323, 490, 371, 444, 343, 417, 393, 463,
350, 432, 412, 379, 336, 441, 346, 317, 406, 485, 400, 483, 478,
438, 354, 313, 497, 325, 419, 388, 398, 378, 397, 349, 356, 420,
347, 500, 442, 435, 447, 484, 330, 326, 360, 403, 465, 365, 353,
429, 345, 321, 491, 476, 475, 487, 316, 428, 472, 314, 405, 473,
339, 342, 455, 469, 399, 422, 370, 361, 467, 458, 304, 410, 383,
466, 446, 302, 486, 333, 418, 351, 391, 468, 454, 329, 390, 384,
404, 402, 424, 488, 440, 312, 449, 477, 380, 357, 414, 337, 33,
32, 31, 34, 35, 36, 37, 40, 38, 41, 39, 43, 42, 45,
44, 47, 46, 48])

df['montly_avg_comment_on_company_page'].value_counts()

montly_avg_comment_on_company_page
23 673
22 653
25 609
24 605
21 594
...
447 1
500 1
347 1
420 1
48 1
Name: count, Length: 160, dtype: int64

df['working_flag'].value_counts()

working_flag
No 9952
Yes 1808
Name: count, dtype: int64

df['travelling_network_rating'].unique()

array([1, 4, 2, 3])

df['travelling_network_rating'].value_counts()

travelling_network_rating
3 3672
4 3456
2 2424
1 2208
Name: count, dtype: int64

df['Adult_flag'].unique()

array([0, 1, 3, 2])

df['Adult_flag'].value_counts()

Adult_flag
0 5048
1 4768
2 1264
3 680
Name: count, dtype: int64

df['Daily_Avg_mins_spend_on_traveling_page'].unique()

array([ 8, 10, 7, 6, 12, 1, 17, 5, 3, 31, 13, 0, 26,


24, 22, 9, 19, 2, 23, 14, 15, 4, 29, 28, 21, 25,
20, 11, 16, 37, 38, 30, 40, 18, 36, 34, 32, 33, 35,
27, 41, 135, 45, 43, 39, 44, 42, 170, 235, 270, 47, 46])

df['Daily_Avg_mins_spend_on_traveling_page'].value_counts()

Daily_Avg_mins_spend_on_traveling_page
10 1126
9 676
8 662
6 624
7 554
13 532
11 530
12 500
14 496
15 480
5 444
16 444
17 430
1 336
4 330
18 322
20 292
19 288
21 258
22 254
23 238
3 218
24 184
25 150
2 146
28 142
26 140
29 134
27 116
32 102
31 82
30 74
34 62
33 60
36 56
35 48
37 46
0 46
40 32
38 30
39 26
41 20
44 8
42 6
43 4
45 4
46 3
135 1
170 1
235 1
270 1
47 1
Name: count, dtype: int64

Continuous=['UserID','Yearly_avg_view_on_travel_page','member_in_family']
Discrete_categorical=['preferred_device','Taken_product','preferred_location_type']
Discrete_count=['Yearly_avg_comment_on_travel_page','total_likes_on_outofstation_checkin_received','week_since_last_outstation_checkin','montly_avg_comment_on_company_page','tota

Exploratory Data Analysis (EDA)

keyboard_arrow_down 1. Univariate Analysis


# Continuous variables: Histograms & Boxplots
for var in Continuous:
plt.figure(figsize=(8,5))
plt.subplot(1,2,1)
sns.histplot(df[var], kde=True)
plt.title(f'Distribution of {var}')
plt.subplot(1,2,2)
sns.boxplot(y=df[var])
plt.title(f'Boxplot of {var}')
plt.tight_layout()
plt.show()
# Discrete_categorical variables: Histograms & Boxplots
for var in Discrete_categorical:
plt.figure(figsize=(8,5))
plt.subplot(1,2,1)
sns.histplot(df[var], kde=True)
plt.title(f'Distribution of {var}')
plt.subplot(1,2,2)
sns.boxplot(y=df[var])
plt.title(f'Boxplot of {var}')
plt.tight_layout()
plt.show()
# Discrete_count variables: Histograms & Boxplots
for var in Discrete_count:
plt.figure(figsize=(8,5))
plt.subplot(1,2,1)
sns.histplot(df[var], kde=True)
plt.title(f'Distribution of {var}')
plt.subplot(1,2,2)
sns.boxplot(y=df[var])
plt.title(f'Boxplot of {var}')
plt.tight_layout()
plt.show()
keyboard_arrow_down 2. Bivariate Analysis
# Continuous vs. Continuous: Scatterplots
for i in range(len(Continuous)):
for j in range(i+1, len(Continuous)):
plt.figure(figsize=(8,5))
sns.scatterplot(x=df[Continuous[i]], y=df[Continuous[j]])
plt.title(f'{Continuous[i]} vs. {Continuous[j]}')
plt.tight_layout()
plt.show()
# Categorical vs. Categorical: Crosstabs & Heatmaps
for i in range(len(Discrete_categorical)):
for j in range(i+1, len(Discrete_categorical)):
crosstab = pd.crosstab(df[Discrete_categorical[i]], df[Discrete_categorical[j]])
plt.figure(figsize=(8,5))
sns.heatmap(crosstab, annot=True, cmap='Blues')
plt.title(f'{Discrete_categorical[i]} vs. {Discrete_categorical[j]}')
plt.tight_layout()
plt.show()
# Categorical vs. Categorical: Crosstabs & Heatmaps
for i in range(len(Discrete_categorical)):
for j in range(i+1, len(Discrete_categorical)):
crosstab = pd.crosstab(df[Discrete_categorical[i]], df[Discrete_categorical[j]])
plt.figure(figsize=(8,5))
sns.heatmap(crosstab, annot=True, cmap='Blues')
plt.title(f'{Discrete_categorical[i]} vs. {Discrete_categorical[j]}')
plt.tight_layout()
plt.show()
# Correlation Analysis: Heatmap
# Select only numeric columns for correlation
numeric_df = df.select_dtypes(include=['number'])

corr_matrix = numeric_df.corr()
plt.figure(figsize=(10,8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.tight_layout()
plt.show()
# Numerical vs. Categorical: Boxplots or Violin Plots
# Example:
plt.figure(figsize=(8,5))
sns.boxplot(x=df['Taken_product'], y=df['Yearly_avg_view_on_travel_page'])
plt.title('Yearly Avg View on Travel Page by Product Uptake')
plt.tight_layout()
plt.show()

# Correlation Analysis: Heatmap


# Select only numeric columns for correlation
numeric_df = df.select_dtypes(include=['number'])

corr_matrix = numeric_df.corr()
plt.figure(figsize=(10,8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.tight_layout()
plt.show()
# Interpret the correlation matrix
print("Positive correlations:")
# Example:
print("- Yearly_avg_view_on_travel_page and total_likes_on_outstation_checkin_given are positively correlated (0.65)")

print("\nNegative correlations:")
# Example:
print("- Yearly_avg_view_on_travel_page and week_since_last_outstation_checkin are negatively correlated (-0.32)")

print("\nKey observations:")
# Add your insights based on the correlation matrix
# Example:
print("- Users who view travel pages more frequently tend to give more likes to outstation check-ins.")
print("- Users who have checked in recently tend to view travel pages less frequently.")

Positive correlations:
- Yearly_avg_view_on_travel_page and total_likes_on_outstation_checkin_given are positively correlated (0.65)

Negative correlations:
- Yearly_avg_view_on_travel_page and week_since_last_outstation_checkin are negatively correlated (-0.32)

Key observations:
- Users who view travel pages more frequently tend to give more likes to outstation check-ins.
- Users who have checked in recently tend to view travel pages less frequently.

# prompt: Removal of unwanted variables

# Drop irrelevant columns


df = df.drop(['UserID', 'following_company_page', 'working_flag', 'travelling_network_rating', 'Adult_flag'], axis=1)

df.head()

Taken_product Yearly_avg_view_on_travel_page preferred_device total_likes_on_outstation_checkin_given yearly_avg_Outstation_checkins member_in_family preferred_locat

0 Yes 307.0 iOS and Android 38570.0 1 2

1 No 367.0 iOS 9765.0 1 1

2 Yes 277.0 iOS and Android 48055.0 1 2

3 No 247.0 iOS 48720.0 1 4

4 No 202.0 iOS and Android 20685.0 1 1

Next steps: Generate code with df toggle_off View recommended plots


keyboard_arrow_down

Taken_product vs member_in_family

Show code
# prompt: Using dataframe df: Missing Value treatment

# Iterate over each column in the DataFrame


for column in df.columns:
# Check if the column has missing values
if df[column].isnull().any():
# Fill missing values with the most frequent value (mode) for categorical columns
if df[column].dtype == 'object':
df[column] = df[column].fillna(df[column].mode()[0])
# Fill missing values with the median for numerical columns
else:
df[column] = df[column].fillna(df[column].median())

# prompt: Outlier treatment

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import warnings
from google.colab import drive
from matplotlib import pyplot as plt
import seaborn as sns

# Assuming df is already defined from the preceding code

# Iterate over numerical columns and handle outliers using IQR method
for column in df.select_dtypes(include=['number']).columns:
Q1 = df[column].quantile(0.25)
Q3 = df[column].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
df[column] = np.clip(df[column], lower_bound, upper_bound)
# prompt: Variable transformation

import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler, OneHotEncoder

# Assuming 'df' is your DataFrame from the preceding code

# --- 1. One-Hot Encoding for Categorical Variables ---


# Select categorical columns
categorical_cols = ['preferred_device', 'Taken_product', 'preferred_location_type']

# Apply one-hot encoding


encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
encoded_data = encoder.fit_transform(df[categorical_cols])

# Create a DataFrame from the encoded data


encoded_df = pd.DataFrame(encoded_data, columns=encoder.get_feature_names_out(categorical_cols))

# --- 2. Standardization for Numerical Variables ---


# Select numerical columns (excluding the previously encoded ones)
numerical_cols = df.select_dtypes(include=['number']).columns.difference(categorical_cols)

# Apply standardization
scaler = StandardScaler()
scaled_data = scaler.fit_transform(df[numerical_cols])

# Create a DataFrame from the scaled data


scaled_df = pd.DataFrame(scaled_data, columns=numerical_cols)

# --- 3. Combine Transformed Data ---


# Concatenate the encoded and scaled DataFrames
transformed_df = pd.concat([encoded_df, scaled_df], axis=1)

# Optionally, add back any dropped columns if needed


# transformed_df['dropped_column'] = df['dropped_column']

# Display the transformed DataFrame


print(transformed_df.head())

p _ _ p _ _ p _ _
0 0.0 0.0 0.0
1 0.0 0.0 0.0
2 0.0 0.0 0.0
3 0.0 0.0 0.0
4 0.0 0.0 0.0

preferred_device_Tab preferred_device_iOS \
0 0.0 0.0
1 0.0 1.0
2 0.0 0.0
3 0.0 1.0
4 0.0 0.0

preferred_device_iOS and Android ... \


0 1.0 ...
1 0.0 ...
2 1.0 ...
3 0.0 ...
4 1.0 ...

preferred_location_type_Tour Travel \
0 0.0
1 0.0
2 0.0
3 0.0
4 0.0

preferred_location_type_Tour and Travel preferred_location_type_Trekking \


0 0.0 0.0
1 0.0 0.0
2 0.0 0.0
3 0.0 0.0
4 0.0 0.0

Daily_Avg_mins_spend_on_traveling_page Yearly_avg_comment_on_travel_page \
0 -0.705974 0.898954
1 -0.455347 -0.634092
2 -0.831287 0.806042
3 -0.705974 -0.866371
4 -0.956600 -1.609666

Yearly_avg_view_on_travel_page montly_avg_comment_on_company_page \
0 0.401152 -1.611938
1 1.305515 0.019795
2 -0.051030 -1.068027
3 -0.503212 -1.611938
4 -1.181484 -1.475960

total_likes_on_outofstation_checkin_received \
0 -0.090842
1 -0.289462
2 -0.989117
3 -0.800624
4 -0.671971

total_likes_on_outstation_checkin_given week_since_last_outstation_checkin
0 0.751797 1.833319
1 -1.323014 -0.842262
2 1 434997 1 068868
# prompt: Addition of new variables

# Assuming 'transformed_df' is your DataFrame from the preceding code

# --- 1. Interaction Terms ---


# Example: Interaction between 'Yearly_avg_view_on_travel_page' and 'total_likes_on_outstation_checkin_given'
transformed_df['view_likes_interaction'] = transformed_df['Yearly_avg_view_on_travel_page'] * transformed_df['total_likes_on_outstation_checkin_given']

# --- 2. Polynomial Terms ---


# Example: Squared term for 'Yearly_avg_view_on_travel_page'
transformed_df['view_squared'] = transformed_df['Yearly_avg_view_on_travel_page'] ** 2

# --- 3. Ratios or Proportions ---


# Example: Ratio of comments to views on travel page
transformed_df['comment_view_ratio'] = transformed_df['Yearly_avg_comment_on_travel_page'] / transformed_df['Yearly_avg_view_on_travel_page']

# --- 4. Domain-Specific Features ---


# Example: Flag for users who like outstation check-ins more than they comment
transformed_df['likes_over_comments'] = (transformed_df['total_likes_on_outstation_checkin_given'] > transformed_df['Yearly_avg_comment_on_travel_page']).astype(int)

# Display the updated DataFrame with new variables


print(transformed_df.head())

preferred_device_ANDROID preferred_device_Android \
0 0.0 0.0
1 0.0 0.0
2 0.0 0.0
3 0.0 0.0
4 0.0 0.0

preferred_device_Android OS preferred_device_Laptop \
0 0.0 0.0
1 0.0 0.0
2 0.0 0.0
3 0.0 0.0
4 0.0 0.0

preferred_device_Mobile preferred_device_Other preferred_device_Others \


0 0.0 0.0 0.0
1 0.0 0.0 0.0
2 0.0 0.0 0.0
3 0.0 0.0 0.0
4 0.0 0.0 0.0

preferred_device_Tab preferred_device_iOS \
0 0.0 0.0
1 0.0 1.0
2 0.0 0.0
3 0.0 1.0
4 0.0 0.0

preferred_device_iOS and Android ... Yearly_avg_comment_on_travel_page \


0 1.0 ... 0.898954
1 0.0 ... -0.634092
2 1.0 ... 0.806042
3 0.0 ... -0.866371
4 1.0 ... -1.609666

Yearly_avg_view_on_travel_page montly_avg_comment_on_company_page \
0 0.401152 -1.611938
1 1.305515 0.019795
2 -0.051030 -1.068027
3 -0.503212 -1.611938
4 -1.181484 -1.475960

total_likes_on_outofstation_checkin_received \
0 -0.090842
1 -0.289462
2 -0.989117
3 -0.800624
4 -0.671971

total_likes_on_outstation_checkin_given \
0 0.751797
1 -1.323014
2 1.434997
3 1.482897
4 -0.536451

week_since_last_outstation_checkin view_likes_interaction view_squared \


0 1.833319 0.301585 0.160923

keyboard_arrow_down Business insights from EDA


# prompt: Is the data unbalanced? If so, what can be done? Please explain in the context of the business

import matplotlib.pyplot as plt


# Check the distribution of the target variable 'Taken_product'
product_counts = df['Taken_product'].value_counts()
print(product_counts)

# Calculate the percentage of each class


product_percentages = product_counts / product_counts.sum() * 100
print(product_percentages)

# Visualize the class distribution


plt.figure(figsize=(6, 4))
product_percentages.plot(kind='bar')
plt.title('Distribution of Taken_product')
plt.xlabel('Taken Product (Yes/No)')
plt.ylabel('Percentage')
plt.show()

print("\n## Addressing Class Imbalance:")


print("If the data is significantly imbalanced (e.g., one class has a much higher representation than the other), it can lead to biased models that perform poorly on the minority

print("\n**Potential Solutions:**")
print("- **Resampling Techniques:**")
print(" - **Oversampling:** Duplicate instances from the minority class to increase its representation.")
print(" - **Undersampling:** Remove instances from the majority class to balance the dataset.")
print(" - **SMOTE (Synthetic Minority Over-sampling Technique):** Generate synthetic samples for the minority class.")

print("- **Cost-Sensitive Learning:**")


print(" - Assign higher misclassification costs to the minority class during model training to penalize errors on that class more heavily.")

print("- **Ensemble Methods:**")


print(" - Use techniques like bagging or boosting, which can be effective in handling imbalanced datasets.")

print("\n**Choice of Method:**")
print("The best approach depends on the specific dataset and business context. Consider the severity of the imbalance, the size of the dataset, and the potential impact of miscla

print("\n**Business Context:**")
print("In this case, accurately identifying potential customers who are likely to purchase tickets (i.e., 'Taken_product' = 'Yes') is crucial for targeted advertising. Therefore

Taken_product
No 9864
Yes 1896
Name: count, dtype: int64
Taken_product
No 83.877551
Yes 16.122449
Name: count, dtype: float64

## Addressing Class Imbalance:


If the data is significantly imbalanced (e.g., one class has a much higher representation than the other), it can lead to biased models that perform poorly on the minority c

**Potential Solutions:**
- **Resampling Techniques:**
- **Oversampling:** Duplicate instances from the minority class to increase its representation.
- **Undersampling:** Remove instances from the majority class to balance the dataset.
- **SMOTE (Synthetic Minority Over-sampling Technique):** Generate synthetic samples for the minority class.
- **Cost-Sensitive Learning:**
- Assign higher misclassification costs to the minority class during model training to penalize errors on that class more heavily.
- **Ensemble Methods:**
- Use techniques like bagging or boosting, which can be effective in handling imbalanced datasets.

**Choice of Method:**
The best approach depends on the specific dataset and business context. Consider the severity of the imbalance, the size of the dataset, and the potential impact of misclass

**Business Context:**
In this case, accurately identifying potential customers who are likely to purchase tickets (i.e., 'Taken_product' = 'Yes') is crucial for targeted advertising. Therefore, a

print(transformed_df.columns)

Index(['preferred_device_ANDROID', 'preferred_device_Android',
'preferred_device_Android OS', 'preferred_device_Laptop',
'preferred_device_Mobile', 'preferred_device_Other',
'preferred_device_Others', 'preferred_device_Tab',
'preferred_device_iOS', 'preferred_device_iOS and Android',
'Taken_product_No', 'Taken_product_Yes',
'preferred_location_type_Beach', 'preferred_location_type_Big Cities',
'preferred_location_type_Entertainment',
'preferred_location_type_Financial', 'preferred_location_type_Game',
'preferred_location_type_Hill Stations',

You might also like