Social Media Data For DSBA - Vijay Borade - Ipynb - Colab
Social Media Data For DSBA - Vijay Borade - Ipynb - Colab
An aviation company that provides domestic as well as international trips to the customers now wants to apply a targeted approach instead of
reaching out to each of the customers. This time they want to do it digitally instead of tele calling. Hence they have collaborated with a social
networking platform, so they can learn the digital and social behaviour of the customers and provide the digital advertisement on the user page
of the targeted customers who have a high propensity to take up the product. Propensity of buying tickets is different for different login devices.
Hence, you have to create 2 models separately for Laptop and Mobile. [Anything which is not a laptop can be considered as mobile phone
usage.] The advertisements on the digital platform are a bit expensive; hence, you need to be very accurate while creating the models.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
Mounted at /content/drive
# Assuming your CSV file is in 'My Drive', adjust the path accordingly
df = pd.read_csv('/content/sample_data/Social Media Data for DSBA.csv')
df.head()
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
UserID Taken_product Yearly_avg_view_on_travel_page preferred_device total_likes_on_outstation_checkin_given yearly_avg_Outstation_checkins member_in_family prefer
(11760, 17)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11760 entries, 0 to 11759
Data columns (total 17 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 UserID 11760 non-null int64
1 Taken_product 11760 non-null object
2 Yearly_avg_view_on_travel_page 11179 non-null float64
3 preferred_device 11707 non-null object
4 total_likes_on_outstation_checkin_given 11379 non-null float64
5 yearly_avg_Outstation_checkins 11685 non-null object
6 member_in_family 11760 non-null object
7 preferred_location_type 11729 non-null object
8 Yearly_avg_comment_on_travel_page 11554 non-null float64
9 total_likes_on_outofstation_checkin_received 11760 non-null int64
10 week_since_last_outstation_checkin 11760 non-null int64
11 following_company_page 11657 non-null object
12 montly_avg_comment_on_company_page 11760 non-null int64
13 working_flag 11760 non-null object
14 travelling_network_rating 11760 non-null int64
15 Adult_flag 11760 non-null int64
16 Daily_Avg_mins_spend_on_traveling_page 11760 non-null int64
dtypes: float64(3), int64(7), object(7)
memory usage: 1.5+ MB
df.describe()
UserID Yearly_avg_view_on_travel_page total_likes_on_outstation_checkin_given Yearly_avg_comment_on_trav
df.isnull().sum() ## Complete the code to view number of null or NaN values in each column
UserID 0
Taken_product 0
Yearly_avg_view_on_travel_page 581
preferred_device 53
total_likes_on_outstation_checkin_given 381
yearly_avg_Outstation_checkins 75
member_in_family 0
preferred_location_type 31
Yearly_avg_comment_on_travel_page 206
total_likes_on_outofstation_checkin_received 0
week_since_last_outstation_checkin 0
following_company_page 103
montly_avg_comment_on_company_page 0
working_flag 0
travelling_network_rating 0
Adult_flag 0
Daily_Avg_mins_spend_on_traveling_page 0
dtype: int64
df.columns
df.head()
keyboard_arrow_down Yearly_avg_view_on_travel_page
# @title Yearly_avg_view_on_travel_page
keyboard_arrow_down Yearly_avg_view_on_travel_page
# @title Yearly_avg_view_on_travel_page
UserID vs Yearly_avg_view_on_travel_page
Show code
keyboard_arrow_down
Show code
keyboard_arrow_down
Taken_product
Show code
keyboard_arrow_down member_in_family
# @title member_in_family
df['UserID'].unique()
df['UserID'].value_counts()
UserID
1000001 1
1007834 1
1007836 1
1007837 1
1007838 1
..
1003922 1
1003923 1
1003924 1
1003925 1
1011760 1
Name: count, Length: 11760, dtype: int64
df['Taken_product'].unique()
df['Taken_product'].value_counts()
Taken_product
No 9864
Yes 1896
Name: count, dtype: int64
df['Yearly_avg_view_on_travel_page'].unique()
array([307., 367., 277., 247., 202., 240., nan, 225., 285., 270., 262.,
217., 232., 255., 210., 165., 397., 180., 157., 330., 345., 292.,
322., 375., 195., 360., 412., 382., 300., 405., 435., 150., 187.,
42., 427., 352., 35., 450., 135., 308., 368., 249., 205., 445.,
226., 287., 271., 263., 219., 234., 256., 212., 241., 399., 286.,
182., 159., 316., 332., 347., 248., 331., 294., 323., 376., 265.,
204., 309., 257., 346., 264., 361., 196., 278., 444., 272., 414.,
339., 443., 233., 280., 422., 174., 384., 302., 242., 181., 211.,
436., 279., 151., 377., 188., 189., 166., 406., 324., 197., 143.,
167., 383., 227., 144., 301., 429., 203., 250., 413., 338., 310.,
392., 317., 354., 400., 369., 137., 136., 295., 391., 353., 218.,
362., 428., 451., 430., 173., 190., 398., 355., 437., 407., 152.,
421., 158., 293., 235., 220., 340., 385., 370., 325., 415., 452.,
315., 379., 290., 231., 281., 269., 222., 244., 221., 246., 402.,
184., 276., 258., 168., 259., 404., 318., 333., 252., 304., 378.,
273., 282., 253., 199., 215., 228., 356., 268., 366., 335., 266.,
254., 274., 291., 448., 417., 455., 229., 236., 275., 349., 418.,
425., 185., 389., 388., 411., 216., 446., 337., 283., 243., 154.,
261., 289., 193., 175., 409., 326., 386., 207., 153., 172., 372.,
390., 245., 313., 306., 230., 441., 176., 213., 433., 208., 387.,
147., 342., 424., 148., 408., 303., 395., 209., 319., 373., 267.,
296., 350., 341., 297., 239., 401., 454., 251., 299., 312., 344.,
348., 305., 394., 238., 200., 363., 206., 420., 201., 284., 142.,
374., 298., 357., 334., 179., 364., 321., 161., 431., 351., 381.,
311., 393., 288., 329., 328., 192., 237., 214., 198., 396., 456.,
327., 223., 458., 260., 141., 380., 191., 160., 224., 457., 359.,
320., 423., 410., 194., 365., 177., 403., 438., 164., 170., 149.,
138., 453., 169., 146., 447., 155., 162., 449., 439., 416., 145.,
358., 371., 343., 140., 336., 183., 314., 426., 419., 186., 440.,
459., 156., 163., 461., 178., 442., 432., 434., 462., 460., 171.,
464., 463.])
df['Yearly_avg_view_on_travel_page'].value_counts()
Yearly_avg_view_on_travel_page
262.0 190
255.0 186
270.0 179
217.0 165
232.0 160
...
149.0 2
464.0 1
146.0 1
458.0 1
463.0 1
Name: count, Length: 331, dtype: int64
df['preferred_device'].unique()
df['preferred_device'].value_counts()
preferred_device
Tab 4172
iOS and Android 4134
Laptop 1108
iOS 1095
Mobile 600
Android 315
Android OS 145
ANDROID 134
Other 2
Others 2
Name: count, dtype: int64
df['total_likes_on_outstation_checkin_given'].unique()
df['total_likes_on_outstation_checkin_given'].value_counts()
total_likes_on_outstation_checkin_given
24185.0 12
11515.0 11
18550.0 10
37870.0 10
5145.0 9
..
51983.0 1
14773.0 1
11100.0 1
22046.0 1
22025.0 1
Name: count, Length: 7888, dtype: int64
df['yearly_avg_Outstation_checkins'].unique()
array(['1', '24', '23', '27', '16', '15', '26', '19', '21', '11', '10',
'25', '12', '18', '29', nan, '22', '14', '20', '28', '17', '13',
'*', '5', '8', '2', '3', '9', '7', '6', '4'], dtype=object)
df['yearly_avg_Outstation_checkins'].value_counts()
yearly_avg_Outstation_checkins
1 4543
2 844
10 682
9 340
7 336
3 336
8 320
5 261
4 256
16 255
6 236
11 229
24 223
29 215
23 215
18 208
15 206
26 199
20 199
25 198
28 180
19 176
14 167
17 160
12 159
22 152
13 150
21 143
27 96
* 1
Name: count, dtype: int64
df['member_in_family'].unique()
df['member_in_family'].value_counts()
member_in_family
3 4561
4 3184
2 2256
1 1349
5 384
Three 15
10 11
Name: count, dtype: int64
df['preferred_location_type'].unique()
df['preferred_location_type'].value_counts()
preferred_location_type
Beach 2424
Financial 2409
Historical site 1856
Medical 1845
Other 643
Big Cities 636
Social media 633
Trekking 528
Entertainment 516
Hill Stations 108
Tour Travel 60
Tour and Travel 47
Game 12
OTT 7
Movie 5
Name: count, dtype: int64
df['Yearly_avg_comment_on_travel_page'].unique()
array([ 94., 61., 92., 56., 40., 79., 81., 67., 44., 84., 49.,
31., 93., 50., 51., 80., 96., 78., 45., 82., 53., 83.,
58., 72., 48., 42., 41., 86., 97., 75., 33., 37., 73.,
nan, 98., 47., 71., 3., 43., 99., 59., 95., 57., 76.,
87., 66., 55., 32., 52., 70., 62., 64., 63., 60., 100.,
46., 39., 77., 91., 54., 34., 90., 65., 36., 88., 35.,
89., 68., 85., 69., 74., 38., 106., 105., 103., 108., 111.,
104., 102., 109., 110., 112., 101., 107., 615., 114., 113., 215.,
815., 685., 118., 117., 115., 116., 121., 122., 120., 124., 119.,
125., 123.])
df['Yearly_avg_comment_on_travel_page'].value_counts()
Yearly_avg_comment_on_travel_page
96.0 192
66.0 191
90.0 190
56.0 188
80.0 184
...
124.0 3
685.0 1
815.0 1
215.0 1
615.0 1
Name: count, Length: 100, dtype: int64
df['total_likes_on_outofstation_checkin_received'].unique()
df['total_likes_on_outofstation_checkin_received'].value_counts()
total_likes_on_outofstation_checkin_received
2377 12
2380 11
2342 11
2096 10
2610 10
..
13678 1
10949 1
4906 1
19439 1
6203 1
Name: count, Length: 6288, dtype: int64
df['week_since_last_outstation_checkin'].unique()
df['week_since_last_outstation_checkin'].value_counts()
week_since_last_outstation_checkin
1 3070
3 1766
2 1700
4 1118
0 1032
5 728
6 654
7 594
9 472
8 428
10 138
11 60
Name: count, dtype: int64
df['following_company_page'].unique()
df['following_company_page'].value_counts()
following_company_page
No 8355
Yes 3285
1 12
0 5
Name: count, dtype: int64
df['montly_avg_comment_on_company_page'].unique()
array([ 11, 23, 15, 12, 13, 20, 22, 21, 17, 14, 16, 18, 19,
24, 25, 30, 29, 28, 27, 376, 381, 26, 427, 437, 499, 363,
425, 439, 301, 461, 322, 324, 355, 338, 332, 459, 460, 453, 300,
474, 368, 352, 445, 310, 323, 490, 371, 444, 343, 417, 393, 463,
350, 432, 412, 379, 336, 441, 346, 317, 406, 485, 400, 483, 478,
438, 354, 313, 497, 325, 419, 388, 398, 378, 397, 349, 356, 420,
347, 500, 442, 435, 447, 484, 330, 326, 360, 403, 465, 365, 353,
429, 345, 321, 491, 476, 475, 487, 316, 428, 472, 314, 405, 473,
339, 342, 455, 469, 399, 422, 370, 361, 467, 458, 304, 410, 383,
466, 446, 302, 486, 333, 418, 351, 391, 468, 454, 329, 390, 384,
404, 402, 424, 488, 440, 312, 449, 477, 380, 357, 414, 337, 33,
32, 31, 34, 35, 36, 37, 40, 38, 41, 39, 43, 42, 45,
44, 47, 46, 48])
df['montly_avg_comment_on_company_page'].value_counts()
montly_avg_comment_on_company_page
23 673
22 653
25 609
24 605
21 594
...
447 1
500 1
347 1
420 1
48 1
Name: count, Length: 160, dtype: int64
df['working_flag'].value_counts()
working_flag
No 9952
Yes 1808
Name: count, dtype: int64
df['travelling_network_rating'].unique()
array([1, 4, 2, 3])
df['travelling_network_rating'].value_counts()
travelling_network_rating
3 3672
4 3456
2 2424
1 2208
Name: count, dtype: int64
df['Adult_flag'].unique()
array([0, 1, 3, 2])
df['Adult_flag'].value_counts()
Adult_flag
0 5048
1 4768
2 1264
3 680
Name: count, dtype: int64
df['Daily_Avg_mins_spend_on_traveling_page'].unique()
df['Daily_Avg_mins_spend_on_traveling_page'].value_counts()
Daily_Avg_mins_spend_on_traveling_page
10 1126
9 676
8 662
6 624
7 554
13 532
11 530
12 500
14 496
15 480
5 444
16 444
17 430
1 336
4 330
18 322
20 292
19 288
21 258
22 254
23 238
3 218
24 184
25 150
2 146
28 142
26 140
29 134
27 116
32 102
31 82
30 74
34 62
33 60
36 56
35 48
37 46
0 46
40 32
38 30
39 26
41 20
44 8
42 6
43 4
45 4
46 3
135 1
170 1
235 1
270 1
47 1
Name: count, dtype: int64
Continuous=['UserID','Yearly_avg_view_on_travel_page','member_in_family']
Discrete_categorical=['preferred_device','Taken_product','preferred_location_type']
Discrete_count=['Yearly_avg_comment_on_travel_page','total_likes_on_outofstation_checkin_received','week_since_last_outstation_checkin','montly_avg_comment_on_company_page','tota
corr_matrix = numeric_df.corr()
plt.figure(figsize=(10,8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.tight_layout()
plt.show()
# Numerical vs. Categorical: Boxplots or Violin Plots
# Example:
plt.figure(figsize=(8,5))
sns.boxplot(x=df['Taken_product'], y=df['Yearly_avg_view_on_travel_page'])
plt.title('Yearly Avg View on Travel Page by Product Uptake')
plt.tight_layout()
plt.show()
corr_matrix = numeric_df.corr()
plt.figure(figsize=(10,8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.tight_layout()
plt.show()
# Interpret the correlation matrix
print("Positive correlations:")
# Example:
print("- Yearly_avg_view_on_travel_page and total_likes_on_outstation_checkin_given are positively correlated (0.65)")
print("\nNegative correlations:")
# Example:
print("- Yearly_avg_view_on_travel_page and week_since_last_outstation_checkin are negatively correlated (-0.32)")
print("\nKey observations:")
# Add your insights based on the correlation matrix
# Example:
print("- Users who view travel pages more frequently tend to give more likes to outstation check-ins.")
print("- Users who have checked in recently tend to view travel pages less frequently.")
Positive correlations:
- Yearly_avg_view_on_travel_page and total_likes_on_outstation_checkin_given are positively correlated (0.65)
Negative correlations:
- Yearly_avg_view_on_travel_page and week_since_last_outstation_checkin are negatively correlated (-0.32)
Key observations:
- Users who view travel pages more frequently tend to give more likes to outstation check-ins.
- Users who have checked in recently tend to view travel pages less frequently.
df.head()
Taken_product vs member_in_family
Show code
# prompt: Using dataframe df: Missing Value treatment
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import warnings
from google.colab import drive
from matplotlib import pyplot as plt
import seaborn as sns
# Iterate over numerical columns and handle outliers using IQR method
for column in df.select_dtypes(include=['number']).columns:
Q1 = df[column].quantile(0.25)
Q3 = df[column].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
df[column] = np.clip(df[column], lower_bound, upper_bound)
# prompt: Variable transformation
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler, OneHotEncoder
# Apply standardization
scaler = StandardScaler()
scaled_data = scaler.fit_transform(df[numerical_cols])
p _ _ p _ _ p _ _
0 0.0 0.0 0.0
1 0.0 0.0 0.0
2 0.0 0.0 0.0
3 0.0 0.0 0.0
4 0.0 0.0 0.0
preferred_device_Tab preferred_device_iOS \
0 0.0 0.0
1 0.0 1.0
2 0.0 0.0
3 0.0 1.0
4 0.0 0.0
preferred_location_type_Tour Travel \
0 0.0
1 0.0
2 0.0
3 0.0
4 0.0
Daily_Avg_mins_spend_on_traveling_page Yearly_avg_comment_on_travel_page \
0 -0.705974 0.898954
1 -0.455347 -0.634092
2 -0.831287 0.806042
3 -0.705974 -0.866371
4 -0.956600 -1.609666
Yearly_avg_view_on_travel_page montly_avg_comment_on_company_page \
0 0.401152 -1.611938
1 1.305515 0.019795
2 -0.051030 -1.068027
3 -0.503212 -1.611938
4 -1.181484 -1.475960
total_likes_on_outofstation_checkin_received \
0 -0.090842
1 -0.289462
2 -0.989117
3 -0.800624
4 -0.671971
total_likes_on_outstation_checkin_given week_since_last_outstation_checkin
0 0.751797 1.833319
1 -1.323014 -0.842262
2 1 434997 1 068868
# prompt: Addition of new variables
preferred_device_ANDROID preferred_device_Android \
0 0.0 0.0
1 0.0 0.0
2 0.0 0.0
3 0.0 0.0
4 0.0 0.0
preferred_device_Android OS preferred_device_Laptop \
0 0.0 0.0
1 0.0 0.0
2 0.0 0.0
3 0.0 0.0
4 0.0 0.0
preferred_device_Tab preferred_device_iOS \
0 0.0 0.0
1 0.0 1.0
2 0.0 0.0
3 0.0 1.0
4 0.0 0.0
Yearly_avg_view_on_travel_page montly_avg_comment_on_company_page \
0 0.401152 -1.611938
1 1.305515 0.019795
2 -0.051030 -1.068027
3 -0.503212 -1.611938
4 -1.181484 -1.475960
total_likes_on_outofstation_checkin_received \
0 -0.090842
1 -0.289462
2 -0.989117
3 -0.800624
4 -0.671971
total_likes_on_outstation_checkin_given \
0 0.751797
1 -1.323014
2 1.434997
3 1.482897
4 -0.536451
print("\n**Potential Solutions:**")
print("- **Resampling Techniques:**")
print(" - **Oversampling:** Duplicate instances from the minority class to increase its representation.")
print(" - **Undersampling:** Remove instances from the majority class to balance the dataset.")
print(" - **SMOTE (Synthetic Minority Over-sampling Technique):** Generate synthetic samples for the minority class.")
print("\n**Choice of Method:**")
print("The best approach depends on the specific dataset and business context. Consider the severity of the imbalance, the size of the dataset, and the potential impact of miscla
print("\n**Business Context:**")
print("In this case, accurately identifying potential customers who are likely to purchase tickets (i.e., 'Taken_product' = 'Yes') is crucial for targeted advertising. Therefore
Taken_product
No 9864
Yes 1896
Name: count, dtype: int64
Taken_product
No 83.877551
Yes 16.122449
Name: count, dtype: float64
**Potential Solutions:**
- **Resampling Techniques:**
- **Oversampling:** Duplicate instances from the minority class to increase its representation.
- **Undersampling:** Remove instances from the majority class to balance the dataset.
- **SMOTE (Synthetic Minority Over-sampling Technique):** Generate synthetic samples for the minority class.
- **Cost-Sensitive Learning:**
- Assign higher misclassification costs to the minority class during model training to penalize errors on that class more heavily.
- **Ensemble Methods:**
- Use techniques like bagging or boosting, which can be effective in handling imbalanced datasets.
**Choice of Method:**
The best approach depends on the specific dataset and business context. Consider the severity of the imbalance, the size of the dataset, and the potential impact of misclass
**Business Context:**
In this case, accurately identifying potential customers who are likely to purchase tickets (i.e., 'Taken_product' = 'Yes') is crucial for targeted advertising. Therefore, a
print(transformed_df.columns)
Index(['preferred_device_ANDROID', 'preferred_device_Android',
'preferred_device_Android OS', 'preferred_device_Laptop',
'preferred_device_Mobile', 'preferred_device_Other',
'preferred_device_Others', 'preferred_device_Tab',
'preferred_device_iOS', 'preferred_device_iOS and Android',
'Taken_product_No', 'Taken_product_Yes',
'preferred_location_type_Beach', 'preferred_location_type_Big Cities',
'preferred_location_type_Entertainment',
'preferred_location_type_Financial', 'preferred_location_type_Game',
'preferred_location_type_Hill Stations',