Data Analytics and Visualization of TwitterSpam Dataset
Data Analytics and Visualization of TwitterSpam Dataset
Introduction:
We will analyze the TwitterSpam dataset using data analytics and visualization in this post. The dataset
includes 'no_tweets,' 'account_age,' 'no_follower,' and a binary'spam' indicator, among other attributes.
Our main goals are to use density plots to compare spam and non-spam tweets, scatterplots to show the
relationship between "account_age" and "no_follower," and regression lines to make the associations
easier to see.
R Script:
library(ggplot2)
colnames(twitter_spam)
geom_density(alpha = 0.6) +
x = "Number of Tweets",
y = "Density",
theme_minimal()
The distribution of 'no_tweets' for both spam and non-spam tweets is displayed in the density plot. We
can see that the density of non-spam tweets is higher in the lower range of the 'no_tweets' axis, showing
that non-spam accounts often have fewer tweets. However, the 'no_tweets' axis' greater range shows a
higher density of spam tweets, indicating that spam accounts typically have more tweets than normal
accounts. This difference in distribution enables us to comprehend how spam and non-spam accounts
differ in terms of the quantity of tweets they post.
R Script:
geom_point() +
y = "Number of Followers",
theme_minimal()
The scatterplot shows the connection between the columns "account_age" and "no_follower." We can
see that whereas spam accounts tend to cluster together, the points for non-spam accounts are more
dispersed. This suggests that while spam accounts might have identical account ages and follower
patterns, non-spam accounts have a range of account ages and follower counts. We can distinguish
between the prevalence of spam and non-spam accounts in the scatterplot thanks to the use of different
colors.
geom_point() +
labs(title = "Scatterplot with Regression Lines of Account Age vs. Number of Followers",
x = "Account Age",
y = "Number of Followers",
theme_minimal()
The scatterplot's overall trends and correlations can be visualized using the regression lines. The
regression line for non-spam accounts has a positive slope, indicating that as the account's age rises, so
too does the number of followers. This is consistent with the hypothesis that more time may have
passed for older accounts to acquire followers naturally. The regression line also has a positive slope for
spam accounts, suggesting that certain spam accounts may be able to gain a sizable following despite
having a young account. For spam accounts, the dispersion of data points around the regression line is
greater, indicating more variation in follower numbers for newer spam accounts, as can be shown.
1. Visual Representation of Trends: Regression lines provide a clear visual representation of the overall
trend between two variables. It helps users identify the general direction of the relationship, such as
positive, negative, or no correlation.
2. Identifying Outliers: The regression line allows us to identify potential outliers more easily. Points that
deviate significantly from the regression line might indicate unusual data points that need further
investigation.
3. Quantifying Relationships: The slope of the regression line provides a quantitative measure of the
relationship's strength and direction. It allows us to estimate how much the dependent variable changes
concerning the independent variable.
Conclusion:
We successfully analyzed spam and non-spam tweets using density plots, showed the link between
"account_age" and "no_follower" using scatterplots, and added regression lines to the visualizations to
better understand the trends through data analytics and visualization. The scatterplot and regression
lines shed light on the correlations between 'account_age' and 'no_follower' for both spam and non-
spam accounts, while the density plot emphasized disparities in the quantity of tweets between spam
and non-spam accounts.