0% found this document useful (0 votes)
46 views5 pages

Data Analytics and Visualization of TwitterSpam Dataset

The document analyzes a Twitter spam dataset using data visualization. Density plots show that non-spam accounts tend to have fewer tweets while spam accounts have more. Scatterplots visualize the relationship between account age and followers, showing spam accounts cluster together and non-spam accounts are more dispersed. Regression lines are added, revealing older non-spam accounts gain followers over time while spam accounts variation increases with younger ages.

Uploaded by

Gloria Auma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views5 pages

Data Analytics and Visualization of TwitterSpam Dataset

The document analyzes a Twitter spam dataset using data visualization. Density plots show that non-spam accounts tend to have fewer tweets while spam accounts have more. Scatterplots visualize the relationship between account age and followers, showing spam accounts cluster together and non-spam accounts are more dispersed. Regression lines are added, revealing older non-spam accounts gain followers over time while spam accounts variation increases with younger ages.

Uploaded by

Gloria Auma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 5

Data Analytics and Visualization of TwitterSpam Dataset

Introduction:

We will analyze the TwitterSpam dataset using data analytics and visualization in this post. The dataset
includes 'no_tweets,' 'account_age,' 'no_follower,' and a binary'spam' indicator, among other attributes.
Our main goals are to use density plots to compare spam and non-spam tweets, scatterplots to show the
relationship between "account_age" and "no_follower," and regression lines to make the associations
easier to see.

1. Density Plot for 'no_tweets':


To start our analysis, we first load the TwitterSpam dataset into R Studio and use the ggplot function to
create a density plot for the 'no_tweets' column, comparing spam and non-spam tweets.

R Script:

# Load required libraries

library(ggplot2)

# Load the dataset

twitter_spam <- read.csv("TwitterSpam.txt")

colnames(twitter_spam)

# Create density plot

ggplot(twitter_spam, aes(x = no_tweets, fill = label)) +

geom_density(alpha = 0.6) +

labs(title = "Density Plot of no_tweets by Spam Status",

x = "Number of Tweets",

y = "Density",

fill = "Spam Status") +

theme_minimal()

here is the output:


Observation from Density Plot:

The distribution of 'no_tweets' for both spam and non-spam tweets is displayed in the density plot. We
can see that the density of non-spam tweets is higher in the lower range of the 'no_tweets' axis, showing
that non-spam accounts often have fewer tweets. However, the 'no_tweets' axis' greater range shows a
higher density of spam tweets, indicating that spam accounts typically have more tweets than normal
accounts. This difference in distribution enables us to comprehend how spam and non-spam accounts
differ in terms of the quantity of tweets they post.

2. Scatterplots for 'account_age' and 'no_follower':


Next, we will create scatterplots to visualize the relationship between 'account_age' and 'no_follower,'
with spammer and non-spammer accounts represented using different colors.

R Script:

# Create scatterplot account age

ggplot(twitter_spam, aes(x = accout_age, y = no_follower, color = label)) +

geom_point() +

labs(title = "Scatterplot of Account Age vs. Number of Followers",


x = "Account Age",

y = "Number of Followers",

color = "Spam Status") +

theme_minimal()

Observation from Scatterplots:

The scatterplot shows the connection between the columns "account_age" and "no_follower." We can
see that whereas spam accounts tend to cluster together, the points for non-spam accounts are more
dispersed. This suggests that while spam accounts might have identical account ages and follower
patterns, non-spam accounts have a range of account ages and follower counts. We can distinguish
between the prevalence of spam and non-spam accounts in the scatterplot thanks to the use of different
colors.

3. Adding Regression Lines to the Scatterplot:


To gain further insights into the relationships depicted in the scatterplot, we will add regression lines
separately for spammer and non-spammer accounts.

# Add regression lines to scatterplot

ggplot(twitter_spam, aes(x = account_age, y = no_follower, color = label)) +

geom_point() +

geom_smooth(method = "lm", se = FALSE) +

labs(title = "Scatterplot with Regression Lines of Account Age vs. Number of Followers",
x = "Account Age",

y = "Number of Followers",

color = "Spam Status") +

theme_minimal()

Observation from Regression Lines:

The scatterplot's overall trends and correlations can be visualized using the regression lines. The
regression line for non-spam accounts has a positive slope, indicating that as the account's age rises, so
too does the number of followers. This is consistent with the hypothesis that more time may have
passed for older accounts to acquire followers naturally. The regression line also has a positive slope for
spam accounts, suggesting that certain spam accounts may be able to gain a sizable following despite
having a young account. For spam accounts, the dispersion of data points around the regression line is
greater, indicating more variation in follower numbers for newer spam accounts, as can be shown.

Advantages of Adding Regression Lines:

1. Visual Representation of Trends: Regression lines provide a clear visual representation of the overall
trend between two variables. It helps users identify the general direction of the relationship, such as
positive, negative, or no correlation.

2. Identifying Outliers: The regression line allows us to identify potential outliers more easily. Points that
deviate significantly from the regression line might indicate unusual data points that need further
investigation.

3. Quantifying Relationships: The slope of the regression line provides a quantitative measure of the
relationship's strength and direction. It allows us to estimate how much the dependent variable changes
concerning the independent variable.
Conclusion:
We successfully analyzed spam and non-spam tweets using density plots, showed the link between
"account_age" and "no_follower" using scatterplots, and added regression lines to the visualizations to
better understand the trends through data analytics and visualization. The scatterplot and regression
lines shed light on the correlations between 'account_age' and 'no_follower' for both spam and non-
spam accounts, while the density plot emphasized disparities in the quantity of tweets between spam
and non-spam accounts.

You might also like