Assignment 2 Twitter Analysis
Assignment 2 Twitter Analysis
Complete the following tasks using RStudio and record the results and your analysis in an R Markdown file.
Apple or Pi
We have found that many twitter users do not use the term #applepi, but prefer to mention #apple and
#raspberrypi separately. A report from Twitter shows us that of all of the tweets from the past month, the
terms #apple and #raspberrypi appear in the following number of tweets.
#apple
Present
Absent
#raspberrypi
Present
Absent
1,230,443
17,233,452
2,224,178
1,245,776,300
1. What is the probability of a tweet from that month containing the term #apple?
2. What is the probability of a tweet from that month containing the term #apple and #raspberrypi?
3. What is the probability of a tweet from that month containing the term #apple given that it contains the
term #raspberrypi?
4. Examine if the terms #apple and #raspberrypi are independent for that month.
We found that many of the tweets containing information about Apple and Raspberry Pi are from employees at
the magazine Crazy Mac. Of the numerous employees at Crazy Mac, we counted the number of days in which
62 employees wrote a tweet containing information about Apple and Raspberry Pi during January 2014. The
following table contains the count of days from each of the 62 employees.
3
4
3
2
4
4
4
3
2
3
9
4
6
5
2
3
4
3
4
4
2
2
1
4
1
2
3
4
7
1
2
3
1
1
3
3
3
7
2
3
3
3
4
3
4
1
2
0
5
3
4
3
2
4
6
2
2
3
1
0
4
3
1. If this sample comes from a Binomial distribution, what are the parameters of the distribution (n and p),
estimated from this sample?
2. Compute the difference between the sample standard deviation and the Binomial standard deviation
(using the computed parameters n and p for the Binomial distribution). Does this provide evidence for
or against the distribution being Binomial?
3. Given that the distribution is Binomial, what is the probability of an employee of Crazy Mac mentioning
Apple and Raspberry Pi in a tweet for more than 5 days of the month of February 2014 (assuming that
the parameter p is the same for January and February).
1
A report from Australian Twits has stated that there are an average of 10.2 tweets per day containing the
term #applepi. Using the following R code, and the given average, we can simulate counting the tweets per
day for fifty days, and compute the mean:
# set up storage array
sample.means = rep(0,1000)
# loop 100 times
for (a in c(1:1000)) {
# obtain a random sample from a Poisson distribution
poisson.sample = rpois(50, lambda = 10.2)
# compute the mean of the sample, and store the mean
sample.means[a] = mean(poisson.sample)
}
1. Find the mean and standard deviation of the sample means from the simulation and compare them to the
theoretical sample mean mean and sample mean standard deviation.
2. Provide an appropriate plot to examine if the distribution of the sample mean follows a Normal distribution.
To examine the validity of the statement that there are an average of 10.2 tweets per day containing the term
#applepi, we observed the tweet count per day of tweets containing the term #applepi for 50 days and obtained
the following sample:
5
2
3
6
6
0
3
3
2
0
1
6
1
0
2
1
2
4
2
5
3
4
3
4
3
4
0
4
5
3
2
6
3
2
9
5
1
3
2
8
2
2
3
3
1
3
0
3
4
2
Assignment Submission
One assignment is to be submitted per student by the due date, containing the description and results from
performing the tasks in sections 1 to 4. The assignment is to be written in R Markdown, Knitted to HTML,
then converted to PDF.
To submit the assignment, login to the 300700 control panel: http://staff.scem.uws.edu.au/~lapark/
300700/login.php and go to the Assignment Submission box and submit the PDF. If submitted successfully,
the assignment will appear in your list of submitted assignments. Please compare the MD5 sum of the recorded
file with your own to ensure that the correct file has been received1 . Note that any resubmission will overwrite
the previous submission.
The first page of your assignment should contain the declaration shown in Figure 1. Note: An examiner
or lecturer/tutor has the right not to mark this project report if the above declaration has not
been added to the cover of the report. Each group members name and student number should be written
after the declaration, along with the percentage contribution of the member to the assignment. Note that a
contribution may involve a solution to a problem, writing up the solution, helping a group member, or any
other task that has lead to increasing the group members understanding of the assignment content.
The assignment should begin on the second page. No identifying information (Name or student
number) should be placed on any page except the first. This is so that the report can be anonymised
by removing the cover page.
1 For
Marking Criteria
The assignment will contribute a maximum of 5 marks towards each students final mark. Four of the marks
will be awarded based on the report. One mark will be awarded based on each students peer review.
Report Assessment
Each of the four sections is worth one mark. A mark (or fraction of it) will be awarded proportional to the
understanding of the problem and the solution presented. Remember that the assessor will only be able to read
what you have written, therefore, clearly explain all decisions made. Each section will be marked according to:
Solution Type
Marks
1
0.5
0
The same mark out of four will be awarded to each member of the group, given that the contribution of
each member was equal.
2 http://en.wikipedia.org/wiki/Kendall_tau
Student Number
% Contribution to Assignment