RStudio for R Statistical Computing Cookbook Over 50 practical and useful recipes to help you perform data analysis with R by unleashing every native RStudio feature 1st Edition Andrea Cirillo All Chapters Instant Download
RStudio for R Statistical Computing Cookbook Over 50 practical and useful recipes to help you perform data analysis with R by unleashing every native RStudio feature 1st Edition Andrea Cirillo All Chapters Instant Download
https://ebookfinal.com/download/r-data-analysis-cookbook-1st-edition-
viswa-viswanathan/
https://ebookfinal.com/download/basic-data-analysis-for-time-series-
with-r-1st-edition-dewayne-r-derryberry/
https://ebookfinal.com/download/an-introduction-to-statistical-
methods-and-data-analysis-6th-edition-r-lyman-ott/
Statistical Analysis with R Essentials For Dummies 1st
Edition Joseph Schmuller
https://ebookfinal.com/download/statistical-analysis-with-r-
essentials-for-dummies-1st-edition-joseph-schmuller/
https://ebookfinal.com/download/sas-and-r-data-management-statistical-
analysis-and-graphics-1st-edition-ken-kleinman/
https://ebookfinal.com/download/parallel-computing-for-data-science-
with-examples-in-r-c-and-cuda-2nd-edition-norman-matloff/
Andrea Cirillo
BIRMINGHAM - MUMBAI
RStudio for R Statistical Computing Cookbook
All rights reserved. No part of this book may be reproduced, stored in a retrieval system,
or transmitted in any form or by any means, without the prior written permission of the
publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the
information presented. However, the information contained in this book is sold without
warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers
and distributors will be held liable for any damages caused or alleged to be caused directly
or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the
companies and products mentioned in this book by the appropriate use of capitals.
However, Packt Publishing cannot guarantee the accuracy of this information.
ISBN 978-1-78439-103-4
www.packtpub.com
Credits
Reviewer Proofreader
Mark van der Loo Safis Editing
Copy Editor
Karuna Narayan
About the Author
Andrea Cirillo is currently working as an internal auditor at Intesa Sanpaolo banking group.
He gained a lot of financial and external audit experience at Deloitte Touche Tohmatsu and
internal audit experience at FNM, a listed Italian company.
His current main responsibilities involve evaluation of credit risk management models and
their enhancement mainly within the field of the Basel III capital agreement.
Andrea has written and contributed to a few useful R packages and regularly shares insightful
advice and tutorials about R programming.
His research and work mainly focuses on the use of R in the fields of risk management
and fraud detection, mainly through modeling custom algorithms and developing
interactive applications.
This book is the result of a lot of patience by my wife and sons, which left
me with the time to write this book, the time that I should have spend
with them.
By my colleagues who endured my talks about the book every three hours
and when I asked for their opinions about almost every recipe.
Mark van der Loo is a statistical researcher who specializes in data cleaning methodology
and likes to program in R and C. He is the author and coauthor of several R packages published
on CRAN, including stringdist, validate, deductive, lintools, and several others. In 2012, he
authored Learning RStudio for R Statistical Computing, Packt Publishing, with Edwin de Jonge.
www.PacktPub.com
At www.PacktPub.com, you can also read a collection of free technical articles, sign up
for a range of free newsletters and receive exclusive discounts and offers on Packt books
and eBooks.
TM
https://www2.packtpub.com/books/subscription/packtlib
Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book
library. Here, you can search, access, and read Packt's entire library of books.
Why Subscribe?
ff Fully searchable across every book published by Packt
ff Copy and paste, print, and bookmark content
ff On demand and accessible via a web browser
Table of Contents
Preface v
Chapter 1: Acquiring Data for Your Project 1
Introduction 1
Acquiring data from the Web – web scraping tasks 2
Accessing an API with R 12
Getting data from Twitter with the twitteR package 16
Getting data from Facebook with the Rfacebook package 21
Getting data from Google Analytics 24
Loading your data into R with rio packages 27
Converting file formats using the rio package 31
Chapter 2: Preparing for Analysis – Data Cleansing and Manipulation 33
Introduction 33
Getting a sense of your data structure with R 34
Preparing your data for analysis with the tidyr package 36
Detecting and removing missing values 40
Substituting missing values using the mice package 43
Detecting and removing outliers 47
Performing data filtering activities 48
Chapter 3: Basic Visualization Techniques 59
Introduction 59
Looking at your data using the plot() function 60
Using pairs.panel() to look at (visualize) correlations between variables 67
Adding text to a ggplot2 plot at a custom location 69
Changing axes appearance to ggplot2 plot (continous axes) 74
Producing a matrix of graphs with ggplot2 79
Drawing a route on a map with ggmap 85
Making use of the igraph package to draw a network 88
Showing communities in a network with the linkcomm package 93
i
Table of Contents
ii
Table of Contents
iii
Preface
Why should you read RStudio for R Statistical Computing Cookbook?
Well, even if there are plenty of books and blog posts about R and RStudio out there, this
cookbook can be an unbeatable friend through your journey from being an average R and
RStudio user to becoming an advanced and effective R programmer.
I have collected more than 50 recipes here, covering the full spectrum of data analysis
activities, from data acquisition and treatment to results reporting.
All of them come from my direct experience as an auditor and data analyst and from
knowledge sharing with the really dynamic and always growing R community.
I took great care selecting and highlighting those packages and practices that have proven
to be the best for a given particular task, sometimes choosing between different packages
designed for the same purpose.
You can therefore be sure that what you will learn here is the cutting edge of the R language
and will place you on the right track of your learning path to R's mastery.
Chapter 2, Preparing for Analysis – Data Cleansing and Manipulation, teaches you how to
get your data ready for analysis, leveraging the latest data-handling packages and advanced
statistical techniques for missing values and outlier treatments.
Chapter 3, Basic Visualization Techniques, lets you get the first sense of your data, highlighting
its structure and discovering patterns within it.
Chapter 4, Advanced and Interactive Visualization, shows you how to produce advanced
visualizations ranging from 3D graphs to animated plots.
v
Preface
Chapter 5, Power Programming with R, discusses how to write efficient R code, making use of
the R objective-oriented systems and advanced tools for code performance evaluation.
Chapter 6, Domain-specific Applications, shows you how to apply the R language to a wide
range of problems related to different domains, from financial portfolio optimization to
e-commerce fraud detection.
Chapter 7, Developing Static Reports, helps you discover the reporting tools available within
the RStudio IDE and how to make the most of them to produce static reports for sharing
results of your work.
Chapter 8, Dynamic Reporting and Web Application Development, displays the collected
recipes designed to make use of the latest features introduced in RStudio from shiny web
applications with dynamic UIs to RStudio add-ons.
More software will be needed for a few specific recipes, which will be highlighted in the
Getting Ready section of the respective recipe.
Just a closing note: all the software employed in this book is available for free for personal
use, and the greatest advantage of them is that they are open source and powered by the
R community.
If you think you are quite good at R and RStudio but you are still missing something in order
to be great, this book is exactly what you need to read.
vi
Preface
Sections
In this book, you will find several headings that appear frequently (Getting ready, How to do it,
How it works, There's more, and See also).
To give clear instructions on how to complete a recipe, we use these sections as follows:
Getting ready
This section tells you what to expect in the recipe, and describes how to set up any software or
any preliminary settings required for the recipe.
How to do it…
This section contains the steps required to follow the recipe.
How it works…
This section usually consists of a detailed explanation of what happened in the previous
section.
There's more…
This section consists of additional information about the recipe in order to make the reader
more knowledgeable about the recipe.
See also
This section provides helpful links to other useful information for the recipe.
Conventions
In this book, you will find a number of text styles that distinguish between different kinds of
information. Here are some examples of these styles and an explanation of their meaning.
Code words in text, database table names, folder names, filenames, file extensions,
pathnames, dummy URLs, user input, and Twitter handles are shown as follows: "The plot()
function is one of most powerful functions in base R."
vii
Preface
New terms and important words are shown in bold. Words that you see on the screen, for
example, in menus or dialog boxes, appear in the text like this: "In order to embed your Sankey
diagram, you can leverage the RStudio Save as Web Page control from the Export menu."
Reader feedback
Feedback from our readers is always welcome. Let us know what you think about this
book—what you liked or disliked. Reader feedback is important for us as it helps us develop
titles that you will really get the most out of.
If there is a topic that you have expertise in and you are interested in either writing or
contributing to a book, see our author guide at www.packtpub.com/authors.
viii
Preface
Customer support
Now that you are the proud owner of a Packt book, we have a number of things to help you to
get the most from your purchase.
1. Log in or register to our website using your e-mail address and password.
2. Hover the mouse pointer on the SUPPORT tab at the top.
3. Click on Code Downloads & Errata.
4. Enter the name of the book in the Search box.
5. Select the book for which you're looking to download the code files.
6. Choose from the drop-down menu where you purchased this book from.
7. Click on Code Download.
You can also download the code files by clicking on the Code Files button on the book's
webpage at the Packt Publishing website. This page can be accessed by entering the book's
name in the Search box. Please note that you need to be logged in to your Packt account.
Once the file is downloaded, please make sure that you unzip or extract the folder using the
latest version of:
ix
Preface
Errata
Although we have taken every care to ensure the accuracy of our content, mistakes do
happen. If you find a mistake in one of our books—maybe a mistake in the text or the
code—we would be grateful if you could report this to us. By doing so, you can save other
readers from frustration and help us improve subsequent versions of this book. If you find
any errata, please report them by visiting http://www.packtpub.com/submit-errata,
selecting your book, clicking on the Errata Submission Form link, and entering the details of
your errata. Once your errata are verified, your submission will be accepted and the errata will
be uploaded to our website or added to any list of existing errata under the Errata section of
that title.
Piracy
Piracy of copyrighted material on the Internet is an ongoing problem across all media.
At Packt, we take the protection of our copyright and licenses very seriously. If you come
across any illegal copies of our works in any form on the Internet, please provide us with
the location address or website name immediately so that we can pursue a remedy.
We appreciate your help in protecting our authors and our ability to bring you valuable content.
Questions
If you have a problem with any aspect of this book, you can contact us at
[email protected], and we will do our best to address the problem.
x
Acquiring Data for
1
Your Project
In this chapter, we will cover the following recipes:
Introduction
The American statistician Edward Deming once said:
I think this great quote is enough to highlight the importance of the data acquisition phase
of every data analysis project. This phase is exactly where we are going to start from. This
chapter will give you tools for scraping the Web, accessing data via web APIs, and importing
nearly every kind of file you will probably have to work with quickly, thanks to the magic
package rio.
All the recipes in this book are based on the great and popular packages developed and
maintained by the members of the R community.
1
Acquiring Data for Your Project
After reading this section, you will be able to get all your data into R to start your data analysis
project, no matter where it comes from.
Before starting the data acquisition process, you should gain a clear understanding of your
data needs. In other words, what data do you need in order to get solutions to your problems?
A rule of thumb to solve this problem is to look at the process that you are investigating—from
input to output—and outline all the data that will go in and out during its development.
In this data, you will surely have that chunk of data that is needed to solve your problem.
In particular, for each type of data you are going to acquire, you should define the following:
After covering these points for each set of data, you will have a clear vision of future data
acquisition activities. This will let you plan ahead the activities needed to clearly define
resources, steps, and expected results.
It is, therefore, crucial to know how to take that data from the Web and load it into your
analytical environment.
You can find data on the Web either in the form of data statically stored on websites (that is,
tables on Wikipedia or similar websites) or in the form of data stored on the cloud, which is
accessible via APIs.
For API recipes, we will go through all the steps you need to get data statically exposed on
websites in the form of tabular and nontabular data.
This specific example will show you how to get data from a specific Wikipedia page, the
one about the R programming language: https://en.wikipedia.org/wiki/R_
(programming_language).
2
Chapter 1
Getting ready
Data statically exposed on web pages is actually pieces of web page code. Getting them from
the Web to our R environment requires us to read that code and find where exactly the data is.
Dealing with complex web pages can become a really challenging task, but luckily,
SelectorGadget was developed to help you with this job. SelectorGadget is a bookmarklet,
developed by Andrew Cantino and Kyle Maxwell, that lets you easily figure out the CSS selector
of your data on the web page you are looking at. Basically, the CSS selector can be seen as
the address of your data on the web page, and you will need it within the R code that you are
going to write to scrape your data from the Web (refer to the next paragraph).
The CSS selector is the token that is used within the CSS code to identify
elements of the HTML code based on their name.
CSS selectors are used within the CSS code to identify which elements are to
be styled using a given piece of CSS code. For instance, the following script
will align all elements (CSS selector *) with 0 margin and 0 padding:
* {
margin: 0;
padding: 0;
}
SelectorGadget is currently employable only via the Chrome browser, so you will need to install
the browser before carrying on with this recipe. You can download and install the last version
of Chrome from https://www.google.com/chrome/.
3
Acquiring Data for Your Project
s=document.createElement('script');
s.setAttribute('type','text/javascript');
s.setAttribute('src','https://dv0akt2986vzh.cloudfront.net/
unstable/lib/selectorgadget.js');document.body.appendChild(s);
})();
This long URL shows that the CSS selector is provided as JavaScript; you can make this out
from the :javascript: token at the very beginning.
We can further analyze the URL by decomposing it into three main parts, which are as follows:
ff Creation on the page of a new element of the div class with the document.
createElement('div') statement
ff Aesthetic attributes setting, composed by all the s.style… tokens
ff The .js file content retrieving at https://dv0akt2986vzh.cloudfront.net/
unstable/lib/selectorgadget.js
The .js file is where the CSS selector's core functionalities are actually defined and the place
where they are taken to make them available to users.
That being said, I'm not suggesting that you try to use this link to employ SelectorGadget
for your web scraping purposes, but I would rather suggest that you look for the Chrome
extension or at the official SelectorGadget page, http://selectorgadget.com. Once
you find the link on the official page, save it as a bookmark so that it is easily available
when you need it.
The other tool we are going to use in this recipe is the rvest package, which offers great web
scraping functionalities within the R environment.
To make it available, you first have to install and load it in the global environment that runs
the following:
install.packages("rvest")
library(rvest)
4
Chapter 1
How to do it...
1. Run SelectorGadget. To do so, after navigating to the web page you are interested
in, activate SelectorGadget by running the Chrome extension or clicking on the
bookmark that we previously saved.
In both cases, after activating the gadget, a Loading… message will appear, and
then, you will find a bar on the bottom-right corner of your web browser, as shown in
the following screenshot:
You are now ready to select the data you are interested in.
5
Acquiring Data for Your Project
2. Select the data you are interested in. After clicking on the data you are going to
scrape, you will note that beside the data you've selected, there are some other
parts on the page that will turn yellow:
6
Chapter 1
When you are done with this fine-tuning process, SelectorGadget will have correctly
identified a proper selector, and you can move on to the next step.
3. Find your data location on the page. To do this, all you have to do is copy the CSS
selector that you will find in the bar at the bottom-right corner:
This piece of text will be all you need in order to scrape the web page from R.
4. The next step is to read data from the Web with the rvest package. The rvest
package by Hadley Wickham is one of the most comprehensive packages for
web scraping activities in R. Take a look at the There's more... section for further
information on package objectives and functionalities.
For now, it is enough to know that the rvest package lets you download HTML code
and read the data stored within the code easily.
Now, we need to import the HTML code from the web page. First of all, we need to
define an object storing all the html code of the web page you are looking at:
page_source <- read_html('https://en.wikipedia.org/wiki/R_
(programming_language)
This code leverages read_html function(), which retrieves the source code that
resides at the written URL directly from the Web.
7
Acquiring Data for Your Project
5. Next, we will select the defined blocks. Once you have got your HTML code, it is time
to extract the part of the code you are interested in. This is done using the
html_nodes() function, which is passed as an argument in the CSS selector and
retrieved using SelectorGadget. This will result in a line of code similar to the following:
version_block <- html_nodes(page_source,".wikitable th ,
.wikitable td")
As you can imagine, this code extracts all the content of the selected nodes, including
HTML tags.
Printing out the version_block object, you will obtain a result similar to
the following:
print(version_block)
{xml_nodeset (45)}
[1] <th>Release</th>
[2] <th>Date</th>
[3] <th>Description</th>
[4] <th>0.16</th>
[5] <td/>
[6] <td>This is the last <a href="/wiki/Alpha_test" title="Alpha
test" class="mw-redirect">alp ...
[7] <th>0.49</th>
[8] <td style="white-space:nowrap;">1997-04-23</td>
[9] <td>This is the oldest available <a href="/wiki/Source_code"
title="Source code">source</a ...
[10] <th>0.60</th>
[11] <td>1997-12-05</td>
[12] <td>R becomes an official part of the <a href="/wiki/GNU_
Project" title="GNU Project">GNU ...
8
Chapter 1
[13] <th>1.0</th>
[14] <td>2000-02-29</td>
[15] <td>Considered by its developers stable enough for production
use.<sup id="cite_ref-35" cl ...
[16] <th>1.4</th>
[17] <td>2001-12-19</td>
[18] <td>S4 methods are introduced and the first version for <a
href="/wiki/Mac_OS_X" title="Ma ...
[19] <th>2.0</th>
[20] <td>2004-10-04</td>
This result is not exactly what you are looking for if you are going to work with this
data. However, you don't have to worry about that since we are going to give your text
a better shape in the very next step.
6. In order to obtain a readable and actionable format, we need one more step:
extracting text from HTML tags.
This can be done using the html_text() function, which will result in a list
containing all the text present within the HTML tags:
content <- html_text(version_block)
The final result will be a perfectly workable chunk of text containing the data needed
for our analysis:
[1] "Release"
[2] "Date"
[3] "Description"
[4] "0.16"
[5] ""
[8] "1997-04-23"
9
Acquiring Data for Your Project
[10] "0.60"
[11] "1997-12-05"
[13] "1.0"
[14] "2000-02-29"
[16] "1.4"
[17] "2001-12-19"
[19] "2.0"
[20] "2004-10-04"
[22] "2.1"
[23] "2005-04-18"
[25] "2.11"
[26] "2010-04-22"
[28] "2.13"
[29] "2011-04-14"
10
Chapter 1
[31] "2.14"
[32] "2011-10-31"
[34] "2.15"
[35] "2012-03-30"
[37] "3.0"
[38] "2013-04-03"
[40] "3.1"
[41] "2014-04-10"
[42] ""
[43] "3.2"
[44] "2015-04-16"
[45] ""
There's more...
The following are a few useful resources that will help you get the most out of this recipe:
ff A useful list of HTML tags, to show you how HTML files are structured and how to
identify code that you need to get from these files, is provided at http://www.
w3schools.com/tags/tag_code.asp
ff The blog post from the RStudio guys introducing the rvest package and highlighting
some package functionalities can be found at http://blog.rstudio.
org/2014/11/24/rvest-easy-web-scraping-with-r/
11
Acquiring Data for Your Project
A typical use case for API data contains data regarding web and mobile applications, for
instance, Google Analytics data or data regarding social networking activities.
The successful web application If This ThenThat (IFTTT), for instance, lets you link together
different applications, making them share data with each other and building powerful and
customizable workflows:
This useful job is done by leveraging the application's API (if you don't know IFTTT, just
navigate to https://ifttt.com, and I will see you there).
12
Chapter 1
Using R, it is possible to authenticate and get data from every API that adheres to the OAuth
1 and OAuth 2 standards, which are nowadays the most popular standards (even though
opinions about these protocols are changing; refer to this popular post by the OAuth creator
Blain Cook at http://hueniverse.com/2012/07/26/oauth-2-0-and-the-road-to-
hell/). Moreover, specific packages have been developed for a lot of APIs.
This recipe shows how to access custom APIs and leverage packages developed for
specific APIs.
In the There's more... section, suggestions are given on how to develop custom functions
for frequently used APIs.
Getting ready
The rvest package, once again a product of our benefactor Hadley Whickham, provides
a complete set of functionalities for sending and receiving data through the HTTP protocol
on the Web. Take a look at the quick-start guide hosted on GitHub to get a feeling of rvest
functionalities (https://github.com/hadley/rvest).
Among those functionalities, functions for dealing with APIs are provided as well.
Both OAuth 1.0 and OAuth 2.0 interfaces are implemented, making this package really useful
when working with APIs.
Let's look at how to get data from the GitHub API. By changing small sections, I will point out
how you can apply it to whatever API you are interested in.
How to do it…
1. The first step to connect with the API is to define the API endpoint. Specifications for
the endpoint are usually given within the API documentation. For instance, GitHub
gives this kind of information at http://developer.github.com/v3/oauth/.
In order to set the endpoint information, we are going to use the oauth_endpoint()
function, which requires us to set the following arguments:
request: This is the URL that is required for the initial unauthenticated
token. This is deprecated for OAuth 2.0, so you can leave it NULL in this
case, since the GitHub API is based on this protocol.
authorize: This is the URL where it is possible to gain authorization for the
given client.
13
Acquiring Data for Your Project
access: This is the URL where the exchange for an authenticated token
is made.
base_url: This is the API URL on which other URLs (that is, the URLs
containing requests for data) will be built upon.
In the GitHub example, this will translate to the following line of code:
github_api <- oauth_endpoint(request = NULL,
authorize =
"https://github.com/login/oauth/authorize",
access = "https://github.com/login/oauth/access_token",
base_url =
"https://github.com/login/oauth")
2. Create an application to get a key and secret token. Moving on with our GitHub
example, in order to create an application, you will have to navigate to https://
github.com/settings/applications/new (assuming that you are already
authenticated on GitHub).
Be aware that no particular URL is needed as the homepage URL, but a specific URL
is required as the authorization callback URL.
This is the URL that the API will redirect to after the method invocation is done.
As you would expect, since we want to establish a connection from GitHub to our
local PC, you will have to redirect the API to your machine, setting the Authorization
callback URL to http://localhost:1410.
After creating your application, you can get back to your R session to establish a
connection with it and get your data.
3. After getting back to your R session, you now have to set your OAuth credentials
through the oaut_app() and oauth2.0_token() functions and establish a
connection with the API, as shown in the following code snippet:
app <- oauth_app("your_app_name",
key = "your_app_key",
secret = "your_app_secret")
API_token <- oauth2.0_token(github_api,app)
4. This is where you actually use the API to get data from your web-based software.
Continuing on with our GitHub-based example, let's request some information about
API rate limits:
request <- GET("https://api.github.com/rate_limit", config(token =
API_token))
14
Chapter 1
How it works...
Be aware that this step will be required both for OAuth 1.0 and OAuth 2.0 APIs, as the
difference between them is only the absence of a request URL, as we noted earlier.
There's more...
You can also write custom functions to handle APIs. When frequently dealing with a particular
API, it can be useful to define a set of custom functions in order to make it easier to interact
with.
15
Acquiring Data for Your Project
Basically, the interaction with an API can be summarized with the following three categories:
ff Authentication
ff Getting content from the API
ff Posting content to the API
You can get the content from the API through the get function of the httr package:
api_get <- function(path = "api_path",password){
auth <- api_auth(path, password )
request <- GET("https://api.com", path = path, auth)
Posting content will be done in a similar way through the POST function:
api_post <- function(Path, post_body, path = "api_path",password){
auth <- api_auth(pat) stopifnot(is.list(body))
body_json <- jsonlite::toJSON(body)
request <- POST("https://api.application.com", path = path, body =
body_json, auth, post, ...)
}
If my words are not enough to convince you, and I think they shouldn't be, you can always
perform a quick search on Google, for instance, text analytics with Twitter, and read the over
30 million results to be sure.
This should not surprise you, given Google's huge and word-spreaded base of users together
with the relative structure and richness of metadata of content on the platform, which makes
this social network a place to go when talking about data analysis projects, especially those
involving sentiment analysis and customer segmentations.
R comes with a really well-developed package named twitteR, developed by Jeff Gentry,
which offers a function for nearly every functionality made available by Twitter through the API.
The following recipe covers the typical use of the package: getting tweets related to a topic.
16
Chapter 1
Getting ready
First of all, we have to install our great twitteR package by running the following code:
install.packages("twitteR")
library(twitter)
How to do it…
1. As seen with the general procedure, in order to access the Twitter API, you will need
to create a new application. This link (assuming you are already logged in to Twitter)
will do the job: https://apps.twitter.com/app/new.
Feel free to give whatever name, description, and website to your app that you want.
The callback URL can be also left blank.
After creating the app, you will have access to an API key and an API secret, namely
Consumer Key and Consumer Secret, in the Keys and Access Tokens tab in your
app settings.
Below the section containing these tokens, you will find a section called Your Access
Token. These tokens are required in order to let the app perform actions on your
account's behalf. For instance, you may be willing to send direct messages to all new
followers and could therefore write an app to do that automatically.
Keep a note of these tokens as well, since you will need them to set up your
connection within R.
2. Then, we will get access to the API from R. In order to authenticate your app and use
it to retrieve data from Twitter, you will just need to run a line of code, specifically, the
setup_twitter_oauth() function, by passing the following arguments:
consumer_key
consumer_token
access_token
access_secret
17
Acquiring Data for Your Project
3. Now, we will query Twitter and store the resulting data. We are finally ready for the
core part: getting data from Twitter. Since we are looking for tweets pertaining to a
specific topic, we are going to use the searchTwitter() function. This function
allows you to specify a good number of parameters besides the search string. You
can define the following:
n : This is the number of tweets to be downloaded.
lang: This is the language specified with the ISO 639-1 code. You can find a
partial list of this code at https://en.wikipedia.org/wiki/List_of_
ISO_639-1_codes.
since – until: These are time parameters that define a range of time,
where dates are expressed as YYYY-MM-DD, for instance, 2012-05-12.
locale: This specifies the geocode, expressed as latitude, longitude
and radius, either in miles or kilometers, for example, 38.481157,
-130.500342,1 mi.
sinceID – maxID: This is the account ID range.
resultType: This is used to filter results based on popularity. Possible
values are 'mixed', 'recent', and 'popular'.
retryOnRateLimit: This is the number that defines how many times the
query will be retried if the API rate limit is reached.
Supposing that we are interested in tweets regarding data science with R; we run the
following function:
tweet_list <- searchTwitter('data science with R', n = 450)
tweet_list will be a list of the first 450 tweets resulting from the given query.
Be aware that since n is the maximum number of tweets retrievable, you may retrieve
a smaller number of tweets, if for the given query the number or result is smaller
than n.
18
Chapter 1
In order to let you work on this data more easily, a specific function is provided to
transform this list in a more convenient data.frame, namely, the twiLstToDF()
function.
After this, we can run the following line of code:
tweet_df <- twListToDF(tweet_list)
This will result in a tweet_df object that has the following structure:
> str(tweet_df)
'data.frame': 20 obs. of 16 variables:
$ text : chr "95% off Applied Data Science with R -
$ favorited : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ favoriteCount: num 0 2 0 2 0 0 0 0 0 1 ...
$ replyToSN : logi NA NA NA NA NA NA ...
$ created : POSIXct, format: "2015-10-16 09:03:32" "2015-10-
15 17:40:33" "2015-10-15 11:33:37" "2015-10-15 05:17:59" ...
$ truncated : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ replyToSID : logi NA NA NA NA NA NA ...
$ id : chr "654945762384740352" "654713487097135104"
"654621142179819520" "654526612688375808" ...
$ replyToUID : logi NA NA NA NA NA NA ...
19
Acquiring Data for Your Project
$ statusSource : chr "<a href=\"http://learnviral.com/\"
rel=\"nofollow\">Learn Viral</a>" "<a href=\"https://about.
twitter.com/products/tweetdeck\" rel=\"nofollow\">TweetDeck</
a>" "<a href=\"http://not.yet/\" rel=\"nofollow\">final one kk</
a>" "<a href=\"http://twitter.com\" rel=\"nofollow\">Twitter Web
Client</a>" ...
$ screenName : chr "Learn_Viral" "WinVectorLLC" "retweetjava"
"verystrongjoe" ...
$ retweetCount : num 0 0 1 1 0 0 0 2 2 2 ...
$ isRetweet : logi FALSE FALSE TRUE FALSE FALSE FALSE ...
$ retweeted : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ longitude : logi NA NA NA NA NA NA ...
$ latitude : logi NA NA NA NA NA NA ...
After sending you to the data visualization section for advanced techniques, we will
now quickly visualize the retweet distribution of our tweets, leveraging the base R
hist() function:
hist(tweet_df$retweetCount)
This code will result in a histogram that has the x axis as the number of retweets and
the y axis as the frequency of those numbers:
There's more...
As stated in the official Twitter documentation, particularly at https://dev.twitter.com/
rest/public/rate-limits, there is a limit to the number of tweets you can retrieve within
a certain period of time, and this limit is set to 450 every 15 minutes.
20
Chapter 1
However, what if you are engaged in a really sensible job and you want to base your work on a
significant number of tweets? Should you set the n argument of searchTwitter() to 450
and wait for 15—everlasting—minutes? Not quite, the twitteR package provides a convenient
way to overcome this limit through the register_db_backend(), register_sqlite_
backend(), and register_mysql_bakend() functions. These functions allow you to
create a connection with the named type of databases, passing the database name, path,
username, and password as arguments, as you can see in the following example:
register_mysql_backend("db_name", "host","user","password")
You can now leverage the search_twitter_and_store function, which stores the
search results in the connected database. The main feature of this function is the
retryOnRateLimit argument, which lets you specify the number of tries to be performed by
the code once the API limit is reached. Setting this limit to a convenient level will likely let you
pass the 15-minutes interval:
tweets_db = search_twitter_and_store("data science R",
retryOnRateLimit = 20)
Retrieving stored data will now just require you to run the following code:
from_db = load_tweets_db()
As we did for the twitteR package, we are going to establish a connection with the API and
retrieve posts pertaining to a given keyword.
Getting ready
This recipe will mainly be based on functions from the Rfacebok package. Therefore, we need
to install and load this package in our environment:
install.packages("Rfacebook")
library(Rfacebook)
21
Acquiring Data for Your Project
How to do it...
1. In order to leverage an API's functionalities, we first have to create an application
in our Facebook profile. Navigating to the following URL will let you create an app
(assuming you are already logged in to Facebook): https://developers.
facebook.com.
After skipping the quick start (the button on the upper-right corner), you can see the
settings of your app and take note of app_id and app_secret, which you will
need in order to establish a connection with the app.
2. After installing and loading the Rfacebook package, you will easily be able to
establish a connection by running the fbOAuth() function as follows:
fb_connection <- fbOauth(app_id = "your_app_id",
app_secret = "your_app_secret")
fb_connection
Running the last line of code will result in a console prompt, as shown in the following
lines of code:
copy and paste into site URL on Facebook App Settings: http://
localhost:1410/ When done press any key to continue
Following this prompt, you will have to copy the URL and go to your Facebook
app settings.
Once there, you will have to select the Settings tab and create a new platform
through the + Add Platform control. In the form, which will prompt you after clicking
this control, you should find a field named Site Url. In this field, you will have to paste
the copied URL.
Close the process by clicking on the Save Changes button.
At this point, a browser window will open up and ask you to allow access permission
from the app to your profile. After allowing this permission, the R console will print out
the following code snippet:
Authentication complete
Authentication successful.
3. To test our API connection, we are going to search Facebook for posts related to data
science with R and save the results within data.frame for further analysis.
Among other useful functions, Rfacebook provides the searchPages() function,
which as you would expect, allows you to search the social network for pages
mentioning a given string.
22
Chapter 1
Different from the searchTwitter function, this function will not let you specify a
lot of arguments:
string: This is the query string
token: This is the valid OAuth token created with the fbOAuth() function
n: This is the maximum number of posts to be retrieved
To search for data science with R, you will have to run the following line of code:
pages ← searchPages('data science with R',fb_connection)
This will result in data.frame storing all the pages retrieved along with the data
concerning them.
As seen for the twitteR package, we can take a quick look at the like distribution,
leveraging the base R hist() function:
hist(pages$likes)
Refer to the data visualization section for further recipes on data visualization.
23
Exploring the Variety of Random
Documents with Different Content
Our Evil Genius 295
Ships of Fire Again 295
Commodore Seymour’s Visit 296
Nouka and Queen ’Toria 297
The Dog to his Vomit Again 298
CHAPTER X.
FAREWELL SCENES.
The War Fever 303
Forced to the War Council 305
A Truce Among the Chiefs 306
Chiefs and People 308
The Kiss of Judas 309
The Death of Ian 309
The Quivering Knife 310
A War of Revenge 312
In the Thick of the Battle 313
Tender Mercies of the Wicked 315
Escape for Life 316
The Loss of All 317
Under the Tomahawk 318
Jehovah is Hearing 318
The Host Turned Back 320
The War Against Manuman 320
Traps Laid 321
House Broken Up 322
War Against Our Friends 322
A Treacherous Murderer 323
On the Chestnut Tree 324
Bargaining for Life 325
Five Hours in a Canoe 328
Kneeling on the Sands 329
Faimungo’s Farewell 330
“Follow! Follow!” 331
A Race for Life 332
Ringed Round with Death 334
Faint yet Pursuing 336
Out of the Lion’s Jaws 337
Brothers in Distress 339
Intervening Events 341
A Cannibal’s Taste 341
Pillars of Cloud and Fire 342
Passing by on the Other Side 344
Kapuku and the Idol Gods 344
A Devil Chief 344
“In Perils Oft” 345
Through Fire and Water 345
“Sail O! Sail O!” 349
“Let Me Die” 350
In Perils on the Sea 351
Tannese Visitors 352
The Devil Chief Again 353
Speckled and Spotted 354
Their Desired Haven 355
“I am Left Alone” 355
My Earthly All 356
Eternal Hope 356
Australia to the Rescue 357
For My Brethren’s Sake 358
A New Holy League 358
The Uses of Adversity 359
Arm-chair Critics Again 360
Concluding Note 361
Prospectus of Part Second 362
APPENDIX.
A. The Prayer of the Chiefs of Tanna 367
B. Notes on the New Hebrides 371
LIST OF ILLUSTRATIONS.
What I write here is for the glory of God. For more than twenty years
have I been urged to record my story as a missionary of the Cross;
but always till now, in my sixty-fourth year, my heart has shrunk
from the task, as savouring too much of self. Latterly the conviction
has been borne home to me that if there be much in my experience
which the Church of God ought to know, it would be pride on my
part, and not humility, to let it die with me. I lift my pen, therefore,
with that motive supreme in my heart; and, so far as memory and
entries in my note-books and letters of my own and of other friends
serve or help my sincere desire to be truthful and fair, the following
chapters will present a faithful picture of the life through which the
Lord has led me. If it bows any of my readers under as deep and
certain a confidence as mine, that in “God’s hand our breath is, and
His are all our ways,” my task will not be fruitless in the Great Day.
Our website is not just a platform for buying books, but a bridge
connecting readers to the timeless values of culture and wisdom. With
an elegant, user-friendly interface and an intelligent search system,
we are committed to providing a quick and convenient shopping
experience. Additionally, our special promotions and home delivery
services ensure that you save time and fully enjoy the joy of reading.
ebookfinal.com