Computing With R
Computing With R
A.A. Ayenigba
0ctober 31, 2019
Contents
History and Overview of R 2
Advantages of R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
R for Data Science . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
What you will learn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
R and R Studio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
R as a calculator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Comment in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Variable assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Variable assignment and data types in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Naming Rules for Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Rules for naming variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Basic classes of objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Matrices 9
Short group work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Progressing from vector to matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Naming a matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Other examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Matrices selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Arithmetic Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Inverse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Short group work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Short group work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
System of linear equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Eigenvalues and Eigenvectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Eigenvalues and Eigenvectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Short group work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Dataframe 15
Quick, have a look at your dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Using built-in datasets in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Statistical modelling in R 17
Simple linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1
Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Multiple linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Figure 1: R programming
R is a dialect of S and S is a language that was developed by John Chambers and others at the old Bell
Telephone Laboratories, originally part of AT & T Corp. S was initiated in 1976 as an internal statistical
analysis environment—originally implemented as Fortran libraries.
The R language came to use quite a bit after S had been developed. One key limitation of the S language
was that it was only available in a commercial package, S-PLUS. In 1991, R was created by Ross Ihaka
and Robert Gentleman in the Department of Statistics at the University of Auckland. In 1993 the first
announcement of R was made to the public.
R is a programming language and free software environment for statistical computations, data cleaning, data
analysis and graphical representation of data. The R language is widely used among statisticians and data
miners for developing statistical software and data analysis.
Advantages of R
1. Availability:
2
R programming language is open source. This makes it highly cost effective for a project of any size. Since it
is open source, developments in R happen at a rapid scale and the community of developers is huge. All of
this, along with a tremendous amount of learning resources makes R programming a perfect choice to begin
learning R programming for data science. Because there are many new developers exploring the landscape of
R programming it is easier and cost-effective to recruit or outsource to R developers.
2. Academia:
R is a very popular language in academia. Many researchers and scholars use R for data analysis. Many
popular books and learning resources on statistics use R for statistical analysis as well. Since it is a language
preferred by academicians, this creates a large pool of people who have a good working knowledge of R
programming. Putting it differently, if many people study R programming in their academic years than this
will create a large pool of skilled statisticians who can use this knowledge when they move to the industry.
Thus, leading increased traction towards this language.
3. Data wrangling
Data wrangling is the process of cleaning messy and complex data sets to enable convenient consumption and
further analysis. This is a very important and time taking process in data science. R has an extensive library
of tools for database manipulation and wrangling. Some of the popular packages for data manipulation in R
include:
dplyr - Created and maintained by Hadley Wickham, dplyr is best known for its data exploration and
transformation capabilities and highly adaptive chaining syntax.
data.table- It allows for faster manipulation of data set with minimum coding. It simplifies data aggregation
and drastically reduces the compute time.
readr- ‘readr’ helps in reading various forms of data into R. By not converting characters into factors it
performs the task at 10x faster speed.
4. Data visualization: Data visualization is the visual representation of data in graphical form. This allows
analyzing data from angles which are not clear in unorganized or tabulated data. R has many tools
that can help in data visualization, analysis, and representation. The R packages ggplot2 and ggedit for
have become the standard plotting packages. While the ggplot2 package is focused on visualizing data,
ggedit helps users bridge the gap between making a plot and getting all of those pesky plot aesthetics
precisely correct.
5. Specificity:
R is a language designed especially for statistical analysis and data reconfiguration. All the R libraries
focus on making one thing certain - to make data analysis easier, more approachable and detailed. Any new
statistical method is first enabled through R libraries. This makes R a perfect choice for data analysis and
projection. Members of the R community are very active and supporting and they have a great knowledge of
statistics as well as programming. This all gives R a special edge, making it a perfect choice for data science
projects.
6. Machine learning:
At some point in data science, a programmer may need to train the algorithm and bring in automation and
learning capabilities to make predictions possible. R provides ample tools to developers to train and evaluate
an algorithm and predict future events. Thus, R makes machine learning (a branch of data science) lot more
easy and approachable. The list of R packages for machine learning is really extensive. R machine learning
packages include MICE (to take care of missing values), rpart & PARTY (for creating data partitions),
CARET (for classification and regression training), randomFOREST (for creating decision trees) and much
more.
3
R for Data Science
Data science is an exciting discipline that allows you to turn raw data into understanding, insight, and
knowledge. The goal of R for Data Science is to help you learn the most important tools in R that will allow
you to do data science. Data science is a huge field, and there’s no way you can master it by reading a single
book.
R and R Studio
R is a statistical programming language for data analysis and visualization while R Studio is an integrated
development environment (IDE) for R programming. R Studio makes programming easier in R.
Figure 3: R Studio
4
In this section, you will take your first steps with R. You will learn how to use the console as a calculator and
how to assign variables. You will also get to know the basic data types in R. Let’s get started!
R as a calculator
In its most basic form, R can be used as a simple calculator. Consider the following arithmetic operations:
• Addition :
• Subtraction :
• Multiplication :
• Division :
• Exponentiation:
• Modulo :
Calculate 6 + 12
6 + 12
## [1] 18
Calculate 800 − 900
800 - 900
## [1] -100
Calculate 4 × 5
4 * 5
## [1] 20
2018
Calculate 2
2018 / 2
## [1] 1009
Calculate 23
2^3
## [1] 8
Calculate 20%%3
20 %% 3
## [1] 2
√
Calculate the square root of 4
sqrt(4)
## [1] 2
√
Calculate ( 4)2
(sqrt(4))^2
## [1] 4
5
Comment in R
R makes use of the # sign to add comments, so that you and others can understand what the R code is about.
Just like Twitter! Comments are not run as R code, so they will not influence your result. For example, any
code like #3 + 4 at the console is a comment. R ignores any code in #, this means that the code will not run.
# 3+4
Variable assignment
A basic concept in statistical programming is called a variable. A variable allows you to store a value (e.g. 5)
or an object (e.g. a function description) in R. You can then later use this variable’s name to easily access
the value or the object that is stored with this variable.
Example
Store the value of 4 as your first name
ezekiel <- 4
To know what is stored in memory as your first name, type your first name in the console and press return
key from the keyboard
ezekiel
## [1] 4
x + y
## [1] 7
z - x - y
## [1] 3
x * y
## [1] 12
z^x
## [1] 1000
6
Example
Create a vector
Vectors are one-dimensional arrays that can hold numeric data, character data, or logical data. In R, you can
create a vector with the combine function c(). You place the vector elements separate by a comma between
the parenthesis.
For example
character.vector <- c('Ayenigba', 'Emmanuel', 'Ezekiel', 'Ajayi', 'Ebun')
Notice
Adding a space behind the commas in the c() function improves the readability of your code
Naming a vector
As a data analyst, it is important to have a clear view on the data that you are using. Understanding what
each element refers to is essential. You can give a name to the elements of a vector with the names ()
function
Create a vector
Example
sales_tax <- c(140000, 200000, 600000, 180000, 170000)
names(sales_tax) <- c(
"Monday", "Tuesday", "Wednessday",
7
"Thursday", "Friday"
)
sales_tax
## [1] 7 9 11 13 15
Vector selection
To select elements of a vector (and later matrices, data frames), you can use square brackets [ ], between the
square brackets, you indicate what elements to select.
To select the first elements of vector a, you type a[1].
To select the second element of the vector, you typed a[2], etc.
Example
a
a[1]
a[2]
To create sequence with increament of 2 from 1 to 16, we can seq() function e.g.
seq(1, 16, 2)
seq(1, 20, 0.1)
seq(20, 1, -0.1)
If you have a sequence value you don’t know the last element, say you just know the start of the sequence
and the length of the sequence, e.g.
seq(5, by = 2, length = 50)
length(seq(5, by = 2, length = 50))
8
rep(5, 10) # Repeat 5 in 10 times
rep(1:4, 5) # Repeat 1 to 4 five times
rep(1:4, each = 3) # Each element of 1 to 4 3 times
Group work
Matrices
In R, a matrix is a collection of elements of the same data type (numeric, character, or logical) arranged into
a fixed number of rows and columns.
Since we are only working with rows and columns, a matrix is called two dimensional array.
You can construct a matrix in R with the matrix () function.
Example
A <- matrix(1:9, nrow = 3, byrow = TRUE)
A
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
## [3,] 7 8 9
• The first argument is the collection of elements that #Rstats will arrange into the rows and columns of
the matrix. Here, we use 1:9 which is a shortcut for c(1, 2, . . . , 9).
• The arguement byrow indicates that the matrix is filled by the rows. If we want the matrix to be filled
by the columns, we just place byrow=FALSE
9
• The argument nrow indicates that the matrix should have 3 rows
## [,1] [,2]
## [1,] 140 134
## [2,] 160 158
Naming a matrix
To help you understand what is stored in the performance analysis matrix, it is good to add the names of
the rows and columns respectively. Not only does this help you to read the data, but it also useful to select
certain elements from the matrix.
rownames(performance_analysis) <-
c(
"Fiscal year July-June 2016/17",
"Fiscal year July-June 2017/18"
)
performance_analysis
## Actual Target
## Fiscal year July-June 2016/17 140 134
## Fiscal year July-June 2017/18 160 158
Other examples
A <- matrix(c(1, 3, 5, 7, 9, 11, 13, 15, 17),
ncol = 3,
byrow = F
)
A
10
## [2,] 3 9 15
## [3,] 5 11 17
B <- matrix(c(2, 4, 6, 8, 10, 12, 14, 16, 18),
ncol = 3,
byrow = F
)
B
Matrices selection
To select elements in a matrix we can use square brackets [ , ], between the square brackets, you indicate the
position of the row and column in which the elements to select are.
To select the element in the first row and second column of matrix A, you type A[1,2].
To select the element in the third row and second column of matrix A, you type A[3,2], etc.
Example
A
A[1, 2]
A[3, 2]
Arithmetic Operation
We can perform all the arithmetic operations on matrices
• Addition
C <- A + B
C
11
## [3,] 156 354 552
• Transpose
1 7 13
G = t(A) = 3 9 15
5 11 17
G <- t(A)
G
1 7 13
G = det(A) = 3 9 15
5 11 17
G <- det(A)
G
## [1] 4.263256e-14
Inverse
For inverse, we use solve() a base function in R
H <- solve(B)
H
Did you encounter a problem?
Be of good cheer; for I have overcome the world!- Jesus Christ in John 16:33
Inverse function to tackle the problem
inverse <- function(A) {
if (det(A) < 0.01) {
cat("Since the given matrix is singular.
Sorry, I can't find inverse")
} else {
solve(A)
}
}
inverse(A)
12
Short group work
Use the function that you wrote to find the inverse of matrix J, where J is:
5 1 0
J = 3 −1 2
4 0 −1
Note
Assign the matrix to J and call inverse(J) in R
x−y =3
2x + 3y = −4
Matrices preparation
1 −1 x 3
A= B= C=
2 3 y −4
B = A−1 × C
Codes in R
A <- matrix(c(1, -1, 2, 3), nrow = 2, byrow = T)
A
## [,1] [,2]
## [1,] 1 -1
## [2,] 2 3
C <- matrix(c(3, -4), nrow = 2, byrow = T)
C
## [,1]
## [1,] 3
## [2,] -4
13
Codes in R
B <- solve(A) %*% C
B
## [,1]
## [1,] 1
## [2,] -2
x <- B[1, 1]
x
## [1] 1
y <- B[2, 1]
y
## [1] -2
1 −6
B=
3 −8
## [,1] [,2]
## [1,] 1 -6
## [2,] 3 -8
## eigen() decomposition
## $values
## [1] -5 -2
##
## $vectors
## [,1] [,2]
## [1,] 0.7071068 0.8944272
## [2,] 0.7071068 0.4472136
14
Short group work
Consider the following matrix
4 5 −5
B = 0 4 1
0 1 2
Dataframe
Dataframes are another way to put data in tables! Unlike matrices, dataframes can have different types of
data!
A dataframe has the variables of a data set as columns and the observations as rows. This will be a familiar
concept for those coming from different statistical software packages such as Excel, SPSS, or STATA
The function for dataframe is data.frame().
Example
# Make a dataframe with columns named a and b
data.frame(a = 2:4, b = 5:7)
a b
2 5
3 6
4 7
The numbers 1 2 3 at the left on your console are row labels and are not a column of the dataframe
Each column in a dataframe is a vector!
Example
a <- c(6, 5, 1)
b <- c(1, 1, 3)
Group work
Create a dataframe and call it data for the following vectors:
# Set the same seed to get the same sample
set.seed(123)
height <- rnorm(n = 100, mean = 135, sd = 12)
weight <- rnorm(n = 100, mean = 55, sd = 9)
15
elements. Therefore, it is often useful to show only part of the entire dataset.
1. head(): enables you to show the first observations of a dataframe.
2. tail(): enables you to print out the last observations in your dataset.
Both head() and tail() print a top line called header, which contains the names of the different variables
in your data set.
Another method that is often used to get a rapid overview of your dataset is the function str().
3. str(): Shows you the structure of your dataset
The structure of a dataframe tells you :
1. The total number of observations
2. The total number of variables
3. A full list of the variables names
4. The first observations
Note
Applying the str() function will often be the first thing that you do when receiving a new dataset or
dataframe. It is a great way to get more insight in your dataset before diving into the real analysis.
Example
Consider the vectors:
height <- rnorm(n = 120, mean = 135, sd = 12)
weight <- rnorm(n = 120, mean = 55, sd = 9)
str(data)
height weight
161.3857 57.13687
150.7490 65.96298
131.8183 42.95103
141.5183 60.94738
130.0279 50.29379
tail(data, 3)
height weight
118 119.8962 64.76298
119 155.2132 34.97511
120 145.9367 66.12124
16
Using built-in datasets in R
There are several ways to find the included datasets in R. Using data() will give you a list of the datasets of
all loaded packages.
data()
Example
library(datasets)
str(data)
Statistical modelling in R
In this section, we will use R for statistical modelling.
Example
We shall use women dataset in R. The description about women dataset can be seen by using ?women i.e.
?women
data <- women
head(data)
height weight
58 115
59 117
60 120
61 123
62 126
63 129
str(women)
17
model <- lm(height ~ weight, data = data)
The function lm() is used to fit the linear model and ~ is used separate dependent variable from independent
variable, and we specify the name of our data in argument data.
To see the results:
model
##
## Call:
## lm(formula = height ~ weight, data = data)
##
## Coefficients:
## (Intercept) weight
## 25.7235 0.2872
From the results, we see that:
height = 25.7235 + 0.2872weight.
To see the full results, we use summary() function i.e.
summary(model)
##
## Call:
## lm(formula = height ~ weight, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.83233 -0.26249 0.08314 0.34353 0.49790
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 25.723456 1.043746 24.64 2.68e-12 ***
## weight 0.287249 0.007588 37.85 1.09e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.44 on 13 degrees of freedom
## Multiple R-squared: 0.991, Adjusted R-squared: 0.9903
## F-statistic: 1433 on 1 and 13 DF, p-value: 1.091e-14
Example
We shall be using attitude dataset in R. The description about the dataset can be seen by using ?attitude
i.e.
18
?attitude
dataset <- attitude
head(dataset)
str(dataset)
The function lm() is used to fit the linear model and ~. is used separate dependent variable from independent
variable and to include all the independent variables in the dataset, and we specify the name of our data in
argument dataset.
To see the results:
model
##
## Call:
## lm(formula = rating ~ ., data = dataset)
##
## Coefficients:
## (Intercept) complaints privileges learning raises
## 10.78708 0.61319 -0.07305 0.32033 0.08173
## critical advance
## 0.03838 -0.21706
From the results, we see that:
attitude = 10.78707639 + 0.61318761(complaints) − 0.07305014(privileges) + 0.32033212(learning) +
0.08173213(raises) + 0.03838145(critical) − 0.21705668(advance)
To see the full results, we use summary() function i.e.
summary(model)
##
## Call:
19
## lm(formula = rating ~ ., data = dataset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.9418 -4.3555 0.3158 5.5425 11.5990
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.78708 11.58926 0.931 0.361634
## complaints 0.61319 0.16098 3.809 0.000903 ***
## privileges -0.07305 0.13572 -0.538 0.595594
## learning 0.32033 0.16852 1.901 0.069925 .
## raises 0.08173 0.22148 0.369 0.715480
## critical 0.03838 0.14700 0.261 0.796334
## advance -0.21706 0.17821 -1.218 0.235577
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.068 on 23 degrees of freedom
## Multiple R-squared: 0.7326, Adjusted R-squared: 0.6628
## F-statistic: 10.5 on 6 and 23 DF, p-value: 1.24e-05
20