R Data Science Essentials - Sample Chapter
R Data Science Essentials - Sample Chapter
$ 34.99 US
22.99 UK
P U B L I S H I N G
Sa
m
pl
Raja B. Koushik
ee
C o m m u n i t y
D i s t i l l e d
E x p e r i e n c e
Raja B. Koushik
Sharan Kumar Ravindran
Sharan Kumar Ravindran is a data scientist with over 5 years of experience and
Preface
According to an article in Harvard Business Review, a data scientist's job is the best
job of the 21st century. With the massive explosion in the amount of data generated,
and with organizations becoming increasingly data-driven, the requirement for data
science professionals is ever increasing.
R Data Science Essentials will provide a detailed step-by-step guide to cover various
important concepts in data science. It covers concepts such as loading data from
different sources, carrying out fundamental data manipulation techniques, extracting
the hidden patterns in data through exploratory data analysis, and building
complex, predictive, and forecasting models. Finally, you will learn to visualize and
communicate the data analysis to an audience. This book is aimed at beginners and
intermediate users of R, taking them through the most important techniques in data
science that will help them start their data scientist journey.
Preface
Chapter 3, Pattern Discovery, focuses on techniques to extract patterns from the raw
data as well as to derive sequential patterns hidden in the data. This chapter will
touch on the evaluation metrics and the tweaking of parameters to adjust the rank
of the association rules. This chapter also discusses the business cases where these
techniques can be used.
Chapter 4, Segmentation Using Clustering, demonstrates how and when to perform
a clustering analysis, how to identify the ideal number of clusters for a dataset,
and how the clustering can be implemented using R. It also focuses on hierarchical
clustering and how it is different from normal clustering. You will also learn about
the visualization of clusters.
Chapter 5, Developing Regression Models, demonstrates why regression models are
used and how logistic regression is different from linear regression. It shows you
how to implement regression models using R and also explores the various methods
used to check the fit accuracy. It touches on the different methodologies that can be
used to improve the accuracy of the model.
Chapter 6, Time Series Forecasting, explains forecasting from fundamentals such as
converting the normal data frame to a time series data and shows you methods
that help uncover the hidden patterns in time series data. It will also teach you the
implementation of different algorithms for the forecasting.
Chapter 7, Recommendation Engine, shows you the basic idea behind a recommendation
engine and some of the real-life use cases in the first part of the chapter. In the latter
part of the chapter, the popular collaborative filtering algorithm based on items as
well as users is explained in detail along with its implementation.
Chapter 8, Communicating Data Analysis, covers some of the best ways to communicate
the results to the user, such as how to make data visualization better using packages
in R such as ggplot and googleViz, and demonstrates stitching together the
visualizations by creating an interactive dashboard using R shiny.
[1]
The other function similar to read.csv is read.csv2. This function is also used
to read the CSV files but the difference is that read.csv2 is mostly used in the
European countries, where comma is used as decimal point and semicolon is used as
a separator. Also, the data can be read from R using a few more parameters, such as
read.table and read.delim. By default, read.delim is used to read tab-delimited
files, and the read.table function can be used to read any file by supplying suitable
parameters as the input:
data
data
All the preceding functions can take multiple parameters that would explain the data
source's format at best. Some of these parameters are as follows:
sep: This defines the separator in the file. By default, the separator is comma
for read.csv, tab for read.delim, and white space for the read.table
function.
[2]
Chapter 1
nrows: This specifies the maximum number of rows to read from the file. By
parameter will take the column's position (one represents the first column)
as input.
fill: This parameter when set as TRUE can read the data with unequal row
These are some of the common parameters used along with the functions to read the
data from a file.
We have so far explored reading data from a delimited file. In addition to this,
we can read data in Excel formats as well. This can be achieved using the xlsx or
XLConnect packages. We will see how to use one of these packages in order to read
a worksheet from a workbook:
install.packages("xlsx")
library(xlsx)
mydata <- read.xlsx("DTH AnalysisV1.xlsx", 1)
head(mydata)
In the preceding code, we first installed the xlsx package that is required to read
the Excel files. We loaded the package using the library function, then used the
read.xlsx function to read the excel file, and passed an additional parameter, 1,
that specifies which sheet to read from the excel file.
[3]
We will have a detailed look at accessing the data from the database using the JDBC
method. In order to perform this operation, we need to install the RJDBC and sqldf
packages. The RJDBC package is used to establish a connection with the database and
the sqldf package is used to write the SQL queries:
install.packages("RJDBC")
library(RJDBC)
install.packages("sqldf")
library(sqldf)
We will now learn to establish a connection with the DB. We need to set up a few
things in order to connect with the DB. To use the JDBC connection, we need to
download the driver. The downloaded file will depend on the database to which
we are going to connect, such as SQL Server, Oracle, or PostgreSQL.
In the following case, we will connect to a SQL server database. The JDBC driver can
be downloaded from http://www.microsoft.com/en-in/download/details.
aspx?id=11774 in order to provide connectivity. In the following code, we will pass
the driver name as well as the location of the JAR file that comes with the download
to the JDBC function. The JDBC function creates a new DBI driver that can be used to
start the JDBC connection.
drv <- JDBC("com.microsoft.sqlserver.jdbc.SQLServerDriver", "C:/Users/
Downloads/Microsoft SQL Server JDBC Driver 3.0/sqljdbc_3.0/enu/sqljdbc4.
jar")
By using the dbConnect function, we establish the actual connection with the
database. We need to pass the location of the database, username, and password to
this function. On the successful execution of the following code, the connection will
be established and we can check the connectivity using the dbGetQuery function:
conn <- dbConnect(drv, "jdbc:sqlserver://localhost;database=SAMPLE_DB",
"admin", "test")
bin <- dbGetQuery(conn, "select count(*) from
[4]
sampletable")
Chapter 1
In addition to the relational databases, we can also connect and access the
non-relational databases, such as Cassandra, Hadoop, MongoDB, and so on.
We will now see how to access data from a Cassandra database. In order to access
the data from a Cassandra database, we need to install the RCassandra package.
The connection to the database can be made using the RC.connect function. We
need to pass the host IP as well as the port number to establish the connection.
Then finally, we need to specify the username and password as follows to establish
the connection successfully:
library(RCassandra)
conn <- RC.connect(host = "localhost", port = 9160)
RC.login(conn, username = "user", password = "password123")
In Cassandra, the container for the application data is keyspace, which is similar to
schema in the relational database. We can use the RC.describe.keyspaces function
to get an understanding about the data, and then using the RC.use function, we
select keyspace to be used for all the subsequent operations:
RC.describe.keyspaces(conn)
RC.use(conn, keyspace = "sampleDB", cache.def = TRUE)
We can read the data using the following code and once all the readings are done, we
can close the connection using RC.close:
a<-RC.read.table(conn, c.family = "Users", convert = TRUE, na.strings =
"NA",
as.is = FALSE, dec = ".")
RC.close(conn)
http://www.statmethods.net/input/importingdata.html
http://www.r-bloggers.com/importing-data-into-r-from-differentsources/
[5]
While establishing connectivity with a remote system, you could face a few issues
related to security and others specific to the R version and package version. Most
likely, the common issues would have been discussed in the forum, stackoverflow.
Data types in R
We explored the various ways that we can read the data from R in the previous
session. Let's have a look at the various data types that are supported by R. Before
going into the details of the data type, we will first explore the variable data types
in R.
In the preceding code, we actually passed an integer to the a variable but it is still
being saved in a numeric format.
We can now convert this variable defined as a numeric in R into an integer using the
as.integer function:
a <- as.integer(a)
class(a)
[1] "integer"
Chapter 1
Having explored the variable data types, now we will move up the hierarchy and
explore these data types: vector, matrix, list, and dataframe.
A vector is a sequence of elements of a basic data type. It could be a sequence of
numeric or logical characters. A vector can't have a sequence of elements with
different data types. The following are the examples for the numeric, character,
and logical vectors:
v1 <- c(12, 34, -21, 34.5, 100) # numeric vector
class(v1)
[1] "numeric"
v2 <- c("sam", "paul", "steve",
class(v2)
[1] "character"
v3 <- c(TRUE, FALSE, TRUE, FALSE, TRUE, FALSE) #logical vector
class(v3)
[1] "logical"
Now, let's consider the v1 numeric vector and the v2 character vector, combine these
two, and see the resulting vector:
newV <- c(v1,v2)
class(newV)
[1] "character"
We can see that the resultant vector is a character vector; we will see what happened
to the numeric elements of the first vector. From the following output, we can see
that the numeric elements are now converted into character vectors represented in
double quotes, whereas the numeric vector will be represented without any quotes:
newV
[1] "12"
"mark"
"34"
"-21"
"34.5"
"100"
"sam"
"paul"
"steve"
6 11 16 21
R2
7 12 17 22
R3
8 13 18 23
R4
9 14 19 24
R5
5 10 15 20 25
A list is a sequence of data elements similar to a vector but can hold elements of
different datatypes. We will combine the variables that we created in the vector
section. As in the following code, these variables hold numeric, character, and logical
vectors. Using the list function, we combine them, but their individual data type
still holds:
l1 <- list(v1, v2, v3)
typeof(l1)
> l1
[[1]]
[1]
12.0
34.0 -21.0
34.5 100.0
[[2]]
[1] "sam"
"paul"
"steve" "mark"
[[3]]
[1]
TRUE FALSE
TRUE FALSE
TRUE FALSE
Factors are categorical variables in R, which means that they take values from a
limited known set. In case of factor variables, R internally stores an equivalent
integer value and maps it to the character string.
[8]
Chapter 1
A dataframe is similar to the matrix, but in a data frame, the columns can hold data
elements of different types. The data frame will be the most commonly used data
type for most of the analysis. As any dataset would have multiple data points, each
could be of a different type. R comes with a good number of built-in datasets such as
mtcars. When we use sample datasets to cover the various examples in the coming
chapters, you will get a better understanding about the data types discussed so far.
After reading the dataset, we use the is.na function to identify the presence of NA
in the dataset, and then using sum, we get the total number of NAs present in the
dataset. In our case, we can see that a large number of rows has NA in it. We can
replace the NA with the mean value or we can remove these NA rows.
The following function can be used to replace the NA with the column mean for
all the numeric columns. The numeric columns are identified by the sapply(data,
is.numeric) function. We will check for the cells that have the NA value, then we
identify the mean of these columns using the mean function with the na.rm=TRUE
parameter, where the NA values are excluded while computing the mean function:
for (i in which(sapply(data, is.numeric))) {
data[is.na(data[, i]), i] <- mean(data[, i],
}
[9]
na.rm = TRUE)
Alternatively, we can also remove all the NA rows from the dataset using the
following code:
newdata <- na.omit(data)
The next major preprocessing activity is to identify the outliers package and deal
with it. We can identify the presence of outliers in R by making use of the outliers
function. We can use the function outliers only on the numeric columns, hence let's
consider the preceding dataset, where the NAs were replaced by the mean values,
and we will identify the presence of an outlier using the outliers function. Then,
we get the location of all the outliers using the which function and finally, we remove
the rows that had outlier values:
install.packages("outliers")
library(outliers)
We identify the outliers in the X2012 column, which can be subsetted using the
data$X2012 command:
outlier_tf = outlier(data$X2012,logical=TRUE)
sum(outlier_tf)
[1] 1
#What were the outliers
find_outlier = which(outlier_tf==TRUE,arr.ind=TRUE)
#Removing the outliers
newdata = data[-find_outlier,]
nrow(newdata)
The column from the preceding dataset that was considered in the outlier example
had only one outlier and hence we can remove this row from the dataset.
Arithmetic operations
String operations
Aggregation operations
[ 10 ]
Chapter 1
9 11 13 15
c1 <- b1-a1
[1] 5 5 5 5 5
c1 <- b1*a1
[1]
6 14 24 36 50
c1 <- b1/a1
[1] 6.000000 3.500000 2.666667 2.250000 2.000000
Apart from those seen at the top, the other arithmetic operations are the
exponentiation and modulus, which can be performed as follows, respectively:
c1 <- b1/a1
c1 <- b1 %% a1
Note that these aforementioned arithmetic operations can be performed between two
or more numeric vectors of the same length.
We can also perform logical operations. In the following code, we will simply pass
the values 1 to 10 to the dataset and then use the check condition to exclude the data
based on the given condition. The condition actually returns the logical value; it
checks all the values and returns TRUE when the condition is satisfied, or else, FALSE
is returned.
x <- c(1:10)
x[(x>=8) | (x<=5)]
Having seen the various operations on variables, we will also check arithmetic
operations on a matrix data. In the following code, we define two matrices that are
exactly the same, and then multiply them. The resultant matrix is stored in newmat:
matdata1 <-matrix(1:25, nrow=5,ncol=5, dimnames=list(rnames, cnames))
matdata2 <-matrix(1:25, nrow=5,ncol=5, dimnames=list(rnames, cnames))
newmat <- matdata1 * matdata2
newmat
[ 11 ]
The following code is used to search for a pattern in the character variables using the
grep function, which searches for matches. In this function, we first pass the string
that has to be found, then the second parameter will hold a vector; in this case, we
specified a character vector, and the third parameter will say if the pattern is a string
or regular expression. When fixed=TRUE, the pattern is a string, where as it is a
regular expression if set as FALSE:
grep("Shawshank", c("The","Shawshank","Redemption"), fixed=TRUE)
[1] 2
Now, we will see how to replace a character with another. In order to substitute
a character with a new character, we use the sub function. In the following code,
we replace the space with a semicolon. We pass three parameters to the following
function. The first parameter will specify the string/character that has to be replaced,
the second parameter tells us the new character/string, and finally, we pass the
actual string:
sub("\\s",",","Hello There")
[1] "Hello,There"
We can also split the string into characters. In order to perform this operation, we need
to use the strsplit function. The following code will split the string into characters:
strsplit("Redemption", "")
[1] "R" "e" "d" "e" "m" "p" "t" "i" "o" "n"
We have a paste function in R that will paste multiple strings or character variables.
It is very useful when arriving at a string dynamically. This can be achieved using
the following code:
paste("Today is", date())
[1] "Today is Fri Jun 26 01:39:26 2015"
[ 12 ]
Chapter 1
In the preceding function, there is a space between the two strings. We can avoid this
using a similar paste0 function, which does the same operation but joins without
any space. This function is very similar to the concatenation operation.
We can convert a string to uppercase or lowercase using the toupper and
tolower functions.
Mean
For this exercise, let's consider the mtcars dataset in R. Read the dataset to a variable
and then use the following code to calculate mean for a numeric column:
data <- mtcars
mean(data$mpg)
[1] 20.09062
Median
The median can be obtained using the following code:
med <- median(data$mpg)
paste("Median MPG:", med)
[1] "Median MPG: 19.2"
Sum
The mtcars dataset has details about various cars. Let's see what is the horsepower
of all the cars in this dataset. We can calculate the sum using the following code:
hp <- sum(data$hp)
paste("Total HP:", hp)
[1] "Total HP: 4694"
[ 13 ]
Standard deviation
We can calculate the standard deviation using the sd function. Look at the following
code to get the standard deviation:
sd <- sd(data$mpg)
paste("Std Deviation of MPG:", sd)
[1] "Std Deviation of MPG: 6.0269480520891"
Control structures in R
We have covered the different operations that are available in R. Now, let's look
at the control structures used in R. Control structures are the key elements of any
programming language.
The control structures commonly used in R are as follows:
if, else: This is used to test a condition and execute based on the condition
[ 14 ]
Chapter 1
[ 15 ]
The break statement can be used to terminate any loop. It is the only way to
terminate a repeat loop.
> sum <- 1
> repeat
{
sum <- sum + 2;
print(sum);
if (sum > 11)
break;
}
3
5
7
9
11
13
[ 16 ]
Chapter 1
wt
qsec vs am gear
Mazda RX4
Datsun 710
Hornet 4 Drive
Valiant
In the preceding code, we first selected the column by its position. The first line of
the code will select the first column and then the 5th to 10th column from the dataset,
whereas, in the last line, the specified two columns are removed from the dataset.
Both the preceding commands will yield the same result.
We can also arrive at a situation where we need to filter the data based on a
condition. While building the model, we cannot create a single model for the whole
of the population but we should create multiple models based on the behavior
present in the population. This can be achieved by subsetting the dataset. In the
following code, we will get the data of cars that have an mpg more than 25 alone:
newdata <- data[ which(data$mpg > 25), ]
mpg cyl
Fiat 128
32.4
disp
hp drat
78.7
wt
30.4
75.7
71.1
Fiat X1-9
27.3
79.0
Porsche 914-2
26.0
4 120.3
Lotus Europa
30.4
We might also need to consider a sample of the dataset. For example, while building
a regression or logistic model, we need to have two datasetsone for the training
and the other for the testing. In these cases, we need to choose a random sample.
This can be done using the following code:
sample <- data[sample(1:nrow(data), 10, replace=FALSE),]
sample
mpg cyl
disp
hp drat
75.7
26.0
4 120.3
15.2
Duster 360
14.3
Fiat X1-9
27.3
79.0
Fiat 128
32.4
78.7
Lotus Europa
30.4
Toyota Corona
21.5
4 120.1
21.0
Honda Civic
30.4
Porsche 914-2
Merc 450SLC
wt
We considered a random sample of 10 rows from the dataset. Along with these, we
might have to merge two different datasets. Let's see how this can be achieved. We
can combine the data both row-wise as well as column-wise as follows:
sample1 <- data[sample(1:nrow(data), 10, replace=FALSE),]
sample2 <- data[sample(1:nrow(data), 5, replace=FALSE),]
newdata <- rbind(sample1, sample2)
The preceding code is used to combine two datasets that share the same column
format. Then we can combine them using the rbind function. Alternatively, if
the two datasets have the same length of data but different columns, then we can
combine them using the cind or merge functions:
newdata1 <- data[c(1,5:7)]
newdata2 <- data[c(8:11)]
newdata <- cbind(newdata1, newdata2)
[ 18 ]
Chapter 1
When we have two different datasets with a common column, then we can use the
merge function to combine them. On using merge, the dataset will be merged based
on the common columns.
These are the essential concepts necessary to prepare the dataset for the analysis,
which will be discussed in the next few chapters.
Summary
In this chapter, you learned to import and read data from different sources such as
CSV, TXT, XLSX, and relational data sources and the different data types available
in R such as numeric, integer, character, and logical data types. We covered the
basic data preprocessing techniques used to handle outliers, missing data, and
inconsistencies in order to facilitate analysis.
You learned to perform different arithmetic operations that can be performed on the
data using R, such as addition, subtraction, multiplication, division, exponentiation,
and modulus, and also learned the string operations that can be performed on the
data using R, such as subsetting a string, replacing a string, changing the case, and
splitting the string into characters, which helps in data manipulation. Finally, you
learned about the different control structures in R, such as if, else, for, while,
repeat, break, next, and return, which facilitate a recursive or logical execution.
We also covered bringing data to a usable format for analysis and building a model.
In the next chapter, we will see how to perform exploratory data analysis using
R. It will include a few statistical techniques and also variable analyses, such as
univariate, bivariate, and multivariate analyses.
[ 19 ]
www.PacktPub.com
Stay Connected: