Data Analytics With R
Data Analytics With R
DATA
Data Analy cs with R will enable readers gain sufficient knowledge and experience to perform
analysis using different analy cal tools available in R. Each chapter begins with a number of
important and interes ng examples taken from a variety of sectors. The objec ve is to explain
the concepts and to simultaneously develop in readers an understanding of its applica on with
real-life examples. This easy-to-understand approach would enable readers to develop the
required skills and apply techniques to solve all types of problems related to R.
Salient Features
l
l
500+ real-life examples.
30+ Case Studies related to different sectors.
ANALYTICS
l
l
200+ Objec ve Type Ques ons with answers.
40+ Prac cal Exercises with solu ons. WITH
R
l 50+ datasets for different problems.
l Thorough refresher on the Basics of R.
l Examina on of Basic and Advanced Visualiza on Techniques.
l Descrip on of Sta s cal Techniques in R.
l Detailed explana on and coverage of Machine Learning.
9 788126 576463
DATA
ANALYTICS
WITH
R
DATA
ANALYTICS
WITH
R
Dr. Bharti Motwani
Associate Professor
Balaji Institute of Modern Management
Pune
Data Analytics with R
Copyright © 2019 by Wiley India Pvt. Ltd., 4436/7, Ansari Road, Daryaganj, New Delhi-110002.
Cover Image: © Toria/Shutterstock
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by
any means, electronic, mechanical, photocopying, recording or scanning without the written permission of the publisher.
Limits of Liability: While the publisher and the author have used their best efforts in preparing this book, Wiley and the
author make no representation or warranties with respect to the accuracy or completeness of the contents of this book,
and specifically disclaim any implied warranties of merchantability or fitness for any particular purpose. There are no
warranties which extend beyond the descriptions contained in this paragraph. No warranty may be created or extended by
sales representatives or written sales materials. The accuracy and completeness of the information provided herein and
the opinions stated herein are not guaranteed or warranted to produce any particular results, and the advice and strategies
contained herein may not be suitable for every individual. Neither Wiley India nor the author shall be liable for any loss of
profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.
Disclaimer: The contents of this book have been checked for accuracy. Since deviations cannot be precluded entirely,
Wiley or its author cannot guarantee full agreement. As the book is intended for educational purpose, Wiley or its author
shall not be responsible for any errors, omissions or damages arising out of the use of the information contained in the
book. This publication is designed to provide accurate and authoritative information with regard to the subject matter
covered. It is sold on the understanding that the Publisher is not engaged in rendering professional services.
Trademarks: All brand names and product names used in this book are trademarks, registered trademarks, or trade
names of their respective holders. Wiley is not associated with any product or vendor mentioned in this book.
Educating effective future leaders is a great responsibility. There is need to rise above the antiquated ap-
proaches of earlier days and infuse the spirit of participation, the spirit of adaptation and the spirit of
adventure. This will happen best in learning environments which are both serious and focused on the one
hand, but which are also joyous and inspiring, operating on the cutting edge of pedagogy and knowledge.
Having spent more than 21 years in the field of information technology and having published more than
75 research papers, I feel obliged to share my knowledge and experience related to analysis of real-life
situations. This book has evolved from my teaching experience in several technical institutions, provid-
ing consultancies, conducting research methodology workshops and my experience of working in the IT
industry. In order to provide a more meaningful and easier learning experience, this book has been written
with more interesting and relevant examples. Each chapter contains numerous problems of different types
to help readers evaluate themselves.
The international success of research depends on its reputation for high-quality tools used in analysis.
This quality and its international perceptions must continue to thrive under the new arrangements. This
means a renewed commitment to high-quality higher education that is more responsive to choice and which
provides the best possible experience. In this competitive world, there is a need to continue supporting core
strengths in higher education: build on a reputation for excellence and diversity in learning and teaching,
world-leading research and an enviable record of knowledge exchange. The goal of this book is to open
the doors of opportunity related to different analytical techniques from a broader array of datasets. It is an
attempt to provide a reservoir of updated knowledge on varied tools for academicians, consultants, research
scholars, practitioner and students. The reader is suggested to execute the programs for understanding util-
ity and effectiveness of the concept in a better manner.
Readers’ views, observations, constructive criticism and suggestions are welcome at bhartimotwani@
hotmail.com.
• Chapter 8 on basic statistics primarily discusses different functions to compute different terms of
descriptive statistics, correlation and covariance, simulation and distributions in R.
• Chapter 9 of this section deals with both parametric and non-parametric techniques for compar-
ing means. All the different tests in both the techniques are applied on two types of data: user’s data
along with the existing dataset available in R environment. This will help the user to have a better
understanding of the concept and familiarity with the available datasets in R environment.
• Chapter 10 on time series models primarily discusses smoothing and seasonal decomposition for
time series data.
4. Part 4 – Machine Learning: This section depicts the real strength of R in a true manner. It includes
six chapters and starts from basic machine learning algorithms to deep learning algorithms for different
types of data. All the algorithms covered in this section are discussed and analysis is done on existing
dataset available in R environment or other reputed places. The source of the dataset is specified at all
places with complete information. Machine learning comprises of both unsupervised and supervised
machine learning algorithms.
• Chapter 11 discusses unsupervised machine learning algorithms: factor analysis and cluster analysis.
• Chapter 12 throws light on the basic supervised machine learning problems: regression and classi-
fication.
• Chapter 13 discusses different machine learning algorithms used for regression and classification
problems like Naïve Bayes, KNN, Support Vector Machines and Decision Tree.
• Chapter 14 discusses different ensemble techniques of machine learning algorithms like Bagging,
Random Forest and Gradient Boosting. These techniques give better results since these techniques
use effective way of analysis by grouping.
• Chapter 15 focuses only on text data and discusses text mining and sentiment analysis. With the
advent of e-Commerce, a lot of available data are in the form of text and analysis is required for this
new type of data.
• Chapter 16 is related to neural networks (advance machine learning technique – deep learning).
This chapter discusses development of deep learning model for different structures of Tensor. A
Multilayer Perceptron model for a 2-D tensor (normal data), Recurrent Neural Network model
for a 3-D tensor (time series data) and Concurrent Neural Network model for 4-D tensor (image
data) is developed and results are analyzed in this chapter.
Instructor Resources
The following resources are available for instructors on request. To register, please log onto https://www.
wileyindia.com/Instructor_Manuals/Register/login.php
1. Chapter-wise PowerPoint Presentations (PPTs)
2. Chapter-wise Solution Manuals
Acknowledgements
Expression of feelings by words loses its significance when it comes to say a few words of gratitude, yet to
express it in some form, however imperfect, is a duty towards those who helped. I offer my special gratitude
to almighty God for His blessings that has made completion of this book possible.
I find myself at a loss for words to express my deep sense of gratitude to my father, Mr. Shrichand
Jagwani, and mother, Mrs. Anita Jagwani, for their affection, continuous support, constant encouragement
and understanding.
My real strength has been the selfless cooperation, solicitous concern and emotional support of my
husband, Mr. Bharat Motwani. No words can convey my gratitude to my children, Pearl and Jahan, who
had to tolerate my preoccupation with this book. Their patience, forbearance, love and support through this
whole process has made this mind-absorbing and time-consuming task possible.
I am grateful to the President of Sri Balaji Society Dr. (Col.) A. Balasubramanian for his guidance and
all the faculty and staff members for providing a conducive environment.
I am also thankful to all those people whose constructive suggestions and work have helped to enhance
the standard of the work directly and/or indirectly and brought the task to fruition.
I am indebted to Wiley Publishers for their sincere efforts, unfailing courtesy and cooperation in
bringing out the book in this elegant form. It has been a real pleasure working with such professional staff.
Contents
Prefacev
About the Author ix
PART 1 Basics of R 1
Chapter 1 Introduction to R 3
1.1 Features of R 3
1.2 Installation of R 4
1.3 Getting Started 5
1.3.1 Window Sections of RStudio 5
1.3.2 First Interaction 5
1.3.3 Command Line versus Scripts 6
1.3.4 Comments6
1.3.5 Help in R 6
1.3.6 Directory7
1.4 Variables in R 7
1.4.1 Naming Variables 8
1.4.2 Assigning Values to Variables 8
1.4.3 Finding Variables 9
1.4.4 Removing Variables 9
1.5 Input of Data 10
1.5.1 Input of Data from Terminal 10
1.5.2 Input of Data through R-Objects 11
1.6 Output in R 11
1.6.1 print() Function 11
1.6.2 cat() Function 12
1.7 In-Built Functions in R 13
1.7.1 Mathematical Functions 13
1.7.2 Trigonometric Functions 15
1.7.3 Logarithmic Functions 15
1.7.4 Date and Time Functions 16
1.7.5 Sequence Function 16
Chapter 1
Introduction to R
Chapter 2
Data Types of R
Chapter 3
Programming in R
Chapter 4
Data Exploration and Manipulation
Chapter 5
Import and Export of Data
R is a programming language primarily used for basic and advanced statistical analysis, excellent visualiza-
tion of graphics, machine learning and deep learning related to numbers, text etc. It was initially written
by Ross Ihaka and Robert Gentleman at the Department of Statistics of the University of Auckland in
Auckland, New Zealand. R can be regarded as an implementation of the S language which was developed at
Bell Laboratories by Rick Becker, John Chambers and Allan Wilks in 1993. R is freely available under the
GNU General Public License, and pre-compiled binary versions are provided for various operating systems
like Linux, Windows and Mac.
1.1 Features of R
R is an interpreted programming language and software environment for statistical analysis and data visual-
ization and reporting. The following are important features of R:
1. It allows branching and looping as well as modular programming using functions.
2. It allows integration with different programming languages such as C, C++, .Net and Python.
3. It has an extensive community of contributors; hence it has rich library of functions and datasets.
4. It has an effective data handling and storage facility for numeric and textual data.
5. It provides a collection of operators for calculations on arrays, lists, factor, vectors, data frame and
matrices.
6. It provides large and integrated collection of tools for data analysis and statistical functions.
7. It provides graphical facilities for data analysis and can show results both in soft and hard copies.
8. It is an integrated suite of software facilities for data manipulation, calculation and graphical facilities for
data analysis and display.
9. It has neither graphical user interfaces nor a spreadsheet view of data, nor is it a database; but it connects
to DBMS and spreadsheets.
1.2 Installation of R
R can be installed from R-3.2.2 for Windows (32-bit/64-bit) and then saved in a local directory. In win-
dows, installer (.exe) with the name “R-version-win.exe” can be downloaded. We need to double click
to run the installer and accept the default settings. After installation, we have to locate the icon to run the
program in a directory structure “R\R- 3.2.2\bin\i386\Rgui.exe” under the directory name Program Files.
Clicking this icon opens R-GUI which is the start for R programming (Fig. 1.1). This software gives facility
to the user to enter single line commands.
Installation of RStudio: RStudio has a better graphical user interface; hence most users prefer RStudio for
programming in R. Execution of commands in R is not menu-driven. (Not like clicking over buttons to
get outcome.) Besides, sometimes the user needs to type multiline commands also. When writing multiline
programs, it is useful to use a text editor rather than execute everything directly at the command line as it is
not possible at the R command line. RStudio is R’s own built-in editor, which is accessible from the R GUI
menu bar. In other words, it is an interface between R and the user. It is more useful for beginners and makes
coding easier. RStudio is also available for 32- and 64-bit versions. The user can download according to the
requirements of the Windows operating system. The user needs to click on RStudio from “All Programs” to
see the screen (Fig. 1.2).
window and click on Run, else write print(“Hello”) in top left window at the command prompt and press
enter to view the result.
1.3.4 Comments
Comments are helping text in the R program and they are ignored by the interpreter while executing
the actual program. They are generally used for user reference. A comment is denoted by ># followed by
statement. For example, >#My first program in R Programming. However, R does not support multiline
comments unlike other programming languages.
1.3.5 Help in R
R has an extensive user-friendly facility to provide help with regard to different commands. Some examples
are presented below.
Explanation
The above commands show the use of basic commands of R for help and exiting from R.
1.3.6 Directory
It is important to determine the directory where a user is creating R program. The getwd() and setwd()
functions are used for getting and setting the working directory, respectively. The user can determine the
existing working directory where he/she is currently working by using the getwd() function and he/she
can change the settings to a new working directory by using the setwd() function.
Explanation
The getwd() displays the working directory. In this example, it shows that Documents folder is the cur-
rent working directory. The next command sets the current working directory to a new working directory
“R prog” in D drive. The modifications are done and if we display the current working directory, then
D:\R prog is displayed. However, this result depends on the operating system and current directory where
we are working.
1.4 Variables in R
In programming, a variable is a named piece of computer memory, containing some information inside.
We can think of a variable as a box with a name, where we can store something. Variables can be static
and dynamic. A variable is a value that can change, depending on conditions or on information passed. A
variable provides us with named storage that our programs can manipulate.
Explanation
The above examples demonstrate different ways of assigning values to a variable. All the four ways can be
used to assign a value of 20 to the variable “x”.
> #List all objects except the hidden and special variables
> ls()
[1] "a" "aa" "Affairs" "air"
......................................................
Explanation
The first function ls() lists all objects in the working environment, except the hidden and special objects,
including the variables created by the user. The second command displays only those variables which have
a string “air” in the variable. The third command displays all the variables. Hence, a long list of all the vari-
ables is displayed.
Explanation
The first command removes the variable “new3”, hence if the second command is executed to display the
value of variable “new3”, an error is generated that this object is not found. Since ls() function displays all
the variables, all the variables are removed if we use rm() and ls() functions together.
Explanation
The first command scans five numeric values and store in variable “x” till the user presses return twice. When
the numeric vector “x” is printed, all the five values that are scanned runtime are printed. scan(what =
"", sep = "\n") function helps to read one set of character data since what = "" and using a new
line character (\n) as separator.
1.6 Output in R
A programming language takes in raw information (or data) at one end, stores it until it is ready to work
on, works on it and then shows the results at the other end. All these processes have a name. Taking in in-
formation is called input, storing information is better known as memory (or storage), work is also known
as processing, and showing the results is called output. Output of data is important after processing any
data. Display of output in R can be done using functions such as print, cat and paste. The functions
are chosen depending upon user requirements and according to their utility.
Explanation
In the first example, print() helps to print the statement. The print() function has a significant lim-
itation that it prints only one object at a time. If we want to print multiple items, it gives an error message.
The only way to print multiple items is to print one at a time. Hence, an error is generated because the
user wants to join the two strings “hello” and “welcome” together and print the result. However, the next
command executes properly. We have used print statement three times, since we want to print three strings.
The last section shows the mechanism to print a variable.
This process of printing using print() function is very cumbersome when dealing with a lot of data.
Besides, sometimes we need appropriate display of strings to obtain the results of specific operations in re-
quired format. Paste() and cat() are some important functions related to concatenation of strings and
customized display of strings.
Explanation
R provides an additional facility using cat() function to join two strings. In the first example, we have
concatenated two strings. In the next statement, we have used two strings: “Good Morning” and string
stored in a variable “hellostring”. The cat() function concatenates the two strings and displays the result.
In the next example, cat() is used to join one string “Today is” and one variable “tdate” containing a
string. The next example joins two strings and two variables. The answer displayed replaces the value of the
variable, while the string is displayed as it is. The last example uses the square root sqrt() function directly
inside the cat() function.
Explanation
The above examples show the usage of different mathematical functions such as round, square root, floor
and factorial. The basic mathematical operations follow a general BODMAS rule for calculation. The power
function (^) evaluates the value of a number raised to the power. The sqrt() function calculates the square
root of a number. However, square root of a negative number results in not a number (NaN). The function
abs() returns the absolute value of a number and factorial() function returns the factorial value of a
number. The round() function rounds off a number to its nearest decimal. If a number after the decimal is
5 or greater, it is rounded off to a higher value else it is rounded off to a lower value. For example, 25.6 will be
rounded off to 26 while 25.4 will be rounded off to 25. The floor() function is used to round off a number
to the previous integer, while the ceiling() function is used to round it off to the next integer. For exam-
ple, 35.7 is a number which lies between 35 and 36. Hence, floor of 35.7 will be 35 and ceiling will be 36.
Variables in Functions: We can also assign a value to a variable and then use the mathematical operator/
function to the variable directly.