0% found this document useful (0 votes)
16 views

CH 4 Data Analytics With R and Weak Machine Learning

Notes of chapter number 4 in big data

Uploaded by

Vaibhav Bhor
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

CH 4 Data Analytics With R and Weak Machine Learning

Notes of chapter number 4 in big data

Uploaded by

Vaibhav Bhor
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 82

What is R?

R is a programming language and software environment


for statistical analysis, graphics representation and
reporting.
R was created by Ross Ihaka and Robert Gentleman at
the University of Auckland, New Zealand, and is
currently developed by the R Development Core Team.
R is freely available under the GNU General Public
License, and pre-compiled binary versions are provided
for various operating systems like Linux, Windows and
Mac.
This programming language was named R, based on the
first letter of first name of the two R authors.
Why R Programming Language?

Features of R
● R is a well-developed, simple and effective
programming language which includes conditionals,
loops, user defined recursive functions and input and
output facilities.
● R has an effective data handling and storage facility,
● R provides a suite of operators for calculations on
arrays, lists, vectors and matrices.
● R provides a large, coherent and integrated
collection of tools for data analysis.
● R provides graphical facilities for data analysis and
display either directly at the computer or printing at
the papers.
Applications of R Programming

How to Install R Studio on Windows?

Installing R Studio on Window


To Install R Studio on windows we will follow the
following steps.
Step 1: First, you need to set up an R environment in your
local machine. You can download the same from r-
project.org.
Step 2: After downloading R for the Windows platform,
install it by double-clicking it.

Step 3: Download R Studio from their official page.


Note: It is free of cost (under AGPL licensing).
Step 4: After downloading, you will get a file named
“RStudio-1.x.xxxx.exe” in your Downloads folder.
Step 5: Double-click the installer, and install the
software.
Step 6: Test the R Studio installation
● Search for RStudio in the Window search bar on

Taskbar.
● Start the application.
● Insert the following code in the console.

● Input : print('Hello world!')


● Output : [1] "Hello world!"
Step 7: Your installation is successful.

How to Print Output in R ?


Unlike many other programming languages, you can
output code in R without using a print function:
Example
"Hello World!"
However, R does have a print() function available if you
want to use it.
Example
print("Hello World!")
Comments:
 Comments are like helping text in your R program and
they are ignored by the interpreter while executing your
actual program.
 Single comment is written using # in the beginning of
the statement as follows −
Ex- # My first program in R Programming

 R does not support multi-line comments

Creating Variables in R
 Variables are containers for storing data values.
 R does not have a command for declaring a variable.
 A variable is created the moment you first assign a
value to it.
 To assign a value to a variable, use the <- sign.
 To output (or print) the variable value, just type the
variable name:
Example
name <- "John"
age <- 40
name # output "John"
age # output 40
print(name) # print the value of the name variable

Concatenate Elements
You can also concatenate, or join, two or more elements,
by using the paste() function.
To combine both text and a variable, R uses comma (,):
Example
text1 <- "R is"
text2 <- "awesome"
paste(text1, text2)
Multiple Variables
R allows you to assign the same value to multiple
variables in one line:
Example
# Assign the same value to multiple variables in one line
var1 <- var2 <- var3 <- "Orange"
# Print variable values
var1
var2
var3
OUTPUT:
[1] "Orange"
[1] "Orange"
[1] "Orange"
Variable Names :
A variable can have a short name (like x and y) or a more
descriptive name (age, carname, total_volume).
Rules for R variables are:
● A variable name must start with a letter and can be a
combination of letters, digits, period(.)
and underscore(_).
● If it starts with period(.), it cannot be followed by a
digit.
● A variable name cannot start with a number or
underscore (_)
● Variable names are case-sensitive (age, Age and
AGE are three different variables)
● Reserved words cannot be used as variables (TRUE,
FALSE, NULL, if...)
# Legal variable names:
myvar <- "John"
my_var <- "John"
myVar <- "John"
MYVAR <- "John"
myvar2 <- "John"
.myvar <- "John"

# Illegal variable names:


2myvar <- "John"
my-var <- "John"
my var <- "John"
_my_var <- "John"
my_v@ar <- "John"
TRUE <- "John"

R Data Types
Basic Data Types :
Basic data types in R can be divided into the following
types:
1)numeric - (10.5, 55, 787)
2)integer - (1L, 55L, 100L, where the letter "L"
declares this as an integer)
3)complex - (9 + 3i, where "i" is the imaginary part)
4)character (string) - ("k", "R is exciting", "FALSE",
"11.5")
5)logical (boolean) - (TRUE or FALSE)
We can use the class() function to check the data type of a
variable:
● Example

● # numeric
x <- 10.5
class(x)

# integer
x <- 1000L
class(x)

# complex
x <- 9i + 3
class(x)

# character/string
x <- "R is exciting"
class(x)

# logical/boolean
x <- TRUE
class(x)
OUTPUT:
[1] "numeric"
[1] "integer"
[1] "complex"
[1] "character"
[1] "logical"

1)Numbers
There are three number types in R:
a)numeric: A numeric data type is the most common type
in R, and contains any number with or without a decimal,
like: 10.5, 55, 787:
Example
x <- 10.5
y <- 55

# Print values of x and y


x
y

# Print the class name of x and y


class(x)
class(y)
OUTPUT:
[1] 10.5
[1] 55
[1] "numeric"
[1] "numeric"

b)integer: Integers are numeric data without decimals.


This is used when you are certain that you will never
create a variable that should contain decimals.
To create an integer variable, you must use the
letter L after the integer value:
Example
x <- 1000L
y <- 55L

# Print values of x and y


x
y

# Print the class name of x and y


class(x)
class(y)
OUTPUT:
[1] 1000
[1] 55
[1] "integer"
[1] "integer"
c)complex : A complex number is written with an "i" as
the imaginary part:
Example
x <- 3+5i
y <- 5i

# Print values of x and y


x
y

# Print the class name of x and y


class(x)
class(y)
OUTPUT:
[1] 3+5i
[1] 0+5i
[1] "complex"
[1] "complex"
Type Conversion
You can convert from one type to another with the
following functions:
● as.numeric()
● as.integer()
● as.complex()
Example
x <- 1L # integer
y <- 2 # numeric

# convert from integer to numeric:


a <- as.numeric(x)

# convert from numeric to integer:


b <- as.integer(y)

# print values of x and y


x
y

# print the class name of a and b


class(a)
class(b)
OUTPUT:
[1] 1
[1] 2
[1] "numeric"
[1] "integer"

R Operators :
Operators are used to perform operations on variables and
values.
Types of Operators:
● Arithmetic operators
● Assignment operators
● Comparison operators
● Logical operators
● Miscellaneous operators
1)R Arithmetic Operators
Arithmetic operators are used with numeric values to
perform common mathematical operations:

Operat Name Example


or
+ Addition x+y

- Subtraction x-y T
r
y
it
»

* Multiplication x * y

/ Division x/y 10 / 5 O/P [1] T


2 r
y
it
»

^ Exponent x^y 2^5 O/P


[1] 32

%% Modulus x %% y Ex: 5 %% 2 O/P T


(Remainder r
from division) [1] 1 y
it
»

%/% Integer x%/%y Ex :15 %/% 2


Division O/P [1] 7

R Assignment Operators
Assignment operators are used to assign values to
variables:
Example
my_var <- 3

my_var <<- 3

3 -> my_var

3 ->> my_var

my_var # print my_var


OP: [1] 3
<<- is a global assigner.
It is also possible to turn the direction of the assignment
operator.
x <- 3 is equal to 3 -> x

R Comparison Operators
Comparison operators are used to compare two values:

Operat Name Example


or

== Equal x == y

!= Not equal x != y

> Greater than x>y

< Less than x<y

>= Greater than or x >= y


equal to

<= Less than or equal x <= y


to

R Logical Operators
Logical operators are used to combine conditional
statements:

Description
Operat
or

& Element-wise Logical AND operator. It returns


TRUE if both elements are TRUE

&& Logical AND operator - Returns TRUE if both


statements are TRUE
| Elementwise- Logical OR operator. It returns
TRUE if one of the statement is TRUE

|| Logical OR operator. It returns TRUE if one of


the statement is TRUE.

! Logical NOT - returns FALSE if statement is


TRUE

R Miscellaneous Operators
Miscellaneous operators are used to manipulate data:

Operat Description Example


or

: Creates a series of x <- 1:10


numbers in a sequence
%in% Find out if an element x %in% y
belongs to a vector

%*% Matrix Multiplication x <- Matrix1 %*


% Matrix2

Decision Making in R:
1) If Statement:
An "if statement" is written with the if keyword, and it is
used to specify a block of code to be executed if a
condition is TRUE:
Example
a <- 33
b <- 200

if (b > a) {
print("b is greater than a")
}
OUTPUT:
[1] "b is greater than a"
R uses curly brackets { } to define the scope in the code.
2) Else If
The else if keyword is R's way of saying "if the previous
conditions were not true, then try this condition":
Example
a <- 33
b <- 33

if (b > a) {
print("b is greater than a")
} else if (a == b) {
print ("a and b are equal")
}
OUTPUT:
[1] "a and b are equal"
In this example a is equal to b, so the first condition is not
true, but the else if condition is true, so we print to screen
that "a and b are equal".
You can use as many else if statements as you want in R.
3) If Else
The else keyword catches anything which isn't caught by
the preceding conditions:
Example
a <- 200
b <- 33

if (b > a) {
print("b is greater than a")
} else if (a == b) {
print("a and b are equal")
} else {
print("a is greater than b")
}
OUTPUT:
[1] "a is greater than b"
In this example, a is greater than b, so the first condition
is not true, also the else if condition is not true, so we go
to the else condition and print to screen that "a is greater
than b".
4) Nested If Statements
You can also have if statements inside if statements, this
is called nested if statements.
Example
x <- 41

if (x > 10) {
print("Above ten")
if (x > 20) {
print("and also above 20!")
} else {
print("but not above 20.")
}
} else {
print("below 10.")
}
OUTPUT:
[1] "Above ten"
[1] "and also above 20!"
AND
The & symbol (and) is a logical operator, and is used to
combine conditional statements:
Example
Test if a is greater than b, AND if c is greater than a:
a <- 200
b <- 33
c <- 500

if (a > b & c > a){


print("Both conditions are true")
}
OUTPUT:
[1] "Both conditions are true"
OR
The | symbol (or) is a logical operator, and is used to
combine conditional statements:
Example
Test if a is greater than b, or if c is greater than a:
a <- 200
b <- 33
c <- 500

if (a > b | a > c){


print("At least one of the conditions is true")
}
OUTPUT:
[1] "At least one of the conditions is true"
Loops
Loops can execute a block of code as long as a specified
condition is reached.
Loops are handy because they save time, reduce errors,
and they make code more readable.
R has two loop commands:
● while loops
● for loops
1) R While Loops
With the while loop we can execute a set of statements as
long as a condition is TRUE:
Example
Print i as long as i is less than 6:
i <- 1
while (i < 6) {
print(i)
i <- i + 1
}
OUTPUT:
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
In the example above, the loop will continue to produce
numbers ranging from 1 to 5. The loop will stop at 6
because 6 < 6 is FALSE.
The while loop requires relevant variables to be ready, in
this example we need to define an indexing variable, i,
which we set to 1.
Note: remember to increment i, or else the loop will
continue forever.

Break
With the break statement, we can stop the loop even if the
while condition is TRUE:
Example
Exit the loop if i is equal to 4.
i <- 1
while (i < 6) {
print(i)
i <- i + 1
if (i == 4) {
break
}
}
[1] 1
[1] 2
[1] 3
The loop will stop at 3 because we have chosen to finish
the loop by using the break statement when i is equal to 4
(i == 4).
Next
With the next statement, we can skip an iteration without
terminating the loop:
Example
Skip the value of 3:
i <- 0
while (i < 6) {
i <- i + 1
if (i == 3) {
next
}
print(i)
}
OUTPUT:
[1] 1
[1] 2
[1] 4
[1] 5
[1] 6
When the loop passes the value 3, it will skip it and
continue to loop.
If .. Else Combined with a While Loop
To demonstrate a practical example, let us say we play a
game of Yahtzee!
Example
Print "Yahtzee!" If the dice number is 6:
dice <- 1
while (dice <= 6) {
if (dice < 6) {
print(" No Yahtzee ")
} else {
print("Yahtzee!")
}
dice <- dice + 1
}
OUTPUT:
[1] "No Yahtzee"
[1] "No Yahtzee"
[1] "No Yahtzee"
[1] "No Yahtzee"
[1] "No Yahtzee"
[1] "Yahtzee!"
If the loop passes the values ranging from 1 to 5, it prints
"No Yahtzee". Whenever it passes the value 6, it prints
"Yahtzee!".
2) For Loops
 A for loop is used for iterating over a sequence:
 R’s for loops are particularly flexible in that they are
not limited to integers, or even numbers in the input.
We can pass character vectors, logical vectors, lists
or expressions.
Syntax
The basic syntax for creating a for loop statement in R is

for (value in vector) {
statements
}
Example
for (x in 1:10) {
print(x)
}
OUTPUT:
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
[1] 10

If .. Else Combined with a For Loop


To demonstrate a practical example, let us say we play a
game of Yahtzee!
Example
Print "Yahtzee!" If the dice number is 6:
dice <- 1:6

for(x in dice) {
if (x == 6) {
print(paste("The dice number is", x, "Yahtzee!"))
} else {
print(paste("The dice number is", x, "Not Yahtzee"))
}
}
OUTPUT:
[1] "The dice number is 1 Not Yahtzee"
[1] "The dice number is 2 Not Yahtzee"
[1] "The dice number is 3 Not Yahtzee"
[1] "The dice number is 4 Not Yahtzee"
[1] "The dice number is 5 Not Yahtzee"
[1] "The dice number is 6 Yahtzee!"

Nested Loops
You can also have a loop inside of a loop:
Example
Print the adjective of each fruit in a list:
adj <- list("red", "big", "tasty")

fruits <- list("apple", "banana", "cherry")


for (x in adj) {
for (y in fruits) {
print(paste(x, y))
}
}
OUTPUT:
[1] "red apple"
[1] "red banana"
[1] "red cherry"
[1] "big apple"
[1] "big banana"
[1] "big cherry"
[1] "tasty apple"
[1] "tasty banana"
[1] "tasty cherry"
Repeat Loop:
A repeat loop is one of the control statements in R
programming that executes a set of statements in a loop
until the exit condition specified in the loop, evaluates to
TRUE.
Syntax
repeat{
Statements
if(exit_condition){
break }
}
Example:
x <- 1
repeat {
print(x)
x = x+1
if (x == 6){
break }
}
Output:
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
Switch Statement:
 A switch statement is a selection control mechanism
that allows the value of an expression to change the
control flow of program execution via map and
search.
 The switch statement is used in place of long if
statements which compare a variable with several
integral values.
 It is a multi-way branch statement which provides an
easy way to dispatch execution for different parts of
code.
 This code is based on the value of the expression.
The basic syntax of If-else statement is as follows:
switch(expression, case1, case2, case3....)
next Statement:
 The next statement is used to skip any remaining
statements in the loop and continue executing.
 In simple words, a next statement is a statement
which skips the current iteration of a loop without
terminating it.
 When the next statement is encountered, the R parser
skips further evaluation and starts the next iteration
of the loop.
Syntax
next
Example:
x <- 1:5
for (val in x)
{
if (val == 3)
{
next
} print(val)
}
Output: [1] 1
[1] 2
[1] 4
[1] 5
Functions:
 A function is a block of code which only runs when it
is called.
 You can pass data, known as parameters, into a
function.
 A function can return data as a result.
Creating a Function
To create a function, use the function() keyword:
Example
my_function <- function() { # create a function with the
name my_function
print("Hello World!")
}
Call a Function
 To call a function, use the function name followed by
parenthesis, like my_function()
Example
my_function <- function() {
print("Hello World!")
}

my_function() # call the function named my_function


OUTPUT:
[1] "Hello World!"
Arguments
 Information can be passed into functions as
arguments.
 Arguments are specified after the function name,
inside the parentheses. You can add as many
arguments as you want, just separate them with a
comma.
The following example has a function with one argument
(fname). When the function is called, we pass along a first
name, which is used inside the function to print the full
name:
Example
my_function <- function(fname) {
paste(fname, "Griffin")
}

my_function("Peter")
my_function("Lois")
my_function("Stewie")
OUTPUT:
[1] "Peter Griffin"
[1] "Lois Griffin"
[1] "Stewie Griffin"
Default Parameter Value
 The following example shows how to use a default
parameter value.
 If we call the function without an argument, it uses
the default value:
Example
my_function <- function(country = "Norway") {
paste("I am from", country)
}

my_function("Sweden")
my_function("India")
my_function() # will get the default value, which is
Norway
my_function("USA")
OUTPUT:
[1] "I am from Sweden"
[1] "I am from India"
[1] "I am from Norway"
[1] "I am from USA"
Return Values
To let a function return a result, use the return() function:
Example
my_function <- function(x) {
return (5 * x)
}

print(my_function(3))
print(my_function(5))
print(my_function(9))
OUTPUT:
[1] 15
[1] 25
[1] 45
Data Structures in R:
1) Vectors
 A vector is simply a list of items that are of the same
type.
 To combine the list of items to a vector, use
the c() function and separate the items by a comma.
 In the example below, we create a vector variable
called fruits, that combine strings:
Example
# Vector of strings
fruits <- c("banana", "apple", "orange")

# Print fruits
fruits
Output:
[1] "banana" "apple" "orange"
In this example, we create a vector that combines
numerical values:
Example
# Vector of numerical values
numbers <- c(1, 2, 3)

# Print numbers
numbers
Output:
[1] 1 2 3
To create a vector with numerical values in a sequence,
use the : operator:
Example
# Vector with numerical values in a sequence
numbers <- 1:10

numbers
Output:
[1] 1 2 3 4 5 6 7 8 9 10
You can also create numerical values with decimals in a
sequence, but note that if the last element does not belong
to the sequence, it is not used:
Example
# Vector with numerical decimals in a sequence
numbers1 <- 1.5:6.5
numbers1

# Vector with numerical decimals in a sequence where the


last element is not used
numbers2 <- 1.5:6.3
numbers2
Result:
[1] 1.5 2.5 3.5 4.5 5.5 6.5
[1] 1.5 2.5 3.5 4.5 5.5
In the example below, we create a vector of logical
values:
Example
# Vector of logical values
log_values <- c(TRUE, FALSE, TRUE, FALSE)

log_values
Output:
[1] TRUE FALSE TRUE FALSE
2) Lists
 A list in R can contain many different data types
inside it.
 A list is a collection of data which is ordered and
changeable.
 To create a list, use the list() function:
Example
# List of strings
thislist <- list("apple", "banana", "cherry")

# Print the list


thislist
OUTPUT:
[[1]]
[1] "apple"

[[2]]
[1] "banana"

[[3]]
[1] "cherry"
Access Lists
 You can access the list items by referring to its index
number, inside brackets.
 The first item has index 1, the second item has index
2, and so on:
Example
thislist <- list("apple", "banana", "cherry")

thislist[1]
OUTPUT:
[[1]]
[1] "apple"
3) Matrices:
 A matrix is a two dimensional data set with columns
and rows.
 A column is a vertical representation of data, while a
row is a horizontal representation of data.
 A matrix can be created with the matrix() function.
Specify the nrow and ncol parameters to get the
amount of rows and columns:
Example
# Create a matrix
thismatrix <- matrix(c(1,2,3,4,5,6), nrow = 3, ncol = 2)

# Print the matrix


thismatrix
OUTPUT:
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6

Note: Remember the c() function is used to concatenate


items together.
You can also create a matrix with strings:
Example
thismatrix <-
matrix(c("apple", "banana", "cherry", "orange"), nrow
= 2, ncol = 2)
thismatrix
OUTPUT:
[,1] [,2]
[1,] "apple" "cherry"
[2,] "banana" "orange"
Access Matrix Items
You can access the items by using [ ] brackets. The first
number "1" in the bracket specifies the row-position,
while the second number "2" specifies the column-
position:
Example
thismatrix <-
matrix(c("apple", "banana", "cherry", "orange"), nrow
= 2, ncol = 2)

thismatrix[1, 2]
OUTPUT:
[1] "cherry"

4)Arrays
Compared to matrices, arrays can have more than two
dimensions.
We can use the array() function to create an array, and
the dim parameter to specify the dimensions:
Example
# An array with one dimension with values ranging from
1 to 24
thisarray <- c(1:24)
thisarray

# An array with more than one dimension


multiarray <- array(thisarray, dim = c(4, 3, 2))
multiarray
OUTPUT:
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
20 21 22 23 24
,,1

[,1] [,2] [,3]


[1,] 1 5 9
[2,] 2 6 10
[3,] 3 7 11
[4,] 4 8 12

,,2

[,1] [,2] [,3]


[1,] 13 17 21
[2,] 14 18 22
[3,] 15 19 23
[4,] 16 20 24

Example Explained
In the example above we create an array with the values 1
to 24.
How does dim=c(4,3,2) work?
The first and second number in the bracket specifies the
amount of rows and columns.
The last number in the bracket specifies how many
dimensions we want.
Note: Arrays can only have one data type.
Access Array Items
You can access the array elements by referring to the
index position. You can use the [] brackets to access the
desired elements from an array:
Example
thisarray <- c(1:24)
multiarray <- array(thisarray, dim = c(4, 3, 2))

multiarray[2, 3, 2]
OUTPUT:
[1] 22
The syntax is as follow: array[row position, column
position, matrix level]
You can also access the whole row or column from a
matrix in an array, by using the c() function:
Example
thisarray <- c(1:24)

# Access all the items from the first row from matrix one
multiarray <- array(thisarray, dim = c(4, 3, 2))
multiarray[c(1),,1]

# Access all the items from the first column from matrix
one
multiarray <- array(thisarray, dim = c(4, 3, 2))
multiarray[,c(1),1]
OUTPUT:
[1] 1 5 9
[1] 1 2 3 4
A comma (,) before c() means that we want to access the
column.
A comma (,) after c() means that we want to access the
row.
5)Factors
Factors are used to categorize data. Examples of factors
are:
● Demography: Male/Female
● Music: Rock, Pop, Classic, Jazz
● Training: Strength, Stamina
To create a factor, use the factor() function and add a
vector as argument:
Example
# Create a factor
music_genre <-
factor(c("Jazz", "Rock", "Classic", "Classic", "Pop", "Jazz
", "Rock", "Jazz"))

# Print the factor


music_genre
Result:
[1] Jazz Rock Classic Classic Pop Jazz Rock
Jazz
Levels: Classic Jazz Pop Rock
Access Factors
To access the items in a factor, refer to the index number,
using [] brackets:
Example
Access the third item:
music_genre <-
factor(c("Jazz", "Rock", "Classic", "Classic", "Pop", "Jazz
", "Rock", "Jazz"))

music_genre[3]
Result:
[1] Classic
Levels: Classic Jazz Pop Rock

6) Data Frames
Data Frames are data displayed in a format as a table.
Data Frames can have different types of data inside it.
While the first column can be character, the second and
third can be numeric or logical.
However, each column should have the same type of
data.
Use the data.frame() function to create a data frame:
Example
# Create a data frame
Data_Frame <- data.frame (
Training = c("Strength", "Stamina", "Other"),
Pulse = c(100, 150, 120),
Duration = c(60, 30, 45)
)

# Print the data frame


Data_Frame
OUTPUT:
Training Pulse Duration
1 Strength 100 60
2 Stamina 150 30
3 Other 120 45

Data Manipulation in R with Dplyr Package


In order to manipulate the data, R provides a library
called dplyr which consists of many built-in methods to
manipulate the data.
So to use the data manipulation function, first need to
import the dplyr package using library(dplyr) line of
code. Below is the list of a few data manipulation
functions present in dplyr package.
Function
Name Description

filter() Produces a subset of a Data Frame.

distinct() Removes duplicate rows in a Data Frame

arrange() Reorder the rows of a Data Frame

Produces data in required columns of a


select()
Data Frame

rename() Renames the variable names

Creates new variables without dropping


mutate()
old ones.

Creates new variables by dropping the


transmute()
old.

Gives summarized data like Average,


summarize()
Sum, etc.
filter() method
The filter() function is used to produce the subset of the
data that satisfies the condition specified in the filter()
method.
In the condition, we can use conditional operators,
logical operators, NA values, range operators etc. to filter
out data. Syntax of filter() function is given below-
filter(dataframeName, condition)
Example:
In the below code we used filter() function to fetch the
data of players who scored more than 100 runs from the
“stats” data frame.
 R

# import dplyr package

library(dplyr)

# create a data frame

stats <- data.frame(player=c('A', 'B', 'C', 'D'),

runs=c(100, 200, 408, 19),

wickets=c(17, 20, NA, 5))


# fetch players who scored more

# than 100 runs

filter(stats, runs>100)

Output
player runs wickets
1 B 200 20
2 C 408 NA
distinct() method
The distinct() method removes duplicate rows from data
frame or based on the specified columns. The syntax of
distinct() method is given below-
distinct(dataframeName, col1, col2,.., .keep_all=TRUE)
Example:
Here in this example, we used distinct() method to
remove the duplicate rows from the data frame and also
remove duplicates based on a specified column.
 R

# import dplyr package


library(dplyr)

# create a data frame

stats <- data.frame(player=c('A', 'B', 'C', 'D', 'A', 'A'),

runs=c(100, 200, 408, 19, 56, 100),

wickets=c(17, 20, NA, 5, 2, 17))

# removes duplicate rows

distinct(stats)

#remove duplicates based on a column

distinct(stats, player, .keep_all = TRUE)

Output
player runs wickets
1 A 100 17
2 B 200 20
3 C 408 NA
4 D 19 5
5 A 56 2
player runs wickets
1 A 100 17
2 B 200 20
3 C 408 NA
4 D 19 5
arrange() method
In R, the arrange() method is used to order the rows
based on a specified column. The syntax of arrange()
method is specified below-
arrange(dataframeName, columnName)
Example:
In the below code we ordered the data based on the runs
from low to high using arrange() function.
 R

# import dplyr package

library(dplyr)
# create a data frame

stats <- data.frame(player=c('A', 'B', 'C', 'D'),

runs=c(100, 200, 408, 19),

wickets=c(17, 20, NA, 5))

# ordered data based on runs

arrange(stats, runs)

Output
player runs wickets
1 D 19 5
2 A 100 17
3 B 200 20
4 C 408 NA
select() method
The select() method is used to extract the required
columns as a table by specifying the required column
names in select() method.
The syntax of select() method is mentioned below-
select(dataframeName, col1,col2,…)
Example:
Here in the below code we fetched the player, wickets
column data only using select() method.
 R

# import dplyr package

library(dplyr)

# create a data frame

stats <- data.frame(player=c('A', 'B', 'C', 'D'),

runs=c(100, 200, 408, 19),

wickets=c(17, 20, NA, 5))

# fetch required column data

select(stats, player,wickets)

Output
player wickets
1 A 17
2 B 20
3 C NA
4 D 5
rename() method
The rename() function is used to change the column
names. This can be done by the below syntax-
rename(dataframeName, newName=oldName)
Example:
In this example, we change the column name “runs” to
“runs_scored” in stats data frame.
 R

# import dplyr package

library(dplyr)

# create a data frame

stats <- data.frame(player=c('A', 'B', 'C', 'D'),

runs=c(100, 200, 408, 19),


wickets=c(17, 20, NA, 5))

# renaming the column

rename(stats, runs_scored=runs)

Output
player runs_scored wickets
1 A 100 17
2 B 200 20
3 C 408 NA
4 D 19 5
mutate() & transmute() methods
These methods are used to create new variables.
The mutate() function creates new variables without
dropping the old ones but transmute() function drops the
old variables and creates new variables.
The syntax of both methods is mentioned below-
mutate(dataframeName, newVariable=formula)
transmute(dataframeName, newVariable=formula)
Example:
In this example, we created a new column avg using
mutate() and transmute() methods.
 R

# import dplyr package

library(dplyr)

# create a data frame

stats <- data.frame(player=c('A', 'B', 'C', 'D'),

runs=c(100, 200, 408, 19),

wickets=c(17, 20, 7, 5))

# add new column avg

mutate(stats, avg=runs/4)

# drop all and create a new column

transmute(stats, avg=runs/4)

Output
player runs wickets avg
1 A 100 17 25.00
2 B 200 20 50.00
3 C 408 7 102.00
4 D 19 5 4.75
avg
1 25.00
2 50.00
3 102.00
4 4.75
Here mutate() functions adds a new column for the
existing data frame without dropping the old ones where
as transmute() function created a new variable but
dropped all the old columns.
summarize() method
Using the summarize method we can summarize the data
in the data frame by using aggregate functions like sum(),
mean(), etc.
The syntax of summarize() method is specified below-
summarize(dataframeName,
aggregate_function(columnName))
Example:
In the below code we presented the summarized data
present in the runs column using summarize() method.
# import dplyr package

library(dplyr)

# create a data frame

stats <- data.frame(player=c('A', 'B', 'C', 'D'),

runs=c(100, 200, 408, 19),

wickets=c(17, 20, 7, 5))

# summarize method

summarize(stats, sum(runs), mean(runs))

Output
sum(runs) mean(runs)
1 727 181.75
Packages in R Programming:
 Packages in R Programming language are a set of R
functions, compiled code, and sample data.
 These are stored under a directory called “library”
within the R environment.
 By default, R installs a group of packages during
installation. Once we start the R console, only the
default packages are available by default.
 Other packages that are already installed need to be
loaded explicitly to be utilized by the R program
that’s getting to use them.
Install an R-Packages
There are multiple ways to install R Package, some of
them are,
● Installing Packages From CRAN: For installing
Package from CRAN we need the name of the package
and use the following command:
install.packages("package name")
● Installing Package from CRAN is the most common
and easiest way as we just have to use only one
command. In order to install more than a package at a
time, we just have to write them as a character vector in
the first argument of the install.packages() function:
Example:
install.packages(c("vioplot", "MASS"))
Data Visualization in R
 Data visualization is the technique used to deliver

insights in data using visual cues such as graphs,


charts, maps, and many others.
 This is useful as it helps in intuitive and easy

understanding of the large quantities of data and


thereby make better decisions regarding it.

Data Visualization in R Programming Language


 The popular data visualization tools that are available

are Tableau, Plotly, R, Google Charts, Infogram, and


Kibana.
 R is a language that is designed for statistical

computing, graphical data analysis, and scientific


research.
 It is usually preferred for data visualization as it

offers flexibility and minimum required coding


through its packages.
R Scatter Plot
 plot() function is used to plot numbers against each
other.
 A "scatter plot" is a type of plot used to display the
relationship between two numerical variables, and
plots one dot for each observation.
 It needs two vectors of same length, one for the x-
axis (horizontal) and one for the y-axis (vertical)
Syntax
The basic syntax for creating scatterplot in R is −
plot(x, y, main, xlab, ylab, xlim, ylim, axes)
Following is the description of the parameters used −
 x is the data set whose values are the horizontal
coordinates.
 y is the data set whose values are the vertical
coordinates.
 main is the tile of the graph.
 xlab is the label in the horizontal axis.
 ylab is the label in the vertical axis.
 xlim is the limits of the values of x used for plotting.
 ylim is the limits of the values of y used for plotting.
 axes indicates whether both axes should be drawn on
the plot.
Example
x <- c(5,7,8,7,2,2,9,4,11,12,9,6)
y <- c(99,86,87,88,111,103,87,94,78,77,85,86)

plot(x, y, main="Observation of Cars", xlab="Car age", ylab="Car


speed")

Result:
Bar Plot

 A bar chart uses rectangular bars to visualize data.


 Bar charts can be displayed horizontally or vertically.
 The height or length of the bars are proportional to
the values they represent.
 Use the barplot() function to draw a vertical bar
chart.
Syntax
The basic syntax to create a bar-chart in R is −
barplot(H,xlab,ylab,main, names.arg,col)
Following is the description of the parameters used −
 H is a vector or matrix containing numeric values
used in bar chart.
 xlab is the label for x axis.
 ylab is the label for y axis.
 main is the title of the bar chart.
 names.arg is a vector of names appearing under each
bar.
 col is used to give colors to the bars in the graph.
Example
x <- c("A", "B", "C", "D")
y <- c(2, 4, 6, 8)

barplot(y, names.arg = x, col = "red")

Horizontal Bars
If you want the bars to be displayed horizontally instead
of vertically, use horiz=TRUE:
Example
x <- c("A", "B", "C", "D")
y <- c(2, 4, 6, 8)
barplot(y, names.arg = x, horiz = TRUE)
Result:

Histogram

 A histogram represents the frequencies of values of a


variable bucketed into ranges.
 Histogram is similar to bar chat but the difference is
it groups the values into continuous ranges.
 Each bar in histogram represents the height of the
number of values present in that range.
 R creates histogram using hist() function. This
function takes a vector as an input and uses some
more parameters to plot histograms.
Syntax
The basic syntax for creating a histogram using R is −
hist(v,main,xlab,xlim,ylim,breaks,col,border)
Following is the description of the parameters used −
 v is a vector containing numeric values used in
histogram.
 main indicates title of the chart.

 col is used to set color of the bars.

 border is used to set border color of each bar.

 xlab is used to give description of x-axis.

 xlim is used to specify the range of values on the x-

axis.
 ylim is used to specify the range of values on the y-

axis.
 breaks is used to mention the width of each bar.

Example:
v=c(12,24,16,38,21,13,55,17,39,10,60,59,58)
hist(v,xlab = "weight",ylab
="Frequency",col="red",border =
"green",xlim=c(0,40),ylim = c(0,3),breaks =5)
Example
R - Line Graphs
 A line chart is a graph that connects a series of points
by drawing line segments between them.
 These points are ordered in one of their coordinate
(usually the x-coordinate) value.
 Line charts are usually used in identifying the trends
in data.
 The plot() function in R is used to create the line
graph.

Syntax
The basic syntax to create a line chart in R is −
plot(v,type,col,xlab,ylab)
Following is the description of the parameters used −
 v is a vector containing the numeric values.
 type takes the value "p" to draw only the points, "l" to
draw only the lines and "o" to draw both points and
lines.
 xlab is the label for x axis.
 ylab is the label for y axis.
 main is the Title of the chart.
 col is used to give colors to both the points and lines.
Example
v=c(18,22,28,7,31,52)
plot(v,type="o",col="blue",xlab = "Month",ylab =
"Temprature")
Result

Box Plot

 Boxplots are a measure of how well distributed is the


data in a data set.
 It divides the data set into three quartiles.
 This graph represents the minimum, maximum,
median, first quartile and third quartile in the data set.
 It is also useful in comparing the distribution of data
across data sets by drawing boxplots for each of
them.
 Boxplots are created in R by using
the boxplot() function.
Syntax
The basic syntax to create a boxplot in R is −
boxplot(x, data, notch, varwidth, names, main)
Following is the description of the parameters used −
 x is a vector or a formula.
 data is the data frame.
 notch is a logical value. Set as TRUE to draw a
notch.
 varwidth is a logical value. Set as true to draw width
of the box proportionate to the sample size.
 names are the group labels which will be printed
under each boxplot.
 main is used to give a title to the graph.

Example:
month_name=c("jun","jul","Aug","sep","oct")
rainfall_data=matrix(c(30,35,25,20,14,40,45,20,15,13,25,
28,23,19,11),nrow=3,ncol=5,byrow=TRUE)
boxplot(rainfall_data,main="Monthly Rainfall
varaition",names=month_name,xlab="Month",ylab="rain
fall",col="green")

Pie Charts
 A pie chart is a circular graphical view of data.
 Use the pie() function to draw pie charts
Syntax
The basic syntax for creating a pie-chart using the R is −
pie(x, labels, radius, main, col, clockwise)
Following is the description of the parameters used −
 x is a vector containing the numeric values used in
the pie chart.
 labels is used to give description to the slices.
 radius indicates the radius of the circle of the pie
chart.(value between −1 and +1).
 main indicates the title of the chart.
 col indicates the color palette.
 clockwise is a logical value indicating if the slices are
drawn clockwise or anti clockwise.
 To add a list of explanation for each pie, use
the legend() function
Example
# Create a vector of labels
mylabel <- c("Apples", "Bananas", "Cherries", "Dates")

# Create a vector of colors


colors <- c("blue", "yellow", "green", "black")

# Display the pie chart with colors


pie(x, label = mylabel, main = "Pie Chart", col = colors)

# Display the explanation box


legend("bottomright", mylabel, fill = colors)
Result:
The legend can be positioned as either:
bottomright, bottom, bottomleft, left, topleft, top, topright,
right, center
Advantages of Data Visualization in R:
R has the following advantages over other tools for data
visualization:
 R offers a broad collection of visualization libraries
along with extensive online guidance on their usage.
 R also offers data visualization in the form of 3D
models and multipanel charts.
 Through R, we can easily customize our data

visualization by changing axes, fonts, legends,


annotations, and labels.
Disadvantages of Data Visualization in R:
R also has the following disadvantages:
 R is only preferred for data visualization when done on
an individual standalone server.
 Data visualization using R is slow for large amounts of
data as compared to other counterparts.
 Real-time maps and geo-positioning systems use
visualization for traffic monitoring and estimating
travel time.

You might also like