CH 4 Data Analytics With R and Weak Machine Learning
CH 4 Data Analytics With R and Weak Machine Learning
Features of R
● R is a well-developed, simple and effective
programming language which includes conditionals,
loops, user defined recursive functions and input and
output facilities.
● R has an effective data handling and storage facility,
● R provides a suite of operators for calculations on
arrays, lists, vectors and matrices.
● R provides a large, coherent and integrated
collection of tools for data analysis.
● R provides graphical facilities for data analysis and
display either directly at the computer or printing at
the papers.
Applications of R Programming
Taskbar.
● Start the application.
● Insert the following code in the console.
Creating Variables in R
Variables are containers for storing data values.
R does not have a command for declaring a variable.
A variable is created the moment you first assign a
value to it.
To assign a value to a variable, use the <- sign.
To output (or print) the variable value, just type the
variable name:
Example
name <- "John"
age <- 40
name # output "John"
age # output 40
print(name) # print the value of the name variable
Concatenate Elements
You can also concatenate, or join, two or more elements,
by using the paste() function.
To combine both text and a variable, R uses comma (,):
Example
text1 <- "R is"
text2 <- "awesome"
paste(text1, text2)
Multiple Variables
R allows you to assign the same value to multiple
variables in one line:
Example
# Assign the same value to multiple variables in one line
var1 <- var2 <- var3 <- "Orange"
# Print variable values
var1
var2
var3
OUTPUT:
[1] "Orange"
[1] "Orange"
[1] "Orange"
Variable Names :
A variable can have a short name (like x and y) or a more
descriptive name (age, carname, total_volume).
Rules for R variables are:
● A variable name must start with a letter and can be a
combination of letters, digits, period(.)
and underscore(_).
● If it starts with period(.), it cannot be followed by a
digit.
● A variable name cannot start with a number or
underscore (_)
● Variable names are case-sensitive (age, Age and
AGE are three different variables)
● Reserved words cannot be used as variables (TRUE,
FALSE, NULL, if...)
# Legal variable names:
myvar <- "John"
my_var <- "John"
myVar <- "John"
MYVAR <- "John"
myvar2 <- "John"
.myvar <- "John"
R Data Types
Basic Data Types :
Basic data types in R can be divided into the following
types:
1)numeric - (10.5, 55, 787)
2)integer - (1L, 55L, 100L, where the letter "L"
declares this as an integer)
3)complex - (9 + 3i, where "i" is the imaginary part)
4)character (string) - ("k", "R is exciting", "FALSE",
"11.5")
5)logical (boolean) - (TRUE or FALSE)
We can use the class() function to check the data type of a
variable:
● Example
● # numeric
x <- 10.5
class(x)
# integer
x <- 1000L
class(x)
# complex
x <- 9i + 3
class(x)
# character/string
x <- "R is exciting"
class(x)
# logical/boolean
x <- TRUE
class(x)
OUTPUT:
[1] "numeric"
[1] "integer"
[1] "complex"
[1] "character"
[1] "logical"
1)Numbers
There are three number types in R:
a)numeric: A numeric data type is the most common type
in R, and contains any number with or without a decimal,
like: 10.5, 55, 787:
Example
x <- 10.5
y <- 55
R Operators :
Operators are used to perform operations on variables and
values.
Types of Operators:
● Arithmetic operators
● Assignment operators
● Comparison operators
● Logical operators
● Miscellaneous operators
1)R Arithmetic Operators
Arithmetic operators are used with numeric values to
perform common mathematical operations:
- Subtraction x-y T
r
y
it
»
* Multiplication x * y
R Assignment Operators
Assignment operators are used to assign values to
variables:
Example
my_var <- 3
my_var <<- 3
3 -> my_var
3 ->> my_var
R Comparison Operators
Comparison operators are used to compare two values:
== Equal x == y
!= Not equal x != y
R Logical Operators
Logical operators are used to combine conditional
statements:
Description
Operat
or
R Miscellaneous Operators
Miscellaneous operators are used to manipulate data:
Decision Making in R:
1) If Statement:
An "if statement" is written with the if keyword, and it is
used to specify a block of code to be executed if a
condition is TRUE:
Example
a <- 33
b <- 200
if (b > a) {
print("b is greater than a")
}
OUTPUT:
[1] "b is greater than a"
R uses curly brackets { } to define the scope in the code.
2) Else If
The else if keyword is R's way of saying "if the previous
conditions were not true, then try this condition":
Example
a <- 33
b <- 33
if (b > a) {
print("b is greater than a")
} else if (a == b) {
print ("a and b are equal")
}
OUTPUT:
[1] "a and b are equal"
In this example a is equal to b, so the first condition is not
true, but the else if condition is true, so we print to screen
that "a and b are equal".
You can use as many else if statements as you want in R.
3) If Else
The else keyword catches anything which isn't caught by
the preceding conditions:
Example
a <- 200
b <- 33
if (b > a) {
print("b is greater than a")
} else if (a == b) {
print("a and b are equal")
} else {
print("a is greater than b")
}
OUTPUT:
[1] "a is greater than b"
In this example, a is greater than b, so the first condition
is not true, also the else if condition is not true, so we go
to the else condition and print to screen that "a is greater
than b".
4) Nested If Statements
You can also have if statements inside if statements, this
is called nested if statements.
Example
x <- 41
if (x > 10) {
print("Above ten")
if (x > 20) {
print("and also above 20!")
} else {
print("but not above 20.")
}
} else {
print("below 10.")
}
OUTPUT:
[1] "Above ten"
[1] "and also above 20!"
AND
The & symbol (and) is a logical operator, and is used to
combine conditional statements:
Example
Test if a is greater than b, AND if c is greater than a:
a <- 200
b <- 33
c <- 500
Break
With the break statement, we can stop the loop even if the
while condition is TRUE:
Example
Exit the loop if i is equal to 4.
i <- 1
while (i < 6) {
print(i)
i <- i + 1
if (i == 4) {
break
}
}
[1] 1
[1] 2
[1] 3
The loop will stop at 3 because we have chosen to finish
the loop by using the break statement when i is equal to 4
(i == 4).
Next
With the next statement, we can skip an iteration without
terminating the loop:
Example
Skip the value of 3:
i <- 0
while (i < 6) {
i <- i + 1
if (i == 3) {
next
}
print(i)
}
OUTPUT:
[1] 1
[1] 2
[1] 4
[1] 5
[1] 6
When the loop passes the value 3, it will skip it and
continue to loop.
If .. Else Combined with a While Loop
To demonstrate a practical example, let us say we play a
game of Yahtzee!
Example
Print "Yahtzee!" If the dice number is 6:
dice <- 1
while (dice <= 6) {
if (dice < 6) {
print(" No Yahtzee ")
} else {
print("Yahtzee!")
}
dice <- dice + 1
}
OUTPUT:
[1] "No Yahtzee"
[1] "No Yahtzee"
[1] "No Yahtzee"
[1] "No Yahtzee"
[1] "No Yahtzee"
[1] "Yahtzee!"
If the loop passes the values ranging from 1 to 5, it prints
"No Yahtzee". Whenever it passes the value 6, it prints
"Yahtzee!".
2) For Loops
A for loop is used for iterating over a sequence:
R’s for loops are particularly flexible in that they are
not limited to integers, or even numbers in the input.
We can pass character vectors, logical vectors, lists
or expressions.
Syntax
The basic syntax for creating a for loop statement in R is
−
for (value in vector) {
statements
}
Example
for (x in 1:10) {
print(x)
}
OUTPUT:
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
[1] 10
for(x in dice) {
if (x == 6) {
print(paste("The dice number is", x, "Yahtzee!"))
} else {
print(paste("The dice number is", x, "Not Yahtzee"))
}
}
OUTPUT:
[1] "The dice number is 1 Not Yahtzee"
[1] "The dice number is 2 Not Yahtzee"
[1] "The dice number is 3 Not Yahtzee"
[1] "The dice number is 4 Not Yahtzee"
[1] "The dice number is 5 Not Yahtzee"
[1] "The dice number is 6 Yahtzee!"
Nested Loops
You can also have a loop inside of a loop:
Example
Print the adjective of each fruit in a list:
adj <- list("red", "big", "tasty")
my_function("Peter")
my_function("Lois")
my_function("Stewie")
OUTPUT:
[1] "Peter Griffin"
[1] "Lois Griffin"
[1] "Stewie Griffin"
Default Parameter Value
The following example shows how to use a default
parameter value.
If we call the function without an argument, it uses
the default value:
Example
my_function <- function(country = "Norway") {
paste("I am from", country)
}
my_function("Sweden")
my_function("India")
my_function() # will get the default value, which is
Norway
my_function("USA")
OUTPUT:
[1] "I am from Sweden"
[1] "I am from India"
[1] "I am from Norway"
[1] "I am from USA"
Return Values
To let a function return a result, use the return() function:
Example
my_function <- function(x) {
return (5 * x)
}
print(my_function(3))
print(my_function(5))
print(my_function(9))
OUTPUT:
[1] 15
[1] 25
[1] 45
Data Structures in R:
1) Vectors
A vector is simply a list of items that are of the same
type.
To combine the list of items to a vector, use
the c() function and separate the items by a comma.
In the example below, we create a vector variable
called fruits, that combine strings:
Example
# Vector of strings
fruits <- c("banana", "apple", "orange")
# Print fruits
fruits
Output:
[1] "banana" "apple" "orange"
In this example, we create a vector that combines
numerical values:
Example
# Vector of numerical values
numbers <- c(1, 2, 3)
# Print numbers
numbers
Output:
[1] 1 2 3
To create a vector with numerical values in a sequence,
use the : operator:
Example
# Vector with numerical values in a sequence
numbers <- 1:10
numbers
Output:
[1] 1 2 3 4 5 6 7 8 9 10
You can also create numerical values with decimals in a
sequence, but note that if the last element does not belong
to the sequence, it is not used:
Example
# Vector with numerical decimals in a sequence
numbers1 <- 1.5:6.5
numbers1
log_values
Output:
[1] TRUE FALSE TRUE FALSE
2) Lists
A list in R can contain many different data types
inside it.
A list is a collection of data which is ordered and
changeable.
To create a list, use the list() function:
Example
# List of strings
thislist <- list("apple", "banana", "cherry")
[[2]]
[1] "banana"
[[3]]
[1] "cherry"
Access Lists
You can access the list items by referring to its index
number, inside brackets.
The first item has index 1, the second item has index
2, and so on:
Example
thislist <- list("apple", "banana", "cherry")
thislist[1]
OUTPUT:
[[1]]
[1] "apple"
3) Matrices:
A matrix is a two dimensional data set with columns
and rows.
A column is a vertical representation of data, while a
row is a horizontal representation of data.
A matrix can be created with the matrix() function.
Specify the nrow and ncol parameters to get the
amount of rows and columns:
Example
# Create a matrix
thismatrix <- matrix(c(1,2,3,4,5,6), nrow = 3, ncol = 2)
thismatrix[1, 2]
OUTPUT:
[1] "cherry"
4)Arrays
Compared to matrices, arrays can have more than two
dimensions.
We can use the array() function to create an array, and
the dim parameter to specify the dimensions:
Example
# An array with one dimension with values ranging from
1 to 24
thisarray <- c(1:24)
thisarray
,,2
Example Explained
In the example above we create an array with the values 1
to 24.
How does dim=c(4,3,2) work?
The first and second number in the bracket specifies the
amount of rows and columns.
The last number in the bracket specifies how many
dimensions we want.
Note: Arrays can only have one data type.
Access Array Items
You can access the array elements by referring to the
index position. You can use the [] brackets to access the
desired elements from an array:
Example
thisarray <- c(1:24)
multiarray <- array(thisarray, dim = c(4, 3, 2))
multiarray[2, 3, 2]
OUTPUT:
[1] 22
The syntax is as follow: array[row position, column
position, matrix level]
You can also access the whole row or column from a
matrix in an array, by using the c() function:
Example
thisarray <- c(1:24)
# Access all the items from the first row from matrix one
multiarray <- array(thisarray, dim = c(4, 3, 2))
multiarray[c(1),,1]
# Access all the items from the first column from matrix
one
multiarray <- array(thisarray, dim = c(4, 3, 2))
multiarray[,c(1),1]
OUTPUT:
[1] 1 5 9
[1] 1 2 3 4
A comma (,) before c() means that we want to access the
column.
A comma (,) after c() means that we want to access the
row.
5)Factors
Factors are used to categorize data. Examples of factors
are:
● Demography: Male/Female
● Music: Rock, Pop, Classic, Jazz
● Training: Strength, Stamina
To create a factor, use the factor() function and add a
vector as argument:
Example
# Create a factor
music_genre <-
factor(c("Jazz", "Rock", "Classic", "Classic", "Pop", "Jazz
", "Rock", "Jazz"))
music_genre[3]
Result:
[1] Classic
Levels: Classic Jazz Pop Rock
6) Data Frames
Data Frames are data displayed in a format as a table.
Data Frames can have different types of data inside it.
While the first column can be character, the second and
third can be numeric or logical.
However, each column should have the same type of
data.
Use the data.frame() function to create a data frame:
Example
# Create a data frame
Data_Frame <- data.frame (
Training = c("Strength", "Stamina", "Other"),
Pulse = c(100, 150, 120),
Duration = c(60, 30, 45)
)
library(dplyr)
filter(stats, runs>100)
Output
player runs wickets
1 B 200 20
2 C 408 NA
distinct() method
The distinct() method removes duplicate rows from data
frame or based on the specified columns. The syntax of
distinct() method is given below-
distinct(dataframeName, col1, col2,.., .keep_all=TRUE)
Example:
Here in this example, we used distinct() method to
remove the duplicate rows from the data frame and also
remove duplicates based on a specified column.
R
distinct(stats)
Output
player runs wickets
1 A 100 17
2 B 200 20
3 C 408 NA
4 D 19 5
5 A 56 2
player runs wickets
1 A 100 17
2 B 200 20
3 C 408 NA
4 D 19 5
arrange() method
In R, the arrange() method is used to order the rows
based on a specified column. The syntax of arrange()
method is specified below-
arrange(dataframeName, columnName)
Example:
In the below code we ordered the data based on the runs
from low to high using arrange() function.
R
library(dplyr)
# create a data frame
arrange(stats, runs)
Output
player runs wickets
1 D 19 5
2 A 100 17
3 B 200 20
4 C 408 NA
select() method
The select() method is used to extract the required
columns as a table by specifying the required column
names in select() method.
The syntax of select() method is mentioned below-
select(dataframeName, col1,col2,…)
Example:
Here in the below code we fetched the player, wickets
column data only using select() method.
R
library(dplyr)
select(stats, player,wickets)
Output
player wickets
1 A 17
2 B 20
3 C NA
4 D 5
rename() method
The rename() function is used to change the column
names. This can be done by the below syntax-
rename(dataframeName, newName=oldName)
Example:
In this example, we change the column name “runs” to
“runs_scored” in stats data frame.
R
library(dplyr)
rename(stats, runs_scored=runs)
Output
player runs_scored wickets
1 A 100 17
2 B 200 20
3 C 408 NA
4 D 19 5
mutate() & transmute() methods
These methods are used to create new variables.
The mutate() function creates new variables without
dropping the old ones but transmute() function drops the
old variables and creates new variables.
The syntax of both methods is mentioned below-
mutate(dataframeName, newVariable=formula)
transmute(dataframeName, newVariable=formula)
Example:
In this example, we created a new column avg using
mutate() and transmute() methods.
R
library(dplyr)
mutate(stats, avg=runs/4)
transmute(stats, avg=runs/4)
Output
player runs wickets avg
1 A 100 17 25.00
2 B 200 20 50.00
3 C 408 7 102.00
4 D 19 5 4.75
avg
1 25.00
2 50.00
3 102.00
4 4.75
Here mutate() functions adds a new column for the
existing data frame without dropping the old ones where
as transmute() function created a new variable but
dropped all the old columns.
summarize() method
Using the summarize method we can summarize the data
in the data frame by using aggregate functions like sum(),
mean(), etc.
The syntax of summarize() method is specified below-
summarize(dataframeName,
aggregate_function(columnName))
Example:
In the below code we presented the summarized data
present in the runs column using summarize() method.
# import dplyr package
library(dplyr)
# summarize method
Output
sum(runs) mean(runs)
1 727 181.75
Packages in R Programming:
Packages in R Programming language are a set of R
functions, compiled code, and sample data.
These are stored under a directory called “library”
within the R environment.
By default, R installs a group of packages during
installation. Once we start the R console, only the
default packages are available by default.
Other packages that are already installed need to be
loaded explicitly to be utilized by the R program
that’s getting to use them.
Install an R-Packages
There are multiple ways to install R Package, some of
them are,
● Installing Packages From CRAN: For installing
Package from CRAN we need the name of the package
and use the following command:
install.packages("package name")
● Installing Package from CRAN is the most common
and easiest way as we just have to use only one
command. In order to install more than a package at a
time, we just have to write them as a character vector in
the first argument of the install.packages() function:
Example:
install.packages(c("vioplot", "MASS"))
Data Visualization in R
Data visualization is the technique used to deliver
Result:
Bar Plot
Horizontal Bars
If you want the bars to be displayed horizontally instead
of vertically, use horiz=TRUE:
Example
x <- c("A", "B", "C", "D")
y <- c(2, 4, 6, 8)
barplot(y, names.arg = x, horiz = TRUE)
Result:
Histogram
axis.
ylim is used to specify the range of values on the y-
axis.
breaks is used to mention the width of each bar.
Example:
v=c(12,24,16,38,21,13,55,17,39,10,60,59,58)
hist(v,xlab = "weight",ylab
="Frequency",col="red",border =
"green",xlim=c(0,40),ylim = c(0,3),breaks =5)
Example
R - Line Graphs
A line chart is a graph that connects a series of points
by drawing line segments between them.
These points are ordered in one of their coordinate
(usually the x-coordinate) value.
Line charts are usually used in identifying the trends
in data.
The plot() function in R is used to create the line
graph.
Syntax
The basic syntax to create a line chart in R is −
plot(v,type,col,xlab,ylab)
Following is the description of the parameters used −
v is a vector containing the numeric values.
type takes the value "p" to draw only the points, "l" to
draw only the lines and "o" to draw both points and
lines.
xlab is the label for x axis.
ylab is the label for y axis.
main is the Title of the chart.
col is used to give colors to both the points and lines.
Example
v=c(18,22,28,7,31,52)
plot(v,type="o",col="blue",xlab = "Month",ylab =
"Temprature")
Result
Box Plot
Example:
month_name=c("jun","jul","Aug","sep","oct")
rainfall_data=matrix(c(30,35,25,20,14,40,45,20,15,13,25,
28,23,19,11),nrow=3,ncol=5,byrow=TRUE)
boxplot(rainfall_data,main="Monthly Rainfall
varaition",names=month_name,xlab="Month",ylab="rain
fall",col="green")
Pie Charts
A pie chart is a circular graphical view of data.
Use the pie() function to draw pie charts
Syntax
The basic syntax for creating a pie-chart using the R is −
pie(x, labels, radius, main, col, clockwise)
Following is the description of the parameters used −
x is a vector containing the numeric values used in
the pie chart.
labels is used to give description to the slices.
radius indicates the radius of the circle of the pie
chart.(value between −1 and +1).
main indicates the title of the chart.
col indicates the color palette.
clockwise is a logical value indicating if the slices are
drawn clockwise or anti clockwise.
To add a list of explanation for each pie, use
the legend() function
Example
# Create a vector of labels
mylabel <- c("Apples", "Bananas", "Cherries", "Dates")