Data Analytics 20APE0514 Min
Data Analytics 20APE0514 Min
Prepared By
Unit 1: An overview of R, Vectors, factors, univariate time series, Data frames, matrices, Functions, operators, loops,
Graphics, Revealing views of the data, Data summary, Statistical analysis questions, aims, and strategies; Statistical
models, Distributions: models for the random component, Simulation of random numbers and random samples,
Model assumptions
Text Books:
1. Data Analysis and Graphics Using R – an Example-Based Approach, John Maindonald, W. John Braun, Third
Edition, 2010
An overview of R
R is a programming language.
R is often used for statistical computing and graphical presentation to analyze and visualize data.
Why Use R?
It is a great resource for data analysis, data visualization, data science and machine learning
It provides many statistical techniques (such as statistical tests, classification, clustering and data
reduction)
It is easy to draw graphs in R, like pie charts, histograms, box plot, scatter plot, etc
It works on different platforms (Windows, Mac, Linux)
It is open-source and free
It has a large community support
It has many packages (libraries of functions) that can be used to solve different problems
How to Install R
To install R, go to https://cloud.r-project.org/ and download the latest version of R for Windows, Mac or Linux.
When you have downloaded and installed R, you can run R on your computer.
The screenshot below shows how it may look like when you run R on a Windows PC:
If you type 5 + 5, and press enter, you will see that R outputs 10.
The command line prompt (>) is an invitation to type commands or expressions. Once the command or expression
is complete, and the Enter key is pressed, R evaluates and prints the result in the console window. This allows the
use of R as a calculator.
Data Analytics(20APE0514)
For example, type 2+2 and press the Enter key. Here is what appears on the screen:
> 2+2
[1] 4
>
The first element is labeled [1] even when, as here, there is just one element! The final > prompt indicates that R is
ready for another command.
Anything that follows a # on the command line is taken as comment and ignored by R.
A continuation prompt, by default +, appears following a carriage return when the command is not yet complete.
For example, an interruption of the calculation of 3*4ˆ2 by a carriage return could appear as
> 3*4ˆ
+2
[1] 48
Multiple commands may appear on one line, with a semicolon (;) as the separator. For example,
> 3*4ˆ2; (3*4)ˆ2
[1] 48
[1] 144
Example
5+5
Print
Unlike many other programming languages, you can output code in R without using a print function:
Example
Department of CSE, AITS-Tirupati 1
Data Analytics(20APE0514)
"Hello World!"
However, R does have a print() function available if you want to use it. This might be useful if you are familiar with
other programming languages, such as Python, which often uses the print() function to output code.
Example
print("Hello World!")
And there are times you must use the print() function to output code, for example when working with for loops
(which you will learn more about in a later chapter):
Example
for (x in 1:10)
{
print(x)
}
Conclusion: It is up to you whether you want to use the print() function to output code. However, when your code
is inside an R expression (for example inside curly braces {} like in the example above), use the print() function to
output the result.
Creating Variables in R
Variables are containers for storing data values.
R does not have a command for declaring a variable. A variable is created the moment you first assign a value to it.
To assign a value to a variable, use the <- sign. To output (or print) the variable value, just type the variable name:
Example
name <- "John"
age <- 40
name # output "John"
age # output 40
From the example above, name and age are variables, while "John" and 40 are values.
In other programming language, it is common to use = as an assignment operator. In R, we can use both = and <- as
assignment operators.
However, <- is preferred in most cases because the = operator can be forbidden in some context in R.
Data Types
In programming, data type is an important concept.
Variables can store data of different types, and different types can do different things.
In R, variables do not need to be declared with any particular type, and can even change type after they have been
set:
Example
my_var <- 30 # my_var is type of numeric
my_var <- "Sally" # my_var is now of type character (aka string)
We can use the class() function to check the data type of a variable:
Example
# numeric
x <- 10.5
class(x)
# integer
x <- 1000L
class(x)
# complex
x <- 9i + 3
class(x)
# character/string
x <- "R is exciting"
class(x)
# logical/boolean
Numbers
There are three number types in R:
numeric
integer
complex
Variables of number types are created when you assign a value to them:
Example
x <- 10.5 # numeric
y <- 10L # integer
z <- 1i # complex
Numeric
A numeric data type is the most common type in R, and contains any number with or without a decimal, like: 10.5,
55, 787:
Example
x <- 10.5
y <- 55
Integer
Integers are numeric data without decimals. This is used when you are certain that you will never create a variable
that should contain decimals. To create an integer variable, you must use the letter L after the integer value:
Example
x <- 1000L
y <- 55L
Complex
A complex number is written with an "i" as the imaginary part:
Example
Type Conversion
You can convert from one type to another with the following functions:
as.numeric()
as.integer()
as.complex()
Example
x <- 1L # integer
y <- 2 # numeric
Simple Math
In R, you can use operators to perform common mathematical operations on numbers.
The + operator is used to add together two values:
Example
10 + 5
And the - operator is used for subtraction:
Example
10 - 5
sqrt()
The sqrt() function returns the square root of a number:
Example
sqrt(16)
abs()
The abs() function returns the absolute (positive) value of a number:
Example
abs(-4.7)
String Literals
A character, or strings, are used for storing text. A string is surrounded by either single quotation marks, or double
quotation marks:
"hello" is the same as 'hello':
Example
"hello"
'hello'
Multiline Strings
You can assign a multiline string to a variable like this:
Example
str <- "Lorem ipsum dolor sit amet,
consectetur adipiscing elit,
sed do eiusmod tempor incididunt
ut labore et dolore magna aliqua."
If you want the line breaks to be inserted at the same position as in the code, use the cat() function:
Department of CSE, AITS-Tirupati 6
Data Analytics(20APE0514)
Example
str <- "Lorem ipsum dolor sit amet,
consectetur adipiscing elit,
sed do eiusmod tempor incididunt
ut labore et dolore magna aliqua."
cat(str)
String Length
There are many usesful string functions in R.
For example, to find the number of characters in a string, use the nchar() function:
Example
str <- "Hello World!"
nchar(str)
Check a String
Use the grepl() function to check if a character or a sequence of characters are present in a string:
Example
str <- "Hello World!"
grepl("H", str)
grepl("Hello", str)
grepl("X", str)
paste(str1, str2)
Escape Characters
To insert characters that are illegal in a string, you must use an escape character.
An escape character is a backslash \ followed by the character you want to insert.
An example of an illegal character is a double quote inside a string that is surrounded by double quotes:
Example
str <- "We are the so-called "Vikings", from the north."
str
Result:
Error: unexpected symbol in "str <- "We are the so-called "Vikings"
str
cat(str)
Note that auto-printing the str variable will print the backslash in the output. You can use the cat() function to
print it without backslash.
Other escape characters in R:
Code Result
\\ Backslash
\n New Line
\r Carriage Return
\t Tab
\b Backspace
a>b
You can also run a condition in an if statement, which you will learn much more about in the if..else chapter.
Example
a <- 200
b <- 33
if (b > a) {
print ("b is greater than a")
} else {
print("b is not greater than a")
}
Operators
Operators are used to perform operations on variables and values.
In the example below, we use the + operator to add together two values:
Example
Department of CSE, AITS-Tirupati 8
Data Analytics(20APE0514)
10 + 5
R Arithmetic Operators
Arithmetic operators are used with numeric values to perform common mathematical operations:
Operator Name Example
+ Addition x+y
- Subtraction x-y
* Multiplication x*y
/ Division x/y
^ Exponent x^y
R Assignment Operators
Assignment operators are used to assign values to variables:
Example
my_var <- 3
my_var <<- 3
3 -> my_var
3 ->> my_var
my_var # print my_var
Note: <<- is a global assigner..
R Comparison Operators
Comparison operators are used to compare two values:
Operator Name Example
== Equal x == y
!= Not equal x != y
R Logical Operators
Logical operators are used to combine conditional statements:
Operator Description
& Element-wise Logical AND operator. It returns TRUE if both elements are TRUE
&& Logical AND operator - Returns TRUE if both statements are TRUE
R Miscellaneous Operators
Miscellaneous operators are used to manipulate data:
Operator Description Example
== Equal x == y
!= Not equal x != y
if (b > a) {
print("b is greater than a")
}
In this example we use two variables, a and b, which are used as a part of the if statement to test whether b is
greater than a. As a is 33, and b is 200, we know that 200 is greater than 33, and so we print to screen that "b is
greater than a".
R uses curly brackets { } to define the scope in the code.
Else If
The else if keyword is R's way of saying "if the previous conditions were not true, then try this condition":
Example
a <- 33
b <- 33
if (b > a) {
print("b is greater than a")
} else if (a == b) {
print ("a and b are equal")
}
In this example a is equal to b, so the first condition is not true, but the else if condition is true, so we print to
screen that "a and b are equal".
You can use as many else if statements as you want in R.
If Else
The else keyword catches anything which isn't caught by the preceding conditions:
Example
a <- 200
b <- 33
if (b > a) {
print("b is greater than a")
} else if (a == b) {
print("a and b are equal")
} else {
print("a is greater than b")
}
In this example, a is greater than b, so the first condition is not true, also the else if condition is not true, so we go to
the else condition and print to screen that "a is greater than b".
You can also use else without else if:
Example
a <- 200
b <- 33
Department of CSE, AITS-Tirupati 11
Data Analytics(20APE0514)
if (b > a) {
print("b is greater than a")
} else {
print("b is not greater than a")
}
Nested If Statements
You can also have if statements inside if statements, this is called nested if statements.
Example
x <- 41
if (x > 10) {
print("Above ten")
if (x > 20) {
print("and also above 20!")
} else {
print("but not above 20.")
}
} else {
print("below 10.")
}
AND
The & symbol (and) is a logical operator, and is used to combine conditional statements:
Example
Test if a is greater than b, AND if c is greater than a:
a <- 200
b <- 33
c <- 500
OR
The | symbol (or) is a logical operator, and is used to combine conditional statements:
Example
Test if a is greater than b, or if c is greater than a:
a <- 200
b <- 33
c <- 500
R While Loops
With the while loop we can execute a set of statements as long as a condition is TRUE:
Example
Print i as long as i is less than 6:
i <- 1
while (i < 6) {
print(i)
i <- i + 1
}
In the example above, the loop will continue to produce numbers ranging from 1 to 5. The loop will stop at 6
because 6 < 6 is FALSE.
The while loop requires relevant variables to be ready, in this example we need to define an indexing variable, i,
which we set to 1.
Note: remember to increment i, or else the loop will continue forever.
Break
With the break statement, we can stop the loop even if the while condition is TRUE:
Example
Exit the loop if i is equal to 4.
i <- 1
while (i < 6) {
print(i)
i <- i + 1
if (i == 4) {
break
}
}
The loop will stop at 3 because we have chosen to finish the loop by using the break statement when i is equal to 4
(i == 4).
With the next statement, we can skip an iteration without terminating the loop:
Example
Skip the value of 3:
i <- 0
while (i < 6) {
i <- i + 1
if (i == 3) {
next
}
print(i)
}
Department of CSE, AITS-Tirupati 13
Data Analytics(20APE0514)
When the loop passes the value 3, it will skip it and continue to loop.
Yahtzee!
If .. Else Combined with a While Loop
To demonstrate a practical example, let us say we play a game of Yahtzee!
Example
Print "Yahtzee!" If the dice number is 6:
dice <- 1
while (dice <= 6) {
if (dice < 6) {
print("No Yahtzee")
} else {
print("Yahtzee!")
}
dice <- dice + 1
}
If the loop passes the values ranging from 1 to 5, it prints "No Yahtzee". Whenever it passes the value 6, it prints
"Yahtzee!".
For Loops
A for loop is used for iterating over a sequence:
Example
for (x in 1:10) {
print(x)
}
This is less like the for keyword in other programming languages, and works more like an iterator method as found
in other object-orientated programming languages.
With the for loop we can execute a set of statements, once for each item in a vector, array, list, etc..
You will learn about lists and vectors, etc in a later chapter.
Example
Print every item in a list:
for (x in fruits) {
print(x)
}
Example
Print the number of dices:
for (x in dice) {
print(x)
Department of CSE, AITS-Tirupati 14
Data Analytics(20APE0514)
}
The for loop does not require an indexing variable to set beforehand, like with while loops.
Break
With the break statement, we can stop the loop before it has looped through all the items:
Example
Stop the loop at "cherry":
for (x in fruits) {
if (x == "cherry") {
break
}
print(x)
}
The loop will stop at "cherry" because we have chosen to finish the loop by using the break statement when x is
equal to "cherry" (x == "cherry").
With the next statement, we can skip an iteration without terminating the loop:
Example
Skip "banana":
for (x in fruits) {
if (x == "banana") {
next
}
print(x)
}
When the loop passes "banana", it will skip it and continue to loop.
Yahtzee!
If .. Else Combined with a For Loop
To demonstrate a practical example, let us say we play a game of Yahtzee!
Example
Print "Yahtzee!" If the dice number is 6:
for(x in dice) {
if (x == 6) {
print(paste("The dice number is", x, "Yahtzee!"))
} else {
Department of CSE, AITS-Tirupati 15
Data Analytics(20APE0514)
print(paste("The dice number is", x, "Not Yahtzee"))
}
}
If the loop reaches the values ranging from 1 to 5, it prints "No Yahtzee" and its number. When it reaches the value
6, it prints "Yahtzee!" and its number.
Nested Loops
You can also have a loop inside of a loop:
Example
Print the adjective of each fruit in a list:
Creating a Function
To create a function, use the function() keyword:
Example
my_function <- function() { # create a function with the name my_function
print("Hello World!")
}
Call a Function
To call a function, use the function name followed by parenthesis, like my_function():
Example
my_function <- function() {
print("Hello World!")
}
Arguments are specified after the function name, inside the parentheses. You can add as many arguments as you
want, just separate them with a comma.
Example
my_function <- function(fname) {
paste(fname, "Griffin")
}
my_function("Peter")
my_function("Lois")
my_function("Stewie")
Parameters or Arguments?
The terms "parameter" and "argument" can be used for the same thing: information that are passed into a function.
A parameter is the variable listed inside the parentheses in the function definition.
Number of Arguments
By default, a function must be called with the correct number of arguments. Meaning that if your function expects 2
arguments, you have to call the function with 2 arguments, not more, and not less:
Example
This function expects 2 arguments, and gets 2 arguments:
my_function("Peter", "Griffin")
If you try to call the function with 1 or 3 arguments, you will get an error:
Example
This function expects 2 arguments, and gets 1 argument:
my_function("Peter")
Default Parameter Value
The following example shows how to use a default parameter value.
Example
Department of CSE, AITS-Tirupati 17
Data Analytics(20APE0514)
my_function <- function(country = "Norway") {
paste("I am from", country)
}
my_function("Sweden")
my_function("India")
my_function() # will get the default value, which is Norway
my_function("USA")
Return Values
To let a function return a result, use the return() function:
Example
my_function <- function(x) {
return (5 * x)
}
print(my_function(3))
print(my_function(5))
print(my_function(9))
The output of the code above will be:
[1] 15
[1] 25
[1] 45
Nested Functions
There are two ways to create a nested function:
Nested_function(Nested_function(2,2), Nested_function(3,3))
Example Explained
The function tells x to add y.
We need to create a new variable called output and give it a value, which is 3 here.
We then print the output with the desired value of "y", which in this case is 5.
Recursion
R also accepts function recursion, which means a defined function can call itself.
Recursion is a common mathematical and programming concept. It means that a function calls itself. This has the
benefit of meaning that you can loop through data to reach a result.
The developer should be very careful with recursion as it can be quite easy to slip into writing a function which
never terminates, or one that uses excess amounts of memory or processor power. However, when written
correctly, recursion can be a very efficient and mathematically-elegant approach to programming.
In this example, tri_recursion() is a function that we have defined to call itself ("recurse"). We use the k variable as
the data, which decrements (-1) every time we recurse. The recursion ends when the condition is not greater than
0 (i.e. when it is 0).
To a new developer it can take some time to work out how exactly this works, best way to find out is by testing and
modifying it.
Example
tri_recursion <- function(k) {
if (k > 0) {
result <- k + tri_recursion(k - 1)
print(result)
} else {
result = 0
return(result)
Department of CSE, AITS-Tirupati 19
Data Analytics(20APE0514)
}
}
tri_recursion(6)
Global Variables
Variables that are created outside of a function are known as global variables.
Global variables can be used by everyone, both inside of functions and outside.
Example
Create a variable outside of a function and use it inside the function:
txt <- "awesome"
my_function <- function() {
paste("R is", txt)
}
my_function()
If you create a variable with the same name inside a function, this variable will be local, and can only be used inside
the function. The global variable with the same name will remain as it was, global and with the original value.
Example
Create a variable inside of a function with the same name as the global variable:
txt <- "global variable"
my_function <- function() {
txt = "fantastic"
paste("R is", txt)
}
my_function()
If you try to print txt, it will return "global variable" because we are printing txt outside the function.
my_function()
print(txt)
Also, use the global assignment operator if you want to change a global variable inside a function:
Department of CSE, AITS-Tirupati 20
Data Analytics(20APE0514)
Example
To change the value of a global variable inside a function, refer to the variable by using the global assignment
operator <<-:
txt <- "awesome"
my_function <- function() {
txt <<- "fantastic"
paste("R is", txt)
}
my_function()
Vectors
A vector is simply a list of items that are of the same type.
To combine the list of items to a vector, use the c() function and separate the items by a comma.
In the example below, we create a vector variable called fruits, that combine strings:
Example
# Vector of strings
fruits <- c("banana", "apple", "orange")
# Print fruits
fruits
# Print numbers
numbers
To create a vector with numerical values in a sequence, use the : operator:
Example
# Vector with numerical values in a sequence
numbers <- 1:10
numbers
You can also create numerical values with decimals in a sequence, but note that if the last element does not belong
to the sequence, it is not used:
Example
# Vector with numerical decimals in a sequence
numbers1 <- 1.5:6.5
numbers1
# Vector with numerical decimals in a sequence where the last element is not used
numbers2 <- 1.5:6.3
numbers2
Result:
Department of CSE, AITS-Tirupati 21
Data Analytics(20APE0514)
[1] 1.5 2.5 3.5 4.5 5.5 6.5
[1] 1.5 2.5 3.5 4.5 5.5
log_values
Vector Length
To find out how many items a vector has, use the length() function:
Example
fruits <- c("banana", "apple", "orange")
length(fruits)
Sort a Vector
To sort items in a vector alphabetically or numerically, use the sort() function:
Example
fruits <- c("banana", "apple", "orange", "mango", "lemon")
numbers <- c(13, 3, 5, 7, 20, 2)
Access Vectors
You can access the vector items by referring to its index number inside brackets []. The first item has index 1, the
second item has index 2, and so on:
Example
fruits <- c("banana", "apple", "orange")
You can also access multiple elements by referring to different index positions with the c() function:
Example
fruits <- c("banana", "apple", "orange", "mango", "lemon")
You can also use negative index numbers to access all items except the ones specified:
Example
fruits <- c("banana", "apple", "orange", "mango", "lemon")
Change an Item
To change the value of a specific item, refer to the index number:
Example
fruits <- c("banana", "apple", "orange", "mango", "lemon")
# Print fruits
fruits
Repeat Vectors
To repeat vectors, use the rep() function:
Example
Repeat each value:
repeat_each <- rep(c(1,2,3), each = 3)
repeat_each
Example
Repeat the sequence of the vector:
repeat_times <- rep(c(1,2,3), times = 3)
repeat_times
Example
Repeat each value independently:
repeat_indepent <- rep(c(1,2,3), times = c(5,2,1))
repeat_indepent
numbers
To make bigger or smaller steps in a sequence, use the seq() function:
Example
numbers <- seq(from = 0, to = 100, by = 20)
numbers
Note: The seq() function has three parameters: from is where the sequence starts, to is where the sequence stops,
and by is the interval of the sequence.
Vectors are the most basic R data objects and there are six types of atomic vectors. They are logical, integer, double,
complex, character and raw.
# If the final element specified does not belong to the sequence then it is discarded.
v <- 3.8:11.4
print(v)
When we execute the above code, it produces the following result −
[1] 5 6 7 8 9 10 11 12 13
[1] 6.6 7.6 8.6 9.6 10.6 11.6 12.6
[1] 3.8 4.8 5.8 6.8 7.8 8.8 9.8 10.8
Vector Manipulation
Vector arithmetic
Two vectors of same length can be added, subtracted, multiplied or divided giving the result as a vector output.
# Create two vectors.
v1 <- c(3,8,4,5,0,11)
v2 <- c(4,11,0,8,1,2)
# Vector addition.
add.result <- v1+v2
Department of CSE, AITS-Tirupati 25
Data Analytics(20APE0514)
print(add.result)
# Vector subtraction.
sub.result <- v1-v2
print(sub.result)
# Vector multiplication.
multi.result <- v1*v2
print(multi.result)
# Vector division.
divi.result <- v1/v2
print(divi.result)
When we execute the above code, it produces the following result −
[1] 7 19 4 13 1 13
[1] -1 -3 4 -3 -1 9
[1] 12 88 0 40 0 22
[1] 0.7500000 0.7272727 Inf 0.6250000 0.0000000 5.5000000
Vector Element Recycling
If we apply arithmetic operations to two vectors of unequal length, then the elements of the shorter vector are
recycled to complete the operations.
v1 <- c(3,8,4,5,0,11)
v2 <- c(4,11)
# V2 becomes c(4,11,4,11,4,11)
Lists are the R objects which contain elements of different types like − numbers, strings, vectors and another list
inside it. A list can also contain a matrix or a function as its elements. List is created using list() function.
Creating a List
Following is an example to create a list containing strings, numbers, vectors and a logical values.
[[2]]
[1] "Green"
[[3]]
[1] 21 32 11
[[4]]
[1] TRUE
[[5]]
[1] 51.23
[[6]]
[1] 119.1
$A_Matrix
[,1] [,2] [,3]
[1,] 3 5 -2
[2,] 9 1 8
$A_Inner_list
$A_Inner_list[[1]]
[1] "green"
$A_Inner_list[[2]]
[1] 12.3
# Access the thrid element. As it is also a list, all its elements will be printed.
print(list_data[3])
$A_Inner_list
$A_Inner_list[[1]]
[1] "green"
$A_Inner_list[[2]]
[1] 12.3
Department of CSE, AITS-Tirupati 28
Data Analytics(20APE0514)
$<NA>
NULL
Merging Lists
You can merge many lists into one list by placing all the lists inside one list() function.
# Create two lists.
list1 <- list(1,2,3)
list2 <- list("Sun","Mon","Tue")
[[2]]
[1] 2
[[3]]
[1] 3
[[4]]
[1] "Sun"
[[5]]
[1] "Mon"
[[6]]
[1] "Tue"
list2 <-list(10:14)
print(list2)
print(v1)
print(v2)
[[1]]
[1] 10 11 12 13 14
Department of CSE, AITS-Tirupati 30
Data Analytics(20APE0514)
[1] 1 2 3 4 5
[1] 10 11 12 13 14
[1] 11 13 15 17 19
Matrices are the R objects in which the elements are arranged in a two-dimensional rectangular layout. They
contain elements of the same atomic types. Though we can create a matrix containing only characters or only
logical values, they are not of much use. We use matrices containing numeric elements to be used in mathematical
calculations.
A Matrix is created using the matrix() function.
Syntax
The basic syntax for creating a matrix in R is −
matrix(data, nrow, ncol, byrow, dimnames)
Following is the description of the parameters used −
data is the input vector which becomes the data elements of the matrix.
nrow is the number of rows to be created.
ncol is the number of columns to be created.
byrow is a logical clue. If TRUE then the input vector elements are arranged by row.
dimname is the names assigned to the rows and columns.
Example
Create a matrix taking a vector of numbers as input.
# Elements are arranged sequentially by row.
M <- matrix(c(3:14), nrow = 4, byrow = TRUE)
print(M)
Matrix Computations
Various mathematical operations are performed on the matrices using the R operators. The result of the operation
is also a matrix.
The dimensions (number of rows and columns) should be same for the matrices involved in the operation.
Arrays are the R data objects which can store data in more than two dimensions. For example − If we create an
array of dimension (2, 3, 4) then it creates 4 rectangular matrices each with 2 rows and 3 columns. Arrays can
store only data type.
An array is created using the array() function. It takes vectors as input and uses the values in the dim parameter
to create an array.
Example
The following example creates an array of two 3x3 matrices each with 3 rows and 3 columns.
# Create two vectors of different lengths.
vector1 <- c(5,9,3)
vector2 <- c(10,11,12,13,14,15)
,,2
, , Matrix2
# Print the element in the 1st row and 3rd column of the 1st matrix.
print(result[1,3,1])
# Use apply to calculate the sum of the rows across all the matrices.
Department of CSE, AITS-Tirupati 36
Data Analytics(20APE0514)
result <- apply(new.array, c(1), sum)
print(result)
When we execute the above code, it produces the following result −
,,1
,,2
[1] 56 68 60
Factors are the data objects which are used to categorize the data and store it as levels. They can store both strings
and integers. They are useful in the columns which have a limited number of unique values. Like "Male, "Female"
and True, False etc. They are useful in data analysis for statistical modeling.
Factors are created using the factor () function by taking a vector as input.
Example
print(data)
print(is.factor(data))
print(factor_data)
print(is.factor(factor_data))
When we execute the above code, it produces the following result −
[1] "East" "West" "East" "North" "North" "East" "West" "West" "West" "East" "North"
[1] FALSE
[1] East West East North North East West West West East North
Levels: East North West
[1] TRUE
Factors in Data Frame
On creating any data frame with a column of text data, R treats the text column as categorical data and creates
factors on it.
A data frame is a table or a two-dimensional array-like structure in which each column contains values of one
variable and each row contains one set of values from each column.
Following are the characteristics of a data frame.
The column names should be non-empty.
The row names should be unique.
The data stored in a data frame can be of numeric, factor or character type.
Each column should contain same number of data items.
start_date = as.Date(c("2012-01-01","2013-09-23","2014-11-15","2014-05-11",
"2015-03-27")),
stringsAsFactors = FALSE
)
Department of CSE, AITS-Tirupati 40
Data Analytics(20APE0514)
# Extract Specific columns.
result <- data.frame(emp.data$emp_name,emp.data$salary)
print(result)
When we execute the above code, it produces the following result −
emp.data.emp_name emp.data.salary
1 Rick 623.30
2 Dan 515.20
3 Michelle 611.00
4 Ryan 729.00
5 Gary 843.25
Extract the first two rows and then all columns
# Extract 3rd and 5th row with 2nd and 4th column.
result <- emp.data[c(3,5),c(2,4)]
print(result)
When we execute the above code, it produces the following result −
emp_name start_date
3 Michelle 2014-11-15
Department of CSE, AITS-Tirupati 41
Data Analytics(20APE0514)
5 Gary 2015-03-27
Add Column
Just add the column vector using a new column name.
# Create the data frame.
emp.data <- data.frame(
emp_id = c (1:5),
emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
salary = c(623.3,515.2,611.0,729.0,843.25),
R packages are a collection of R functions, complied code and sample data. They are stored under a directory
called "library" in the R environment. By default, R installs a set of packages during installation. More packages are
added later, when they are needed for some specific purpose. When we start the R console, only the default
packages are available by default. Other packages which are already installed have to be loaded explicitly to be
used by the R program that is going to use them.
All the packages available in R language are listed at R Packages.
Below is a list of commands to be used to check, verify and use the R packages.
library()
When we execute the above code, it produces the following result. It may vary depending on the local settings of
your pc.
Install a New Package
There are two ways to add new R packages. One is installing directly from the CRAN directory and another is
downloading the package to your local system and installing it manually.
Install directly from CRAN
The following command gets the packages directly from CRAN webpage and installs the package in the R
environment. You may be prompted to choose a nearest mirror. Choose the one appropriate to your location.
Department of CSE, AITS-Tirupati 43
Data Analytics(20APE0514)
install.packages("Package Name")
Data Reshaping in R is about changing the way data is organized into rows and columns. Most of the time data
processing in R is done by taking the input data as a data frame. It is easy to extract data from the rows and
columns of a data frame but there are situations when we need the data frame in a format that is different from
format in which we received it. R has many functions to split, merge and change the rows to columns and vice-
versa in a data frame.
Joining Columns and Rows in a Data Frame
We can join multiple vectors to create a data frame using the cbind()function. Also we can merge two data frames
using rbind() function.
# Print a header.
cat("# # # # The First data frame\n")
# Print a header.
cat("# # # The Second data frame\n")
# Print a header.
cat("# # # The combined data frame\n")
library(MASS)
Department of CSE, AITS-Tirupati 45
Data Analytics(20APE0514)
merged.Pima <- merge(x = Pima.te, y = Pima.tr,
by.x = c("bp", "bmi"),
by.y = c("bp", "bmi")
)
print(merged.Pima)
nrow(merged.Pima)
When we execute the above code, it produces the following result −
bp bmi npreg.x glu.x skin.x ped.x age.x type.x npreg.y glu.y skin.y ped.y
1 60 33.8 1 117 23 0.466 27 No 2 125 20 0.088
2 64 29.7 2 75 24 0.370 33 No 2 100 23 0.368
3 64 31.2 5 189 33 0.583 29 Yes 3 158 13 0.295
4 64 33.2 4 117 27 0.230 24 No 1 96 27 0.289
5 66 38.1 3 115 39 0.150 28 No 1 114 36 0.289
6 68 38.5 2 100 25 0.324 26 No 7 129 49 0.439
7 70 27.4 1 116 28 0.204 21 No 0 124 20 0.254
8 70 33.1 4 91 32 0.446 22 No 9 123 44 0.374
9 70 35.4 9 124 33 0.282 34 No 6 134 23 0.542
10 72 25.6 1 157 21 0.123 24 No 4 99 17 0.294
11 72 37.7 5 95 33 0.370 27 No 6 103 32 0.324
12 74 25.9 9 134 33 0.460 81 No 8 126 38 0.162
13 74 25.9 1 95 21 0.673 36 No 8 126 38 0.162
14 78 27.6 5 88 30 0.258 37 No 6 125 31 0.565
15 78 27.6 10 122 31 0.512 45 No 6 125 31 0.565
16 78 39.4 2 112 50 0.175 24 No 4 112 40 0.236
17 88 34.5 1 117 24 0.403 40 Yes 4 127 11 0.598
age.y type.y
1 31 No
2 21 No
3 24 No
4 21 No
5 21 No
6 43 Yes
7 36 Yes
8 40 No
9 29 Yes
10 28 No
11 55 No
12 39 No
13 39 No
14 49 Yes
15 49 Yes
16 38 No
17 28 No
[1] 17
Melting and Casting
One of the most interesting aspects of R programming is about changing the shape of the data in multiple steps to
get a desired shape. The functions used to do this are called melt() and cast().
We consider the dataset called ships present in the library called "MASS".
Department of CSE, AITS-Tirupati 46
Data Analytics(20APE0514)
library(MASS)
print(ships)
When we execute the above code, it produces the following result −
type year period service incidents
1 A 60 60 127 0
2 A 60 75 63 0
3 A 65 60 1095 3
4 A 65 75 1095 4
5 A 70 60 1512 6
.............
.............
8 A 75 75 2244 11
9 B 60 60 44882 39
10 B 60 75 17176 29
11 B 65 60 28609 58
............
............
17 C 60 60 1179 1
18 C 60 75 552 1
19 C 65 60 781 0
............
............
Melt the Data
Now we melt the data to organize it, converting all columns other than type and year into multiple rows.
molten.ships <- melt(ships, id = c("type","year"))
print(molten.ships)
When we execute the above code, it produces the following result −
type year variable value
1 A 60 period 60
2 A 60 period 75
3 A 65 period 60
4 A 65 period 75
............
............
9 B 60 period 60
10 B 60 period 75
11 B 65 period 60
12 B 65 period 75
13 B 70 period 60
...........
...........
41 A 60 service 127
42 A 60 service 63
43 A 65 service 1095
...........
...........
70 D 70 service 1208
Department of CSE, AITS-Tirupati 47
Data Analytics(20APE0514)
71 D 75 service 0
72 D 75 service 2051
73 E 60 service 45
74 E 60 service 0
75 E 65 service 789
...........
...........
101 C 70 incidents 6
102 C 70 incidents 2
103 C 75 incidents 0
104 C 75 incidents 1
105 D 60 incidents 0
106 D 60 incidents 0
...........
...........
Cast the Molten Data
We can cast the molten data into a new form where the aggregate of each type of ship for each year is created. It is
done using the cast() function.
recasted.ship <- cast(molten.ships, type+year~variable,sum)
print(recasted.ship)
When we execute the above code, it produces the following result −
type year period service incidents
1 A 60 135 190 0
2 A 65 135 2190 7
3 A 70 135 4865 24
4 A 75 135 2244 11
5 B 60 135 62058 68
6 B 65 135 48979 111
7 B 70 135 20163 56
8 B 75 135 7117 18
9 C 60 135 1731 2
10 C 65 135 1457 1
11 C 70 135 2731 8
12 C 75 135 274 1
13 D 60 135 356 0
14 D 65 135 480 0
15 D 70 135 1557 13
16 D 75 135 2051 4
17 E 60 135 45 0
18 E 65 135 1226 14
19 E 70 135 3318 17
20 E 75 135 542 1
In R, we can read data from files stored outside the R environment. We can also write data into files which will be
stored and accessed by the operating system. R can read and write into various file formats like csv, excel, xml etc.
In this chapter we will learn to read data from a csv file and then write data into a csv file. The file should be
present in current working directory so that R can read it. Of course we can also set our own directory and read
files from there.
Department of CSE, AITS-Tirupati 48
Data Analytics(20APE0514)
[1] "/web/com/1441086124_2016"
[1] "/web/com"
This result depends on your OS and your current directory where you are working.
You can create this file using windows notepad by copying and pasting this data. Save the file as input.csv using the
save As All files(*.*) option in notepad.
id,name,salary,start_date,dept
1,Rick,623.3,2012-01-01,IT
2,Dan,515.2,2013-09-23,Operations
3,Michelle,611,2014-11-15,IT
4,Ryan,729,2014-05-11,HR
5,Gary,843.25,2015-03-27,Finance
6,Nina,578,2013-05-21,IT
7,Simon,632.8,2013-07-30,Operations
8,Guru,722.5,2014-06-17,Finance
Reading a CSV File
Following is a simple example of read.csv() function to read a CSV file available in your current working directory −
print(is.data.frame(data))
print(ncol(data))
print(nrow(data))
When we execute the above code, it produces the following result −
[1] TRUE
[1] 5
[1] 8
Once we read data in a data frame, we can apply all the functions applicable to data frames as explained in
subsequent section.
[1] 843.25
Get the details of the person with max salary
We can fetch rows meeting specific filter criteria similar to a SQL where clause.
Sampling distributions
The standard deviation (SD) measures the amount of variability, or dispersion, from the individual data values to the
mean.
Standard error of the mean (SEM) measures how far the sample mean (average) of the data is likely to be from the true
population mean. The SEM is always smaller than the SD. The SEM is a measure of the accuracy of the sample mean, as
an estimate of the population mean. The SEM takes the SD and divides it by the square root of the sample size.
A sampling distribution is a probability distribution of a statistic that is obtained through repeated sampling of a specific
population. It describes a range of possible outcomes for a statistic, such as the mean or mode of some variable, of a
population. The majority of data analyzed by researchers are actually samples, not populations.
The selection of some of the employees/population from the whole employees/population list is known as Sample.
Let’s say we have a company in which 30,000 employees are working. We want to find out the daily commute time of all
the employees.
Eg. Consider data from an experiment in which 21 elastic bands were randomly divided into two groups, one of 10 and
one of 11. Bands in the first group were immediately tested for the amount that they stretched under a weight of 1.35 kg.
The other group were dunked in hot water at 65◦C for four minutes, then left at air temperature for ten minutes, and then
tested for the amount that they stretched under the same 1.35 kg weight as before. The results were:
Ambient: 254 252 239 240 250 256 267 249 259 269 (Mean = 253.5)
Heated: 233 252 237 246 255 244 248 242 217 257 254 (Mean = 244.1)
The pooled standard deviation estimate is s = 10.91, with 19 (= 10 + 11 − 2) degrees of freedom. Since the separate
standard deviations (s1 = 9.92; s2 = 11.73) are similar, the pooled standard deviation estimate is an acceptable summary
of the variation in the data
The pooled standard deviation estimate is 10.91. Hence, the SED is 10.91 × (1/10 + 1/11) = 4.77
Functions used:
To find the value of probability density function (pdf) of the Student’s t-distribution given a random variable x, use
the dt() function in R.
Syntax: dt(x, df)
Department of CSE, AITS-Tirupati 55
Data Analytics(20APE0514)
Parameters:
x is the quantiles vector
df is the degrees of freedom
pt() function is used to get the cumulative distribution function (CDF) of a t-distribution
Syntax: pt(q, df, lower.tail = TRUE)
Parameter:
q is the quantiles vector
df is the degrees of freedom
lower.tail – if TRUE (default), probabilities are P[X ≤ x], otherwise, P[X > x].
The qt() function is used to get the quantile function or inverse cumulative density function of a t-distribution.
Syntax: qt(p, df, lower.tail = TRUE)
Parameter:
p is the vector of probabilities
df is the degrees of freedom
lower.tail – if TRUE (default), probabilities are P[X ≤ x], otherwise, P[X > x].
Approach
Set degrees of freedom
To plot the density function for student’s t-distribution follow the given steps:
First create a vector of quantiles in R.
Next, use the dt function to find the values of a t-distribution given a random variable x and certain
degrees of freedom.
Using these values plot the density function for student’s t-distribution.
Now, instead of the dt function, use the pt function to get the cumulative distribution function (CDF) of a t-
distribution and the qt function to get the quantile function or inverse cumulative density function of a t-
distribution. Put it simply, pt returns the area to the left of a given random variable q in the t-distribution and qt
finds the t-score is of the pth quantile of the t-distribution.
Example: To find a value of t-distribution at x=1, having certain degrees of freedom, say Df = 25,
Output:
0.237211
Example:
Code below shows a comparison of probability density functions having different degrees of freedom. It is observed as
mentioned before, larger the sample size (degrees of freedom increasing), the closer the plot is to a normal distribution
(dotted line in figure).
Output:
0.02716657
Essentially we found the one-sided p-value, P(t>2.1) as 2.7%. Now suppose we want to construct a two-sided 95%
confidence interval. To do so, find the t-score or t-value for 95% confidence using the qt function or the quantile
distribution.
Example:
Output:
-2.144787
So, a t-value of 2.14 will be used as the critical value for a confidence interval of 95%.
Using the density, it is possible to determine the probabilities of events. For example, you may wonder: What is the
likelihood that a person has an IQ of exactly 140?. In this case, you would need to retrieve the density of the IQ
distribution at value 140. The IQ distribution can be modeled with a mean of 100 and a standard deviation of 15.
pp <- function(x) {
print(paste0(round(x * 100, 3), "%"))
}
# likelihood of IQ == 140?
pp(iq.df$Density[iq.df$IQ == 140])
## [1] "0.076%"
The depicted CDF shows the probability of having an IQ less or equal to a given value. This is because pnorm computes
the lower tail by default, i.e. P[X<=x].
A hypothesis is made by the researchers about the data collected for any experiment or data set. A hypothesis is an
assumption made by the researchers that are not mandatory true. In simple words, a hypothesis is a decision taken by the
researchers based on the data of the population collected. Hypothesis Testing in R Programming is a process of testing the
hypothesis made by the researcher or to validate the hypothesis. To perform hypothesis testing, a random sample of data
from the population is taken and testing is performed. Based on the results of testing, the hypothesis is either selected or
rejected. This concept is known as Statistical Inference.
Example:
# Defining sample vector
x <- rnorm(100)
Output:
One Sample t-test
data: x
t = -49.504, df = 99, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 5
Department of CSE, AITS-Tirupati 60
Data Analytics(20APE0514)
95 percent confidence interval:
-0.1910645 0.2090349
sample estimates:
mean of x
0.008985172
Output:
Welch Two Sample t-test
data: x and y
t = -1.0601, df = 197.86, p-value = 0.2904
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.4362140 0.1311918
sample estimates:
mean of x mean of y
-0.05075633 0.10175478
Directional Hypothesis
Using the directional hypothesis, the direction of the hypothesis can be specified like, if the user wants to know the
sample mean is lower or greater than another mean sample of the data.
Syntax: t.test(x, mu, alternative)
Parameters:
x: represents numeric vector data
mu: represents mean against which sample data has to be tested
alternative: sets the alternative hypothesis
Example:
# Defining sample vector
x <- rnorm(100)
Output:
One Sample t-test
Department of CSE, AITS-Tirupati 61
Data Analytics(20APE0514)
data: x
t = -20.708, df = 99, p-value = 1
alternative hypothesis: true mean is greater than 2
95 percent confidence interval:
-0.2307534 Inf
sample estimates:
mean of x
-0.0651628
Example:
# Define vector
x <- rnorm(100)
Output:
Wilcoxon signed rank test with continuity correction
data: x
V = 2555, p-value = 0.9192
alternative hypothesis: true location is not equal to 0
Example:
# Define vectors
x <- rnorm(100)
y <- rnorm(100)
Output:
Wilcoxon rank sum test with continuity correction
data: x and y
Department of CSE, AITS-Tirupati 62
Data Analytics(20APE0514)
W = 5300, p-value = 0.4643
alternative hypothesis: true location shift is not equal to 0
Correlation Test
This test is used to compare the correlation of the two vectors provided in the function call or to test for the association
between the paired samples.
Syntax: cor.test(x, y)
Parameters:
x and y: represents numeric data vectors
To know about more optional parameters in cor.test() function, use below command:
help("cor.test")
Example:
# Using mtcars dataset in R
cor.test(mtcars$mpg, mtcars$hp)
Output:
Pearson's product-moment correlation
Contingency tables:
Contingency tables are very useful to condense a large number of observations into smaller to make it easier to
maintain tables. A contingency table shows the distribution of a variable in the rows and another in its columns.
Contingency tables are not only useful for condensing data, but they also show the relations between variables. They
are a way of summarizing categorical variables. A contingency table that deals with a single table are called a complex
or a flat contingency table.
Making Contingency tables
A contingency table is a way to redraw data and assemble it into a table. And, it shows the layout of the original data in
a manner that allows the reader to gain an overall summary of the original data. The table() function is used in R to
create a contingency table. The table() function is one of the most versatile functions in R. It can take any data
structure as an argument and turn it into a table. The more complex the original data, the more complex is the resulting
contingency table.
Creating contingency tables from Vectors
In R a vector is an ordered collection of basic data types of a given length. The only key thing here is all the elements
of a vector must be of the identical data type e.g homogeneous data structures. Vectors are one-dimensional data
structures. It is the simplest data object from which you can create a contingency table.
Example:
# R program to illustrate
# Contingency Table
# Creating a vector
vec = c(2, 4, 3, 1, 6, 3, 2, 1, 4, 5)
Output:
vec
123456
222211
In the given program what happens is first when we execute table() command on the vector it sorts the vector value and
also prints the frequencies of every element given in the vector.
Creating contingency tables from Data
Now we will see a simple example that provides a data frame containing character values in one column and also
containing a factor in one of its columns. This one column of factors contains character variables. In order to create our
contingency table from data, we will make use of the table(). In the following example, the table() function returns a
contingency table. Basically, it returns a tabular result of the categorical variables.
Example:
# R program to illustrate
# Contingency Table
Output:
Gender
Name Female Male
Amiya 0 1
Asish 0 1
Rosy 1 0
Creating custom contingency tables
The contingency table in R can be created using only a part of the data which is in contrast with collecting data from all
the rows and columns. We can create a custom contingency table in R using the following ways:
Using Columns of a Data Frame in a Contingency Table
Using Rows of a Data Frame in a Contingency Table
By Rotating Data Frames in R
Creating Contingency Tables from Matrix Objects in R
Using Columns of a Data Frame in a Contingency Table: With the help of table() command, we are able to
specify the columns with which the contingency tables can be created. In order to do so, you only need to pass the
name of vector objects in the parameter of table() command.
Example:
# R program to illustrate
# Contingency Table
Output:
Amiya Asish Rosy
1 1 1
From the output, you can notice that the table() command sorts the name in alphabetical order along with their frequencies
of occurrence.
Using Rows of a Data Frame in a Contingency Table: We can’t create a contingency table using rows of a data
frame directly as we did in “using column” part. With the help of the matrix, we can create a contingency table by
looking at the rows of a data frame.
Example:
# R program to illustrate
# Contingency Table
Output:
Asish Female Male Rosy
1 1 1 1
By Rotating Data Frames in R: We can also create a contingency table by rotating a data frame in R. We can
perform a rotation of the data, that is, transpose of the data using the t() command.
Example:
# R program to illustrate
# Contingency Table
Output:
newDf
Amiya Asish Female Male Rosy
1 1 1 2 1
Creating Contingency Tables from Matrix Objects in R A matrix is a rectangular arrangement of numbers in
rows and columns. In a matrix, as we know rows are the ones that run horizontally and columns are the ones that
run vertically. Matrices are two-dimensional, homogeneous data structures. We can create a contingency table by
using this matrix object.
Example:
# R program to illustrate
# Contingency Table
# Creating a matrix
A = matrix(
c(1, 2, 4, 1, 5, 6, 2, 4, 7),
nrow = 3,
ncol = 3
)
Output:
A
124567
222111
Converting Objects into tables
As mentioned above, a table is a special type of data object which is similar to the matrix but also possesses several
differences.
Converting Matrix Objects into tables: We can directly convert a matrix object into table by using
the as.table() command. Just pass the matrix object as a parameter to the as.table() command.
Example:
# R program to illustrate
# Contingency Table
# Creating a matrix
A = matrix(
c(1, 2, 4, 1, 5, 6, 2, 4, 7),
nrow = 3,
ncol = 3
)
Output:
ABC
A112
B254
C467
Converting Data frame Objects into tables: We can’t directly convert a data frame object into table by using
the as.table() command. In the case of a data frame, the object can be converted into the matrix, and then it can be
converted into the table using as.table() command.
Example:
# R program to illustrate
# Contingency Table
Output:
Name Gender
A Amiya Male
B Rosy Female
C Asish Male
One-way unstructured comparisons:
data from a one-way unstructured comparison between three treatments. The weights of the plants were measured after
two months on respective treatments: water concentrated nutrient, and concentrated nutrient plus the selective herbicide
2,4-D.
Data are: tomato <- data.frame(weight= c(1.5, 1.9, 1.3, 1.5, 2.4, 1.5, # water
1.5, 1.2, 1.2, 2.1, 2.9, 1.6, # Nutrient
1.9, 1.6, 0.8, 1.15, 0.9, 1.6), # Nutrient+24D
trt = rep(c("water", "Nutrient", "Nutrient+24D"), c(6, 6, 6)))
Figure 4.5 Weights of tomato plants, after two months of the three treatments.
## Make water the first level of trt. It will then appear as ## the initial level in the graphs. In aov or lm calculations, ## it
will be the baseline or reference level.
The strip plots display “within-group” variability, as well as giving an indication of differ- ences among the group means.
Variances seem similar for the three treatments.
The function onewayPlot(), from the DAAG package, provides a convenient visual summary of results, shown in Figure
4.6. The code is:
Notice that the graph gives two different assessments of the least difference that should be treated as “significant”.
These use different criteria:
. The 5% least significant difference (LSD) is designed so that, under the null model (no differences), significant
differences will be found in 5% of comparisons.
. The 5% honest significant difference (HSD) is designed so that, under the null model,
the maximum difference will be significant in 5% of experiments.
LSD
Tukey HSD
The LSD is commonly regarded as overly lax, while the HSD may be overly conservative. There are a variety of
alternatives to the HSD that are less conservative
Department of CSE, AITS-Tirupati 68
Data Analytics(20APE0514)
> anova(tomato.aov)
Analysis of Variance
Response curves:
data that are strongly structured. A model car was released three times at each of four different distances (starting.point)
up a 20◦ ramp. The experimenter recorded distances traveled from the bottom of the ramp across a concrete floor.
Response curve analyses should be used whenever appropriate in preference to comparison of individual pairs of means.
For these data, the physics can be used to suggest the likely form of response. Where no such help is available, careful
examination of the graph, followed by systematic examination of plausible forms of response, may suggest a suitable form
of response curve.
Table 4.8 Each tester made two firmness tests on each of five fruit.
1. Does a straight line explain the data better than assuming a random scatter about a horizontal line?
A representation of the response curve in terms of coefficients of orthogonal polynomials provides information that makes
it relatively easy to address questions 1–3. Consider, for example, a model that has terms in x and x2. Orthogonal
polynomials re-express this combination of terms in such a way that the coefficient of the “linear” term is independent of
the coefficient of the “quadratic” term. Higher-order (cubic, . . . ) orthogonal polynomial terms can of course be fitted, and
it remains the case that the coefficients are mutually independent.
Ten apples are taken from a box. A randomization procedure assigns five to one tester, and the other five to another tester.
Each tester makes two firmness tests on each of their five fruit. Firmness is measured by the pressure needed to push the
flat end of a piece of rod through the surface of the fruit.gives the results, in N/m2.
What happens if we ignore the data structure, and compare ten values for one tester with ten values for the other tester?
This pretends that we have ten experimental units for each tester. The analysis will suggest that the treatment means are
more accurate than is really the case. We obtain a pretend standard error that is not the correct standard error of the mean.
We are likely to under-estimate the standard error of the treatment difference.
For comparison of two means when the sample sizes n1 and n2 are small, it is important to have as many degrees of
freedom as possible for the denominator of the t-test. It is worth tolerating possible bias in some of the calculated SEDs in
order to gain extra degrees of freedom.
The same considerations arise in the one-way analysis of variance, and we pursue the issue in that context. It is
illuminating to plot out, side by side, say 10 SEDs based on randomly generated normal variates, first for a comparison
based on 2 d.f., then 10 SEDs for a comparison based on 4 d.f., etc (d.f degree of freedom)
A formal statistical test is thus unlikely, unless the sample is large, to detect differences in variance that may have a large
effect on the result of the test. It is therefore necessary to rely on judgment. Both past experience with similar data and
subject area knowledge may be important.
In comparing two treatments that are qualitatively similar, differences in the population variance may be unlikely, unless
the difference in means is at least of the same order of magnitude as the individual means.
If the means are not much different then it is reasonable, though this is by no means inevitable, to expect that the
variances will not be much different.
If there do seem to be differences in variance, it may be possible to model the variance as a function of the mean. It may
be possible to apply a variance-stabilizing transformation.
Otherwise, if there are just one or two degrees of freedom per mean, use a pooled estimate of variance unless the
assumption of equal variance seems clearly unacceptable.
The extension is less straightforward when one or both of these conditions are not met. For unbalanced data from designs
with a simple error structure, it is necessary to use the lm() (linear model) function.
The lme() function in the nlme package, or alternatively lmer() in the lme4 package, is able to handle problems where
there is structure in the error term, including data from unbalanced designs.
Pair #
1 2 3 4 5 6 7 8 9
Heated (mm) 244 255 253 254 251 269 248 252 292
Ambient 225 247 249 253 245 259 242 255 286
Difference 19 8 4 1 6 10 6 −3 6
Resample data
Let’s assume that the data are a sample of measurements for a single variable stored in a vector x. The data may be
numeric or categorical.
1. A single bootstrap replicate is obtained as follows. The replace option is used to indicate that sampling is carried
out with replacement.
2. Calculate the statistic of interest (for example, the mean) on the resampled data in xboot and store the result in a
vector created for this purpose.
4. Repeat steps (1) and (2) many times. The result will be a large collection of bootstrap replicate estimates for
subsequent analysis.
In other cases, two or more variables are measured on individuals (e.g., stem height, leaf area, petal diameter, etc).
Assume that each row of a data frame mydata is a different individual, and each column a different variable.
1. To resample individuals (i.e., rows),
The data frame bootdata will contain a single bootstrap replicate including all the variables.
Department of CSE, AITS-Tirupati 71
Data Analytics(20APE0514)
3. Calculate the statistic of interest on the resampled data and store the result in vector created for this purpose. For
example, to calculate the correlation between two variables x and y in bootdata,
5. Repeat steps (1) and (2) many times. The result will be a large collection of bootstrap replicate estimates for
subsequent analysis.
sd(z)
A large number of bootstrap replicate estimates is required for an accurate confidence interval.
mean(z) - estimate
library(boot)
Single variable
Department of CSE, AITS-Tirupati 72
Data Analytics(20APE0514)
To use the boot package you will need to write a function to calculate the statistic of interest. The format is illustrated
below for the sample mean, but any univariate function would be handled similarly. We’ll call our function “boot.mean”.
When you have finished writing the script for a function you will need to cut and paste it into the command window so
that R can access it (you’ll need to do this just once in an R session). Here, x refers to the vector of data. i serves as a
counter, as in your own for loop, but it must be included as an argument in your function as shown.
The command boot will automatically carry out all the resampling and computations required. For this example, x is the
vector of original data and boot.mean is the name of the function we created above to calculate the statistic of interest. R
specifies the number of bootstrap replicate estimates desired.
The resulting object (which here named z) is a boot object containing all the results. Use the following additional
commands to pull out the results.
Then create a function to calculate the statistic of interest on the variables in mydata. For example, to create a function
that calculates the correlation coefficient between the two variables x and y, use
Here, i refers to a vector of indices, which must be included as an argument in the function and employed as shown.
Finally, pass your data frame and function to the boot command,
See the previous section for a list of commands to pull results from the boot object (here named z).
Permutation test
A permutation test uses resampling and the computer to generate a null distribution for a test statistic. The test statistic is a
measure of association between two variables or difference between groups, such as a slope, a correlation, or an odds
ratio. Each permutation step involves randomly resampling without replacement the values of one of the two variables in
the data and recalculating the test statistic in each permutation. The two variables may be categorical (character or factor),
numeric, or one of each.
Categorical data
R has a built-in permutation procedure for a contingency test of association when both of two variables are categorical
(call them A1 and A2). To apply it, execute the usual command for the χ2 contingency test, but set
the simulate.p.value option to TRUE. The number of replicates in the permutation is set by the option B (default is 2000).
Each permutation rearranges the values in the contingency table while keeping all the row and column totals fixed to their
observed values.
Numeric data
If one or both of the variables is numeric, then you will need to create a short loop to carry out the resampling necessary
for the permutation test. Choose one of the two variables to resample (call it x). It doesn’t matter which of the two
variables you choose. Keep the other variable (call it y) unchanged (there is no benefit to resampling both variables).
1. Resample x without replacement to create a new vector (call it x1).
2. Calculate the test statistic to measure association between y and the randomized variable x1. Store the result in a
vector created for this purpose. For example, to calculate the correlation between the two variables,
4. Repeat steps (1) and (2) many times. The result will be a large collection of replicates representing the null
distribution of your test statistic.
Theories of inference:
frequentist and
Bayesian.
Frequentist:The frequentist approach is usually based on the concept of likelihood; given the model, what is the
probability of obtaining a sample similar to that observed? Parameter values are 16
Another type of methodology, broadly known as “Bayesian” uses Bayes’ theorem. The essential idea is that we might
have prior information (knowledge or belief) about the distribution of a parameter value before taking a sample of
observations. This prior information can be updated using the sample and the rules of probability.
Bayesian estimation :
the Bayesian methodology provides a way to update our prior information about the model parameters using sample
information.
the prior information is summarized in the form of a probability law called the prior distribution of the model
parameters. Interest usually centers on the posterior distribution of the parameters, which is proportional to the
product of the likelihood and the prior distribution.
A simple application of Bayes’ theorem is as follows. The incidence of HIV in adult Australian males (15–49 years)
who do not have any known risk factor may be of the order of 1 in 100 000, i.e., the prior probability of infection is
0.00001. A person in this group has an initial test (for example, it may be required in order to obtain a US green card)
that has a specificity of 0.01%, i.e., for every 10 000 people tested, there will on average be one false positive.
How should such an individual interpret the result? If 100 000 individuals take the test, one will on average have
AIDS and will almost certainly return a positive test. On the other hand there will, on average, be close to 10 false
positives (0.1% of 99 999).
Not infected Infected
10000 × 0.001 = 10 (false) positives 1 true positive
The posterior odds that the person has AIDS are thus close to 1:10, certainly a narrowing from the prior odds of
1:99 999.
Note that, as often happens when Bayesian calculations are used, the prior information is not very precise. What we can
say is that the prior probability is, in the case mentioned, very low.
ny¯ + µ0σ2/σ20
n + σ2/σ2 0
variance
σ2
n + σ2/σ20 .
This assumes that σ2 is actually known; an estimate can be obtained using the sample variance. Alternatively, we could
put a prior distribution on this parameter as well.
If there is strong prior information, use it:
Any methodology that ignores strong prior information is inappropriate, and may be highly misleading. Diagnostic testing
(the AIDS test example mentioned above) and criminal investigations provide cogent examples.
Using the hypothesis testing framework, we take the null hypothesis H0, in the AIDS test example, to be the hypothesis
that the individual does not have HIV.
Given this null hypothesis, the probability of a positive result is 0.0001. Therefore the null hypothesis is rejected.
Scrutiny of 10 000 potential perpetrators will on average net 10 suspects. Suppose one of these is later charged. The
probability of such incriminating evidence, assuming that the defendant is innocent, is indeed 0.001.
The police screening will net around 10 innocent people along with, perhaps, the one perpetrator. The following
summarizes the expected result of the police search for a suspect.
It is optimistic in its assumption that the perpetrator will be among those netted.
From the scatter plot above, it can be seen that not all the data points fall exactly on the fitted regression line. Some of the
points are above the blue curve and some are below it; overall, the residual errors (e) have approximately mean zero.
The sum of the squares of the residual errors are called the Residual Sum of Squares or RSS.
The average variation of points around the fitted regression line is called the Residual Standard Error (RSE). This is
one the metrics used to evaluate the overall quality of the fitted regression model. The lower the RSE, the better it is.
Since the mean error term is zero, the outcome variable y can be approximately estimated as follow:
y ~ b0 + b1*x
Mathematically, the beta coefficients (b0 and b1) are determined so that the RSS is as minimal as possible. This method
of determining the beta coefficients is technically called least squares regression or ordinary least squares (OLS)
regression.
Once, the beta coefficients are calculated, a t-test is performed to check whether or not these coefficients are significantly
different from zero. A non-zero beta coefficients means that there is a significant relationship between the predictors (x)
and the outcome variable (y).
We’ll use the marketing data set, introduced in the Chapter @ref(regression-analysis), for predicting sales units on the
basis of the amount of money spent in the three advertising medias (youtube, facebook and newspaper)
We’ll randomly split the data into training set (80% for building a predictive model) and test set (20% for evaluating the
model). Make sure to set seed for reproducibility.
The simple linear regression is used to predict a continuous outcome variable (y) based on one single predictor variable
(x).
In the following example, we’ll build a simple linear model to predict sales units based on the advertising budget spent on
youtube. The regression equation can be written as sales = b0 + b1*youtube.
The R function lm() can be used to determine the beta coefficients of the linear model, as follow:
model <- lm(sales ~ youtube, data = train.data)
summary(model)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.3839 0.62442 13.4 5.22e-28
Department of CSE, AITS-Tirupati 78
Data Analytics(20APE0514)
## youtube 0.0468 0.00301 15.6 7.84e-34
The output above shows the estimate of the regression beta coefficients (column Estimate) and their significance levels
(column Pr(>|t|). The intercept (b0) is 8.38 and the coefficient of youtube variable is 0.046.
The estimated regression equation can be written as follow: sales = 8.38 + 0.046*youtube. Using this formula, for each
new youtube advertising budget, you can predict the number of sale units.
For example:
For a youtube advertising budget equal zero, we can expect a sale of 8.38 units.
For a youtube advertising budget equal 1000, we can expect a sale of 8.38 + 0.046*1000 = 55 units.
Predictions can be easily made using the R function predict(). In the following example, we predict sales units for two
youtube advertising budget: 0 and 1000.
newdata <- data.frame(youtube = c(0, 1000))
model %>% predict(newdata)
## 1 2
## 8.38 55.19
multiple linear regressions:
Multiple linear regression is an extension of simple linear regression for predicting an outcome variable (y) on the basis
of multiple distinct predictor variables (x).
For example, with three predictor variables (x), the prediction of y is expressed by the following equation: y = b0 + b1*x1
+ b2*x2 + b3*x3
The regression beta coefficients measure the association between each predictor variable and the outcome. “b_j” can be
interpreted as the average effect on y of a one unit increase in “x_j”, holding all other predictors fixed.
In this section, we’ll build a multiple regression model to predict sales based on the budget invested in three advertising
medias: youtube, facebook and newspaper. The formula is as follow: sales = b0 + b1*youtube + b2*facebook +
b3*newspaper
You can compute the multiple regression model coefficients in R as follow:
Interpretation
Before using a model for predictions, you need to assess the statistical significance of the model. This can be easily
checked by displaying the statistical summary of the model.
Model summary
Coefficients significance
To see which predictor variables are significant, you can examine the coefficients table, which shows the estimate of
regression beta coefficients and the associated t-statistic p-values.
Model accuracy
Once you identified that, at least, one predictor variable is significantly associated to the outcome, you should continue the
diagnostic by checking how well the model fits the data. This process is also referred to as the goodness-of-fit
The overall quality of the linear regression fit can be assessed using the following three quantities, displayed in the model
summary:
1. Residual Standard Error (RSE),
Department of CSE, AITS-Tirupati 81
Data Analytics(20APE0514)
2. R-squared (R2) and adjusted R2,
3. F-statistic, which has been already described in the previous section
## rse r.squared f.statistic p.value
## 1 2.11 0.89 644 5.64e-77
1. Residual standard error (RSE).
The RSE (or model sigma), corresponding to the prediction error, represents roughly the average difference between the
observed outcome values and the predicted values by the model. The lower the RSE the best the model fits to our data.
Dividing the RSE by the average value of the outcome variable will give you the prediction error rate, which should be as
small as possible.
In our example, using only youtube and facebook predictor variables, the RSE = 2.11, meaning that the observed sales
values deviate from the predicted values by approximately 2.11 units in average.
This corresponds to an error rate of 2.11/mean(train.data$sales) = 2.11/16.77 = 13%, which is low.
2. R-squared and Adjusted R-squared:
The R-squared (R2) ranges from 0 to 1 and represents the proportion of variation in the outcome variable that can be
explained by the model predictor variables.
For a simple linear regression, R2 is the square of the Pearson correlation coefficient between the outcome and the
predictor variables. In multiple linear regression, the R2 represents the correlation coefficient between the observed
outcome values and the predicted values.
The R2 measures, how well the model fits the data. The higher the R2, the better the model. However, a problem with the
R2, is that, it will always increase when more variables are added to the model, even if those variables are only weakly
associated with the outcome (James et al. 2014). A solution is to adjust the R2 by taking into account the number of
predictor variables.
The adjustment in the “Adjusted R Square” value in the summary output is a correction for the number of x variables
included in the predictive model.
So, you should mainly consider the adjusted R-squared, which is a penalized R2 for a higher number of predictors.
An (adjusted) R2 that is close to 1 indicates that a large proportion of the variability in the outcome has been
explained by the regression model.
A number near 0 indicates that the regression model did not explain much of the variability in the outcome.
In our example, the adjusted R2 is 0.88, which is good.
3. F-Statistic:
Recall that, the F-statistic gives the overall significance of the model. It assess whether at least one predictor variable has
a non-zero coefficient.
In a simple linear regression, this test is not really interesting since it just duplicates the information given by the t-test,
available in the coefficient table.
The F-statistic becomes more important once we start using multiple predictors as in multiple linear regression.
A large F-statistic will corresponds to a statistically significant p-value (p < 0.05). In our example, the F-statistic equal
644 producing a p-value of 1.46e-42, which is highly significant.
Making predictions
We’ll make predictions using the test data in order to evaluate the performance of our regression model.
The procedure is as follow:
1. Predict the sales values based on new advertising budgets in the test data
2. Assess the model performance by computing:
o The prediction error RMSE (Root Mean Squared Error), representing the average difference between the
observed known outcome values in the test data and the predicted outcome values by the model. The lower the
RMSE, the better the model.
Department of CSE, AITS-Tirupati 82
Data Analytics(20APE0514)
o The R-square (R2), representing the correlation between the observed outcome values and the predicted
outcome values. The higher the R2, the better the model.
# Make predictions
predictions <- model %>% predict(test.data)
# Model performance
# (a) Compute the prediction error, RMSE
RMSE(predictions, test.data$sales)
## [1] 1.58
# (b) Compute R-square
R2(predictions, test.data$sales)
## [1] 0.938
From the output above, the R2 is 0.93, meaning that the observed and the predicted outcome values are highly correlated,
which is very good.
The prediction error RMSE is 1.58, representing an error rate of 1.58/mean(test.data$sales) = 1.58/17 = 9.2%, which is
good.
Discussion
This chapter describes the basics of linear regression and provides practical examples in R for computing simple and
multiple linear regression models. We also described how to assess the performance of the model for predictions.
Note that, linear regression assumes a linear relationship between the outcome and the predictor variables. This can be
easily checked by creating a scatter plot of the outcome variable vs the predictor variable.
For example, the following R code displays sales units versus youtube advertising budget. We’ll also add a smoothed line:
The graph above shows a linearly increasing relationship between the sales and the youtube variables, which is a good
thing.
The Gauss Markov theorem tells us that if a certain set of assumptions are met, the
ordinary least squares estimate for regression coefficients gives you the Best Linear
Unbiased Estimate (BLUE) possible.
Linearity:
o The parameters we are estimating using the OLS method must be themselves
linear.
Random:
o Our data must have been randomly sampled from the population.
Non-Collinearity:
o The regressors being calculated aren’t perfectly correlated with each other.
Exogeneity:
o The regressors aren’t correlated with the error term.
Homoscedasticity:
o No matter what the values of our regressors might be, the error of the variance is
constant.
Checking how well our data matches these assumptions is an important part of estimating
regression coefficients.
When you know where these conditions are violated, you may be able to plan ways to
change your experiment setup to help your situation fit the ideal Gauss Markov situation
more closely.
In practice, the Gauss Markov assumptions are rarely all met perfectly, but they are still
useful as a benchmark, and because they show us what ‘ideal’ conditions would be.
They also allow us to pinpoint problem areas that might cause our estimated regression
coefficients to be inaccurate or even unusable.
and generated by the ordinary least squares estimate is the best linear unbiased estimate
(BLUE) possible if
The first of these assumptions can be read as “The expected value of the error term is
zero.”. The second assumption is collinearity, the third is exogeneity, and the fourth is
homoscedasticity.
Regression Concepts
Regression
Each xi corresponds to the set of attributes of the ith observation (known as explanatory
variables) and yi corresponds to the target (or response) variable.
The explanatory attributes of a regression task can be either discrete or continuous.
Regression (Definition)
Regression is the task of learning a target function f that maps each attribute set x into a
continuous-valued output y.
To find a target function that can fit the input data with minimum error.
The error function for a regression task can be expressed in terms of the sum of absolute
or squared error:
Suppose we wish to fit the following linear model to the observed data:
where w0 and w1 are parameters of the model and are called the regression coefficients.
A standard approach for doing this is to apply the method of least squares, which
attempts to find the parameters (w0,w1) that minimize the sum of the squared error
These equations can be summarized by the following matrix equation' which is also
known as the normal equation:
Since
the normal equations can be solved to obtain the following estimates for the parameters.
Thus, the linear model that best fits the data in terms of minimizing the SSE is
We can show that the general solution to the normal equations given in D.6 can be
expressed as follow:
Thus, linear model that results in the minimum squared error is given by
In summary, the least squares method is a systematic approach to fit a linear model to the
response variable g by minimizing the squared error between the true and estimated value
of g.
Although the model is relatively simple, it seems to provide a reasonably accurate
approximation because a linear model is the first-order Taylor series approximation for
any function with continuous derivatives.
Logistic Regression
Consider a procedure in which individuals are selected on the basis of their scores in a
battery of tests.
After five years the candidates are classified as "good" or "poor.”
We are interested in examining the ability of the tests to predict the job performance of
the candidates.
Here the response variable, performance, is dichotomous.
We can code "good" as 1 and "poor" as 0, for example.
The predictor variables are the scores in the tests.
In a study to determine the risk factors for cancer, health records of several people were
studied.
Data were collected on several variables, such as age, gender, smoking, diet, and the
family's medical history.
The response variable was the person had cancer (Y = 1) or did not have cancer (Y = 0).
The relationship between the probability π and X can often be represented by a logistic
response function.
It resembles an S-shaped curve.
The probability π initially increases slowly with increase in X, and then the increase
accelerates, finally stabilizes, but does not increase beyond 1.
Intuitively this makes sense.
Consider the probability of a questionnaire being returned as a function of cash reward,
or the probability of passing a test as a function of the time put in studying for it.
The shape of the S-curve can be reproduced if we model the probabilities as follows:
A sigmoid function is a bounded differentiable real function that is defined for all real
input values and has a positive derivative at each point.
Modeling the response probabilities by the logistic distribution and estimating the
parameters of the model given below constitutes fitting a logistic regression.
In logistic regression the fitting is carried out by working with the logits.
The Logit transformation produces a model that is linear in the parameters.
The method of estimation used is the maximum likelihood method.
The maximum likelihood estimates are obtained numerically, using an iterative
procedure.
OLS:
The ordinary least squares, or OLS, can also be called the linear least squares.
This is a method for approximately determining the unknown parameters located in a
linear regression model.
According to books of statistics and other online sources, the ordinary least squares is
obtained by minimizing the total of squared vertical distances between the observed
responses within the dataset and the responses predicted by the linear approximation.
Through a simple formula, you can express the resulting estimator, especially the single
regressor, located on the right-hand side of the linear regression model.
For example, you have a set of equations which consists of several equations that have
unknown parameters.
You may use the ordinary least squares method because this is the most standard
approach in finding the approximate solution to your overly determined systems.
In other words, it is your overall solution in minimizing the sum of the squares of errors
in your equation.
Data fitting can be your most suited application. Online sources have stated that the data
that best fits the ordinary least squares minimizes the sum of squared residuals.
“Residual” is “the difference between an observed value and the fitted value provided by
a model.”
is a method used in estimating the parameters of a statistical model, and for fitting a
statistical model to data.
If you want to find the height measurement of every basketball player in a specific
location, you can use the maximum likelihood estimation.
Normally, you would encounter problems such as cost and time constraints.
If you could not afford to measure all of the basketball players’ heights, the maximum
likelihood estimation would be very handy.
Using the maximum likelihood estimation, you can estimate the mean and variance of the
height of your subjects.
The MLE would set the mean and variance as parameters in determining the specific
parametric values in a given model.
For j = 1,2,···, (k - 1). The model parameters are estimated by the method of maximum
likelihood. Statistical software is available to do this fitting.
We use linear or logistic regression technique for developing accurate models for
predicting an outcome of interest.
Often, we create separate models for separate segments.
Segmentation methods such as CHAID or CRT is used to judge their effectiveness
Creating separate model for separate segments may be time consuming and not worth the
effort.
But, creating separate model for separate segments may provide higher predictive power.
Market Segmentation
Dividing the target market or customers on the basis of some significant features which
could help a company sell more products in less marketing expenses.
Companies have limited marketing budgets. Yet, the marketing team is expected to
makes large number of sales to ensure rising revenue & profits.
A product is created in two ways:
Create a product after analyzing (research) the needs and wants of target market –
For example: Computer. Companies like Dell, IBM, Microsoft entered this
market after analyzing the enormous market which this product upholds.
Create a product which evokes the needs & wants in target market – For example:
iPhone.
Once the product is created, the ball shifts to the marketing team’s court.
As mentioned above, they make use of market segmentation techniques.
This ensures the product is positioned to the right segment of customers with high
propensity to buy.
Logistic regression uses 1 or 0 indicator in the historical campaign data, which indicates
whether the customer has responded to the offer or not.
Usually, one uses the target (or ‘Y’ known as dependent variable) that has been identified
for model development to undertake an objective segmentation.
Remember, a separate model will be built for each segment.
A segmentation scheme which provides the maximum difference between the segments
with regards to the objective is usually selected.
Below is a simple example of this approach.
Objective Segmentation
Segmentation to identify the type of customers who would respond to a particular offer.
Segmentation to identify high spenders among customers who will use the e-commerce
channel for festive shopping.
Segmentation to identify customers who will default on their credit obligation for a loan
or credit card.
Non-Objective Segmentation
Segmentation of the customer base to understand the specific profiles which exist within
the customer base so that multiple marketing actions can be personalized for each
segment
Segmentation of geographies on the basis of affluence and lifestyle of people living in
each geography so that sales and distribution strategies can be formulated accordingly.
Segmentation of web site visitors on the basis of browsing behavior to understand the
level of engagement and affinity towards the brand.
Hence, it is critical that the segments created on the basis of an objective segmentation
methodology must be different with respect to the stated objective (e.g. response to an
offer).
However, in case of a non-objective methodology, the segments are different with respect
to the “generic profile” of observations belonging to each segment, but not with regards
to any specific outcome of interest.
The most common techniques for building non-objective segmentation are cluster
analysis, K nearest neighbor techniques etc.
Each of these techniques uses a distance measure (e.g. Euclidian distance, Manhattan
distance, Mahalanobis distance etc.)
This is done to maximize the distance between the two segments.
This implies maximum difference between the segments with regards to a combination of
all the variables (or factors).
Tree Building
goal
o to create a model that predicts the value of a target variable based on several input
variables.
Classification tree analysis is when the predicted outcome is the class to which the
data belongs.
Regression tree analysis is when the predicted outcome can be considered a real
number. (e.g. the price of a house, or a patient’s length of stay in a hospital).
A decision tree
o is a flow-chart-like structure
o each internal (non-leaf) node denotes a test on an attribute
o each branch represents the outcome of a test,
o each leaf (or terminal) node holds a class label.
o The topmost node in a tree is the root node.
Decision-tree algorithms:
o ID3 (Iterative Dichotomiser 3)
o C4.5 (successor of ID3)
o CART (Classification and Regression Tree)
o CHAID (CHI-squared Automatic Interaction Detector). Performs multi-level
splits when computing classification trees.
o MARS: extends decision trees to handle numerical data better. Conditional
Inference Trees.
Statistics-based approach that uses non-parametric tests as splitting criteria, corrected for
multiple testing to avoid over fitting.
This approach results in unbiased predictor selection and does not require pruning.
ID3 and CART follow a similar approach for learning decision tree from training tuples.
Splits are found that maximize the homogeneity of child nodes with respect to the value
of the dependent variable.
Impurity Measure:
GINI Index Used by the CART (classification and regression tree) algorithm, Gini
impurity is a measure of how often a randomly chosen element from the set would be
incorrectly labeled if it were randomly labeled according to the distribution of labels in
the subset.
Gini impurity can be computed by summing the probability fi of each item being chosen
times the probability 1-fi of a mistake in categorizing that item.
It reaches its minimum (zero) when all cases in the node fall into a single target category.
To compute Gini impurity for a set of items, suppose i ε {1, 2... m}, and let fi be the
fraction of items labeled with value i in the set.
Pruning
After building the decision tree, a tree-pruning step can be performed to reduce the size
of the decision tree.
Pruning helps by trimming the branches of the initial tree in a way that improves the
generalization capability of the decision tree.
The errors committed by a classification model are generally divided into two types:
o training errors
o generalization errors.
Training error
o also known as resubstitution error or apparent error.
o it is the number of misclassification errors committed on training records.
generalization error
o is the expected error of the model on previously unseen records.
o A good classification model must not only fit the training data well, it must also
accurately classify records it has never seen before.
A good model must have low training error as well as low generalization error.
Model overfitting
o Decision trees that are too large are susceptible to a phenomenon known as
overfitting.
o A model that fits the training data too well can have a poorer generalization error
than a model with a higher training error.
o Such a situation is known as model overfitting.
Model underfitting
o The training and test error rates of the model are large when the size of the tree is
very small.
o This situation is known as model underfitting.
o Underfitting occurs because the model has yet to learn the true structure of the
data.
o Model complexity
o To understand the overfitting phenomenon, the training error of a model can be
reduced by increasing the model complexity.
o Overfitting and underfitting are two pathologies that are related to the model
complexity.
Applications
o ARIMA models are important for generating forecasts and providing
understanding in all kinds of time series problems from economics to health care
applications.
o In quality and reliability, they are important in process monitoring if observations
are correlated.
o designing schemes for process adjustment
o monitoring a reliability system over time
o forecasting time series
o estimating missing values
o finding outliers and atypical events
o understanding the effects of changes in a system
Forecast Accuracy can be defined as the deviation of Forecast or Prediction from the
actual results.
Error = Actual demand – Forecast
OR
ei = At – Ft
We measure Forecast Accuracy by 2 methods :
Mean Forecast Error (MFE)
o For n time periods where we have actual demand and forecast values:
o Ideal value = 0;
o MFE > 0, model tends to under-forecast
o MFE < 0, model tends to over-forecast
While MFE is a measure of forecast model bias, MAD indicates the absolute size of the
errors
ETL Approach
Extract, Transform and Load (ETL) refers to a process in database usage and especially
in data warehousing that:
o Extracts data from homogeneous or heterogeneous data sources
o Transforms the data for storing it in proper format or structure for querying and
analysis purpose
o Loads it into the final target (database, more specifically, operational data store,
data mart, or data warehouse)
Usually all the three phases execute in parallel since the data extraction takes time, so
while the data is being pulled another transformation process executes, processing the
already received data and prepares the data for loading and as soon as there is some data
ready to be loaded into the target, the data loading kicks off without waiting for the
completion of the previous phases.
ETL systems commonly integrate data from multiple applications (systems), typically
developed and supported by different vendors or hosted on separate computer hardware.
The disparate systems containing the original data are frequently managed and operated
by different employees.
For example, a cost accounting system may combine data from payroll, sales, and
purchasing.
o Extract
The Extract step covers the data extraction from the source system and
makes it accessible for further processing.
The main objective of the extract step is to retrieve all the required data
from the source system with as little resources as possible.
The extract step should be designed in a way that it does not negatively
affect the source system in terms or performance, response time or any
kind of locking.
o Transform
The transform step applies a set of rules to transform the data from the
source to the target.
This includes converting any measured data to the same dimension (i.e.
conformed dimension) using the same units so that they can later be
joined.
The transformation step also requires joining data from several sources,
generating aggregates, generating surrogate keys, sorting, deriving new
calculated values, and applying advanced validation rules.
o Load
During the load step, it is necessary to ensure that the load is performed
correctly and with as little resources as possible.
The target of the Load process is often a database.
In order to make the load process efficient, it is helpful to disable any
constraints and indexes before the load and enable them back only after
the load completes.
The referential integrity needs to be maintained by ETL tool to ensure
consistency.
Staging
o It should be possible to restart, at least, some of the phases independently from the
others.
o For example, if the transformation step fails, it should not be necessary to restart
the Extract step.
o We can ensure this by implementing proper staging. Staging means that the data
is simply dumped to the location (called the Staging Area) so that it can then be
read by the next processing phase.
o The staging area is also used during ETL process to store intermediate results of
processing.
o This is ok for the ETL process which uses for this purpose.
o However, the staging area should be accessed by the load ETL process only.
o It should never be available to anyone else; particularly not to end users as it is
not intended for data presentation to the end-user.
o May contain incomplete or in-the-middle-of-the-processing data.
UNIT-V
1. RETAIL ANALYTICS:
To understand the role analytics plays in retail, it is useful to break down the business
decisions taken in retail into the following categories:
consumer,
product,
workforce, and
advertising.
2. Product: Retail product decisions can be broken down into single product and group of
product decisions. Single or individual product decisions are mostly inventory decisions: how
much stock of the product to order, and when to place the order. At the group level, the
decisions are typically related to pricing and assortment planning. That is, what price to set
for each product in the group and how to place the products on the store-shelves, keeping in
mind the variety of products, the number of each type of product, and location . To make
these decisions, predictive modelling is called for to forecast the product demand and the
price-response function, and essentially the decision-maker needs to understand how
customer reacts to price changes. A fine understanding of consumer choice is also needed to
understand how a customer chooses to buy a certain product from a group of products.
3. Human resources: The key decisions here are related to the number of employees needed
in the store at various times of the day and how to schedule them. To make these decisions,
the overall work to be completed by the employees needs to be estimated. Part of this is a
function of other decisions, such as the effort involved in stocking shelves, taking deliveries,
changing prices, etc. There is additional work that comes in as a function of the customer
volume in the store. This includes answering customer questions and manning checkout
counters.
4. Advertising: In the advertising sphere, companies deal with the typical decisions of
finding the best medium to advertise on (online mediums such as Google Ad words,
Facebook, Twitter, and/or traditional mediums such as print and newspaper inserts) and the
best products to advertise. This may entail cultivating some “loss-leaders” that are priced low
to entice customers into the store, so they may also purchase other items which have a greater
margin.
• Analytics has revealed that a great number of customer visits to online stores fail to convert
at the last minute, when the customer has the item in their shopping basket but does not go on
to confirm the purchase. Theorizing that this was because customers often cannot find their
credit or debit cards to confirm the details, Swedish e-commerce platform Klarna moved its
clients (such as Vistaprint, Spotify, and 45,000 online stores) onto an invoicing model, where
customers can pay after the product is delivered. Sophisticated fraud prevention analytics are
used to make sure that the system cannot be manipulated by those with devious intent.
•Trend forecasting algorithms comb social media posts and Web browsing habits to elicit
what products may be causing a buzz, and ad-buying data is analyzed to see what marketing
departments will be pushing. Brands and marketers engage in “sentiment analysis,” using
sophisticated machine learning-based algorithms to determine the context when a product is
discussed. This data can be used to accurately predict what the top selling products in a
category are likely to be.
•Amazon has proposed using predictive shipping analytics6 to ship products to customers
before they even click “add to cart.” According to a recent trend report by DHL, over the next
5 years, this so-called psychic supply chain will have far reaching effects in nearly all
industries, from automotive to consumer goods. It uses big data and advanced predictive
algorithms to enhance planning and decision-making.
There are various complications that arise in retail scenarios that need to be overcome for the
successful use of retail analytics. These complications can be classified into
Some of the most common issues that affect predictive modelling are demand censoring and
inventory inaccuracies (DeHoratius and Raman 2008). Typically, retail firms only have
access to sales information, not demand information, and therefore need to account for the
fact that when inventory runs out, actual demand is not observed.
Methodologies:
Typical forecasting methods consider the univariate time series of sales data and use time-
series-based methods such as exponential smoothing and ARIMA models; These methods
typically focus on forecasting sales and may require uncensoring to be used for decision-
making. Recently, there have been advances that utilize statistical and machine learning
approaches to deal with greater amounts of data.
This directly motivates modeling customer preferences over all the products carried by the
retailer. One of the workhorse models for such consumer choice modeling is the multinomial
logit (MNL);
Omni-Channel Retail:
The tremendous success of e-commerce has led many retailers to augment their brick-and-
mortar stores with an online presence, leading to the advent of multichannel retail. In this
approach the retailer has access to multiple channels to engage with and sell to customer.
A good example of such an approach is the “buy online, pick up in store” (BOPS) approach
that has become quite commonplace. This seamless approach inarguably improves the
customer experience and overall sales;
Retail Startups:
In terms of data collection, there are many startups that cater to the range of retailers both
small and large. Some illustrative examples here are Euclid Analytics, which uses in-store
Wi-Fi to collect information on customers via their smartphones.
2. Marketing Analytics:
Marketing analytics can help firms realize the true potential of data and explore meaningful
insights. Marketing analytics can be defined as a “high technology enabled and marketing
science model-supported approach to harness the true values of the customer, market, and
firm level data to enhance the effect of marketing strategies” Basically, marketing analytics
is the creation and use of data to measure and optimize marketing decisions. Marketing
analytics comprises tools and processes
The processes and tools discussed in this chapter will help in various aspects of marketing
such as target marketing and segmentation, price and promotion, customer valuation,
resource allocation, response analysis, demand assessment, and new product development.
These can be applied at the following levels:
• Firm: At this level, tools are applied to the firm as a whole. Instead of focusing on a
particular product or brand, these can be used to decide and evaluate firm strategies. For
example, data envelopment analysis (DEA) can be used for all the units (i.e., finance,
marketing, HR, operation, etc.) within a firm to find the most efficient units and allocate
resources accordingly.
• Brand/product: At the brand/product level, tools are applied to decide and evaluate
strategies for a particular brand/product. For example, conjoint analysis can be conducted to
find the product features preferred by customers or response analysis can be conducted to
find how a particular brand advertisement will be received by the market.
• Customer: Tools applied at customer level provide insights that help in segmenting and
targeting customers. For example, customer lifetime value is a forward-looking customer
metric that helps assess the value provided by customers to the firm
let us look at what constitutes marketing analytics. Though it is an ever-expanding field, for
our purpose, we can segment marketing analytics into the following processes and tools:
1. Multivariate statistical analysis: It deals with the analysis of more than one outcome
variable. Cluster analysis, factor analysis, perceptual maps, conjoint analysis, discriminant
analysis, and MANOVA are a part of multivariate statistical analysis. These can help in target
marketing and segmentation, optimizing product features, etc., among other applications.
2. Choice analytics: Choice modeling provides insights on how customers make decisions.
Understanding customer decision-making process is critical as it can help to design and
optimize various marketing mix strategies such as pricing and advertising. Largely, Logistic,
Probit, and Tobit models .
4. Time-series analytics: Models stated till now mainly deal with cross-sectional data
(however, choice and regression models can be used for panel data as well). This section
consists of auto-regressive models and vector auto-regressive models for time-series analysis.
These can be used for forecasting sales, market share, etc.
5. Nonparametric tools: Non parametric tools are used when the data belongs to no particular
distribution. Data envelopment analysis (DEA) and stochastic frontier analysis (SFA) are
discussed in this section and can be used for benchmarking, resource allocation, and
assessing efficiency.
6. Survival analysis: Survival analysis is used to determine the duration of time until an event
such as purchase, attrition, and conversion happens. Baseline hazard model, proportional
hazard model, and analysis with time varying covariates are covered in this section. 626 S.
Arunachalam and A. Sharma
7. Sales force /sales analytics: This section covers analytics for sales, which includes
forecasting potential sales, forecasting market share, and causal analysis. It comprises various
methods such as chain ratio method, Delphi method, and product life cycle analysis.
8. Innovation analytics: Innovation analytics deals specifically with new products. New
product analysis differs from existing product analysis as you may have little or no historical
data either for product design or sales forecasting. Bass model, ASSESSOR model, conjoint
analysis can be used for innovation analytics.
9. Conjoint analysis: This section covers one of the most widely used quantitative methods in
marketing research. Conjoint (trade-off) analysis is a statistical technique to measure
customer preferences for various attributes of a product or service. This is used in various
stages of a new product design, segmenting customers and pricing.
10. Customer analytics: In this section, we probe customer metrics such as customer lifetime
value, customer referral value, and RFM (recency, frequency, and monetary value) analysis.
These can be used for segmenting customers and determining value provided by customers.
3.Financial Analytics:
One can paraphrase Rudyard Kipling’s poem The Ballad of East and West and say, “Oh, the
Q-world is the Q-world, the P-world is the P-world, and never the twain shall meet.”
Q-Quants:
In the Q-world, the objective is primarily to determine a fair price for a financial instrument,
especially a derivative security, in terms of its underlying securities. The price of these
underlying securities is determined by the market forces of demand and supply. The demand
and supply forces come from a variety of sources in the financial markets, but they primarily
originate from buy-side and sell-side financial institutions.
The Q-quants typically have deep knowledge about a specific product. So a Qquant who, for
instance, trades credit derivatives for a living would have abundant knowledge about credit
derivative products, but her know-how may not be very useful in, say, a domain like foreign
exchange.
P-Quants:
We now discuss the P-world and their origins, tools, and techniques and contrast them with
the Q-world. The P-world started with the mean–variance framework by Markowitz in 1952
(Markowitz 1952). Harry Markowitz showed that the conventional investment evaluation
criteria of net present value (NPV) needs to be explicitly segregated in terms of risk and
return. He defined risk as standard deviation of return distribution.
In reality, the probability distribution needs to be estimated from the available financial
information. So a very large component of this so-called 20 Financial Analytics 663
information set, that is, the prices and other financial variables, is observed at discrete time
intervals, forming a time series.
• Step 1: Identification
• Step 3: Inference
• Step 4: Projection
• Step 5: Pricing
• Step 2: Assessment
• Step 3: Attribution
• Step 2: Execution
Fig. 20.1 Methodology of the three-stage framework for data analysis in finance
The threestage methodology starting with variable identification for different asset classes
such as equities, foreign exchange, fixed income, credit, and commodities.
1.3.1 Stage I: Asset Price Estimation The objective of the first stage is to estimate the price
behavior of an asset. It starts with identification of the financial variable to model.
Step 1: Identification :
The first step of modeling in the P-world is to identify the appropriate variable which is
different for distinct asset classes. The basic idea is to find a process for the financial variable
where the residuals are essentially i.i.d. The most common process used for modeling a
financial variable x is the random walk: xt = xt−1 + εt
Step 2: I.I.D.
Once the financial variable that is of interest is identified, the next step in data preparation is
to obtain a time series of the variables that are of interest. These variables should display a
homogenous behavior across time, For instance, in equities, currencies, or commodities, it is
the log of the stock/currency/commodity price. For fixed income instruments, the variable of
interest may not be the price or the log of price. The variable of interest would be the yield to
maturity of the fixed income security.
Step 3: Inference
The third step in estimation after the financial variable is identified and after we have gotten
to the point of i.i.d. shocks is to infer the joint behavior of i.i.d. shocks. In the estimation
process, we typically determine those parameters in the model which gets us to an i.i.d.
distribution.
absolute index levels of S&P 500 for reasons mentioned in Step 1. From the daily index
levels, the 1-day returns are calculated as follows: rt = log (pt) − log (pt−1) This return rt
itself is not distributed in an i.i.d. sense. Neither are the daily returns identical nor are they
independent. We can infer from a quick look at the graph of the returns data.
The way we model variance σt 2 is: σt 2 = ω + α rt−1 2 + βσt−1 2 We have to estimate the
parameters ω, α, β. The estimation technique we use is maximum likelihood estimation
(MLE).
Using the Gaussian distribution, the likelihood or the probability of rt σt being normally
distributed is given by: 1 / 2πσt 2 e^ – 1/ 2( rt/ σt) 2
Step 4: Projection
The fourth step is projection. We explain this step using a simple example from foreign
exchange markets. Let us say that the financial variable is estimated using a technique such as
MLE, GMM, or Bayesian estimation (Hansen 1982). The next step is to project the variable
using the model. Say the horizon is 1 year, and we want to calculate the expected profit or
loss of a certain portfolio.
We first pick a random number from the standard normal distribution say x. We then scale
(multiply) x by standard deviation and add average return to get a random variable mapped to
the exact normal distribution of returns.
Note that the average return and standard deviation are adjusted for daily horizon by dividing
with 365 and square root of 365, respectively
Step 5: Pricing
The fifth step is pricing which logically follows from projection. The example that we used in
Step 4 was projection of USD/INR for a horizon of 1 year. What pricing allows us to do is
arrive at the ex-ante expected profit or loss of a specific instrument based on the projections
done
Stage II: Risk Management The second stage of data analytics in finance concerns risk
management. It involves analysis for risk aggregation, risk assessment, and risk attribution.
The framework can be used for risk analysis of a portfolio or even for an entire financial
institution.
Step 1: Aggregation The first of the three steps in risk management is risk aggregation. The
aggregation step is crucial because all financial institutions need to know the value of the
portfolio of their assets and also the aggregated risk exposures in their balance sheet.
We model this using a one-factor model. The single factor is assumed to be the state of the
economy M, which is assumed to have a Gaussian distribution. To generate a one-factor
model, we define random variables xi (1 ≤ i ≤ N): xi = ρiM + sqrt( 1 − ρ 2 i Zi)
We model this using a one-factor model. The single factor is assumed to be the state of the
economy M, which is assumed to have a Gaussian distribution. To generate a one-factor
model, we define random variables xi (1 ≤ i ≤ N): xi = ρiM + sqrt( 1 − ρ 2 i Zi)
Step 2: Assessment
We now move on to the second step of risk management which is assessment of the portfolio.
Assessment of the portfolio is done by summarizing it according to a suitable statistical
feature. More precisely, assessment is done by calculating the exante risk of the portfolio
using metrics such as threshold persistence (TP) or value at risk (VaR) and sometimes
sensitizing it using methods like stress-testing.
Step 3: Attribution
The third step in risk management analysis is attribution. Once we have assessed the risk of
the portfolio in the previous step, we need to now attribute the risk to different risk factors.
Social media has created new opportunities to both consumers and companies. It has become
one of the major drivers of consumer revolution. Companies can analyze data available from
the web and social media to get valuable insights into what consumers want. Social media
and web analytics can help companies measure the impact of their advertising and the effect
of mode of message delivery on the consumers. Companies can also turn to social media
analytics to learn more about their consumers.
Social media analytics involves gathering information from social networking sites such as
Facebook, LinkedIn and Twitter in order to provide businesses with better understanding of
customers. It helps in understanding customer sentiment, creating
customer profiles and evolving appropriate strategies for reaching the right customer at the
right time.
It involves four basic activities, namely, listening (aggregating what is being said on social
media),
analyzing (such as identifying trends, shifts in customer preferences and customer
sentiments), understanding (gathering insights into customers, their interests and preferences
and sentiments) and strategizing (creating appropriate strategies and campaigns to connect
with customers with a view to encourage sharing and commenting as well as improving
referrals).
One of the major advantages of social media analytics is that it enables businesses to identify
and encourage different activities that drive revenues and profits and make real-time
adjustments in the strategies and campaigns. Social media analytics can help businesses in
targeting advertisements more effectively and thereby reduce the advertising cost while
improving ROI
Many companies have started leveraging the power of social media. A particular airline keeps
the customers informed through social media about the delays and the causes for such delays.
In the process, the airline is able to proactively communicate the causes for the delays and
thereby minimize the negative sentiment arising out of the delays. In addition, the airline
company is also able to save much of the time of its call centre employees, because many
customers already knew about the delays as well as the reasons associated with the delays
and hence do not make calls to the call centre.
Search engine optimization (SEO) is another technique to acquire customers when they are
looking for a specific product or service or even an organization. For example, when a
customer initiates a search for a product, say a smartphone, there is a possibility of getting
overloaded and overwhelmed with the search results. These results contain both “paid
listings” and “organic listings”.
The Internet provides new scope for creative approaches to advertising. Advertising on the
Internet is also called online advertising, and it encompasses display advertisements found on
various websites and results pages of search queries and those placed in emails as well as
social networks.
There are different types of display advertisements. The most popular one is the banner
advertisement. This is usually a graphic image, with or without animation, displayed on a
web page. These advertisements are usually in the GIF or JPEG images if they are static, but
use Flash, JavaScript or video if there are animations involved.
There are many options for getting the advertisements displayed online. Some of these are
discussed below. One of the most popular options is placing the advertisements on social
media. You can get your ads displayed on social media such as Facebook, Twitter and
LinkedIn. In general, Facebook offers standard advertisement space on the right-hand side
bar. These advertisements can be placed based on demographic information as well as
hobbies and interests which can make it easy to target the right audience.
Programmatic advertising is “the automation of the buying and selling of desktop display,
video, FBX, and mobile ads using real-time bidding. Programmatic describes how online
campaigns are booked, flighted, analyzed and optimized via demand-side software (DSP)
interfaces and algorithms”
(a) Supply-Side Platform (SSP) The SSP helps the publishers to better manage and optimize
their online advertising space and advertising inventory. SSP constantly interacts with the ad
exchange and the demand-side platform (DSP). Admeld (www.admeld.com) and Rubicon
(https://rubiconproject.com/) are two examples of SSPs.
(b) Demand-Side Platform (DSP) The DSP enables the advertisers to set and apply various
parameters and automate the buying of the displays. It also enables them to monitor the
performance of their campaigns. Turn (now Amobee, https://www.amobee.com/), AppNexus
(https://www.appnexus.com/) and Rocket Fuel (https://rocketfuel. com/) are some of the
DSPs.
(d) Publisher Publishers are those who provide the display ad inventory.
(e) Advertiser The advertiser bids for the inventory in real time depending on the relevance of
the inventory.
Pu Ad Exchange
Consumer “C”
Publisher checks if an broadcasts the
clicks on a URL and availability to multiple
ad is available else
the content begins DSPs
contacts Ad Exchange
to load
Browser informs
Publisher send the DSP that the ad was
Ad Exchange sends
ad to browser and displayed and
the winning ad and
browse displays viewed
bid to publisher
winning ad
The entire process described above takes less than half a second. In other words, the
entire process is completed and the display ad is shown while the browser is loading the
requested page on the consumer’s screen.
This process of matching the right display to the right consumer is completely data
driven. Data with respect to the context, who is to see the display, the profile of the
consumer, who is a good target, etc. is part of the process. In order to complete the process
within a very short time span, it is necessary to build all the rules in advance into the system.
5.Healthcare Analytics
Health care analytics is a subset of data analytics that uses both historic and current data to
produce actionable insights, improve decision making, and optimize outcomes within the health
care industry. Health care analytics is not only used to benefit health care organizations but also
to improve the patient experience and health outcomes.
The health care industry is awash with valuable data in the form of detailed records. Industry
regulations stipulate that health care providers must retain many of these records for a set
period of time.
This means that health care has become a site of interest for those working with “big data,” or
large pools of unstructured data. As a still-developing field, big data analytics in health care
offers the potential to reduce operation costs, improve efficiency, and treat patients.
Predictive analytics is the use of historical data to identify past trends and project associated
future outcomes. In the health care industry, predictive analytics has many impactful uses, such
as identifying a patient’s risk for developing a health condition, streamlining treatment courses,
and reducing a hospital’s number of 30-day readmissions (which can result in costly fines for
the hospital).
A 2021 study conducted by a University of Michigan research team illustrates the positive
impact that predictive analytics can have on patient treatment. During the study, researchers
devised a sensitive blood test that predicted how well patients with HPV-positive throat cancer
would respond to specific treatment courses. Overall, the researchers found that their method
could predict treatment effectiveness many months earlier than traditional scans [1].
Prescriptive analytics is the use of historical data to identify an appropriate course of action.
In the health care industry, prescriptive analytics is used to both direct business decisions and to
literally prescribe treatment plans for patients. As a result, some of the most common uses of
prescriptive analytics in health care include identifying a patient’s likelihood of developing
diabetes, allocating ventilators for a hospital unit, and enhancing diagnostic imaging tools.
Health care analytics offers benefits to health businesses, hospital administrators, and patients.
Although it can be tempting to imagine health care analysts working in a virtual data cloud, the
reality is that their work has a tangible impact on how hospitals operate, treatment is provided,
and medical research is conducted.
At a glance, some of the most common benefits of health care analytics include:
Improved patient care, such as offering more effective courses of treatment
Predictions for a patient’s vulnerability to a particular medical condition
More accurate health insurance rates
Improved scheduling for both patients and staff
Optimized resource allocation
More efficient decision-making at the business and patient care level