0% found this document useful (0 votes)
14 views

Introduction To Data Science

This is the document used to define about data science

Uploaded by

rakesh rocking
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Introduction To Data Science

This is the document used to define about data science

Uploaded by

rakesh rocking
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

PREDICTIVE ANALYTICS

UNIT – VI

1
Topics of Unit-6:
PREDICTIVE ANALYTICS

Data Interfaces: Introduction, CSV Files: Syntax, Importing a CSV File


Statistical Applications: Introduction, Basic Statistical Operations,
Linear Regression Analysis, Chi-Squared Goodness of Fit Test, Chi-
Squared Test of Independence, Multiple Regression.

2
CSV file in R
• In R can read and write into various file formats like csv, excel, json,
xml etc.
• The csv file is a text file in which the values in the columns are
separated by a comma.
• read.csv() function is used to read a CSV file in your working
directory. Similarly, write.csv() function is used to write the csv file.
How to import .CSV file in R
• It is hard to use the clipboard to work on larger data sets, instead a
command read.csv() can be used to read such files
• dataset = read.csv(file = “c:/samplefile.csv”, header = TRUE,
sep = “,”)
Mean
Mean is calculated as the summation of all the values in the data series divided
by the number of values.

Following is the function and syntax definition for calculating the mean:
mean(V, trim=0.0, na.rm= FALSE, …)

• V is the input vector;


• trim, which sorts the input vector and removes equal number of values specified from both ends of the sorted vector;
• na.rm dismisses the missing values from the input vector.

5
Median
Median is the middle value of any data series. The R function for calculating the
median is:

median(V, na.rm=FALSE)

The function takes V as the input vector and na.rm for dismissing the missing values from V to avoid anomalies.

6
Mode
• Mode is defined as a value that has the maximum number of occurrences in the
data series.

• There may be more than one value with equal number of occurrences.

• Mode can be calculated for both numerical data and character data unlike mean
and median, which are applicable only for numerical data.

7
Mode : Example
• # create a function for MODE
• getmode <- function(v) {
• uniqv <- unique(v)
• uniqv[which.max(tabulate(match(v,uniqv)))]
• }
• # Create a vector with some numbers
• v <- c(2,1,2,3,1,2,4,5,3,2,1,2,3,4,5)

• result <- getmode(v)


• Result =2

• v <- c(3,2,1,2,3,1,2,4,5,3,2,1,2,3,4,5,3,3)
• result <- getmode(v)
• result = 3
8
Standard Deviation
• Apart from calculating the mean, median, and mode, another important
function for any statistical analysis of the data is calculation of standard
deviation.

• The R function for calculating the same is as follows with syntax definition:
sd(V, na.rm=FALSE)

Function takes any numerical vector V as input with the na.rm parameter.

9
SD & Variance
• Standard Deviation
• The Standard Deviation is a measure of how spread out numbers are.
• Its symbol is σ (the greek letter sigma)
• The formula is easy: it is the square root of the Variance.
• Variance
• The Variance is defined as:
• “The average of the squared differences from the Mean”
• To calculate the variance follow these steps:
o Work out the Mean (the simple average of the numbers)
o Then for each number: subtract the Mean and square the result (the squared difference).
o Then work out the average of those squared differences.
10
Example:
• The heights of the (5) dogs (at the shoulders) are: 600mm, 470mm, 170mm,
430mm and 300mm.
• Find out the Mean, the Variance, and the Standard Deviation.
• Your first step is to find the Mean = 1970/5 = 394
• Variance σ2 =2062 + 762 + (−224)2 + 362 + (−94)2 / 5 =
• => 42436 + 5776 + 50176 + 1296 + 8836 / 5 =
• => 108520 / 5 = 21704
• So the Variance is 21,704
• Standard Deviation σ = √21704 = 147.32... = 147 (to the nearest mm)

11
10.7 Illustration of chi-squared goodness of fit test

© Oxford University Press 2017. All rights reserved. 22


10.8 Chi-squared Test of Independence

© Oxford University Press 2017. All rights reserved. 23


Linear Regression Analysis
• Linear regression analysis is one of the widely used methods for statistical
analysis and the result of this analysis is how exactly the relationship is
established between the two variables of the model.

• The first variable is known as the independent variable, which comprises the values
drawn out of experimental results.
• The second variable is the dependent variable, which comprises the values derived from
the independent variable.

• Linear regression analysis in R constitutes model building with the help of two
variables in the equation form as follows:
Yi = a + bXi + ei , i = 1, …, h

24
• Linear regression is a basic and commonly used type of predictive analysis.
• The overall idea of regression is to examine two things:
• (1) does a set of predictor variables do a good job in predicting an outcome
(dependent) variable? (2) Which variables in particular are significant
predictors of the outcome variable, and in what way do they–indicated by the
magnitude and sign of the beta estimates–impact the outcome variable?
• Three major uses for regression analysis are (1) determining the strength of
predictors, (2) forecasting an effect, and (3) trend forecasting.
• First, the regression might be used to identify the strength of the effect that the
independent variable(s) have on a dependent variable. Typical questions are
what is the strength of relationship between dose and effect, sales and
marketing spending, or age and income.

25
• Second, it can be used to forecast effects or impact of changes. That is, the
regression analysis helps us to understand how much the dependent variable
changes with a change in one or more independent variables. A typical question
is, “how much additional sales income do I get for each additional $1000 spent
on marketing?”
• Third, regression analysis predicts trends and future values. The regression
analysis can be used to get point estimates. A typical question is, “what will the
price of gold be in 6 months?”
Linear Regression
• Linear regression is an algorithm that provides a linear relationship
between an independent variable and a dependent variable to predict
the outcome of future events.
• It is a statistical method used in data science and machine learning for
predictive analysis.
• The independent variable is also the predictor or explanatory variable
that remains unchanged due to the change in other variables.
• However, the dependent variable changes with fluctuations in the
independent variable.
• The regression model predicts the value of the dependent variable,
which is the response or outcome variable being analyzed or studied.
Linear Regression Equation

• The measure of the relationship between two variables is shown by the correlation
coefficient. The range of the coefficient lies between -1 to +1. This coefficient
shows the strength of the association of the observed data between two variables.

• Linear Regression Equation is given below:


• Y=a+bX
• where X is the independent variable and it is plotted along the x-axis
• Y is the dependent variable and it is plotted along the y-axis
• Here, the slope of the line is b, and a is the intercept (the value of y when x = 0).
Linear Model Function
R has the lm() function, which stands for ‘linear model’

© Oxford University Press 2017. All rights reserved. 38


Multiple Linear Regression
• Multiple linear regression (MLR), also known simply as multiple
regression, is a statistical technique that uses several explanatory
variables to predict the outcome of a response variable.
• The goal of multiple linear regression is to model the linear
relationship between the explanatory (independent) variables and
response (dependent) variables.
• In essence, multiple regression is the extension of ordinary least-
squares (OLS) regression because it involves more than one
explanatory variable.

You might also like