0% found this document useful (0 votes)
8 views23 pages

Statistics

The document provides an overview of data types, including qualitative (nominal and ordinal) and quantitative (discrete and continuous) data, as well as statistics, sampling techniques, and measures of central tendency and variability. It explains descriptive and inferential statistics, including concepts like entropy and information gain in machine learning, and details on hypothesis testing. Additionally, it includes practical examples and methods for calculating mean, median, mode, variance, and standard deviation.

Uploaded by

mohith
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views23 pages

Statistics

The document provides an overview of data types, including qualitative (nominal and ordinal) and quantitative (discrete and continuous) data, as well as statistics, sampling techniques, and measures of central tendency and variability. It explains descriptive and inferential statistics, including concepts like entropy and information gain in machine learning, and details on hypothesis testing. Additionally, it includes practical examples and methods for calculating mean, median, mode, variance, and standard deviation.

Uploaded by

mohith
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 23

Data

Types of data

Qualitative Data
Qualitative data deals with the characteristics and descriptors that can’t be easily measured, but can be
observed subjectively.

Nominal data

Data with no inherent ordering or ranking.


Eg. Gender, race

Ordinal Data

Data with an ordered series.


Eg. Customer IDs and their feedback.
Quantitative Data
Quantitative data deals with numbers and things that you can measure objectively.

Discrete Data (also known as categorical data)

It can hold finite number of possible values.


Eg. Number of students in a class.

Continuous data

Data that can hold infinite number of values.


Eg. Weight of a person
Statistics
Area of applied mathematics concerned with the data collection, analysis, interpretation and
presentation.

Terminology

Population: collection of individuals, objects or events whose properties are to be analysed.


Sample: a subset of population. A well chosen sample will have most of the information about a
particular population parameter.

Sampling techniques

Probability Sampling
Sample collected by using the theory of probability.

Random sampling

Each member in the population has the equal chance of getting selected in the sample.
Systematic sampling

Every nth record is chosen to be part of the sample.

Stratified sampling

A stratum is used to form sample. Then random sampling is used to select sufficient number of subjects
from a stratum.
A stratum is a subset of the population that shares at least one common characteristics.
Types of statistics
Descriptive statistics

This uses the data to provide descriptions of the population either through numerical calculations,
graphs or tables.

Inferential statistics

This makes inferences and predictions about a population based on the sample data.
Inferential statistics generalizes a large dataset and applies probability to draw conclusion. It allows us
to infer data parameters based on the statistical model using a sample data.

Descriptive statistics
Descriptive statistics is a method used to describe and understand features of a specific data set by
giving short summaries using the sample and measures of the data.

Categories of descriptive statistics:

 Measures of central tendency


 Measures of variability (spread)
Measures of central tendency (also known as measures of center)

This is a statistical measure that represent the summary of a data set. The three measures of center are
Mean, Median, and Mode.

Mean

Measure of average of all the values in the sample is called Mean.

Mean = Sum of all the values


Total number of values

Mean of hp = 110+110+93+96+90+110+110+110 = 103.625


8

Median

Measure of the central value of the sample set is called median. Arrange the records in ascending
order. In case of even entries take the average of the 2 middle values.

To find the center value of mpg, arrange the values in ascending order,
ie. 21,21,21.3,22.8,23,23,23,23
Since even number of values, the Median = (22.8+23)/2 = 22.9

Mode

The value which is most recurrent in the sample set is known as mode.

Mode = most recurrent value

Mode of cyl = 6
Measures of variability (also known as measures of spread)

This is sometimes also called as measures of dispersion and is used to measure the variability in the
sample set. The four measures of spread are Range, Inter Quartile Range, Variance, and Standard
deviation.

Range

Range is the given measure of how spread apart the values in a dataset are.

Range = Max(xi) - Min(xi)

Inter Quartile Range

This tells us about the spread of the data by breaking the data set into quarters(4).

IQR = Q3-Q1

Consider an example of marks of 100 students in a class.

Q1 = (45+45)/2 = 45
Q2 = (58+59)/2 = 58.5
Q3 = (71+71)/2 = 71
Now the Inter quartile range is,

IQR = 71-45 = 26

Variance

Variance describes how much a random variable differs from its expected value. It entails computing
squares of deviations.

x: Individual data points


n: Total number of data points
x : Mean of data points

Standard deviation

The difference between each element from the mean.

Deviation = (xi - µ)

Population variance
Population variance is the average of squared deviations.
N: Number of data points in the population
µ : Mean

Sample variance
Sample variance is the average of squared differences from the mean.

n: Number of data points in the sample

Standard deviation
Standard deviation is the measure of a dispersion of a set of data from its mean.

Standard deviation use case:

Daenerys has 20 dragons. They have the numbers 9,2,5,4,12,7,8,11,9,3,7,4,12,5,4,10,9,6,8,4.


Work out standard deviation.

Step 1:
Find the mean.

Mean = 9+2+5+4+12+7+8+11+9+3+7+4+12+5+4+10+9+6+8+4
20
µ=7

Step 2:
From each element subtract the mean and square the results. ie. (xi - µ)2

(9-7)2 = 4
(2-7)2 = (-5)2 =25
and so on…

so we get,
4,25,4,9,25,0,1,16,4,16,0,9,25,4,9,9,4,1,4,9
Step 3:
Find the mean of these squared differences(deviations).

σ2 = 4+25+4+9+25+0+1+16+4+16+0+9+25+4+9+9+4+1+4+9
20
2
σ = 8.9

Step 4:
Take the square root of σ2 which is the standard deviation.

σ = 2.983

Information gain and Entropy


IG and entropy are used in ML algorithms like decision trees and random forest.

Entropy

Entropy measures the impurity or uncertainty present in the data.

where,
S : set of all instances in the data set
N: number of distinct class values
pi : event probability

Information gain

IG indicates how much “information” a particular feature/variable gives us about the final outcome.

where,
H(S) : entropy of the whole data set S
|Sj| : number of instances with j value of an attribute A
|S| : total number of instances in dataset S
v : set of distinct values of an attribute A
H(Sj) : entropy of subset of instances for attribute A
H(A,S) : entropy of an attribute A

Problem statement: Predict whether a match can be played or not by


studying whether conditions.
Target variable: The value we are trying to predict.

Here target variable is ”Play”.

Calculate entropy

From the total number of 14 instances we have:


9 instances of “Yes”
5 instances of “No”

The Entropy is:


H(S) = - 9 log2 9 - 5 log2 5 = 0.940
14 14 14 14

Create a decision tree

Selecting the root node for decision tree

The root node is a decision variable. Root node should be the most significant variable. IG and entropy is
used to select this root node. Variable with highest IG should be chosen as root node.

All possible combinations for root node:


Calculate the IG for each attribute (Outlook, Windy, Humidity and temperature)

IG for Outlook: (maximum IG)

IG for Windy:

IG for Humidity:
IG for Temperature:

To summarize,

Hence the root node should be “Outlook”.


Resulting decision tree

Result:

When Outlook is,


Sunny and Rain – give impure outputs (mix of “Yes” and “No”).
Overcast – 100% pure, definite and certain output.

Confusion matrix
Confusion matrix is a table used to describe performance of a classification model (or “classifier”) on a
set of test data for which the true values are known.

Confusion matrix represents a tabular representation of actual vs predicted values.

You can calculate the accuracy of your model with:


True positives + True negatives
True positives + True negatives + False positives + False negatives

A table used to describe the performance of a classification model (“or” classifier) on a set of test data
for which the true values are known.

Consider an example:
The TP,TN,FT, and FN are
A sample code in R to find the Mean, Median, Mode, Variance, and Standard deviation.

R is a statistical language.

data = runif(20,1,10) ==> generates the random numbers and stores it in variable “data”
mean = mean(data) ==> calculates the mean of those values and stores in the variable “mean”
print(mean) ==> prints the value of mean
Plotting a histogram
This will show how frequently a data point is occurring.

Inferential statistics
This makes inferences and predictions about a population based on the sample data.
Inferential statistics generalizes a large dataset and applies probability to draw conclusion.

Point Estimation

Point estimation is concerned with the use of sample data to measure a single value which serves as an
appropriate value or the best estimate of an unknown population parameter.
Methods of finding the estimates

 Method of moments
Estimates are found out by equating the first k sample moments to the corresponding k population
moments.
 Maximum of likelihood
Uses a model and the values in the model to maximize a likelihood function. This results in a most likely
parameter for the inputs selected.
 Baye’s estimators
Minimizes the average risk(an expectation of random variables)
 Best unbiased estimators
Several unbiased estimators can be used to approximate a parameter (which one is best depends on
what parameter you’re trying to find)

Interval estimate

An interval, or range of values, used to estimate a population parameter is called interval estimate.
The value of probability can be within a range.

Confidence interval
This is the measure of your confidence, that the interval estimate contains the population mean, µ.
Statistician uses a confidence interval to describe the uncertainty associated with a sample estimate of a
population parameter.
Technically, a range of values so constructed that there is a specified probability of including the true
value of a parameter within it.
Confidence level tells how much confident you are about the interval.

Margin of error
Difference between the point estimate and the actual population parameter value is called the sampling
error.
When µ is estimated, the sampling error is the difference µ-

Margin of error E, for a given level of confidence is is the greatest possible distance between the point
estimate and the value of the parameter it is estimating. (deviation from the actual point estimate).

Zc : critical value or the point estimate


n : sample size
σ : standard deviation

Estimating level of confidence


The level of confidence c, is the probability that the interval estimate contains the population
parameter.

Zc is calculated from the standard normal table (also available in Google).


Eg.

Steps involved in constructing a confidence interval:


 Identify sample statistic (anything like mean of sample)
 Select a confidence level
 Find margin of error

Margin of error - use case

Hypothesis testing

Hypothesis testing is an inferential statistics technique used to formally check whether the hypothesis is
accepted or rejected. It’s based on the percentage value we get for hypothesis testing. Determine
whether there is enough evidence in the data sample to infer that a certain condition holds true for an
entire population.
Hypothesis testing is conducted in the following manner:
 State the Hypotheses - This stage involves stating null and alternative hypotheses.
 Formulate an analysis plan - This stage involves construction of an analysis plan.
 Analyze sample data - This stage involves the calculation and interpretation pf the test statistic as
described in the analysis plan.
 Interpret results - This stage involves the application of decision rule described in the analysis plan.

Null Hypothesis(H0) - Result is no different from assumption


Alternate Hypothesis(Ha) - Result disproves the assumption

You might also like