Statistics
Statistics
Types of data
Qualitative Data
Qualitative data deals with the characteristics and descriptors that can’t be easily measured, but can be
observed subjectively.
Nominal data
Ordinal Data
Continuous data
Terminology
Sampling techniques
Probability Sampling
Sample collected by using the theory of probability.
Random sampling
Each member in the population has the equal chance of getting selected in the sample.
Systematic sampling
Stratified sampling
A stratum is used to form sample. Then random sampling is used to select sufficient number of subjects
from a stratum.
A stratum is a subset of the population that shares at least one common characteristics.
Types of statistics
Descriptive statistics
This uses the data to provide descriptions of the population either through numerical calculations,
graphs or tables.
Inferential statistics
This makes inferences and predictions about a population based on the sample data.
Inferential statistics generalizes a large dataset and applies probability to draw conclusion. It allows us
to infer data parameters based on the statistical model using a sample data.
Descriptive statistics
Descriptive statistics is a method used to describe and understand features of a specific data set by
giving short summaries using the sample and measures of the data.
This is a statistical measure that represent the summary of a data set. The three measures of center are
Mean, Median, and Mode.
Mean
Median
Measure of the central value of the sample set is called median. Arrange the records in ascending
order. In case of even entries take the average of the 2 middle values.
To find the center value of mpg, arrange the values in ascending order,
ie. 21,21,21.3,22.8,23,23,23,23
Since even number of values, the Median = (22.8+23)/2 = 22.9
Mode
The value which is most recurrent in the sample set is known as mode.
Mode of cyl = 6
Measures of variability (also known as measures of spread)
This is sometimes also called as measures of dispersion and is used to measure the variability in the
sample set. The four measures of spread are Range, Inter Quartile Range, Variance, and Standard
deviation.
Range
Range is the given measure of how spread apart the values in a dataset are.
This tells us about the spread of the data by breaking the data set into quarters(4).
IQR = Q3-Q1
Q1 = (45+45)/2 = 45
Q2 = (58+59)/2 = 58.5
Q3 = (71+71)/2 = 71
Now the Inter quartile range is,
IQR = 71-45 = 26
Variance
Variance describes how much a random variable differs from its expected value. It entails computing
squares of deviations.
Standard deviation
Deviation = (xi - µ)
Population variance
Population variance is the average of squared deviations.
N: Number of data points in the population
µ : Mean
Sample variance
Sample variance is the average of squared differences from the mean.
Standard deviation
Standard deviation is the measure of a dispersion of a set of data from its mean.
Step 1:
Find the mean.
Mean = 9+2+5+4+12+7+8+11+9+3+7+4+12+5+4+10+9+6+8+4
20
µ=7
Step 2:
From each element subtract the mean and square the results. ie. (xi - µ)2
(9-7)2 = 4
(2-7)2 = (-5)2 =25
and so on…
so we get,
4,25,4,9,25,0,1,16,4,16,0,9,25,4,9,9,4,1,4,9
Step 3:
Find the mean of these squared differences(deviations).
σ2 = 4+25+4+9+25+0+1+16+4+16+0+9+25+4+9+9+4+1+4+9
20
2
σ = 8.9
Step 4:
Take the square root of σ2 which is the standard deviation.
σ = 2.983
Entropy
where,
S : set of all instances in the data set
N: number of distinct class values
pi : event probability
Information gain
IG indicates how much “information” a particular feature/variable gives us about the final outcome.
where,
H(S) : entropy of the whole data set S
|Sj| : number of instances with j value of an attribute A
|S| : total number of instances in dataset S
v : set of distinct values of an attribute A
H(Sj) : entropy of subset of instances for attribute A
H(A,S) : entropy of an attribute A
Calculate entropy
The root node is a decision variable. Root node should be the most significant variable. IG and entropy is
used to select this root node. Variable with highest IG should be chosen as root node.
IG for Windy:
IG for Humidity:
IG for Temperature:
To summarize,
Result:
Confusion matrix
Confusion matrix is a table used to describe performance of a classification model (or “classifier”) on a
set of test data for which the true values are known.
A table used to describe the performance of a classification model (“or” classifier) on a set of test data
for which the true values are known.
Consider an example:
The TP,TN,FT, and FN are
A sample code in R to find the Mean, Median, Mode, Variance, and Standard deviation.
R is a statistical language.
data = runif(20,1,10) ==> generates the random numbers and stores it in variable “data”
mean = mean(data) ==> calculates the mean of those values and stores in the variable “mean”
print(mean) ==> prints the value of mean
Plotting a histogram
This will show how frequently a data point is occurring.
Inferential statistics
This makes inferences and predictions about a population based on the sample data.
Inferential statistics generalizes a large dataset and applies probability to draw conclusion.
Point Estimation
Point estimation is concerned with the use of sample data to measure a single value which serves as an
appropriate value or the best estimate of an unknown population parameter.
Methods of finding the estimates
Method of moments
Estimates are found out by equating the first k sample moments to the corresponding k population
moments.
Maximum of likelihood
Uses a model and the values in the model to maximize a likelihood function. This results in a most likely
parameter for the inputs selected.
Baye’s estimators
Minimizes the average risk(an expectation of random variables)
Best unbiased estimators
Several unbiased estimators can be used to approximate a parameter (which one is best depends on
what parameter you’re trying to find)
Interval estimate
An interval, or range of values, used to estimate a population parameter is called interval estimate.
The value of probability can be within a range.
Confidence interval
This is the measure of your confidence, that the interval estimate contains the population mean, µ.
Statistician uses a confidence interval to describe the uncertainty associated with a sample estimate of a
population parameter.
Technically, a range of values so constructed that there is a specified probability of including the true
value of a parameter within it.
Confidence level tells how much confident you are about the interval.
Margin of error
Difference between the point estimate and the actual population parameter value is called the sampling
error.
When µ is estimated, the sampling error is the difference µ-
Margin of error E, for a given level of confidence is is the greatest possible distance between the point
estimate and the value of the parameter it is estimating. (deviation from the actual point estimate).
Hypothesis testing
Hypothesis testing is an inferential statistics technique used to formally check whether the hypothesis is
accepted or rejected. It’s based on the percentage value we get for hypothesis testing. Determine
whether there is enough evidence in the data sample to infer that a certain condition holds true for an
entire population.
Hypothesis testing is conducted in the following manner:
State the Hypotheses - This stage involves stating null and alternative hypotheses.
Formulate an analysis plan - This stage involves construction of an analysis plan.
Analyze sample data - This stage involves the calculation and interpretation pf the test statistic as
described in the analysis plan.
Interpret results - This stage involves the application of decision rule described in the analysis plan.