0% found this document useful (0 votes)

8 views23 pages

Statistics

The document provides an overview of data types, including qualitative (nominal and ordinal) and quantitative (discrete and continuous) data, as well as statistics, sampling techniques, and measures of central tendency and variability. It explains descriptive and inferential statistics, including concepts like entropy and information gain in machine learning, and details on hypothesis testing. Additionally, it includes practical examples and methods for calculating mean, median, mode, variance, and standard deviation.

Uploaded by

mohith

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views23 pages

Statistics

Uploaded by

mohith

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 23

Data

Types of data

Qualitative Data
Qualitative data deals with the characteristics and descriptors that can’t be easily measured, but can be
observed subjectively.

Nominal data

Data with no inherent ordering or ranking.

Eg. Gender, race

Ordinal Data

Data with an ordered series.

Eg. Customer IDs and their feedback.
Quantitative Data
Quantitative data deals with numbers and things that you can measure objectively.

Discrete Data (also known as categorical data)

It can hold finite number of possible values.

Eg. Number of students in a class.

Continuous data

Data that can hold infinite number of values.

Eg. Weight of a person
Statistics
Area of applied mathematics concerned with the data collection, analysis, interpretation and
presentation.

Terminology

Population: collection of individuals, objects or events whose properties are to be analysed.

Sample: a subset of population. A well chosen sample will have most of the information about a
particular population parameter.

Sampling techniques

Probability Sampling
Sample collected by using the theory of probability.

Random sampling

Each member in the population has the equal chance of getting selected in the sample.
Systematic sampling

Every nth record is chosen to be part of the sample.

Stratified sampling

A stratum is used to form sample. Then random sampling is used to select sufficient number of subjects
from a stratum.
A stratum is a subset of the population that shares at least one common characteristics.
Types of statistics
Descriptive statistics

This uses the data to provide descriptions of the population either through numerical calculations,
graphs or tables.

Inferential statistics

This makes inferences and predictions about a population based on the sample data.
Inferential statistics generalizes a large dataset and applies probability to draw conclusion. It allows us
to infer data parameters based on the statistical model using a sample data.

Descriptive statistics
Descriptive statistics is a method used to describe and understand features of a specific data set by
giving short summaries using the sample and measures of the data.

Categories of descriptive statistics:

 Measures of central tendency

 Measures of variability (spread)
Measures of central tendency (also known as measures of center)

This is a statistical measure that represent the summary of a data set. The three measures of center are
Mean, Median, and Mode.

Mean

Measure of average of all the values in the sample is called Mean.

Mean = Sum of all the values

Total number of values

Mean of hp = 110+110+93+96+90+110+110+110 = 103.625

Median

Measure of the central value of the sample set is called median. Arrange the records in ascending
order. In case of even entries take the average of the 2 middle values.

To find the center value of mpg, arrange the values in ascending order,
ie. 21,21,21.3,22.8,23,23,23,23
Since even number of values, the Median = (22.8+23)/2 = 22.9

Mode

The value which is most recurrent in the sample set is known as mode.

Mode = most recurrent value

Mode of cyl = 6
Measures of variability (also known as measures of spread)

This is sometimes also called as measures of dispersion and is used to measure the variability in the
sample set. The four measures of spread are Range, Inter Quartile Range, Variance, and Standard
deviation.

Range

Range is the given measure of how spread apart the values in a dataset are.

Range = Max(xi) - Min(xi)

Inter Quartile Range

This tells us about the spread of the data by breaking the data set into quarters(4).

IQR = Q3-Q1

Consider an example of marks of 100 students in a class.

Q1 = (45+45)/2 = 45
Q2 = (58+59)/2 = 58.5
Q3 = (71+71)/2 = 71
Now the Inter quartile range is,

IQR = 71-45 = 26

Variance

Variance describes how much a random variable differs from its expected value. It entails computing
squares of deviations.

x: Individual data points

n: Total number of data points
x : Mean of data points

Standard deviation

The difference between each element from the mean.

Deviation = (xi - µ)

Population variance
Population variance is the average of squared deviations.
N: Number of data points in the population
µ : Mean

Sample variance
Sample variance is the average of squared differences from the mean.

n: Number of data points in the sample

Standard deviation
Standard deviation is the measure of a dispersion of a set of data from its mean.

Standard deviation use case:

Daenerys has 20 dragons. They have the numbers 9,2,5,4,12,7,8,11,9,3,7,4,12,5,4,10,9,6,8,4.

Work out standard deviation.

Step 1:
Find the mean.

Mean = 9+2+5+4+12+7+8+11+9+3+7+4+12+5+4+10+9+6+8+4
20
µ=7

Step 2:
From each element subtract the mean and square the results. ie. (xi - µ)2

(9-7)2 = 4
(2-7)2 = (-5)2 =25
and so on…

so we get,
4,25,4,9,25,0,1,16,4,16,0,9,25,4,9,9,4,1,4,9
Step 3:
Find the mean of these squared differences(deviations).

σ2 = 4+25+4+9+25+0+1+16+4+16+0+9+25+4+9+9+4+1+4+9
20
2
σ = 8.9

Step 4:
Take the square root of σ2 which is the standard deviation.

σ = 2.983

Information gain and Entropy

IG and entropy are used in ML algorithms like decision trees and random forest.

Entropy

Entropy measures the impurity or uncertainty present in the data.

where,
S : set of all instances in the data set
N: number of distinct class values
pi : event probability

Information gain

IG indicates how much “information” a particular feature/variable gives us about the final outcome.

where,
H(S) : entropy of the whole data set S
|Sj| : number of instances with j value of an attribute A
|S| : total number of instances in dataset S
v : set of distinct values of an attribute A
H(Sj) : entropy of subset of instances for attribute A
H(A,S) : entropy of an attribute A

Problem statement: Predict whether a match can be played or not by

studying whether conditions.
Target variable: The value we are trying to predict.

Here target variable is ”Play”.

Calculate entropy

From the total number of 14 instances we have:

9 instances of “Yes”
5 instances of “No”

The Entropy is:

H(S) = - 9 log2 9 - 5 log2 5 = 0.940
14 14 14 14

Create a decision tree

Selecting the root node for decision tree

The root node is a decision variable. Root node should be the most significant variable. IG and entropy is
used to select this root node. Variable with highest IG should be chosen as root node.

All possible combinations for root node:

Calculate the IG for each attribute (Outlook, Windy, Humidity and temperature)

IG for Outlook: (maximum IG)

IG for Windy:

IG for Humidity:
IG for Temperature:

To summarize,

Hence the root node should be “Outlook”.

Resulting decision tree

Result:

When Outlook is,

Sunny and Rain – give impure outputs (mix of “Yes” and “No”).
Overcast – 100% pure, definite and certain output.

Confusion matrix
Confusion matrix is a table used to describe performance of a classification model (or “classifier”) on a
set of test data for which the true values are known.

Confusion matrix represents a tabular representation of actual vs predicted values.

You can calculate the accuracy of your model with:

True positives + True negatives
True positives + True negatives + False positives + False negatives

A table used to describe the performance of a classification model (“or” classifier) on a set of test data
for which the true values are known.

Consider an example:
The TP,TN,FT, and FN are
A sample code in R to find the Mean, Median, Mode, Variance, and Standard deviation.

R is a statistical language.

data = runif(20,1,10) ==> generates the random numbers and stores it in variable “data”
mean = mean(data) ==> calculates the mean of those values and stores in the variable “mean”
print(mean) ==> prints the value of mean
Plotting a histogram
This will show how frequently a data point is occurring.

Inferential statistics
This makes inferences and predictions about a population based on the sample data.
Inferential statistics generalizes a large dataset and applies probability to draw conclusion.

Point Estimation

Point estimation is concerned with the use of sample data to measure a single value which serves as an
appropriate value or the best estimate of an unknown population parameter.
Methods of finding the estimates

 Method of moments
Estimates are found out by equating the first k sample moments to the corresponding k population
moments.
 Maximum of likelihood
Uses a model and the values in the model to maximize a likelihood function. This results in a most likely
parameter for the inputs selected.
 Baye’s estimators
Minimizes the average risk(an expectation of random variables)
 Best unbiased estimators
Several unbiased estimators can be used to approximate a parameter (which one is best depends on
what parameter you’re trying to find)

Interval estimate

An interval, or range of values, used to estimate a population parameter is called interval estimate.
The value of probability can be within a range.

Confidence interval
This is the measure of your confidence, that the interval estimate contains the population mean, µ.
Statistician uses a confidence interval to describe the uncertainty associated with a sample estimate of a
population parameter.
Technically, a range of values so constructed that there is a specified probability of including the true
value of a parameter within it.
Confidence level tells how much confident you are about the interval.

Margin of error
Difference between the point estimate and the actual population parameter value is called the sampling
error.
When µ is estimated, the sampling error is the difference µ-

Margin of error E, for a given level of confidence is is the greatest possible distance between the point
estimate and the value of the parameter it is estimating. (deviation from the actual point estimate).

Zc : critical value or the point estimate

n : sample size
σ : standard deviation

Estimating level of confidence

The level of confidence c, is the probability that the interval estimate contains the population
parameter.

Zc is calculated from the standard normal table (also available in Google).

Eg.

Steps involved in constructing a confidence interval:

 Identify sample statistic (anything like mean of sample)
 Select a confidence level
 Find margin of error

Margin of error - use case

Hypothesis testing

Hypothesis testing is an inferential statistics technique used to formally check whether the hypothesis is
accepted or rejected. It’s based on the percentage value we get for hypothesis testing. Determine
whether there is enough evidence in the data sample to infer that a certain condition holds true for an
entire population.
Hypothesis testing is conducted in the following manner:
 State the Hypotheses - This stage involves stating null and alternative hypotheses.
 Formulate an analysis plan - This stage involves construction of an analysis plan.
 Analyze sample data - This stage involves the calculation and interpretation pf the test statistic as
described in the analysis plan.
 Interpret results - This stage involves the application of decision rule described in the analysis plan.

Null Hypothesis(H0) - Result is no different from assumption

Alternate Hypothesis(Ha) - Result disproves the assumption

GE Oil & Gas Nuovo Pignone: Title: Part List: Drawing: Gas Turbine Ms5002D
100% (1)
GE Oil & Gas Nuovo Pignone: Title: Part List: Drawing: Gas Turbine Ms5002D
1 page
Statistics: a QuickStudy Laminated Reference Guide
From Everand
Statistics: a QuickStudy Laminated Reference Guide
BarCharts Publishing, Inc.
No ratings yet
Doulos - UVM Golden Reference Guide (2013, Doulos LTD.) PDF
100% (1)
Doulos - UVM Golden Reference Guide (2013, Doulos LTD.) PDF
316 pages
Business Statistics NOtes
No ratings yet
Business Statistics NOtes
46 pages
Dsbda Unit 2
No ratings yet
Dsbda Unit 2
155 pages
DS Chapter - 2
No ratings yet
DS Chapter - 2
73 pages
Article Review 1 Eng
No ratings yet
Article Review 1 Eng
30 pages
PC 2 Statistics by Praveen Mathur
No ratings yet
PC 2 Statistics by Praveen Mathur
44 pages
Business Analytics
No ratings yet
Business Analytics
44 pages
Statistics and Its Types(v1.0)
No ratings yet
Statistics and Its Types(v1.0)
6 pages
program-1_
No ratings yet
program-1_
15 pages
Lesson 02 Probability and Statistics
No ratings yet
Lesson 02 Probability and Statistics
127 pages
Chapter 1: Descriptive Statistics: Example 1: Making Steel Rods
No ratings yet
Chapter 1: Descriptive Statistics: Example 1: Making Steel Rods
20 pages
ge8 statistics
No ratings yet
ge8 statistics
2 pages
Descriptive Statistics PDF
100% (1)
Descriptive Statistics PDF
40 pages
Parameter Statistic Parameter Population Characteristic Statistic Sample Characteristic
No ratings yet
Parameter Statistic Parameter Population Characteristic Statistic Sample Characteristic
9 pages
APznzaZmf FjNZzQU2KZGNWcTIMyEPNieeXpEIC4txhLpx IW9aIcijwEdcvmrObIy4gDpcU78AYLsB6msaeqj47x3Fc6z9vdKhe5EnyMTtReSpFg 23R3DG W66DWWysqOW PfB BJrKuEN CsrKXdSrdM OKOdbGKa2ND0ltkJXrievcwimUpSlHEYiQCPleUm8zmyjmaz7 PPZRnRfUuizv
No ratings yet
APznzaZmf FjNZzQU2KZGNWcTIMyEPNieeXpEIC4txhLpx IW9aIcijwEdcvmrObIy4gDpcU78AYLsB6msaeqj47x3Fc6z9vdKhe5EnyMTtReSpFg 23R3DG W66DWWysqOW PfB BJrKuEN CsrKXdSrdM OKOdbGKa2ND0ltkJXrievcwimUpSlHEYiQCPleUm8zmyjmaz7 PPZRnRfUuizv
24 pages
Angilan, Ef
No ratings yet
Angilan, Ef
5 pages
1-Descriptive Statistics
No ratings yet
1-Descriptive Statistics
44 pages
1-Descriptive Statistics
No ratings yet
1-Descriptive Statistics
44 pages
Statistics & Psychology
No ratings yet
Statistics & Psychology
47 pages
Statistics For Data Science
No ratings yet
Statistics For Data Science
93 pages
Introduction To Statistics
No ratings yet
Introduction To Statistics
35 pages
Jerome Statistics
No ratings yet
Jerome Statistics
12 pages
Business Analytics
No ratings yet
Business Analytics
40 pages
Mmw Reviewer
No ratings yet
Mmw Reviewer
9 pages
NITKclass 1
No ratings yet
NITKclass 1
50 pages
Stats 1 Module Updated
No ratings yet
Stats 1 Module Updated
53 pages
St130: Basic Statistics Week 3: Lecture: School of Computing Information and Mathematical Sciences
No ratings yet
St130: Basic Statistics Week 3: Lecture: School of Computing Information and Mathematical Sciences
62 pages
Nummerical Summaries
No ratings yet
Nummerical Summaries
11 pages
Statistics N Probability
No ratings yet
Statistics N Probability
31 pages
Data and Metrics
No ratings yet
Data and Metrics
35 pages
Statistics For Data Analysis
No ratings yet
Statistics For Data Analysis
13 pages
RM-EBBA-class-8-CH0-11-Quatitative-analysis
No ratings yet
RM-EBBA-class-8-CH0-11-Quatitative-analysis
37 pages
Unit II TYCS DS
No ratings yet
Unit II TYCS DS
176 pages
Math
No ratings yet
Math
6 pages
Lecture 9
No ratings yet
Lecture 9
40 pages
Measures of Central Tendency
100% (15)
Measures of Central Tendency
15 pages
DeMeasure of central tendency and dispersion
No ratings yet
DeMeasure of central tendency and dispersion
15 pages
03 Numerical Description
No ratings yet
03 Numerical Description
52 pages
Probability and Statistics Week 1 Text Book
No ratings yet
Probability and Statistics Week 1 Text Book
10 pages
DSML
No ratings yet
DSML
510 pages
Psychology Project
No ratings yet
Psychology Project
14 pages
Descriptive and Inferential Statistics
No ratings yet
Descriptive and Inferential Statistics
10 pages
5. Descriptive Statistics
No ratings yet
5. Descriptive Statistics
15 pages
Chapter 3(Technical English for Statistics)
No ratings yet
Chapter 3(Technical English for Statistics)
8 pages
Statistics_Compendium_DMS IIT DELHI_2025
No ratings yet
Statistics_Compendium_DMS IIT DELHI_2025
18 pages
Topic 2- Descriptive_statistics
No ratings yet
Topic 2- Descriptive_statistics
36 pages
Descriptive Statsistics
No ratings yet
Descriptive Statsistics
34 pages
02Data (2)
No ratings yet
02Data (2)
36 pages
2466939-EDA_and_STATISTICS_NOTES
No ratings yet
2466939-EDA_and_STATISTICS_NOTES
15 pages
Module 3 - Branches of Statistics (1)
No ratings yet
Module 3 - Branches of Statistics (1)
50 pages
Da Session 2
No ratings yet
Da Session 2
95 pages
Statistics 24 04 2021 20210618114031
No ratings yet
Statistics 24 04 2021 20210618114031
41 pages
Week 03
No ratings yet
Week 03
38 pages
Statistics: The Language of Facts: Group 6
No ratings yet
Statistics: The Language of Facts: Group 6
65 pages
DSHCS AhujaG
No ratings yet
DSHCS AhujaG
251 pages
ai- ssmda
No ratings yet
ai- ssmda
142 pages
Discriptive Statistics
No ratings yet
Discriptive Statistics
23 pages
Machine Learning (1) : Inteligência Artificial E Cibersegurança (Inacs)
No ratings yet
Machine Learning (1) : Inteligência Artificial E Cibersegurança (Inacs)
33 pages
Basics of Statistics
No ratings yet
Basics of Statistics
32 pages
Stats For Data Science
No ratings yet
Stats For Data Science
21 pages
2025115384-BashShellScripting (1)
No ratings yet
2025115384-BashShellScripting (1)
6 pages
Network_topology notes_new
No ratings yet
Network_topology notes_new
8 pages
Math topics and status_yes
No ratings yet
Math topics and status_yes
7 pages
Advoctae
No ratings yet
Advoctae
35 pages
cyber_kill_chain
No ratings yet
cyber_kill_chain
8 pages
Network_topology notes
No ratings yet
Network_topology notes
8 pages
G_H_Hacker
No ratings yet
G_H_Hacker
2 pages
boodi_banna_Hacker
No ratings yet
boodi_banna_Hacker
2 pages
bili Hat Hacker
No ratings yet
bili Hat Hacker
2 pages
phases_1720962326402 (1)
No ratings yet
phases_1720962326402 (1)
8 pages
Kappu_Hacker
No ratings yet
Kappu_Hacker
2 pages
ASA_Topic
No ratings yet
ASA_Topic
3 pages
Script_Kiddie
No ratings yet
Script_Kiddie
1 page
Google Dorks New.docx
No ratings yet
Google Dorks New.docx
2 pages
Black Hat Hacker
No ratings yet
Black Hat Hacker
2 pages
TCP_RTO
No ratings yet
TCP_RTO
5 pages
topology notes
No ratings yet
topology notes
8 pages
MG6088 Software Project Management Syllabus
No ratings yet
MG6088 Software Project Management Syllabus
1 page
Game Log
No ratings yet
Game Log
26 pages
Impact of 5g
0% (1)
Impact of 5g
9 pages
3754 S 34 Rev 0 EN
No ratings yet
3754 S 34 Rev 0 EN
5 pages
PRIYANKA MEENA-B.A. PART-III EXAM 2024 # 60331477 __~
No ratings yet
PRIYANKA MEENA-B.A. PART-III EXAM 2024 # 60331477 __~
3 pages
Instruction Manual For PT
No ratings yet
Instruction Manual For PT
5 pages
Op 275
No ratings yet
Op 275
13 pages
A Review: Solar Water Heating Systems: Tadvi Sachin Vinubhai
No ratings yet
A Review: Solar Water Heating Systems: Tadvi Sachin Vinubhai
8 pages
Assessment - Lesson 1
No ratings yet
Assessment - Lesson 1
12 pages
Careerwill Current Affairs 2021
No ratings yet
Careerwill Current Affairs 2021
38 pages
Amplitude Modulation Lab Report
100% (3)
Amplitude Modulation Lab Report
10 pages
Em-Cijj190028 424..464
No ratings yet
Em-Cijj190028 424..464
41 pages
GDC72002 - Status NOT RUNNING
No ratings yet
GDC72002 - Status NOT RUNNING
2 pages
Contol Devices
No ratings yet
Contol Devices
6 pages
Aircraft Maintenance Management Question Paper
No ratings yet
Aircraft Maintenance Management Question Paper
5 pages
Protek 506 PDF
100% (1)
Protek 506 PDF
1 page
000146
No ratings yet
000146
9 pages
BDD+40203 Sem 2 (12-13) PDF
No ratings yet
BDD+40203 Sem 2 (12-13) PDF
6 pages
senaraiIPT MyMaster
No ratings yet
senaraiIPT MyMaster
4 pages
Android Studio Muhitida Menyular Va Dialog Oynalar Bilan Ishlash
No ratings yet
Android Studio Muhitida Menyular Va Dialog Oynalar Bilan Ishlash
16 pages
Ai Unit III Notes
No ratings yet
Ai Unit III Notes
47 pages
Chapter 4
No ratings yet
Chapter 4
53 pages
O Ring GB en
No ratings yet
O Ring GB en
226 pages
GHAlex
No ratings yet
GHAlex
9 pages
MJPRC OU Research Proposal 15 TH Nov 2023
No ratings yet
MJPRC OU Research Proposal 15 TH Nov 2023
4 pages
ASME Sec VIII Div 1
100% (4)
ASME Sec VIII Div 1
27 pages
A Guide To Microsoft .NET Developer Certifications - Edit
No ratings yet
A Guide To Microsoft .NET Developer Certifications - Edit
8 pages
TT - Service Manual M SV 001 en Rev e
100% (1)
TT - Service Manual M SV 001 en Rev e
132 pages

Statistics

Uploaded by

Statistics

Uploaded by

Data

Data with no inherent ordering or ranking.

Data with an ordered series.

Discrete Data (also known as categorical data)

It can hold finite number of possible values.

Data that can hold infinite number of values.

Population: collection of individuals, objects or events whose properties are to be analysed.

Every nth record is chosen to be part of the sample.

Categories of descriptive statistics:

 Measures of central tendency

Measure of average of all the values in the sample is called Mean.

Mean = Sum of all the values

Mean of hp = 110+110+93+96+90+110+110+110 = 103.625

Mode = most recurrent value

Range = Max(xi) - Min(xi)

Inter Quartile Range

Consider an example of marks of 100 students in a class.

x: Individual data points

The difference between each element from the mean.

n: Number of data points in the sample

Standard deviation use case:

Daenerys has 20 dragons. They have the numbers 9,2,5,4,12,7,8,11,9,3,7,4,12,5,4,10,9,6,8,4.

Information gain and Entropy

Entropy measures the impurity or uncertainty present in the data.

Problem statement: Predict whether a match can be played or not by

Here target variable is ”Play”.

From the total number of 14 instances we have:

The Entropy is:

Create a decision tree

Selecting the root node for decision tree

All possible combinations for root node:

IG for Outlook: (maximum IG)

Hence the root node should be “Outlook”.

When Outlook is,

Confusion matrix represents a tabular representation of actual vs predicted values.

You can calculate the accuracy of your model with:

Zc : critical value or the point estimate

Estimating level of confidence

Zc is calculated from the standard normal table (also available in Google).

Steps involved in constructing a confidence interval:

Margin of error - use case

Null Hypothesis(H0) - Result is no different from assumption

You might also like