Introduction to Statistics
Introduction to Statistics
by
Farhan Sufyan
Contents
1
1.12 Arithmetic Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
1.12.1 How to Calculate Arithmetic Mean? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
1.12.2 Mean of Raw Data or Individual Series . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
1.12.2.1 Direct Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
1.12.2.2 Assumed Mean Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
1.12.2.3 Step-Deviation Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
1.12.2.4 Assumed Mean Method vs Step Deviation Method . . . . . . . . . . . . . . . . 48
1.12.3 Mean of Ungrouped Frequency Distribution or Discrete Series . . . . . . . . . . . . . . . 49
1.12.3.1 Direct Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
1.12.4 Practice Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
1.13 Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
1.13.1 Probability Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
1.13.1.1 Random Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
1.13.1.2 Systematic Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
1.13.1.3 Stratified Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
1.13.1.4 Cluster Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
1.13.1.5 Multi-Stage Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
1.13.2 Non-Probability Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
1.13.2.1 Convenience Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
1.13.2.2 Judgmental Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
1.13.2.3 Quota Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
1.13.3 Inferential Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
Chapter 1
1
5. Scientific Research
• Hypothesis Testing: Statistics are essential in testing hypotheses and validating research findings
across various scientific disciplines.
• Data Collection and Analysis: Researchers use statistical methods to design experiments, collect data,
and analyze results, ensuring the validity and reliability of their studies.
6. Understanding Variability
• Managing Uncertainty: Statistics help in understanding and managing variability in data, which is
inherent in any real-world process or phenomenon.
• Quantifying Differences: Through statistical tests, it’s possible to determine if observed differences in
data are significant or due to random variation.
7. Policy Formulation and Evaluation
• Public Policy: Governments and organizations use statistical data to formulate policies, assess their
impact, and make necessary adjustments.
• Socio-Economic Analysis: Statistics help in understanding social and economic issues, guiding policy
decisions on health, education, employment, and more.
8. Business and Market Research
• Consumer Insights: Businesses use statistics to understand consumer behavior, preferences, and
market trends.
• Product Development: Statistical analysis helps in identifying market needs, leading to the develop-
ment of new products and services.
9. Education and Psychology
• Educational Assessment: Statistics are used to analyze educational data, assess student performance,
and improve teaching methods.
• Psychological Research: In psychology, statistics help in studying human behavior, testing theories,
and validating psychological assessments.
10. Healthcare and Medicine
• Clinical Trials: Statistics are crucial in designing and analyzing clinical trials to ensure the efficacy
and safety of new treatments.
• Epidemiology: Statistical methods help in studying the distribution and determinants of health-related
events in populations, guiding public health interventions.
2
1.1.2 Advantages of Statistics
• Informed Decision Making
– Data-Driven Decisions: Statistics enable decisions based on data rather than intuition, increasing the
reliability and effectiveness of outcomes.
– Risk Management: Statistical analysis helps in identifying and managing risks, allowing for better
planning and mitigation strategies.
• Predictive Analysis
• Quality Control
• Scientific Research
• Understanding Variability
• Policy Formulation and Evaluation
• Business Applications
• Healthcare Applications
• Ethical Issues
– Data Manipulation: There is a risk of manipulating data or using selective statistics to mislead or
support a specific agenda.
– Privacy Concerns: Collecting and analyzing personal data raises privacy and ethical concerns, espe-
cially in sensitive areas like healthcare.
• Resource Intensive
– Time-Consuming: Collecting, analyzing, and interpreting statistical data can be time-consuming and
resource-intensive.
– Cost: Conducting large-scale surveys or experiments can be costly, requiring significant financial and
human resources.
• Statistical Limitations
– Assumptions: Many statistical methods rely on certain assumptions (e.g., normality, independence),
and violating these assumptions can affect the results.
– Causation vs. Correlation: Statistics can identify correlations but not necessarily causation, leading
to potential misinterpretation of cause-and-effect relationships.
• Dynamic Nature of Data
3
1.1.4 Applications of Statistics
1. Business and Economics
• Market Research: Analyzing consumer behavior, preferences, and market trends to guide marketing
strategies and product development.
• Quality Control: Using statistical methods to monitor and improve product and service quality.
• Financial Analysis: Evaluating investment opportunities, assessing risks, and forecasting financial
trends.
• Operational Efficiency: Optimizing supply chain management, inventory control, and resource alloca-
tion.
2. Healthcare and Medicine
• Clinical Trials: Designing and analyzing clinical trials to determine the efficacy and safety of new drugs
and treatments.
• Epidemiology: Studying the distribution and determinants of health-related events to guide public
health interventions and policy.
• Medical Research: Analyzing data from medical studies to understand disease patterns, treatment
outcomes, and health risks.
• Health Services Management: Improving hospital management, patient care, and resource allocation
through statistical analysis.
3. Social Sciences
• Sociological Research: Analyzing social behaviors, trends, and patterns to understand societal dynam-
ics and inform policy.
• Psychology: Using statistical methods to validate psychological theories, assess interventions, and
analyze behavioral data.
• Education: Evaluating educational programs, assessing student performance, and improving teaching
methods through data analysis.
4. Engineering and Manufacturing
• Quality Assurance: Applying statistical process control (SPC) to monitor and improve manufacturing
processes.
• Reliability Engineering: Analyzing the reliability and life-cycle of products to enhance durability and
performance.
• Design of Experiments: Optimizing product design and development through systematic experimenta-
tion and analysis.
5. Environmental Science
• Climate Studies: Analyzing climate data to understand trends, model climate change, and predict future
conditions.
• Environmental Monitoring: Assessing pollution levels, natural resource management, and ecological
impacts through statistical analysis.
• Conservation Biology: Studying species populations, habitat use, and conservation strategies using
statistical methods.
6. Government and Public Policy
• Census and Surveys: Collecting and analyzing population data to inform policy decisions and resource
allocation.
• Economic Planning: Using statistical models to forecast economic growth, unemployment, inflation,
and other macroeconomic indicators.
• Policy Evaluation: Assessing the impact and effectiveness of public policies and programs through
data analysis.
4
7. Sports and Entertainment
• Performance Analysis: Analyzing athlete performance, game statistics, and team strategies to enhance
competitive edge.
• Audience Analytics: Studying viewer preferences, ratings, and engagement to optimize content and
marketing strategies in media and entertainment.
8. Information Technology and Data Science
• Machine Learning: Using statistical methods to develop algorithms for predictive modeling, classifi-
cation, and clustering.
• Data Mining: Extracting meaningful patterns and insights from large datasets to inform business
decisions and strategies.
• Cybersecurity: Analyzing security threats, intrusion patterns, and system vulnerabilities through
statistical techniques.
9. Agriculture and Food Science
• Crop Yield Analysis: Studying factors affecting crop yields, pest control, and soil health to improve
agricultural practices.
• Food Safety: Monitoring and analyzing food production processes to ensure safety and compliance
with health regulations.
10. Education
• Assessment and Evaluation: Analyzing student performance data, evaluating educational programs,
and improving instructional methods.
• Educational Research: Using statistical methods to study learning outcomes, teaching effectiveness,
and educational trends.
5
1.2 Types of Data or Variables in Statistics
6
1.3 Qualitative vs Quantitative Data
2 https://www.youtube.com/watch?v=E1C5hB0yAM4
7
1.3.1.1 Nominal Data
• Nominal data is a type of data that consists of categories or names that cannot be ordered or ranked.
• Nominal data is often used to categorize observations into groups, and the groups are not comparable.
• In other words, nominal data has no inherent order or ranking. Therefore, if you would change the order of
its values, the meaning would not change.
• Examples of nominal data include:
• Nominal data can be represented using frequency tables and bar charts, which display the number or
proportion of observations in each category.
• For example, a frequency table for gender might show the number of males and females in a sample of
people.
• Nominal data is analyzed using non-parametric tests, which do not make any assumptions about the underlying
distribution of the data.
• Common non-parametric tests for nominal data include Chi-Squared Tests and Fisher’s Exact Tests. These
tests are used to compare the frequency or proportion of observations in different categories.
8
1.3.1.2 Ordinal Data
• Ordinal data is a type of data that consists of categories that can be ordered or ranked. However, the distance
between categories is not necessarily equal.
• Ordinal data is nearly the same as nominal data, except that its ordering matters.
• Ordinal data is often used to measure subjective attributes or opinions, where there is a natural order to the
responses.
• Examples of ordinal data include education level (Elementary, Middle, High School, College), job position
(Manager, Supervisor, Employee), etc.
• Note that the difference between Elementary and High School is different from the difference between
High School and College. This is the main limitation of ordinal data, the differences between the values
is not really known. Because of that, ordinal scales are usually used to measure non-numeric features like
happiness, customer satisfaction and so on.
• Ordinal data can be represented using bar charts, line charts. These displays show the order or ranking of the
categories, but they do not imply that the distances between categories are equal.
• Ordinal data is analyzed using non-parametric tests, which make no assumptions about the underlying
distribution of the data.
• Common non-parametric tests for ordinal data include the Wilcoxon Signed-Rank test and Mann-Whitney
U test.
9
1.3.2 Quantitative Data or Numerical Data
• Quantitative Data 3 takes about quantity. Something that we can measure in numbers.
• Quantitative Data is a fundamental component of Statistics, providing a numerical foundation for analysis
and decision-making.
• Quantitative data are data represented numerically, including anything that can be counted, measured, or
given a numerical value.
• They are also called the Numerical Data (i.e., how much, how often, how many).
• Quantitative data type is used to represent quantities, measurements, and observations like height, weight,
length and other things of the data.
3 https://www.youtube.com/watch?v=kNARs2oeuk0
4 https://www.youtube.com/watch?v=Cg0W6mod9Hw
10
1.3.2.1 Discrete Data
• Discrete data type is a type of data in statistics that only uses distinct and countable values.
• Discrete information contains only a finite number of possible values. Those values cannot be subdivided
meaningfully.
• In a Discrete Dataset, apparent gaps or intervals exist between the values. These gaps indicate that there are
no values between the specified data points.
• The example of the discrete data types are,
– Marks of the students in a class test
– Number of customers
– Dice rolls: When rolling a six-sided dice, the possible outcomes are discrete and countable, ranging
from 1 to 6.
• Discrete Data is often analyzed using Statistical techniques tailored to discrete variables, such as frequency
distributions, bar charts, and probability calculations. These methods help to summarize and interpret Data
that can be counted or categorized into distinct values.
– Changes over time: Continuous data changes over time and can have different values at different time
intervals.
– May or may not have decimals: Continuous data comprises random variables that may or may not be
whole numbers.
– Visualized with line graphs or skews: Continuous data is measured using data analysis methods such
as line graphs and skews.
11
12
1.4 Other Types of Data/Variables
1.4.1 Primary Data?
Primary data in mathematics is defined as the data that is collected for the first time. It is pure data and no analysis
is performed in this data.
– Examples: Temperature in Celsius: Differences are meaningful, but 0°C does not mean the absence of
temperature.
– Calendar Years: 2000, 2010, 2020 (the intervals are equal, but there is no true zero year).
• Ratio data uses absolute zero as a reference point for measurement. In other words, Ratio data has a defined
zero point, whereas interval data lacks the absolute zero point.
• Ratio variables, never fall below zero. Height and weight measure from 0 and above, but never fall below it.
• Ratio data can include variables like income, height, weight, annual sales, market share, product defect rates,
time to repurchase, unemployment rate, and crime rate. As an analyst, you can say a crime rate of 10% is
twice that of 5%, or annual sales of 2 million are 25% greater than 1.5 million.
• Interval variables are also commonly known as Scaled variables.
– Examples: Income: 0, 50, 000, 100, 000 (income of 0 means no income, and you can say that 100, 000
is twice as much as 50, 000).
– Distance: 0 km, 5 km, 10 km (0 means no distance, and 10 km is twice as far as 5 km).
– Age: 0 years, 25 years, 50 years (0 means no age, and 50 years is twice as old as 25 years).
5 https://www.youtube.com/watch?v=kNARs2oeuk0
13
14
15
16
1.5 Statistical Series
Statistical Series
Characteristics Construction
Time Series Spacial Series Condition Series Individual Series (or Raw Data)
Frequency
Frequency Distribution
Frequency Array Inclusive Exclusive Open End Cumulative Mid-Value Equal and Unequal
Series Series Series Frequency Frequency Class Interval Series
Series Series
17
• Data is important for researchers but in its raw form, it is hardly usable.
• Therefore, data is often organized in series 6 to facilitate analysis and interpretation.
• Series has its own characteristics and they obey some general principles.
• Such types of series are very important for researchers and economists to gain insights so that they can use
them for actionable purposes.
• A statistical series refers to a set of observations arranged in a particular order based on one or more criteria.
• In other words, arranging data in some logical order such as according to the time of occurrence, size, or
some other measurable or non-measurable characteristics is known as Statistical Series. 7 .
• Understanding the different types of statistical series is crucial for effectively analyzing and presenting data.
• Statistical Series can be classified:
– On the Basis of Characteristics:
* Time Series
* Spatial Series
* Condition Series
– On the Basis of Construction:
* Individual Series
* Discrete Series
* Continuous Series
6 https://www.youtube.com/watch?v=NWNW1jln8cc
7 https://www.youtube.com/watch?v=VunpIAw5pPg
18
1.6 On the Basis of Characteristics
When the data is arranged on the basis of qualitative characteristics, statistical series are of three kinds:
• Time Series
• Spatial Series
• Condition Series
• Most commonly, a time series is a sequence taken at successive equally spaced points in time. Thus it is a
sequence of discrete-time data.
• Simply, time series is a statistical series in which the given data is presented with regard to time unit; i.e., day,
month, week, or year.
• Time series analysis is used for non-stationary data—things that are constantly fluctuating over time or are
affected by time. Industries like finance, retail, and economics frequently use time series analysis because
currency and sales are always changing. Stock market analysis is an excellent example of time series analysis
in action, especially with automated trading algorithms. Likewise, time series analysis is ideal for forecasting
weather changes, helping meteorologists predict everything from tomorrow’s weather report to future years
of climate change.
19
1.6.2 Spatial Series
• Spatial data is any type of data that directly or indirectly references a specific geographical area or location.
• Example: The following is the sex ratio of 6 different states of India as per the Census of 2011.
20
1.6.3 Condition Series
• In this series, data is classified according to the changes occurring in variables according to certain condition,
then it is called a Condition Series.
• Students of a certain class arranged according to their age. Heights, weights, marks etc.
• Example: The following is the table showing the arrangement of 40 students in a class according to their
age. It is a condition series because the data is arranged on basis of the age of the students
21
1.7 On the Basis of Construction
When the data is arranged on the basis of quantitative characteristics, statistical series are of three kinds 8 9 :
• Individual Series
– Unorganized Individual Series
– Organized Individual Series
• Discrete Series
• Continuous Series
– Exclusive Series
– Inclusive Series
– Open-end Distribution
– Cumulative Frequency Series
– Equal and Unequal Class Interval Series
– Mid-value Series
8 https://www.youtube.com/watch?v=NWNW1jln8cc
9 https://www.tutorialspoint.com/statistical-series
22
1.7.1 Individual Series (or Raw Data)
• Individual series is that series in which the terms are listed singly.
• In simple terms, a separate value of the measurement is given to each item.
• Example: If the marks of 10 students of Class is given individually as, 80, 82, 75, 95, 77, 81, 60, 35, 54, and
99; then, the resultant series will be an individual series.
• In such series, there is no class of the items and also there is no frequency of the items.
• The two types of individual series are:
23
1.7.2 Frequency
• In statistics, the frequency or absolute frequency of an event i is the number ni of times the observation has
occurred/recorded in an experiment or study.
• Frequency is basically the number of times a data item occurs in the series. In other words, it deals with how
frequent a data item is in the series.
• Example: Twenty students were asked how many hours they worked per day. Their responses, in hours, are
as follows:
5; 6; 3; 3; 2; 4; 7; 5; 2; 3; 5; 6; 5; 4; 4; 3; 5; 2; 5; 3
• It gives a visual display of the frequency of items or shows the number of times they occurred.
• Types of Frequency Distribution:
– Ungrouped Frequency Distribution or Discrete Series
– Grouped Frequency Distribution or Continuous Series
10 https://edu.gcfglobal.org/en/statistics-basic-concepts/frequency-tables/1/
24
1.7.3 Discrete Series or UnGrouped Frequency Distribution or Frequency Array
• Discrete Series is nothing but ungrouped frequency distribution series where different values of the variables
are shown with their respective frequencies.
• The classification of data for a discrete variable is known as Frequency Array.
• In discrete series, data obtained in raw form are presented along with their frequencies. In such a series, data
are not presented in ascending or descending manner.
• Instead, the data and its frequencies are presented in a tabular or grouped manner.
• For example, if the monthly wages of five employees of a company are 10,000, 12,000, 10,000, 12,000,
13,000, 14,000, and 15,000, then the discrete series will be made as follows
25
1.7.4 Continuous Series or Grouped Frequency Distribution
• A discrete series cannot take any value in an interval; therefore, in cases where it is essential to represent
continuous variables with a range of values of different items of a given data, Continuous Series is used.
• In continuous series (grouped frequency distribution), the value of a variable is grouped into several class
intervals (such as 0-5,5-10,10-15) along with the corresponding frequencies.
• Other names of Continuous Series are Frequency Distribution, Grouped Frequency Distribution, Series with
Class Intervals, and Series of Grouped Data.
• Different types of Continuous Series 11 12 :
– Inclusive Series
– Exclusive Series
– Open-end Distribution
– Cumulative Frequency Series
– Equal and Unequal Class Interval Series
– Mid-value Series
11 https://www.geeksforgeeks.org/types-of-frequency-distribution/
12 https://www.toppr.com/guides/economics/organisation-of-data/frequency-distribution/
26
1.7.4.1 Important Terms under Continuous Series
• Class: Class in Continuous Series refers to a group of numbers in which the items are placed. For example,
0-5, 5-10, 10-15, 15-20, 20-25, etc.
• Number of Classes: The decision regarding the number of classes of a given data usually depends upon the
judgement of the individual investigator. Even though there is no strict rule regarding the number of classes,
the number should not be very small or very large.
• Class Limits: In continuous series, the class limit is formed by the two numbers between which every class
is located. The lowest value of the class is known as Lower Limit and the highest value of the class is known
as Upper Limit. For example, if a class is 5 - 10, then 5 is the lower limit and 10 is the upper limit.
• Class Interval: It is the difference between the lower limit and upper limit of a class.
• Range: It is the difference between the lower limit of the first class interval and the upper limit of the last
class interval. For example, if the classes of a distribution are 0-5, 5-10, 10-15, . . . . . . . . . . . . .till 45-50, then
the range will be 50 – 0 = 50.
• Width of Class Intervals: At the time of constructing the frequency distribution, it is suggested that the
width of each class interval is equal in size. The formula for determining the size or width of each class
interval is as follows:
range
width = √
SampleSize
27
• How to make a grouped frequency table?
Example: A sociologist conducted a survey of 20 adults. She wants to report the frequency distribution of
the ages of the survey respondents. The respondents were the following ages in years:
52, 34, 32, 29, 63, 40, 46, 54, 36, 36, 24, 19, 45, 20, 28, 29, 38, 33, 49, 37
19 ≤ a ≤ 29
29 ≤ a ≤ 39
39 ≤ a ≤ 49
49 ≤ a ≤ 59
59 ≤ a ≤ 69
28
1.7.4.2 Inclusive Series
• The series with class intervals, in which all the items having the range from the lower limit up to the upper
limit are included, is known as Inclusive Series.
• However, there is a gap (between 0.1 to 1) between the upper-class limit of one class interval and the lower
limit of the next class interval.
• For example, class intervals of an inclusive series can be, 0-9, 10-19, 20-29, 30-39, and so on. In this case,
the gap between the upper limit of one class interval and the lower limit of the next class interval is 1.
• From the above table of inclusive series, it can be seen that the upper limit of one class interval (say, 9 of
interval 0-9) is not the same as the lower limit of the next class interval (10 of interval 10-19). Also, all the
values that come under 0-9, including 0 and 9 are included in the frequency against 0-9.
• For statistical calculation, sometimes it becomes necessary to convert the inclusive series into exclusive
series. Suppose, in the above example some students have obtained marks such as 10.5, 40,5, etc. In this
case, this series will be converted into exclusive series,
29
1.7.4.3 Exclusive Series
• The series with class intervals, in which all the items having the range from the lower limit to the value just
below its upper limit are included, is known as the Exclusive Series.
• For example, if a class interval is 0-10, and the values of the given series are 4, 10, 2, 15, 8, and 9, then only
4, 2, 8, and 9 will be included in the 0-10 class interval. 10 and 15 will be included in the next class interval,
i.e., 10-20.
• In Exclusive Series, the upper limit of a class interval is the lower limit of the next class interval.
• From the above table of exclusive series, it can be seen that the upper limits of the first class interval is the
lower limit of the second class interval, and so on.
• If the data includes a value 10, it will be included in the class interval 10-20, not in 0-10.
30
1.7.4.4 Conversion of Inclusive Series into Exclusive Series?
• For statistical calculation, sometimes it becomes necessary to convert the inclusive series into exclusive
series.
• Suppose, in the above example some students have obtained marks such as 10.5, 40, 5, etc. In this case, this
series will be converted into exclusive series,
• The steps for converting an inclusive series into exclusive series are:
– In this first step, calculate the difference between the upper class limit of one class interval and the
lower limit of the next class interval.
– The next step is to divide the difference by two and then add the resulting value to the upper limit of
every class interval and subtract it from the lower limit of every class interval.
• The inclusive series of the above example is converted into exclusive series as under:
31
1.7.4.5 Difference between Inclusive and Exclusive Series
• In Inclusive Series, the upper limit of one class interval is not the same as the lower limit of the next class
interval. There is a gap ranging from 0.1 to 1.0 between the upper class limit of one class interval and the
lower class limit of the next class interval. However, in the Exclusive Series, the upper limit of one class
interval is the same as the lower limit of the next class interval.
• In the case of Inclusive Series, the value of the upper and the lower limit are included in that class interval
only. However, in the case of Exclusive Series, the value of upper limit of a class interval is not included in
that interval, instead, it is included in the next class interval.
• Inclusive Series is suitable for an investigator only if the value is in complete number and not in decimal
form. However, an Exclusive Series is suitable for an investigator whether the value is in complete number or
decimal form.
• Counting in Inclusive Series is possible only after converting it into an Exclusive Series. However, counting
in Exclusive Series is possible in all cases.
32
1.7.5 Open End Series
• Sometimes the lower limit of the first class interval and the upper class limit of a series is not available;
instead, Less than or Below is mentioned in the former case (in place of the lower limit of the first class
interval), and More than or Above is mentioned in the latter case (in place of the upper limit of the last class
interval). These types of series are known as Open End Series.
• For statistical calculations, if one needs to change the first and last class open-end class interval into limits, it
can be done by the general practice of giving the same magnitude or class size to these intervals as the class
size of other class intervals.
• In the above example, the magnitude of other class intervals is 5. Therefore, the open-end class intervals can
be written as 5-10 and 30-35, respectively.
33
1.8 Types of Statistics
Statistics can be broadly classified into two main types:
1. Descriptive Statistics
2. Inferential Statistics
• Data Representations:
– Histograms: Bar graphs representing the frequency distribution of numerical data.
– Bar Charts: Graphs representing categorical data with rectangular bars.
– Pie Charts: Circular charts divided into sectors representing proportions.
– Box Plots: Visual representations of the distribution of data based on five-number summaries (minimum,
first quartile, median, third quartile, and maximum).
– Pictograph
– Frequency Distribution
34
1.9.2 Measures of Central Tendency
• Central Tendencies in Statistics are the numerical values that are used to represent mid-value or central value
a large collection of numerical data. These obtained numerical values are called central values in Statistics.
• Measures of central tendency are statistical metrics that describe or represents the center or the single value
as representative of the entire distribution or a dataset.
• Such a value is of great significance because it depicts the nature or characteristics of the entire data, which
is otherwise very difficult to observe.
• The three most common measures of central tendency are:
– Mean : provides the average value of the dataset
– Median: provides the central value of the dataset
– Mode: provides the most frequent value in the dataset
35
1.10 Population vs Sample
• Population: A collection or set of individuals or objects or events whose properties are to be analyzed.
• Sample: A subset of the population is called ‘Sample’. A well-chosen sample will contain most of the
information about a particular population parameter.
• Outliers: An outlier is a data point that differs significantly from the majority of the data taken from a sample
or population. There are many possible causes of outliers, but here are a few to start you off:
– Natural variation in data
– Change in the behavior of the observed system
– Errors in data collection
36
1.11 Mean
• Mean is the measure of central tendency and is mostly used in Statistics.
• Mean is the central tendency of the distributed data, which refers to the average value of the given set of data.
• The method of finding the mean is also different depending on the type of data (Grouped or Ungrouped
Data).
• Mean is also referred to as the average.
• Weighted Mean
When not specified, the mean is generally referred to as the arithmetic mean.
37
1.12 Arithmetic Mean
1.12.1 How to Calculate Arithmetic Mean?
There are three ways to determine the arithmetic mean for both Grouped/Ungrouped Data or Individual, Discrete
and Continuous Series. 13 14 .
• Direct Method
• Assumed Mean Method or Short-Cut Method
• Step Deviation Method
13 https://www.youtube.com/playlist?list=PLYwJOKtPsLuiFjFGKDFoPZOM0g4JBKUrj
14 https://www.youtube.com/playlist?list=PLEHGYFbPuuMEhz_AU8iCrBTYb5eNtFpeg
38
1.12.2 Mean of Raw Data or Individual Series
• Raw data is the dataset simply contains all the data in no particular manner.
• The series in which the items are listed singly is known as Individual Series.
• The mean is of raw data calculated by adding up all the observations and dividing it by the total number of
observations in the set.
• Mean = Sum of all Observations ÷ Total number of Observations
• The population mean is represented by the Greek letter µ (mu).
• The sample mean is represented by x (x-bar).
• The sample mean is usually the best, unbiased estimate of the population mean. However, the mean is
influenced by extreme values (outliers) and may not be the best measure of center with strongly skewed data.
39
1.12.2.1 Direct Method
• The following equations compute the population mean and sample mean:
x1 + x2 ..... + xN
µ=
N
N
∑ xi
i=1
µ=
N
x1 + x2 ..... + xn
x=
n
n
∑ xi
i=1
x=
n
40
1.12.2.2 Assumed Mean Method
• Assumed mean method 15 finds the actual mean of the data by first assuming a mean value.
• When the calculation of the mean for raw data using the direct method becomes very tedious, then the mean
can be calculated using the assumed mean method.
• When calculating the mean using the direct mean method, you obtain significantly bigger numbers. The
likelihood of making calculating errors is decreased when utilizing the assumed mean approach, also known
as a shift of origin because it gives you smaller numbers to work with (as well as negative numbers that lower
the sum).
• The Assumed Mean method simplifies the calculation of the arithmetic mean by reducing the size of the
numbers involved in the calculation, making it easier to compute, thus suitable if your data set has large
values.
• The following equations compute the population mean and sample mean:
∑ di
µ = A+
N
where A is the assumed mean and d is the deviation from the mean
∑ di
x̄ = A +
n
• Advantages:
– Simplifies arithmetic by using smaller numbers.
– Reduces computational complexity.
• Disadvantages:
– Assumed mean is still a central value, so deviations might still be relatively large.
• How to Calculate Mean using Assumed Mean Method?: We can calculate mean using the assumed mean
method by following the below steps:
1. Choose an Assumed Mean (A): Select a value from the data, often a central value, to act as an assumed
mean.
2. Calculate the Deviations (d): Subtract the assumed mean from each data point to find the deviation
di = xi − A, where xi is each data point.
3. Find the Sum of Deviations (∑ di ): Add up all the deviations.
4. Calculate the Mean using the above formulas
15 https://testbook.com/maths/assumed-mean-method
41
• Example:
– Assume your data set is 73, 75, 76, 78 and 79.
– Sort your data set from smallest to largest.
– Assume a mean. This should be a number that you feel is a close representation of your data set.
– In a simple example, take the number in the center of your data set; in this case 76.
– Subtract your assumed mean from each data entry.
– In our example, 73 − 76 = −3, 75 − 76 = −1, 76 − 76 = 0, 78 − 76 = 2and79 − 76 = 3
– Add together these differences from the mean.
– (−3) + (−1) + 0 + 2 + 3 = 1
– Divide the sum of the differences from assumed mean by the number of data points.
– 1/5 = 0.2
– Add the result of the division to your assumed mean.
– Mean = 76 + 0.2 = 76.2
• Example: Find the mean of the following data using Assumed mean method 40, 50, 55, 78, 58
n
∑d
x̄ = A + i=1
n
x̄ = 40 + 81/5
Mean(x̄) = 56.2
42
• Find the average for the following data using Assumed mean method
∑d
x̄ = A +
N
17
x̄ = 8 +
10
Mean(x̄) = 9.7
43
1.12.2.3 Step-Deviation Method
• The Step Deviation method is an extension of the Assumed Mean method.
• This method further simplifies calculations by choosing a common factor (step size) to reduce the size of the
deviations from an assumed mean.
• Advantage:
– The step deviations simplify the calculations, especially when the original deviations are large or
involve complex numbers.
– Makes it easier to work with data when the values are spread out over a large range.
• Disadvantage:
– Choose an Assumed Mean (A): Select a value close to the center of your data as the assumed mean.
This value can be one of the data points.
– Calculate the Deviations (d): Subtract the assumed mean from each data point to find the deviation
di = xi − A, where xi is each data point.
– Select a Common Factor (h): Choose a common factor hh (also known as the step size), which could be
a convenient value, such as 2, 5, 10, etc., depending on the data range.
– Calculate Step Deviations: Divide each deviation by the chosen factor h to obtain the step deviations ui .
di xi − A
ui = =
h h
– Find the Sum of Step Deviations (∑ ui ): Add up all the step deviations
– Calculate the Mean: The following equations compute the population mean and sample mean:
∑ ui
µ = A+h×
N
∑ ui
x̄ = A + h ×
n
44
• Example: Let’s consider the following ungrouped data: 47, 53, 59, 65, 71
1. Choose an Assumed Mean (A): Select A = 59 (a central value from the data).
2. Calculate Deviations (d):
(a) d1 = 47 − 59 = −12
(b) d2 = 53 − 59 = −6
(c) d3 = 59 − 59 = 0
(d) d4 = 65 − 59 = 6
(e) d5 = 71 − 59 = 1
3. Step 3: Select a Common Factor (h):
4. Calculate Deviations (d): Choose h = 6 (a convenient value given the range of deviations).
−12
(a) u1 = 6 = −2
−6
(b) u2 = 6 = −1
0
(c) u3 = 6 =0
6
(d) u4 = 6 =1
12
(e) u5 = 6 =2
5. Calculate Step Deviations (ui ):
6. Find the Sum of Step Deviations (∑ ui ):
∑ ui = (−2) + (−1) + 0 + 1 + 2 = 0
7. Calculate the Arithmetic Mean using the above formula:
∑ ui
x̄ = A + h ×
n
0
= 59 + 6 ×
5
= 59
45
• Example: Find the mean of the following data using direct method, assumed mean method and step deviation
method. 40, 50, 55, 78, 58
46
• Find the average for the following data using step-deviation method.
∑ ui
x̄ = A + h ×
n
0
= 60 + 5 ×
5
= 60
47
1.12.2.4 Assumed Mean Method vs Step Deviation Method
• The assumed mean method is typically used when the mean of the dataset is a known, predetermined value.
• This assumed mean method is appropriate when the focus is on calculating the standard deviation rather than
estimating the mean.
• The formula for the standard deviation using the assumed mean method is:
q
∑ni=1 (xi −x̄)2
s= n
• The step deviation method, on the other hand, is used when the mean of the dataset is unknown and needs to
be calculated as part of the standard deviation computation.
• This method involves calculating the deviations of each data point from the actual mean, and then using those
deviations to compute the standard deviation.
• The formula for the standard deviation using the step deviation method is:
q
∑ni=1 (xi −x̄)2
s= n−1
• To summarize:
– Use the assumed mean method when the mean is a known, predetermined value and the focus is on
calculating the standard deviation.
– Use the step deviation method when the mean is unknown and needs to be calculated as part of the
standard deviation computation.
48
1.12.3 Mean of Ungrouped Frequency Distribution or Discrete Series
• In discrete series (ungrouped frequency distribution), the values of variables represent the repetitions.
• It means that the frequencies are given corresponding to the different values of variables.
• The total number of observations in a discrete series, N, equals the sum of the frequencies, which is ∑ fi .
• Example of Discrete Series: If 6 students of a class score 50 marks, 4 students score 60 marks, 7 students
score 70 marks, 3 students score 80 marks, and 5 students score 90 marks, then this information will be
shown as:
49
1.12.3.1 Direct Method
1. List the Data: Prepare a frquency table with values (xi ) and their corresponding frequencies ( fi )
2. Calculate the Product of (xi ) and ( fi ): Multiply each value by its frequency to get xi . fi
3. Find the Sum of the Products ∑(xi . fi ): Add all the products together.
50
• Example:
• Example:
∑ xi . fi
x̄ or µ =
∑ fi
264
=
28
= 9.42
51
• Calculate the mean of the following distribution, which represents the scores obtained by students in a quiz.
52
1.12.4 Practice Questions
• Calculate the mean for the following set of data 2, 6, 7, 9, 15, 11, 13, 12
• If there are 5 observations, which are 27, 11, 17, 19, and 21 then find the mean
• Find the mean for the following sample data set: 6.4, 5.2, 7.9, 3.4
53
1.13 Sampling
• Why we need sampling?: Consider a scenario wherein you’re asked to perform a survey about the eating
habits of teenagers in the US. There are over 42 million teens in the US at present and this number is
growing as you read this blog. Is it possible to survey each of these 42 million individuals about their health?
Obviously not! That’s why sampling is used.
• How can one choose a sample that best represents the entire population?. Sampling is a statistical
method that deals with the selection of individual observations within a population that best represents the
entire population.
• Probability sampling techniques ensure that every member of the population has a known and non-zero
chance of being selected.
• There are three types of probability sampling:
– Simple Random Sampling or Random Sampling
– Systematic Sampling
– Stratified Sampling
54
1.13.1.1 Random Sampling
• In this method, each member of the population has an equal chance of being selected in the sample.
• Example: A company wants to survey its employees’ job satisfaction. They use a random number generator
to select 50 employees out of 500, ensuring each employee has an equal chance of being chosen.
• Advantages:
– Easy to implement.
– Reduces selection bias.
• Disadvantages:
– Requires a complete list of the population.
– May not be practical for large populations.
55
1.13.1.2 Systematic Sampling
• In Systematic sampling, every nth record is chosen from the population to be a part of the sample after a
random starting point.
• Example: In a factory with 1000 products, an inspector selects every 10th product for quality testing, starting
with the 5th product randomly.
• Advantages:
– Simple and quick to implement.
– Ensures a spread across the population.
• Disadvantages:
56
1.13.1.3 Stratified Sampling
• Stratified sampling divides the population into stratum/strata (subgroups).
• A stratum is a subset of the population that shares at least one common characteristic.
• After this, the random sampling method is used to select a sufficient number of subjects from each stratum.
• Example: A researcher wants to study the income levels of different age groups. They divide the population
into age strata (e.g., 18-29, 30-49, 50-69) and randomly select individuals from each stratum.
• Advantages:
57
1.13.1.4 Cluster Sampling
• Divides the population into clusters, randomly selects some clusters, and then samples all or some members
within those clusters.
• Example: A school district wants to evaluate student performance. They randomly select 5 out of 20 schools
(clusters) and then test all students in those selected schools.
• Advantages:
58
1.13.1.5 Multi-Stage Sampling
• Multistage sampling is an extension of cluster sampling in that, first, clusters are randomly selected and,
second, sample units within the selected clusters are randomly selected.
• It involves multiple stages of sampling, where each stage becomes progressively smaller and more focused.
• Advantages:
– Flexible and cost-effective.
– Suitable for large-scale surveys.
• Disadvantages:
59
1.13.2 Non-Probability Sampling
• Non-probability sampling techniques do not provide every individual with a known or equal chance of being
selected.
• These techniques are often used when probability sampling is not feasible.
• Participants recruit other participants from their acquaintances. Thus the sample group is said to grow like a
rolling snowball.
• Example: A researcher studying a rare disease starts with one patient and asks them to refer other patients
they know.
• Advantages:
60
1.13.2.3 Quota Sampling
• Quota sampling is a method for selecting survey participants that is a non-probabilistic version of stratified
sampling.
• Ensures that specific characteristics (quotas) are represented in the sample.
• Quota sampling is a non-probability sampling method that relies on the non-random selection of a predeter-
mined number or proportion of units. This is called a quota.
You first divide the population into mutually exclusive subgroups (called strata) and then recruit sample units
until you reach your quota. These units share specific characteristics, determined by you prior to forming
your strata.
• Example: A researcher ensures that their sample includes a certain number of men and women, age groups,
and ethnic backgrounds, reflecting the population’s proportions.
• Advantages:
– Ensures representation of specific groups.
– More practical than stratified sampling.
• Disadvantages:
– Can introduce bias.
– Not random, limiting generalizability.
61
1.13.3 Inferential Statistics
• Inference Statistics offers methods to study experiments done on small samples of population and chalk out
the inferences about the entire population.
• Inferential statistics uses the probability principle to examine if patterns seen in a research sample may be
extrapolated to the larger population from which the sample was taken.
• Inferential statistics is used to forecast precise generalizations in addition to testing hypotheses and examining
connections between variables from samples.
62