0% found this document useful (0 votes)
5 views65 pages

Introduction to Statistics

The document is an introduction to data science, focusing on statistics fundamentals, including definitions, types of data, and various statistical methods. It covers topics such as descriptive statistics, sampling methods, and measures of central tendency. The notes serve as a comprehensive guide for understanding the foundational concepts of statistics in data science.

Uploaded by

meowyoongi159
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views65 pages

Introduction to Statistics

The document is an introduction to data science, focusing on statistics fundamentals, including definitions, types of data, and various statistical methods. It covers topics such as descriptive statistics, sampling methods, and measures of central tendency. The notes serve as a comprehensive guide for understanding the foundational concepts of statistics in data science.

Uploaded by

meowyoongi159
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 65

Introduction to Data Science Notes

by

Farhan Sufyan
Contents

1 Introduction to Statistics Fundamentals 1


1.1 What is Statistics? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Need of Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.2 Advantages of Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.3 Disadvantages of Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.4 Applications of Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Types of Data or Variables in Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Qualitative vs Quantitative Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3.1 Qualitative Data or Categorical Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3.1.1 Nominal Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3.1.2 Ordinal Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3.2 Quantitative Data or Numerical Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3.2.1 Discrete Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3.2.2 Continuous Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.4 Other Types of Data/Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.4.1 Primary Data? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.4.2 Binary (Dichotomous) Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.4.3 Interval vs Ratio Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.4.3.1 Interval Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.4.3.2 Ratio Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.5 Statistical Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.6 On the Basis of Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.6.1 Time series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.6.2 Spatial Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.6.3 Condition Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.7 On the Basis of Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.7.1 Individual Series (or Raw Data) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.7.2 Frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.7.2.1 Frequency Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.7.2.2 Types of Frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.7.2.3 Frequency Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.7.3 Discrete Series or UnGrouped Frequency Distribution or Frequency Array . . . . . . . . . 25
1.7.4 Continuous Series or Grouped Frequency Distribution . . . . . . . . . . . . . . . . . . . 26
1.7.4.1 Important Terms under Continuous Series . . . . . . . . . . . . . . . . . . . . 27
1.7.4.2 Inclusive Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
1.7.4.3 Exclusive Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
1.7.4.4 Conversion of Inclusive Series into Exclusive Series? . . . . . . . . . . . . . . 31
1.7.4.5 Difference between Inclusive and Exclusive Series . . . . . . . . . . . . . . . . 32
1.7.5 Open End Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
1.8 Types of Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
1.9 Descriptive Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
1.9.1 Key Components of Descriptive Statistics: . . . . . . . . . . . . . . . . . . . . . . . . . . 34
1.9.2 Measures of Central Tendency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
1.10 Population vs Sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
1.11 Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
1.11.1 Types of Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

1
1.12 Arithmetic Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
1.12.1 How to Calculate Arithmetic Mean? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
1.12.2 Mean of Raw Data or Individual Series . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
1.12.2.1 Direct Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
1.12.2.2 Assumed Mean Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
1.12.2.3 Step-Deviation Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
1.12.2.4 Assumed Mean Method vs Step Deviation Method . . . . . . . . . . . . . . . . 48
1.12.3 Mean of Ungrouped Frequency Distribution or Discrete Series . . . . . . . . . . . . . . . 49
1.12.3.1 Direct Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
1.12.4 Practice Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
1.13 Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
1.13.1 Probability Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
1.13.1.1 Random Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
1.13.1.2 Systematic Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
1.13.1.3 Stratified Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
1.13.1.4 Cluster Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
1.13.1.5 Multi-Stage Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
1.13.2 Non-Probability Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
1.13.2.1 Convenience Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
1.13.2.2 Judgmental Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
1.13.2.3 Quota Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
1.13.3 Inferential Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
Chapter 1

Introduction to Statistics Fundamentals

1.1 What is Statistics?


• The term statistics, derived from the word state, was used to refer to a collection of facts of interest to the
state. Statistics is the art of learning from data.
• Statistics is an applied mathematics subject involved with collecting, characterizing, analyzing, and extracting
conclusions from quantitative data 1 .
• Statistics involves using mathematical techniques to summarize and describe data, as well as to draw
conclusions and make decisions based on data.

1.1.1 Need of Statistics


Statistics provide the necessary tools to collect, analyze, interpret, and present data effectively. They help in making
decisions that are based on data rather than assumptions, thus reducing uncertainty and increasing the reliability of
conclusions drawn from data. Here are several key reasons why statistics are essential:

1. Data Analysis and Interpretation


• Extracting Meaningful Information: Statistics provide tools and methodologies to analyze large
volumes of data, extracting patterns and trends that might not be immediately apparent.
• Summarizing Data: Measures like mean, median, and standard deviation help summarize complex
data sets into understandable metrics.
2. Decision Making
• Evidence-Based Decisions: In fields like business, medicine, and public policy, decisions must be
based on data rather than intuition. Statistical analysis helps make informed and objective decisions.
• Risk Assessment: Statistics help in evaluating risks and uncertainties, enabling better decision-making
in uncertain conditions.
3. Predictive Analysis
• Forecasting: Statistical models can predict future trends based on historical data, which is crucial in
finance, economics, weather forecasting, and more.
• Trend Analysis: By analyzing past and present data, statistics help identify trends that can influence
future strategies.
4. Quality Control and Improvement
• Monitoring Processes: In manufacturing and service industries, statistical methods are used to monitor
and control processes, ensuring products and services meet quality standards.
• Continuous Improvement: Statistical tools like Six Sigma and control charts help in identifying areas
for improvement and in maintaining high-quality standards.
1 https://www.youtube.com/watch?v=p205pOUEYMk

1
5. Scientific Research
• Hypothesis Testing: Statistics are essential in testing hypotheses and validating research findings
across various scientific disciplines.
• Data Collection and Analysis: Researchers use statistical methods to design experiments, collect data,
and analyze results, ensuring the validity and reliability of their studies.
6. Understanding Variability
• Managing Uncertainty: Statistics help in understanding and managing variability in data, which is
inherent in any real-world process or phenomenon.
• Quantifying Differences: Through statistical tests, it’s possible to determine if observed differences in
data are significant or due to random variation.
7. Policy Formulation and Evaluation
• Public Policy: Governments and organizations use statistical data to formulate policies, assess their
impact, and make necessary adjustments.
• Socio-Economic Analysis: Statistics help in understanding social and economic issues, guiding policy
decisions on health, education, employment, and more.
8. Business and Market Research

• Consumer Insights: Businesses use statistics to understand consumer behavior, preferences, and
market trends.
• Product Development: Statistical analysis helps in identifying market needs, leading to the develop-
ment of new products and services.
9. Education and Psychology

• Educational Assessment: Statistics are used to analyze educational data, assess student performance,
and improve teaching methods.
• Psychological Research: In psychology, statistics help in studying human behavior, testing theories,
and validating psychological assessments.
10. Healthcare and Medicine

• Clinical Trials: Statistics are crucial in designing and analyzing clinical trials to ensure the efficacy
and safety of new treatments.
• Epidemiology: Statistical methods help in studying the distribution and determinants of health-related
events in populations, guiding public health interventions.

2
1.1.2 Advantages of Statistics
• Informed Decision Making
– Data-Driven Decisions: Statistics enable decisions based on data rather than intuition, increasing the
reliability and effectiveness of outcomes.
– Risk Management: Statistical analysis helps in identifying and managing risks, allowing for better
planning and mitigation strategies.
• Predictive Analysis
• Quality Control
• Scientific Research

• Understanding Variability
• Policy Formulation and Evaluation
• Business Applications

• Healthcare Applications

1.1.3 Disadvantages of Statistics


• Misinterpretation of Data
– Complexity: Statistical methods can be complex, leading to misinterpretation or misuse if not properly
understood.
– Over Generalization: Incorrect conclusions may be drawn if statistical results are overgeneralized
beyond the scope of the data.

• Data Quality Issues


– Garbage In, Garbage Out: Poor quality or biased data can lead to misleading results and incorrect
conclusions.
– Sampling Errors: Improper sampling techniques can result in non-representative samples, affecting
the validity of the results.

• Ethical Issues
– Data Manipulation: There is a risk of manipulating data or using selective statistics to mislead or
support a specific agenda.
– Privacy Concerns: Collecting and analyzing personal data raises privacy and ethical concerns, espe-
cially in sensitive areas like healthcare.
• Resource Intensive
– Time-Consuming: Collecting, analyzing, and interpreting statistical data can be time-consuming and
resource-intensive.
– Cost: Conducting large-scale surveys or experiments can be costly, requiring significant financial and
human resources.
• Statistical Limitations
– Assumptions: Many statistical methods rely on certain assumptions (e.g., normality, independence),
and violating these assumptions can affect the results.
– Causation vs. Correlation: Statistics can identify correlations but not necessarily causation, leading
to potential misinterpretation of cause-and-effect relationships.
• Dynamic Nature of Data

3
1.1.4 Applications of Statistics
1. Business and Economics
• Market Research: Analyzing consumer behavior, preferences, and market trends to guide marketing
strategies and product development.
• Quality Control: Using statistical methods to monitor and improve product and service quality.
• Financial Analysis: Evaluating investment opportunities, assessing risks, and forecasting financial
trends.
• Operational Efficiency: Optimizing supply chain management, inventory control, and resource alloca-
tion.
2. Healthcare and Medicine
• Clinical Trials: Designing and analyzing clinical trials to determine the efficacy and safety of new drugs
and treatments.
• Epidemiology: Studying the distribution and determinants of health-related events to guide public
health interventions and policy.
• Medical Research: Analyzing data from medical studies to understand disease patterns, treatment
outcomes, and health risks.
• Health Services Management: Improving hospital management, patient care, and resource allocation
through statistical analysis.
3. Social Sciences
• Sociological Research: Analyzing social behaviors, trends, and patterns to understand societal dynam-
ics and inform policy.
• Psychology: Using statistical methods to validate psychological theories, assess interventions, and
analyze behavioral data.
• Education: Evaluating educational programs, assessing student performance, and improving teaching
methods through data analysis.
4. Engineering and Manufacturing
• Quality Assurance: Applying statistical process control (SPC) to monitor and improve manufacturing
processes.
• Reliability Engineering: Analyzing the reliability and life-cycle of products to enhance durability and
performance.
• Design of Experiments: Optimizing product design and development through systematic experimenta-
tion and analysis.
5. Environmental Science
• Climate Studies: Analyzing climate data to understand trends, model climate change, and predict future
conditions.
• Environmental Monitoring: Assessing pollution levels, natural resource management, and ecological
impacts through statistical analysis.
• Conservation Biology: Studying species populations, habitat use, and conservation strategies using
statistical methods.
6. Government and Public Policy
• Census and Surveys: Collecting and analyzing population data to inform policy decisions and resource
allocation.
• Economic Planning: Using statistical models to forecast economic growth, unemployment, inflation,
and other macroeconomic indicators.
• Policy Evaluation: Assessing the impact and effectiveness of public policies and programs through
data analysis.

4
7. Sports and Entertainment
• Performance Analysis: Analyzing athlete performance, game statistics, and team strategies to enhance
competitive edge.
• Audience Analytics: Studying viewer preferences, ratings, and engagement to optimize content and
marketing strategies in media and entertainment.
8. Information Technology and Data Science
• Machine Learning: Using statistical methods to develop algorithms for predictive modeling, classifi-
cation, and clustering.
• Data Mining: Extracting meaningful patterns and insights from large datasets to inform business
decisions and strategies.
• Cybersecurity: Analyzing security threats, intrusion patterns, and system vulnerabilities through
statistical techniques.
9. Agriculture and Food Science

• Crop Yield Analysis: Studying factors affecting crop yields, pest control, and soil health to improve
agricultural practices.
• Food Safety: Monitoring and analyzing food production processes to ensure safety and compliance
with health regulations.

10. Education
• Assessment and Evaluation: Analyzing student performance data, evaluating educational programs,
and improving instructional methods.
• Educational Research: Using statistical methods to study learning outcomes, teaching effectiveness,
and educational trends.

11. Astronomy and Space Science


• Astrophysical Research: Analyzing astronomical data to study celestial bodies, cosmic phenomena,
and the structure of the universe.
• Space Mission Planning: Using statistical models to plan and optimize space missions, satellite
deployments, and exploration strategies.
12. Law and Forensics
• Criminology: Analyzing crime data to understand trends, patterns, and the effectiveness of law
enforcement strategies.
• Forensic Analysis: Using statistical methods in forensic science to analyze evidence, identify patterns,
and solve crimes.

5
1.2 Types of Data or Variables in Statistics

6
1.3 Qualitative vs Quantitative Data

1.3.1 Qualitative Data or Categorical Data


• Qualitative data 2 also known as Categorical Data are not numerical.
• Qualitative data, also known as the categorical data, represents characteristics or describes the data that fits
into the categories.

• Qualitative data encompasses non-numerical information categorized into groups or classes.


• Categorical measures involves categorical variables and are defined in terms of natural language specifications,
but not in terms of numbers such as a person’s gender, home town etc.
• Sometimes categorical data can hold numerical values (quantitative value), but those values do not have a
mathematical sense like birth date, pincode. Here, the birthdate and school postcode hold the quantitative
value, but it does not give numerical meaning.
• Qualitative data is further categorized into two categories that includes:
– Nominal Data
– Ordinal Data

2 https://www.youtube.com/watch?v=E1C5hB0yAM4

7
1.3.1.1 Nominal Data
• Nominal data is a type of data that consists of categories or names that cannot be ordered or ranked.
• Nominal data is often used to categorize observations into groups, and the groups are not comparable.

• In other words, nominal data has no inherent order or ranking. Therefore, if you would change the order of
its values, the meaning would not change.
• Examples of nominal data include:

– Gender (Male or female),


– Race (White, Black, Asian),
– Religion (Hinuduism, Christianity, Islam, Judaism)
– blood type (A, B, AB, O).

• Nominal data can be represented using frequency tables and bar charts, which display the number or
proportion of observations in each category.
• For example, a frequency table for gender might show the number of males and females in a sample of
people.
• Nominal data is analyzed using non-parametric tests, which do not make any assumptions about the underlying
distribution of the data.
• Common non-parametric tests for nominal data include Chi-Squared Tests and Fisher’s Exact Tests. These
tests are used to compare the frequency or proportion of observations in different categories.

8
1.3.1.2 Ordinal Data
• Ordinal data is a type of data that consists of categories that can be ordered or ranked. However, the distance
between categories is not necessarily equal.
• Ordinal data is nearly the same as nominal data, except that its ordering matters.

• Ordinal data is often used to measure subjective attributes or opinions, where there is a natural order to the
responses.
• Examples of ordinal data include education level (Elementary, Middle, High School, College), job position
(Manager, Supervisor, Employee), etc.

• Note that the difference between Elementary and High School is different from the difference between
High School and College. This is the main limitation of ordinal data, the differences between the values
is not really known. Because of that, ordinal scales are usually used to measure non-numeric features like
happiness, customer satisfaction and so on.
• Ordinal data can be represented using bar charts, line charts. These displays show the order or ranking of the
categories, but they do not imply that the distances between categories are equal.
• Ordinal data is analyzed using non-parametric tests, which make no assumptions about the underlying
distribution of the data.
• Common non-parametric tests for ordinal data include the Wilcoxon Signed-Rank test and Mann-Whitney
U test.

9
1.3.2 Quantitative Data or Numerical Data
• Quantitative Data 3 takes about quantity. Something that we can measure in numbers.
• Quantitative Data is a fundamental component of Statistics, providing a numerical foundation for analysis
and decision-making.

• Quantitative data are data represented numerically, including anything that can be counted, measured, or
given a numerical value.
• They are also called the Numerical Data (i.e., how much, how often, how many).
• Quantitative data type is used to represent quantities, measurements, and observations like height, weight,
length and other things of the data.

• Quantitative data is further classified into two categories 4 :


– Discrete Data
– Continuous Data

3 https://www.youtube.com/watch?v=kNARs2oeuk0
4 https://www.youtube.com/watch?v=Cg0W6mod9Hw

10
1.3.2.1 Discrete Data
• Discrete data type is a type of data in statistics that only uses distinct and countable values.
• Discrete information contains only a finite number of possible values. Those values cannot be subdivided
meaningfully.

• In a Discrete Dataset, apparent gaps or intervals exist between the values. These gaps indicate that there are
no values between the specified data points.
• The example of the discrete data types are,
– Marks of the students in a class test
– Number of customers
– Dice rolls: When rolling a six-sided dice, the possible outcomes are discrete and countable, ranging
from 1 to 6.
• Discrete Data is often analyzed using Statistical techniques tailored to discrete variables, such as frequency
distributions, bar charts, and probability calculations. These methods help to summarize and interpret Data
that can be counted or categorized into distinct values.

• key characteristics of discrete data.


– Finite, countable, and nondivisible: Discrete data includes discrete variables that are finite, numeric,
and non-negative integers (5, 10, 15, and so on).
– Easy to visualize: Discrete data can be easily visualized and demonstrated using simple statistical
methods such as bar charts, line charts, or pie charts.
– Can be categorical: Discrete data can also be categorical - containing a finite number of data values,
such as the gender of a person.
– Easy to distribute: Discrete data is distributed discretely in terms of time and space. Discrete
distributions make analyzing discrete values more practical.

1.3.2.2 Continuous Data


• Continuous Data, in contrast to Discrete Data, represent the data in a continuous range.
• The variable in the data set can have infinite number of probable values that can be selected within a given
specific range.
• The values are typically expressed as decimals or fractions and can be divided into smaller and smaller
intervals.
• Examples of the continuous data types are:
– Height of individuals: Heights can vary continuously, from fractions of an inch to several feet.
Measuring the height of people yields Continuous Data.
– Temperature readings: Temperature measurements can include decimal values and vary continuously,
ranging from below freezing to triple digits above zero.
– Weight of products: In manufacturing and commerce, the weight of products is often measured with
precision, resulting in Continuous Data.
• Key characteristics of continuous data are:

– Changes over time: Continuous data changes over time and can have different values at different time
intervals.
– May or may not have decimals: Continuous data comprises random variables that may or may not be
whole numbers.
– Visualized with line graphs or skews: Continuous data is measured using data analysis methods such
as line graphs and skews.

11
12
1.4 Other Types of Data/Variables
1.4.1 Primary Data?
Primary data in mathematics is defined as the data that is collected for the first time. It is pure data and no analysis
is performed in this data.

1.4.2 Binary (Dichotomous) Variable


• Variables with only two possible values.
– Light Switch: On, Off.
– Pass/Fail: Pass, Fail.
– Yes/No: Yes, No.

1.4.3 Interval vs Ratio Data


Interval vs Ratio Data: Video Lecture 5

1.4.3.1 Interval Data


• Numerical data with meaningful intervals between values, but no true zero point.
• Interval data is measured so that each value is placed at an equal distance from one another in a clear order.
• Interval data lacks the absolute zero point, it makes direct comparisons of magnitude impossible (e.g. A is
twice as large as B).
• Interval scales hold no true zero and can represent values below zero. For example, you can measure
temperatures below 0 degrees Celsius, such as -10 degrees.
• Interval variables are also commonly known as Scaled variables.

– Examples: Temperature in Celsius: Differences are meaningful, but 0°C does not mean the absence of
temperature.
– Calendar Years: 2000, 2010, 2020 (the intervals are equal, but there is no true zero year).

1.4.3.2 Ratio Data


• Numerical variables with meaningful intervals and a true zero point, allowing for the calculation of ratios.

• Ratio data uses absolute zero as a reference point for measurement. In other words, Ratio data has a defined
zero point, whereas interval data lacks the absolute zero point.
• Ratio variables, never fall below zero. Height and weight measure from 0 and above, but never fall below it.
• Ratio data can include variables like income, height, weight, annual sales, market share, product defect rates,
time to repurchase, unemployment rate, and crime rate. As an analyst, you can say a crime rate of 10% is
twice that of 5%, or annual sales of 2 million are 25% greater than 1.5 million.
• Interval variables are also commonly known as Scaled variables.
– Examples: Income: 0, 50, 000, 100, 000 (income of 0 means no income, and you can say that 100, 000
is twice as much as 50, 000).
– Distance: 0 km, 5 km, 10 km (0 means no distance, and 10 km is twice as far as 5 km).
– Age: 0 years, 25 years, 50 years (0 means no age, and 50 years is twice as old as 25 years).
5 https://www.youtube.com/watch?v=kNARs2oeuk0

13
14
15
16
1.5 Statistical Series
Statistical Series

Characteristics Construction

Time Series Spacial Series Condition Series Individual Series (or Raw Data)

Frequency

Frequency Distribution

UnGrouped Frequency Distribution Grouped Frequency Distribution

Discrete Series Continuous Series

Frequency Array Inclusive Exclusive Open End Cumulative Mid-Value Equal and Unequal
Series Series Series Frequency Frequency Class Interval Series
Series Series

17
• Data is important for researchers but in its raw form, it is hardly usable.
• Therefore, data is often organized in series 6 to facilitate analysis and interpretation.
• Series has its own characteristics and they obey some general principles.

• Such types of series are very important for researchers and economists to gain insights so that they can use
them for actionable purposes.
• A statistical series refers to a set of observations arranged in a particular order based on one or more criteria.
• In other words, arranging data in some logical order such as according to the time of occurrence, size, or
some other measurable or non-measurable characteristics is known as Statistical Series. 7 .
• Understanding the different types of statistical series is crucial for effectively analyzing and presenting data.
• Statistical Series can be classified:
– On the Basis of Characteristics:
* Time Series
* Spatial Series
* Condition Series
– On the Basis of Construction:
* Individual Series
* Discrete Series
* Continuous Series

6 https://www.youtube.com/watch?v=NWNW1jln8cc
7 https://www.youtube.com/watch?v=VunpIAw5pPg

18
1.6 On the Basis of Characteristics
When the data is arranged on the basis of qualitative characteristics, statistical series are of three kinds:

• Time Series
• Spatial Series
• Condition Series

1.6.1 Time series


• If the different values taken by a variable in a period of time are arranged in chronological order, the series
obtained is called a Time Series. Thus, a Time series is a series of data points indexed (or listed or graphed)
in time order.

• Most commonly, a time series is a sequence taken at successive equally spaced points in time. Thus it is a
sequence of discrete-time data.
• Simply, time series is a statistical series in which the given data is presented with regard to time unit; i.e., day,
month, week, or year.

• Time series analysis is used for non-stationary data—things that are constantly fluctuating over time or are
affected by time. Industries like finance, retail, and economics frequently use time series analysis because
currency and sales are always changing. Stock market analysis is an excellent example of time series analysis
in action, especially with automated trading algorithms. Likewise, time series analysis is ideal for forecasting
weather changes, helping meteorologists predict everything from tomorrow’s weather report to future years
of climate change.

• Examples of time series analysis in action include:


– Weather data
– Rainfall measurements
– Temperature readings
– Heart rate monitoring (EKG)
– Brain monitoring (EEG)
– Quarterly sales
– Stock prices
– Automated stock trading
– Industry forecasts
– Interest rates

19
1.6.2 Spatial Series
• Spatial data is any type of data that directly or indirectly references a specific geographical area or location.
• Example: The following is the sex ratio of 6 different states of India as per the Census of 2011.

20
1.6.3 Condition Series
• In this series, data is classified according to the changes occurring in variables according to certain condition,
then it is called a Condition Series.
• Students of a certain class arranged according to their age. Heights, weights, marks etc.

• Example: The following is the table showing the arrangement of 40 students in a class according to their
age. It is a condition series because the data is arranged on basis of the age of the students

21
1.7 On the Basis of Construction
When the data is arranged on the basis of quantitative characteristics, statistical series are of three kinds 8 9 :

• Individual Series
– Unorganized Individual Series
– Organized Individual Series
• Discrete Series
• Continuous Series
– Exclusive Series
– Inclusive Series
– Open-end Distribution
– Cumulative Frequency Series
– Equal and Unequal Class Interval Series
– Mid-value Series

8 https://www.youtube.com/watch?v=NWNW1jln8cc
9 https://www.tutorialspoint.com/statistical-series

22
1.7.1 Individual Series (or Raw Data)
• Individual series is that series in which the terms are listed singly.
• In simple terms, a separate value of the measurement is given to each item.
• Example: If the marks of 10 students of Class is given individually as, 80, 82, 75, 95, 77, 81, 60, 35, 54, and
99; then, the resultant series will be an individual series.
• In such series, there is no class of the items and also there is no frequency of the items.
• The two types of individual series are:

1. Unorganized Individual Series


– A series with raw data or an unarranged mass of data is known as Unorganised Series. Raw Data
is the data in its original form.
– Simply put, when the investigator collects the data and has not arranged it in a systematic manner,
then the collected data will be known as unorganised data.
– The data presented through unorganised series does not provide the investigator with any useful
information; instead, it confuses them.
2. Organized Individual Series
– A series with orderly arranged raw data is known as Organised Individual Series.
– An individual series can be arranged in ascending or descending order.

23
1.7.2 Frequency
• In statistics, the frequency or absolute frequency of an event i is the number ni of times the observation has
occurred/recorded in an experiment or study.
• Frequency is basically the number of times a data item occurs in the series. In other words, it deals with how
frequent a data item is in the series.

1.7.2.1 Frequency Table


• After data collection, we have to show data in a meaningful manner for better understanding. Organize the
data in such a way that all its features are summarized in a table.
• A frequency table is a way to present data. The data are counted and ordered to summarize larger sets of data.
With a frequency table you can analyze the way the data is distributed across different values.

• Example: Twenty students were asked how many hours they worked per day. Their responses, in hours, are
as follows:
5; 6; 3; 3; 2; 4; 7; 5; 2; 3; 5; 6; 5; 4; 4; 3; 5; 2; 5; 3

Work Hours Frequency


2 3
3 5
4 3
5 6
6 2
7 1

Table 1.1: Frequency Table of Student Work Hours

1.7.2.2 Types of Frequency


• There are several types of Frequency is statistics: 10

– Frequency or Absolute Frequency


– Absolute Cumulative Frequency
– Relative Frequency
– Cumulative Relative Frequency

1.7.2.3 Frequency Distribution


• A frequency distribution shows the frequency of repeated items in a graphical form or tabular form.

• It gives a visual display of the frequency of items or shows the number of times they occurred.
• Types of Frequency Distribution:
– Ungrouped Frequency Distribution or Discrete Series
– Grouped Frequency Distribution or Continuous Series

10 https://edu.gcfglobal.org/en/statistics-basic-concepts/frequency-tables/1/

24
1.7.3 Discrete Series or UnGrouped Frequency Distribution or Frequency Array
• Discrete Series is nothing but ungrouped frequency distribution series where different values of the variables
are shown with their respective frequencies.
• The classification of data for a discrete variable is known as Frequency Array.

• In discrete series, data obtained in raw form are presented along with their frequencies. In such a series, data
are not presented in ascending or descending manner.
• Instead, the data and its frequencies are presented in a tabular or grouped manner.
• For example, if the monthly wages of five employees of a company are 10,000, 12,000, 10,000, 12,000,
13,000, 14,000, and 15,000, then the discrete series will be made as follows

25
1.7.4 Continuous Series or Grouped Frequency Distribution
• A discrete series cannot take any value in an interval; therefore, in cases where it is essential to represent
continuous variables with a range of values of different items of a given data, Continuous Series is used.
• In continuous series (grouped frequency distribution), the value of a variable is grouped into several class
intervals (such as 0-5,5-10,10-15) along with the corresponding frequencies.

• Other names of Continuous Series are Frequency Distribution, Grouped Frequency Distribution, Series with
Class Intervals, and Series of Grouped Data.
• Different types of Continuous Series 11 12 :
– Inclusive Series
– Exclusive Series
– Open-end Distribution
– Cumulative Frequency Series
– Equal and Unequal Class Interval Series
– Mid-value Series

11 https://www.geeksforgeeks.org/types-of-frequency-distribution/
12 https://www.toppr.com/guides/economics/organisation-of-data/frequency-distribution/

26
1.7.4.1 Important Terms under Continuous Series
• Class: Class in Continuous Series refers to a group of numbers in which the items are placed. For example,
0-5, 5-10, 10-15, 15-20, 20-25, etc.
• Number of Classes: The decision regarding the number of classes of a given data usually depends upon the
judgement of the individual investigator. Even though there is no strict rule regarding the number of classes,
the number should not be very small or very large.
• Class Limits: In continuous series, the class limit is formed by the two numbers between which every class
is located. The lowest value of the class is known as Lower Limit and the highest value of the class is known
as Upper Limit. For example, if a class is 5 - 10, then 5 is the lower limit and 10 is the upper limit.
• Class Interval: It is the difference between the lower limit and upper limit of a class.

• Range: It is the difference between the lower limit of the first class interval and the upper limit of the last
class interval. For example, if the classes of a distribution are 0-5, 5-10, 10-15, . . . . . . . . . . . . .till 45-50, then
the range will be 50 – 0 = 50.
• Width of Class Intervals: At the time of constructing the frequency distribution, it is suggested that the
width of each class interval is equal in size. The formula for determining the size or width of each class
interval is as follows:
range
width = √
SampleSize

27
• How to make a grouped frequency table?
Example: A sociologist conducted a survey of 20 adults. She wants to report the frequency distribution of
the ages of the survey respondents. The respondents were the following ages in years:
52, 34, 32, 29, 63, 40, 46, 54, 36, 36, 24, 19, 45, 20, 28, 29, 38, 33, 49, 37

Range = Highest − Lowest


Range = 63 − 19
Range = 44
Range
Width = √
SampleSize
44
Width = √
20
Width = 9.84 ≈ 10

The class intervals are:

19 ≤ a ≤ 29
29 ≤ a ≤ 39
39 ≤ a ≤ 49
49 ≤ a ≤ 59
59 ≤ a ≤ 69

28
1.7.4.2 Inclusive Series
• The series with class intervals, in which all the items having the range from the lower limit up to the upper
limit are included, is known as Inclusive Series.
• However, there is a gap (between 0.1 to 1) between the upper-class limit of one class interval and the lower
limit of the next class interval.

• For example, class intervals of an inclusive series can be, 0-9, 10-19, 20-29, 30-39, and so on. In this case,
the gap between the upper limit of one class interval and the lower limit of the next class interval is 1.

Work Hours Frequency


0-9 0
10-19 2
20-29 8
30-39 3
40-49 5
50-59 6
60-69 6

Table 1.2: Frequency Distribution in Inclusive Series Example

• From the above table of inclusive series, it can be seen that the upper limit of one class interval (say, 9 of
interval 0-9) is not the same as the lower limit of the next class interval (10 of interval 10-19). Also, all the
values that come under 0-9, including 0 and 9 are included in the frequency against 0-9.

• For statistical calculation, sometimes it becomes necessary to convert the inclusive series into exclusive
series. Suppose, in the above example some students have obtained marks such as 10.5, 40,5, etc. In this
case, this series will be converted into exclusive series,

29
1.7.4.3 Exclusive Series
• The series with class intervals, in which all the items having the range from the lower limit to the value just
below its upper limit are included, is known as the Exclusive Series.
• For example, if a class interval is 0-10, and the values of the given series are 4, 10, 2, 15, 8, and 9, then only
4, 2, 8, and 9 will be included in the 0-10 class interval. 10 and 15 will be included in the next class interval,
i.e., 10-20.
• In Exclusive Series, the upper limit of a class interval is the lower limit of the next class interval.

Work Hours Frequency


0-10 0
10-20 2
20-30 8
30-40 3
40-50 5
50-60 6
60-69 6

Table 1.3: Frequency Distribution in Inclusive Series Example

• From the above table of exclusive series, it can be seen that the upper limits of the first class interval is the
lower limit of the second class interval, and so on.

• If the data includes a value 10, it will be included in the class interval 10-20, not in 0-10.

30
1.7.4.4 Conversion of Inclusive Series into Exclusive Series?
• For statistical calculation, sometimes it becomes necessary to convert the inclusive series into exclusive
series.
• Suppose, in the above example some students have obtained marks such as 10.5, 40, 5, etc. In this case, this
series will be converted into exclusive series,

• The steps for converting an inclusive series into exclusive series are:

– In this first step, calculate the difference between the upper class limit of one class interval and the
lower limit of the next class interval.
– The next step is to divide the difference by two and then add the resulting value to the upper limit of
every class interval and subtract it from the lower limit of every class interval.

• The inclusive series of the above example is converted into exclusive series as under:

Work Hours Frequency


0-9.5 0
9.5-19.5 2
19.5-29.5 8
29.5-39.5 3
39.5-49.5 5
49.5-59.5 6
59.5-69.5 6

Table 1.4: Frequency Distribution in Exclusive Series Example

31
1.7.4.5 Difference between Inclusive and Exclusive Series
• In Inclusive Series, the upper limit of one class interval is not the same as the lower limit of the next class
interval. There is a gap ranging from 0.1 to 1.0 between the upper class limit of one class interval and the
lower class limit of the next class interval. However, in the Exclusive Series, the upper limit of one class
interval is the same as the lower limit of the next class interval.

• In the case of Inclusive Series, the value of the upper and the lower limit are included in that class interval
only. However, in the case of Exclusive Series, the value of upper limit of a class interval is not included in
that interval, instead, it is included in the next class interval.
• Inclusive Series is suitable for an investigator only if the value is in complete number and not in decimal
form. However, an Exclusive Series is suitable for an investigator whether the value is in complete number or
decimal form.
• Counting in Inclusive Series is possible only after converting it into an Exclusive Series. However, counting
in Exclusive Series is possible in all cases.

32
1.7.5 Open End Series
• Sometimes the lower limit of the first class interval and the upper class limit of a series is not available;
instead, Less than or Below is mentioned in the former case (in place of the lower limit of the first class
interval), and More than or Above is mentioned in the latter case (in place of the upper limit of the last class
interval). These types of series are known as Open End Series.

• For statistical calculations, if one needs to change the first and last class open-end class interval into limits, it
can be done by the general practice of giving the same magnitude or class size to these intervals as the class
size of other class intervals.
• In the above example, the magnitude of other class intervals is 5. Therefore, the open-end class intervals can
be written as 5-10 and 30-35, respectively.

33
1.8 Types of Statistics
Statistics can be broadly classified into two main types:

1. Descriptive Statistics
2. Inferential Statistics

1.9 Descriptive Statistics


• Descriptive Statistics offers methods to describe and summarize data set by transforming raw observations
into meaningful information that is easy to interpret and share.

• A data set is a collection of responses or observations from a sample or entire population.


• The data set is summarized from a population sample using factors such as mean and standard deviation.
• Descriptive Statistics is a way of organizing, presenting, and explaining the data set using tables and graphs
(histograms, pie charts, bars, and scatter plots).

1.9.1 Key Components of Descriptive Statistics:


• Measures of Central Tendency:

– Mean: The average of a set of numbers.


– Median: The middle value when the data is ordered.
– Mode: The most frequently occurring value in the dataset.
• Measures of Dispersion/Variability/Spread:

– Range: The difference between the highest and lowest values.


– Variance: The average of the squared differences from the mean.
– Standard Deviation: The square root of the variance, representing the spread of the data.
– Interquartile Range (IQR): The range between the first quartile (25th percentile) and the third quartile
(75th percentile).

• Data Representations:
– Histograms: Bar graphs representing the frequency distribution of numerical data.
– Bar Charts: Graphs representing categorical data with rectangular bars.
– Pie Charts: Circular charts divided into sectors representing proportions.
– Box Plots: Visual representations of the distribution of data based on five-number summaries (minimum,
first quartile, median, third quartile, and maximum).
– Pictograph
– Frequency Distribution

34
1.9.2 Measures of Central Tendency
• Central Tendencies in Statistics are the numerical values that are used to represent mid-value or central value
a large collection of numerical data. These obtained numerical values are called central values in Statistics.
• Measures of central tendency are statistical metrics that describe or represents the center or the single value
as representative of the entire distribution or a dataset.

• Such a value is of great significance because it depicts the nature or characteristics of the entire data, which
is otherwise very difficult to observe.
• The three most common measures of central tendency are:
– Mean : provides the average value of the dataset
– Median: provides the central value of the dataset
– Mode: provides the most frequent value in the dataset

35
1.10 Population vs Sample
• Population: A collection or set of individuals or objects or events whose properties are to be analyzed.

• Sample: A subset of the population is called ‘Sample’. A well-chosen sample will contain most of the
information about a particular population parameter.

• Outliers: An outlier is a data point that differs significantly from the majority of the data taken from a sample
or population. There are many possible causes of outliers, but here are a few to start you off:
– Natural variation in data
– Change in the behavior of the observed system
– Errors in data collection

36
1.11 Mean
• Mean is the measure of central tendency and is mostly used in Statistics.

• Mean is the central tendency of the distributed data, which refers to the average value of the given set of data.
• The method of finding the mean is also different depending on the type of data (Grouped or Ungrouped
Data).
• Mean is also referred to as the average.

• Mean is sensitive to skewed data and extreme values.

1.11.1 Types of Mean


There are majorly four different types of mean value that you will be studying in statistics.
• Arithmetic Mean
• Geometric Mean
• Harmonic Mean

• Weighted Mean
When not specified, the mean is generally referred to as the arithmetic mean.

37
1.12 Arithmetic Mean
1.12.1 How to Calculate Arithmetic Mean?
There are three ways to determine the arithmetic mean for both Grouped/Ungrouped Data or Individual, Discrete
and Continuous Series. 13 14 .

• Direct Method
• Assumed Mean Method or Short-Cut Method
• Step Deviation Method

13 https://www.youtube.com/playlist?list=PLYwJOKtPsLuiFjFGKDFoPZOM0g4JBKUrj
14 https://www.youtube.com/playlist?list=PLEHGYFbPuuMEhz_AU8iCrBTYb5eNtFpeg

38
1.12.2 Mean of Raw Data or Individual Series
• Raw data is the dataset simply contains all the data in no particular manner.
• The series in which the items are listed singly is known as Individual Series.
• The mean is of raw data calculated by adding up all the observations and dividing it by the total number of
observations in the set.
• Mean = Sum of all Observations ÷ Total number of Observations
• The population mean is represented by the Greek letter µ (mu).
• The sample mean is represented by x (x-bar).

• The sample mean is usually the best, unbiased estimate of the population mean. However, the mean is
influenced by extreme values (outliers) and may not be the best measure of center with strongly skewed data.

39
1.12.2.1 Direct Method
• The following equations compute the population mean and sample mean:

x1 + x2 ..... + xN
µ=
N
N
∑ xi
i=1
µ=
N

where, N is the total number of observations in the population

x1 + x2 ..... + xn
x=
n
n
∑ xi
i=1
x=
n

where, n is the total number of observations in the sample

40
1.12.2.2 Assumed Mean Method
• Assumed mean method 15 finds the actual mean of the data by first assuming a mean value.
• When the calculation of the mean for raw data using the direct method becomes very tedious, then the mean
can be calculated using the assumed mean method.

• When calculating the mean using the direct mean method, you obtain significantly bigger numbers. The
likelihood of making calculating errors is decreased when utilizing the assumed mean approach, also known
as a shift of origin because it gives you smaller numbers to work with (as well as negative numbers that lower
the sum).
• The Assumed Mean method simplifies the calculation of the arithmetic mean by reducing the size of the
numbers involved in the calculation, making it easier to compute, thus suitable if your data set has large
values.
• The following equations compute the population mean and sample mean:

∑ di
µ = A+
N

where A is the assumed mean and d is the deviation from the mean

∑ di
x̄ = A +
n

• Advantages:
– Simplifies arithmetic by using smaller numbers.
– Reduces computational complexity.
• Disadvantages:

– Assumed mean is still a central value, so deviations might still be relatively large.
• How to Calculate Mean using Assumed Mean Method?: We can calculate mean using the assumed mean
method by following the below steps:
1. Choose an Assumed Mean (A): Select a value from the data, often a central value, to act as an assumed
mean.
2. Calculate the Deviations (d): Subtract the assumed mean from each data point to find the deviation
di = xi − A, where xi is each data point.
3. Find the Sum of Deviations (∑ di ): Add up all the deviations.
4. Calculate the Mean using the above formulas
15 https://testbook.com/maths/assumed-mean-method

41
• Example:
– Assume your data set is 73, 75, 76, 78 and 79.
– Sort your data set from smallest to largest.
– Assume a mean. This should be a number that you feel is a close representation of your data set.
– In a simple example, take the number in the center of your data set; in this case 76.
– Subtract your assumed mean from each data entry.
– In our example, 73 − 76 = −3, 75 − 76 = −1, 76 − 76 = 0, 78 − 76 = 2and79 − 76 = 3
– Add together these differences from the mean.
– (−3) + (−1) + 0 + 2 + 3 = 1
– Divide the sum of the differences from assumed mean by the number of data points.
– 1/5 = 0.2
– Add the result of the division to your assumed mean.
– Mean = 76 + 0.2 = 76.2
• Example: Find the mean of the following data using Assumed mean method 40, 50, 55, 78, 58

n
∑d
x̄ = A + i=1
n
x̄ = 40 + 81/5
Mean(x̄) = 56.2

42
• Find the average for the following data using Assumed mean method

∑d
x̄ = A +
N
17
x̄ = 8 +
10
Mean(x̄) = 9.7

43
1.12.2.3 Step-Deviation Method
• The Step Deviation method is an extension of the Assumed Mean method.
• This method further simplifies calculations by choosing a common factor (step size) to reduce the size of the
deviations from an assumed mean.

• Advantage:
– The step deviations simplify the calculations, especially when the original deviations are large or
involve complex numbers.
– Makes it easier to work with data when the values are spread out over a large range.
• Disadvantage:

– Requires an additional step of selecting an appropriate step size hh.


– May not always lead to simpler calculations if hh is not chosen wisely.
• How to Calculate Mean using Step Deviation Method?

– Choose an Assumed Mean (A): Select a value close to the center of your data as the assumed mean.
This value can be one of the data points.
– Calculate the Deviations (d): Subtract the assumed mean from each data point to find the deviation
di = xi − A, where xi is each data point.
– Select a Common Factor (h): Choose a common factor hh (also known as the step size), which could be
a convenient value, such as 2, 5, 10, etc., depending on the data range.
– Calculate Step Deviations: Divide each deviation by the chosen factor h to obtain the step deviations ui .

di xi − A
ui = =
h h

– Find the Sum of Step Deviations (∑ ui ): Add up all the step deviations
– Calculate the Mean: The following equations compute the population mean and sample mean:

∑ ui
µ = A+h×
N

∑ ui
x̄ = A + h ×
n

44
• Example: Let’s consider the following ungrouped data: 47, 53, 59, 65, 71

1. Choose an Assumed Mean (A): Select A = 59 (a central value from the data).
2. Calculate Deviations (d):
(a) d1 = 47 − 59 = −12
(b) d2 = 53 − 59 = −6
(c) d3 = 59 − 59 = 0
(d) d4 = 65 − 59 = 6
(e) d5 = 71 − 59 = 1
3. Step 3: Select a Common Factor (h):
4. Calculate Deviations (d): Choose h = 6 (a convenient value given the range of deviations).
−12
(a) u1 = 6 = −2
−6
(b) u2 = 6 = −1
0
(c) u3 = 6 =0
6
(d) u4 = 6 =1
12
(e) u5 = 6 =2
5. Calculate Step Deviations (ui ):
6. Find the Sum of Step Deviations (∑ ui ):

∑ ui = (−2) + (−1) + 0 + 1 + 2 = 0
7. Calculate the Arithmetic Mean using the above formula:

∑ ui
x̄ = A + h ×
n
0
= 59 + 6 ×
5
= 59

45
• Example: Find the mean of the following data using direct method, assumed mean method and step deviation
method. 40, 50, 55, 78, 58

46
• Find the average for the following data using step-deviation method.

∑ ui
x̄ = A + h ×
n
0
= 60 + 5 ×
5
= 60

47
1.12.2.4 Assumed Mean Method vs Step Deviation Method
• The assumed mean method is typically used when the mean of the dataset is a known, predetermined value.
• This assumed mean method is appropriate when the focus is on calculating the standard deviation rather than
estimating the mean.

• The formula for the standard deviation using the assumed mean method is:
q
∑ni=1 (xi −x̄)2
s= n

• The step deviation method, on the other hand, is used when the mean of the dataset is unknown and needs to
be calculated as part of the standard deviation computation.

• This method involves calculating the deviations of each data point from the actual mean, and then using those
deviations to compute the standard deviation.
• The formula for the standard deviation using the step deviation method is:
q
∑ni=1 (xi −x̄)2
s= n−1

• To summarize:
– Use the assumed mean method when the mean is a known, predetermined value and the focus is on
calculating the standard deviation.
– Use the step deviation method when the mean is unknown and needs to be calculated as part of the
standard deviation computation.

48
1.12.3 Mean of Ungrouped Frequency Distribution or Discrete Series
• In discrete series (ungrouped frequency distribution), the values of variables represent the repetitions.
• It means that the frequencies are given corresponding to the different values of variables.
• The total number of observations in a discrete series, N, equals the sum of the frequencies, which is ∑ fi .

• Example of Discrete Series: If 6 students of a class score 50 marks, 4 students score 60 marks, 7 students
score 70 marks, 3 students score 80 marks, and 5 students score 90 marks, then this information will be
shown as:

Figure 1.1: Frequency Table

49
1.12.3.1 Direct Method
1. List the Data: Prepare a frquency table with values (xi ) and their corresponding frequencies ( fi )
2. Calculate the Product of (xi ) and ( fi ): Multiply each value by its frequency to get xi . fi
3. Find the Sum of the Products ∑(xi . fi ): Add all the products together.

4. Find the Total Frequency (∑ fi ): Add all the frequencies together.


5. Calculate the Arithmetic Mean (x̄): Use the formula:
• Arithmetic Mean for Sample:
f1 .x1 + f2 .x2 + .... + fn .xn
x̄ =
f1 + f2 + ...... + fn
∑ xi . fi
x̄ =
∑ fi

• Arithmetic Mean for Population:


f1 .x1 + f2 .x2 + .... + fN .xN
µ=
f1 + f2 + ...... + fN
∑ xi . fi
µ=
∑ fi

50
• Example:

• Example:

∑ xi . fi
x̄ or µ =
∑ fi
264
=
28
= 9.42

51
• Calculate the mean of the following distribution, which represents the scores obtained by students in a quiz.

52
1.12.4 Practice Questions
• Calculate the mean for the following set of data 2, 6, 7, 9, 15, 11, 13, 12
• If there are 5 observations, which are 27, 11, 17, 19, and 21 then find the mean
• Find the mean for the following sample data set: 6.4, 5.2, 7.9, 3.4

• Find the mean of 9, 6, -3, 2, -7, 1


• Find the mean of 5,10,15,20,25.
• Find the mean of the given data set: 10,20,30,40,50,60,70,80,90.
• Calculate the mean of the first 10 natural numbers.

• Find the mean of the first 10 even numbers.


• Find the mean of the first 10 odd numbers.
• The Mean of a series with 5 items is 40, and the values of four items are 35, 10, 65, 50. Find out the missing
5th item.

53
1.13 Sampling
• Why we need sampling?: Consider a scenario wherein you’re asked to perform a survey about the eating
habits of teenagers in the US. There are over 42 million teens in the US at present and this number is
growing as you read this blog. Is it possible to survey each of these 42 million individuals about their health?
Obviously not! That’s why sampling is used.
• How can one choose a sample that best represents the entire population?. Sampling is a statistical
method that deals with the selection of individual observations within a population that best represents the
entire population.

• There are two main types of Sampling techniques:


– Probability Sampling
– Non-Probability Sampling

1.13.1 Probability Sampling


• This is a sampling technique in which samples from a large population are chosen using the theory of
probability.

• Probability sampling techniques ensure that every member of the population has a known and non-zero
chance of being selected.
• There are three types of probability sampling:
– Simple Random Sampling or Random Sampling
– Systematic Sampling
– Stratified Sampling

54
1.13.1.1 Random Sampling
• In this method, each member of the population has an equal chance of being selected in the sample.
• Example: A company wants to survey its employees’ job satisfaction. They use a random number generator
to select 50 employees out of 500, ensuring each employee has an equal chance of being chosen.

• Advantages:
– Easy to implement.
– Reduces selection bias.
• Disadvantages:
– Requires a complete list of the population.
– May not be practical for large populations.

55
1.13.1.2 Systematic Sampling
• In Systematic sampling, every nth record is chosen from the population to be a part of the sample after a
random starting point.
• Example: In a factory with 1000 products, an inspector selects every 10th product for quality testing, starting
with the 5th product randomly.

• Advantages:
– Simple and quick to implement.
– Ensures a spread across the population.
• Disadvantages:

– Can introduce bias if there is a hidden pattern in the population.

56
1.13.1.3 Stratified Sampling
• Stratified sampling divides the population into stratum/strata (subgroups).
• A stratum is a subset of the population that shares at least one common characteristic.
• After this, the random sampling method is used to select a sufficient number of subjects from each stratum.

• Example: A researcher wants to study the income levels of different age groups. They divide the population
into age strata (e.g., 18-29, 30-49, 50-69) and randomly select individuals from each stratum.

• Advantages:

– Ensures representation of all subgroups.


– Increases precision.
• Disadvantages:
– Requires detailed population information.
– More complex to administer.

57
1.13.1.4 Cluster Sampling
• Divides the population into clusters, randomly selects some clusters, and then samples all or some members
within those clusters.
• Example: A school district wants to evaluate student performance. They randomly select 5 out of 20 schools
(clusters) and then test all students in those selected schools.

• Advantages:

– Cost-effective for large populations.


– Reduces travel and administrative costs.
• Disadvantages:
– Less precise if clusters are not homogeneous.
– Can increase sampling error.

58
1.13.1.5 Multi-Stage Sampling
• Multistage sampling is an extension of cluster sampling in that, first, clusters are randomly selected and,
second, sample units within the selected clusters are randomly selected.
• It involves multiple stages of sampling, where each stage becomes progressively smaller and more focused.

• Here’s a step-by-step explanation:


– Stage 1: Primary Sampling Units (PSUs) - Divide the population into larger groups or clusters, such as
cities, states, or regions.
– Stage 2: Secondary Sampling Units (SSUs) - Select a random sample of PSUs.
– Stage 3: Tertiary Sampling Units (TSUs) - Divide the selected SSUs into smaller sub-groups, such as
neighborhoods or blocks.
– Stage 4: Final Sample - Select a random sample of individuals or units from the TSUs.
• Example: A national health survey first randomly selects regions (stage 1), then randomly selects towns
within those regions (stage 2), and finally selects households within those towns (stage 3).

• Advantages:
– Flexible and cost-effective.
– Suitable for large-scale surveys.
• Disadvantages:

– Complex to design and analyze.


– Errors can accumulate at each stage, affecting the overall accuracy of the sample.
– If not properly implemented, multistage sampling can introduce bias at each stage.

59
1.13.2 Non-Probability Sampling
• Non-probability sampling techniques do not provide every individual with a known or equal chance of being
selected.
• These techniques are often used when probability sampling is not feasible.

1.13.2.1 Convenience Sampling


• Convenience sampling is also known as opportunity sampling,
• Convenience sampling method involves collecting samples from easily accessible locations or sources.
• Example: A researcher surveys customers at a shopping mall because they are easily accessible.
• Advantages:

– Quick and inexpensive.


– Easy to implement.
• Disadvantages:

– High risk of bias.


– Not representative of the population.

1.13.2.2 Judgmental Sampling


• Snowball sampling is a recruitment technique in which research participants are asked to assist researchers in
identifying other potential subjects.

• Participants recruit other participants from their acquaintances. Thus the sample group is said to grow like a
rolling snowball.
• Example: A researcher studying a rare disease starts with one patient and asks them to refer other patients
they know.
• Advantages:

– Useful for hard-to-reach populations.


– Builds networks of related subjects.
• Disadvantages:

– Potential for bias.


– Limited control over the sample composition.

60
1.13.2.3 Quota Sampling
• Quota sampling is a method for selecting survey participants that is a non-probabilistic version of stratified
sampling.
• Ensures that specific characteristics (quotas) are represented in the sample.

• Quota sampling is a non-probability sampling method that relies on the non-random selection of a predeter-
mined number or proportion of units. This is called a quota.
You first divide the population into mutually exclusive subgroups (called strata) and then recruit sample units
until you reach your quota. These units share specific characteristics, determined by you prior to forming
your strata.

• Example: A researcher ensures that their sample includes a certain number of men and women, age groups,
and ethnic backgrounds, reflecting the population’s proportions.
• Advantages:
– Ensures representation of specific groups.
– More practical than stratified sampling.

• Disadvantages:
– Can introduce bias.
– Not random, limiting generalizability.

61
1.13.3 Inferential Statistics
• Inference Statistics offers methods to study experiments done on small samples of population and chalk out
the inferences about the entire population.
• Inferential statistics uses the probability principle to examine if patterns seen in a research sample may be
extrapolated to the larger population from which the sample was taken.

• Inferential statistics is used to forecast precise generalizations in addition to testing hypotheses and examining
connections between variables from samples.

62

You might also like