Statistics Notes
Statistics Notes
Social statistics refer to the branch of statistics that deals specifically with data and information
related to social phenomena, behaviors, trends, and structures within human societies.
It involves the collection, analysis, interpretation, and presentation of numerical data regarding
various aspects of society, such as demographics, economics, health, education, crime, and
public opinion.
Functions of Statistics
Statistics serves several important functions across various disciplines and applications.
1. Descriptive Statistics: Describes and summarizes data through measures such as mean,
median, mode, variance, and standard deviation. It helps in organizing and presenting
data in a meaningful way.
2. Inferential Statistics: Draws conclusions or makes predictions about a population based
on sample data. It involves techniques like hypothesis testing, confidence intervals, and
regression analysis.
3. Exploratory Data Analysis (EDA): Techniques like histograms, scatter plots, and box
plots are used to visually explore data patterns, identify relationships, and detect
anomalies.
4. Data Collection and Sampling: Provides methods for collecting, organizing, and
sampling data to ensure it is representative and suitable for analysis.
5. Probability: Provides the theoretical foundation for statistical methods, helping to
quantify uncertainty and randomness in data.
6. Decision Making: Helps in making informed decisions based on data analysis and
statistical inference, minimizing risks and uncertainties.
7. Quality Control and Process Improvement: Statistical process control techniques are
used to monitor and improve processes, ensuring consistency and quality.
8. Predictive Modeling: Uses statistical models to forecast future trends or outcomes based
on historical data and relationships observed in the data.
9. Comparative Analysis: Compares different groups or datasets to identify similarities,
differences, and relationships.
10. Research Design: Helps in designing experiments and studies to ensure valid
conclusions can be drawn from the data collected.
Limitations of Statistics
1. Sampling Bias: If the sample used to gather data is not representative of the entire
population, the conclusions drawn from the statistics may not be accurate for the whole
population.
2. Causation vs. Correlation: Statistics can show relationships between variables, but they
often cannot prove causation. Correlation does not necessarily imply causation, and other
factors may be influencing the observed relationship.
3. Assumptions of Normality: Many statistical tests assume that the data follows a normal
distribution. If this assumption is not met, the results of the analysis may be misleading.
4. Measurement Errors: Errors in data collection or measurement can introduce
inaccuracies into statistical analyses, affecting the validity of conclusions drawn from the
data.
5. Interpretation Issues: Statistical results can sometimes be misinterpreted or
misunderstood, leading to incorrect conclusions or decisions.
6. Ethical Issues: Statistics can be misused or misinterpreted to support biased or unethical
practices, especially if not handled transparently or rigorously.
7. Complexity of Relationships: Some real-world relationships are complex and may not
be fully captured by statistical models, leading to oversimplification or incomplete
understanding.
8. Changes Over Time: Statistics provide a snapshot of data at a particular point in time.
Changes in the underlying conditions or variables over time may not be adequately
captured by static statistical analyses.
9. Contextual Limitations: Statistics may not fully account for cultural, social, or historical
contexts that can influence the phenomena being studied.
TOPIC 2: DATA COLLECTION AND PRESENTATION
The basis for data collection refers to the principles and methods used to gather information for
analysis or research purposes. It involves establishing criteria, procedures, and techniques to
ensure that data is collected accurately, ethically, and effectively.
1. Purpose: Clearly defining the objectives and goals of the data collection process.
2. Scope: Determining the extent and boundaries of the data to be collected, including what
data is relevant and necessary.
3. Methodology: Selecting appropriate methods and tools for data collection, such as
surveys, interviews, observations, or experiments.
4. Ethics: Adhering to ethical guidelines and principles, ensuring that data collection
respects privacy, confidentiality, and informed consent.
5. Validity and Reliability: Ensuring that the data collected is accurate, relevant, and
reliable for the intended analysis or research.
6. Documentation: Keeping detailed records of the data collection process, including any
deviations or challenges encountered.
7. Analysis: Planning for how the collected data will be processed, analyzed, and
interpreted to derive meaningful insights.
Data Classification
Data classification is the process of organizing data into categories for its most effective and
efficient use.
This classification helps in managing and protecting data based on its level of sensitivity and
importance. Organizations often use data classification to implement security measures, ensure
compliance with regulations, and facilitate easier access and retrieval of information.
Data Tabulation
Data tabulation typically refers to the process of organizing data into a table or a structured
format.
It involves summarizing, categorizing, and presenting data in a clear and understandable way.
Tabulation is often used in data analysis and reporting to facilitate easy interpretation and
comparison of information.
Diagrammatic and Graphical Presentation of Data
Diagrammatic and graphical presentation of data are visual methods used to represent data in a
clear and concise manner.
Probability is a fundamental concept in mathematics and statistics that quantifies the likelihood
of an event occurring.
1. Experiment: A process that leads to one or more outcomes. For example, flipping a coin,
rolling a die, or conducting a survey.
2. Outcome: A possible result of an experiment. For instance, "heads" or "tails" in a coin
flip, or "1", "2", "3", "4", "5", or "6" on a die.
3. Sample Space: The set of all possible outcomes of an experiment, usually denoted by
SSS. For a coin flip, the sample space S would be {heads,tails}.
4. Event: A subset of the sample space, which consists of one or more outcomes. Events
can be simple (like getting heads on a coin flip) or compound (like getting an even
number on a die roll).
5. Probability of an Event: The likelihood of an event occurring, denoted by P(event). It is
a number between 0 and 1, where 0 means the event will not occur, and 1 means the
event is certain to occur.
6. Probability Distribution: A function or a rule that assigns probabilities to the possible
outcomes in a sample space. It describes how the probabilities are distributed among all
the possible outcomes.
7. Rules of Probability:
o Sum Rule: P(A∪B)=P(A)+P(B) for mutually exclusive events (events that cannot
occur simultaneously).
o Product Rule: P(A∩B)=P(A)⋅P(B∣A) for the probability of both events A and B
occurring.
TOPIC 7: SAMPLING
Meaning of Sampling
Sampling generally refers to the process of selecting a subset of individuals or items from a
larger population or group.
Types of Sampling
Sampling methods are techniques used to select a subset of individuals from a larger population,
allowing researchers to make inferences and generalizations about the population.
1. Simple Random Sampling: Every member of the population has an equal chance of
being selected. This is typically done using random number generators or drawing lots.
2. Stratified Sampling: The population is divided into subgroups (or strata) based on
certain characteristics (like age, gender, income), and then random samples are taken
from each subgroup in proportion to their size in the population.
3. Systematic Sampling: Researchers choose every nth individual from a list of the
population. For example, if you wanted a sample of 100 from a population of 1000, you
might select every 10th person on a list.
4. Cluster Sampling: The population is divided into clusters (like geographic areas or
organizational units), and then a random sample of clusters is selected. All individuals
within the chosen clusters are sampled.
5. Convenience Sampling: Also known as accidental or haphazard sampling, this method
involves sampling individuals who are easiest to access. It's convenient but may not be
representative of the entire population.
6. Snowball Sampling: Used when the population is hard to access, this method relies on
referrals from initial subjects to generate additional subjects.
7. Purposive Sampling: Also called judgmental or selective sampling, researchers choose
subjects based on specific criteria relevant to the study's objectives.
Sampling and census are two methods used in statistics and research to gather information from
a population.
Census:
Definition: A census involves collecting data from every member of the population.
Purpose: It aims to provide a complete and accurate count or measurement of every
individual or item in the population.
Example: A national census conducted by a government to count every citizen.
Sampling:
Limitations of Sampling
1. Sampling Bias: There's a risk that the sample may not accurately represent the entire
population, leading to biased results. This can happen due to factors like non-response
bias (certain groups being less likely to respond), selection bias (specific groups being
over or underrepresented), or volunteer bias (people who volunteer for studies may differ
from those who do not).
2. Sampling Error: Even with random sampling, there's always a margin of error due to
chance. This means that the sample statistics (like mean or proportion) may differ from
the population parameters they estimate.
3. Cost and Time Constraints: Conducting a comprehensive sample can be expensive and
time-consuming, especially if the population is large or geographically dispersed.
4. Inability to Infer Causation: Sampling can show correlation but not necessarily
causation. Establishing causal relationships often requires more controlled experimental
designs.
5. Population Definition: Defining the population accurately is crucial. If the population is
poorly defined or changes over time, the sample may not be representative.
6. Ethical Considerations: In some cases, obtaining a representative sample may involve
ethical challenges, especially if certain groups are marginalized or difficult to access.
7. Difficulty in Sampling Rare Events: If the event of interest is rare, it may be
challenging to obtain a sufficient number of occurrences in the sample to draw
meaningful conclusions.
TOPIC 8: ESTIMATION AND TEST OF HYPOTHESIS
Estimation in Statistics
Estimation in statistics refers to the process of using sample data to estimate the characteristics of
a population. It involves making inferences or educated guesses about population parameters
(such as mean, variance, proportion) based on sample statistics (such as sample mean, sample
variance, sample proportion).
1. Point Estimation: This involves using a single value (such as the sample mean or sample
proportion) to estimate a population parameter. For example, using the sample mean to
estimate the population mean.
2. Interval Estimation: This involves estimating a range of values (an interval) that likely
contains the population parameter. Confidence intervals are a common form of interval
estimation, providing a range of values within which the population parameter is
expected to lie with a certain level of confidence.
The sampling distribution of a statistic refers to the distribution of values taken by the statistic in
all possible samples of the same size from the same population.
1. Statistic: A quantity calculated from a sample, such as the sample mean, sample
variance, or sample proportion.
2. Population: The entire set of individuals, items, or data from which the samples are
taken.
3. Sampling Distribution: The distribution of values of a statistic across all possible
samples of the same size from the population.
Key Points:
Central Limit Theorem: For large sample sizes, the sampling distribution of the sample
mean (or other statistics) tends to be normal, regardless of the shape of the population
distribution, due to the Central Limit Theorem.
Standard Error: This measures the variability of the sampling distribution around the
true population parameter. It is related to the sample size and the variability of the
population.
Uses: Understanding the sampling distribution helps in making inferences about the
population based on sample statistics. It also plays a crucial role in hypothesis testing and
constructing confidence intervals.
Confidence Interval For Parameter and Interpretation
A confidence interval for a parameter in statistics is a range of values constructed from sample
data that is likely to contain the true value of the parameter.
1. Parameter: This could be any unknown value in a population that we are interested in
estimating. For example, the mean (μ) or proportion (p) of a population.
2. Sample Data: We collect a sample from the population and use it to estimate the
parameter.
3. Confidence Interval: This is an interval estimate around our sample statistic (like the
sample mean or sample proportion) that likely contains the true population parameter. It's
expressed with a level of confidence, usually 95% or 99%, indicating how confident we
are that the true parameter falls within the interval.
Interpretation:
If we construct a 95% confidence interval for the population mean height of adults based
on a sample, say [65,75][65, 75][65,75], it means we are 95% confident that the true
mean height of all adults falls between 65 inches and 75 inches.
This does not mean there’s a 95% chance that the true parameter lies in the interval;
rather, it means that if we were to repeat this process many times, about 95% of the
intervals constructed would contain the true parameter.
Widening the confidence interval increases our certainty (confidence level), but it also
widens the range of possible values, decreasing precision.
Hypothesis in Statistics
Types of hypothesis.
Formulate Hypotheses: Clearly state the null hypothesis (H₀) and the alternative
hypothesis (H₁).
Select a Significance Level: This is denoted as α (alpha) and represents the probability
of rejecting the null hypothesis when it is actually true. Common values for α are 0.05 or
0.01.
Collect Data and Compute Test Statistic: Using sample data, compute a test statistic
that will help us decide whether to reject the null hypothesis.
Make a Decision: Compare the test statistic to a critical value (from the statistical
distribution) or use a p-value to determine whether to reject the null hypothesis.
Draw Conclusions: Based on the decision from the hypothesis test, draw conclusions
about the population parameter(s) being studied.
In statistics, errors can occur in various forms, affecting the accuracy and reliability of data
analysis and interpretations.
1. Sampling Error: This occurs when the sample used to make inferences about a
population is not perfectly representative of the entire population. It leads to
discrepancies between sample statistics and population parameters.
2. Measurement Error: This error arises from inaccuracies or inconsistencies in the
measurement process. It can result from faulty instruments, human error in recording
data, or natural variability in measurements.
3. Non-Sampling Error: Unlike sampling error, non-sampling errors are not related to the
sample selection process but can still affect the validity of statistical analyses. Examples
include data entry errors, non-response bias, and errors in data processing.
4. Bias: Bias refers to systematic errors that consistently skew results in a particular
direction, away from the true value. It can be introduced by sampling methods,
measurement techniques, or even the interpretation of results.
5. Type I Error: In hypothesis testing, a Type I error occurs when a true null hypothesis is
rejected. It represents the probability of incorrectly concluding that there is a significant
effect or relationship when none exists (false positive).
6. Type II Error: Conversely, a Type II error occurs when a false null hypothesis is not
rejected. It signifies the probability of failing to detect a true effect or relationship (false
negative).
7. Errors in Causation: These errors occur when relationships between variables are
incorrectly interpreted as causal when they are not. Correlation does not imply causation,
and such errors can lead to erroneous conclusions.
8. Confounding Variables: These are variables that are related to both the independent and
dependent variables in a study, making it difficult to determine the true relationship
between them. Ignoring confounding variables can lead to biased results.
Hypothesis testing is a fundamental concept in statistics used to make decisions about the
population based on sample data.
1. Hypotheses:
o Null Hypothesis (H₀): This hypothesis typically states that there is no significant
difference or relationship between variables. It represents the status quo or no
effect scenario.
o Alternative Hypothesis (H₁ or Hₐ): This hypothesis contradicts the null
hypothesis, suggesting that there is indeed an effect, difference, or relationship.
2. Steps in Hypothesis Testing:
o Step 1: Formulate the Hypotheses: Define the null and alternative hypotheses
based on the research question.
o Step 2: Choose the Significance Level: Typically denoted as α (alpha), this is the
threshold used to assess the strength of evidence against the null hypothesis.
o Step 3: Collect Data and Compute Test Statistic: Gather sample data and
calculate a test statistic, which depends on the type of test (e.g., t-test, z-test, chi-
square test).
o Step 4: Make a Decision: Compare the test statistic to a critical value from the
appropriate statistical distribution (e.g., t-distribution, normal distribution) or use
a p-value approach to determine whether to reject the null hypothesis.
o Step 5: Interpret Results: Based on the comparison, either reject the null
hypothesis in favor of the alternative hypothesis or fail to reject the null
hypothesis (meaning there is insufficient evidence to reject it).
3. Types of Errors:
o Type I Error: Rejecting the null hypothesis when it is actually true (false
positive). The probability of committing this error is α.
o Type II Error: Failing to reject the null hypothesis when it is actually false (false
negative). The probability of this error is denoted by β.
4. Common Statistical Tests:
o Parametric Tests: Require assumptions about the population parameters (e.g., t-
test, z-test).
o Non-Parametric Tests: Do not make specific assumptions about population
parameters (e.g., chi-square test, Mann-Whitney U test).
TOPIC 9: TIME SERIES ANALYSIS
1. Trend: The long-term movement or direction of the series. It represents the overall
tendency of the data to increase, decrease, or remain stable over time.
2. Seasonality: Patterns that repeat at regular intervals, often influenced by seasonal factors
such as the time of year, month, day, etc. Seasonality occurs when a time series is
affected by factors operating in a fixed and known period, such as weather, holidays, or
other predictable events.
3. Cyclicality: Patterns that occur at irregular intervals, usually over multiple years, and are
not necessarily of fixed period. Unlike seasonality, cyclicality does not have a fixed and
known period.
4. Irregularity or Residual: Random fluctuations or noise in the data that cannot be
attributed to the above components. These are the unpredictable components of a time
series.
Time series models are statistical models used to understand and make predictions about data
points that are sequentially ordered over time.
1. Moving Averages:
o Simple Moving Average (SMA): Calculated as the average of a specified
number of past observations. It smooths out short-term fluctuations and highlights
longer-term trends.
o Weighted Moving Average (WMA): Similar to SMA, but assigns weights to
observations, giving more importance to recent data points.
2. Exponential Smoothing:
o Assigns exponentially decreasing weights to older observations. It's useful for
capturing trends and seasonal patterns in data.
3. Seasonal Decomposition of Time Series (STL):
o Separates a time series into trend, seasonal, and residual components. It helps in
understanding the underlying patterns.
4. Regression Analysis:
o Fits a regression model to the time series data, where time is the independent
variable. This can help quantify trends and seasonal effects explicitly.
5. Seasonal Adjustment Techniques:
o Methods like X-12-ARIMA or Census Bureau's X-13ARIMA-SEATS can be
used to adjust time series data for seasonal variations, making trends easier to
identify.
6. Fourier Transforms:
o Decomposes a time series into its constituent frequencies, allowing the
identification and extraction of seasonal components.
7. AutoRegressive Integrated Moving Average (ARIMA):
o Models the autocorrelation of the time series, allowing for the identification of
trend and seasonal components.
De-seasonalization
This is typically done to better understand the underlying trend or to make accurate comparisons
across different time periods, especially in economics, finance, and other fields where seasonal
fluctuations can obscure long-term trends.
Techniques for de-seasonalization often involve statistical methods such as moving averages,
seasonal indices, or de-seasonalizing formulas tailored to the specific characteristics of the data.
Time series analysis is widely used across various fields for analyzing data points collected over
time.
1. Degree Distribution: This refers to the distribution of node degrees in a network, where
node degree is the number of connections (edges) a node has.
2. Centrality Measures: These measures (like degree centrality, betweenness centrality,
closeness centrality) indicate the relative importance of a node within a network. The
distribution of these centrality measures across nodes can provide insights into the
network structure.
3. Clustering Coefficient: This measures the degree to which nodes tend to cluster
together. The distribution of clustering coefficients across nodes can indicate how
clustered or decentralized a network is.
4. Attribute Distribution: Networks can also have attributes associated with nodes or
edges (e.g., weights, labels). The distribution of these attributes across the network can be
analyzed to understand patterns or anomalies.
5. Random Network Models: Various models like Erdős-Rényi, Barabási-Albert, and
Watts-Strogatz generate networks with specific distributions of properties such as degree
distribution or clustering.
Network analysis, within the realm of statistics, plays a crucial role in understanding and
analyzing complex relationships and interactions among entities.
In statistics, "network construction" typically refers to the process of building a network or graph
model from data, where nodes represent entities (such as individuals, variables, or events) and
edges represent relationships or connections between them.
This concept is widely used in various fields like social network analysis, biology (gene
regulatory networks), and computer science (communication networks).
Here are some key points and steps involved in network construction in statistics:
1. Data Collection: Gather data that describes the entities and their relationships. This
could be observational data, survey responses, or any other relevant information.
2. Define Nodes and Edges: Identify what each node in your network will represent (e.g.,
individuals, variables, genes) and how edges will be defined (e.g., co-occurrence,
interaction, similarity).
3. Data Representation: Represent your data in a suitable format for network analysis.
This typically involves creating adjacency matrices (for binary relationships) or weighted
matrices (for strength of relationships).
4. Network Visualization: Use software tools like Gephi, NetworkX (Python), or igraph
(R) to visualize your network and explore its structure. Visualization can help in
understanding patterns and centralities within the network.
5. Network Analysis: Apply statistical methods and metrics to analyze the network. This
may include measuring centrality (e.g., degree centrality, betweenness centrality),
clustering coefficients, and detecting communities or modules within the network.
6. Model Fitting: In some cases, you may want to fit a specific network model (e.g., Erdős-
Rényi model, Barabási-Albert model) to understand how well your data fits theoretical
network structures.
7. Interpretation: Interpret the results of your analysis in the context of your research
question or problem. Network analysis can provide insights into connectivity patterns,
influential nodes, and overall network dynamics.
Critical Path Determinations in Network Analysis
In network analysis, especially in project management, the critical path is crucial for determining
the shortest possible duration for completing a project.
1. Definition: The critical path is the longest sequence of activities in a project plan which
must be completed on time for the project to finish by its due date. It represents the
minimum time needed to complete the project.
2. Identifying the Critical Path:
o Forward Pass: Calculate the earliest start and finish times for each activity.
o Backward Pass: Calculate the latest start and finish times that still allow the
project to finish on time.
o Activities where the early and late times match are on the critical path.
3. Key Characteristics:
o Activities on the critical path have zero slack or float, meaning any delay in these
activities delays the project.
o Non-critical activities have some slack, meaning they can be delayed without
affecting the project's overall duration.
4. Importance:
o Helps in project scheduling and resource allocation.
o Guides project managers in focusing resources on critical tasks to ensure timely
project completion.
o Allows for better risk management as delays in critical path activities can impact
project deadlines.
5. Tools: Critical path method (CPM) and Program Evaluation and Review Technique
(PERT) are commonly used tools for determining and managing the critical path.
Inventory control refers to the process of managing and overseeing the ordering, storage, and use
of goods or materials within an organization. It involves ensuring that the right amount of
inventory is available at the right time, minimizing excess or shortage.
Inventory control systems are crucial for businesses to manage and track their inventory
effectively. These systems help optimize stock levels, reduce costs, and ensure products are
available when needed.
The Economic Order Quantity (EOQ) model is a formula used to determine the optimal quantity
of inventory to order that minimizes total inventory costs. It balances the costs of holding
inventory (holding costs) and the costs of ordering inventory (ordering costs). The EOQ formula
is:
EOQ=√ 2 DS/ H
where:
The EOQ model aims to find the order quantity that minimizes the sum of these two costs. It
assumes constant demand, fixed ordering and holding costs, and no constraints on capital or
space.
Safety Stock and Re-order Level Determination
Safety stock and reorder level are key inventory management concepts used to ensure that
businesses can meet demand without running out of stock.
2. Safety Stock:
o Safety stock is extra inventory held to mitigate the risk of stockouts caused by
variability in demand and/or lead time. Factors influencing safety stock include:
Demand Variability: Fluctuations in customer demand.
Lead Time Variability: Variations in the time taken for suppliers to
deliver.
Service Level Objective: Desired level of stock availability.
Formula: Safety Stock = (Maximum Daily Usage - Average Daily Usage) × Lead Time
Determining these levels involves balancing the cost of holding excess inventory (including
storage and obsolescence) against the cost of stockouts (lost sales, customer dissatisfaction).
Advanced forecasting methods and inventory management software can help optimize these
levels based on historical data and future projections.