0% found this document useful (0 votes)
4 views74 pages

CS22021 - EXPLORATORY DATA ANALYSIS - FAT 3 - Notes

The document provides an in-depth exploration of exploratory data analysis techniques, focusing on the use of percentage tables, contingency tables, scatterplots, and various statistical tests like Chi-square and ANOVA. It emphasizes the importance of understanding relationships between variables, the role of third variables in analysis, and the significance of time series data. Additionally, it outlines methods for data transformation, hypothesis testing, and the implications of violating statistical assumptions.

Uploaded by

2022cs0136
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views74 pages

CS22021 - EXPLORATORY DATA ANALYSIS - FAT 3 - Notes

The document provides an in-depth exploration of exploratory data analysis techniques, focusing on the use of percentage tables, contingency tables, scatterplots, and various statistical tests like Chi-square and ANOVA. It emphasizes the importance of understanding relationships between variables, the role of third variables in analysis, and the significance of time series data. Additionally, it outlines methods for data transformation, hypothesis testing, and the implications of violating statistical assumptions.

Uploaded by

2022cs0136
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 74

CS22021 - EXPLORATORY DATA ANALYSIS

FAT 3 QB

PART A

Q1. How do percentage tables assist in making fair comparisons between


different populations or groups?
Ans:
1.​ Controls for Group Size Differences: When raw counts differ significantly
between groups, converting them to percentages allows meaningful comparison
by neutralizing the size effect.​

2.​ Standardized Interpretation: Percentages provide a uniform basis for


interpretation, making trends easily understandable regardless of group size.​

3.​ Highlights Proportional Trends: They help detect trends or proportions, such
as higher incidence rates in a subgroup despite smaller counts.​

4.​ Supports Policy Decisions: Widely used in fields like public health and
education to formulate decisions based on relative distributions.​

Q2. How can row and column percentages in a table reveal different insights in
categorical data analysis?
Ans:
1.​ Row Percentages: Reveal how a specific row category is distributed across
column categories, helping compare behaviors or responses across groups.​

2.​ Column Percentages: Show how each column category is composed of


different row groups, revealing contributions to a certain outcome.​

3.​ Comparison Clarity: Enables identification of dominant subgroups or patterns in


multi-category comparisons.​

4.​ Application Example: In survey results, row percentages compare opinions


across age groups, while column percentages analyze age distribution within
each opinion category.
Q3. How does analyzing a contingency table help in understanding the
relationship between two variables?
Ans:
1.​ Identifies Associations: Helps examine if there's a dependency or association
between two categorical variables.​

2.​ Visual Summary: Presents a structured, easy-to-read matrix that reveals


interaction patterns.​

3.​ Enables Statistical Testing: Forms the basis for performing the Chi-square test
to statistically assess independence.​

4.​ Supports Hypothesis Formation: Patterns seen in the table guide the
development of meaningful hypotheses for further investigation.​

Q4. What role do marginal totals play in interpreting a contingency table?


Ans:
1.​ Summarize Totals: Provide the total count of occurrences for each row and
column category.​

2.​ Facilitate Probability Calculation: Used to compute marginal and conditional


probabilities.​

3.​ Check Consistency: Allow validation of table structure by ensuring total


observed values match expected dataset size.​

4.​ Aid in Independence Testing: Needed to calculate expected values when


conducting a Chi-square test for independence.​

Q5. What are the key features of a scatterplot that help detect relationships
between two variables?
Ans:
1.​ Direction: Shows whether the relationship is positive, negative, or nonexistent
based on the slope.​

2.​ Form: Indicates whether the relationship is linear, nonlinear, or no clear pattern.​

3.​ Strength: Reflected by how tightly the points cluster around a line or curve.​

4.​ Outliers: Helps visually identify unusual or influential data points that could affect
analysis.​
Q6. In what situations is using a resistant line in a scatterplot more appropriate
than a least-squares regression line?
Ans:
1.​ Presence of Outliers: Resistant lines are less affected by extreme values,
providing more robust trend estimation.​

2.​ Skewed Data: Better suited for data distributions that are not symmetrical, unlike
least-squares lines.​

3.​ Exploratory Phase: Useful during initial analysis for identifying consistent trends
without distortion by anomalies.​

4.​ Non-Normal Errors: More reliable when data residuals are not normally
distributed.​

Q7. What are some common data transformation techniques and when are they
used?
Ans:
1.​ Log Transformation: Applied to right-skewed data to stabilize variance and
normalize distribution.​

2.​ Square Root Transformation: Suitable for data where variance increases with
the mean, such as count data.​

3.​ Reciprocal Transformation: Reduces impact of large values and can linearize
hyperbolic trends.​

4.​ Box-Cox Transformation: A flexible family of power transformations to improve


normality and equal variance.​

Q8. Why is it important to use expected counts when performing a Chi-square


test?
Ans:
1.​ Basis of Comparison: Expected counts serve as a benchmark under the
assumption of independence.​

2.​ Chi-square Formula: Test statistic is computed based on the difference between
observed and expected values.​
3.​ Test Validity: Ensures the test assesses deviation from randomness or
independence correctly.​

4.​ Detects Patterns: Helps identify where actual distributions differ significantly
from expected ones.​

Q9. What are the consequences of violating ANOVA assumptions and how can
they be addressed?
Ans:
1.​ Violation of Normality: Makes the F-test unreliable; can be corrected using data
transformation or non-parametric alternatives.​

2.​ Inequality of Variances: Leads to incorrect conclusions; Levene’s test can


detect this, and Welch’s ANOVA adjusts for it.​

3.​ Non-Independence: Inflates Type I error rates; resolved using repeated


measures ANOVA or careful experimental design.​

4.​ Multiple Comparisons Problem: Increased risk of false positives; corrected


using Bonferroni or Tukey adjustments.​

Q10. In what scenarios would a paired t-test be more appropriate than an


independent t-test?
Ans:
1.​ Repeated Measures: Suitable when the same individuals are tested under two
conditions (e.g., before and after treatment).​

2.​ Matched Samples: Applied when subjects are paired based on shared
characteristics to minimize variability.​

3.​ Controls Variability: Within-subject comparison reduces the effect of


inter-subject variability, improving test power.​

4.​ Dependent Observations: Recognizes the natural pairing in data, unlike


independent t-tests which assume unrelated samples.​

Q11. How does ANOVA partition the total variation in the dataset, and why is this
important?
Ans:
1.​ Between-Group Variation: Measures differences among group means relative
to the overall mean.​

2.​ Within-Group Variation: Measures variation within individual groups due to


random or unexplained causes.​

3.​ Total Variation = Explained + Unexplained: Helps assess the quality of


group-based models in explaining variability.​

4.​ F-ratio Calculation: Compares explained to unexplained variation to determine


statistical significance.​

Q12. What is the purpose of a chi-square test in bivariate analysis, and how is it
applied to categorical data?
Answer:
1.​ Purpose: The chi-square test is used to assess whether two categorical
variables are independent or associated with each other.​

2.​ Expected vs. Observed: It compares the observed frequencies with the
expected frequencies if the variables were independent.​

3.​ Application: Applied to categorical data, the chi-square test determines whether
there is a statistically significant relationship between the two variables.​

4.​ Example: A chi-square test could be applied to examine whether there is a


relationship between smoking status (smoker, non-smoker) and the presence of
lung disease (yes, no) in a population.​

Q13. What is ANOVA, and how does it help in comparing the means of three or
more groups?
Answer:
1.​ Definition: ANOVA (Analysis of Variance) is a statistical method used to
compare the means of three or more groups.​

2.​ Between vs. Within Group Variability: It assesses whether the variance
between group means is greater than the variance within the groups, indicating a
significant difference.​
3.​ Hypothesis Testing: The null hypothesis assumes no difference between group
means, and the alternative hypothesis assumes at least one group mean differs.​

4.​ Example: ANOVA could be used to compare the average test scores of students
from different teaching methods (traditional, online, hybrid) to see if teaching
method affects performance.​

Q14. How is hypothesis testing used in bivariate analysis?


Answer:
1.​ Null Hypothesis: Hypothesis testing starts with a null hypothesis, which typically
assumes no relationship between the two variables.​

2.​ P-Value: The test computes a p-value to assess the strength of evidence against
the null hypothesis. A small p-value indicates that the relationship is statistically
significant.​

3.​ Test Selection: Common tests like the chi-square test for independence or the
t-test for comparing means are used depending on the data type.​

4.​ Decision Making: Hypothesis testing helps to decide whether the observed
relationship is due to chance or if there is a genuine association between the
variables.​

Q15. What is the result of t-test?


The result of a t-test is influenced by:
1.​ Sample Size: Larger samples reduce standard error, increasing the t-value and
making it easier to find significant differences.​

2.​ Mean Difference: A larger difference between group means increases the
t-value.​

3.​ Variability: Higher data variability increases standard error and lowers the
t-value.​

4.​ Significance Level (α): A lower α (e.g., 0.01) makes it harder to reject the null
hypothesis.​

These factors affect the t-value and p-value, influencing whether the null hypothesis is
accepted or rejected.
Q16. What are the steps in Chi-Square test?
Q17. 4 ANOVA Methods (with Uses)
1.​ One-Way ANOVA:​

○​ Compares means of three or more groups based on one independent


variable.​

2.​ Two-Way ANOVA:​

○​ Analyzes the effect of two independent variables and their interaction on


the dependent variable.​

3.​ Repeated Measures ANOVA:​


○​ Used when the same subjects are measured multiple times (within-subject
design).​

4.​ MANOVA (Multivariate ANOVA):​

○​ Compares multiple dependent variables simultaneously across groups.​

These methods help identify whether observed differences in group means are
statistically significant.
UNIT 5
Q1. How does introducing a third variable help in analyzing relationships
between two variables?
1.​ Controls Confounding Effects: A third variable can control for confounding
factors that might distort the relationship between the primary variables.​

2.​ Reveals Interactions: It can reveal hidden interactions where the relationship
between two variables changes depending on the third variable.​

3.​ Improves Understanding: By introducing a third variable, analysts can refine


their understanding of the data by distinguishing between correlation and
causation.​

4.​ Example: Introducing age as a third variable when studying the relationship
between physical activity and health outcomes can uncover whether the impact
of physical activity on health differs by age group.​

Q2. What are the possible roles a third variable can play in data analysis?
Ans:
1.​ Confounder: Influences both the independent and dependent variables, possibly
creating a spurious association.​

2.​ Moderator: Alters the strength or direction of the relationship between two
variables (interaction effect).​

3.​ Mediator: Explains the mechanism through which the independent variable
affects the dependent variable.​

4.​ Control Variable: Included in analysis to isolate the relationship between the
variables of primary interest.​
Q3. What conditions must be satisfied to establish a causal relationship between
two variables?
Ans:
1.​ Temporal Precedence: The cause must occur before the effect in time.​

2.​ Association: There must be a statistically significant relationship between the


variables.​

3.​ Non-Spuriousness: The relationship must not be due to a third (confounding)


variable.​

4.​ Theoretical Rationale: There should be a logical mechanism or theory


supporting the cause-effect relationship.​

Q4. How does correlation differ from causation in data analysis?


Ans:
1.​ Correlation Measures Association: It shows how two variables move together
but not why.​

2.​ Causation Requires Mechanism: Causal claims need proof that one variable
directly affects the other.​

3.​ Spurious Correlation Risk: Correlation may arise due to a lurking third variable
or coincidence.​

4.​ Statistical Tools Differ: Regression, experiments, or path analysis are needed
for causal inference, unlike simple correlation.​

Q5. What are the advantages of using three-variable contingency tables in


categorical data analysis?
Ans:
1.​ Reveals Complex Relationships: Helps assess interactions between two
variables across different levels of a third.​

2.​ Controls for Confounding: Shows whether an association holds after adjusting
for the third variable.​

3.​ Tests for Conditional Independence: Evaluates whether two variables are
independent within each subgroup.​
4.​ Enhances Data Insight: Provides a deeper understanding of subgroup trends
and relationships.​

Q6. What are the primary goals of time series analysis?


Ans:
1.​ Understand Patterns: Identify trends, seasonality, and irregularities in
time-based data.​

2.​ Forecast Future Values: Use past data to predict future observations accurately.​

3.​ Detect Structural Changes: Spot shifts in data behavior due to external or
internal events.​

4.​ Improve Decision Making: Enable planning, resource allocation, and policy
formation based on data-driven trends.​

Q7. How does time series data differ from cross-sectional data?
Ans:
1.​ Ordered Observations: Time series has a natural chronological order, unlike
cross-sectional data.​

2.​ Temporal Dependence: Time series data is collected over time, with each
observation corresponding to a specific time point, while cross-sectional data is
collected at a single point in time.​

3.​ Dynamic Behavior: Captures change over time, while cross-sectional data
represents a single time snapshot.​

4.​ Requires Specialized Methods: Techniques like autocorrelation, decomposition,


and ARIMA are specific to time series.
5.​ Dependence: In time series, past values influence future values, whereas in
cross-sectional data, observations are independent of each other.​

6.​ Purpose: Time series is used for forecasting and understanding trends over
time, while cross-sectional data is useful for comparing different subjects or
entities at one point.​

7.​ Example: Stock prices over a month represent time series data, while a survey
of household income levels at one time point represents cross-sectional data.​
Q8. What are the key components of time series data?
Ans:
1.​ Trend: Long-term upward or downward direction in the data.​

2.​ Seasonality: Regular, periodic fluctuations occurring within a year (e.g., monthly
sales).​

3.​ Cyclic Variation: Irregular fluctuations over longer periods due to business or
economic cycles.​

4.​ Random/Irregular Component: Unpredictable noise or variation not explained


by the other components.​

Q9. Why is stationarity important in time series analysis?


Ans:
1.​ Predictive Stability: Stationary series have consistent mean and variance,
essential for accurate forecasting.​

2.​ Simplifies Modeling: Many models (like ARIMA) assume stationarity for validity.​

3.​ Avoids Misleading Trends: Non-stationary data may show spurious


relationships or false trends.​

4.​ Transforms to Stationarity: Differencing or detrending techniques are used to


convert non-stationary series to stationary.​

Q10. What is the importance of time-based indexing in time series datasets?


Ans:
1.​ Efficient Access: Enables fast slicing, filtering, and querying by time intervals.​

2.​ Aligns Time-Based Operations: Supports resampling, rolling windows, and


shifting aligned with time steps.​

3.​ Supports Missing Value Detection: Helps identify time gaps or missing records
in the series.​

4.​ Pandas & Datetime Integration: Time-based indexing is key to leveraging


date-aware operations in Python (e.g., DatetimeIndex).​

Q11. What is the purpose of resampling in time series analysis?


Ans:
1.​ Change Frequency: Converts data to different time intervals (e.g., from daily to
monthly).​

2.​ Smooth Irregularities: Reduces short-term volatility and highlights long-term


trends.​

3.​ Fill Gaps: Upsampling can fill missing values using interpolation or
forward/backward fill.​

4.​ Supports Rolling Analysis: Enables moving averages or rolling statistics over
resampled data.​

Q12. Given the variables "age group," "exercise frequency," and "sleep quality,"
construct a three-variable contingency table and explain how it can be used to analyze
the relationship between these variables.
Answer:
1.​ Define Categories: First, categorize the variables:​

○​ Age group: Young (18-30), Middle-aged (31-50), Older (51+)​

○​ Exercise frequency: Low (0-1 times/week), Moderate (2-3 times/week),


High (4+ times/week)​

○​ Sleep quality: Poor, Average, Good​

2.​ Construct the Table: Create a 3D table, with age group and exercise frequency
as row and column variables, and sleep quality as another dimension (cells will
represent the frequency of individuals in each combination). For example:​

Age Group / Low (0-1 Moderate (2-3 High (4+


Exercise times/week) times/week) times/week)
Frequency

Young (18-30) 30 (Poor), 20 15 (Poor), 30 5 (Poor), 10


(Average), 10 (Average), 25 (Average), 35
(Good) (Good) (Good)
Middle-aged 25 (Poor), 15 10 (Poor), 20 5 (Poor), 10
(31-50) (Average), 5 (Average), 15 (Average), 20
(Good) (Good) (Good)

Older (51+) 20 (Poor), 10 5 (Poor), 15 5 (Poor), 5


(Average), 5 (Average), 10 (Average), 15
(Good) (Good) (Good)
3.​ Interpretation: The table can be used to analyze the relationship between the
three variables. For example:​

○​ Exercise Frequency and Sleep Quality: Among the young group, those
who exercise more frequently (High) have better sleep quality (35 Good
sleep). This indicates a positive relationship between exercise frequency
and sleep quality.​

○​ Age and Sleep Quality: Older individuals tend to have poorer sleep
quality, with 20 individuals in the "Poor" sleep category. This suggests that
sleep quality may decline with age.​

4.​ Analyze Interactions: The three-variable contingency table helps identify


interactions, such as:​

○​ The combination of low exercise frequency and poor sleep quality seems
most prevalent in the younger age group.​

○​ Middle-aged individuals with moderate exercise frequency generally report


better sleep quality, suggesting that a balanced exercise routine may
positively affect sleep quality.​

By constructing and analyzing this table, we gain insights into how exercise, age, and
sleep quality interact within the population, guiding decisions related to health and
lifestyle.

Q13. Outline the causal explanation in the context of exploratory data analysis
(EDA).
Answer:
1.​ Definition: Causal explanation in EDA aims to identify the underlying
cause-and-effect relationships between variables.​
2.​ Visual Exploration: Causal relationships can be explored by plotting data and
looking for patterns that suggest a directional influence (e.g., A causes B).​

3.​ Statistical Models: Regression models or path analysis can be applied to


support causal claims, providing evidence of the relationship's strength and
direction.​

4.​ Example: In EDA, analyzing the impact of hours of study on test scores can help
establish a causal link between increased study time and higher test
performance.​

Q14. How does bivariate analysis differ from multivariate analysis?


Answer:
1.​ Number of Variables: Bivariate analysis deals with the relationship between two
variables, while multivariate analysis involves more than two variables.​

2.​ Complexity: Multivariate analysis allows for the exploration of more complex
relationships, including interactions between multiple variables.​

3.​ Modeling: Multivariate techniques, such as multiple regression, can provide


more comprehensive insights compared to bivariate methods.​

4.​ Example: Bivariate analysis might explore the relationship between age and
income, while multivariate analysis could examine the combined effect of age,
education, and occupation on income.​

Q15. How do you interpret autocorrelation in time series analysis?


Answer:
1.​ Definition: Autocorrelation measures the correlation of a time series with its own
lagged values.​

2.​ Pattern Recognition: High autocorrelation suggests that past values


significantly influence future values, indicating a predictable pattern.​

3.​ Modeling: It helps in building time series models, such as ARIMA, which use
past values to predict future values.​

4.​ Example: A high autocorrelation at a lag of 12 months in monthly sales data


might suggest seasonal effects influencing sales trends.​
Q16. What are outliers in time series data, and how can they be identified?
Answer:
1.​ Definition: Outliers are data points that significantly differ from other
observations, indicating unusual events or errors.​

2.​ Identification: Outliers can be detected using visual methods (like boxplots or
scatter plots) or statistical tests (like Z-scores or IQR).​

3.​ Impact: They can distort trends, seasonality, and forecasts if not handled
properly.​

4.​ Handling: Outliers can be removed, replaced, or transformed to minimize their


impact on analysis.​

Q17. What role does data aggregation play in time series analysis?
Answer:
1.​ Simplification: Aggregating data over larger time periods (e.g., from daily to
weekly) helps reduce noise and emphasizes long-term trends.​

2.​ Comparison: It allows for comparing trends across different time periods or
categories.​

3.​ Improved Forecasting: Aggregated data provides a clearer signal, improving


model performance for forecasting.​

4.​ Example: Aggregating daily stock prices into weekly averages smoothens out
short-term volatility, aiding in trend analysis.

Q18. What are the Key assumptions for a valid ARIMA model?
Key assumptions for a valid ARIMA model are:
1.​ Stationarity:​

○​ The time series should have a constant mean, variance, and


autocovariance over time.​

○​ Differencing may be applied to achieve stationarity.​

2.​ Linearity and Normality of Errors:​


○​ The relationship is linear, and the residuals (errors) should be normally
distributed, uncorrelated, and have constant variance
(homoscedasticity).​

These ensure the model's predictions and inferences are reliable.

PART B

UNIT 4

1.​ Percentage tables

1. Explain how percentage tables are constructed and interpreted in bivariate


analysis. Illustrate with a relevant example.
1.​ Definition of Percentage Tables:​
A percentage table expresses the relationship between two categorical variables
by converting frequency counts into percentages. This helps in identifying
patterns and making comparisons easier across groups.​

2.​ Types of Percentage Tables:​

○​ Row Percentage Table: Shows the percentage distribution of a variable


across the rows.​

○​ Column Percentage Table: Shows percentage distribution across the


columns.​

○​ Total Percentage Table: Uses the overall total as the denominator for all
cells.​

3.​ Construction Steps:​

○​ Step 1: Create a frequency contingency table.​

○​ Step 2: Choose a basis for percentage calculation (row-wise,


column-wise, or total).​

○​ Step 3: Divide each cell count by the row total, column total, or overall
total.​
○​ Step 4: Multiply by 100 to convert to a percentage.​

4.​ Example:​
Suppose we have a dataset of 100 students categorized by gender and whether they
prefer online or offline learning:​

Gende Onlin Offlin Tota


r e e l

Male 30 20 50

Femal 40 10 50
e

Total 70 30 100
5.​ ​
Row Percentage Table:​

Gende Online Offline (%)


r (%)

Male 60% 40%

Femal 80% 20%


e
6.​ ​
Interpretation:​

○​ Among males, 60% prefer online learning.​

○​ Among females, 80% prefer online learning.​


This suggests that female students have a higher preference for online
learning.​

Example:

Question:

A company conducts a survey to understand the relationship between income levels


and preferred online shopping platforms. The data collected is as follows:
Respondent Income Preferred
ID Group Platform

201 Low Amazon

202 Medium Flipkart

203 Low Amazon

204 High Myntra

205 Medium Amazon

206 Low Flipkart

207 High Flipkart

208 Medium Myntra

209 Low Amazon

210 High Amazon


How can row and column percentage tables help analyze the relationship
between income groups and platform preferences? Provide calculations and
interpretation.

Solution:
Step 1: Create the Frequency Table
Income Amazo Flipkart Myntra Tota
Group n l

Low 3 1 0 4

Medium 1 1 1 3

High 1 1 1 3

Total 5 3 2 10

Step 2: Row Percentage Table


(Each row total = 100%)
Income Amazon Flipkart (%) Myntra (%)
Group (%)

Low 75% 25% 0%

Medium 33.3% 33.3% 33.3%

High 33.3% 33.3% 33.3%


Interpretation (Row %):
●​ 75% of Low-income customers prefer Amazon.​

●​ Medium and High-income groups are evenly split across platforms (Amazon,
Flipkart, Myntra).​

●​ This helps us understand platform preference within each income group.​

Step 3: Column Percentage Table


(Each column total = 100%)
Income Amazon Flipkart (%) Myntra (%)
Group (%)

Low 60% 33.3% 0%

Medium 20% 33.3% 50%

High 20% 33.3% 50%


Interpretation (Column %):
●​ 60% of Amazon users are from the Low-income group.​

●​ Flipkart users are evenly split across all income groups.​

●​ Myntra is preferred only by Medium and High-income customers.​

Conclusion:
●​ Row percentages show how each income group spreads its preference across
platforms.​

●​ Column percentages show the composition of each platform's user base by


income group.​
●​ Together, these tables provide a comprehensive view of how income level and
shopping platform preference are related.​

2. Discuss the role of percentage tables in understanding marginal and


conditional distributions. Use a structured example.
Answer:
1.​ Marginal Distribution:​

○​ Refers to the distribution of a single variable in a two-variable table by


summing across the rows or columns.​

○​ Helps understand the overall distribution of one variable regardless of the


other.​

2.​ Conditional Distribution:​

○​ The distribution of one variable based on the condition of another variable.​

○​ Calculated using row or column percentages to analyze dependence.​

3.​ Example Table:​

Age Yes N Tota


Group (Vaccinated) o l

<30 20 30 50

30+ 60 40 100

Total 80 70 150
4.​ ​
Marginal Percentages:​

○​ Vaccinated: (80 / 150) * 100 = 53.3%​

○​ Not Vaccinated: (70 / 150) * 100 = 46.7%​

5.​ Conditional Percentages (Row-wise):​


Age Yes No (%)
Group (%)

<30 40% 60%

30+ 60% 40%


6.​ ​
Insights:​

○​ Conditional distribution shows that older individuals (30+) have a higher


vaccination rate (60%).​

○​ Marginal distribution gives an overall sense of vaccine adoption (53.3%


vaccinated).​

3. Compare and contrast row-wise and column-wise percentage tables. Provide


advantages and a real-world scenario where each is preferred.
Answer:
1.​ Row-wise Percentage Table:​

○​ Each row total is treated as 100%.​

○​ Shows the proportion of column variables within each row category.​

2.​ Column-wise Percentage Table:​

○​ Each column total is treated as 100%.​

○​ Shows the distribution of row variables within each column.​

3.​ Example: Students categorized by department (CS, IT) and whether they attend
workshops (Yes/No):​

Dept Yes No Total

CS 60 40 100

IT 40 60 100
4.​ ​
Row-wise:​

○​ CS: 60% Yes, 40% No​

○​ IT: 40% Yes, 60% No​

5.​ Column-wise (Yes total = 100; No total = 100):​

○​ Yes: 60% CS, 40% IT​

○​ No: 40% CS, 60% IT​

6.​ Application Scenarios:​

○​ Use row-wise when you want to analyze how column categories vary
across each row group (e.g., preferences of departments).​

○​ Use column-wise when analyzing how row categories are distributed


within each column (e.g., which departments contribute more to a
specific behavior).​

2. Analyzing Contingency Tables -Handling Several Batches

A company produces smartphones in three different factories (Factory A, Factory B,


Factory C). Each factory produces phones in four production batches. The company
wants to analyze the relationship between Factory and Defect Status (Defective or Non-
defective) across all batches. The summarized data collected is shown below:
Factory Batch Defective Units Non-defective Units

Factory Batch Defective Non-defective


Units Units

A 1 5 95

A 2 7 93

A 3 6 94

A 4 4 96
B 1 12 88

B 2 15 85

B 3 14 86

B 4 10 90

C 1 3 97

C 2 2 98

C 3 5 95

C 4 4 96

i) Combine the batch data for each factory to create a contingency table showing total
defective and non-defective units by factory.
ii) Construct a percentage table from the contingency table, showing the percentage of
defective and non-defective units for each factory out of the total units produced by that
factory.
iii) Analyze the tables to identify:
Which factory has the highest defect rate?
Whether the defect status is associated with the factory.
How handling several batches improves the reliability of the analysis?

Given Data:
Factory Batch Defective Non-defective
Units Units

A 1 5 95

A 2 7 93

A 3 6 94

A 4 4 96

B 1 12 88

B 2 15 85

B 3 14 86
B 4 10 90

C 1 3 97

C 2 2 98

C 3 5 95

C 4 4 96

i) Combine Batch Data for Each Factory


Calculate total defective and non-defective units for each factory by summing across
batches:
Factor Total Defective Total Non-defective Total Units
y Units Units Produced

A 5+7+6+4 = 22 95+93+94+96 = 378 400

B 12+15+14+10 = 51 88+85+86+90 = 349 400

C 3+2+5+4 = 14 97+98+95+96 = 386 400

ii) Percentage Table (Within Each Factory)


Calculate percentages of defective and non-defective units out of total units produced
by that factory:
Factor Defective (%) Non-defective (%)
y

A (22/400) × 100 = 5.5% (378/400) × 100 =


94.5%

B (51/400) × 100 = (349/400) × 100 =


12.75% 87.25%

C (14/400) × 100 = 3.5% (386/400) × 100 =


96.5%

iii) Analysis
1. Which factory has the highest defect rate?
●​ Factory B has the highest defect rate at 12.75%.​
●​ Factory A has 5.5%, and Factory C has 3.5%.​

●​ This shows Factory B produces the most defective units proportionally.​

2. Is defect status associated with the factory?


●​ Since defect rates vary considerably between factories (Factory B > A > C), this
suggests a relationship between factory and defect status.​

●​ To confirm this statistically, a Chi-square test of independence could be applied


to the combined contingency table to test if defect rates depend on the factory.​

3. How does handling several batches improve the reliability of the analysis?
●​ Combining data from several batches increases the sample size, reducing
random variation.​

●​ It provides a more stable and representative estimate of defect rates per


factory.​

●​ This reduces the impact of any unusual batch, leading to more accurate and
reliable conclusions about factory performance.​

Question:​
A researcher collects data on the number of hours students study per week and their
corresponding test scores. The scatterplot shows a positive linear trend, but one student
studied unusually many hours with a lower-than-expected score.
a) Explain how this outlier might affect the least squares regression line.​
b) Describe how a resistant line (like the median-median line) would handle this outlier
differently.

3. Scatterplots and Resistant Lines:

a) Effect of the outlier on the least squares regression line:


The least squares regression line minimizes the sum of squared vertical distances from
all points. Because it uses squared differences, it is sensitive to extreme values or
outliers. The outlier with unusually many study hours but a low score will pull the
regression line towards itself, potentially distorting the slope and intercept. This can
lead to a misleading model that does not accurately represent the overall trend of the
majority of the data.
b) How a resistant line (e.g., median-median line) handles the outlier differently:
Resistant lines, like the median-median line, rely on medians rather than means and
use less sensitive measures to fit the data. Since medians are not heavily influenced by
extreme values, the resistant line is less affected by the outlier. This means the line
better represents the central trend of the bulk of the data, providing a more robust fit
when outliers are present.

4. T-test

Reference : https://chatgpt.com/share/6826ccf0-7cbc-8001-9a3c-cb6768658c1c

Problem:

1.​ A company tests two different advertising campaigns, Campaign A and


Campaign B, to see which one results in higher sales. The sales numbers (in
thousands) from random samples are:
●​ Campaign A: 150, 160, 155​

●​ Campaign B: 140, 145, 138​

How would you use an independent samples t-test to check if the difference in average
sales between the two campaigns is statistically significant?

1.​ Formulate hypotheses:​

○​ H0:H_0: No difference in mean sales between Campaign A and B.​

○​ H1:H_1: Mean sales differ between Campaign A and B.​

2.​ Calculate the means and variances for both samples.​

3.​ Compute the t-statistic using the independent samples t-test formula.​

4.​ Determine degrees of freedom and find the critical t-value or p-value.​

5.​ If the calculated t exceeds critical t or p-value < significance level (e.g., 0.05),
reject H0H_0, concluding a significant difference.
Detailed solution:


2. A school wants to compare test scores of students taught using two different teaching
methods: Method 1 and Method 2. The test scores are:
●​ Method 1: 85, 88, 90​

●​ Method 2: 80, 78, 82​


Explain how to apply an independent samples t-test to determine if the teaching
methods result in significantly different scores.
1.​ State hypotheses:​

○​ H0:H_0: No difference in mean test scores between Method 1 and Method


2.​

○​ H1:H_1: There is a difference in means.​

2.​ Calculate sample means and standard deviations.​

3.​ Use the independent samples t-test formula to calculate t-statistic.​

4.​ Find degrees of freedom and corresponding critical t-value or p-value.​

5.​ Compare t-statistic with critical value or p-value with α (e.g., 0.05) to accept or
reject H0H_0.​
5. Chi-square test of independence

Reference : https://chatgpt.com/share/6826ccf0-7cbc-8001-9a3c-cb6768658c1c

Problem:

A supermarket wants to find out if customer age group is related to their preferred
payment method (Cash or Card). They survey 100 customers and record the following
data:
Age Cas Card Tota
Group h l

18-30 20 30 50

31-50 15 25 40

51+ 5 5 10

Total 40 60 100
Is there a significant association between Age Group and Preferred Payment Method?

Solution:
Summary

Step Action

1 State null and alternative hypotheses

2 Organize observed frequencies in


table

3 Calculate expected frequencies

4 Compute chi-square statistic

5 Determine degrees of freedom

6 Find critical value from chi-square


table

7 Compare calculated value with


critical value

8 Make decision and interpret result

Practice questions

Question 1:​
A hospital wants to check if the recovery rate of patients is independent of the
treatment type (Treatment A, Treatment B). They record the following data:
Treatment Recovered Not Total
Type Recovered

Treatment A 45 15 60

Treatment B 30 30 60
Is there a significant association between treatment type and recovery status?

Question 2:​
A college surveys students to find out if gender is related to their choice of major. The
data collected is:
Gender Science Arts Total
Majors Majors
Male 40 20 60

Female 30 30 60
Use a Chi-square test to determine if there is an association between gender and major
choice.

6. ANOVA

What is ANOVA?

ANOVA stands for Analysis of Variance. It is a statistical method used to compare


means of three or more groups to determine if there are any statistically
significant differences among them.

Why Use ANOVA?

●​ When comparing two groups, a t-test is appropriate.​

●​ When comparing three or more groups, using multiple t-tests increases the
risk of Type I error (false positives).​

●​ ANOVA allows simultaneous comparison of all group means while


controlling the overall error rate.​

Key Concept: Variance

ANOVA works by analyzing variances:

●​ Between-group variance: Variability due to the interaction between the


groups (differences among group means).​

●​ Within-group variance: Variability within each group (differences among


individuals in the same group).​

If between-group variance is significantly larger than within-group variance, it


suggests that the group means are different.
Types of ANOVA

1.​ One-Way ANOVA​

○​ Compares means across one independent variable (factor) with three


or more groups.​

○​ Example: Comparing average test scores across three teaching


methods.​

2.​ Two-Way ANOVA​

○​ Compares means across two independent variables (factors) and


can analyze interaction effects.​

○​ Example: Comparing test scores based on teaching method and


gender.​

3.​ Repeated Measures ANOVA​

○​ Used when the same subjects are measured multiple times under
different conditions or over time.​

○​ Example: Measuring blood pressure at different time points after


medication.​

4.​ MANOVA (Multivariate ANOVA)​

○​ Extends ANOVA when there are multiple dependent variables.​

○​ Example: Comparing groups on both test scores and attendance


rates simultaneously.​

How Does ANOVA Work?

1.​ Formulate Hypotheses​

●​ Null hypothesis (H₀): All group means are equal.​


●​ Alternative hypothesis (H₁): At least one group mean is different.​

2.​ Calculate Variances​

●​ Compute Sum of Squares Between Groups (SSB): Measures variation


between group means and overall mean.​

●​ Compute Sum of Squares Within Groups (SSW): Measures variation within


each group.​

●​ Compute Total Sum of Squares (SST) = SSB + SSW.​

Example Scenario

Suppose three diet plans are tested for their effect on weight loss. ANOVA will tell
if the average weight loss differs significantly between these plans.

Assumptions of ANOVA

1.​ Independence of observations.​

2.​ Normality: The dependent variable is normally distributed in each group.​


3.​ Homogeneity of variances: Equal variances across groups.​

Post Hoc Tests

If ANOVA shows significant differences, post hoc tests (e.g., Tukey’s HSD,
Bonferroni) are performed to identify which groups differ.

Summary
Aspect Description

Purpose Compare means of 3+ groups

Method Compares between-group variance to


within-group variance

Output F-statistic and p-value

Result Significant F → at least one mean differs


Interpretation

Extensions Two-way ANOVA, Repeated Measures, MANOVA

Problem:

A teacher wants to check if three different teaching methods affect students’


average test scores differently. She randomly assigns students into three groups
and records the test scores after one month:
●​ Method A: 78, 82, 80, 85, 79​

●​ Method B: 74, 76, 73, 77, 75​

●​ Method C: 88, 90, 85, 89, 87​

At a 0.05 significance level, test if there is a significant difference in average test


scores among the three teaching methods using One-Way ANOVA.
Practice:

1.​ A fitness trainer wants to compare the effectiveness of three different


exercise programs on improving participants' endurance levels. After 6
weeks, the endurance scores (measured by the number of push-ups
completed) for 4 participants in each group are recorded:
●​ Program A: 25, 30, 28, 27​

●​ Program B: 20, 22, 19, 21​

●​ Program C: 32, 35, 33, 30​

At the 0.05 significance level, test whether there is a significant difference in


average endurance scores among the three exercise programs using One-Way
ANOVA.
UNIT 5

1.​ Three-Variable Contingency Tables

What is a Three-Way Contingency Table?


A three-way contingency table is a multi-dimensional table that displays the
frequency distribution of three categorical variables simultaneously. It extends
the idea of two-way contingency tables (which show frequencies for two
variables) by adding a third dimension.

It helps analyze the relationship and interaction between three categorical


variables.

Structure
●​ You have three categorical variables, say A, B, and C.​

●​ Each variable has different categories or levels.​

●​ The table is typically presented as a series of two-way tables for each level
of the third variable, or as a cube-shaped table.​

For example, if:

●​ Variable A has rrr categories,​

●​ Variable B has ccc categories,​

●​ Variable C has kkk categories,​

then the table will have r×c×kr \times c \times kr×c×k cells.

Why Use a Three-Way Table?


●​ To investigate conditional relationships — how two variables relate when
you control or stratify by a third variable.​
●​ To detect interactions between three categorical variables.​

●​ To perform more complex chi-square tests for independence or


homogeneity involving three variables.​

1. In a training center, data is collected on Trainee Gender, Weekly Practice Hours,


and Certification Status (Certified/Not Certified). The data for 10 trainees is as
follows:
Trainee Gender Weekly Practice Certification
Hours Status

T1 Male 6 Certified

T2 Female 3 Not Certified

T3 Male 5 Certified

T4 Female 4 Certified

T5 Male 2 Not Certified

T6 Female 6 Certified

T7 Male 3 Not Certified

T8 Female 5 Certified

T9 Male 4 Certified

T10 Female 2 Not Certified


Tasks:
1.​ Construct a three-variable contingency table showing counts of Gender,
grouped Practice Hours (Low: 1-3 hours, Medium: 4-6 hours), and
Certification Status.​

2.​ Discuss observable trends based on the table.​

Step-by-step Solution:
Step 1: Categorize Practice Hours
●​ Low Practice Hours (1-3 hours): Trainees with practice hours 1, 2, or 3​

●​ Medium Practice Hours (4-6 hours): Trainees with practice hours 4, 5, or 6​

Categorize each trainee:


Trainee Gender Weekly Practice Category Certification
Hours Status

T1 Male 6 Medium Certified

T2 Female 3 Low Not Certified

T3 Male 5 Medium Certified

T4 Female 4 Medium Certified

T5 Male 2 Low Not Certified

T6 Female 6 Medium Certified

T7 Male 3 Low Not Certified

T8 Female 5 Medium Certified

T9 Male 4 Medium Certified

T10 Female 2 Low Not Certified

Step 2: Construct the three-variable contingency table


Gender Practice Certified Not Total
Hours Certified

Male Low (1-3) 0 3 3


(T5,T7,T9?)

Male Medium 4 (T1,T3,T9) 0 4


(4-6)

Female Low (1-3) 0 3 (T2,T10) 3

Female Medium 3 (T4,T6,T8) 0 3


(4-6)
Note: For male Low category: T5(2hrs, Not Certified), T7(3hrs, Not Certified), T9 is
4 hours, so belongs to Medium, correction below:
Correct Low male: T5, T7​
Correct Medium male: T1, T3, T9
Corrected table:
Gender Practice Certified Not Certified Total
Hours

Male Low (1-3) 0 2 (T5, T7) 2

Male Medium (4-6) 3 (T1, T3, T9) 0 3

Female Low (1-3) 0 2 (T2, T10) 2

Female Medium (4-6) 3 (T4, T6, T8) 0 3

Step 3: Analyze the trends


●​ All trainees who practice for Medium hours (4-6) are certified regardless of
gender.​

●​ None of the trainees who practice for Low hours (1-3) are certified,
regardless of gender.​

●​ Certification status is strongly associated with practice hours rather than


gender.​

●​ Gender does not seem to influence certification status directly since both
males and females have similar certification rates within practice hour
groups.​

Summary:
The contingency table shows a clear pattern where trainees practicing more
hours (4-6 per week) achieve certification, while those practicing fewer hours
(1-3) do not. This suggests practice hours are a key factor in certification
success, more than gender.

2. A university collects data on students’ Gender, Weekly Study Hours (Low: 1-3,
High: 4-6), and Exam Result (Pass/Fail). The data for 10 students is:
Student Gender Study Exam
Hours/Week Result
S1 Male 2 Fail

S2 Female 5 Pass

S3 Male 4 Pass

S4 Female 3 Fail

S5 Male 6 Pass

S6 Female 2 Fail

S7 Male 5 Pass

S8 Female 4 Pass

S9 Male 3 Fail

S10 Female 6 Pass


Tasks:​
a) Construct a three-variable contingency table grouping Gender, Study Hours,
and Exam Result.​
b) Identify any observable trends.

3. A call center tracks Agent Gender, Number of Calls Handled per Day (Low:
10-15, High: 16-20), and Customer Satisfaction (Satisfied/Not Satisfied) for 10
agents:
Agent Gender Calls/Day Customer
Satisfaction

A1 Male 12 Not Satisfied

A2 Female 18 Satisfied

A3 Male 16 Satisfied

A4 Female 14 Not Satisfied

A5 Male 20 Satisfied

A6 Female 10 Not Satisfied

A7 Male 15 Not Satisfied

A8 Female 19 Satisfied
A9 Male 17 Satisfied

A10 Female 13 Not Satisfied


Tasks:​
a) Create a three-variable contingency table for Gender, Calls/Day category, and
Customer Satisfaction.​
b) Analyze the relationship between calls handled and customer satisfaction.

Step 1: Categorize Calls/Day:


●​ Low (10-15): A1, A4, A6, A7, A10​

●​ High (16-20): A2, A3, A5, A8, A9​

Step 2: Construct table:


Gender Calls/ Satisfied Not Satisfied Total
Day

Male Low 0 2 (A1, A7) 2

Male High 3 (A3, A5, 0 3


A9)

Female Low 0 3 (A4, A6, 3


A10)

Female High 2 (A2, A8) 0 2


Step 3: Trends:
●​ Agents handling high calls/day are all rated Satisfied.​

●​ Low call volumes correspond with Not Satisfied ratings.​

●​ Customer satisfaction appears related to the number of calls handled,


independent of gender.​

2. Causal Explanations

A causal relationship implies that changes in one variable directly cause changes
in another. Unlike correlation, which only shows an association between
variables, causation means one variable is responsible for the effect on the other.
Key aspects to establish causality:
1.​ Temporal precedence: The cause must occur before the effect.​

2.​ Covariation: There is an observed association between variables.​

3.​ No plausible alternative explanations: Other confounding factors are ruled


out.​

Exploring causality often requires additional data, controlling for confounders,


and sometimes experimental or quasi-experimental designs.
Data is collected on daily caffeine consumption (in cups) and anxiety levels
(measured by a standardized scale) for a group of adults over two months. The
data reveals a positive correlation between caffeine intake and anxiety levels.
How would you explore whether increased caffeine consumption causes higher
anxiety? What additional variables or analyses would you consider to support or
refute a causal relationship?
Data shows a negative correlation between daily screen time (hours) and sleep
quality (hours of uninterrupted sleep). The goal is to explore whether increased
screen time causes poor sleep quality.
Step 1: Understand Correlation vs Causation​
The negative correlation means individuals with higher screen time tend to have
poorer sleep quality, but this does not prove screen time causes poor sleep.
Step 2: Consider Additional Variables (Confounders)
●​ Stress levels: High stress could reduce sleep quality and increase screen
time (e.g., watching screens to relax).​

●​ Physical activity: Less activity might worsen sleep and encourage more
screen time.​

●​ Caffeine consumption: Could affect sleep quality independently.​

●​ Sleep environment: Noise, light, or room temperature could influence sleep


quality.​

●​ Bedtime routines: Using screens just before bed might have a different
effect than during the day.​

Step 3: Explore Temporal Relationship​


Analyze data to see if increases in screen time precede decreases in sleep
quality over time. Time series or longitudinal analyses can identify if screen time
changes are followed by changes in sleep.
Step 4: Statistical Analyses
●​ Use multivariate regression including confounders to check if screen time
still predicts sleep quality after adjusting for other factors.​

●​ Perform mediation analysis if an intermediate variable (like stress) might


explain the relationship.​

●​ Conduct causal modeling (e.g., path analysis) to examine direct and


indirect effects.​

Step 5: Experimental or Quasi-experimental Design


●​ Ideally, conduct a controlled study where participants reduce screen time
to observe changes in sleep quality.​

●​ If not possible, use natural experiments or instrumental variables to mimic


causal inference.​

Summary:
To explore whether increased screen time causes poor sleep quality, one must
analyze temporal patterns, control for confounding factors, and use statistical
methods to test if the relationship holds when accounting for these variables.
Experimental designs provide the strongest causal evidence.

2. A research team collects data on the amount of time students spend on social
media per day (in hours) and their academic performance (measured by GPA).
The analysis reveals a negative correlation between social media usage and GPA.​
How would you determine whether increased social media usage causes lower
academic performance? What other variables or methods would you consider in
your exploratory data analysis to support or challenge this causal link?
Step 1: Understand the Correlation
The negative correlation indicates that students with higher social media usage
tend to have lower GPAs. However, correlation alone does not imply causation.
It’s possible that other factors influence both variables.

Step 2: Investigate Temporal Order


To claim that social media usage causes lower academic performance:
●​ Social media usage must occur before the decline in GPA.​
●​ Collect or analyze longitudinal data over a semester or year to confirm the
direction of the relationship.​

Step 3: Identify and Control Confounding Variables


Several variables might influence both social media usage and GPA, such as:
●​ Time management skills​

●​ Study hours per week​

●​ Mental health status​

●​ Sleep duration​

●​ Parental supervision​

●​ School environment or peer influence​

By including these variables in your analysis, you can determine whether the
relationship holds even after adjusting for confounders.
Step 4: Use Statistical Techniques
●​ Multiple Linear Regression: Include social media usage and confounders to
see if social media remains a significant predictor of GPA.​

●​ Path Analysis: Check for indirect effects (e.g., social media affects sleep,
which affects GPA).​

●​ Propensity Score Matching: Match students with similar backgrounds and


habits but different social media usage levels to compare their GPAs.​

Step 5: Consider an Experimental Approach


Though challenging in practice, you could design a:
●​ Controlled experiment: Limit social media use for one group of students
and compare GPA outcomes.​

●​ Natural experiment: Use situations where social media access is reduced


(e.g., during school policy changes) to observe effects.​

Conclusion:
To explore if social media usage causes lower academic performance:
●​ Establish time order (usage precedes performance drop),​

●​ Control for confounders,​

●​ Use statistical models to isolate effects,​

●​ And consider experimental evidence.​

This multifaceted approach increases the reliability of causal inferences in


exploratory data analysis.

3. Fundamentals of Time Series Analysis (TSA) and Characteristics of Time Series


Data
Time Series Analysis (TSA) involves analyzing data points collected or recorded
at regular time intervals. It helps identify patterns such as trends, seasonality, and
cyclical movements, and is commonly used for forecasting and understanding
temporal behavior in various fields like economics, weather, finance, and health.
Key Characteristics of Time Series Data
Understanding the core features of time series data is crucial before applying any
model or method. Here are the main characteristics:

1. Time Order (Chronological Dependency)


●​ Each observation is recorded at a specific time.​

●​ Order matters — unlike cross-sectional data, the sequence cannot be


rearranged without affecting interpretation.​

●​ Time intervals can be daily, weekly, monthly, quarterly, etc.​

📌 Example: Daily temperature readings, monthly sales, or yearly GDP.


2. Trend
●​ A long-term upward or downward movement in data over time.​

●​ It reflects the general direction of the data, ignoring short-term fluctuations.​

📌 Example: An increasing trend in population or company revenue over years.


3. Seasonality
●​ Periodic fluctuations that occur at regular intervals due to seasonal factors.​
●​ These are short-term patterns repeated annually, monthly, weekly, etc.​

📌 Example: Increased ice cream sales in summer or retail sales during festivals.
4. Cyclicality
●​ Long-term, wave-like fluctuations not of fixed length.​

●​ Different from seasonality because cycles don’t occur at regular intervals.​

●​ Often tied to economic or business cycles.​

📌 Example: Economic recessions and recoveries over several years.


5. Irregular or Random Component
●​ Residual variation in a time series after trend, seasonality, and cyclic
effects are removed.​

●​ Caused by unpredictable or random events (e.g., natural disasters, strikes).​

📌 Example: A sudden drop in production due to a factory fire.


6. Stationarity
●​ A time series is stationary if its properties like mean, variance, and
autocorrelation remain constant over time.​

●​ Stationarity is crucial for modeling (e.g., ARIMA requires stationarity).​

●​ Non-stationary data must be transformed (e.g., differencing) to become


stationary.​

📌 Example: Daily temperature may be non-stationary due to seasonality.


7. Autocorrelation (Serial Correlation)
●​ Measures the relationship of a variable with its past values.​

●​ Important in identifying lags or time dependencies in data.​

📌 Example: Today’s stock price is likely to be similar to yesterday’s.


Summary Table:
Characteristic Description Example
Time Order Chronological sequence of data Monthly rainfall
points

Trend Long-term rise/fall in data Growth in internet


users

Seasonality Regular and predictable pattern Holiday shopping


spike

Cyclicality Irregular long-term fluctuations Business cycles

Irregular Unpredictable short-term shocks Earthquake impact on


Component sales

Stationarity Constant statistical properties Stable monthly profit


over time

Autocorrelation Correlation of data with its Weekly fuel price


previous values pattern

Why Are These Characteristics Important?


1.​ Model Selection: Helps choose between models like ARIMA, Exponential
Smoothing, etc.​

2.​ Forecasting Accuracy: Understanding components improves predictive


power.​

3.​ Data Preprocessing: Seasonal decomposition, differencing, and smoothing


require identifying these features.​

Time based indexing ,Visualizing, Grouping, Resampling

4. Time-Based Indexing
Time-based indexing is a technique used in time series analysis where the index
(row labels) of a dataset is made up of timestamps (dates or date-times). This
enables highly efficient, intuitive, and flexible handling of temporal data.

🔹 Why Time-Based Indexing?


Time-based indexing allows:
●​ Easy filtering and slicing using dates​
●​ Efficient resampling and rolling window operations​

●​ Time-aware plotting and calculations​

●​ Grouping and aggregating by time periods​

🔧 Example Scenario
Let’s say you have a dataset that records daily temperature.
Date Temperature
(°C)

2023-01-0 30
1

2023-01-0 31
2

2023-01-0 32
3

2023-01-0 30
4
🐍 Example in Python (Using Pandas)
import pandas as pd

# Step 1: Create sample data


data = {
'Date': ['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04'],
'Temperature': [30, 31, 32, 30]
}
df = pd.DataFrame(data)

# Step 2: Convert 'Date' column to datetime


df['Date'] = pd.to_datetime(df['Date'])

# Step 3: Set the 'Date' column as the index


df.set_index('Date', inplace=True)

print(df)

📈 Output:
Temperature
Date
2023-01-01 30
2023-01-02 31
2023-01-03 32
2023-01-04 30

🎯 Now What Can You Do?


🔹 1. Slicing by Date
df['2023-01-02':'2023-01-03']

Output:
Temperature
Date
2023-01-02 31
2023-01-03 32

🔹 2. Filter by Month or Year


df[df.index.month == 1] # All January data
df[df.index.year == 2023] # All 2023 data

🔹 3. Resample Daily Data to Monthly Averages


df.resample('M').mean()

🔢 Time-Based Index vs Regular Index


Feature Regular Time-Based
Index Index

Efficient time ❌ ✅
slicing

Rolling time ❌ ✅
operations

Built-in resampling ❌ ✅
Time-based ❌ ✅
grouping
🔍 Common Frequencies in Pandas resample()
Cod Frequenc
e y

'D' Daily

'W' Weekly

'M' Month-En
d

'Q' Quarter

'A' Year-End

'H' Hourly

📦 Real-Life Use Cases


Domain Use Case

Finance Stock prices indexed by date

Weather Hourly temperature recordings

Healthcar Patient vitals recorded per visit


e date

IoT/Senso Sensor data collected every


rs minute/hour

✅ Summary
Key Feature Description

Time-aware slicing Select rows by date range

Grouping and Easily convert between time


resampling granularities

Efficient analysis Use .resample(), .rolling(), .shift()

Real-world relevance Essential for any time series or


longitudinal data

Question: Temperature Data Analysis


You are given a dataset containing daily temperature readings for the month of

📋
January 2024 in Chennai. Your task is to:
Dataset (Sample):
Date Temperatu
re (°C)

2024-01-01 28.5

2024-01-02 29.0

2024-01-03 30.2

... ...

2024-01-31 27.8

🔍 Tasks:
1.​ Convert the 'Date' column to a datetime format and set it as the index.​

2.​ Display temperature data between January 10 and January 15 using


time-based slicing.​

3.​ Calculate the average temperature for the entire month.​

4.​ Resample the data to get the weekly average temperatures.​

5.​ Plot the daily temperature trend using a line plot (if using Python).​

🧠 Hints (if you're using Python):


import pandas as pd

# Assuming you have read the data into a DataFrame called df


df['Date'] = pd.to_datetime(df['Date'])
df.set_index('Date', inplace=True)

# Slice from Jan 10 to Jan 15


df['2024-01-10':'2024-01-15']

# Monthly average
df['Temperature'].mean()
# Weekly resampling
df.resample('W').mean()

5. Visualizing Time Series Analysis (TSA)

✅ What is Time Series Visualization?


Time Series Visualization involves plotting time-indexed data to identify trends,
patterns, seasonality, and anomalies over time. It is one of the first and most

🔍
important steps in Exploratory Data Analysis (EDA) of time series data.
Why is Visualization Important in TSA?
●​ Helps detect trends (upward/downward movement).​

●​ Reveals seasonality (regular pattern in intervals).​

●​ Identifies outliers and sudden changes.​

●​ Highlights cyclical behavior or repetitive variations.​

●​ Makes forecasting decisions easier.​

📌 Types of Visualizations Used in TSA


Visualization Type Description Use Case

Line Plot Plots values over time on a Observe trends &


continuous line seasonality

Lag Plot Plots data at time t vs time Test for randomness in


t+lag time series

Autocorrelation Shows correlation between Identify autocorrelation


Plot lagged values structure

Seasonal Decomposes time series into Understand underlying


Decomposition trend, seasonal, residual components

Heatmaps / Show seasonal and cyclic Compare across


Boxplots behavior over time months/weeks/years

📘 Example: Monthly Sales Data


Imagine you have a dataset showing monthly sales data for a store from January
2021 to December 2023.
| Month | Sales |
|------------|-------|
| 2021-01 | 150 |
| 2021-02 | 160 |
| 2021-03 | 180 |
| ... | ... |
| 2023-12 | 210 |

📈 How to Visualize (Example using Python)


import pandas as pd
import matplotlib.pyplot as plt

# Load data
df = pd.read_csv("sales_data.csv")
df['Month'] = pd.to_datetime(df['Month'])
df.set_index('Month', inplace=True)

# Line plot to see the trend


plt.figure(figsize=(10,5))
plt.plot(df.index, df['Sales'], marker='o', linestyle='-')
plt.title("Monthly Sales Trend (2021-2023)")
plt.xlabel("Month")
plt.ylabel("Sales")
plt.grid(True)
plt.show()

🔍 What Can You Observe from This Plot?


●​ A clear upward trend in sales over time.​

●​ Seasonal spikes in certain months (e.g., December).​

●​ Anomalies if sudden drops/spikes appear.​

🧠 Tips for Effective Visualization:


1.​ Use datetime index – makes slicing and plotting easier.​

2.​ Label axes and title clearly.​


Try rolling averages to smooth short-term noise:​

df['Sales'].rolling(window=3).mean().plot()
3.​
4.​ Use different colors or line styles to compare series.​

6. Grouping in Time Series Analysis (TSA)

✅ What is Grouping in TSA?


Grouping in Time Series Analysis refers to organizing time-based data into
time-related categories (such as days, months, quarters, or years) to perform
aggregate operations (like mean, sum, count, etc.) for each group. This helps
uncover seasonal patterns, trends, and behavioral shifts over different time

🧠
intervals.
Why Use Grouping?
●​ To aggregate values by specific time intervals (e.g., monthly sales, yearly
rainfall).​

●​ To compare performance across years/months/days.​

●​ To detect seasonal trends or periodic patterns.​

●​ Useful when the dataset is granular (e.g., daily/hourly) but insights are
needed at a higher level (e.g., weekly/monthly).​

🛠️ How is Grouping Done?


Grouping is usually performed using groupby() based on time attributes of the

📘
datetime index or column.
Example Dataset
Let’s consider a dataset of daily sales:
| Date | Sales |
|------------|-------|
| 2022-01-01 | 200 |
| 2022-01-02 | 220 |
| ... | ... |
| 2022-12-31 | 240 |

🧪 Example 1: Group by Month


import pandas as pd
# Load and parse date
df = pd.read_csv("daily_sales.csv")
df['Date'] = pd.to_datetime(df['Date'])
df.set_index('Date', inplace=True)

# Group by Month and calculate total sales


monthly_sales = df.groupby(df.index.month).sum()
print(monthly_sales)

📝 This returns total sales for each month (1=Jan, 2=Feb...).


🧪 Example 2: Group by Year and Month
# Group by Year and Month
df['Year'] = df.index.year
df['Month'] = df.index.month

monthly_avg = df.groupby(['Year', 'Month'])['Sales'].mean()


print(monthly_avg)

📝 This returns average monthly sales for each year, helping to compare monthly
patterns across years.

🧪 Example 3: Group by Weekday to Analyze Weekly Patterns


# Group by Day of Week (0 = Monday, ..., 6 = Sunday)
df['Weekday'] = df.index.weekday

weekday_sales = df.groupby('Weekday')['Sales'].mean()
print(weekday_sales)

📝 Useful to see if weekend vs weekday impacts sales.


📊 Visualizing Grouped Data
weekday_sales.plot(kind='bar', title='Average Sales by Weekday')

🔍 Key Observations Enabled by Grouping


Grouping Insight You Get
Basis

Monthly Seasonality trends (e.g., festive


months)
Year Year-over-year growth or decline

Weekday Behavioral pattern based on day


of week

Hour Traffic/sales trend during the day

🧠 Tips for Effective Grouping:


●​ Ensure your datetime column is parsed correctly using pd.to_datetime().​

●​ Use .resample() for time-interval grouping (e.g., weekly, monthly) – helpful


for evenly spaced time periods.​

●​ Combine groupby() with agg() to compute multiple summary statistics.​

7. Resampling in Time Series Analysis (TSA)

✅ What is Resampling?
Resampling in Time Series Analysis is the process of changing the frequency of
time series data. It allows you to convert time series data to a different time scale
– for example:
●​ From daily to monthly (downsampling)​

●​ From monthly to daily (upsampling)​

It is used to summarize, smooth, or reorganize time series for better analysis and

🔍
visualization.
Types of Resampling
Type Description Example

Downsampli Reduce data frequency Daily → Monthly (summarize


ng daily sales into monthly total)

Upsampling Increase data frequency (may Monthly → Daily (estimate daily


require interpolation) values from monthly data)

🧠 Why Use Resampling?


●​ To reduce noise in high-frequency data.​

●​ To aggregate data for long-term trends.​


●​ To fill missing dates in low-frequency data.​

●​ To align data frequency for comparative analysis.​

🧪 Example Dataset
| Date | Sales |
|------------|-------|
| 2022-01-01 | 200 |
| 2022-01-02 | 210 |
| ... | ... |
| 2022-12-31 | 240 |

📦 Step-by-Step: Downsampling
import pandas as pd

# Load data
df = pd.read_csv("sales.csv")
df['Date'] = pd.to_datetime(df['Date'])
df.set_index('Date', inplace=True)

# Resample to Monthly Frequency and Calculate Total Sales


monthly_sales = df.resample('M').sum()
print(monthly_sales)

🧾 'M' stands for Month-End Frequency. Other common codes:


●​ 'W' = weekly​

●​ 'Q' = quarterly​

●​ 'A' = annual​

📈 Step-by-Step: Upsampling
# Upsample from daily to hourly (fill with NaNs)
hourly_data = df.resample('H').asfreq()
print(hourly_data.head())

📍 Use .interpolate() if you want to estimate missing values:


hourly_data_interpolated = hourly_data.interpolate()

🔄 Aggregate Multiple Statistics with .agg()


monthly_stats = df.resample('M').agg({'Sales': ['mean', 'sum', 'max', 'min']})
print(monthly_stats)

🧠 When to Use Resampling?


Situation Use

High-frequency data too noisy Downsample

Need to fill gaps in data Upsample with interpolation

Trend analysis over longer Downsample (monthly,


periods quarterly)

Aligning multiple time series Resample to common


datasets frequency

📊 Visualization Example
import matplotlib.pyplot as plt

df['Sales'].plot(label='Daily Sales', alpha=0.5)


monthly_sales['Sales'].plot(label='Monthly Sales', linewidth=2)
plt.legend()
plt.title("Daily vs Monthly Sales")
plt.show()

🔁 Difference Between Upsampling and Downsampling


Feature Upsampling Downsampling

Definition Increasing the frequency of Decreasing the frequency of


time series data time series data
Goal Fill in more granular time Aggregate data to a broader
intervals (e.g., add hours to time interval (e.g., from daily to
daily data) monthly)

Data Points More data points than Fewer data points than original
original

Requires Yes, often (to estimate No, values are typically


Interpolation? values in new time points) aggregated using sum, mean,
etc.

Example Converting monthly data to Converting daily data to


daily monthly

Use Case Data modeling, forecasting Trend analysis, reducing noise


at finer granularity

📌 Examples
✅ Upsampling (Monthly → Daily)
df.resample('D').asfreq()

You may need to interpolate missing values:


df.resample('D').interpolate()

✅ Downsampling (Daily → Monthly)


df.resample('M').sum()

🧠 Summary
●​ Use Upsampling when you want finer granularity (but be careful with filling
missing values).​

●​ Use Downsampling to summarize or smooth data for clearer trend analysis.​

🔍 Real-World Applications
●​ Finance: Resample stock price data from minute-level to daily averages.​

●​ Retail: Aggregate daily sales into monthly totals for seasonality detection.​
●​ Healthcare: Convert weekly health reports into monthly summaries.​

●​ IoT: Aggregate sensor data recorded every second into hourly averages.​

8.Problems

Question:
You are given monthly sales data (in units) for a retail store over the last year. Plot
the sales data using a suitable graph and describe the overall trend and any
seasonal patterns you observe. Then:
1.​ Apply a 4-month moving average to smooth the data and highlight the
underlying trend.​

2.​ Use simple exponential smoothing with a smoothing factor (α) of 0.4 to
analyze the trend further.​

3.​ Forecast sales for the next three months.​

4.​ Interpret the smoothed results and forecast.​

Month Sales
(units)

Jan 1200

Feb 1300

Mar 1250

Apr 1400

May 1500

Jun 1600

Jul 1550

Aug 1700

Sep 1800

Oct 1900
Nov 2000

Dec 2100

Step 1: Plotting and Observing the Data


●​ Plot the monthly sales on a line graph with months on the x-axis and sales
on the y-axis.​

●​ Trend: Sales show an increasing trend over the year.​

●​ Seasonal pattern: Slight increase in summer months (May-August)


indicating possible seasonality.​

Step 2: Calculate 4-Month Moving Average


●​ Calculate the average sales over every 4 consecutive months to smooth
short-term fluctuations.
●​

Month 4-Month Moving Average


Ending

Apr (1200+1300+1250+1400)/4 = 1287.5

May (1300+1250+1400+1500)/4 = 1362.5

Jun (1250+1400+1500+1600)/4 = 1437.5

Jul (1400+1500+1600+1550)/4 = 1512.5

Aug (1500+1600+1550+1700)/4 = 1587.5

Sep (1600+1550+1700+1800)/4 = 1662.5

Oct (1550+1700+1800+1900)/4 = 1737.5

Nov (1700+1800+1900+2000)/4 = 1850

Dec (1800+1900+2000+2100)/4 = 1950


●​ ​
The moving average smooths the sales data, showing a clearer upward
trend.​
Step 3: Apply Simple Exponential Smoothing (α = 0.4)

Month Actual Calculation Smoothed


Sales StS_t
YtY_t

Jan 1200 S1=1200S_1 = 1200 1200

Feb 1300 0.4×1300+0.6×1200=124 1240


00.4 \times 1300 + 0.6
\times 1200 = 1240

Mar 1250 0.4×1250+0.6×1240=124 1244


40.4 \times 1250 + 0.6
\times 1240 = 1244

Apr 1400 0.4×1400+0.6×1244=129 1294.4


4.40.4 \times 1400 + 0.6
\times 1244 = 1294.4

May 1500 0.4×1500+0.6×1294.4=1 1366.6


366.60.4 \times 1500 +
0.6 \times 1294.4 =
1366.6

Jun 1600 0.4×1600+0.6×1366.6=1 1459.96


459.960.4 \times 1600 +
0.6 \times 1366.6 =
1459.96

Jul 1550 0.4×1550+0.6×1459.96= 1503.98


1503.980.4 \times 1550
+ 0.6 \times 1459.96 =
1503.98

Aug 1700 0.4×1700+0.6×1503.98= 1582.39


1582.390.4 \times 1700
+ 0.6 \times 1503.98 =
1582.39

Sep 1800 0.4×1800+0.6×1582.39= 1673.43


1673.430.4 \times 1800
+ 0.6 \times 1582.39 =
1673.43

Oct 1900 0.4×1900+0.6×1673.43= 1764.06


1764.060.4 \times 1900
+ 0.6 \times 1673.43 =
1764.06

Nov 2000 0.4×2000+0.6×1764.06= 1858.44


1858.440.4 \times 2000
+ 0.6 \times 1764.06 =
1858.44

Dec 2100 0.4×2100+0.6×1858.44= 1951.07


1951.070.4 \times 2100
+ 0.6 \times 1858.44 =
1951.07

Step 4: Forecast Next Three Months


●​ Forecasts are equal to the last smoothed value (since SES assumes level is
constant).​

●​ If December smoothed value is, say, 2000, then:​

Mont Forecast Sales


h (units)

Jan 2000

Feb 2000

Mar 2000
Step 5: Interpretation
●​ Moving average smooths short-term variations and shows a clear upward
sales trend.​
●​ Simple exponential smoothing reacts more quickly to recent changes and
helps forecast future sales.​

●​ The forecast indicates stable sales around the latest smoothed value,
assuming no sudden changes.​

●​ If seasonality exists, other methods like Holt-Winters may be more


appropriate.​

ARIMA

Question
You have monthly sales data for a clothing store in 2023:
Month Sales

Jan 500

Feb 520

Mar 550

Apr 600

May 620

Jun 630

Jul 700

Aug 720

Sep 690

Oct 680

Nov 710

Dec 750
Task: Apply an ARIMA model to smooth the data, capture patterns, and forecast
sales for Jan-Mar 2024. Interpret the results.
Step 1: Plot and Explore the Data
●​ Plot the sales over months to observe trend and seasonality.​
●​ Here, sales generally increase over the year, indicating an upward trend.​

●​ There may be slight seasonal behavior (e.g., higher sales in mid and end of
year).​

Step 2: Check for Stationarity


●​ ARIMA models require stationary data (mean and variance constant over
time).​

●​ Use Augmented Dickey-Fuller (ADF) test to check stationarity.​

Possible outcome:
●​ The test shows non-stationarity (p-value > 0.05), so differencing is needed.​

Step 3: Differencing
●​ Apply first-order differencing: subtract each month’s sales from the
previous month’s sales to remove trend.​

Differenced data (ΔSales):​


Feb - Jan = 520 - 500 = 20​
Mar - Feb = 550 - 520 = 30​
Apr - Mar = 600 - 550 = 50​
... and so forth.
●​ Check stationarity of differenced data with ADF again. Usually, first
difference is enough.​

Step 4: Identify ARIMA (p, d, q) Parameters


●​ d = 1 (from differencing step)​

●​ Use ACF (Auto-correlation function) and PACF (Partial auto-correlation


function) plots on differenced data:​

○​ If PACF cuts off sharply after lag k → suggests AR order p = k​

○​ If ACF cuts off sharply after lag k → suggests MA order q = k​

Example:
●​ PACF cuts off after lag 1 → p = 1​
●​ ACF tails off slowly → q = 0​

So, ARIMA(1,1,0) is a candidate.

●​

Step 6: Check Model Diagnostics


●​ Plot residuals to confirm they behave like white noise (no autocorrelation).​

●​ Use Ljung-Box test to confirm residuals are random.​

●​ If diagnostics are good, model is acceptable.​

Step 7: Forecast Next 3 Months (Jan-Mar 2024)


●​ Use the fitted ARIMA model to forecast sales for next 3 months.​

●​ The model predicts differenced sales, so you need to add forecasts


cumulatively to last observed actual sales (Dec 2023).​

For example, if last observed sales in Dec = 750:​


Forecast for Jan 2024 = 750 + forecasted difference for Jan​
Forecast for Feb 2024 = forecast for Jan + forecasted difference for Feb​
Forecast for Mar 2024 = forecast for Feb + forecasted difference for Mar

Step 8: Interpret Results


●​ Trend: The positive AR coefficient and differencing reflect the increasing
sales trend over the year.​

●​ Seasonality: If model includes seasonal terms (ARIMA with seasonal part,


SARIMA), seasonal patterns will be captured; here we assume none or
minor seasonality.​

●​ Fluctuations: The noise term captures random fluctuations.​

●​ Forecast: The predicted sales for first three months of 2024 provide
actionable insights to plan inventory and marketing.​

Summary:
Ste Description Key Points
p

1 Plot data, observe Upward trend visible


trend/seasonality

2 Check stationarity (ADF test) Data non-stationary

3 Differencing applied First order differencing


(d=1)

4 Identify p and q using ACF and Example: p=1, q=0


PACF

5 Fit ARIMA(1,1,0) model Estimate parameters

6 Check residual diagnostics Residuals white noise?

7 Forecast Jan-Mar 2024 Use model to predict


future

8 Interpret trend, fluctuations, Understand patterns and


forecast prepare

Practice:

1.​ You are given monthly sales data for a retail store over the past year. Plot
the data on a suitable graph and describe the overall trend and any
seasonal variations. Apply a 4-month moving average to smooth the data
and highlight the underlying trend. Use simple exponential smoothing with
a smoothing factor of 0.4 to analyze the trend further and forecast sales for
the next three months. Interpret the smoothed results and the forecasted
sales.
2.​ You are given monthly data of a certain product’s demand for the past two
years. The data shows a general upward trend with some seasonal
fluctuations.
●​ Explain how you would check if the data is stationary.​

●​ Describe the steps to transform the data if it is not stationary.​

●​ How would you use ACF and PACF plots to decide the ARIMA model
parameters (p, d, q)?​

●​ Once the ARIMA model is fitted, explain how to forecast the demand for the
next three months.​

●​ Discuss how you would interpret the fitted model in terms of trend,
seasonality, and residual fluctuations.

You might also like