CS22021 - EXPLORATORY DATA ANALYSIS - FAT 3 - Notes
CS22021 - EXPLORATORY DATA ANALYSIS - FAT 3 - Notes
FAT 3 QB
PART A
3. Highlights Proportional Trends: They help detect trends or proportions, such
as higher incidence rates in a subgroup despite smaller counts.
4. Supports Policy Decisions: Widely used in fields like public health and
education to formulate decisions based on relative distributions.
Q2. How can row and column percentages in a table reveal different insights in
categorical data analysis?
Ans:
1. Row Percentages: Reveal how a specific row category is distributed across
column categories, helping compare behaviors or responses across groups.
3. Enables Statistical Testing: Forms the basis for performing the Chi-square test
to statistically assess independence.
4. Supports Hypothesis Formation: Patterns seen in the table guide the
development of meaningful hypotheses for further investigation.
Q5. What are the key features of a scatterplot that help detect relationships
between two variables?
Ans:
1. Direction: Shows whether the relationship is positive, negative, or nonexistent
based on the slope.
2. Form: Indicates whether the relationship is linear, nonlinear, or no clear pattern.
3. Strength: Reflected by how tightly the points cluster around a line or curve.
4. Outliers: Helps visually identify unusual or influential data points that could affect
analysis.
Q6. In what situations is using a resistant line in a scatterplot more appropriate
than a least-squares regression line?
Ans:
1. Presence of Outliers: Resistant lines are less affected by extreme values,
providing more robust trend estimation.
2. Skewed Data: Better suited for data distributions that are not symmetrical, unlike
least-squares lines.
3. Exploratory Phase: Useful during initial analysis for identifying consistent trends
without distortion by anomalies.
4. Non-Normal Errors: More reliable when data residuals are not normally
distributed.
Q7. What are some common data transformation techniques and when are they
used?
Ans:
1. Log Transformation: Applied to right-skewed data to stabilize variance and
normalize distribution.
2. Square Root Transformation: Suitable for data where variance increases with
the mean, such as count data.
3. Reciprocal Transformation: Reduces impact of large values and can linearize
hyperbolic trends.
2. Chi-square Formula: Test statistic is computed based on the difference between
observed and expected values.
3. Test Validity: Ensures the test assesses deviation from randomness or
independence correctly.
4. Detects Patterns: Helps identify where actual distributions differ significantly
from expected ones.
Q9. What are the consequences of violating ANOVA assumptions and how can
they be addressed?
Ans:
1. Violation of Normality: Makes the F-test unreliable; can be corrected using data
transformation or non-parametric alternatives.
2. Matched Samples: Applied when subjects are paired based on shared
characteristics to minimize variability.
Q11. How does ANOVA partition the total variation in the dataset, and why is this
important?
Ans:
1. Between-Group Variation: Measures differences among group means relative
to the overall mean.
Q12. What is the purpose of a chi-square test in bivariate analysis, and how is it
applied to categorical data?
Answer:
1. Purpose: The chi-square test is used to assess whether two categorical
variables are independent or associated with each other.
2. Expected vs. Observed: It compares the observed frequencies with the
expected frequencies if the variables were independent.
3. Application: Applied to categorical data, the chi-square test determines whether
there is a statistically significant relationship between the two variables.
Q13. What is ANOVA, and how does it help in comparing the means of three or
more groups?
Answer:
1. Definition: ANOVA (Analysis of Variance) is a statistical method used to
compare the means of three or more groups.
2. Between vs. Within Group Variability: It assesses whether the variance
between group means is greater than the variance within the groups, indicating a
significant difference.
3. Hypothesis Testing: The null hypothesis assumes no difference between group
means, and the alternative hypothesis assumes at least one group mean differs.
4. Example: ANOVA could be used to compare the average test scores of students
from different teaching methods (traditional, online, hybrid) to see if teaching
method affects performance.
2. P-Value: The test computes a p-value to assess the strength of evidence against
the null hypothesis. A small p-value indicates that the relationship is statistically
significant.
3. Test Selection: Common tests like the chi-square test for independence or the
t-test for comparing means are used depending on the data type.
4. Decision Making: Hypothesis testing helps to decide whether the observed
relationship is due to chance or if there is a genuine association between the
variables.
2. Mean Difference: A larger difference between group means increases the
t-value.
3. Variability: Higher data variability increases standard error and lowers the
t-value.
4. Significance Level (α): A lower α (e.g., 0.01) makes it harder to reject the null
hypothesis.
These factors affect the t-value and p-value, influencing whether the null hypothesis is
accepted or rejected.
Q16. What are the steps in Chi-Square test?
Q17. 4 ANOVA Methods (with Uses)
1. One-Way ANOVA:
These methods help identify whether observed differences in group means are
statistically significant.
UNIT 5
Q1. How does introducing a third variable help in analyzing relationships
between two variables?
1. Controls Confounding Effects: A third variable can control for confounding
factors that might distort the relationship between the primary variables.
2. Reveals Interactions: It can reveal hidden interactions where the relationship
between two variables changes depending on the third variable.
4. Example: Introducing age as a third variable when studying the relationship
between physical activity and health outcomes can uncover whether the impact
of physical activity on health differs by age group.
Q2. What are the possible roles a third variable can play in data analysis?
Ans:
1. Confounder: Influences both the independent and dependent variables, possibly
creating a spurious association.
2. Moderator: Alters the strength or direction of the relationship between two
variables (interaction effect).
3. Mediator: Explains the mechanism through which the independent variable
affects the dependent variable.
4. Control Variable: Included in analysis to isolate the relationship between the
variables of primary interest.
Q3. What conditions must be satisfied to establish a causal relationship between
two variables?
Ans:
1. Temporal Precedence: The cause must occur before the effect in time.
2. Causation Requires Mechanism: Causal claims need proof that one variable
directly affects the other.
3. Spurious Correlation Risk: Correlation may arise due to a lurking third variable
or coincidence.
4. Statistical Tools Differ: Regression, experiments, or path analysis are needed
for causal inference, unlike simple correlation.
2. Controls for Confounding: Shows whether an association holds after adjusting
for the third variable.
3. Tests for Conditional Independence: Evaluates whether two variables are
independent within each subgroup.
4. Enhances Data Insight: Provides a deeper understanding of subgroup trends
and relationships.
2. Forecast Future Values: Use past data to predict future observations accurately.
3. Detect Structural Changes: Spot shifts in data behavior due to external or
internal events.
4. Improve Decision Making: Enable planning, resource allocation, and policy
formation based on data-driven trends.
Q7. How does time series data differ from cross-sectional data?
Ans:
1. Ordered Observations: Time series has a natural chronological order, unlike
cross-sectional data.
2. Temporal Dependence: Time series data is collected over time, with each
observation corresponding to a specific time point, while cross-sectional data is
collected at a single point in time.
3. Dynamic Behavior: Captures change over time, while cross-sectional data
represents a single time snapshot.
6. Purpose: Time series is used for forecasting and understanding trends over
time, while cross-sectional data is useful for comparing different subjects or
entities at one point.
7. Example: Stock prices over a month represent time series data, while a survey
of household income levels at one time point represents cross-sectional data.
Q8. What are the key components of time series data?
Ans:
1. Trend: Long-term upward or downward direction in the data.
2. Seasonality: Regular, periodic fluctuations occurring within a year (e.g., monthly
sales).
3. Cyclic Variation: Irregular fluctuations over longer periods due to business or
economic cycles.
2. Simplifies Modeling: Many models (like ARIMA) assume stationarity for validity.
3. Supports Missing Value Detection: Helps identify time gaps or missing records
in the series.
3. Fill Gaps: Upsampling can fill missing values using interpolation or
forward/backward fill.
4. Supports Rolling Analysis: Enables moving averages or rolling statistics over
resampled data.
Q12. Given the variables "age group," "exercise frequency," and "sleep quality,"
construct a three-variable contingency table and explain how it can be used to analyze
the relationship between these variables.
Answer:
1. Define Categories: First, categorize the variables:
2. Construct the Table: Create a 3D table, with age group and exercise frequency
as row and column variables, and sleep quality as another dimension (cells will
represent the frequency of individuals in each combination). For example:
○ Exercise Frequency and Sleep Quality: Among the young group, those
who exercise more frequently (High) have better sleep quality (35 Good
sleep). This indicates a positive relationship between exercise frequency
and sleep quality.
○ Age and Sleep Quality: Older individuals tend to have poorer sleep
quality, with 20 individuals in the "Poor" sleep category. This suggests that
sleep quality may decline with age.
○ The combination of low exercise frequency and poor sleep quality seems
most prevalent in the younger age group.
By constructing and analyzing this table, we gain insights into how exercise, age, and
sleep quality interact within the population, guiding decisions related to health and
lifestyle.
Q13. Outline the causal explanation in the context of exploratory data analysis
(EDA).
Answer:
1. Definition: Causal explanation in EDA aims to identify the underlying
cause-and-effect relationships between variables.
2. Visual Exploration: Causal relationships can be explored by plotting data and
looking for patterns that suggest a directional influence (e.g., A causes B).
4. Example: In EDA, analyzing the impact of hours of study on test scores can help
establish a causal link between increased study time and higher test
performance.
2. Complexity: Multivariate analysis allows for the exploration of more complex
relationships, including interactions between multiple variables.
4. Example: Bivariate analysis might explore the relationship between age and
income, while multivariate analysis could examine the combined effect of age,
education, and occupation on income.
3. Modeling: It helps in building time series models, such as ARIMA, which use
past values to predict future values.
2. Identification: Outliers can be detected using visual methods (like boxplots or
scatter plots) or statistical tests (like Z-scores or IQR).
3. Impact: They can distort trends, seasonality, and forecasts if not handled
properly.
Q17. What role does data aggregation play in time series analysis?
Answer:
1. Simplification: Aggregating data over larger time periods (e.g., from daily to
weekly) helps reduce noise and emphasizes long-term trends.
2. Comparison: It allows for comparing trends across different time periods or
categories.
4. Example: Aggregating daily stock prices into weekly averages smoothens out
short-term volatility, aiding in trend analysis.
Q18. What are the Key assumptions for a valid ARIMA model?
Key assumptions for a valid ARIMA model are:
1. Stationarity:
PART B
UNIT 4
○ Total Percentage Table: Uses the overall total as the denominator for all
cells.
○ Step 3: Divide each cell count by the row total, column total, or overall
total.
○ Step 4: Multiply by 100 to convert to a percentage.
4. Example:
Suppose we have a dataset of 100 students categorized by gender and whether they
prefer online or offline learning:
Male 30 20 50
Femal 40 10 50
e
Total 70 30 100
5.
Row Percentage Table:
Example:
Question:
Solution:
Step 1: Create the Frequency Table
Income Amazo Flipkart Myntra Tota
Group n l
Low 3 1 0 4
Medium 1 1 1 3
High 1 1 1 3
Total 5 3 2 10
● Medium and High-income groups are evenly split across platforms (Amazon,
Flipkart, Myntra).
Conclusion:
● Row percentages show how each income group spreads its preference across
platforms.
<30 20 30 50
30+ 60 40 100
Total 80 70 150
4.
Marginal Percentages:
3. Example: Students categorized by department (CS, IT) and whether they attend
workshops (Yes/No):
CS 60 40 100
IT 40 60 100
4.
Row-wise:
○ Use row-wise when you want to analyze how column categories vary
across each row group (e.g., preferences of departments).
A 1 5 95
A 2 7 93
A 3 6 94
A 4 4 96
B 1 12 88
B 2 15 85
B 3 14 86
B 4 10 90
C 1 3 97
C 2 2 98
C 3 5 95
C 4 4 96
i) Combine the batch data for each factory to create a contingency table showing total
defective and non-defective units by factory.
ii) Construct a percentage table from the contingency table, showing the percentage of
defective and non-defective units for each factory out of the total units produced by that
factory.
iii) Analyze the tables to identify:
Which factory has the highest defect rate?
Whether the defect status is associated with the factory.
How handling several batches improves the reliability of the analysis?
Given Data:
Factory Batch Defective Non-defective
Units Units
A 1 5 95
A 2 7 93
A 3 6 94
A 4 4 96
B 1 12 88
B 2 15 85
B 3 14 86
B 4 10 90
C 1 3 97
C 2 2 98
C 3 5 95
C 4 4 96
iii) Analysis
1. Which factory has the highest defect rate?
● Factory B has the highest defect rate at 12.75%.
● Factory A has 5.5%, and Factory C has 3.5%.
3. How does handling several batches improve the reliability of the analysis?
● Combining data from several batches increases the sample size, reducing
random variation.
● This reduces the impact of any unusual batch, leading to more accurate and
reliable conclusions about factory performance.
Question:
A researcher collects data on the number of hours students study per week and their
corresponding test scores. The scatterplot shows a positive linear trend, but one student
studied unusually many hours with a lower-than-expected score.
a) Explain how this outlier might affect the least squares regression line.
b) Describe how a resistant line (like the median-median line) would handle this outlier
differently.
4. T-test
Reference : https://chatgpt.com/share/6826ccf0-7cbc-8001-9a3c-cb6768658c1c
Problem:
How would you use an independent samples t-test to check if the difference in average
sales between the two campaigns is statistically significant?
3. Compute the t-statistic using the independent samples t-test formula.
4. Determine degrees of freedom and find the critical t-value or p-value.
5. If the calculated t exceeds critical t or p-value < significance level (e.g., 0.05),
reject H0H_0, concluding a significant difference.
Detailed solution:
2. A school wants to compare test scores of students taught using two different teaching
methods: Method 1 and Method 2. The test scores are:
● Method 1: 85, 88, 90
5. Compare t-statistic with critical value or p-value with α (e.g., 0.05) to accept or
reject H0H_0.
5. Chi-square test of independence
Reference : https://chatgpt.com/share/6826ccf0-7cbc-8001-9a3c-cb6768658c1c
Problem:
A supermarket wants to find out if customer age group is related to their preferred
payment method (Cash or Card). They survey 100 customers and record the following
data:
Age Cas Card Tota
Group h l
18-30 20 30 50
31-50 15 25 40
51+ 5 5 10
Total 40 60 100
Is there a significant association between Age Group and Preferred Payment Method?
Solution:
Summary
Step Action
Practice questions
Question 1:
A hospital wants to check if the recovery rate of patients is independent of the
treatment type (Treatment A, Treatment B). They record the following data:
Treatment Recovered Not Total
Type Recovered
Treatment A 45 15 60
Treatment B 30 30 60
Is there a significant association between treatment type and recovery status?
Question 2:
A college surveys students to find out if gender is related to their choice of major. The
data collected is:
Gender Science Arts Total
Majors Majors
Male 40 20 60
Female 30 30 60
Use a Chi-square test to determine if there is an association between gender and major
choice.
6. ANOVA
What is ANOVA?
● When comparing three or more groups, using multiple t-tests increases the
risk of Type I error (false positives).
○ Used when the same subjects are measured multiple times under
different conditions or over time.
Example Scenario
Suppose three diet plans are tested for their effect on weight loss. ANOVA will tell
if the average weight loss differs significantly between these plans.
Assumptions of ANOVA
If ANOVA shows significant differences, post hoc tests (e.g., Tukey’s HSD,
Bonferroni) are performed to identify which groups differ.
Summary
Aspect Description
Problem:
Structure
● You have three categorical variables, say A, B, and C.
● The table is typically presented as a series of two-way tables for each level
of the third variable, or as a cube-shaped table.
then the table will have r×c×kr \times c \times kr×c×k cells.
T1 Male 6 Certified
T3 Male 5 Certified
T4 Female 4 Certified
T6 Female 6 Certified
T8 Female 5 Certified
T9 Male 4 Certified
Step-by-step Solution:
Step 1: Categorize Practice Hours
● Low Practice Hours (1-3 hours): Trainees with practice hours 1, 2, or 3
● None of the trainees who practice for Low hours (1-3) are certified,
regardless of gender.
● Gender does not seem to influence certification status directly since both
males and females have similar certification rates within practice hour
groups.
Summary:
The contingency table shows a clear pattern where trainees practicing more
hours (4-6 per week) achieve certification, while those practicing fewer hours
(1-3) do not. This suggests practice hours are a key factor in certification
success, more than gender.
2. A university collects data on students’ Gender, Weekly Study Hours (Low: 1-3,
High: 4-6), and Exam Result (Pass/Fail). The data for 10 students is:
Student Gender Study Exam
Hours/Week Result
S1 Male 2 Fail
S2 Female 5 Pass
S3 Male 4 Pass
S4 Female 3 Fail
S5 Male 6 Pass
S6 Female 2 Fail
S7 Male 5 Pass
S8 Female 4 Pass
S9 Male 3 Fail
3. A call center tracks Agent Gender, Number of Calls Handled per Day (Low:
10-15, High: 16-20), and Customer Satisfaction (Satisfied/Not Satisfied) for 10
agents:
Agent Gender Calls/Day Customer
Satisfaction
A2 Female 18 Satisfied
A3 Male 16 Satisfied
A5 Male 20 Satisfied
A8 Female 19 Satisfied
A9 Male 17 Satisfied
2. Causal Explanations
A causal relationship implies that changes in one variable directly cause changes
in another. Unlike correlation, which only shows an association between
variables, causation means one variable is responsible for the effect on the other.
Key aspects to establish causality:
1. Temporal precedence: The cause must occur before the effect.
● Physical activity: Less activity might worsen sleep and encourage more
screen time.
● Bedtime routines: Using screens just before bed might have a different
effect than during the day.
Summary:
To explore whether increased screen time causes poor sleep quality, one must
analyze temporal patterns, control for confounding factors, and use statistical
methods to test if the relationship holds when accounting for these variables.
Experimental designs provide the strongest causal evidence.
2. A research team collects data on the amount of time students spend on social
media per day (in hours) and their academic performance (measured by GPA).
The analysis reveals a negative correlation between social media usage and GPA.
How would you determine whether increased social media usage causes lower
academic performance? What other variables or methods would you consider in
your exploratory data analysis to support or challenge this causal link?
Step 1: Understand the Correlation
The negative correlation indicates that students with higher social media usage
tend to have lower GPAs. However, correlation alone does not imply causation.
It’s possible that other factors influence both variables.
● Sleep duration
● Parental supervision
By including these variables in your analysis, you can determine whether the
relationship holds even after adjusting for confounders.
Step 4: Use Statistical Techniques
● Multiple Linear Regression: Include social media usage and confounders to
see if social media remains a significant predictor of GPA.
● Path Analysis: Check for indirect effects (e.g., social media affects sleep,
which affects GPA).
Conclusion:
To explore if social media usage causes lower academic performance:
● Establish time order (usage precedes performance drop),
📌 Example: Increased ice cream sales in summer or retail sales during festivals.
4. Cyclicality
● Long-term, wave-like fluctuations not of fixed length.
4. Time-Based Indexing
Time-based indexing is a technique used in time series analysis where the index
(row labels) of a dataset is made up of timestamps (dates or date-times). This
enables highly efficient, intuitive, and flexible handling of temporal data.
🔧 Example Scenario
Let’s say you have a dataset that records daily temperature.
Date Temperature
(°C)
2023-01-0 30
1
2023-01-0 31
2
2023-01-0 32
3
2023-01-0 30
4
🐍 Example in Python (Using Pandas)
import pandas as pd
print(df)
📈 Output:
Temperature
Date
2023-01-01 30
2023-01-02 31
2023-01-03 32
2023-01-04 30
Output:
Temperature
Date
2023-01-02 31
2023-01-03 32
Efficient time ❌ ✅
slicing
Rolling time ❌ ✅
operations
Built-in resampling ❌ ✅
Time-based ❌ ✅
grouping
🔍 Common Frequencies in Pandas resample()
Cod Frequenc
e y
'D' Daily
'W' Weekly
'M' Month-En
d
'Q' Quarter
'A' Year-End
'H' Hourly
✅ Summary
Key Feature Description
📋
January 2024 in Chennai. Your task is to:
Dataset (Sample):
Date Temperatu
re (°C)
2024-01-01 28.5
2024-01-02 29.0
2024-01-03 30.2
... ...
2024-01-31 27.8
🔍 Tasks:
1. Convert the 'Date' column to a datetime format and set it as the index.
5. Plot the daily temperature trend using a line plot (if using Python).
# Monthly average
df['Temperature'].mean()
# Weekly resampling
df.resample('W').mean()
🔍
important steps in Exploratory Data Analysis (EDA) of time series data.
Why is Visualization Important in TSA?
● Helps detect trends (upward/downward movement).
# Load data
df = pd.read_csv("sales_data.csv")
df['Month'] = pd.to_datetime(df['Month'])
df.set_index('Month', inplace=True)
🧠
intervals.
Why Use Grouping?
● To aggregate values by specific time intervals (e.g., monthly sales, yearly
rainfall).
● Useful when the dataset is granular (e.g., daily/hourly) but insights are
needed at a higher level (e.g., weekly/monthly).
📘
datetime index or column.
Example Dataset
Let’s consider a dataset of daily sales:
| Date | Sales |
|------------|-------|
| 2022-01-01 | 200 |
| 2022-01-02 | 220 |
| ... | ... |
| 2022-12-31 | 240 |
📝 This returns average monthly sales for each year, helping to compare monthly
patterns across years.
weekday_sales = df.groupby('Weekday')['Sales'].mean()
print(weekday_sales)
✅ What is Resampling?
Resampling in Time Series Analysis is the process of changing the frequency of
time series data. It allows you to convert time series data to a different time scale
– for example:
● From daily to monthly (downsampling)
It is used to summarize, smooth, or reorganize time series for better analysis and
🔍
visualization.
Types of Resampling
Type Description Example
🧪 Example Dataset
| Date | Sales |
|------------|-------|
| 2022-01-01 | 200 |
| 2022-01-02 | 210 |
| ... | ... |
| 2022-12-31 | 240 |
📦 Step-by-Step: Downsampling
import pandas as pd
# Load data
df = pd.read_csv("sales.csv")
df['Date'] = pd.to_datetime(df['Date'])
df.set_index('Date', inplace=True)
● 'Q' = quarterly
● 'A' = annual
📈 Step-by-Step: Upsampling
# Upsample from daily to hourly (fill with NaNs)
hourly_data = df.resample('H').asfreq()
print(hourly_data.head())
📊 Visualization Example
import matplotlib.pyplot as plt
Data Points More data points than Fewer data points than original
original
📌 Examples
✅ Upsampling (Monthly → Daily)
df.resample('D').asfreq()
🧠 Summary
● Use Upsampling when you want finer granularity (but be careful with filling
missing values).
🔍 Real-World Applications
● Finance: Resample stock price data from minute-level to daily averages.
● Retail: Aggregate daily sales into monthly totals for seasonality detection.
● Healthcare: Convert weekly health reports into monthly summaries.
● IoT: Aggregate sensor data recorded every second into hourly averages.
8.Problems
Question:
You are given monthly sales data (in units) for a retail store over the last year. Plot
the sales data using a suitable graph and describe the overall trend and any
seasonal patterns you observe. Then:
1. Apply a 4-month moving average to smooth the data and highlight the
underlying trend.
2. Use simple exponential smoothing with a smoothing factor (α) of 0.4 to
analyze the trend further.
Month Sales
(units)
Jan 1200
Feb 1300
Mar 1250
Apr 1400
May 1500
Jun 1600
Jul 1550
Aug 1700
Sep 1800
Oct 1900
Nov 2000
Dec 2100
Jan 2000
Feb 2000
Mar 2000
Step 5: Interpretation
● Moving average smooths short-term variations and shows a clear upward
sales trend.
● Simple exponential smoothing reacts more quickly to recent changes and
helps forecast future sales.
● The forecast indicates stable sales around the latest smoothed value,
assuming no sudden changes.
ARIMA
Question
You have monthly sales data for a clothing store in 2023:
Month Sales
Jan 500
Feb 520
Mar 550
Apr 600
May 620
Jun 630
Jul 700
Aug 720
Sep 690
Oct 680
Nov 710
Dec 750
Task: Apply an ARIMA model to smooth the data, capture patterns, and forecast
sales for Jan-Mar 2024. Interpret the results.
Step 1: Plot and Explore the Data
● Plot the sales over months to observe trend and seasonality.
● Here, sales generally increase over the year, indicating an upward trend.
● There may be slight seasonal behavior (e.g., higher sales in mid and end of
year).
Possible outcome:
● The test shows non-stationarity (p-value > 0.05), so differencing is needed.
Step 3: Differencing
● Apply first-order differencing: subtract each month’s sales from the
previous month’s sales to remove trend.
Example:
● PACF cuts off after lag 1 → p = 1
● ACF tails off slowly → q = 0
●
● Forecast: The predicted sales for first three months of 2024 provide
actionable insights to plan inventory and marketing.
Summary:
Ste Description Key Points
p
Practice:
1. You are given monthly sales data for a retail store over the past year. Plot
the data on a suitable graph and describe the overall trend and any
seasonal variations. Apply a 4-month moving average to smooth the data
and highlight the underlying trend. Use simple exponential smoothing with
a smoothing factor of 0.4 to analyze the trend further and forecast sales for
the next three months. Interpret the smoothed results and the forecasted
sales.
2. You are given monthly data of a certain product’s demand for the past two
years. The data shows a general upward trend with some seasonal
fluctuations.
● Explain how you would check if the data is stationary.
● How would you use ACF and PACF plots to decide the ARIMA model
parameters (p, d, q)?
● Once the ARIMA model is fitted, explain how to forecast the demand for the
next three months.
● Discuss how you would interpret the fitted model in terms of trend,
seasonality, and residual fluctuations.