0% found this document useful (0 votes)
9 views

Final Report Format

This document describes a project that aims to analyze and predict gold prices using regression models and time series forecasting techniques. It provides details about the problem statements, data collection process, variables in the dataset, and expected relationships between gold prices and economic indicators like inflation, unemployment, interest rates and oil prices.
Copyright
© © All Rights Reserved
Available Formats
Download as ODT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Final Report Format

This document describes a project that aims to analyze and predict gold prices using regression models and time series forecasting techniques. It provides details about the problem statements, data collection process, variables in the dataset, and expected relationships between gold prices and economic indicators like inflation, unemployment, interest rates and oil prices.
Copyright
© © All Rights Reserved
Available Formats
Download as ODT, PDF, TXT or read online on Scribd
You are on page 1/ 28

CAPSTONE PROJECT REPORT

CAC - 3
Machine Learning

“GoldForecast”

A Regression and Time Series Analysis

Submitted By,

Suhani Lariya

22112338

4 B.Sc Economics & Analytics

Under the guidance of

Prof. Vandana Bhagat

Department of Data Science

3rd May 2024

1
INDEX

Sl.No Table of Contents Page Number

1 Introduction to the Course


2 Aims and Objectives of the Internship
3 Reasons for choosing the organization
4 Profile of the Organization
5 Job Description
6 Task Analysis
7 Challenges faced
8 Conclusion/Retrospection
9 Appendices and References

2
I. INTRODUCTION

This project undertakes a comprehensive analysis of predicting gold prices primarily focused on the USA
market. It aims to leverage regression models and time series forecasting techniques to anticipate
fluctuations in gold prices based on key independent variables such as inflation, unemployment rate,
interest rate, and oil prices.

The project initiates with a clear delineation of its objectives and relevance in addressing contemporary
challenges in the finance domain. Through the systematic exploration of data science methodologies, it
seeks to elucidate the intricate relationships between economic indicators and gold prices, thus
empowering stakeholders to make informed decisions.

Key phases of the project include:

● Problem Identification:

The project identifies the core challenge of predicting gold prices in the USA market and formulates
specific objectives to address this problem.

● Data Collection and Description:

A suitable dataset encompassing historical data on gold prices, inflation rates, unemployment rates,
interest rates, and oil prices is acquired and described in detail.

● Exploratory Data Analysis (EDA) and Preprocessing:

Through EDA, the project delves into the characteristics of the dataset, uncovering patterns, correlations,
and outliers. Preprocessing steps are undertaken to clean and transform the data, ensuring its suitability
for modeling.

● Model Selection and Building:

The project justifies the selection of regression models and time series forecasting techniques based on
the problem statement and dataset characteristics. Models are built and fine-tuned to predict gold prices
accurately.

● Result Interpretation:

3
The outcomes of the models are analyzed and interpreted, shedding light on the factors influencing gold
prices and the predictive performance of the models.

Recommendations and Conclusion:

Based on the findings, the project provides actionable recommendations for stakeholders in the finance
industry and concludes by summarizing the project's contributions and potential avenues for future
research.

4
II. PROBLEM STATEMENT

The problem statements for the study can be categorized into 3 parts:

Problems which can be solved through :

1. EDA (Exploratory Data Analysis) - This problem deals with all the complications and
doubts solved at data-preparation level . It deals with :-
● Missing Values: Check for missing values in any of the columns and decide on
strategies for handling them, such as imputation or removal.
● Outliers: Identify outliers in the Gold Prices column as well as in other
independent variables and determine if they are genuine data points or errors.
● Data Visualization : Type of data , the distribution of Gold Prices and other
variables , pairplots ,histograms etc…
● Correlation Analysis: Explore the relationships between Gold Prices and other
independent variables (Inflation, Unemployment rate, Interest rates, Oil Prices)
using correlation analysis or visualizations (e.g., scatter plots, correlation
matrices).
● Variable Importance: Determine the relative importance of each independent
variable in predicting Gold Prices using techniques like feature importance.
● Data Scaling: Evaluate whether data scaling is necessary for any of the variables
● Multicollinearity : To determine if data has any multicollinearity between two or
more independent variables.And how to deal with it.

Exploring these aspects through EDA can provide valuable insights into the data and help
in building more accurate predictive models for Gold Prices.

2. Regression model - There are some problems which can be significantly solved through a
regression analysis.
● Prediction: Primary objective of regression analysis in this scenario is to predict
future Gold Prices based on historical data and the values of the independent

5
variables. By fitting a regression model to the historical data, wecan estimate the
relationship between Gold Prices and the independent variables and then use this
relationship to make predictions for future time periods.
● Model Selection: Selecting an appropriate regression model that best fits the data.
By exploring different types of regression models (e.g., linear regression, ridge
regression, Lasso regression) and assessing their performance using techniques
such as cross-validation . Also , to fine tune the models by hyperparameter tuning.
And what model is the best to continue with.
● Assumptions Checking: Checking whether some assumptions of regression
analysis are met. These assumptions include normality of residuals, and tests for
multicollinearity.
● Model Evaluation: Once you have fitted several regression models, you need to
evaluate their performance to determine which model is better suited for
predicting Gold Prices. Common evaluation metrics for regression models include
Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and R-squared
(coefficient of determination). And compared to identify the one that provides the
best balance between accuracy and simplicity.
● Interpretation of Results: Regression analysis also allows you to interpret the
coefficients of the independent variables in the model. This can provide valuable
insights into the factors influencing Gold Prices in the real world.

3. Time-Series model/forecasting - Some questions can be simply answered with


TimeSeries analysis , especially when forecasting.
● Time Series Forecasting: Predict future values of a time series variable (e.g., stock
prices, sales, temperature) based on historical data. EDA can help identify trends,
seasonality, and other patterns in the data, while regression modeling can be used
to build predictive models.
● Seasonality and Trends: Seasonality or long-term trends in Gold Prices and other
variables over time. This could involve visualizing time series plots and methods.
● Data Stationarity: Check if the data exhibits stationarity, which is crucial for time

6
series analysis and forecasting

The challenges and insights in a time series project extend beyond analysis and exploratory data
analysis (EDA). They encompass the entire process of completing the project, including data
collection, preprocessing, modeling, and interpretation.It is something which we gain from
experience .

7
III. DATA DESCRIPTION

● The data has 2244 rows and 6 columns , including timestamp . The collected data was
time series but as variables were needed , apparently it changed to Panel data .
● Data Source:

The data was collected from various reputable sources, including the Federal Reserve website,
the U.S. Energy Information Administration (EIA) website, Statista, the World Bank, and
Macrotrends.
These sources provide reliable and authoritative data on economic and financial indicators,
making them suitable for analysis and prediction.
● Data Collection Process:
-The objective was to predict gold prices using time series data. Weekly data for gold prices was
obtained from the Federal Reserve website.
-Additional variables impacting gold prices were collected individually from various sources,
including economic indicators such as inflation, unemployment rate, interest rates, and oil prices.
-Due to variations in data availability and consistency, the time period for analysis was narrowed
down from 1978-2023 to 1980-2020, spanning 40 years.
-This timeframe was chosen to ensure consistency in data coverage across all variables and to
manage the size of the dataset, as handling too many rows of data became challenging.
-Also, theoretically , many variables impact Gold prices , but in practical scenario ,the data is
difficult to find as there is no consistent data availability.
● Data Variables:
-Time Series Variable: Gold Prices (USD) - Weekly data from 1980 to 2020.
-Economic Indicators: Inflation, Unemployment Rate, Interest Rates, Oil Prices (USD), and

8
potentially other relevant variables impacting gold prices.
-Each variable is collected at weekly intervals, providing a detailed view of changes over time.
-Variable Relation with Gold prices:
1. Gold Prices (USD): Represents gold's market value in U.S. dollars per ounce. Gold is
often sought during economic uncertainty or high inflation, leading to increased prices.
2. Inflation: Measures the rate of price increase for goods and services. Gold is considered a
hedge against inflation, so its prices tend to rise during periods of high inflation.
3. Unemployment Rate: Indicates the percentage of the workforce without jobs. High
unemployment rates often coincide with economic instability thus increasing its prices.
4. Interest Rates: Reflect the cost of borrowing or returns on investments. Low interest rates
make gold more attractive as an investment, while higher rates may reduce demand and
lower prices.
5. Oil Prices (USD): Represent crude oil's market value per barrel. Changes in oil prices can
impact inflation. It generally shares positive relationship.

● Descriptive Statistics:
-The data has some missing values originally. Due to lack of data , 1-2 year data was missing.
-Count: The number of non-null observations for each variable. In this case, there are 2136
observations for each variable.
-Mean: The average value of each variable. For example, the mean gold price is approximately
$684.83 USD.
- The minimum value observed for each variable. For instance, the lowest gold price observed is
$253.80 USD.
-The 25th percentile, also known as the first quartile. This indicates the value below which 25%
of the observations fall. For example, 25% of the gold prices are below $353.99 USD.
-The median, also known as the 50th percentile or second quartile. This represents the middle
value of the dataset when it's sorted in ascending order. For example, the median gold price is
$417.75 USD.
-The 75th percentile, also known as the third quartile. This indicates the value below which 75%
of the observations fall. For example, 75% of the gold prices are below $1140.94 USD.
-The maximum value observed for each variable. For instance, the highest gold price observed is
$2031.15 USD.
-A higher standard deviation indicates greater variability in the data. For example, the standard

9
deviation for gold prices is approximately $457.73 USD.

● Visualisation :

The data can be visualized by using histogram to find


out the distribution of data , it can be seen that the
data for Gold Prices (USD) is right skewed . Along
with this , “inflation” , “Interest rates” are also right
skewed. It can be resolved by some techniques.

This shows the trend / it is a scatter plot when kept X


and Y axis as “Gold Prices (USD)” and “year”. It
shows an upward trend and the trend is increasing
over years.

● Insight :
The dataset was exceptionally large, making it challenging to efficiently monitor and analyze
each individual value, particularly within Excel. Large datasets can overwhelm traditional
spreadsheet software, leading to performance issues, increased likelihood of errors, and difficulty
in maintaining data integrity

10
IV. EDA(EXPLORATORY DATA ANALYSIS

1. Missing Values :

The data has a total of 2244


rows , out of which 2 variables
have 2192 and 2136 rows
respectively. They do have
some missing values

To address missing values in the dataset, the most suitable approach for this time series
data is removal. The availability of data for variables such as inflation and unemployment
rate varied, with inflation data available until 2021 and unemployment rate data only
available until 2020. Since the dataset follows a time series format and missing values
cannot be accurately imputed using methods like mean, median, or forward/backward fill,
it was decided to remove rows with missing values.

By removing rows with missing values, we ensured that the integrity of the time series
data remained intact, preserving the temporal relationships between variables. This
approach avoids introducing bias or inaccuracies that may arise from imputing missing
values.

2. Outlier Detection :

If we look into the data , there were many outliers present.

It was observed that there were numerous instances where values


around 200 to 300 appeared to be outliers. However, upon
further investigation, it was found that these outliers were not
random occurrences but were often associated with significant
economic activities or events. For example, during periods of
economic recession or instability, unemployment rates may

11
spike, leading to outlier values in the dataset. Similarly, high inflation rates during recessionary
periods can also result in outliers.

Given the historical context and the potential significance of these outlier values, it was decided
not to remove them from the dataset. Removing these outliers could potentially obscure the true
underlying patterns and trends in the data. By retaining these outlier values, the dataset provides
a more comprehensive and accurate representation of the economic dynamics and fluctuations
over time.

3. Visualisation :

It can be seen from the distribution of data for each column , almost all the variables has Right-
Skewed data , means majority of the values lies on the left side of distribution .

12
Only “Unemployment rate” has an equal distribution , from a broader perspective.

This is called “subplot” and from all these plots we can observe a TREND spanning from 1980
to 2020. The trend for almost most of the variables is looking non-stationary. From the graphs ,
plot(1:Gold prices (USD)) and plot(3:Unemployment rate) are definitely non-stationary. It is a
common phenomenon for time-series and panel data.

-Given below is called “PAIRPLOT” , also known as a scatterplot matrix, displays pairwise
relationships between variables in a dataset. Each variable in the dataset is plotted against every
other variable, resulting in a grid of scatter plots.

13
-Oil Prices and Inflation are not that correlated which can be seen in the graph .

-Unemployment rate and Gold Prices (USD) are more correlated according to pairplot.

-Inflation & Oil prices , Interest rate and Oil prices , are not much correlated

4. Correlation Matrix:

14
-Interest rate and gold prices share a correlation of -0.56 , as higher interest rates can lead to
lower demand for gold and, consequently, lower gold prices.

-It can be observed that the relationship between inflation and gold prices is positive
theoretically , but according to the correlation matrix it is negative which is -0.21 .It may be due
to external factors like govt policies that the data is behaving like this. Also this is not a very
strong relationship.

- The relationship between the unemployment rate and gold prices is showing as -0.01 which is
an extremely weak negative correlation . Theoretically , the relationship between these two
variables may not always be consistent due to other factors influencing market sentiment and
investor behavior. But , according to data it shares negative correlation.

-Changes in oil prices can indirectly affect gold prices. A significant rise in oil prices may raise
concerns about inflation and economic slowdown, prompting investors to hedge with gold, thus
increasing its demand and price. Conversely, a substantial decrease in oil prices may alleviate
inflation worries but signal economic weakness, leading investors to seek safe-haven assets like
gold, consequently driving its prices higher. However, the correlation coefficient between Oil

15
Prices(USD) and Gold Prices (USD) is quite good ,0.77 which suggests that there is positive
correlation.

5. Data Scaling (Standardization):

Assessing the need for data scaling, it was determined that scaling wasn't pursued. There is

disparity in magnitudes between variables: prices were represented by large numbers (Gold

Prices(USD) , Oil Prices (USD) ), while rates were in decimal form (Unemployment rate ,

Interest rate). Scaling these variables could have resulted in excessive variation, leading to the

decision to retain the original scaling. This choice remains consistent even after considering the

potential effects of scaling.

Scaling aims to bring all features to a similar scale, typically between 0 and 1 or centered around
a mean of 0 with a standard deviation of 1. However, if the disparity in magnitude between
variables is too large, scaling might not be appropriate as it could distort the relationships
between variables.

In our scenario, where scaling doesn't improve the interpretability or performance of the model
and may even introduce unnecessary variability, it's reasonable not to scale the data.

6. Multicollinearity:

To find ,if the data has Multicollinearity present in it , we used Python to do that . Upon
revisiting the results, I found that some variables, like Gold Prices (USD) and Oil Prices (USD),
and Inflation and Interest rates, were pretty closely linked, with correlation coefficients ranging
from 0.7 to 0.9. This tells us they might be sharing too much info, making it tricky to understand
their individual effects. And imply multicollinearity.

To handle this, I looked into using principal component analysis (PCA) to simplify things, but
the results didn't seem promising.

16
The plot showed a downward trend and k=1, suggesting PCA might not be the best fit here.

Plus, dropping any variables would seriously limit the number of factors we're considering for
our analysis. So, for now, we'll stick with what we've got, knowing there might be some overlap
in our data but still aiming to make the most of it for our analysis.

V. MODEL SELECTION AND MODEL BUILDING

17
1.The data requires a regression analysis

● Linear Regression Model : We chose linear regression at first because it's a simple and
interpretable model that provides a good baseline for understanding the relationship
between the input variables and the target variable. Additionally, linear regression
assumes a linear relationship between the features and the target variable, which is a
reasonable starting point for many datasets.

Linear Regression RMSE: 1.2366982197576302e-13

Linear Regression R-squared : 1

( When we initially ran linear regression and obtained an accuracy of 1, it might seem
like we've found the perfect model. However, this high accuracy could be indicative of
overfitting, especially if we're testing the model on the same data it was trained on.
Overfitting occurs when a model learns the noise in the training data rather than the
underlying pattern, resulting in poor generalization to new, unseen data.

To address the possibility of overfitting and to ensure that our model generalizes well to
new data, we explored various other models such as decision trees, random forests, and
ensembling techniques like stacking. )

● Lasso Regression : It offers a regularization technique that helps prevent overfitting by

18
penalizing the absolute size of the regression coefficients. Lasso performs feature
selection by shrinking some coefficients to exactly zero, effectively removing those
features from the model. This feature selection property is useful when dealing with
datasets with many features, as it can help simplify the model and improve its
interpretability by focusing on the most important features.

Additionally, Lasso regression can handle multicollinearity among the features, which is
common in real-world datasets. Therefore, by using Lasso regression, we aim to build a
more robust and parsimonious model while avoiding overfitting.

Results :

LassoRegression RMSE 2.2337202089187744e-06

Lasso Regression R-Squared 1

The accuracy may still be the same .We will check the results, if they are genuinely true
or it is a result of overfitting

● DECISION TREE :

Decision trees recursively split the data into subsets based on the features that best separate the
target variable. Each split aims to maximize the homogeneity of the target variable within the
resulting subsets.

-The R-squared value of 1.0 indicates that the Decision Tree perfectly predicts the target variable
on the training data

19
-The high R-squared value close to 1.0 (specifically 0.9999932907263452) on the testing data

-A lower RMSE signifies better model performance, and in this case, the RMSE is relatively
low, indicating that the Decision Tree model's predictions are close to the actual values on
average.

However, the Decision Tree also performs exceptionally well on the testing data, with a very
high R-squared value and a low RMSE. This indicates that the model generalizes well and is not
overfitting to the training data, unlike the linear regression model which had perfect accuracy but
might have overfit the data.

Overall, the Decision Tree model demonstrates superior performance compared to the linear
regression model, particularly in terms of generalization to unseen data.

1. CROSS VALIDATION:

a)On Lasso regression model

After performing cross-validation on the results of Lasso Regression by taking the difference
between training accuracy and testing accuracy . The lesser the RMSE , the better .

The training and testing mean squared errors (MSEs) are very close, suggesting that the model
generalizes well to unseen data. The cross-validation RMSE scores also indicate consistent
performance across different folds, with a mean RMSE of approximately 0.0002189 . The small

20
difference between training and testing MSEs further suggests that the model is not overfitting.

After performing Cross-Validation ; the Overall R-Square of Lasso Regression came as ;

It shows that cross-validation has impacted and tried to make the model better.

b) On Linear regression model–

The results for cross validation in this model was -

After performing the cross validation in this


model , the results are the same . It means that
Linear regression model is not good for this data.

So , Lasso Regression is only giving the best results

-Residual Analysis :

Residual analysis allows us to evaluate the goodness of fit of the regression model. If the
residuals are randomly distributed around zero with constant variance, it indicates that the model
captures the underlying relationships in the data well. Deviations from this pattern may suggest
issues such as underfitting or overfitting.

1.Shapiro-Wilk Test: After running this test , we found out that “Residuals are not normally
distributed”.

2.Breusch-Pagan Test:After running this test also , the results were same as residuals are not
normally distributed.

21
-APPLYING LOG TRANSFORMATION TO GET UNIFORM DISTRIBUTED
RESIDUALS

After applying LOG transformation , the distribution graphically looked symmetric ;

But , re-running the test gave the same results. Residuals were still not equally distributed.

● ENSEMBLING MODELS

We employed several ensemble learning techniques to improve the predictive


performance of our models. Ensemble methods combine predictions from multiple
individual models to produce a stronger learner.

a. Bagging (Random Forest):

Bagging, or Bootstrap Aggregating, involves training multiple models (often decision trees) on
different bootstrap samples of the training data and averaging their predictions.

Bagging helps reduce overfitting by reducing the variance of the individual models. It achieves
this by training each model on a slightly different subset of the data, leading to more robust
predictions.

22
–Training R-squared: The R-squared value of 1.0 indicates that the Decision Tree perfectly
predicts the target variable on the training data, capturing all the variance in the target variable
based on the features.

–Testing R-squared: The high R-squared value close to 1.0 (specifically 0.9999932907263452)
on the testing data suggests that the model generalizes well and maintains strong predictive
performance on unseen data.

–Decision Tree RMSE: The Root Mean Squared Error (RMSE) of 1.1855782531712218
indicates the average magnitude of the residuals (the differences between the predicted and
actual values) in the model's predictions.

The Decision Tree model shows perfect fit on the training data, which is not uncommon for
decision trees as they can easily memorize the training data due to their ability to create complex
decision boundaries.

However, the Decision Tree also performs exceptionally well on the testing data, with a very
high R-squared value and a low RMSE. This indicates that the model generalizes well and is not
overfitting to the training data, unlike the linear regression model which had perfect accuracy but
might have overfit the data.

Overall, the Decision Tree model demonstrates superior performance compared to the linear
regression model, particularly in terms of generalization to unseen data.

Testing R-squared: The high R-squared value close to 1.0 (specifically 0.999994285557632) on
the testing data indicates that the model generalizes exceptionally well. It maintains strong
predictive performance on unseen data, which is crucial for real-world applications.

Comparing these results with the linear regression model:

The Random Forest model outperforms the linear regression model by a significant margin in
terms of R-squared values. While the linear regression model may have shown perfect accuracy
on the training data, the Random Forest model achieves similarly high accuracy on the testing

23
data, demonstrating its ability to generalize well.

24
25
26
27
28

You might also like