FML Micro Project
FML Micro Project
Submitted By
Devraj Patel (236400316116)
Jenish Patel (236400316131)
Maharshi Patel (236400316135)
DIPLOMA ENGINEERING
in
The provided dataset consists of product and outlet-specific features, such as item
weight, visibility, outlet size, establishment year, and the type of items sold..
Our models achieved outstanding results, with the final XGBoost model
explaining 99.7% of the variance in sales predictions, indicated by an R² score of
0.9970, and an RMSE of 76.68, demonstrating high accuracy.
The LightGBM model also showed competitive results, highlighting the strength
of both algorithms in solving regression problems. The project demonstrates how
machine learning techniques can be leveraged to solve real-world business
challenges, offering insights into data-driven decision-making for sales
optimization.
1.
Project Introduction
Objective:
Predict the sales of various products across different BigMart outlets using
historical data and machine learning techniques.
Context:
BigMart is a retail chain with multiple stores in different cities, selling a
wide range of consumer products.
Dataset:
Sales data from 2013: The dataset includes historical sales information
from various outlets.
1,559 products across 10 outlets: Contains data for products and their re-
spective sales.
Includes product attributes (e.g., weight, type, MRP) and outlet features
(e.g., size, location, type).
Problem Type:
Supervised Regression Problem
Target variable: Item_Outlet_Sales (the sales value for products in different
outlets).
Approach:
Data Preprocessing and Cleaning:
Handle missing values and inconsistent data formats.
Feature engineering to create new meaningful features (e.g., visibility ratio,
outlet years).
Encoding categorical variables such as item type and outlet location.
Scaling numerical features using StandardScaler.
2.
Models Used:
XGBoost: For its robustness and ability to handle large datasets efficiently.
LightGBM: A faster alternative to XGBoost for regression tasks.
Random Forest Regressor: Used as an ensemble method to improve predic-
tion accuracy.
Goal:
Build a predictive model with strong generalization performance to accur-
ately forecast product sales.
Results:
Achieved an R² score of 0.9970 and an RMSE of 76.68, indicating high ac-
curacy.
3.
Code Explanation
4.
5.
Imports:
Essential libraries like pandas, numpy, matplotlib, seaborn for data handling
and visualization.
scikit-learn for preprocessing, model evaluation, and splitting.
xgboost for advanced regression modeling.
Data Cleaning:
Standardizes inconsistent values in Item_Fat_Content (e.g., ‘low fat’, ‘LF’
→ ‘Low Fat’).
Fills missing Item_Weight using average weights per Item_Identifier.
Fills missing Outlet_Size with the most frequent category (mode).
Feature Engineering:
Creates Item_Visibility_MeanRatio (placeholder for visibility adjustment).
Calculates Outlet_Years (2025 - year of establishment).
Extracts item category prefixes to create Item_Type_Combined.
Updates fat content for non-consumable items as Non-Edible.
Encoding:
Applies pd.get_dummies() to Item_Type_Combined and Item_Type.
Uses LabelEncoder on categorical columns:
Item_Fat_Content, Outlet_Location_Type, Outlet_Size, Outlet_Type,
Outlet_Identifier.
Feature Pruning:
Drops columns not useful for modeling: Item_Identifier,
Outlet_Establishment_Year.
Data Splitting:
Splits combined data back into train_data and test_data based on original
lengths.
Separates features (X) and target (y) from training data.
Splits training set into 85% training and 15% validation using
train_test_split.
6.
Imputation and Scaling:
Applies SimpleImputer (mean strategy) to fill any remaining missing
numerical data.
Scales features using StandardScaler for better model performance.
7.
Output
The final output of this project is a CSV submission file named
BigMart_Final_Submission.csv, which contains the predicted sales
(Item_Outlet_Sales) for each product in the test dataset.
The file mirrors the structure of the sample submission provided.
Each row represents a unique product-outlet combination from the test set.
The Item_Outlet_Sales column holds the predicted sales value for that specific
combination.
Given the high R² score of 0.9970 and low RMSE of 76.68, the predicted outputs
are highly reliable and closely aligned with real sales behavior, making them
valuable for actionable insights in a real-world retail setting.
8.
Evaluation
To assess the effectiveness of the regression models, two primary performance
metrics were used:
Root Mean Squared Error (RMSE):
Measures the average magnitude of prediction errors. A lower RMSE
indicates better predictive accuracy.
R² Score (Coefficient of Determination):
Reflects how well the model explains the variance in the target variable. A
value closer to 1 signifies strong predictive power.
Evaluation Results
Final Model – Optimized XGBoost (on Scaled Data):
RMSE: 76.68
R²: 0.9970
Insights
The optimized XGBoost model achieved exceptional performance, with an
R² of 0.9970, indicating it can explain 99.7% of the variance in the target
variable.
The very low RMSE of 76.68 reflects minimal prediction error, showing
that the model generalizes well.
This performance is a significant improvement over baseline models (e.g.,
Linear, Ridge, Random Forest), clearly demonstrating the value of:
o Proper feature engineering
o Data scaling and imputation
o Hyperparameter tuning with GridSearchCV
This model is suitable for real-world deployment to aid BigMart in accurate
demand forecasting and strategic decision-making.
These predictions can be directly used by BigMart’s analytics team for:
o Demand forecasting
o Inventory planning
o Promotion targeting
o Strategic business decisions
9.
Conclusion
This project successfully demonstrated how machine learning techniques can be
leveraged to predict retail product sales with high accuracy. Using a robust
pipeline involving data preprocessing, feature engineering, and model
optimization, we built and fine-tuned several regression models to predict
Item_Outlet_Sales across various BigMart outlets.
Among all the models tested, the XGBoost Regressor delivered outstanding
performance with an RMSE of 76.68 and an R² score of 0.9970, indicating that
the model could explain nearly all the variance in the sales data.
This level of accuracy reflects the effectiveness of our data handling strategies
and the power of ensemble learning in capturing complex patterns within retail
data.
The results suggest that machine learning, when properly applied, can
significantly enhance business forecasting and decision-making. This model can
help BigMart optimize their inventory management, tailor marketing strategies,
and meet customer demand more effectively.
In summary, this project not only met but exceeded the performance goal of
achieving a positive or near-zero R² score, establishing a solid foundation for
further enhancements and real-world deployment.
Thank You
10.