Internship Doc (A1)
Internship Doc (A1)
BACHELOR OF TECHNOLOGY
In
by
G.SS. ANJALI
VU21CSEN0101734
DEPARTMENT OF
VISAKHAPATNAM,ANDHRA PRADESH
2024
DEPARTMENT OF COMPUTER SCIENCE AND
ENGINEERING
VISAKHAPATNAM
DECLARATION
I hereby declare that the internship report entitled Big mart sales
Analysis(machine Language) is an original work done in the Department of
Computer Science and Engineering, GITAM School of Technology, GITAM
(Deemed to be University) submitted in partial fulfillment of the requirements for
the award of the degree of B.Tech. in Computer Science and Engineering. The
work has not been submitted to any other college or University for the award of
any degree or diploma.
Date:29-10-2024
ACKNOWLEDGEMENT:
I would like to thank Coding Raja Technologies , for giving me the opportunity to do
an internship within the organization.
I also would like all the people that worked along with me in Coding Raja
Technologies, with their patience and openness they created an enjoyable working
environment. It is indeed with a great sense of pleasure and immense sense of
gratitude that I acknowledge the help of these individuals. I am extremely grateful to
my department staff members and friends who helped me in successful completion
of this internship.
ABSTRACTION:
In this project, we were asked to experiment with a real world dataset, and to explore
how
machine learning algorithms can be used to find the patterns in data. We were
expected to gain
experience using a common data-mining and machine learning library, Weka, and
were expected
to submit a report about the dataset and the algorithms used. After performing the
required tasks
on a dataset of my choice, herein lies my final report.
INTRODUCTION :
REGRESSION:
PROBLEM STATEMENT:
The data scientists at BigMart have collected sales data for 1559
products across 10 stores in different cities for the year 2013. Now each
product has certain attributes that sets it apart from other products.
Breakdown of the Problem Statement:
LIBRARIES:
import os #paths to file
import numpy as np # linear algebra
import pandas as pd # data processing
import warnings# warning filter
#ploting libraries
import matplotlib.pyplot as plt
import seaborn as sns
#feature engineering
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
#metrics
from sklearn.metrics import mean_absolute_error as MAE
from sklearn.metrics import mean_squared_error as MSE
from sklearn.metrics import r2_score as R2
from sklearn.model_selection import cross_val_score as CVS
#ML models
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import Lasso
#warning hadle
warnings.filterwarnings("always")
warnings.filterwarnings("ignore")
FILE PATHS :
#list all files under the input directory
for dirname, _, filenames in os.walk('/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
#path for the training set
tr_path = "/input/bigmart-sales-data/Train.csv"
#path for the testing set
te_path = "/input/bigmart-sales-data/Test.csv"
TRAINING SET:
TESTING SET:
MISSING VALUES:
#column information
tr_df.info(verbose=True, null_counts=True)
#summary statistics train
tr_df.describe()
MISSING VALUES:
There are many ways data can end up with missing values. For example:
Most machine learning libraries (including scikit-learn) give an error if you try
to build a model using data with missing values. As you can see we have
some missing data, let's have a look how many we have for each column:
by numbers
by %
This analysis will also compare to the test and train datasets for evaluation.
#missing values in decsending order
print("Train:\n")
print(tr_df.isnull().sum().sort_values(ascending=False),"\n\
n",tr_df.isnull().sum()/tr_df.shape[0] *100,"\n\n")
print("Test:\n")
print(te_df.isnull().sum().sort_values(ascending=False),"\n\
n",te_df.isnull().sum()/te_df.shape[0] *100,"\n\n"
print("Outlet_Size:\n",
tr_df.Outlet_Size.value_counts(), "\n\n")
print("Item_Weight:\n",
tr_df.Item_Weight.value_counts(), "\n\n")
#train
tr_df['Outlet_Size'] = tr_df['Outlet_Size'].fillna(
tr_df['Outlet_Size'].dropna().mode().values[0])
#test
te_df['Outlet_Size'] = te_df['Outlet_Size'].fillna(
te_df['Outlet_Size'].dropna().mode().values[0])
#numeric df
BM_num = tr_df[num]
#categoric df
BM_cat = tr_df[cat]
#print(num)
#print(cat)
DATA VISUALIZATION :
For starters we will create countplots for the categorical columns:
[]
#categorical columns:
['Item_Identifier', 'Item_Fat_Content', 'Item_Type', 'Outlet_Identifier',
'Outlet_Size', 'Outlet_Location_Type', 'Outlet_Type']
plt.figure(figsize=(6,4))
sns.countplot(x='Item_Fat_Content' , data=tr_df ,palette='mako')
plt.xlabel('Item_Fat_Content', fontsize=14)
plt.show()
plt.figure(figsize=(27,10))
sns.countplot(x='Item_Type' , data=tr_df ,palette='summer')
plt.xlabel('Item_Type', fontsize=14)
plt.show()
plt.figure(figsize=(15,4))
sns.countplot(x='Outlet_Identifier' , data=tr_df ,palette='winter')
plt.xlabel('Outlet_Identifier', fontsize=14)
plt.show()
plt.figure(figsize=(10,4))
sns.countplot(x='Outlet_Size' , data=tr_df ,palette='autumn')
plt.xlabel('Outlet_Size', fontsize=14)
plt.show()
#because of the variability of the unique values of the numeric columns a
scatter plot with the target value will be of use
for numeric in BM_num[num[:3]]:
plt.scatter(BM_num[numeric], BM_num['Item_Outlet_Sales'])
plt.title(numeric)
plt.ylabel('Item_Outlet_Sales')
plt.show()
multivariate plots
I want to check the following relationships with Item_Outlet_Sales:
plt.figure(figsize=(27,10))
sns.barplot('Outlet_Identifier' ,'Item_Outlet_Sales',
data=tr_df ,palette='gist_rainbow')
plt.xlabel('Outlet_Identifier', fontsize=14)
plt.legend()
plt.show()
CORRELATION MATRIX:
FEATURE ENGINEERING:
We have 7 columns we need to delete or encode.
Ordinal variables:
o Item_Fat_Content
o Outlet_Size
o Outlet_Location_Type
Nominal variables:
o Item_Identifier
o Item_Type
o Outlet_Identifier
o Outlet_Type
Numeric values:
Conclusion:
In my FE process i have decided:
1. The
columns Outlet_Establishment_Year, Item_Identifier and Outlet_Iden
tifier don't have significant values so we will drop them.
2. All Ordinal variables will be Label encoded.
3. The columns Outlet_Type and Item_Type will be One Hot encoded.