0% found this document useful (0 votes)
15 views20 pages

Internship Doc (A1)

Uploaded by

gssanjali9
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views20 pages

Internship Doc (A1)

Uploaded by

gssanjali9
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 20

“Big Mart Sales Analysis”

An Internship report submitted in partial fulfillment of the requirements

for the Award of the Degree of

BACHELOR OF TECHNOLOGY

In

COMPUTER SCIENCE AND ENGINEERING

by

G.SS. ANJALI

VU21CSEN0101734

Coding Raja Technologies,

Duration:( 15/5/24 to 15/7/24 )

DEPARTMENT OF

COMPUTER SCIENCE AND ENGINEERING

GANDHI INSTITUTE OF TECHNOLOGY AND MANAGEMENT(GITAM)

VISAKHAPATNAM,ANDHRA PRADESH

2024
DEPARTMENT OF COMPUTER SCIENCE AND

ENGINEERING

GANDHI INSTITUTE OF TECHNOLOGY AND MANAGEMENT

VISAKHAPATNAM

DECLARATION

I hereby declare that the internship report entitled Big mart sales
Analysis(machine Language) is an original work done in the Department of
Computer Science and Engineering, GITAM School of Technology, GITAM
(Deemed to be University) submitted in partial fulfillment of the requirements for
the award of the degree of B.Tech. in Computer Science and Engineering. The
work has not been submitted to any other college or University for the award of
any degree or diploma.

Date:29-10-2024

Registration No Name signature

Vu21csen0101734 Giduturi Sai Sri Anjali


INTERNSHIP OFFER LETTER :
CERTIFICATE ISSUED AFTER INTERNSHIP:

ACKNOWLEDGEMENT:

I would like to thank Coding Raja Technologies , for giving me the opportunity to do
an internship within the organization.
I also would like all the people that worked along with me in Coding Raja
Technologies, with their patience and openness they created an enjoyable working
environment. It is indeed with a great sense of pleasure and immense sense of
gratitude that I acknowledge the help of these individuals. I am extremely grateful to
my department staff members and friends who helped me in successful completion
of this internship.
ABSTRACTION:

In this project, we were asked to experiment with a real world dataset, and to explore
how
machine learning algorithms can be used to find the patterns in data. We were
expected to gain
experience using a common data-mining and machine learning library, Weka, and
were expected
to submit a report about the dataset and the algorithms used. After performing the
required tasks
on a dataset of my choice, herein lies my final report.

INTRODUCTION :

Machine learning (ML) is a branch of artificial intelligence (AI) and


computer science that focuses on the using data and algorithms to
enable AI to imitate the way that humans learn, gradually improving
its accuracy. There are several types of machine learning, each
with special characteristics and applications. Some of the main
types of machine learning algorithms are as follows:
1. Supervised Machine Learning
2. Unsupervised Machine Learning
3. Semi-Supervised Machine Learning
4. Reinforcement Learning

SUPERVISED MACHINE LEARNING:


It is defined as when a model gets trained on a “Labelled Dataset”. Labelled
datasets have both input and output parameters. In Supervised
Learning algorithms learn to map points between inputs and correct outputs. It
has both training and validation datasets labelled.
METHODOLOGY :
1. Start with the least interpretable and most flexible models.
2. Investigate simpler models that are less opaque.
3. Consider using the simplest model that reasonably approximates the
performance of the more complex models.
CLASSIFICATION:
deals with predicting categorical target variables, which represent
discrete classes or labels. For instance, classifying emails as spam or
not spam, or predicting whether a patient has a high risk of heart
disease. Classification algorithms learn to map the input features to one
of the predefined classes.

REGRESSION:

on the other hand, deals with predicting continuous target variables,


which represent numerical values. For example, predicting the price of
a house based on its size, location, and amenities, or forecasting the
sales of a product. Regression algorithms learn to map the input
features to a continuous numerical value.

Advantages of Supervised Machine Learning


 Supervised Learning models can have high accuracy as they
are trained on labelled data.
 The process of decision-making in supervised learning models
is often interpretable.
 It can often be used in pre-trained models which saves time
and resources when developing new models from scratch.
Disadvantages of Supervised Machine Learning
 It has limitations in knowing patterns and may struggle with
unseen or unexpected patterns that are not present in the
training data.
 It can be time-consuming and costly as it relies
on labeled data only.
 It may lead to poor generalizations based on new data.

PROBLEM STATEMENT:
The data scientists at BigMart have collected sales data for 1559
products across 10 stores in different cities for the year 2013. Now each
product has certain attributes that sets it apart from other products.
Breakdown of the Problem Statement:

 Supervised machine learning problem.


 The target value will be Item_Outlet_Sales.

LIBRARIES:
import os #paths to file
import numpy as np # linear algebra
import pandas as pd # data processing
import warnings# warning filter

#ploting libraries
import matplotlib.pyplot as plt
import seaborn as sns

#feature engineering
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder

#train test split


from sklearn.model_selection import train_test_split

#metrics
from sklearn.metrics import mean_absolute_error as MAE
from sklearn.metrics import mean_squared_error as MSE
from sklearn.metrics import r2_score as R2
from sklearn.model_selection import cross_val_score as CVS

#ML models
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import Lasso

#default theme and settings


sns.set(context='notebook', style='darkgrid', palette='deep',
font='sans-serif', font_scale=1, color_codes=False, rc=None)
pd.options.display.max_columns

#warning hadle
warnings.filterwarnings("always")
warnings.filterwarnings("ignore")

FILE PATHS :
#list all files under the input directory
for dirname, _, filenames in os.walk('/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
#path for the training set
tr_path = "/input/bigmart-sales-data/Train.csv"
#path for the testing set
te_path = "/input/bigmart-sales-data/Test.csv"

PREPROCCESSING AND DATA ANALYSIS:

TRAINING SET:

# read in csv file as a DataFrame


tr_df = pd.read_csv(tr_path)
# explore the first 5 rows
tr_df.head()

TESTING SET:

# read in csv file as a DataFrame


te_df = pd.read_csv(te_path)
# explore the first 5 rows
te_df.head()

MISSING VALUES:

#missing values in decsending order


print("Train:\n")
print(tr_df.isnull().sum().sort_values(ascending=False),"\n\
n",tr_df.isnull().sum()/tr_df.shape[0] *100,"\n\n")
print("Test:\n")
print(te_df.isnull().sum().sort_values(ascending=False),"\n\
n",te_df.isnull().sum()/te_df.shape[0] *100,"\n\n")

#summary statistics test


te_df.describe()

#column information
tr_df.info(verbose=True, null_counts=True)
#summary statistics train
tr_df.describe()
MISSING VALUES:

There are many ways data can end up with missing values. For example:

1. The product wasn't weighed.


2. The data provider didn't include the outlet size of some products.

Most machine learning libraries (including scikit-learn) give an error if you try
to build a model using data with missing values. As you can see we have
some missing data, let's have a look how many we have for each column:

 by numbers
 by %

This analysis will also compare to the test and train datasets for evaluation.
#missing values in decsending order
print("Train:\n")
print(tr_df.isnull().sum().sort_values(ascending=False),"\n\
n",tr_df.isnull().sum()/tr_df.shape[0] *100,"\n\n")
print("Test:\n")
print(te_df.isnull().sum().sort_values(ascending=False),"\n\
n",te_df.isnull().sum()/te_df.shape[0] *100,"\n\n"

LETS CHECK THE VALUE COUNTS FOR OUTLET_SIZEAND ITEM_WEIGHT

print("Outlet_Size:\n",
tr_df.Outlet_Size.value_counts(), "\n\n")
print("Item_Weight:\n",
tr_df.Item_Weight.value_counts(), "\n\n")

#train
tr_df['Outlet_Size'] = tr_df['Outlet_Size'].fillna(
tr_df['Outlet_Size'].dropna().mode().values[0])

#test
te_df['Outlet_Size'] = te_df['Outlet_Size'].fillna(
te_df['Outlet_Size'].dropna().mode().values[0])

#checking if we filled missing values


tr_df['Outlet_Size'].isnull().sum(),te_df['Outlet_Si
ze'].isnull().sum()
DATA EXPLORATION :

#list of all the numeric columns


num = tr_df.select_dtypes('number').columns.to_list()
#list of all the categoric columns
cat = tr_df.select_dtypes('object').columns.to_list()

#numeric df
BM_num = tr_df[num]
#categoric df
BM_cat = tr_df[cat]

#print(num)
#print(cat)

[tr_df[category].value_counts() for category in cat[1:]]

DATA VISUALIZATION :
For starters we will create countplots for the categorical columns:

[]
#categorical columns:
['Item_Identifier', 'Item_Fat_Content', 'Item_Type', 'Outlet_Identifier',
'Outlet_Size', 'Outlet_Location_Type', 'Outlet_Type']
plt.figure(figsize=(6,4))
sns.countplot(x='Item_Fat_Content' , data=tr_df ,palette='mako')
plt.xlabel('Item_Fat_Content', fontsize=14)
plt.show()
plt.figure(figsize=(27,10))
sns.countplot(x='Item_Type' , data=tr_df ,palette='summer')
plt.xlabel('Item_Type', fontsize=14)
plt.show()

plt.figure(figsize=(15,4))
sns.countplot(x='Outlet_Identifier' , data=tr_df ,palette='winter')
plt.xlabel('Outlet_Identifier', fontsize=14)
plt.show()
plt.figure(figsize=(10,4))
sns.countplot(x='Outlet_Size' , data=tr_df ,palette='autumn')
plt.xlabel('Outlet_Size', fontsize=14)
plt.show()
#because of the variability of the unique values of the numeric columns a
scatter plot with the target value will be of use
for numeric in BM_num[num[:3]]:
plt.scatter(BM_num[numeric], BM_num['Item_Outlet_Sales'])
plt.title(numeric)
plt.ylabel('Item_Outlet_Sales')
plt.show()

multivariate plots
I want to check the following relationships with Item_Outlet_Sales:

 Sales per item type


 Sales per outlet
 Sales per outlet type
 Sales per outlet size
 Sales per location type

plt.figure(figsize=(27,10))
sns.barplot('Outlet_Identifier' ,'Item_Outlet_Sales',
data=tr_df ,palette='gist_rainbow')
plt.xlabel('Outlet_Identifier', fontsize=14)
plt.legend()
plt.show()

CORRELATION MATRIX:

FEATURE ENGINEERING:
We have 7 columns we need to delete or encode.
 Ordinal variables:
o Item_Fat_Content
o Outlet_Size
o Outlet_Location_Type
 Nominal variables:
o Item_Identifier
o Item_Type
o Outlet_Identifier
o Outlet_Type

Numeric values:

 From the numeric variables Outlet_Establishment_Year is no longer


needed
Realizations
 Item_MRP optimizes Maximum Outlet sales (positive correlation with the target).
 Linear Regression and Lasso Regressor have the best perfomance in most
categories.
 only a third of the observed variation can be explained by the model's inputs of
Random Forest Regressor, there for it's performance is not optimal even though
his cross validation is the highest.
 For better peformance this models need tuning e.g. Grid Search.

Conclusion:
In my FE process i have decided:

1. The
columns Outlet_Establishment_Year, Item_Identifier and Outlet_Iden
tifier don't have significant values so we will drop them.
2. All Ordinal variables will be Label encoded.
3. The columns Outlet_Type and Item_Type will be One Hot encoded.

-----------Thank you ----------

You might also like