0% found this document useful (0 votes)

15 views20 pages

Internship Doc (A1)

Uploaded by

gssanjali9

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views20 pages

Internship Doc (A1)

Uploaded by

gssanjali9

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 20

“Big Mart Sales Analysis”

An Internship report submitted in partial fulfillment of the requirements

for the Award of the Degree of

BACHELOR OF TECHNOLOGY

COMPUTER SCIENCE AND ENGINEERING

G.SS. ANJALI

VU21CSEN0101734

Coding Raja Technologies,

Duration:( 15/5/24 to 15/7/24 )

DEPARTMENT OF

COMPUTER SCIENCE AND ENGINEERING

GANDHI INSTITUTE OF TECHNOLOGY AND MANAGEMENT(GITAM)

VISAKHAPATNAM,ANDHRA PRADESH

2024
DEPARTMENT OF COMPUTER SCIENCE AND

ENGINEERING

GANDHI INSTITUTE OF TECHNOLOGY AND MANAGEMENT

VISAKHAPATNAM

DECLARATION

I hereby declare that the internship report entitled Big mart sales
Analysis(machine Language) is an original work done in the Department of
Computer Science and Engineering, GITAM School of Technology, GITAM
(Deemed to be University) submitted in partial fulfillment of the requirements for
the award of the degree of B.Tech. in Computer Science and Engineering. The
work has not been submitted to any other college or University for the award of
any degree or diploma.

Date:29-10-2024

Registration No Name signature

Vu21csen0101734 Giduturi Sai Sri Anjali

INTERNSHIP OFFER LETTER :
CERTIFICATE ISSUED AFTER INTERNSHIP:

ACKNOWLEDGEMENT:

I would like to thank Coding Raja Technologies , for giving me the opportunity to do
an internship within the organization.
I also would like all the people that worked along with me in Coding Raja
Technologies, with their patience and openness they created an enjoyable working
environment. It is indeed with a great sense of pleasure and immense sense of
gratitude that I acknowledge the help of these individuals. I am extremely grateful to
my department staff members and friends who helped me in successful completion
of this internship.
ABSTRACTION:

In this project, we were asked to experiment with a real world dataset, and to explore
how
machine learning algorithms can be used to find the patterns in data. We were
expected to gain
experience using a common data-mining and machine learning library, Weka, and
were expected
to submit a report about the dataset and the algorithms used. After performing the
required tasks
on a dataset of my choice, herein lies my final report.

INTRODUCTION :

Machine learning (ML) is a branch of artificial intelligence (AI) and

computer science that focuses on the using data and algorithms to
enable AI to imitate the way that humans learn, gradually improving
its accuracy. There are several types of machine learning, each
with special characteristics and applications. Some of the main
types of machine learning algorithms are as follows:
1. Supervised Machine Learning
2. Unsupervised Machine Learning
3. Semi-Supervised Machine Learning
4. Reinforcement Learning

SUPERVISED MACHINE LEARNING:

It is defined as when a model gets trained on a “Labelled Dataset”. Labelled
datasets have both input and output parameters. In Supervised
Learning algorithms learn to map points between inputs and correct outputs. It
has both training and validation datasets labelled.
METHODOLOGY :
1. Start with the least interpretable and most flexible models.
2. Investigate simpler models that are less opaque.
3. Consider using the simplest model that reasonably approximates the
performance of the more complex models.
CLASSIFICATION:
deals with predicting categorical target variables, which represent
discrete classes or labels. For instance, classifying emails as spam or
not spam, or predicting whether a patient has a high risk of heart
disease. Classification algorithms learn to map the input features to one
of the predefined classes.

REGRESSION:

on the other hand, deals with predicting continuous target variables,

which represent numerical values. For example, predicting the price of
a house based on its size, location, and amenities, or forecasting the
sales of a product. Regression algorithms learn to map the input
features to a continuous numerical value.

Advantages of Supervised Machine Learning

 Supervised Learning models can have high accuracy as they
are trained on labelled data.
 The process of decision-making in supervised learning models
is often interpretable.
 It can often be used in pre-trained models which saves time
and resources when developing new models from scratch.
Disadvantages of Supervised Machine Learning
 It has limitations in knowing patterns and may struggle with
unseen or unexpected patterns that are not present in the
training data.
 It can be time-consuming and costly as it relies
on labeled data only.
 It may lead to poor generalizations based on new data.

PROBLEM STATEMENT:
The data scientists at BigMart have collected sales data for 1559
products across 10 stores in different cities for the year 2013. Now each
product has certain attributes that sets it apart from other products.
Breakdown of the Problem Statement:

 Supervised machine learning problem.

 The target value will be Item_Outlet_Sales.

LIBRARIES:
import os #paths to file
import numpy as np # linear algebra
import pandas as pd # data processing
import warnings# warning filter

#ploting libraries
import matplotlib.pyplot as plt
import seaborn as sns

#feature engineering
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder

#train test split

from sklearn.model_selection import train_test_split

#metrics
from sklearn.metrics import mean_absolute_error as MAE
from sklearn.metrics import mean_squared_error as MSE
from sklearn.metrics import r2_score as R2
from sklearn.model_selection import cross_val_score as CVS

#ML models
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import Lasso

#default theme and settings

sns.set(context='notebook', style='darkgrid', palette='deep',
font='sans-serif', font_scale=1, color_codes=False, rc=None)
pd.options.display.max_columns

#warning hadle
warnings.filterwarnings("always")
warnings.filterwarnings("ignore")

FILE PATHS :
#list all files under the input directory
for dirname, _, filenames in os.walk('/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
#path for the training set
tr_path = "/input/bigmart-sales-data/Train.csv"
#path for the testing set
te_path = "/input/bigmart-sales-data/Test.csv"

PREPROCCESSING AND DATA ANALYSIS:

TRAINING SET:

# read in csv file as a DataFrame

tr_df = pd.read_csv(tr_path)
# explore the first 5 rows
tr_df.head()

TESTING SET:

# read in csv file as a DataFrame

te_df = pd.read_csv(te_path)
# explore the first 5 rows
te_df.head()

MISSING VALUES:

#missing values in decsending order

print("Train:\n")
print(tr_df.isnull().sum().sort_values(ascending=False),"\n\
n",tr_df.isnull().sum()/tr_df.shape[0] *100,"\n\n")
print("Test:\n")
print(te_df.isnull().sum().sort_values(ascending=False),"\n\
n",te_df.isnull().sum()/te_df.shape[0] *100,"\n\n")

#summary statistics test

te_df.describe()

#column information
tr_df.info(verbose=True, null_counts=True)
#summary statistics train
tr_df.describe()
MISSING VALUES:

There are many ways data can end up with missing values. For example:

1. The product wasn't weighed.

2. The data provider didn't include the outlet size of some products.

Most machine learning libraries (including scikit-learn) give an error if you try
to build a model using data with missing values. As you can see we have
some missing data, let's have a look how many we have for each column:

 by numbers
 by %

This analysis will also compare to the test and train datasets for evaluation.
#missing values in decsending order
print("Train:\n")
print(tr_df.isnull().sum().sort_values(ascending=False),"\n\
n",tr_df.isnull().sum()/tr_df.shape[0] *100,"\n\n")
print("Test:\n")
print(te_df.isnull().sum().sort_values(ascending=False),"\n\
n",te_df.isnull().sum()/te_df.shape[0] *100,"\n\n"

LETS CHECK THE VALUE COUNTS FOR OUTLET_SIZEAND ITEM_WEIGHT

print("Outlet_Size:\n",
tr_df.Outlet_Size.value_counts(), "\n\n")
print("Item_Weight:\n",
tr_df.Item_Weight.value_counts(), "\n\n")

#train
tr_df['Outlet_Size'] = tr_df['Outlet_Size'].fillna(
tr_df['Outlet_Size'].dropna().mode().values[0])

#test
te_df['Outlet_Size'] = te_df['Outlet_Size'].fillna(
te_df['Outlet_Size'].dropna().mode().values[0])

#checking if we filled missing values

tr_df['Outlet_Size'].isnull().sum(),te_df['Outlet_Si
ze'].isnull().sum()
DATA EXPLORATION :

#list of all the numeric columns

num = tr_df.select_dtypes('number').columns.to_list()
#list of all the categoric columns
cat = tr_df.select_dtypes('object').columns.to_list()

#numeric df
BM_num = tr_df[num]
#categoric df
BM_cat = tr_df[cat]

#print(num)
#print(cat)

[tr_df[category].value_counts() for category in cat[1:]]

DATA VISUALIZATION :
For starters we will create countplots for the categorical columns:

[]
#categorical columns:
['Item_Identifier', 'Item_Fat_Content', 'Item_Type', 'Outlet_Identifier',
'Outlet_Size', 'Outlet_Location_Type', 'Outlet_Type']
plt.figure(figsize=(6,4))
sns.countplot(x='Item_Fat_Content' , data=tr_df ,palette='mako')
plt.xlabel('Item_Fat_Content', fontsize=14)
plt.show()
plt.figure(figsize=(27,10))
sns.countplot(x='Item_Type' , data=tr_df ,palette='summer')
plt.xlabel('Item_Type', fontsize=14)
plt.show()

plt.figure(figsize=(15,4))
sns.countplot(x='Outlet_Identifier' , data=tr_df ,palette='winter')
plt.xlabel('Outlet_Identifier', fontsize=14)
plt.show()
plt.figure(figsize=(10,4))
sns.countplot(x='Outlet_Size' , data=tr_df ,palette='autumn')
plt.xlabel('Outlet_Size', fontsize=14)
plt.show()
#because of the variability of the unique values of the numeric columns a
scatter plot with the target value will be of use
for numeric in BM_num[num[:3]]:
plt.scatter(BM_num[numeric], BM_num['Item_Outlet_Sales'])
plt.title(numeric)
plt.ylabel('Item_Outlet_Sales')
plt.show()

multivariate plots
I want to check the following relationships with Item_Outlet_Sales:

 Sales per item type

 Sales per outlet
 Sales per outlet type
 Sales per outlet size
 Sales per location type

plt.figure(figsize=(27,10))
sns.barplot('Outlet_Identifier' ,'Item_Outlet_Sales',
data=tr_df ,palette='gist_rainbow')
plt.xlabel('Outlet_Identifier', fontsize=14)
plt.legend()
plt.show()

CORRELATION MATRIX:

FEATURE ENGINEERING:
We have 7 columns we need to delete or encode.
 Ordinal variables:
o Item_Fat_Content
o Outlet_Size
o Outlet_Location_Type
 Nominal variables:
o Item_Identifier
o Item_Type
o Outlet_Identifier
o Outlet_Type

Numeric values:

 From the numeric variables Outlet_Establishment_Year is no longer

needed
Realizations
 Item_MRP optimizes Maximum Outlet sales (positive correlation with the target).
 Linear Regression and Lasso Regressor have the best perfomance in most
categories.
 only a third of the observed variation can be explained by the model's inputs of
Random Forest Regressor, there for it's performance is not optimal even though
his cross validation is the highest.
 For better peformance this models need tuning e.g. Grid Search.

Conclusion:
In my FE process i have decided:

1. The
columns Outlet_Establishment_Year, Item_Identifier and Outlet_Iden
tifier don't have significant values so we will drop them.
2. All Ordinal variables will be Label encoded.
3. The columns Outlet_Type and Item_Type will be One Hot encoded.

-----------Thank you ----------

AWS AI ML Virtual Internship Full Report
No ratings yet
AWS AI ML Virtual Internship Full Report
33 pages
Hino Fault Codes List
50% (4)
Hino Fault Codes List
2 pages
Big Sales Prediction Model Using Machine Learning1
No ratings yet
Big Sales Prediction Model Using Machine Learning1
21 pages
Final Rep
No ratings yet
Final Rep
23 pages
Main_merged
No ratings yet
Main_merged
76 pages
Major Project Report BIG MART Final Reedited
No ratings yet
Major Project Report BIG MART Final Reedited
91 pages
1822 B.E Cse Batchno 149
No ratings yet
1822 B.E Cse Batchno 149
48 pages
Krce
No ratings yet
Krce
71 pages
FINAL INTERN DOCUMENT Dhanunjai
No ratings yet
FINAL INTERN DOCUMENT Dhanunjai
26 pages
7th Sem Final Report
No ratings yet
7th Sem Final Report
67 pages
An Anaya
No ratings yet
An Anaya
40 pages
empowering small companies with automated sales forecasting
No ratings yet
empowering small companies with automated sales forecasting
66 pages
Machine Learning Algorithms
No ratings yet
Machine Learning Algorithms
32 pages
Final 30
No ratings yet
Final 30
20 pages
Intership
No ratings yet
Intership
23 pages
Final Report Indhu
No ratings yet
Final Report Indhu
23 pages
BDA VAISHAA
No ratings yet
BDA VAISHAA
11 pages
Big Mart Project Report
No ratings yet
Big Mart Project Report
19 pages
Aiml virtual internship report
No ratings yet
Aiml virtual internship report
99 pages
Tarun DS Resume
No ratings yet
Tarun DS Resume
1 page
BTP Report Final 1
No ratings yet
BTP Report Final 1
28 pages
Big Mart Sales Prediction Using Machine Learning
No ratings yet
Big Mart Sales Prediction Using Machine Learning
58 pages
sravs mini[1]
No ratings yet
sravs mini[1]
65 pages
Student Performance Analysis Using Machine Learning
No ratings yet
Student Performance Analysis Using Machine Learning
40 pages
heart disease
No ratings yet
heart disease
28 pages
Sid Tie
No ratings yet
Sid Tie
31 pages
sarumathi intern18
No ratings yet
sarumathi intern18
37 pages
Data Valley 21VV1A0510
No ratings yet
Data Valley 21VV1A0510
85 pages
Data Science-Logbook
No ratings yet
Data Science-Logbook
101 pages
mini project on ml
No ratings yet
mini project on ml
20 pages
Avinash.pdf.PDF
No ratings yet
Avinash.pdf.PDF
23 pages
Sanjay
No ratings yet
Sanjay
38 pages
Adnan
No ratings yet
Adnan
19 pages
Internship Report 1
No ratings yet
Internship Report 1
19 pages
Review 1
No ratings yet
Review 1
11 pages
Career Guidance Finale
No ratings yet
Career Guidance Finale
47 pages
Main Project
No ratings yet
Main Project
43 pages
E.venkatasai Ir
No ratings yet
E.venkatasai Ir
204 pages
Internship Report 40 Pages
No ratings yet
Internship Report 40 Pages
40 pages
Project Report: Application of Machine Learning
No ratings yet
Project Report: Application of Machine Learning
12 pages
O180421 Summer Internship Report
No ratings yet
O180421 Summer Internship Report
33 pages
internship report
No ratings yet
internship report
64 pages
Internship Report (Data Science)
No ratings yet
Internship Report (Data Science)
32 pages
570 Report
No ratings yet
570 Report
38 pages
RK Final
No ratings yet
RK Final
32 pages
Training Report On Machine Learning
No ratings yet
Training Report On Machine Learning
32 pages
Group Long Term Internship
No ratings yet
Group Long Term Internship
45 pages
CHANDU PPT
No ratings yet
CHANDU PPT
35 pages
B-11 DOCUMENT-1
No ratings yet
B-11 DOCUMENT-1
43 pages
DataMiningSynopsis_Sawan
No ratings yet
DataMiningSynopsis_Sawan
19 pages
PPF and Train-summer-internship-report
No ratings yet
PPF and Train-summer-internship-report
33 pages
Data Mining
No ratings yet
Data Mining
2 pages
Aniruddh Sharma
No ratings yet
Aniruddh Sharma
1 page
Industrial Training Report: Submitted by
No ratings yet
Industrial Training Report: Submitted by
27 pages
Supriya Synopsis Final
No ratings yet
Supriya Synopsis Final
27 pages
ML LAB REPORT
No ratings yet
ML LAB REPORT
23 pages
Anaswarak_DataScience
No ratings yet
Anaswarak_DataScience
1 page
Visvesvaraya Technological University: City Engineering College
No ratings yet
Visvesvaraya Technological University: City Engineering College
31 pages
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
Machine Learning Mastery for Engineers
From Everand
Machine Learning Mastery for Engineers
Abdellatif Sadeq
No ratings yet
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
From Everand
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
César Pérez López
No ratings yet
Allen Bradley 1336
No ratings yet
Allen Bradley 1336
137 pages
Capstone Research
No ratings yet
Capstone Research
25 pages
Configuring A SINAMICS S120 With Startdrive V15
No ratings yet
Configuring A SINAMICS S120 With Startdrive V15
31 pages
Mac Adams E-Solution Projects
No ratings yet
Mac Adams E-Solution Projects
32 pages
NAD-916 Manual
No ratings yet
NAD-916 Manual
20 pages
LOP 12A4 3.2 de Thi Thu 2023 (10) - MONG THO - ISCHOOL
No ratings yet
LOP 12A4 3.2 de Thi Thu 2023 (10) - MONG THO - ISCHOOL
4 pages
DeChinhThuc AnhChuyen
No ratings yet
DeChinhThuc AnhChuyen
11 pages
250/150 PSI DJ Series Butterfly Valves: Code Number System
No ratings yet
250/150 PSI DJ Series Butterfly Valves: Code Number System
3 pages
P6248 1.7 GHZ (Typical) Differential Probe Service Manual
No ratings yet
P6248 1.7 GHZ (Typical) Differential Probe Service Manual
55 pages
TP 2014 7 SCC J 26 46 Richa221666 Gmailcom 20210529 224322 1 15
No ratings yet
TP 2014 7 SCC J 26 46 Richa221666 Gmailcom 20210529 224322 1 15
15 pages
Answer 9 Employability Skills
No ratings yet
Answer 9 Employability Skills
44 pages
conCM Tut
No ratings yet
conCM Tut
132 pages
Instrumentation Project Documentation & Execution Ch.3 Documents To Be Designed Instrumentation Index Sheet Instrumentation Index
100% (1)
Instrumentation Project Documentation & Execution Ch.3 Documents To Be Designed Instrumentation Index Sheet Instrumentation Index
2 pages
E-Brochure Xandari
100% (1)
E-Brochure Xandari
13 pages
Retirement Asset Error The lookup code is invalid. Enter a valid lookup code (Doc ID 2912690.1)
No ratings yet
Retirement Asset Error The lookup code is invalid. Enter a valid lookup code (Doc ID 2912690.1)
1 page
Purpose: Objective
No ratings yet
Purpose: Objective
5 pages
Yanmar Air-Starter-Pdf-Free
No ratings yet
Yanmar Air-Starter-Pdf-Free
31 pages
Deep Sea Electronics PLC: DSE8860 Configuration Suite Software Manual
No ratings yet
Deep Sea Electronics PLC: DSE8860 Configuration Suite Software Manual
100 pages
Department of Automobile & Mechanical Engineering: System Design and Simulation
No ratings yet
Department of Automobile & Mechanical Engineering: System Design and Simulation
8 pages
Doraha Transport Inc._1099 2024
No ratings yet
Doraha Transport Inc._1099 2024
50 pages
ASSIGNMENT 1 Oct 2022 - OPM545 - Final
No ratings yet
ASSIGNMENT 1 Oct 2022 - OPM545 - Final
7 pages
Arduino Based Temperature and Humidity C
No ratings yet
Arduino Based Temperature and Humidity C
6 pages
BLACK SHARK LUCIFER T2 Wireless Gaming Earbuds User Manual
No ratings yet
BLACK SHARK LUCIFER T2 Wireless Gaming Earbuds User Manual
7 pages
Approved Research Topics, Jan-Apr 2024
No ratings yet
Approved Research Topics, Jan-Apr 2024
12 pages
Model an Anti-Lock Braking System - MATLAB & Simulink - MathWorks India
No ratings yet
Model an Anti-Lock Braking System - MATLAB & Simulink - MathWorks India
8 pages
Sivni Error Detail
No ratings yet
Sivni Error Detail
9 pages
Syllabus BCA 1st Sem
No ratings yet
Syllabus BCA 1st Sem
6 pages
Ict Unit 1 Marking Scheme
No ratings yet
Ict Unit 1 Marking Scheme
1 page
Joachims 02c
No ratings yet
Joachims 02c
10 pages