0% found this document useful (0 votes)
31 views21 pages

ASM - CCA-2 - MK GRP 1

This document summarizes an analysis of a linear multiple regression model performed on a marketing dataset containing variables for TV, radio, social media, and influencer spending, as well as sales amounts. The analysis included exploratory data analysis, data cleaning, and model building. It was found that TV and radio spending were the most significant predictors of sales, with TV having the strongest positive relationship. Models including TV and radio as factors had lower error than models with just one or the other, indicating they were better at predicting sales amounts. In conclusion, the analysis showed TV and radio marketing had the greatest impact on sales.

Uploaded by

ankita wadhe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views21 pages

ASM - CCA-2 - MK GRP 1

This document summarizes an analysis of a linear multiple regression model performed on a marketing dataset containing variables for TV, radio, social media, and influencer spending, as well as sales amounts. The analysis included exploratory data analysis, data cleaning, and model building. It was found that TV and radio spending were the most significant predictors of sales, with TV having the strongest positive relationship. Models including TV and radio as factors had lower error than models with just one or the other, indicating they were better at predicting sales amounts. In conclusion, the analysis showed TV and radio marketing had the greatest impact on sales.

Uploaded by

ankita wadhe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 21

Indira Institute of Management, Pune

MBA Semester 2 (2021-23)


Advance Statistical Methods CCA-2

Assignment Report Topic

Linear Multiple Regression

Submitted By

MK Grp 1

Ajay Aynile MK B 1
Akshay Rathod MK B 4
Ankita Wadhe MK B 6
Arpan Shah MK B 7
Arpita Pojge MK B 8

Under the Guidance of

Prof. Dr. Punam Bhoyar


Linear Multiple Regression:

Multiple linear regression (MLR), also known simply as multiple regression, is a statistical
technique that uses several explanatory variables to predict the outcome of a response variable.
The goal of multiple linear regression is to model the linear relationship between the explanatory
(independent) variables and response (dependent) variables. 

Main characteristics or features of the data.

 The variables and their relationships.


 Finding out the important variables that can be used in our problem.
 EDA is an iterative approach that includes:

Generating questions about our data

 Searching for the answers by using visualization, transformation, and modeling of our
data.
 Using the lessons that we learn in order to refine our set of questions or to generate a
new set of questions.

About the dataset:

Page |
1
The dataset used contains 5 variables i.e., TV, Radio, Social Media, Influencer and Sales. It
represents the spending of the company on marketing campaign through TV, Social Media,
Radio and Influencers and amount of Sales generated through these campaigns.

Objective:
To perform Linear Multiple Regression and find out relationship between the dependent and
independent variable

Analysis tool:
The analysis is done using Rstudio

Exploratory Data Analysis and Data Cleaning:

STEP 1:

Setting the working directory to do the analysis on the selected dataset. The functions used for
this are getwd() and setwd(). The code for the same is as follows:

Code:

STEP 2:

The dimensions of the dataset were checked using the function dim(), to know the number of
rows (observations) and columns (variables) in the data set. The dataset was displayed using
View() function. str() function was used to get the structure of the dataset. Then the dataset
was checked for missing values and finding out the number of missing values using is.na()
and sum(is.na()) functions respectively.

Page |
2
Code:

Result:

The above results show that there are a total of 4572 observations and 5 variables. The
variables Radio, Social Media, and Sales have numeric values, and, variables TV and
Influencer have integer and characteristic values respectively. The dataset also has 26 missing
values as per the results obtained.

STEP 3:

Plotting the scatter plot for each variable to check the directionality of the scatter plot
whether it is positive or negative. This was done by excluding the variable Influencer as it has
characteristic values. The scatter plot was done using function plot().

Code:

Result:

Page |
3
The above scatter plot represents that all the variables have positive directionality in the plot
which shows they have good relationship between them. Also, from the we can say that the
relationship between TV and Sales is strong.

STEP 4:

The boxplot of all the variables was plot to check whether there are outliers in their respective
observations or not. The boxplot was plot using the function boxplot().

Code:

Page |
4
Result:

Page |
5
From the above results it can be observed that the observations of the variables Social Media
and Radio contain outliers. While the observations of the variables Sales and TV have no
outliers.

Page |
6
STEP 5:

Cleaning the data by replacing the missing values by respective mean or median values of the
observations of the respective variables. As from the boxplot we observed that variables
social media and radio contain outliers in their observations hence the missing values in these
observations are replaced by median values of their respective observations group. The
function used for mean and median are mean() and median() respectively.

Code:

The syntax used for replacing the values are as follows:

For mean –

dataset_name$variable1_name[is.na()]=mean(dataset_name$variable1_name, na.rm=T)

For median –

dataset_name$variable1_name[is.na()]=median(dataset_name$variable1_name, na.rm=T)

STEP 6:

Plotting the histogram to check the frequency distribution of each variable. This was done
using the function hist().

Code:

Page |
7
Result:

Page |
8
Page |
9
From the above charts it can be observed that the variables sales and TV have Normal
Distribution, or the distribution is nearly flat for both the variables. Whereas for variables
social media and radio the distribution is longer on the right side which shows that both the
variables have Right Skewed or Positively Skewed Distribution.

STEP 7:

Finding the skewness and kurtosis using Descriptive Statistic Summary. The skewness and
kurtosis was found out using the function describe().

Code:

Result:

Page |
10
STEP 8:

To find the correlation between dependent and independent variables in the dataset. The
functions used are cor().

Code:

Result:

From the above result we can interpret that the factor which affects sales the most is TV and it is a positive
relationship as the R value is positive which is 0.996 which is close to 1 (perfect positive relationship).
This means the more this company spends on Marketing Campaigns on TV the more the sales will be. The
company should prioritize allocating budget to Marketing Campaigns on TV. The second priority will be
Marketing Campaigns on Radio as it is also a strong positive relationship as the value is 0.867. The
company should be critical on spending on Social Media Marketing Campaign as it will not generate sales
to the extend TV and Radio is generating.

STEP 9:

Installing the packages to access the library required for performing Multiple Linear
Regression. The functions used are install.packages() and library().

Code:

STEP 10:

Creating Model for further analysis.

Code:

Page |
11
Results:

Page |
12
STEP 11:

Creating Model summary for further analysis.

Code:

Result:

Page |
13
According to the summary of model 1 TV and Radio are the most significant factors

which affects the sales positively.

According to the summary of model 2 TV significantly affects the sales positively.

According to the summary of model 3 Radio significantly affects the sales positively.

Intercept value is also more and it is positive. It is also significant as magnitude is more.

Page |
14
According to the summary of model 4 Social significantly affects the sales positively.

Intercept value is also more and it is positive. It is also significant as magnitude is more.

According to the summary of model 5 TV and Radio significantly affects the sales

positively.

Page |
15
According to the summary of model 5 when combined effect of TV and Radio is studied,

it shows that TV significantly affects the sales positively more than the Social media.

STEP 12:

Predicting the values of sales from different Models.

Code:

STEP 13:

Predicting the values of sales from different Models.

Code:

Page |
16
Result:

As the RMSE values of model 3 and model 4 is more so the models are no predicting the

values nearby to the actual values. Model 1, Model 2, Model 5 are good to predict the

sales as RMSE is less.

Results:

Page |
17
Conclusion:
From the above results it can be interpreted that distribution of TV and Sales are flat or have
zero skewness as their values are approximately zero which can be confirmed from their
respective histogram plots. Similarly, it can be interpreted that radio and social media have
positively skewed distribution which is high for social media compared to radio based on
their values obtained in the results and it can also be confirmed from their respective
histogram charts. These can also be confirmed based their respective values of kurtosis i.e.,
the distribution is flat for TV and Sales as they have kurtosis value less than -1, the
distribution for

Page |
18
radio is slightly peaked whereas distribution for the social media has a normal peak as the
kurtosis values are between -1 and 1.

Based on the above results the standard error for the variable sales is much higher compared
to other variables and standard error is lowest for the variable social media. This shows that
the estimated mean of the sample after data cleaning for sales is inaccurate or highly different
than the true population as it is much greater than zero, whereas, in case of social media
standard error is very much close to zero which shows that the estimated mean for social
media is approximately equals to its true population mean.

Page |
19

You might also like