0% found this document useful (0 votes)
24 views

Excel and R Analysis

The document analyzes a multivariable linear regression model predicting client account balances. The regression finds a weak correlation (Multiple R = 0.17) between balances and predictor variables like age, job, marital status and education. Only marital status and education are statistically significant predictors of balance. The model explains only 3% of variation in balances (R-squared = 0.03).

Uploaded by

kennedy kimweli
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views

Excel and R Analysis

The document analyzes a multivariable linear regression model predicting client account balances. The regression finds a weak correlation (Multiple R = 0.17) between balances and predictor variables like age, job, marital status and education. Only marital status and education are statistically significant predictors of balance. The model explains only 3% of variation in balances (R-squared = 0.03).

Uploaded by

kennedy kimweli
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 8

1

Case Study: Excel and R Analysis

Student’s Name

Professor’s Name

Course

Date
2

Multivariable Linear Regression

The response or dependent variable in our multivariable linear regression is the account balance

of each client.

Most of the predictor or independent variables are categorical, therefore, dummy variables had to

be created for the purpose of performing the multivariable linear regression; the following table

shows the numerically expressed categorical variables:

Categorical variable Numerical Expressions

Job category Professional = 1

Nonprofessional = 0

Marital status Married = 1

Not married = 0

Education level Primary / Unknown = 1

Secondary = 2

Tertiary = 3

Housing loan Yes = 1

No = 0

Personal loan Yes = 1

No = 0

The linear regression output is shown below:


3

SUMMARY OUTPUT

Regression Statistics
Multiple R 0.173302801722
R Square 0.030033861085
Adjusted R Square 0.023041314254
Standard Error 2473.454712349
Observations 979

SUMMARY OUTPUT

Regression Statistics
Multiple R 0.173302801722
R Square 0.030033861085

Observations 979

ANOVA
df SS MS F Significance F
Regression 7 183942358.310526277480 4.295125 0.00010842369
Residual 971 5940556845.835 6117978
Total 978 6124499204.145

Multiple R is the correlation coefficient and it is used to show the multiple correlation between

the response variable and the predictor variables. Our regression model has Multiple R = 0.1733;

this values indicates a very weak correlation between the response variable and the predictor

variables.

The R squared is the coefficient of determination, showing how many point fall on the regression

line; our regression model has a R2 = 0.03. This implies that 3% of the variation of y-values

around the mean are explained by the x-values. In other words, 3% of the values fit the model.

Higher R squared values are considered to be better, therefore, our R squared value means that
4

the model does not explain much of variation of the data. But our regression model is still

significant due to the fact that the p-value is 0.0001, which is less that the significance level of

0.05, hence indicating that the regression mode as a whole is statistically significance.

The table below shows the coefficient for each predictor variable along with the intercept value,

which is always constant.

Coefficients
Intercept -535.0195887
Age 12.85198653
Job 186.5920589
Marital 553.358256
Education 434.2270047
u- housing loan 30.51171675
u- personal loan -328.7827573
Duration -0.341798459
The coefficient for the Age is approximately 12.85; this means that as the age increases the

account balance increases as well. Other variable that have positive correlation with the account

balance are Job, Marital status, Education, and Housing loan. Duration seems to behave in

opposite direction; this means that as the account balance decreases, duration increases, and vice-

versa. Personal loan seems to be having a negation correlation with the response variable account

balance as well.

Therefore, the following is the equation of our regression model:

Balance = -535.02 + Age*12.85 + Job*186.59 + Marital*553.36 + Education*434.23 + Housing

loan*30.51 – Personal loan*328.78 - Duration*0.34

To predict a back account balance for 55 years old, not married, professional, with the tertiary

education level, has neither housing nor personal loans, and the duration of the oldest bank
5

account is 300 days, the following calculation will be carried out:

Balance = -535.02 + 55 * 12.85 + 1*186.58 + 0 + 3*434.23 + 0 – 0 – 300*0.34

Estimated bank account balance = $2094.02

Residual Analysis

Through the use of the linear regression equation, residuals shows how far away the actual points

are from the predicted data points. The scatter plot below shows the predicated balances and their

respective residuals:

Residuals
40000
35000
30000
25000
20000
15000
10000
5000
0
0 200 400 600 800 1000 1200
-5000

Predicted Balance Residuals

The scatter plot shows that the assumptions of the linear regression are not serious violated due

to the fact that the model is statistically significant.

Multicollinearity

When independent variables in a regression model are correlated, multicollinearity occurs.

Because independent variables should be independent, this correlation is undesirable. When

fitting the model and interpreting the findings, it may be difficult if there is a high enough degree
6

of correlation between the variables. Isolating the relationship between each independent

variable and the dependent variable is one of the main objectives of regression analysis. When all

other independent variables are held constant, a regression coefficient is interpreted as the

average change in the dependent variable for each change in an independent variable of 1 unit.

Our treatment of multicollinearity depends on that final section. The concept is that just one

independent variable's value can be changed, not the others. When independent variables, on the

other hand, are correlated, it means that variations in one variable are related to shifts in another

one. It is more challenging to modify one variable without also changing another when there is a

high link between the two. Because the independent variables frequently change simultaneously,

it becomes challenging for the model to estimate the link between each independent variable and

the dependent variable separately.

Significance of the Predictor Variable

Coefficients Standard Error t Stat P-value


0.32955
Intercept -535.0195887 548.4557345 -0.9755 4
1.47028 0.14180
Age 12.85198653 8.741140704 7 8
1.00812 0.31364
Job 186.5920589 185.0874897 9 4
3.32025 0.00093
Marital 553.358256 166.6614763 3 3
3.06457
Education 434.2270047 141.6924942 3 0.00224
0.11896 0.90533
u- housing loan 30.51171675 256.4859357 1 1
0.13384
u- personal loan -328.7827573 219.1339114 -1.50037 3
0.27864
Duration -0.341798459 0.315319285 -1.08398 5
The table above shows the p-values of each predictor variable; the significance level of 0.05 was

used. The table shows that there are only two predictor variables with statistical significance in

our regression model: Marital status and education level – this is due to the fact that their p-
7

values are less that the signifance level of 0.05. Housing loan, personal loan, age, duration, and

job status seems that they are not statistically significant due to their large p-values. When they

are eliminated from the linear regression, the output will be the following:

SUMMARY OUTPUT

Regression Statistics
Multiple R 0.151353127753621
R Square 0.0229077692808039
Adjusted R Square 0.0209055311031007
Standard Error 2476.15691403559
Observations 979

ANOVA
df SS MS F Significance F
Regression 2 1.4E+08 7E+07 11.44108 1.23E-05

SUMMARY OUTPUT

Regression Statistics
Multiple R 0.151353127753621
8

Running Least-Squares Regression in R

You might also like