0% found this document useful (0 votes)
8 views

AAAAAAAAAAAAAAAAAAAAAAAAA

Uploaded by

Hoang The Anh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

AAAAAAAAAAAAAAAAAAAAAAAAA

Uploaded by

Hoang The Anh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 41

SomethingHappy

2023-08-14

Activity 01
1. Dataset introduction
The settings up before using dataset for examination include: - Set up a working directory
at the specified address. - Then, read the data file “hprice1.csv” into the variable named
“data”. - Use the “attach” function to conveniently reference the names of individual
columns as vectors. - There is no header in the provided CSV file, we add the name for each
column as they are provided in the description file.
setwd("C:/Users/Admin/Desktop")
data <- read.csv("hprice1.csv", header = TRUE, col.names = c("price",
"assess", "bdrms", "lotsize", "sqrft", "colonial", "lprice", "lassess",
"llotsize", "lsqrft"))
attach(data)

1.1. Brief description


dim(data)

## [1] 88 10

• The dataset consists of 88 rows and 10 columns, including: price, assess, bdrms,
lotsize, sqrft, colonial, lprice, lassess, llotsize, lsqrft.
• The dataset name is “data 6”.

1.2. Format and structure of dataset


The dataset consists of below columns:
summary(data)

## price assess bdrms lotsize sqrft

## Min. :111.0 Min. :198.7 Min. :2.000 Min. : 1000


Min. :1171
## 1st Qu.:230.0 1st Qu.:253.9 1st Qu.:3.000 1st Qu.: 5733 1st
Qu.:1660
## Median :265.5 Median :290.2 Median :3.000 Median : 6430
Median :1845
## Mean :293.5 Mean :315.7 Mean :3.568 Mean : 9020
Mean :2014
## 3rd Qu.:326.2 3rd Qu.:352.1 3rd Qu.:4.000 3rd Qu.: 8583 3rd
Qu.:2227
## Max. :725.0 Max. :708.6 Max. :7.000 Max. :92681
Max. :3880
## colonial lprice lassess llotsize
## Min. :0.0000 Min. :4.710 Min. :5.292 Min. : 6.908
## 1st Qu.:0.0000 1st Qu.:5.438 1st Qu.:5.537 1st Qu.: 8.654
## Median :1.0000 Median :5.582 Median :5.671 Median : 8.769
## Mean :0.6932 Mean :5.633 Mean :5.718 Mean : 8.905
## 3rd Qu.:1.0000 3rd Qu.:5.788 3rd Qu.:5.864 3rd Qu.: 9.058
## Max. :1.0000 Max. :6.586 Max. :6.563 Max. :11.437
## lsqrft
## Min. :7.066
## 1st Qu.:7.415
## Median :7.520
## Mean :7.573
## 3rd Qu.:7.708
## Max. :8.264

Number Column Meaning


1 price House price, $1000s
2 assess Assessed value, $1000s
3 bdrms Number of bedrooms
4 lotsize Size of lot in square
feet
5 sqrft Size of house in square
feet
6 colonial =1 if home is colonial
style
7 lprice log ( price )
8 lassess log ( assess )
9 llotsize log ( lotsize )
10 lsqrft log ( sqrft )

2. Desciptive statistics
Prepare functions for later use:
mode <- function(x){
freq_table <- table(x)
mode_value <- as.numeric(names(freq_table)[which.max(freq_table)])
return(mode_value)
}

outliers <- function(x) {


outliers <- boxplot.stats(x)$out
num_outliers <- length(outliers)
return(num_outliers)

}
percent_outliers <- function(x){
outliers <- boxplot.stats(x)$out
num_outliers <- length(outliers)
percent <- num_outliers/length(x)
}

2.1. Quantitative statistics

2.1.1. House price


hist(price, main = "House price", xlab = "Price (in $1000)")

boxplot(price)
cat("Mode of price: ", mode(price),"\n" )

## Mode of price: 225

cat("Summary of price: ", "\n")

## Summary of price:

summary(price)

## Min. 1st Qu. Median Mean 3rd Qu. Max.


## 111.0 230.0 265.5 293.5 326.2 725.0

cat("Total outliers: ", outliers(price), " ", percent_outliers(price))

## Total outliers: 5 0.05681818

• The histogram shows that the majority of house price lies within 230 and 300
thousand dollars. Also, the rightward skew in the histogram indicates that the mean
of house price is larger than the mode (mode < median < mean).
• There are 5 values that are outliers, which accounts for 5.6% of the dataset. This
figure does not affect much on our analysis resutls.

2.1.2. Assessed value


hist(assess, xlim = c(0,800), main = "Assessed value", xlab = "Assessed value
(in $1000)")
boxplot(assess)

cat("Mode of assessed value: ", mode(assess),"\n" )


## Mode of assessed value: 198.7

cat("Summary of assessed value: ", "\n")

## Summary of assessed value:

summary(assess)

## Min. 1st Qu. Median Mean 3rd Qu. Max.


## 198.7 253.9 290.2 315.7 352.1 708.6

cat("Total outliers: ", outliers(assess), " ", percent_outliers(assess))

## Total outliers: 5 0.05681818

• The histogram shows that the majority of assessed value lies within 250 and 300
thousand dollars. Also, the rightward skew in the histogram indicates that the mean
of assessed value is larger than the mode (mode < median < mean).
• There are 5 values that are outliers, which accounts for 5.6% of the dataset. This
figure does not affect much on our analysis resutls.

2.1.3. Number of bedrooms


hist(bdrms)

plot(bdrms)
pie(table(bdrms), labels = c("1 bdrm", "2 bdrms", "3 bdrms", "4 bdrms", "5
bdrms", "6 bdrms", "7 bdrms"), col = c("red", "orange", "yellow",
"lightgreen", "lightblue", "violet", "cyan"), main = "Pie Chart for Bedrooms
attribute")
cat("Mode of bedrooms: ", mode(bdrms), "\n")

## Mode of bedrooms: 3

cat("Summary of bedrooms: ", "\n")

## Summary of bedrooms:

summary(bdrms)

## Min. 1st Qu. Median Mean 3rd Qu. Max.


## 2.000 3.000 3.000 3.568 4.000 7.000

cat("Total outliers: ", outliers(bdrms), " ", percent_outliers(bdrms))

## Total outliers: 2 0.02272727

• The diagram shows that most houses have the number of bedrooms varies within
[3,4].
• There are 2 values that are outliers, which accounts for 2.2% of the dataset. This
figure does not affect much on our analysis resutls.

2.1.4. Lot size


hist(lotsize/1000, main = "Histogram of lotsize (per 1000ft)", xlab =
"lotsize/1000ft")
boxplot(lotsize)

cat("Mode of price: ", mode(lotsize),"\n" )


## Mode of price: 6000

cat("Summary of price: ", "\n")

## Summary of price:

summary(lotsize)

## Min. 1st Qu. Median Mean 3rd Qu. Max.


## 1000 5733 6430 9020 8583 92681

cat("Total outliers: ", outliers(lotsize), " ", percent_outliers(lotsize))

## Total outliers: 11 0.125

• The histogram shows that the majority of lot size lies under 20 thousand square
feet. Also, the rightward skew in the histogram indicates that the mean of lot size is
larger than the mode (mode < median < mean).
• There are 11 values that are outliers, which accounts for 12.5% of the dataset. This
figure has slightly effects on our analysis resutls.

2.1.5. House size


hist(sqrft, main = "Histogram of house size", xlim = c(0, 5000), xlab =
"House size")

boxplot(sqrft)
cat("Mode of price: ", mode(sqrft),"\n" )

## Mode of price: 1536

cat("Summary of price: ", "\n")

## Summary of price:

summary(sqrft)

## Min. 1st Qu. Median Mean 3rd Qu. Max.


## 1171 1660 1845 2014 2227 3880

cat("Total outliers: ", outliers(sqrft), " ", percent_outliers(sqrft))

## Total outliers: 6 0.06818182

• The histogram shows that the majority of house size varies within 1500 to 2000
sqaure ft. Also, the rightward skew in the histogram indicates that the mean of lot
size is larger than the mode (mode < median < mean).
• There are 6 values that are outliers, which accounts for 6.8% of the dataset. This
figure does not affect much on our analysis resutls.

2.1.6. The remaining variables


The four remaining variables are calculate by using natural logarithm `log()` of 4 values
related to price and area including: houses price, assessed price, lot area, and houses
area.The meaning log-transforming variables is to achieve more linear relationships in
analyses.

2.2. Qualitative statistics

2.2.1. Colonial style


# Assuming your data is in a vector named 'binary_data'
pie(table(colonial), main = "Pie Chart for Colonial style")

table(colonial)

## colonial
## 0 1
## 27 61

The pie chart shows that about one-third of houses is not colonial style. The figure depicts
that there are more houses which are of colonial style.

3. Assessment on the average of two quantitative variables and applying


hypothesis testing.
Let us consider two quantitive variables: house price, lotsize (size of lot in square feet).

3.1. House price


summary(price)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 111.0 230.0 265.5 293.5 326.2 725.0

The output shows that the mean of house prices is 293.5 thousand dollars. Hypothesis
testing: Let us test whether the mean of house prices is different from 293.5 thousand
dollars. $H_0 : \mu_0 = 293.5 \\ H_1 : \mu_0 \ne 293.5$
t.test(x = price, mu = 293.5, alternative = "two.sided")

##
## One Sample t-test
##
## data: price
## t = 0.0042043, df = 87, p-value = 0.9967
## alternative hypothesis: true mean is not equal to 293.5
## 95 percent confidence interval:
## 271.7831 315.3089
## sample estimates:
## mean of x
## 293.546

Using one sample t-test on house price attribute of the dataset, we obtain p-value = 0.9967,
which is closed to 1. Therefore, we accept the null hypothesis: it is true that the mean of
house price is 293.5 thousand dollars.

3.2. Lot size


summary(lotsize)

## Min. 1st Qu. Median Mean 3rd Qu. Max.


## 1000 5733 6430 9020 8583 92681

The output shows that the mean of lot size in square feet is 9020 (square ft). Hypothesis
testing: Let us test whether the mean of lot size is different from 9020 square ft. $H_0 : \
mu_0 = 9020 \\ H_1 : \mu_0 \ne 9020$
t.test( x = lotsize, mu = 9020, alternative = "two.sided")

##
## One Sample t-test
##
## data: lotsize
## t = -0.00012573, df = 87, p-value = 0.9999
## alternative hypothesis: true mean is not equal to 9020
## 95 percent confidence interval:
## 6864.167 11175.560
## sample estimates:
## mean of x
## 9019.864
Using one sample t-test on lot size attribute of the dataset, we obtain p-value = 0.9999,
which is closed to 1. Therefore, we accept the null hypothesis: it is true that the mean of lot
size in square feet is 9020 (square ft).

4. Assessment on the characteristics of two quanlitative variables and applying


hypothesis testing.
Because there is only one quanlitative variable (colonial style) in the given dataset, we
create one more quanlitative variable. The new variable named “house’” indicating that if
the lot size if greater than 5000 square ft, “house” will be 1, vice versa, “house” will be 0.
library(dplyr)

## Warning: package 'dplyr' was built under R version 4.3.1

##
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':


##
## filter, lag

## The following objects are masked from 'package:base':


##
## intersect, setdiff, setequal, union

data <- data %>%


mutate(house = ifelse(lotsize >= 5000, 1, 0))

prop.test( x = c(nrow(data[data$colonial == 1 & data$house == 1, ]),


nrow(data[data$colonial == 1 & data$house == 0, ])), n = c( nrow(data),
nrow(data) ), alternative = "less")

##
## 2-sample test for equality of proportions with continuity correction
##
## data: c(nrow(data[data$colonial == 1 & data$house == 1, ]),
nrow(data[data$colonial == 1 & data$house == 0, ])) out of c(nrow(data),
nrow(data))
## X-squared = 57.805, df = 1, p-value = 1
## alternative hypothesis: less
## 95 percent confidence interval:
## -1.0000000 0.6638851
## sample estimates:
## prop 1 prop 2
## 0.62500000 0.06818182
5. Linear Regression on quantitative variables
5.1. Check correlation between variables.
pairs( price ~ ., data = data )
r <- cor(dplyr::select_if(data, is.numeric))
ggcorrplot::ggcorrplot(r,
hc.order = TRUE,
lab = TRUE)
car::vif(lm(price~., data = data))

## assess bdrms lotsize sqrft colonial lprice lassess


llotsize
## 48.806574 1.655664 4.214772 59.995813 1.298321 5.207658 54.454442
7.305277
## lsqrft house
## 60.069926 1.798603

The Variance Inflation Factor (VIF) is a measure of the strength of the correlation between
independent variables in a multiple regression model. A VIF value if less than 10 indicates
that there is no strong correlation between the variables. In this case, the values of “lsqrft”
is 60.07, which has strong correlation with other variables, that should be eliminated.
We have a new VIF table of values after removing “lsqrft”:
car::vif(lm(price~. - lsqrft, data = data))

## assess bdrms lotsize sqrft colonial lprice lassess


llotsize
## 38.433497 1.655410 4.130959 5.324560 1.219080 4.806119 36.921664
7.052738
## house
## 1.798484

In the above results, the values of “assess” is 38.43, which has strong correlation with other
variables, that should be eliminated.
We have a new VIF table of values after removing “assess”:
car::vif(lm(price~.-lsqrft - assess, data = data))

## bdrms lotsize sqrft colonial lprice lassess llotsize house


## 1.586106 3.981064 5.228633 1.161620 4.687746 9.818245 6.335779 1.606603

In the above results, the value of “lassess” is 9.8, which is closed to 10. This indicates that it
has a strong correlation with other variables, so that it should be eliminated.
car::vif(lm(price~.-lsqrft - assess - lassess, data = data))

## bdrms lotsize sqrft colonial lprice llotsize house


## 1.581703 3.577465 2.902717 1.161599 3.118880 4.964992 1.596831

In the above result, the VIF values for all variables are less than 10, which means that there
are no variables that are strongly correlated with each other.

5.2. Building Linear Regression model


We carry out Linear regression analysis to relate the dependent variable House price to the
other independent variables.
At first, we examine the first-order model and not conduct on the “lprice” because it is
highly related to the variable House price. In addition, the above removed variables (lsqrft,
assess, lassess) are not considered in this case.
fit = lm(price ~ bdrms+ lotsize+sqrft+ colonial + lprice + llotsize +house,
data = data)
coef(fit)

## (Intercept) bdrms lotsize sqrft colonial


## -1.623522e+03 8.687747e+00 -6.385875e-04 1.165188e-02 -1.037888e+01
## lprice llotsize house
## 2.913898e+02 2.929917e+01 -2.985674e+01

summary(fit)

##
## Call:
## lm(formula = price ~ bdrms + lotsize + sqrft + colonial + lprice +
## llotsize + house, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -45.782 -12.887 -2.674 6.438 98.288
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.624e+03 9.602e+01 -16.908 < 2e-16 ***
## bdrms 8.688e+00 3.754e+00 2.314 0.02322 *
## lotsize -6.386e-04 4.669e-04 -1.368 0.17524
## sqrft 1.165e-02 7.413e-03 1.572 0.11996
## colonial -1.038e+01 5.836e+00 -1.778 0.07914 .
## lprice 2.914e+02 1.461e+01 19.943 < 2e-16 ***
## llotsize 2.930e+01 1.029e+01 2.848 0.00558 **
## house -2.986e+01 1.041e+01 -2.867 0.00530 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23.43 on 80 degrees of freedom
## Multiple R-squared: 0.9522, Adjusted R-squared: 0.948
## F-statistic: 227.5 on 7 and 80 DF, p-value: < 2.2e-16

In the above results, the p-values of “lotsize” - lot size, and “sqrft” - house size are pretty
high.
Therefore, we conduct a Hypothesis testing on these variables whether these variable
could be eliminated in our model. $H_0:\beta_2 = \beta_3 = 0 \\ H_1: \exists \beta_i \ne 0
(i = 2,3)$
model1 <- lm( price ~ bdrms + colonial + lprice + llotsize + house, data =
data )
anova(fit, model1)

## Analysis of Variance Table


##
## Model 1: price ~ bdrms + lotsize + sqrft + colonial + lprice + llotsize +
## house
## Model 2: price ~ bdrms + colonial + lprice + llotsize + house
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 80 43903
## 2 82 46723 -2 -2820.8 2.57 0.08284 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary(model1)

##
## Call:
## lm(formula = price ~ bdrms + colonial + lprice + llotsize + house,
## data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -57.689 -12.350 -2.066 6.169 98.251
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1608.649 56.117 -28.666 < 2e-16 ***
## bdrms 10.528 3.593 2.930 0.00439 **
## colonial -12.245 5.806 -2.109 0.03798 *
## lprice 309.087 11.022 28.043 < 2e-16 ***
## llotsize 17.456 5.919 2.949 0.00415 **
## house -26.150 9.665 -2.705 0.00829 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23.87 on 82 degrees of freedom
## Multiple R-squared: 0.9491, Adjusted R-squared: 0.946
## F-statistic: 305.8 on 5 and 82 DF, p-value: < 2.2e-16

The resulted p-value is 0.08 greater than 0.05. Therefore, we accept the null hypothesis
eliminating the considered variables.
As can be seen above, the p-values of all variables are low. We can conclude that our model
is the most suitable model for calculation.

6. Interpretation of the model results.


summary(model1)

##
## Call:
## lm(formula = price ~ bdrms + colonial + lprice + llotsize + house,
## data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -57.689 -12.350 -2.066 6.169 98.251
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1608.649 56.117 -28.666 < 2e-16 ***
## bdrms 10.528 3.593 2.930 0.00439 **
## colonial -12.245 5.806 -2.109 0.03798 *
## lprice 309.087 11.022 28.043 < 2e-16 ***
## llotsize 17.456 5.919 2.949 0.00415 **
## house -26.150 9.665 -2.705 0.00829 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23.87 on 82 degrees of freedom
## Multiple R-squared: 0.9491, Adjusted R-squared: 0.946
## F-statistic: 305.8 on 5 and 82 DF, p-value: < 2.2e-16

plot(model1)
We have the formula of our Linear Regression Model:
p r ic e=−1608.849+10.528∗b d r m s −12.245∗c o l o n ia l+309.087∗l p r i c e +17.456∗ll o t s i z e − 26.150∗h o
Interpretation of the coefficients: - β 1 : 10.528. This value indicates that if the number of
bedrooms increases 1, the resulted value of house price will increase by 10.528. - β 2 : -
12.245. This value indicates that if the number of colonial style increases 1, the resulted
value of house price will decrease by 12.245. - β 3 : 309.087. This value indicates that if the
log of house price increases 1, the resulted value of house price will increase by 309.087. -
β 4 : 17.456. This value indicates that if the log of lot size increases 1, the resulted value of
house price will increase by 17.456. - β 5: -26.15. This value indicates that if the value of
“house” variable increases 1, the resulted value of house price will decrease by 26.15.

Activity 02
1. Dataset introduction
The dataset was conducted to examine where the Insurance charges are given against the
following attributes of the insured: Age, Sex, BMI, Number of Children, Smoker and Region.
The attributes are a mix of numeric and categorical variables.
The settings up before using dataset for examination include: - Set up a working directory
at the specified address. - Then, read the data file “Real estate.csv” into the variable named
“data1”. - Use the “attach” function to conveniently reference the names of individual
columns as vectors.
setwd("C:/Users/Admin/Desktop")
data1 <- read.csv("insurance.csv", header = TRUE)
attach(data1)

1.1. Brief description


dim(data1)

## [1] 1338 7

• The dataset consists of 1338 rows and 7 columns, including: age, sex, bmi, children,
smoker, region, charges.
• The dataset name is “insurance”.

1.2. Format and structure of dataset


The dataset consists of below columns:
summary(data1)

## age sex bmi children


## Min. :18.00 Length:1338 Min. :15.96 Min. :0.000
## 1st Qu.:27.00 Class :character 1st Qu.:26.30 1st Qu.:0.000
## Median :39.00 Mode :character Median :30.40 Median :1.000
## Mean :39.21 Mean :30.66 Mean :1.095
## 3rd Qu.:51.00 3rd Qu.:34.69 3rd Qu.:2.000
## Max. :64.00 Max. :53.13 Max. :5.000
## smoker region charges
## Length:1338 Length:1338 Min. : 1122
## Class :character Class :character 1st Qu.: 4740
## Mode :character Mode :character Median : 9382
## Mean :13270
## 3rd Qu.:16640
## Max. :63770

Number Column Meaning


1 age Age of primary beneficiary
2 sex Insurance contractor gender, female / male
3 bmi Body mass index
4 children Number of children covered by health
insurance
5 smoker =1 if that person is smoker
6 region The beneficiary’s residential area in the US,
northeast, southeast, southwest, northwest
7 charges Individual medical costs billed by health
insurance

2. Desciptive statistics
2.1. Quantitative statistics

2.1.1. Age
hist(age, main = "Age of beneficiary", xlab = "age", xlim = c(0, 70), breaks
= 10)
boxplot(age)

cat("Mode of age: ", mode(age),"\n" )


## Mode of age: 18

cat("Summary of age: ", "\n")

## Summary of age:

summary(age)

## Min. 1st Qu. Median Mean 3rd Qu. Max.


## 18.00 27.00 39.00 39.21 51.00 64.00

cat("Total outliers: ", outliers(age), " ", percent_outliers(price))

## Total outliers: 0 0.05681818

• The histogram shows that the most patients aged within 27 and 51 years old. People
who aged 18 appear most frequently in this column.
• There are 0 values that are outliers in this attribute.

2.1.2. BMI
hist(bmi, main= "BMI (Body Mass Index)", xlab = "BMI", xlim = c(0, 60))

boxplot(bmi)
cat("Mode of BMI: ", mode(bmi),"\n" )

## Mode of BMI: 32.3

cat("Summary of BMI: ", "\n")

## Summary of BMI:

summary(bmi)

## Min. 1st Qu. Median Mean 3rd Qu. Max.


## 15.96 26.30 30.40 30.66 34.69 53.13

cat("Total outliers: ", outliers(bmi), " ", percent_outliers(bmi))

## Total outliers: 9 0.006726457

• The histogram shows that the BMI of people lies within 25 and 35. People with BMI
of 32.3 appear the most frequently in this column.
• There are 9 values that are outliers, which accounts for 0.67% of the dataset. This
figure does not affect much on our analysis resutls.

2.1.3. Children
hist(children, main = "Number of dependents", xlab = "Children")
boxplot(children)

cat("Mode of number of dependents: ", mode(children),"\n" )


## Mode of number of dependents: 0

cat("Summary of number of dependents: ", "\n")

## Summary of number of dependents:

summary(children)

## Min. 1st Qu. Median Mean 3rd Qu. Max.


## 0.000 0.000 1.000 1.095 2.000 5.000

cat("Total outliers: ", outliers(children), " ", percent_outliers(children))

## Total outliers: 0 0

• The histogram shows that the majority of number of children lies within 0 and 2.
Also, the rightward skew in the histogram indicates that the mean of house price is
larger than the mode (mode < median < mean).
• There are 0 values that are outliers in this case.

2.1.4. Charges
hist(charges/1000, main = "Medical costs by health insurance", xlab =
"Charges / $1000", xlim = c(0, 70))

boxplot(charges)
cat("Mode of number of dependents: ", mode(charges),"\n" )

## Mode of number of dependents: 1639.563

cat("Summary of number of dependents: ", "\n")

## Summary of number of dependents:

summary(charges)

## Min. 1st Qu. Median Mean 3rd Qu. Max.


## 1122 4740 9382 13270 16640 63770

cat("Total outliers: ", outliers(charges), " ", percent_outliers(charges))

## Total outliers: 139 0.1038864

• The histogram shows that the majority of charges lies under 15000. Also, the
rightward skew in the histogram indicates that the mean of house price is larger
than the mode.
• There are 139 values that are outliers in this case, which make up 10.3% of this
attribute. This figure has an impact on our total analysis results and these outliers
should be eliminated.
2.2. Quanlitative statistics.

2.2.1. Sex
The given “sex” attribute in the dataset is demonstrated by 2 string values “female” and
“male”. Therefore, we convert them into numeric values and replace the new values into
the dataframe.
library(dplyr)
data1 <- data1 %>% mutate(sex_numeric = ifelse(sex == "male", 0, 1))

# Remove the old "sex" column


data1 <- data1 %>%
select(-sex)

# Rename the new column to "gender"


colnames(data1)[colnames(data1) == "sex_numeric"] <- "sex"

Now we have a dataframe with “sex” column values “male” as 0 and “female” as 1.
pie(table(sex), main = "Pie chart for sex distribution")

table(sex)

## sex
## female male
## 662 676
cat(paste("Percentage of", "female patients", ":", round((length(sex[sex ==
"female"]) / length(sex))*100, 2), "%\n"))

## Percentage of female patients : 49.48 %

cat(paste("Percentage of", "male patients", ":", round((length(sex[sex ==


"male"]) / length(sex))*100, 2), "%\n"))

## Percentage of male patients : 50.52 %

The pie chart shows that the percentages of female patients and male patients are nearly
the same. #### 2.2.2. Smoker The “smoker” attribute is also desmonstrated by 2 string
values “yes” and “no”. Therefore, we convert them into numeric values and replace the new
values into the dataframe.
library(dplyr)
data1 <- data1 %>% mutate(smoker_numeric = ifelse(smoker == "no", 0, 1))

# Remove the old "smoker" column


data1 <- data1 %>%
select(-smoker)

# Rename the new column to "smoker"


colnames(data1)[colnames(data1) == "smoker_numeric"] <- "smoker"

Now we have a dataframe with “smoker” column values “no” as 0 and “yes” as 1.
pie(table(smoker), main = "Pie chart for smoker distribution")
table(smoker)

## smoker
## no yes
## 1064 274

cat(paste("Percentage of", "smoker", ":", round((length(smoker[smoker ==


"yes"]) / length(smoker))*100, 2), "%\n"))

## Percentage of smoker : 20.48 %

cat(paste("Percentage of", "non-smoker", ":", round((length(smoker[smoker ==


"no"]) / length(smoker))*100, 2), "%\n"))

## Percentage of non-smoker : 79.52 %

The pie chart shows that about 20% of the total patients is smoker, and the remaining 80%
is non-smoker.

2.2.3. Region
# Get unique values in the "region" column
unique_regions <- unique(data1$region)

# Count the number of unique regions


num_unique_regions <- length(unique_regions)

# Print the unique regions and the count


cat("Number of unique regions:", num_unique_regions, "\n")

## Number of unique regions: 4

cat("Unique regions:\n")

## Unique regions:

cat(unique_regions, sep = "\n")

## southwest
## southeast
## northwest
## northeast

The “region” attribute is desmonstrated by 4 string values “southwest”, “southeast”,


“northwest”, and “northeast”. Therefore, we convert them into numeric values and replace
the new values into the dataframe.
# Assuming you have the dplyr package installed
library(dplyr)

# Convert region values to numeric


data1 <- data1 %>%
mutate(region_numeric = case_when(
region == "southwest" ~ 1,
region == "southeast" ~ 2,
region == "northwest" ~ 3,
region == "northeast" ~ 4,
TRUE ~ NA_integer_
)) %>%
select(-region)

# Rename the new column to "region"


colnames(data1)[colnames(data1) == "region_numeric"] <- "region"

Now we have a dataframe with “region” column values “southwest” as 1, “southeast” as 2,


“northwest” as 3, and “northeast” as 4.
pie(table(region), main = "Pie chart for regions distribution")

table(region)

## region
## northeast northwest southeast southwest
## 324 325 364 325

cat(paste("Percentage of", "patients in Southwest", ":",


round((length(region[region == "southwest"]) / length(region))*100, 2), "%\
n"))

## Percentage of patients in Southwest : 24.29 %


cat(paste("Percentage of", "patients in Southeast", ":",
round((length(region[region == "southeast"]) / length(region))*100, 2), "%\
n"))

## Percentage of patients in Southeast : 27.2 %

cat(paste("Percentage of", "patients in Northwest", ":",


round((length(region[region == "northwest"]) / length(region))*100, 2), "%\
n"))

## Percentage of patients in Northwest : 24.29 %

cat(paste("Percentage of", "patients in Northeast", ":",


round((length(region[region == "northeast"]) / length(region))*100, 2), "%\
n"))

## Percentage of patients in Northeast : 24.22 %

The pie chart shows that the percentages of beneficiary’s residential areas distribute quite
equally.

3. Linear regression on quantitative variables.

3.1. Data cleaning.


We calculate the number of outliers and examine whether they should be kept or not. Total
outliers:
num_outliers<- unique(unlist(apply(data1[,c(1,2,3,4,5,6,7)],2,outliers)))
length(num_outliers)

## [1] 4

round((length(num_outliers) / nrow(data1))*100,2)

## [1] 0.3

As calculated above, the number of outliers is 284, which makes up to 21% of the whole
dataset. Hence, we will not remove the outliers in this case. #### 3.2. Create dummy
variables. In our dataset, the quanlitative variable “region” has 4 types of values, so we
create 4 dummy variables for this attribute and remove the “region” column.
library(fastDummies)

## Warning: package 'fastDummies' was built under R version 4.3.1

## Thank you for using fastDummies!

## To acknowledge our work, please cite the package:

## Kaplan, J. & Schlegel, B. (2023). fastDummies: Fast Creation of Dummy


(Binary) Columns and Rows from Categorical Variables. Version 1.7.1. URL:
https://github.com/jacobkap/fastDummies,
https://jacobkap.github.io/fastDummies/.

data1 <- dummy_cols(data1, select_columns = 'region')

# Remove the original "region" column


data1 <- subset(data1, select = -c(region))

# Rename the columns


colnames(data1)[colnames(data1) == "region_1"] <- "region_southwest"
colnames(data1)[colnames(data1) == "region_2"] <- "region_southeast"
colnames(data1)[colnames(data1) == "region_3"] <- "region_northwest"
colnames(data1)[colnames(data1) == "region_4"] <- "region_northeast"

3.3. Check correlation between variables.


Since 4 dummy variables of “region” have strong correlation with each other, we examine
which variable is the alias of the others. We do it in order to remove the variable which has
strong correlation with others.
model_activ2 <- lm(charges~., data = data1)

# Check for aliased coefficients


alias_table <- alias(model_activ2)
print(alias_table)

## Model :
## charges ~ age + bmi + children + sex + smoker + region_southwest +
## region_southeast + region_northwest + region_northeast
##
## Complete :
## (Intercept) age bmi children sex smoker region_southwest
## region_northeast 1 0 0 0 0 0 -1
## region_southeast region_northwest
## region_northeast -1 -1

As can be seen from the above result, “region_northeast” has strong correlation with other
dummy variables. Therefore, we remove it.
# Remove the original "region_northeast" column
data1 <- subset(data1, select = -c(region_northeast))

pairs( charges ~ ., data = data1 )


r <- cor(dplyr::select_if(data1, is.numeric))
ggcorrplot::ggcorrplot(r,
hc.order = TRUE,
lab = TRUE)
car::vif(lm(charges~., data = data1))

## age bmi children sex


## 1.016822 1.106630 1.004011 1.008900
## smoker region_southwest region_southeast region_northwest
## 1.012074 1.529411 1.652230 1.518823

The Variance Inflation Factor (VIF) is a measure of the strength of the correlation between
independent variables in a multiple regression model. A VIF value if less than 10 indicates
that there is no strong correlation between the variables. In this case, there is no VIF value
which is under 10, so we do not remove any variables. It also shows that none of them are
strongly correlated with each other. #### 3.4. Building Linear Regression model
model_activ21 <- lm(charges~., data = data1)
summary(model_activ21)

##
## Call:
## lm(formula = charges ~ ., data = data1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11304.9 -2848.1 -982.1 1393.9 29992.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -12069.9 999.6 -12.074 < 2e-16 ***
## age 256.9 11.9 21.587 < 2e-16 ***
## bmi 339.2 28.6 11.860 < 2e-16 ***
## children 475.5 137.8 3.451 0.000577 ***
## sex 131.3 332.9 0.394 0.693348
## smoker 23848.5 413.1 57.723 < 2e-16 ***
## region_southwest -960.0 477.9 -2.009 0.044765 *
## region_southeast -1035.0 478.7 -2.162 0.030782 *
## region_northwest -353.0 476.3 -0.741 0.458769
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6062 on 1329 degrees of freedom
## Multiple R-squared: 0.7509, Adjusted R-squared: 0.7494
## F-statistic: 500.8 on 8 and 1329 DF, p-value: < 2.2e-16

In the above results, the p-value of “sex”, and “region_northwest” are pretty high.
Therefore, we conduct a Hypothesis testing on these variables whether they could be
eliminated in our model. $H_0:\beta_4 = \beta_8 = 0 \\ H_1: \exists \beta_i \ne 0(i = 4,8)$
model_activ22 <- lm(charges~. -sex - region_northwest, data = data1)
anova(model_activ21, model_activ22)

## Analysis of Variance Table


##
## Model 1: charges ~ age + bmi + children + sex + smoker + region_southwest
+
## region_southeast + region_northwest
## Model 2: charges ~ (age + bmi + children + sex + smoker + region_southwest
+
## region_southeast + region_northwest) - sex - region_northwest
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 1329 4.8840e+10
## 2 1331 4.8865e+10 -2 -25810671 0.3512 0.7039

summary(model_activ22)

##
## Call:
## lm(formula = charges ~ . - sex - region_northwest, data = data1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11377.1 -2855.0 -973.6 1349.5 29926.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -12165.38 949.54 -12.812 < 2e-16 ***
## age 257.01 11.89 21.617 < 2e-16 ***
## bmi 338.64 28.55 11.860 < 2e-16 ***
## children 471.54 137.66 3.426 0.000632 ***
## smoker 23843.87 411.66 57.921 < 2e-16 ***
## region_southwest -782.75 413.76 -1.892 0.058734 .
## region_southeast -858.47 415.21 -2.068 0.038873 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6059 on 1331 degrees of freedom
## Multiple R-squared: 0.7508, Adjusted R-squared: 0.7497
## F-statistic: 668.3 on 6 and 1331 DF, p-value: < 2.2e-16

The resulted p-value is 0.7 greater than 0.05. Therefore, we accept the null hypothesis
eliminating the considered variables. As can be seen above, the p-values of all variables are
low. We can conclude that our model is the most suitable model for calculation. ## 4.
Interpretation of the model results.
summary(model_activ22)

##
## Call:
## lm(formula = charges ~ . - sex - region_northwest, data = data1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11377.1 -2855.0 -973.6 1349.5 29926.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -12165.38 949.54 -12.812 < 2e-16 ***
## age 257.01 11.89 21.617 < 2e-16 ***
## bmi 338.64 28.55 11.860 < 2e-16 ***
## children 471.54 137.66 3.426 0.000632 ***
## smoker 23843.87 411.66 57.921 < 2e-16 ***
## region_southwest -782.75 413.76 -1.892 0.058734 .
## region_southeast -858.47 415.21 -2.068 0.038873 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6059 on 1331 degrees of freedom
## Multiple R-squared: 0.7508, Adjusted R-squared: 0.7497
## F-statistic: 668.3 on 6 and 1331 DF, p-value: < 2.2e-16

We have the formula of our Linear Regression Model:


c h a r g e s=− 12165.38+257.01∗a g e+338.64∗b mi+ 471.54∗c h il d r e n+23843.87∗s m o k e r −782.75∗r e g i

Interpretation of the coefficients: - β 1 : 257.01, This value indicates that if the age of
beneficiary increases by 1, the resulted charges will increase by 257.01. - β 2 : 338.64. This
value indicates that if the BMI increases by 1, the resulted charges will increase by 338.64. -
β 3 : 471.54. This value indicates that if the number of dependents increases by 1, the
resulted charges will increase by 471.54. - β 4 :23843.87. This value indicates that if the
number of smoking beneficiary increases by 1, the resulted charges will increase by
23843.87. - β 5: -782.75. This value indicates that if the number of beneficiary located in the
Southwest of US increases by 1, the resulted charges will decrease by 782.75. - β 6: -858.47.
This value indicates that if the number of beneficiary located in the Southeast of US
increases by 1, the resulted charges will decrease by 858.47.

You might also like