0% found this document useful (0 votes)

8 views

AAAAAAAAAAAAAAAAAAAAAAAAA

Uploaded by

Hoang The Anh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views

AAAAAAAAAAAAAAAAAAAAAAAAA

Uploaded by

Hoang The Anh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 41

SomethingHappy

2023-08-14

Activity 01
1. Dataset introduction
The settings up before using dataset for examination include: - Set up a working directory
at the specified address. - Then, read the data file “hprice1.csv” into the variable named
“data”. - Use the “attach” function to conveniently reference the names of individual
columns as vectors. - There is no header in the provided CSV file, we add the name for each
column as they are provided in the description file.
setwd("C:/Users/Admin/Desktop")
data <- read.csv("hprice1.csv", header = TRUE, col.names = c("price",
"assess", "bdrms", "lotsize", "sqrft", "colonial", "lprice", "lassess",
"llotsize", "lsqrft"))
attach(data)

1.1. Brief description

dim(data)

## [1] 88 10

• The dataset consists of 88 rows and 10 columns, including: price, assess, bdrms,
lotsize, sqrft, colonial, lprice, lassess, llotsize, lsqrft.
• The dataset name is “data 6”.

1.2. Format and structure of dataset

The dataset consists of below columns:
summary(data)

## price assess bdrms lotsize sqrft

## Min. :111.0 Min. :198.7 Min. :2.000 Min. : 1000

Min. :1171
## 1st Qu.:230.0 1st Qu.:253.9 1st Qu.:3.000 1st Qu.: 5733 1st
Qu.:1660
## Median :265.5 Median :290.2 Median :3.000 Median : 6430
Median :1845
## Mean :293.5 Mean :315.7 Mean :3.568 Mean : 9020
Mean :2014
## 3rd Qu.:326.2 3rd Qu.:352.1 3rd Qu.:4.000 3rd Qu.: 8583 3rd
Qu.:2227
## Max. :725.0 Max. :708.6 Max. :7.000 Max. :92681
Max. :3880
## colonial lprice lassess llotsize
## Min. :0.0000 Min. :4.710 Min. :5.292 Min. : 6.908
## 1st Qu.:0.0000 1st Qu.:5.438 1st Qu.:5.537 1st Qu.: 8.654
## Median :1.0000 Median :5.582 Median :5.671 Median : 8.769
## Mean :0.6932 Mean :5.633 Mean :5.718 Mean : 8.905
## 3rd Qu.:1.0000 3rd Qu.:5.788 3rd Qu.:5.864 3rd Qu.: 9.058
## Max. :1.0000 Max. :6.586 Max. :6.563 Max. :11.437
## lsqrft
## Min. :7.066
## 1st Qu.:7.415
## Median :7.520
## Mean :7.573
## 3rd Qu.:7.708
## Max. :8.264

Number Column Meaning

1 price House price, $1000s
2 assess Assessed value, $1000s
3 bdrms Number of bedrooms
4 lotsize Size of lot in square
feet
5 sqrft Size of house in square
feet
6 colonial =1 if home is colonial
style
7 lprice log ( price )
8 lassess log ( assess )
9 llotsize log ( lotsize )
10 lsqrft log ( sqrft )

2. Desciptive statistics
Prepare functions for later use:
mode <- function(x){
freq_table <- table(x)
mode_value <- as.numeric(names(freq_table)[which.max(freq_table)])
return(mode_value)
}

outliers <- function(x) {

outliers <- boxplot.stats(x)$out
num_outliers <- length(outliers)
return(num_outliers)

}
percent_outliers <- function(x){
outliers <- boxplot.stats(x)$out
num_outliers <- length(outliers)
percent <- num_outliers/length(x)
}

2.1. Quantitative statistics

2.1.1. House price

hist(price, main = "House price", xlab = "Price (in $1000)")

boxplot(price)
cat("Mode of price: ", mode(price),"\n" )

## Mode of price: 225

cat("Summary of price: ", "\n")

## Summary of price:

summary(price)

## Min. 1st Qu. Median Mean 3rd Qu. Max.

## 111.0 230.0 265.5 293.5 326.2 725.0

cat("Total outliers: ", outliers(price), " ", percent_outliers(price))

## Total outliers: 5 0.05681818

• The histogram shows that the majority of house price lies within 230 and 300
thousand dollars. Also, the rightward skew in the histogram indicates that the mean
of house price is larger than the mode (mode < median < mean).
• There are 5 values that are outliers, which accounts for 5.6% of the dataset. This
figure does not affect much on our analysis resutls.

2.1.2. Assessed value

hist(assess, xlim = c(0,800), main = "Assessed value", xlab = "Assessed value
(in $1000)")
boxplot(assess)

cat("Mode of assessed value: ", mode(assess),"\n" )

## Mode of assessed value: 198.7

cat("Summary of assessed value: ", "\n")

## Summary of assessed value:

summary(assess)

## Min. 1st Qu. Median Mean 3rd Qu. Max.

## 198.7 253.9 290.2 315.7 352.1 708.6

cat("Total outliers: ", outliers(assess), " ", percent_outliers(assess))

## Total outliers: 5 0.05681818

• The histogram shows that the majority of assessed value lies within 250 and 300
thousand dollars. Also, the rightward skew in the histogram indicates that the mean
of assessed value is larger than the mode (mode < median < mean).
• There are 5 values that are outliers, which accounts for 5.6% of the dataset. This
figure does not affect much on our analysis resutls.

2.1.3. Number of bedrooms

hist(bdrms)

plot(bdrms)
pie(table(bdrms), labels = c("1 bdrm", "2 bdrms", "3 bdrms", "4 bdrms", "5
bdrms", "6 bdrms", "7 bdrms"), col = c("red", "orange", "yellow",
"lightgreen", "lightblue", "violet", "cyan"), main = "Pie Chart for Bedrooms
attribute")
cat("Mode of bedrooms: ", mode(bdrms), "\n")

## Mode of bedrooms: 3

cat("Summary of bedrooms: ", "\n")

## Summary of bedrooms:

summary(bdrms)

## Min. 1st Qu. Median Mean 3rd Qu. Max.

## 2.000 3.000 3.000 3.568 4.000 7.000

cat("Total outliers: ", outliers(bdrms), " ", percent_outliers(bdrms))

## Total outliers: 2 0.02272727

• The diagram shows that most houses have the number of bedrooms varies within
[3,4].
• There are 2 values that are outliers, which accounts for 2.2% of the dataset. This
figure does not affect much on our analysis resutls.

2.1.4. Lot size

hist(lotsize/1000, main = "Histogram of lotsize (per 1000ft)", xlab =
"lotsize/1000ft")
boxplot(lotsize)

cat("Mode of price: ", mode(lotsize),"\n" )

## Mode of price: 6000

cat("Summary of price: ", "\n")

## Summary of price:

summary(lotsize)

## Min. 1st Qu. Median Mean 3rd Qu. Max.

## 1000 5733 6430 9020 8583 92681

cat("Total outliers: ", outliers(lotsize), " ", percent_outliers(lotsize))

## Total outliers: 11 0.125

• The histogram shows that the majority of lot size lies under 20 thousand square
feet. Also, the rightward skew in the histogram indicates that the mean of lot size is
larger than the mode (mode < median < mean).
• There are 11 values that are outliers, which accounts for 12.5% of the dataset. This
figure has slightly effects on our analysis resutls.

2.1.5. House size

hist(sqrft, main = "Histogram of house size", xlim = c(0, 5000), xlab =
"House size")

boxplot(sqrft)
cat("Mode of price: ", mode(sqrft),"\n" )

## Mode of price: 1536

cat("Summary of price: ", "\n")

## Summary of price:

summary(sqrft)

## Min. 1st Qu. Median Mean 3rd Qu. Max.

## 1171 1660 1845 2014 2227 3880

cat("Total outliers: ", outliers(sqrft), " ", percent_outliers(sqrft))

## Total outliers: 6 0.06818182

• The histogram shows that the majority of house size varies within 1500 to 2000
sqaure ft. Also, the rightward skew in the histogram indicates that the mean of lot
size is larger than the mode (mode < median < mean).
• There are 6 values that are outliers, which accounts for 6.8% of the dataset. This
figure does not affect much on our analysis resutls.

2.1.6. The remaining variables

The four remaining variables are calculate by using natural logarithm `log()` of 4 values
related to price and area including: houses price, assessed price, lot area, and houses
area.The meaning log-transforming variables is to achieve more linear relationships in
analyses.

2.2. Qualitative statistics

2.2.1. Colonial style

# Assuming your data is in a vector named 'binary_data'
pie(table(colonial), main = "Pie Chart for Colonial style")

table(colonial)

## colonial
## 0 1
## 27 61

The pie chart shows that about one-third of houses is not colonial style. The figure depicts
that there are more houses which are of colonial style.

3. Assessment on the average of two quantitative variables and applying

hypothesis testing.
Let us consider two quantitive variables: house price, lotsize (size of lot in square feet).

3.1. House price

summary(price)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 111.0 230.0 265.5 293.5 326.2 725.0

The output shows that the mean of house prices is 293.5 thousand dollars. Hypothesis
testing: Let us test whether the mean of house prices is different from 293.5 thousand
dollars. $H_0 : \mu_0 = 293.5 \\ H_1 : \mu_0 \ne 293.5$
t.test(x = price, mu = 293.5, alternative = "two.sided")

##
## One Sample t-test
##
## data: price
## t = 0.0042043, df = 87, p-value = 0.9967
## alternative hypothesis: true mean is not equal to 293.5
## 95 percent confidence interval:
## 271.7831 315.3089
## sample estimates:
## mean of x
## 293.546

Using one sample t-test on house price attribute of the dataset, we obtain p-value = 0.9967,
which is closed to 1. Therefore, we accept the null hypothesis: it is true that the mean of
house price is 293.5 thousand dollars.

3.2. Lot size

summary(lotsize)

## Min. 1st Qu. Median Mean 3rd Qu. Max.

## 1000 5733 6430 9020 8583 92681

The output shows that the mean of lot size in square feet is 9020 (square ft). Hypothesis
testing: Let us test whether the mean of lot size is different from 9020 square ft. $H_0 : \
mu_0 = 9020 \\ H_1 : \mu_0 \ne 9020$
t.test( x = lotsize, mu = 9020, alternative = "two.sided")

##
## One Sample t-test
##
## data: lotsize
## t = -0.00012573, df = 87, p-value = 0.9999
## alternative hypothesis: true mean is not equal to 9020
## 95 percent confidence interval:
## 6864.167 11175.560
## sample estimates:
## mean of x
## 9019.864
Using one sample t-test on lot size attribute of the dataset, we obtain p-value = 0.9999,
which is closed to 1. Therefore, we accept the null hypothesis: it is true that the mean of lot
size in square feet is 9020 (square ft).

4. Assessment on the characteristics of two quanlitative variables and applying

hypothesis testing.
Because there is only one quanlitative variable (colonial style) in the given dataset, we
create one more quanlitative variable. The new variable named “house’” indicating that if
the lot size if greater than 5000 square ft, “house” will be 1, vice versa, “house” will be 0.
library(dplyr)

## Warning: package 'dplyr' was built under R version 4.3.1

##
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':

##
## filter, lag

## The following objects are masked from 'package:base':

##
## intersect, setdiff, setequal, union

data <- data %>%

mutate(house = ifelse(lotsize >= 5000, 1, 0))

prop.test( x = c(nrow(data[data$colonial == 1 & data$house == 1, ]),

nrow(data[data$colonial == 1 & data$house == 0, ])), n = c( nrow(data),
nrow(data) ), alternative = "less")

##
## 2-sample test for equality of proportions with continuity correction
##
## data: c(nrow(data[data$colonial == 1 & data$house == 1, ]),
nrow(data[data$colonial == 1 & data$house == 0, ])) out of c(nrow(data),
nrow(data))
## X-squared = 57.805, df = 1, p-value = 1
## alternative hypothesis: less
## 95 percent confidence interval:
## -1.0000000 0.6638851
## sample estimates:
## prop 1 prop 2
## 0.62500000 0.06818182
5. Linear Regression on quantitative variables
5.1. Check correlation between variables.
pairs( price ~ ., data = data )
r <- cor(dplyr::select_if(data, is.numeric))
ggcorrplot::ggcorrplot(r,
hc.order = TRUE,
lab = TRUE)
car::vif(lm(price~., data = data))

## assess bdrms lotsize sqrft colonial lprice lassess

llotsize
## 48.806574 1.655664 4.214772 59.995813 1.298321 5.207658 54.454442
7.305277
## lsqrft house
## 60.069926 1.798603

The Variance Inflation Factor (VIF) is a measure of the strength of the correlation between
independent variables in a multiple regression model. A VIF value if less than 10 indicates
that there is no strong correlation between the variables. In this case, the values of “lsqrft”
is 60.07, which has strong correlation with other variables, that should be eliminated.
We have a new VIF table of values after removing “lsqrft”:
car::vif(lm(price~. - lsqrft, data = data))

## assess bdrms lotsize sqrft colonial lprice lassess

llotsize
## 38.433497 1.655410 4.130959 5.324560 1.219080 4.806119 36.921664
7.052738
## house
## 1.798484

In the above results, the values of “assess” is 38.43, which has strong correlation with other
variables, that should be eliminated.
We have a new VIF table of values after removing “assess”:
car::vif(lm(price~.-lsqrft - assess, data = data))

## bdrms lotsize sqrft colonial lprice lassess llotsize house

## 1.586106 3.981064 5.228633 1.161620 4.687746 9.818245 6.335779 1.606603

In the above results, the value of “lassess” is 9.8, which is closed to 10. This indicates that it
has a strong correlation with other variables, so that it should be eliminated.
car::vif(lm(price~.-lsqrft - assess - lassess, data = data))

## bdrms lotsize sqrft colonial lprice llotsize house

## 1.581703 3.577465 2.902717 1.161599 3.118880 4.964992 1.596831

In the above result, the VIF values for all variables are less than 10, which means that there
are no variables that are strongly correlated with each other.

5.2. Building Linear Regression model

We carry out Linear regression analysis to relate the dependent variable House price to the
other independent variables.
At first, we examine the first-order model and not conduct on the “lprice” because it is
highly related to the variable House price. In addition, the above removed variables (lsqrft,
assess, lassess) are not considered in this case.
fit = lm(price ~ bdrms+ lotsize+sqrft+ colonial + lprice + llotsize +house,
data = data)
coef(fit)

## (Intercept) bdrms lotsize sqrft colonial

## -1.623522e+03 8.687747e+00 -6.385875e-04 1.165188e-02 -1.037888e+01
## lprice llotsize house
## 2.913898e+02 2.929917e+01 -2.985674e+01

summary(fit)

##
## Call:
## lm(formula = price ~ bdrms + lotsize + sqrft + colonial + lprice +
## llotsize + house, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -45.782 -12.887 -2.674 6.438 98.288
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.624e+03 9.602e+01 -16.908 < 2e-16 ***
## bdrms 8.688e+00 3.754e+00 2.314 0.02322 *
## lotsize -6.386e-04 4.669e-04 -1.368 0.17524
## sqrft 1.165e-02 7.413e-03 1.572 0.11996
## colonial -1.038e+01 5.836e+00 -1.778 0.07914 .
## lprice 2.914e+02 1.461e+01 19.943 < 2e-16 ***
## llotsize 2.930e+01 1.029e+01 2.848 0.00558 **
## house -2.986e+01 1.041e+01 -2.867 0.00530 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23.43 on 80 degrees of freedom
## Multiple R-squared: 0.9522, Adjusted R-squared: 0.948
## F-statistic: 227.5 on 7 and 80 DF, p-value: < 2.2e-16

In the above results, the p-values of “lotsize” - lot size, and “sqrft” - house size are pretty
high.
Therefore, we conduct a Hypothesis testing on these variables whether these variable
could be eliminated in our model. $H_0:\beta_2 = \beta_3 = 0 \\ H_1: \exists \beta_i \ne 0
(i = 2,3)$
model1 <- lm( price ~ bdrms + colonial + lprice + llotsize + house, data =
data )
anova(fit, model1)

## Analysis of Variance Table

##
## Model 1: price ~ bdrms + lotsize + sqrft + colonial + lprice + llotsize +
## house
## Model 2: price ~ bdrms + colonial + lprice + llotsize + house
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 80 43903
## 2 82 46723 -2 -2820.8 2.57 0.08284 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary(model1)

##
## Call:
## lm(formula = price ~ bdrms + colonial + lprice + llotsize + house,
## data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -57.689 -12.350 -2.066 6.169 98.251
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1608.649 56.117 -28.666 < 2e-16 ***
## bdrms 10.528 3.593 2.930 0.00439 **
## colonial -12.245 5.806 -2.109 0.03798 *
## lprice 309.087 11.022 28.043 < 2e-16 ***
## llotsize 17.456 5.919 2.949 0.00415 **
## house -26.150 9.665 -2.705 0.00829 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23.87 on 82 degrees of freedom
## Multiple R-squared: 0.9491, Adjusted R-squared: 0.946
## F-statistic: 305.8 on 5 and 82 DF, p-value: < 2.2e-16

The resulted p-value is 0.08 greater than 0.05. Therefore, we accept the null hypothesis
eliminating the considered variables.
As can be seen above, the p-values of all variables are low. We can conclude that our model
is the most suitable model for calculation.

6. Interpretation of the model results.

summary(model1)

plot(model1)
We have the formula of our Linear Regression Model:
p r ic e=−1608.849+10.528∗b d r m s −12.245∗c o l o n ia l+309.087∗l p r i c e +17.456∗ll o t s i z e − 26.150∗h o
Interpretation of the coefficients: - β 1 : 10.528. This value indicates that if the number of
bedrooms increases 1, the resulted value of house price will increase by 10.528. - β 2 : -
12.245. This value indicates that if the number of colonial style increases 1, the resulted
value of house price will decrease by 12.245. - β 3 : 309.087. This value indicates that if the
log of house price increases 1, the resulted value of house price will increase by 309.087. -
β 4 : 17.456. This value indicates that if the log of lot size increases 1, the resulted value of
house price will increase by 17.456. - β 5: -26.15. This value indicates that if the value of
“house” variable increases 1, the resulted value of house price will decrease by 26.15.

Activity 02
1. Dataset introduction
The dataset was conducted to examine where the Insurance charges are given against the
following attributes of the insured: Age, Sex, BMI, Number of Children, Smoker and Region.
The attributes are a mix of numeric and categorical variables.
The settings up before using dataset for examination include: - Set up a working directory
at the specified address. - Then, read the data file “Real estate.csv” into the variable named
“data1”. - Use the “attach” function to conveniently reference the names of individual
columns as vectors.
setwd("C:/Users/Admin/Desktop")
data1 <- read.csv("insurance.csv", header = TRUE)
attach(data1)

1.1. Brief description

dim(data1)

## [1] 1338 7

• The dataset consists of 1338 rows and 7 columns, including: age, sex, bmi, children,
smoker, region, charges.
• The dataset name is “insurance”.

1.2. Format and structure of dataset

The dataset consists of below columns:
summary(data1)

## age sex bmi children

## Min. :18.00 Length:1338 Min. :15.96 Min. :0.000
## 1st Qu.:27.00 Class :character 1st Qu.:26.30 1st Qu.:0.000
## Median :39.00 Mode :character Median :30.40 Median :1.000
## Mean :39.21 Mean :30.66 Mean :1.095
## 3rd Qu.:51.00 3rd Qu.:34.69 3rd Qu.:2.000
## Max. :64.00 Max. :53.13 Max. :5.000
## smoker region charges
## Length:1338 Length:1338 Min. : 1122
## Class :character Class :character 1st Qu.: 4740
## Mode :character Mode :character Median : 9382
## Mean :13270
## 3rd Qu.:16640
## Max. :63770

Number Column Meaning

1 age Age of primary beneficiary
2 sex Insurance contractor gender, female / male
3 bmi Body mass index
4 children Number of children covered by health
insurance
5 smoker =1 if that person is smoker
6 region The beneficiary’s residential area in the US,
northeast, southeast, southwest, northwest
7 charges Individual medical costs billed by health
insurance

2. Desciptive statistics
2.1. Quantitative statistics

2.1.1. Age
hist(age, main = "Age of beneficiary", xlab = "age", xlim = c(0, 70), breaks
= 10)
boxplot(age)

cat("Mode of age: ", mode(age),"\n" )

## Mode of age: 18

cat("Summary of age: ", "\n")

## Summary of age:

summary(age)

## Min. 1st Qu. Median Mean 3rd Qu. Max.

## 18.00 27.00 39.00 39.21 51.00 64.00

cat("Total outliers: ", outliers(age), " ", percent_outliers(price))

## Total outliers: 0 0.05681818

• The histogram shows that the most patients aged within 27 and 51 years old. People
who aged 18 appear most frequently in this column.
• There are 0 values that are outliers in this attribute.

2.1.2. BMI
hist(bmi, main= "BMI (Body Mass Index)", xlab = "BMI", xlim = c(0, 60))

boxplot(bmi)
cat("Mode of BMI: ", mode(bmi),"\n" )

## Mode of BMI: 32.3

cat("Summary of BMI: ", "\n")

## Summary of BMI:

summary(bmi)

## Min. 1st Qu. Median Mean 3rd Qu. Max.

## 15.96 26.30 30.40 30.66 34.69 53.13

cat("Total outliers: ", outliers(bmi), " ", percent_outliers(bmi))

## Total outliers: 9 0.006726457

• The histogram shows that the BMI of people lies within 25 and 35. People with BMI
of 32.3 appear the most frequently in this column.
• There are 9 values that are outliers, which accounts for 0.67% of the dataset. This
figure does not affect much on our analysis resutls.

2.1.3. Children
hist(children, main = "Number of dependents", xlab = "Children")
boxplot(children)

cat("Mode of number of dependents: ", mode(children),"\n" )

## Mode of number of dependents: 0

cat("Summary of number of dependents: ", "\n")

## Summary of number of dependents:

summary(children)

## Min. 1st Qu. Median Mean 3rd Qu. Max.

## 0.000 0.000 1.000 1.095 2.000 5.000

cat("Total outliers: ", outliers(children), " ", percent_outliers(children))

## Total outliers: 0 0

• The histogram shows that the majority of number of children lies within 0 and 2.
Also, the rightward skew in the histogram indicates that the mean of house price is
larger than the mode (mode < median < mean).
• There are 0 values that are outliers in this case.

2.1.4. Charges
hist(charges/1000, main = "Medical costs by health insurance", xlab =
"Charges / $1000", xlim = c(0, 70))

boxplot(charges)
cat("Mode of number of dependents: ", mode(charges),"\n" )

## Mode of number of dependents: 1639.563

cat("Summary of number of dependents: ", "\n")

## Summary of number of dependents:

summary(charges)

## Min. 1st Qu. Median Mean 3rd Qu. Max.

## 1122 4740 9382 13270 16640 63770

cat("Total outliers: ", outliers(charges), " ", percent_outliers(charges))

## Total outliers: 139 0.1038864

• The histogram shows that the majority of charges lies under 15000. Also, the
rightward skew in the histogram indicates that the mean of house price is larger
than the mode.
• There are 139 values that are outliers in this case, which make up 10.3% of this
attribute. This figure has an impact on our total analysis results and these outliers
should be eliminated.
2.2. Quanlitative statistics.

2.2.1. Sex
The given “sex” attribute in the dataset is demonstrated by 2 string values “female” and
“male”. Therefore, we convert them into numeric values and replace the new values into
the dataframe.
library(dplyr)
data1 <- data1 %>% mutate(sex_numeric = ifelse(sex == "male", 0, 1))

# Remove the old "sex" column

data1 <- data1 %>%
select(-sex)

# Rename the new column to "gender"

colnames(data1)[colnames(data1) == "sex_numeric"] <- "sex"

Now we have a dataframe with “sex” column values “male” as 0 and “female” as 1.
pie(table(sex), main = "Pie chart for sex distribution")

table(sex)

## sex
## female male
## 662 676
cat(paste("Percentage of", "female patients", ":", round((length(sex[sex ==
"female"]) / length(sex))*100, 2), "%\n"))

## Percentage of female patients : 49.48 %

cat(paste("Percentage of", "male patients", ":", round((length(sex[sex ==

"male"]) / length(sex))*100, 2), "%\n"))

## Percentage of male patients : 50.52 %

The pie chart shows that the percentages of female patients and male patients are nearly
the same. #### 2.2.2. Smoker The “smoker” attribute is also desmonstrated by 2 string
values “yes” and “no”. Therefore, we convert them into numeric values and replace the new
values into the dataframe.
library(dplyr)
data1 <- data1 %>% mutate(smoker_numeric = ifelse(smoker == "no", 0, 1))

# Remove the old "smoker" column

data1 <- data1 %>%
select(-smoker)

# Rename the new column to "smoker"

colnames(data1)[colnames(data1) == "smoker_numeric"] <- "smoker"

Now we have a dataframe with “smoker” column values “no” as 0 and “yes” as 1.
pie(table(smoker), main = "Pie chart for smoker distribution")
table(smoker)

## smoker
## no yes
## 1064 274

cat(paste("Percentage of", "smoker", ":", round((length(smoker[smoker ==

"yes"]) / length(smoker))*100, 2), "%\n"))

## Percentage of smoker : 20.48 %

cat(paste("Percentage of", "non-smoker", ":", round((length(smoker[smoker ==

"no"]) / length(smoker))*100, 2), "%\n"))

## Percentage of non-smoker : 79.52 %

The pie chart shows that about 20% of the total patients is smoker, and the remaining 80%
is non-smoker.

2.2.3. Region
# Get unique values in the "region" column
unique_regions <- unique(data1$region)

# Count the number of unique regions

num_unique_regions <- length(unique_regions)

# Print the unique regions and the count

cat("Number of unique regions:", num_unique_regions, "\n")

## Number of unique regions: 4

cat("Unique regions:\n")

## Unique regions:

cat(unique_regions, sep = "\n")

## southwest
## southeast
## northwest
## northeast

The “region” attribute is desmonstrated by 4 string values “southwest”, “southeast”,

“northwest”, and “northeast”. Therefore, we convert them into numeric values and replace
the new values into the dataframe.
# Assuming you have the dplyr package installed
library(dplyr)

# Convert region values to numeric

data1 <- data1 %>%
mutate(region_numeric = case_when(
region == "southwest" ~ 1,
region == "southeast" ~ 2,
region == "northwest" ~ 3,
region == "northeast" ~ 4,
TRUE ~ NA_integer_
)) %>%
select(-region)

# Rename the new column to "region"

colnames(data1)[colnames(data1) == "region_numeric"] <- "region"

Now we have a dataframe with “region” column values “southwest” as 1, “southeast” as 2,

“northwest” as 3, and “northeast” as 4.
pie(table(region), main = "Pie chart for regions distribution")

table(region)

## region
## northeast northwest southeast southwest
## 324 325 364 325

cat(paste("Percentage of", "patients in Southwest", ":",

round((length(region[region == "southwest"]) / length(region))*100, 2), "%\
n"))

## Percentage of patients in Southwest : 24.29 %

cat(paste("Percentage of", "patients in Southeast", ":",
round((length(region[region == "southeast"]) / length(region))*100, 2), "%\
n"))

## Percentage of patients in Southeast : 27.2 %

cat(paste("Percentage of", "patients in Northwest", ":",

round((length(region[region == "northwest"]) / length(region))*100, 2), "%\
n"))

## Percentage of patients in Northwest : 24.29 %

cat(paste("Percentage of", "patients in Northeast", ":",

round((length(region[region == "northeast"]) / length(region))*100, 2), "%\
n"))

## Percentage of patients in Northeast : 24.22 %

The pie chart shows that the percentages of beneficiary’s residential areas distribute quite
equally.

3. Linear regression on quantitative variables.

3.1. Data cleaning.

We calculate the number of outliers and examine whether they should be kept or not. Total
outliers:
num_outliers<- unique(unlist(apply(data1[,c(1,2,3,4,5,6,7)],2,outliers)))
length(num_outliers)

## [1] 4

round((length(num_outliers) / nrow(data1))*100,2)

## [1] 0.3

As calculated above, the number of outliers is 284, which makes up to 21% of the whole
dataset. Hence, we will not remove the outliers in this case. #### 3.2. Create dummy
variables. In our dataset, the quanlitative variable “region” has 4 types of values, so we
create 4 dummy variables for this attribute and remove the “region” column.
library(fastDummies)

## Warning: package 'fastDummies' was built under R version 4.3.1

## Thank you for using fastDummies!

## To acknowledge our work, please cite the package:

## Kaplan, J. & Schlegel, B. (2023). fastDummies: Fast Creation of Dummy

(Binary) Columns and Rows from Categorical Variables. Version 1.7.1. URL:
https://github.com/jacobkap/fastDummies,
https://jacobkap.github.io/fastDummies/.

data1 <- dummy_cols(data1, select_columns = 'region')

# Remove the original "region" column

data1 <- subset(data1, select = -c(region))

# Rename the columns

colnames(data1)[colnames(data1) == "region_1"] <- "region_southwest"
colnames(data1)[colnames(data1) == "region_2"] <- "region_southeast"
colnames(data1)[colnames(data1) == "region_3"] <- "region_northwest"
colnames(data1)[colnames(data1) == "region_4"] <- "region_northeast"

3.3. Check correlation between variables.

Since 4 dummy variables of “region” have strong correlation with each other, we examine
which variable is the alias of the others. We do it in order to remove the variable which has
strong correlation with others.
model_activ2 <- lm(charges~., data = data1)

# Check for aliased coefficients

alias_table <- alias(model_activ2)
print(alias_table)

## Model :
## charges ~ age + bmi + children + sex + smoker + region_southwest +
## region_southeast + region_northwest + region_northeast
##
## Complete :
## (Intercept) age bmi children sex smoker region_southwest
## region_northeast 1 0 0 0 0 0 -1
## region_southeast region_northwest
## region_northeast -1 -1

As can be seen from the above result, “region_northeast” has strong correlation with other
dummy variables. Therefore, we remove it.
# Remove the original "region_northeast" column
data1 <- subset(data1, select = -c(region_northeast))

pairs( charges ~ ., data = data1 )

r <- cor(dplyr::select_if(data1, is.numeric))
ggcorrplot::ggcorrplot(r,
hc.order = TRUE,
lab = TRUE)
car::vif(lm(charges~., data = data1))

## age bmi children sex

## 1.016822 1.106630 1.004011 1.008900
## smoker region_southwest region_southeast region_northwest
## 1.012074 1.529411 1.652230 1.518823

The Variance Inflation Factor (VIF) is a measure of the strength of the correlation between
independent variables in a multiple regression model. A VIF value if less than 10 indicates
that there is no strong correlation between the variables. In this case, there is no VIF value
which is under 10, so we do not remove any variables. It also shows that none of them are
strongly correlated with each other. #### 3.4. Building Linear Regression model
model_activ21 <- lm(charges~., data = data1)
summary(model_activ21)

##
## Call:
## lm(formula = charges ~ ., data = data1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11304.9 -2848.1 -982.1 1393.9 29992.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -12069.9 999.6 -12.074 < 2e-16 ***
## age 256.9 11.9 21.587 < 2e-16 ***
## bmi 339.2 28.6 11.860 < 2e-16 ***
## children 475.5 137.8 3.451 0.000577 ***
## sex 131.3 332.9 0.394 0.693348
## smoker 23848.5 413.1 57.723 < 2e-16 ***
## region_southwest -960.0 477.9 -2.009 0.044765 *
## region_southeast -1035.0 478.7 -2.162 0.030782 *
## region_northwest -353.0 476.3 -0.741 0.458769
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6062 on 1329 degrees of freedom
## Multiple R-squared: 0.7509, Adjusted R-squared: 0.7494
## F-statistic: 500.8 on 8 and 1329 DF, p-value: < 2.2e-16

In the above results, the p-value of “sex”, and “region_northwest” are pretty high.
Therefore, we conduct a Hypothesis testing on these variables whether they could be
eliminated in our model. $H_0:\beta_4 = \beta_8 = 0 \\ H_1: \exists \beta_i \ne 0(i = 4,8)$
model_activ22 <- lm(charges~. -sex - region_northwest, data = data1)
anova(model_activ21, model_activ22)

## Analysis of Variance Table

##
## Model 1: charges ~ age + bmi + children + sex + smoker + region_southwest
+
## region_southeast + region_northwest
## Model 2: charges ~ (age + bmi + children + sex + smoker + region_southwest
+
## region_southeast + region_northwest) - sex - region_northwest
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 1329 4.8840e+10
## 2 1331 4.8865e+10 -2 -25810671 0.3512 0.7039

summary(model_activ22)

##
## Call:
## lm(formula = charges ~ . - sex - region_northwest, data = data1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11377.1 -2855.0 -973.6 1349.5 29926.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -12165.38 949.54 -12.812 < 2e-16 ***
## age 257.01 11.89 21.617 < 2e-16 ***
## bmi 338.64 28.55 11.860 < 2e-16 ***
## children 471.54 137.66 3.426 0.000632 ***
## smoker 23843.87 411.66 57.921 < 2e-16 ***
## region_southwest -782.75 413.76 -1.892 0.058734 .
## region_southeast -858.47 415.21 -2.068 0.038873 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6059 on 1331 degrees of freedom
## Multiple R-squared: 0.7508, Adjusted R-squared: 0.7497
## F-statistic: 668.3 on 6 and 1331 DF, p-value: < 2.2e-16

The resulted p-value is 0.7 greater than 0.05. Therefore, we accept the null hypothesis
eliminating the considered variables. As can be seen above, the p-values of all variables are
low. We can conclude that our model is the most suitable model for calculation. ## 4.
Interpretation of the model results.
summary(model_activ22)

We have the formula of our Linear Regression Model:

c h a r g e s=− 12165.38+257.01∗a g e+338.64∗b mi+ 471.54∗c h il d r e n+23843.87∗s m o k e r −782.75∗r e g i

Interpretation of the coefficients: - β 1 : 257.01, This value indicates that if the age of
beneficiary increases by 1, the resulted charges will increase by 257.01. - β 2 : 338.64. This
value indicates that if the BMI increases by 1, the resulted charges will increase by 338.64. -
β 3 : 471.54. This value indicates that if the number of dependents increases by 1, the
resulted charges will increase by 471.54. - β 4 :23843.87. This value indicates that if the
number of smoking beneficiary increases by 1, the resulted charges will increase by
23843.87. - β 5: -782.75. This value indicates that if the number of beneficiary located in the
Southwest of US increases by 1, the resulted charges will decrease by 782.75. - β 6: -858.47.
This value indicates that if the number of beneficiary located in the Southeast of US
increases by 1, the resulted charges will decrease by 858.47.

Chapter 6 Student
No ratings yet
Chapter 6 Student
21 pages
House Prices Prediction in King County
No ratings yet
House Prices Prediction in King County
10 pages
Introduction To Machine Learning (ML) With Sklearn
No ratings yet
Introduction To Machine Learning (ML) With Sklearn
10 pages
R Doc Ii Vee
No ratings yet
R Doc Ii Vee
24 pages
The Boston Housing Dataset
100% (1)
The Boston Housing Dataset
4 pages
Clodes Class Data Science
No ratings yet
Clodes Class Data Science
14 pages
Stastistics and Probability With R Programming Language: Lab Report
50% (2)
Stastistics and Probability With R Programming Language: Lab Report
44 pages
Comprehensive Data Exploration With Python
No ratings yet
Comprehensive Data Exploration With Python
20 pages
Faisal Nadeem (SAP# 30601)
No ratings yet
Faisal Nadeem (SAP# 30601)
7 pages
Problem Set 3: General Guideline
No ratings yet
Problem Set 3: General Guideline
12 pages
Exercise3 Solution
No ratings yet
Exercise3 Solution
19 pages
Xgboost
No ratings yet
Xgboost
12 pages
EDA LAB Experiment No. 5 Confidence Interval2
No ratings yet
EDA LAB Experiment No. 5 Confidence Interval2
11 pages
Lesllie Salt Company
No ratings yet
Lesllie Salt Company
15 pages
Predicting House Prices Using Regression Techniques: Problem Statement: Problems Faced During Buying A House
No ratings yet
Predicting House Prices Using Regression Techniques: Problem Statement: Problems Faced During Buying A House
20 pages
DM Assignment
No ratings yet
DM Assignment
17 pages
Statistical Modeling With R - Fall 2016 Homework 2: House Prices in Oregon
No ratings yet
Statistical Modeling With R - Fall 2016 Homework 2: House Prices in Oregon
16 pages
DALab Part-B BCU&BU
No ratings yet
DALab Part-B BCU&BU
12 pages
Module 2notes
No ratings yet
Module 2notes
44 pages
Lab 3. Linear Regression 230223
100% (1)
Lab 3. Linear Regression 230223
7 pages
Machine Learning(BCSL606) Lab Manual (2) (1)
No ratings yet
Machine Learning(BCSL606) Lab Manual (2) (1)
117 pages
Report On Linear Regression Using R
No ratings yet
Report On Linear Regression Using R
15 pages
EDA and Hypothesis Testing On KC Housing Data: Daniele Sammarco - Exploratory Data Analysis For Machine Learning by IBM
No ratings yet
EDA and Hypothesis Testing On KC Housing Data: Daniele Sammarco - Exploratory Data Analysis For Machine Learning by IBM
9 pages
Making predictions
No ratings yet
Making predictions
13 pages
vertopal.com_R_practical
No ratings yet
vertopal.com_R_practical
9 pages
GianluigiDeRubertis 228766
No ratings yet
GianluigiDeRubertis 228766
9 pages
Linear Reg
No ratings yet
Linear Reg
25 pages
Exp1a
No ratings yet
Exp1a
5 pages
Predicting Home Prices in Bangalore
No ratings yet
Predicting Home Prices in Bangalore
18 pages
Diamond Dataset Output
No ratings yet
Diamond Dataset Output
19 pages
House Pricing Regression
No ratings yet
House Pricing Regression
11 pages
Case Study
No ratings yet
Case Study
20 pages
Project Report ME-315 Machine Learning in Practice: Sebastian Perez Viegener LSE ID:201870983 July 3, 2019
No ratings yet
Project Report ME-315 Machine Learning in Practice: Sebastian Perez Viegener LSE ID:201870983 July 3, 2019
15 pages
Real Estate Model
No ratings yet
Real Estate Model
5 pages
Kaggle Machine Learning
No ratings yet
Kaggle Machine Learning
6 pages
Boston House Prediction - Colab1
No ratings yet
Boston House Prediction - Colab1
10 pages
STAT511HW4SolutionDraft
No ratings yet
STAT511HW4SolutionDraft
30 pages
Topic2-Numerical-Summary
No ratings yet
Topic2-Numerical-Summary
62 pages
FinalProject STAT4444
No ratings yet
FinalProject STAT4444
11 pages
Finalproj Aml
No ratings yet
Finalproj Aml
69 pages
Linear Regression Analysis - Polynomial Regression
No ratings yet
Linear Regression Analysis - Polynomial Regression
25 pages
STAT511---HW4
No ratings yet
STAT511---HW4
31 pages
Basic Codes
No ratings yet
Basic Codes
1 page
SMDM - Assignment 1
No ratings yet
SMDM - Assignment 1
4 pages
Lecture 3b Descriptive Statistics - Numerical Measures
No ratings yet
Lecture 3b Descriptive Statistics - Numerical Measures
34 pages
EX. NO: 3 Performing Statistical Analysis On A Dataset DATE: 21/08/2024
No ratings yet
EX. NO: 3 Performing Statistical Analysis On A Dataset DATE: 21/08/2024
8 pages
The Data Science Process
100% (1)
The Data Science Process
53 pages
Bi El
No ratings yet
Bi El
26 pages
LAB 1 Notes
No ratings yet
LAB 1 Notes
3 pages
Sampling Distributions Coursera
No ratings yet
Sampling Distributions Coursera
8 pages
Machine Learning Laboratory
No ratings yet
Machine Learning Laboratory
23 pages
Assignment 1
No ratings yet
Assignment 1
8 pages
Full Download Introduction to Robust Estimation and Hypothesis Testing Second Edition Rand R. Wilcox PDF DOCX
100% (1)
Full Download Introduction to Robust Estimation and Hypothesis Testing Second Edition Rand R. Wilcox PDF DOCX
42 pages
ML LAB34
No ratings yet
ML LAB34
29 pages
seminar_1 2
No ratings yet
seminar_1 2
14 pages
Dawit House
No ratings yet
Dawit House
49 pages
Assignment Project Using SPSS
No ratings yet
Assignment Project Using SPSS
14 pages
Home Credit Data
No ratings yet
Home Credit Data
6 pages
STAT2102 Homework 4
No ratings yet
STAT2102 Homework 4
9 pages
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet
Miscellaneous Mathematical Constants
From Everand
Miscellaneous Mathematical Constants
Simon Plouffe
No ratings yet
Presentation 1
No ratings yet
Presentation 1
12 pages
Linear Regression
No ratings yet
Linear Regression
5 pages
Milk Quality Prediction
No ratings yet
Milk Quality Prediction
28 pages
BYOAD-Block_Level_Progress-2024-12-25
No ratings yet
BYOAD-Block_Level_Progress-2024-12-25
23 pages
Robuxio Lite Factsheet
No ratings yet
Robuxio Lite Factsheet
4 pages
Assignment 7 Regression
No ratings yet
Assignment 7 Regression
4 pages
(eBook PDF) Introductory Econometrics: A Modern Approach 6th Editioninstant download
100% (5)
(eBook PDF) Introductory Econometrics: A Modern Approach 6th Editioninstant download
57 pages
Social Statistics For A Diverse Society 8th Edition Frankfort Nachmias Test Bank Download Full Testbank Solution Chapters
100% (5)
Social Statistics For A Diverse Society 8th Edition Frankfort Nachmias Test Bank Download Full Testbank Solution Chapters
49 pages
Ch3_slides_Ed4_2024_20(1)
No ratings yet
Ch3_slides_Ed4_2024_20(1)
72 pages
Regression Analysis: Causal Relationship Between The Explanatory and
No ratings yet
Regression Analysis: Causal Relationship Between The Explanatory and
17 pages
How Linear Regression Works - A Simple Explanation - by Ravishek Singh - Sep, 2024 - Medium
No ratings yet
How Linear Regression Works - A Simple Explanation - by Ravishek Singh - Sep, 2024 - Medium
13 pages
Week 1
No ratings yet
Week 1
47 pages
Excel Assignment Acg 5395 CURRENT VERSION-1
No ratings yet
Excel Assignment Acg 5395 CURRENT VERSION-1
1 page
Panel Data Notes
No ratings yet
Panel Data Notes
26 pages
Set of Problems 9
No ratings yet
Set of Problems 9
3 pages
MedCalc's Diagnostic Test Evaluation Calculator
No ratings yet
MedCalc's Diagnostic Test Evaluation Calculator
1 page
Chapter 3 - Forecasting - EXCEL TEMPLATES
No ratings yet
Chapter 3 - Forecasting - EXCEL TEMPLATES
14 pages
Logistic Regression
No ratings yet
Logistic Regression
22 pages
Forecasting Practice 3
No ratings yet
Forecasting Practice 3
4 pages
Practice With Punnett Squares
No ratings yet
Practice With Punnett Squares
6 pages
Final Exam - Practice Exam - (Chapter 10, 11 and 12) - Part 2
No ratings yet
Final Exam - Practice Exam - (Chapter 10, 11 and 12) - Part 2
8 pages
Partition Function of 1-, 2-, and 3-D Monatomic Ideal Gas: A Simple and Comprehensive Review
No ratings yet
Partition Function of 1-, 2-, and 3-D Monatomic Ideal Gas: A Simple and Comprehensive Review
4 pages
2022-23 S1 - 22 (DSE) - ISM - EC3M - April 2023
No ratings yet
2022-23 S1 - 22 (DSE) - ISM - EC3M - April 2023
2 pages
Experiment 1
No ratings yet
Experiment 1
17 pages
DID & DDD in Stata (2021)
No ratings yet
DID & DDD in Stata (2021)
1 page
Formula
No ratings yet
Formula
2 pages
Output SmartPLS 27 September 2024 Brostrapping
No ratings yet
Output SmartPLS 27 September 2024 Brostrapping
153 pages
Econometric Analysis II (Theory and Lab) - Course Outline
No ratings yet
Econometric Analysis II (Theory and Lab) - Course Outline
3 pages
Forecasting formula used in Supply chain for prod..
No ratings yet
Forecasting formula used in Supply chain for prod..
2 pages