AAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAA
2023-08-14
Activity 01
1. Dataset introduction
The settings up before using dataset for examination include: - Set up a working directory
at the specified address. - Then, read the data file “hprice1.csv” into the variable named
“data”. - Use the “attach” function to conveniently reference the names of individual
columns as vectors. - There is no header in the provided CSV file, we add the name for each
column as they are provided in the description file.
setwd("C:/Users/Admin/Desktop")
data <- read.csv("hprice1.csv", header = TRUE, col.names = c("price",
"assess", "bdrms", "lotsize", "sqrft", "colonial", "lprice", "lassess",
"llotsize", "lsqrft"))
attach(data)
## [1] 88 10
• The dataset consists of 88 rows and 10 columns, including: price, assess, bdrms,
lotsize, sqrft, colonial, lprice, lassess, llotsize, lsqrft.
• The dataset name is “data 6”.
2. Desciptive statistics
Prepare functions for later use:
mode <- function(x){
freq_table <- table(x)
mode_value <- as.numeric(names(freq_table)[which.max(freq_table)])
return(mode_value)
}
}
percent_outliers <- function(x){
outliers <- boxplot.stats(x)$out
num_outliers <- length(outliers)
percent <- num_outliers/length(x)
}
boxplot(price)
cat("Mode of price: ", mode(price),"\n" )
## Summary of price:
summary(price)
• The histogram shows that the majority of house price lies within 230 and 300
thousand dollars. Also, the rightward skew in the histogram indicates that the mean
of house price is larger than the mode (mode < median < mean).
• There are 5 values that are outliers, which accounts for 5.6% of the dataset. This
figure does not affect much on our analysis resutls.
summary(assess)
• The histogram shows that the majority of assessed value lies within 250 and 300
thousand dollars. Also, the rightward skew in the histogram indicates that the mean
of assessed value is larger than the mode (mode < median < mean).
• There are 5 values that are outliers, which accounts for 5.6% of the dataset. This
figure does not affect much on our analysis resutls.
plot(bdrms)
pie(table(bdrms), labels = c("1 bdrm", "2 bdrms", "3 bdrms", "4 bdrms", "5
bdrms", "6 bdrms", "7 bdrms"), col = c("red", "orange", "yellow",
"lightgreen", "lightblue", "violet", "cyan"), main = "Pie Chart for Bedrooms
attribute")
cat("Mode of bedrooms: ", mode(bdrms), "\n")
## Mode of bedrooms: 3
## Summary of bedrooms:
summary(bdrms)
• The diagram shows that most houses have the number of bedrooms varies within
[3,4].
• There are 2 values that are outliers, which accounts for 2.2% of the dataset. This
figure does not affect much on our analysis resutls.
## Summary of price:
summary(lotsize)
• The histogram shows that the majority of lot size lies under 20 thousand square
feet. Also, the rightward skew in the histogram indicates that the mean of lot size is
larger than the mode (mode < median < mean).
• There are 11 values that are outliers, which accounts for 12.5% of the dataset. This
figure has slightly effects on our analysis resutls.
boxplot(sqrft)
cat("Mode of price: ", mode(sqrft),"\n" )
## Summary of price:
summary(sqrft)
• The histogram shows that the majority of house size varies within 1500 to 2000
sqaure ft. Also, the rightward skew in the histogram indicates that the mean of lot
size is larger than the mode (mode < median < mean).
• There are 6 values that are outliers, which accounts for 6.8% of the dataset. This
figure does not affect much on our analysis resutls.
table(colonial)
## colonial
## 0 1
## 27 61
The pie chart shows that about one-third of houses is not colonial style. The figure depicts
that there are more houses which are of colonial style.
The output shows that the mean of house prices is 293.5 thousand dollars. Hypothesis
testing: Let us test whether the mean of house prices is different from 293.5 thousand
dollars. $H_0 : \mu_0 = 293.5 \\ H_1 : \mu_0 \ne 293.5$
t.test(x = price, mu = 293.5, alternative = "two.sided")
##
## One Sample t-test
##
## data: price
## t = 0.0042043, df = 87, p-value = 0.9967
## alternative hypothesis: true mean is not equal to 293.5
## 95 percent confidence interval:
## 271.7831 315.3089
## sample estimates:
## mean of x
## 293.546
Using one sample t-test on house price attribute of the dataset, we obtain p-value = 0.9967,
which is closed to 1. Therefore, we accept the null hypothesis: it is true that the mean of
house price is 293.5 thousand dollars.
The output shows that the mean of lot size in square feet is 9020 (square ft). Hypothesis
testing: Let us test whether the mean of lot size is different from 9020 square ft. $H_0 : \
mu_0 = 9020 \\ H_1 : \mu_0 \ne 9020$
t.test( x = lotsize, mu = 9020, alternative = "two.sided")
##
## One Sample t-test
##
## data: lotsize
## t = -0.00012573, df = 87, p-value = 0.9999
## alternative hypothesis: true mean is not equal to 9020
## 95 percent confidence interval:
## 6864.167 11175.560
## sample estimates:
## mean of x
## 9019.864
Using one sample t-test on lot size attribute of the dataset, we obtain p-value = 0.9999,
which is closed to 1. Therefore, we accept the null hypothesis: it is true that the mean of lot
size in square feet is 9020 (square ft).
##
## Attaching package: 'dplyr'
##
## 2-sample test for equality of proportions with continuity correction
##
## data: c(nrow(data[data$colonial == 1 & data$house == 1, ]),
nrow(data[data$colonial == 1 & data$house == 0, ])) out of c(nrow(data),
nrow(data))
## X-squared = 57.805, df = 1, p-value = 1
## alternative hypothesis: less
## 95 percent confidence interval:
## -1.0000000 0.6638851
## sample estimates:
## prop 1 prop 2
## 0.62500000 0.06818182
5. Linear Regression on quantitative variables
5.1. Check correlation between variables.
pairs( price ~ ., data = data )
r <- cor(dplyr::select_if(data, is.numeric))
ggcorrplot::ggcorrplot(r,
hc.order = TRUE,
lab = TRUE)
car::vif(lm(price~., data = data))
The Variance Inflation Factor (VIF) is a measure of the strength of the correlation between
independent variables in a multiple regression model. A VIF value if less than 10 indicates
that there is no strong correlation between the variables. In this case, the values of “lsqrft”
is 60.07, which has strong correlation with other variables, that should be eliminated.
We have a new VIF table of values after removing “lsqrft”:
car::vif(lm(price~. - lsqrft, data = data))
In the above results, the values of “assess” is 38.43, which has strong correlation with other
variables, that should be eliminated.
We have a new VIF table of values after removing “assess”:
car::vif(lm(price~.-lsqrft - assess, data = data))
In the above results, the value of “lassess” is 9.8, which is closed to 10. This indicates that it
has a strong correlation with other variables, so that it should be eliminated.
car::vif(lm(price~.-lsqrft - assess - lassess, data = data))
In the above result, the VIF values for all variables are less than 10, which means that there
are no variables that are strongly correlated with each other.
summary(fit)
##
## Call:
## lm(formula = price ~ bdrms + lotsize + sqrft + colonial + lprice +
## llotsize + house, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -45.782 -12.887 -2.674 6.438 98.288
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.624e+03 9.602e+01 -16.908 < 2e-16 ***
## bdrms 8.688e+00 3.754e+00 2.314 0.02322 *
## lotsize -6.386e-04 4.669e-04 -1.368 0.17524
## sqrft 1.165e-02 7.413e-03 1.572 0.11996
## colonial -1.038e+01 5.836e+00 -1.778 0.07914 .
## lprice 2.914e+02 1.461e+01 19.943 < 2e-16 ***
## llotsize 2.930e+01 1.029e+01 2.848 0.00558 **
## house -2.986e+01 1.041e+01 -2.867 0.00530 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23.43 on 80 degrees of freedom
## Multiple R-squared: 0.9522, Adjusted R-squared: 0.948
## F-statistic: 227.5 on 7 and 80 DF, p-value: < 2.2e-16
In the above results, the p-values of “lotsize” - lot size, and “sqrft” - house size are pretty
high.
Therefore, we conduct a Hypothesis testing on these variables whether these variable
could be eliminated in our model. $H_0:\beta_2 = \beta_3 = 0 \\ H_1: \exists \beta_i \ne 0
(i = 2,3)$
model1 <- lm( price ~ bdrms + colonial + lprice + llotsize + house, data =
data )
anova(fit, model1)
summary(model1)
##
## Call:
## lm(formula = price ~ bdrms + colonial + lprice + llotsize + house,
## data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -57.689 -12.350 -2.066 6.169 98.251
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1608.649 56.117 -28.666 < 2e-16 ***
## bdrms 10.528 3.593 2.930 0.00439 **
## colonial -12.245 5.806 -2.109 0.03798 *
## lprice 309.087 11.022 28.043 < 2e-16 ***
## llotsize 17.456 5.919 2.949 0.00415 **
## house -26.150 9.665 -2.705 0.00829 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23.87 on 82 degrees of freedom
## Multiple R-squared: 0.9491, Adjusted R-squared: 0.946
## F-statistic: 305.8 on 5 and 82 DF, p-value: < 2.2e-16
The resulted p-value is 0.08 greater than 0.05. Therefore, we accept the null hypothesis
eliminating the considered variables.
As can be seen above, the p-values of all variables are low. We can conclude that our model
is the most suitable model for calculation.
##
## Call:
## lm(formula = price ~ bdrms + colonial + lprice + llotsize + house,
## data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -57.689 -12.350 -2.066 6.169 98.251
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1608.649 56.117 -28.666 < 2e-16 ***
## bdrms 10.528 3.593 2.930 0.00439 **
## colonial -12.245 5.806 -2.109 0.03798 *
## lprice 309.087 11.022 28.043 < 2e-16 ***
## llotsize 17.456 5.919 2.949 0.00415 **
## house -26.150 9.665 -2.705 0.00829 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23.87 on 82 degrees of freedom
## Multiple R-squared: 0.9491, Adjusted R-squared: 0.946
## F-statistic: 305.8 on 5 and 82 DF, p-value: < 2.2e-16
plot(model1)
We have the formula of our Linear Regression Model:
p r ic e=−1608.849+10.528∗b d r m s −12.245∗c o l o n ia l+309.087∗l p r i c e +17.456∗ll o t s i z e − 26.150∗h o
Interpretation of the coefficients: - β 1 : 10.528. This value indicates that if the number of
bedrooms increases 1, the resulted value of house price will increase by 10.528. - β 2 : -
12.245. This value indicates that if the number of colonial style increases 1, the resulted
value of house price will decrease by 12.245. - β 3 : 309.087. This value indicates that if the
log of house price increases 1, the resulted value of house price will increase by 309.087. -
β 4 : 17.456. This value indicates that if the log of lot size increases 1, the resulted value of
house price will increase by 17.456. - β 5: -26.15. This value indicates that if the value of
“house” variable increases 1, the resulted value of house price will decrease by 26.15.
Activity 02
1. Dataset introduction
The dataset was conducted to examine where the Insurance charges are given against the
following attributes of the insured: Age, Sex, BMI, Number of Children, Smoker and Region.
The attributes are a mix of numeric and categorical variables.
The settings up before using dataset for examination include: - Set up a working directory
at the specified address. - Then, read the data file “Real estate.csv” into the variable named
“data1”. - Use the “attach” function to conveniently reference the names of individual
columns as vectors.
setwd("C:/Users/Admin/Desktop")
data1 <- read.csv("insurance.csv", header = TRUE)
attach(data1)
## [1] 1338 7
• The dataset consists of 1338 rows and 7 columns, including: age, sex, bmi, children,
smoker, region, charges.
• The dataset name is “insurance”.
2. Desciptive statistics
2.1. Quantitative statistics
2.1.1. Age
hist(age, main = "Age of beneficiary", xlab = "age", xlim = c(0, 70), breaks
= 10)
boxplot(age)
## Summary of age:
summary(age)
• The histogram shows that the most patients aged within 27 and 51 years old. People
who aged 18 appear most frequently in this column.
• There are 0 values that are outliers in this attribute.
2.1.2. BMI
hist(bmi, main= "BMI (Body Mass Index)", xlab = "BMI", xlim = c(0, 60))
boxplot(bmi)
cat("Mode of BMI: ", mode(bmi),"\n" )
## Summary of BMI:
summary(bmi)
• The histogram shows that the BMI of people lies within 25 and 35. People with BMI
of 32.3 appear the most frequently in this column.
• There are 9 values that are outliers, which accounts for 0.67% of the dataset. This
figure does not affect much on our analysis resutls.
2.1.3. Children
hist(children, main = "Number of dependents", xlab = "Children")
boxplot(children)
summary(children)
## Total outliers: 0 0
• The histogram shows that the majority of number of children lies within 0 and 2.
Also, the rightward skew in the histogram indicates that the mean of house price is
larger than the mode (mode < median < mean).
• There are 0 values that are outliers in this case.
2.1.4. Charges
hist(charges/1000, main = "Medical costs by health insurance", xlab =
"Charges / $1000", xlim = c(0, 70))
boxplot(charges)
cat("Mode of number of dependents: ", mode(charges),"\n" )
summary(charges)
• The histogram shows that the majority of charges lies under 15000. Also, the
rightward skew in the histogram indicates that the mean of house price is larger
than the mode.
• There are 139 values that are outliers in this case, which make up 10.3% of this
attribute. This figure has an impact on our total analysis results and these outliers
should be eliminated.
2.2. Quanlitative statistics.
2.2.1. Sex
The given “sex” attribute in the dataset is demonstrated by 2 string values “female” and
“male”. Therefore, we convert them into numeric values and replace the new values into
the dataframe.
library(dplyr)
data1 <- data1 %>% mutate(sex_numeric = ifelse(sex == "male", 0, 1))
Now we have a dataframe with “sex” column values “male” as 0 and “female” as 1.
pie(table(sex), main = "Pie chart for sex distribution")
table(sex)
## sex
## female male
## 662 676
cat(paste("Percentage of", "female patients", ":", round((length(sex[sex ==
"female"]) / length(sex))*100, 2), "%\n"))
The pie chart shows that the percentages of female patients and male patients are nearly
the same. #### 2.2.2. Smoker The “smoker” attribute is also desmonstrated by 2 string
values “yes” and “no”. Therefore, we convert them into numeric values and replace the new
values into the dataframe.
library(dplyr)
data1 <- data1 %>% mutate(smoker_numeric = ifelse(smoker == "no", 0, 1))
Now we have a dataframe with “smoker” column values “no” as 0 and “yes” as 1.
pie(table(smoker), main = "Pie chart for smoker distribution")
table(smoker)
## smoker
## no yes
## 1064 274
The pie chart shows that about 20% of the total patients is smoker, and the remaining 80%
is non-smoker.
2.2.3. Region
# Get unique values in the "region" column
unique_regions <- unique(data1$region)
cat("Unique regions:\n")
## Unique regions:
## southwest
## southeast
## northwest
## northeast
table(region)
## region
## northeast northwest southeast southwest
## 324 325 364 325
The pie chart shows that the percentages of beneficiary’s residential areas distribute quite
equally.
## [1] 4
round((length(num_outliers) / nrow(data1))*100,2)
## [1] 0.3
As calculated above, the number of outliers is 284, which makes up to 21% of the whole
dataset. Hence, we will not remove the outliers in this case. #### 3.2. Create dummy
variables. In our dataset, the quanlitative variable “region” has 4 types of values, so we
create 4 dummy variables for this attribute and remove the “region” column.
library(fastDummies)
## Model :
## charges ~ age + bmi + children + sex + smoker + region_southwest +
## region_southeast + region_northwest + region_northeast
##
## Complete :
## (Intercept) age bmi children sex smoker region_southwest
## region_northeast 1 0 0 0 0 0 -1
## region_southeast region_northwest
## region_northeast -1 -1
As can be seen from the above result, “region_northeast” has strong correlation with other
dummy variables. Therefore, we remove it.
# Remove the original "region_northeast" column
data1 <- subset(data1, select = -c(region_northeast))
The Variance Inflation Factor (VIF) is a measure of the strength of the correlation between
independent variables in a multiple regression model. A VIF value if less than 10 indicates
that there is no strong correlation between the variables. In this case, there is no VIF value
which is under 10, so we do not remove any variables. It also shows that none of them are
strongly correlated with each other. #### 3.4. Building Linear Regression model
model_activ21 <- lm(charges~., data = data1)
summary(model_activ21)
##
## Call:
## lm(formula = charges ~ ., data = data1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11304.9 -2848.1 -982.1 1393.9 29992.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -12069.9 999.6 -12.074 < 2e-16 ***
## age 256.9 11.9 21.587 < 2e-16 ***
## bmi 339.2 28.6 11.860 < 2e-16 ***
## children 475.5 137.8 3.451 0.000577 ***
## sex 131.3 332.9 0.394 0.693348
## smoker 23848.5 413.1 57.723 < 2e-16 ***
## region_southwest -960.0 477.9 -2.009 0.044765 *
## region_southeast -1035.0 478.7 -2.162 0.030782 *
## region_northwest -353.0 476.3 -0.741 0.458769
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6062 on 1329 degrees of freedom
## Multiple R-squared: 0.7509, Adjusted R-squared: 0.7494
## F-statistic: 500.8 on 8 and 1329 DF, p-value: < 2.2e-16
In the above results, the p-value of “sex”, and “region_northwest” are pretty high.
Therefore, we conduct a Hypothesis testing on these variables whether they could be
eliminated in our model. $H_0:\beta_4 = \beta_8 = 0 \\ H_1: \exists \beta_i \ne 0(i = 4,8)$
model_activ22 <- lm(charges~. -sex - region_northwest, data = data1)
anova(model_activ21, model_activ22)
summary(model_activ22)
##
## Call:
## lm(formula = charges ~ . - sex - region_northwest, data = data1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11377.1 -2855.0 -973.6 1349.5 29926.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -12165.38 949.54 -12.812 < 2e-16 ***
## age 257.01 11.89 21.617 < 2e-16 ***
## bmi 338.64 28.55 11.860 < 2e-16 ***
## children 471.54 137.66 3.426 0.000632 ***
## smoker 23843.87 411.66 57.921 < 2e-16 ***
## region_southwest -782.75 413.76 -1.892 0.058734 .
## region_southeast -858.47 415.21 -2.068 0.038873 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6059 on 1331 degrees of freedom
## Multiple R-squared: 0.7508, Adjusted R-squared: 0.7497
## F-statistic: 668.3 on 6 and 1331 DF, p-value: < 2.2e-16
The resulted p-value is 0.7 greater than 0.05. Therefore, we accept the null hypothesis
eliminating the considered variables. As can be seen above, the p-values of all variables are
low. We can conclude that our model is the most suitable model for calculation. ## 4.
Interpretation of the model results.
summary(model_activ22)
##
## Call:
## lm(formula = charges ~ . - sex - region_northwest, data = data1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11377.1 -2855.0 -973.6 1349.5 29926.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -12165.38 949.54 -12.812 < 2e-16 ***
## age 257.01 11.89 21.617 < 2e-16 ***
## bmi 338.64 28.55 11.860 < 2e-16 ***
## children 471.54 137.66 3.426 0.000632 ***
## smoker 23843.87 411.66 57.921 < 2e-16 ***
## region_southwest -782.75 413.76 -1.892 0.058734 .
## region_southeast -858.47 415.21 -2.068 0.038873 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6059 on 1331 degrees of freedom
## Multiple R-squared: 0.7508, Adjusted R-squared: 0.7497
## F-statistic: 668.3 on 6 and 1331 DF, p-value: < 2.2e-16
Interpretation of the coefficients: - β 1 : 257.01, This value indicates that if the age of
beneficiary increases by 1, the resulted charges will increase by 257.01. - β 2 : 338.64. This
value indicates that if the BMI increases by 1, the resulted charges will increase by 338.64. -
β 3 : 471.54. This value indicates that if the number of dependents increases by 1, the
resulted charges will increase by 471.54. - β 4 :23843.87. This value indicates that if the
number of smoking beneficiary increases by 1, the resulted charges will increase by
23843.87. - β 5: -782.75. This value indicates that if the number of beneficiary located in the
Southwest of US increases by 1, the resulted charges will decrease by 782.75. - β 6: -858.47.
This value indicates that if the number of beneficiary located in the Southeast of US
increases by 1, the resulted charges will decrease by 858.47.