Data Analytics Lesson 12 Notes
Data Analytics Lesson 12 Notes
Linear regression
continued
2
Contents
3 Introduction
3 Lesson outcomes
References
DATA ANALYTICS
3
Lesson outcomes
By the end of this lesson, you should be able to:
Introduction
In this lesson we will continue to broaden our understanding of linear regression and data frames. We will understand
what it means for the model to fit the data well and gain some further insight into treating data in R. We will end the lesson
by exploring some basics surrounding dates values in R.
Mathematically, we can write the linear regression relationship as the prediction of estimate of y that is represented by the
intercept and the slope terms. These estimates are used to predict the value of the outcome variable.
We also learnt that we can use the function lm() to fit a simple linear regression model in R.
The function lm.fit provides us with some basic information about the model and summary(lm.fit) provides us with more
detailed information.
The mathematical model now contains multiple coefficients and unknown variables. Multiple linear regression therefore
tries to model the relationship between 2 or more predictor variables and the response variable through fitting a linear
equation to the data.
DATA ANALYTICS
4
• In a simple linear model, the residual standard error (RSE) represents an estimate of the standard deviation of the
random error term.
• Error terms are associated with each observation, because we are never able to perfectly predict Y from our given
set of observations from a sample.
• The RSE statistic therefore tells us more about the average amount that the response variable will stray from the
true regression line.
• This value tells us how badly the current model fits the data.
• If the RSE value is small, the model fits the data well and vice versa if the RSE value is large.
𝑹𝟐
Note: the r-squared value will always increase as we add more predictor variables to the model, even if the predictor
variables added are not strongly associated with the response variable. We must therefore be careful not to add variables
to the model that provides no real improvement to the model fit, but more on this in later lessons.
Another note: a large r-squared value does not necessarily mean that the estimated regression line fits the data well;
another function, for example a polynomial trend, might describe the data better.
Apart from being described as a modern reimagining of data frames in R, tibbles is also called lazy and surly data frames
because they do less than data frames and they complain more. The complaining the tibbles do, actually pressures us to
tackle the problems in the data earlier which in the long run, makes our code cleaner and more efficient.
• Using the tibble package in r, we are able to convert a data frame into a tibble with the function as_tibble.
• We can also create a new tibble using the function tibble()
• Furthermore, we are also able to define a tibble row by row with the function tribble(). Tribble is short for
transposed tibble.
DATA ANALYTICS
5
Current date
• To get current timezone information, use command
o Sys.timezone()
• To get current date, use command
o Sys.date()
• To get current date, time and timezone particulars, use command
o Sys.time()
• In Lubridate, to get current date, time and timezone particulars, use command
o Now()
• We can convert a string that is in date format to a date object with the use of the as.date function.
• The default date format for R is year-months-days.
Note: For our American friends, you will need to add an additional part to this format to convert the format of the date by
telling R what format your date is in.
• Use the command ymd or mdy depending on the format of the date to convert the string to a date object.
DATA ANALYTICS
6
References
• Boehmke, B., 2020, Dealing with Dates, UC Business Analytics R Programming Guide,
http://uc-r.github.io/dates/
• James, G., Witten, D., Hastie, T. & Tibshirani, R., 2013, An Introduction to Statistical
Learning with Application in R, Springer, https://faculty.marshall.usc.edu/gareth-
james/ISL/ISLR%20Seventh%20Printing.pdf
• Shekar, 2018, Model on Boston (simple linear regression), RPubs,
https://rpubs.com/shekar07/399193
• Yue Xie, A., 2017, Regression with R – Boston Housing Price, Kaggle,
https://www.kaggle.com/andyxie/regression-with-r-Boston-housing-price
• JMP Statistical Discovery from SAS, Fitting the Multiple Linear Regression Model,
2019, https://www.jmp.com/en_us/statistics-knowledge-portal/what-is-multiple-
regression/fitting-multiple-regression-model.html
• Spinu, V., 2020, ymd, RDocumentation,
https://www.rdocumentation.org/packages/Lubridate/versions/1.7.9/topics/ymd
• Spina, V., 2020, Package ‘lubridate’, R-Project, https://cran.r-
project.org/web/packages/Lubridate/Lubridate.pdf
• Seth, K., 2020, EDA and Multiple Linear Regression on Boston Housing in R, Analytics
Vidhya, https://medium.com/analytics-vidhya/eda-and-multiple-linear-regression-
on-Boston-housing-in-r-270f858dc7b
DATA ANALYTICS