0% found this document useful (0 votes)
3 views

Data Analytics Lesson 12 Notes

This document provides an overview of linear regression and data frames in R, including concepts such as simple and multiple linear regression, model fit statistics, and the use of tibbles. It also introduces the Lubridate package for handling dates and times in R, detailing commands for obtaining current date and time, as well as converting strings to date objects. The lesson aims to enhance understanding of these topics for data analytics.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Data Analytics Lesson 12 Notes

This document provides an overview of linear regression and data frames in R, including concepts such as simple and multiple linear regression, model fit statistics, and the use of tibbles. It also introduces the Lubridate package for handling dates and times in R, detailing commands for obtaining current date and time, as well as converting strings to date objects. The lesson aims to enhance understanding of these topics for data analytics.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Diploma in Data Analytics

Linear regression
continued
2

Contents

3 Introduction

3 Lesson outcomes

3 Linear regression continued

4 Data frames continued

5 Dates and times

References

DATA ANALYTICS
3

Lesson outcomes
By the end of this lesson, you should be able to:

• Linear regression continued


• Data frames continued
• Dates and times

Introduction
In this lesson we will continue to broaden our understanding of linear regression and data frames. We will understand
what it means for the model to fit the data well and gain some further insight into treating data in R. We will end the lesson
by exploring some basics surrounding dates values in R.

Linear regression continued


Linear regression recap
In the previous lesson, we learnt that a simple linear regression aims to model the relationship between the independent
variables, our unknowns, and the dependent variable, our outcome variable, through fitting a straight line equation to the
data. The model therefore assumes that the relationship between the predictor variable X and the response variable Y is
linear.

Mathematically, we can write the linear regression relationship as the prediction of estimate of y that is represented by the
intercept and the slope terms. These estimates are used to predict the value of the outcome variable.

We also learnt that we can use the function lm() to fit a simple linear regression model in R.

The function lm.fit provides us with some basic information about the model and summary(lm.fit) provides us with more
detailed information.

Multiple linear regression


Single linear regression is useful when we only have one predictor variable to a response or outcome variable, but in
practise, this is often not the case. We use a multiple linear regression model to fit multiple predictors to a single outcome
variable.

The mathematical model now contains multiple coefficients and unknown variables. Multiple linear regression therefore
tries to model the relationship between 2 or more predictor variables and the response variable through fitting a linear
equation to the data.

DATA ANALYTICS
4

Model fit statistics


Residual standard error

• In a simple linear model, the residual standard error (RSE) represents an estimate of the standard deviation of the
random error term.
• Error terms are associated with each observation, because we are never able to perfectly predict Y from our given
set of observations from a sample.
• The RSE statistic therefore tells us more about the average amount that the response variable will stray from the
true regression line.
• This value tells us how badly the current model fits the data.
• If the RSE value is small, the model fits the data well and vice versa if the RSE value is large.

𝑹𝟐

• The r-squared statistic is also known as the coefficient of determination.


• The r-square statistic is used to explain the measure of variability in the response explained by the model in a
simple linear regression model.
• With multiple linear regression, the r-squared statistic equals the square of the correlation between the response
and the fitted linear model.
• An r-squared value near 1 show that the model explains a large part of the variance in the response variable.

Note: the r-squared value will always increase as we add more predictor variables to the model, even if the predictor
variables added are not strongly associated with the response variable. We must therefore be careful not to add variables
to the model that provides no real improvement to the model fit, but more on this in later lessons.

Another note: a large r-squared value does not necessarily mean that the estimated regression line fits the data well;
another function, for example a polynomial trend, might describe the data better.

Data frames continued


tibble
By now you should be familiar with the collection of data analytics tool, tidyverse, that is used to transform and visualize
data in R. Instead of using the traditional data frames that R mostly uses, tidyverse makes use of what it calls tibbles.
These tibbles are actually data frames, but it has a few tweaks to it where Wickham thought the original concept of the
data frame could use improvement. Tibbles make tweaks to data frame behaviour in order to make life working with them
easier.

Apart from being described as a modern reimagining of data frames in R, tibbles is also called lazy and surly data frames
because they do less than data frames and they complain more. The complaining the tibbles do, actually pressures us to
tackle the problems in the data earlier which in the long run, makes our code cleaner and more efficient.

To name a few of the basic functions of tibbles:

• Using the tibble package in r, we are able to convert a data frame into a tibble with the function as_tibble.
• We can also create a new tibble using the function tibble()
• Furthermore, we are also able to define a tibble row by row with the function tribble(). Tribble is short for
transposed tibble.

DATA ANALYTICS
5

Tibble vs data frame


• Tibbles have a delightful printing method that show only the first 10 rows and all the columns that fit on the
screen. This is handy when you work with large data sets.
• Subsetting a tibble will always return a tibble. You don’t need to use drop = FALSE in contrast to conventional
data.frames.

Dates and times


Lubridate package
R has a package called Lubridate that makes working with dates and times easier. Lubridate was a joint venture created by
Hadley Wickham, the creator of Tidyverse, and Garrett Grolemund. It is currently being maintained by Vitalie Spinu.

Current date
• To get current timezone information, use command
o Sys.timezone()
• To get current date, use command
o Sys.date()
• To get current date, time and timezone particulars, use command
o Sys.time()
• In Lubridate, to get current date, time and timezone particulars, use command
o Now()

Converting strings to dates


Dates are often converted to character strings when imported into R.

• We can convert a string that is in date format to a date object with the use of the as.date function.
• The default date format for R is year-months-days.

Note: For our American friends, you will need to add an additional part to this format to convert the format of the date by
telling R what format your date is in.

If using the Lubridate package,

• Use the command ymd or mdy depending on the format of the date to convert the string to a date object.

DATA ANALYTICS
6

References
• Boehmke, B., 2020, Dealing with Dates, UC Business Analytics R Programming Guide,
http://uc-r.github.io/dates/
• James, G., Witten, D., Hastie, T. & Tibshirani, R., 2013, An Introduction to Statistical
Learning with Application in R, Springer, https://faculty.marshall.usc.edu/gareth-
james/ISL/ISLR%20Seventh%20Printing.pdf
• Shekar, 2018, Model on Boston (simple linear regression), RPubs,
https://rpubs.com/shekar07/399193
• Yue Xie, A., 2017, Regression with R – Boston Housing Price, Kaggle,
https://www.kaggle.com/andyxie/regression-with-r-Boston-housing-price
• JMP Statistical Discovery from SAS, Fitting the Multiple Linear Regression Model,
2019, https://www.jmp.com/en_us/statistics-knowledge-portal/what-is-multiple-
regression/fitting-multiple-regression-model.html
• Spinu, V., 2020, ymd, RDocumentation,
https://www.rdocumentation.org/packages/Lubridate/versions/1.7.9/topics/ymd
• Spina, V., 2020, Package ‘lubridate’, R-Project, https://cran.r-
project.org/web/packages/Lubridate/Lubridate.pdf
• Seth, K., 2020, EDA and Multiple Linear Regression on Boston Housing in R, Analytics
Vidhya, https://medium.com/analytics-vidhya/eda-and-multiple-linear-regression-
on-Boston-housing-in-r-270f858dc7b

DATA ANALYTICS

You might also like