Linear Regression Using R - An Introduction To Data Modeling
Linear Regression Using R - An Introduction To Data Modeling
USING R - AN
INTRODUCTION TO
DATA MODELING
David Lilja
University of Minnesota
Book: Linear Regression Using R - An
Introduction to Data Modeling (Lilja)
This text is disseminated via the Open Education Resource (OER) LibreTexts Project (https://LibreTexts.org) and like the
hundreds of other texts available within this powerful platform, it is freely available for reading, printing and
"consuming." Most, but not all, pages in the library have licenses that may allow individuals to make changes, save, and
print this book. Carefully consult the applicable license(s) before pursuing such effects.
Instructors can adopt existing LibreTexts texts or Remix them to quickly build course-specific resources to meet the needs
of their students. Unlike traditional textbooks, LibreTexts’ web based origins allow powerful integration of advanced
features and new technologies to support learning.
The LibreTexts mission is to unite students, faculty and scholars in a cooperative effort to develop an easy-to-use online
platform for the construction, customization, and dissemination of OER content to reduce the burdens of unreasonable
textbook costs to our students and society. The LibreTexts project is a multi-institutional collaborative venture to develop
the next generation of open-access texts to improve postsecondary education at all levels of higher learning by developing
an Open Access Resource environment. The project currently consists of 14 independently operating and interconnected
libraries that are constantly being optimized by students, faculty, and outside experts to supplant conventional paper-based
books. These free textbook alternatives are organized within a central environment that is both vertically (from advance to
basic level) and horizontally (across different fields) integrated.
The LibreTexts libraries are Powered by MindTouch® and are supported by the Department of Education Open Textbook
Pilot Project, the UC Davis Office of the Provost, the UC Davis Library, the California State University Affordable
Learning Solutions Program, and Merlot. This material is based upon work supported by the National Science Foundation
under Grant No. 1246120, 1525057, and 1413739. Unless otherwise noted, LibreTexts content is licensed by CC BY-NC-
SA 3.0.
Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do
not necessarily reflect the views of the National Science Foundation nor the US Department of Education.
Have questions or comments? For information about adoptions or adaptions contact [email protected]. More
information on our activities can be found via Facebook (https://facebook.com/Libretexts), Twitter
(https://twitter.com/libretexts), or our blog (http://Blog.Libretexts.org).
1: INTRODUCTION
1.1: PRELUDE TO LINEAR REGRESSION
1.2: WHAT IS A LINEAR REGRESSION MODEL?
1.3: WHAT IS R?
1.4: WHAT'S NEXT?
3: ONE-FACTOR REGRESSION
The simplest linear regression model finds the relationship between one input variable, which is called the predictor variable, and the
output, which is called the system’s response. This type of model is known as a one-factor linear regression. To demonstrate the
regression-modeling process, we will begin developing a one-factor model for the SPEC Integer 2000 (Int2000) benchmark results
reported in the CPU DB data set.
4: MULTI-FACTOR REGRESSION
4.1: VISUALIZING THE RELATIONSHIPS IN THE DATA
4.2: IDENTIFYING POTENTIAL PREDICTORS
4.3: THE BACKWARD ELIMINATION PROCESS
4.4: AN EXAMPLE OF THE BACKWARD ELIMINATION PROCESS
4.5: RESIDUAL ANALYSIS
4.6: WHEN THINGS GO WRONG
5: PREDICTING RESPONSES
5.1: DATA SPLITTING FOR TRAINING AND TESTING
5.2: TRAINING AND TESTING
5.3: PREDICTING ACROSS DATA SETS
5.4: SECTION 5-
5.5: SECTION 6-
7: SUMMARY
8: A FEW THINGS TO TRY NEXT
BACK MATTER
1 12/17/2021
INDEX
GLOSSARY
2 12/17/2021
CHAPTER OVERVIEW
1: INTRODUCTION
One of the most fundamental of the broad range of data mining techniques that have been developed is regression modeling. Regression
modeling is simply generating a mathematical model from measured data. This model is said to explain an output value given a new set
of input values. Linear regression modeling is a specific form of regression modeling that assumes that the output can be explained
using a linear combination of the input values.
1 12/17/2021
1.1: Prelude to Linear Regression
Data mining is a phrase that has been popularly used to suggest the process of finding useful information from within a
large collection of data. I like to think of data mining as encompassing a broad range of statistical techniques and tools that
can be used to extract different types of information from your data. Which particular technique or tool to use depends on
your specific goals.
One of the most fundamental of the broad range of data mining techniques that have been developed is regression
modeling. Regression modeling is simply generating a mathematical model from measured data. This model is said to
explain an output value given a new set of input values. Linear regression modeling is a specific form of regression
modeling that assumes that the output can be explained using a linear combination of the input values.
A common goal for developing a regression model is to predict what the output value of a system should be for a new set
of input values, given that you have a collection of data about similar systems. For example, as you gain experience
driving a car, you begun to develop an intuitive sense of how long it might take you to drive somewhere if you know the
type of car, the weather, an estimate of the traffic, the distance, the condition of the roads, and so on. What you really have
done to make this estimate of driving time is constructed a multi-factor regression model in your mind. The inputs to your
model are the type of car, the weather, etc. The output is how long it will take you to drive from one point to another.
When you change any of the inputs, such as a sudden increase in traffic, you automatically re-estimate how long it will
take you to reach the destination.
This type of model building and estimating is precisely what we are going to learn to do more formally in this tutorial. As
a concrete example, we will use real performance data obtained from thousands of measurements of computer systems to
develop a regression model using the R statistical software package. You will learn how to develop the model and how to
evaluate how well it fits the data. You also will learn how to use it to predict the performance of other computer systems.
As you go through this tutorial, remember that what you are developing is just a model. It will hopefully be useful in
understanding the system and in predicting future results. However, do not confuse a model with the real system. The real
system will always produce the correct results, regardless of what the model may say the results should be.
1 1500 64 2 98
The first column in this table is the index number (or name) from 1 to n that we have arbitrarily assigned to each of the
different systems measured. Columns 2-4 are the input parameters. These are called the independent variables for the
system we will be modeling. The specific values of the
input parameters were set by the experimenter when the system was measured, or they were determined by the system
configuration. In either case, we know what the values are and we want to measure the performance obtained for these
input values. For example, in the first system, the processor’s clock was 1500 MHz, the cache size was 64 kbytes, and the
processor contained 2 million transistors. The last column is the performance that was measured for this system when it
executed a standard benchmark program. We refer to this value as the output of the system. More technically, this is known
as the system’s dependent variable or the system’s response.
The goal of regression modeling is to use these n independent measurements to determine a mathematical function, f(),
that describes the relationship between the input parameters and the output, such as:
performance = f(Clock,Cache,Transistors)
This function, which is just an ordinary mathematical equation, is the regression model. A regression model can take on
any form. However, we will restrict ourselves to a function that is a linear combination of the input parameters. We will
explain later that, while the function is a linear combination of the input parameters, the parameters themselves do not need
to be linear. This linear combination is commonly used in regression modeling and is powerful enough to model most
systems we are likely to encounter.
In the process of developing this model, we will discover how important each of these inputs are in determining the output
value. For example, we might find that the performance is heavily dependent on the clock frequency, while the cache size
and the number of transistors may be much less important. We may even find that some of the inputs have essentially no
impact on the output making it completely unnecessary to include them in the model. We also will be able to use the model
we develop to predict the performance we would expect to see on a system that has input values that did not exist in any of
the systems that we actually measured. For instance, Table 1.2 shows three new systems that were not part of the set of
systems that we previously measured. We can use our regression model to predict the performance of each of these three
systems to replace the question marks in the table.
1 1500 64 2 98
As a final point, note that, since the regression model is a linear combination of the input values, the values of the model
parameters will automatically be scaled as we develop the model. As a result, the units used for the inputs and the output
are arbitrary. In fact, we can rescale the values of the inputs and the output before we begin the modeling process and still
produce a valid model.
1 12/17/2021
2.1: Missing Values
Any large collection of data is probably incomplete. That is, it is likely that there will be cells without values in your data
table. These missing values may be the result of an error, such as the experimenter simply forgetting to fill in a particular
entry. They also could be missing because that particular system configuration did not have that parameter available. For
example, not every processor tested in our example data had an L2 cache. Fortunately, R is designed to gracefully handle
missing values. R uses the notation NA to indicate that the corresponding value is not available.
Most of the functions in R have been written to appropriately ignore NA values and still compute the desired result.
Sometimes, however, you must explicitly tell the function to ignore the NA values. For example, calling the mean()
function with an input vector that contains NA values causes it to return NA as the result. To compute the mean of the
input vector while ignoring the NA values, you must explicitly tell the function to remove the NA values using mean(x,
na.rm=TRUE).
> int92.dat[15,12]
[1] 180
We can also access cells by name by putting quotes around the name:
> int92.dat["71","perf"]
[1] 105.1
This expression returns the data in the row labeled 71 and the column labeled perf . Note that this is not row 71, but
rather the row that contains the data for the processor whose name is 71 .
We can access an entire column by leaving the first parameter in the square brackets empty. For instance, the following
prints the value in every row for the column labeled clock :
> int92.dat[,"clock"]
[1] 100 125 166 175 190 ...
Similarly, this expression prints the values in all of the columns for row 36:
> int92.dat[36,]
nperf perf clock threads cores ...
36 13.07378 79.86399 80 1 1 ...
The functions nrow() and ncol() return the number of rows and columns, respectively, in the data frame:
> nrow(int92.dat)
[1] 78
> ncol(int92.dat)
[1] 16
Because R functions can typically operate on a vector of any length, we can use built-in functions to quickly compute
some useful results. For example, the following expressions compute the minimum, maximum, mean, and standard
deviation of the perf column in the int92.dat data frame:
> min(int92.dat[,"perf"])
[1] 36.7
> max(int92.dat[,"perf"])
[1] 366.857
> mean(int92.dat[,"perf"])
[1] 124.2859
> sd(int92.dat[,"perf"])
[1] 78.0974
This square-bracket notation can become cumbersome when you do a substantial amount of interactive computation within
the R environment. R provides an alternative notation using the $ symbol to more easily access a column. Repeating the
previous example using this notation:
This notation says to use the data in the column named perf from the data frame named int92.dat . We can
make yet a further simplification using the attach function. This function makes the corresponding data frame local
to the current workspace, thereby eliminating the need to use the potentially awkward $ or square-bracket indexing
notation. The following example shows how this works:
> attach(int92.dat)
> min(perf)
[1] 36.7
> max(perf)
[1] 366.857
> mean(perf)
[1] 124.2859
> sd(perf)
[1] 78.0974
To change to a different data frame within your local workspace, you must first detach the current data frame:
> detach(int92.dat)
> attach(fp00.dat)
> min(perf)
[1] 87.54153
> max(perf)
[1] 3369
> mean(perf)
[1] 1217.282
> sd(perf)
[1] 787.4139
Now that we have the necessary data available in the R environment, and some understanding of how to access and
manipulate this data, we are ready to generate our first regression model.
1 12/17/2021
3.1: Visualize the Data
The first step in this one-factor modeling process is to determine whether or not it looks as though a linear relationship
exists between the predictor and the output value. From our understanding of computer system design that is, from
our domain-specific knowledge we know that the clock frequency strongly influences a computer system’s performance.
Consequently, we must look for a roughly linear relationship between the processor’s performance and its clock frequency.
Fortunately, R provides powerful and flexible plotting functions that let us visualize this type relationship quite easily.
This R function call:
generates the plot shown in Figure 3.1. The first parameter in this function call is the value we will plot on the x-axis. In
this case, we will plot the clock values from the int00.dat data frame as the independent variable
Figure 3.1: A scatter plot of the performance of the processors that were tested using the Int2000 benchmark versus the
clock frequency.
on the x-axis. The dependent variable is the perf column from int00.dat , which we plot on the y-axis. The function
argument main="Int2000" provides a title for the plot, while xlab="Clock" and
ylab="Performance" provide labels for the xand y-axes, respectively.
This figure shows that the performance tends to increase as the clock frequency increases, as we expected. If we
superimpose a straight line on this scatter plot, we see that the relationship between the predictor (the clock frequency) and
the output (the performance) is roughly linear. It is not perfectly linear, however. As the clock frequency increases, we see
a larger spread in performance values. Our next step is to develop a regression model that will help us quantify the degree
of linearity in the relationship between the output and the predictor.
> attach(int00.dat)
> int00.lm <lm(perf ~ clock)
The first line in this example attaches the int00.dat data frame to the current workspace. The next line calls the
lm() function and assigns the resulting linear model object to the variable int00.lm. We use the suffix .lm to
emphasize that this variable contains a linear model. The argument in the lm() function, (perf ~ clock) , says
that we want to find a model where the predictor clock explains the output perf .
Typing the variable’s name, int00.lm , by itself causes R to print the argument with which the function lm() was
called, along with the computed coefficients for the regression model.
> int00.lm
Call:
lm(formula = perf ~ clock)
Coefficients:
(Intercept) clock
51.7871 0.5863
In this case, the y-intercept is a0 = 51.7871 and the slope is a1 = 0.5863. Thus, the final regression model is:
> plot(clock,perf)
> abline(int00.lm)
> summary(int00.lm)
Call:
lm(formula = perf ~ clock)
Residuals:
Min 1Q Median 3Q Max
-634.61 -276.17 -30.83 75.38 1299.52
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 51.78709 53.31513 0.971 0.332
clock 0.58635 0.02697 21.741 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’0.05 ‘.’ 0.1 ‘ ’ 1
> summary(int00.lm)
Call:
lm(formula = perf ~ clock)
These first few lines simply repeat how the lm() function was called. It is useful to look at this information to verify that
you actually called the function as you intended.
Residuals:
Min 1Q Median 3Q Max
-634.61 -276.17 -30.83 75.38 1299.52
The residuals are the differences between the actual measured values and the corresponding values on the fitted regression
line. In Figure 3.2, each data point’s residual is the distance that the individual data point is above (positive residual) or
below (negative residual) the regression line. Min is the minimum residual value, which is the distance from the
regression line to the point furthest below the line. Similarly, Max is the distance from the regression line of the point
furthest above the line. Median is the median value of all of the residuals. The 1Q and 3Q values are the points
that mark the first and third quartiles of all the sorted residual values.
How should we interpret these values? If the line is a good fit with the data, we would expect residual values that are
normally distributed around a mean of zero. (Recall that a normal distribution is also called a Gaussian distribution.) This
distribution implies that there is a decreasing probability of finding residual values as we move further away from the
mean. That is, a good model’s residuals should be roughly balanced around and not too far away from the mean of zero.
Consequently, when we look at the residual values reported by summary() , a good model would tend to have a
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 51.78709 53.31513 0.971 0.332
clock 0.58635 0.02697 21.741 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
This portion of the output shows the estimated coefficient values. These values are simply the fitted regression model
values from Equation 3.2. The Std. Error column shows the statistical standard error for each of the coefficients.
For a good model, we typically would like to see a standard error that is at least five to ten times smaller than the
corresponding coefficient. For example, the standard error for clock is 21.7 times smaller than the coefficient value
(0.58635/0.02697 = 21.7). This large ratio means that there is relatively little variability in the slope estimate, a1. The
standard error for the intercept, a0, is 53.31513, which is roughly the same as the estimated value of 51.78709 for this
coefficient. These similar values suggest that the estimate of this coefficient for this model can vary significantly.
The last column, labeled Pr(>|t|) , shows the probability that the corresponding coefficient is not relevant in the
model. This value is also known as the significance or p-value of the coefficient. In this example, the probability that
clock is not relevant in this model is 2 × 10−16 a tiny value. The probability that the intercept is not relevant is 0.332,
or about a one-inthree chance that this specific intercept value is not relevant to the model. There is an intercept, of course,
but we are again seeing indications that the model is not predicting this value very well.
The symbols printed to the right in this summary that is, the asterisks, periods, or spaces are intended to give a quick visual
check of the coefficients’ significance. The line labeled Signif. codes: gives these symbols’ meanings. Three
asterisks (***) means 0 < p ≤ 0.001, two asterisks (**) means 0.001 < p ≤ 0.01, and so on.
R uses the column labeled t value to compute the p-values and the corresponding significance symbols. You
probably will not use these values directly when you evaluate your model’s quality, so we will ignore this column for now.
These final few lines in the output provide some statistical information about the quality of the regression model’s fit to the
data. The Residual standard error is a measure of the total variation in the residual values. If the residuals are
distributed normally, the first and third quantiles of the previous residuals should be about 1.5 times this
standard error .
The number of degrees of freedom is the total number of measurements or observations used to generate the
model, minus the number of coefficients in the model. This example had 256 unique rows in the data frame, corresponding
to 256 independent measurements. We used this data to produce a regression model with two coefficients: the slope and
the intercept. Thus, we are left with (256 2 = 254) degrees of freedom.
The Multiple R-squared value is a number between 0 and 1. It is a statistical measure of how well the model
describes the measured data. We compute it by dividing the total variation that the model explains by the data’s total
variation. Multiplying this value by 100 gives a value that we can interpret as a percentage between 0 and 100. The
reported R2 of 0.6505 for this model means that the model explains 65.05 percent of the data’s variation. Random chance
and measurement errors creep in, so the model will never explain all data variation. Consequently, you should not ever
expect an R2 value of exactly one. In general, values of R2 that are closer to one indicate a better-fitting model. However, a
> plot(fitted(int00.lm),resid(int00.lm))
Figure 3.3: The residual values versus the input values for the one-factor model developed using the Int2000 data.
In this plot, we see that the residuals tend to increase as we move to the right. Additionally, the residuals are not uniformly
scattered above and below zero. Overall, this plot tells us that using the clock as the sole predictor in the regression model
does not sufficiently or fully explain the data. In general, if you observe any sort of clear trend or pattern in the residuals,
you probably need to generate a better model. This does not mean that our simple one-factor model is useless, though. It
only means that we may be able to construct a model that produces tighter residual values and better predictions.
Another test of the residuals uses the quantile-versus-quantile, or Q-Q, plot. Previously we said that, if the model fits the
data well, we would expect the residuals to be normally (Gaussian) distributed around a mean of zero. The Q-Q plot
provides a nice visual indication of whether the residuals from the model are normally distributed. The following
function calls generate the Q-Q plot shown in Figure 3.4:
> qqnorm(resid(int00.lm))
> qqline(resid(int00.lm))
where the xi values are the inputs to the system, the ai coefficients are the model parameters computed from the measured data, and y is
the output value predicted by the model. Everything we learned in Chapter 3 for one- factor models also applies to the multi-factor
models. To develop this type of multi-factor regression model, we must also learn how to select specific predictors to include in the
model
1 12/17/2021
4.1: Visualizing the Relationships in the Data
Before beginning model development, it is useful to get a visual sense of the relationships within the data. We can do this
easily with the following function call:
The pairs() function produces the plot shown in Figure 4.1. This plot provides a pairwise comparison of all the data
in the int00.dat data frame. The gap parameter in the function call controls the spacing between the individual
plots. Set it to zero to eliminate any space between plots.
As an example of how to read this plot, locate the box near the upper left corner labeled perf . This is the value of the
performance measured for the int00.dat data set. The box immediately to the right of this one is a scatter
Figure 4.1: All of the pairwise comparisons for the Int2000 data frame.
plot, with perf data on the vertical axis and clock data on the horizontal axis. This is the same information we
previously plotted in Figure 3.1. By scanning through these plots, we can see any obviously significant relationships
between the variables. For example, we quickly observe that there is a somewhat proportional relationship between
perf and clock . Scanning down the perf column, we also see that there might be a weakly inverse
relationship between perf and featureSize .
Notice that there is a perfect linear correlation between perf and nperf . This relationship occurs because
nperf is a simple rescaling of perf . The reported benchmark performance values in the database that is, the
perf values use different scales for different benchmarks. To directly compare the values that our models will predict,
it is useful to rescale perf to the range [0,100]. Do this quite easily, using this R code:
max_perf = max(perf)
min_perf = min(perf)
range = max_perf min_perf
nperf = 100 * (perf min_perf) / range
Note that this rescaling has no effect on the models we will develop, because it is a linear transformation of perf . For
convenience and consistency, we use nperf in the remainder of this tutorial.
where n is the number of observations and m is the number of predictors in the model. If adding a new predictor to the
model increases the previous model’s R2 value by more than we would expect from random fluctuations, then the
adjusted R2 will increase. Conversely, it will decrease if removing a predictor decreases the R2 by more than we would
expect due to random variations. Recall that the goal is to use as few predictors as possible, while still producing a model
that explains the data well.
Because we do not know a priori which input parameters will be useful predictors, it seems reasonable to start with all of
the columns available in the measured data as the set of potential predictors. We listed all of the column names in
Table 2.1. Before we throw all these columns into the modeling process, though, we need to step back and consider what
we know about the underlying system, to help us find any parameters that we should obviously exclude from the start.
There are two output columns: perf and nperf . The regression model can have only one output, however, so we
must choose only one column to use in our model development process. As discussed in Section 4.1, nperf is a linear
transformation of perf that shifts the output range to be between 0 and 100. This range is useful for quickly obtaining
a sense of future predictions’ quality, so we decide to use nperf as our model’s output and ignore the perf column.
Almost all the remaining possible predictors appear potentially useful in our model, so we keep them available as potential
predictors for now. The only exception is TDP . The name of this factor, thermal design power, does not clearly indicate
whether this could be a useful predictor in our model, so we must do a little additional research to understand it better. We
discover [10] that thermal design power is “the average amount of power in watts that a cooling system must dissipate.
Also called the ‘thermal guideline’ or ‘thermal design point,’ the TDP is provided by the chip manufacturer to the system
vendor, who is expected to build a case that accommodates the chip’s thermal requirements.” From this definition, we
conclude that TDP is not really a parameter that will directly affect performance. Rather, it is a specification provided by
the processor’s manufacturer to ensure that the system designer includes adequate cooling capability in the final product.
Thus, we decide not to include TDP as a potential predictor in the regression model.
In addition to excluding some apparently unhelpful factors (such as TDP) at the beginning of the model development
process, we also should consider whether we should include any additional parameters. For example, the terms in a
regression model add linearly to produce the predicted output. However, the individual terms themselves can be nonlinear,
such as aixim ,where m does not have to be equal to one.This flexibility lets us include additional powers of the
This function call assigns the resulting linear model object to the variable int00.lm . As before, we use the suffix
.lm to remind us that this variable is a linear model developed from the data in the corresponding data frame,
int00.dat . The arguments in the function call tell lm() to compute a linear model that explains the output
nperf as a function of the predictors separated by the “+” signs. The argument data=int00.dat explicitly
passes to the lm() function the name of the data frame that should be used when developing this model. This
data= argument is not necessary if we attach() the data frame int00.dat to the current workspace.
However, it is useful to explicitly specify the data frame that lm() should use, to avoid confusion when you
manipulate multiple models simultaneously.
The summary() function gives us a great deal of information about the linear model we just created:
Coefficients:
Estimate Std. Error t value
(Intercept) -2.108e+01 7.852e+01 -0.268
clock 2.605e-02 1.671e-03 15.594
threads -2.346e+00 2.089e+00 -1.123
cores 2.246e+00 1.782e+00 1.260
transistors -5.580e-03 1.388e-02 -0.402
dieSize 1.021e-02 1.746e-02 0.585
voltage -2.623e+01 7.698e+00 -3.408
freatureSize 3.101e+01 1.122e+02 0.276
channel 9.496e+01 5.945e+02 0.160
FO4delay -1.765e-02 1.600e+00 -0.011
L1icache 1.102e+02 4.206e+01 2.619
sqrt(L1icache) -7.390e+02 2.980e+02 -2.480
L1dcache -1.114e+02 4.019e+01 -2.771
sqrt(L1dcache) 7.492e+02 2.739e+02 2.735
L2cache -9.684e-03 1.745e-03 -5.550
sqrt(L2cache) 1.221e+00 2.425e-01 5.034
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.632 on 61 degrees of freedom (179 observations de
Multiple R-squared: 0.9652, Adjusted R-squared: 0.9566 F-statistic: 112.8 on
Notice a few things in this summary: First, a quick glance at the residuals shows that they are roughly balanced around a
median of zero, which is what we like to see in our models. Also, notice the line, (
179 observations deleted due to missingness ). This tells us that in 179 of the rows in the data
frame that is, in 179 of the processors for which performance results were reported for the Int2000 benchmark some of the
values in the columns that we would like to use as potential predictors were missing. These NA values caused R to
automatically remove these data rows when computing the linear model.
The total number of observations used in the model equals the number of degrees of freedom remaining 61 in this case
plus the total number of predictors in the model. Finally, notice that the R2 and adjusted R2 values are relatively close to
one, indicating that the model explains the nperf values well. Recall, however, that these large R2 values may simply
show us that the model is good at modeling the noise in the measurements. We must still determine whether we should
retain all these potential predictors in the model.
To continue developing the model, we apply the backward elimination procedure by identifying the predictor with the
largest p-value that exceeds our predetermined threshold of p = 0.05. This predictor is FO4delay , which has a p-value
of 0.99123. We can use the update() function to eliminate a given predictor and recompute the model in one step.
The notation “.~.” means that update() should keep the left and right-hand sides of the model the same.
By including “ - FO4delay , ”we also tell it to remove that predictor from the model, as shown in the following:
Coefficients:
Estimate Std. Error t value Pr(>|t|
(Intercept) -2.088e+01 7.584e+01 -0.275 0.783983
clock 2.604e-02 1.563e-03 16.662 < 2e-16
threads -2.345e+00 2.070e+00 -1.133 0.261641
cores 2.248e+00 1.759e+00 1.278 0.206080
transistors -5.556e-03 1.359e-02 -0.409 0.684020
dieSize 1.013e-02 1.571e-02 0.645 0.521488
voltage -2.626e+01 7.302e+00 -3.596 0.000642
featureSize 3.104e+01 1.113e+02 0.279 0.781232
channel 8.855e+01 1.218e+02 0.727 0.469815
L1icache 1.103e+02 4.041e+01 2.729 0.008257
sqrt(L1icache) -7.398e+02 2.866e+02 -2.581 0.012230
L1dcache -1.115e+02 3.859e+01 -2.889 0.005311
sqrt(L1dcache) 7.500e+02 2.632e+02 2.849 0.005937
L2cache -9.693e-03 1.494e-01 -6.488 1.64e-08
sqrt(L2cache) 1.222e+00 1.975e-01 6.189 5.33e-08
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.594 on 62 degrees of freedom (179 observations de
Multiple R-squared: 0.9652, Adjusted R-squared: 0.9573 F-statistic: 122.8 on
We repeat this process by removing the next potential predictor with the largest p-value that exceeds our predetermined
threshold, featureSize . As we repeat this process, we obtain the following sequence of possible models.
Remove featureSize :
Call:
lm(formula = nperf ~ clock + threads + cores + transistors + dieSize +
voltage + channel + L1icache + sqrt(L1icache) + L1dcache + sqrt(L1dcache) +
L2cache + sqrt(L2cache), data = int00.dat)
Residuals:
Min 1Q Median 3Q Max
-10.5548 -2.6442 0.0937 2.2010 10.0264
Coefficients:
Estimate Std. Error t value Pr(>|t|
Remove transistors:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -7.789e+01 4.318e+01 -1.804 0.075745 .
clock 2.566e-02 1.422e-03 18.040 < 2e-16 ***
threads -1.801e+00 1.995e+00 -0.903 0.369794
cores 1.805e+00 1.132e+00 1.595 0.115496
dieSize 1.111e-02 8.807e-03 1.262 0.211407
voltage -2.379e+01 5.734e+00 -4.148 9.64e-05 ***
channel 1.512e+02 3.918e+01 3.861 0.000257 ***
L1icache 8.159e+01 2.006e+01 4.067 0.000128 ***
sqrt(L1icache) -5.386e+02 1.418e+02 -3.798 0.000317 ***
L1dcache -8.422e+01 1.914e+01 -4.401 3.96e-05 ***
sqrt(L1dcache) 5.671e+02 1.299e+02 4.365 4.51e-05 ***
L2cache -8.700e-03 1.262e-03 -6.893 2.35e-09 ***
sqrt(L2cache) 1.069e+00 1.654e-01 6.465 1.36e-08 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.578 on 67 degrees of freedom
(176 observations deleted due to missingness)
Multiple R-squared: 0.9657, Adjusted R-squared: 0.9596
F-statistic: 157.3 on 12 and 67 DF, p-value: < 2.2e-16
Remove threads:
Call:
lm(formula = nperf ~ clock + cores + dieSize + voltage + channel + L1icache
sqrt(L1icache) + L1dcache + sqrt(L1dcache) + L2cache + sqrt(L2cache), data =
Residuals:
Min 1Q Median 3Q Max
-9.7388 -3.2326 0.1496 2.6633 10.6255
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -8.022e+01 4.304e+01 -1.864 0.066675
clock 2.552e-02 1.412e-03 18.074 <2e-16
cores 2.271e+00 1.006e+00 2.257 0.027226
dieSize 1.281e-02 8.592e-03 1.491 0.140520
voltage -2.299e+01 5.657e+00 -4.063 0.000128
channel 1.491e+02 3.905e+01 3.818 0.000293
L1icache 8.131e+01 2.003e+01 4.059 0.000130
sqrt(L1icache) -5.356e+02 1.416e+02 -3.783 0.000329
L1dcache -8.388e+01 1.911e+01 -4.390 4.05e-05
sqrt(L1dcache) 5.637e+02 1.297e+02 4.346 4.74e-05
L2cache -8.567e-03 1.252e-03 -6.844 2.71e-09
sqrt(L2cache). 1.040e+00 1.619e-01 6.422 1.54e-08
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Remove dieSize:
Residuals:
Min 1Q Median 3Q Max
-10.0240 -3.5195 0.3577 2.5486 12.0545
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5.822e+01 3.840e+01 -1.516 0.133913
clock 2.482e-02 1.246e-03 19.922 < 2e-16 ***
cores 2.397e+00 1.004e+00 2.389 0.019561 *
voltage -2.358e+01 5.495e+00 -4.291 5.52e-05 ***
channel 1.399e+02 3.960e+01 3.533 0.000726 ***
L1icache 8.703e+01 1.972e+01 4.412 3.57e-05 ***
sqrt(L1icache) -5.768e+02 1.391e+02 -4.146 9.24e-05 ***
L1dcache -8.903e+01 1.888e+01 -4.716 1.17e-05 ***
sqrt(L1dcache) 5.980e+02 1.282e+02 4.665 1.41e-05 ***
L2cache -8.621e-03 1.273e-03 -6.772 3.07e-09 ***
sqrt(L2cache) 1.085e+00 1.645e-01 6.598 6.36e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
At this point, the p-values for all of the predictors are less than 0.02, which is less than our predetermined threshold of
0.05. This tells us to stop the backward elimination process. Intuition and experience tell us that ten predictors are a rather
large number to use in this type of model. Nevertheless, all of these predictors have p-values below our significance
threshold, so we have no reason to exclude any specific predictor. We decide to include all ten predictors in the final
model:
nperf= − 58.22 + 0.02482c loc k + 2.397cores
Looking back over the sequence of models we developed, notice that the number of degrees of freedom in each subsequent
model increases as predictors are excluded, as expected. In some cases, the number of degrees of freedom increases by
more than one when only a single predictor is eliminated from the model. To understand how an increase of more than one
is possible, look at the sequence of values in the lines labeled
the number of observations dropped due to missingness . These values show how many rows
the update() function dropped because the value for one of the predictors in those rows was missing and had
the NA value. When the backward elimination process removed that predictor from the model, at least some of those rows
became ones we can use in computing the next version of the model, thereby increasing the number of degrees of freedom.
Also notice that, as predictors drop from the model, the R2 values stay very close to 0.965. However, the adjusted R2 value
tends to increase very slightly with each dropped predictor. This increase indicates that the model with fewer predictors
> plot(fitted(int00.lm),resid(int00.lm))
produces the plot shown in Figure 4.2. We see that the residuals appear to be somewhat uniformly scattered about zero. At
least, we do not see any obvious patterns that lead us to think that the residuals are not well behaved. Consequently, this
plot gives us no reason to believe that we have produced a poor model.
The Q-Q plot in Figure 4.3 is generated using these commands:
> qqnorm(resid(int00.lm))
> qqline(resid(int00.lm))
We see the that residuals roughly follow the indicated line. In this plot, we can see a bit more of a pattern and some
obvious nonlinearities, leading us to be slightly more cautious about concluding that the residuals are
Figure 4.2: The fitted versus residual values for the multi-factor model developed from the Int2000 data.
normally distributed. We should not necessarily reject the model based on this one test, but the results should serve as a
reminder that all models are imperfect.
Call:
lm(formula = nperf ~ clock + threads + cores + transistors +
dieSize + voltage + featureSize + channel + FO4delay +
L1icache + sqrt(L1icache) + L1dcache + sqrt(L1dcache) +
L2cache + sqrt(L2cache))
Residuals:
14 15 16 17 18 19
0.4096 1.3957 -2.3612 0.1498 -1.5513 1.9575
Coefficients: (14 not defined because of singularities)
Estimate Std. Error t value Pr(>|t
(Intercept) -25.93278 6.56141 -3.952 0.01
clock 0.35422 0.02184 16.215 8.46e-
threads NA NA NA
cores NA NA NA
transistors NA NA NA
dieSize NA NA NA
voltage NA NA NA
featureSize NA NA NA
channel NA NA NA
FO4delay NA NA NA
L1icache NA NA NA
sqrt(L1icache) NA NA NA
L1dcache NA NA NA
sqrt(L1dcache) NA NA NA
L2cache NA NA NA
sqrt(L2cache) NA NA NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.868 on 4 degrees (72 observations deleted due to
Multiple R-squared: 0.985, Adjusted R-squared: 0.9813 F-statistic: 262.9 on
> table(clock)
clock
48 50 60 64 66 70 75 77 80 85 90 96 99 100 101 110
118 120 125 133 150 166 175 180 190 200 225 231 233 250 266
275 291 300 333 350
1 3 4 1 5 1 4 1 2 1 2 1 2 10 1 1
1 3 4 4 3 2 2 1 1 4 1 1 2 2 2 1 1 1 1
The top line shows the unique values that appear in the column. The list of numbers directly below that line is the count of
how many times that particular value appeared in the column. For example, 48 appeared once, while 50 appeared three
times and 60 appeared four times. We see a reasonable range of values with minimum ( 48 ) and maximum ( 350 )
values that are not unexpected. Some of the values occur only once; the most frequent value occurs ten times, which again
does not seem unreasonable. In short, we do not see anything obviously amiss with these results. We conclude that the
problem likely is with a different data column.
Executing the table() function on the next column in the data frame threads produces this output:
Aha! Now we are getting somewhere. This result shows that all of the 78 entries in this column contain the same value:
1 . An input factor in which all of the elements are the same value has no predictive power in a regression model. If
every row has the same value, we have no way to distinguish one row from another. Thus, we conclude that threads
is not a useful predictor for our model and we eliminate it as a potential predictor as we continue to develop our Int1992
regression model.
We continue by executing table() on the column labeled cores . This operation shows that this column also
consists of only a single value, 1. Using the update() function to eliminate these two predictors from the model
gives the following:
Residuals:
14 15 16 17 18 19
0.4096 1.3957 -2.3612 0.1498 -1.5513 1.9575
Unfortunately, eliminating these two predictors from consideration has not solved the problem. Notice that we still have
only four degrees of freedom, because 72 observations were again eliminated. This small number of degrees of freedom
indicates that there must be at least one more column with insufficient data.
> table(L2cache)
L2cache
96 256 512
6 2 2
Although these specific data values do not look out of place, having only three unique values can make it impossible for
lm() to compute the model coefficients. Dropping L2cache and sqrt(L2cache) as potential predictors
finally produces the type of result we expect:
Call:
lm(formula = nperf ~ clock + transistors + dieSize + voltage +
featureSize + channel + FO4delay + L1icache + sqrt(L1icache) +
L1dcache + sqrt(L1dcache))
Residuals:
Min 1Q Median 3Q Max
-7.3233 -1.1756 0.2151 1.0157 8.0634
Coefficients:
Estimate Std. Error t value Pr(>|t|
(Intercept) -58.51730 17.70879 -3.304 0.0027
clock 0.23444 0.01792 13.084 6.03e-1
transistors -0.32032 1.13593 -0.282 0.7801
dieSize 0.25550 0.04800 5.323 1.44e-0
voltage 1.66368 1.61147 1.032 0.3113
featureSize 377.84287 69.85249 5.409 1.15e-0
channel -493.84797 88.12198 -5.604 6.88e-0
FO4delay 0.14082 0.08581 1.641 0.1128
L1icache 4.21569 1.74565 2.415 0.0230
sqrt(L1icache) -12.33773 7.76656 -1.589 0.1242
L1dcache -5.53450 2.10354 -2.631 0.0141
sqrt(L1dcache) 23.89764 7.98986 2.991 0.0060
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.68 on 26 degrees of freedom (40 observations dele
Multiple R-squared: 0.985, Adjusted R-squared: 0.9786 F-statistic: 155 on 11
We now can proceed with the normal backward elimination process. We begin by eliminating the predictor that has the
largest p-value above our preselected threshold, which is transistors in this case. Eliminating this predictor gives
the following:
Call:
lm(formula = nperf ~ clock + dieSize + voltage + featureSize +
channel + FO4delay + L1icache + sqrt(L1icache) + L1dcache +
sqrt(L1dcache))
Residuals:
Min 1Q Median 3Q Max
-13.2935 -3.6068 -0.3808 2.4535 19.9617
Coefficients:
Estimate Std. Error t value Pr(>
(Intercept) -16.73899 24.50101 -0.683
clock 0.19330 0.02091 9.243 2.77
dieSize 0.11457 0.02728 4.201 0.00
voltage 0.40317 2.85990 0.141 0.
featureSize 11.08190 104.66780 0.106 0
channel -37.23928 104.22834 -0.357 0
FO4delay -0.13803 0.14809 -0.932 0.35
L1icache 7.84707 3.33619 2.352 0.025
sqrt(L1icache) -16.28582 15.38525 -1.059 0.29
L1dcache -14.31871 2.94480 -4.862 3.44e
sqrt(L1dcache) 48.26276 9.41996 5.123 1.6
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7.528 on 30 degrees of freedom (37 observations del
Multiple R-squared: 0.9288, Adjusted R-squared: 0.9051 F-statistic: 39.13 on
After eliminating this predictor, however, we see something unexpected. The p-values for voltage and
featureSize increased dramatically. Furthermore, the adjusted R-squared value dropped substantially, from 0.9786
to 0.9051. These unexpectedly large changes make us suspect that transistors may actually be a useful predictor
in the model even though at this stage of the backward elimination process it has a high p-value. So, we decide to put
transistors back into the model and instead drop voltage , which has the next highest p-value. These changes
produce the following result:
Call:
lm(formula = nperf ~ clock + dieSize + featureSize + channel +
FO4delay + L1icache + sqrt(L1icache) + L1dcache +
sqrt(L1dcache) +
transistors)
Coefficients:
Estimate Std. Error t value Pr(>|t|
(Intercept) -50.28514 15.27839 -3.291 0.00270
clock 0.21854 0.01718 12.722 3.71e-1
dieSize 0.20348 0.04401 4.623 7.77e-0
featureSize 409.68604 67.00007 6.115 1.34e-0
channel -490.99083 86.23288 -5.694 4.18e-0
FO4delay 0.12986 0.09159 1.418 0.16726
L1icache 1.48070 1.21941 1.214 0.23478
sqrt(L1icache) -5.15568 7.06192 -0.730 0.47141
L1dcache -0.45668 0.10589 -4.313 0.00018
sqrt(L1dcache) 4.77962 2.45951 1.943 0.06209
transistors 1.54264 0.88345 1.746 0.09175
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The adjusted R-squared value now is 0.9746, which is much closer to the adjusted R-squared value we had before
dropping transistors . Continuing with the backward elimination process, we first drop sqrt(L1icache)
with a p-value of 0.471413, then FO4delay with a p-value of 0.180836, and finally sqrt(L1dcache) with a p-
value of 0.071730.
After completing this backward elimination process, we find that the following predictors belong in the final model for
Int1992:
clock transistors dieSize featureSize
channel L1icache L1dcache
As shown below, all of these predictors have p-values below our threshold of 0.05. Additionally, the adjusted R-square
looks quite good at 0.9722.
Call:
lm(formula = nperf ~ clock + dieSize + featureSize + channel +
L1icache + L1dcache + transistors, data = int92.dat)
Residuals:
Min 1Q Median 3Q Max
-10.1742 -1.5180 0.1324 1.9967 10.1737
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -34.17260 5.47413 -6.243 6.16e-07 **
clock 0.18973 0.01265 15.004 9.21e-16 **
dieSize 0.11751 0.02034 5.778 2.31e-06 **
featureSize 305.79593 52.76134 5.796 2.20e-06 **
channel -328.13544 53.04160 -6.186 7.23e-07 **
L1icache 0.78911 0.16045 4.918 2.72e-05 **
L1dcache -0.23335 0.03222 -7.242 3.80e-08 **
transistors 3.13795 0.51450 6.099 9.26e-07 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
This example illustrates that you cannot always look at only the p-values to determine which potential predictors to
eliminate in each step of the backward elimination process. You also must be careful to look at the broader picture, such as
changes in the adjusted R-squared value and large changes in the p-values of other predictors, after each change to the
model.
Because the model was developed using measured data, the coefficient values necessarily are only estimates. Consequently, any
predictions we make with the model are also only estimates. The summary() function produces useful statistics about the regression
model’s quality, such as the R2 and adjusted R2 values. These statistics offer insights into how well the model explains variation in the
data. The best indicator of any regression model’s quality, however, is how well it predicts output values. The R environment provides
some powerful functions that help us predict new values from a given model and evaluate the quality of these predictions.
1 12/17/2021
5.1: Data Splitting for Training and Testing
In Chapter 4 we used all of the data available in the int00.dat data frame to select the appropriate predictors to
include in the final regression model. Because we computed the model to fit this particular data set, we cannot now use this
same data set to test the model’s predictive capabilities. That would be like copying exam answers from the answer key
and then using that same answer key to grade your exam. Of course you would get a perfect result. Instead, we must use
one set of data to train the model and another set of data to test it.
The difficulty with this train-test process is that we need separate but similar data sets. A standard way to find these two
different data sets is to split the available data into two parts. We take a random portion of all the available data and call it
our training set. We then use this portion of the data in the lm() function to compute the specific values of the model’s
coefficients. We use the remaining portion of the data as our testing set to see how well the model predicts the results,
compared to this test data.
The following sequence of operations splits the int00.dat data set into the training and testing sets:
rows <nrow(int00.dat)
f <0.5
upper_bound <floor(f * rows)
permuted_int00.dat <int00.dat[sample(rows), ]
train.dat <permuted_int00.dat[1:upper_bound, ]
test.dat <permuted_int00.dat[(upper_bound+1):rows, ]
The first line assigns the total number of rows in the int00.dat data frame to the variable rows . The next line
assigns to the variable f the fraction of the entire data set we wish to use for the training set. In this case, we somewhat
arbitrarily decide to use half of the data as the training set and the other half as the testing set. The floor() function
rounds its argument value down to the nearest integer. So the line upper_bound <floor(f * rows) assigns
the middle row’s index number to the variable upper_bound .
The interesting action happens in the next line. The sample() function returns a permutation of the integers between
1 and n when we give it the integer value n as its input argument. In this code, the expression sample(rows) returns
a vector that is a permutation of the integers between 1 and rows , where rows is the total number of rows in the
int00.dat data frame. Using this vector as the row index for this data frame gives a random permutation of all of the
rows in the data frame, which we assign to the new data frame, permuted_int00.dat. The next two lines assign
the lower portion of this new data frame to the training data set and the top portion to the testing data set, respectively. This
randomization process ensures that we obtain a new random selection of the rows in the train-and-test data sets every time
we execute this sequence of operations.
Figure 5.1: The training and testing process for evaluating the predictions produced by a regression model.
The following statement calls the lm() function to generate a regression model using the predictors we identified in
Chapter 4 and the train.dat data frame we extracted in the previous section. It then assigns this model to the
variable int00_new.lm. We refer to this process of computing the model’s coefficients as training the regression
model.
The predict() function takes this new model as one of its arguments. It uses this model to compute the predicted
outputs when we use the test.dat data frame as the input, as follows:
We define the difference between the predicted and measured performance for each processor i to
be ∆i = Predictedi − Measuredi, where Predictedi is the value predicted by the model, which is stored in the data frame
predicted.dat , and Measuredi is the actual measured performance response, which we previously assigned to the
test.dat data frame. The following statement computes the entire vector of these ∆i values and assigns the vector to
the variable delta .
Note that we use the $ notation to select the column with the output value, nperf , from the test.dat data
frame.
The mean of these ∆ differences for n different processors is:
¯ 1 n
Δ = ∑ Δi
n i=1
data: delta
t = -0.65496, df = 41, p-value = 0.5161
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:-2.232621 1.139121
sample estimates: mean of x -0.5467502
If the prediction were perfect, then ∆i = 0. If ∆i > 0, then the model predicted that the performance would be greater than it
actually was. A ∆i < 0, on the other hand, means that the model predicted that the performance was lower than it actually
was. Consequently, if the predictions were reasonably good, we would expect to see a tight confidence interval around
zero. In this case, we obtain a 95 percent confidence interval of [-2.23, 1.14]. Given that nperf is scaled to between 0 and
100, this is a reasonably tight confidence interval that includes zero. Thus, we conclude that the model is reasonably good
at predicting values in the test.dat data set when trained on the train.dat data set.
Another way to get a sense of the predictions’ quality is to generate a scatter plot of the ∆i values using the plot()
function:
plot(delta)
This function call produces the plot shown in Figure 5.2. Good predictions would produce a tight band of values uniformly
scattered around zero. In this figure, we do see such a distribution, although there are a few outliers that are more than ten
points above or below zero.
It is important to realize that the sample() function will return a different random permutation each time we execute it.
These differing permutations will partition different processors (i.e., rows in the data frame) into the train and test sets.
Thus, if we run this experiment again with exactly the same inputs, we will likely get a different confidence interval
and ∆i scatter plot. For example, when we repeat the same test five times with identical inputs, we obtain the following
confidence intervals: [-1.94, 1.46], [-1.95, 2.68], [-2.66, 3.81], [-6.13, 0.75], [-4.21, 5.29]. Similarly, varying the fraction of
the data we assign to the train and test sets by changing f = 0.5 also changes the results.
It is good practice to run this type of experiment several times and observe how the results change. If you see the results
vary wildly when you re-run these tests, you have good reason for concern. On the other hand, a series of similar results
does not necessarily mean your results are good, only that they are consistently reproducible. It is often easier to spot a bad
model than to determine that a model is good.
Based on the repeated confidence interval results and the corresponding scatter plot, similar to Figure 5.2, we conclude that
this model is reasonably good at predicting the performance of a set of processors when the model is trained on a different
set of processors executing the same benchmark program. It is not perfect, but it is also not too bad. Whether the
differences are large enough to warrant concern is up to you.
data: delta
t = 1.5231, df = 80, p-value = 0.1317
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
-0.4532477 3.4099288 sample estimates:
mean of x
1.478341
Figure 5.3: Predicting the Fp2000 results using the model developed with the Int2000 data.
The resulting confidence interval for the delta values contains zero and is relatively small. This result suggests that
the model developed using the Int2000 data is reasonably good at predicting the Fp2000 benchmark program’s results. The
scatter plot in Figure 5.4 shows the resulting delta values for each of the processors we used in the prediction. The
results tend to be randomly distributed around zero, as we would expect from a good regression model. Note, however,
that some of the values differ significantly from zero. The maximum positive deviation is almost 20, and the magnitude of
Figure 5.4: A scatter plot of the differences between the predicted and actual performance results for the Fp2000
benchmark when predicted using the Int2000 regression model.
As a final example, we use the Int2000 regression model to predict the results of the benchmark program’s future Int2006
version. The R code to compute this prediction is:
data: delta
t = 49.339, df = 168, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval: 48.87259 52.94662
sample estimates:
mean of x
50.9096
In this case, the confidence interval for the delta values does not include zero. In fact, the mean value of the
differences is 50.9096, which indicates that the average of the model-predicted values is substantially larger than the actual
average value. The scatter plot shown in Figure 5.5 further confirms that the predicted values are all much larger than the
actual values.
This example is a good reminder that models have their limits. Apparently, there are more factors that affect the
performance of the next generation of the benchmark programs, Int2006, than the model we developed using the Int2000
results captures. To develop a model that better predicts future performance, we would have to uncover those factors.
Doing so requires a deeper understanding of the factors that affect computer performance, which is beyond the scope of
this tutorial.
Learning Objects
Learning Objects
The name between the quotes is the name of the csv-formatted file to be read. Each file line corresponds to one data
record. Commas separate the individual data fields in each record. This function assigns each data record to a new row in
the data frame, and assigns each data field to the corre- sponding column. When this function completes, the variable
processors contains all the data from the file all-data.csv nicely organized into rows and columns in a
data frame.
If you type processors to see what is stored in the data frame, you will get a long, confusing list of data. Typing
> head(processors)
will show a list of column headings and the values of the first few rows of data. From this list, we can determine which
columns to extract for our model development. Although this is conceptually a simple problem, the execution can be rather
messy, depending on how the data was collected and organized in the file.
As with any programming language, R lets you define your own func- tions. This feature is useful when you must perform
a sequence of opera- tions multiple times on different data pieces, for instance. The format for defining a function is:
where function-name is the function name you choose and a1, a2, ... is the list of arguments in your
function. The R system evaluates the expres- sions in the body of the definition when the function is called. A function can
return any type of data object using the return() statement.
We will define a new function called extract_data to extract all the rows that have a result for the given
benchmark program from the processors data frame. For instance, calling the function as follows:
extracts every row that has a result for the given benchmark program and assigns it to the corresponding data frame,
int92.dat, fp92.dat , and so on.
We define the extract_data function as follows:
The first line with the paste functions looks rather complicated. How- ever, it simply forms the name of the column
with the given benchmark results. For example, when extract_data is called with Int2000 as the ar- gument,
the nested paste functions simply concatenate the strings " Spec ", " Int2000 ", and " ..average.base.
". The final string corresponds to the name of the column in the processors data frame that contains the perfor-
mance results for the Int2000 benchmark, " SpecInt2000..average.base. ".
The next line calls the function get_column , which selects all the rows with the desired column name. In this case,
that column contains the actual performance result reported for the given benchmark program, perf . The next four
lines compute the normalized performance value, nperf , from the perf value we obtained from the data frame.
The following sequence of calls to get_column extracts the data for each of the predictors we intend to use in
developing the regression model. Note that the second parameter in each case, such as " Processor.Clock..MHz.
", is the name of a column in the processors data frame. Finally, the data.frame() function is a predefined
R function that assembles all its arguments into a single data frame. The new function we have just defined,
extract_data() , returns this new data frame.
Next, we define the get_column() function to return all the data in a given column for which the given benchmark
program has been defined:
The argument x is a string with the name of the benchmark program, and y is a string with the name of the desired
column. The nested paste() func- tions produce the same result as the extract_data() function. The
is.na() function performs the interesting work. This function returns a vector with “ 1 ” values corresponding to
the row numbers in the processors data frame that have NA values in the column selected by the benchmark
index. If there is a value in that location, is.na() will return a corresponding value that is a 0 . Thus, is.na
indicates which rows are missing performance results for the benchmark of interest. Inserting the exclamation point in
front of this function complements its output. As a result, the variable ix will con- tain a vector that identifies every
row that contains performance results for the indicated benchmark program. The function then extracts the selected rows
from the processors data frame and returns them.
These types of data extraction functions can be somewhat tricky to write, because they depend so much on the specific
format of your input file. The functions presented in this chapter are a guide to writing your own data extraction functions.
10. What can you say about these models’ predictive abilities, based on the results from the previous problem? For
example, how well does a model developed for the integer benchmarks predict the same-year performance of the
floating-point benchmarks? What about predic- tions across benchmark generations?
11. In the discussion of data splitting, we defined the value f as the fraction of the complete data set used in the training set.
For the Fp2000 data set, plot a 95 percent confidence interval for the mean of delta for f = [0.1, 0.2, ..., 0.9]. What
value of f gives the best result (i.e., the smallest confidence interval)? Repeat this test n = 5 times to see how the best
value of f changes.
12. Repeat the previous problem, varying f for all the other data sets.