SimpleRegression_Transcript
SimpleRegression_Transcript
1
see a relationship exists between the two variables, and whether or
not the relationship has a linear form. For the scatter plot, we place
the explanatory variable on the x-axis. The dependent variable is also
known as the response variable. It is the variable that we
wish to understand or predict. In a business example,
this might be sales or demand. For the scatter plot, we place
the response variable on the y-axis. A regression model is considered a
simple
linear regression when there is only one dependent variable and
one independent variable, and the relationship between
these two has a linear form. You have to be very careful
not to confuse the two. When you step on a weight
scale do you ever wonder that according to the weight
chart you have a height problem? You should be seven feet tall for the number
you see on your scale,
of course not. We understand the height is
used to estimate our weight. So height is the predictor and
weight is the response variable. While in this example roles
variables play maybe easy to see. There are many times where
you can confuse the two. It is not always easy to detect which
variable is the explanatory variable and which variable is the response
variable. Imagine a study that wants to establish
a connection between K-12 spending and economic growth. If you get a report
that says, states that
spend more money on K-12 education have higher rates of economic growth
than states that spend less. Then the study is making K-12 spending
as the explanatory variable and the states rate of economic growth
as the prediction variable. Meaning, investing in K-12
causes economic growth. On the other hand I could argue that
states that have a good rate of growth have more money and
thus spend more on K-12. In this understanding the economic
growth rate is causing how much money is spent on K-12. What you see here is
an example
where the direction of cause and effect is not that clear. The two variables
are hopelessly tangled since they both can affect one
another at different times. In the first case, the struggling state
decides to increase investment in K-12 so that its residents are ready to be
employed, and at some time in the future when this investment pays off,
we see the second case occurring. Which is, since the state is
doing really well and has money, it will start spending more on K-12,
thus the variables switch roles. Avoid these types of studies. When you pick
your variables you
should have reason to believe that the explanatory variables affect
the dependent variable and not the other way around. Just because a strong
relationship
between two variables exist doesn't necessarily mean that
they are functionally related. Any two sequences, y and
x, that are related. If x increases then y either increases or
decreases will always show a strong statistical relation while functional
relationship may not exist. Look at this graph which shows cheese
consumption is highly correlated to a number of people who died by becoming
tangled in their bed sheets, no kidding. Data is from National Geographic. So
make sure that you pick variables
that are functionally related. So now lets practice. Imagine conducting these
studies. In one study we are collecting data on
number of surviving fish in the tank and water temperature in the tank. In
the second study we
2
are recording the year and the bushels of corn harvested in Illinois,
and in the third study is collecting data on medication dosage and
time elapsed for total relief. Identify for
each study the response variable. The response variable in each
study is shown in red color font. In the first study
the number of surviving fish in the tank is responding to
the water temperature in the tank. In the second study bushels
of corn harvested in Illinois is being tracked year after year so
the harvest is the response variable. And in the third study time elapsed for
total relief is the response variable,
and depends on medication dosage. We learned about scatter plots early on.
This is a great visual tool for
detecting whether or not the variables we have identified show
any special relationship with one another. For instance, here is a scatter
plot for
income versus educational attainment for all 50 states. Looking at the
scatter plot where each
diamond represents a state, we can see that there is a linear relationship
between the predicted variable, education level on the x-axis, and income,
the response variable on the y-axis. Since as the education level goes up so
does the income, then we will conclude that these two
variables are positively correlated. We can even look at
the spread of the points and see that they are fairly clustered
which suggests a strong relationship. So it is well worth doing the full
analysis to know more in details by how much does education level
impacts the income level. Plotting the two variables we can detect
whether we have identified two variables which don't have any correlation at
all in which case you don't need to
pursue building a predicative model. Or we can detect a positive correlation
or
a negative correlation. These two will be suitable for
simple linear regression analysis. So again, let's practice. This scatter
plot shows the correlation
between per capita income in a state and the percent of the 2012 presidential
election votes that went to Romney. What type of correlation do you see here?
State median income is negatively
correlated with proportion of the states votes in the 2012
election that went to Romney. When you have a scatter plot,
first thing to look for in a scatter plot is
the direction of the association. A pattern that runs from the upper left to
the lower right is said to be negative. A pattern that running from the lower
left
to the upper right is called positive. The second thing to look for
in a scatter plot is its form. If there is a straight line relationship,
it will appear or a cloud or swarm of points stretched out in
a generally consistent straight form. This is called linear form. Sometimes
the relationship curves gently,
while still increasing or decreasing steadily,
sometimes it curves sharply up and down. If you think this pattern then a
linear
regression model will not be suitable. Third feature to look for in a scatter
plot is the strength of the relationship. Do the points appear tightly
clustered in
a single stream or do the points seem to be so variable and spread out that
we
can barely discern any trend or pattern? As the points get more and more
spread out then the relationship
3
becomes weaker and weaker. A good model will have
most of its points closely clustered providing a strong relationship. Finally
always look for the unexpected. An outlier is unusual observation, standing
away from the overall
pattern of the scatter plot. Here's the graph which shows income
versus percent of votes for Romney. In this graph,
this point can be defined as an outlier. Sometimes, we remove the outlier
from
the data before doing the regression, but other times we may want to
know why the outlier is there before deciding what to do next. Scatter plots
are useful in visualizing
the relationship between the two variables you're studying. Once we have a
sense that
the relationship exists, then we can move to the next step which
involves defining a mathematical equation, which describes the relationship
between these two variables. This will be called
the regression equation.