0% found this document useful (0 votes)
1 views

SimpleRegression_Transcript

The document discusses the use of regression analysis in statistics to explore relationships between variables and make predictions. It emphasizes the importance of identifying dependent and independent variables, as well as the need for a clear understanding of cause and effect. Additionally, it highlights the significance of scatter plots in visualizing relationships and determining the suitability of regression models.

Uploaded by

Joseph oyadina
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

SimpleRegression_Transcript

The document discusses the use of regression analysis in statistics to explore relationships between variables and make predictions. It emphasizes the importance of identifying dependent and independent variables, as well as the need for a clear understanding of cause and effect. Additionally, it highlights the significance of scatter plots in visualizing relationships and determining the suitability of regression models.

Uploaded by

Joseph oyadina
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

[SOUND] Does eating nuts help you live longer?

Does Zika virus cause


a serious birth defect? Does regular exercise improve memory? Does burning
fossil fuels
accelerate global warming? We can use regression to
answer questions like these. In the last few lessons we focused
on building confidence intervals and comparing different population. That is
what we call
inferential statistics. Using sample information to get
some insight about the population. Now, we can go one step further. Think
about it,
look at the first question on this slide. From what we have learned so far, I
know I can do a study comparing
people who eat nuts to those who don't. And then find whether the mean
life expectancy of the two groups are different or not. While this is an
insight worth having, you
may face the next set of questions that may come up, like how much of
the increase is due to eating nuts? Maybe one group was healthier than
the other group to begin with. Or how much nuts does one need
to eat to get this benefit? In another word, do you have a way
of explaining how consuming nuts impacts longevity among all other factors
that could be impacting longevity. We can use regression to find the answer,
and once we know how nut consumption and longevity are related,
now we can go ahead and start predicting. For example we can say eating 100
grams
per day will reduce the risk of dying, by the way this is
a result of a real study. Regression builds on everything we
have learned so far in the class and will allow us to move to prediction
based on sample information. The main objectives of regression analysis
is first the main instant prediction like making inferences about possible
cause and effect relationships and extrapolating them into future. Then
gaining the ability to address
questions of why so much, or why so many cause this difference. Taking the
new understanding and
improving our business. One caveat here is the fact that we
expect future behavior of the process to be similar to the past. At some time
in the future, the relationship that we have discovered
may weaken or cease to exist. So watch how well the predictions
are turning out and adjust the model when needed. Regression analysis will
generate
an equation to describe the statistical relationship between one or more
predictors and the response variable and then we can use the regression
equation to
predict the outcome of new observations. In the field of statistic,
regression is a big topic on its own and there are many models available for
regression analysis. The simplest model is the simple
linear regression and that is what I will introduce
you to in this module. There are two different variables that
make up a simple linear regression model. The dependent variable and
independent variable. The independent variable,
which is also known as the explanatory or predictor variable. This is the
variable we
are going to use to try and understand how it impacts
the dependent variable. In a business example, this might be
advertising level or time period. We can control the independent variable,
for example one can decide how
much to spend on advertising. Simple linear regression most
often starts with a scatter plot. Which will allow us to visually

1
see a relationship exists between the two variables, and whether or
not the relationship has a linear form. For the scatter plot, we place
the explanatory variable on the x-axis. The dependent variable is also
known as the response variable. It is the variable that we
wish to understand or predict. In a business example,
this might be sales or demand. For the scatter plot, we place
the response variable on the y-axis. A regression model is considered a
simple
linear regression when there is only one dependent variable and
one independent variable, and the relationship between
these two has a linear form. You have to be very careful
not to confuse the two. When you step on a weight
scale do you ever wonder that according to the weight
chart you have a height problem? You should be seven feet tall for the number
you see on your scale,
of course not. We understand the height is
used to estimate our weight. So height is the predictor and
weight is the response variable. While in this example roles
variables play maybe easy to see. There are many times where
you can confuse the two. It is not always easy to detect which
variable is the explanatory variable and which variable is the response
variable. Imagine a study that wants to establish
a connection between K-12 spending and economic growth. If you get a report
that says, states that
spend more money on K-12 education have higher rates of economic growth
than states that spend less. Then the study is making K-12 spending
as the explanatory variable and the states rate of economic growth
as the prediction variable. Meaning, investing in K-12
causes economic growth. On the other hand I could argue that
states that have a good rate of growth have more money and
thus spend more on K-12. In this understanding the economic
growth rate is causing how much money is spent on K-12. What you see here is
an example
where the direction of cause and effect is not that clear. The two variables
are hopelessly tangled since they both can affect one
another at different times. In the first case, the struggling state
decides to increase investment in K-12 so that its residents are ready to be
employed, and at some time in the future when this investment pays off,
we see the second case occurring. Which is, since the state is
doing really well and has money, it will start spending more on K-12,
thus the variables switch roles. Avoid these types of studies. When you pick
your variables you
should have reason to believe that the explanatory variables affect
the dependent variable and not the other way around. Just because a strong
relationship
between two variables exist doesn't necessarily mean that
they are functionally related. Any two sequences, y and
x, that are related. If x increases then y either increases or
decreases will always show a strong statistical relation while functional
relationship may not exist. Look at this graph which shows cheese
consumption is highly correlated to a number of people who died by becoming
tangled in their bed sheets, no kidding. Data is from National Geographic. So
make sure that you pick variables
that are functionally related. So now lets practice. Imagine conducting these
studies. In one study we are collecting data on
number of surviving fish in the tank and water temperature in the tank. In
the second study we

2
are recording the year and the bushels of corn harvested in Illinois,
and in the third study is collecting data on medication dosage and
time elapsed for total relief. Identify for
each study the response variable. The response variable in each
study is shown in red color font. In the first study
the number of surviving fish in the tank is responding to
the water temperature in the tank. In the second study bushels
of corn harvested in Illinois is being tracked year after year so
the harvest is the response variable. And in the third study time elapsed for
total relief is the response variable,
and depends on medication dosage. We learned about scatter plots early on.
This is a great visual tool for
detecting whether or not the variables we have identified show
any special relationship with one another. For instance, here is a scatter
plot for
income versus educational attainment for all 50 states. Looking at the
scatter plot where each
diamond represents a state, we can see that there is a linear relationship
between the predicted variable, education level on the x-axis, and income,
the response variable on the y-axis. Since as the education level goes up so
does the income, then we will conclude that these two
variables are positively correlated. We can even look at
the spread of the points and see that they are fairly clustered
which suggests a strong relationship. So it is well worth doing the full
analysis to know more in details by how much does education level
impacts the income level. Plotting the two variables we can detect
whether we have identified two variables which don't have any correlation at
all in which case you don't need to
pursue building a predicative model. Or we can detect a positive correlation
or
a negative correlation. These two will be suitable for
simple linear regression analysis. So again, let's practice. This scatter
plot shows the correlation
between per capita income in a state and the percent of the 2012 presidential
election votes that went to Romney. What type of correlation do you see here?
State median income is negatively
correlated with proportion of the states votes in the 2012
election that went to Romney. When you have a scatter plot,
first thing to look for in a scatter plot is
the direction of the association. A pattern that runs from the upper left to
the lower right is said to be negative. A pattern that running from the lower
left
to the upper right is called positive. The second thing to look for
in a scatter plot is its form. If there is a straight line relationship,
it will appear or a cloud or swarm of points stretched out in
a generally consistent straight form. This is called linear form. Sometimes
the relationship curves gently,
while still increasing or decreasing steadily,
sometimes it curves sharply up and down. If you think this pattern then a
linear
regression model will not be suitable. Third feature to look for in a scatter
plot is the strength of the relationship. Do the points appear tightly
clustered in
a single stream or do the points seem to be so variable and spread out that
we
can barely discern any trend or pattern? As the points get more and more
spread out then the relationship

3
becomes weaker and weaker. A good model will have
most of its points closely clustered providing a strong relationship. Finally
always look for the unexpected. An outlier is unusual observation, standing
away from the overall
pattern of the scatter plot. Here's the graph which shows income
versus percent of votes for Romney. In this graph,
this point can be defined as an outlier. Sometimes, we remove the outlier
from
the data before doing the regression, but other times we may want to
know why the outlier is there before deciding what to do next. Scatter plots
are useful in visualizing
the relationship between the two variables you're studying. Once we have a
sense that
the relationship exists, then we can move to the next step which
involves defining a mathematical equation, which describes the relationship
between these two variables. This will be called
the regression equation.

You might also like