0% found this document useful (0 votes)
5 views

UNIT-3

This document discusses the concepts of correlation and regression, detailing how to interpret relationships between two quantitative variables through scatter plots and correlation coefficients. It explains the significance of positive, negative, and no relationships, as well as the calculation and interpretation of the correlation coefficient (r) and the least squares regression line. Additionally, it emphasizes the importance of understanding that correlation does not imply causation and outlines the assumptions necessary for regression analysis.

Uploaded by

ilayaraja.it
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

UNIT-3

This document discusses the concepts of correlation and regression, detailing how to interpret relationships between two quantitative variables through scatter plots and correlation coefficients. It explains the significance of positive, negative, and no relationships, as well as the calculation and interpretation of the correlation coefficient (r) and the least squares regression line. Additionally, it emphasizes the importance of understanding that correlation does not imply causation and outlines the assumptions necessary for regression analysis.

Uploaded by

ilayaraja.it
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

UNIT III DESCRIBING RELATIONSHIPS

Correlation –Scatter plots –correlation coefficient for quantitative data –computational


formula for correlation coefficient – Regression –regression line –least squares
regression line – Standard error of estimate – interpretation of r2

I. Correlation:

• An investigator suspects that a relationship exists between the number of greeting


cards sent and the number of greeting cards received by individuals.
• The investigator obtains the estimates for the most recent holiday season fromfive
friends, as shown in Table 6.1.
• The data in Table 6.1 represent a very simple observational study with two
dependent variables.

AN INTUITIVE APPROACH

• A tendency for pairs of scores to occupy similar relative positions in theirrespective


distributions.

Positive Relationship
• Trends among pairs of scores can be detected most easily by constructing a list of
paired scores in which the scores along one variable are arranged from largest to
smallest.
• In panel A of Table 6.2, the five pairs of scores are arranged from the largest (13) to
the smallest (1) number of cards sent.
• This table reveals a pronounced tendency for pairs of scores to occupy similar relative
positions in their respective distributions.
• For example, John sent relatively few cards (1) and received relatively few cards (6),
whereas Doris sent relatively many cards (13) and received relatively many cards (14).
• Therefore, that the two variables are related.
• Insofar as relatively low values are paired with relatively low values, and relatively
high values are paired with relatively high values, the relationship is positive.
Negative Relationship
• Although John sent relatively few cards (1), he received relatively many (18).
• From this pattern, we can conclude that the two variables are related.
• This relationship implies that “You get the opposite of what you give.”
• Insofar as relatively low values are paired with relatively high values, and relatively
high values are paired with relatively low values, the relationship is negative.

Little or No Relationship
• No regularity is apparent among the pairs of scores in panel C.
• For instance, although both Andrea and John sent relatively few cards (5 and 1,
respectively), Andrea received
• relatively few cards (6) and John received relatively many cards (14).
• We can conclude that little, if any, relationship exists between the two variables.

• Two variables are positively related if pairs of scores tend to occupy similar relative
positions (high with high and low with low) in their respective distributions.
• They are negatively related if pairs of scores tend to occupy dissimilar relative
positions (high with low and vice versa) in their respective distributions.
II. SCATTERPLOTS
• A scatterplot is a graph containing a cluster of dots that represents all pairs of scores.

Construction
• To construct a scatterplot, as in Figure 6.1, scale each of the two variables along the
horizontal (X) and vertical (Y) axes, and use each pair of scores to locate a dot within
the scatterplot.
• For example, the pair of numbers for Mike, 7 and 12, define points along the X andY
axes, respectively.
• Using these points to anchor lines perpendicular to each axis, locate Mike’s dot where
the two lines intersect.
Positive, Negative, or Little or No Relationship?
• A dot cluster that has a slope from the lower left to the upper right, as in panel Aof
Figure 6.2, reflects a positive relationship.
• Small values of one variable are paired with small values of the other variable, and
large values are paired with large values.
• In panel A, short people tend to be light, and tall people tend to be heavy.
• On the other hand, a dot cluster that has a slope from the upper left to the lower
right, as in panel B of Figure 6.2, reflects a negative relationship.
• Small values of one variable tend to be paired with large values of the other
variable, and vice versa.
• A dot cluster that lacks any apparent slope, as in panel C of Figure 6.2, reflects littleor
no relationship.
• Small values of one variable are just as likely to be paired with small, medium, or
large values of the other variable.

Strong or Weak Relationship?


• The more closely the dot cluster approximates a straight line, the stronger (the more
regular) the relationship will be.
• Figure 6.3 shows a series of scatterplots, each representing:
• A different positive relationship between IQ scores for pairs of people whose
backgrounds reflect different degrees of genetic overlap, ranging from minimum
overlap between foster parents and foster children to maximum overlap between
identical twins.
Perfect Relationship
• A dot cluster that equals a straight line reflects aperfect relationship between two
variables.

Curvilinear Relationship
• The a dot cluster approximates a straight line and, therefore, reflects a linear
relationship.
• Sometimes a dot cluster approximates a bent or curved line, as in Figure 6.4, and
therefore reflects a curvilinear relationship.
• Eg: physical strength, as measured by the force of a person’s handgrip, is less for
children, more for adults, and then less again for older people.
III. A CORRELATION COEFFICIENT
FOR QUANTITATIVE DATA : r
• A correlation coefficient is a number between –1 and 1 that describes the
relationship between pairs of variables.
• The type of correlation coefficient, designated as r, that describes the linear
relationship between pairs of variables for quantitative data.

Key Properties of r
• Named in honor of the British scientist Karl Pearson, the Pearson correlation
coefficient, r, can equal any value between –1.00 and +1.00.
• Furthermore, the following two properties apply:
• The sign of r indicates the type of linear relationship, whether positive or negative.
• The numerical value of r, without regard to sign, indicates the strength of the
linear relationship.

Sign of r
• A number with a plus sign (or no sign) indicates a positive relationship, and a
number with a minus sign indicates a negative relationship.

Numerical Value of r
• The more closely a value of r approaches either –1.00 or +1.00, the stronger the
relationship.
• The more closely the value of r approaches 0, the weaker the relationship.
• r = –.90 indicates a stronger relationship than does an r of –.70, and
• r = –.70 indicates a stronger relationship than does an r of .50.

Interpretation of r
• Located along a scale from –1.00 to +1.00, the value of r supplies information
about the direction of a linear relationship—whether positive or negative—and,
• generally, information about the relative strength of a linear relationship—whether
relatively
• weak because r is in the vicinity of 0, or relatively strong because r deviates from0 in
the direction of
• either +1.00 or –1.00.

r Is Independent of Units of Measurement


• The value of r is independent of the original units of measurement.
• Same value of r describes the correlation between height and weight for a groupof
adults.
• r depends only on the pattern among pairs of scores, which in turn show no tracesof
the units of measurement for the original X and Y scores.
• A positive value of r reflects a tendency for pairs of scores to occupy similar relative
locations in their respective distributions, while a negative value of r reflects a
tendency for pairs of scores to occupy dissimilar relative locations in their respective
distributions.

Range Restrictions
• The value of the correlation coefficient declines whenever the range of possible Xor Y
scores is restricted.
• For example, Figure 6.5 shows a dot cluster with an obvious slope, represented byan r
of .70 for the positive relationship between height and weight for all college students.
• If, the range of heights along Y is restricted to students who stand over 6 feet 2 inches
(or 74 inches) tall, the abbreviated dot cluster loses its slope because of theweights
among tall students.
• Therefore, as depicted in Figure 6.5, the value of r drops to .10.
• Sometimes it’s impossible to avoid a range restriction.
• For example, some colleges only admit students with SAT test scores above some
minimum value.

Caution
• We have to be careful when interpreting the actual numerical value of r.
• An r of .70 for height and weight doesn’t signify that the strength of this relationship
equals either .70 or 70 percent of the strength of a perfect relationship.
• The value of r can’t be interpreted as a proportion or percentage of some perfect
relationship.
Verbal Descriptions

• When interpreting a new r, you’ll find it helpful to translate the numerical value ofr
into a verbal description of the relationship.
• An r of .70 for the height and weight of college students could be translated into
“Taller students tend to weigh more”;
• An r of –.42 for time spent taking an exam and the subsequent exam score couldbe
translated into “Students who take less time tend to make higher scores”; and
• An r in the neighborhood of 0 for shoe size and IQ could be translated into “Little, if
any, relationship exists between shoe size and IQ.”

Correlation Not Necessarily Cause-Effect

• A correlation coefficient, regardless of size, never provides information about whether


an observed relationship reflects a simple cause-effect relationship or some more
complex state of affairs.

• Eg: correlation between cigarette smoking and lung cancer.


• American Cancer Society representatives interpreted the correlation as a causal
relationship: Smoking produces lung cancer.
• Tobacco industry representatives interpreted the correlation as, both the desire to
smoke cigarettes and lung cancer are caused by some more basic but unidentified
factors, such as the body metabolism or personality of some people.
• According to this reasoning, people with a high body metabolism might be more prone
to smoke and, quite independent of their smoking, more vulnerable to lungcancer.
• Therefore, smoking correlates with lung cancer because both are effects of some
common cause or causes.
Role of Experimentation
• In the present case, laboratory animals were trained to inhale different amounts of
tobacco tars and were then euthanized.
• Autopsies revealed that the observed incidence of lung cancer varied directly withthe
amount of inhaled tobacco tars, even though possible “contaminating” factors,such as
different body metabolisms or personalities, had been neutralized either through
experimental control or by random assignment of the subjects to different test
conditions.

IV. DETAILS: COMPUTATION FORMULA FOR r


V. Regression

TWO ROUGH PREDICTIONS

• A correlation analysis of the exchange of greeting cards by five friends for the
most recent holiday season suggests a strong positive relationship between
cards sent and cards received.
• When informed of these results, another friend, Emma, who enjoys receiving
greeting cards, asks you to predict how many cards she will receive during the
next holiday season, assuming that she plans to send 11 cards.

TWO ROUGH PREDICTIONS


• Predict “Relatively Large Number”
Rough Prediction for Emma:
• We could offer Emma a very rough prediction by recalling that cards sent and
received tend to occupy similar relative locations in their respective
distributions.
• Therefore, Emma can expect to receive a relatively large number of cards, since she
plans to send a relatively large number of cards.

Predict “between 14 and 18 Cards”


• To obtain a slightly more precise prediction for Emma, refer to the scatter plot for
the original five friends shown in Figure 7.1.
• Notice that Emma’s plan to send 11 cards locates her along the X axis between
the 9 cards sent by Steve and the 13 sent by Doris.
• Using the dots for Steve and Doris as guides, construct two strings of arrows, one
beginning at 9 and ending at 18 for Steve and the other beginning at 13 and ending
at 14 for Doris.
• We can predict that Emma’s return should be between 14 and 18 cards, the
numbers received by Doris and Steve.
VI. A REGRESSION LINE

• All five dots contribute to the more precise prediction, illustrated in Figure 7.2,
that Emma will receive 15.20 cards.
• The solid line designated as the regression line in Figure 7.2, which guides the
string of arrows, beginning at 11, toward the predicted value of 15.20.
• If all five dots had defined a single straight line, placement of the regression line
would have been simple; merely let it pass through all dots.

Predictive Errors
• Figure 7.3 illustrates the predictive errors that would have occurred if the regression
line had been used to predict the number of cards received by the five friends.
• Solid dots reflect the actual number of cards received, and open dots, always located
along the regression line, reflect the predicted number of cards received.
• The largest predictive error, shown as a broken vertical line, occurs for Steve, whosent
9 cards.
• Although he actually received 18 cards, he should have received slightly fewer than 14
cards, according to the regression line.
• The smallest predictive error none for Mike, who sent 7 cards.
• He actually received the 12 cards that he should have received, according to the
regression line.

Total Predictive Error

The smaller the total for all predictive errors in Figure 7.3, the more favorable will be
the prognosis for our predictions.
The regression line to be placed in a position that minimizes the total predictive error,
that is, that minimizes the total of the vertical discrepancies between the solid and open
dots shown in Figure 7.3.
VII. LEAST SQUARES REGRESSION LINE

• To avoid the arithmetic standoff of zero always produced by adding positive and
negative predictive errors
• the placement of the regression line minimizes the total squared predictive
error.
• When located like this, the regression line is often referred to as the least
squares regression line.
Key Property
• Once numbers have been assigned to b and a, as just described, the least squares
regression equation emerges as a working equation with a most desirable property:
• It automatically minimizes the total of all squared predictive errors for known Y
scores in the original correlation analysis.
Solving for Y′
• In its present form, the regression equation can be used to predict the number of
cards that Emma will receive, assuming that she plans to send 11 cards.
• Simply substitute 11 for X and solve for the value of Y′ as follows:

• Even when no cards are sent (X = 0), we predict a return of 6.40 cards because of the
value of a.
• Also, notice that sending each additional card translates into an increment of
only .80 in the predicted return because of the value of b.
• Whenever b has a value less than 1.00, increments in the predicted return will
lag—by an amount equal to the value of b, that is, .80 in the present case—
behind increments in cards sent.
• If the value of b had been greater than 1.00, then increments in the predicted
return would have exceeded increments in cards sent.

A Limitation
• Emma might survey these predicted card returns before committing herself to a
particular card investment. There is no evidence of a simple cause-effect
relationship between cards sent and cards received.

VIII. STANDARD ERROR OF ESTIMATE,s y | x

• Emma’s investment of 11 cards will yield a return of 15.20 cards, we would be


surprised if she actually received 15 cards.
• It is more likely that because of the imperfect relationship between cards sent and
cards received,
• Emma’s return will be some number other than 15.
• Although designed to minimize predictive error, the least squares equation doesnot
eliminate it.
Importance of r

Substituting a value of 1 for r, we obtain

substituting a value of 0 for r in the numerator of Formula 7.5, we obtain

ASSUMPTIONS
Linearity

• Use of the regression equation requires that the underlying relationship be linear.

Homoscedasticity
• Use of the standard error of estimate, sy|x, assumes that except for chance, the dots
in the original scatterplot will be dispersed equally about all segments of the
regression line.
• when the scatterplot reveals a dramatically different type of dot cluster, such as
that shown in Figure 7.4.
• The standard error of estimate for the data in Figure 7.4 should be used cautiously,since
its value overestimates the variability of dots about the lower half of the regression
line and underestimates the variability of dots about the upper half of the regression
line.
INTERPRETATION OF r 2

• The squared correlation coefficient, r2, provides us a key interpretation of the


correlation coefficient and also a measure of predictive accuracy that
supplements the standard error of estimate, sy|x.

Repetitive Prediction of the Mean

• Pretend that we know the Y scores (cards received), but not the corresponding X
scores (cards sent), for each of the five friends.
• Lacking information about the relationship between X and Y scores, we could not
construct a regression equation and use it to generate a customized prediction, Y′,for
each friend.
• We mount a primitive predictive effort by always predicting the mean, Y, for each of
the five friends’ Y scores.
• The repetitive prediction of Y for each of the Y scores of all five friends will supplyus
with a frame of reference against which to evaluate our customary predictive effort
based on the correlation between cards sent (X) and cards received (Y).

Predictive Errors

Panel A of Figure 7.5 shows the predictive errors for all five friends when the mean for
all five friends, Y, of 12 (shown as the mean line) is always used to predict each of their
five Y scores.
Panel B shows the corresponding predictive errors for all five friends when a series of
different Y′ values, obtained from the least squares equation (shown as the least
squares line), is used to predict each of their five Y scores.
Panel A of Figure 7.5 shows the error for John when the mean for all five friends, Y, of 12
is used to predict his Y score of 6.
Shown as a broken vertical line, the error of −6 for John (from Y − Y = 6 − 12 = −6)
indicates that Y overestimates John’s Y score by 6 cards. Panel B shows a smaller error
of −1.20 for John when a Y′ value of
7.20 is used to predict the same Y score of 6.
This Y’ value of 7.20 is obtained from the least squares equation, where the number of
cards sent by John, 1, has been substituted for X.

Error Variability (Sum of Squares)


The sum of squares of any set of deviations, now called errors, can be calculated by first
squaring each error (to eliminate negative signs), then summing all squared errors.
Proportion of Predicted Variability

SSy measures the total variability of Y scores that occurs after only primitive
predictions based on Y are made while SSy|x measures the residual variability of Y
scores that remains after customized leastsquare predictions are made.

The error variability of 28.8 for the least squares predictions is much smaller than the
error variability of 80 for the repetitive prediction of Y, confirming the greater accuracy
of the least squares predictions
apparent in Figure 7.5.
To obtain an SS measure of the actual gain in accuracy due to the least squares
predictions, subtract the residual variability from the total variability, that is, subtract
SSy|x from SSy, to obtain

This result, .64 or 64 percent, represents the proportion or percent gain in predictive
accuracy when the repetitive prediction of Y is replaced by a series of customized Y′
predictions based on the least squares equation.
r 2 Does Not Apply to Individual Scores:
• The total variability of all Y scores—as measured by SSY—can be reduced by 64
percent when each Y score is replaced by its corresponding predicted Y’ score andthen
expressed as a squared deviation from the mean of all observed scores.
• Thus, the 64 percent represents a reduction in the total variability for the five Y scores
when they are replaced by a succession of predicted scores, given the least squares
equation and various values of X.

Small Values of r 2
• When transposed from r to r2, Cohen’s guidelines, state that a value of r 2 in the
vicinity of .01, .09, or .25 reflects a weak, moderate, or strong relationship,
respectively.

r 2 Doesn’t Ensure Cause-Effect


• If the correlation between mental health scores of sixth graders and their weaningages
as infants equals .20, we cannot claim, therefore, that (.20)(20) = .04 or 4 percent of
the total variability in mental health scores is caused by the differences in weaning
ages.
• r2 is indicating the proportion or percent of predictable variability, you also might
encounter references to r2 as indicating the proportion or percent of explained
variability.
• In this context, “explained” signifies only predictability, not causality.

You might also like