UNIT-3
UNIT-3
I. Correlation:
AN INTUITIVE APPROACH
Positive Relationship
• Trends among pairs of scores can be detected most easily by constructing a list of
paired scores in which the scores along one variable are arranged from largest to
smallest.
• In panel A of Table 6.2, the five pairs of scores are arranged from the largest (13) to
the smallest (1) number of cards sent.
• This table reveals a pronounced tendency for pairs of scores to occupy similar relative
positions in their respective distributions.
• For example, John sent relatively few cards (1) and received relatively few cards (6),
whereas Doris sent relatively many cards (13) and received relatively many cards (14).
• Therefore, that the two variables are related.
• Insofar as relatively low values are paired with relatively low values, and relatively
high values are paired with relatively high values, the relationship is positive.
Negative Relationship
• Although John sent relatively few cards (1), he received relatively many (18).
• From this pattern, we can conclude that the two variables are related.
• This relationship implies that “You get the opposite of what you give.”
• Insofar as relatively low values are paired with relatively high values, and relatively
high values are paired with relatively low values, the relationship is negative.
Little or No Relationship
• No regularity is apparent among the pairs of scores in panel C.
• For instance, although both Andrea and John sent relatively few cards (5 and 1,
respectively), Andrea received
• relatively few cards (6) and John received relatively many cards (14).
• We can conclude that little, if any, relationship exists between the two variables.
• Two variables are positively related if pairs of scores tend to occupy similar relative
positions (high with high and low with low) in their respective distributions.
• They are negatively related if pairs of scores tend to occupy dissimilar relative
positions (high with low and vice versa) in their respective distributions.
II. SCATTERPLOTS
• A scatterplot is a graph containing a cluster of dots that represents all pairs of scores.
Construction
• To construct a scatterplot, as in Figure 6.1, scale each of the two variables along the
horizontal (X) and vertical (Y) axes, and use each pair of scores to locate a dot within
the scatterplot.
• For example, the pair of numbers for Mike, 7 and 12, define points along the X andY
axes, respectively.
• Using these points to anchor lines perpendicular to each axis, locate Mike’s dot where
the two lines intersect.
Positive, Negative, or Little or No Relationship?
• A dot cluster that has a slope from the lower left to the upper right, as in panel Aof
Figure 6.2, reflects a positive relationship.
• Small values of one variable are paired with small values of the other variable, and
large values are paired with large values.
• In panel A, short people tend to be light, and tall people tend to be heavy.
• On the other hand, a dot cluster that has a slope from the upper left to the lower
right, as in panel B of Figure 6.2, reflects a negative relationship.
• Small values of one variable tend to be paired with large values of the other
variable, and vice versa.
• A dot cluster that lacks any apparent slope, as in panel C of Figure 6.2, reflects littleor
no relationship.
• Small values of one variable are just as likely to be paired with small, medium, or
large values of the other variable.
Curvilinear Relationship
• The a dot cluster approximates a straight line and, therefore, reflects a linear
relationship.
• Sometimes a dot cluster approximates a bent or curved line, as in Figure 6.4, and
therefore reflects a curvilinear relationship.
• Eg: physical strength, as measured by the force of a person’s handgrip, is less for
children, more for adults, and then less again for older people.
III. A CORRELATION COEFFICIENT
FOR QUANTITATIVE DATA : r
• A correlation coefficient is a number between –1 and 1 that describes the
relationship between pairs of variables.
• The type of correlation coefficient, designated as r, that describes the linear
relationship between pairs of variables for quantitative data.
Key Properties of r
• Named in honor of the British scientist Karl Pearson, the Pearson correlation
coefficient, r, can equal any value between –1.00 and +1.00.
• Furthermore, the following two properties apply:
• The sign of r indicates the type of linear relationship, whether positive or negative.
• The numerical value of r, without regard to sign, indicates the strength of the
linear relationship.
Sign of r
• A number with a plus sign (or no sign) indicates a positive relationship, and a
number with a minus sign indicates a negative relationship.
Numerical Value of r
• The more closely a value of r approaches either –1.00 or +1.00, the stronger the
relationship.
• The more closely the value of r approaches 0, the weaker the relationship.
• r = –.90 indicates a stronger relationship than does an r of –.70, and
• r = –.70 indicates a stronger relationship than does an r of .50.
Interpretation of r
• Located along a scale from –1.00 to +1.00, the value of r supplies information
about the direction of a linear relationship—whether positive or negative—and,
• generally, information about the relative strength of a linear relationship—whether
relatively
• weak because r is in the vicinity of 0, or relatively strong because r deviates from0 in
the direction of
• either +1.00 or –1.00.
Range Restrictions
• The value of the correlation coefficient declines whenever the range of possible Xor Y
scores is restricted.
• For example, Figure 6.5 shows a dot cluster with an obvious slope, represented byan r
of .70 for the positive relationship between height and weight for all college students.
• If, the range of heights along Y is restricted to students who stand over 6 feet 2 inches
(or 74 inches) tall, the abbreviated dot cluster loses its slope because of theweights
among tall students.
• Therefore, as depicted in Figure 6.5, the value of r drops to .10.
• Sometimes it’s impossible to avoid a range restriction.
• For example, some colleges only admit students with SAT test scores above some
minimum value.
Caution
• We have to be careful when interpreting the actual numerical value of r.
• An r of .70 for height and weight doesn’t signify that the strength of this relationship
equals either .70 or 70 percent of the strength of a perfect relationship.
• The value of r can’t be interpreted as a proportion or percentage of some perfect
relationship.
Verbal Descriptions
• When interpreting a new r, you’ll find it helpful to translate the numerical value ofr
into a verbal description of the relationship.
• An r of .70 for the height and weight of college students could be translated into
“Taller students tend to weigh more”;
• An r of –.42 for time spent taking an exam and the subsequent exam score couldbe
translated into “Students who take less time tend to make higher scores”; and
• An r in the neighborhood of 0 for shoe size and IQ could be translated into “Little, if
any, relationship exists between shoe size and IQ.”
• A correlation analysis of the exchange of greeting cards by five friends for the
most recent holiday season suggests a strong positive relationship between
cards sent and cards received.
• When informed of these results, another friend, Emma, who enjoys receiving
greeting cards, asks you to predict how many cards she will receive during the
next holiday season, assuming that she plans to send 11 cards.
• All five dots contribute to the more precise prediction, illustrated in Figure 7.2,
that Emma will receive 15.20 cards.
• The solid line designated as the regression line in Figure 7.2, which guides the
string of arrows, beginning at 11, toward the predicted value of 15.20.
• If all five dots had defined a single straight line, placement of the regression line
would have been simple; merely let it pass through all dots.
Predictive Errors
• Figure 7.3 illustrates the predictive errors that would have occurred if the regression
line had been used to predict the number of cards received by the five friends.
• Solid dots reflect the actual number of cards received, and open dots, always located
along the regression line, reflect the predicted number of cards received.
• The largest predictive error, shown as a broken vertical line, occurs for Steve, whosent
9 cards.
• Although he actually received 18 cards, he should have received slightly fewer than 14
cards, according to the regression line.
• The smallest predictive error none for Mike, who sent 7 cards.
• He actually received the 12 cards that he should have received, according to the
regression line.
The smaller the total for all predictive errors in Figure 7.3, the more favorable will be
the prognosis for our predictions.
The regression line to be placed in a position that minimizes the total predictive error,
that is, that minimizes the total of the vertical discrepancies between the solid and open
dots shown in Figure 7.3.
VII. LEAST SQUARES REGRESSION LINE
• To avoid the arithmetic standoff of zero always produced by adding positive and
negative predictive errors
• the placement of the regression line minimizes the total squared predictive
error.
• When located like this, the regression line is often referred to as the least
squares regression line.
Key Property
• Once numbers have been assigned to b and a, as just described, the least squares
regression equation emerges as a working equation with a most desirable property:
• It automatically minimizes the total of all squared predictive errors for known Y
scores in the original correlation analysis.
Solving for Y′
• In its present form, the regression equation can be used to predict the number of
cards that Emma will receive, assuming that she plans to send 11 cards.
• Simply substitute 11 for X and solve for the value of Y′ as follows:
• Even when no cards are sent (X = 0), we predict a return of 6.40 cards because of the
value of a.
• Also, notice that sending each additional card translates into an increment of
only .80 in the predicted return because of the value of b.
• Whenever b has a value less than 1.00, increments in the predicted return will
lag—by an amount equal to the value of b, that is, .80 in the present case—
behind increments in cards sent.
• If the value of b had been greater than 1.00, then increments in the predicted
return would have exceeded increments in cards sent.
A Limitation
• Emma might survey these predicted card returns before committing herself to a
particular card investment. There is no evidence of a simple cause-effect
relationship between cards sent and cards received.
ASSUMPTIONS
Linearity
• Use of the regression equation requires that the underlying relationship be linear.
Homoscedasticity
• Use of the standard error of estimate, sy|x, assumes that except for chance, the dots
in the original scatterplot will be dispersed equally about all segments of the
regression line.
• when the scatterplot reveals a dramatically different type of dot cluster, such as
that shown in Figure 7.4.
• The standard error of estimate for the data in Figure 7.4 should be used cautiously,since
its value overestimates the variability of dots about the lower half of the regression
line and underestimates the variability of dots about the upper half of the regression
line.
INTERPRETATION OF r 2
• Pretend that we know the Y scores (cards received), but not the corresponding X
scores (cards sent), for each of the five friends.
• Lacking information about the relationship between X and Y scores, we could not
construct a regression equation and use it to generate a customized prediction, Y′,for
each friend.
• We mount a primitive predictive effort by always predicting the mean, Y, for each of
the five friends’ Y scores.
• The repetitive prediction of Y for each of the Y scores of all five friends will supplyus
with a frame of reference against which to evaluate our customary predictive effort
based on the correlation between cards sent (X) and cards received (Y).
Predictive Errors
Panel A of Figure 7.5 shows the predictive errors for all five friends when the mean for
all five friends, Y, of 12 (shown as the mean line) is always used to predict each of their
five Y scores.
Panel B shows the corresponding predictive errors for all five friends when a series of
different Y′ values, obtained from the least squares equation (shown as the least
squares line), is used to predict each of their five Y scores.
Panel A of Figure 7.5 shows the error for John when the mean for all five friends, Y, of 12
is used to predict his Y score of 6.
Shown as a broken vertical line, the error of −6 for John (from Y − Y = 6 − 12 = −6)
indicates that Y overestimates John’s Y score by 6 cards. Panel B shows a smaller error
of −1.20 for John when a Y′ value of
7.20 is used to predict the same Y score of 6.
This Y’ value of 7.20 is obtained from the least squares equation, where the number of
cards sent by John, 1, has been substituted for X.
SSy measures the total variability of Y scores that occurs after only primitive
predictions based on Y are made while SSy|x measures the residual variability of Y
scores that remains after customized leastsquare predictions are made.
The error variability of 28.8 for the least squares predictions is much smaller than the
error variability of 80 for the repetitive prediction of Y, confirming the greater accuracy
of the least squares predictions
apparent in Figure 7.5.
To obtain an SS measure of the actual gain in accuracy due to the least squares
predictions, subtract the residual variability from the total variability, that is, subtract
SSy|x from SSy, to obtain
This result, .64 or 64 percent, represents the proportion or percent gain in predictive
accuracy when the repetitive prediction of Y is replaced by a series of customized Y′
predictions based on the least squares equation.
r 2 Does Not Apply to Individual Scores:
• The total variability of all Y scores—as measured by SSY—can be reduced by 64
percent when each Y score is replaced by its corresponding predicted Y’ score andthen
expressed as a squared deviation from the mean of all observed scores.
• Thus, the 64 percent represents a reduction in the total variability for the five Y scores
when they are replaced by a succession of predicted scores, given the least squares
equation and various values of X.
Small Values of r 2
• When transposed from r to r2, Cohen’s guidelines, state that a value of r 2 in the
vicinity of .01, .09, or .25 reflects a weak, moderate, or strong relationship,
respectively.