6 - Reliability-1
6 - Reliability-1
Why is it important?
◦ We usually get one shot to measure the variables we want to, so it’s
important to make that shot count
Error
There are a lot of sources of unsystematic error
◦ People: employees may perform differently in some situations than others;
performance fluctuates
◦ Tests: the wording of items and the item content may influence
measurements
◦ Conditions: room temperature, noises outside, etc. may affect attention or
focus
What are “systematic” errors?
◦ These are not true errors per se; they are meaningful changes in a variable
that arise from experience, training, or other events
Reliability vs accuracy
1 2 3 4
The coefficient of determination
r and r2
◦ You will often see correlations reported as r ; this is a correlation between two
xy
different variables (x and y)
◦ The correlation between a test and itself (i.e., its reliability) is often expressed as r
xx
◦ Theoretically, this is the average correlation between infinite administrations of the
same test
◦ The square of this correlation, r2, can be thought of as the percentage of variance
(fluctuations in scores) explained by differences in the trait being measured vs
error
◦ Example: if r2 = .6, 60% of the differences in peoples’ scores is due to their differences on the trait
we’re measuring (systematic); 40% of the differences we observe are due to error from various
sources (unsystematic)
Classical Test Theory
The whole idea of reliability relies on the assumption that X = T + e
◦ X is the observed score of a person (the actual measurement value)
◦ T is their “true” score (an individual’s real standing on a trait)
◦ e is error (some crap that we don’t want)
In all types of reliability, rxx > .80 is good, rxx > .70 is acceptable
Interrater reliability
In cases of subjective scoring (e.g., performance appraisal) the raters
or examiners may be an additional source of error
◦ Interrater agreement: the exact agreement between raters
◦ Interclass correlation: relative agreement; used when raters are rating
multiple objects or individuals
◦ Intraclass correlation: estimates how much of the differences between raters
are due to rater errors vs true differences in the ratees
These aren’t exactly reliability
◦ Correlations between raters are often pretty low (.1 to .3)
◦ This doesn’t mean error accounts for 90 to 99% of ratings, only that error
accounts for no more than that percentage of the ratings (still not reassuring)
Interpreting reliability
There are many reasons why a test might have a low reliability estimate
◦ Range restriction: the wider range of scores you have, the better your estimates of correlations
will be
◦ Assessing multiple dimensions in the same test often leads to unreliability if dimensions are
unrelated
◦ Reliability estimates are influenced by the # of items
Interpreting reliability
Standard error of measurement
◦ From a reliability coefficient, we can determine the amount of error expected in
each individual’s score
Predictive
◦ Predictors are collected first; criteria are collected at a later time
Concurrent
◦ Criteria data are collected at the same time as predictors
◦ This is usually done by administering a new set of predictors to job
incumbents (i.e., current employees)
Factors affecting CRV
Range enhancement
◦ Including some people a selection battery was NOT designed for in your
analysis can artificially inflate validity
Range restriction
◦ Direct: criterion data is unavailable for applicants who have been screened
out by predictors
◦ Indirect: applicants are screened out by the previous predictors, which are
related to the new predictors
Various statistical methods exist to correct for range restriction
Types of validity evidence
Construct-related validity
◦ What is the meaning of the construct; how should measures of the construct
relate to measures of other constructs?
◦ What would be the characteristics of individuals who are high or low on that
construct?
Examining the relationship of the construct to these other characteristics
is how we provide evidence of construct validity
◦ Convergent validity: tests should be related to other tests of the same or
similar constructs
◦ Discriminant validity: tests should be unrelated to tests of dissimilar
constructs
Validity
Review from last class
Types of validity evidence
◦ Content
◦ Example: did Exam 1 adequately cover all the material from the first three
weeks of class?
◦ Assessment: panel of subject matter experts (SMEs)
◦ Construct
◦ Example: exactly which construct(s) did my exam measure?
◦ Assessment: examine relationships between constructs
Review from last class
Types of validity evidence
◦ Criterion-related
◦ Example: does being extraverted relate to increased sales performance?
◦ Assessment
◦ Predictive: collect predictor data from applicants, collect criterion data
later on from employees
◦ Concurrent: collect predictor and criterion data from job incumbents
Cross-validation
Cross-validation: the extent to which validity in one instance applies
equally well in other instances
◦ This is also called generalization (how much a specific property applies to
general situations)
Construct and content validity generalize somewhat readily
◦ A possible exception might be transfers across cultures in which conceptions
of a construct may change
On the other hand, criterion-related validity can change greatly from
the sample with which it was established to other samples
Predictors and Regression
We know that our predictors have varying degrees of relationships with our
criterion
◦ Multiple regression allows us to determine the relative importance of each of the predictors
in predicting a criterion; this “relative importance” is quantified in regression weights
◦ Example: consider how your various classes factor into your GPA
◦ Classes with more credits factor into your GPA more than classes with fewer credits
Positive Zero
Validity Validity
Differential validity
A valid predictor that results in
adverse impact (AI)
◦ Predicts equally for both groups
◦ This can be legal, but you will need
evidence that:
◦ The criterion is relevant and important and
not biased
◦ There aren’t any equivalent criteria that
aren’t biased
◦ A third factor didn’t cause the group
differences
Differential validity
A predictor that is valid for the
entire group, but not each group
separately (AI)
◦ In this case, the predictor is
basically a crude grouping variable
◦ This would be an obvious attempt
to use selection measures to
discriminate
Differential validity
Equal validity for both groups and
valid overall, but unequal predictor
means (AI)
◦ Both groups are equally likely to
succeed on the job, but the minority
group members are much less likely to
be hired due to lower predictor scores
◦ A solution to this problem is to use
separate cut scores for the groups
◦ In most cases, this “solution” is illegal
unless these corrections have been
approved as part of affirmative action
Differential validity
Equal validity for both groups and
valid overall, but unequal criterion
means (no AI)
◦ Both groups are equally likely to be
hired, but nonminority group
members are much more likely to
succeed on the job
◦ This sort of difference in validity could
reinforce negative stereotypes
Differential validity
Equal predictor means, but only
valid for the nonminority group (no
AI)
◦ Members of groups are selected at
equal rates, but nonminorities are
more likely to succeed
◦ Again, this sort of difference in validity
could reinforce negative stereotypes
Differential validity
Differential validity
◦ The validity (i.e., correlation between predictor and criterion) for one or both
subgroups is significant
◦ There is a significant difference in validity between the subgroups
Single-group validity
◦ The validity for one or both groups is significant
◦ There is no significant difference in validity between the subgroups
10
8
Criterion Ratings
0
0 1 2 3 4 5 6 7 8 9 10
Predictor Score
Differential prediction
◦ If we add a grouping variable (e.g., ethnicity or sex), we can determine if there are differences
between the subgroups in their average scores on the criterion
◦ Example: = b1(IQ test) + b2(group)
16
14
Women
12
Criterion Ratings
10
8
Overall Men
6
0
1 2 3 4 5 6 7 8 9 10
IQ Test Score
Differential prediction
◦ If we add the interaction between a group and a predictor, we can determine if the
relationships between the predictors differ as a function of the subgroups (moderation)
◦ Example: = b1(IQ test) + b2(group) + b3(IQ test × group)
35
30
Women
25
Criterion Ratings
20
15
Overall
10 Men
0
1 2 3 4 5 6 7 8 9 10
IQ Test Score
Differential prediction
Similar to differential validity, group differences in slopes or
intercepts (averages) are rare
When these differences do occur, they almost always favor minority
groups
◦ Minorities tend to have overprediction of criteria, whereas nonminorities
tend to have underprediction
Cognitive ability tests are most likely to display group differences,
especially for different ethnic groups
Differential prediction
◦ When using the “Overall” line (as required by law) in the presence of differential validity:
◦ Scores for the group for whom the predictions are less valid tend to be inflated
◦ Scores for the group the predictions are more valid for receive lower scores
35
30
Women
25
Criterion Ratings
20
15
Overall
10 Men
0
1 2 3 4 5 6 7 8 9 10
IQ Test Score
Bias in IQ tests
We always hope that our tests never show bias, but they do in rare cases
Could be real differences
◦ Differences in educational opportunities, poverty, neighborhoods, home life, etc.
Culturally-influenced questions
◦ “What would you do if a child much smaller than you tried to pick a fight with you?”
Reducing adverse impact
1. Improve recruitment of minorities
2. Use a combination of cognitive and noncognitive predictors
3. Use measures of specific (rather than general) cognitive ability
4. Use multiple regression to combine predictors into a composite
5. Use differential weighting for criterion facets
6. Consider alternative modes for presenting test materials
7. Enhance face validity
8. Implement banding to select among applicants
Banding
Typically, we select applicants in a top-down fashion: those with the
highest scores are selected first, then the next highest, and so on
down the list
◦ This method gives the greatest utility of our measures, but may lead to
adverse impact due to subgroup differences in test scores
However, all of our predictors have a certain amount of error when
they measure each person
◦ The differences between two scores that are close together may be a result of
unsystematic (random) error
Banding
Banding involves using this error (the standard error of
measurement) in order to create ranges or “bands” of scores that
are not statistically different from one another
◦ Example: the SEM might show that those in the 94th percentile and up are not
statistically different than the top scorer (100th percentile)
◦ Within this band, all applicants are treated equally in terms of that predictor
or criterion; you can then select individuals within this band using other
predictors or criteria (e.g., diversity needs)
0 100
Fairness
Applicants may examine the fairness of a selection system on three
dimensions:
◦ Distributive: do the outcomes of the system (i.e., who gets hired) seem fair?
◦ Procedural: do the tests and processes used to make decisions seem fair?
◦ Interpersonal: was I treated well during my interactions with members of the
organization?
As you’d expect, individual perceptions of distributive fairness are
largely based on whether or not the outcome was favorable him or
her
Fairness
Although tests may be technically fair and lack bias, the process of
testing and making decisions can be such that applicants perceive
unfairness
◦ These perceptions are bad for both the organization and the test taker (e.g.,
lower self-efficacy)
Fair and equitable treatment of test takers involves providing
information about:
◦ The nature of the tests
◦ The intended use of test scores
◦ The confidentiality of the results
Law and HR
Discrimination
Unequal/disparate treatment: Intentional
discrimination
◦ Direct evidence
◦ Open expressions of hatred or inequality; exclusionary policies
◦ Circumstantial evidence
◦ Often established through statistics
◦ Mixed-motive
◦ Both direct evidence of intentional discrimination and evidence that the
stated basis for an employment decision is merely a pretext
Discrimination
Adverse/disparate impact: Unintentional discrimination
◦ Identical standards and procedures are applied to everyone, but
1. The outcome is substantially different for members of a particular group
2. They are unrelated to success on a job
◦ Example: height requirement of 5’8” for police cadets
◦ May have adverse impact on Asians, Hispanics, and women
Employment Laws and Civil Rights
Thirteenth Amendment
◦ No slavery or involuntary servitude
Fourteenth Amendment
◦ Equal protection under the law
Seniority systems
◦ Legal so long as the differences are not intentionally discriminative
Preemployment Inquiries
◦ Legal so long as they are not used as a basis for discrimination
Provisions to Title VII
Testing
◦ Any professionally developed ability test may be used as long as it does
not discriminate based on protected traits
Preferential treatment
◦ It is illegal to give preferential treatment to a group due to existing
imbalances
Veterans
◦ Veterans may be given preference in spite of likely adverse impact
National Security
◦ Discrimination is permissible when necessary to protect national security
Employment Laws and Civil Rights
Age Discrimination in Employment act of 1967
◦ Requires EEO based on age; protects those 40 and over
◦ However, older employees can waive their rights to sue
What classes are protected under Title VII of the Civil Rights Act of 1964?
System Court of
Supreme Court
Appeals
However, case laws are not real laws and decisions may be reversed
in light of new social circumstances or scientific evidence
Case Law: Testing
All professionally developed tests are fine as long as they don’t discriminate
◦ 4/5 rule: if the selection rate of one group is 4/5 that of another, a test is biased
Employees with a complaint must identify the specific practice that violates
their rights
◦ Burden of proof shifts to the employer to show that the practice is job relevant and
unbiased
◦ Plaintiff can then show that an alternative test exists that is not biased