SUPPLEMENTARY READINGS FOR RELIABILITY, VALIDITY, UTILITY
The document discusses the concepts of reliability and validity in psychological assessment, detailing various types of reliability such as test-retest, parallel forms, and internal consistency. It also covers the importance of validity, including content, criterion-related, and construct validity, emphasizing the need for accurate measurement and interpretation of test scores. Additionally, it highlights the impact of measurement errors and the significance of understanding the reliability coefficients in the context of psychological testing.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
6 views8 pages
SUPPLEMENTARY READINGS FOR RELIABILITY, VALIDITY, UTILITY
The document discusses the concepts of reliability and validity in psychological assessment, detailing various types of reliability such as test-retest, parallel forms, and internal consistency. It also covers the importance of validity, including content, criterion-related, and construct validity, emphasizing the need for accurate measurement and interpretation of test scores. Additionally, it highlights the impact of measurement errors and the significance of understanding the reliability coefficients in the context of psychological testing.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8
Psychological Assessment
Reliability, Validity, Utility
Source: Cohen & Swerdlik (2018), Kaplan & Saccuzzo (2018), Gravetter & Wallnau (2013) Reliability test scoreꟷand thus the reliability can be o Dependability or consistency affected o Consistency of the instrument o Measurement Error – all of the factors o A test may be reliable in one context and associated with the process of measuring unreliable in another some variable, other than the variable being o Reliability Coefficient – index of reliability, a measured proportion that indicates the ratio between the o Random Error – source of error in measuring true score variance on a test and the total a targeted variable caused by unpredictable variance fluctuations and inconsistencies of other o Classical Test Theory – a score on an ability variables in the measurement process tests is presumed to reflect not only the ▪ “Noise” testtaker’s true score on the ability being ▪ E.g., physical events that happened while measured but also error test is happening ▪ Errors of measurement are random o Systematic Error – source of error in a o Error – refers to the component of the measuring a variable that is typically constant observed test score that does not have to do or proportionate to what is presumed to be the with the testtaker’s ability true value of the variable being measured o Sources of Error Variance: a. Item Sampling/Content Sampling – refer to variation among items within a test as well as to variation among items between tests ▪ The extent to which testtaker’s score is affected by the content sampled on a test and by the way the content is sampled is a source of error variance b. Test Administration ▪ Testtaker’s motivation or attention, environment, etc. ▪ Testtaker variables and Examiner-related Variables ▪ Type I – “false-positive”; an c. Test Scoring and Interpretation investigator ▪ May employ objective-type items amenable rejects a null hypothesis that is true to computer scoring of well-documented ▪ Type II – “false-negative”; fails to reject reliability null ▪ If subjectivity is involved in scoring, hypothesis that is false in the population Reliability Estimates ▪ Can reduce the likelihood of type 1 and 2 Test-Retest Reliability errors by increasing the sample size o Time Sampling o Variance – useful in describing sources of test o An estimate of reliability obtained by score variability correlating pairs of scores from the same ▪ True Variance – variance from true people on two different administrations of the differences test ▪ Error Variance – variance from irrelevant, o Appropriate when evaluating the reliability of a random sources test that purports to measure something that o Reliability refers to the proportion of total is relatively stable such as personality trait variance attributed to true variance o The longer the time that passes, the greater o The greater the proportion of the total the likelihood that the reliability coefficient variance attributed to true variance, the more will be lower reliable the test o Coefficient of Stability - when the interval o Error variance may increase or decrease a between testing is greater than 6 months test score by varying amounts, consistency Parallel Forms and Alternate Forms Reliability of the Psychological Assessment Reliability, Validity, Utility Source: Cohen & Swerdlik (2018), Kaplan & Saccuzzo (2018), Gravetter & Wallnau (2013) o Item Sampling o Randomly assign items to one or the other half o Coefficient of Equivalence – the degree of of the test or assign odd-numbered items to relationship between various forms of test can one half and even-numbered to the other half be evaluated by means of an alternate forms (odd- even reliability) or parallel forms coefficient of reliability o Divide the test by content so that each half o Parallel Forms – each form of the test, the contains items equivalent with respect to means and the variances are equal content and difficulty ▪ Same items, different o Spearman-Brown Formula – allows a test positionings/numberings developer or user to estimate internal ▪ Parallel Forms Reliability – estimate of the consistency reliability from a correlation of extent to which item sampling and other two halves of a test errors have affect test scores on version of o Reliability of the test is affected by the length. the same test when, for each form of the Usually, reliability increases as length increases test, the means and variances of observed o Spearman-Brown may be used to estimate the test scores are equal effect of the shortening on the test’s reliability o Alternate Forms – simply different version of a o SBF also be used to determine the number of test that has been constructed so as to be items needed to attain a desired level of parallel reliability ▪ Alternate Forms Reliability – estimate of o If the reliability of the original test is relatively the extent to which these different forms of low, then it may be impractical to increase the the same test have been affected by no. of items, so they should develop a suitable sampling error, or other error alternative o Two administrations with the same group are o Or increase reliability by creating new items, required clarifying the test instructions, or simplifying o Test scores may be affected by factors such as the scoring rules motivation, fatigue, or intervening events such Inter-item Consistency as practice, learning, or therapy o Refers to the degree of correlation among all o Some testtaker might do better on a specific the items on a scale form of a test but not a function of their true o Calculated from a single administration of a ability but simply bec of the particular items single form of a test that were selected for inclusion in the test o Useful in assessing Homogeneity o Minimizes the effect of memory for the content ▪ Homogeneity – if a test contains items that of a previously administered form of the test measure a single trait (unifactorial) o Certain traits are presumed to relatively ▪ Heterogeneity – degree to which a test stable in people measures different factors (more than one o The means and the variances of the observed trait); source of error variance scores are equal for two forms o More homogenous = higher inter-item Internal Consistency consistency Split-Half Reliability o KR-20 – used for inter-item consistency o Obtained by correlating two pairs of scores of dichotomous items obtained from equivalent halves of a single o KR-21 – if all the items have the same degree test administered once of difficulty (speed tests) o Useful when it is impractical or undesirable to o Coefficient Alpha – appropriate for use on assess reliability with two tests or to tests containing non-dichotomous items administer a test twice ▪ Help answer questions about how similar o Simply diving the test in the middle is not sets of data are recommended because it is likely that this ▪ Check consistency across terms of an procedure would spuriously raise or lower the instrument with responses with varying reliability coefficient credit Psychological Assessment Reliability, Validity, Utility Source: Cohen & Swerdlik (2018), Kaplan & Saccuzzo (2018), Gravetter & Wallnau (2013) o Average Proportional Distance – measure ▪ As individual differences decrease, a used to evaluate internal consistency of a test traditional measure of reliability would also that focuses on the degree of differences that decrease, regardless of the stability of exists between item scores individual performance ▪ Not connected to the number of items on a o Classical Test Theory – everyone has a “true measure score” on test Inter-scorer Reliability ▪ True Score – genuinely reflects an o The degree of agreement or consistency individual’s ability level as measured by a between two or more scorers with regard to a particular test particular measure o Domain Sampling Theory – estimate the extent o Used for coding nonverbal behavior to which specific sources of variation under o Coefficient of Inter-scorer Reliability defined conditions are contributing to the test o Observer Differences scores o Kappa Statistics is used ▪ Considers problem created by using a ▪ Fleiss Kappa – determine the level of limited number of items to represent a agreement between two or more raters larger and more complicated construct when the method of assessment is ▪ Test reliability is conceived of as an measured on a categorical scale; best way; objective measure of how precisely the more than 2 raters test score assesses the domain from ▪ Cohen’s Kappa – each classify N items into which the test draws a sample C mutually exclusive categories; rates the ▪ Generalizability Theory – based on the idea same thing, corrected for how often that that a person’s test scores vary from the raters may agree by chance; only 2 testing to testing because of the variables raters in the testing situations Using and Interpreting Coefficient of Reliability ▪ Universe – test situation o Tests designed to measure one factor ▪ Facets – number of items in the test, (Homogenous) are expected to have high amount of review, and the purpose of test degree of internal consistency and vice versa administration o Dynamic – trait, state, or ability presumed to ▪ According to Generalizability Theory, given be ever-changing as a function of situational the exact same conditions of all the facets and cognitive experience in the universe, the exact same test score o Static – barely changing or relatively should be obtained (Universe score) unchanging ▪ Decision Study – developers examine the o Restriction of range or Restriction of variance usefulness of test scores in helping the test – if the variance of either variable in a user make decisions correlational analysis is restricted by the o Item Response Theory – the probability that a sampling procedure used, then the resulting person with X ability will be able to perform at correlation coefficient tends to be lower a level of Y in a test o Power Tests – when time limit is long enough ▪ Latent-Trait Theory to allow test takers to attempt all times ▪ The computer is used to focus on the range o Speed Tests – generally contains items of of item difficulty that helps assess an uniform level of difficulty with time limit individual’s ability level ▪ Reliability should be based on performance ▪ Difficulty – attribute of not being easily from two independent testing periods using accomplished, solved, or comprehended test-retest and alternate-forms or split- ▪ Discrimination – degree to which an item half-reliability differentiates among people with higher or o Criterion-Referenced Tests – designed to lower levels of the trait, ability or etc. provide an indication of where a testtaker ▪ Dichotomous – can be answered with only stands with respect to some variable or one of two alternative responses criterion Psychological Assessment Reliability, Validity, Utility Source: Cohen & Swerdlik (2018), Kaplan & Saccuzzo (2018), Gravetter & Wallnau (2013) ▪ Polytomous – 3 or more alternative o Content Validity – describes a judgement of responses how adequately a test samples behavior Reliability and Individual Scores representative of the universe of behavior that o Standard Error of Measurement – provide a the test was designed to sample measure of the precision of an observed test o When the proportion of the material covered score by the test approximates the proportion of ▪ Standard deviation of errors as the basic material covered in the course measure of error o Test Blueprint – a plan regarding the types of ▪ Provides an estimate of the amount of error information to be covered by the items, the no. inherent in an observed score or of items tapping each area of coverage, the measurement organization of the items, and so forth ▪ Higher reliability, lower SEM Criterion-Related Validity ▪ Used to estimate or infer the extent to o Criterion-Related Validity – a judgement of which an observed score deviates from a how adequately a test score can be used to true score infer an individual’s most probable standing ▪ Standard Error of a Score on some measure of interestꟷthe measure of ▪ Confidence Interval – a range or band of interest being criterion test scores that is likely to contain true o Criterion – standard on which a judgement or scores decision may be made o Standard Error of the Difference – can aid a ▪ Characteristics: relevant, valid, test user in determining how large a uncontaminated difference should be before it is considered ▪ Criterion Contamination – occurs when the statistically significant criterion measure includes aspects of o Standard Error of Estimate – refers to the performance that are not part of the job or standard error of the difference between the when the measure is affected by predicted and observed values “construct- irrelevant ” (Messick, 1989) factors that are not part of the criterion construct Concurrent Validity o If the test scores obtained at about the same time as the criterion measures are obtained o Extent to which test scores may be used to estimate an individual’s present standing on a criterion o Economically efficient Validity Predictive Validity o Validity – a judgment or estimate of how well a test measures what it supposed to measure o Measures of the relationship between test scores and a criterion measure obtained at a ▪ Evidence about the appropriateness of future time inferences drawn from test scores o Researchers must take into consideration the ▪ Inferences – logical result or deduction base rate of the occurrence of the variable, ▪ May diminish as the culture or times change both as that variable exists in the general o Validation – the process of gathering and population and as it exists in the sample being evaluating evidence about validity studies o Validation Studies – yield insights regarding a o Base Rate – the extent to which a particular particular population of testtakers as trait, behavior, characteristic, or attribute compared to the norming sample described in exist in the population a test manual o Hit Rate – defined as the proportion of people o Face Validity – a test appears to measure to a test accurately identifies possessing a the person being tested than to what the test particular trait, behavior, etc. actually measures Content Validity Psychological Assessment Reliability, Validity, Utility Source: Cohen & Swerdlik (2018), Kaplan & Saccuzzo (2018), Gravetter & Wallnau (2013) o Miss Rate – fails to identify having that of people who does not have that construct particular characteristic o False Positive – miss; the test predicted that they do possess a particular trait but actually not o False Negative – miss; the test predicted they do not possess a particular trait but actually do o Validity Coefficient – correlation coefficient that provides a measure of the relationship between test scores and scores on the criterion measure ▪ Usually, Pearson R is used, however other correlation coefficients could be used depends on the type of data ▪ Affected by restriction or inflation of range ▪ Validity Coefficient need to be large enough to enable the test user to make accurate decisions within the unique context in which a test is being used o Incremental Validity – the degree to which an additional predictor explains something about the criterion measure that is not explained by predictors already in use Construct Validity o Construct Validity – judgement about the appropriateness of inferences drawn from test scores regarding individual standing on variable called construct o Construct – an informed, scientific idea developed or hypothesized to describe or explain behavior ▪ Unobservable, presupposed traits that may invoke to describe test behavior or criterion performance o One way a test developer can improve the homogeneity of a test containing dichotomous items is by eliminating items that do not show significant correlation coefficients with total test scores o If it is an academic test and high scorers on the entire test for some reason tended to get that particular item wrong while low scorers got it right, then the item is obviously not a good one o Some constructs lend themselves more readily than others to predictions of change over time o Method of Contrasted Groups – demonstrate that scores on the test vary in a predictable way as a function of membership in a group ▪ If a test is a valid measure of a particular construct, then the scores from the group Psychological Assessment Reliability, Validity, Utility Source: Cohen & Swerdlik (2018), Kaplan & Saccuzzo (2018), Gravetter & Wallnau (2013) would have different test scores than those who really possesses that construct o Convergent Evidence – if scores on the test undergoing construct validation tend to highly correlated with another established, validated test that measures the same construct o Discriminant Evidence – a validity coefficient showing little relationship between test scores and/or other variables with which scores on the test being construct-validated should not be correlated o Factor Analysis – designed to identify factors or specific variables that are typically attributes, characteristics, or dimensions on which people may differ ▪ Employed as data reduction method ▪ Identify the factor or factors in common between test scores on subscales within a particular test ▪ Explanatory FA – estimating or extracting factors; deciding how many factors must be retained ▪ Confirmatory FA – researchers test the degree to which a hypothetical model fits the actual data ▪ Factor Loading – conveys info about the extent to which the factor determine the test score or scores Validity, Bias, and Fairness o Bias – factor inherent in a test that systematically prevents accurate, impartial measurement ▪ Prejudice, preferential treatment ▪ Prevention during test dev through a procedure called Estimated True Score Transformation o Rating – numerical or verbal judgement that places a person or an attribute along a continuum identified by a scale of numerical or word descriptors known as Rating Scale ▪ Rating Error – intentional or unintentional misuse of the scale ▪ Leniency Error – rater is lenient in scoring (Generosity Error) ▪ Severity Error – rater is strict in scoring ▪ Central Tendency Error – rater’s rating would tend to cluster in the middle of the rating scale Psychological Assessment Reliability, Validity, Utility Source: Cohen & Swerdlik (2018), Kaplan & Saccuzzo (2018), Gravetter & Wallnau (2013) ▪ One way to overcome rating errors is to scores on a criterion measure – passing, use rankings acceptable, failing ▪ Halo Effect – tendency to give high score o Might indicate future behaviors, then if due to failure to discriminate among successful, the test is working as it should conceptually distinct and potentially o Taylor-Russel Tables – provide an estimate of independent aspects of a ratee’s behavior the extent to which inclusion of a particular o Fairness – the extent to which a test is used in test in the selection system will improve an impartial, just, and equitable way selection o Attempting to define the validity of the test will o Selection Ratio – numerical value that reflects be futile if the test is NOT reliable the relationship between the number of people to be hired and the number of people available to be hired
o Base Rate – percentage of people hired under
the existing system for a particular position o One limitation of Taylor-Russel Tables is that the relationship between the predictor (test) and criterion must be linear o Naylor-Shine Tables – entails obtaining the difference between the means of the selected and unselected groups to derive an index of what the test is adding to already established Utility procedures o Utility – usefulness or practical value of Brogden-Cronbach-Gleser Formula testing to improve efficiency o Used to calculate the dollar amount of a utility o Can tell us something about the practical gain resulting from the use of a particular value of the information derived from scores selection instrument on the test o Utility Gain – estimate of the benefit of using a o Helps us make better decisions particular test o Higher criterion-related validity = higher utility o Productivity Gains – an estimated increase in o One of the most basic elements in utility work output analysis is financial cost of the selection Some Practical Considerations device o High performing applicants may have been o Cost – disadvantages, losses, or expenses offered in other companies as well both economic and noneconomic terms o The more complex the job, the more people o Benefit – profits, gains or advantages differ on how well or poorly they do that job o The cost of test administration can be well o Cut Score – reference point derived as a worth it if the results is certain noneconomic result of a judgement and used to divide a benefits set of data into two or more classifications Utility Analysis ▪ Relative Cut Score – reference point o Utility Analysis – family of techniques that based on norm-related considerations entail a cost-benefit analysis designed to yield (norm- referenced); e.g, NMAT information relevant to a decision about the ▪ Fixed Cut Scores – set with reference to usefulness and/or practical value of a tool of a judgement concerning minimum level assessment of proficiency required; e.g., Board How is Utility Analysis Conducted? Exams Expectancy Data ▪ Multiple Cut Scores – refers to the use of o Expectancy table – provide an indication that two or more cut scores with reference a testtaker will score within some interval of to Psychological Assessment Reliability, Validity, Utility Source: Cohen & Swerdlik (2018), Kaplan & Saccuzzo (2018), Gravetter & Wallnau (2013) one predictor for the purpose of categorization ▪ Multiple Hurdle – multi-stage selection process, a cut score is in place for each predictor ▪ Compensatory Model of Selection – assumption that high scores on one attribute can compensate for lower scores Methods for Setting Cut Scores o Angoff Method – setting fixed cut scores ▪ low interrater reliability o Known Groups Method – collection of data on the predictor of interest from group known to possess and not possess a trait of interest ▪ The determination of where to set cutoff score is inherently affected by the composition of contrasting groups o IRT-Based Methods – cut scores are typically set based on testtaker’s performance across all the items on the test ▪ Item-Mapping Method – arrangement of items in histogram, with each column containing items with deemed to be equivalent value ▪ Bookmark Method – expert places “bookmark” between the two pages that are deemed to separate testtakers who have acquired the minimal knowledge, skills, and/or abilities from those who are not o Method of Predictive Yield – took into account the number of positions to be filled, projections regarding the likelihood of offer acceptance, and the distribution of applicant scores o Discriminant Analysis – shed light on the relationship between identified variables and two naturally occurring groups end