Lesson 8
Lesson 8
Inter-Rater Reliability
QUALITIES OF A GOOD RESEARCH INSTRUMENT This is another statistical approach to validity that estimates the relationship of test Inter-rater reliability is the extent to which two or more individuals (coders or raters)
scores to an examinee's future performance as a master or non-master. Predictive validity agree. Inter-rater reliability assesses the consistency of how a measuring system is
Research-made instruments such as tests, questionnaires, rating scales, interviews, considers the question, "How well does the test predict examinees' future status as implemented.
observation schedules, etc., should meet the qualities of a good research instrument masters or non-masters?" For this type of validity, the correlation that is computed is Example: When two or more teachers use a rating scale with which they are rating the
before they are used. These measuring instruments are used for gathering or collecting based on the test results and the examinee's later performance. This type of validity is students' oral responses in an interview (1 being most negative, 5 being most positive). If
data and are important devices because the success or failure of a study lies on the data especially useful for test purposes such as selection or admissions. one researcher gives a "1" to a student response, while another researcher gives a "5,"
gathered. obviously the inter-rater reliability would be inconsistent. Inter-rater reliability is
6. Face Validity dependent upon the ability of two or more individuals to be consistent. Training,
THE QUALITIES OF A GOOD RESEARCH INSTRUMENT ARE:
education, and monitoring skills can enhance inter-rater reliability.
Like content validity, face validity is determined by a review of the items and not
1. VALIDITY through the use of statistical analyses. Unlike content validity, face validity is not 5. Intra-Rater Reliability
investigated through formal procedures. Instead, anyone who looks over the test,
2. RELIABILITY including examinees, may develop an informal opinion as to whether or not the test is Intra-rater reliability is a type of reliability assessment in which the same assessment is
measuring what it is supposed to measure. While it is clearly of some value to have the completed by the same rater on two or more occasions. These different ratings are then
3. USABILITY/PRACTICALITY test appear to be valid, face validity alone is insufficient for establishing that the test is compared, generally by means of correlation. Since the same individual is completing
measuring what it claims to measure. both assessments, the rater's subsequent ratings are contaminated by knowledge of earlier
VALIDITY
ratings.
The term validity refers to whether or not a test measures what it intends to measure. On
a test with high validity, the items will be closely linked to the test's intended focus. For RELIABILITY
many certification and licensure tests, this means that the items will be highly related to a SOURCES OF ERROR
specific job or occupation. If a test has poor validity, then it does not measure the job- Reliability is the extent to which an experiment, test, or any measuring procedure shows
related content and competencies it ought to. the same result on repeated trials. Without the agreement of independent observers able EXAMINEE (IS A HUMAN BEING)
to replicate research procedures, or the ability to use research tools and procedures that
The Types of Validity produce consistent measurements, researchers would be unable to satisfactorily draw
EXAMINER (IS A HUMAN BEING)
conclusions, formulate theories, or make claims about the generalizability of their EXAMINATION (IS DESIGNED BY AND FOR HUMAN BEINGS)
1. CONTENT: Related to objectives and their sampling. research. For researchers, four key types of reliability are:
RELATIONSHIP BETWEEN VALIDITY AND RELIABILITY
2. CONSTRUCT: Referring to the theory underlying the target.
Types of Reliability
Validity and reliability are closely related. A test cannot be considered valid unless the
3. CRITERION: Related to concrete criteria in the real world. It can be concurrent or measurements resulting from it are reliable. Likewise, results from a test can be reliable
1. EQUIVALENCY: Related to the co-occurrence of two items.
predictive. and not necessarily valid.
2. STABILITY: Related to time consistency.
4. CONCURRENT: Correlating high with another measure already validated.
3. INTERNAL: Related to the instruments.
5. PREDICTIVE: Capable of anticipating some later measure. PRACTICALITY
4. INTER-RATER: Related to the examiners' criterion.
6. FACE: Related to the test's overall appearance. Practicality refers to the economy of time, effort, and money in testing. In other words, a
5. INTRA-RATER: Related to the examiners' criterion. test should be:
1. Content Validity
1. Equivalency Reliability EASY TO DESIGN
Content validity refers to the connections between the test items and the subject-related EASY TO ADMINISTER
tasks. The test should evaluate only the content related to the field of study in a manner Equivalency reliability is the extent to which two items measure identical concepts at an
that is representative, relevant, and sufficiently comprehensible. identical level of difficulty. Equivalency reliability is determined by relating two sets of EASY TO MARK
test scores to one another to highlight the degree of relationship or association. EASY TO INTERPRET (THE RESULTS)
2. Construct Validity Example: A researcher studying university English students noticed that when some
students were studying for finals, they got sick. Intrigued by this, the researcher BACKWASH
It implies using the construct correctly (concepts, ideas, notions). Construct validity attempted to observe how often, or to what degree, these two behaviors co-occurred Backwash effect (also known as washback) is the influence of testing on teaching and
seeks agreement between a theoretical concept and a specific measuring device or throughout the academic year. The researcher used the results of the observations to learning. It is also the potential impact that the form and content of a test may have on
procedure. For example, a test of intelligence nowadays must include measures of assess the correlation between "studying throughout the academic year" and "getting learners' conception of what is being assessed (language proficiency) and what it
multiple intelligences, rather than just logical-mathematical and linguistic ability sick". The researcher concluded there was poor equivalency reliability between the two involves. Therefore, test designers, deliverers, and raters have a particular responsibility,
measures. actions. In other words, studying was not a reliable predictor of getting sick. considering that the testing process may have a substantial impact, either positive or
3. Criterion-Related Validity negative.
2. Stability Reliability
Also referred to as instrumental validity, it states that the criteria should be clearly Levels of Backwash
Stability reliability (sometimes called test-retest reliability) is the agreement of It is believed that backwash is a subset of a test's impact on society, educational systems,
defined by the teacher in advance. It has to take into account other teachers' criteria to be measuring instruments over time. To determine stability, a measure or test is repeated on
standardized and it also needs to demonstrate the accuracy of a measure or procedure and individuals. Thus, test impact operates at two levels:
the same subjects at a future date. Results are compared and correlated with the initial
compared to another measure or procedure which has already been demonstrated to be test to give a measure of stability. This method of evaluating reliability is appropriate THE MICRO LEVEL: The effect of the test on individual students and teachers.
valid. only if the phenomenon that the test measures is known to be stable over the interval
Example: Imagine a hands-on driving test has been proved to be an accurate test of between assessments. The possibility of practice effects should also be taken into THE MACRO LEVEL: The impact of the test on society and the educational system.
driving skills. A written test can be validated by using a criterion-related strategy in account. Bachman and Palmer (1996)
which the hands-on driving test is compared to it.
3. Internal Consistency
4. Concurrent Validity
Internal consistency is the extent to which tests or procedures assess the same
Concurrent validity is a statistical method using correlation, rather than a logical method. characteristic, skill, or quality. It is a measure of the precision between the measuring
Examinees who are known to be either masters or non-masters on the content measured instruments used in a study. This type of reliability often helps researchers interpret data
by the test are identified before the test is administered. Once the tests have been scored, and predict the value of scores and the limits of the relationship among variables.
the relationship between the examinees' status as either masters or non-masters and their Example: Analyzing the internal reliability of the items on a vocabulary quiz will reveal
performance (i.e., pass or fail) is estimated based on the test. This type of validity the extent to which the quiz focuses on the examinee's knowledge of words.
provides evidence that the test is classifying examinees correctly. The stronger the
correlation is, the greater the concurrent validity of the test is.