Resume Group 3 Principles of Language Assessment
Resume Group 3 Principles of Language Assessment
Assessment
A. Validity
Validity is devided into some aspects. The first is that content validity of the test. A
test is said to have content validity if its content constitutes a representative sample of the
language skills, structures, etc. with which it is meant to be concerned. The test would have
content validity only if it included a proper sample of the relevant structures. We would not
expect an achievement test for intermediate learners to contain just the same set of structures
as one for advanced learners. In order to judge whether or not a test has content validity, we
need to a specification of the skills or structures, etc. that it is meant to cover. Such a
specification should be made at a very early stage in test construction. It is not to be expected
that everything in the specification will always appear in the test; there may simply be too
many things for all of them to appear in a single test. A comparison of test specification and
test content is the basis for judgments as to content validity. Ideally these judgments should
be made by people who are familiar with language teaching and testing but who are directly
concerned with the production of the test in question.
What is the important of content validity? Firstly, the greater a test’s content validity,
the more likely it is to be an accurate measure of what it is supposed to measure, i.e. to have
construct validity. A test in which major areas identified in the specification are under-
represented – or not represented at all – is unlikely to be accurate. Secondly, such a test is
likely to have a harmful backwash effect. Areas that are not tested are likely to become areas
ignored in teaching and learning.
The second one is a form of evidence of a test’s construct validity relates to the degree
to which results on the test agree with those provided by some independent and highly
dependable assessment of the candidate’s ability. This independent assessment is thus the
criterion measure against which the test is validated.
There are essentially two kinds of criterion-related validity: concurrent validity and
predictive validity. Concurrent validity is established when the test and the criterion are
administered at about the same time. To exemplify this kind of validation in achievement
testing, let us consider a situation where course objectives call for an oral component as part
of the final achievement test.
From the point of view of content validity, this will depend on how many of the
functions are tested in the component, and how representative they are of the component set
of functions included in the objectives. Every effort should be made when designing the oral
component to give it content validity.
The second kind of criterion-related validity is predictive validity. This concerns the
degree to which a test can predict candidates’ future performance. An example would be how
well a proficiency test could predict a student’s ability to cope with a graduate course. The
criterion measure here might be an assessment of the student’s English as perceived by his or
her supervisor at the university, or it could be the outcome of the course (pass/fail etc).
The third one is that an investigations of a test’s content validity and criterion-related
validity provide evidence for its overall, or construct validity. One could imagine at a test that
was meant to measure reading ability, the specifications for which included reference to a
variety of reading sub-skills, including, for example, the ability to guess the meaning of
unknown words from the context in which they are met. Concurrent validation might several
a strong relations between students’ performance on the test and their supervisors’ assessment
of their reading ability. But one would still not be sure that the items in the test were ‘really’
measuring the sub-skills listed in the specifications.
Two principal methods are used to gather such information: think aloud and
retrospection. In the think aloud method, test takers voice their thoughts as they respond to
the item. In retrospection, they try to recollect what their thinking was as they responded. The
problem with the think aloud method is that the very voicing of thoughts may interfere with
what will be the natural response to the item. The drawback to retrospection is that thoughts
may be misremembered or forgotten. Despite these weaknesses, such research can give
valuable insights into how items work.
B. Reliability
In order to make your test reliable, we consider these factors: 1) Take enough samples
of behavior, 2) Do not allow candidates too much freedom, 3) Write unambiguous items, 4)
Provide clear and explicit instructions, 5) Ensure that tests are well laid out and perfectly
legible, 6) Candidates should be familiar with format and testing techniques, 7) Provide
uniform and non-distracting conditions of administration, 8) Use items that permit scoring
which is as objective as possible, 9) Make comparisons between candidates as direct as
possible, 10) Provide a detailed scoring key, 11) Train scorers, 12) Agree acceptable
responses and appropriate scores at outset of scoring, 13) Identify candidates by number not
name, and 14) Employ multiple, independent scoring.
In connection with validity and reliability, we could argue that to be valid a test must
provide consistently accurate measurements. It must therefore be reliable. A reliable test,
however, may not be valid at all. There always be some tension between reliability and
validity. The tester has to balance gains in one against losses in the other.