Principles of Language Testing
Principles of Language Testing
By
Didi Sukyadi
English Education Department
Indonesia University of Education
Practicality
• Is not excessively expensive
• Stays within appropriate time constraints
• Is relatively easy to administer
• Has a scoring/evaluation procedure that is
specific and time efficient
• items can be replicated in terms of resources
needed e.g. time, materials, people
• can be administered
• can be graded
• results can be interpreted
Reliability
• A reliable test is consistent and dependable.
• Related to accuracy, dependability and
consistency e.g. 20°C here today, 20°C in North
Italy – are they the same?
According to Henning [1987], reliability is
• a measure of accuracy, consistency,
dependability, or fairness of scores resulting from
the administration of a particular examination
e.g. 75% on a test today, 83% tomorrow –
problem with reliability.
Reliability
• Student Related reliability: the deviation of an
observed score from one’s true score because of
temporary ilness, fatigue, anxiety, bad day, etc.
• Rater reliability: two or more scores yield an
inconsistent scores of the same test because of lack
attention on scoring criteria, inexperience, inattention,
or preconceived bias.
• Administration reliability: unreliable results because of
testing environment such as noise, poor quality of
cassettee tape, etc.
• Test reliability: measurement errors because the test is
too long.
To Make Test More Reliable
• Take enough sample of behaviour
• Exclude items which do not discriminate well
between weaker and stronger students
• Do not allow candidate too much freedom.
• Provide clear and explicit instructions
• Make sure that the tests were perfectly laid out
and legible
• Make candidates familiar with format and testing
techniques
To Make Test More Reliable
• Provide uniform and undistracted conditions
of administration
• Use items that pemit objective scoring
• Provide a detailed scoring key
• Train scorers
• Identify candidate by number, not by name
• Employ multiple, independent scoring
Measuring Reliability
• Test retest reliability: administer whatever the test
involved two times.
• Equivalent –forms reliability/parallel-forms
reliability: administering two different bu equal tests
to a single group of students (e.g. Form A and B)
• Internal consistency reliability: estimate the
consistency of a test using only information internal
to a test, available in one administration of a single
test. This procedure is called Split-half method.
Validity
• Criterion related validity: the degree to which results
on the test agree with those provided by some
independent and highly dependable assessment of
the candidates’ ability.
• Construct validity: any theory, hypothesis, or model
that attempts to explain observed phenomena in
our universe and perception; Proficiency and
communicative competence are linguistic
constructs; self-esteem and motivation are
psychological constructs.
Reliability Coefficient
• Validity coefficient to compare the reliability of different
tests.
• Lado: vocabulary, structure, reading (0,9-0,99), auditory
comprehension (0,80-0,89), oral production (0,70-0,79)
• Standard error: how far an individual test taker’s actual
score is likely to diverge from their true score
• Classical analysis: gives us a single estimatefor all test
takers
• Item Response theory: gives estimate for each
individual, basing this estimate on that individual’s
performance
Validity
• The extent to which the inferences made from
assessment results are appropriate, meaningful and
useful in terms of the purpose of the assessment.
• Content validity: requires the test taker to perform
the behaviour that is being measured.
• Content validity: Its content constitutes a
representative sample of the language skills,
structures, etc. With which it is meant to be
measured
Validity
• Consequential validity: accuracy in measuring
intended criteria, its impacts on the
preparation of test takers, its effects on the
learner, and social consequences of test
interpretation and use.
• Face validity: the degree to which the test looks
right and appears to the knowledge and ability
it claims to measure based on the subjective
judgement of examinees who take it and the
administrative personnel who decide on its use
and other psychometrical observers.
Validity
Response validity [internal]
• the extent to which test takers respond in the way
expected by the test developers
Organization
(introduction,
body,
conclusion)
Logical
development
of ideas
Grammar
Punctuation,
Spelling,
mechanics
Style and
quality of
Holistic Version of the Scale for Rating
Composition Tasks
• Content
• Organization
• Language Use
• Vocabulary
• Mechanics
Personal Response Items
• The response allows the students to
communicate in ways and about things that
are interesting to them personally
• Personal Responses include: self assessment,
conferences, porfolio
Self-Assessment
• Decide on a scoring type
• Decide what aspect of students’ language
performance they will be assessing
• Develop a written rating for the learners
• The rating scale should decide concrete language
and behaviours in simple terms
• Plan the logistics of how the students will assess
themselves
• The students should the self-scoring procedures
• Have another student/teacher do the same scoring
Conferences
• Introduce and explain conferences to the students
• Give the students the sense that they are in control
of the conference
• Focus the discussion on the students’ views
concerning the learning process
• Work with the students concerning self-image issue
• Elicit performances on specific skills that need to be
reviewed.
• The conferences should be scheduled regularly
Portfolios
• Explain the portfolios to the students
• Decide who will take responsibility for what
• Select and collect meaningful work.
• The students periodically reflect in writing on
their portfolios
• Have other students, teachers, outsiders
periodically examined the portfolios.