Monitoring and Evaluation Final Paper
Monitoring and Evaluation Final Paper
Shorts Questions
A test item bank is a term for a source of test items that belong to a testing program, as well as
all information pertaining to those items. In most applications of testing and assessment, the
items are of multiple choice format, but any format can be used.
Diagnostic feedback can be provided very quickly to each student on those items
answered incorrectly.
Differences in the degree to which students are familiar with using computers or
typewriter keyboards may lead to discrepancies in their performances on computer-
assisted or computer- adaptive tests.
2. Types of Essay?
An essay item that poses specific problem for which a student must recall
proper information, organize it in a suitable manner, derive a defensible conclusion, and
express it within the limits of posed problem, or within a page or time limit, is called a restricted
response essay type item. The statement of the problem specifies response limitations that
guide the student in responding and provide evaluation criteria for scoring.
Analyze relationship
An essay type item that allows the student to determine the length and
complexity of response is called an extended-response essay item. This type of essay is most
useful at the synthesis or evaluation levels of cognitive domain. We are interested, in
determining whether students can organize, integrate, express, and evaluate information,
ideas, or pieces of knowledge the extended response items are used.
Students are thus more likely to actively help each other learn.
4. Administration of test:
The test is ready. All the remains is to get the students ready and hand out the test. Here are
some suggestions to help your students psychologically prepare for the test.
7. Rotate distributions
13. Count the answer sheets, seal it in a bag and hand it over to the concerned authority.
Validity:
Validily refers to how well a test measures what it is supposed/claimed to measure. It is the
degree to which a test measures.
Types of validity:
Internal validity
External validity
Reliability:
Reliability is the degree to which an assessment tool produces stable and consistent results. The
degree to which a test consistently measures no matter whatever it measures.Reliability is a
measure of the stability or consistency of a test score. It is the ability for a test or rescarch
findings to be repeatable.
Long Questions
There are several rules we can follow to improve the quality of this type of written
examination.
Avoid distracted in the form of "All the answers are correct"- or "None of
the answer is correct"
Teachers use these statement most frequently when they run out of
ideas for distracters.Students knowing what is behind such questions, are rarely misled by it.
Therefore, if you do use such statements, sometimes use them as the key answer. Furthermore,
if a student recognizes that there are two correct answers (out of 5 options), they will be able
to conclude that the key answer is the statement "all the answer are correct",.without knowing
the accuracy of the other distraçters.
Reliability of test:-
Definition
Reliability is the degree to which an assessment tool produces stable and consistent results. The
degree to which a test consistently measures no matter whatever it measures.
Reliability is a measure of the stability or consistency of a test score. It is the ability for a test or
rescarch findings to be repeatable. For example, a medical thermometer is a reliable tool that
would measure the correct temperature each time it is used. In the same way, a reliable math
test will accurately describe measure mathematical knowledge for every student who takes it
and reliable research findings can be replicated over and over.
So it is the extent to which an experiment, test, or any measuring procedure shows the same
result on repeated trials.
Types of reliability
Split-Half Method:
A measure of consistency where a test is split in two and the scores for each half of the test are
compared with one another. A test is split into two, odds and evens, if the two scores for the
two tests are similar then the test is reliable.
This is done oy comparing the results of one half of a test with the results from the other half. A
test can be split in half in several ways, e.g. first half and second half, or by odd and even
numbers. If the two halves of the test provide similar results this would suggest that the test
has internal reliability.
The reliability of a test could be improved through using this method. For example any items on
separate halves of a test which have a low correlation (e.g. r= .25) should either be removed or
re-written.
The split-half method is a quick and easy way to establish reliability. However it can only be
effective with large questionnaires in which all questions measure the same construct. This
means it would not be appropriate for tests which measure different constructs.
Test-Retest Method:
Test-retest reliability is the degree to which scores are consistent over time. It indicates score
variation that occurs from testing session to testing session as a result of errors of
measurement. Each time the test is carried out, the results are the same. Test-retest reliability
is a measure of reliability obtained by administering the same test twice over a period of time
to a group of individuals. The scores from Time I and Time 2 can then be correlated in order to
evaluate the test for stability over time. It applies on same people, different times.
The test-retest method assesses the external consistency of a test. Examples of appropriate
tests include quest ionnaires and psychometric tests. It measures the stability of a test over
time.
For example, a test designed to assess student learning in psychology could be given to a group
of students twice, with the second administration perhaps coming a week after the first. The
obtained correlation coefficient would indicate the stability of the scores.
A typical assessnment would involve giving participants the same test on two separate
occasions.The same or similar results are obtained then external reliability is established. The
disadvantages of the test-retest method are that it takes a long time for results to be obtained.
Beck et al. (1996) studied the responses of 26 outpatients on two separate therapy sessions
one week aparl, they found a correlat ion of .93 there fore demonstrating high test-retest
reliability of the depression invenlory. The timing of the test is important; if the duration is to
brief then part ic ipants may recall information from the first test which could bias the results.
Alternatively, if the duration is too long it is feasible that the participants could have changed in
some important way which could also bias the results.
Independent judges (two or more) score the test or experiment, the data is then compared to
find out how consistent they were with the raters estimates.
Note. it can also be called inter-observer reliability when referring to observational research.
Here researcher when observe the same behavior independently (to avoided bias) and compare
their data. If the data is similar then it is reliable.
Where observer scores do not significantly correlate then reliability can be improved.
For example. if two researchers are observing 'aggressive behavior' of children at nursery they
would both have their own subjective opinion regarding what aggression comprises. In this
scenario it would be unlikely they would record aggressive behavior the same and the data
would be unreliable.
Validity of test
Definition
Validily refers to how well a test measures what it is supposed/claimed to measure. It is
the degree to which a test measures.
Aspects of validity
There are two aspects of validity:
Internal validity
Definition
The degree to which observed differences on the dependent variable is a direct result
of manipulation of the independent variable, not some other variable. The instruments or
procedures used in the research measured what they were supposed to measure.
Example:
As part of a stress experiment, people are shown photos of war atrocities. After the
study, they are asked how the pictures made them feel, and they respond that the pictures
were very upsetting. In this study, the photos have good internal validity as stress producers.
External validity
The degree to which results are generalizable or applicable to groups and
environment outside the experimental settings.
The resulls can be generalized beyond the immediate study. In order to have external validity,
the claim that spaced study (studying in several sessions ahead of time) is better than cramming
for exams should apply to more than one subject (e.g., to math as well as history). It should also
apply to people beyond the sample in the study.
Types of Validity
Content validity
Content validity refers to the connections between the test items and the subject-related tasks.
The test should evaluate only the content related to the field of study in a manner sufficiently
representative, relevant, and comprehensible.
When we want to find out if the entire content of the behavior/construct/area is represented in
the test we compare the test task with the content of the behavior. This is a logical method, not
an empirical one.
Example
If we want to test knowledge on American Geography it is not fair to have most questions
limited to the geography of New England.
Construct validity
It is used to ensure that the measure, actually measures what it is intended to measure and not
other variables. Using a panel of "experts" familiar with the construct is a way in which this type
of validity can be assessed. The experts can examine the itenms and decide what that specific
item is intended to measure. Students can be involved in this process to obtain their feedback.
For Example
A women's studies program may design a cumulative assessment of learning throughout the
major. The questions are written with complicated wording and phrasing. This cause the test
inadvertently becoming a test of reading comprehension, rather than a test of women's studies.
It is important that the measure is actually assessing the intended construct, rather than an
extraneous factor. It implies using the construct correctly (concepts, ideas motions).
Construct validity seeks agreement between a theoretical concept and a specific measuring
device or procedure.
For example
A test of intelligence now a days must include measures of multiple intelligences rather than
just logical-mathematical and linguistic ability measures. It is the deeree to which à test
measures an intended hypothetical construct. It is the assessment actually measures what it is
designed to measure. A actually is A.
Criterion-related validity
Also referred to as instrumental validity, it states that the criteria should be clearly defined by
the teacher in advance. It has to take into account other teachers' criteria to be standardized
and it also, needs to demonstrate the accuracy of a measure or procedure compared to
another measure or procedure which has already been demonstrated to be valid. lt has further
two types.
Concurrent validity
lt is the degree to which the scores on a test are related to the scores on another,
already established, test administered at the same time, or to some other valid criterion
available at the same time.
For example
A new simple test is to be used in place of an old cumbersome one, which is
considered useful measurements are obtained on both at the same time. Logically, predictive
and concurrent validation are the same, the term concurrent validation is used to indicate that
no time clasped between measures. This assessment correlates with other assessments that
measure the same construct. A correlates with B.
Predictive validity
It estimate the relationship of test scores to an examinees, future performance as
a master or non-master. Predictive validity considers the question; How well does the test
predict examinees future status as masters or non-masters? For this type of validity, the
correlation that is computed is based on the test results and the examinee's later performance.
This type of validity is especially useful for test purposes such as selection or admissions. This
assessment predicts performance on a future assessment. A predicts B.
Essay questions differ from short answer questions in that the essay
questions are less structured. This , openness allows students to demonstrate that they can
integrate the course material in creative ways. As a result, essays are a favoured approach to
test higher levels of cognition including analysis, synthesis and evaluation. However, the
requirement that the students provide most of the structure increases the amount of work
required to respond effectively. Students often take longer time to compose a five paragraph
essay than they would take to compose paragraph answer to short answer questions.
Essay items can vary from very lengthy, open ended end of semester term
papers or take home tests that have flexible page limits (e.g. 10-12 pages, no more than 30
pages etc) to essays with responses limited or restricted to one page or less. Essay questions
are used both as formative assessments (in classrooms) and summative assessments (on
standardized tests). There are 2 major categories of essay questions short response (also
referred to as restricted or brief) and extended response.
Example 1:
List the major similarities and differences in the lives of people living in Islamabad and
Faisalabad
Example 2:
Compare advantages and disadvantages of lecture teaching method and
demonstration teaching method.
Analyze relationship
Example:
Identify as many different ways to generate electricity in Pakistan as you can? Give
advantages and disadvantages of each. Your response will be graded on its accuracy,
comprehension and practical ability. Your response should be 8-10 pages in length and it will be
evaluated according to the RUBRIC scoring criteria) already provided.