0% found this document useful (0 votes)
89 views

Baxter EvaluatingYourStudents Chapter4

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
89 views

Baxter EvaluatingYourStudents Chapter4

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 11
CHAPTER 4 Testing: What makes a ‘good’ test good? Juction, the easiest and most common form of assessment isto give the students atest. However, while writing a quick tle test may appear tasy tis very ficult to write a good test, How much time and effort you invest Fr test will depend on how important the result is to the student. If you want to know whether a student knows seven vocabulary items referring to transport, this isa simple test to write. The result isn't very important. For example: ‘As we said in the Introdi Write five more words in the same category: car, bus, But if you are going to use the test to decide whether someone will repeat @ school year, or wll be able to go to university or not, the test obviously needs to be much better. These kinds of exam are normally written by intemational exam boards or by the state, simply because they are so complicated to make and score. “So what isa good test?” A good test has the following qualities: is valid itis reliable it is practical ” ithas no negative effects on the teaching programme (negative BACKWASH). Validity There are three main types of vali CONTENT VAUIDITY CONSTRUCT VAUIOITY FACE VAUDITY TIAISIK Before reading the sections below, look at the three types of validity above and have a guess what each type of validity means. ct "What is content validity?” : content vauioiTy means Does a test test what itis supposed to test? For example, if we want to test whether a class of beginners can produce : examples of the present simple for describing routines, we must make sure that: . the questions are on the present simple for routines (and not, for example, present simple for future) . we test the verbs that beginners are likely to know " we ask the students to produce the answer, and not just recognise the answer by, say, using multiple-choice. 13 In other words, the questions we ask must be a representative sample of a | beginner's whole ability to produce the present simple for routines. : itis easier to make the content of a test valid when we are trying to test small items like these, But CONTENT VALIDITY is more difficult to assure when we are testing a student's global abilities, as in a paonicitucy rest. & see pAGE 8 5 Let us look at a typical (lower) intermediate exam of general English. What ¥ structures do exams at this level typically test ~ and therefore assume are representative of a level of knowledge of English in general? _. modal verbs can, must, don't have to present perfect with for a future will vs going to -ed vs -ing adjectives ing form after verbs of liking and enjoyment too + adjective/not + adjective + enough simple passives, etc. Yet it could test a number of other things, ¢.g. _. topic/comment sentences, e.g. That car ~ it was aweful. .. colloquial English, e.g. He gets on my nerves. ‘compound nouns, e.g, table leg vs the back of the book " speed of delivery, e.g. average number of words per minute .. average sentence length .. tum-taking in conversation skills. In other words, a test, especially a test of general English, cannot test everything. So we must choose a selection of things to test that we think are representative of a student's ability in knowing/using a (particular part of) language. Note: Some skills are more difficult to test than others. Testing the passive is easier than, say, testing tum-taking in conversation. Similarly, some question- types are easier to write than others, e.g. you can listen to English for days without hearing anyone use reported speech, but it appears in lots of tests; not in order to test reported speech, but simply because it is useful for testing the student's ability to manipulate the tense system and (in questions) word order. They are very easy questions to write. “What is construct . CONSTRUCT VALIDITY means Does the test test what it's supposed to test and validity?” nothing else? Normally we try to test one of the following: ... grammar (ie. structure, vocabulary and pronunciation) .. skills (Le. reading, writing, listening and speaking). But it is sometimes very difficult to test one of these without also testing others. Look at the following test question, then compare your answer with the notes on page 20. Fill thé’gap with an appropriate verb in the correct form. 1) Mr Smith normally ared Mercedes. What must the student know in order to answer this question correctly? What exactly are we testing? 19 How to make tests valid: : The student must _. be able to read and understand the instructions, We are assuming that he/she understands the vocabulary (appropriate, verb, form) have some reading skills (e.g. he/she may speak English, but be illiterate). know the required vocabulary ~ and guess what the teacher wants. We may be trying to test if the student knows the verb drive, but he/she could use the verb have or have got. .. know the tense system. We want the 3rd person -s.. also know what a ‘Mercedes’ is - we are using assumed cultural knowledge which the student may not have. Suppose in thelr country ‘Mercedes’ is a make of bike. Would we accept rides? .. also know some teacher-shorthand. We assume that if we write normally, the student will know that we want a present simple. But the student could write any of the following correct answers: drove, used to drive, will drive, has driven, should have driven, etc. (Although the word order should help them to choose.) So if the student answers: 1 Mr Smith normally puts ared Mercedes. > How correct is this? How many marks do we give him/her? “What is face validity?” : Content validity 1 mark for form? (the third person -s form is right) 1 mark for fing the gap? (he/she understood the instruction) 0 marks for appropriacy? (the verb is wrong) FACE VaUIDITY means Does the test appear to test what itis trying to test? For example, imagine that we do lots of research and we find that, amazingly, the size of a student's feet is directly related to language learning aptitude. We ! find that shoe-size is a better predictor of level than our own PLACEMENT TEST. + If this were true, it would make sense for us to throw away OUF PLACEMENT TEST and instead simply ask students What's your shoe size? Students, and parents, } would immediately complain because there is no apparent link between shoe size and language ability. In other words, there is a kind of psychological factor involved in testing. The ! test must appear to have something to do with the skill you are trying to test. Before you write a test, write down what you want to test. © Do you want the students to recognise or produce the answer? @ Remember that one form (e.g. structure or vocabulary item) may have a number of different meanings. @ Remember that each structure or vocabulary item may have a number of different forms (singular, plural; questions, negatives, etc.; 1st, 2nd persons, etc.). If you are testing a svttanus, see what percentage of the syLLaeus is given to each skill, form and meaning. If you are writing a general test, decide for yourself which skills, etc. are most important. You may find it useful to fill a chart like the one below and on PHOTOCOPIAGLE PAGE 1. Follow this procedure. © Make a list of the teaching items on the syitaaus. An item might be the present simple, inviting or a vocabulary area. @ Then look at the amount of time the syLLaaus or coursebook suggests spending on each item. Divide this by the total number of course hours or -m_amm mm coursebook units, multiply by 100 and ; nd this wil tell you what percentage of the course each item represents. (This method i of courte, elvan ly you are devising a test/tests for the whole snuanon) © Then look at what forms of each item you have covered, 6 Item: present simple 3st/2na/d person? singular/plural?.questions/statements/negatives/ Item: vocabulary for food Singular/plurals? spelling? associated structures like un/countables?, etc. Hem: inviting Which exponents? Which answers? (e.g. Yes, I'd love to.) Item: skim-reading What length of text? What speed? What text type? © Then decide if the svuaaus expects the students simply to be able to recognise these items correctly used, or to be able to produce them You can fill out a kind of grid, e.g. P Syllabus item What exactly Percentage | Recognise | Produce ‘Number of are we teaching? of syllabus | //x Vix items in test Grammar, all persons, statements, present simple questions, negatives 5 v v 8/50 and short answers Vocabulary: countable /uncountable food singular /plurat 5 v v 2/50 Communication/function: | inviting 5 Y v 3/50 [_inuitations accepting/refusing TIP How to make tests valid: Construct validity When you have completed the chart, you should try and make the number of questions and marks on each item match the percentages. Imagine, for example, that your complete beginner syutaaus recommends that you spend 15% of the time covering the present simple, and tells you that the students should be able to recognise the correct forms. You then look at the test and find that 50% of the questions relate to the present simple, and it includes gap-fills with no suggested answers, There is a clear mismatch between sv.Lasus and test Instructions | The easiest way in a monolingual classroom of removing any complications with ® instructions is to write them in the students’ own language. On the other hand, test instructions are classroom-authentic items of the target language, and the ability to understand them becomes important if the students are to take : international exams. A useful half-way point is to put both L1 and target- language instructions side-by-side, and to move gradually towards the target language ones over a number of years. ! Remember to tell the students how many marks, or what percentage of their ! total score, each item/section is worth. This gives the student the responsiblity © of allocating an appropriate amount of time and effort! at Testing: What makes 9 ‘good! test good? £ Testing two things at the same time ‘ook at each question and check what you are trying to test. We can limit the tudent to make sure we are testing only the part we want to test, e.g. 1 Normally, Mr Smith ared Mercedes. 2 Normally, Mr Smith ared Mercedes. (drive) {nthe last example, we must decide if we are testing drive or -s. If we give them drive, we are testing -s; if we don't, perhaps we should give a half-mark for drive, as the student has chosen the right vocabulary item. Remember that drove is also correct. If more than one teacher is marking the -xam, you will need an answer key and marking guide with all the possible 2 answers, > SEE SCORER RELIABILITY PAGE 26 If possible, when you have written your test, check it with other people - native speakers, other teachers, and other students. Ideally, you should trial it with another class at the same level, and see if it gives results which are similar to £ other test results and the teacher's gut reaction. If you can't tral aloft, give the ¥ other class haf of i, e.g. every other question. Finally, check that your test looks like its testing what you intend it to test. Eg see FACE VAUD Face 20 For all these reasons, choosing some form of o1REct TESTING (se PAGE 30) will } normally automatically give you a more valid test (if the instructions are 2 easily understood). How to make tests valid: In general Reliability There are two main forms of reliability: TEST RELIABILITY SCORER RELIABILITY “What is test reliability?” £ rest neuaBitiy means If it was possible to give the same person the same test at the same time, would the result be the same? Imagine you want to see how well people can play darts. You ask them to hit the bulls-eye. How many darts would they need to throw to convince you of their level of dart playing? Three? Five? Ten? For this example, we will choose five. Now suppose you want to test if a student ‘knows’ the present simple. How many questions would you ask? We chose five for darts, so, if we want to test the third person he, we would need five questions or test items lik: He to the cinema every day. (go) On Tuesdays, he to goto the cinema, (like) } Note: Because of conten vauorry, we have to give them the base verb or we are also testing vocabulary. 22 ; But we can’t assume that the student knows she as well as he. So we would need five she test items as well. And what about plurals? And names, as well as ! pronouns? And also things, possibly both concrete and abstract. In fact, we would need five test items for each of the following: 1 you he she it John —_ building we you they John and Mary —_ ideas. In addition, there are at least four forms of the present simple: statements, questions, negatives and negative questions. To make it simple, we will exclude question tags, yes/no answers, etc * This means that, to test whether the student knows the present simple, we would theoretically need to ask the following: 12 subjects x4 forms x5 examples of each \you/he/she/it/we/you/they | affirmative/ John/building negative/question/ John & Mary/ideas negative question ! This gives us a test with 12 x 4 x 5 or 240 questions. Remember that here we are only testing structure. The present simple can have many different meanings, apart from routine actions, including: . universal truths (The sun rises in the East.) | commentary or present historic (Jones shoots and he scores! 2 ~ 1!) va. futures (Your train leaves at six a.m. tomorrow.) Imagine we wanted to test routine actions and these three other different meanings at the same time, we would need 240 x 4 questions ~ 960 in total! However, we have to realise that this is totally impractical. So we have to compromise and select some of the possible questions we could ask. Out of the 240 possible questions we might ask 10 or 20. The problem is which 10 or 20 do we ask? We must hope that the sample we choose is representative, Let us imagine two students: A and B. A only knows 20 answers out of the 240. B knows 220 answers out of the 240. Itis therefore possible that lucky A might score 20/20 because we only ask the 20 questions he/she knows, but unlucky B might © score 0/20 because we ask him/her only the 20 questions he/she doesn’t know. So here we are asking How representative is our selection of questions out of all the questions possible? Given our resources, there will always need to be a compromise between making our test long enough to be reliable but also short enough to be practical. On the other hand, we must also make tests long enough to give enough samples to measure. For example, we can't test paragraphing skills unless the piece of writing is long enough to require paragraphs; and we can’t test talking skills unless there is time to talk, interrupt, request information, and so on. Sometimes we also need to make tasks sufficiently complex. You can't test a student's abilities to compare two possible choices and make an informed decision in a two-minute conversation about their summer holidays. Of course, there has been a lot of research on Test reuiaaitiTy and how to + measure it, but itis all extremely complex and time-consuming. Its unrealistic to * expect schools to have the time and resources to make a test totally reliable. 23 Testing: What makes a ‘good’ test good? How to make tests more reliable 24 Instead, it may be best to accept that almost every test we design will have limited reliability. It will simply be a guide to the teacher when it comes to giving ‘a final assessment of any student's abilities. And, if we are honest, if we think a student has under-performed in our reliable test, we are often still tempted to find a few marks and help them pass anyway! Get enough examples © See number of questions and complexity above. © Give the students fresh starts. If they don't like the essay topic or question type, or if they feel they are making a mess of this question, they may not perform as well as they can. You need to let them start again on a fresh task, Compare Test A and Test B below. Test A eine Write a letter to an aunt who is borrowing your family’s house for a holiday. Tell her how your holiday is going and describe what there is to do in the area if she gets bored. As you are the only person in your family who knows how the video works, your parents have asked you to explain to her how to change channels and how to record a programme on a different channel from the one she is watching. (250 words) Test 8 ‘a Some relatives are staying in your house while you are on holiday. You are the only person in your family who knows how the video works. Your parents have asked you to write a short note telling them how to change channels and how to record a programme on a different channel from the one they are watching. Write a short note explaining how to do it. (75 words) b Write a short posteard to an aunt to tell her how your holiday is going. (75 words) ¢ During the school holidays, you and your parents have moved to a new house in a different part of your city. You are writing to a friend who is away with his/her parents on their summer holidays. Write one paragraph from the letter describing what the new part of the city is like, and telling him/her what there is to do there. (75 words) ‘As you can see, the tasks are very similar; but Test B gives the student three fresh : attempts at the same task, and also allows you to test a wider range of social : styles, audiences, and text types, } Testing techniques : Make them: © varied ~ don't use only one technique to measure. For example, don’t use just gap-fils, but also other techniques such as multiple choice. However, don't give the answer to question 10 in question 24, ¢.g. Lt Testing: What makes a ‘good test go Write the appropriate question word, 10 does he go on Friday nights? Fill the gap with the appropriate form of the verb, 24 Where he on Tuesday mornings? (go) © familiar ~ the students may not perform well if they have to learn new question-types in the middle of the test. Students should have met the question-type before. For example, if you normally only do true/false listening ‘comprehension tasks, the following would confuse a student. Listen to the tape and decide if the information is true (T), false (F) or not given (NG). Instructions Make them: © clear © at the appropriate level of language. Teachers rarely teach the word gap or suitable to beginners, but they often use them in the test instructions. Remember to use the students’ L1 if necessary. If not, you may be testing their instruction-reading skills instead of what you are actually trying to test. Restrict the task Alll the students should have the same chance. Look at the following compositions. Computers How can computers help us? } How can computers help people at work? How can computers help the police, fire and ambulance services in emergencies? Obviously the last task is the most restricted and will allow you to see differing ability better than the first. In addition, if you give a general topic and the student has no ideas, you are testing creativity as well as English. @ see consTRUCT VauiDITY (PAGE 19) They may need a fresh start (see Pace 24) ” Keep conditions comparable Make sure two different groups take the test under the same conditions. The instructions must be the same. Do you pause the tape between plays? How long for? Is there distracting background noise? Can they cheat? Do you give them a © minute to let them finish after Time's up!, or do you say Pens down now!? -of - fro ur school. How reliable do you Look at one of the end-of-year tests from yo! to ria ta think it is? Think of at least one way of making it more rel same amount of time. What is scorer reliability?” scoRER RELIABILITY Means If you gave the same test to two different people to af 77 “How can we improve scorer reliability in in these cases?” mark, would they give the same score? Look at the three answers to exercises below, What kind of test is each from? What kind of problems will you have when marking them for a) your class? b) another teacher's class? Examplel:la 2b 3d 4a Sd Example 2: He to the cinema. Example 3: Jonh get tp and open de eyes. The sun is shinning. and de. brids ave siniing, He tink, le a beavtiful day. Today, | no B to work, but | g to the beach’ But when he is driving. in her” car, he see Mr Smith, hig boss, who sam him, Why, you Kot in work?! Le ~ Example 1: Multiple choice tests It is easy to see that Example 1 is much easier to mark than Example 3. In fact, a computer can mark Example 1. But as we explained in CONTENT VALIDITY (PAGE 18), multiple-choice exams are much more difficult to write. And there are some skills ~ like writing — where testing by multiple-choice causes problems with validity. Another major drawback with multiple-choice exams is that the results don't help the student to learn. Neither teacher nor learner can get any useful information about why the learner's answer was right/wrong or successful/unsuccessful. Example 2: Limited possibility tests There are only a limited number of possible correct answers for Example 2 However, as we saw in CONSTRUCT VALIDITY (PAGE 19): ... there are often more possible answers than we anticipated when we wrote the question .. students can give answers which are partially correct and partially wrong. Use an answer key or a marking guide: give a list of acceptable answers and a marking scheme (i.e. can you give half-marks? If so, what for?). But if more than one teacher is marking the tests, remember that you may need several meetings to add new acceptable answers to the list, or alter the marking scheme. For example, all the answers below are possible. Ex.2 He to the cinema. gocsieventedll ga[hos been/will have. gonc/woud like to go - Example 3: Multiple possibility tests Example 3 is much more difficult to mark. © Teacher A might notice that all the punctuation and present progressive forms are right. © Teacher B might notice that all present simples are wrong and the spelling is terrible. © Teacher C might think this is very creative and fluent. “Hoto can we improve score reliability in these cases? “Hote can we do this? A “What is practicality?” : Perhaps the most important qui © Teacher 0 might count 17 mistakes in 57 words (ie, 30% wrong). : © Teacher E might say that this is a good essay for a beginner but bad for an intermediate student, You may need more than one teacher involved in the marking for two reasons: First, if more than one teacher is administering, the exarn, it is very important that all the teachers are marking it in the same way. Second, we are now judging the student's work rather than counting it, and even dancers are judged by a panell The most important action is to negotiate and agree on the criteria you will al judge the answer by. This could be done by agreement on puoriuna (see PAGE 49); breaking down the answers you want into either their component parts, like spelling, punctuation, structure, cohesion; or other criteria, such as organisation, relevance, etc.; and/or eanoine: (see pace 51): marking according to overall impression. We will return to this in a later chapter. Some teachers will say that there is no time to have meetings or read documents to make sure they are marking the test in the same way as the other teachers. But if ‘one teacher is marking the same test in a different way, everyone's time is wasted! “The results are simply not of any use, because the results are not comparable, So: _. the students have wasted their clas time doing the test «.. the teachers have wasted their time marking the test «.. the school's administration has wasted its time recording the results “the school is open to complaints from parents whose children will compare results on the way home: / put the same thing as he did but he got it right and | got it wrong Of all the qualities of 2 good test, scorer reusaitity is the only one that non- experts understand. Ignoring score RELIABILITY is a false economy! Look at a current test used in your school. Are there clear and unambiguous marking instructions? How would you improve them? Now try to answer all the questions as if you were the ‘Student from Hell Answer all the questions as unco-aperatively as possible! a Make sure that every answer you write is possible; but b either not what the teacher actually wanted you to write (she was trying to test something else); or will cause the teacher other marking difficultiest How would you change the test, but make it take the same amount of time? Practicality ality of any test is haw practical itis to administer. lents their own While we may want to have 1,000 questions, or give the stud Ma thatis, video-recorder for a listening test, we simply do not have the resources, ! time, personnel, space, equipment or money. 27 Read the tottowing (ideauiseat) Ust of resources required for writing good tests. Which do your school have? Which could your school possibly arrange or get? Which are impossible for your school to arrange or get? What effect will this have on your school’s testing system? Time will be needed for: teachers designing the test teachers analysing the results (e. trialling it on sample groups teachers marking the papers students doing the test how successful are the distractors, pace 36?) Writing a test which is valid and reliable requires: teachers who are experienced in test-writing teachers who are expert statisticians teachers to attend pre-marking standardisation sessions teachers to mark the tests co-ordinators to answer questions about alternative answers Space and equipment students need to be sitting where they can’t copy (especially in multiple-choice tests) they may need different tables (e.g. one desk per person) teachers need good audio-/video tape-players with counters and pause/replay buttons: they may need calculators or computers to record and analyse the results Money for: extra staff extra space extra equipment. (However, this money is probably not available.) Backwash “What is backwash?” _ Backwash — sometimes called washback ~ refers to the effect that a final test has : "on the teaching programme that leads to it. This is a familiar experience, and is also sometimes called teaching to the test. ! For example, the school sy.tasus/objectives tell the teacher to teach fluency, but the school's final testis, e.g. a multiple-choice grammar and vocabulary test. ‘Most teachers want their students to pass the test - possibly the teachers will have their teaching performance assessed on the basis of the students’ success } (or lack of it), therefore most teachers will teach grammar and vocabulary rather than fluency. Sometimes the effect of this sackwash can improve the teaching programme: this : is called beneficial sackwast. For example: the school management notices that students at the end of the teaching programme know their grammar but cannot speak the target language, They decide on radical action! They drop all grammar ; Items in the test and instead introduce interviews on video by other teachers of the target language. Teachers therefore change their teaching to give more emphasis to the speaking skill. 28

You might also like