0% found this document useful (0 votes)
60 views

In Class Task 4

The document discusses the concepts of reliability and validity in educational measurement. Reliability refers to the consistency of measurement, while validity refers to the accuracy of measurement in assessing the intended construct. There are different types of reliability, including test-retest, parallel forms, inter-rater, and internal consistency reliability. Measurement tools must demonstrate both reliability and validity to accurately measure educational concepts.

Uploaded by

Muhd Abdullah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views

In Class Task 4

The document discusses the concepts of reliability and validity in educational measurement. Reliability refers to the consistency of measurement, while validity refers to the accuracy of measurement in assessing the intended construct. There are different types of reliability, including test-retest, parallel forms, inter-rater, and internal consistency reliability. Measurement tools must demonstrate both reliability and validity to accurately measure educational concepts.

Uploaded by

Muhd Abdullah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Muhammad Abdullah Bin Mohd Fazli

2020802098

MU3752 MEASUREMENT & EVALUATION IN MUSIC EDUCATION

Assignment 2

‘Measurement ascertains reliability and validity in educational assessment’.

Introduction

While numerous public schools around the country are working to foster a culture of

data, these schools are putting a lot of emphasis on linking the use of data in the classroom,

school, and district. Although the largest difficulty associated with this integration is choosing

what data would provide an appropriate picture of those goals, one factor to keep in mind is that

data is like a weather vane as it will inevitably change. Especially when the aims of the school

are not explained in a way that lends itself to concrete analysis, it is essential to take into

consideration the school's mission, which might be defined as improving school climate.

Conversely, if a school is concerned with raising student literacy, one issue to consider could be:

which student groups routinely perform below the proficiency cutoffs on standardised English

exams? A school that wishes to promote a supportive and inclusive environment can provide a

question such as: Do teachers treat children differently based on their race, religion, gender,

ability, or any other difference?


Similar to questions asked in domains such as psychology, economics, and education,

these targeted queries might be considered to be queries in search of answers in the form of

study. While a question on its own may not always tell you which instrument (such as a

standardised exam, student survey, etc.) is best, it does help to point in the right direction.

In order to accurately measure something, it is essential to assign scores to individuals,

which reflect some feature of the individuals. Although it is unknown exactly how researchers

determine the test results represent the construct, it is a well-known fact that cognitive traits such

as intellect, self-esteem, depression, and working memory capacity exist. To comprehend the

concepts measured by test scores, they conduct study and validate that the scores are in fact a

measure of these concepts. Henceforth, touching on an extremely crucial aspect, it is generally

acknowledged that psychologists do not simply presume that their measures will yield results.

They actually gather data to indicate that they're effective. If they have no proof that a measure is

effective, they will drop it.

For the sake of this example, assume you have been on a diet for a month. Several friends

have commented that you seem to be looser in your clothing. When you weigh yourself and find

out that you've lost ten pounds, you'll keep using the scale. You would infer the scale was broken

or scrapped if it confirmed that you had gained 10 pounds. Two major characteristics of

measurement tools that psychology, economics, and education or any target queries take into

consideration when evaluating them are their reliability and validity.


Reliability and validity are closely related, but they mean different things. A

measurement can be reliable without being valid. However, if a measurement is valid, it is

usually also reliable. According to Middleton (2019), reliability and validity are concepts used to

evaluate the quality of research. They indicate how well a method, technique or test measures

something. Reliability is about the consistency of a measure, and validity is about the accuracy

of a measure. In pursuant to, according to Drost (2011), Reliability is the extent to which

measurements are repeatable when different persons perform the measurements, on different

occasions, under different conditions, with supposedly alternative instruments which measure the

same thing. In sum, reliability is consistency of measurement. Apart from that, reliability can by

all means the consistency of measurement as cited from (Bollen, 1989), or stability of

measurement over a variety of conditions in which basically the same results should be obtained

(Nunnally, 1978).

Conceptually, validity is the extent to which the underlying structure it is expected to

measure accurately depicts (Drost, 2011). The term structure in the statement refers to the

knowledge, skill, character, or attitude investigated by the researcher. If the researchers wished to

assess appreciation in music, for example, in music training, it is essential to know whether the

metric appropriately measures the value of music And how the research results are to be

acknowledged as valid.
Reliability

Reliability is important for measuring some attributes or conduct via a psychological test

(Rosenthal and Rosnow 1991). To understand the operation of a test, for example, it is vital for

the test to discriminate against people consistently at once or for a period of time. This means

that dependability is to which measurements may be repeated - if other people carry out the

measures, on various occasions, with presumably alternative devices that measure the same thing

under different situations. In short, reliability based on Nunnally (2013) is stability of

measurement over a variety of conditions in which basically the same results should be obtained.

Random measurement mistakes impact the data gained from behavioural study.

Measuring mistakes occur either as systemic mistakes or as random errors. The bathroom scale is

a wonderful example (Rosenthal and Rosnow, 1991). The systemic mistake would be if you

weighed on the bathroom scale frequently, which gave you a constant weight measure but always

weighed 10 pounds which is more than it should have been. If the scale was correct, the random

error would be working, but you misinterpret it when weighing. Therefore, you would read your

weight on some instances that it was little higher and on others slightly lower than it was.

However, these random mistakes negate repeated measurements on one person on average. In

addition, random eros also propagates as a noise by researchers in measurement. However due to

its circumstances of not being important, it is usually being ignored by them. For example, in an

interview, a persons’ knowledge or schematics of his daily living will eventually influence the

quality of an interview that is being done by a researcher. It is either depending on the mood,

integrity, knowledge and many more that could affect it.


Systematic mistakes, however, do not eliminate; they contribute to the mean score for all

the investigated individuals, making the mean value either too large or too little. Thus, if a person

measured himself on the same scale on a recurring basis, not always would be exactly the same

weight, but if tiny fluctuations were to be random and cancelled, the weight of the person would

be calculated by means of averaging the results. However, should the weight of 10 pounds be too

high, the average value of this systematic mistake cannot be cancelled but can be adjusted by

removing 10 pounds from the average human weight. The major problem of validity is systemic

mistakes. Although systematic error is a mistake, it is caused by variables affecting all

constructive observations over the whole sample. Systematic mistakes are seen as a measuring

bias and should be addressed to provide better findings. There are different types of reliability,

For example:

1) Test-retest reliability

It is a measure of reliability obtained by administering the same test twice over a period

of time to a group of individuals. The scores from Time 1 and Time 2 can then be correlated in

order to evaluate the test for stability over time. For instance, a test designed to assess student

learning in psychology could be given to a group of students twice, with the second

administration perhaps coming a week after the first. The obtained correlation coefficient would

indicate the stability of the scores.

2) Parallel forms reliability

It is a measure of reliability obtained by administering different versions of an assessment

tool whereby both versions must contain items that probe the same construct, skill, knowledge
base, etc. that is to the same group of individuals. The scores from the two versions can then be

correlated in order to evaluate the consistency of results across alternate versions. For example, if

an individual wanted to evaluate the reliability of a critical thinking assessment, you might create

a large set of items that all pertain to critical thinking and then randomly split the questions up

into two sets, which would represent the parallel forms.

3) Inter-rater reliability

It is a measure of reliability used to assess the degree to which different judges or raters

agree in their assessment decisions. Inter-rater reliability is useful because human observers will

not necessarily interpret answers the same way; raters may disagree as to how well certain

responses or material demonstrate knowledge of the construct or skill being assessed. Whereby,

Inter-rater reliability might be employed when different judges are evaluating the degree to

which art portfolios meet certain standards. Inter-rater reliability is especially useful when

judgments can be considered relatively subjective. Thus, the use of this type of reliability would

probably be more likely when evaluating artwork as opposed to math problems.

4) Internal consistency reliability

It is a measure of reliability used to evaluate the degree to which different test items that

probe the same construct produce similar results. Internal consistency reliability can be divided

into two distinguishes which are Average inter-item correlation and split-half reliability. Average

inter-item correlation is a subtype of internal consistency reliability. It is obtained by taking all

of the items on a test that probes the same construct. For instance, reading comprehension,
determining the correlation coefficient for each pair of items, and finally taking the average of all

of these correlation coefficients. This final step yields the average inter-item correlation.

Split-half reliability is another subtype of internal consistency reliability. The process of

obtaining split-half reliability is begun by “splitting in half” all items of a test that are intended to

probe the same area of knowledge (e.g., World War II) in order to form two “sets” of items. The

entire test is administered to a group of individuals, the total score for each “set” is computed,

and finally the split-half reliability is obtained by determining the correlation between the two

total “set” scores.

Validity

The importance of research components is focused with validity. Researchers are worried

when measuring behaviour, if they measure what they aim to assess. Whether IQ intelligence is

measured or Does the test and examinations predict that a graduate programme will be

completed successfully and many more. These are validity issues, and while they can never be

resolved with full confidence, researchers can strongly support its measures' validity (Bollen

1989). Validity means the measurement values represent the variable to which they are meant.

But how do scientists judge this? One element that we have previously discussed is

dependability. If a measure is soundly reliable and internally consistent, scientists should be

certain that the results represent the results they should be. But more has to be done since a

metric might be very dependable, but not valid. Imagine someone who thinks that the finger

longitudinal index of individuals represents their self-esteem and hence attempts to assess
self-esteem by holding the ruler up to the index finger of people. While this metric would be very

reliable for test-testing, it would not be valid at all. The fact that one finger is one centimetre

longer than the other does not show a person more self-esteem. Validity arguments generally split

it into several "types." Yet, in addition to dependability, various forms of evidence should be

taken into account when assessing the validity of a measure. Three fundamental types are

considered as validity, validity of contents and validity of criteria.

1) Face validity

Face validity is how apparent a measurement appears to be when taken directly on its

face. A self-esteem questionnaire is generally thought to include topics such as "do you see

yourself as someone who is worthwhile" and "do you think you have good qualities?" To have

excellent face validity, this kind of survey would contain questions with these kind of responses.

Self-esteem is measured using a technique that involves a ruler and a scale. However, it appears

that this technique has little to do with self-esteem and has a low face validity. While quantitative

face validity is possible, it is generally measured informally.

While face validity is, at best, a very weak kind of evidence, it is nevertheless necessary

for methods to pass the face validity test in order to measure what they are meant to. Another

contributing factor is that it is dependent on people's intuitive understanding of human behaviour,

which is rarely correct. Many psychological assessments function fairly effectively, even though

they lack face validity. The Minnesota Multiphasic Personality Inventory-2 (MMPI-2) uses the

MMPI scoring algorithm to give respondents over 567 distinct items to rate; however, the

statements used in the algorithm have no apparent connection to the concepts they are assessing.
Also, the responses that include "I like detective and mystery stories" and "I am not scared by the

sight of blood" show suppression of aggressiveness. One inquiry here is, if the pattern of replies

of the participants to a series of questions mirrors that of persons who habitually repress their

aggressiveness.

2) Validity of Content

The amount to which a measure “covers” the concept of interest is known as content

validity. The above example illustrates an effective conceptual definition for test anxiety, which

will allow the measurement of both anxious sensations and negative thoughts as test anxiety

symptoms. To see attitudes as involving thoughts, feelings, and behaviours toward something is

another way of looking at them. When defined in this conceptual manner, someone has a positive

attitude toward exercise if they have positive thoughts, feelings, and/or activities related to

exercising. In order to get excellent content validity, you would have to test all three elements of

people's attitudes about exercise. Content validity is only seldom measured in quantitative terms,

similarly to face validity. Instead, the measurement is evaluated by examining the technique of

measurement and verifying to see if it meets the conceptual description of the construct.

3) Criterion validity

The validity of a measurement is the extent to which test scores are associated with

relevant factors (known as criteria). For instance, the new test anxiety test results should be

adversely associated with an important school exam performance. This is proof that people's

results truly reflect their test anxiety. However, if it were shown that individuals' test
performance and test anxiety levels were the same independent of their scores, this would imply

that the measure is flawed.

Any variable for which there is good cause to believe will be connected with the variable

one is measuring is a possible criteria. In certain cases, you may anticipate test anxiety to be

associated with academic performance and grades, but not with other measures of worry, such as

blood pressure. In this situation, think of it like a researcher coming up with a novel indicator of

risk-taking behaviour. Participation in "extreme" sports like skiing and rock climbing, as well as

the amount of speeding tickets, broken bones, and so on, should all be factors in someone's final

score on this metric. Concurrent validity is referred to as criterion validity when the criteria is

measured at the same time as the construct; however, when the criterion is measured after the

construct has been measured, it is referred to as predictive validity. Other measurements of the

same construct may also be included in the selection criteria. This is what is generally seen.

Convergent validity has been documented several times in the literature.

The process of determining convergent validity calls for the use of the measure. In an

attempt to assess how much individuals value and engage in thinking, researchers John Cacioppo

and Richard Petty conducted a study in which they produced their Need for Cognition Scale.

People's test results were shown to be favourably associated with each other, and they were

found to be negatively correlated with a measure of dogmatism (which represents a tendency

toward obedience). The Need for Cognition Scale has been utilised in literally hundreds of

academic papers since it was created, and this wide array of studies shows that it is correlated
with many other variables, including interest in politics, juror decisions, and the effectiveness of

an advertisement (Petty, Briñol, Loersch, & McCaslin, 2009)

4) Distinctiveness

On the other hand, discriminant validity refers to the lack of correlation between a test

result and measurements of factors that are conceptually different. One important example of

self-esteem is that it is steady throughout time. It is not the same as one's current mood, which

describes how good or awful one feels at the moment. Don't expect too much correlation

between people's scores on a new measure of self-esteem and their emotions. This may be argued

if the new measurement is significantly associated with mood, in which case it would not

actually be assessing self-esteem, but mood instead.

When they developed the Need for Cognition Scale, Cacioppo and Petty showed that

people's scores were not linked with other factors, which supported the conclusion that the test

had discriminant validity. The researchers discovered only a slight correlation between people's

need for cognition and a measurement of their cognitive style. They found that people who tend

to think analytically (break ideas down into smaller parts) are more likely to need cognition,

while those who tend to think holistically which sees the big picture and are less likely to need it.

As for other cognitive correlates, they discovered no link between people's cognitive requirement

and their test anxiety and their willingness to display socially desirable behaviour. Low

correlations might be indicative of a measurement conceptually separate from the notion.


Methods to Measuring Music and Behaviour, Attitudes, Interest and Values

There are certain methods that could be used upon measuring music behaviors, attitudes,

interested values in accordance to variables and many more. Many commonly used methods for

measuring musical attributes above are by using self-report questionnaires. Open-ended

questions, paired comparisons, various-choice questionnaires, ordered choice questionnaires,

ordered rating scales, summated ratings, and semantic differentials are among the dependent

variables. Subjects will reply on one of the grading scales listed above, based on their knowledge

of the description. It has the benefit of providing a specific property to which participants react.

measurements, such as these, can offer examples that are so particular as to be unique, and

therefore unable to be generalised The two most common measurements employed in the study

of musical attitude are listed below (cite).

Open-Ended Questions for example whereby the studies that use open-ended inquiries

have shown that when you want to learn something about someone, you should ask. respondents

have answered structured questions, while also taking use of the extensive countrywide random

sample techniques. On top of that, paired Comparison are the subjects selected between two

audio stimuli while using a paired comparison approach. Every possible option is available for

every presented pair. To gauge preference, subjects must be told about the characteristics of the

stimuli. Fourth-grade students were asked to select their favourite music activity and paired

comparison was utilised as a dependent variable. A team of psychologists at the University of

North Carolina asked 5-year-old children to pay attention to two short excerpts of music, one
each of jazz and classical music, and indicate whether they preferred the classical or jazz pieces.

For almost 70 years, this forced choice approach has been used in investigations. (Radocy, 1976)

Multiple Choice Scales on the other hand is in multiple-choice scales, participants select

the most favoured song from a collection of songs and (Adler, 1929) used a similar approach to

that of multiple choice questionnaires and other ranked choice methods by choosing music

pieces, types of music, or activities in order of their preferredness. Researchers, many of whom

employ multiple-choice and ranking scales, report in several studies that good musical taste or

preference is described as the degree of similarity between subject's choices and those of

recognised musical experts. Pictographic Scales is like replacing spoken alternatives with

pictures, the pictographic scale incorporates aspects of multiple choice, ranking, and rating

scales. The scales used by researchers dealing with young children have been chosen because of

their immediate comprehensibility and the simplicity of interpretation these instruments provide.

Instead of having to deal with several choices that may be polarising, youngsters only mark a

happy, neutral, or frowning face which best conveys their sentiments toward music, an activity,

or a concept. The many activities depicted on some of the other scales include things like

singing, playing musical bells, or taking part in a game. However there are many more elements

that could be done in stimulating the measurement in musical attributable form.

Behavioral music however goes hand in hand with observational measures. Observational

measures have been employed by experimenters. The methods used to capture this data often

take the form of assigning distinct categories to different activities that occur during a set period

of time. Attitude can be inferred from the observable actions after they have been identified.
Most measurements of reliability employ dividing the number of agreement observations by the

number of agreement and disagreement observations. While engaged in music listening

activities, subjects may sing, hum, smile, frown, whistle, verbalise, make noises, or move their

bodies. people buy albums, listen to radios, watch television, and record their favourite tunes

while they are out and about (Appleton, 1970).

Apart from that, the single stimulus listening time is the time it takes a subject to listen to

each of a sequence of musical stimuli is known as single stimulus listening time (Crozier, 1974).

Because of the numerous variables, it is difficult to accurately state how much time is spent

listening to each stimulus. This approach is ideal for visual art, and it is a good analogue for

finding music in the radio dial. In music, it's known as "search time" since it has less value than

in the visual arts. Whereas at an art museum one may look at different works of art in varying

periods of time, there is no way to do so at a concert. People typically rapidly zero in on one

station, and then stop listening.

Besides, reward value is the strength of an event in stimulating a learning increase in

some condition and usually after a different event. A certain sound is made by pressing a

particular switch. Interpret strength of preference as the amount of tokens spent for various

reinforcers such as rock. The current reward value of music listening (in general) is in direct

proportion to the amount of time spent listening to one selection in comparison to other

selections. Several studies based on the Operant Music Listening Recorder (Dorow, 1977) have

been done on the reinforcing value of music listening. a subject using this gadget is able to pick

among up to five continuously accessible sound options by activating a certain channel using a
switch. The sound contingencies in most of these experiments relocate every time the same

switch is continuously pressed for around two minutes. The subject must push a different switch

in order to continue listening to the music associated with that switch.


Conclusion

In conclusion, as music educators and also a researcher in the meantime, we need to find

the best possible ways in which are adequate and also appropriate in order to create a better

measurement design for our students in the future. In this vast world of knowledge where every

inclination of education goes from time to time, it will eventually determine and follow upon the

timeline and time scale of each generation. Educators need to find out a way in creating

measurement stimuli that calibrates with the time as well as in they need to find a firm,

established which is either a new creation or an innovation from the previous studies that were

made by the scholars. There are many certain ways that could be carried out in order to make

measurement precise, reliable and valid.

You might also like