In Class Task 4
In Class Task 4
2020802098
Assignment 2
Introduction
While numerous public schools around the country are working to foster a culture of
data, these schools are putting a lot of emphasis on linking the use of data in the classroom,
school, and district. Although the largest difficulty associated with this integration is choosing
what data would provide an appropriate picture of those goals, one factor to keep in mind is that
data is like a weather vane as it will inevitably change. Especially when the aims of the school
are not explained in a way that lends itself to concrete analysis, it is essential to take into
consideration the school's mission, which might be defined as improving school climate.
Conversely, if a school is concerned with raising student literacy, one issue to consider could be:
which student groups routinely perform below the proficiency cutoffs on standardised English
exams? A school that wishes to promote a supportive and inclusive environment can provide a
question such as: Do teachers treat children differently based on their race, religion, gender,
these targeted queries might be considered to be queries in search of answers in the form of
study. While a question on its own may not always tell you which instrument (such as a
standardised exam, student survey, etc.) is best, it does help to point in the right direction.
which reflect some feature of the individuals. Although it is unknown exactly how researchers
determine the test results represent the construct, it is a well-known fact that cognitive traits such
as intellect, self-esteem, depression, and working memory capacity exist. To comprehend the
concepts measured by test scores, they conduct study and validate that the scores are in fact a
acknowledged that psychologists do not simply presume that their measures will yield results.
They actually gather data to indicate that they're effective. If they have no proof that a measure is
For the sake of this example, assume you have been on a diet for a month. Several friends
have commented that you seem to be looser in your clothing. When you weigh yourself and find
out that you've lost ten pounds, you'll keep using the scale. You would infer the scale was broken
or scrapped if it confirmed that you had gained 10 pounds. Two major characteristics of
measurement tools that psychology, economics, and education or any target queries take into
usually also reliable. According to Middleton (2019), reliability and validity are concepts used to
evaluate the quality of research. They indicate how well a method, technique or test measures
something. Reliability is about the consistency of a measure, and validity is about the accuracy
of a measure. In pursuant to, according to Drost (2011), Reliability is the extent to which
measurements are repeatable when different persons perform the measurements, on different
occasions, under different conditions, with supposedly alternative instruments which measure the
same thing. In sum, reliability is consistency of measurement. Apart from that, reliability can by
all means the consistency of measurement as cited from (Bollen, 1989), or stability of
measurement over a variety of conditions in which basically the same results should be obtained
(Nunnally, 1978).
measure accurately depicts (Drost, 2011). The term structure in the statement refers to the
knowledge, skill, character, or attitude investigated by the researcher. If the researchers wished to
assess appreciation in music, for example, in music training, it is essential to know whether the
metric appropriately measures the value of music And how the research results are to be
acknowledged as valid.
Reliability
Reliability is important for measuring some attributes or conduct via a psychological test
(Rosenthal and Rosnow 1991). To understand the operation of a test, for example, it is vital for
the test to discriminate against people consistently at once or for a period of time. This means
that dependability is to which measurements may be repeated - if other people carry out the
measures, on various occasions, with presumably alternative devices that measure the same thing
measurement over a variety of conditions in which basically the same results should be obtained.
Random measurement mistakes impact the data gained from behavioural study.
Measuring mistakes occur either as systemic mistakes or as random errors. The bathroom scale is
a wonderful example (Rosenthal and Rosnow, 1991). The systemic mistake would be if you
weighed on the bathroom scale frequently, which gave you a constant weight measure but always
weighed 10 pounds which is more than it should have been. If the scale was correct, the random
error would be working, but you misinterpret it when weighing. Therefore, you would read your
weight on some instances that it was little higher and on others slightly lower than it was.
However, these random mistakes negate repeated measurements on one person on average. In
addition, random eros also propagates as a noise by researchers in measurement. However due to
its circumstances of not being important, it is usually being ignored by them. For example, in an
interview, a persons’ knowledge or schematics of his daily living will eventually influence the
quality of an interview that is being done by a researcher. It is either depending on the mood,
the investigated individuals, making the mean value either too large or too little. Thus, if a person
measured himself on the same scale on a recurring basis, not always would be exactly the same
weight, but if tiny fluctuations were to be random and cancelled, the weight of the person would
be calculated by means of averaging the results. However, should the weight of 10 pounds be too
high, the average value of this systematic mistake cannot be cancelled but can be adjusted by
removing 10 pounds from the average human weight. The major problem of validity is systemic
constructive observations over the whole sample. Systematic mistakes are seen as a measuring
bias and should be addressed to provide better findings. There are different types of reliability,
For example:
1) Test-retest reliability
It is a measure of reliability obtained by administering the same test twice over a period
of time to a group of individuals. The scores from Time 1 and Time 2 can then be correlated in
order to evaluate the test for stability over time. For instance, a test designed to assess student
learning in psychology could be given to a group of students twice, with the second
administration perhaps coming a week after the first. The obtained correlation coefficient would
tool whereby both versions must contain items that probe the same construct, skill, knowledge
base, etc. that is to the same group of individuals. The scores from the two versions can then be
correlated in order to evaluate the consistency of results across alternate versions. For example, if
an individual wanted to evaluate the reliability of a critical thinking assessment, you might create
a large set of items that all pertain to critical thinking and then randomly split the questions up
3) Inter-rater reliability
It is a measure of reliability used to assess the degree to which different judges or raters
agree in their assessment decisions. Inter-rater reliability is useful because human observers will
not necessarily interpret answers the same way; raters may disagree as to how well certain
responses or material demonstrate knowledge of the construct or skill being assessed. Whereby,
Inter-rater reliability might be employed when different judges are evaluating the degree to
which art portfolios meet certain standards. Inter-rater reliability is especially useful when
judgments can be considered relatively subjective. Thus, the use of this type of reliability would
It is a measure of reliability used to evaluate the degree to which different test items that
probe the same construct produce similar results. Internal consistency reliability can be divided
into two distinguishes which are Average inter-item correlation and split-half reliability. Average
of the items on a test that probes the same construct. For instance, reading comprehension,
determining the correlation coefficient for each pair of items, and finally taking the average of all
of these correlation coefficients. This final step yields the average inter-item correlation.
obtaining split-half reliability is begun by “splitting in half” all items of a test that are intended to
probe the same area of knowledge (e.g., World War II) in order to form two “sets” of items. The
entire test is administered to a group of individuals, the total score for each “set” is computed,
and finally the split-half reliability is obtained by determining the correlation between the two
Validity
The importance of research components is focused with validity. Researchers are worried
when measuring behaviour, if they measure what they aim to assess. Whether IQ intelligence is
measured or Does the test and examinations predict that a graduate programme will be
completed successfully and many more. These are validity issues, and while they can never be
resolved with full confidence, researchers can strongly support its measures' validity (Bollen
1989). Validity means the measurement values represent the variable to which they are meant.
But how do scientists judge this? One element that we have previously discussed is
certain that the results represent the results they should be. But more has to be done since a
metric might be very dependable, but not valid. Imagine someone who thinks that the finger
longitudinal index of individuals represents their self-esteem and hence attempts to assess
self-esteem by holding the ruler up to the index finger of people. While this metric would be very
reliable for test-testing, it would not be valid at all. The fact that one finger is one centimetre
longer than the other does not show a person more self-esteem. Validity arguments generally split
it into several "types." Yet, in addition to dependability, various forms of evidence should be
taken into account when assessing the validity of a measure. Three fundamental types are
1) Face validity
Face validity is how apparent a measurement appears to be when taken directly on its
face. A self-esteem questionnaire is generally thought to include topics such as "do you see
yourself as someone who is worthwhile" and "do you think you have good qualities?" To have
excellent face validity, this kind of survey would contain questions with these kind of responses.
Self-esteem is measured using a technique that involves a ruler and a scale. However, it appears
that this technique has little to do with self-esteem and has a low face validity. While quantitative
While face validity is, at best, a very weak kind of evidence, it is nevertheless necessary
for methods to pass the face validity test in order to measure what they are meant to. Another
which is rarely correct. Many psychological assessments function fairly effectively, even though
they lack face validity. The Minnesota Multiphasic Personality Inventory-2 (MMPI-2) uses the
MMPI scoring algorithm to give respondents over 567 distinct items to rate; however, the
statements used in the algorithm have no apparent connection to the concepts they are assessing.
Also, the responses that include "I like detective and mystery stories" and "I am not scared by the
sight of blood" show suppression of aggressiveness. One inquiry here is, if the pattern of replies
of the participants to a series of questions mirrors that of persons who habitually repress their
aggressiveness.
2) Validity of Content
The amount to which a measure “covers” the concept of interest is known as content
validity. The above example illustrates an effective conceptual definition for test anxiety, which
will allow the measurement of both anxious sensations and negative thoughts as test anxiety
symptoms. To see attitudes as involving thoughts, feelings, and behaviours toward something is
another way of looking at them. When defined in this conceptual manner, someone has a positive
attitude toward exercise if they have positive thoughts, feelings, and/or activities related to
exercising. In order to get excellent content validity, you would have to test all three elements of
people's attitudes about exercise. Content validity is only seldom measured in quantitative terms,
similarly to face validity. Instead, the measurement is evaluated by examining the technique of
measurement and verifying to see if it meets the conceptual description of the construct.
3) Criterion validity
The validity of a measurement is the extent to which test scores are associated with
relevant factors (known as criteria). For instance, the new test anxiety test results should be
adversely associated with an important school exam performance. This is proof that people's
results truly reflect their test anxiety. However, if it were shown that individuals' test
performance and test anxiety levels were the same independent of their scores, this would imply
Any variable for which there is good cause to believe will be connected with the variable
one is measuring is a possible criteria. In certain cases, you may anticipate test anxiety to be
associated with academic performance and grades, but not with other measures of worry, such as
blood pressure. In this situation, think of it like a researcher coming up with a novel indicator of
risk-taking behaviour. Participation in "extreme" sports like skiing and rock climbing, as well as
the amount of speeding tickets, broken bones, and so on, should all be factors in someone's final
score on this metric. Concurrent validity is referred to as criterion validity when the criteria is
measured at the same time as the construct; however, when the criterion is measured after the
construct has been measured, it is referred to as predictive validity. Other measurements of the
same construct may also be included in the selection criteria. This is what is generally seen.
The process of determining convergent validity calls for the use of the measure. In an
attempt to assess how much individuals value and engage in thinking, researchers John Cacioppo
and Richard Petty conducted a study in which they produced their Need for Cognition Scale.
People's test results were shown to be favourably associated with each other, and they were
toward obedience). The Need for Cognition Scale has been utilised in literally hundreds of
academic papers since it was created, and this wide array of studies shows that it is correlated
with many other variables, including interest in politics, juror decisions, and the effectiveness of
4) Distinctiveness
On the other hand, discriminant validity refers to the lack of correlation between a test
result and measurements of factors that are conceptually different. One important example of
self-esteem is that it is steady throughout time. It is not the same as one's current mood, which
describes how good or awful one feels at the moment. Don't expect too much correlation
between people's scores on a new measure of self-esteem and their emotions. This may be argued
if the new measurement is significantly associated with mood, in which case it would not
When they developed the Need for Cognition Scale, Cacioppo and Petty showed that
people's scores were not linked with other factors, which supported the conclusion that the test
had discriminant validity. The researchers discovered only a slight correlation between people's
need for cognition and a measurement of their cognitive style. They found that people who tend
to think analytically (break ideas down into smaller parts) are more likely to need cognition,
while those who tend to think holistically which sees the big picture and are less likely to need it.
As for other cognitive correlates, they discovered no link between people's cognitive requirement
and their test anxiety and their willingness to display socially desirable behaviour. Low
There are certain methods that could be used upon measuring music behaviors, attitudes,
interested values in accordance to variables and many more. Many commonly used methods for
ordered rating scales, summated ratings, and semantic differentials are among the dependent
variables. Subjects will reply on one of the grading scales listed above, based on their knowledge
of the description. It has the benefit of providing a specific property to which participants react.
measurements, such as these, can offer examples that are so particular as to be unique, and
therefore unable to be generalised The two most common measurements employed in the study
Open-Ended Questions for example whereby the studies that use open-ended inquiries
have shown that when you want to learn something about someone, you should ask. respondents
have answered structured questions, while also taking use of the extensive countrywide random
sample techniques. On top of that, paired Comparison are the subjects selected between two
audio stimuli while using a paired comparison approach. Every possible option is available for
every presented pair. To gauge preference, subjects must be told about the characteristics of the
stimuli. Fourth-grade students were asked to select their favourite music activity and paired
North Carolina asked 5-year-old children to pay attention to two short excerpts of music, one
each of jazz and classical music, and indicate whether they preferred the classical or jazz pieces.
For almost 70 years, this forced choice approach has been used in investigations. (Radocy, 1976)
Multiple Choice Scales on the other hand is in multiple-choice scales, participants select
the most favoured song from a collection of songs and (Adler, 1929) used a similar approach to
that of multiple choice questionnaires and other ranked choice methods by choosing music
pieces, types of music, or activities in order of their preferredness. Researchers, many of whom
employ multiple-choice and ranking scales, report in several studies that good musical taste or
preference is described as the degree of similarity between subject's choices and those of
recognised musical experts. Pictographic Scales is like replacing spoken alternatives with
pictures, the pictographic scale incorporates aspects of multiple choice, ranking, and rating
scales. The scales used by researchers dealing with young children have been chosen because of
their immediate comprehensibility and the simplicity of interpretation these instruments provide.
Instead of having to deal with several choices that may be polarising, youngsters only mark a
happy, neutral, or frowning face which best conveys their sentiments toward music, an activity,
or a concept. The many activities depicted on some of the other scales include things like
singing, playing musical bells, or taking part in a game. However there are many more elements
Behavioral music however goes hand in hand with observational measures. Observational
measures have been employed by experimenters. The methods used to capture this data often
take the form of assigning distinct categories to different activities that occur during a set period
of time. Attitude can be inferred from the observable actions after they have been identified.
Most measurements of reliability employ dividing the number of agreement observations by the
activities, subjects may sing, hum, smile, frown, whistle, verbalise, make noises, or move their
bodies. people buy albums, listen to radios, watch television, and record their favourite tunes
Apart from that, the single stimulus listening time is the time it takes a subject to listen to
each of a sequence of musical stimuli is known as single stimulus listening time (Crozier, 1974).
Because of the numerous variables, it is difficult to accurately state how much time is spent
listening to each stimulus. This approach is ideal for visual art, and it is a good analogue for
finding music in the radio dial. In music, it's known as "search time" since it has less value than
in the visual arts. Whereas at an art museum one may look at different works of art in varying
periods of time, there is no way to do so at a concert. People typically rapidly zero in on one
some condition and usually after a different event. A certain sound is made by pressing a
particular switch. Interpret strength of preference as the amount of tokens spent for various
reinforcers such as rock. The current reward value of music listening (in general) is in direct
proportion to the amount of time spent listening to one selection in comparison to other
selections. Several studies based on the Operant Music Listening Recorder (Dorow, 1977) have
been done on the reinforcing value of music listening. a subject using this gadget is able to pick
among up to five continuously accessible sound options by activating a certain channel using a
switch. The sound contingencies in most of these experiments relocate every time the same
switch is continuously pressed for around two minutes. The subject must push a different switch
In conclusion, as music educators and also a researcher in the meantime, we need to find
the best possible ways in which are adequate and also appropriate in order to create a better
measurement design for our students in the future. In this vast world of knowledge where every
inclination of education goes from time to time, it will eventually determine and follow upon the
timeline and time scale of each generation. Educators need to find out a way in creating
measurement stimuli that calibrates with the time as well as in they need to find a firm,
established which is either a new creation or an innovation from the previous studies that were
made by the scholars. There are many certain ways that could be carried out in order to make