Reliability
Reliability
1. Reliability is not absolute; a test may be reliable in one context and unreliable in another.
1: Perfect reliability.
3. Split-Half Reliability: Assesses consistency by dividing a single test into two equivalent
halves.
Measurement Error
1. Preventable Errors: Mistakes caused by human factors such as lack of skill, carelessness,
or inadequate preparation.
2. Inevitable Errors: Natural variations caused by physical factors like molecular motion due to
heat or other environmental influences.
Significance of Reliability
1. Reliability ensures the consistency of test results, which is crucial for interpreting test scores
accurately.
2. Measurement errors, though inevitable, can often be minimized by improving test design and
administration.
3. Tests measuring stable traits (e.g., intelligence) are expected to show high reliability, whereas
tests for fluctuating variables (e.g., mood) may have lower reliability.
1. True Scores:
2. Construct Scores:
Approximations of true scores, are influenced by measurement error and other variables.
1. Fluctuating Variables:
Some psychological traits (e.g., mood, alertness, motivation) are in constant flux, making true
scores variable over time.
2. Carryover Effects:
Practice Effects: Improvement in test performance due to repeated exposure to the test.
Fatigue Effects: Decrease in test performance due to mental exhaustion or lack of motivation
from repeated testing.
If repeated measurements could be taken without carryover effects, the long-term average of
those measurements would equal the true score.
In reality, due to the passage of time and carryover effects, true scores can only be estimated.
Definition: The standard deviation of repeated measurements for an individual, representing the
typical distance from an observed score to the true score.
Example:
SEM: 5
Possible observed scores will vary but will generally fall within the range determined by the
SEM.
1. True Score:
Instrument-Specific: A person’s true score on one test might differ from their true score on
another test for the same construct (e.g., depression).
2. Construct Score:
Represents the ideal measure of truth but is rarely achievable due to flawed measurement tools.
3. Key Relationship:
True scores are essential for understanding and calculating reliability, which is a prerequisite for
validity.
Without reliability, a test cannot be valid. However, high reliability does not guarantee validity.
Example: A flawed test may produce consistent results (reliable) but fail to measure the
intended construct (invalid).
True Score (T): The actual score reflecting the individual’s standing on the test without
measurement error.
Measurement Error (E): The difference between the observed score and the true score.
Equation:
1. Reliable Test:
The observed score (X) is primarily influenced by the true score (T).
2. Unreliable Test:
1. Variance (σ²):
Variance is the standard deviation squared and measures variability in test scores.
The total variance in observed test scores equals the true variance () plus the error variance ().
Formula:
2. Reliability:
Assumes true differences are stable, leading to consistent scores across repeated
administrations or equivalent forms.
Measurement Error
1. Random Error:
Examples:
Unanticipated benefits: Random events (e.g., learning a relevant term before a test).
Effect:
2. Systematic Error:
Example:
A flawed measurement tool (e.g., a ruler measuring 12.1 inches instead of 12 inches).
Effect:
Does not affect score consistency but introduces bias.
Key Terminology
Bias:
In statistics, bias is a technical term for predictable systematic error rather than prejudice.
1. Test Construction:
Item Sampling:
Differences in the way items are worded or the exact content sampled.
Example: A test-taker may hope for specific questions that align with their strengths.
Content Sampling:
Variation in the specific skills, attributes, or knowledge areas included in the test.
2. Test Administration:
1. Test Construction
Content Sampling:
Test scores can vary based on the specific items included in the test and how they are worded.
Example: Test-takers might perform better if the test contains questions they "hoped" to be
asked.
Objective:
Test creators aim to maximize true variance and minimize error variance during test
development.
2. Test Administration
Environmental Factors:
Conditions like room temperature, lighting, ventilation, noise, and seating arrangements can
influence performance.
Examples:
Test-Taker Variables:
Cognitive performance can also be affected by biological factors (e.g., high fasting glucose
levels linked to cognitive errors).
Examiner Variables:
Examples:
Nonverbal cues like head nodding or eye movements might give away correct answers.
Strong personal beliefs (e.g., religious views) might bias assessments (e.g., overestimating
suicidal risk).
Objective Scoring:
The use of computers and objective grids reduces errors caused by scorer differences.
Tests requiring manual scoring or subjective interpretation (e.g., interviews or essays) remain
prone to error variance.
By addressing these sources, test reliability and validity can be improved, ensuring fairer and
more accurate assessments.
Objective Scoring:
Tests with objective items (e.g., multiple-choice) are often scored by computers, minimizing
error.
Subjective Scoring:
Tests requiring human interpretation (e.g., essays, personality tests, creativity tests) are more
prone to error.
Examples:
Intelligence tests: Scorers may encounter responses that fall into ambiguous "gray areas."
Behavioral measures: Observers may differ in their interpretation of behaviors like "social
relatedness."
Consistency in Scoring:
Margin of Error:
Surveys often include disclaimers about a margin of error, indicating possible deviations in the
results.
Example: Political polls predicting election outcomes may be off by a few percentage points.
Sampling Error:
Occurs when the sample of participants is not representative of the target population.
Factors such as demographics or political affiliations may not align with the larger population.
Methodological Error:
Systematic Errors:
Nonsystematic Errors:
Random and unpredictable; they tend to balance out over time but still affect individual scores.
Use representative samples in surveys and ensure appropriate methodologies are followed.
Minimize systematic and nonsystematic errors by addressing both environmental factors and
scoring biases.
By recognizing and mitigating these sources of error, the validity and reliability of assessments
can be significantly enhanced.
Abuse typically occurs behind closed doors, so only the individuals involved truly know the
extent of the abuse (Moffitt et al., 1997).
Underreporting:
Overreporting:
Some individuals exaggerate claims of abuse for secondary gains, such as custody disputes or
financial incentives (Petherick, 2019).
Implications:
The variability between true reports and error-prone ones makes the "true score" of abuse
difficult, if not impossible, to ascertain.
Debates continue over the utility of methods used to estimate true versus error variance
(Stanley, 1971).
Definition:
Test-retest reliability measures the stability of a test over time.
It involves administering the same test to the same individuals at two different points in time and
correlating the scores.
Applications:
Ideal for evaluating traits or characteristics assumed to remain stable over time (e.g., personality
traits).
Not suitable for variables expected to fluctuate (e.g., mood, knowledge acquisition).
Example:
A ruler made of steel consistently measures 12 inches as 12 inches over time, showing high
test-retest reliability.
A ruler made of putty that changes shape over time demonstrates low reliability.
As the time interval between test administrations increases, the reliability coefficient tends to
decrease.
External factors (e.g., learning new information or experiencing trauma) can introduce error
variance.
Coefficient of Stability:
When the interval between tests exceeds six months, the test-retest reliability estimate is
referred to as the coefficient of stability.
Example Scenarios:
A math test’s reliability may decrease if testtakers take a tutorial between test administrations.
When tests are administered during periods of significant developmental changes, test-retest
reliability estimates may not fully represent a test's reliability.
Factors such as counseling, practice, fatigue, or motivation between test administrations can
confound results.
Even with brief intervals between tests, experience or fatigue can still influence results.
Scientific Replicability:
Reliability demands that test results be replicable under the same conditions by other
experimenters.
Psychology, like other sciences, faces challenges with replication, highlighting the importance of
reliable measures.
Definitions:
Parallel Forms: Two tests are considered parallel if they have equal means and variances for
observed scores and correlate equally with true scores.
Alternate Forms: Similar to parallel forms but not identical in terms of statistical properties;
designed to be equivalent in content and difficulty.
Coefficient of Equivalence:
Refers to the degree of relationship between parallel or alternate forms of a test, offering an
estimate of reliability.
Practical Implications
Parallel Forms:
Reliability estimates reflect the degree to which item sampling and other errors affect scores.
Alternate Forms:
Estimate of reliability is based on the correlation of scores obtained from different test versions.
Scenario:
You missed the midterm exam, and your instructor says you'll take an alternate form of the test
instead of a parallel form.
Feelings: Likely apprehensive, as alternate forms are not guaranteed to be identical in difficulty
or content balance.
Error Sources:
Evaluating Reliability
Consider Context:
Reliability coefficients are meaningful only when interpreted within the test's purpose and
context.
Whether parallel or alternate forms are used, understanding the nature of the test and its
sources of error is key to drawing valid conclusions about its reliability.
Broader Implications:
Consistency in testing ensures fairness and validity, particularly in high-stakes situations like
exams or clinical assessments.
Background
A growing concern among academic scientists emerged about the rigor of scientific practices.
Researchers feared that published studies, despite being peer-reviewed, were not replicable by
other independent parties.
A significant study in 2015 aimed to replicate 100 psychology studies that had already been
peer-reviewed and published in leading journals.
Results showed that only 40-60% of replications found the same results as the original studies,
highlighting a replicability crisis in psychology.
Academic journals have historically preferred publishing novel findings over replication studies.
A study found that only 1.07% of psychological scientific literature involved direct replication of
previous work.
Implications:
A focus on novelty over replication limits confidence in findings and contributes to the crisis.
Positive Findings: Research that rejects the null hypothesis and finds an experimental effect.
Negative Findings: Studies that accept the null hypothesis and find no effect.
Editorial Bias:
Journals tend to favor publishing studies with positive results, leading to a higher likelihood of
seeing only effects rather than the absence of effects.
As a result, there is a publication bias where studies confirming a hypothesis are more likely to
be published, and studies that fail to reject the null hypothesis (negative results) are often
neglected.
QRPs are practices that may not constitute outright fraud but still introduce error or bias into
research findings.
Example:
Data Peeking: Researchers check data during the study and decide whether to collect more
data based on whether the current data has reached statistical significance.
This introduces bias because decisions on data collection are influenced by the data already
gathered.
Other QRPs include selective reporting, where researchers only publish studies that support
their hypothesis, excluding studies that do not confirm their expectations.
The crisis in replicability undermines the reliability and credibility of psychological research.
Negative findings and replication studies often go unpublished, which distorts the actual
strength of effects reported in scientific literature.
Long-Term Consequences:
The replication crisis calls into question the generalizability and robustness of scientific findings,
especially in psychology, where findings are often used to shape policy, practices, and legal
decisions.
Impact of QRPs:
QRPs include practices that introduce bias and undermine the integrity of scientific findings. For
example, researchers may selectively report only the studies that support their hypothesis,
hiding the ones that fail to do so.
The issue is that, without access to the researchers' raw data and full research records, it is
difficult for consumers of research to know about important milestones in the study, such as the
sequence of studies conducted or measurements taken.
Preregistration as a Solution
Preregistration:
Benefits of Preregistration:
It also helps to address QRPs, as researchers cannot alter their study methodology after seeing
initial results.
Several websites now allow researchers to preregister their research plans, and many journals
require preregistration for study publication. Some journals even provide special recognition for
preregistered studies to boost confidence in the findings.
Traditional Assumption:
It was once believed that science would self-correct over time; faulty studies would eventually
be exposed, and the scientific record would be corrected.
Reality:
The problem is that unreliable findings can remain accepted for decades before they are
eventually disconfirmed. The process of self-correction is often slow, and there is no established
mechanism to inform the scientific community or the public when erroneous studies are
disproven.
A more rigorous standard is being applied, where judges evaluate scientific evidence based on
criteria like the error rate, peer review, replication, and whether the study was preregistered.
This helps ensure that only robust, reliable findings are admitted as evidence in court.
The adoption of preregistration and other practices aimed at transparency is growing. Open
science initiatives are receiving increasing financial support to help researchers conduct more
rigorous studies.
Replication Efforts:
Replication efforts, such as the Open Science Collaboration's large-scale replication study, are
becoming more common. These efforts aim to increase confidence in research findings and
ensure their validity.
Legal Field:
In law, scientific research is frequently used in courtrooms, whether in criminal cases or civil
disputes. For example, studies from psychology journals may be used to support or challenge
claims about behavior or causality.
Given the potential consequences of legal decisions, it is crucial that the scientific evidence
relied upon in these cases is replicable and reliable. It is especially critical when timely decisions
are necessary, as legal appeals can be limited, and cases may be expensive to revisit.
Overview
1. Two Test Administrations: Both forms require administering the test twice to the same group
of participants.
2. Potential External Influences: Factors such as motivation, fatigue, and intervening events
(e.g., practice, learning, or therapy) can affect test scores. However, their impact may not be as
significant as in test-retest reliability (where the same test is administered twice).
Additional Source of Error
Item Sampling: The test-taker's performance on one form may vary not just because of their true
ability but due to the specific items selected for inclusion in that form.
Time and Expense: Creating equivalent forms of a test can be time-consuming and costly.
Researchers need to develop equivalent items and then administer them to the same group of
test-takers.
Stable Traits: Traits presumed to remain consistent over time (e.g., intelligence). Tests
measuring these traits are expected to show a reasonable degree of stability in scores across
alternate or parallel forms.
State-like Traits: Traits subject to fluctuation, such as mood or state anxiety (anxiety felt at a
given moment). These variables are less stable and are expected to show greater variability in
test scores over time.
State Variables: Since state-like traits are constantly changing, a retest reliability coefficient
might not fully capture the reliability of the measurement. Alternative methods of estimating
reliability are required for these kinds of traits.
Internal Consistency:
Internal Consistency Estimate: Reliability can be estimated from a single test administration
without the need for an alternate form or a second test administration. This method assesses
how well the items of the test work together.
Inter-item Consistency: This estimate evaluates the correlation between individual test items to
determine the consistency of responses.
Split-Half Reliability Estimates
Split-Half Reliability: This method is used when it is impractical to administer two tests or
perform a test-retest due to time or expense constraints.
1. Divide the Test: Split the test into two equivalent halves.
2. Calculate Pearson Correlation: Compute the correlation between scores from the two halves
of the test.
Spearman-Brown prediction formula, which shows how combining multiple parallel tests can
increase reliability. It highlights that a single test with low reliability requires many parallel tests
to achieve high reliability. The text also mentions that reducing test size may be beneficial in
cases of boredom or fatigue and provides guidance on determining the number of items needed
for desired reliability levels. Finally, it prompts readers to consider situations where reducing test
size or administration time might be desirable and arguments against it.
Spearman-Brown Formula:
- Combining multiple parallel tests increases reliability.
- If a test with low reliability needs a higher reliability coefficient, multiply the number of items
accordingly.
- Example: A 10-item test with a reliability of 0.60, needing a reliability of 0.80, requires 27
items.
- New items must match the original in content and difficulty.
- If reliability is too low, consider finding or developing an alternative instrument.
- Improve reliability by creating new items, clarifying instructions, or simplifying scoring rules.
3. Important Notes
- Internal consistency estimates are inappropriate for heterogeneous and speed tests.
- The impact of test characteristics on reliability is discussed in detail.
Cronbach’s alpha is a widely used measure of internal consistency, assessing how similar test
items are in measuring the same construct, with values ranging from 0 (no similarity) to 1
(perfect similarity). However, a common misconception is that a higher alpha always indicates
better reliability; in reality, including overly similar items can inflate alpha without improving
measurement accuracy. Alpha is most effective when all test items have equal loadings (λ), but
it underestimates reliability when loadings are unequal. As an alternative, McDonald’s omega
provides a more accurate measure of internal consistency in such cases, making it a preferred
choice among statisticians when test items have varying relationships with the underlying
construct.
Inter-scorer reliability refers to the consistency of evaluations from multiple scorers and is crucial
for ensuring test results are not influenced by the evaluator's identity. Historical examples, such
as the 1912 study where teachers graded the same paper with varying scores, highlight the
need for high inter-scorer reliability, which indicates consistent scoring by trained evaluators.
Factors like unclear scoring criteria can affect reliability, but training and clear guidelines can
improve it. In research, involving multiple raters helps minimize individual biases. The degree of
consistency among scorers can be measured using a correlation coefficient, known as the
coefficient of inter-scorer reliability. This reliability is often used in coding nonverbal behaviors,
ensuring more accurate ratings. Real-life scenarios, such as job interviews, may also impact
evaluation consistency based on the interviewer's relationship with the candidate.
Alternative Ways of Splitting: There are multiple ways to split a test for the split-half reliability,
but some methods are not appropriate. A common mistake is simply dividing the test into two
parts without ensuring that the halves are equivalent in difficulty and content.
According to classical test theory, factors like motivation or fatigue are considered measurement
error. However, other models do not always view these factors as errors. For example, Atkinson
(1981) discussed alternative models in personality assessment where fluctuating test scores
due to such factors might not be considered mere error.
The document discusses the importance of reliability in tests and the purpose of reliability
coefficients. It emphasizes that the required level of reliability depends on the significance of the
test. High-stakes tests, such as those with life-or-death implications, must meet high reliability
standards. The text explains that different reliability coefficients reflect various sources of error
variance, such as errors from test construction, administration, scoring, and interpretation. An
example in Figure 5-5 shows different sources of variance in a hypothetical test.
Important Details:
- Reliability Importance: Essential for all tests, but standards vary based on the test's
significance.
- High-stakes tests require higher reliability standards.
- Reliability Coefficients: Different coefficients reflect different sources of error variance.
- Examples of error variance: test construction, administration, scoring, and interpretation.
Closely related to considerations concerning the purpose and use of a reliability coefficient are
those concerning the nature of the test itself. These include whether:
1. The test items are homogeneous or heterogeneous in nature.
2. The characteristic, ability, or trait being measured is presumed to be dynamic or static.
3. The range of test scores is or is not restricted.
4. The test is a speed or a power test.
5. The test is or is not criterion-referenced.
Some tests present special problems regarding the measurement of their reliability. For
example, psychological tests developed for infants help identify children who are developing
slowly or may benefit from early intervention. Measuring internal consistency reliability or
inter-scorer reliability for these tests is similar to other tests. However, measuring test-retest
reliability presents a unique challenge due to the rapid and uneven cognitive development in
young children. The child’s performance on the test could change significantly in a short period,
reflecting genuine changes rather than error. Therefore, gauging the test-retest reliability of such
tests must consider developmental changes between testings.
such tests may design test-retest reliability studies with short intervals between testings,
sometimes as little as four days.
Homogeneity versus heterogeneity of test items Recall that a test is said to be homogeneous in
items if it is functionally uniform throughout. Tests designed to measure one factor, such as one
ability or one trait, are expected to be homogeneous in items. For such tests, it is reasonable to
expect a high degree of internal consistency. By contrast, if the test is heterogeneous in items,
an estimate of internal consistency might be low relative to a more appropriate estimate of
test-retest reliability. It is important to note that high internal consistency does not guarantee
item homogeneity. As long as the items are positively correlated, adding many items eventually
results in high internal consistency coefficients, homogeneous or not.
Dynamic versus static characteristics Whether what is being measured by the test is dynamic or
static is also a consideration in obtaining an estimate of reliability. A dynamic characteristic is a
trait, state, or ability presumed to be ever-changing as a function of situational and cognitive
experiences. If, for example, one were to take hourly measurements of the dynamic
characteristic of anxiety as manifested by a stockbroker throughout a business day, one might
find the measured level of this characteristic to change from hour to hour. Such changes might
even be related to the magnitude of the Dow Jones average. Because the true amount of
anxiety presumed to exist would vary with each assessment, a test-retest measure would be of
little help in gauging the reliability of the measuring instrument. Therefore, the best estimate of
reliability would be obtained from a measure of internal consistency. Contrast this situation to
one in which a measurement of this same stockbroker are made of a trait, state, or ability
presumed to be relatively unchanging (a static characteristic), such as intelligence. In this
instance, obtained measurement would not be expected to vary significantly as a function of
time, and either the test-retest or the alternate-forms method would be appropriate.
Restriction or inflation of range In using and interpreting a coefficient of reliability, the issue
variously referred to as restriction of range or restriction of variance (or, conversely, inflation of
range or inflation of variance) is important. If the variance of either variable in a correlational
analysis is restricted by the sampling procedure used, then the resulting correlation coefficient
tends to be lower. If the variance of either variable in a correlational analysis is inflated by the
sampling procedure, then the resulting correlation coefficient tends to be higher. Refer back to
Figure 3-18 (Two Scatterplots Illustrating Unrestricted and Restricted Ranges) for a graphic
illustration.
Also of critical importance is whether the range of variances employed is appropriate to the
objective of the correlational analysis. Consider, for example, a published educational test
designed for use with children in grades 1 through 6. Ideally, the manual for this test should
contain not one reliability value covering all the test takers in grades 1 through 6 but instead
reliability values for test takers at each grade level. Here's another example: A corporate
personnel officer employs a certain screening test in the hiring process. For future testing and
hiring purposes, this personnel officer maintains reliability data with respect to scores achieved
by job applicants—as opposed to hired employees—in order to avoid restriction of range effects
in the data. Doing so is important because the people who were hired typically scored higher on
the test than any comparable group of applicants.
Speed tests versus power tests When a time limit is long enough to allow test takers to attempt
all items, and if some items are so difficult that no test taker is able to obtain a perfect score,
then the test is a power test. By contrast, a speed test generally contains items of lower
difficulty, and the completion time is the primary factor.
Homogeneity versus heterogeneity of test items Recall that a test is said to be homogeneous in
items if it is functionally uniform throughout. Tests designed to measure one factor, such as one
ability or one trait, are expected to be homogeneous in items. For such tests, it is reasonable to
expect a high degree of internal consistency. By contrast, if the test is heterogeneous in items,
an estimate of internal consistency might be low relative to a more appropriate estimate of
test-retest reliability. It is important to note that high internal consistency does not guarantee
item homogeneity. As long as the items are positively correlated, adding many items eventually
results in high internal consistency coefficients, homogeneous or not.
Dynamic versus static characteristics Whether what is being measured by the test is dynamic or
static is also a consideration in obtaining an estimate of reliability. A dynamic characteristic is a
trait, state, or ability presumed to be ever-changing as a function of situational and cognitive
experiences. If, for example, one were to take hourly measurements of the dynamic
characteristic of anxiety as manifested by a stockbroker throughout a business day, one might
find the measured level of this characteristic to change from hour to hour. Such changes might
even be related to the magnitude of the Dow Jones average. Because the true amount of
anxiety presumed to exist would vary with each assessment, a test-retest measure would be of
little help in gauging the reliability of the measuring instrument. Therefore, the best estimate of
reliability would be obtained from a measure of internal consistency. Contrast this situation to
one in which a measurement of this same stockbroker is made of a trait, state, or ability
presumed to be relatively unchanging (a static characteristic), such as intelligence. In this
instance, obtained measurement would not be expected to vary significantly as a function of
time, and either the test-retest or the alternate-forms method would be appropriate.
Restriction or inflation of range In using and interpreting a coefficient of reliability, the issue
variously referred to as restriction of range or restriction of variance (or, conversely, inflation of
range or inflation of variance) is important. If the variance of either variable in a correlational
analysis is restricted by the sampling procedure used, then the resulting correlation coefficient
tends to be lower. If the variance of either variable in a correlational analysis is inflated by the
sampling procedure, then the resulting correlation coefficient tends to be higher. Refer back to
Figure 3-18 (Two Scatterplots Illustrating Unrestricted and Restricted Ranges) for a graphic
illustration.
Also of critical importance is whether the range of variances employed is appropriate to the
objective of the correlational analysis. Consider, for example, a published educational test
designed for use with children in grades 1 through 6. Ideally, the manual for this test should
contain not one reliability value covering all the test takers in grades 1 through 6 but instead
reliability values for test takers at each grade level. Here's another example: A corporate
personnel officer employs a certain screening test in the hiring process. For future testing and
hiring purposes, this personnel officer maintains reliability data with respect to scores achieved
by job applicants—as opposed to hired employees—in order to avoid restriction of range effects
in the data. Doing so is important because the people who were hired typically scored higher on
the test than any comparable group of applicants.
Speed tests versus power tests When a time limit is long enough to allow test takers to attempt
all items, and if some items are so difficult that no test taker is able to obtain a perfect score,
then the test is a power test. By contrast, a speed test generally contains items of lower
difficulty, and the completion time is the primary factor.
1. Uniform Difficulty in Speed Tests:
- Speed tests have uniformly low difficulty items with time limits set so that few test takers
complete the test.
- Score differences are based on performance speed since attempted items are usually
correct.
3.Criterion-Referenced Tests:
- Designed to show test taker's standing on a specific variable or criterion (e.g., educational or
vocational objective).
- Scores are often interpreted in pass-fail terms and used for diagnostic and remedial
purposes.
5. Key Formula:
- Reliability depends on the variability of test scores.
- Total variance (σ²) = True variance (σ²ₓ) + Error variance (σ²ₑ).
RELIABILITY IN MEASUREMENTS
Sampling Theory:
- This approach views a test score as a sample from a larger population of possible
scores. Reliability is estimated by the variance of the true scores compared to the total
variance.
- Focuses on generalizability theory and domain sampling models.
- Relatively simple to understand and apply, requiring basic statistical concepts. It's widely
used in practice.
DIFFICULTIES
- Most simpler than all of the theories of reliability measurements.
DIFFERENCES
● SAMPLING THEORY
- More complex than CTT, requiring an understanding of sampling distributions and
statistical inference. It's often used in large-scale assessments.
DIFFICULTIES
- The second in simplest form or a moderate form than CTT.
DIFFERENCES
● GENERALIZED THEORY (G THEORY)
- The most complex of the theories, requiring specialized statistical software. It's often
used in research settings.
DIFFICULTIES
- More on software and specialized knowledge to practice.
DIFFERENCES
● ITEM RESPONSE THEORY (IRT)
- IRT requires sophisticated statistical software and a strong understanding of
mathematical models. It's often used in adaptive testing and advanced item analysis.
DIFFICULTIES
- It is the same on G theory, software and specialized knowledge to practice.
CONCLUSION:
RESOURCES:
CENTRAL TEST THEORY: (ZICKAR & BROADFOOT, 2009)
DOMAIN SAMPLING THEORY & GENERALIZED THEORY: (TRYON, 1957) , (SHAVELSON et
al., 1989) & WAS DEVELOPED BY (LEE J. CRONBACH, 1970)
ITEM RESPONSE THEORY: (LORD, 1980; LORD & NOVICK, 1968)
15 items question
Identification
1. The term refers to the process where researchers publicly commit to their research
methods before conducting the study.
2. This reliability coefficient measures the consistency of test scores over time by
comparing scores from the same test administered at two different points in time.
3. This term refers to a test that contains items of lower difficulty, where the primary factor
is how quickly test-takers can complete the items.
5. True or False: The replicability crisis in science has been entirely resolved with no
further issues regarding unreliable findings.
6. True or False: Coefficient Alpha is accurate when all test items have unequal loadings
(strength of the relationship between true and observed scores).
7. True or False: Dynamic traits, such as anxiety, are best measured using test-retest
reliability.
8. Question: Which of the following methods is used to adjust split-half reliability to account
for the lower correlation of test halves?
A) Kuder-Richardson Formula
B) Spearman-Brown Formula
C) Cronbach's Alpha
D) Test-Retest Reliability
9. What is the significance of the changes in scientific practices in response to the
replicability crisis?
What is one proposed remedy for questionable research practices (QRPs) in scientific studies?
a) Preregistration of research procedures b) Publishing results without peer review
c) Reducing the number of experiments conducted
d) Avoiding statistical analysis
11. What theory in reliability measurement is the most fundamental or most widely used in
psychometrics?
a) Item Response Theory
b) Generalized Theory
c) Classical Test Theory
d) Sampling Theory
ENUMERATE