0% found this document useful (0 votes)
2 views

Reliability

Uploaded by

CJ Marie Gepana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Reliability

Uploaded by

CJ Marie Gepana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Reliability

-​ Refers to dependability or consistency (e.g., a reliable train or friend).

-​ Psychometric Meaning: Reliability refers to consistency in measurement, meaning that a


test produces similar results consistently, regardless of whether those results are good
or bad.

Key Features of Reliability

1. Reliability is not absolute; a test may be reliable in one context and unreliable in another.

2. Reliability is quantified using a reliability coefficient, which ranges from:

0: Not reliable at all.

1: Perfect reliability.

Types of Reliability Coefficients

1. Test-Retest Reliability: Measures the stability of test results over time.

2. Alternate-Forms Reliability: Evaluate the consistency of results between different versions


of a test.

3. Split-Half Reliability: Assesses consistency by dividing a single test into two equivalent
halves.

4. Inter-Scorer Reliability: Determines the level of agreement between different evaluators or


scorers.

Measurement Error

Definition: Fluctuations or inaccuracies in measurements that occur even when no apparent


mistakes are made.

Two Types of Errors:

1. Preventable Errors: Mistakes caused by human factors such as lack of skill, carelessness,
or inadequate preparation.

2. Inevitable Errors: Natural variations caused by physical factors like molecular motion due to
heat or other environmental influences.

Examples of Measurement Error


A ruler used for home repair projects might be reliable for basic measurements but is insufficient
for precise machine manufacturing due to higher measurement tolerance requirements.

Significance of Reliability

1. Reliability ensures the consistency of test results, which is crucial for interpreting test scores
accurately.

2. Measurement errors, though inevitable, can often be minimized by improving test design and
administration.

3. Tests measuring stable traits (e.g., intelligence) are expected to show high reliability, whereas
tests for fluctuating variables (e.g., mood) may have lower reliability.

True Scores vs. Construct Scores

1. True Scores:

Represent the ideal measurement of a quantity without any error.

It is unobservable in practice but can be approximated by averaging multiple measurements.

2. Construct Scores:

Approximations of true scores, are influenced by measurement error and other variables.

Challenges in Measuring True Scores

1. Fluctuating Variables:

Some psychological traits (e.g., mood, alertness, motivation) are in constant flux, making true
scores variable over time.

2. Carryover Effects:

Definition: Changes in measurement results due to the testing process itself.

Types of Carryover Effects:

Practice Effects: Improvement in test performance due to repeated exposure to the test.

Fatigue Effects: Decrease in test performance due to mental exhaustion or lack of motivation
from repeated testing.

Understanding True Scores Using Repeated Measurements

If repeated measurements could be taken without carryover effects, the long-term average of
those measurements would equal the true score.
In reality, due to the passage of time and carryover effects, true scores can only be estimated.

Standard Error of Measurement (SEM)

Definition: The standard deviation of repeated measurements for an individual, representing the
typical distance from an observed score to the true score.

Example:

Population Mean: 100

Population Standard Deviation: 15

Individual’s True Score: 120

SEM: 5

Possible observed scores will vary but will generally fall within the range determined by the
SEM.

True Scores vs. Construct Scores

1. True Score:

Refers to the average of many measurements using a specific measurement instrument,


including flaws inherent to the tool.

Instrument-Specific: A person’s true score on one test might differ from their true score on
another test for the same construct (e.g., depression).

2. Construct Score:

Refers to a person’s standing on a theoretical variable (e.g., depression, agreeableness)


independent of measurement.

Represents the ideal measure of truth but is rarely achievable due to flawed measurement tools.
3. Key Relationship:

Reliable Tests: Produce scores that approximate true scores.

Valid Tests: Produce scores that approximate construct scores.


-

Importance of True Scores

True scores are essential for understanding and calculating reliability, which is a prerequisite for
validity.

Without reliability, a test cannot be valid. However, high reliability does not guarantee validity.

Example: A flawed test may produce consistent results (reliable) but fail to measure the
intended construct (invalid).

The Equation of Reliability

Observed Score (X): The measurement obtained during testing.

True Score (T): The actual score reflecting the individual’s standing on the test without
measurement error.

Measurement Error (E): The difference between the observed score and the true score.

Equation:

1. Reliable Test:

The observed score (X) is primarily influenced by the true score (T).

2. Unreliable Test:

The observed score (X) is largely influenced by measurement error (E).

Variance and Reliability

1. Variance (σ²):

Variance is the standard deviation squared and measures variability in test scores.

Total observed variance () can be broken into two components:

True Variance (): Variance due to true differences between individuals.

Error Variance (): Variance due to irrelevant or random factors.

Variance in Test Scores


1. Total Variance ():

The total variance in observed test scores equals the true variance () plus the error variance ().

Formula:

2. Reliability:

Refers to the proportion of total variance attributed to true variance.

A higher proportion of true variance indicates a more reliable test.

Assumes true differences are stable, leading to consistent scores across repeated
administrations or equivalent forms.

Measurement Error

1. Random Error:

Consists of unpredictable fluctuations or inconsistencies in the measurement process.

Examples:

Environmental disturbances (e.g., a sudden noise or event).

Test-taker's physical condition (e.g., a surge in blood sugar).

Unanticipated benefits: Random events (e.g., learning a relevant term before a test).

Effect:

Increases or decreases scores unpredictably.

Cancel out over time when averaged across repeated measures.

2. Systematic Error:

Consistent influences that inflate or deflate scores in a predictable direction.

Example:

A flawed measurement tool (e.g., a ruler measuring 12.1 inches instead of 12 inches).

Effect:
Does not affect score consistency but introduces bias.

Systematic error is predictable and fixable once identified.

Key Terminology

Bias:

Refers to the degree to which systematic error consistently overestimates or underestimates


measurements.

In statistics, bias is a technical term for predictable systematic error rather than prejudice.

Sources of Error Variance

1. Test Construction:

Item Sampling:

Differences in the way items are worded or the exact content sampled.

Example: A test-taker may hope for specific questions that align with their strengths.

Content Sampling:

Variation in the specific skills, attributes, or knowledge areas included in the test.

2. Test Administration:

Environmental factors (e.g., lighting, noise, or temperature).

Examiner’s behavior or instructions.

3. Scoring and Interpretation:

Differences in scoring criteria or subjective interpretation by assessors.

1. Test Construction

Content Sampling:

Test scores can vary based on the specific items included in the test and how they are worded.

Example: Test-takers might perform better if the test contains questions they "hoped" to be
asked.
Objective:

Test creators aim to maximize true variance and minimize error variance during test
development.

2. Test Administration

Environmental Factors:

Conditions like room temperature, lighting, ventilation, noise, and seating arrangements can
influence performance.

Examples:

A fly distracting the test-taker.

Poor-quality pencils or damaged desks.

Global events (e.g., war or peace) affecting emotional states.

Test-Taker Variables:

Factors specific to the test-taker that can introduce error variance:

Emotional distress or physical discomfort.

Lack of sleep, medication, or substance use.

Life experiences, formal education, or illness.

Cognitive performance can also be affected by biological factors (e.g., high fasting glucose
levels linked to cognitive errors).

Examiner Variables:

The examiner's appearance, demeanor, or behavior can influence test outcomes.

Examples:

Nonverbal cues like head nodding or eye movements might give away correct answers.

Strong personal beliefs (e.g., religious views) might bias assessments (e.g., overestimating
suicidal risk).

Professionalism is critical to minimizing error variance caused by examiners.

3. Test Scoring and Interpretation

Objective Scoring:
The use of computers and objective grids reduces errors caused by scorer differences.

Challenges in Non-Objective Scoring:

Tests requiring manual scoring or subjective interpretation (e.g., interviews or essays) remain
prone to error variance.

Errors can arise due to scorer bias or inconsistency in interpretation.

Sources of Error Variance Summary

1. Test Construction: Variations in item content and phrasing.

2. Test Administration: Environmental and test-taker-related factors.

3. Examiner Influence: Examiner behavior, personal beliefs, and professionalism.

4. Scoring: Errors due to subjective interpretation or inconsistent scoring.

By addressing these sources, test reliability and validity can be improved, ensuring fairer and
more accurate assessments.

1. Scorers and Scoring Systems

Objective Scoring:

Tests with objective items (e.g., multiple-choice) are often scored by computers, minimizing
error.

However, technical glitches can still contaminate the results.

Subjective Scoring:

Tests requiring human interpretation (e.g., essays, personality tests, creativity tests) are more
prone to error.

Examples:

Intelligence tests: Scorers may encounter responses that fall into ambiguous "gray areas."

Behavioral measures: Observers may differ in their interpretation of behaviors like "social
relatedness."

Consistency in Scoring:

To reduce variability, rigorous scorer training is essential.


Training aims to maximize inter-rater reliability and minimize discrepancies among scorers.

2. Other Sources of Error Variance

Surveys and Polls

Margin of Error:

Surveys often include disclaimers about a margin of error, indicating possible deviations in the
results.

Example: Political polls predicting election outcomes may be off by a few percentage points.

Sampling Error:

Occurs when the sample of participants is not representative of the target population.

Factors such as demographics or political affiliations may not align with the larger population.

Methodological Error:

Results can be influenced by flaws in the research process, such as:

Poorly trained interviewers.

Ambiguously worded questions.

Biased survey items that favor one perspective or candidate.

Systematic and Nonsystematic Error

Systematic Errors:

Predictable and consistent; they do not cancel each other out.

Example: Biased test wording systematically favors one response.

Nonsystematic Errors:

Random and unpredictable; they tend to balance out over time but still affect individual scores.

Example: Environmental factors or emotional states during testing.

3. Addressing Error Variance

Ensure scoring systems are robust and free of ambiguities.


Train scorers rigorously to improve reliability.

Use representative samples in surveys and ensure appropriate methodologies are followed.

Minimize systematic and nonsystematic errors by addressing both environmental factors and
scoring biases.

By recognizing and mitigating these sources of error, the validity and reliability of assessments
can be significantly enhanced.

1. Assessing Partner Abuse

Private Nature of Abuse:

Abuse typically occurs behind closed doors, so only the individuals involved truly know the
extent of the abuse (Moffitt et al., 1997).

This makes objective assessments difficult and error-prone.

Sources of Nonsystematic Error:

Forgetting past abusive behaviors.

Failing to notice or recognize abusive acts.

Misunderstanding instructions during reporting.

Sources of Systematic Error:

Underreporting:

Driven by fear, shame, or the desire to conform to social norms.

Overreporting:

Some individuals exaggerate claims of abuse for secondary gains, such as custody disputes or
financial incentives (Petherick, 2019).

Implications:

The variability between true reports and error-prone ones makes the "true score" of abuse
difficult, if not impossible, to ascertain.

Debates continue over the utility of methods used to estimate true versus error variance
(Stanley, 1971).

2. Reliability Estimates: Test-Retest Reliability

Definition:
Test-retest reliability measures the stability of a test over time.

It involves administering the same test to the same individuals at two different points in time and
correlating the scores.

Applications:

Ideal for evaluating traits or characteristics assumed to remain stable over time (e.g., personality
traits).

Not suitable for variables expected to fluctuate (e.g., mood, knowledge acquisition).

Example:

A ruler made of steel consistently measures 12 inches as 12 inches over time, showing high
test-retest reliability.

A ruler made of putty that changes shape over time demonstrates low reliability.

Time and Error Variance:

As the time interval between test administrations increases, the reliability coefficient tends to
decrease.

External factors (e.g., learning new information or experiencing trauma) can introduce error
variance.

Coefficient of Stability:

When the interval between tests exceeds six months, the test-retest reliability estimate is
referred to as the coefficient of stability.

Example Scenarios:

A math test’s reliability may decrease if testtakers take a tutorial between test administrations.

A personality test’s reliability may be lower if a participant experiences emotional trauma


between tests.

Test-Retest Reliability Revisited

Context and Intervening Factors:

When tests are administered during periods of significant developmental changes, test-retest
reliability estimates may not fully represent a test's reliability.

Factors such as counseling, practice, fatigue, or motivation between test administrations can
confound results.

Ideal Situations for Test-Retest Reliability:


Best for variables like reaction time, perceptual judgments (e.g., brightness or loudness), and
short-term memory tasks.

Even with brief intervals between tests, experience or fatigue can still influence results.

Scientific Replicability:

Reliability demands that test results be replicable under the same conditions by other
experimenters.

Psychology, like other sciences, faces challenges with replication, highlighting the importance of
reliable measures.

Parallel-Forms and Alternate-Forms Reliability

Definitions:

Parallel Forms: Two tests are considered parallel if they have equal means and variances for
observed scores and correlate equally with true scores.

Alternate Forms: Similar to parallel forms but not identical in terms of statistical properties;
designed to be equivalent in content and difficulty.

Coefficient of Equivalence:

Refers to the degree of relationship between parallel or alternate forms of a test, offering an
estimate of reliability.

Practical Implications

Parallel Forms:

Theoretically ideal but challenging to create.

Reliability estimates reflect the degree to which item sampling and other errors affect scores.

Alternate Forms:

Practical and commonly used when parallel forms aren't feasible.

Estimate of reliability is based on the correlation of scores obtained from different test versions.

Real-World Example: Makeup Exams

Scenario:

You missed the midterm exam, and your instructor says you'll take an alternate form of the test
instead of a parallel form.

Feelings: Likely apprehensive, as alternate forms are not guaranteed to be identical in difficulty
or content balance.
Error Sources:

Item sampling differences.

Content variations and subjective perceptions of difficulty.

Evaluating Reliability

Consider Context:

Reliability coefficients are meaningful only when interpreted within the test's purpose and
context.

Whether parallel or alternate forms are used, understanding the nature of the test and its
sources of error is key to drawing valid conclusions about its reliability.

Broader Implications:

Consistency in testing ensures fairness and validity, particularly in high-stakes situations like
exams or clinical assessments.

Alternate-forms reliability provides a practical approach, though parallel-forms reliability remains


the gold standard when achievable.

Psychology's Replicability Crisis

Background

Concern in the Mid-2000s:

A growing concern among academic scientists emerged about the rigor of scientific practices.
Researchers feared that published studies, despite being peer-reviewed, were not replicable by
other independent parties.

The Open Science Collaboration (2015):

A significant study in 2015 aimed to replicate 100 psychology studies that had already been
peer-reviewed and published in leading journals.

Results showed that only 40-60% of replications found the same results as the original studies,
highlighting a replicability crisis in psychology.

Causes of the Replicability Crisis

1. Lack of Published Replication Attempts:

Academic journals have historically preferred publishing novel findings over replication studies.

A study found that only 1.07% of psychological scientific literature involved direct replication of
previous work.

Implications:
A focus on novelty over replication limits confidence in findings and contributes to the crisis.

Replications help minimize bias and statistical anomalies.

2. Editorial Preference for Positive Findings:

Positive Findings: Research that rejects the null hypothesis and finds an experimental effect.

Negative Findings: Studies that accept the null hypothesis and find no effect.

Editorial Bias:

Journals tend to favor publishing studies with positive results, leading to a higher likelihood of
seeing only effects rather than the absence of effects.

As a result, there is a publication bias where studies confirming a hypothesis are more likely to
be published, and studies that fail to reject the null hypothesis (negative results) are often
neglected.

3. Questionable Research Practices (QRPs):

QRPs are practices that may not constitute outright fraud but still introduce error or bias into
research findings.

Example:

Data Peeking: Researchers check data during the study and decide whether to collect more
data based on whether the current data has reached statistical significance.

This introduces bias because decisions on data collection are influenced by the data already
gathered.

Other QRPs include selective reporting, where researchers only publish studies that support
their hypothesis, excluding studies that do not confirm their expectations.

Impact of the Crisis

Replicability and Reliability:

The crisis in replicability undermines the reliability and credibility of psychological research.

Negative findings and replication studies often go unpublished, which distorts the actual
strength of effects reported in scientific literature.

Long-Term Consequences:

The replication crisis calls into question the generalizability and robustness of scientific findings,
especially in psychology, where findings are often used to shape policy, practices, and legal
decisions.

Addressing Questionable Research Practices (QRPs)


The Problem of Questionable Research Practices (QRPs)

Impact of QRPs:

QRPs include practices that introduce bias and undermine the integrity of scientific findings. For
example, researchers may selectively report only the studies that support their hypothesis,
hiding the ones that fail to do so.

The issue is that, without access to the researchers' raw data and full research records, it is
difficult for consumers of research to know about important milestones in the study, such as the
sequence of studies conducted or measurements taken.

Preregistration as a Solution

Preregistration:

Preregistration involves researchers publicly committing to a specific set of procedures before


conducting their study. This ensures transparency, as there is a clear record of the research
plan.

Benefits of Preregistration:

There is no ambiguity about the number of observations or the anticipated measures.

It also helps to address QRPs, as researchers cannot alter their study methodology after seeing
initial results.

Several websites now allow researchers to preregister their research plans, and many journals
require preregistration for study publication. Some journals even provide special recognition for
preregistered studies to boost confidence in the findings.

Lessons Learned from the Replicability Crisis

The Self-Correction Myth

Traditional Assumption:

It was once believed that science would self-correct over time; faulty studies would eventually
be exposed, and the scientific record would be corrected.

Reality:

The problem is that unreliable findings can remain accepted for decades before they are
eventually disconfirmed. The process of self-correction is often slow, and there is no established
mechanism to inform the scientific community or the public when erroneous studies are
disproven.

Legal Implications of Replicability Issues

Admitting Science in Court:


Historically, scientific evidence was admitted in court if it was generally accepted by the scientific
community. This standard, however, is problematic if the science is based on non-replicable
studies.

New Legal Test:

A more rigorous standard is being applied, where judges evaluate scientific evidence based on
criteria like the error rate, peer review, replication, and whether the study was preregistered.

This helps ensure that only robust, reliable findings are admitted as evidence in court.

Moving Forward: Addressing the Replicability Crisis

Steps Taken to Address the Crisis

Preregistration and Open Science:

The adoption of preregistration and other practices aimed at transparency is growing. Open
science initiatives are receiving increasing financial support to help researchers conduct more
rigorous studies.

Replication Efforts:

Replication efforts, such as the Open Science Collaboration's large-scale replication study, are
becoming more common. These efforts aim to increase confidence in research findings and
ensure their validity.

Importance for Various Professions

Legal Field:

In law, scientific research is frequently used in courtrooms, whether in criminal cases or civil
disputes. For example, studies from psychology journals may be used to support or challenge
claims about behavior or causality.

Given the potential consequences of legal decisions, it is crucial that the scientific evidence
relied upon in these cases is replicable and reliable. It is especially critical when timely decisions
are necessary, as legal appeals can be limited, and cases may be expensive to revisit.

Alternate-Forms and Parallel-Forms Reliability Estimates

Overview

Obtaining estimates for alternate-forms and parallel-forms reliability is similar to estimating


test-retest reliability in two significant ways:

1. Two Test Administrations: Both forms require administering the test twice to the same group
of participants.

2. Potential External Influences: Factors such as motivation, fatigue, and intervening events
(e.g., practice, learning, or therapy) can affect test scores. However, their impact may not be as
significant as in test-retest reliability (where the same test is administered twice).
Additional Source of Error

An additional source of error, known as item sampling, is inherent in alternate or parallel-forms


reliability:

Item Sampling: The test-taker's performance on one form may vary not just because of their true
ability but due to the specific items selected for inclusion in that form.

Challenges of Developing Alternate Forms

Time and Expense: Creating equivalent forms of a test can be time-consuming and costly.
Researchers need to develop equivalent items and then administer them to the same group of
test-takers.

Advantages: Once developed, alternate or parallel forms are advantageous as they:

Minimize the effect of memory from prior test administrations.

Provide more reliable measurements of stable traits (like intelligence).

Traits and Stability in Test Scores

Stable vs. State-like Traits

Stable Traits: Traits presumed to remain consistent over time (e.g., intelligence). Tests
measuring these traits are expected to show a reasonable degree of stability in scores across
alternate or parallel forms.

State-like Traits: Traits subject to fluctuation, such as mood or state anxiety (anxiety felt at a
given moment). These variables are less stable and are expected to show greater variability in
test scores over time.

Challenges with Measuring State-like Variables

State Variables: Since state-like traits are constantly changing, a retest reliability coefficient
might not fully capture the reliability of the measurement. Alternative methods of estimating
reliability are required for these kinds of traits.

Internal Consistency Estimates of Reliability

Internal Consistency:

Internal Consistency Estimate: Reliability can be estimated from a single test administration
without the need for an alternate form or a second test administration. This method assesses
how well the items of the test work together.

Inter-item Consistency: This estimate evaluates the correlation between individual test items to
determine the consistency of responses.
Split-Half Reliability Estimates

Split-Half Reliability: This method is used when it is impractical to administer two tests or
perform a test-retest due to time or expense constraints.

Steps to Calculate Split-Half Reliability:

1. Divide the Test: Split the test into two equivalent halves.

2. Calculate Pearson Correlation: Compute the correlation between scores from the two halves
of the test.

3. Spearman-Brown Adjustment: Adjust the split-half reliability using the Spearman-Brown


formula, which corrects for the fact that a test split in half would likely produce a lower
correlation than if the full test were used.

Spearman-Brown prediction formula, which shows how combining multiple parallel tests can
increase reliability. It highlights that a single test with low reliability requires many parallel tests
to achieve high reliability. The text also mentions that reducing test size may be beneficial in
cases of boredom or fatigue and provides guidance on determining the number of items needed
for desired reliability levels. Finally, it prompts readers to consider situations where reducing test
size or administration time might be desirable and arguments against it.

Spearman-Brown Formula:
- Combining multiple parallel tests increases reliability.
- If a test with low reliability needs a higher reliability coefficient, multiply the number of items
accordingly.
- Example: A 10-item test with a reliability of 0.60, needing a reliability of 0.80, requires 27
items.
- New items must match the original in content and difficulty.
- If reliability is too low, consider finding or developing an alternative instrument.
- Improve reliability by creating new items, clarifying instructions, or simplifying scoring rules.

2. Other Methods of Estimating Internal Consistency


- Kuder and Richardson (1937): Developed formulas for internal consistency.
- Cronbach’s Coefficient Alpha (1951)
- Measures inter-item consistency from a single test administration.
- Coefficient alpha is the mean of all possible split-half correlations, corrected by the
Spearman-Brown formula.
- Formula: \(rα = \frac{k}{k-1}\left(1-\frac{\sum{\sigma_i^2}}{\sigma^2}\right)\).

3. Important Notes
- Internal consistency estimates are inappropriate for heterogeneous and speed tests.
- The impact of test characteristics on reliability is discussed in detail.

1. Coefficient Alpha as a Reliability Measure:


- Coefficient alpha is widely used for measuring reliability because it only requires one
administration of the test.
- It typically ranges from 0 to 1, indicating the degree of similarity among data sets.
- Negative values of alpha are theoretically impossible and should be reported as zero in rare
cases.
- Higher internal consistency is not always better if achieved by highly similar items that yield
no additional information.

2. Limitations of Coefficient Alpha:


- It has several well-known limitations and accurately measures internal consistency under
specific conditions, which are rarely met in real measures.
- Cronbach's alpha is accurate when the loadings (strength of the relationship between true
and observed scores) are equal.
- When loadings are unequal, Cronbach's alpha overestimates reliability.

3. Understanding Loadings and True Scores:


- The text describes a scenario with a test containing four items, where each item is the sum
of the true score and a different error term.
- Loadings (represented by the Greek letter lambda, λ) indicate the strength of the relationship
between the true score and observed scores.
- Coefficient alpha is accurate when these loadings are equal, but overestimates reliability
when they are unequal.

Cronbach’s alpha is a widely used measure of internal consistency, assessing how similar test
items are in measuring the same construct, with values ranging from 0 (no similarity) to 1
(perfect similarity). However, a common misconception is that a higher alpha always indicates
better reliability; in reality, including overly similar items can inflate alpha without improving
measurement accuracy. Alpha is most effective when all test items have equal loadings (λ), but
it underestimates reliability when loadings are unequal. As an alternative, McDonald’s omega
provides a more accurate measure of internal consistency in such cases, making it a preferred
choice among statisticians when test items have varying relationships with the underlying
construct.

Inter-scorer reliability refers to the consistency of evaluations from multiple scorers and is crucial
for ensuring test results are not influenced by the evaluator's identity. Historical examples, such
as the 1912 study where teachers graded the same paper with varying scores, highlight the
need for high inter-scorer reliability, which indicates consistent scoring by trained evaluators.
Factors like unclear scoring criteria can affect reliability, but training and clear guidelines can
improve it. In research, involving multiple raters helps minimize individual biases. The degree of
consistency among scorers can be measured using a correlation coefficient, known as the
coefficient of inter-scorer reliability. This reliability is often used in coding nonverbal behaviors,
ensuring more accurate ratings. Real-life scenarios, such as job interviews, may also impact
evaluation consistency based on the interviewer's relationship with the candidate.

Alternative Ways of Splitting: There are multiple ways to split a test for the split-half reliability,
but some methods are not appropriate. A common mistake is simply dividing the test into two
parts without ensuring that the halves are equivalent in difficulty and content.

Classical Test Theory and Measurement Error

According to classical test theory, factors like motivation or fatigue are considered measurement
error. However, other models do not always view these factors as errors. For example, Atkinson
(1981) discussed alternative models in personality assessment where fluctuating test scores
due to such factors might not be considered mere error.

The document discusses the importance of reliability in tests and the purpose of reliability
coefficients. It emphasizes that the required level of reliability depends on the significance of the
test. High-stakes tests, such as those with life-or-death implications, must meet high reliability
standards. The text explains that different reliability coefficients reflect various sources of error
variance, such as errors from test construction, administration, scoring, and interpretation. An
example in Figure 5-5 shows different sources of variance in a hypothetical test.

Important Details:
- Reliability Importance: Essential for all tests, but standards vary based on the test's
significance.
- High-stakes tests require higher reliability standards.
- Reliability Coefficients: Different coefficients reflect different sources of error variance.
- Examples of error variance: test construction, administration, scoring, and interpretation.

- Figure 5-5: Sources of variance in a hypothetical test:

- 67% True variance


- 18% Error due to test construction
- 5% Administration error
- 5% Unidentified error
- 5% Scorer error

Table 5-1: Summary of Reliability Types

The Nature of the Test

Closely related to considerations concerning the purpose and use of a reliability coefficient are
those concerning the nature of the test itself. These include whether:
1. The test items are homogeneous or heterogeneous in nature.
2. The characteristic, ability, or trait being measured is presumed to be dynamic or static.
3. The range of test scores is or is not restricted.
4. The test is a speed or a power test.
5. The test is or is not criterion-referenced.

Some tests present special problems regarding the measurement of their reliability. For
example, psychological tests developed for infants help identify children who are developing
slowly or may benefit from early intervention. Measuring internal consistency reliability or
inter-scorer reliability for these tests is similar to other tests. However, measuring test-retest
reliability presents a unique challenge due to the rapid and uneven cognitive development in
young children. The child’s performance on the test could change significantly in a short period,
reflecting genuine changes rather than error. Therefore, gauging the test-retest reliability of such
tests must consider developmental changes between testings.
such tests may design test-retest reliability studies with short intervals between testings,
sometimes as little as four days.

Homogeneity versus heterogeneity of test items Recall that a test is said to be homogeneous in
items if it is functionally uniform throughout. Tests designed to measure one factor, such as one
ability or one trait, are expected to be homogeneous in items. For such tests, it is reasonable to
expect a high degree of internal consistency. By contrast, if the test is heterogeneous in items,
an estimate of internal consistency might be low relative to a more appropriate estimate of
test-retest reliability. It is important to note that high internal consistency does not guarantee
item homogeneity. As long as the items are positively correlated, adding many items eventually
results in high internal consistency coefficients, homogeneous or not.

Dynamic versus static characteristics Whether what is being measured by the test is dynamic or
static is also a consideration in obtaining an estimate of reliability. A dynamic characteristic is a
trait, state, or ability presumed to be ever-changing as a function of situational and cognitive
experiences. If, for example, one were to take hourly measurements of the dynamic
characteristic of anxiety as manifested by a stockbroker throughout a business day, one might
find the measured level of this characteristic to change from hour to hour. Such changes might
even be related to the magnitude of the Dow Jones average. Because the true amount of
anxiety presumed to exist would vary with each assessment, a test-retest measure would be of
little help in gauging the reliability of the measuring instrument. Therefore, the best estimate of
reliability would be obtained from a measure of internal consistency. Contrast this situation to
one in which a measurement of this same stockbroker are made of a trait, state, or ability
presumed to be relatively unchanging (a static characteristic), such as intelligence. In this
instance, obtained measurement would not be expected to vary significantly as a function of
time, and either the test-retest or the alternate-forms method would be appropriate.

Restriction or inflation of range In using and interpreting a coefficient of reliability, the issue
variously referred to as restriction of range or restriction of variance (or, conversely, inflation of
range or inflation of variance) is important. If the variance of either variable in a correlational
analysis is restricted by the sampling procedure used, then the resulting correlation coefficient
tends to be lower. If the variance of either variable in a correlational analysis is inflated by the
sampling procedure, then the resulting correlation coefficient tends to be higher. Refer back to
Figure 3-18 (Two Scatterplots Illustrating Unrestricted and Restricted Ranges) for a graphic
illustration.

Also of critical importance is whether the range of variances employed is appropriate to the
objective of the correlational analysis. Consider, for example, a published educational test
designed for use with children in grades 1 through 6. Ideally, the manual for this test should
contain not one reliability value covering all the test takers in grades 1 through 6 but instead
reliability values for test takers at each grade level. Here's another example: A corporate
personnel officer employs a certain screening test in the hiring process. For future testing and
hiring purposes, this personnel officer maintains reliability data with respect to scores achieved
by job applicants—as opposed to hired employees—in order to avoid restriction of range effects
in the data. Doing so is important because the people who were hired typically scored higher on
the test than any comparable group of applicants.

Speed tests versus power tests When a time limit is long enough to allow test takers to attempt
all items, and if some items are so difficult that no test taker is able to obtain a perfect score,
then the test is a power test. By contrast, a speed test generally contains items of lower
difficulty, and the completion time is the primary factor.

Homogeneity versus heterogeneity of test items Recall that a test is said to be homogeneous in
items if it is functionally uniform throughout. Tests designed to measure one factor, such as one
ability or one trait, are expected to be homogeneous in items. For such tests, it is reasonable to
expect a high degree of internal consistency. By contrast, if the test is heterogeneous in items,
an estimate of internal consistency might be low relative to a more appropriate estimate of
test-retest reliability. It is important to note that high internal consistency does not guarantee
item homogeneity. As long as the items are positively correlated, adding many items eventually
results in high internal consistency coefficients, homogeneous or not.

Dynamic versus static characteristics Whether what is being measured by the test is dynamic or
static is also a consideration in obtaining an estimate of reliability. A dynamic characteristic is a
trait, state, or ability presumed to be ever-changing as a function of situational and cognitive
experiences. If, for example, one were to take hourly measurements of the dynamic
characteristic of anxiety as manifested by a stockbroker throughout a business day, one might
find the measured level of this characteristic to change from hour to hour. Such changes might
even be related to the magnitude of the Dow Jones average. Because the true amount of
anxiety presumed to exist would vary with each assessment, a test-retest measure would be of
little help in gauging the reliability of the measuring instrument. Therefore, the best estimate of
reliability would be obtained from a measure of internal consistency. Contrast this situation to
one in which a measurement of this same stockbroker is made of a trait, state, or ability
presumed to be relatively unchanging (a static characteristic), such as intelligence. In this
instance, obtained measurement would not be expected to vary significantly as a function of
time, and either the test-retest or the alternate-forms method would be appropriate.

Restriction or inflation of range In using and interpreting a coefficient of reliability, the issue
variously referred to as restriction of range or restriction of variance (or, conversely, inflation of
range or inflation of variance) is important. If the variance of either variable in a correlational
analysis is restricted by the sampling procedure used, then the resulting correlation coefficient
tends to be lower. If the variance of either variable in a correlational analysis is inflated by the
sampling procedure, then the resulting correlation coefficient tends to be higher. Refer back to
Figure 3-18 (Two Scatterplots Illustrating Unrestricted and Restricted Ranges) for a graphic
illustration.

Also of critical importance is whether the range of variances employed is appropriate to the
objective of the correlational analysis. Consider, for example, a published educational test
designed for use with children in grades 1 through 6. Ideally, the manual for this test should
contain not one reliability value covering all the test takers in grades 1 through 6 but instead
reliability values for test takers at each grade level. Here's another example: A corporate
personnel officer employs a certain screening test in the hiring process. For future testing and
hiring purposes, this personnel officer maintains reliability data with respect to scores achieved
by job applicants—as opposed to hired employees—in order to avoid restriction of range effects
in the data. Doing so is important because the people who were hired typically scored higher on
the test than any comparable group of applicants.

Speed tests versus power tests When a time limit is long enough to allow test takers to attempt
all items, and if some items are so difficult that no test taker is able to obtain a perfect score,
then the test is a power test. By contrast, a speed test generally contains items of lower
difficulty, and the completion time is the primary factor.
1. Uniform Difficulty in Speed Tests:
- Speed tests have uniformly low difficulty items with time limits set so that few test takers
complete the test.
- Score differences are based on performance speed since attempted items are usually
correct.

2. Reliability Estimation for Speed Tests:


- Should be based on two independent testing periods.
- Methods: test-retest reliability, alternate-forms reliability, or split-half reliability (with
Spearman-Brown adjustment).

3.Criterion-Referenced Tests:
- Designed to show test taker's standing on a specific variable or criterion (e.g., educational or
vocational objective).
- Scores are often interpreted in pass-fail terms and used for diagnostic and remedial
purposes.

4. Traditional Reliability Estimation:


- Methods: test-retest reliability, alternate-forms reliability, split-half reliability.
- Not always appropriate for criterion-referenced tests due to focus on score variability, which
is less relevant for mastery testing.

5. Key Formula:
- Reliability depends on the variability of test scores.
- Total variance (σ²) = True variance (σ²ₓ) + Error variance (σ²ₑ).

The True Score of Measurements and Alternatives

RELIABILITY IN MEASUREMENTS

Classical Test Theory (CTT):


-​ CTT is a foundational psychometric theory that is widely used to decompose an
observed score into a true score and an error score. It assumes that measurement error
is random and normally distributed.

Sampling Theory:
-​ This approach views a test score as a sample from a larger population of possible
scores. Reliability is estimated by the variance of the true scores compared to the total
variance.
-​ Focuses on generalizability theory and domain sampling models.

Generalizability Theory (G Theory):


-​ G theory extends CTT by explicitly modeling multiple sources of variance (e.g., items,
raters, occasions, facets). It allows for the estimation of reliability under different
conditions.
-​ Estimate generalizability coefficients that are similar to reliability coefficients.

Item Response Theory (IRT):


-​ IRT models the probability of a person responding correctly to an item as a function of
their ability and the item's difficulty. Reliability is expressed as the precision of ability
estimation.
-​ Uses item information functions and test information functions to assess the precision of
measurement at different ability levels.

DIFFERENCES AND DIFFICULTIES BETWEEN RELIABILITY MEASUREMENTS


DIFFERENCES
●​ CLASSICAL TEST THEORY (CTT)

-​ Relatively simple to understand and apply, requiring basic statistical concepts. It's widely
used in practice.

DIFFICULTIES
-​ Most simpler than all of the theories of reliability measurements.

DIFFERENCES
●​ SAMPLING THEORY
-​ More complex than CTT, requiring an understanding of sampling distributions and
statistical inference. It's often used in large-scale assessments.

DIFFICULTIES
-​ The second in simplest form or a moderate form than CTT.

DIFFERENCES
●​ GENERALIZED THEORY (G THEORY)

-​ The most complex of the theories, requiring specialized statistical software. It's often
used in research settings.

DIFFICULTIES
-​ More on software and specialized knowledge to practice.

DIFFERENCES
●​ ITEM RESPONSE THEORY (IRT)
-​ IRT requires sophisticated statistical software and a strong understanding of
mathematical models. It's often used in adaptive testing and advanced item analysis.

DIFFICULTIES
-​ It is the same on G theory, software and specialized knowledge to practice.

CONCLUSION:

-​ Measurements of Reliability depend on which theory to choose from specific research,


complex assessment, and the resources available. CTT is widely used among all of
them that’s why it is a good starting point, while G Theory and IRT is more likely to be
used for more advanced approaches to reliability estimation.

RESOURCES:
CENTRAL TEST THEORY: (ZICKAR & BROADFOOT, 2009)
DOMAIN SAMPLING THEORY & GENERALIZED THEORY: (TRYON, 1957) , (SHAVELSON et
al., 1989) & WAS DEVELOPED BY (LEE J. CRONBACH, 1970)
ITEM RESPONSE THEORY: (LORD, 1980; LORD & NOVICK, 1968)

15 items question
Identification

1.​ The term refers to the process where researchers publicly commit to their research
methods before conducting the study.
2.​ This reliability coefficient measures the consistency of test scores over time by
comparing scores from the same test administered at two different points in time.

3.​ This term refers to a test that contains items of lower difficulty, where the primary factor
is how quickly test-takers can complete the items.

4.​ What is one major benefit of pre registration in research studies?

a) It reduces the number of participants needed


b) It increases the likelihood of publication
c) It eliminates doubts about the planned number of observations and measures
d) It ensures the study results will be statistically significant

5.​ True or False: The replicability crisis in science has been entirely resolved with no
further issues regarding unreliable findings.

6.​ True or False: Coefficient Alpha is accurate when all test items have unequal loadings
(strength of the relationship between true and observed scores).

7.​ True or False: Dynamic traits, such as anxiety, are best measured using test-retest
reliability.

8.​ Question: Which of the following methods is used to adjust split-half reliability to account
for the lower correlation of test halves?

A) Kuder-Richardson Formula
B) Spearman-Brown Formula
C) Cronbach's Alpha
D) Test-Retest Reliability

9.​ What is the significance of the changes in scientific practices in response to the
replicability crisis?

a) They will prevent the publication of all future studies


b) They signal a shift towards more reliable and replicable scientific research
c) They make replication efforts irrelevant
d) They focus primarily on reducing the number of experiments conducted

What is one proposed remedy for questionable research practices (QRPs) in scientific studies?
a) Preregistration of research procedures b) Publishing results without peer review
c) Reducing the number of experiments conducted
d) Avoiding statistical analysis

11. What theory in reliability measurement is the most fundamental or most widely used in
psychometrics?
a) Item Response Theory
b) Generalized Theory
c) Classical Test Theory
d) Sampling Theory

ENUMERATE

12-15) What are the four theories in Reliability Measurement?

You might also like