0% found this document useful (0 votes)
42 views

Module 2 PSYCH 3140

This document provides an overview and outline of principles of psychological testing and assessment. It discusses scales of measurement, statistical interpretation of test scores, measures of central tendency and variability, norms, reliability, validity, and item analysis. The document is an instructional module that aims to describe the statistical foundations and basic concepts of psychological assessment and testing.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views

Module 2 PSYCH 3140

This document provides an overview and outline of principles of psychological testing and assessment. It discusses scales of measurement, statistical interpretation of test scores, measures of central tendency and variability, norms, reliability, validity, and item analysis. The document is an instructional module that aims to describe the statistical foundations and basic concepts of psychological assessment and testing.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

P S Y C H O L O G I C A L A S S E S S M E N T | 14

Prepared by:
ELIZABETH S. SUBA, Ph.D., RPsy, RPm, RGC
ANGELO R. DULLAS, MA Clinical Psych

E-mail Address:
[email protected]

Central Luzon State University


Science City of Munoz 3120
Nueva Ecija, Philippines

Instructional Module for the Course


PSYCH 3140 Psychological Assessment

MODULE 2

Topic 1: Principles of Psychological Testing

Overview

In this module, we will provide you with the Principles of Psychological Assessment and
Psychological Testing, the definition and its basic concepts. You are expected to define what are
the Principles of Psychological Assessment and Psychological Testing, the definition and its basic
concepts primarily the statistical foundation of modern psychometrics. The following are the
outline of this chapter.
P S Y C H O L O G I C A L A S S E S S M E N T | 15

1. Scales of Measurement
2. Statistical Interpretation of Tests Scores (Raw and Derived Scores)
3. Measure of Central Tendencies
4. Measures of Variability
5. Norms
5.1 Linear and Non-Linear Transformation
5.2 Types of Norms
6. Test Reliability
6.1 General Model of Reliability
6.2 Test-retest
6.3 Alternate Form
6.4 Split Half Reliability
6.5 Kuder Richardson
6.6 Standard Error of Measurement
7. Test Validity
7.1 Content Validity
7.2 Criterion-related Validity
7.3 Construct Validity
8. Item Analysis

Item Response Theory

I. Objectives:

Upon the completion of this module, you are expected to:


1. Describe the basic principle of Psychological Assessment and Psychological Testing.
2. Describe the statistical foundation of Psychological Assessment and Psychological Testing.

II. Learning Activities

MEASUREMENT AND STATISTICS

Statistical Interpretation of Test Scores

Descriptive Statistics - procedures used to summarize and describe a set of data in quantitative
terms, where complete population data are available.
P S Y C H O L O G I C A L A S S E S S M E N T | 16

Inferential Statistics - procedures used in drawing inferences about the properties and
characteristics of populations from sample data. Inferences are logical deductions about things
that cannot be observed directly.
Scales of Measurement

A measurement scale differentiate people from each other on any one variable.

Variable- a factor, property, attribute, characteristic, or behavior dimension along which people
or objects differ.
Physical dimension- length or weight

Psychological dimension- intelligence or self-concept

SCALES OF MEASUREMENT
Scales Description Limitations/Application
- numbers are used to classify and Limitations:
identify people or objects
Nominal -they do not provide very precise
according to category labels.
information about individual
differences; do not really quan- tify
test-taker’s performance;
Examples:
Gender can be categorized as
“male” or “female” -they indicate the presence or
absence of a property but not the
extent or amount of a property
we can choose to give all females a
“score” of 1 and all “males” a score
of 2. Compare:
IQ = 102, IQ= 108
Example, IQ = Average
we can administer an IQ test to a
group of people and reclassify their
Note: when we transform scores to
scores as “below average”,
a nominal scale our information
“average”, or “above average”.
becomes more general and less
precise.
Limitation:
Ordinal
P S Y C H O L O G I C A L A S S E S S M E N T | 17

-When we classify people or it does not indicate the precise


objects by ranking them on some extent by which the group members
dimensions or in terms of the differ.
attributes being measured.

Ex: the ranks simply tell us that one


An ordinal scale provides child is taller than another, but not
information about where group exactly how much taller.
members fall relative to each
other. Ex. 1st, 2nd, 3rd, . . . .
-they do not provide the type of
individual differences informa-tion
that we want.

When we classify people or objects Application:


by ranking them with an equal-unit
Interval -Assume that 3 people, A, B, and C
scale.
receive scores of 65, 55, and 45
respectively on a standardized test
of anxiety.
We need to establish that a
difference of 1 or 3 or 5 units is
equivalent at any place along the
-If this is an interval-level test, we
scale.
can draw 3 conclusions:

Ex. height
1. Person A demonstrates a higher
The difference between 60 level of anxiety than Person B, who
and 65 inches, a 5 unit difference, in turn is more anxious than person
is exactly the same as the C. The scores permit us to
difference between 40 and 45 determine the relative extent of
inches. anxiety in these three people.
Note:
2. the difference between a score of
55 and a score of 65 for persons A
Scores on most
and B is equivalent to the difference
psychological tests are designed to
between 45 and 55 for person B and
represent interval scales of
measurement
P S Y C H O L O G I C A L A S S E S S M E N T | 18

C. Each pair represents a difference


of 10 units.

3. the difference in extent of anxiety


between persons A and C (20 units)
is twice as great as the difference
between persons A and B (10 units).

-When we rank people or objects


with an equal-interval scale that
Ratio Application:
has a true zero point.

Ratio scales are rare in


-True zero point- indicates the
psychological measurement since it
absence of the characteristics
is virtually impossible to define a
being measured.
true zero point for most
psychological characteristics.
Example: miles per hour
It measures the extent of speed Could a person ever be classified as
attained by a moving object. At 0 possessing no intelligence, no
miles per hour, the object is not aggression, or no self-concept?.
moving. Each mile per hour
increment above 0 indicates the
increase in speed on an equal
interval scale.

NOTE:

As we move from nominal scale to interval and ratio scales, we increase the precision of
the measurement process.
P S Y C H O L O G I C A L A S S E S S M E N T | 19

Interval and ratio scales with their equal units are most appropriate for comparing people,
for the study of individual differences.

Types of Scores

Raw Score- scores obtained directly from test performance.


-usually meaningless unless they are transformed to other scales.

Transformed scores or Derived Scores- scores resulting from the transformation of raw score into
other scales in order to facilitate analysis and interpretation.

• in a linear transformation the original (raw) score and transformed scores will be related
in a linear manner.

Describing Score Distributions

Frequency Distribution- a technique for systematically displaying or representing scores to show


how frequently each value was obtained. This distribution can also be shown graphically by
plotting the frequencies either as a frequency polygon or a histogram.

Properties of Frequency Distribution

1. Central Location or Central Tendency- refers to a value or measure near the center of the
distribution which represents the average score of the group.

Mean- the arithmetic average or the value obtained by adding together a set of measurements
and then dividing by the number of measurements in the set.

Median- the middlemost score or the score above and below which 50% of the score fall. It is
sometimes referred to as the 50th percentile, the 5th decile, and the second quartile.

Mode- the score that occurs more frequently in a set of test scores or the score obtained by the
most number of people. When test scores are grouped into intervals, the mode is the midpoint of
the interval containing the largest number of scores.

1. Variation- refers to the extent of the clustering about a central value or the dispersion of
scores around a given point. If all scores are close to the central value, their variation will
be less than if they tend to depart more markedly from the central values.

Range- the simplest measure; the difference between the largest and smallest score.
P S Y C H O L O G I C A L A S S E S S M E N T | 20

Semi-interquartile Range (Quartile Deviation) – a modified type of range used as an index of


variability when the distribution of scores is highly skewed. This range or Q1 is computed as one-
half the difference between the 75th percentile (third quartile) and the 25th percentile (first
quartile).
Standard Deviation- most commonly used measure of variability; appropriate when the arithmetic
mean is the reported average; gives an index of how widely the scores are dispersed about the
mean. The larger the standard deviation, the more widely scattered the scores. The standard
deviation is actually the square root of the variance.
Variance- is a measure of the total amount of variability in a set of test scores.
Standard score- it expresses a person’s performance in terms of his deviation from the mean in
standard deviation units, data are transformed into deviation units.

2. Skewness- refers to the symmetry or asymmetry of a frequency distribution.

Positively Skewed- if the larger frequencies tend to be concentrated toward the low end of the
variable and the smaller frequencies toward the high end. Few high scores and many low scores.
Mean is larger than the median.

Example: if a test is difficult scores could cluster at the low end.

Negatively skewed- the larger frequencies are concentrated toward the high end of the scale and
the smaller frequencies toward the low end. Many high scores and few low scores. The median is
larger than the mean.
P S Y C H O L O G I C A L A S S E S S M E N T | 21

Example: If a test is easy, the scores would cluster at the high end of the scale and tail off
toward the low end.

Normal Curve- if the distribution is symmetrical, bell shaped and the larger frequencies are
clustered around the average. The mean, median and mode coincide.

3. Kurtosis- refers to the flatness or peakedness of one distribution in relation to another.

Leptokurtic- if one distribution is more peaked than normal.


Platykurtic- if it is less peaked
Mesokurtic- normal distribution
P S Y C H O L O G I C A L A S S E S S M E N T | 22

NORM REFERENCED VS. CRITERION REFERENCED TESTS


Criterion-Referenced Test
❑ when an individual’s score is compared to an established standard or criterion.
❑ whether the person has reached a certain standard of performance within a domain.
Example: Academic performance where you need to score 90% or better correct in a test
for a grade of 1.00 ; 80% or better for 1.50 and so forth.
❑ there is a mastery component; a predetermined cut-off score indicates whether the
person has attained an established level of mastery.
❑ Professional licensing examinations are examples that include a mastery component.

Norm-Referenced Test
❑ when an individual’s score is compared to other individuals who have taken the test often
called standardization sample or normative group.

THE MEANING AND APPLICATION OF NORMS


Standardization and Norming
Standardization - involves administering the constructed test to a large sample of people (the
standardization sample) selected as representative of the target population of persons for whom
the test is intended.
P S Y C H O L O G I C A L A S S E S S M E N T | 23

NORMS- refer to the performance of the standardization sample used in the process of
standardizing the test; empirically established and presented in tabular form.
- raw scores are converted to some form of derived scores or norms.
Two essential points should be stressed:

• No single population can be regarded as the normative group.


• A wide variety of norm-based interpretations could be made for a given raw score,
depending on which normative group is chosen.

KINDS OF NORMS
Developmental Norms - indicate how far along the normal developmental path the individual
had progressed. (Anastasi & Urbina, 1997).
1. Age norms- Age equivalent is the median score on a test obtained by persons (standardization
sample) of a given chronological age.
Mental Age score of an examinee corresponds to the chronological age of the subgroup in the
standardization group whose median is the same as that of the examinee.
2. Grade norms or equivalents- often used in interpreting educational achievement tests. Grade
norms are found by computing the mean or median raw score obtained by students at a given
grade level.
For example: if the average number of problems solved correctly on an Arithmetic test by
the fourth graders in the standardization sample is 23, then a raw score of 23 corresponds to a
grade equivalent of 4.

Within Group Norms


➢ The individual’s performance is evaluated in terms of the performance of the most nearly
comparable standardization group.

Percentile- scores that are expressed in terms of the percentage of persons in the standardization
sample who fall below a given raw score. Also called percentile rank
For example, If 28% of the subjects obtained a score of 15 problems correct on a
Mathematical Ability test, then a raw score of 15 corresponds to the 28th percentile.
Limitations: inequality of their units, especially at the extremes of the distribution.
P S Y C H O L O G I C A L A S S E S S M E N T | 24

Percentiles are derived scores expressed in terms of percentage of persons.


Percentage scores are raw scores expressed in terms of percentage of correct items.

Standard Scores

• address the limitations of unequal units of percentiles


• express a person’s performance in terms of his score’s deviation from the mean in standard
deviation units.

2 Kinds of Standard Scores Transformation


Linear transformation- scores retain their exact numerical relations of the original raw scores
because they are computed by subtracting a constant (mean) from each raw score and then
dividing the result by another constant (standard deviation).
Linearly derived standard scores are often designated as standard scores or “z scores”.
z Scores
❖ The z-score is considered the base of standard scores, since it is used for conversion to
another type of standard score.
❖ We convert an individual raw score into a z-score by subtracting the mean of the
instrument from the client’s raw score and dividing by the standard deviation of the
instrument. The formula for computing a z-score is:

Nonlinear or Normalized standard scores- expressed in terms of a distribution that has been
transformed to fit a normal curve.
T Scores
If the normalized standard score is multiplied by 10 and added to or subtracted from 50.
Has a fixed mean of 50 and a standard deviation of 10.
A score of 50 corresponds to the mean, a score of 60 to 1 SD above the mean, and so forth.
Some test developers prefer T-scores because they eliminate the decimals and positive
and negative signs of z-scores.

Stanines
▪ Range from 1 to 9, with a mean of 5 and a standard deviation of 1.96 except for the stanines
of 1 and 9.
P S Y C H O L O G I C A L A S S E S S M E N T | 25

▪ Raw scores are converted to stanines by having the lowest 4 percent of the individuals
receive a stanine score of 1, the next 7 percent receive a stanine of 2, the next 12 percent
receive a stanine of 3, and then just keep progressing through the group.
▪ The disadvantage is that the stanines represent a range of scores, and sometimes people
do not understand that one number represents various raw scores.

Deviation IQs

• is a standard score with a mean of 100 and an SD that approximate the SD of the Stanford-
Binet IQ distribution. It resembles an IQ scale because of the use of 100.
• the deviations from the mean are converted into standard scores, which typically have a
mean of 100 and a standard deviation of 15.
• an extension of the ratio IQ (intelligence quotient) used in early intelligence tests. They
are more preferred now than the ratio IQ .

CEEB (College Entrance Examination Board) Score


• the raw scores are converted to standard score with a mean of 500 and a standard
deviation of 100.
• used in Scholastic Assessment Test (SAT) and Graduate Record Examination (GRE)
P S Y C H O L O G I C A L A S S E S S M E N T | 26

Comparison of Different Types of Derived Scores in a Normal Distribution


P S Y C H O L O G I C A L A S S E S S M E N T | 27

CORRELATIONAL STATISTICS

Correlation is concerned with determining the extent to which two sets of measures such
as intelligence test scores and school grades are related.

Correlation coefficient – a numerical index that describes the magnitude and direction of the
relationship between two variables. (Aiken, 2000) It may be either Positive or Negative.

+/- 0.00 to 0.19 Very weak, negligible correlation


+/- 0.20 to 0.39 Weak, low correlation
+/- 0.40 to 0.59 Moderate correlation
+/- 0.60 to 0.79 Strong high correlation
+/- 0.80 to 1.00 Very strong correlation

Coefficient of Determination – squared value of the correlation coefficient. It is the proportion of


the total variation in scores on Y that we know as a function of information about X.

Pearson Product-Moment Correlation or Pearson r - is the most popular measure ;


- It is used when the variables are of interval or ratio type of measurement.
- It ranges from –1.00 (a perfect inverse relationship) to +1.00 (a perfect direct
relationship.)
The formula for computing Pearson r is:

The Meaning of Correlation (Aiken, 2000)

Correlation implies predictability

-The accuracy with which a person’s score on measure Y can be predicted from his or her
score on measure X depends on the magnitude of the correlation between the two
variables.

-The closer the correlation coefficient is to an absolute value of 1.00 (either +1.00 or –
1.00, the smaller the average error made in predicting Y scores from X scores

For example,
If the correlation between tests X and Y is close to +1.00, it can be predicted with
confidence that a person who makes a high score on variable X will also make a high score
on variable Y and a person who makes a low score on X will also obtain a low score on Y.
On the other hand, if the correlation is close to – 1.00 what could be your prediction?
P S Y C H O L O G I C A L A S S E S S M E N T | 28

Correlation does not imply causation

-The fact that two variables are significantly correlated facilitates predicting performance
on one from performance on the other, but it provides no direct information on whether
the two variables are causally connected.

Simple Linear Regression – procedure for determining the algebraic equation of the best-fitting
line for predicting scores on a dependent variable from one or more independent variables. The
product moment correlation coefficient, which is a measure of the linear relationship between
two variables is actually a by-product of the statistical procedure for finding the equation of the
straight line that best fits the set of points representing the paired X-Y values.

Multiple Regression Analysis – it is an extension of simple linear regression analysis to two or more
variables, with Y as the criterion variable and X1, X2, and X3 as the independent variables.

Factor Analysis - a mathematical procedure for analyzing a matrix of correlations among


measurements to determine what factors (constructs) are sufficient to explain the correlations.
Its major purpose is to reduce the number of variables in a group of measures by taking into
account the overlap (correlations) among them.

Other Statistical Tools

➢ The chi-square test for goodness of fit


Chi-square (x2) is used to determine the strength of association between two nominal
variables. In the test of goodness of fit, x2 test is used to determine whether a significant
difference exists between the observed frequency distribution and the expected frequency
distribution.

➢ The chi-square for independence – it sets to find out whether two nominal variables A and
B are independent of each other, or whether an association exists between them.

Example: is there an association between the manager’s annual salary rate (High,
moderate, low) and his educational attainment ?
P S Y C H O L O G I C A L A S S E S S M E N T | 29

CHARACTERISTICS OF A GOOD TEST


DESIGN PROPERTIES OF A GOOD TEST (Freidenberg, 1995)
❖ A clearly defined purpose.
o What is the test supposed to measure?
- (knowledge, skills, behavior, and attitudes and other characteristics)
o Who will take the test?
- the format of the test may be varied to suit the test taker (oral, written, pictures,
words, manipulations)
o How will the test scores be used?
- appropriateness of different types of test items and test scores.

❖ A specific and standard content.


Content is specific to the domain to be measured
standard - all test takers are tested on the same attributes or knowledge.

❖ A set of standard administration procedures.


-Standard conditions are necessary to minimize the effects of irrelevant variables,

❖ A standard scoring procedure.

PSYCHOMETRIC PROPERTIES OF A GOOD TEST


➢ Reliability

Refers to the consistency of scores obtained by the same person when retested with the
same test or with an equivalent form of the test on different occasions.
➢ Validity

Refers to the degree to which a test measures what it is supposed to measure.


➢ Good Item Statistics

Item Analysis- process of statistically reexamining the qualities of each item of the test. It
includes Item Difficulty Index and Discrimination Index.
P S Y C H O L O G I C A L A S S E S S M E N T | 30

TEST RELIABILITY
▪ Refers to the accuracy or consistency of measurement or the degree to which test scores
are consistent, dependable, repeatable and free from errors or free from bias.

▪ Broadly, test reliability indicates the extent to which individual differences in test scores
are attributable to “true differences in the characteristics under consideration and the
extent to which they are attributable to chance errors”

Despite optimum testing conditions, however, no test is a perfectly reliable instrument.


Reliability Coefficient – A numerical index (between .00 and 1.00) of the reliability of an
assessment instrument. It is based on the correlation between two independently derived set of
scores.
General Model of Reliability
Theories of test reliability were developed to estimate the effects of inconsistency on the accuracy
of psychological measurement.
This conceptual breakdown is typically represented by the simple equation
Observed test score = True Score + errors of measurement
X =T+E
where X = score on the test
T = True Score
E = Error of measurement
Error in measurement represent discrepancies between scores obtained on tests and the
corresponding true scores Thus,
E=X–T
The goal of reliability theory is to estimate errors in measurement and to suggest ways of
improving tests so that errors are minimized.
P S Y C H O L O G I C A L A S S E S S M E N T | 31

METHODS OF OBTAINING RELIABILITY

Methods Procedure Coefficient Problems

Same test given twice with Memory effect


time interval between
Test-Retest Coefficient of Practice effect
testing
Stability
Change over time
- The error variance
corresponds to random Practice effect- may
fluctuation of perfor- produce improvement in
mance from one test retest scores.
session to another as a
Thus, the correlation
result of uncontrolled
between the 2 tests will be
testing conditions.
spuriously high.
Source of Error: Time
The time interval must be
Sampling
recorded.

Equivalent tests given with Coefficient of Hard to develop two


time interval between Equivalence equivalent tests.
Alternate Form
testing.
or Parallel Form and May reflect change in
-Uses one form of test on behavior over time
Coefficient of
the first testing and with
stability Practice effect may tend to
another comparable form
reduce the correlation
on the second. Consistency of
between the two test forms.
response to
-In the development of
different item The degree to which the
alternate forms, need to
samples. nature of the test will change
ensure that they are truly
in repetition.
parallel.
Source of Error: Item
Sampling
P S Y C H O L O G I C A L A S S E S S M E N T | 32

METHODS OF OBTAINING RELIABILITY

Methods Procedure Coefficient Problems

One test given at one time


only.
Internal Coefficient of Uses shortened forms (split-
Consistency or -Two scores are obtained internal half)
Split-Half by dividing the test into consistency
Only good if traits are unitary
comparable halves. (split-
or homogenous
half method)
Coefficient of Gives high estimate on a
-uses corrected correlation
Equivalence speeded test.
between two halves of the
test The correlation gives the
reliability of only one half.
-Temporal stability is not a
problem because only one Hard to compute by hand
test session is involved.

Kuder- -utilizes a single Consistency of Two sources of error:


Richardson administration of a single responses to
a) content sampling
Reliability form. all items
b) heterogeneity of the
(Inter-item - KR20 for heterogenous
behavior domain sampled
consistency) instruments
- KR 21 for homogenous
instruments

Coefficient Alpha Appropriate for instru- Consistency of


or Cronbach’s ments where the scoring is responses to
Alpha not dichotomous such as items
scales.
Takes into consideration
the variance of each item
P S Y C H O L O G I C A L A S S E S S M E N T | 33

Inter-rater or Different scorers or Consistency of Source of Error:


Inter-scorer observers rate the items or ratings
Observer differences
reliability responses indepen-dently.
Used for free responses

Other things being equal, the longer the test, the more reliable it will be.

Lengthening a test ,however, will only increase its consistency in terms of content sampling, not
its stability over time. The effect that lengthening or shortening a test will have on its coefficient
can be estimated by means of the Spearman-Brown formula.
Spearman Brown formula is used to correct the split-half reliability estimates.
- Provides a good estimate of what the reliability coefficient would be if the two halves were
increased to the original length of the instrument.

Standard Error of Measurement (Whiston, 2000)


is an estimate of the standard deviation of a normal distribution of scores that would
presumably be obtained if a person took the test an infinite number of times.

It provides a band or a range of where a Psychologist or Counsellor can expect a client’s


“true score” to fall if he is to take the instrument over and over again

The mean of this hypothetical score distribution is the person’s true score on the test. If a
client took a test 100 times, we would expect that one of those test scores would be his
or her true score.

Depending on the confidence level that is needed, standard error of measurement can be
used to predict where a score might fall 68%, 95% or 99.5% of the time.

The formula for calculating the standard error of measurement (SEM) is:
SEM= s√ 1- r
Where: s represents the standard deviation and r is the reliability coefficient.
P S Y C H O L O G I C A L A S S E S S M E N T | 34

Example: Case of Anne (Whiston, 2000)


Anne took the Graduate Record Examinations Aptitude Test (GRE), an instrument used in selecting
and admitting students in the graduate program.
GRE gives three scores: Verbal (GRE-V), Quantitative (GRE-Q) and Analytical (GRE-A)
Scores range from 200 to 800
Anne’s Scores in the GRE-V is 430:
Assume that the mean is 500 and standard deviation is 100
The reliability coefficient for the GRE-V is .90 (Educational Testing Service, 1997).
Therefore, the standard error of measurement would be
100√ 1- .90 = 100 √ .10 = 100 (.32) = 32.
We would then add and subtract the standard error of measurement to Anne’s score to
get the range.
A counselor could then tell Anne that 68% of the time she could expect her GRE-V score
would fall between 398 (430 - 32) and 462 (430 + 32).
If we wanted to expand this interpretation, we could use two standard errors of
measurement (2 x 32 = 64).
In this case, we would say that 95% of the time Anne’s score would fall between 366 (430
- 64) and 494 (430 + 64).
If we wanted to further increase the probability of including her true score, we would use
three standard errors of measurement and conclude that 99.5% of the time her score would fall
between 334 (430-96) and 526 (430 + 96).
Question:
Given this information, how would you help Anne, if you are the counsellor?
If Anne is applying to a graduate program that only admits students with GRE-V scores of 600 or
higher, what are her chances of being admitted?
As a Psychologist or Counselor, one might assist Anne in examining her GRE scores and
considering other options or other graduate programs.
P S Y C H O L O G I C A L A S S E S S M E N T | 35

(Whiston, 2000)
TEST VALIDITY
The degree to which a test measures what it purports (what it is supposed) to measure when
compared with accepted criteria. (Anastasi and Urbina, 1997).
TYPES OF VALIDITY
Types Purpose/Description Procedure Types of Tests
To compare whether Compare test blueprint with the -Survey
the test items match the school, course, program achievement tests
CONTENT
set of goals and objectives, goals.
-Criterion-referen-
objectives;
Have panel experts in content ced tests
area (e.g. teachers, profess -
-Essential skills tests
sors), to do the following:
-if the test items are
-Minimum-level
representative of the -Examine whether the items
skills tests
defined universe or represent the defined universe
content domain that or content domain. - State assessment
they are supposed to tests
- Utilize systematic obser- vation
measure.
of behavior (observe skills and -Professional
- concern is on test competencies needed to licensing exams
items (content), perform a given task;.
-Aptitude Tests
objectives, and format.
P S Y C H O L O G I C A L A S S E S S M E N T | 36

To predict perfor- Use a rating, observation or Aptitude Tests


mance on another another test as criterion.
CRITERION-
measure or to predict
RELATED
an individual’s beha-vior Ability Tests
in specified situations Correlate test scores with
criterion measure obtained at
the same time. Personality Tests
- criterion measure
example: Test correlated with
obtained as same time
supervisory ratings of the
Employment Tests
worker’s performance
Concurrent
conducted at the same time

Achievement tests

certification tests
-Scholastic aptitude
tests
CRITERION- -criterion measure is to Correlate test scores with
RELATED be obtained in the criterion measure obtained after -General aptitude
future. a period of time. batteries
-Goal is to have test -Prognostic tests
Predictive
scores accurately pre-
Ex. Predictive validities of -Readiness tests
dict criterion perfor-
Admission tests
mance identified. -Intelligence tests

CONSTRUCT To determine whether a Conduct multivariate sta- tistical Intelligence tests


construct exists and to analysis such as factor analysis,
understand the traits or discriminant ana-lysis,
concepts that make up multivariate analysis of variance. Aptitude Tests
the set of scores or
items.
A construct is -Requires evidence that support Personality Tests
not direct -ly the interpretation of test scores
observa- ble -The extent to which a in line with “theoretical
but usually test measure a implications associated with the
derived from theoretical construct or construct label.
P S Y C H O L O G I C A L A S S E S S M E N T | 37

theory, trait. such as


research or intelligence, mecha-
-The authors should precisely
observation. nical comprehension,
define each construct and
and anxiety.
distinguish it from other
constructs.
-involves gradual
accumulation of
evidence

Validity Coefficient – the correlation between the scores on an instrument and the correlation
measure.
ITEM ANALYSIS
A general term for procedures designed to assess the utility or validity of a set of test items.

• Validity concerns the entire instrument, while item analysis examines the qualities of each
item.

• done during test construction and revision; provides information that can be used to revise
or edit problematic items or eliminate faulty items.

Item Difficulty Index


An index of the easiness or difficulty of an item

• it reflects the proportion of people getting the item correct, calculated by dividing the
number of individuals who answered the item correctly by the total number of people.

p = number who answered correctly


total number of examinees

• item difficulty index can range from .00 (meaning no one got the item correct) to 1.00
(meaning everyone got the item correct.

• item difficulty actually indicate how easy the item is because it provides the proportion of
individuals who got the item correct.

Example: in a test where 15 of the students in a class of 25 got the first item on the test
correct.
p = 15 = .60
P S Y C H O L O G I C A L A S S E S S M E N T | 38

25
• the desired item difficulty depends on the purpose of the assessment, the group taking the
instrument, and the format of the item.

Item Discrimination Index


A measure of how effectively an item discriminates between examinees who score high on
the test as a whole ( or on some other criterion variable) and those who score low. (Aiken
2000).

I. Extreme Group Method


▪ examinees are divided into two groups based on high and low scores.

▪ calculate by subtracting the proportion of examinees in the lower group from the
proportion of examinees in the upper group who got the item correct or who endorsed
the item in the expected manner.

▪ item discrimination indices can range from + 1.00 (all of the upper group got it right and
none of the lower group got it right) to – 1.00 (none of the upper group got it right and all
of the lower group got it right)

▪ the determination of the upper and lower group will depend on the distribution of scores.
If normal distribution, use the upper 27% for the upper group and lower 27% for the lower
group (Kelly,1939). For small groups Anastasi and Urbina (1997) suggest the range of
upper and lower 25% to 33%.

▪ In general, negative item discrimination indices, particularly and small positive indices are
indicators that the item needs to be eliminated or revised.

ITEM RESPONSE THEORY (IRT) OR LATENT TRAIT THEORY


• Theory of test in which item scores are expressed in terms of estimated scores on a latent-
ability continuum.

• it rests on the assumption that the performance of an examinee on a test item can be
predicted by a set of factors called traits, latent traits or abilities.

• using IRT, we get an indication of an individual’s performance based not on the total score,
but on the precise items the person answers correctly.
P S Y C H O L O G I C A L A S S E S S M E N T | 39

• it suggests that the relationship between examinees’ item performance and the underlying
trait being measured can be described by an item characteristic curve.

Item characteristic curve. A graph, used in item analysis, in which the proportion of examinees
passing a specified item is plotted against total test scores.
• Item response curve is constructed by plotting the proportion of respondents who gave the
keyed response against estimates of their true standing on a uni-dimensional latent trait
or characteristic. An item response curve can be constructed either from the responses of
a large group of examinees to an item, or if certain parameters are estimated from a
theoretical model

Rasch Model – one parameter (item difficulty) model for scaling test items for purposes of
item analysis and test standardization.
- The model is based on the assumption that indexes of guessing and item discrimination
are negligible parameters. As with other latent trait models, the Rasch model relates
examinees’ performances on test items (percentage passing) to their estimated standings
on a hypothetical latent-ability trait or continuum.
P S Y C H O L O G I C A L A S S E S S M E N T | 40

References

Anastasi, Anne and Urbina, Susana (1997). Psychological Testing. 7th edition, New York: McMillan
Publishing.

Aiken, Lewis R. (2000) Psychological Testing and Assessment. Boston: Allyn and Bacon Inc.

Cohen, Ronald Jay & Swerdlik, Mark E. (2010). Psychological Testing and Assessment. New York:
McGraw-Hill Companies, Inc.

Cronbach, Lee J. 1984. Essentials of Psychological Testing. 4th edition. Harper and Row
Publishers. New York.

Del Pilar, Gregorio H. (2015) Scale Construction: Principles and Procedures, Workshop powerpoint
presentation. AASP-PAP, 2015, Cebu City

Drummond, Robert J. (2000). Appraisal Procedure for Counselors and Helping Professional. 4th
edition, New Jersey: Prentice Hall.

Dullas, A.R. (2018). The Development of Academic Self-efficacy Scale for Filipino Junior High School
Students. Frontiers in Education, Educational Psychology section. Front. Educc. 3:1. DOI:
10.3389/feduc.2018.00019

Friedenberg, Lisa (1995). Psychological Testing: Design, Analysis and Use. Boston.Allyn and Bacon
Inc.

Groth-Marnat, Gary (2009) Handbook of Psychological Assessment 5th edition. John Wiley and
Sons Inc.

Kaplan, Robert M. And Sacuzzon, Dennis P. (1997) Psychological Testing: Principles and Application
and Issues. 4th edition, California: Brooks/Cole Publishing Company.

Kellermen, Henry and Burry, Anthony (1991) Handbook of Psychological Testing.2nd edition,
Boston:Allyn and Bacon Inc.

Murphy, Kevin R. and Davidsholer, Charles O. (1998) Psychological Testing: Principles and
Application. New Jersey: Prentice Hall Inc.

Newmark, Charles S. (1985) Major Psychological Assessment Instruments. Boston: Allyn and
Bacon.

Orense, Charity and Jason Parena (2014) Lecture in Psychological Assessment, Review Manual in
RGC Licensure Examination, Assumption College, Makati.
P S Y C H O L O G I C A L A S S E S S M E N T | 41

Suba, Elizabeth S. (2014) Lecture (powerpoint) in Psych 140 Psychological Assessment, CLSU,
Nueva Ecija.

Suba, Elizabeth S. (2013) Lecture (powerpoint) in GU 722 Psychological Assessment , CLSU, Nueva
Ecija

Suba, Elizabeth S. (2005) Lecture notes in Assessment Tools in Counseling. DLSU.


(unpublished).

Walsh, w. Bruce and Bets, Nancy E. (1995) Test Assessment. New Jersey: Prentice Hall Inc.

Morrison, J. (2014). DSM-5 Made Easy. The Clinician’s Guide to Diagnosis. The Guilford Press.
New York.

Nolen-Hoeksema, S. (2014). Abnormal Psychology (6th Ed.). Mcgraw-Hill. New York, NY.

Sarason, I.G. & Sarason, B.R. (2005). Abnormal Psychology. The Problem of Maladaptive Behavior
(11th Edition). Pearson Prentice Hall. New Jersey.

Others:
1. Manual of psychological tests
2. Psychological Resources Center – test brochures and test descriptions.
3. www.AssessmentPsychology.com

You might also like