100% found this document useful (1 vote)
1K views

Psych Assessment

1. Psychological assessment involves gathering integrated data to evaluate an individual, using tools like tests, interviews, observation. 2. Testing is defined as measuring variables through standardized procedures, while assessment acknowledges tests are one tool informed by the assessor's expertise. 3. Psychological assessment aims to understand an individual's process, not just results, through individualized evaluation selecting appropriate tools.

Uploaded by

Daegee Alcazar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
1K views

Psych Assessment

1. Psychological assessment involves gathering integrated data to evaluate an individual, using tools like tests, interviews, observation. 2. Testing is defined as measuring variables through standardized procedures, while assessment acknowledges tests are one tool informed by the assessor's expertise. 3. Psychological assessment aims to understand an individual's process, not just results, through individualized evaluation selecting appropriate tools.

Uploaded by

Daegee Alcazar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 22

PSYCHOLOGICAL ASSESSMENT • Therapeutic Psychological Assessment – here therapeutic self-discovery

and new understandings are encouraged throughout the assessment process


(Finn, 2003; Finn & Martin, 1997; Finn & Tonsager, 2002)
DIFFERENTIATING TESTING AND ASSESSMENT Basic Concepts
Roots of Contemporary Testing & Assessment • Dynamic Assessment - an interactive approach to psychological assessment
- 20th Century France that usually follows a model of
- Alfred Binet published a test designed to help place Paris (1) evaluation,
schoolchildren in appropriate class (2) intervention of some sort, and
- during WWII in 1917, the military needed a way to screen large (3) evaluation.
number of recruits Dynamic assessment is most typically employed in educational settings,
- military would depend even more on psychological tests to screen although it may be employed in correctional, corporate, neuropsychological,
recruits for service clinical, and most any other setting as well.
- more and more tests purporting to measure an ever-widening array
of psychological variables were developed and used TOOLS OF PSYCHOLOGICAL ASSESSMENT

DEFINITIONS DEFINITIONS
 Testing is a term used to refer to everything from the A test is a measurement device or technique used to quantify behavior or aid
administration of a test to the interpretation of a test score (Cohen in the understanding and prediction of behavior
& Swerdlik, 2018) (Kaplan & Saccuzzo, 2018)
 Assessment acknowledges that tests are only one type of tool used
by professional assessors and that a test’s value is intimately linked A psychological test or educational test is a set of items that are designed to
to the knowledge, skill, and experience of the assessor (Cohen & measure characteristics of human beings that pertain to behavior
Swerdlik, 2018) (Kaplan & Saccuzzo, 2018)
 Psychological Assessment - the gathering and integration of
psychology related data for the purpose of making a psychological TYPES OF TESTS
evaluation that is accomplished through the use of tools e.g. tests, Those that can be given to only one person at a time are known as individual
interviews, case studies, behavioral observation, and specially tests (Kaplan & Saccuzzo, 2018)
designed apparatuses and measurement procedures (Cohen &
Swerdlik, 2018) A group test, by contrast, can be administered to more than one person at a
 Psychological Testing - the process of measuring psychology time by a single examiner (Kaplan & Saccuzzo, 2018)
related variables by means of devices or procedures designed to
obtain a sample of behavior (Cohen & Swerdlik, 2018)

Psych Assessment vs Psych Testing


Psychological Assessment Psychological Testing
Answer a referral question, To obtain some gauge, usually
solve a problem, or arrive at a numerical in nature, with regard
Objective
decision through the use of to an ability or attribute
tools of evaluation
Usually individualized. More May be individual or group in
typically focuses on how an nature. After test
individual processes rather administration, the tester
Process
than simply the results of that typically adds up the number of Tests may have a different format
processing. correct answers or the number • The term format pertains to the form, plan, structure, arrangement, and
of certain types of responses layout of test items as well as to related considerations such as time limits.
Assessor is key to the process The tester is not key to the • Format is also used to refer to the form in which a test is administered:
of selecting tests and/or other process; one tester may be computerized, pencil-and-paper, or some other form.
Role of
tools of evaluation as well as substituted for another tester
Evaluator
drawing conclusions from the without appreciably affecting Tests may differ in their administration procedures
entire evaluation the evaluation • Some tests, particularly those designed for administration on a one-to-one
Assessment typically requires Testing typically requires basis, may require an active and knowledgeable test administrator.
an educated selection of tools technician-like skills in terms of
• Alternatively, some tests, particularly those designed for
Skill of of evaluation, skill in administering and scoring a test
administration to groups, may not even require the test administrator to be
Evaluator evaluation, and thoughtful as well as in interpreting a test
organization and integration result. present while the test takers independently do whatever it is the test
of data. requires.
Typically entails a Typically yields a test score or
logicalproblem solving series of test scores. Tests differ in their scoring and interpretation procedures
approach that brings to bear • In testing and assessment, we may formally define score as a code or
Outcome summary statement, usually but not necessarily numerical in nature, that
many sources of data
designed to shed light on a reflects an evaluation of performance on a test, task, interview, or some other
referral question. sample of behavior.
• Scoring is the process of assigning such evaluative codes or
Process of Assessment statements to performance on tests, tasks, interviews, or other behavior
Refferal  Selection of Tools  Formal Assessment  Writing the
Report  Feed backing Tests differ with respect to their technical quality
• This refers to the psychometric soundness of a test.
Approached Used in Assessment • Psychometric soundness of a test is referring to how consistently and how
• Collaborative Psychological Assessment - the assessor and assessee may accurately a psychological test measures what it purports to measure.
work as “partners” from initial contact through final feedback (Fischer, 1978, • Assessment professionals also speak of the psychometric utility
2004, 2006) of a particular test or assessment method. In this context, utility refers to the
usefulness or practical value that a test or assessment technique has for a Who, What, Why, How, and Where of Assessment
particular purpose.
Who Are the Parties in the assessment enterprise?
Tools of Psychological Assessment  Test Developer
• Interview as a tool of psychological assessment typically involves more than  Test User
talk.  Test Taker
 If the interview is conducted face-to-face, then the interviewer is  Society at Large
probably taking note of not only the content of what is said but also ✓Developers and publishers of tests
the way it is being said. More specifically, the interviewer is taking ✓Users of tests, and people who are evaluated by means of tests.
note of both verbal and nonverbal behavior. ✓May also consider the society at large
 Nonverbal behavior may include the interviewee’s “body
language,” movements and facial expressions in response to the TEST DEVELOPER
interviewer, the extent of eye contact, and apparent willingness to - Test developers and publishers create tests or other methods of
cooperate. The interviewer may also take note of the way that the assessment.
interviewee is dressed. - The American Psychological Association (APA) has estimated that
 Interview is then defined as a method of gathering information more than 20,000 new psychological tests that are developed each year.
through direct communication involving reciprocal exchange.  Nadine Kaufman, Alan Kaufman, Hermann Rorschach, Henry
Murray
• PORTFOLIO are work products— whether retained on paper, canvas, film,
video, audio, or some other medium. - Some popular Test Developers and Publishers in the Philippines:
 The appeal of portfolio assessment as a tool of evaluation extends  Center for Educational Measurement, Inc. THE TEST OF
to many other fields, including education. EXCELLENCE
 Some have argued, for example, that the best evaluation of a  APSA, The Pioneer in Standards-Based Assessment
student’s writing skills can be accomplished not by the  PHILIPPINE PSYCHOLOGICAL CORPORATION, The country’s pioneer
administration of a test but by asking the student to compile a in psychological testing 1961
selection of writing samples.
TEST USER
• CASE HISTORY DATA - refers to records, transcripts, and other accounts in - Psychological tests and assessment methodologies are used by a wide
written, pictorial, or other form that preserve archival information, official range of professionals, including clinicians, counselors, school
and informal accounts, and other data and items relevant to an assessee. psychologists, psychometricians, human resource personnel, consumer
 Case history data is a useful tool in a wide variety of assessment psychologists, IO psychologist, etc…
contexts. In a clinical evaluation, for example, case history data can
shed light on an individual’s past and current adjustment as well as - the person or persons responsible for the selection, administration,
on the events and circumstances that may have contributed to any and scoring of tests for the analysis, interpretation, and
changes in adjustment. communication of test results and for any decisions or actions that are
 Case history data may include files or excerpts from files maintained based on test scores.
at institutions and agencies. Other examples of case history data
are letters and written correspondence, photos and family albums, - Generally, individuals who simply administer tests, score tests, and
newspaper and magazine clippings, and home videos, movies, and communicate simple or “canned” test results are not test users.
audiotapes.
▪ The Standards, as well other published guidelines from specialty
• BEHAVIORAL OBSERVATION is employed by assessment professional organizations, have had much to say in terms of identifying
professionals and may be defined as monitoring the actions of others or just who is a qualified test user and who should have access to (and be
oneself by visual or electronic means while recording quantitative and/or permitted to purchase) psychological tests and related tools of
qualitative information regarding the actions. psychological assessment
 Behavioral observation is often used as a diagnostic aid in various
settings such as inpatient facilities, behavioral research Test User Qualifications
laboratories, and classrooms. ▪ knowledge, skills, abilities, training, experience, and, where
 In addition to diagnosis, behavioral observation may be used for appropriate, credentials important for optimal use of psychological
selection purposes, as in corporate settings. tests.
Level A = There are no special qualifications to purchase these products.
COMPUTER AS TOOLS Level B = A master's degree in psychology, education, speech language
 Computers can serve as test administrators (online or off) and as pathology, occupational therapy, social work, counseling, or in a
highly efficient test scorers. Within seconds they can derive not only field closely related to the intended use of the assessment, and
test scores but patterns of test scores. formal training in the ethical administration, scoring, and
 Whether processed locally or centrally, the account of performance interpretation of clinical assessments.
spewed out can range from a mere listing of score or scores (i.e., a  Certification by or full active membership in a professional that
simple scoring report ) to the more detailed extended scoring requires training and experience in the relevant area of
report, which includes statistical analyses of the test taker’s assessment.
performance.  A degree or license to practice in the healthcare or allied healthcare
field.
• The acronym CAPA refers to the term computer assisted psychological Level C = A doctorate degree in psychology, education, or a closely related
assessment. Here the word assisted typically refers to the assistance field with formal training in the ethical administration, scoring, and
computers provide to the test user, not the test taker. interpretation of clinical assessments related to the intended use of
the assessment.
• Another acronym you may come across is CAT, this for computer adaptive  Licensure or certification to practice in your state in a field related
testing. The adaptive in this term is a reference to the computer’s ability to to the purchase.
tailor the test to the test taker’s ability or test taking pattern  Certification by or full active membership in a professional
organization that requires training and experience in the relevant
area of assessment.
 Test users have the responsibility of ensuring that the room in
which the test will be conducted is suitable and conducive to the
TEST TAKER testing
- Having all taken tests, we all have had firsthand experience in the role
of test taker. During-Test Obligations
- In the broad sense in which we are using the term test taker, anyone  During test administration, and especially in one-on-one or
who is the subject of an assessment or an evaluation can be a test taker smallgroup testing, rapport between the examiner and the
or an assessee. examinee can be critically important.
- may vary with respect to numerous variables such as:  In this context, rapport may be defined as a working relationship
o The amount of test anxiety they are experiencing and the between the examiner and the examinee.
degree to which that test anxiety might significantly
affect their test results After-Test Obligations
o The extent to which they understand and agree with the  After a test users have many obligations as well. These obligations
rationale for the assessment range from safeguarding the test protocols to conveying the test
o Their capacity and willingness to cooperate with the results in a clearly understandable fashion.
examiner or to comprehend written test instructions  In addition, there are other obligations such as those related to
o The amount of physical pain or emotional distress they scoring the test.
are experiencing  After Interpreting the test results and seeing to it that the test data
o The amount of physical discomfort brought on by not are used in accordance with established procedures and ethical
having had enough to eat, having had too much to eat, guidelines constitute further obligations of test users.
or other physical conditions
o The extent to which they are alert and wide awake as Where to Go for Authoritative Information: Reference Sources
opposed to nodding off - Many reference sources exist for learning more about published tests and
o The extent to which they are predisposed to agree or assessment related issues.
disagree when presented with stimulus statements - These sources vary with respect to detail. Some merely provide descriptions
o The extent to which they have received prior coaching of tests, others provide detailed information regarding technical aspects, and
o The importance they may attribute to portraying still others provide critical reviews complete with discussion of the pros and
themselves in a good (or bad) light cons of usage.
o The extent to which they are “lucky” and can “beat the
odds” on a multiple-choice Test catalogues
 Perhaps one of the most readily accessible sources of information
SOCIETY AT LARGE is a catalogue distributed by the publisher of the test.
- Society at large exerts its influence as a party to the assessment  Because most test publishers make available catalogues of their
enterprise in many ways. As society evolves and as the need to measure offerings, this source of test information can be tapped by a simple
different psychological variables emerges, test developers respond by telephone call, email, or note.
devising new tests.
- Through elected representatives to the legislature, laws are enacted Test manuals
that govern aspects of test development, test administration, and test  Detailed information concerning the development of a particular
interpretation (e.g.RA 10029, RA 9258) test and technical information relating to it should be found in the
- Similarly, by means of court decisions (e.g. annulment, adoption, test manual, which is usually available from the test publisher.
criminal cases, etc) , society at large exerts its influence on various  However, for security purposes, the test publisher will typically
aspects of the testing and assessment enterprise. require documentation of professional training before filling an
order for a test manual.
What Types of Settings Are Assessments Conducted?
▪ Educational Settings Reference volumes
▪ Governmental & Organizational Settings  The Buros Institute of Mental Measurements provides “one-stop
▪ Clinical Settings shopping” for a great deal of test-related information.
▪ Counseling Settings  The initial version of what would evolve into the Mental
▪ Geriatric Settings Measurements Yearbook was compiled by Oscar Buros in 1933.
▪ Business and Military Settings  The Buros Institute also disseminates a series of publications called
▪ Other Settings Tests in Print that contains a listing of all commercially available
Englishlanguage tests in print.
HOW of ASSESSMENT
Pre-Test Obligations Journal articles
 Responsible test users have obligations before, during, and after a  Articles in current journals may contain reviews of the test,
test or any measurement procedure is administered. updated or independent studies of its psychometric soundness, or
 Ethical guidelines dictate that, before a test is administered, it examples of how the instrument was used in either research or an
should be stored in a way that reasonably ensures that its specific applied context.
contents will not be made known in advance.  There are also journals that focus more specifically on matters
 Another obligation of the test user before the test’s administration related to testing and assessment.
is to ensure that a prepared and suitably trained person
administers the test properly. Online Databases
 The test administrator (or examiner) must be familiar with the test  One of the most widely used bibliographic databases for test-
materials and procedures and must have at the test site all the related publications is that maintained by the Educational
materials needed to properly administer the test. Resources Information Center (ERIC).
 Funded by the U.S. Department of Education and operated out of
 Materials needed might include a stopwatch, a supply of pencils,
the University of Maryland, the ERIC website at www.eric.ed.gov
and a sufficient number of test protocols.
contains a wealth of resources and news about tests, testing, and
 School psychologists have another pretest obligation: selecting and
assessment. There are abstracts of articles, original articles, and
using tests that are most appropriate for the individual student
links to other useful websites as well.
being tested.
 ERIC strives to provide balanced information concerning CHARLES DARWIN AND INDIVIDUAL DIFFERENCES
educational assessment and to provide resources that encourage 1. To develop a measuring device, we must understand what we want
responsible test use. to measure.
 The American Psychological Association (APA) maintains a number 2. According to Darwin’s theory, some possess characteristics that are
of databases useful in locating psychology-related information in more adaptive or successful in a given environment than are those
journal articles, book chapters, and doctoral dissertations. of other members. (The Origin of Species, 1859)
 PsycINFO is a database of abstracts dating back to 1887. 3. Through this process, he argued, life has evolved to its currently
 ClinPSYC is a database derived from PsycINFO that focuses on complex and intelligent levels.
abstracts of a clinical nature.
 PsycARTICLES is a database of full-length articles dating back to 1) Sir Francis Galton, a relative of Darwin’s, soon began applying
1894. Darwin’s theories to the study of human beings.
 Health and Psychosocial Instruments (HAPI) contains a listing of 2) Hereditary Genius (1869) - some people possessed characteristics
measures created or modified for specific research studies but not that made them more fit than others.
commercially available; it is available at many college libraries 3) Galton initiated a search for knowledge concerning human
through BRS Information Technologies and also on CD-ROM individual differences which is now one of the most important
(updated twice a year). domains of scientific psychology.
 PsycLAW is a free database, available to everyone, that contains
discussions of selected topics involving psychology and law. It can - Mental test (Cattell, 1890) James McKeen Cattell's doctoral dissertation was
be accessed at www.apa.org/psyclaw. based on Galton's work on individual differences in reaction time. As such,
 The world’s largest private measurement institution is Educational Cattell perpetuated and stimulated the forces that ultimately led to the
Testing Service (ETS). development of modern tests.
 This company, based in Princeton, New Jersey, maintains a staff of
some 2,500 people, including about 1,000 measurement EXPERIMENTAL PSYCHOLOGY AND PSYCHOPHYSICAL MEASUREMENT
professionals and education specialists. These are the folks who  J. E. Herbart = eventually used mathematical models as the basis for
bring you the Scholastic Aptitude Test (SAT) and the Graduate educational theories that strongly influenced 19th-century educational
Record Exam (GRE), among many other tests. practices.
 Descriptions of these and the many other tests developed by this  E. H. Weber = followed and attempted to demonstrate the existence of a
company can be found at their website, www.ets.org. psychological threshold, the minimum stimulus necessary to activate a
sensory system.
Why is there a need for psychological assessment?  G. T. Fechner = devised the law that the strength of a sensation grows as
- In many ways, psychological testing and assessment are similar to medical the logarithm of the stimulus intensity.
tests. Psychological evaluations serve the same purpose.  Wilhelm Wundt = set up a laboratory at the University of Leipzig in
 Psychological tests and assessments allow a psychologist to 1879, and was credited with founding the science of psychology
understand the nature of the problem, and to figure out the best  E. B. Titchner = Succeeded the works of Wundt and founded
way to go about addressing it. structuralism
 Psychological evaluation may sound intimidating, but it's designed  G. Whipple= a student of Titchener recruited L. L. Thurstone.
to help you. Psychologists use tests and other assessment tools to  Whipple provided the basis for immense changes in the field of
measure and observe a client's behavior to arrive at a diagnosis testing by conducting a seminar at the Carnegie Institute in 1919
and guide treatment (APA, 2013) attended by Thurstone , E. Strong, and other early prominent U.S.
psychologists.
HISTORICAL PERSPECTIVE OF ASSESSMENT  This seminar came up with the Carnegie Interest Inventory and
later the Strong Vocational Interest Blank.
EARLY ANTECEDENTS
 Evidence suggests that the Chinese had a relatively sophisticated civil - Thus, psychological testing developed from at least two lines of inquiry:
service testing program more than 4000 years ago (DuBois, 1970, 1972). • one based on the work of Darwin, Galton, and Cattell on the
 Every third year in China, oral examinations were given to help determine measurement of individual differences, and
work evaluations and promotion decisions. • the other (more theoretically relevant and probably stronger) based on
the work of the German psychophysicists Herbart, Weber, Fechner,
Han Dynasty (206 B.C.E. to 220 C.E.) and Wundt.
 Test Batteries (two or more tests used in conjunction) were used  Experimental psychology developed from the latter.
 Topics were civil law, military affairs, agriculture, revenue, and  From this work also came the idea that testing, like an experiment,
geography. requires rigorous experimental control.
Ming Dynasty (1368-1644 C.E.)  Such control, comes from administering tests under highly standardized
 National Multistage Testing Program involved local and regional conditions.
testing centers equipped with special testing booths.  There are also tests that arose in response to important needs such as
 Only those who passed this third set of tests were eligible for public classifying and identifying the mentally and emotionally handicapped:
office. • Seguin Form Board Test (Seguin 1866/1907), was developed in an
effort to educate and evaluate the mentally disabled.
 Reports by British missionaries and diplomats encouraged the English • Kraepelin (1912) devised a series of examination for evaluating
East India Company in 1832 to copy the Chinese system as a method of emotionally impaired people.
selecting employees for overseas duty.
 Because testing programs worked well for the company, the British THE EVOLUTION OF INTELLIGENCE AND STANDARDIZED ACHIEVEMENT TEST
government adopted a similar system of testing for its civil service in  Binet-Simon Scale
1855. The first version of the test was published in 1905. This instrument
 After the British endorsement of a civil service testing system, the French contained 30 items of increasing difficulty and was designed to identify
and German governments followed. intellectually subnormal individuals.
 In 1883, the US government established the American Civil Service Binet’s standardization sample consisted of 50 children who had been
Commission, which developed and administered competitive given the test under standard conditions – that is, with precisely the same
examinations for certain government jobs. instructions and format.
The 1908 Binet-Simon Scale has been substantially improved and
introduced the significant concept of a child’s mental age.
 Stanford-Binet Intelligence Scale (Terman, 1916) - The TAT required the subject to make up a story about the ambiguous
By 1916, L. M. Terman of Stanford University had revised the Binet test scene. TAT purported to measure human needs and thus to ascertain
for use in the United States. individual differences in motivation.
- was the only American version of the Binet test that flourised.
- In Terman’s revision, the standardization sample was increased to THE EMERGENCE OF NEW APPROACHES TO PERSONALITY TESTING
include 1000 people, original items were revised, and many items were - The popularity of the two most important projective personality tests, the
added. Rorschach and TAT, grew rapidly by the late 1930s and early 1940s.

 Army Alpha and Army Beta  MMPI


Robert Yerkes, who was then the president of the American Minnesota Multiphasic Personality Inventory (MMPI) published in 1943,
Psychological Association was requested for assistance by the army. began a new era for structured personality tests.
Yerkes headed a committee of distinguished psychologists who soon MMPI authors argued that the meaning of a test response could be
developed two structured group tests of human abilities: the Army Alpha and determined only by empirical research.
the Army Beta MMPI, along with its updated companion the MMPI2 (Butcher, 1989,
The Army Alpha required reading ability, whereas the Army Beta 1990), is currently the most widely used and referenced personality test
measured the intelligence of illiterate adults.
 Factor Analysis
 Achievement Tests is a method of finding the minimum number of dimensions
In contrast to essay tests, standardized achievement tests provide (characteristics, attributes), called factors, to account for a large number of
multiple-choice questions that are standardized on a large sample to produce variables.
norms against which the results of new examinees can be compared. In 1940s, J. R. Guilford made the first serious attempt to use factor
Standardized achievement tests caught on quickly because of the analytic techniques in the development of a structured personality test.
relative ease of administration and scoring and the lack of subjectivity or By the end of that decade, R.B. Cattell had introduced the Sixteen
favoritism that can occur in essay or other written tests. Personality Factor Questionnaire (16PF) – remains one of the most well-
In 1923, the development of standardized achievement tests culminated constructed structured personality tests and an important example of a test
in the publication of the Stanford Achievement Test by T. L. Kelley, G. M. developed with the aid of factor analysis.
Ruch, and L. M. Terman.
THE CURRENT ENVIRONMENT
 Wechsler Intelligence Scales Beginning in the 1980s and through the present, several major
A mere 2 years after the 1937 revision of the Stanford-Binet test, David branches of applied psychology emerged and flourished: neuropsychology,
Wechsler published the first version of the Wechsler intelligence scales the health psychology, forensic psychology, and child psychology. Because each of
Wechsler-Bellevue Intelligence Scale (W-B) (Wechsler, 1939). these important areas of psychology makes extensive use of psychological
Wechsler-Bellevue scale contained several interesting innovations in tests, psychological testing again grew in status and use
intelligence testing. It yielded several scores, permitting an analysis of an Testing is indeed one of the essential elements of psychology. All
individual’s pattern or combination of abilities. areas of psychology depend on knowledge gained in research studies that rely
Among the various scores produced by the Wechsler test was the on measurements. The meaning and dependability of these measurements
performance IQ. Performance tests do not require a verbal response; one can are essential to psychological research. To study any area of human behavior
use them to evaluate intelligence in people who have few verbal or language effectively, one must understand the basic principles of measurement.
skills.
In today’s complex society, the relevance of the principles,
PERSONALITY TESTS applications, and issues of psychological testing extends far beyond the field
- Personality tests measured presumably stable characteristics or traits that of psychology.
theoretically underlie behavior. The more you know about psychological tests, the more confident
 One of the basic goals of traditional personality tests is to measure you can be in your encounters with them. Given the attacks on tests and
traits. threats to prohibit or greatly limit their use, you have a responsibility to
 The earliest personality tests were structured paperand-pencil yourself and to society to know as much as you can about psychological
group tests. These tests provided multiple-choice and true-false tests.
questions that could be administered to a large group. A thorough knowledge of testing will allow you to base your
 The first structured personality test, the Woodworth Personal Data decisions on facts and to ensure that tests are used for the most beneficial
Sheet, was developed during World War I and was published in and constructive purposes.
final form just after the war
 As indicated earlier, the motivation underlying the development of Culture and Assessment
the first personality test was the need to screen military recruits.
 Interpretation of the Woodworth test depended on the now- EARLY ANTECEDENTS
discredited assumption that the content of an item could be ▪ Culture may be defined as “the socially transmitted behavior
accepted at face value. If the person marked “False” for the patterns, beliefs, and products of work of a particular population, community,
statement “I wet the bed”, then it was assumed that he or she did or group of people” (Cohen, 1994)
not “wet the bed.” ▪ Culture prescribes many behaviors and ways of thinking. Spoken
language, attitudes toward elders, and techniques of child rearing are but a
 The Rorschach test few critical manifestations of culture.
- was first published by Herman Rorschach of Switzerland in1921. ▪ Indeed, the influence of culture on an individual’s thoughts and
- David Levy introduced the Rorschach in the United States behavior may be a great deal stronger than most of us would acknowledge
The first Rorschach doctoral dissertation written in a U.S. university was not
completed until 1932, when Sam Beck, Levy’s student, decided to investigate SOME ISSUES REGARDING CULTURE AND ASSESSMENT
the properties of the Rorschach test scientifically. Language is the means by which information is communicated. It is
a key yet sometimes overlooked variable in the assessment process.
 Thematic Apperception Test (TAT) by Henry Murray & Christina  Verbal Communication
Morgan in 1935 The examiner and the examinee must speak the same language
- TAT was more structured. which is necessary not only for the assessment to proceed but also for the
assessor’s conclusions regarding the assessment to be reasonably accurate.
When an assessment is conducted with the aid of a translator, LEGISLATION
different types of problems may emerge: • Table 2–1 (Cohen, p 57) presented several legislations enacted at the
 subtle nuances of meaning may be lost in translation, or federal level that affects the assessment enterprise.
unintentional hints to the correct or more desirable response may • In the 1970s numerous states enacted minimum competency testing
be conveyed programs:
 translated items may be either easier or more difficult than the formal testing programs designed to be used in decisions regarding various
original aspects of students’ education.
 some vocabulary words may change meaning or have dual • Truth-in-testing legislation was also passed at the state level beginning in
meanings when translated the 1980s.
•The primary objective of these laws was to give test takers a way to learn
 Non- Verbal Communication & Behavior the criteria by which they are being judged.
Facial expressions, finger and hand signs, and shifts in one’s position in
space may all convey messages. Of course, the messages conveyed by such • Some truth-in-testing laws require providing descriptions of
body language may be different. (1) the test’s purpose and its subject matter,
Humans communicate not only through verbal means but also through (2) the knowledge and skills the test purports to measure,
nonverbal means. (3) procedures for ensuring accuracy in scoring,
Hoffman (1962) questioned the value of timed tests of ability, (4) procedures for notifying test takers of errors in scoring, and
particularly those tests that employed multiplechoice items. He believed (5) procedures for ensuring the test taker’s confidentiality.
such tests relied too heavily on test takers’ quickness of response and as such • The EEOC has published sets of guidelines concerning standards to be met
discriminated against the individual who is characteristically a “deep, in constructing and using employment tests.
brooding thinker.” • In 1978 the EEOC, the Civil Service Commission, the Department of Labor,
and the Justice Department jointly published the Uniform Guidelines on
 Standards of Evaluation Employee Selection
Individualist culture is characterized by value being placed on traits such
as selfreliance, autonomy, independence, uniqueness, and competitiveness. LITIGATION
In collectivist culture value is placed on traits such as conformity, • This is why law resulting from litigation (the court-mediated resolution of
cooperation, interdependence, and striving toward group goals. legal matters of a civil, criminal or administrative nature) can impact our
Cultures differ from one another in the extent to which they are daily lives.
individualist or collectivist (Markus & Kitayama, 1991). • Litigation can result in bringing an important and timely matter to the
attention of legislators, thus serving as a stimulus to the creation of new
Legal and Ethical Considerations in Assessment legislation
• Litigation has sometimes been referred to as “judge-made law” because it
DEFINITIONS typically comes in the form of a ruling by a court.
• LAWS are rules that individuals must obey for the good of the society as a
whole - or rules thought to be for the good of society as a whole. CONCERNS OF THE PROFESSION
 Some laws are and have been relatively uncontroversial while • In 1895 the American Psychological Association (APA) formed its first
others are very controversial. committee on mental measurement.
• ETHICS is a body of principles of right, proper, or good conduct. • Another APA committee on measurement was formed in 1906 to further
• Code of professional ethics is recognized and accepted by members of a study various testing-related issues and problems.
profession, it defines the standard of care • In 1916 and again in 1921, symposia dealing with various issues surrounding
expected of members of that profession. the expanding uses of tests were sponsored.
• Standard of care is the level at which the average, reasonable, and prudent • 1954, APA published its Technical Recommendations for Psychological
professional would provide diagnostic or therapeutic services under the Tests and Diagnostic Tests, a document that set forth testing standards
same or similar conditions. and technical recommendations.
• However, members of the public and members of the profession have not • The following year, the National Educational Association
always been on “the same side” with respect to issues of ethics and law. (working in collaboration with National Council on
Measurement) published its Technical Recommendations for Achievement
THE CONCERNS OF THE PUBLIC Tests.
•The assessment enterprise has never been well understood by the public,
and even today you might hear criticisms based on a misunderstanding of TEST USER QUALIFICATION
testing • As early as 1950 an APA Committee on Ethical Standards for Psychology
• Possible consequences of public misunderstanding include fear, anger, published a report called Ethical Standards for the Distribution of
legislation, litigation, and administrative regulations. Psychological Tests and Diagnostic Aids.
• the testing-related provisions of the No Child Left Behind Act of 2001 and Level A: Tests or aids that can adequately be administered, scored,
the 2010 Common Core State Standards have generated a great deal of and interpreted with the aid of the manual and a general orientation to
controversy the kind of institution or organization in which one is working (for
• Concern about the use of psychological tests first became widespread in the instance, achievement or proficiency tests).
aftermath of World War I, when various professionals (as well as Level B: Tests or aids that require some technical knowledge of test
nonprofessionals) sought to adapt group tests developed by the military construction and use and of supporting psychological and educational
for civilian use in schools and industry. fields such as statistics, individual differences, psychology of adjustment,
• In 1969 an article, in Harvard Educational Review entitled “How Much Can personnel psychology, and guidance (e.g., aptitude tests and adjustment
We Boost IQ and Scholastic Achievement?” fires again the public concern inventories applicable to normal populations)
about testing Level C: Tests and aids that require substantial understandi ng of
• Its author, Arthur Jensen, argued that “genetic factors are strongly testing and supporting psychologi cal fields together with supervised
implicated in the average Negro–white intelligence difference” (1969, p. experience
82).
• What followed was an outpouring of public and professional attention to Code of Fair Testing Practices in Education
nature-versus-nurture issues in addition to widespread skepticism about • this document presents standards for educational test developers
what intelligence tests were really measuring. in four areas:
(1) developing/ selecting tests,
(2) interpreting scores,
(3) striving for fairness, and of compensation the functions of guidance and counseling under Section 3 of
(4) informing test takers. RA 9258

• A psychologist licensing law designed to serve as a model for state


legislatures has been available from APA since 1987. However, that law
contains no definition of psychological testing.

TESTING PEOPLE WITH DISABILITIES


Challenges may include:
(1) transforming the test into a form that can be taken by the test
taker,
(2) transforming the responses of the test taker so that they are
scorable, and
(3) meaningfully interpreting the test data.

COMPUTERIZED TEST ADMINISTRATION, SCORING, AND INTERPRETATION


For assessment professionals, some major issues with regard to
CAPA are as follows.
 Access to test administration, scoring, and interpretation software.
 Comparability of pencil-and-paper and computerized versions of RIGHTS OF TEST TAKERS
tests.  The right of informed consent
 The value of computerized test interpretations.  Test takers have a right to know why they are being
 Unprofessional, unregulated “psychological testing” online. evaluated, how the test data will be used, and what (if any)
information will be released to whom.
GUIDELINES WITH RESPECT TO CERTAIN POPULATIONS  With full knowledge of such information, test takers give their
In general, the guidelines are designed to assist professionals in informed consent to be tested.
providing informed and developmentally appropriate services.  The disclosure of the information needed for consent must, of
• Although standards must be followed by all psychologists, guidelines are course, be in language the test taker can understand.
more aspirational in nature.
• Example: Guidelines for Psychological Practice with Transgender  The right to be informed of test findings
and Gender Nonconforming (TGNC) People.  Giving realistic information about test performance to
• The document lists and discusses 16 guidelines examinees is not only ethically and legally mandated but may
be useful from a therapeutic perspective as well.
LEGISLATION IN THE PHILIPPINE CONTEXT  Test takers have a right to be informed, in language they can
REPUBLIC ACT No. 10029 or an act known to be as Philippine understand, of the nature of the findings with respect to a test
Psychology Act of 2009 they have taken.
 An act to regulate the practice of psychology and psychometrics in
the Philippines to protect the public from inexperienced or  The right to privacy and confidentiality
untrained individuals offering psychological services  The concept of the privacy right “recognizes the freedom of
the individual to pick and choose for himself the time,
• "Psychometrician" means a natural person who holds a valid certificate of circumstances, and particularly the extent to which he wishes
registration and a valid professional identification card as psychometrician to share or withhold from others his attitudes, beliefs,
issued by the Professional Regulatory Board of Psychology and the behavior, and opinions” (Shah, 1969, p. 57).
Professional Regulation Commission pursuant to this Act.
 As such, he/she shall be authorized to do any of the following:  The right to privacy and confidentiality
Provided, that such shall at all times be conducted under the supervision  Confidentiality concerns matters of communication outside
of a licensed professional psychologist: the courtroom; privilege protects clients from disclosure in
judicial proceedings (Jagim et al., 1978, p. 459).
Role of Psychometrician
 Psychometrician plays a secondary role to the psychologist  The right to the least stigmatizing label
 When giving psychological assessment, a psychometrician should  The Standards advise that the least stigmatizing labels should
be supervised by a psychologist always be assigned when reporting test results.
 The greater responsibility is delegated to psychologist who should
review very well the output of the psychometrician BASIC STATISTICAL CONCEPTS IN PSYCH ASSESSMENT – Describing Data
 Any blunder by the psychometrician will be blames to the
psychologist who serves as signatory a psychometrician’s report. Plotting Data
 One of the simplest methods to reorganize data to make them more
• "Psychologist" means a natural person who holds a valid certificate of intelligible is to plot them in some sort of graphical form.
registration and a valid professional identification card as psychometrician  There are several common ways in which data can be represented
issued by the Professional Regulatory Board of Psychology and the PRC for the graphically. Some of these methods are frequency distributions,
purpose of delivering psychological services defined in this Act histograms, and stem-and-leaf displays.

• REPUBLIC ACT 9258 or the Guidance and Counseling act of 2004 Frequency Table
 Crafted and designed to professionalize the practice of guidance  Frequency table – is an ordered listing of number of individuals having
and counseling in the Philippines each of the different values for a particular variable.
 Frequency table is called a frequency table because it shows how
• Guidance Counselor: A natural person who has been registered and issued a frequently (how many times) each score was used. A frequency table
valid Certificate of Registration and a valid Professional Identification Card by makes the pattern of numbers easy to see.
the PRB of Guidance and Counseling and PRC in accordance with RA 9258 and  You can also use a frequency table to show the number of scores for each
who by virtue of specialized training, perform for a fee, salary, or other forms value (that is, for each category) of a nominal variable.
Grouped Frequency Table
 Sometimes there are so many possible values that an ordinary frequency
table is too awkward to give a simple picture of the scores.
 The solution is to make groupings of values that include all values in a
certain range.
 interval - range of values in a grouped frequency table that are
grouped together. (For example, if the interval size is 10, one of the
intervals might be from 10 to 19.)
 grouped frequency table frequency table in which the number of
individuals (frequency) is given for each interval of values.

Histograms
 A graph is another good way to make a large group of scores easy to
understand. A picture may be worth a thousand words, but it is also
sometimes worth a thousand numbers.
 Histogram is a barlike graph of a frequency distribution in which the Normal and Kurtotic Distribution
values are plotted along the horizontal axis and the height of each bar is NORMAL CURVE specific, mathematically defined, bell-shaped
the frequency of that value; the bars are usually placed next to each frequency distribution that is symmetrical and unimodal; distributions
other without spaces, giving the appearance of a city skyline. observed in nature and in research commonly approximate it.
 When you have a nominal variable, the histogram is called a bar graph.
 Since the values of a nominal variable are not in any particular order,
you leave a space between the bars.

Frequency Distribution
 A frequency distribution shows the pattern of frequencies over the
various values.
 A frequency table or histogram describes a frequency distribution
because each shows the pattern or shape of how the frequencies are
spread out, or “distributed.” Psychologists also describe a distribution in terms of whether the
 Psychologists also describe this shape in words. middle of the distribution is particularly peaked or flat.
 unimodal distribution - frequency The standard of comparison is a bell-shaped curve. In psychology
distribution with one value clearly research and in nature generally, distributions often are similar to this bell-
having a larger frequency than any shaped standard, called the normal curve.
other.
 bimodal distribution - frequency Kurtosis is how much the shape of a distribution differs from a
distribution with two normal curve in terms of whether its curve in the middle is more peaked or
approximately equal frequencies, flat than the normal curve (DeCarlo, 1997). Kurtosis comes from the Greek
each clearly larger than any of the word kyrtos, “curve.”
others. KURTOSIS - extent to which a frequency distribution deviates from
 multimodal distribution - a normal curve in terms of whether its curve in the middle is more peaked or
frequency distribution with two or flat than the normal curve.
more high frequencies separated
by a lower frequency; a bimodal BASIC STATISTICAL CONCEPTS IN PSYCH ASSESSMENT – Measure of Central
distribution is the special case of Tendency
two high frequencies.
 rectangular distribution - CENTRAL TENDENCY
frequency distribution in which all  The central tendency of a distribution refers to the middle of the group of
values have approximately the same frequency. scores.
 Measures of central tendency refers to the set of measures that reflect
 symmetrical distribution - distribution in which the pattern of
where on the scale the distribution is centered.
frequencies on the left and right side are mirror images of each
 Three measures of central tendency: mean, mode, and median.
other.
 Each measure of central tendency uses its own method to come up with a
 skewed distribution - distribution in which the scores pile up on
single number describing the middle of a group of scores.
one side of the middle and are spread out on the other side;
 The MEAN is the most commonly used measure of central tendency.
distribution that is not symmetrical.
THE MEAN
▪ MEAN - arithmetic average of a group of scores; sum of the scores divided
by the number of scores.
Outlier - a score with an extreme value (very high or very low) in
Figure: (a) approximately symmetrical, (b) skewed to the right (positively skewed), and (c) relation to the other scores in the distribution.
skewed to the left (negatively skewed)
MODE
 floor effect situation in which many scores pile up at the low end of  The MODE is another measure of central tendency. The mode is the most
a distribution (creating skewness to the right) because it is not common single value in a distribution.
possible to have any lower score. mode = value with the greatest frequency in a distribution
 ceiling effect situation in which many scores pile up at the high end  It can also be defined simply as the most common score, that is, the score
of a distribution (creating skewness to the left) because it is not obtained from the largest number of subjects. Thus, the mode is that
possible to have a higher score. value of X that corresponds to the highest point on the distribution.
 In a perfectly symmetrical unimodal distribution, the mode is the same as
A distribution that is skewed to the right is also called positively skewed. the mean. However, what happens when the mean and the mode are not
A distribution skewed to the left is also called negatively skewed.
the same? In that situation, the mode is usually not a very good way of  Think of the variability of a distribution as the amount of spread of
describing the central tendency of the scores in the distribution. the scores around the mean. In other words, how close or far from
the mean are the scores in a distribution.
MEDIAN  There are three measures of the variability of a group of scores: the
Another alternative to the mean is the MEDIAN. If you line up all the scores range, variance and standard deviation.
from lowest to highest, the middle score is the median.
When you have an even number of scores, the median can be ▪ Measures of variability communicate three related aspects of the data:
between two different scores. In that situation, the median is the average 1st - the opposite of variability is consistency.
(the mean) of those two scores. 2nd - measures of variability indicate how spread out the scores and
The median is the score that corresponds to the point at or below the distribution are.
which 50% of the scores fall when the data are arranged in numerical order. 3rd - a measure of variability tells us how accurately the measure of
By this definition, the median is also called the 50th percentile. central tendency describes the distribution
REMEMBER: Measures of variability communicate the differences among the
USES OF THE MEAN, MEDIAN, AND MODE scores, how consistently close to the mean the scores are, and how spread
 The mode is a score that actually occurred, whereas the mean and out the distribution is.
sometimes the median may be values that never appear in the data.
 The mode also has the obvious advantage of representing the largest RANGE
number of people.  One way to describe variability is to determine how far the lowest
 The mode has the advantage of being applicable to nominal data, which score is from the highest score.
is not true of the median or the mean.  The descriptive statistic that indicates the distance between the two
 Disadvantage: the mode depends on how we group our data. Another most extreme scores in a distribution is called the range.
disadvantage is that it may not be particularly representative of the  Range = Highest score – Lowest score
entire collection of numbers.  The range does communicate the spread in the data. However, the
 The major advantage of the median, which it shares with the mode, is range is a rather crude measure. It involves only the two most extreme
that it is unaffected by extreme scores. scores it is based on the least typical and often least frequent scores.
 The median is the preferred measure of central tendency when the data  Therefore, we usually use the range as our sole measure of variability
are ordinal scores. only with nominal or ordinal data.
 The median is preferred when interval or ratio scores form a very
skewed distribution. VARIANCE
 Computing the mean is appropriate whenever getting the “average” of  The variance of a group of scores is one kind of number that tells you
the scores makes sense. Therefore, do not use the mean when describing how spread out the scores are around the mean. To be precise, the
nominal data. variance is the average of each score’s squared difference from the
 Likewise, do not compute the mean with ordinal scores. The mean mean.
describes interval or ratio data.  Mathematically, the distance between a score and the mean is the
 Always compute the mean to summarize a normal or approximately difference between them which is the amount that a score deviates
normal distribution: The mean is the mathematical center of any from the mean. Thus, a score’s deviation indicates how far it is spread
distribution, and in a normal distribution, most of the scores are located out from the mean.
around this central point. Therefore, the mean is an accurate summary  Of course, some scores will deviate by more than others, so it makes
and provides an accurate address for the distribution. sense to compute something like the average amount the scores
 Only when the distribution is symmetric will the mean and the median deviate from the mean. Let’s call this the “average of the deviations.”
be equal, and only when the distribution is symmetric and unimodal will  The larger the average of the deviations, the greater the variability.
all three measures be the same.  The variance of a group of scores is one kind of number that tells you
 The mean will inaccurately describe a skewed (nonsymmetrical) how spread out the scores are around the mean. To be precise, the
distribution. variance is the average of each score’s squared difference from the
 The solution is to use the median to summarize a skewed distribution mean.
 REMEMBER: Use the mean to summarize normal distributions of interval  The more spread out the distribution has a larger variance because
or ratio scores; use the median to summarize skewed distributions. being spread out makes the deviation scores bigger.
 If the deviation scores are bigger, the squared deviation scores and the
average of the squared deviation scores (the variance) are also bigger
 The variance is rarely used as a descriptive statistic. This is because the
variance is based on squared deviation scores, which do not give a very
easy-to-understand sense of how spread out the actual, non-squared
scores are.

STANDARD DEVIATION
 The most widely used number to describe the spread of a group of
scores is the standard deviation. The standard deviation is simply the
square root of the variance.
 The measure of variability that more directly communicates the
“average of the deviations” is the standard deviation.
BASIC STATISTICAL CONCEPTS IN PSYCH ASSESSMENT – Measures of  There are two steps in figuring the standard deviation.
Variability ❶ Figure the variance.
❷ Take the square root.
VARIABILITY  The standard deviation is the positive square root of the variance.
 Researchers also want to know how spread out the scores are in a
distribution. This shows the amount of variability in the REMEMBER: The standard deviation indicates the “average
distribution. deviation” from the mean, the consistency in the scores, and how far
 Computing a measure of variability is important because without it scores are spread out around the mean.
a measure of central tendency provides an incomplete description
of a distribution. The mean, for example, only indicates the central
score and where the most frequent scores are.
REMEMBER: The variance and standard deviation are two  If you start with a normal
measures of variability that indicate how much the scores are spread out distribution and move scores
around the mean. from both the center and the
We use the variance and the standard deviation to describe how tails into the shoulders, the
different the scores are from each other. curve becomes flatter and is
REMEMBER: Approximately 34% of the scores in a normal called platykurtic. This is
distribution are between the mean and the score that is 1 standard where the central portion of
deviation from the mean. the distribution is much too
flat.

 If, on the other hand, you


moved scores from the
shoulders into both thes
center and the tails, the
curve becomes more
peaked with thicker tails.
Such a curve is called
leptokurtic. Notice in this
distribution that there
are too many scores in the center and too many scores in the tails.

BASIC STATISTICAL CONCEPTS IN PSYCH ASSESSMENT – Scales of


Measurement

Basic Concepts
BASIC STATISTICAL CONCEPTS IN PSYCH ASSESSMENT – NORMAL  VARIABLE - is a condition or characteristic that can have different values.
DISTRIBUTION In short, it can vary.

THE NORMAL DISTRIBUTION


The normal distribution has the following characteristics;
 The score with the highest frequency is the middle score between the
highest and lowest scores.
 The normal curve is symmetrical, meaning that the left half below the
middle score is a mirror image of the right half above the middle score.
 VALUE - possible number or category that a score can have.
 As we proceed away from the middle score either toward the higher or
 SCORE - particular person’s value on a variable.
lower scores, the frequencies at first decrease slightly.
 Farther from the middle score , however, the frequencies decrease
Basic concepts: kinds of variables
more drastically, with the highest and lowest scores having relatively
 NUMERIC VARIABLE - variable whose values are numbers (as opposed
low frequency.
to a nominal variable). Also called quantitative variable.
 In psychology research the most important distinction of numeric
 In statistics the scores that are relatively far above and below the
variable is between two types: equal-interval variables and rank-order
middle score of the distribution are called the “extreme” scores.
variables.
 Then, the far left and right portions of a normal curve containing the
 An EQUAL-INTERVAL VARIABLE (INTERVAL SCALE) is a variable in
low frequency, extreme scores are called the tails of the distribution.
which the numbers stand for approximately equal amounts of what is
being measured.
 Some equal-interval variables are measured on what is called a RATIO
SCALE. It is measured on a ratio scale if it has an absolute zero point –
which means that the value of zero on the variable indicates a
complete absence of the variable
 A RANK-ORDER VARIABLE, is a variable in which the numbers stand
only for relative ranking. Rank-order variables are also called ordinal
variables/ ORDINAL SCALE.
 A rank-order variable provides less information than an equal-interval
 On a normal distribution, the farther a score is from the central score of variable. That is, the difference from one rank to the next doesn’t tell
the distribution, the less frequently the score occurs. you the exact difference in amount of what is being measured.
However, psychologists often use rank-order variables because they
 Kurtosis refers to the relative concentration of scores in the center, are the only information available.
the upper and lower ends (tails), and the shoulders (between the  Another major type of variable used in psychology research, which is
center and the tails) of a not a numeric variable at all, is a NOMINAL VARIABLE in which the
distribution. values are names or categories.
 A normal distribution is called
mesokurtic. Its tails are
neither too thin nor too thick,
and there are neither too
many nor too few scores
concentrated in the center.
 The term nominal comes from the idea that its values are names.  A z- score describes a score in terms of how much it is above or below
NOMINAL SCALES are also called categorical variables because their the average.
values are categories (qualitative variables)  A z-score is the distance a raw score is from the mean when measured in
LEVELS OF MEASUREMENT standard deviations.
NOMINAL = Named Variables  A z-score always has two components:
ORDINAL = Named + Ordered Variables 1) either a positive or negative sign which indicates whether the raw
INTERVAL = Named + Ordered + Proportionate Interval between Variables score is above or below the mean, and
RATIO = Named + Ordered + Proportionate Interval between Variables + Can 2) the absolute value of the z-score which indicates how far the score
accommodate Absolute Zero lies from the mean when measured in standard deviations.
 Like any raw score, a z-score is a location on the distribution.
 However, the important part is that a z-score also simultaneously
communicates its distance from the mean. By knowing where a score is
relative to the mean, we know the score’s relative standing within the
distribution.

INTERPRETING Z-SCORES USING THE Z-DISTRIBUTION


 A z-distribution is the distribution produced by transforming all raw
scores in the data into z-scores.
 A “+” indicates that the z-score (and raw score) is above and graphed to
the right of the mean. Positive z-scores become increasingly larger as
we proceed farther to the right. Larger positive z-scores (and their
corresponding raw scores) occur less frequently
 Conversely, a “-” indicates that the z-score (and raw score) is below and
graphed to the left of the mean. Negative z-scores become increasingly
larger as we proceed farther to the left. Larger negative z-scores (and
their corresponding raw scores) occur less frequently.
 However, as shown, most of the z-scores are between -3 and +3.

USING Z-SCORES TO COMPARE DIFFERENT VARIABLES

REMEMBER: To compare raw scores from two different variables, transform


the scores into z-scores
BASIC STATISTICAL CONCEPTS IN PSYCH ASSESSMENT – Z-SCORE
USING Z-SCORES TO DETERMINE THE RELATIVE FREQUENCY OF RAW
Why Is It Important to know about Z-Scores? SCORES
 Researchers usually don’t know how to interpret someone’s raw score: Ꙭ A third important use of z-scores is for computing the relative
Usually, we won’t know whether, in nature, a score should be frequency of raw scores.
considered high or low, good, bad, or what. Instead, the best we can do Ꙭ Relative frequency is the proportion of time that a score occurs, and
is compare a score to the other scores in the distribution, describing the that relative frequency can be computed using the proportion of the
score’s relative standing. total area under the curve.
 Relative standing reflects the systematic evaluation of a score relative to Ꙭ We can use the z-distribution to determine relative frequency
the sample or population in which the score occurs. The way to calculate because, as we’ve seen, when raw scores produce the same z-score
the relative standing of a score is to transform it into a z-score. they are at the same location on their distributions.
 With z-scores we can easily determine the underlying raw score’s
location in a distribution, its relative and simple frequency, and its BASIC STATISTICAL CONCEPTS IN PSYCH ASSESSMENT – Comparison of Mean
percentile. All of this helps us to know whether the individual’s raw Tests
score was relatively good, bad, or in-between.
How many GROUPS?
THERE ARE TWO PROBLEMS WITH THESE DESCRIPTIONS Two = Independent Samples t-test, Paired t-test
 First, they were somewhat subjective and imprecise. Two or more = One-way ANOVA
 Second, to get them we had to look at all scores in the distribution.
 However, recall that the point of statistics is to accurately summarize INDEPENDENT SAMPLES T-TEST
our data so that we don’t need to look at every score. Independent samples t-test, also called the two-sample t-test, is a
 The way to obtain the above information, but more precisely and statistical test that determines whether there is a statistically significant
without looking at every score, is to compute each man’s z-score. difference between the means in two independent or unrelated groups.

Z-SCORE Unrelated Groups


Unrelated groups, also called unpaired groups, are groups in which  Descriptive Statistics = collecting, describing and presenting a set
the cases (e.g., participants) in each group are different. of data
Often we are investigating differences in individuals, which means  Inferential Statistics = analysis of a subset of data leading to
that when comparing two groups, an individual in one group cannot also be a predictions or inferences about the entire set of data
member of the other group and vice versa.  generalizations about the characteristics of a larger set where only
Null and Alternative Hypotheses a part is examined.
Null Hypothesis Alternative Hypothesis
𝐻0: 𝜇1 = 𝜇2 𝐻1: 𝜇1 ≠ 𝜇2 Inferential Statistics
𝐻0: 𝜇1 = 𝜇2 𝐻1: 𝜇1 < 𝜇2 Variables
𝐻0: 𝜇1 = 𝜇2 𝐻1: 𝜇1 > 𝜇2  Quantitative Variables = variables that are measured on a numeric
or quantitative scale
Assumptions Discrete variables - variables with a finite or countable number of
 Assumption of Independence - two independent, categorical possible values (e.g. age (in years), No. of female
groups that represent your independent variable. enrolees in CSU for A.Y. 2016-2017, etc.)
 Assumption of Normality – The dependent variable should be Continuous variables - variables that assumes any value in a given
continuous and approximately normally distributed. interval (e.g. height (in meters), weight (in kg), etc.)
 Assumption of Homogeneity of Variances - The variances of the
dependent variable across groups should be equal.  Qualitative Variables = assumes values that are names or labels,
thus can be categorized (categorical variables) -Categories may be
PAIRED (SAMPLES) T-TEST identified by either non-numerical descriptions or by numerical
PAIRED T-TEST codes - Ex. Civil status, religious affiliation, etc.
 The Paired Samples t Test compares two means that are from the
same individual, object, or related units. Levels of Measurement
 The two means typically represent two different times (e.g., pretest Nominal = Attributes are only named; weakest
and post-test with an intervention between the two time points) or Ordinal = Attributes can be ordered
two different but related conditions or units. Interval = Distance is meaningful
 The purpose of the test is to determine whether there is statistical Ratio = Absolute zero
evidence that the mean difference between paired observations on
a particular outcome is significantly different from zero. Correlation Analysis
Assumptions Correlation is a statistical technique that can show how strongly
 Assumption of Dependence - Independent variable should contain pairs of variables are related. Examples:
two dependent or related, categorical. (1) score and the no. of hours studying
 Assumption of Continuity – Dependent variable should be (2) extent of experience and competence at work
measured in the continuous scale.
 Assumption of Normality – The differenced of the means of the  The correlation coefficient, r describes the extent of correlation
two paired groups should be approximately normally distributed. between the variables.
 One can have idea on the significance, direction, and strength of
ONE-WAY ANALYSIS OF VARIANCE (ANOVA) the relationship
The One-Way ANOVA (analysis of variance) compares the means of Ranges from -1.0 to +1.0
two or more independent groups in order to determine whether there is Extent: -1.0 or +1.0, strong; close 0, weak;
statistical evidence that the associated population means are significantly  The p-value shows the extent of practical significance; that is, as to
different. This test is also known as One-Factor ANOVA. data provide sufficient evidence that correlation between the
variables is significant.
ONE-WAY ANOVA Rule of the thumb: p-value < α =1%, 5%, 10%
The variables used in this test are known as:
 Dependent variable What test should be used?
 Independent variable (grouping variable/factor) Ꙭ Relationship
- This variable divides cases into two or more mutually exclusive  Pearson Correlation (Pearson Product-Moment Correlation)
levels or groups  Kendall’s Tau-b Correlation
Note:  Spearman’s Rank-Order Correlation
Both the One-Way ANOVA and the Independent Samples t Test can compare Ꙭ Association
the means for two groups. However, only the OneWay ANOVA can compare  Chi-square
the means across three or more groups. If the grouping variable has only two
groups, then the results of a one-way ANOVA and the independent samples t Pearson Correlation (Pearson Product-Moment Correlation) – PARAMETRIC
test will be equivalent. STAT
ASSUMPTIONS
Assumptions  The two variables considered should be measured at the interval or
 Dependent variable that is continuous (i.e., interval or ratio level) ratio level.
 Independent variable that is categorical (i.e., two or more groups)  There is linear relationship between the two variables (ex. use
 Independent samples/groups (i.e., independence of observations) There scatterplot to check the linearity)
is no relationship between the subjects in each sample. That is, subjects  There should be no significant outliers.
in the first group cannot also be in the second group no group can  The variables should be approximately normally distributed.
influence the other group
 Normal distribution (approximately) of the dependent variable for each Kendall’s Tau-b Correlation, 𝑟𝑘 – Non-PARAMETRIC STAT
group (Preferably used for small sample size nonnormal quantitative data)
 Homogeneity of variances (i.e., variances approximately equal across ASSUMPTIONS
groups)  The two variables should be measured on an at least ordinal scale.
 There is monotonic relationship between the two variables – Y goes in
TEST OF SIGNIFICANT RELATIONSHIP/ASSOCIATION one direction as X changes.

Statistics Spearman’s Rank-Order Correlation - Non-PARAMETRIC STAT


ASSUMPTIONS: Ꙭ Competent test users understand and appreciate the limitations of
1. The two variables considered should be measured on an ordinal, or interval the tests they use as well as how those limitations might be
or ratio level. compensated for by data from other sources.
2. There is monotonic relationship between the two variables.
Chi-square Test for Association - Non-PARAMETRIC STAT
This test is used to determine whether there is significant
association between two categorical variables.
Significant value (p-value): We want to compare this value to the Assumption #5: Various Sources of Error Are Part of the Assessment Process
default value of α (level of significance), which is set to 0.05 or 5%. The ERROR
decision rule is: If p-value is lesser than α, then there is significant association  In the context of assessment, it need not refer to a deviation, an
between the two variables. Otherwise, association is not significant. oversight, or something that otherwise violates expectations
 traditionally refers to something that is more than expected; it is
Assumptions About Psychological Testing and Assessment actually a component of the measurement process
 refers to a long-standing assumption that factors other than what a
Assumption #1: Psychological Traits and States Exist test attempts to measure will influence performance on the test

ERROR VARIANCE
TRAIT Constructs Overt Behavior
 the component of a test score attributable to sources other than
STATES
the trait or ability measured.
o Potential sources of error variance:
 Trait - defined as “any distinguishable, relatively enduring way in which
• Assessees themselves are sources of error variance
one individual varies from another” (Guilford, 1959, p. 6).
• Assessors, too, are sources of error variance
 States also distinguish one person from another but are relatively less
• Measuring instruments themselves are another source
enduring (Chaplin et al., 1988).
of error variance
 Construct—an informed, scientific concept developed or constructed to
 Measurement professionals tend to view error as simply an
describe or explain behavior
element in the process of measurement
 Overt behavior refers to an observable action or the product of an
 Classical Test Theory (CTT- also referred to as true score theory)-
observable action, including test- or assessment-related responses.
the assumption is made that each test taker has a true score on a
test that would be obtained but for the action of measurement
Traits that manifest in observable behavior are presumed to depend not only
error.
on the strength of the trait in the individual but also on the nature of the
 A model of measurement based on item response theory (IRT) is an
situation.
alternative. However, whether CTT, IRT, or some other model of
measurement is used, the model must have a way of accounting for
Assumption #2: Psychological Traits and States Can Be Quantified and
measurement error.
Measured
AGGRESSIVE
Assumption #6: Testing and Assessment Can Be Conducted in a Fair and
 Test developers and researchers have many different ways of
Unbiased Manner
looking at and defining the same phenomenon.
 One source of fairness related problems is the test user who attempts
to use a particular test with people whose background and
 Test developers must ensure the most appropriate test items to be
experience are different from the background and experience of
included in the assessment based on how the trait term is defined.
people for whom the test was intended.
 Today all major test publishers strive to develop instruments that are
 Test developers must also ensure appropriate ways to score the
fair when used in strict accordance with guidelines in the test
test and interpret the results.
manual. However, despite the best efforts of many professionals,
fairness-related questions and problems do occasionally arise.
 Cumulative Scoring = There is the assumption that the more the
test taker responds in a particular direction as keyed by the test
Assumption #7: Testing and Assessment Benefit Society
manual as correct or consistent with a particular trait, the higher
In a world WITHOUT TESTS or other assessment procedures:
that test taker is presumed to be on the targeted ability or trait.
 personnel might be hired on the basis of nepotism (favoritism/bias)
rather than documented merit
Assumption #3: Test-Related Behavior Predicts Non-Test-Related Behavior
 teachers and school administrators could subjectively place children
 The tasks in some tests mimic the actual behaviors that the test
in different types of special classes simply because that is where
user is attempting to understand. However, such tests yield only a
they believed the children belonged
sample of the behavior that can be expected to be emitted under
non-test conditions.  there would be a great need for instruments to diagnose educational
 The obtained sample of behavior is typically used to make difficulties in reading and math and point the way to remediation
predictions about future behavior, such as work performance of a  there would be no instruments to diagnose neuropsychological
job applicant. impairments
 Psychological tests may be used not to predict behavior but to  there would be no practical way for the military to screen thousands
postdict it—that is, to aid in the understanding of behavior that has of recruits with regard to many key variables.
already taken place.
Reliability
Assumption #4: Tests and Other Measurement Techniques Have Strengths
and Weaknesses What is a Good Test?
Ꙭ Competent test users understand a great deal about the tests they • the criteria for a good test would include clear instructions for
use. They understand how a test was developed, the circumstances administration, scoring, and interpretation.
under which it is appropriate to administer the test, how the test • a test offered economy in the time and money it took to administer, score,
should be administered and to whom, and how the test results and interpret it
should be interpreted. • technical criteria that assessment professionals use to evaluate the quality
of tests refers to the psychometric soundness of tests
• two key aspects are reliability and validity
RELIABILITY
• the criterion of reliability involves the consistency of the measuring tool
• the precision with which the test measures and the extent to which error is
present in measurements.
• the perfectly reliable measuring tool consistently measures in the same
way.
Reliability Coefficient
• A reliability coefficient is an index of reliability, a proportion that indicates
the ratio between the true score variance on a test and the total variance.

Concept of Reliability
• Error refers to the component of the observed test score that does not have
to do with the test taker’s ability.
• If we use X to represent an observed score, T to represent a true score, and
E to represent error, then the fact that an observed score equals the true
score plus error may be expressed as follows:
X=T+E
• If σ2 represents the total variance, the true variance, and the error variance,
then the relationship of the variances can be expressed as σ2 = σ2th + σ2e
• The term reliability refers to the proportion of the total variance attributed Reliability Estimates
to true variance. Ꙭ Test-Retest Reliability
• The greater the proportion of the total variance attributed to true variance,  One way of estimating the reliability of a measuring instrument is by
the more reliable the test. using the same instrument to measure the same thing at two
points in time.
Measurement Error  This approach to reliability evaluation is called the test-retest method,
• refers to, collectively, all of the factors associated with the process of and the result of such an evaluation is an estimate of test-retest
measuring some variable, other than the variable being measured. reliability.
 Is an estimate of reliability obtained by correlating pairs of scores
from the same people on two different administrations of the
same test
 The test-retest measure is appropriate when evaluating the
reliability of a test that purports to measure something that is
relatively stable over time, such as a personality trait.

Ꙭ Parallel Forms & Alternate Forms


 Parallel forms of a test exist when, for each form of the test, the
means and the variances of observed test scores are equal. In
RANDOM ERROR theory, the means of scores obtained on parallel forms correlate
a source of error in measuring a targeted variable caused by equally with the true score.
unpredictable fluctuations and inconsistencies of other variables in the  Parallel forms reliability refers to an estimate of the extent to which
measurement process item sampling and other errors have affected test scores on
versions of the same test when, for each form of the test, the
SYSTEMATIC ERROR means and variances of observed test scores are equal.
refers to a source of error in measuring a variable that is typically  Alternate forms are simply different versions of a test that have been
constant or proportionate to what is presumed to be the true value of constructed so as to be parallel.
• the variable being measured  Alternate forms reliability refers to an estimate of the extent to
which these different forms of the same test have been affected by
Sources of Error Variance item sampling error, or other error.
• test construction = One source of variance during test construction  Obtaining estimates of alternate-forms reliability and parallel-forms
is item sampling or content sampling, terms that refer to variation among reliability is similar in two ways to obtaining an estimate of test-
items within a test as well as to variation among items between tests. retest reliability:
The extent to which a test taker’s score is affected by the content (1) Two test administrations with the same group are required
sampled on a test and by the way the content is sampled (that is, (2) test scores may be affected by factors such as motivation, fatigue,
the way in which the item is constructed) is a source of error or intervening events
variance.
Ꙭ Split-half Reliability
• administration = The test taker’s reactions to those untoward The computation of a coefficient of split-half reliability generally
influences are the source of one kind of error variance entails three steps:
Examples are factors related to the test environment Step 1. Divide the test into equivalent halves.
External to the test environment, the events of the day may also serve Step 2. Calculate a Pearson r between scores on the two halves of
as a source of error the test.
Other potential sources of error variance during test administration Step 3. Adjust the half-test reliability using the Spearman–Brown
are test taker variables. formula.
Examiner-related variables are potential sources of error variance Spearman–Brown formula allows a test developer or user to estimate
internal consistency reliability from a correlation of two halves of a
• test scoring and interpretation = Scorers and scoring systems are test. It is a specific application of a more general formula to
potential sources of error variance estimate the reliability of a test that is lengthened or shortened by
any number of items.
RELIABILITY The general Spearman–Brown (rSB) formula is where rSB is equal to
the reliability adjusted by the Spearman–Brown formula, rxy is
equal to the Pearson r in the original-length test, and n is equal to
the number of items in the revised version divided by the number
of items in the original version.

By determining the reliability of one half of a test, a test developer can use
the Spearman–Brown formula to estimate the reliability of a whole test.
Because a whole test is two times longer than half a test, n becomes 2 in the Using and Interpreting a Coefficient of Reliability
Spearman–Brown formula for the adjustment of split-half reliability. • How high should the coefficient of reliability be?
•“On a range relative to the purpose and importance of the decisions
Other Methods of Estimating Internal Consistency to be made on the basis of scores on the test.
• Inter-item consistency refers to the degree of correlation among all the Reliability is a mandatory attribute in all tests we use. However, we
items on a scale. A measure of inter-item consistency is calculated from a need more of it in some tests, and we will admittedly allow for less of it in
single administration of a single form of a test. others.
• An index of inter-item consistency, in turn, is useful in assessing the
homogeneity of the test. Tests are said to be homogeneous if they contain
items that measure a single trait.

The Kuder–Richardson formulas


• The most widely known of the many formulas they collaborated on is their
Kuder–Richardson formula 20, or KR-20 (named because it was the 20th
formula developed in a series)
• Where test items are highly homogeneous, KR-20 and split-half reliability
estimates will be similar
• However, KR-20 is the statistic of choice for determining the inter-item
consistency of dichotomous items, primarily those items that can be scored
right or wrong (such as multiple-choice items).
• If test items are more heterogeneous, KR-20 will yield lower reliability
estimates than the split-half method.

Coefficient alpha
• Developed by Cronbach (1951) and subsequently elaborated on by
others (such as Kaiser & Michael, 1975; Novick & Lewis, 1967), coefficient
alpha may be thought of as the mean of all possible split half correlations,
corrected by the Spearman–Brown formula
• Coefficient alpha is appropriate for use on tests containing
nondichotomous items.
• Coefficient alpha is the preferred statistic for obtaining an estimate VALIDITY
of internal consistency reliability. What is a Good Test?
• Coefficient alpha is widely used as a measure of reliability, in part • technical criteria that assessment professionals use to evaluate the quality
because it requires only one administration of the test. of tests refers to the psychometric soundness of tests
• two key aspects are reliability and validity
Average proportional distance (APD)
• Rather than focusing on similarity between scores on items of a test, Validity
the APD is a measure that focuses on the degree of difference that exists • a judgment or estimate of how well a test measures what it purports to
between item scores. measure in a particular context.
• average proportional distance method is defined as a measure used • a judgment based on evidence about the appropriateness of inferences
to evaluate the internal consistency of a test that focuses on the degree of drawn from test scores
difference that exists between item scores. • No test or measurement technique is “universally valid” for all time, for all
uses, with all types of test taker populations.
Keep in Mind! • Rather, tests may be shown to be valid within what we would characterize
• Measures of reliability are estimates, and estimates are subject to as reasonable boundaries of a contemplated usage
error. The precise amount of error inherent in a reliability estimate will vary
with various factors, such as the sample of test takers from which the data
were drawn.
• A reliability index published in a test manual might be very
impressive. However, keep in mind that the reported reliability was achieved
with a particular group of test takers.

Measures of Inter-Scorer Reliability


• Inter-scorer reliability is the degree of agreement or consistency
between two or more scorers (or judges or raters) with regard to a particular
measure.
• Inter-scorer reliability is often used when coding nonverbal
behavior.
• The simplest way of determining the degree of consistency among
scorers in the scoring of a test is to calculate a coefficient of correlation. This
correlation coefficient is referred to as a coefficient of inter-scorer reliability.
Validity Coefficient
 a correlation coefficient that provides a measure of the
relationship between test scores and scores on the criterion
measure

Incremental Validity
 the degree to which an additional predictor explains
something about the criterion measure that is not explained
by predictors already in use.

3. Construct Validity
 This is a measure of validity that is arrived at by executing a
comprehensive analysis of
a. how scores on the test relate to other test scores and measures,
and
b. how scores on the test can be understood within some
theoretical framework for understanding the construct that the
test was designed to measure.
 The degree to which the measurement instrument measures the
theoretical constructs that it was intended to measure
 Construct validity can be carried out through factor analysis
Validation  is a judgment about the appropriateness of inferences drawn from
• the process of gathering and evaluating evidence about validity. test scores regarding individual standings on a variable called a
• both the test developer and the test user may play a role in the validation of construct.
a test for a specific purpose  A construct is an informed, scientific idea developed or hypothesized
• Local validation studies are absolutely necessary when the test user plans to to describe or explain behavior.
alter in some way the format, instructions, language, or content of the test.

1. Content Validity
 This is a measure of validity based on an evaluation of the subjects,
topics, or content covered by the items in the test.
 describes a judgment of how adequately a test samples behavior
representative of the universe of behavior that the test was
designed to sample
 In the interest of ensuring content validity, test developers strive to
include key components of the construct targeted for
measurement, and exclude content irrelevant to the construct
targeted for measurement.
 The degree to which the items in the measurement scale represent all
aspects of the variable being measured.
 Content validity is not evaluated numerically, it is judged by
researcher.

2. Criterion-Related Validity
 This is a measure of validity obtained by evaluating the relationship of
scores obtained on the test to scores on other tests or measures
 is a judgment of how adequately a test score can be used to infer an
individual’s most probable standing on some measure of interest
Characteristics of a Criterion:
1. An adequate criterion is relevant. Meaning, it is pertinent or
applicable to the matter at hand
2. The degree to which a measurement instrument is related to 4. Face Validity
independent measure of the relevant criterion  relates more to what a test appears to measure to the person being
3. Criterion-related validity can be evaluated by computing the tested than to what the test actually measures.
multiple correlation (R) and performance  is a judgment concerning how relevant the test items appear to be.
4. An adequate criterion measure must also be valid for the Stated another way, if a test definitely appears to measure what it
purpose for which it is being used. purports to measure “on the face of it,” then it could be said to be
5. A criterion is also uncontaminated high in face validity

Concurrent Validity Validity, Bias, and Fairness


 is an index of the degree to which a test score is related to some • Test Bias
criterion measure obtained at the same time (concurrently)  bias as applied to psychological and educational tests may conjure up
 it indicates the extent to which test scores may be used to estimate many meanings having to do with prejudice and preferential
an individual’s present standing on a criterion treatment (Brown et al., 1999).
 For psychometricians, bias is a factor inherent in a test that
Predictive Validity systematically prevents accurate, impartial measurement.
 is an index of the degree to which a test score predicts some criterion
measure. • Rating Error
 it is how accurately scores on the test predict some criterion measure. = is a judgment resulting from the intentional or unintentional
misuse of a rating scale.
 leniency error (also known as a generosity error) is an error in rating  In industrial settings, a partial list of such non-economic benefits
that arises from the tendency on the part of the rater to be lenient (many carrying with them economic benefits as well) would
in scoring, marking, and/or grading include:
 At the other extreme is a severity error. Movie critics who criticize an increase in the quality of workers’ performance;
just about everything they review may be guilty of severity errors. an increase in the quantity of workers’ performance;
 central tendency error - the rater, for whatever reason, exhibits a a decrease in the time needed to train workers;
general and systematic reluctance to giving ratings at either the a reduction in the number of accidents;
positive or the negative extreme a reduction in worker turnover.
 • Halo Effect - for some raters, some ratees can do no wrong
- may also be defined as a tendency to give a particular ratee a Utility Analysis
higher rating than he or she objectively deserves because of the • Utility analysis may be broadly defined as a family of techniques that entail
rater’s failure to discriminate among conceptually distinct and a cost– benefit analysis designed to yield information relevant to a decision
potentially independent aspects of a ratee’s behavior about the usefulness and/or practical value of a tool of assessment
• utility analysis may be undertaken for the purpose of evaluating whether the
• Test Fairness benefits of using a test outweigh the costs
 fairness in a psychometric context is the extent to which a test is used
in an impartial, just, and equitable way

How is utility analysis conducted?


Utility Expectancy Data
Utility defined  An expectancy table can provide an indication of the likelihood that a
• in the context of testing and assessment, it is the usefulness or practical test taker will score within some interval of scores on a criterion
value of testing to improve efficiency measure - an interval that may be categorized as “passing,”
• It is also used to refer to the usefulness or practical value of a training “acceptable,” or “failing.”
program or intervention.  Taylor-Russell tables (Taylor and Russell,1939) provide an estimate of
the extent to which inclusion of a particular test in the selection
Factors that affects test utility system will improve selection
Psychometric soundness One limitation of the Taylor-Russell tables is that the relationship
 An index of reliability can tell us something about how consistently a between the predictor (the test) and the criterion (rating of performance on
test measures what it measures; and the job) must be linear.
 an index of validity can tell us something about whether a test Another limitation of the is the potential difficulty of identifying a
measures what it purports to measure. criterion score that separates “successful” from “unsuccessful” employees.
 An index of utility can tell us something about the practical value of From the base rate of 60% of the hired employees being expected to
the information derived from scores on the test. perform successfully, a full 88% can be expected to do so
 The higher the criterion-related validity of test scores for making a
particular decision, the higher the utility of the test  Naylor-Shine tables
 Would it be accurate to conclude that “a valid test is a useful test”? (Naylor & Shine, 1965) entails obtaining the difference between the
means of the selected and unselected groups to derive an index of what the
Cost test is adding to already established procedures.
 in the context of test utility refers to disadvantages, losses, or When the purpose of a utility analysis is to answer a question related
expenses in both economic and noneconomic terms. to costs and benefits in terms of dollars and cents – use the Brogden-
 Associated costs of testing may come in the form of: Cronbach-Gleser formula. This formula gives an estimate utility by estimating
(1) payment to professional personnel and staff associated with the amount of money an organization would save if it used the test to select
test administration, scoring, and interpretation, employees.
(2) facility rental, mortgage, and/or other charges related to the
usage of the test facility, and Perhaps the most oftcited application of statistical decision theory to the field
(3) insurance, legal, accounting, licensing, and other routine costs of psychological testing is Cronbach and Gleser’s Psychological Tests and
of doing business. Personnel Decisions (1957, 1965)
• Cronbach and Gleser (1965) presented:
Benefits (1) a classification of decision problems;
 refers to profits, gains, or advantages in both economic and (2) various selection strategies ranging from single-stage processes to
noneconomic terms sequential analyses;
(3) a quantitative analysis of the relationship between test utility, the • as a method of evaluation and a way of deriving meaning from test scores
selection ratio, cost of the testing program, and expected value of the by evaluating an individual test taker’s score and comparing it to scores of a
outcome; and group of test takers.
(4) a recommendation that in some instances job requirements be • In this approach, the meaning of an individual test score is understood
tailored to the applicant’s ability instead of the other way around (a concept relative to other scores on the same test.
they refer to as adaptive treatment). • A common goal of norm-referenced tests is to yield information on a test
taker’s standing or ranking relative to some comparison group of test
takers.
Some practical considerations when conducting utility analysis
•The pool of job applicants Normative sample
•The complexity of the job • is that group of people whose performance on a particular test is analyzed
•The cut-off score in use for reference in evaluating the performance of individual test takers
 cut-off score is a (usually numerical) reference point
derived as a result of a judgment and used to divide a set of Sampling to Develop Norms
data into two or more classifications, with some action to • The process of administering a test to a representative sample of
be taken or some inference to be made on the basis of test takers for the purpose of establishing norms is referred to as
these classifications standardization or test standardization
• The process of selecting the portion of the universe deemed to be
Method of Setting Cut Scores representative of the whole population is referred to as sampling.
• The Angoff Method • Such sampling, termed stratified sampling, would help prevent
Devised by William Angoff (1971), the Angoff method for setting fixed sampling bias and ultimately aid in the interpretation of the findings.
cut scores can be applied to personnel selection tasks as well as to questions • If such sampling were random (or, if every member of the
regarding the presence or absence of a particular trait, attribute, or ability. population had the same chance of being included in the sample), then the
procedure would be termed stratified-random sampling.
• The Known Groups Method • If we subjectively select some sample because we believe it to be
Also referred to as the method of contrasting groups, the known representative of the population, then we have selected what is referred to as
groups method entails collection of data on the predictor of interest from a purposive sample.
groups known to possess, and not to possess, a trait, attribute, or ability of • Incidental sample or convenience sample is one that is convenient
interest. or available for use.
Based on an analysis of this data, a cut score is set on the test that
best discriminates the two groups’ test performance Types of Norms
• A percentile is an expression of the percentage of people whose score on a
• IRT-Based Methods test or measure falls below a particular raw score.
The methods described thus far for setting cut scores are based on  A percentile is a converted score that refers to a percentage
classical test score theory. of test takers.
In this theory, cut scores are typically set based on test takers’
performance across all the items on the test; some portion of the total • Percentage correct refers to the distribution of raw scores—more
number of items on the test must be scored “correct” (or in a way that specifically, to the number of items that were answered correctly multiplied
indicates the test taker possesses the target trait or attribute) in order for the by 100 and divided by the total number of items.
test taker to “pass” the test (or be deemed to possess the targeted trait or
attribute) •Also known as agee quivalent scores, age norms indicate the average
A technique that has found application in setting cut scores for performance of different samples of test takers who were at various ages at
licensing examinations is the item-mapping method. the time the test was administered.
An IRT-based method of setting cut scores that is more typically used
in academic applications is the bookmark method • Designed to indicate the average test performance of test takers in a given
school grade, grade norms are developed by administering the test to
• Other Methods representative samples of children over a range of consecutive grade levels
The method of predictive yield (R.L. Thorndike, 1949)was a technique (such as first through sixth grades)
for setting cut scores which took into account the number of positions to be
filled, projections regarding the likelihood of offer acceptance, and the • developmental norms, a term applied broadly to norms developed on the
distribution of applicant scores. basis of any trait, ability, skill, or other characteristic that is
Another approach to setting cut scores employs a family of statistical presumed to develop, deteriorate, or otherwise be affected by chronological
techniques called discriminant analysis (also referred to as discriminant age, school grade, or stage of life
function analysis).
These techniques are typically used to shed light on the relationship • national norms are derived from a normative sample that was nationally
between identified variables (such as scores on a battery of tests) and two or representative of the population at the time the norming study was
more naturally occurring groups (such as persons judged to be successful at a conducted.
job and persons judged unsuccessful at a job).  In the fields of psychology and education, for example,
national norms may be obtained by testing large numbers
NORMS & STANDARD SCORES of people representative of different variables of interest
such as age, gender, racial/ethnic background,
Norm defined socioeconomic strata, geographical location
• Norm (singular) is used in the scholarly literature to refer to behavior that is
usual, average, normal, standard, expected, or typical. Fixed referenced group scoring system
• Norms is the plural form of norm, as in the term gender norms. In a • Another type of aid in providing a context for interpretation is termed a
psychometric context, norms are the test performance data of a particular fixed reference group scoring system.
group of test takers that are designed for use as a reference when evaluating • Here, the distribution of scores obtained on the test from one group of test
or interpreting individual test scores. takers—referred to as the fixed reference group—is used as the basis for the
calculation of test scores for future administrations of the test.
Norm referenced testing and assessment
Criterion referenced testing and assessment
• Criterion-referenced testing and assessment may be defined as a method ❑ In the context of test development, terms such as pilot work, pilot
of evaluation and a way of deriving meaning from test scores by evaluating an study, and pilot research refer to the preliminary research surrounding
individual’s score with reference to a set standard. the creation of a prototype of the test.
• The criterion in criterion-referenced assessments typically derives from the
values or standards of an individual or organization. ❑ Test items may be pilot studied (or piloted) to evaluate whether
they should be included in the final form of the instrument.

❑ In pilot work, the test developer typically attempts to determine


how best to measure a targeted construct. The process may entail
Standard Scores literature reviews and experimentation as well as the creation,
Z-scores revision, and deletion of preliminary test items
• A z-score results from the conversion of a raw score into a number
indicating how many standard deviation units the raw score is below or above Test Construction
the mean of the distribution SCALING
- may be defined as the process of setting rules for assigning numbers
T-scores in measurement.
• T scores can be called a fifty plus or minus ten scale; that is, a scale with a - the process by which a measuring device is designed and calibrated
mean set at 50 and a standard deviation set at 10. and by which numbers (scale values) are assigned to different
• Devised by W. A. McCall (1922, 1939) and named a score in honor of his amounts of the trait, attribute, or characteristic being measured.
professor E. L. Thorndike, this T standard score system is composed of a scale
that ranges from 5 standard deviations below the mean to 5 standard Types of Scales
deviations above the mean. • There is no best type of scale.
• Test developers scale a test in the manner they believe is
Stanine optimally suited to their conception of the measurement of the trait (or
• Researchers during World War II developed a standard score with a mean of whatever) that is being measured.
5 and a standard deviation of approximately 2. Divided into nine units, the
scale was christened a stanine, a term that was a contraction of the words Scaling Method
standard and nine.  Rating Scale can be defined as a grouping of words, statements, or
symbols on which judgments of the strength of a particular trait,
Test Development attitude, or emotion are indicated by the test taker.
1 Test Conceptualization  Rating scales can be used to record judgments of oneself,
2 Test Construction others, experiences, or objects, and they can take several
3 Test Tryout forms
4 Item Analysis  Likert scales are relatively easy to construct. Each item presents the
5 Test Revision test taker with five alternative responses (sometimes seven),
usually on an agree–disagree or approve–disapprove continuum.
Test Conceptualization  Likert scales are usually reliable, which may account for
The beginnings of any published test can probably be traced to their widespread popularity.
thoughts—self-talk, in behavioral terms.  Method of paired comparisons – test takers are presented with pairs
- The test developer says to himself or herself something like: “There of stimuli (two photographs, two objects, two statements), which
ought to be a test designed to measure [fill in the blank] in [such they are asked to compare. They must select one of the stimuli
and such] way.” according to some rule.
- The stimulus for such a thought could be almost anything.  Comparative scaling • entails judgments of a stimulus in comparison
 A review of the available literature on existing tests with every other stimulus on the scale.
designed to measure a particular construct might indicate  Categorical scaling • Stimuli are placed into one of two or more
that such tests leave much to be desired in psychometric alternative categories that differ quantitatively with respect to
soundness. some continuum
 An emerging social phenomenon or pattern of behavior  Guttman scale • Yields ordinal-level measures. Items on it range
might serve as the stimulus for the development of a new sequentially from weaker to stronger expressions of the attitude,
test belief, or feeling being measured.
 The development of a new test may be in response to a - A feature of Guttman scales is that all respondents who
need to assess mastery in an emerging occupation or agree with the stronger statements of the attitude will
profession also agree with milder statements.

Some Preliminary Questions Writing Item


• What is the test designed to measure? • The prospective test developer or item writer immediately faces
• What is the objective of the test? three questions related to the test blueprint:
• Is there a need for this test? ■ What range of content should the items cover?
• Who will use this test? ■ Which of the many different types of item formats should be
• Who will take this test? employed?
• What content will the test cover? ■ How many items should be written in total and for each content
• How will the test be administered? area covered?
• What is the ideal format of the test?  item pool is the reservoir or well from which items will or
• Should more than one form of the test be developed? will not be drawn for the final version of the test.
• What special training will be required of test users for administering
or interpreting the test? Item Format • Variables such as the form, plan, structure,
• What types of responses will be required of test takers? arrangement, and layout of individual test items are collectively referred to as
• Who benefits from an administration of this test? item format
• Is there any potential for harm as the result of an administration of • 2 Types of Item Format
this test?
1. selected-response format require test takers to select a response
from a set of alternative responses. [multiple-choice, matching,
and true– false]
2. constructed-response format require test takers to supply or to
create the correct answer, not merely to select it [completion item,
the short answer, and the essay.]

Multiple Choice Item Analysis •Among the tools test developers might employ to analyze and
• An item written in a multiple-choice format has three elements: select items are
(1) a stem, ■ an index of the item’s difficulty
(2) a correct alternative or option, and ■ an index of the item’s reliability
(3) several incorrect alternatives or options variously ■ an index of the item’s validity
referred to as distractors or foils. ■ an index of item discrimination
Matching Item
• In a matching item, the test taker is presented with two columns: Item-difficulty index
premises on the left and responses on the right. • An index of an item’s difficulty is obtained by calculating the
• The test taker’s task is to determine which response is best proportion of the total number of test takers who answered the item
associated with which premise. correctly.
• For very young test takers, the instructions will direct them to • A lowercase italic “p” (p) is used to denote item difficulty, and a
draw a line from one premise to one response. Test takers other than subscript refers to the item number (so p1 is read “item difficulty index for
young children are typically asked to write a letter or number as a item 1”).
response. • The value of an item-difficulty index can theoretically range from 0
(if no one got the item right) to 1 (if everyone got the item right).
Completion Item Example: If 50 of the 100 examinees answered item 2 correctly,
• requires the examinee to provide a word or phrase that then the item difficulty index for this item would be
completes a sentence, as in the following example: p2
The standard deviation is generally considered the most useful = 50 / 100 = .5
measure of __________. Note that the larger the item-difficulty index, the easier the item. Because p
refers to the percent of people passing an item, the higher the p for an item,
Short-answer Item the easier the item.
• It is desirable for short-answer items to be written clearly enough
that the test taker can respond succinctly - that is, with a short answer
What descriptive statistic is generally considered the most useful
measure of variability?

Essay Item
• a test item that requires the test taker to respond to a question
by writing a composition, typically one that demonstrates recall of
facts, understanding, analysis, and/or interpretation.
• Example of an essay item:
Compare and contrast definitions and techniques of
classical and operant conditioning. Include examples of
how principles of each have been applied in clinical as
well as educational settings.

Test Tryout
• Having created a pool of items from which the final version of the
test will be developed, the test developer will try out the test. • An index of the difficulty of the average test item for a particular
• The test should be tried out on people who are similar in critical test can be calculated by averaging the item-difficulty indices for all the test’s
respects to the people for whom the test was designed. items.
• Equally important are questions about the number of people on • This is accomplished by summing the item-difficulty indices for all
whom the test should be tried out. An informal rule of thumb is that there test items and dividing by the total number of items on the test.
should be no fewer than 5 subjects and preferably as many as 10 for each
item on the test. In a true–false item, the probability of guessing correctly on the basis of
In general, the more subjects in the tryout the better chance alone is 1/2, or .50. Therefore, the optimal item difficulty is halfway
• The test tryout should be executed under conditions as identical as between .50 and 1.00, or .75. In general, the midpoint representing the
possible to the conditions under which the standardized test will be optimal item difficulty is obtained by summing the chance success proportion
administered. and 1.00 and then dividing the sum by 2, or
• All instructions, and everything from the time limits allotted for
completing the test to the atmosphere at the test site, should be as similar For a five-option multiple-choice item, the probability of guessing correctly on
as possible. any one item on the basis of chance alone is equal to 1/5, or .20. The optimal
item difficulty is therefore .60:
The higher the value of d, the greater the number of high scorers
answering the item correctly.
A negative d-value on a particular item is a red flag because it
indicates that low-scoring examinees are more likely to answer the item
correctly than high-scoring examinees. This situation calls for some action
such as revising or eliminating the item.
Item reliability index
• The item-reliability index provides an indication of the internal consistency The highest possible value of d is +1.00. This value indicates that all
of a test (Figure 8–4); members of the U group answered the item correctly whereas all members of
• the higher this index, the greater the test’s internal consistency. the L group answered the item incorrectly.
• This index is equal to the product of the item-score standard deviation (s) If the same proportion of members of the U and L groups pass the
and the correlation (r) between the item score and the total test score. item, then the item is not discriminating between test takers at all and d will
be equal to 0.
Factor analysis and inter-item consistency • A statistical tool useful The higher the value of d, the more adequately the item discriminates
in determining whether items on a test appear to be measuring the the higher-scoring from the lower-scoring test takers.
same thing(s) is factor analysis.
• Through the judicious use of factor analysis, items that
do not appear to be measuring what they were designed to
measure can be revised or eliminated.
If too many items appear to be tapping a particular
area, the weakest of such items can be eliminated.

Item validity index


• The item-validity index is a statistic designed to provide an
indication of the degree to which a test is measuring what it purports to
measure.
• The higher the item-validity index, the greater the test’s criterion-
related validity.
The item-validity index can be calculated once the following two
statistics are known:

The correlation between the score on item 1 and a score on the


criterion measure (denoted by the symbol r1 c) is multiplied by item 1’s item-
score standard deviation (s1 ), and the product is equal to an index of an
item’s validity (s1 r1 c).

A visual representation of the best items on a test (if the objective is


to maximize criterion-related validity) can be achieved by plotting each item’s Item – characteristic curve
item-validity index and item-reliability index (Figure 8– 5). • item-characteristic curve is a graphic representation of item difficulty and
discrimination.
❑ Note that the extent to which an item discriminates high-from low-
scoring examinees is apparent from the slope of the curve.
❑ The steeper the slope, the greater the item discrimination.
❑ An item may also vary in terms of its difficulty level.
❑ An easy item will shift the ICC to the left along the ability axis,
indicating that many people will likely get the item correct.
❑ A difficult item will shift the ICC to the right along the horizontal
axis, indicating that fewer people will answer the item correctly.
❑ Note that the extent to which an item discriminates high- from
low-scoring examinees is apparent from the slope of the curve.
❑ The steeper the slope, the greater the item discrimination.
❑ An item may also vary in terms of its difficulty level.
❑ An easy item will shift the ICC to the left along the ability axis,
indicating that many people will likely get the item correct.
Item discrimination index ❑ A difficult item will shift the ICC to the right along the horizontal
• Measures of item discrimination indicate how adequately an item separates axis, indicating that fewer people will answer the item correctly.
or discriminates between high scorers and low scorers on an entire test.
• Item-discrimination index is a measure of item discrimination, symbolized Other Considerations in Item Analysis
by a lowercase italic “d” (d). This estimate of item discrimination compares Guessing
performance on a particular item with performance in the upper and lower • In achievement testing, the problem of how to handle test taker
regions of a distribution of continuous test scores. guessing is one that has eluded any universally acceptable solution.
• Methods designed to detect guessing (S.-R. Chang et al., 2011),
The item-discrimination index is a measure of the difference between minimize the effects of guessing (Kubinger et al., 2010), and statistically
the proportion of high scorers answering an item correctly and the proportion correct for guessing (Espinosa & Gardeazabal, 2010) have been proposed, but
of low scorers answering the item correctly. no such method has achieved universal acceptance.
❑ To date, no solution to the problem of guessing has been deemed
entirely satisfactory. Cross Validation = refers to the revalidation of a test on a sample of
❑ The responsible test developer addresses the problem of guessing test takers other than those on whom test performance was originally found
by including in the test manual to be a valid predictor of some criterion.
(1) explicit instructions regarding this point for the examiner to ❑ The decrease in item validities that inevitably occurs after cross-
convey to the examinees and validation of findings is referred to as validity shrinkage. Such shrinkage is
(2) specific instructions for scoring and interpreting omitted items. expected and is viewed as integral to the test development process.
➢ Co-validation may be defined as a test validation process
Item Fairness = refers to the degree, a test item is biased. conducted on two or more tests using the same sample of test takers.
• A biased test item is an item that favors one particular group of ➢ When used in conjunction with the creation of norms or the
examinees in relation to another when differences in group ability are revision of existing norms, this process may also be referred to as co-norming.
controlled (Camilli & Shepard, 1985). ▪ A current trend among test publishers who publish more than one
❑ Item-characteristic curves can be used to identify biased items. test designed for use with the same population is to co-validate and/or co-
Specific items are identified as biased in a statistical sense if they exhibit norm tests.
differential item functioning. • Co-validation of new tests and revisions of existing tests can be
❑ Differential item functioning is exemplified by different shapes of beneficial in various ways to all parties in the assessment enterprise.
item-characteristic curves for different groups (say, men and women) when ▪ Co-validation is beneficial to test publishers because it is
the two groups do not differ in total test score (Mellenbergh, 1994). economical.
▪ With sampling error minimized by the co-norming process, a test
Speed Test user can be that much more confident that the scores on the two
• Item analyses of tests taken under speed conditions yield misleading tests are comparable
or uninterpretable results.
• The closer an item is to the end of the test, the more difficult it may • Three of the many possible applications of IRT in building and revising tests
appear to be. include
• This is because test takers simply may not get to items near the end (1) evaluating existing tests for the purpose of mapping test revisions,
of the test before time runs out. (2) determining measurement equivalence across test taker
Given these problems, how can items on a speed test be analyzed? populations, and
❑ Perhaps the most obvious solution is to restrict the item analysis (3) developing item banks.
of items on a speed test only to the items completed by the test taker.
IRT information curves can help test developers evaluate how well an
Test Revision individual item is working to measure different levels of the underlying
•Test Revision as a Stage in New Test Development construct. Developers can use these information curves to weed out
Having conceptualized the new test, constructed it, tried it out, and uninformative questions or to eliminate redundant items that provide
item-analyzed it, what remains is to act carefully on all the information and duplicate levels of information
mold the test into its final form.
There are probably as many ways of approaching test revision: One tool to help ensure that the same construct is being measured, no matter
▪ One approach is to characterize each item according to its what language the test has been translated into, is IRT.
strengths and weaknesses.
▪ Test developers may find that they must balance various strengths
and weaknesses across items.

▪ Having balanced all these concerns, the test developer comes out of
the revision stage with a better test.
▪ The next step is to administer the revised test under standardized
conditions to a second appropriate sample of examinees.
▪ On the basis of an item analysis of data derived from this
administration of the second draft of the test, the test developer may deem
the test to be in its finished form.
▪ Once the test is in finished form, the test’s norms may be developed
from the data, and the test will be said to have been “standardized” on this
(second) sample.

• Test Revision in the Life Cycle of an Existing Test


▪ The American Psychological Association (APA, 1996b, Standard 3.18)
offered the general suggestions that an existing test be kept in its present
form as long as it remains “useful” but that it should be revised “when
significant changes in the domain represented, or new conditions of test use
and interpretation, make the test inappropriate for its intended use.”
Once the successor to an established test is published, there are
inevitably questions about the equivalence of the two editions.
▪ For example, does a measured full-scale IQ of 110 on the first edition
of an intelligence test mean exactly the same thing as a fullscale IQ of 110 on
the second edition?
A number of researchers have advised caution in comparing results
from an original and a revised edition of a test, despite similarities in
appearance (Reitan & Wolfson, 1990; Strauss et al., 2000).
❑ Formal item-analysis methods must be employed to evaluate the
stability of items between revisions of the same test (Knowles & Condon,
2000).
❑ A key step in the development of all tests—brand-new or revised
editions—is cross-validation.

You might also like