0% found this document useful (0 votes)
13 views

Psych Assess Lec Prelims Notes

The document outlines key concepts in psychological testing and assessment, including definitions, objectives, and processes involved in psychological evaluation. It discusses various assessment methods such as ecological momentary assessment, psychological autopsy, and retrospective assessment, as well as the roles of evaluators and the importance of integrating findings. Additionally, it covers the tools used in psychological assessment, the parties involved, and the settings where assessments are conducted.

Uploaded by

mary paulynne
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Psych Assess Lec Prelims Notes

The document outlines key concepts in psychological testing and assessment, including definitions, objectives, and processes involved in psychological evaluation. It discusses various assessment methods such as ecological momentary assessment, psychological autopsy, and retrospective assessment, as well as the roles of evaluators and the importance of integrating findings. Additionally, it covers the tools used in psychological assessment, the parties involved, and the settings where assessments are conducted.

Uploaded by

mary paulynne
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 16

PSYCH ASSESS LEC PRELIMS NOTES

LESSON 1: PSYCHOLOGICAL TEST AND PSYCHOLOGICAL


ASSESSMENT 4. ECOLOGICAL MOMENTARY ASSESSMENT
 EMA refers to the “in the moment” evaluation of specific
TESTING AND ASSESSMENT DEFINITIONS problems and related cognitive and behavioral variables at
the very time and place that they occur.
PSYCHOLOGICAL TEST  EMA has been used to help tackle diverse clinical problems
 The process of measuring psychological related variables including post-traumatic stress disorder, problematic
by means or procedures designed to obtain a sample of smoking, and chronic abdominal pain in children
behavior 5. PSYCHOLOGICAL AUTOPSY
PSYCHOLOGICAL ASSESSMENT  How severe the illness is affecting the person
 Gathering of data for the purpose of making a  limitations due to the illness, as people with a disease
psychological evaluation may not carry out certain functions in their daily lives
 Accomplished through the use of various tools and tests
THE PROCESS OF ASSESSMENT
TESTING ASSESSMENT 1. Referral question
OBJECTIVES 2. Clarification of referral
To obtain some gauge, usually To answer a referral question, 3. Formal assessment
numerical in nature, with solve a problem, or arrive at a 4. Psychological report writing.
regard to an ability or attitude decision through the use of 5. Feedback with assessee and/or interested third parties
tools of evaluation
PROCESS REQUIREMENTS OF PSYCHOLOGICAL ASSESSMENT
May be individual or group Typically individualized. 1. Problem Identification
Typically focuses on how an  Clarify the referral qs
individual processes rather 2. Selection and implementation methods
than simply the results of that  After clarifying the qs you select the right tools (right psych
processing assessment test)
ROLE OF THE EVALUATOR 3. Integration
Technician-like skills in terms Typically requires an educated  Have justification on the test to be taken
of scoring. selection of tools of  Results are useless if it just raw score
evaluation, skills in evaluation,  You need to integrate the findings
and thoughtful organization  ex. (Interview) hirap magfocus sa studying the results
and integration of data (test) are mataas ang anxiety
administration, and scoring o saan nagmamanifest ang taas ng anxiety?
OUTCOME o -------- sa academic study
Typically yields a test score or Typically entails a logical 4. Psychological report
a series of score problem solving approach that
brings to bear many sources of THE PROCESS OF ASSESSMENT
data designed to shed light on Collaborative psychological assessment
a referral question  Clients are the expert of themselves assessment
o The assessor is not the more knowledgeable
RETROSPECTIVE ASSESSMENT because the client is the one who knows exactly
 The use of evaluative tools to draw conclusions about what he/she feels, know themselves better, or
psychological aspects of a person as they existed at some aware of the symptoms however they are not
point in time prior to the assessment aware of the insights
 buhay pa tao
Therapeutic psychological
REMOTE ASSESSMENT  Encourages self-discovery
 Not in physical proximity to the person or people  Ex. Interviewing
conducting the evaluation  Example: Motivational interviewing by Miller and Rollnick
 done in a distance (2002)
o Ex. The procedure of interview itself is
ECOLOGICAL MOMENTARY ASSESSMENT (EMA) therapeutic because the qs are encouraging
 In the moment evaluation of specific problems and insights on what the problems of the client
related cognitive behavioral variables at the time and o A form of intervention program
place that they occur (e.g., assessing physical activity by o 2 way: you get info of the client and the
asking clients to fill up a form during or after the activity) assessment is therapeutic (encourages self-
discovery)
VARIETIES OF PSYCHOLOGICAL ASSESSMENT o We are not pursuing that everyone is the expert
1. DIAGNOSIS (therapeutic)
 what kind of sickness the person has through
psychological assessment or thorough test Dynamic assessment (application: ABA/ applied behavior analysis)
2. ETIOLOGY 1. EVALUATION
 source or cause of origin o established a baseline
 Etiology is complicated by the fact that most disorders o an assessment
have more than one cause. Early etiological theories were 2. INTERVENTION
the Freudian and post-Freudian psychoanalytic beliefs. o manipulating some variables
3. PROGNOSIS 3. EVALUATION
 a prediction of the course, duration, severity, and o to check of the intervention is effective
outcome of a condition, disease, or disorder. Prognosis o is it adequate?
may be given before any treatment is undertaken, so that o Assessor as the expert
the patient or client can weigh the benefits of different
 sya yng nagchecheck ng variables that
treatment options
needs to be manipulated
Ex. Targeted behavior - aggression
 Why aggression shows? 2. INTERVIEW
 Eval: how parents handle the child or environment  Communicating with the client
 Intervention: variable – environment  Method of gathering information through direct
 Eval: is it the environment that elicits aggression? communication involving reciprocal exchange.
 Interviews differ with regard to many variables, such as
THE TOOLS OF PSYCHOLOGICAL ASSESSMENT their purpose, length, and nature
 Semi structured - with freedom or follow up qs or dig
1. TEST deeper or gathered more info
 Measuring device or procedure  Structured - directly follow the qs instructed
o more reliable because pareparehas lng tinatanong
PSYCHOLOGICAL TEST mo to diff. population
 Device or procedure designed to measure variables o a certain level of expert to what qs need to be ask
related to psychology  panel interviews, motivational interviewing (therapeutic
 just a tool for psychological assessment interview) ; online/telephonic; face to face to face

CONTENT 3. PORTFOLIO
 Subject matter; what the test is trying to measure  Compilation of works done by the individual being
 Focus may vary (i.e. two tests measuring the same variable assessed
may differ) o not your current work or best work (Except for
o ex. There are different intelligence tests job hiring)
(theoretical concept on intelligence) o it shows your progression or development
FORMAT o Ex. Photography, art related works, job hiring in
 form, plan, structure, arrangement, and layout of test a company
item
o First bullet would refer to the psych test 4. CASE HISTORY
 Computerized, pen-and-paper, or some other form  Refers to records, transcripts, and other accounts in
o (Ex. A performance test - you ask people to written, pictorial, or other form that preserve archival
perform a behavior) information, official and informal accounts, and other data
 Form of the software (for digital tests) and items relevant to an assessee
 Case study
 Tests differ in their administration procedures o A report or illustrative account concerning a
o digital vs pen and paper person or an event that was compiled on the
basis of case history data.
 Tests differ in scoring and interpretation procedures
o Score - code or summary statement, often numerical 5. BEHAVIORAL OBSERVATION
in nature  Monitoring the actions of others or oneself by visual or
 (units or quota --- a quota of items) electronic means while recording quantitative and or
 score: 1-100 qualitative information regarding those actions
o Scoring - process of assigning such evaluative codes  Example: naturalistic observation
or statements to performance o observing an individual in his/her natural habitat
 ex. Within average (in their home or school)
 ex. Assigning: whether the score is high o asking the person to report what he/she see in
and low; mild moderate or severe himself/herself
 Ex. IQ test: you get a score (65 on the  it is impractical to always observe the
scale) then the scoring will be you telling client
whether the score range from 55-70 6. ROLE PLAY TEST
would be considered as mild; 55-25 - low  Improvised or partially improvised
o Cut score - reference point, usually numerical, o ex. Hiring for jobs - you are stimulating how a
derived by judgement and used to divide a set of person behave
data in two to or more classifications o ex. How the person would respond to the
 ex. Grading system - cut off score (75%) is customer? --- stress test to the employee ---
the reference if you pass or fail check how the person handles the annoying
 what differentiates a low score or below customer
performing

 Tests differ with respect to their psychometric soundness  Scoring may be done on-site (local processing) or at a central
(technical quality ~ psychometry) Tests differ in scoring and location (central processing).
interpretation procedures o Ex. Recent psychometrian board -- pusb
o Validity computer testing but on site
 is it measuring what is supposed to o Answers would be sent to a central site then it
measure? would compute
 what theory did you use for intelligence?
 is it really measuring the intelligence?  Reports may come in the form of a simple scoring report
o Reliability (numerical), extended scoring report, interpretive report,
 consistency of results consultative report, or integrative report.
 ex. is intelligence score consistent through  consultative and integrative are given to
time? expertise
o Utility - Usefulness of a test for a particular purpose OTHER TOOLS
 Tools traditionally associated with medical health.
 Thermometers to measure body temperature. Gauges to
measure blood pressure.
 Biofeedback equipment.  intelligence tests, personality tests, neuropsychological
o How it influence the person (heart rate or tests, and other specialized instruments
palpitation)  mostly individual assessment, with group testing usually
 Penile plethysmograph. for screening (i.e.. determining individuals who need
o phallometry is measurement of blood flow to the further diagnostic evaluation)
penis 3. Counselling Settings
o typically used as a proxy for measurement of  Schools, prisons, and governmental or privately owned
sexual arousal institutions.
o used for pedophiles  Measures of social and academic skills and measures of
WHO, WHAT, WHY, HOW, AND WHERE personality, interest, attitudes, and values are among the
many types of tests that a counselor might administer to a
Who are the parties? client.
 Test developer  Referral questions: “How can this child better focus on
o Creates tests for research studies, publications (as tasks?” to “For what career is the client best suited?” to
commercially available instruments), or modifications of “What activities are recommended for retirement?”
existing tests. 4. Geriatric Settings
 Psych test are products - test developers would  care of older patients in a variety of settings
try to sell them to us  assessments is the extent to which assessees are enjoying
 It is important to conduct 3rd party of the test as good a quality of life as possible (perceived stress,
 Test User loneliness, sources of satisfaction, personal values, quality
o Clinicians, counselors, school psychologists, human of living conditions, and quality of friendships and other
resources personnel, consumer psychologists, social support)
experimental psychologists, social psychologists, etc.  screening for cognitive decline and dementia
o Must be QUALIFIED test users (the person buying the 5. Business and Military Settings
test, trying to use the test, and score the test), and also  Decision making about the careers of personnel.
PERMITTED TO PURCHASE  A wide range of achievement, aptitude, interest,
o Different levels of testing: motivational, and other tests may be employed in the
 LEVEL A - teacher made test (prelims test) decision to hire as well as in related decisions regarding
 LEVEL B - usually psychometrician/ psychologist promotions, transfer, job satisfaction, and eligibility for
(there are technical knowledge of test) further training
 can be done in groups  Psychologists working in the area of marketing help
 LEVEL C - highest - psychologist (supporting “diagnose” what is wrong (and right) about brands,
psych theories and administered individually) products, and campaigns.
 Intelligence test 6. Governmental and organizational credentialing
 The Test Taker: the person taking the test  Before they are legally entitled to practice medicine,
o Issues to consider physicians must pass an examination.
 Test anxiety  Law school graduates cannot present themselves to the
 Understanding and agreement with the public as attorneys until they pass their state’s bar
rationale of the assessment examination.
 Capacity and willingness to cooperate with the  Psychologists, too, must pass an examination before
examiner adopting the official title “psychologist.”
 the person voluntary taking the test 7. Academic Research Settings
 Physical pain and emotional distress  any academician who ever hopes to publish research
experienced; physical discomfort; alertness, should ideally have a sound knowledge of measurement
wakefulness during assessment principles and tools of assessment.
 test fatigue: multiple test taken in a
day HOW ARE ASSESSMENTS CONDUCTED?
 Test acquiescence; social desirability issues;  Protocol
 social desirability - client lying to you  Typically refers to the form or sheet or booklet on which a
(client tries to appear okay) test taker's responses are entered. The term may also be
 Prior coaching received used to refer to a description of a set of test- or
o security of the tests and test protocols assessment related procedure (steps to take)
o Ex. Test booklet, answer sheet
 some paint themselves as ok or good so the
psych test would be good  Rapport
 May be defined as a working relationship between the
 Society at large (creates needs for new variables to measure) examiner and the examinee
 Other parties o He/she might respond to random statements
o Court cases (make decision; forensics)  Group vs individually administered tests
o Group - more practical, economical, but not much
In what types of settings are assessments conducted, and why? observation
o Individual - mas tedious pero mas tutok ka
1. Educational Settings  Score, interpret, integrate, and report (and feedback)
 achievement tests (measuring amount of learning)
o prelims exam ASSESSMENT OF PEOPLE WITH DISABILITIES
 diagnostic tests (tools of assessment used to help narrow
down and identify areas of deficit to be targeted for 1. Accommodation
interventions)  Adaptation of a test, procedure or situation;
 diagnostic test - you being place at the right  Substitution of one test for another
lvl o if there is available
2. Clinical Settings  Need to make assessment more suitable for an assesee
 hospitals, in-patient and out-patient clinics, private with exceptional needs
practice consulting rooms, schools, other institutions
2. Alternate assessment  After World War Two, psychologists increasingly used the tests
 Evaluative/ diagnostic procedure/ process in government and civilian applications
 Varies from the usual, customary standardized way of o Civilian application: Large corporations and private
deriving measurement organizations.
 Special accommodation or alternative methods designed  By the late 1930s, about 4,000 different psychological
to measure the same variable tests were in print.
WHERE TO GO FOR AUTHORITATIVE INFORMATION: REFERENCE
SOURCE Robert Woodworth
 Test catalogues  The Woodworth Psychoneurotic Inventory (PERSONAL
o most readily accessible sources of information DATA SHEET (PDS)) was the first widely used self-report
o This source of test information can be tapped by a personality test.
simple telephone call, e-mail, or note  Advantage of the self-report personality test.
o brief description of the test o Respondents are arguably the best-qualified
o catalogue’s objective is to sell the test people to provide answers about themselves.
 Test manuals  Disadvantages of the self-report personality test.
o Detailed information concerning the development of o Respondents may have poor insight into
a particular test and technical information themselves
o Usually can be purchased from the test publisher o People might honestly believe some things
 Reference volumes about themselves that in reality are not true.
o This authoritative compilation of test reviews is o Respondents are unwilling to reveal anything
currently updated about every three years about themselves that is very personal or that
o This volume, which is also updated periodically, paints them in a negative light.
provides detailed information for each test listed,
including test publisher, test author, test purpose,  Projective tests, such as the Rorschach Inkblot Test, are tests in
intended test population, and test administration which an individual is assumed to "project" onto some
time. ambiguous stimulus his or her own unique needs, fears, hopes,
 Journal articles and motivation. (Projection theory)
o Reviews of the test, updated or independent studies  Psychological assessment has proceeded along distinct threads:
of its psychometric soundness, or examples of how the academic and the applied.
the instrument was used in either research or an  Academic tradition: Researchers at universities throughout the
applied context world use the tools of assessment to help advance knowledge
 Online databases and understanding of human and animal behavior.
o most widely used bibliographic databases for test- o For research purposes to get to know more
related publications is that maintained by the  In the applied tradition, the goal is to help select applicants for
Educational Resources Information Center (ERIC). various positions on the basis of merit.
o The American Psychological Association (APA) o Putting into action those findings
maintains a number of databases useful in locating CULTURE AND ASSESSMENT
psychology-related information in journal articles,
book chapters, and doctoral dissertations.  Culture: The socially transmitted behavior patterns, beliefs,
and products of work of a particular population, community, or
HISTORICAL, LEGAL, ETHICAL CONSIDERATIONS group of people (Cohen, 1994).
 Professionals in assessment have shown increasing sensitivity
A HISTORICAL PERSPECTIVE to cultural issues with every aspect of test development and
use.
 Darwin's interest in individual differences led his half cousin,
Francis Galton, to devise a number of measures for HENRY GODDARD
psychological variables.  Early psychological testing of immigrant populations by Henry
o Eugenics is the scientifically erroneous and immoral Goddard was controversial.
theory of “racial improvement” and “planned o He found that the majority of immigrant populations
breeding, ----Ayaw magpadami were feebleminded.
 In Germany, Wilhelm Max Wundt o Goddard's findings were largely the result of using a
o started the first experimental psychology laboratory translated Binet Test that overestimated mental
o measured variables such as reaction time, deficiency in native English-speaking populations, let
perception, and attention span. alone immigrant populations.
o Focused on how people were similar o Goddard's research sparked a nature-nurture debate
o Individual differences as a frustrating source of error  Were IQ results indicative of some
in experimentation underlying native ability or the extent to
 Charles Spearman which knowledge and skills had been
o Student of Wundt in Leipzig acquired?
o credited with originating the concept of test
reliability as well as building the mathematical  Early developers of IQ tests devised culture-specific tests and
framework for the statistical technique of factor clarified that the tests were designed for people from one
analysis culture but not from another.
 James McKeen Cattell - coined the term mental test in 1890  Today, developers of intelligence tests take precautions against
and was responsible for introducing mental testing in America. bias.
 The twentieth century brought the first tests of abilities such
as intelligence. 1. Verbal communication.
 In 1905, Binet (father of intelligence or IQ testing) and Simon  Certain nuances of meaning may be lost in translation.
developed the first intelligence test to identify intellectually  Some interpreters may not be familiar with mental health
disabled Paris schoolchildren. issues and pre-training may be necessary.
 World Wars One and Two brought the need for large-scale  In interviews, language deficits may be detected by
testing of the intellectual ability of new recruits. trained examiners but may go undetected in written tests.
 Assessments need to be evaluated in terms of the  Appreciating the nature of the situation.
language proficiency required and the language level of
the test taker. 2. The right to be informed of test findings
 Test takers have a right to know about test findings and
2. Nonverbal communication and behavior. recommendations.
 Nonverbal signs or body language may vary from one  Test users should sensitively inform test takers of the
culture to another. purpose of the test, the meaning of the score relative to
 Psychoanalysis pays particular attention to the symbolic those of other test takers, and the possible limitations and
meaning of nonverbal behavior. margins of error of the test.
 Differences in the pace of life across cultures may detract 2.1 Relevancy
or enhance test scores for timed tests.
3. The right to privacy and confidentiality.
3. Standards of evaluation.  In most states, information provided by clients to
 Judgments related to certain psychological traits can be psychologists is considered privileged information.
culturally relative.  Privilege is not absolute
 Cultures differ with regard to gender roles and views of  Another ethical mandate regarding confidentiality pertains
psychopathology. to safeguarding test data.
 Cultures also vary in terms of collectivist versus o The duty to warn authorities or involved party
individualist value.  ex. balak q patayin si ganto
 Collectivist cultures value traits such as conformity,
cooperation, interdependence, and striving toward group 4. The right to the least stigmatizing label:
goals.  The least stigmatizing labels should always be assigned
 Individualist cultures place value on traits such as self- when reporting test results.
reliance, autonomy, independence, uniqueness, and o Ex. Feeble minded or mentally retarded ---
competitiveness. intellectual disability

LEGAL AND ETHICAL ISSUES (Historical, Cultural, and Ethical LESSON 3 or 4 OF TESTS AND TESTING
Considerations)
ASSUMPTIONS ABOUT PSYCHOLOGICAL TESTING
 Test user qualifications: In 1950, an APA Committee on Ethical  Psychological traits and states exist.
Standards for Psychology published a report called Ethical  Traits and states can be quantified and measured
Standards for the Distribution of Psychological Tests and  Test-related behavior predicts non-test related behavior.
Diagnostic Aids  All tests have limits and imperfections.
 Level A: Tests or aids that can adequately be administered,  Various sources of error are part of the assessment
scored, and interpreted with the aid of the manual. process.
 Level B: Tests or aids that require some technical knowledge of  Unfair and biased assessment procedures can be identified
test construction and of supporting psychological and and reformed.
educational fields. (May be in groups)  Testing and assessment benefit society.
 Level C: Tests and aids that require substantial understanding
of testing and supporting psychological fields together with Assumption 1: Psychological traits and states exist.
supervised experience in the use of these devices. (Psychologist  A trait has been defined as "any distinguishable, relatively
only in Ph) enduring way in which one individual varies from
 Some challenges in testing people with disabilities may include: another" (Guilford, 1959, p. 6).
o Transforming the test into a form that can be taken by  States also distinguish one person from another but are
the test taker. (Ex. Braile or oral) relatively less enduring (Chaplin et al., 1988).
o Transforming the responses of the test taker so that o May vary from time to time
they are scorable. o Temporary
o Meaningfully interpreting the test data.  Psychological traits exist as constructs - an informed,
scientific concept developed or constructed to describe or
THE RIGHTS OF TEST TAKERS explain behavior.
 Test takers have a right to know why they are being o Trait - intelligence; would not specify because
evaluated, how the test data will be used, and what (if considered as relatively stable
any) information will be released to whom. o State - anxiety (can be trait); often specify for
1. Informed consent. the past couple of days or currently feeling
 Test takers should give their informed consent only with
full knowledge of such information. Assumption 2: Traits and states can be quantified and measured.
o Information needed for consent must be in  Different test developers may define and measure
language the test taker can understand. constructs in different ways.
o Some groups (example: dementia, bipolar  Once a construct is defined, test developers turn to types
disorder, and schizophrenia) may not have the of item content and gauging the strength of a trait in a
capacity, or competency, to provide informed test taker.
consent.  A scoring system and a way to interpret results need to be
o If competency cannot be provided by the person, devised.
consent may be obtained from a parent or a o Test developers have different interpretation for
legal representative. intelligence based on the scoring that you have
(ex. Optimistic, very optimistic, pessimistic)
COMPONENTS OF COMPETENCY INCLUDE:
 Being able to evidence a choice as to whether one wants Assumption 3: Test-related behavior predicts non-test related
to participate. behavior
 Demonstrating a factual understanding of the issues.  Responses on tests are thought to predict real-world
 Being able to reason about the facts of a study, treatment, behavior; the obtained sample of behavior is expected to
or whatever it is to which consent is sought. predict future behavior.
Assumption 4: All tests have limits and imperfections.  Validity
 Competent test users understand and appreciate the  measuring what is supposed to measure
limitations of the tests they use as well as how those  Utility
limitations might be compensated for by data from other  How useful the test is
sources.  The practical value of test as it aids in the decision-making
o Ex. How could you say that the items you put in process/efficiency
your test are associated with intelligence?
Assumption 5: Various sources of error are part of the assessment
process. NORM
 ERROR refers to a long-standing assumption that factors other
than what a test attempts to measure will influence Norm-referenced testing and assessment:
performance on the test.  A method of evaluation and a way of deriving meaning
o Random error - out of the blue; from test scores by evaluating an individual test taker's
 isang tao lng di naprintan (90 items sa kanya score and comparing it to scores of a group of test takers.
the rest 100 items)
o Systematic error - applicable to all would not affect  The meaning of an individual test score is understood relative
your reliability to other scores on the same test.
 ex. Your exact weight is 55kg but weighing
scale is off by 5kg then you are 60kg Book recon: the end of average by todd rose
 a test that has 100items but 90 lng naprint
sa test paper - when all of u would answer Norms
the test would u think that anybody would  Test performance data of a particular group of test takers
get a 100? that are designed for use as a reference when evaluating
o Environmental - the lights on the room can affect the or interpreting individual test scores.
test taker
o Procedural error - shade a if the statement is true and Normative sample
shade b if the statement is false; then mali yng  the reference group to which the performance of test
binigay na instruction shabe a (false) shade b (true) takers are compared.
 ERROR VARIANCE: The component of a test score attributable
to sources other than the trait or ability measured. SAMPLING TO DEVELOP NORMS
o Influences the actual score
o Variance - things that would contribute to your • Standardization
performance  The process of administering a test to a representative
o Error variance - ex. Weighing a block of gold with the sample of test takers for the purpose of establishing
right weighing scale, 1kg is the weigh (Is the block of norms.
gold pure gold? No) • Sampling
 Impurity is the variance  Test developers select a population, for which the test is
o How much impurity for the block of gold to weigh 1kg intended.
o Ex. Yng score na makukuha ay di talaga yun yng score
• Test takers should have at least one common, observable
nya kasi something is influencing their score
characteristic.
o Variance in inherit
 Both the assessee and assessor are sources of error variance.
• Stratified sampling
 Sampling that includes different subgroups, or strata,
Assumption 6: Unfair and biased assessment procedures can be
from the population.
identified and reformed.
 Ilang percent ng subgroup ganun din percent sa population
 All major test publishers strive to develop instruments that
 divide a sample of adults into subgroups by age, like 18–
are fair when used in strict accordance with guidelines in
29, 30–39, 40–49, 50–59, and 60 and above.
the test manual.
 Problems related to test fairness.
• Stratified-random sampling
 Arise if the test is used with people for whom it was not
 Every member of the population has an equal opportunity
intended.
of being included in a sample.
 Some are more political than psychometric in nature.
 All of them would have equal number
o some tests have norms not
 --- fishbowl method
Assumption 7: Testing and assessment benefit society.
• Purposive sample
 There is a great need for tests, especially good tests,
 Arbitrarily selecting a sample that is believed to be
considering the many areas of our lives that they benefit.
representative of the population.
 Ex. Specific; you want your samples to have a certain
CHARACTERISTICS OF A PSYCHOLOGICAL TEST
characteristics
 Objectivity
• Incidental/convenience sample:
 Dapat same score yng nakuha nung 2 different individual
 A sample that is convenient or available for use; it may
sa iniscroran nila
not representative of the population.
 prelims exams - there is only 1 correct answer
 Generalization of findings from convenience samples must
 eliminates judgement or bias
be made with caution.
 Subjectivity: essays, may be affected based on the
judgement of the person checking (personal biases)
• Non-proba - kung cno lng available
 Standardization
• Ex. Survey on fastfood
 a test is made uniformed
 there is a particular standard that you have to follow
 Reliability
 consistency of results
1.2 Ordinal Scale
 Identify stage reached by child in development of specific
DEVELOPING NORMS FOR A STANDARDIZED TEST behavior functions
 Having obtained a sample, test developers:  Age norms
o Administer the test with a standard set of 2.1 Percentiles
instructions.  An expression of the percentage of people whose score on
o Recommend a setting for administering the test. a test or measure falls below a particular raw score.
o Collect and analyze data.  Different from percentage correct
o Summarize data using descriptive statistics including 2.2 Standard Scores
measures of central tendency and variability.  Derived scores that uses as its unit the SD of the
o Provide a detailed description of the standardization population upon which the test was standardized
sample 2.3 Deviation IQ
 Identify stage reached by child in development of specific
behavior functions
 Age norms

ADDITIONAL NORMS

NATIONAL NORMS
 Norms derived from a normative sample that was
nationally representative of the population at the time
the norming study was conducted
• Fixed reference group scoring systems NATIONAL ANCHOR NORMS
 The distribution of scores obtained on the test from one group  Equivalency table for scores on two nationally
of test takers is used as the basis for the calculation of test standardized tests designed to measure the same thing.
scores for future administrations of the test. Provides some stability to test scores by anchoring
 The SAT tests employ this method. (comparing) them to other test scores
SUBGROUP NORMS
NORM-REFERENCED VERSUS CRITERION-REFERENCED EVALUATION  Segmented normative samples
 Norm-referenced tests involve comparing individuals to the LOCAL NORMS
normative group  Local population's performance on the test
 Criterion-referenced tests and assessments, test takers are
evaluated as to whether they meet a set standard (example: a 5: RELIABILITY
driving exam).

CRITERION VS NORM REFERENCED

Criterion
 a standard on which judgement or decision may be based

CRITERION (DOMAIN/CONTENT-) REFERENCED


 Interpretative frame of reference specific content domain
 Domain or content referenced testing
 Assessment or mastery testing
 Provides information on what people can do HIGH V HIGH R – all of them are near the bullseye (v) ; consistently
hitting the middle part (r)
NORM REFERENCED LOW V HIGH R – the dots are not near the bullseye (v) ; consistent
 An individual's score is interpreted by comparing it with because all of them are together (r)
the scores obtained by others on the same test; HIGH V LOW R – nasasagot yung about sa kpop (v) pero not specific
 Method of evaluation; way of deriving meaning from test (r)
scores by evaluating an examinee's test score and
comparing it to scores of a group of examinees; RELIABILITY
 Goal is to yield information on an examinee's standing or  Dependability or consistency (magkakalapit)
ranking relative to some comparison group of examinees.  Consistency in measurement
 Refers to something that produces similar results.
1. DEVELOPMENTAL NORMS
 Indicates how far along the normal developmental path RELIABILITY COEFFICIENT
an individual has progressed  A statistic that quantifies reliability which ranges from 0
(not reliable at all) to 1 (perfectly reliable)
2. WITHIN GROUP NORMS o 0 or 1 lang (walang negative number)
 individual's performance is evaluated in terms of the • There is no negatively reliable
performance of the most nearly comparable o Either reliable ka or not
standardization group.
TRUE SCORE (~ reliability)
1. DEVELOPMENTAL NORMS  The individual's score on a measure if there was no error
 Tied to the measurement used
1.1 Mental Age (age norms) • example: True score for a test for depression
 Child's score corresponds to the highest year/age level may not be the same as the true score for
that he can successfully complete another test for depression
1.2 Grade Equivalent (grade norms) • Test 1 intel and test 2 is intel ulit (the true score
 Assigns achievement on a test/battery of tests according is different for both)
to grade norms.
 Only relevant to the years with respect to schooling CONSTRUCT SCORE (~ validity)
 A construct score is a person's standing on the theoretical Perfect world
variable independent of any particular measurement.  error does not exist
• Regardless of what measure you use it would be  Observed score = 0
the score  ó = true variance
 A theoretical variable (depression)- it  70% lng natutunan nya (construct score)
cannot be obtained  True score = 70/100
Ex. True scores for:
Test 1 - 5/10 Real world
Test 2 - 100/200  Observed score (X) = T + E
Test 3 - 25/50  Variance = true variance + error variance
--- true scores are different for each  Assumption that psych tests has limitations and errors
Test 4:  True score: 99/100 (cause of error variance)
Construct score: 50% of that attribute (it is impossible to claim since o 99 kasi alam mo na yung answers
it is theoretical) o 70% lng talaga alam mo
 Construct score is related to validity  Kapag True score: 50/100
----------‐----------------------- o Nilagnat habang nagtetest

Observed score = True score plus error. We cannot truly know the perfect variance because error of
 Error refers to the difference between the observed score variance exists
and the true score.
 X=T+E. ó = sd
o Your performance on a test might not be ó^2 = v
necessarily be your true score Ó^2 = ó^2 T + ó^2 E
o It might be your performance at that moment
only SOURCES OF ERROR VARIANCE
Test construction
VARIANCE AND MEASUREMENT ERROR  Variation may exist within items in a test or between tests
 Variance: Standard deviation squared.  (ie., item sampling or content sampling).
o Measure of dispersion o This is putting items or questions on your test that
o Something that contributes on why you got a score is not relevant to what you are intending to
o Ex. Bakit 70pts nakuha mo? What made it 70? measure
 70 - the attribute you truly possess o Might cause the person’s score to increase and
 Variance equals true variance plus error variance. decrease
o Ex. Test about intelligence – kpop content
o  Some are not knowledgeable about
kpop while some are
Test administration
 Sources of error variance may stem from the test
environment.
 paiba iba yng test, first test (no
instructions)
 2nd administration (right instructions)
 3rd administration (mali mali yng
instructions)
o Test taker variables include: Pressing emotional
o
problems, physical discomfort, lack of sleep, and
 Reliability is the proportion of the total variance attributed to
the effects of drugs or medication
true variance.
o Examiner-related variables include: Physical
o Variable - anything that varies
appearance and demeanor may play a role.
 The way proctor administer the test
MEASUREMENT ERROR
Test scoring and interpretation
 Refers to the inherent uncertainty associated with any
 Computer testing reduces error in test scoring, but many
measurement, even after care has been taken to
tests still require expert interpretation.
minimize preventable mistakes.
 Example: Tests of personality, tests of creativity, various
RANDOM ERROR
behavioral measures, etc.
 consists of unpredictable fluctuations and
o Different experts might have different opinion
inconsistencies of other variables in the measurement
process (ie., noise).
RELIABILITY ESTIMATES
 Not applicable to all (become noticeable)
SYSTEMATIC ERROR
Test-retest reliability
 Typically proportionate to what is presumed to be the
 An estimate of reliability obtained by correlating pairs of
true value of the variable being measured.
scores from the same people on two different
 Experience or applicable to all (all are incorrect)
administrations of the same test.
EXAMPLES:
o Pre-test and post-test – time sampling reliability
Error variance:
 Most appropriate for variables that should be stable over time,
 Ex. T1 - 71, T2 - 75 (naalala yng mga sagot sa t1), T3 - 65
such as personality, and not appropriate for variables expected
(may anxiety yng tao)
to change over time.
 Able to completely measures (constant yung scores)
o Not the best way to get reliability
 As time passes, correlation between the scores obtained on
each testing decreases.
o With intervals greater than 6 months, the estimate of DYNAMIC VS STATIC CHARACTERISTICS
test-retest reliability is called the coefficient of  Would test retest be a good measure of reliability for
stability. dynamic characteristics? No
• How stable the scores are through time  What are examples of static characteristics?
DYNAMIC STATIC
COEFFICIENT OF EQUIVALENCE  Trait, state, or ability  A trait, state, or ability
 Measures the degree of the relationship between various presumed to be ever- presumed to be relatively
forms of a test by means of alternate-forms or parallel-forms. changing as a function of unchanging
PARALLEL FORMS ALTERNATE FORMS situational and cognitive  Example: intelligence
For each form of the test, the Typically designed to be experiences  Obtained measurement
means and the variances of equivalent with respect to  Measure of anxiety would not be expected to
observed test scores are equal. variables such as content and (characteristic) may vary significantly as a
level of difficulty. change from hour to hour. function of time
Ex. Each item in form a should  Test-retest measure  Either the test-retest or
have equivalent statistics with would be of little help in the alternate-forms
items in form b gauging the reliability of method would be
Example: giving 2 different forms of test (test a and test b) the measuring instrument appropriate

• Obtaining estimates of alternate-forms reliability and parallel- RESTRICTION OF RANGE VS INFLATION OF RANGE
forms reliability is similar to obtaining an estimate of test-retest
reliability. RESTRICTION INFLATION
 The variance of either  Variance of either variable
SPLIT-HALF RELIABILITY variable in a correlational in a correlational analysis
 Obtained by correlating two pairs of scores obtained from analysis is restricted by is inflated by the sampling
equivalent halves of a single test administered once. the sampling procedure procedure,
• Step 1: Divide the test into equivalent halves. used  the resulting correlation
 The resulting correlation coefficient = higher.
• Step 2: Calculate a Pearson r between scores on the two
coefficient = lower.  Comparing your statistics
halves test.
 Limiting the statistics that to a larger group
• Step 3: Adjust the half-test reliability using the Spearman- you have obtained
Brown of the formula.
• Spearman-Brown formula allows a test developer or EXAMPLE:
user to estimate internal consistency reliability from a  Inflation or range – ustet – entrance exam – you get the
correlation of two halves of a test. statistics of those who took the ustet, the items of the test,
• Ex. The score for odd and score for even must be the range of scores of the test
same o Sample: college of science
o Range of score of takers – 700-800
OTHER METHODS OF ESTIMATING INTERNAL CONSISTENCY
 Restriction of range: only getting the measure of those who
Inter-item consistency
passed
 The degree of correlation among all the items on a scale.
o For passers the range of score is: 600-900 (restricted)
Coefficient alpha
o University wide passers 100-900
 Mean of all possible split-half correlations.
 Inflation of range: comparing your score to the univ-wide
 Corrected by the Spearman-Brown formula,
instead of focusing only on the college of science range of
 Values range from 0 to 1.
scores
o Passing score: 700-800
MEASURES OF INTER-SCORER RELIABILITY
 Inter-scorer reliability: The degree of agreement or
SPEED TEST VS POWER TEST
consistency between two or more scorers (or judges or
raters) with regard to a particular measure.
SPEED POWER
 Often used when coding nonverbal behavior.
Limited time Enough time
 Guards against biases or idiosyncrasies in scoring.
Uniform level of difficulty If some items are so difficult
• Coefficient of inter-scorer reliability: A
(typically uniformly low) that no testtaker is able to
correlation coefficient used to determine the
obtain a perfect score
degree of consistency among scorers. Testtakers should be able to Allow testtakers to attempt all
(check reliability docs for table) complete all the test items items
correctly. (kasi Madali lang)
THE NATURE OF TESTS

HOMOGENEITY VS HETEROGENEITY OF ITEMS A reliability estimate of a speed test should be based on


performance from two independent testing periods using one of the
HOMOGENEITY HETEROGENEITY following:
 Unidimensional  Multidimensional (1) test-retest reliability
 Items are internally (Maraming facets) (2) alternate-forms reliability
consistent  Low internal consistency (3) split-half reliability from two separately timed half tests
 Has higher reliability  There are many
 Measure one factor, such dimensions of personality
TRUE SCORE MODEL AND ALTERNATIVES
as one ability or one trait in a test
 Example: SEI (self-esteem  Example: 16PF or NEO PI-
TRUE SCORE THEORY OR CLASSICAL TEST THEORY
inventory) 3
 Estimates the portion of a test score that is attributable to
error
 Observed score = True score plus error
 Averaging all the observed scores obtained over a period
of time, the result would be closest to the true score
o Administer the test infinite number of times
o Each score for each administration would help
you find the true score
 The greater number of items, the higher the reliability
DOMAIN SAMPLING THEORY
 Estimates the extent to which specific sources of variation
under defined conditions are contributing to the test
score.
 Assumes that the items that have been selected for any
one test are just a sample of items from an infinite domain
of potential items (e.g., Depression, what constitute
depression?)
o Putting “I feel sad or I feel lonely” on items
 A domain of behavior, or the universe of items (e.g.,
depression) that could conceivably measure that behavior, EXAMPLE 1:
can be thought of as a hypothetical construct:  Guessing the age of a person: 25 yrs old (+ - 2 years:
o one that shares certain characteristics with (and standard of error) = 23-27 range of age
is measured by) the sample of items that make  How confident are you that it is from the range of 23-27
up the test. o 50% sure
 the items in the domain are thought to have the same  25 yrs old (+ - 5 years: standard of error): 20-30
means and variances of those in the test that samples o How confident are you: mas mataas confidence
from the domain. kasi mas malaki yng range
 Measures of internal consistency are perhaps the most  If the test test you that the sem is 5
compatible with domain sampling theory EXAMPLE 2:
 Score of 20/100 in a test
GENERALIZABILITY THEORY o SEM: 5 (+-1 s.d.)
 Based on the idea that a person's test scores vary from  Confidence of interval (COI): 68.3%
testing to testing because of variables in the testing  true score is: 15-25
situation. o SEM: 10 (+-2 s.d.)
o It is not about true score  COI: 95.4%
 Instead of conceiving of variability in a person's scores as  True score: 10-30
error, Cronbach encouraged test developers and
researchers to describe the details of the particular test 6: VALIDITY
situation or universe leading to a specific test score.
o Universe is described in terms of its facets, VALIDITY VALIDATION
including the number of items in the test, the A judgment or estimate of how The process of gathering and
amount of training the test scorers have had, well a test measures what it evaluating evidence about
and the purpose of the test administration. purports to measure in a validity.
particular context.
ITEM RESPONSE THEORY (commonly used)
To establish validity, one must
 Provides a way to model the probability that a person with
Ex. Measuring optimism go through validation
X ability will be able to perform at a level of Y.
Items about optimism like
 IRT refers to a family of methods and techniques used to having a positive outlook in life
distinguish specific approaches. should be placed in an
 IRT incorporates considerations of an item's level of optimism test and should not
difficulty and discrimination. contain items not relevant to
o Difficulty relates to an item not being easily optimism.
accomplished, solved, or comprehended. Both test developers and test users may play a role in the
 Example: Checking how many got the validation of a test.
correct answer in a sample vs how  Example: recently develop a test
many took the test o You want them to send back the data when
o Discrimination refers to the degree to which an they used your test so you could add it to
your existing data to compare the validity of
item differentiates among people with higher or
your test
lower levels of the trait, ability, or other
variables being measured.
LOCAL VALIDATION STUDIES
 How many of those who actually
 May yield insights regarding a particular population of test
possesses the trait endorse in the item
takers as compared to the norming sample described in a
 Example: Test of optimism
test manual.
 The person is optimistic that
 Importance of normative sample
person should answer that
o Compare a test taker score to a particular
item highly
population one must be pasok sa population na
 Pessimistic person has low
iyon
endorsement of the item
 Test for western must be for western
 Test for Asians must be for Asians
THE STANDARD ERROR OF MEASUREMENT (SEM)
 There are different norms
 The higher the reliability of the test, the lower the
o Comparing the results
standard error.
 Interpretation might not reflect the
 Standard error can be used to estimate the extent to
actual performance
which an observed score deviates from a true score.
 Confidence interval: A range or band of test scores that is
likely to contain the true score.
 Since we cannot accurately pinpoint true score, we utilize
SEM (see what would be the estimate of true score based
on the standard measurement)
CONCEPT OF VALIDITY tapping each area of coverage, the organization
Validity is often conceptualized according to three categories: of the items in the test, etc.
1. Content validity: This measure of validity is based on an o Example: test measuring personality
evaluation of the subjects, topics, or content covered by  NEO-PI 3
the items in the test.  How many items is
o Ex. Measure amount of learning for prelims associated to a particular
 The contents of item should be only facet or domain
limited to chapters 1-10 (neuroticism, consciousness,
 You cannot just simply create the test agreeableness, etc.)
where the items only cover chapter 1  Culture and the relativity of content validity.
 You included chapter 11 (not valid) o The content validity of a test varies across
2. Criterion-related validity: This measure of validity is cultures
obtained by evaluating the relationship of scores obtained  A valid test for westerns might not be
on the test to scores on other tests or measures. valid for immigrants
o Ex. You got a very excellent score in test o Political considerations may also play a role.
 You learned a lot from chapters 1-10  Content validity Ratio: measures agreement among raters
 When you become a psychometrician, regarding how essential an individual test item is for
the score you get from the test should inclusion in a test
reflect your actual performance
 Then one can say that it has a high
criterion validity if the score reflects
o
the performance
o n = number of experts who gave ratings of
3. Construct validity: This measure of validity is arrived at by
essential for an item
executing a comprehensive analysis of:
o N = total number of experts
o How scores on the test relate to other test
o Values can range from -1 to +1
scores and measures.
o Closer to +1 = majority of experts agree there is
 Comparing the score on a test to other
an association between the item and the domain
existing tests
o How scores on the test can be understood within
CRITERION RELATED VALIDITY
some theoretical framework for understanding
 A criterion is the standard against which a test or a test
the construct that the test was designed to
score is evaluated.
measure.
o In a driving test, you cannot compare your score
to others
o If you bump a car – then it shows how
competent are you in driving
 Characteristics of an adequate criterion.
o Relevant to the matter at hand.
o valid for the purpose for which it is being used.
 Criterion-related validity
o A judgment of how adequately a test score can
be used to infer an individual's most probable
standing on some measure of interest (i.e., the
FACE VALIDITY (lowest form or easiest) criterion).
 A judgment concerning how relevant the test items appear o To predict the performance on another measure
to be. o Studies relationship between a test referred to
 If a test appears to measure what it purports to measure as a predictor and a validation measure as
"on the face of It," it could be said to be high in face criterion
validity.  Similar to linear regression. Predictor (x),
 Example: testing you about prelims in psych ass. criterion (y)
o Does it look like the test is for psych assessment  Concurrent validity
or for zoology o An index of the degree to which a test score is
 Ex. Objective or psychometric test (obviously can be seen related to some criterion measure obtained at
that a test is about intelligence) the same time (concurrently).
o Test about intelligence o Ex. If SAT predict college score
 Visualize objects, identify the sequence  SAT is not the only basis for the admission
of a student to college
CONTENT VALIDITY  Use other criteria to select students
 Content validity o Disadvantage: it might limit the amount of
o A judgment of how adequately a test samples variability of the measure
behavior representative of the universe of  Predictive validity
behavior that the test was designed to sample. o An index of the degree to which a test score
o Universe of behavior: predicts some criterion, measure.
 Example: happiness, optimism  Often long term or longitudinal
 Smiling, sociable (universal)  Data are collected at different time
 It is hard to list down all the items for o Example: Job applicants taking the test
optimism kasi madaming factors  test for screening the applicants (
o Test is only a sample of a behavior predictor)
 Test blueprint  Evaluated in terms of performance
o A plan regarding the types of information to be (criterion score)
covered by the items, the number of items
o Disadvantages: some people might easily be
included in a job which might lower the validity
of a test. (nepotism)

 Statistical evidences for concurrent and predictive


Validity
o Expectancy data~ expectancy table/chart BASE RATE VS SELECTION RATIO
 Information found in an expectancy table
 Shows the percentage of people within a BASE RATE SELECTION RATION
specified test score intervals who Percentage of people hired Numerical value that reflects
subsequently where place on various under the existing system for a the relationship between the
categories on a criterion particular position extent to number of people to be hired
 Example: 75/100 – average only which a particular trait, and the number of people
 What is the expected behavior, characteristic or available to be hired
attribute exists in the
performance to succeed in
population expressed in
an educational set up
proportion
o Validity coefficient
Ex. Working in a company Ilan yung nag-apply sayo vs.
 Relationship between a variable and a  You already have an ilan lang kukunin mo
criterion existing screening
 A correlation coefficient that provides a  how many applicants
measure of the relationship between test were you able to hire that
scores and scores on the criterion identified as successful
measure. after they get hired, using
 An index, typically a correlation coefficient existing procedure (e.g.,
that reflects how well an assessment or interview)?
instrument predicts a well-accepted
indicator of a given concept or criterion FALSE POSITVE (P1) VS FALSE NEGATIVE (N2)
 Ex. If a criminal behavior is valid
 It should be possible to use it FALSE POSITIVE (TYPE 1 FALSE NEGATIVE (TYPE 2
to predict if the person is to ERROR) ERROR)
be arrested in the future, A miss wherein the test A miss wherein the test
currently breaking the law or predicted that the examinee predicted that the examinee
did possess the particular did not possess the particular
has previous criminal record
characteristic or attribute characteristic or attribute
o Incremental validity
being measured when the being measured when the
 The degree to which an additional examinee did not. examinee did.
predictor explains something about the
criterion measure that is not explained by Akala mo meron pero wala Akala mo wala, pero meron
predictors already in use.
 Improvement obtained by adding a CONSTRUCT VALIDITY
particular procedure to an already existing  Judgment about the appropriateness of inferences drawn
combination of assessment from test scores regarding individual standings on a
 The value of including more than one construct.
predictor depends on a couple of factors. o Psychological construct: extroversion,
 Reflects the value of each measure to the intelligence, aggression
process and outcome of the assessment o A construct is an informed, scientific idea
 Standards depend on the goal of the developed or hypothesized to describe or explain
assessments such as one wishes to gather behavior.
information, predict a criterion, make a  Intelligence is a construct that may be
diagnosis. invoked to describe why a student
 Example: medical doctor performs well in school.
 More likely to correctly  Anxiety is a construct that may be
diagnose a kidney infection if invoked to describe why a psychiatric
a urine test is ordered patient paces the floor
instead of simply  If a test is a valid measure of a construct, then high scorers
interviewing and low scorers should behave as theorized.
 The lab test – added  All types of validity evidence, including evidence from the
incremental validity content- and criterion-related varieties of validity, come
under the umbrella of construct validity.
HIT RATE VS MISS RATE
HIT RATE MISS RATE EVIDENCE OF CONSTRUCT VALIDITY
Proportion of people a test Proportion of people the test  Evidence of homogeneity: How uniform a test is in
accurately identifies possessing fails to identify as having or measuring a single concept.
or exhibiting a particular trait, not having a particular  Example: personality test
behavior/characteristic/attribute characteristic/ attribute
o There would be items that would fall for
Gaano karami yung tama mo na Ilan yung tao na may
different facets one is for anxiety other is
nascreen for job applicants attribute na iyon pero hindi
na-identify ng test conscientiousness
o Are the items for anxiety homogenous? (test
blueprint)
 Evidence of changes with age: Some constructs are
expected to change over time (e.g., reading rate).
 Evidence of pretest-posttest changes: Test scores change o That is only true if they review a wide range of
as a result of some experience between a pretest and a movies that might consensually be viewed as
posttest (e.g., therapy). good and bad.
o Treating depression (posttest after the  Example: selection purposes in job hiring
intervention to see if it is effective) o Iba-iba yung nag-iinterview – reduces the
 Evidence from distinct groups: Scores on a test vary in a reliability
predictable way as a function of membership in some o May bias yung nag-iinterview (some are
group. compassionate or too strict)
 Example: test for introversion Halo effect
o Introverted people should not score high on  A tendency to give a particular person a higher rating than
extroversion he or she objectively deserves because of a favorable
overall impression.
CONVERGENT EVIDENCE DISCRIMINANT EVIDENCE  Example: hiring
Scores on the test undergoing Validity coefficient showing o Tataasan mo yung score niya kahit hindi siya
construct validation tend to little relationship between test pasok sa criteria pero dahil parehas kayong k-
correlate highly in the scores and/or other variables pop fan
predicted direction with scores with which scores on the test Fairness
on older, more established should not theoretically be  The extent to which a test is used in an impartial, just, and
tests designed to measure the correlated.
equitable way.
same (or a similar) construct. (DIVERGENT)
 Acknowledges bias and tries to correct them
2 test: (1) the test you are Ex. Creating a test for
developing you would happiness
correlate to another 7: UTILITY
established test (2) that is also You correlate it for intelligence
trying to measure what you Then your test is not actually UTILITY
wanted to measure testing happiness but UTILITY
intelligence  The practical value of testing to improve efficiency.
Correlation of the test should  Usefulness of the test
be high Psychometric soundness
Factor analysis: Class of  Higher the criterion-related validity of test scores, the
mathematical procedures higher the utility of the test.
designed to identify specific o Exceptions exist as many factors may enter into
variables on which people may an estimate of a test’s utility, and there are
differ. (grouping test items)
variations in the ways in which the utility is
determined.
A data reduction method in
which several sets of scores  Valid tests are not always useful tests
and the correlations between o Example: test 1 vs test 2
them are analyzed.  Test 1: test for depression (valid) is
only around 60 items
In such studies, the purpose of  Test 2: test for depression is around
the factor analysis may be to 300 items
identify the factor or factors in  If they measure the same thing which
common between test scores is more useful?
on subscales within a particular  Test 1 would be more useful
test, or the factors in common because it is practical,
between scores on a series of
efficient, shorter and if the
tests.
statistics to both the 2 test is
not far from each other then
Exploratory factor analysis
opt for test 1
 typically entails “estimating, or extracting factors; deciding
FACTORS AFFECTING UTILITY
how many factors to retain; and rotating factors to an
COSTS
interpretable orientation”
 One of the most basic elements of utility analysis is the
Confirmatory factor analysis
financial cost associated with a test.
 researchers test the degree to which a hypothetical model
 Of test utility, factors variously referred to as economic,
(which includes factors) fits the actual data.
financial, or budget-related in nature must certainly be
taken into account.
VALIDITY, BIAS, AND FAIRNESS
 It refers to disadvantages, losses, or expenses in both
Bias
economic and noneconomic terms.
 A factor inherent in a test that systematically prevents
 the term costs can be interpreted in the traditional,
accurate, impartial measurement.
economic sense:
o Standardization or objectivity would minimize
o relating to expenditures associated with testing
bias
or not testing
Rating error
 If testing is to be conducted, then it may be necessary to
 A judgment resulting from the intentional or unintentional
allocate funds to purchase:
misuse of a rating scale.
 a particular test
 Raters may be either too lenient, too severe, or reluctant
 a supply of blank test protocols
to give ratings at the extremes (central tendency error).
 computerized test processing, scoring, and
 leniency error (also known as a generosity error)
interpretation from the test publisher or some
o an error in rating that arises from the tendency
independent service.
on the part of the rater to be lenient in scoring,
 Ex. Test protocols: php 30-50,000 (limited original answer
marking, and/or grading
sheets)
 severity error
 Original tests are expensive so choose a test
o Movie critics who pan just about everything they
that is cheaper but would not reduce its utility
review may be guilty of severity errors.
 Always go back to the referral questions  Estimates if you would be successful or not
 How can you answer the referral question if you
don’t have the appropriate test to measure?
 Ex. Screen for ADHD yet you don’t have the test
for ADHD
 Cost for human life and safety (non-monetary)
o Noneconomic costs of drastic cost cutting by the
airline might come in the form of harm or injury
to airline passengers and crew as a result of TAYLOR-RUSSELL TABLES NAYLOR-SHINE TABLES
incompetent pilots flying the plane and Provide an estimate of the help obtain the difference
incompetent ground crews servicing the planes. percentage of employees hired between the means of the
o Ex. Your psychological test might be causing by the use of a particular test selected and unselected
harm to the people you are administering to who will be successful at their groups to derive an index of
 Stimulus or questions that are jobs. what the test (or some other
tool of assessment) is adding
triggering to them
provide an estimate of the to already established
 If the questions are very harmful
extent to which inclusion of a procedures.
(people would have negative particular test in the selection
psychological effects) system will improve selection. Likely average increase in
BENEFITS criterion performance as a
 The benefits of testing should be weighted against the tables provide an estimate of result of using a particular test
costs of administering, scoring, and interpreting the test the percentage of employees or intervention; also provides
 Refers to profits, gains, or advantages hired using a particular test selection ratio needed to
 Utilizing a test for screening an applicant who will be successful at their achieve a particular increase in
 A test is useful if it can properly identify jobs criterion performance.
applicants fitted for the job description Different combinations of Offers an easy computation of
 Better invest in a test that is good three variables: the increase in average
 Potential benefits: increase in quantity and quality of 1. Test’s validity, criterion score which can be
2. Selection ratio used achieve with any test given
workers performance and decrease in the time needed to
3. Base rate that the test validity and the
train workers because it was able to identify skilled
test cut-off point can be
workers
specified
 Non-economic benefits: better working environment and Differences in percent Differences in average criterion
improved moral successful between selected score between selected group
group and the original group and the original group
UTILITY ANALYSIS Employees are separated into Does not require this decision
UTILITY ANALYSIS 2 groups (successful vs. <<<<
 A family of techniques that entail a cost–benefit analysis unsuccessful)
designed to yield information relevant to a decision about The value assigned for the More general in applicability
the usefulness and/or practical value of a tool of test’s validity is the computed
assessment. validity coefficient.
 Utility tests address the question of “Which test The selection ratio is a
numerical value that reflects
gives us the most bang for the buck?”
the relationship between the
 Endpoint of a utility analysis yields an educated
number of people to be hired
decision as to which of several alternative and the number of people
courses of action is most optimal (in terms of available to be hired.
costs and benefits).
 Sometimes, availing the most For instance, if there are 50
expensive test might be the most positions and 100 applicants,
useful for you and cost effective then the selection ratio is
 Has high validity and good 50/100, or .50.
predictive value
(Ilan yung avail na positions vs
COMPENSATORY MODEL OF SELECTION ilan yung nag-aapply for that)
 An assumption is made that high scores on one attribute base rate refers to the
percentage of people hired
can “balance out” or compensate for low scores on
under the existing system for a
another attribute.
particular position.

HOW IS UTILITY ANALYSIS CONDUCTED If, for example, a firm employs


EXPECTANCY DATA 25 computer programmers and
 The likelihood that a test taker will score within some 20 are considered successful,
interval of scores on a criterion measure. the base rate would be .80.
 If you have a particular score, what is the
likelihood that you will succeed? Ilan yung hinire mo using the
 Ex. School setting existing system vs. ilan doon sa
 Use expectancy table to determine the hinire mo yung considered
likelihood that a student who scored a successful?
particular score in an aptitude test Provide an estimate of
would succeed in regular classes percentage of employees hired
using a particular test will be
compared to special education
successful later on when they
 Ex. Grade in psych ass
are hired based on the 3
 Your grade is 75, what would be the variables
likelihood that when you take psych BOTH TAYLOR-RUSSELL AND NAYLOR-SHINE TABLES
ass 2, you will be able to perform well? The validity coefficient comes from concurrent validation
procedures.
 Degree of the scores of measurement are related to
other scores on other measurements that has been
established
 Both have certain limitations
o Relationship between the predictor and
criterion is linear
o The validity coefficient used is obtained by
concurrent validity procedures

 Many other variables may play a role in selection


RELATIVE CUT SCORE FIXED CUT SCORE
decisions, including applicants’ minority status, general
Reference point that is set Minimum level of proficiency
physical or mental health, or drug use.
based on norm-related required to be included in a
considerations rather than on particular classification.
BROGDEN-CRONBACH-GLESER FORMULA (BCG) the relationship of test scores
 used to calculate the dollar/peso amount of a utility gain to a criterion.
resulting from the use of a particular selection instrument Also known as norm reference also be referred to as absolute
under specified conditions. cut score cut scores
 utility gain refers to an estimate of the benefit (monetary Ex. Top 5% magkakaroon ng Ex. the score achieved on the
or otherwise) of using a particular test or selection uno grade regardless what road test for a driver’s license.
method. their score is in the test
o How much would a company profit (benefit) if  50/100 lang sa test Here the performance of other
they invest in a particular test? Ex. the top 10% of all scores on would-be drivers has no
 A variation of BGC computes the productivity gain each test would receive the bearing upon whether an
grade of A. individual testtaker is classified
o productivity gain refers to an estimated increase
as “licensed” or “not licensed.”
in work output.
In other words, the cut score in
use would depend on the
DECISION THEORY AND UTILITY performance of the class as a
whole.
Cronbach and Gleser (1965) presented: MULTIPLE CUT SCORE MULTIPLE HURDLE
 A classification of decision problems. Use of two or more cut scores Multistage decision-making
 Various selection strategies ranging from single-stage with reference to one process wherein cut score on
processes to sequential analyses. predictor for purpose of one test is necessary in order
 Quantitative analysis of the relationship between test categorizing test takers. to advance to the next stage of
utility, the selection ratio, cost of the testing program, and evaluation in a selection
expected value of the outcome. process.
 A recommendation that in some instances job Ex. your instructor may have Ex. Before you take psych ass
requirements be tailored to the applicant’s ability instead multiple cut scores in place 2, you first have to get a
of the other way around (adaptive treatment). every time an examination is passing grade for psych ass 1
administered, and each class  You have to passed
member will be assigned to the prerequisite first
PRACTICAL CONSIDERATIONS
one category (e.g., A, B, C, D,
Things to consider when base rate is too low or too high: or F) on the basis of scores on Ex. the written application
 The pool of job applicants: Some utility models are based that examination. stage in which individuals who
on the assumption that that there will be a ready supply of turn in incomplete applications
viable applicants from which to choose and fill positions. That is, meeting or exceeding are eliminated from further
• However, some jobs require such expertise or one cut score will result in an A consideration.
sacrifice that the pool of qualified candidates for the examination, meeting
may be very small. or exceeding another cut score This is followed by what might
 If the job is very technical or highly will result in a B for the be termed an additional
specialized (doctor uwu) examination, and so forth. materials stage in which
• The economic climate also affects the size of the individuals with low test
pool. In college – 1.00, 3.00,  5.00 scores, GPAs, or poor letters of
recommendation are
 Maraming nawawalan ng trabaho =
eliminated. The final stage in
maraming maghahanap ng trabaho
the process might be a
 Mataas na employment rate = mahirap personal interview stage.
maghanap ng workers
• The top performers on a selection test may not METHODS FOR SETTING CUT SCORES
accept a job offer. 1. Classical Test Score Theory
 The complexity of the job: The same kind of utility models 1.1. Angoff Method
are used for a variety of positions, yet the more complex 1.2. Known Groups Method
the job, the bigger the difference in people who perform 2. IRT-Based Method
well or poorly. 2.1. Item-mapping method
2.2. Bookmark msethod
Cut-off score 3. Method of Predictive Yield
 reference point derived as a result of a judgment 4. Discriminant Analysis
 used to divide a set of data into two or more classifications
as basis for some actions to be taken or some inferences CLASSICAL TEST SCORE THEORY
to be made;
1. Relative cut score ANGOFF METHOD
2. Fixed cut score  The judgments of the experts are averaged to yield cut
scores for the test.
o Persons who score at or above the cut score are
considered high enough in the ability to be hired
or to be sufficiently high in the trait, attribute, or provided with sample of items arranged in
ability of interest. items from each column ascending order of
 Can be used for personnel selection based on traits, and are asked whether a difficulty
attributes, and abilities. minimally competent  experts place a bookmark
o As applied for purposes relating to the individual would answer between 2 items deemed
determination of whether testtakers possess a those items correctly to separate test takers
 difficulty level is set as the who have acquired
particular trait, attribute, or ability, an expert
cut score minimal knowledge, etc.
panel makes judgments concerning the way a
person with that trait, attribute, or ability would
ITEM-MAPPING METHOD BOOKMARK METHOD
respond to test item  should a low performing  Hanggang grade school
 Problems arise if there is disagreement between experts. student be able to answer lang natapos niya
o there is low inter-rater reliability and major a question that can only  Bookmark: highest score
disagreement regarding how certain populations be answered by high for grade schoolers is until
of testtakers should respond to items. performing students? 45
 Ex. In making a happiness test  The item might not be  Maximum grade of 75 for
 One would include the item good for the low grade schools
“smiling” performer  Highschoolers should
 One disagrees that smiling is score higher than grade
not considered as happiness schoolers (80)

o In such scenarios, it may be time for “Plan B,” a
strategy for setting cut scores that is driven more
OTHER METHODS
by data and less by subjective judgments.
 R. L. Thorndike (1949) proposed a norm-referenced method
KNOWN GROUPS METHOD
called the method of predictive yield.
 Method of Contrasting Groups
o The method of predictive yield took into account the
 Entails collection of data on the predictor of interest from
number of positions to be filled, projections regarding
groups known to possess, and not to possess, a trait,
the likelihood of offer acceptance, and the
attribute, or ability of interest.
distribution of applicant scores.
 Based on the analysis of data, a cut score is set on the test
Discriminant analysis
that best discriminates the groups’ test performance.
 Also referred to as discriminant function analysis (DFA)
 There is no standard set of guidelines for choosing
 A family of statistical techniques used to shed light on the
contrasting groups.
relationship between identified variables (such as scores on a
o An example of the known groups method for a
battery of tests) and two (or more) naturally occurring groups
math placement course.
(such as persons judged to be successful at a job and persons
judged unsuccessful at a job).

ITEM ANALYSIS (hindi included sa test pero for test development)


1. Index of Item Difficulty
2. Index of Item Discrimination
3. Index of Item Reliability
4. Index of Item Validity
5. Spiral Omnibus Format

o Index of Item Difficulty


o The cut score that would be selected is the score  item-endorsement index
at the point of least difference between the  in cognitive tests, a statistic indicating how many tests
groups (in this case, a score of 6). takers responded correctly to an Item
 If you set the cut score at 8, you might  in personality tests, a statistic indicating how many test
identify those on 7 to be low takers responded to an item in a particular direction.
performing even is they passed  Closest to 0 (the more difficult)
already.
 If you set the cut score at 4, you might Index of Item Discrimination
interpret that those that scored from  a statistic designed to indicate how adequately a test item
5,6,7 (even if they failed the course) discriminates between high and low scorers
you will assume them as relatively
good at math  ex. should a low performing student be able to answer a
question that can only be answered by high performing
IRT-BASED METHODS students?
 In an IRT framework, each item is associated with a o The item might not be good for the low
particular level of difficulty. performer
 In order to “pass” the test, the test taker must answer
items that are deemed to be above some minimum level Index of Item Reliability
of difficulty, which is determined by experts and serves as  provides an indication of the internal consistency of a test
the cut score.  is equal to the product of the item-score standard
deviation (s) and the correlation (r) between the item
Make use of the item-mapping method and the bookmark method. score and the total test score.
ITEM-MAPPING METHOD BOOKMARK METHOD
 entails arrangement of  training of experts Index of Item Validity
items in a histogram with regarding the minimal  statistic indicating the degree to which a test measures
each column containing knowledge, skills and / or what it purports to measure;
items deemed to be of abilities test takers should  the higher the item-validity index, the greater the test’s
equivalent value possess in order to pass criterion-related validity.
 trained judges are  experts are given a book

You might also like