PA Notes -1
PA Notes -1
PRINCIPLE OF ASSESSMENT
The principles of assessment have been given by Shertzer and Linden which states that
assessment should be Holistic, Ongoing, Balanced, Accurate and Confidential.
· Assessment should be Holistic: This principle involves multiple methods in collecting
information. The use of a combination of assessment techniques increases the likelihood of
applying positive intervention and consequently the achievement of the desired goals. The
principle of holistic assessment follows a systematic process to arrive at an understanding of
the individual.
· Assessment should be Accurate: The assessment device used should be accurate and the
counselor should have the skill for interpreting the data. Counselors must keep in mind the
possibility of errors, as all tools may not be 100% accurate; so they must try to minimize the
errors by using standardized procedures.
· Self Understanding: The basic purpose of carrying out an assessment is for gaining insight
in helping the client understand themselves better, helping them to know what they can do
and cannot do including their strengths and weaknesses.
· To Diagnose Student’s Problem: To diagnose the client’s problem is another purpose that
assessment data fullfils. By using the data properly, we can interpret causal factors. It also
helps to identify various aspects such as family background, physical health, academic
performance etc.
· To Help in Career Planning and Education: Assessment done with the help of various
psychological tools guides the students in making choices for their career and selection of
subjects/courses.
· To Help Predict the Future Performance: Counsellors use assessment data to estimate
individual’s attitude, ability, personality, etc that have implication for the success and
adjustment which help to predict the future performance of the individual. Moreover, the
counsellor can also motivate the client in a direction where he /she can get more success.
· To Evaluate the Outcome of Counselling: Assessment is done prior to counselling as well
as at the end of it. This gives the counsellor valuable insights for further intervention and to
achieve the expected outcome.
Purpose: Both tests and assessments aim to gather information about an individual’s abilities,
characteristics, or performance. They are used to evaluate specific competencies or traits.
Standardization: Many tests and assessments are standardized, meaning they are administered
and scored in a consistent manner to ensure comparability across different individuals and
groups.
Measurement: Both involve the measurement of certain variables, such as cognitive abilities,
personality traits, skills, knowledge, or behaviors. They provide a way to quantify these
variables for analysis.
Feedback: The results from both tests and assessments are often used to provide feedback to
individuals. This feedback can be used for personal development, educational guidance, or
organizational decisions.
Objective and Subjective Components: Both can include objective components (such as
multiple-choice questions) and subjective components (such as essay questions or
observational ratings).
Data Collection: Both involve the collection of data that can be analyzed and interpreted.
This data can be qualitative or quantitative, depending on the nature of the test or assessment.
TYPES OF ASSESSMENTS:
1. INTERVIEW:
The interview is probably the most commonly used assessment tool. Professionals use
interview method to help gather information about clients and clarify results of other
assessments. Assessors must be appropriately trained. Their skills and experience are
essential for the interviewee. Below are several aspects which must be kept in mind:
There can be two types of interview: Structured and Unstructured. That is, the interview
can range from being totally unplanned i.e., unstructured to carefully designed i.e.,
completely structured.
Structured Interview:
Description: Uses a predefined set of questions that are asked in the same order for all
candidates.
Advantages: Ensures consistency, easier to compare responses, and can reduce
interviewer bias.
Disadvantages: May be inflexible and not allow for exploration of unexpected topics.
Unstructured Interview:
Description: More open-ended and flexible, with questions that evolve based on the
interviewee’s responses.
Advantages: Allows for a deeper understanding of the interviewee and can adapt to
the flow of the conversation.
Disadvantages: Can be time-consuming and difficult to compare responses between
interviewees.
Semi-Structured Interview:
Situational Interview:
Description: Presents hypothetical scenarios to the interviewee and asks how they
would respond.
Advantages: Can assess problem-solving and decision-making skills.
Disadvantages: Responses may not always reflect actual behavior in real situations.
Group Interview:
2. OBSERVATION METHOD:
Definition
The observation method involves systematically watching, listening to, and recording
behaviours and events as they naturally occur without manipulating or interfering with the
environment or subjects being observed.
Naturalistic Observation:
Controlled Observation:
Observing behaviour in a structured setting where some variables are controlled by
the researcher.
Example: Observing reactions in a laboratory setting where specific stimuli are
presented.
Participant Observation:
The observer becomes actively involved in the environment they are observing.
Example: A researcher joining a community group to understand its dynamics.
Non-Participant Observation:
The observer remains detached and does not interact with the subjects.
Example: Observing classroom behavior from the back of the room without
interacting with the students.
Scales: A scale is a set of numbers (or other symbols) whose properties model empirical
properties of the objects to which the numbers are assigned.
A discrete scale has a countable sample space, such as categorical variables like year in high
school, which includes distinct categories like freshman, sophomore, junior, and senior.
Quantitative variables can also be discrete, exemplified by the number of previous
hospitalizations a patient has had, which can only be whole numbers like 0, 1, 2, and so on. In
discrete scales, values between these distinct numbers are not allowed, meaning a patient
cannot have 2.5 hospitalizations.
Continuous Scales: These can take any real number within a range, including fractions and
decimals (e.g., weight, height, temperature). In practice, measurements on continuous scales
are often rounded to match the precision of the measuring instrument, ensuring accuracy
without implying false precision.
Nominal Level:
Definition: Data are categorized into distinct groups that do not have a meaningful
order or ranking.
Characteristics: Labels or names used to identify categories, no mathematical
operations can be performed.
Examples:
o Gender (male, female)
o Marital status (single, married, divorced)
o Blood type (A, B, AB, O)
Ordinal Level:
Definition: Data are categorized into groups that can be ranked or ordered; however,
the differences between ranks are not necessarily equal.
Characteristics: Rank order matters, but precise differences between ranks are not
known.
Examples:
o Education level (high school, bachelor’s, master’s, doctorate)
o Socioeconomic status (low, middle, high)
o Customer satisfaction (very unsatisfied, unsatisfied, neutral, satisfied, very
satisfied)
Interval Level:
Definition: Data are ordered with equal intervals between values, but there is no true
zero point (zero does not indicate the absence of the measured attribute).
Characteristics: Differences between values are meaningful, but ratios are not.
Examples:
o Temperature in Celsius or Fahrenheit (e.g., 20°C, 30°C)
o IQ scores
o Calendar years (e.g., 2000, 2020)
Ratio Level:
Definition: Data have all the properties of interval data, and there is a true zero point,
meaning zero indicates the absence of the measured attribute.
Characteristics: Differences and ratios between values are meaningful.
Examples:
o Height (e.g., 150 cm, 175 cm)
o Weight (e.g., 60 kg, 80 kg)
o Duration (e.g., time taken to complete a task)
1. Demographic Data: Patient's personal information like name, age, address, etc.
2. Reason for Referral: The purpose for which the patient was referred for assessment.
3. Tests Administered: List of tests conducted and their dates.
4. Findings: Detailed observations and test results, including behavioral observations
and extraneous variables.
5. Recommendations: Suggestions for addressing the issues identified during
assessment.
6. Summary: A brief overview summarizing the reason for referral, findings, and
Key Principles of Report Writing
Constructive Feedback Skills: Effective feedback involves ensuring that the client
understands the information accurately.
__________________________________________________________________________
UNIT 2: Psychological Testing: Definition of a test, types of test; Characteristics of a Good
Test; Applications of psychological tests in various contexts (educational, counselling and
guidance, clinical, organizational etc.)
Psychological Testing:
Psychological testing refers to all the possible uses, applications, and underlying concepts of
psychological and educational tests. The main use of these tests, though, is to evaluate
individual differences or variations among individuals. Such tests measure individual
differences in ability and personality and assume that the differences shown on the test reflect
actual differences among individuals. For instance, individuals who score high on an IQ test
are assumed to have a higher degree of intelligence than those who obtain low scores. Thus,
the most important purpose of testing is to differentiate among those taking the tests.
Definition of a Test
A test is a standardized procedure for sampling behaviour and describing it with categories or
scores. A psychological test is essentially an objective and standardized measure of a sample
of behaviour. Psychological tests are like tests in any other science, insofar as observations
are made on a small but carefully chosen sample of an individual's behaviour.
A psychological test or educational test is a set of items that are designed to measure
characteristics of human beings that pertain to behaviour. There are many types of behaviour.
Overt behaviour is an individual’s observable activity. Some psychological tests attempt to
measure the extent to which someone might engage in or “emit” a particular overt behaviour.
Other tests measure how much a person has previously engaged in some overt behaviour.
Behaviour can also be covert—that is, it takes place within an individual and cannot be
directly observed.
Types of test:
There are various types of psychological tests. These are discussed as follows:
Individual test: Tests that are administered on a single individual. For example, Wechsler
Adult Intelligence Scale (WAIS), Stanford-Binet Intelligence Scale (SB), Bhatia battery.
Group test: Such tests can be administered to a group of individuals at the same time. For
example, NEO PI and Minnesota Multiphasic Personality Inventory
Speed test: A speed test constitutes items that are of same difficulty level, however a certain
time period is provided to complete the test.
Power test: A power test constitutes items that increase gradually in terms of their difficult
level. Though there is no time limit to complete the test.
Verbal test: A paper pencil test can be termed as a verbal test where the items are mentioned
using language. For example: 16 PF and Eysenck’s Personality Inventory.
Non-verbal test: In this type of test certain figures and symbols are used. For example,
Raven’s Progressive Matrices. In this the language may be used only to provide instructions
to the individual taking the test.
Performance test: In performance test, the individual taking the test has to perform certain
tasks. For example: Alexander’s pass-a-long test and Koh’s block design test.
Objective tests: In objective tests, the individual will choose from certain correct answers
that are decided in advance. This avoids any subjectivity on behalf of the scorer. The
responses could be in terms of true or false or multiple choices or even a rating scale like
Likert scale or Thurston’s scale may be used. For example: NEO PI.
Projective Tests: These are subjective in nature. Here, the test taker may be asked to respond
to certain semi-structured or unstructured stimuli. The responses are then to be interpreted by
the administrator, where subjectivity may creep in. Examples of projective tests are
Rorschach Inkblot test, Somatic Inkblot Series, Sentence Completion Test, Thematic
Apperception Test and Children’s Apperception Test.
Personality tests: These are used to measure personality of individuals. Larsen and Buss
(2018) defined personality as a collection of psychological traits and mechanisms that are
stable and organised and that have an influence an individual’s interaction and also has an
impact on how he/she modifies his/ her physical, social and psychological environment. It
can also be explained as differences amongst individuals with regard to their patterns of
thinking, feeling and the way they behave (American Psychological Association, 2019).
Personality tests are used widely in varied setups including clinical, educational, counselling,
industrial and organisational setup and so on. Examples of personality test are Eysenck’s
Personality Inventory, Thematic Apperception Test (TAT), Somatic Inkblot Series (SIS).
Aptitude tests: There are tests that measure the potential/ abilities possessed by an individual
in certain area. These find their application in schools and even in industrial set up for
selection purpose. They denote whether a person will be able to perform effectively if he/ she
is given training in that area. For instance, a person with aptitude for dance or music will do
well in the area if given training. Examples of aptitude tests are Differential Aptitude Test,
Seashore Musical Aptitude Test.
Attitude tests: These tests measure attitude of an individual towards events, other
individuals, objects and so on. Often in attitude tests, Thurston and Likert scales are used.
These could measure attitude towards women, health and so on.
Achievement tests: There are also tests that measure achievement of individuals. They
mainly test an individual’s learning in certain academic area. Such tests are often used in
educational setup. Academic achievement test and Mathematics Achievement Test are
examples of achievement tests.
Creativity tests: assess a subject’s ability to produce new ideas, insights, or artistic creations
that are accepted as being of social, aesthetic, or scientific value. Thus, measures of creativity
emphasize novelty and originality in the solution of fuzzy problems or the production of
artistic works.
1. Objectivity: The test should be free from subjective-judgement regarding the ability, skill,
knowledge, trait or potentiality to be measured and evaluated.
2. Reliability: This refers to the extent to which they obtained results are consistent or
reliable. When the test is administered on the same sample for more than once with a
reasonable gap of time, a reliable test will yield same scores. It means the test is trustworthy.
There are many methods of testing reliability of a test.
3. Validity: It refers to extent to which the test measures what it intends to measure. For
example, when an intelligent test is developed to assess the level of intelligence, it should
assess the intelligence of the person, not other factors. Validity explains us whether the test
fulfils the objective of its development. There are many methods to assess validity of a test.
4. Norms: Norms refer to the average performance of a representative sample on a given test.
It gives a picture of average standard of a particular sample in a particular aspect. Norms are
the standard scores, developed by the person who develops test. The future users of the test
can compare their scores with norms to know the level of their sample.
5. Practicability: The test must be practicable in- time required for completion, the length,
number of items or questions, scoring, etc. The test should not be too lengthy and difficult to
answer as well as scoring.
Selection or placement
Diagnosis
Accountability evaluations
Judging progress and following trends
Self-discovery
Educational settings for children.
Types of Decision
Instructional decisions: Teachers use the test results to determine the pace of their courses
(e.g., should they slow down, speed up, continue their teaching ace, or skip a topic
altogether?).
Grading decisions: Teachers use the test results to assign grades to students (e.g., teachers
administer quizzes, midterm exams, and final exams).
Diagnostic decisions: Teachers use the results of tests to understand students' strengths and
difficulties (e.g., a test may reveal a student can write a complete sentence but only a simple
sentence).
Selection decisions: Specialists such as school psychologists or administrators use the results
of tests to make group-, program-, or institutional- level admissions decisions (e.g., a test may
reveal that a student has a very high intelligence and would benefit from a gifted program).
Placement decision: Specialists or administrators use tests to place individuals into the proper
level of a course (e.g., colleges and universities may use test scores to determine the math
course in which individuals should be placed).
Program and curriculum decision: Specialists or administrators use test scores to determine
the success of a program or curriculum (e.g., are students who go through the new curriculum
learning more than students who go through the old curriculum?) and to determine whether a
program or curriculum should be implemented or dropped.
Teachers, at the primary, secondary, and college levels, must make a variety of decisions in
the classroom. Teachers must decide whether students are ready to learn new material, and if
so, they must determine how much of the new material students already know. Teachers must
decide what information students are learning and what information they are having difficulty
learning. Teachers must also decide what grades students have earned. Teachers often use
tests, combined with other assessment methods, to help them make these and other types of
decisions.
Teachers make some of these decisions at the beginning of a course of unit of instruction and
other decisions during or at the end of instruction. Teachers often use test results to answer
questions they have the answers to which will allow them to both evaluate and improve
student learning.
At the beginning of a course or before a new unit of instruction, teachers will often use
psychological tests as placement assessments, which are used to determine the extent to
which students possess the knowledge, skills, and abilities necessary to understand new
material and how much of the material to be taught students already know.
Periodically throughout the school year, teachers may administer tests as formative
assessments. Formative assessments help teachers determine what information students are
and are not learning during the instructional process so that the teachers can identify where
students need help and decide whether it is appropriate to move to the next unit of instruction.
Teachers do not use these test scores to assign grades; instead, teachers use formative
assessments to make immediate adjustments to their own curricula and teaching methods.
That is, teachers can use the results of formative assessments to adjust the pace of their
teaching and the material they are covering.
At the end of the year, or at the end of a unit of instruction, teachers and administrators at the
state and district levels typically use tests as summative assessments to help them determine
what students do and do not know (to gauge student learning) and to assign earned grades.
For example, a teacher may administer a final exam to students in an introductory psychology
class to determine whether students learned the material the teacher intended them to learn
and to determine what grades students have earned.
Sometimes students confuse formative and summative assessment. Teachers may use the
same tests as both formative and summative assessments. However, when used as a formative
assessment, the test is used to direct future instruction to provide the teacher with information
about what information students have already mastered and where the teacher should spend
his or her time teaching. When used as a summative assessment, the test is often used as a
final evaluation-for example, to determine what grade to assign to students.
Psychological tests are essential in clinical and counselling settings for various purposes:
Initial Assessment: They help determine the nature and severity of a client's problems at the
start of treatment.
Therapeutic Insights: They provide clients with insights that can promote change and
understanding.
1. Structured Interviews
Clinicians use interviews to diagnose problems and plan treatment. Typically, these are
unstructured, focusing on gathering information. However, semi-structured and structured
interviews are more formal.
Examples include:
Behaviour rating scales help clinicians develop and revise treatment plans, and clarify
diagnoses for children. Informants, such as parents or teachers, rate the child's specific
behaviours, and some scales include self-report versions.
A key example is the Child Behaviour Checklist (CBCL), used widely in psychopathology to
gather evidence of validity. Adults rate over 100 statements about a child's behaviour, such as
aggression and social problems, on a scale from "not true" to "very or often true." The items
are grouped into clusters, and scores are reported as T scores and percentiles.
Symptom checklists and self-report tests are used by clinicians to assess clients' feelings,
thoughts, and behaviors related to psychiatric disorders. These tools are essential for initial
evaluations, treatment planning, and monitoring progress.
The Symptom Checklist 90-Revised (SCL-90-R) covers a broad range of mental health
conditions with 90 items rated on a five-point scale, providing a comprehensive symptom
overview.
The Beck Depression Inventory II (BDI-II) focuses specifically on depression with 21 items
rated on a four-point scale, yielding a total score indicating the level of depression. Both tests
are well-researched, reliable, and valid for diagnosing and tracking treatment progress.
PAI: Published in 1991, it includes over 300 items rated on a four-point scale,
assessing personality variables and clinical concerns with validity scales to detect
inconsistent responses. It's useful for diagnosis and treatment planning but less
comprehensive in assessing personality traits than the MMPI-2.
MCMI-III: Revised in 1994, it focuses on personality disorders and related
symptoms, with 175 true/false questions providing scores on clinical scales for
various personality disorders and clinical syndromes. It has strong research support
for its reliability and validity in differentiating among diagnostic groups.
Performance-based or projective personality tests, such as the Rorschach inkblot test and the
Thematic Apperception Test (TAT), are controversial due to vague scoring guidelines and
questionable reliability and validity. Despite their standardized administration by qualified
professionals, robust research supporting their effectiveness is lacking. The Exner
Comprehensive System, developed in the 1980s, attempted to improve the Rorschach test by
synthesizing scoring systems and collecting normative data. This system codes and tabulates
responses to produce scores reflecting aspects of personality functioning, including self-
perception, coping strategies, thought processes, and emotional responsiveness..
6. Neuropsychological Testing
Neuropsychological testing involves assessing the relationship between brain functioning and
behavior, often conducted by specialized clinicians called neuropsychologists. These
professionals evaluate individuals with conditions like epilepsy, traumatic brain injury,
learning problems, or ADHD, using a range of specialized tests to measure various aspects of
brain function.
Specialized forensic testing, on the other hand, is performed by forensic psychologists, who
work at the intersection of mental health or neuropsychological issues and the law. They
evaluate issues such as competency to stand trial, child custody disputes, treatment
recommendations for juvenile offenders, and assessing damages in legal cases.
Pre-Employment Testing
Traditional Interviews: These interviews are less structured, allowing the interviewer
to explore different areas with each candidate.
Structured Interviews: These involve standardized questions asked of all candidates,
providing more consistent data for comparison.
Structured Behavioral Interviews: These interviews focus on past behaviors rather than
attitudes or opinions. They ask candidates to provide specific examples of past behaviors
related to job tasks, such as planning a project or achieving a goal. These interviews often use
behaviorally anchored rating scales to assess the quality of the candidate's responses.
Performance Tests: These tests require candidates to perform job-related tasks, ranging
from large-scale simulations (assessment centers) to smaller tasks (work samples).
High-Fidelity Tests: These replicate job settings realistically, using the same
equipment and tasks as the actual job, such as flight simulators for pilots. They
provide realistic practice while eliminating the risk of poor performance.
Low-Fidelity Tests: These simulate tasks using written, verbal, or visual descriptions,
with responses typically in the form of open-ended or multiple-choice questions.
Some interview questions, particularly behavioral ones, can serve as low-fidelity
performance tests.
A customer has called you to let you know that a competitor is offering a promotional rebate
for a top-selling product of service that you also sell. You have the option of offering
comparable program, but it will impact both your profit sod compensation. Which of the
following actions would most effective to take?
a. Offer your rebate program only to the customer who gave you the information regarding
the competitor's program.
b. Offer a matching rebate program plus additional incentives to all customers who currently
buy this product to ensure that you do not lose their business.
d. Call all of your customers that currently buy this product to offer them your rebate
program.
Job Simulations: These are exercises designed to mimic real job tasks, allowing candidates
to demonstrate their skills in a controlled environment. Common simulations include:
In-Basket Exercise: Candidates manage hypothetical incoming mail, phone messages, and
memos, deciding on appropriate actions.
Role-Play: Candidates interact with trained role-players to resolve job-related issues, such as
handling a difficult employee or customer, with some background information provided
beforehand.
Leaderless Discussion Group: Candidates work in small groups to discuss or solve job-
related problems without a designated leader.
Assessment centers remain popular for hiring, developing, and promoting employees,
especially for management roles. While many Fortune 500 and smaller companies utilize
them, some streamlined or discontinued their programs during the austerity movement of the
1980s. Despite this, assessment centers continue to be a valuable tool for assessing and
developing employee capabilities in various organizational contexts.
Personality Inventories
Applications:
16PF: The Sixteen Personality Factor Questionnaire (16PF) has been linked to various
workplace outcomes, including absenteeism, turnover, tenure, safety, and job performance. It
assesses sixteen primary factors of personality and is used in organizational contexts.
Hogan Personality Inventory (HPI): Derived partially from the five-factor model, the HPI is
widely employed for organizational testing and decision-making. It provides insights into an
individual's personality traits relevant to the workplace, aiding in personnel selection and
development.
Integrity Testing
The need for businesses to enhance efficiency and competitiveness has led to a growing
concern regarding employee honesty and integrity, particularly in relation to issues like theft.
Employees account for a significant portion of retail theft, costing retailers billions annually.
The Enemy Within: Employees engaging in dishonest behavior are often referred to as the
"enemy within," highlighting the threat they pose to businesses.
Integrity as the Key: According to Forbes, integrity is the crucial factor that distinguishes
conscientious employees from potential thieves.
Assessment Methods:
Physiological Measures: Traditional methods like polygraph tests have been used for
screening applicants, but they have limitations and are often replaced by alternative
approaches.
Paper-and-Pencil Tests: These tests are gaining popularity as a more practical and ethical
alternative to physiological measures. They fall into two categories:
Overt Tests: These ask candidates about past behaviors or responses to hypothetical situations
related to honesty.
Personality-Oriented Tests: These assess personality traits predictive of honest behavior and
positive organizational citizenship. They focus on characteristics like self-control, risk-taking,
diligence, and emotional stability, often correlating with the Big Five personality factors.
Both types of tests aim to identify individuals with traits indicative of integrity, but they
differ in the constructs they measure and their approach to assessment.
Performance Appraisal
1. Ranking Employees:
o Forced Ranking: This method involves comparing employees' performance
and ranking them against each other. It requires supervisors to determine the
"best," "next best," and so on, based on predetermined criteria.
o Forced Distribution: Here, supervisors assign a certain number of employees
to predefined performance categories, ensuring that appraisals follow a normal
distribution curve. Some employees are rated as "outstanding," while others
are rated as "poor."
Errors in Rating: Raters may make several systematic errors when assessing employee
performance:
By addressing these rating errors and providing comprehensive rater training, organizations
can enhance the accuracy and fairness of their performance appraisal processes.
________________________________________________________________________
UNIT 3: Test and Scale Construction: Test Construction and Standardization: Item analysis,
Reliability, validity, and norms (characteristics of z-scores, T-scores, percentiles, stens and
stanines); Scale Construction: Likert, Thurstone, Guttman& Semantic Differential.
Test construction
1. Defining the Test: Defining the test is the initial and crucial step in test construction. It
involves having a clear idea of what the test is to measure and how it is to differ from existing
instruments. The developer must articulate the specific traits, behaviors, or skills that the test
will assess.
2. Selecting a Scaling Method: Selecting a scaling method involves choosing the rules by
which numbers are assigned to test responses. This step is critical as it ensures that the
measurement accurately reflects the trait being assessed. Various methods, such as ordinal
ranking or complex scaling, can be employed depending on the nature of the trait and the
conceptual framework of the test. The chosen method must align with the test's objectives
and provide valid and reliable measurements.
3. Constructing the Items: Creating test items demands both creativity and careful planning.
The process includes deciding if the content should be consistent or diverse, determining the
difficulty range, and specifying the cognitive processes and domains to be addressed. The
item writer must select the test item types and formats, generating a large pool to ensure
comprehensive content coverage.
4. Testing the Items: Testing the items involves administering them to a sample, conducting
statistical item analysis to identify effective items, and refining the test by retaining those
with high reliability, validity, and discriminatory power.
5. Revising the Test: Revising the test is an iterative process where unproductive items are
revised, eliminated, or replaced. The test is re-administered to a new sample to verify
improvements, enhancing its reliability and predictive accuracy to meet measurement goals.
6. Publishing the Test: Publishing the test involves finalizing materials, creating technical
and user manuals, and managing production and distribution. The technical manual details
development, validation, and administration procedures, while the user manual guides proper
usage, ensuring the test is accessible and implementable for its intended audience.
Item analysis: Item analysis is a critical aspect of test construction and evaluation,
particularly relevant for both standardized assessments and locally created tests such as
classroom quizzes. It involves both qualitative and quantitative approaches to assess the
quality of individual test items.
Qualitative analysis focuses on content and form. Content validity, ensuring that the
items measure what they are intended to measure, is a key consideration.
Quantitative analysis involves statistical techniques to measure item difficulty and
discrimination. Item difficulty refers to how easily or difficultly an item is answered
by test-takers, while item discrimination assesses how well an item distinguishes
between high and low performers.
The item-difficulty index measures the level of difficulty for a single test item, indicated by
the proportion of examinees in a large tryout sample who answer it correctly. This index,
denoted as p, ranges from 0.0 to 1.0, where a lower value indicates higher difficulty because
fewer people answer the item correctly. For instance, an item with a difficulty index of .2 is
more challenging than one with a difficulty of .7. Questions with difficulty close to 0.0 or 1.0
aren't very useful because they either confuse everyone or are too easy.
During test construction, understanding item difficulty is crucial for selecting suitable items.
Items that everyone passes or nobody passes don't provide useful information about
individual differences. Ideally, an item that roughly 50% of test-takers answer correctly
provides the most differentiation among them. However, selecting all items at a .50 difficulty
level isn't always optimal. Test items tend to be intercorrelated, especially in homogeneous
tests. If all items were of equal difficulty and perfectly correlated, half of the test-takers
would score perfectly, and the other half would score zero. Therefore, it's advisable to choose
items with a moderate spread of difficulty levels to ensure a balanced and informative test.
The item-reliability index assesses the internal consistency and usefulness of individual test
items. It's determined by the point-biserial correlation coefficient, which measures the
relationship between dichotomous item scores (right or wrong) and the total test score. Items
with higher point-biserial correlations are more internally consistent.
The item-validity index is vital in constructing tests with high concurrent or predictive
validity. It helps identify test items that effectively predict the criterion. By calculating this
index for each item, developers can pinpoint ineffective ones, improving the test's practical
utility.
5. Item-discrimination index
Advantages:
Disadvantages:
Advantages:
Disadvantages:
Memory Effects: Subjects may recall their initial answers if the test is repeated
immediately, potentially inflating their scores and compromising reliability.
Practice Effects: Familiarity with the material and practice from the first
administration may influence scores on subsequent administrations, affecting
reliability.
3. Parallel-Forms Reliability
Parallel-forms reliability, also known as alternative form reliability or equivalent form
reliability, assesses the consistency of measurement by using two different but equivalent
forms of the test. These forms are designed to be identical in content, objectives, format,
difficulty level, and other relevant aspects.
Advantages:
Reduce Memory and Practice Effects: Unlike the test-retest method, parallel-forms
reliability minimizes memory, practice, carryover effects, and recall factors, as the
same test is not repeated.
Temporal Stability: This method provides a measure of both temporal stability and
consistency of response to different item samples or test forms, combining two types
of reliability.
Limitations:
The split-half method evaluates the internal consistency of a test by splitting it into two
comparable halves and correlating the scores obtained on each half. This approach overcomes
the challenges of using the same test twice or obtaining equivalent forms.
Advantages:
Minimization of Memory and Practice Effects: Since the test is administered only
once, there are no memory or practice effects influencing the scores.
Stability against Environmental Factors: Individual fluctuations due to environmental
or physical conditions are minimized, as the test is taken in a single sitting.
Limitations:
Variability in Splitting: Different ways of dividing the test into halves may yield
different correlation coefficients, affecting the reliability estimate.
Not Suitable for Speed Tests: This method cannot be used to estimate reliability for
speed tests.
Intrinsic Factors:
(i) Length of the Test: Longer tests tend to have higher reliability as they provide a broader
sample of items, but excessive length may lead to fatigue effects.
(iii) Difficulty Value of Items: Test items that are either too easy or too difficult can result in
low reliability due to restricted score spreads.
(iv) Discriminative Value: Items that effectively differentiate between high and low
performers contribute to higher reliability.
(v) Test Instructions: Clear and concise instructions enhance reliability, while complex or
ambiguous instructions can lower it.
(vi) Item Selection: Too many interdependent items can decrease reliability.
Extrinsic Factors:
(v) Poor Instructions: Inadequate or unclear test instructions can lead to inaccurate responses
and lower reliability.
(vi) Test Difficulty: If a test is too difficult for the test-takers' knowledge level, reliability
may decrease as they struggle to provide answers.
(vii) Objective Scoring: Accurate and unbiased scoring methods are essential for maintaining
test score reliability
Types of Errors
1. Random Error:
Random errors are inherent in every measurement and are a significant source of
uncertainty.
These errors have no specific assignable cause and cannot be completely eliminated.
They stem from numerous uncontrollable variables that are inherent in every human-
made analysis.
Identifying these variables is challenging, and even if identified, they are often too
small to be measured accurately.
2. Systematic Error:
Systematic errors arise from instruments, machines, or measuring tools rather than
individuals.
They persist consistently across measurements and are not attributable to random
fluctuations.
Validity
Validity refers to the extent to which a test measures what it is intended to measure, making it
a critical aspect of test development. There are several types of validity, each serving a
different purpose in evaluating the effectiveness of a test.
Internal Validity: This type of validity ensures that a study is methodologically sound and
free from confounding variables. It assesses whether the observed effects in a study are due to
the manipulation of the independent variable rather than extraneous factors. To enhance
internal validity, researchers implement strategies such as blinding, random selection,
experimental manipulation, and adherence to study protocols.
External validity External validity in psychology assesses whether a study's findings can be
generalized to real-world conditions beyond the specific study context. It's about ensuring
results hold up across different settings, populations, and times. Researchers enhance external
validity by ensuring participants act naturally.
1. Face validity is a measure of whether a test appears to measure what it claims to measure,
based on its surface content. It is assessed subjectively by experts or participants familiar
with the construct being measured. Methods of measuring face validity include polling
participants to gauge their perception of the test's relevance and importance, as well as
administering follow-up questionnaires to gather feedback on whether the test accurately
reflects its intended purpose.
Ease of Assessment: Face validity is relatively simple and quick to assess compared
to other forms of validity.
Improves Acceptance: Tests that appear valid are more likely to be accepted and
trusted by participants and stakeholders.
Enhances Participant Engagement: When participants see the test as relevant, they are
more likely to be motivated and engaged, leading to more accurate responses.
Comprehensive Assessment: Content validity ensures that a test covers all relevant
aspects of the construct it aims to measure, providing a thorough assessment.
Expert Input: Involving subject matter experts in the validation process enhances the
credibility and accuracy of the test content.
Criterion validity encompasses two types: concurrent validity and predictive validity.
Practical Relevance: Provides direct evidence of how well a test predicts real-world
outcomes, making it highly relevant for practical applications.
Enhanced Decision-Making: Helps in making informed decisions based on the test
results, such as in employee selection, educational placement, or customer behavior
analysis.
Criterion Availability: Requires an appropriate and reliable criterion, which may not
always be available or easy to define.
Time-Consuming: Collecting criterion data can be time-consuming and resource-
intensive.
Changing Criteria: External criteria can change over time, affecting the stability and
ongoing validity of the test.
4. Construct validity assesses the test measure the specific trait or concept it intends to
measure. It involves theoretical considerations about why and how the test works, and how it
relates to other relevant constructs. Establishing construct validity requires a strong
theoretical framework and can be challenging to achieve. It encompasses three subtypes:
convergent validity, discriminant validity, and nomological validity.
Convergent validity measures the extent to which a test correlates positively with
other measures of the same construct.
Discriminant validity, also known as divergent validity, assesses the extent to which
a test does not correlate with unrelated constructs
Norms
A norm is the average or typical score on a particular test obtained by a set/group of defined
individuals. Norms of the test are based on the distribution of scores obtained by the people
of the standardization group. To develop norms, a test is administered to a large sample of
individuals, and the distribution of scores obtained by those individuals represent the norms
of the test.
- Percentile norms
- Deciles
- Standard score
- Stanine
Types of norms
1. An age norm depicts the level of test performance for each separate age group in the
normative sample. The purpose of age norms is to facilitate same-aged comparisons.
2. A grade norm depicts the level of test performance for each separate grade in the
normative sample. Grade norms are rarely used with ability tests.
3. Local norms are derived from representative local examinees, as opposed to a
national sample.
4. Subgroup norms consist of the scores obtained from an identified subgroup (African
Americans, Hispanics, females)
A. Percentile Norms
A percentile rank is a type of converted score that expresses a student's score relative to their
group in percentile points.
This indicates the percentage of students tested who made scores equal to or lower than the
specified score.
Easy to calculate.
Easy to understand and interpret.
B. Decile: Deciles are points which divide the scale of measurement into 10 equal parts.
Range of deciles is from Decile 1 - Decile 9. There are Deciles scores and deciles ranks.
Deciles score indicates that 10 % of cases lie below you therefore Deciles rank 1 indicates
that you are among the lowest 10% of the group. Deciles are based on same principles as of
percentile instead of 1/100 it uses 1/10 parts of the group. Deciles score I =PR 10 Decile 1
(lowest) and decile 9 (highest).
Standard scores; Standard scores are increasingly used in current tests due to their
effectiveness in providing a satisfactory derived score from various perspectives. These
scores express an individual's deviation from the mean in terms of the standard deviation of
the distribution.
Z-score: Measures how far and in what direction a score deviates from the mean,
expressed in standard deviations. For example, a z-score of 1.5 means the score is 1.5
standard deviations above the mean.
T-score: Standardizes raw scores to a scale with a mean of 50 and a standard deviation of
10, making them easier to interpret and compare. For instance, a T-score of 60 indicates a
score one standard deviation above the mean.
Percentile: Indicates the relative standing of a score within a distribution by showing the
percentage of scores that fall below it. For example, the 90th percentile means the score is
higher than 90% of the scores in the distribution.
Stanine: Simplifies score interpretation using a nine-point scale, where 5 represents the
average performance, and each unit is roughly half a standard deviation. For example, a
stanine score of 7 indicates above-average performance.
Sten: Divides the distribution of scores into ten equal intervals, with a mean of 5 and each
unit representing one-tenth of a standard deviation. For example, a sten score of 8 indicates a
score significantly above the average.
___________________________________________________________________________
Projective Personality Tests: During the 1930s, interest also grew in measuring personality
by exploring the unconscious. A projective test is a personality test in which subjects are
shown ambiguous images or given situations and asked to interpret them. The subjects are to
project their own emotions, attitudes, and impulses onto the stimulus given; and then use
these projections to explain an image, tell a story, or finish a sentence. These unconscious
reactions are then interpreted and used to evaluate the subject's personality.
The use of projective tests began in the 1920s and 1930s; and the tests became more widely
used in the United States military during World War II.
There are a number of different projective tests that are used. Some of the most common and
widely used projective tests include the Rorschach Inkblot Test, the Thematic Apperception
Test, the Draw- A-Person Test, and the House-Tree-Person Test.
Description
The Rorschach Inkblot Test is a psychological assessment tool based on the projective
hypothesis. It involves presenting individuals with a series of inkblots and asking them to
describe what they see. The test is designed to reveal underlying thoughts, feelings, and
personality characteristics by analysing the individual's interpretations of the inkblots.
History
The origins of projective tests can be traced back to the early 20th century, with Swiss
psychiatrist Hermann Rorschach pioneering the field through the creation of his inkblot test.
He developed a set of inkblot cards by dropping ink on paper and folding it to produce
unique, symmetrical forms. After experimenting with thousands of blots, Rorschach selected
20, but the final published test included only 10, comprising five black and gray blots, two
with black, gray, and red, and three with pastel colors. The Rorschach test has been both
praised as a powerful psychometric tool and criticized as lacking scientific rigor.
Reliability
The reliability of the Rorschach Inkblot Test, particularly in terms of its scoring and
interpretation, has been a point of controversy. Critics argue that the test's reliability is
questionable due to vague and idiosyncratic scoring guidelines. However, efforts like the
Exner Comprehensive System have aimed to standardize the scoring process and improve the
reliability of the test.
Validity
The validity of the Rorschach test has also been debated. While some argue that it provides
deep insights into personality and psychological functioning, others contend that there is
insufficient evidence to support its validity for clinical and forensic purposes. Despite these
criticisms, the test continues to be used, with proponents citing recent research that supports
its validity in specific contexts.
Norms
Norms for the Rorschach test are based on comparative data from representative samples.
The Exner Comprehensive System played a significant role in developing normative data by
collecting responses from various populations, including children and adults. This system
aimed to provide a more standardized framework for interpreting test results.
Scoring
Location: Identifies where the perception occurs on the inkblot (whole blot, common
detail, unusual detail).
Determinant: Refers to what aspect of the inkblot led to the perception (form,
movement, colour, shading). For instance, responses based on form are scored as F,
while movement responses are categorized as human (M), animal (FM), or inanimate
(m).
Content: Identifies what the subject perceives (human, animal, nature).
Form Quality: Assesses how well the perception matches the inkblot's properties.
Adequate form quality means the examiner can see the same image; poor form quality
(F-) indicates the examiner cannot perceive the same image.
Application
The Rorschach test is applied in clinical settings to assess personality, cognition, and
emotional functioning. It is used to diagnose psychological conditions, understand personality
structure, and guide treatment planning. The test is administered individually, with each of
the 10 inkblot cards presented in a minimally structured manner to elicit the subject's
responses.
In forensic settings, the Rorschach test might be used to evaluate an individual's
psychological state in legal contexts, although its use here is more contentious due to validity
concerns. The test's application requires skilled administration and interpretation by qualified
professionals to ensure the accuracy and usefulness of the results.
Description
The Thematic Apperception Test (TAT) is a projective psychological assessment tool used to
evaluate personality, motives, and emotional functioning. It involves showing individuals
ambiguous pictures and asking them to tell a story about each one, which reveals their
underlying thoughts, feelings, and concerns.
History
Developed in 1935 by Henry A. Murray and Christiana Morgan, the TAT is based on
Murray's theory of needs and is designed to uncover an individual's underlying motives,
especially the need for achievement.
Reliability
The Thematic Apperception Test (TAT) presents complex verbal material, making
quantitative analysis difficult. Multiple scoring systems exist, complicating the assessment of
reliability. Despite these challenges, interscorer reliability across various scoring systems is
generally good, ranging from .37 to .90, with most reports indicating reliability of .85 or
higher.
Validity
Reviews of the Thematic Apperception Test (TAT) regarding its validity exhibit significant
variability. While some reviewers consider a correlation of .25 impressive, others perceive it
as deficient. Additionally, research on the TAT is often influenced by the biases resulting in
subjective interpretations of its validity.
Norms
Norms for the TAT involve establishing typical responses from subjects based on specific
demographics, such as age, gender, race, and education. These norms allow for the
comparison of an individual's responses against standardized groups.
Scoring
Scoring the TAT involves evaluating the following aspects of the stories:
Subtests
The TAT does not have subtests in the traditional sense. Instead, it comprises 31 picture cards
(30 with scenes and 1 blank) depicting various situations. The subjects create stories based on
these cards, which are analyzed for themes and patterns.
Application
The TAT is used in both clinical and research settings. It helps in:
Advantages
Disadvantages
Access to Deep Personality Structures: Projective tests can reveal covert and deeper
aspects of an individual's personality that might not be accessible through other
methods.
Reduced Susceptibility to Faking: The disguised purpose of projective techniques
makes them less susceptible to intentional manipulation by test-takers.
Engaging and Nonthreatening: These tests are often interesting and nonthreatening to
participants, which can encourage more honest and open responses.
Rich Qualitative Data: Projective tests provide rich, qualitative data that can offer
insights into the individual's inner world and unconscious processes.
Disadvantages of Projective Tests
Semi-projective tests Projective techniques that employ words or open-ended phrases and
sentences are referred to as semi-structured techniques because, although they allow for a
variety of responses, they still provide a framework within which the subject must operate.
Perhaps the two best-known examples of verbal projective techniques are word association
tests and sentence completion tests.
Description
Sentence completion tests, such as the Rotter Incomplete Sentences Blank (RISB), are semi-
structured projective techniques used for personality assessment. They require an individual
to finish an incomplete sentence or phrase, revealing latent feelings and cognitions. These
tests can be general or designed for specific settings, such as schools or businesses.
History
The RISB, developed by Rotter & Rafferty in 1950, is one of the most popular sentence
completion tests. It is widely used in both research and applied settings to screen for
maladjustment, assess psychological distress, and monitor changes during treatment.
Reliability
Internal Consistency: The RISB has demonstrated high internal consistency, with
corrected split-half reliability scores of .84 for male and .83 for female college
students.
Inter-scorer Reliability: The inter-scorer reliability is high, with scores of .91 for
male and .96 for female records.
Validity
Group Validation: The RISB was validated using groups not involved in its
development, scored blindly. It effectively differentiated between adjusted and
maladjusted students using a cutting score of 135.
Norms
The RISB includes scoring criteria for male and female college students separately. Norms
are established based on the responses of individuals in specific demographic groups,
allowing for the assessment of an individual's responses relative to these standardized groups.
Scoring
Responses: Each of the 40 sentence stems is completed by the respondent and scored
on a 7-point scale.
o High Scores (4-6): Indicate psychosocial conflict.
o Low Scores (0-2): Indicate positive adjustment and coping.
o Neutral Scores (3): Reflect stereotypical or non-emotional responses.
Overall Adjustment Score: Summed ratings for all 40 responses, with long
responses assigned an additional point.
Categories: Responses are interpreted according to categories such as family
attitudes, social and sexual attitudes, general attitudes, and character traits.
Subtests
The RISB does not have traditional subtests but consists of 40 sentence stems that elicit
responses covering various psychological constructs such as attitudes, motivations, and
interpersonal relationships.
Application
Advantages
Rich Qualitative Data: Provides in-depth insights into an individual's thoughts and
emotions.
Flexibility in Application: Can be used in various settings and contexts.
Versatility in Content: Assesses a wide range of psychological constructs.
Potential for Projective Projection: Reveals unconscious aspects of personality.
Ease of Administration: Simple and convenient for clinicians and researchers.
Disadvantages
The Picture Frustration Test (PFT) comprises 24 cartoon-like pictures, each illustrating
common frustrating situations involving two individuals. These scenarios intentionally lack
detailed facial expressions and emotions. In each picture, the figure on the left is depicted
saying words that convey frustration, while the figure on the right remains silent, with a blank
caption box above them. Test-takers are instructed to provide the first response that comes to
mind for the frustrated figure on the right. The administration of the test typically takes about
15 to 20 minutes.
Scoring
It is assumed as a basis for P-F scoring that the examinee unconsciously or consciously
identifies with the frustrated individual in each picture and projects his or her own bias into
the responses given.
To define this bias, scores are assigned to each response under two main dimensions:
direction of aggression and type of aggression.
Direction of Aggression
Types of Aggression
Obstacle-dominance (OD), in which the barrier occasioning the frustration stands out
in the response;
Ego (etho) defense (ED), in which the ego of the subject predominates to defend
itself.
Need-persistence (NP), in which the solution of the frustrating problem is emphasized
by pursuing the goal despite the obstacle.
Scoring
Advantages:
Limitations:
Subjectivity: Interpretation relies heavily on the subjective judgment of the test
administrator, which may introduce bias.
Validity Concerns: Limited empirical evidence supporting the validity and reliability
of the test, raising questions about its effectiveness as a diagnostic tool.
Cultural Factors: Responses may be influenced by cultural background and individual
differences, impacting the test’s cross-cultural validity.
History
The original MMPI was developed in the late 1930s and published in 1943 by psychologist
Starke Hathaway and psychiatrist J.C. McKinley. It was revised to become the MMPI-2 in
1989. The original MMPI consisted of 550 true/false items derived from other personality
tests and the developers' clinical experience to provide psychiatric diagnoses. About 500
items from an initial pool of 1000 were administered to psychiatric patients and visitors at the
University of Minnesota hospitals. The MMPI was designed for individuals aged 16 and
older, although it has also been used with younger adolescents. Completing the MMPI
typically takes one to one-and-a-half hours.
The MMPI-2 includes various clinical scales used to assess different psychological
conditions and personality attributes. These scales are:
The MMPI-2 includes several validity scales to assess the test-taking attitude and the
reliability of the responses:
1. Cannot Say (?): Reflects the number of unanswered questions, indicating response
inconsistency or reluctance to disclose information.
2. Lie (L): Measures the tendency to present oneself in an overly positive manner
("faking good"), with high scores indicating a denial of problems.
3. Infrequency (F): Assesses response consistency and the presence of unusual or
exaggerated symptoms, with high scores suggesting potential invalidity of responses.
4. Correction (K): Evaluates defensiveness or attempts to minimize symptoms, with
high scores indicating a tendency to present oneself in a socially desirable manner.
Scoring
Norms: Scores are normed using standardized T-scores, where each scale has a mean
of 50 and a standard deviation of 10.
Interpretation: Scores above 65 (one and a half standard deviations above the mean)
are considered elevated and within the clinical range, indicating potential
psychological issues.
Application
The MMPI-2 is used in various settings, including clinical, legal, and occupational contexts.
It aids in diagnosing mental disorders, assessing psychological functioning, and guiding
treatment planning. Its reliability and validity make it a valuable tool for psychologists and
other mental health professionals. The MMPI-2's broad application range includes:
Clinical Settings: For diagnosing and developing treatment plans for mental health
issues.
Forensic Settings: To evaluate individuals in legal contexts, such as assessing
competency to stand trial.
Occupational Settings: For employee selection and fitness-for-duty evaluations.
Research: In studies examining personality, psychopathology, and psychological
interventions.
Background of FIRO-B
William Schutz developed the FIRO-B to improve the performance of military teams,
particularly within the Navy's Combat Information Center (CIC). Schutz's research was
rooted in the idea that, beyond basic physiological needs, individuals have interpersonal
needs that motivate their behavior. These needs fall into three categories:
Schutz believed that these interpersonal needs, if unmet, could lead to discomfort or anxiety.
His work drew on psychological theories from Freud, Adorno, Fromm, Adler, and Jung.
Scoring
The FIRO-B instrument contains 54 items. Respondents use two six-point rating scales:
Frequency Scale: Measures how often the client engages in the described behavior
(never, rarely, occasionally, sometimes, often, usually).
Selectivity Scale: Measures how many people the client engages with in the
described behavior (nobody, one or two people, a few people, some people, many
people, most people).
Responses are scored using a 0,1 key, where each response is converted into a numerical
value to derive the scale scores.
Application
Advantages:
Disadvantages: