0% found this document useful (0 votes)
13 views124 pages

HFC Psych Assess

Uploaded by

Jayati Garg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views124 pages

HFC Psych Assess

Uploaded by

Jayati Garg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 124

1

Tharki, Moska, himu, meenu, shanta

UNIT 1: Introduction to Psychological Assessment : Psychological assessment:

Principles of assessment, Nature and purpose, Similarity and difference between test and
assessment. Types of assessment: Observation, Interview, scales and tests Integrating inputs from
multiple sources of information, report writing an providing feedback to the client/referral
source. Psychological assessment in multi-cultural context, Ethical and professional issues and
challenges

UNIT 2: Psychological Testing: Definition of a test, types of test ;Characteristics of a Good Test;
Applications of psychological tests in various contexts (educational, counselling and guidance,
clinical, organizational etc.)

UNIT 3: Test and Scale Construction: Test Construction and Standardization: Item analysis,
Reliability, validity, and norms (characteristics of z-scores, T-scores, percentiles, stens and
stanines);Scale Construction: Likert, Thurstone, Guttman& Semantic Differential

UNIT 4: done in class presentations but need to compile and some tests are still left isme

Tests of Cognitive Ability and Personality: Tests of cognitive ability: General mental ability tests
(The Wechsler scales of intelligence, Stanford–Binet Intelligence Scales: 5th Edition, Culture
Fair Intelligence Test, Raven’s Progressive Matrices etc.), Aptitude tests/batteries (eg,
Differential Aptitude tests), Information-processing tests (Das-Naglieri Cognitive Assessment
System (CAS);Tests of personality: Inventories such as NEO-FFI, 16 PF, FIRO B, MMPI etc.,
Projective tests like Rorschach and Thematic Apperception Test (a brief introduction to both),
semi-projective tests like Rotter’s Incomplete Sentence Blank, Rosenzweig’s Picture Frustration
testFuture directions in psychological assessment: Computer assisted assessment, Virtual reality
and psychological assessment
2

UNIT 1: INTRODUCTION TO
PSYCHOLOGICAL ASSESSMENT
Psychological assessment
Psychological assessment is a flexible, not standardized, process aimed at reaching a defensible
determination concerning one or more psychological issues or questions, through the collection,
4evaluation, and analysis of data appropriate to the purpose at hand (Maloney & Ward, 1976).

Psychological assessment is the gathering and integration of data to evaluate a person’s behavior,
abilities, and other characteristics, particularly for the purposes of making a diagnosis or
treatment recommendation. Psychologists assess diverse psychiatric problems (e.g., anxiety,
substance abuse) and nonpsychiatric concerns (e.g., intelligence, career interests) in a range of
clinical, educational, organizational, forensic, and other settings. Assessment data may be
gathered through interviews, observation, standardized tests, self-report measures, physiological
or psychophysiological measurement devices, or other specialized procedures and apparatuses
(APA, 2023).

Steps in the Assessment Process

The first and most important step in psychological assessment is to identify its goals as clearly
and realistically as possible. Without clearly defined objectives that are agreed upon by the
assessor and the person requesting the assessment, the process is not likely to be satisfactory. In
most instances, the process of assessment ends with a verbal or written report, communicating
the conclusions that have been reached to the persons who requested the assessment, in a
comprehensible and useful manner. In between these two points, the professional conducting the
assessment, usually a psychologist or a counselor, will need to employ her or his expertise at
every step. These steps involve the appropriate selection of instruments to be used in gathering
data, their careful administration, scoring, interpretation, and—most important of all—the
judicious use of the data collected to make inferences about the question at hand. This last step
goes beyond psychometric expertise and requires a knowledge of the field to which the question
refers, such as health care, educational placement, psychopathology, organizational behavior, or
criminology, among others.

Examples of issues amenable to investigation through psychological assessment include


• diagnostic questions, such as differentiating between depression and dementia;
3

• making predictions, such as estimating the likelihood of suicidal or homicidal behaviors; and
• evaluative judgments, such as those involved in child custody decisions or in assessing the
effectiveness of programs or interventions.

None of these complex issues can be resolved by means of test scores alone because the same
test score can have different meanings depending on the examinee and the context in which it
was obtained. Furthermore, no single test score or set of scores can capture all the aspects that
need to be considered in resolving such issues.

PRINCIPLE OF ASSESSMENT

The principles of assessment have been given by Shertzer and Linden which states that
assessment should be Holistic, Ongoing, Balanced, Accurate and Confidential. These are
explained below.

1) Assessment should be Holistic: This principle involves multiple methods in collecting


information. The use of a combination of assessment techniques increases the likelihood of
applying positive intervention and consequently the achievement of the desired goals. The
principle of holistic assessment follows a systematic process to arrive at an understanding of the
individual. To make the assessment process more systematic a counselor needs to keep in mind
three important factors i.e. What to assess, when to assess and in which state of assessing is
required. Assessment should be within the context of life pattern of the individual, i.e.,
supportive information regarding other aspects of the person need also to be considered to better
understand the problem. For example, a student may experience difficulty in school due to
limited academic preparation. However, it may not be the only factor; other factor such as self
esteem which might not be evident, but could influence the academic achievement of the
students.

2) Assessment should be on Going: Ongoing assessment allows comparison between the client’s
initial present problems and the client’s current functioning. It appraises the counselor of possible
new and urgent needs which may arise after the initial assessment, therefore psychological
assessment must take into account the dynamics nature of the human behaviour which involves
his/her needs, goals, abilities etc. In assessment, the conceptualization of an individual must be
continuous. This is important because the counselor keeps on refining his conceptualization of
the client in the light of more and more information collected and interaction between the client
and the counselor. Hence, assessment needs to be considered as ongoing and not episodic.
4

3) Assessment should be Balanced: Assessment makes use of normative information as well as


individualised data. Both types of data combined try to give a better understanding about the
client. It is the purpose and the situation that decides which type of assessment data is required.

4) Assessment should be Accurate: The assessment device used should be accurate and the
counselor should have the skill for interpreting the data. Counselors must keep in mind the
possibility of errors, as all tools may not be 100% accurate; so they must try to minimize the
errors by using standardized procedures. Predictions of future behaviour should always be stated
in terms of probabilities as human behaviour is so complex and dynamic. Assessment therefore
can only provide useful insights to derive inferences rather than making prediction in absolute
terms.

5) Assessment should be Confidential: Clients need to be assured of confidentiality of their


personal information. This will develop trust with the counselor. It is one of the basic ethic of
counseling as well. The client will also be able to build a good rapport based on mutual trust and
respect.

PURPOSE OF ASSESSMENT

The purposes of assessment in guidance and counseling situation are as follows:

1) Self Understanding: The basic purpose of carrying out an assessment is for gaining insight in
helping the client understand themselves better, helping them to know what they can do and
cannot do including their strengths and weaknesses.

2) To Diagnose Student’s Problem: To diagnose the client’s problem is another purpose that
assessment data fulfills .By using the data properly, we can interpret causal factors. It also helps
to identify various aspects such as family background, physical health, academic performance etc

3) To Help in Career Planning and Education: Assessment done with the help of various
psychological tools guides the students in making choices for their career and selection of
subjects/courses.

4) To Help Predict Future Performance: Counselors use assessment data to estimate individual’s
attitude, ability, personality, etc that have implication for the success and adjustment which help
to predict the future performance of the individual. Moreover, the counselor can also motivate
the client in a direction where he /she can get more success.
5

5) To Evaluate the Outcome of Counseling: Assessment is done prior to counselling as well as at


the end of it. This gives the counselor valuable insights for further intervention and to achieve
the expected outcome.

https://egyankosh.ac.in/bitstream/123456789/21246/1/Unit-1.pdf
(reference of principle and purpose)

Similarities and DIFFERENCES BETWEEN TEST AND


ASSESSMENT
6
7

Types of Assessment

1. Observation
The American Psychological Association (2023) defines observation as “the careful,
close examination of an object, process, or other phenomenon for the purpose of collecting data
about it or drawing conclusions about behavior”. A major chunk in observation goes into
behavioral observation which in itself is sufficient to obtain adequate decisions about behaviors
before, during, and even after treatment interventions.
While interviews are mainly used to gather verbal information from clients, behavioral
observation is a crucial tool to determine and implement specific strategies and techniques that
measure relevant areas of behavior discussed during the interview. This involves recording
non-verbal behavioral signs like postures and gestures. Behavioral observation may be
particularly significant when assessing developmentally disabled individuals, resistant clients, or
very young children. The observations can be carried out by the treatment provider or someone
closely associated with the client, such as a parent, teacher, spouse, or the client themselves.
However, it's essential to note that in clinical settings, only specialists and professionals should
conduct these observations due to the susceptibility of observer errors such as bias, leniency,
lapses in concentration, or discussing data with other observers. To ensure reliability, different
observers may rate the same behaviors and compare their ratings. Nonetheless, caution is
necessary when using trained observers since interobserver agreement can vary widely.

SETTINGS FOR OBSERVATION CAN BE DIFFERENT LIKE-


Naturalistic observation
is a type of observation where the researcher observes the subject in their natural environment
without any manipulation or interference. This method is commonly used in psychological
assessments to gain insights into the subject's behavior, thoughts, and emotions in their natural
setting. For instance, a researcher may observe how a child interacts with their peers on the
playground or how a person responds to stress in their workplace. These observations provide
valuable information about the subject's behavior and emotions in real-life situations.

Non-naturalistic observation involves observing subjects in a controlled environment, where the


researcher manipulates the setting or introduces specific tasks to elicit particular behaviors or
emotions. This method is used to measure specific aspects of behavior or emotions that may not
occur naturally in the subject's environment. For instance, a researcher may ask a subject to
complete a cognitive task while measuring their brain activity through electroencephalography
(EEG). Non-naturalistic observation allows researchers to control the conditions under which
they observe the subject, enabling them to collect more precise and specific data.
8

Both naturalistic and non-naturalistic observation have their advantages and disadvantages, and
both can be valuable in psychological assessments. Naturalistic observation provides valuable
insights into the subject's behavior in their natural environment, allowing researchers to see how
they behave and interact with others. On the other hand, non-naturalistic observation provides
researchers with more controlled conditions, allowing them to collect more precise and specific
data about the subject's behavior and emotions.

Participant research, also known as participatory research, involves a trained investigator who
joins a pre-existing group to study it from within without disrupting group processes or biasing
the data. The researcher may assume a known or unknown role in the group, avoiding any
conspicuous behavior that could alter the group's dynamics. For example, cultural
anthropologists often become participant observers when they immerse themselves in a
particular culture to study its social structure and processes.

In behavioral observation, there are different methods of data recording to measure and
document observable behaviors. Four common methods are narrative recording, interval
recording, rating recording, and event recording. Here's an explanation of each method and a
critical evaluation of their effectiveness:

1. Narrative recording: In this method, the observer records a detailed description of


the subject's behavior in the form of a narrative or written account. Narrative
recording allows for a detailed and descriptive account of the subject's behavior,
but it can be time-consuming and challenging to quantify the data collected.

2. Interval recording: In interval recording, the observer divides the observation


period into equal intervals of time and records whether the behavior occurred or
not during each interval. This method is useful for measuring the frequency and
duration of behaviors over time, but it may not capture the entire behavior's
complexity or nuances. Additionally, this method may be biased if the intervals
chosen do not accurately reflect the behavior's patterns or if the observer misses
the behavior during an interval.

3. Rating recording: In rating recording, the observer assigns a score to the behavior
based on specific criteria. For example, the observer may use a Likert scale to rate
the intensity of a behavior. Rating recording allows for easy quantification of data,
but it may not provide a complete picture of the behavior or the reasons behind it.

4. Event recording: In event recording, the observer records each occurrence of a


specific behavior. This method is useful for measuring behaviors that are
9

infrequent or brief, but it may be challenging to use when multiple behaviors


occur simultaneously.

Critical evaluation:

Each method has its strengths and weaknesses. It's essential to choose the most
appropriate method based on the research question, behavior under observation, and the
observation's goal. For instance, event recording may be the best option to capture infrequent
behaviors, while interval recording may be better suited to measure behaviors over time.
Furthermore, it's essential to ensure that the observer is appropriately trained to minimize
observer bias and increase reliability. Finally, researchers should use multiple data collection
methods to triangulate their findings and increase the accuracy and validity of their observations.

Overall for observation

Interviews
Interviews are a type of assessment that involves the collection of information from an
individual through a structured or unstructured conversation. Interviews are a flexible and
versatile method that can be used to collect data on various aspects of an individual's life, such as
their thoughts, feelings, behaviors, and experiences. Gregory and Kaplan (2012) described
interviews as a "conversation with a purpose" and emphasized that interviews require skill and
training to conduct effectively. The American Psychological Association (2023) defines
interview as, “a directed conversation in which a researcher, therapist, clinician, employer, or the
like (the interviewer) intends to elicit specific information from an individual (the interviewee)
for purposes of research, diagnosis, treatment, or employment.”
The primary aims of an interview are to obtain information that is difficult to gather
through other means, foster a positive relationship between the interviewer and interviewee,
enhance understanding of problem behavior for both parties, and offer guidance and assistance to
10

the interviewee in addressing problem behaviors. To accomplish these goals, the interviewer
must not only guide the conversation towards specific objectives, but also possess knowledge
about the topics to be addressed during the interview (Groth-Marnat, 2003).

● One type of interview is the structured interview, which involves a predetermined set of
questions that are asked in a specific order. Structured interviews are often used in
clinical settings to gather information on specific symptoms or to assess a particular
disorder. For example, the Diagnostic and Statistical Manual of Mental Disorders
(DSM-5) includes structured clinical interviews that are used to diagnose mental health
conditions such as major depressive disorder (APA, 2013).

● Another type of interview is the unstructured interview, which is more flexible and allows
for spontaneous conversation and follow-up questions. Unstructured interviews are often
used in research settings to explore a particular phenomenon or to gain a deeper
understanding of an individual's experiences. For example, in a study on the experiences
of cancer patients, researchers may conduct unstructured interviews to explore the
emotional, social, and physical challenges that patients face (Molassiotis et al., 2012).
● The semi-structured interview is more flexible than the structured interview, with a core
set of questions supplemented by additional questions that are tailored to the individual
being interviewed. This type of interview allows for more in-depth exploration of specific
issues that arise during the interview. An example of a semi-structured interview is the
diagnostic interview, which is used to gather information about a person's symptoms and
to make a diagnosis.

In addition to these types of interviews, mental status examinations (MSEs) and case histories
are also commonly used in psychological assessment. An MSE is a structured assessment of a
person's current mental state, including their mood, behavior, thought processes, and cognitive
functioning. It is typically conducted in a clinical setting, and can provide valuable information
for diagnosing and treating mental health disorders.

A case history is a comprehensive assessment of a person's past and current psychological


functioning, typically obtained through a combination of interviews with the person and their
family members, review of medical records, and psychological testing. This type of assessment
can provide important context for understanding a person's current psychological state, and can
help guide treatment planning.

Interviews can also be used in employment settings to assess an individual's skills,


knowledge, and experience. For example, during a job interview, employers may ask structured
or unstructured questions to assess a candidate's qualifications and fit for the position.
11

Despite their versatility and flexibility, interviews also have limitations. One limitation is
the potential for interviewer bias, which can occur when the interviewer's biases or assumptions
influence the questions asked or the interpretation of the responses. To minimize interviewer
bias, it is important for interviewers to receive training on how to conduct interviews objectively
and without leading the respondent
12

Scales and Tests.


Scales are referred to, “a system of measurement for a cognitive, social, emotional, or
behavioural variable or function, such as personality, intelligence, attitudes, or beliefs”
(American Psychological Association [APA], 2023). Scales are a type of measurement tool that
is typically less formal than tests. The items of a scale do not have any right or wrong response
and are often self-reported, meaning that individuals complete the assessment themselves by
rating their own thoughts, feelings, or behaviours on a predetermined scale (American
Psychological Association [APA], 2023). Some common examples of scales include attitude
tests, self-concept questionnaires, and personality tests.

Differences between the two:


- Psychological tests and scales are frequently used in psychology research and practice,
although they differ in their purpose and complexity. Tests are designed to evaluate an
individual's overall functioning, while scales focus on specific constructs or
13

characteristics. Tests are often administered and scored by trained professionals, and the
results may require extensive interpretation and analysis. Scales, on the other hand, are
usually simpler to administer and score, making them more accessible to individuals
without specialized training.

- Tests are generally more reliable and valid than scales due to their standardized
administration and comprehensive nature. The difficulty level and specific ranges of a
test distinguish it from a scale. Tests have an interval scale, while scales have only
nominal and ordinal levels of measurement. Additionally, parametric statistics can be
applied to tests, whereas only non-parametric statistics can be applied to scales.

Types of Scales and Tests:

Scales and tests used in psychological assessment can be classified into various categories based
on their purpose, mode of administration, and the constructs they assess. Some of the commonly
used scales and tests in psychological assessment include:

Intelligence tests: These tests are designed to measure cognitive abilities such as
problem-solving, reasoning, memory, and linguistic skills. Examples of intelligence tests include
Wechsler Adult Intelligence Scale (WAIS) and Raven's Progressive Matrices.

Personality tests: These tests are used to assess personality traits and characteristics such as
extroversion, neuroticism, and conscientiousness. Examples of personality tests include
Minnesota Multiphasic Personality Inventory (MMPI) and NEO Personality Inventory.

Neuropsychological tests: These tests assess cognitive and behavioral functioning related to
brain structures and processes. Examples of neuropsychological tests include Halstead-Reitan
Neuropsychological Battery and Wisconsin Card Sorting Test.

Projective tests: These tests use ambiguous stimuli to elicit responses that reveal unconscious
motives, conflicts, and emotions. Examples of projective tests include Rorschach Inkblot Test
and Thematic Apperception Test (TAT).

Functions of Scales and Tests in Psychological Assessment:

Scales and tests serve several functions in psychological assessment, including:

- Screening and diagnosis: Scales and tests can be used to screen individuals for
psychological problems or disorders and to diagnose them based on their scores.
14

- Treatment planning and evaluation: Scales and tests can be used to plan appropriate
interventions for individuals based on their psychological profile and to evaluate the
effectiveness of the treatment.

- Research: Scales and tests can be used in research studies to investigate the relationships
between psychological constructs and to test theories.

Limitations of Scales and Tests in Psychological Assessment:

Despite their usefulness, scales and tests have some limitations that should be considered when
using them in psychological assessment, including:

-Cultural bias: Some scales and tests may be culturally biased, meaning they may not accurately
measure psychological constructs in individuals from diverse cultural backgrounds.

-Faking: Individuals may intentionally or unintentionally provide inaccurate responses on scales


and tests, leading to false or misleading results.

-Limited scope: Scales and tests may only measure specific constructs or aspects of
psychological functioning, meaning that they may not provide a comprehensive picture of an
individual's psychological profile.

Report writing and communication of results


Except for group testing, the practice of psychological testing invariably culminates in a written
report that constitutes a semipermanent record of test findings and examiner recommendations.
Effective report writing is an important skill because of the potential lasting impact of the
written document. It is beyond the scope of this text to illuminate the qualities of effective report
writing, although we can refer the reader to a few sources (Gregory, 1999; Tallent, 1993).

Responsible reports typically use simple and direct writing that steers clear of jargon and
technical terms. The proper goal of a report is to provide helpful perspectives on the client, not
to impress the referral source that the examiner is a learned person! When Tallent (1993)
surveyed more than one thousand health practitioners who made referrals for testing, one
respondent declared his disdain toward psychologists who “reflect their needs to shine as a
psychoanalytic beacon in revealing the dark, deep secrets they have observed.” On a related
note, effective reports stay within the bounds of expertise of the examiner. For example: It is
never appropriate for a psychologist to recommend that a client undergo a specific medical
15

procedure (such as a CT scan for an apparent brain tumor) or receive a particular drug (such as
Prozac for depression). Even when the need for a special procedure seems obvious (e.g., the
symptoms strongly attest to the rapid onset of a brain disease), the best way to meet the needs of
the client is to recommend immediate consultation with the appropriate medical profession (e.g.,
neurology or psychiatry). (Gregory, 1999)

Ownby (1991) and Sattler (2001)’s views:

Effective report writing is an essential skill for psychologists conducting psychological


assessments. It involves the ability to accurately and clearly communicate the findings of the
assessment to both the client and any other parties involved in the assessment process, such as
other healthcare professionals. In order to improve the quality of psychological reports, Ownby
(1991) and Sattler (2001) have provided valuable advice for effective report writing.

Ownby (1991) emphasizes the importance of writing reports that are clear, concise, and free of
jargon. According to Ownby, the use of jargon can obscure the meaning of a report and make it
difficult for clients to understand the findings. Ownby recommends using simple and direct
language to convey information to clients. Additionally, Ownby suggests that reports should
focus on the most important findings and be organized in a logical and easy-to-follow format.

Similarly, Sattler (2001) advocates for clear and concise writing in psychological reports. Sattler
also stresses the importance of using language that is understandable to clients, and cautions
against the use of technical terms that may be unfamiliar to them. Sattler suggests that reports
should be organized in a way that is easy to read and follow, with sections that clearly outline the
purpose of the assessment, the methods used, and the results obtained.

In addition to these suggestions, both Ownby and Sattler emphasize the importance of tailoring
reports to the needs of the client and any other parties involved in the assessment process. For
example, reports may need to be written differently depending on whether they are being
provided to a medical doctor or an attorney. Ownby also recommends that reports be written
with the client's level of education and understanding in mind, and that the tone of the report be
respectful and empathetic.

Finally, both Ownby and Sattler stress the importance of ethical and professional considerations
in report writing. Reports must be accurate, objective, and based on valid and reliable data.
Additionally, confidentiality and privacy must be respected at all times, and reports should be
shared only with those who have a legitimate need to know the information contained in them.

Communication of results

Practitioners often do not include one-to-one feedback as part of the assessment. A major reason
16

for reluctance is a lack of training in how to provide feedback, especially when the test results
appear to be negative. For example, how does a clinician tell a college student that her IQ is 93
when most students in that milieu score 115 or higher?

Providing effective and constructive feedback to clients about their test results is a challenging
skill to learn. Pope (1992) emphasizes the responsibility of the clinician to determine that the
client has understood adequately and accurately the information that the clinician was attempting
to convey. Furthermore, it is the responsibility of the clinician to check for adverse reactions: Is
the client exceptionally depressed by the findings? Is the client inferring from findings
suggesting a learning disorder that the client—as the client has always suspected—is “stupid”?
Using scrupulous care to conduct this assessment of the client’s understanding of and reactions
to the feedback is no less important than using adequate care in administering standardized
psychological tests; test administration and feedback are equally important, fundamental aspects
of the assessment process.

Destructive feedback can occur when clinicians do not challenge their clients' incorrect
perceptions of test results, particularly in the case of IQ tests where many people consider the
scores as an index of personal worth. Before providing test results, clinicians are advised to
investigate the client's understanding of what IQ scores mean and challenge any unrealistic
perspectives. IQ scores are limited and do not evaluate important attributes such as creativity,
social intelligence, musical ability, or athletic skill. Clinicians should elicit and challenge their
clients' views when necessary before proceeding.

(EXTRA-can be skipped) Finn and Tonsager (1997) propose that psychological assessment is
not only a way to gather information for therapeutic purposes, but also a short-term intervention
that can directly and immediately help individuals with psychological problems. They conducted
a study where clients who received brief psychological assessment, including MMPI-2 results
and feedback, showed a greater decline in symptomatic distress and a greater increase in
self-esteem immediately after the feedback session and two weeks later, compared to a group of
clients who received supportive, nondirective psychotherapy instead of test feedback. The study
highlights the importance of providing thoughtful and constructive test feedback that can have
therapeutic benefits.

Ethics: issues and guidelines


(same answer for both; can be asked either way)
17

During the approximately 80 years that psychologists have been conducting formal assessment, a
number of ethical guidelines have gradually evolved to ensure that appropriate professional
relationships and procedures are developed and maintained. These guidelines have largely
evolved through careful considerations of what constitutes ideal practice. Many of these
considerations have been highlighted and refined because of difficulties surrounding assessment
procedures. Criticism has been directed at the use of tests in inappropriate contexts,
confidentiality, cultural bias, invasion of privacy, and the continued use of tests that are
inadequately validated. This has resulted in restrictions on the use of certain tests, greater
clarification within the profession regarding ethical standards, and increased skepticism from the
public.

To deal with these potential difficulties as well as conduct useful and accurate assessments,
clinicians need to be aware of the ethical use of assessment tools. The American Educational
Research Association (AERA) and other professional groups have published guidelines for
examiners in their Standards for Educational and Psychological Tests (1999), Ethical Principles
of Psychologists and Code of Conduct (American Psychological Association, 1992), and
Guidelines for Computer-Based Test Interpretations (American Psychological Association,
1986). A special series in the Journal of Personality Assessment (Russ, 2001) also elaborates on
ethical dilemmas found in training, medical, school, and forensic settings. The following section
outlines the most important of these guidelines along with additional related issues.

1. Developing a Professional Relationship


- Assessment should be conducted only in the context of a clearly defined professional
relationship. This means that the nature, purpose, and conditions of the relationship are
discussed and agreed on. Usually, the clinician provides relevant information, followed
by the client’s signed consent.
- Information conveyed to the client usually relates to the type and length of assessment,
alternative procedures, details relating to appointments, the nature and limits of
confidentiality, financial requirements, and additional general information that might be
relevant to the unique context of an assessment.
- The quality of the relationship can have on both assessment results and the overall
working relationship. It is the examiner’s responsibility to recognize the possible
influences he or she may exert on the client and to optimize the level of rapport. For
example, enhanced rapport with older children (but not younger ones) involving verbal
reinforcement and friendly conversation has been shown to increase WISC-R scores by
an average of 13 IQ points compared with an administration involving more neutral
interactions (Feldman & Sullivan, 1971). This is a difference of nearly one full standard
deviation. It has also been found that mildly disapproving comments such as “I thought
you could do better than that” resulted in significantly lowered performance when
compared with either neutral or approving ones
18

- The obvious implication for clinicians is that they should continually question themselves
regarding their expectations of clients and check to see whether they may in some way be
communicating these expectations to their clients in a manner that confounds the results
- An additional factor that may affect the nature of the relationship between the client and
the examiner is the client’s relative emotional state. It is particularly important to assess
the degree of the client’s motivation and his or her overall level of anxiety. There may be
times in which it would be advisable to discontinue testing because situational emotional
states may significantly influence the results of the tests. At the very least, examiners
should consider the possible effects of emotional factors and incorporate these into their
interpretations.
- A final consideration, which can potentially confound both the administration and, more
commonly, the scoring of responses, is the degree to which the examiner likes the client
and perceives him or her as warm and friendly. Thus, examiners should continually check
themselves to assess whether their relationship with the client is interfering with the
objectivity of the test administration and scoring.

Invasion of privacy
- One of the challenges of psychological testing is the possibility of the client revealing
personal information they would prefer to keep private, which may be used in ways not in
their best interest. Privacy is a fundamental right of the individual to decide how much
they will share about their personal life. Personality tests are particularly controversial in
this regard because they may reveal traits that are disguised or not obvious to the person
being tested, potentially exposing personal information they would rather keep private.
Similarly, IQ scores are often considered highly personal.
- Even though there are ethical guidelines regarding invasion of privacy in psychological
testing, dilemmas can still arise during personnel selection. Applicants may feel
pressured to reveal personal information on tests and have no control over the inferences
that examiners make about the test data. However, for positions that require careful
screening and where serious negative consequences may result from poor selection,
testing may be necessary, such as in police work, delicate military positions, or important
public duty overseas.
- Adequate handling of the issue of an individual’s right to privacy involves both a clear
explanation of the relevance of the testing and obtaining informed consent. Examiners
should always have a clear conception of the specific rreasons for giving a test.
Furthermore, the general rationale for test selection should be provided in clear,
straightforward language that can be understood by the client. Informed consent involves
communicating not only the rationale for testing, but also the kinds of data obtained and
the possible uses of the data.
19

- Introducing the test format and intent in a simple, respectful, and forthright manner
significantly reduces the chance that the client will perceive the testing situation as an
invasion of privacy.

Inviolacy
The concept of inviolacy refers to negative feelings that clients may experience when confronted
with questions or topics that they would rather not think about during an assessment. For
instance, psychological tests like MMPI can include questions about taboo topics like sexual
practices and personal beliefs that may make examinees more aware of their deviant thoughts or
repressed unpleasant memories, thus provoking anxiety. This issue requires sensitivity and clear
communication about the testing procedure to help clients feel more comfortable and secure
during the assessment.

Labeling and Restriction of Freedom

- When individuals are given a medical diagnosis for physical ailments, the social stigmata
are usually relatively mild. In contrast are the potentially damaging consequences of
many psychiatric diagnoses. A major danger is the possibility of creating a self-fulfilling
prophecy based on the expected roles associated with a specific label. Many of these
expectations are communicated nonverbally and are typically beyond a person’s
immediate Awareness.
- Other self-fulfilling prophecies may be less subtle; for example, the person who is labeled
as a chronic schizophrenic is, therefore, given only minimal treatment because chronic
schizophrenics rarely respond and then do not improve, perhaps mainly because of
having received suboptimal treatment.
- Another negative consequence of labeling is the social stigma attached to different
disorders. Thus, largely because of the public’s misconceptions of terms such as
schizophrenia, labeled individuals may be socially avoided.
- Just as labels imposed by others can have negative consequences, self-acceptance of
labels can likewise be detrimental. Clients may use their labels to excuse or deny
responsibility for their behavior (sick role behaviour).
- A final difficulty associated with labeling is that it may unnecessarily impose limitations
on either an individual or a system by restricting progress and creativity. One alternative
to this predicament is to look at future trends and develop selection criteria based on
these trends. Furthermore, diversity might be incorporated into an organization so that
different but compatible types can be selected to work on similar projects. Thus,
clinicians should be sensitive to the potential negative impact resulting from labeling by
outside sources or by self-labeling, as well as to the possible limiting effects that labeling
might have.
20

Competent Use of Assessment Instruments


● To correctly administer and interpret psychological tests, an examiner must have proper
training, which generally includes adequate graduate course work, combined with lengthy
supervised experience (Turner et al., 2001).
● Clinicians should have a knowledge of tests and test limitations, and should be willing to
accept responsibility for competent test use. Intensive training is particularly important
for individually administered intelligence tests and for the majority of personality tests.
● Students who are taking or administering tests as part of a class requirement are not
adequately trained to administer and interpret tests professionally. Thus, test results
obtained by students have questionable validity, and they should clearly inform their
subjects that the purpose of their testing is for training purposes only.
● Tests should only be used for the purposes they were designed for, and using them
beyond their intended use can lead to inaccurate conclusions. This is especially important
when using tests for personnel selection as the information obtained can be of a highly
personal nature and may represent an invasion of privacy. The MMPI-2, for example, is
designed to assess psychopathology and should not be used to assess a normal person's
level of functioning or for job-related skills assessment.

Skills needed to be acquired by the examiner in this regard


● These include the ability to evaluate the technical strengths and limitations of a test, the
selection of appropriate tests, knowledge of issues relating to the test’s reliability and
validity, and interpretation with diverse populations.
● Examiners need to be aware of the material in the test manual as well as relevant research
both on the variable the test is measuring and the status of the test since its publication.
This is particularly important with regard to newly developed subgroup norms and
possible changes in the meaning of scales resulting from further research.
● After examiners evaluate the test itself, they must also be able to evaluate whether the
purpose and context for which they would like to use it are appropriate.
● To help develop accurate conclusions, examiners should have a general knowledge of the
diversity of human behavior. Different considerations and interpretive strategies may be
necessary for various ethnic groups, sex, sexual orientation, or persons from different
countries.
● A final consideration is that, if interns or technicians are administering the tests, an
adequately trained psychologist should be available as a consultant or supervisor

Interpretation and Use of Test Results


- Interpreting test results should never be considered a simple, mechanical procedure.
Accurate interpretation means not simply using norms and cutoff scores, but also taking
into consideration unique characteristics of the person combined with relevant aspects of
21

the test itself. Whereas tests themselves can be validated, the integration of information
from a test battery is far more difficult to validate. It is not infrequent, for example, to
have contradictions among different sources of data. It is up to the clinician to evaluate
these contradictions to develop the most appropriate, accurate, and useful interpretations.
If there are significant reservations regarding the test interpretation, this should be
communicated, usually in the psychological report itself.
- A further issue is that test norms and stimulus materials eventually become outdated. As
a result, interpretations based on these tests may become inaccurate. This means that
clinicians need to stay current on emerging research and new versions of tests. A rule of
thumb is that if a clinician has not updated his or her test knowledge in the past 10 years,
he or she is probably not practicing competently.

Communicating Test Results


- Psychologists should ordinarily give feedback to the client and referral source regarding
the results of assessment (Lewak & Hogan, 2003; also see Pope, 1992 for specific
guidelines and responsibilities). This should be done using clear, everyday language. If
the psychologist is not the person giving the feedback, this should be agreed on in
advance and the psychologist should ensure that the person providing the feedback
presents the information in a clear, competent manner.
- Unless the results are communicated effectively, the purpose of the assessment is not
likely to be achieved. This involves understanding the needs and vocabulary of the
referral source, client, and other persons, such as parents or teachers, who may be
affected by the test results. Initially, there should be a clear explanation of the rationale
for testing and the nature of the tests being administered. This may include the general
type of conclusions that are drawn, the limitations of the test, and common
misconceptions surrounding the test or test variable. If a child is being tested in an
educational setting, a meeting should be arranged with the school psychologist, parents,
teacher, and other relevant persons. Such an approach is crucial for IQ tests, which are
more likely to be misinterpreted, than for achievement tests. Feedback of test results
should be given in terms that are clear and understandable to the receiver.
- In providing effective feedback, the clinician should also consider the personal
characteristics of the receiver, such as his or her general educational level, relative
knowledge regarding psychological testing, and possible emotional response to the
information. The emotional reaction is especially important when a client is learning
about his or her personal strengths or shortcomings

Maintenance of Test Security and Assessment Information


● Psychologists should ensure that test materials are secure to maintain their validity. Tests
should be kept locked in a secure place, and untrained persons should not review them.
Copyrighted material should not be duplicated, and raw data from tests should not be
22

released to clients or others who may misinterpret them. Clients have the right to request
reports, and the information can be released to a person they designate with a written
request.
● The security of assessment results should be maintained by limiting access to designated
persons or those who have been authorized by the client. However, this may be difficult
to achieve in certain medical contexts where all members of the treatment team have
access to patient records. There is also a conflict between patient autonomy and the
benefits of having a treatment team with access to records. In managed healthcare
environments, access to patient records by various organizations can compromise the
security of assessment results. Large interconnected databases potentially having access
to patient data also poses a threat to the security of client records.
● In legal situations, the court or opposing counsel may want to see test materials or raw
data, but psychologists should inform them that ethical guidelines and agreements with
test distributors prohibit the release of this information to untrained individuals. Instead, a
trained person may be designated to explain or describe the information.

TEST BIAS AND USE WITH MINORITY GROUPS


Bias in testing refers to the presence of systematic error in the measurement of certain factors
(e.g., academic potential, intelligence, psychopathology) among certain individuals or groups.
The possible presence of bias toward minority groups has resulted in one of the most
controversial issues in psychological testing. More specifically, critics believe that psychological
tests are heavily biased in favor of, and reflect the values of, European American, middleclass
society. They argue that such tests cannot adequately assess intelligence or personality when
applied to minority groups. Whereas the greatest controversy has arisen from the use of
intelligence tests, the presence of cultural bias is also relevant in the use of personality testing.
Over the past decade, discussion over bias has shifted from controversy over the nature and
extent of bias to a more productive working through of how to make the most valid and equitable
assessment based on current knowledge (see Dana, 2000; Handel & Ben-Porath, 2000; Sandoval
et al., 1999).

Some issues associated with Computer-assisted psychological assessment (CAPA)


For assessment professionals, some major issues with regard to CAPA are as follows.
■ Access to test administration, scoring, and interpretation software. Despite purchase
restrictions on software and technological safeguards to guard against unauthorized copying,
software may still be copied. Unlike test kits, which may contain manipulatable objects,
manuals, and other tangible items, a computer-administered test may be easily copied and
duplicated.
23

■ Comparability of pencil-and-paper and computerized versions of tests. Many tests once


available only in a paper-and-pencil format are now available in computerized form as well. In
many instances, the comparability of the traditional and the computerized forms of the test has
not been researched or has only insuffi ciently been researched.
■ The value of computerized test interpretations. Many tests available for computerized
administration also come with computerized scoring and interpretation procedures. Thousands of
words are spewed out every day in the form of test interpretation results, but the value of these
words in many cases is questionable.
■ Unprofessional, unregulated “psychological testing” online. A growing number of I nternet
sites purport to provide, usually for a fee, online psychological tests. Yet the vast majority of the
tests offered would not meet a psychologist’s standards. Assessment professionals wonder about
the long-term effect of these largely unprofessional and unregulated “psychological testing”
sites. Might they, for example, contribute to more public skepticism about psychological tests?

The Rights of Testtakers (Need not mention if elaborated in the previous points)
- Right of informed consent
- The right to be informed of test findings
- The right to privacy and confidentiality
- The right to the least stigmatizing label
24

UNIT 2: PSYCHOLOGICAL TESTING


MEANING OF TEST IN PSYCHOLOGY AND EDUCATION

According to the dictionary, 'test' is defined as a series of questions on the basis of which some
information is sought.

A psychological (or an educational) test is a standardized procedure to measure quantitatively or


qualitatively one or more than one aspect of a trait by means of a sample of verbal or nonverbal
behaviour. The purpose of a psychological test is twofold. First, it attempts to compare the same
individual on two or more than two aspects of a trait; and second, two or more than two persons
may be compared on the same trait. Such a measurement may be either quantitative or
qualitative.

In the words of Bean (1953, 11), a test is "an organized succession of stimuli designed to
measure quantitatively or to evaluate qualitatively some mental process, trait or characteristic.

Likewise Anastasi and Urbina (1997) have defined a test as "essentially an objective and
standardized measure of sample of behaviour".

Similarly, Cullari (1998) has said, "A test is a standardized procedure for sampling behaviour and
describing it with scores or categories".

Kaplan and Saccuzzo (2001) have opined, "A psychological test or educational test is a set of
items designed to measure characteristics of human beings that pertain to behaviour".

These definitions reveal some important characteristics of a psychological and educational test:

First, test is an organized succession of stimuli, which means that the stimuli (popular known as
items in the test are organized in a certain sequence and are based upon some principles of test
construction. Items are processed through item analysis. Test standardisation becomes important
for its uniformity. The lack of standardization may change not only the character of the test but
also its difficulty level which may ultimately reduce the validity of the test.

Second, both quantitative and qualitative measurements are possible through psychological and
educational tests. The reading ability of a child may be quantitatively measured with the help of
a test specially designed for the purpose. His reading-ability score may be evaluated (o
qualitatively measured) with respect to the average performance of the reading ability of the
other children of his age or class. Thus, a test provides both quantitative and qualitative
measurement of a trait.
25

Third, a psychological test is based upon a limited sample of behaviour. This means that any
psychological test or educational test does not assess the totality of a person's behaviour. Rather
it is focussed on the limited effect of that behaviour. For example, when testing the vocabulary of
a person, the test constructor must settle for a sample of 40 to 50 words and predict that person's
word knowledge from this limited sample. In reality, the totality of the person's word knowledge
might be poorer or stronger than the 40 or a 50-word vocabulary test. Obviously, then, the
implication of the test-as-sample concept is that the test score invariably contains some degree of
error. Such measurement errors can be minimized by means of a careful test design, but it can
never be fully eliminated.

Fourth, psychological tests usually provide scores or categories which are, subsequently,
interpreted with reference to a standardization sample. The standardization sample should be
representative of the population for whom the test is meant so that it may be possible to evaluate
each person's test score or results in comparison to the reference group.

Some psychological tests are norm-referenced tests which means that results from them are
interpreted with reference to the average performance of the standardization sample whereas
some psychological tests are criterion-referenced, which are, in fact, used to determine where an
examinee stands with reference to a tightly defined criterion or educational objective. On such
Tests, the comparison is done with an objective standard rather than with the performance of
other examinees.

Difference bw testing and assessment -

Finally, for getting the meaning of a psychological test or educational test clear, it is also
essential that we must make a distinction between testing and assessment. In fact, testing is a
more limited venture which primarily consists of administering, scoring and interpreting the test
scores. Assessment, on the other hand, is a more comprehensive and wider term that includes the
entire process of compiling and synthesizing the information to make a prediction about the
person.

CLASSIFICATION OF TESTS

1. On the basis of the criterion of administrative conditions


tests have been classified into two types on the basis of administrative conditions; Individual
tests and group tests. Individual tests are those tests that are administered to one person at a time.
Kohs Block Design Test is an example of an individual test. Individual tests are often used by
school psychologists and counselors to motivate children and to observe how they respond .
26

Some individually administered tests are given orally, and they require the constant attention of
the examiner. Individual tests, in general, have two limitations, i.e. such tests are time-consuming
and require the services of trained and experienced examiners. As such, these tests are used only
when a crucial decision is necessary.
Group tests are tests which can be used among more than one person or in a group at a time.
Bell Adjustment Inventory is an example of the group test. Besides assessing adjustment, group
rests are adequate for measuring cognitive skills to survey the achievements, strengths and
weaknesses of the students in the classroom, etc.

2. On the basis of the criterion of scoring


Scoring is one of the vital parts of a test. Based upon this criterion, tests are classified into two
types objective test and subjective test. Objective tests are those whose items are scored in such a
way that no scope for subjective judgement or opinion exists and thus, the scoring remains
unambiguous. Tests having multiple-choice, true-false and matching items are usually called
objective tests. In such items, the problem as well as its answer is given along with the distractor.
The problem is known as the stem of the item.
Subjective tests are tests whose items are scored in a way in which there exists some scope for
subjective judgement and opinion. As a consequence, some elements of vagueness and
ambiguity remain in their scoring. These are also called essay tests. Such tests are intended to
assess an examinee's ability to organize a comprehensive answer, recall and select important
information, and present the same logically and effectively. Since in these tests the examinee is
free to write and organize the answer, they are also known as free-answer tests.

3. On the basis of the criterion of time limit in producing the response


Another way of classifying tests is whether they emphasize time limits or not. On the basis of
this criterion, the tests are classified into power tests and speed tests. A power test is one which
has a generous time limit so that most examinees are able to attempt every item. Usually such
tests have items which are generally arranged in increasing order of difficulties. Most of the
intelligence tests and aptitude tests belong to the category of power tests. In fact, power tests
demonstrate how much knowledge or information the examinees have.
Speed tests are those that have severe time limits but the items are comparatively easy and the
difficulties involved therein are more or less of the same degree. Here, very few examinees are
supposed to make errors. Speed tests reveal with what speed the examinee can respond in a given
time limit . Eg - clerical aptitude tests.
Today a pure power test or speed test are rare , most of them being mixed of both.

4. On the basis of the criterion of the nature or contents of items


A test may be classified on the basis of the nature of the items or the contents used therein.
Important types of tests based on this criterion are:
(i) Verbal test (ii) Nonverbal test (iii) Performance test (iv) Non Language test
27

(i) A verbal test is one whose items emphasize reading, writing and oral expression as the
primary mode of communication. Herein, instructions are printed or written. These are read by
the examinees and, accordingly, items are answered. Jalota Group General Intelligence Test and
Mehta Group Test of Intelligence are some common examples. Verbal tests are also called
paper-pencil tests because the examinee has to write on a piece of paper while answering the test
items.
(ii) Nonverbal tests are those that emphasize but don't altogether eliminate the role of language
by using symbolic materials like pictures, figures, etc. Such tests use the language in instruction
but in items, they don't use language. Test items present the problem with the help of figures and
symbols. Nonverbal tests are commonly used with young children as an attempt to assess the
nonverbal aspects of intelligence such as spatial perception. Raven Progressive Matrices is a
good example of nonverbal tests.
(iii) Performance tests are those that require the examinees to perform a task rather than answer
some questions. Such tests prohibit the use of language in items. Occasionally, oral language is
used to give instruction, or, the instruction may also be given through gesture and pantomime.
Different kinds of performance tests are available. Some tests require examinees to assemble a
puzzle, place pictures in a correct sequence, place pages in the boards as rapidly as possible,
point to a missing part of the picture, etc. the common feature of all performance tests is their
emphasis on the examinee's ability to perform a task rather than answer some questions.
(iv) Non Language tests are those which don't depend upon any form of written, spoken or
reading communication. Such tests remain completely independent of the ability to use language
in any way. Instructions are usually given through gestures or pantomime and the examinees
respond by pointing at or manipulating objects such as pictures, blocks, puzzles, etc. Such tests
are usually administered to those persons or children who can't communicate in any form of
ordinary language.

5. On the basis of the criterion of purpose or objective


Tests are also classified in terms of their objectives or purposes. Based upon this criterion, tests
are usually classified as intelligence tests, aptitude tests, personality tests, neuropsychological
tests and achievement tests. Intelligence tests intend to assess intelligence of the examinees.
Aptitude tests assess potentials or aptitudes of the persons. Personality tests assess traits,
adjustments, interests, values etc. of the person. Eg- achievement tests measures what the person
has acquired in a given area during the period of training.
28

6. On the basis of the criterion of standardization

(THIS INFORMATION ON STANDARDISATION CAN BE USED IN FURTHER TOPICS


ALSO HENCE NOT REDUCING THE POINTS IN IT)

Tests are also classified on the basis of standardization. Based upon this criterion, tests are
classified into standardized tests and teacher-made tests, Standardized tests are those which have
been subjected to the procedure of standardization. However, the meaning of the term
'standardization" is controversial and includes at least the following conditions:
4) The first condition for standardization is that there must be a standard manner of giving
instructions so that uniformity can be maintained in the evaluation of all those who take the test.
(i) The second condition for standardization is that there must be uniformity of scoring and an
index of fairness of correct answer through the procedure of iter analysis should be available.
Git) The third condition is that reliability and validity of the test must be established and the
individuals for whom the test is intended should be explicitly mentioned.
(iv) The fourth condition, a controversial one, is that a standardized test should have norms.
However, according to Cronbach (1970, 27), a test even without norms may be called a
standardized test. But the majority of psychologists favour the idea that a standardized test
should have norms as well.
By way of summarizing the meaning of a standardized test, it can be said that standardized tests,
constructed by test specialists, are standardized in the sense that they have been administered and
scored under standard and uniform testing conditions so that the results obtained from different
samples may legitimately be compared. Items of standardized tests are fixed and not modifiable.
Teacher-made tests are those that are constructed by teachers for use largely within their
classrooms. The effectiveness of such tests depends upon the skill of the teacher and his
knowledge of test construction. Items may come from any area of curriculum and they may be
modified according to the will of the teacher. Rules for administration and scoring are
determined by the teacher. Such tests are largely evaluated by the teachers themselves and no
particular norms are provided; however, they may be developed by the teacher for his own class.
29

Thus, we find that tests have been classified in terms of various criteria. These tests are used for
a variety of purposes.

CHARACTERISTICS OF A GOOD TEST


For a test to be scientifically sound, it must possess the following characteristics.

Objectivity
A test must have the trait of objectivity, ie., it must be free from the subjective element so that
there is complete interpersonal agreement among experts regarding the meaning of the items and
scoring of the test. Obviously, objectivity here relates to two aspects of the test--objectivity of the
items and objectivity of the scoring system. By objectivity of items is meant that the items
should be phrased in such a manner that they are interpreted in exactly the same way by all those
who lake the test. For ensuring objectivity of items, items must have uniformity of order of
presentation (that is, either ascending or descending order). By objectivity of scoring is meant
that the scoring method of the test should be a standard one so that complete uniformity can be
maintained whereas the test is scored by different experts at different times.

Reliability
It refers to the self correlation of the test. The extent to which results obtained are consistent
when the test is administered once or more than once on the sample with a reasonable time gap is
called reliability. Index of internal consistency is the consistency in results obtained in a single
administration. Index of temporal consistency is consistency in results obtained upon testing and
re testing. For a test to be sound, it must be reliable because reliability indicates the extent to
which scores obtained in the test are free from internal defects of standardisation.

Validity
Validity is another prerequisite for a test to be sound. Validity indicates the extent to which the
test measures what it intends to measure, when compared with some outside independer
criterion. In other words, it is the correlation of the test with some outside criterion. The criterion
should be an independent one and should be regarded as the best index of train or ability being
measured by the test. Generally, validity of the test is dependent upon the reliability because :
a test which yields inconsistent results (poor reliability) is ordinarily not expected to correlate
with some outside independent criterion.

Norms
A test must also be guided by certain norms. Norms refer to the average performance of a
representative sample on a given test. There are four common types of norms--age norms, grade
norms, percentile norms and standard score norms. Depending upon the purpose and use, aets
30

constructor prepares any of these norms for his test. Norms help in interpretation of the scores. In
the absence of norms, no meaning can be added to the score obtained on the test.

Practicability / Usability
A test must also be practicable/usable from the point of view of the time taken in sti completion,
length, scoring, etc. In other words, the test should not be lengthy and the scoring method must
not be difficult nor one which can only be done by highly specialized persons. In addition, the
test should be economical from the point of view of money also.

standardized procedures in Test administration


Standardization is when a test is made uniform or set to adhere to a specific standard. It involves
administering and scoring the test in the same way for everyone who takes the test. The
interpretation of a psychological test is most reliable when the measurements are obtained under
the standardized conditions outlined in the publisher’s test manual. Nonstandard testing
procedures can alter the meaning of the test results, rendering them invalid and, therefore,
misleading. The most common examples of standardised tests are the Rorschach, TAT, and
MMPI

Behaviour Sample
A test measures a limited sample of behaviour that is of interest to the researcher, which allows
them to make inference about the total domain of relevant behaviour, of which the sample is a
representative.
Eg. WAIS (Vocabulary subtest)
It requires the client to try to define u to 30 words. It assesses: language development, expressive
language skills, cultural and educational experiences, ability to use words appropriately and
retrieval of information from long term memory
Eg. Stress tests, personality tests (MMPI):
Test items do not always resemble the behaviours that the test is attempting to predict. While
most tests do sample directly from the domain of behaviour they hope to predict, this is not a
psychometric requirement.

Applications of Psychological testing -


It is a valuable tool for individuals, organizations, and researchers alike. It can help individuals
gain insight into their own abilities and limitations, and it can help organizations make informed
decisions about employee selection and development. Researchers use psychological testing to
31

study and understand human behavior, which can ultimately lead to the development of new and
more effective treatments for mental health conditions, as well as a greater understanding of how
we think, feel, and behave.
These tests are widely used in various fields and settings, including educational,
counselling and guidance, clinical, organisational, education, employment screening, research,
etc.
Beyond a doubt, no practice in modern psychology has been more assailed than psychological
testing. For critics, the most common rallying point is test bias. In proclaiming test bias, the
skeptics assert in various ways that tests are culturally and sexually biased so as to discriminate
unfairly against racial and ethnic minorities, women, and the poor (Gregory, 2013).
Generally, psychologists and other mental health professionals with advanced training in
psychological assessment are the individuals who are best equipped to administer psychological
tests. Psychologists undergo extensive education and training in the use and interpretation of
psychological tests, and many hold advanced degrees in clinical psychology or related fields.
They are also required to obtain a license from the state in which they practice, which typically
requires passing a comprehensive examination and completing a period of supervised practice.
Various tests may or may not be applicable on certain populations. For this purpose,
different tests are for special populations. These tests are developed primarily for use with
persons who cannot be properly or adequately examined with traditional instruments. Some
examples are: Autism Diagnostic Observation Schedule (ADOS), Vineland Adaptive Behavior
Scales (VABS), Wechsler Intelligence Scale for Children (WISC), Behavior Assessment System
for Children (BASC) and Woodcock-Johnson Tests of Achievement (WJTA).

Educational
Nearly every type of available test is utilized in the schools. Intelligence, special aptitude,
multiple aptitude, and personality tests can all be found in the repertory of the educational
counsellor and the school psychologist. Teachers and educational administrators frequently have
to act on the results obtained with several different kinds of tests. Certain types of tests, however,
have been specifically developed for use in educational contexts, predominantly at the
elementary and high school levels.

I. ACHIEVEMENT TESTS
Achievement refers to what a person has acquired or achieved after the specific training or
instruction has been imparted. In other words, achievement tests are primarily designed to
measure the effects of a specific programme of instruction or training (Anastasi, 1968). Thus, the
performance on the achievement test indicates the performance under known and controlled
conditions (training).
Important uses of achievement tests include: (EXTRA)
32

- Achievement tests are an effective way to check any weakness in the instructions or even
slackness on the part of the examinee.
- Achievement tests aid in the formulation of educational goals and critical examination of
instruction. They can illustrate changes in educational goals and assess the adequacy of syllabi.
Achievement tests provide answers to questions about the course content, common
misunderstandings, and motivation for learners. They aid in critical evaluation of teaching
method
- Achievement tests can aid in adapting instruction to individual learner needs. Poor performance
in a specific area on the test can indicate the need for special guidance and training. Instruction
can be modified to suit individual needs and address areas where learners require more support.

There are two types of achievement tests:


(i) General achievement test batteries: they attempt to measure the general educational
achievement in several areas of the academic curriculum. They can be used from primary levels
in school to adult levels in colleges.
Eg. California Achievement Test, Iowa Tests of basic Skills, Tests of Educational Development
and SRA Achievement Services
California Achievement Test (CAT)
- Widely used
- Basic academic skills
- Children and adolescents (kindergarten to grade 12)
- First introduced 1950; Recent edition 6
- Determine for academic competency and child’s readiness to be promoted to a higher grade
- Given in group-classroom setting
- 1.5 to 5 hours to complete depending on test form and grade level
- Test is sent back to publisher for scoring, then scoring information is returned to school in the
form of test reports
Iowa Test of Basic Skills
- Developed in 1935; Recent version: Iowa Assessments (2011-12)
- Administration: Group
- Age range: Kindergarten to grade 12
- 10 core sections: reading, writing, mathematics, science, social studies, vocabulary, spelling,
capitalization, punctuation, and computation.
- Items: 270-340 depending on grade level
- Administration time 3-3.5 hours
(ii) Special Achievement Tests: They measure pupils in some selected areas and can be grouped
into: diagnostic tests and the standardised end-of-course examination.
Diagnostic tests - identify the educationally retarded pupils and to suggest remedial programmes
for them. Such tests are available in special areas like reading skill and mathematical skill. Eg.
Stanford Diagnostic Reading test and Durrell Analysis of reading Difficulty. The standard
33

end-of-course examinations are the co-ordinated series of achievement tests for different subjects
taught at either school or at college level. They provide one system for comparable norms for all
tests (of different subjects) and thus a direct comparison of scores obtained in different subjects
by the same testee is possible.
Stanford Diagnostic Reading test
- Administration: Group
- Age range: Elementary school
- Level I – grade 2.5 to 4.5
- Level II – grade 4.5 to 8.5
- Dimensions: reading comprehension, vocabulary, and word recognition skills
Durrell Analysis of Reading Difficulty
- Developed by Durrell & Catterson (1955)
- Age range: children
- Administration: Individual
- Covers a range of reading ability from the non-reader to 6th grade reading
- discover weaknesses and faulty habits in reading which may be corrected in a remedial
program.

II. INTELLIGENCE TESTS


Intelligence test is an individually administered test used to determine a person’s level of
intelligence by measuring his or her ability to solve problems, form concepts, reason, acquire
detail, and perform other intellectual tasks. It comprises mental, verbal, and performance tasks of
graded difficulty that have been standardized by use on a representative sample of the
population.
Stanford-Binet Intelligence Scale.
- Developed by: Binet & Simon (1905)
- Latest edition 5 (2003)
- Calculate IQ, identify children with gifted abilities and children with intellectual deficiency
- Age: 2 to adulthood
- Time: 45 minutes to 3 hours depending on age
- Five factors of cognitive ability: fluid reasoning, knowledge, quantitative reasoning,
visual-spatial processing and working memory.
- Both verbal and non-verbal responses measured

Wechsler Intelligence Test for Children


- Developed by David Wechsler (1967)
- Latest edition 5 (2014)
- Age: 6-16
- Time: 65-80 minutes
34

- Individual test
- Cognitive abilities assessed: verbal comprehension, perceptual reasoning, working memory,
processing speed and fluid reasoning
- 10 primary scale: similarities, vocabulary, block design, matrix reasoning, figure weights, digit
span, coding, visual puzzles, picture span, symbol search

Organisational

The fundamental steps in setting up a testing procedure are basically similar as those necessary
for any kind of selection procedure for the requirement of an industry or organisation. The
primary step is to understand the nature or characteristics of the job for which psychological
testing is to be used as a selection device. When job and worker analyses have been performed,
the appropriate test or set of tests to assess the behaviours and abilities required for success on
the job must be very carefully chosen or developed.

Tests - WPT - R
The Wonderlic Personnel Test (WPT) is a standardized, self-administered assessment of general
mental ability that is frequently used in industrial and business settings as an aptitude test for
prospective employees. The WPT was initially introduced in 1937 by Eldon F. Wonderlic as a
short-form test of general mental ability. The WPT was used by the United States Military in
World War II to aid in the selection and placement of potential military recruits. The Wonderlic
Personnel Test – Revised is a quick assessment exam used for employment prequalification. It
produces a snapshot of a person’s cognitive aptitude or general intellect. These characteristics are
a commonly accepted way to forecast how well a candidate will do in a job. A person’s cognitive
gifts are considered more trustworthy, legitimate and neutral with regards to employment
forecasting when compared to data gathered from resumes, previous success in school,
references or interviews. The WPT-R is an aptitude assessment that can of great assistance in
choosing the right employee, as it tends to be very beneficial for selecting candidates. The
assessment lasts 12 minutes, and may be given on a computer or taken with a pencil and paper
on a fax answer document. The assessment is mechanically graded by Wonderlic. This
assessment can help put employees into jobs that go well with how quickly they can be trained
and what their skills are.

Bennet mechanical comprehension test (BMCT)


(BMCT) is an aptitude test relating to mechanics which has been developed by Pearson. With a
focus on spatial perception and tool knowledge rather than manual dexterity, the BMCT is
especially well-suited for assessing job candidates for positions that require a grasp of the
35

principles underlying the operation and repair of mechanical devices. The Bennett Mechanical
Comprehension Test is made up of a total of 55 questions/12 categories.

An individual who scores well on the BMCT demonstrates an aptitude for learning mechanical
skills.
The ability to apply mechanical information, spatial visualization, and mechanical reasoning in
answering BMCT questions is a predictor of employee success.

Personality testing in organisations


Personality is meaningful to management, because employees' personalities may dictate how
well they perform their jobs. Personality may indicate how hard a person will work, how
organized they are, how well they will interact with others, and how creative they are.

In recent years, more organizations have been using self-reporting personality tests to identify
personality traits as part of their hiring or management development processes. Employers
recognize that experience, education, and intelligence may not be the only indicators of who the
best hire might be.

MINNESOTA MULTIPHASIC PERSONALITY


INVENTORY AND CALIFORNIA
PSYCHOLOGICAL INVENTORY
Some of the earlier tests used to assess the personality of job applicants and employees were the
Minnesota Multiphasic Personality Inventory (MMPI) and the California Psychological
Inventory (CPI), which is based on the MMPI.

The MMPI was developed for psychological clinical profiling and includes ten clinical scales.
While some of these scales may be applicable to predicting job performance in a selection tool,
others are not. Additionally, the items used in the MMPI may be off-putting to job applicants.
However, before the availability of personality tests commercially available for use in a business
setting, organizations often used the MMPI to assess the personality characteristics of applicants
and employees.

FIVE-FACTOR MODEL
A different conception of personality is captured in the Sixteen Personality Factor Questionnaire,
also called the 16 PF. It yields scores of sixteen different personality traits, including dominance,
vigilance, and emotional stability. These sixteen factors can be combined to express five "global
factors" of personality. These five global factors are often called the Big Five or the Five-Factor
Model.
36

Most researchers agree that while more than five dimensions of personality are present in human
beings, nearly all of them can be subsumed within five: emotional stability, conscientiousness,
agreeableness, extraversion, and openness to experience.

Minnesota Clerical Test (MCT)


- Developed by Andrew et al (1979)
- Time 15 minutes
- Used for selection of clerical personnel, predictor of job performance and to provide career
guidance information
- 2 subtests
o Numerical Comparison
o Name Comparison
- 100 identical and 100 different pairs for each subtest
- MCQ format

Clinical Setting
Clinicians typically draw upon multiple sources of data in the intensive study of individual cases.
Information derived from interviewing and from the case history is combined with test scores to
build up an integrated picture of the individual. The clinician thus has available certain
safeguards against overgeneralizing from isolated test scores.
Some tools of measurement include: intelligence tests (WAIS), personality tests (MMPI), tests to
assess learning disabilities (Peabody test), checklist, questionnaire, rating scales for diagnosis,
clinical interview, MSE, observation and clinical assessment.
I. DIAGNOSTIC USE
Use of psychological tests in clinical setting can help diagnose mental illnesses, collect
information about mental abilities, strengths, and weaknesses, create a treatment plan, assess
personality, intelligence, and neuropsychological functioning, determine whether a patient is
eligible for a specific treatment.
Autism Spectrum Disorder
For example, to assess on what spectrum of autism an individual lies, a test has to evaluate
certain core characteristics like reciprocal and social skills, communication abilities and flexible
behaviour. Some tests are given below:
1. Indian Scale for Assessment of Autism
- Age 3-11 years
- Dependent on reporting by parents
- Items 40
- 6 domains
- Scores range from 1-5 (rarely, sometimes, frequently, mostly, always)
- Based on intensity, frequency and duration of a particular behaviour
37

2. Modified Checklist for Autism in Toddler (M-CHAT-R)


- Age 16-30 months
- Items 23
- Filed out by parents
- Yes/no responses
- Score of medium-high would not meet the criteria for diagnosis
3. Baby and Infant Screen for Children with aUtIsm Traits (BISCUIT)
- Developed by Matson et al (2007)
- Screening instrument
- Measures core symptoms of autism
- Age: 17-37 months
- 71 item
- BISCUIT- Part 2 measures comorbid symptoms
- BISCUIT - Part 3 measures challenging behaviours

Neuropsychological Assessment
As observed in the Global Burden of Disease study (1990-2019) there has been an increase in
prevalence of neuropsychological disorders Singh et al., 2021). It becomes more and more
crucial to be able to accurately diagnose and treat them. According to Reitan and Wolfram
(1998), there needs to be evaluation of categories like: sensory input, attention and concentration,
learning and memory, executive functioning and motor output.
1. Finger Localisation Test
- For Sensory Input
- Developed by Arthur L. Benton and colleagues (1983)
- Purpose: identification, naming and localisation of the fingers
- Items 60
- 3 parts
- (i) identify which finger is touched by the examiner
- (ii) blindfolded and then asked to identify which finger is touched
- (iii) blindfolded and then asked to identify which two fingers are simultaneously touched
2. Test of Everyday Attention (TEA)
- For Attention
- Developed by Robertson et al. (1994)
- Age: 18-80
- Administration time: 45-60 minutes
- 60-65 administrations
- 311 versions to reduce practice effect
- Individual test
- 8 subtests
- Assesses: selective attention, sustained attention and attentional switching
38

- Everyday items like rounds of elevator


3. Continuous Performance Tests (CPT)
- Developed by Haldor Rosvold et al (1956)
- Assesses visual vigilance, scanning and concentrating on a single array
- Most commonly used CPTs are
o Tests of Variable Attention (TOVA)
o Integrated Visual and Auditory CPT (IVA)
o Conners’ CPT
- Help rule out diagnosis of ADHD
- Item example: press a key after certain letters
4. Rey Auditory Verbal Learning Test (RAVLT)
- Developed by Edouard Claparede (1964)
- Time: 10-15 minutes (excluding 30 minute interval)
- Assesses: short-term auditory-verbal memory, rate of learning, learning strategies, retroactive,
and proactive interference, presence of confabulation of confusion in memory processes,
retention of information, and differences between learning and retrieval
- Five presentations of a 15-word list are given, each followed by attempted recall
- This is followed by a second 15-word interference list (list B), followed by recall of list A
- Rate of learning, delayed recall and recognition are also tested

5. Wide Range Assessment of Memory and Learning (WRAML)


- Developed by Sheslow and Adams (2003)
- Latest edition: WRAML3
- Edition 1: 5-17 years
- Edition 2: 5-70 years
- Age: 5-90 years
- Assesses delayed memory ability and acquisition of new learning
- 6 subtests
o Working memory
o Symbolic working memory
o Verbal working memory

Other tests include


- Test of Spatial and Manipulatory Abilities
- Design Copying Test
- Bender Visual Motor Gestalt Test

II. IDENTIFYING PEOPLE WITH SPECIAL NEEDS


Psychological tests, in clinical settings provide valuable information about an individual's
cognitive, emotional, and behavioural functioning. For instance, intelligence tests can be used to
39

assess a person's cognitive abilities and identify any cognitive impairments or learning
disabilities. Personality tests can provide information about a person's personality traits and help
identify any underlying psychological disorders or conditions. Behavioural assessments can
provide insight into a person's behavior patterns and help identify any developmental or
behavioural disorders. In clinical settings, the results of psychological tests are used to inform the
development of treatment plans and interventions tailored to the individual's specific needs.
1. Peabody Developmental Motor Scale
- Developed by Folio & Fewell (1983)
- Latest edition: PDMS-3
- Assesses interrelated motor abilities that develop early in life
- Age: 0-5 years
- Time: 60-90 minutes
- Gross and fine motor skills
2. Test of Non-verbal Intelligence (TONI)
- Developed by Brown & Sherbenou (1982)
- Latest edition: TONI-4 (2010)
- Abstract reasoning and figural problem-solving
- Age: 6 years and older
- Time: 10-15 minutes
- Assesses abstract reasoning and problem-solving

III. ASSESSMENT OF EXECUTIVE FUNCTIONING


Executive functioning is a broad term that refers to a set of cognitive processes that are involved
in planning, organizing, initiating, and regulating behavior in order to achieve a goal. Assessing
executive functioning (logical analysis, conceptualisation, reasoning, planning, flexibility of
thinking) typically involves a range of tests that measure different aspects of these processes.
Some common tests used to assess executive functioning include:
1. Porteus Maze Test (PMT)
- Developed by Stanley Porteus (1965) as supplement for Stanford Binet Test
- Age: 3 and older
- No time limit
- Nonverbal test
- Assess planning and foresight
- Required to trace a path through a drawn maze of varying complexity.
- Must avoid blind alleys and dead ends; no back-tracking is allowed
- Culture reduced measure
- People with brain damage are unable to solve (particularly in frontal lobe)

2. The Tinkertoy Test (TTT)


- Developed by Muriel Lezoy (1982)
40

- Assess executive dysfunction (initiating, planning, and structuring of behaviors) in people with
neurodegenerative diseases
- Four categories
o goal formulation
o planning
o carrying out goal-directed plans
o effective performance
- 50 tinkertoys
- 5 minutes to complete
- Make whatever with blocks, sticks, etc.
- Score: 12-112 on the basis of
o Number of pieces used
o Mobility of construction
o Symmetry
o Naming the construction
- People with head injury make impoverished designs, consisting small number of pieces and are
unable to name their design

IV. ASSESSMENT OF MOTOR OUTPUT


Assessment of motor output involves evaluating an individual's ability to control, coordinate
their movements, manipulating speed and accuracy. It includes both gross motor and fine motor
skills.
1. Purdue Pegboard Test (PPT)
- Developed by Joseph Tiffin (1948)
- Time: 5-10 minutes
- Age: all
- Assesses
o gross movement of the arm, hand and fingers
o fingertip dexterity
- Place pegs in holes
- Right hand, then left hand, then both
- Each tried 30 for seconds
- Used with patients with impairments of the upper extremity resulting from neurological and
musculoskeletal conditions
2. Line Tracing task
- Assesses aspects relative to motor control (fine motor skills)
- Figures on a sheet are presented
- Brightly coloured pen given
- Draw over lines as quickly as possible
41

V. CLINICAL JUDGEMENT AND REPORT WRITING (EXTRA from Anastasi)


They provide objective and standardized measures of an individual's psychological functioning,
which can help clinicians make informed decisions about diagnosis, treatment, and other aspects
of patient care. They help clinicians to identify any areas of impairment or difficulty in the
individual's psychological functioning, track changes in the individual's psychological
functioning over time, which can be useful in evaluating the effectiveness of treatment
interventions and adjusting treatment plans as needed.
​The tests result also help create a detailed and comprehensive report that summarizes the
individual's psychological profile and informs clinical decision-making. It typically begins with
providing background information, description of tests administered, purpose, administration
procedures and scoring methods. Then a detailed analysis of results, interpretation of scores,
drawing conclusions about their psychological functioning.

Counselling and Guidance


Psychological testing is often used in counselling and guidance to assess a wide range of issues
that clients may be experiencing. The tests can help counsellors and guidance professionals
understand an individual's unique psychological makeup, their strengths, and areas of potential
growth or challenges. Some of the ways that psychological testing can be used in counselling and
guidance include: assessment of mental health, career and vocational guidance, educational
assessment, personal growth and self-awareness and relationship and interpersonal issues.
Highly used tests are intelligence tests (WAIS) and multiple aptitude batteries (DAT), personality
tests (MMPI) and certain educational tests also. Counselling psychologists use testing majorly in
vocational and career guidance too. There can be tests used for highly specialised counselling too
like family, marital, children, substance abuse, etc.
Personality traits or interests tend to cluster into a small number of vocational relevant patterns
called ‘type’. For each personality type, there is also a corresponding work environment best
suited to that type.

1. RIASEC Model
- RIASEC - Realistic, Investigative, Artistic, Social, Enterprising and Conventional
- Developed by John Holland (1959)
- Problem-solving, cognitive approach to career planning
- According to this model, when choosing a career, people prefer jobs where they can be around
others who are like them.
- search for environments that will let them use their skills and abilities, and express their
attitudes and values, while taking on enjoyable problems and roles
- 6 personality types
42

(i) Realistic work environments require hands-on involvement, physical movement, mechanical
skill, and technical competencies. Pragmatic problem solving is needed. Typical vocations
include auto repair, cook, drywall installer, machinist, taxi driver, and umpire.
(ii) Investigative settings require the use of abstract thinking and creative abilities. The focus is a
rational approach and ideas, not people. Typical positions include architect, arson investigator,
pharmacist, physician, psychologist, and software engineer.
(iii) Artistic environments require the creative application of artistic forms. These settings
demand prolonged work and place a premium on access to intuition and emotional life. Typical
vocations include actor, composer, graphic designer, model, photographer, and reporter.
(iv) Social environments involve an interest in caring for people and the ability to discern and
influence their behavior. These work settings require good social skills and the ability to deal
with emotionally laden interactions. Typical positions include clergy, teacher, emergency medical
technician, marriage therapist, psychiatric aide, and waitperson.
(v) Enterprising work environments involve the influence of others through verbal skills. These
roles require self-confidence and leadership capacities for directing and controlling the activity
of others. Typical vocations include bartender, real estate agent, construction manager, first-line
supervisor, police detective, and travel agent.
(vi) Conventional work environments require the methodical, routine, and concrete processing of
words and mathematical symbols. The key to these settings is repetitive application of
established clerical procedures. Typical settings include bank teller, bookkeeper, court recorder,
insurance underwriter, office clerk, and shipping clerk

2. Person Environment Fit (P-E Fit)


- Developed by John Holland (1959)
- assessing and predicting how characteristics of the employee and the work environment jointly
determine worker well-being and performance
- Characteristics of the person (P) include needs as well as abilities
- Characteristics of the environment (E) include supplies and opportunities for meeting the
employee’s needs as well as demands which are made on the employee’s abilities
- Individuals' behavior, motivation, and physical, as well as mental, health is affected by their
interaction with the environment.
- an optimal fit facilitates an individual's functioning, like improved attitude and performance,
while an unsuitable fit may worsen the individual's functioning

3. Campbell Interest and Skill Survey


- Developed by David Campbell (1974)
- Assesses degree of interest in 200 academic and occupational topics
- Assesses degree of skill in 120 specific occupations
- 3 scales
43

(i) Orientation Scales: 7 scales describe the test taker’s occupational orientation - influencing,
organizing, helping, creating, analysing, producing, and adventuring
(ii) Basic Scales: The basic scales provide an overview for categories of occupations. Examples
of basic scales include law/politics, counselling, and mathematics
(iii) Occupational Scales: 60 occupational scales describe matches with particular occupations,
including attorney, engineer, guidance counsellor, and math teacher
4. Career Assessment Inventory
- Developed by Charles B. Johnsson (1976, 1978)
- occupational interest inventory
- Age: 15 years and older
- Time: 35-40 minutes
- 305 items
- 5-point rating scale
- takes individual’s workplace interests and compares them with other individuals currently in
one of the 111 careers
- 2 versions: Enhanced Version and Vocational Version
5. Kuder Occupational Interest Survey
- Developed by Frederic Kuder (1939)
- Assesses interest in occupational fields of study and suggest career possibilities
- Sixth grade reading level
- 10 broad interest areas
o Outdoor
o Mechanical
o Clerical
o Computational
o Scientific
o Literary
o Social service
o Persuasive
o Artistic
o Musical

6. Self-Directed Search (SDS)


- Developed by Holland (1970)
- Time: 35-40 minutes
- Career and interest test
- Items regarding personal ambition, skills, activities, and interests in various positions
- 6 categories abbreviated (RIASEC)
o Realistic
o Investigative
44

o Artistic
o Social
o Enterprising
o Conventional

7. Career Beliefs Inventory


- Developed by Krumboltz (1994)
- Age: Grade 8 to adult
- Individual and a group test
- 25 scales under 5 headings
- 96 items
- Explore clients' assumptions, generalizations, and beliefs about themselves and the world of
work
- It helps a counsellor identify assumptions blocking a client's career progress
- Low score in any alerts the counsellor to explore specific beliefs in that category with the client
and to examine their consequences.

8. Differential Aptitude Test (DAT)


- Developed by Benett et al (1947)
- Time: 3 hours
- Age: Grade 7 - 12
- Assesses abilities and skills: verbal, numerical ability, abstract reasoning, mechanical reasoning
and space relations
- 6 sections
o Mechanical comprehension
o Abstract Reasoning
o Numerical Calculations
o Numerical Sequence
o Space Relations
o Verbal Analogies
45

UNIT 3: Test and Scale Construction


: Test Construction and Standardization: Item analysis, Reliability, validity, and norms
(characteristics of z-scores, T-scores, percentiles, stens and stanines); Scale Construction:
Likert, Thurstone, Guttman & Semantic Differential

TEST CONSTRUCTION
A k Singh (2018) defined the test as a series of questions on the basis of which some Information
is sought.

A psychological (or an educational) test is a standardized procedure to measure quantitatively or


qualitatively one or more than one aspect of a trait by means of a sample of verbal or nonverbal
behavior. The purpose of a psychological test is twofold. First, it attempts to compare the same
individual on two or more than two aspects of a trait; and second, two or more than two persons
may be compared on the same trait. Such a measurement may be either quantitative or
qualitative.

Anastasi and Urbina (1997) have defined a psychological test as "essentially an objective, and
standardized measure of sample of behavior”.

Important characteristics are-

First, test is an organized succession of stimuli, which means that the stimuli (popularly known
as items) in the test are organized in a certain sequence and are based upon some principles of
test construction

Second, both quantitative and qualitative measurements are possible through psychological and
educational tests.

Third, a psychological test is based upon a limited sample of behavior.

Fourth, psychological tests usually provide scores or categories which are, subsequently,
interpreted with reference to a standardization sample.

CHARACTERISTICS OF A GOOD TEST


For a test to be scientifically sound, it must possess the following characteristics.

Objectivity- A test must have the trait of objectivity, i.e., it must be free from the subjective
element so that there is complete interpersonal agreement among experts regarding the meaning
46

of the items and scoring of the test. Obviously, objectivity here relates to two aspects of the
test--objectivity of the items and objectivity of the scoring system. By objectivity of items is
meant that the items should be phrased in such a manner that they are interpreted in exactly the
same way by all those who take the test. For ensuring objectivity of items, items must have
uniformity of order of presentation (that is, either ascending or descending order). By objectivity
of scoring means that the scoring method of the test should be a standard one so that complete
uniformity can be maintained when the test is scored by different experts at different times.

Reliability - A test must also be reliable. Reliability here refers to self-correlation of the test. It
shows the extent to which the results obtained are consistent when the test is administered once
or more than once on the same sample with a reasonable time gap. Consistency in results
obtained in a single administration is the index of internal consistency of the test and consistency
in results obtained upon testing and retesting is an index of temporal consistency. Reliability,
thus, includes both internal consistency as well as temporal consistency. For a test to be called
sound, it must be reliable because reliability indicates the extent to which the scores obtained in
the test are free from such internal defects of standardization which are likely to produce errors
of measurement.

Validity- is another prerequisite for a test to be sound. Validity indicates the extent to which the
test measures what it intends to measure, when compared with some outside independent
criterion. In other words, it is the correlation of the test with some outside criterion. The criterion
should be an independent one and should be regarded as the best index of trait or ability being
measured by the test. Generally, validity of the test is dependent upon the reliability because a
test which yields inconsistent results (poor reliability) is ordinarily not expected to correlate with
some outside independent criterion.

Norms- A test must also be guided by certain norms. Norms refer to the average performance of
a representative sample on a given test. There are four common types of norms- age norms,
grade norms, percentile norms and standard score norms. Depending upon the purpose and use, a
test constructor prepares any of these norms for his test. Norms help in interpretation of the
scores. In the absence of norms, no meaning can be added to the score obtained on the test.

Practicability / Usability- A test must also be practicable/usable from the point of view of the
time taken in its completion, length, scoring, etc. In other words, the test should not be lengthy
and the scoring method must not be difficult nor one which can only be done by highly
specialized persons. In addition, the test should be economical from the point of view of money
also.

GENERAL STEPS OF TEST CONSTRUCTION


(ITEM ANALYSIS EXAM K LIYE PDHA THA VHN SE DEKH LENA GUYYZ)
47

Before the real work of test construction begins, certain broad decisions are taken by the
investigator. These preliminary decisions have far-reaching consequences. It is at this
preliminary stage that the test constructor outlines the major objectives of the test in general
terms, and specifies the populations for whom the test is intended. He also indicates the possible
conditions under which the test can be used and its important uses. For example, a test
constructor may decide to construct an intelligence test meant for students of the tenth grade
broadly aiming at diagnosing the manipulative and organizational ability of the pupils. Having
decided the above preliminary things, he must go ahead with the following steps.

1. Planning of the test- The first step in the construction of a test is careful planning. At this
stage, the test constructor Specifies the broad and specific objectives of the test in clear terms. He
decides upon the nature of the content or items to be included, the type of instructions)to be
included, the method of sampling, detailed arrangement for the preliminary administration and
the final administration. a probable(length and time limit for the completion of the test,
probable(statistical method? to be adopted, etc. Planning also includes the total number of
reproductions of the test to be made and (a preparation of a manual.

2. Writing items of the test- The second step in test construction is the preparation of the items
of the test. According to Bean (1953, 15), an item is defined as "a single question or task that is
not often broken down into any smaller units." Item writing starts with the planning done earlier,
If the test constructor decides to prepare an essay test, the essay items are written down.
However, if he decides to construct an objective test, he writes down the objective items such as
the alternative reponse item, matching item, multiple-choice item, completion item, short-answer
item, pictorial form of item, etc.

there are some essential prerequisites, which must be met if the item writer wants to write good
and appropriate items. These requirements are enumerated as follows:

1. The item writer must have thorough knowledge and complete mastery of the subject matter. In
other words, he must be fully acquainted with all facts, principles, misconceptions, fallacies in a
particular field so that he may be able to write good and appropriate items.

2. The item writer must be fully aware of those persons for whom the test is mean?. He must be
aware of the intelligence level of those persons so that he may manipulate the difficulty level of
the items for proper adjustment with their ability level. He must also be able to avoid irrelevant
clues to the correct responses.

3. The item writer must be familiar with different types of items)along with their advantages and
disadvantages. He must also be aware of the Characteristics of good items and the common
probable errors in writing items.
48

4. The item writer must have a large vocabulary. He must know the different meanings of a word
so that confusion in writing the items may be avoided. He must be able to convey the meaning of
the items in the simplest possible language.

5. After writing down the items, they must be submitted to a group of subject experts for their
criticisms and suggestions, following which the items must then be duly modified.

3. Preliminary administration (or the experimental try-out) of the test

When the items have been written down and modified in the light of the suggestions and
criticisms given by the experts, the test is said to be ready for its experimental try-out. The
purpose of experimental try-out or preliminary administration of the test is manifold. According
to Conrad the main Purpose of the experiment try out of any psychological and éducational test
is as given below:

1) Finding out the major weaknesses, omissions, ambiguities and inadequacies of the items In
other words, try-out helps in identifying the ambiguous and indeterminate items, non functioning
distractors in multiple-choice items, very difficult or very easy items, and the like.

2) Determining the difficulty values of each item which, in turn, helps in selecting items for their
even and proper distribution in the final form.

(3) Determining the(validity of each individual item The experimental try-out helps in
determining the discriminatory power of each individual item. The discriminatory power here
refers to the extent to which any given item discriminates successfully between those who
possess the trait in larger amounts and those who possess the same trait in the least amount.

(4) Determining a reasonable time limit of the test

5) Determining the appropriate length of the test)In other words, it helps in determining the
number of items to be included in the final form.

6. Determining the intercorrelations of items so that overlapping can be avoided.

7 Identifying weakness and vagueness In directions or instructions of the test as well as in the
fore-exercises or sample questions of the test.

4. Reliability of the final test

When on the basis of the experimental or empirical try-out the test is finally composed of the
selected items, the final test is again administered on a fresh sample in order to compute the
reliability coefficient. The size of the sample for this purpose should not be less than 100.
Reliability is the self-correlation of the test and it indicates the consistency of the scores in the
49

test. There are three common ways of calculating reliability coefficient, namely test-retest
method, split-half method, and the equivalent-form method. Besides these, the Kuder Richardson
formulas and the Rulan formula are also used in computing the reliability coefficient of the test.

5. Validity of the final test

Validity refers to what the test measures and how well it measures. If a test measures a trait that it
intends to measure well, we say that the test is a valid one. After estimating the reliability
coefficient of the test, the test constructor validates the test against some outside independent
criteria by comparing the test with the criteria. Thus, validity may also be defined as the
(Correlation of the test With Some outside independent criteria>Validity should be computed
from the data obtained from the samples other than those used in item analysis. This procedure is
known as cross-validation. There are three main types of validity: content validity, construct
validity and criterion-related validity. The usual statistical techniques employed in computing
validity coefficients are Pearsonian r, biserial r, point biserial r, chi square, phi-coefficient,
etc.The abac tables have also been prepared by Flanagan for directly reading the values of
biserial r, point biserial r and phi-coefficient when the proportions of those passing an item in the
lower group and upper group are known

6. Norms of the final test

Finally, the test constructor also prepares norms of the test. Norms are defined as the average a9
performance or score of a large sample representative of specified population. Norms are
prepared to meaningfully Interpret the scores obtained on the test for, as we know, the obtained
scores on the test themselves convey no meaning regarding the ability or trait being measured.
But when these are compared with the norms, a meaningful inference can immediately be drawn.
The common types of norms are the age norms, the grade norms (the percentile norms,) and the
standard score norms.All these types of norms are not suited to all types of tests. Keeping in
view the purpose and type of test, the test constructor develops a suitable norm for the test. The
preliminary considerations in developing norms are that the sample must be representative of the
true population; it must be randomly selected; and it should preferably represent a cross-section
of the population!

7. Preparation of manual and reproduction of the test

The last step in test construction is the preparation of a manual of the test. In the manual the test
constructor reports the psychometric properties of the test norms and references. This gives a
clear indication regarding the procedures of the test administration, the scoring methods and time
limit if any, of the test. It also includes instructions) as well as the details of arrangement of
materials, that is, whether items have been arranged in random order or in any other order. In
general, the test manual should yield information about the(standardization sample,) reliability,
50

validity, scoring as well as practical considerations. The test constructor, after seeing the
importance and requirement of the test, finally orders for printing of the test and the manual.

Reliability
The classical test theory assumes that any observed score (X) is equal to the true score (T) plus
the error score (E). This classical theory assumes that each person has a true score that would be
obtained if there are no errors in measurement. But in reality, the measuring instruments are not
perfect. Such errors are also produced by the characteristics of the individuals or situations which
can affect the test scores but which have nothing to do with the attribute being measured. The
difference between the true score and the score obtained results due to their error of
measurement, or error score. Or the same can be said in this way that the difference between the
score the person obtains and the score he is really interested in, equals the error of measurement:

X-T = E

Variance of obtained scores is simply the sum of variance of true scores plus the variance of
errors of measurement. In terms of equation,

TYPES OF RELIABILITY

Test-Retest Reliability

In test-retest reliability, the single form of the test is administered twice on the same sample with
a reasonable time gap. In this way, two administrations of the same test yield two independent
sets of scores. The two sets, when correlated, give the value of the reliability coefficient. The
reliability coefficient thus obtained is also known as the temporal stability coefficient and
indicates to what extent the examinees retain their relative position as measured in terms of the
least score over a given period of time. A high test-retest reliability coefficient indicates that the
examinee who obtains a low score on the first administration tends to score low on the second.
administration, and its converse, when the examine scores high on the first administration, tends
to score high on the second administration.

Test-retest reliability has its disadvantages.


51

1. The test-retest method is a time-consuming method of estimating the reliability coefficient.


This method assumes that the examinee's physical and psychological set-up remains unchanged
in both the testing situations.

2. In fact, the examinee's health, emotional condition, motivational condition and his mental
set-up do not remain perfectly uniform. Not only this, the examiner's physical and mental
make-up also changes. Besides, some uncontrolled environmental changes may take place during
the administration of the test.

3. All these factors are likely to make the total score of the examinee different from the first
administration and thus, the examinee's relative position is likely to change, thereby lowering the
reliability coefficient. Obviously, such factors contribute to the error variance and reduce the
proportion of the true variance in total variance. In a nutshell, the source of error variance in the
test-retest method is time sampling.

4.Maturational effects also operate in contributing to the error variance. When the examinees are
young children and the time interval between the two administrations is a comparatively long
one, such effects are more obvious. Since maturational growth is not uniform for all young
examinees, they are likely to produce a wider fluctuation in test score on the second
administration, thus lowering the reliability coefficient of the test.

5. Besides, when the examinee is once acquainted with types of items and their mode of answer,
he is likely to develop a skill which may help him in the second administration. He is also likely
to memorize many answers given in the first administration, especially if the second
administration follows a week after the first one. All the acquired skill, knowledge and memory
of the first answers are likely to help examinee in answering them in a more or less similar way
the second time, thus helping them in retaining their same relative position.

6. Obviously, these factors contribute to the true variance and are also likely to inflate the
reliability coefficient of the test score. Apart from this, tests that measure constantly changing
characteristics are not appropriate for test-retest evaluation.

Despite all these limitations, the test-retest method is the most appropriate method of estimating
reliability of both the speed test and the power test. For a heterogeneous test, too, the test-retest
method is the most appropriate method of computing reliability.

Internal Consistency Reliability

Internal consistency reliability indicates the homogeneity of the test. If all the items of the test
measure the same function or trait, the test is said to be a homogeneous one and its internal
consistency reliability would be pretty high. The most common method of estimating internal
consistency reliability is the split-half method in which the test is divided into two equal or
52

nearly equal halves. The common way of splitting the test is the odd-even method. Almost any
split can be accepted except the first half and the second half of the items. A division of this sort
is not preferred because the nature of items in the two halves in a power test is different. Usually,
the easier items are placed at the beginning or in the first half of the test and the comparatively
difficult items are placed towards the end of the test or in the second half of the test. However,
the odd-even method can be reasonably applied for the purpose of splitting. In this method, all
odd-numbered items (like 1, 3, 5, 7, 9, etc.), constitute one part of the test and all even-numbered
items (like 2, 4, 6, 8, 10, 12, etc.) constitute another part of the test. Each examinee, thus,
receives two scores: the number of correct answers on all odd-numbered items constitutes one
score and the number of correct answers on all even-numbered items constitutes another score
for the same examinee. In this way from single administration of the single form of the test, two
sets of scores are obtained. Product moment (PM correlation is computed to obtain the reliability
of the half test. On the basis of this half-test reliability, the reliability for the whole test is
estimated. The source of error variance in the split-half technique is content sampling or item
sampling in which scores on the test tend to differ due to particular nature of items or due to
differences in selection of items.

The advantage of the split-half method is that all data necessary for the computation of the
reliability coefficient are obtained in a single administration of the test. Thus the variability
produced by the difference in the two administrations of the same test is automatically
eliminated. Therefore, a quick estimate of the reliability is made. That is why, Guilford and
Fruchter (1973, 409) have described it as on-the-spot reliability.

The disadvantage is that since both the sets of scores are obtained on one occasion, fluctuations
due to changes in the temporary conditions within the examine as well as due to the temporary
changes in the external environment will operate in one direction, that is, either favorably or
unfavorably, the obvious result of which would be either an enhancement or depression in the
real coefficient of reliability. Another demerit of the split-half method is that it should not be
used with a speed test. A test can be divided into halves through different methods and, it has
been found that each method yields a different coefficient of reliability. Undoubtedly, this is
another weakness of the split-half technique of estimating the reliability coefficient of a test.

Rulon and Flanagan Formulas

An estimate of internal consistency is also made through the Rulon formula and the Flanagan
formula. Both these formulas provide the reliability of the whole test (that is total-test score) and
not the half test, and both formulas estimate the reliability coefficient on the basis of proportion
53

of error variance in total variance of the test. The lesser the error variance, the higher the true
variance and therefore, the higher will be the reliability.

The use of the Rulon formula requires that the test should be divided into two equal halves,
either through the odd-even method or by any other method. Thus each examine would have one
subtotal score on odd-numbered items and another subtotal score on even-numbered items. A
simple difference between the two subtest scores would indicate the error of measurement or
chance error of each examinee and thus, would give an idea of the error variance. The Rulon
formula is:

The Flanagan formula for estimating reliability is very similar to the Rulon formula. In the
Flanagan formula, variance of the score of odd-numbered items and the score of the
even-numbered items are calculated separately and then, an estimate of error variance is made.

Thus, like the Rulon formula, it is not based upon the difference of the two half scores. The
Flanagan formula is:

Thus, the Rulon formula and the Flanagan formula have yielded the same coefficient of
reliability from the data of Table 5.4, which automatically checks the accuracy of the
computation. These two formulas are also applicable to the computation of reliability of the
alternate forms of the test. The advantage of these two formulas over the Spearman-Brown
formula is that one need not calculate the reliability coefficient of the half-test scores.

Kuder and Richardson formula and Coefficient alpha


54

Kuder and Richardson (1937) did a series of researches to remove some of the difficulties of the
split-half method of estimating reliability. They were dissatisfied with the split half method of
estimating reliability and therefore, they devised their own formulas for estimating the internal
consistency of the test. Their formulas 20 and 21 have become very popular and well known. K-
R2o is the basic formula for computing the reliability coefficient and K-R is the modified form
of K-R2o. The main requirements for the use of the K-R formulas are:

(i) All items of the test should be homogeneous, that is, each item should measure the same
factor or factors in the same proportion. In other words, the test should have inter-item
consistency, which is indicated by high inter-item correlation. Thus, the test should be a unifactor
one.

(ii) Items should be scored either as +1 or 0, that is, all correct answers should be scored as +1
and all incorrect answers should be scored as zero.

(iii) For K-Ro, items should not vary much in their indices of difficulty and for K-Rz1, all items
should be of the same difficulty value. If the indices of difficulty of items are not equal, the value
of reliability yielded by K-R2; would be substantially lowered from that computed from K-Rzo.
K-Rzo is:

Coefficient alpha: The Kuder-Richardson formulas are applicable to the tests whose items are
scored as 0 or 1 (right or wrong) according to some other all-or-none system. Some tests,
however, may have multiple-scored items. We often see that on a personality inventory, testee or
respondent receives a different numerical score on an item, depending upon whether he checks

'Sometimes', 'Usually' 'rarely' or 'never'. For calculating reliability of such tests, a generalized
formula known as Coefficient alpha or a (also called Cronbach's alpha) has been formulated
(Cronbach 1951; Kaiser & Michael 1975). It simply means that coefficient alpha estimates the
internal consistency of the test in which items are not scored as 0 or 1 (right or wrong). The
formula for coefficient alpha is
55

The procedure is to find out the variance of all individual scores on each item and then to add
these variances across all items. Psychometricians today regard the coefficient alpha as the most
general method of finding estimates of reliability through internal consistency. The sources of
error variance in coefficient alpha are content sampling and content heterogeneity.

Alternate-Forms Reliability

Alternate-forms reliability is known by various names such as the parallel-forms reliability,


equivalent-forms reliability and the comparable-forms reliability. Alternate-forms reliability
requires that the test be developed in two forms, which should be comparable or equivalent. Two
forms of the lest are administered to the same sample, either immediately the same day or with
the time interval of usually a fortnight. When the reliability is calculated on the basis of data
collected immediately on the basis of two administrations of the test, it is called alternate-form
(immediate) reliability and when the reliability is calculated on the basis of the data collected
after a gap of a fortnight, it is called alternate-form (delayed) reliability. In the former, the source
of error variance is content sampling, whereas in the latter, the sources of variance are time
sampling, content sampling and content heterogeneity. Pearson r between two sets of scores
obtained from two equivalent forms becomes the measure of reliability. Such a coefficient is
known as the coefficient of equivalence. Alternate-forms reliability measures the consistency of
the examinee's scores between two administrations of parallel forms of a single test. A very short
time interval between the administrations of the two forms may help the examinees in
maintaining the same position on the second form.

The biggest problem in parallel forms test is how to make both the forms equivalent in the true
sense (because, if the test is not equivalent, the reliability coefficient may not be based upon the
true variances). Gulliksen (1950) has defined parallel tests (or equivalent tests) as tests having
equal means, equal variances and equal inter-item correlations. Freeman (1962, 72) has listed the
following criteria for judging whether or not the two forms of the test are parallel.

1. The number of items in both the forms should be the same.

2. Items in both the forms should have uniformity regarding the content, the range of difficulty
and the adequacy of sampling.
56

3. Distribution of the indexes of difficulty of items in both should be similar.

4. Items in both the forms should be of equal degree of homogeneity, which can be shown either
by inter-item correlation or by correlating each item with subtest scores or with total test scores.

5. Means and standard deviations of both the forms should be equal or nearly so.

6. Mode of administration and scoring of both the forms should be uniform.

A test in order to be called a parallel test must meet the above criteria very closely, if not
perfectly. Alternate-forms reliability is most appropriate when a speed test is being constructed.
But this does not mean that it cannot be applied to a power test.

Scorer reliability is the reliability which can be estimated by having a sample of a test
independently scored by two or more examiners. The two sets of scores obtained by each
examiner are completed in the usual way and the resulting correlation coefficient is known as
scorer reliability. This type of reliability is needed specially when subjectively scored tests are
employed in research. The source of error variance in scorer reliability is interscorer differences.
Test-retest reliability, internal consistency reliability and parallel-forms reliability express
reliability in terms of the correlation coefficient. As we know, the correlation coefficient is
always relative; the reliability obtained by the said methods is, therefore, also known as the
relative reliability. Sometimes analysis of variance (ANOVA) is also applied as a measure of
relative reliability. Analysis of variance technique has been used by Hoyt (1941), Jackson (1939)
and Alexander (1947). In applying the analysis of variance technique to individual items and
determining the error variance thereupon, Hoyt has made four assumptions. The first assumption
is that the total score of an examine on a test can be divided into four independent components:

(a) a component which is common to all examinees and to all items of the test; (b) a component
associated with items only; (c) a component associated with the examinees only; and (d) the
error component independent of the first three factors. The second assumption is that the
variance of the error component of each item is equal. The third assumption is that the error
component for each item is symmetrical and usually distributed; and lastly, the error component
of any two distinct items is independent. Hoyt's formula for the reliability coefficient is:
57

FACTORS INFLUENCING RELIABILITY OF TEST SCORES

The reliability of test scores is influenced by a large number of factors and all these factors can
be categorized under two heads: extrinsic and intrinsic. Extrinsic factors are those factors which
lie outside the test itself and tend to make the test reliable or unreliable.

Extrinsic Factors

Important extrinsic factors affecting the reliability of a test may be enumerated as follows:

1. Group variability: When the group of examines being tested is homogeneous in ability, the
reliability of the test scores is likely to be lowered. But when the examinees vary widely in their
range of ability, that is, the group of examinees is a heterogeneous one, the reliability of the test
scores is likely to be high. The effect of variability on reliability can be examined by seeing what
happens when variability is zero.

2. Guessing by the examines: Guessing in a test is an important source of unreliability. In


two-alternative response options, there is a 50% chance of answering the items correctly on the
basis of the guess. In multiple-choice items, the chances of getting the answer correct purely by
guessing are reduced. Guessing has two important effects upon the total test scores. First, it tends
to raise the total score and thereby makes the reliability coefficient spuriously high. Second,
guessing contributes to the measurement error since the examinees differ in exercising their luck
over guessing the correct answer.

3. Environmental conditions: As far as possible, the testing environment should be uniform.


Arrangement should be such that light, sound, and other comforts are equal and uniform to all
the examinees, otherwise it will tend to lower the reliability of the test scores.

4. Momentary fluctuations in the examinee: Momentary fluctuations influence the test score
sometimes by raising the score and sometimes by lowering it. Accordingly, they tend to affect
reliability. A broken pencil, momentary distraction by the sudden sound of an airplane flying
above, anxiety regarding noncompletion of homework, mistake in giving the answer and
knowing no way to change it, are some of the factors which explain momentary fluctuations in
the examinee.

Intrinsic Factors

The main intrinsic factors affecting the reliability of a test are as follows:

1. Length of the test: A longer test tends to yield a higher reliability coefficient than a shorter
test. Lengthening the test or averaging total test scores obtained from several repetitions of the
same test tends to increase the reliability. It has been demonstrated that averaging the test scores
of several applications essentially gives the same result as increasing the length of the test.
58

2. Range of the total scores: If the obtained total scores on the test are very close to each other,
that is, if there is lesser variability among them, the reliability of the test is lowered. On the other
hand, if the total scores on the test vary widely, the reliability of the test is increased. Putting this
in statistical terms, it can be said that when the standard deviation of the total scores is high, the
reliability is also high and when the standard deviation of the total scores is low, the reliability is
also low.

3. Homogeneity of items: Homogeneity of items is an important factor in reliability. The


concept of homogeneity of items includes two things- item reliability (or inter-item correlation)
and the homogeneity of function or trait measured from one item to another. When the items
measure different functions and the intercorrelations of items are zero or near it (that is, when the
test is heterogeneous), the reliability is zero or very low. When all items measure the same
function or trait and when the inter-item correlation is high, the reliability of the test is also high.

4. Difficulty value of items: In general, items having indexes of difficulty at 0.5 or close to it,
yield higher reliability than items of extreme indexes of difficulty. In other words, when items
are too easy or too difficult, the test yields very poor reliability (because such items do not
contribute to the reliability)-poorer than when items are of moderate difficulty values. But
sometimes such items could also be wholly indiscriminately and then they would contribute
nothing to reliability (cf. Table 4. 1).

5. Discrimination value: When the test is composed of discriminating items, the item-total test
correlation is likely to be high and then, the reliability is also likely to be high. But when items
do not discriminate well between superior and inferior, that is, when items have poor
discrimination values, the item-total correlation is affected, which ultimately attenuates the
reliability of the test.

6. Scorer reliability: Scorer reliability (also known as reader reliability) is also an important
factor which affects the reliability of the test. By scorer reliability is meant how closely two or
more scorers agree in scoring or rating the same set of responses. If they do not agree, the
reliability is likely to be lowered.

HOW TO IMPROVE RELIABILITY OF TEST SCORES

Reliability of test scores can be improved by controlling those factors which adversely affect the
reliability of the test. The following suggestions are useful for improving the reliability:

1. The group of examines should be heterogeneous, that is, the examinees should vary widely in
their ability or trait being measured.

2. Items should be homogeneous.


59

3. Test should preferably be a longer one.

4. As far as possible, items should be of moderate difficulty values; in other words, the indexes
of item difficulty should have the range of 0.40-0.50-0.60.

5. Items should be discriminatory ones.

Apart from these general suggestions, there are three common approaches for improving the
reliability of the test. One approach emphasizes upon the length of the test; the second approach
emphasizes upon throwing out items that pulls down the reliability; and the third approach
emphasizes upon correction for attenuation.

The approach emphasizing upon increasing the length of the test assumes that if new items
similar to the original set of items are added, the reliability of the test would tend to increase.
Following the domain-sampling model, each item in the test is an independent sample of the trait
or ability being measured. The larger the sample, the more likely the test will represent the true
characteristic. According to this model, the reliability of a test increases as the number of items
increases. Formula 5.18 is used to estimate the n times the test should be lengthened in order to
reach the projected level of reliability.

Another approach to improve reliability is to discard the items that run down the reliability.
Under this approach, two techniques are commonly applied: factor analysis and item analysis.
For increasing reliability, it must be ensured that all items measure the same thing. Factor
analysis ensures that tests are most reliable if they are unidimensional (Loehlin 1998; Tabachnick
& Fidell 1996). This reflects that one factor should account for considerably more of the variance
than any other factor. Items that fail to load on this factor should be omitted or discarded. In item
analysis, generally the correlation between each item and the total score for the test is examined.

This type of item analysis is called discriminability analysis. When the correlation between a
single item and total test score is low, the item is probably measuring something different from
other items on the test. On the other hand, it may also mean that the item is so easy or so hard
that the sample does not differ in its response to it. In either case, the low correlation indicates
that such items are pulling down the estimate of reliability and therefore, they should be
excluded.

The third approach to enhance reliability is to go for correction for attenuation.

VALIDITY
validity refers to the degree to which a test measures what it claims to measure. Validity is not
the self-correlation of the test, rather it is the correlation with some outside independent criteria,
60

which are regarded by experts as the best measure of the trait or ability being measured by the
test.

"The validity of a test concerns what the test measures and how well it does so." Lindquist
(1951, 213) has defined validity of a test as "the accuracy with which it measures that which is
intended to measure or as the degree to which it approaches infallibility in measuring what it
purports to measure." Kaplan and Saccuzzo (2001) have defined validity as "the agreement
between a test score or measure and the quantity it is believed to measure."

Validity has five important properties:

1. Validity is a relative term. A test is not generally valid. It is valid only for a particular purpose.
For example, a test of statistical ability will be valid only for measuring statistical ability because
it is put to use only for measuring that ability. It will be worthless for other uses like measuring
the knowledge of geography, history, etc. It is obvious from this interpretation that strictly
speaking, one validates not a measuring instrument, rather some uses to which the test is put.

2. Validity is not a fixed property of the test because validation is not a fixed process, rather an
unending process. With the discovery of new concepts, and the formulation of new meanings the
old contents of the test become less meaningful. Therefore they need to be modified radically in
the light of the new meanings. Hence, the validity of a test computed in the beginning becomes
less dependable and therefore, the test constructor should compute a fresh validity of the test in
the light of the new meanings attached.

3. Validity, like reliability, is a matter of degree and not an all-or-none property. A test meant for
measuring a particular trait or ability cannot be said to be either perfectly valid or not valid at all.

4. Validity is a unitary concept. In the two most important revisions done in 1999 of the Standard
of Educational and Psychological Testing by the American Educational Research Association
(AERA), the American Psychological Association (APA) and the National council on
Measurement in Education (NCME), the view that there are different types of validity has been
discarded. Instead, validity has been viewed as a unitary concept based on various kinds of
evidences.

5. Validity involves an overall evaluative judgement. In other words, it requires an evaluation of


the degree to which the interpretation and uses of test results are justified by supporting
evidences as well as in terms of the consequences of those interpretations and uses.

The Standards for Educational and Psychological Testing have provided five sources of
evidences for evaluating the validity of a specific use or interpretation. These sources of
evidences are (a) test content (b) response processes (c) internal structure (d) relations to other
variables and (e) the consequences of testing. It means that the validity may include a
61

consideration of the content measured, the ways of responding by the respondents, the
relationship of the individual items to the test scores, relationship of the performance to other
measures as well as the consequences of using the assessment.

ASPECTS OF VALIDITY

Since a test is valid to the extent that it serves the purpose for which it is to be used, and since
there are many purposes of testing, it automatically follows that there are different aspects of
validity representing each purpose of the test. Ordinarily, there are three main purposes of
testing:

1. Representation of a certain specified area of content: The tester may wish to determine how an
examine performs at present in a sample of situations (or contents) that the test claims to
represent. For example, through an English spelling test (a kind of achievement test) the tester
may determine the present level of English spelling among school pupils.

2. Establishment of a functional relationship with a variable available at present or in future: The


tester may wish to predict an examinee's future standing on a certain variable or he may wish to
determine his present standing on a particular variable. For example, on a mechanical aptitude
test he may wish to measure mechanical aptitude and predict his future performance in a job of a
mechanic. Likewise, a tester may wish to determine the level of emotional adjustment of the
examine through the appropriate measure so that he may be able to infer his day-to-day
adjustment with his peers.

3. Measurement of a hypothetical trait or quality (or construct): A tester may wish to determine
the extent to which an examine possesses some traits) as measured by the test performance. For
example, a tester may wish to know whether or not an examinee scores high on some abstract
measures like extroversion, neuroticism, intelligence, which cannot be observed directly.

Content validity

Content validity is also designated by other terms such as intrinsic validity, relevance, circular
validity and representativeness. Content validity is a nonstatistical type of validity that is usually
associated with achievement tests. When a test is constructed so that its content of term measures
what the whole test claims to measure, the test is said to have content or curricular validity. Thus
content validity is concerned with the relevance of the contents of the items, individually and as
a whole. Each individual item or content of the test should correctly and adequately sample or
measure the trait or the variable in question and the test, as a whole, should contain only the
representative items of the variable to be measured by the test. Anastasi (1968,100) has said that
content validity, "involves essentially the systematic examination of the test content to determine
whether it covers a representative sample of the behaviour domain to be measured."
62

In fact, content validity is the degree to which a test measures an intended content area.
Psychometricians are of the view that content validity requires both item validity and sampling
validity. Item validity is basically concerned with whether the test items represent measurement
in the intended content area, and sampling validity is concerned with the extent to which the test
samples the total content area.

Content validity of a test is examined in two ways: (i) by the expert's judgement, and (i) by
statistical analysis.

the contents (or items) of the test will be submitted to a group of subject-matter experts. These
experts will judge whether or not the items represent all the important events of Indian history,
whether or not some additional items should be added for complete coverage, what should the
relative weights of the items of a particular event be, etc. The validity of the contents or items
will be dependent upon a consensus judgement of the majority of the subject-matter experts.
Statistical methods may also be applied to ensure that all items measure the same thing, that is, a
statistical test of internal consistency may provide evidence for the content validity. Another
statistical technique for ensuring content validity may be to correlate the scores on the two
independent tests, both of which are said to measure the same thing. Suppose one wants to know
the content validity of a Hindi spelling test. Then the teacher can correlate the scores on the said
test with another similar Hindi spelling test.

Data relating to the item-discriminating power may also provide circumstantial evidence for the
content validity. Items showing such power, that is, items discriminating among superior and
inferior examines are said to have content validity. The following points should be fully covered
for ensuring full content validation of a test:

1. The area of content (or items) should be specified explicitly so that all major portions in equal
proportion be adequately covered by the items. This specification should be followed rigidly for
removing the general tendency of the item writers to include such items which are readily
available and are easily written.

2. Before the item writing starts, the content area should be fully defined in clear words and must
include the objectives, the factual knowledge and the application of principles and not just the
subject matter.

3. The relevance of contents or items should be established in the light of the examinee's
responses to those contents and not in the light of apparent relevance of the contents themselves.
This is because the contents may appear to be relevant for a specific skill or a certain course of
study but may not be equally relevant to the examinees and then, they may misunderstand and
give inappropriate responses. Thus, content validity is dependent not upon the apparent relevance
of the test items, rather upon the relevance of the responses given by the examinees towards
those test items.
63

Content validity is most appropriately applied to the achievement test or the proficiency test.

Face validity is often confused with content validity, but in the strict sense it is quite different.
Face validity refers not to what the test actually claims to measure but to what it appears to
measure superficially. Face validity is needed in all types of tests and helps a lot in improving the
objectively determined validity of the test by improving the wording and structure of the test
contents.

Criterion-related Validity

Criterion-related validity is a very common and popular type of test validity. As its name implies,
criterion-related validity is one which is obtained by comparing (or correlating) the test scores
with scores obtained on a criterion available at present or to be available in the future. The
criterion is defined as an external and independent measure of essentially the same variable that
the test claims to measure. In this sense, by way of defining the validity of a test, Cureton (1965)
has said that the validity of a test is an estimate of the correlation coefficient between the test
scores and the "true" (that is, perfectly reliable) criterion scores. There are two subtypes of
criterion-related validity: (a) predictive validity, and (b) concurrent validity. A detailed
discussion of these two subtypes is given below.

Predictive Validity

Predictive validity is also designated as empirical validity or statistical validity. As its name
implies, in predictive validity a test is correlated against the criterion to be made available
sometime in the future. Marshall & Hales (1972, 110) have said, "The predictive validity
coefficient is a Pearson product-moment correlation between the scores on the test and an
appropriate criterion, where the criterion measure is obtained after the desired lapse of time." Let
us take an example to illustrate predictive validity. Predictive validity is needed for tests which
include long-range forecast of academic achievement, forecast of vocational success and forecast
of reaction to therapy.

Concurrent Validity

Concurrent validity is another subtype of criterion-related validity. Concurrent validity is very


similar to predictive validity, except that there is no time gap in obtaining test scores and
criterion scores. The test is correlated with a criterion which is available at the present time.
Scores on a newly constructed intelligence test may be correlated with scores obtained on an
already standardized test of intelligence. The resulting coefficient of correlation will be an
indicator of concurrent validity. If the correlation is too high, it will indicate that the new test is a
needless duplication of the previous one. Concurrent validity is most suitable to tests meant for
diagnosis of the present status rather than for prediction of future outcomes.
64

Concurrent validity can be determined by establishing relationships or discrimination. The


relationship method is simple and it involves determination of the relationship between scores on
the test and scores on some other established criterion which are concurrently available. In this
method, the steps involved are as follows:

(a) The test is administered to a defined group of individuals.

(b) The criterion or previously established valid test is also administered to the same group of
individuals.

(c) Subsequently, the two sets of scores are correlated

(d) The resulting coefficient indicates the concurrent validity of the test. If the coefficient is high,
the test has good concurrent validity.

The discrimination method of computing concurrent validity involves determining whether the
test scores can be used to discriminate between persons who possess a certain characteristic and
those who don't possess such a characteristic, or between those who possess more of a certain
characteristic and those who possess less of that characteristic. For example, a mental adjustment
inventory would have concurrent validity if scores obtained on it could be used to discriminate
correctly between institutionalized and noninstitutionalized persons.

Construct Validity

Construct validity is the third important type of validity. The term "construct validity" was first
introduced in 1954 in the Technical Recommendations of the American Psychological
Association and since then it has been frequently used by measurement theorists. It is a more
complex and difficult process than content validation and criterion-related validation. The
process of construct validity is required when, "no criterion or universe of content is accepted as
entirely adequate to define the quality to be measured, " (Cronbach & Meehl 1955, 282).

Construct validity has also been given other names such as factorial validity and trait validity. In
construct validity, the meaning of the test is examined in terms of a construct. Anastasi (1968,
114) has defined it as "the extent to which the test may be said to measure a theoretical construct
or trait." According to Nunnally (1970), a construct indicates a hypothesis which tells us that "a
variety of behaviours will correlate with one another in studies of individual differences and/or
will be similarly affected by experimental treatments." A few examples of construct are anxiety,
intelligence, verbal fluency, extroversion, neuroticism, dominance, etc.

The process of validation involves the following steps:

1. Specifying the possible different measures of the construct


65

II. Determining the extent of correlation between all or some of those measures of construct III.
Determining whether or not all or some measures act as if they were measuring the construct

A brief discussion of each of these steps for determining construct validity follows:

1. Specifying the possible different measures of the construct: This is the first step in any
construct validational study. Here the investigator explicitly defines the construct in clear words
and also states one or many supposed measures of that construct. There is no standard way of
stating the different measures of the construct. Specification of such measures is partly
dependent upon the previous research conducted in that area and partly upon the intuition of the
investigator.

II. Determining the extent of correlation between all or some of the measures of construction.
When adequate measures of the construct have been outlined, the second step consists of
determining whether or not those well-specified measures actually lead to the measurement of
the concerned construct. This is done through an empirical investigation in which the extent to
which the various measures go together, or correlate with each other, is determined. If the
various measures highly correlate with each other or if the various measures are affected by a
variety of experimental manipulations (called independent variables) in much the same way, we
get sufficient evidence for concluding that they are measuring the same thing (that is, the same
construct)

III. Determining whether or not all or some measures act as if they were measuring the construct.
When it has been determined that all or some measures or referents of construct correlate highly
with each other (providing sufficient evidence for the referents that they all measure the same
thing), the next step is to determine whether or not such measures behave with reference to other
variables of interest in an expected manner. If they behave in an expected manner, it means they
are providing evidence for the construct validity

FACTORS INFLUENCING VALIDITY

Validity of a test is influenced by several factors. Some of the important factors are enumerated
below.

Length of the Test Homogeneous lengthening of the test not only increases the reliability but
also the validity of the test. The longer the test, the more reliable and valid it becomes. Thus,
lengthening the test or repeated administration of the same test increases the reliability, and since
validity in a homogeneous test is dependent upon reliability, lengthening also increases the
validity of the test. But validity as compared to reliability does not change rapidly with increase
in the length of the test or several repeated administrations of the same test.
66

Range of Ability (or Sample Heterogeneity) Like reliability, validity is also influenced by the
range of ability of the samples used. If the subjects have a very limited range of ability (that is,
the wider range of scores is not possible), the validity coefficient will be low. On the other hand,
if the subjects have a wider range of ability so that a wider range of scores is obtained, the
validity coefficient of the test would be enhanced

Ambiguous Directions If the directions of the test are ambiguous, it would be differently
interpreted by different examinees. Moreover, such items tend to encourage guessing on the part
of the examinees. As a consequence, the validity of the tests would be lowered.

Socio-cultural Differences Cultural differences among different societies are likely to affect the
validity of a test. A particular test developed in one culture may not be valid for another culture
because of the differences in socio-economic status, sex ratios, social norms, etc. However, when
a test is cross-cultured, this factor does not affect the validity of the test.

Addition of Inappropriate Items When inappropriate items, particularly vague ones whose
difficulty values differ widely from the original items, are added to the test, they are likely to
lower both the reliability and the validity of the test.

Norms
STEPS IN DEVELOPING NORMS

Developing norms is certainly a very difficult task. However, this difficulty can be minimized if
we follow the proper steps in developing norms. The following are the three important steps in
developing norms.

1. Defining the target population

2. Selecting the sample from the target population

3. Standardizing the conditions

These steps may be discussed as follows:

1. Defining the target population:

The first step in developing norms is to define the composition of the target group. The test is
intended to be used for a particular type of person or group of persons. The composition of the
target group (also called normative group) is determined by the intended use of the test. Let us
suppose that the test constructor has constructed Test of English as Foreign Language (TOEFL).
67

Obviously, this test is intended for those students whose native tongue is not English but who
plan for studying abroad where the medium of instruction is English. Thus, for TOEFL the target
population will consist of these students who are relevant and appropriate. For TOEFL,
population of PhD candidates will be an example of inappropriate one.

2. Selecting the sample from the target population:

When the test constructor has defined the target population, he proceeds to select a representative
sample from each of the target population or group. To make the selected sample representative
of the target population, a cross-sectional representation of the target population must be made.
A cross-sectional representation is one in which people from all sections of the target population
are represented. For ensuring representative character of the sample, various techniques of
sampling are employed. Generally, for constructing norms a larger sample is preferred. For
constituting a larger sample, a completely random sampling technique is the best one; but due to
its impracticability, this technique is seldom employed. Generally, cluster sampling or its
variation is preferred.

3. Standardizing conditions for proper implementation of test:

Unless conditions of test administrations are standardized, valid and proper comparisons of
individual test scores to test norms are impossible. Therefore, factors like adequate sound
control, lighting, ventilation, temperature of working space and the like must be properly
controlled. Above all, factors like test timing, test security, adherence to test-manual direction
and assuring that the examines work on proper test sections are most important for standardizing
the conditions of working space. Without these standardization procedures, norms cannot serve
as a useful comparative device.

These are the important steps in developing norms of a test. For norms to be a useful
comparative device, these steps must be covered thoroughly.

TYPES OF NORMS AND TEST SCALES

Ordinarily, the derived scores may be divided into four common types and depending upon each
of these four scores, there are four types of norms. The four commonly derived scores are age
scores, grade scores, percentile scores and standard scores (Lyman 1963). Accordingly, there are
four types of norms: age norms, grade norms, percentile norms and standard-score norms. A
detailed critical discussion of each of these is given below.

Age-equivalent Norms

Age-equivalent norms are defined as the average performance of a representative sample of a


certain age level on the measure of a certain trait or ability. If, for example, we measure the
68

weight of a representative sample of 10-year-old girls of the state of Bihar, and find out the
average of the obtained weight, we can determine the age norms for the weights of 10-year-old
girls. Age norms are most suited to those traits or abilities which increase systematically with
age.Since most of the physical traits like weight, height, etc., and cognitive abilities like general
intelligence show such systematic change during childhood and adolescence, age norms can be
more appropriately used for these traits or abilities at the elementary level.

There are some disadvantages of age norms.

1. Age norms lack a standard and uniform unit throughout the period of growth of physical and
psychological traits.

2. Another problem in age norms arises from the fact that the growth rate of some traits are not
comparable. For example, progress in maze learning does not ordinarily take place after
adolescence but progress in vocabulary continues even after adolescence. In such a situation, the
age norms for these two traits cannot be compared.

3. A trait like acuity of vision cannot be expressed in terms of age norms because this trait does
not exhibit progressive change over the years. Many other personality traits would also fall into
this category.

Grade-equivalent Norms

Like age-equivalent norms, grade-equivalent norms are defined as the average performance of a
representative sample of a certain grade or class. The test, whose norms are being prepared, is
given to the representative sample selected from each of the several grades or classes. After that,
the average performance of each grade on the given test is determined and then grade equivalents
for the in-between scores are determined arithmetically by interpolation. This average
performance is known as the grade-equivalent norms.

Grade-equivalent norms also have some limitations.

1. Grade-equivalent norms of the same student in different subjects are not comparable. For
example, the grade-equivalent in social studies of a student cannot be compared with the
grade-equivalent in arithmetic of the same student because everyday life experiences outside the
school may contribute to the knowledge of social studies whereas knowledge of arithmetical
concepts is primarily dependent upon formal training in the class.

2. Grade-equivalent norms assume that all students of a class or grade have more or less similar
curriculum experiences. This assumption may be true in the elementary classes but it may not be
true for higher classes.
69

3. The grade-equivalent norm is not suited to those subjects in which there occurs rapid growth
in the elementary class and a very slow growth in the higher classes. For example, in spelling
and arithmetic there occurs rapid growth in the elementary grades but growth is slow in the
higher classes.

Despite these limitations grade-equivalent norms are common, particularly among achievement
tests and educational tests. Such norms are also suited to intelligence tests.

Percentile Norms (or Percentile-rank Norm)

percentile rank refers to a percentage and that percentile refers to a score. Percentile norms are
the most popular and common type of norms used in psychological and educational tests. Such
norms can be prepared for either adults or children and for any type of tests. A percentile norm
indicates, for each raw score, the percentage of the standardization sample that falls at or below
that raw score. Some of the important advantages of percentile norms are that they are easy to
construct, easy to understand, and even an untrained person can freely use them.

Despite these advantages, such norms have two distinct limitations.

1. Laymen as well as skilled persons sometimes fail to distinguish between the percentile and the
percentage score, and the obvious result is confusion. But the two should not be confused
because percentile is a derived or converted score usually expressed in terms of percentages of
persons whereas the percentage score is a raw score expressed in terms of percentages of items
carefully answered or solved.

2. The biggest limitation of the percentile norms is that of inequality of units throughout the
percentile scale.

3. Percentile norms indicate only the person's relative position in the standardization sample.
They convey nothing regarding the amount of the actual difference between the scores.

Standard Score Norms

A norm which is based upon a standard score is known as a standard score norm. The reason
why one needs standard score norms in place of percentile norms is that here units of the scale
are equal so that they convey the same meaning throughout the whole range of the scale. In this
way, standard score norms remove one of the serious problems of inequality of units common
among the percentile norms. In order to understand the standard score norms, it is essential to
first know the meaning of the standard score itself. Standard score, like the percentile score, is a
derived score. It has a specified or fixed mean and fixed standard deviation. There are several
types of standard scores, such as z score (also known as sigma score), T score, stanine score,
deviation 1Q, etc. Of these, the z score is very popular and has been frequently used in preparing
70

norms. In fact, some writers use the term "standard score" interchangeably with the z score but in
reality, this should not be so because the z score is just one of several types of standard scores.

Standard scores are needed primarily for two reasons.

Firstly, when the performance of the same person on different tests is to be compared, it is best
done through converting the raw scores into standard scores. Secondly, standard scores have
equal units of measurement and their size does not vary from distribution to distribution. Hence,
they are frequently used in interpreting test scores.

Raw scores can be converted into standard scores through two methods -linear transformation
and normalized transformation. When raw scores are linearly transformed into standard scores,
all characteristics of the original distribution of the raw scores are retained without any change in
the distribution.

The z score is an example of a linearly transformed standard score. A z score indicates how
many standard deviations of a score lie below or above the mean of distribution. A z score is a
standard score which has a mean of zero and standard deviation of 1. It is computed by
subtracting the mean from the individual's raw score and dividing the same by the standard
deviation (SD) of the distribution. In terms of an equation, a z score may be read as under:

where z = z score or sigma score; X = raw score; M = mean of the distribution;


o = standard

deviation of the distribution.

z score is often used in making comparisons. In fact, the transformation of raw scores into z
scores is very useful when the researcher wants to compare scores from two different
distributions. There are two merits of z score. First, z score represents the most precise way of
indicating position in distribution. Second, although it does not sound good when computing
average of ranks or percentile ranks, it is perfectly legitimate when z scores are being used. For
example, suppose a person has a score of 1.25 on the mathematics test and a z score of -0.5 on a
verbal comprehension test, the number of standard deviations away from both means, on the
average,

would be [1.25 + (-5)]/2 = 0.75/2 = 0375 standard deviation above the mean of X's.

There are some difficulties associated with the use of z scores. First, a z score may be obtained in
plus and minus form as it is obvious from the above example. z scores in minus are cumbersome
to handle Second, z scores may be expressed in decimal points which tend to further complicate
handling them. In order to circumvent the first difficulty the score is made larger by adding to
71

each z score a constant, such as 50 or 100, so that the minus signs are omitted.The difficulty
caused by decimal points is solved by using a larger standard deviation, that is, by multiplying
every z score by 10 or 20. Such linear transformation of a z score into a new standard score is
very common, particularly the system of having a mean of 50 and SD of 10. The formula for
such transformation of the z score is shown below. Standard score = z (new standard deviation) +
new mean.

Normalized Standard Score

Marshall & Hales (1972, 120) have defined normalized standard scores as scores which have
been "adjusted to produce a normal frequency distribution and converted to a standard base with
a reassigned mean and standard deviation." Thus normalized standard scores are the standard
scores, which are expressed in terms of the normal distribution. Normalized standard scores can
be expressed in the same form as linearly derived standard scores, that is, with a mean of zero
and standard deviation of 1. A normalized standard score can therefore, be interpreted in a
similar way. A normalized standard score of +1 indicates that the examine has surpassed 84% of
his group and -1 indicates that he has surpassed only 16% of his group. There are three common
normalized standard scores-T scores, stanine scores and the deviation 1Q.

The T score was first devised by McCall (1922) and named after E L Thorndike, an important
leader in educational measurement. T score is defined as a standard score which is based upon a
mean of 50 and standard deviation of 10 as well as upon the shape of the normal distribution
curve. The T-score scale is based upon the normalized z-score scale, which generally extends
from -3 to +3 standard deviations with decimal fractions at various points in between. The
72

T-score scale has a range of 20 to 80 in most distributions (Figure 7.2), when raw scores are
transformed into T scores and their distribution automatically takes a shape which approximates
a normal curve. In terms of a formula, T score is equal to: T = z10 + 50

Another normalized standard score is the stanine score which was developed by the United
States Air Force during World War II. The term 'stanine' is a contraction of 'Standard nine' and it
has scores expressed in digits ranging from 1 to 9. The mean of these scores is 5 and the standard
deviation is 1.96 or approximately 2 (Figure 7.2). When raw scores are transformed into stanine
scores, they automatically take a shape approximating the normal curve. As a matter of fact,
stanine scores are the condensed scores on the C scale. In the C scale, there are 11 score points
ranging from 0 to 10 with the mean lying exactly at 5. For computational facilities with computer
punched-card records the two points at both the extremes (that is, O on the lower end and 10 on
the higher end) are combined, thus leaving only a nine-point scale (called the stanine scale). A
variant of the stanine scale is the sten scale proposed by Canfield (1951) where there are 10
units- 5 units above and 5 units below the mean.

Raw scores can be transformed into the stanine scale by arranging them in order of size and then
giving the percentage of each stanine score points according to the normal distribution curve
(Figure 7.2). The first stanine covers 4%, second stanine covers 7%, third stanine covers 12%,
fourth stanine covers 17%, fifth stanine covers 20%, sixth 17%, seventh 12%, eighth 7%, and
ninth 4% of the total cases. When, for example, there are 300 scores earned by 300 students on a
test, then the lowest 12 scores (4% of 300) would receive a stanine score of 1; the next 21 scores
would receive a stanine score of 2, and so on.

CRITERIA OF GOOD TEST NORMS

We have seen that test norms are of different types. Regardless of the type of test norms, it is
essential that their adequacy should be established. Test developers have set some criteria of
good qualities that norms should possess. The following are main criteria for judging whether or
not the test norms possess good qualities:

1. Test norms should be representative: One of the important characteristics of test norms is that
they should be a true representative of the group with which comparison is desired. Ideally, test
norms should be based on a random sample taken from the population they represent. This is
extremely difficult as well as expensive. Therefore, the matter has to be settled. At the minimum,
all the subgroups of the population such as boys and girls, rural and urban areas, socIo-economic
status, caste groups, schools of varying size and so on, must be adequately represented in the
sample.

2. Test norms should be relevant: Another good quality of norms is that it should be relevant.
Since norms are based on various types of groups, they must be meaningful in light of the
concerned group. Some test norms are based upon national sample whereas some are limited to
73

samples taken from a limited geographical region or state. The variety of groups available for
comparison necessitates the study of the norm sample before using any table of norms. If the
researcher wants to compare a sample or student with a general reference group for diagnosing
the strengths and weaknesses in different areas, national norms may be a good decision. But
when the researcher is trying to decide such things as which students should be placed in a
highly select group, which ones should be encouraged to pursue engineering course, which ones
should be encouraged to pursue medical course, national norms are less fruitful and for such
decision, norms for each specific groups are needed. A student might have an above-average
aptitude and achievement when he is compared with students in general, but he may fall short of
the ability needed to succeed in any accelerated group.

3. Test norms should be up to date: Good test norms should be up to date and they should not be
based upon past uses. When achievement and education are going up, a given raw score will
produce a higher percentile rank when compared to old norms than when compared to new
norms. Therefore, old norms should be discarded and new norms should be preferred.

4. Test norms should be comparable: Another good quality of test norms is that they should be
comparable. When the researcher wants to directly compare scores from different tests such as
when he compares aptitude and achievement test scores to identify underachievers or when he
makes profile comparisons of test scores to identify the strength and weaknesses of a student,
comparability of test norms is needed because only comparable test norms can help the
researcher in doing this desired function. A comparability of norms is assured when the norms of
all tests have been developed on the same population. Whenever the researcher wants to compare
scores from different tests directly, he should check the manual to ascertain whether the norms
are based on the same group or whether they have been made comparable by any other means.

5. Test norms should be properly and adequately described: Test norms should provide
information about the norm group, the norming procedures and other relevant factors. Such
information is needed and only then, one is able to judge any particular purpose. Test norms
should provide at least the following information:

1. What is the method of sampling?


2. What are the number and distribution of cases in the norm sample?
3. What are the characteristics of the norm group especially regarding age, sex,
socioeconomic status, geographical location, education level, caste, social group, etc.?
4. Whether standard conditions of administration and motivation were maintained during
testing.
74

Scale construction
Steps in scale construction
There are three phases in scale construction: item development, scale development, and
scale evaluation; these can be further broken down into nine steps. These nine steps are-
Phase 1- Item Development
Step I. Identification of the Domain(s) and Item Generation
Domain Identification: To create a scale, it is crucial to identify the domain or construct
that will be measured. This must be done before generating any items to understand the
phenomenon being studied and to guide item development and content validation. There are five
steps to identify a domain, including determining the purpose of the construct, ensuring that no
existing instruments serve the same purpose, describing the domain and its dimensions, and
reviewing the literature to define the domain.
Item Generation: After defining the domain, an item pool can be generated using two
methods: deductive and inductive. The deductive method involves identifying items based on a
description of the relevant domain, through a literature review and assessment of existing scales
and indicators. The inductive method, on the other hand, involves generating items based on the
responses of individuals. It is recommended to use both methods to define the domain and
identify the questions to assess it. During item creation, it is important to consider factors such as
form, wording, and response types, and to use simple and unambiguous language to avoid
offensive or biased language. Balanced scales should also be used to minimize the response set
effect by wording some items positively and others negatively towards the target construct.
Writing items with clarity, relevance, and balance are typically taken into consideration. The
initial pool of items should be at least twice as long as the desired final scale, and undesirable
items can be eliminated during successive evaluations.
Step 2 Content Validity
Content validity, also known as "theoretical analysis," is the degree to which a measure
accurately evaluates the intended domain. It is essential to ensure that the items assess what they
are intended to measure. Content validity ensures that only the phenomenon specified in the
conceptual definition is measured and not any other related aspects. To evaluate content validity,
expert and target population judges are primarily used. Expert judges are people with significant
knowledge of the domain or scale development, while target population judges are potential
users of the scale. Expert judges assess each item to determine if it represents the domain of
interest. It is important to use independent expert judges who are not involved in developing the
item pool and to use multiple judges to avoid bias in the assessment of items. Target population
judges are proficient in evaluating face validity, which is a component of content validity. Face
validity is the extent to which respondents or end-users judge that the items of an assessment
instrument are appropriate for the targeted construct and assessment objectives. These end-users
can determine if the construct is a good measure of the domain through cognitive interviews.
75

Phase 2. Scale Development


Step 3. Pre-testing Questions
To ensure that the survey questions are relevant and meaningful to the target population, a
pre-test can be conducted prior to administering the survey. The pre-testing process involves two
main components. Firstly, it evaluates how well the questions align with the domain being
studied. Secondly, it assesses whether the answers obtained from the questions provide valid
measurements. By conducting pre-tests, the target population can provide valuable insights into
the survey development process, which can lead to improved questionnaires and more accurate
data collection.
Step 4. Survey Administration and Sample Size
Survey Administration: Gathering accurate data from a sufficient sample size is critical,
and there are two main methods to collect data: traditional Paper and Pen/Pencil Interviewing
(PAPI) and Computer Assisted Personal Interviewing (CAPI) using electronic devices like
laptops, tablets, or smartphones. Both methods have advantages and disadvantages. The use of
technology can minimize errors in data entry, allow for inexpensive collection of data from large
samples, increase response rates, reduce mistakes made by enumerators, provide instant
feedback, and enhance monitoring of data collection and data confidentiality. However, paper
forms can prevent data loss due to software crashes or the loss or theft of devices before backup
and mav be more suitable for areas with unreliable electricity or internet. On the other hand, as
sample sizes increase. PAPI becomes more expensive. time-consuming. and labor-intensive. and
the data collected are more susceptible to human errors in various ways.
Establishing the Sample Size: The appropriate sample size for developing a latent
construct has been a topic of discussion, and it is recommended to test potential scale items on a
diverse sample that represents the range of the target population. A widely used guideline is to
have a minimum of 10 participants for each scale item, which results in a respondent-to-item
ratio of 10:1. However, other suggestions for sample sizes are not based on the number of survey
items. There is no fixed respondent-to-item ratio that applies to all survey development
situations. It is better to have a larger sample size or respondent-to-item ratio since it reduces
measurement errors. increases factor loadings' stability, produces replicable factors, and
generalizable results to the true population structure. Conversely, a smaller sample size or
respondent-to-item ratio may lead to unstable loadings and factors, random non-replicable
factors, and non-generalizable results.
Step 5. Item Reduction Analysis
In the process of developing a scale, item reduction analysis is conducted to select
essential, clear, and coherent items. Classical Test Theory (CTT) and Item Response Theory
(IRT) are used as guides in this process. The goal is to identify effective items that are associated
with, differentiate, and contribute substantially to the construct being measured. CTT models
predict construct outcomes and item difficulty, while IRT evaluates item information and
standard error functions. Different techniques, such as analyzing item difficulty, discrimination
76

indices, inter-item and item-total correlations, and distractor efficiency, are employed to reduce
items.
Item Difficulty Index: The index of item difficulty is utilized in both CTT and IRT
frameworks to determine the relative difficulty and discriminatory ability of test items. In CTT, it
is the proportion of correct answers to a given item, while in IRT it is the probability that a
specific examinee will answer an item correctly. A higher score on the difficulty index suggests
that more participants answered the question correctly, while a lower score implies the
requirement to modify or eliminate the item. This index can assist researchers in identifying
various levels of individual performance and creating questions tailored to specific subgroups.
Item Discrimination Index: The item discrimination index measures the extent to which
an item can differentiate between examinees on a construct and is used in both CTT and IRT
frameworks. It shows the difference in performance between high-scoring and low-scoring
groups and is calculated by subtracting the proportion of lower-performing examinees who got
the item correct from the proportion of higher-performing examinees who got it correct. The
index allows for the identification of positively discriminating, negatively discriminating, and
non-discriminating items.
Inter-Item and item-total correlations: Another technique used in determining Which
items to keep in a scale is inter-item and item-total correlations, which is a part of CTT. This
technique involves evaluating the relationships between individual items in a pool using a
matrix. Inter-Item correlations measure the extent to which items in a scale are assessing the
same content, and items with very low correlations less than 0-30) can be removed. Item-total
correlations examine the relationship between each item and the total score of scale items, and
items with very low adjusted item-total correlations (less than 0.30) may also be eliminated.
Step 6. Extraction of Factors
Factor extraction is the step in which the optimal number of domains, or factors, that best
fit a set of items is determined. This phase involves the use of factor analysis to explore the
internal structure of a group of items and assess the consistency of their relationships. This
technique can also be used to reduce the number of items. In factor analysis, items with factor
loadings, which represent the strength of the relationship between an item and a factor, below
0.30 are considered inadequate, as they explain less than 10% of the underlying construct being
measured. Therefore, it is recommended to retain items with factor loadings of 0.40 or higher.
Additionally, items that do not uniquely load on a specific factor or have cross-loadings can be
eliminated.
Phase 3. Scale Evaluation
Step 7. Tests of Dimensionality
The dimensionality test is a process that assesses whether the factor structure extracted
from a previous model remains consistent when applied to a new sample or a different time point
in a longitudinal study. The purpose of this test is to establish whether the items, factors, and
their roles are comparable across different independent samples or within the same sample over
time.
77

Step 8. Tests of Reliability


The concept of reliability pertains to how consistent the results of a measurement are
when the same conditions are replicated. There are several statistical techniques used to assess
the reliability of a scale, such as Cronbach's alpha and test-retest reliability, which are commonly
used for this purpose.
Cronbach's Alpha: Cronbach's alpha is a metric used to determine the coherence of a
scale by examining the degree to which a group of items are closely linked. A reliability level of
0.70 is typically viewed as acceptable, but higher-quality scales strive for a range of 0.80 to 0.95.
This approach is generally recognized and frequently employed to assess the dependability of
measurements.
Test-retest Reliability: A different way to assess dependability is through test-retest
reliability. This approach measures how much a participant's scores stay the same over time. To
determine test-retest reliability, researchers utilize several measures such as the intra-class
correlation coefficient or Pearson product-moment correlation. High correlations indicate high
dependability, whereas values near zero imply low dependability. Nevertheless, if there are
changes in the study's circumstances or interventions that might affect the concept being
evaluated, it may lower the test-retest reliability.
Step 9. Tests of Validity
The accuracy of a scale pertains to how well it measures the intended underlying
concept. The process of validating a scale is continuous and begins with identifying and
clearly defining the domain under study, and then extends to its applicability to other related
concepts. To assess the accuracy of a scale, various methods are used such as content validity,
predictive and concurrent criterion validity, as well as convergent, discriminant, differentiation
by known groups, and correlation-based construct validity.
Criterion Validity: Criterion validity involves evaluating the correlation between a test
score and performance on another measure that is high v relevant. Predictive and
concurrent are the two types of criterion validity. Predictive validity assesses the degree to which
a test can predict responses to related questions or outcomes. Concurrent validity assesses the
degree to which test scores have a stronger correlation with a gold standard measurement taken
at the time of the test or shortly thereafter.
Construct Validity: Construct validity measures the degree to which an instrument
accurately measures a specific construct and is related to other measurements of constructs in
that area and to measures of real-world outcomes. Convergent validity determines if different
measurements of a construct yield similar outcomes. Discriminant validity determines if a
measurement is distinct and not just a reflection of another construct. All of these methods
together allow for the evaluation of the validity of a newly created or adapted scale.
78

Attitude scale

5 main categories niche likhi h


1. Self reports are most common
2. The respondents are asked about what they think about, how they feel towards or would
behave towards etc.
3. A set of categories or range of scores on a variable is called a scale and the process of
assessing scale to objects to yield a measure of the construct is called a scaling.
4. Thus scaling is a method of measuring the amount of property possessed by a c;lass of
object or events. It is most often associated with the measurements of attitude. (maam n
yhi 4 likhayi h)

Scale construction techniques

Scale construction techniques Name of the scale

1, arbitrary approach Arbitrary scale

2. Consensus approach Differential scale (eg- thurstone equal


appearing interval scale)

3. Cumulative scale approach Cumulative scale (guttman)

4. Factor analysis approach Factor scale (osgood’s semantic differential


79

scale)

5. Item analysis Likert scale

Attitude can be measured only on the basis of inferences drawn from verbal statements regarding
belief, feeling and tendency to act towards the object or person. Attitude scales (also known as
opinionnaires), which usually consist of a large number of staterents towards objects of attitudes,
are one such indirect measure. Here we shall discuss the three most common and frequently used
techniques in construction of attitude scales, namely, the method of equal-appearing intervals,
the method of summated ratings and the method of cumulative scaling.

Method for equal- appearing interval


Thurstone (1929), Thurstone & Chave (1929), Thurstone (1931) developed a method known as
the method of equal-appearing intervals. The method, as used by Thurstone and Chave, is given
below. A large number of statements, both favourable and unfavourable, towards the object of
attitudes, are collected from various sources. The number of items usually ranges from 100 to
200. Each statement is printed on a separate card and subjects are asked to sort these printed
statements into a number of intervals. Along with the statements, each subject is given 11 cards
on which letters A to K are written. These cards are arranged before the subjects in a manner that
A is kept at the extreme left and K is kept at the extreme right. A indicates the most unfavourable
interval and K represents the most favourable interval. The middle category is designated by the
letter F, which is the neutral category. The cards lettered from G to K represent various degrees
of favourableness and the cards lettered from D to A represent various degrees of
unfavourableness as illustrated below:

Thurstone and Chave defined only the two extremes and the middle category ( of the 11
intervals) on the ground that the undefined intervals between successive cards would represent
equal-appearing intervals for all the subjects. The subjects are requested to sort the given
statements in terms of 11 intervals represented by the 11 cards. Ordinarily, there is no time limit
for sorting. But Thurstone and Chave reported that subjects took 45 minutes in sorting 130
statements into 11 intervals. They made the following assumptions in this method:
1. The intervals into which the statements are sorted or rated are equal.
2. The attitude of the subjects does not influence the sorting of the statements into the various
intervals. In other words, subjects having favourable attitudes and those having unfavourable
80

attitudes would do the sorting in a similar manner. Thus the scale values of the statement are
independent of the attitude of the judges.
There is no fixed number of subjects to be engaged in the sorting work. However, Thurstone and
Chave used 300 subjects in sorting 130 statements. Other investigators like Edwards & Kenney
(1946) and Ferguson (1939) have demonstrated that reliable sorting can be done even with a
smaller number of subjects. When sorting is over, the next step is to determine the scale value
and Q value of each statement.

The method of equal-appearing intervals has its advantages and disadvantages. Its important
advantages are given below.
1. Thurstone scales enable the researcher to differentiate between larger numbers of people
regarding their attitudinal position. Here item weights are averaged (median) and this reveals a
great variety of attitudinal positions. This makes it possible to make finer distinctions among
people according to the attitudes they have.
2. In Thurstone scale, it can be said with increased confidence that items being used have a
stronger claim to reliability because they are based upon judges' view who have higher degree of
agreement about items used and who eliminate the bad items reflecting little or no agreement.
This method has the following disadvantages:
1. Judges or subjects do not keep the intervals equal. Fransworth (1943) has found evidence in
support of the above fact. As a matter of fact, in the method of equal-appearing intervals there is
no way through which this assumption can be tested. Thus one of the assumptions of the method
does not stand the rigours of the experimental test.
2. It is also said that the attitude of the subjects or judges tends to influence the sorting of the
statements into the intervals. In other words, the scale values of the statement are not
independent of the attitudes of the judges who do the sorting.
3. Thurstone and Chave have provided no objective basis for selecting the most discriminating
items from among items having approximately the same scale values. It may be possible that
items having approximately the same scale values differ in their discriminatory power.
4. The subjects may do the sorting work carelessly and with least interest. In such a situation, the
interpretation of the scale values may be a difficult task. Thurstone and Chave, have, however
provided a technique to detect careless judgements and accordingly, they can be eliminated.
They have pointed out that if any subject sorts more than 30 statements in any one of the 11
intervals, the judgements of that subject may be rejected on the ground that he has done sorting
either carelessly or has misunderstood the instruction.

Method of Summated Ratings


Likert (1932) developed a different method for the construction of the attitude scale. His method,
named by Bird (1940, 159) as the method of summated ratings, is a simpler method than that of
81

Thurstone's equal-appearing intervals method. The main steps involved in Likert's method may
be summarized as mentioned below:
1. A large number of multiple-choice-type statements usually with five alternatives such as
strongly agree, agree, undecided, disagree and strongly disagree concerning the object of attitude
are collected by the investigator. Two examples of items intended to measure attitude towards
nationalization are given below:

2. Such statements are administered to a group of subjects who respond to each item by
indicating which of the given five alternatives they agree with.
3. Every responded item is scored with different weights. The weight ranges from 5 to 1.
For favorable statements a weight of 5 is given to 'Strongly agree, 4 to 'Agree', 3 to 'Undecided',
2 to 'Disagree', and 1 to 'Strongly disagree' (as shown in the first illustrative example), and for
the receives I and 'Strongly disagree' receives 5.
unfavourable statement the order of weights to be given is reversed so that 'Strongly agree'
4. After the weight has been given to items, a total score for each subject is found by adding the
weights earned by him on each item. Thus his total score is obtained after the weights are
summated over all the statements. Since a subject's response to each item may be considered as
his rating of own attitudes on a 5-point scale and his total score is obtained after all these weights
are summated, the method is known as the method of summated ratings.
5. Finally, selection of items is done through the procedure of item analysis. Probably, this step
of item analysis is the major step, which distinguishes it from Thurstone's method of
equal-appearing intervals. As we have seen, Thurstone's method makes no use of item analysis in
final selection of items. There are several methods of item analysis. Edwards (1957) has
suggested the setting of two extreme groups-high and low on the basis of the total score and
finding out the significance of the difference between the means of two groups by the t test. The
value of t will indicate the extent to which a given statement distinguishes between high and low
groups. But other methods such as correlational methods may also be used in place of the t test.
In the method of summated ratings, it is customary to select 20 to 25 statements, which constitute
the final attitude scale. As far as possible, half of the total statements should be favourable so
that half of 'strongly agree may receive the weight of 5 and the remaining half of total statements
should be unfavourable so that half of 'strongly agree' may receive the weight of
1. This type of arrangement is necessitated to control certain response biases of subjects, which
might be produced if only favourable statements or only unfavourable statements are included in
the attitude scale.
82

The Likert method also has advantages and disadvantages. Its major advantages are mentioned
below.
(i) Likert scales are easy to construct and it takes less time. This method is simpler and easier
than the method of equal-appearing Intervals. Some empirical evidence is available to support
this content. Rundquist & Sletto (1936, 5) have used Likert method in the construction of the
attitude scale and expressed the belief that this method "... is less labourious than that developed
by Thurstone." Edwards & Kenney (1946) have made a comparative study of Likert method and
Thurstone method and have concluded that the time required in the construction of the attitude
scale by equal-appearing Intervals is almost twice the time required by the method of summated
ratings.
(i) Scoring of Likert scale is easy as well. Statements on Likert scale are worded positively or
negatively and subsequently, numerical weights are assigned to them. Subsequently, they are
summed to yield total score. High total score indicates favourable attitude and low total score
indicates unfavourable attitude.
(in) Likert's summated ratings are the most common measurement format. The ease of
application and simplicity of interpretation have increased the popularity of this measurement
format in social science researches.
(iv) Likert's method of scaling possesses sufficient degree of flexibility. Here the investigator is
free to include as many and as few items in his measure as he chooses. In this scaling, because
each item is presumed to count equally in measuring the concerned phenomenon, increasing the
number of statements increases the ability of the scale to reveal differences in the phenomenon
measured.
Despite these advantages, some disadvantages or weaknesses have been reported in Likert's
method of scaling as mentioned below.
(h) In Likert's method of summating ratings, it is assumed that each item or statement has
identical weight in relation to every other item or statement. This is not necessarily a valid
assumption. In fact, the different individuals may have a given attitude to the same degree, yet
they may respond differently to different statements or items of the scale. Therefore, it is
difficult, if not impossible, to ensure that each statement counts the same as every other item.
(i) The validity of Likert's scaling is questionable. As we know, the process of deducing items
from an abstract universe of traits is a logical one. Therefore, there always exists the possibility
that some items may be wrongly included in the scale at any given time. The problem is to know
that we are measuring exactly what we claim to measure.
(ili) In summated ratings, the persons receiving the same score on a measure don't necessarily
possess the trait to the same degree. This obviously means that this method is never as precise as
it claims to be. Its raw scores may be regarded as crude estimates at best.
Despite these disadvantages the method of summated ratings, as devised by Likert, has been
successfully used in assessment of attitudes. Recently, this method has also been fruitfully used
in assessing socio-economic status, intelligence, interest and special skills (Black &
Champion 1976).
83

Guttman's Scale, or Cumulative Scale


Guttman's method of scale analysis or scalogram analysis differs considerably from the two
methods of attitude scale construction discussed previously. The Guttman scale is based upon the
methods of cumulative scaling and has been defined by Guttman (1950) as follows:
"We shall call a set of items of common content a scale if a person with a higher rank than
another person is just as high or higher on every item than the other person."
Thus, if a set of statements with common content defines the Guttman scale, a person with higher
score or rank than another person on the same set of statements will rank consistently higher than
him on each statement in the set. For example, the following items illustrate the perfect
Guttman's scale:
(a) My height is more than 5'
(b) My height is more than 5'3".
(c) My height is more than 5' 8".
(d) My height is more than 6'.
A person who responds with 'Yes' to item (d), will also be responding in 'Yes' to items (a), (b)
and (c). All the four items are measuring the same dimension, that is, height and constitute what
Guttman (1944,1945) called 'unidimensional scale.' Similarly, if a set of attitude statements
measures the same attitude, they are said to constitute a unidimensional scale or a Guttman scale.
According to Guttman, one advantage of the unidimensional scale is that from the total score of
the person one can reproduce the pattern of his responses to the set of statements.
Suppose, for example, that in the above example, 'Yes' is given a weight of 1 and 'No' is given a
weight 0, then knowing that a person has secured a total weight of 4, we can say that he has
responded 'Yes' to items a, b, c and d. Likewise, if a person gets a total weight of 3, he has
responded 'Yes' to item a, b and c and 'No' to item d. Such prediction regarding the perfect
reproducibility is true in a perfect Guttman scale only. In case of attitude, statements showing
such a perfect reproducibility are rarely achieved because some degree of irrelevancy is always
present.
The major steps in the Guttman scale may be enumerated as shown below.
1. A large number of statements are collected regarding the object of attitude. All statements
seem to indicate the same attitude. This constitutes what Guttman calls a universe of items.
2. Out of these collected statements, a small number of items are selected. Usually, the number of
selected items does not exceed 20. According to Guttman, the selection of a small number of
items from the large number of possible items is dependent upon the intuition and experience of
the investigator. These selected items must be of homogeneous content. Thus one should look for
items in the Guttman scale which are, to a greater extent, the rephrasings of the same content.
Guttman believed that item analysis was not an essential part of scale analysis for selection of
items as we find in case of the Likert's scale.
84

3. Each statement may have two alternatives such as 'Agree-Disagree or more than two
alternatives such as 'Agree, Neutral and Disagree.' All these items are administered to a group of
100 persons who respond to each item.
4. All items are scored or weighted and a total score by adding the weights on all items is
determined for each person.
5. On the basis of the total score, each subject is ranked from highest to lowest and is listed in a
column. Each row indicates the responses of a subiect to different items (see Table 13.20).
6. Subsequently, it is determined whether or not the responses to each item are in close
agreement. In other words, those marking the response category (such as 'Agree'), which strongly
indicates the quality being measured should show consistency with the higher total scores and
those marking the response category (such as 'Disagree' or 'Neutral'), which poorly indicate the
quality being measured should show consistency with the lower total scores. If this is reality, the
scale is to be a homogeneous one and in this case from the person total score (or rank) alone, we
can reproduce his response to each item. In a perfectly homogeneous test, the index of
reproducibility will also be perfect and therefore, the coefficient of reproducibility will be one.
A case of perfect reproducibility has been demonstrated in Table 13.20 where responses of 10
subiects towards five items have been displayed. Each item has two response categories-'Agree'
and 'Disagree'. The response category 'Agree is scored with +1 and the other response category
'Disagree' is scored with 0. Subsequently, an attempt is made to evaluate the scalability of the
items and for this purpose, there are several procedures.

Guttman scale has some advantages and disadvantages. Its major advantages are as follows:
(i) The Guttman scale clearly demonstrates the unidimensionality of items. Such
unidimensionality is assessed neither by the Likert scale nor by the Thurstone scale.
(i) The unidimensionality and scalability of the scale enable the researchers to identify any
inconsistencies in the responses and probable untruthful replies given by the subjects.
(in) In the Guttman scale, the person's response pattern can be easily reproduced with a
knowledge of his total score on the scale. This type of advantages is not found in case of the
Thurstone scale or the Likert scale.
(iv) Researchers have shown that the Guttman technique of scaling is relatively easy to use when
the number of dichotomous items (such as agree-disagree) is 12 or less than that. When the
number of items exceeds 12, the technique becomes cumbersome.
The Guttman scale has, however, some disadvantages. The major ones are given below.
(i) The Guttman scale is not well suited to those items which have three or more than three
response categories. Although some psychologists have applied this scaling technique in such a
situation, it is very cumbersome and tedious to proceed with this technique.
(i) The Gutman scale can't be used appropriately in situations where the number of items exceeds
12 and the number of subjects is more than 100. In such scalogram analysis, scoring and error
determination will prove to be a Herculean task.
85

(ii) The Guttman scaling technique does not provide as extensive a continuum for scaling as we
find in the case of Likert's and Thurstone scaling technique.
Despite these limitations, the Guttman scale has been successfully used in attitude assessment.
Besides, the technique has also been used in opinion studies of a political, economic and social
nature. It may also be used in combination with Likert's and Thurstone scale to encompass a
wide variety of attitude assessment.

Semantic differential scale


The Semantic Differential refers to the words which are on the scale to two extremes, one is
positive another is negative. For example, Lazy and Industrious. This refers to the respondents
attaching meaning to that word. This meaning is attached by the respondents to that word.
Meaning Is of two types and that's why the words can be of two types, one is denotative and
other is connotative.
Denotative means denoted. If we say "Horse" it is the name of an animal, so that is the meaning
given to that word. This is denotative. Connotative is not the meaning given but the meaning is
attached to that word. The meaning attached to the word "Horse" is power. This has a
connotative meaning.
The Semantic Differential uses this kind of connotative meaning. It asks the respondent the
meaning they wish to attach to the pair of words. connotative meaninq is not the actual explicit
meaning which that word has. It is the meaning attached to that word by the respondent. There
are polar words or polar adjectives, e.g. good and bad, beautiful and ugly these are the two
extreme words but they are polar adjectives.
Osgood first suggested this type of scale which is called semantic differential scale. He used
three areas: strength, value and activity. He thought that any scale prepared can have these three,
not necessarily all three together.
Each one of them has a peculiar place in the scale. In case of a 'Strength' a pair of polar
adjectives 'strong and weak' can be used. Similarly 'decisive and indecisive' pair can be used.
'Value' can be 'good and bad' and cheap and expensive. For the Activity one can use the word
active and passive' that shows how actively people are engaged into that event or process.
'Lazy and industrious' is the pair which gives the sense of activity. Osgood suggested some of
these adjectives which could be used as polar pair.
Construction of Semantic Differential Scale
Now let us see how semantic differential scale is prepared
1. The first step is to refer to the theoretical framework, refer to objectives, and identify the
concepts which one wants to investigate. Now select pairs which would define, explain
these concepts. Identify or select the appropriate word which describes the concepts. You
may be having two groups: experimental and control. Give these words to them and ask
them which words clearly define, clearly explain the selected concepts
86

2. Ask them to rate the list of 25-30 adjectives, for their appropriateness, words which are
appropriate and which are not so appropriate. The ranking will show that the highest
ranking words are relevant, more appropriate. The respondents must be given clear-cut
instructions what is expected of them.
3. They need to remember that these words are not denotative but are connotative so the
meaning attached to the word Is important rather than the actual/explicit meaning of the
word.
4. Once you get those ratings from two different groups with different instructions, find out
the pairs clearly coming out, are distinguishing, differentiating from others.
5. After analysis of the data from these raters, short-list words for the scale, not more than
12-14 statements. There need not be 50 statements in scale. There can be 8- 12. Semantic
Differential Scale is a short scale.
6. Once the scale is ready, do the pilot testing, by selecting a sample which resembles or
represents the population but not the sample which will be used in real study.
7. Analyze the data and establish its validity and reliability.

Advantages of semantic differential

1. The semantic differential has outdone the other scales like the Likert scale in vitality,
rationality, or authenticity.
2. It has an advantage in terms of language too. There are two polar adjectives for the factor
to be measured and a scale connecting both these polar.
3. It is more advantageous than a Likert scale. The researcher declares a statement and
expects respondents to either agree or disagree with that.
4. Respondents can express their opinions about the matter in hand more accurately and
entirely due to the polar options provided in the semantic differential.
5. In other question types like the Likert scale, respondents have to indicate the level of
agreement or disagreement with the mentioned topic. The semantic differential scale
offers extremely opposite adjectives on each end of the range. The respondents can
precisely explain their feedback that researchers use for making accurate judgments from
the survey.
6. Researchers can gain perception of concepts, attitudes, and opinions using the verbally
different terms as a measuring tool using the semantic differential scale.

disadvantages of semantic differential

1. Limited response options: Semantic differential scales typically offer a restricted range of
response options, which may not capture the full gamut of attitudes that people hold
towards an object or concept.
87

2. Contextual bias: The context in which the semantic differential scale is presented may
impact people’s responses. For instance, if a survey is conducted immediately after
negative news about a product, people may rate it more negatively than they would
otherwise.

UNIT 4:
INTELLIGENCE TESTS

I. Stanford binet intelligence scale

Originally designed as the Binet-Simon scale in France in 1905, the Stanford-Binet Intelligence
Scale is a standardized psychological test measuring cognitive ability (or intelligence). -The
Stanford Binet Intelligence Scales - Fifth Edition (SB-5) is the most recent iteration, which was
released in 2003; the instrument has undergone several changes throughout the previous century
(Wilson, 2011). The SB-5 is one of the most commonly used psychometric tools in clinical,
neuropsychological, and psychoeducational settings. The test can be administered for individuals
from the age of two till eighty five. The test is individually administered and takes approximately
45-90 minutes to complete, depending on the age and ability of the individual being tested
(Wilson, 2011). It consists of 10 subtests, which are grouped into five cognitive domains: Fluid
Reasoning, Knowledge, Quantitative Reasoning, Visual-Spatial Processing, and Working
Memory (Wilson, 2011).

Psychometric Properties

The SB-5 test exhibits good reliability. The FSIQ composite score has a very high internal
consistency coefficient, which ranges from r = 0.97 to 0.98. It was discovered that both the
verbal and nonverbal IQ domains had strong reliability, with corresponding averages of r = 0.96
and 0.95. The five subtests' internal reliability was similarly outstanding, with mean coefficients
ranging from r = 0.84 to 89 (Grimes et al., 2022). Content validity was established using diverse
methods such as extensive literature review, factor analyses of previous editions of the
Stanford-Binet, item response theory modeling, expert advice, user surveys, and pilot studies
(Roid, 2003; Grimes et al., 2022).
88

Domains & subtests of SB-5

The first among the five domains is fluid reasoning, and the subtests in this domain measure an
individual's ability to solve problems and think abstractly. These subtests include Pattern
Analysis, Fluid Reasoning, and Analogies. The pattern analysis subtest measures an individual's
ability to recognize and understand patterns, and consists of a series of incomplete patterns, and
the individual must select the correct answer choice to complete the pattern. The fluid reasoning
subtest measures an individual's ability to solve problems and think abstractly; it consists of a
series of visual and verbal tasks, and the individual must select the correct answer choice based
on their reasoning skills. The analogies subtest measures an individual's ability to understand the
relationship between words and concepts; it consists of a series of word pairs, and the individual
must select the correct answer choice based on the relationship between the words.

The Knowledge subtests measure an individual's acquired knowledge and vocabulary. These
subtests include Vocabulary and General Information. The former measures an individual's
acquired knowledge and vocabulary, and the test consists of a series of verbal questions, and the
individual must define the meaning of the presented word; and the latter measures an individual's
general knowledge about the world. The test consists of a series of questions related to history,
science, literature, and other subjects.

The Quantitative Reasoning domain measures an individual's ability to solve problems


involving numerical concepts. These subtests include an individual's ability to solve problems
involving numerical concepts. The test consists of a series of numerical questions, and the
individual must select the correct answer choice based on their reasoning skills. The number
series subtest measures an individual's ability to recognize numerical patterns; it consists of a
series of numerical sequences, and the individual must select the next number in the sequence.

The visual spatial domain measures an individual's ability to analyze and manipulate visual
information. These subtests include Bead Memory, Memory for Pictures, and Spatial
Visualization. The bead memory test consists of a series of bead patterns, and the individual must
reproduce the pattern from memory. The memory for pictures subtest measures an individual's
ability to remember visual information over time. The spatial visualization subtest measures an
individual's ability to manipulate visual images in their mind. The test consists of a series of
visual tasks, and the individual must select the correct answer choice based on their spatial
reasoning skills.

The Working Memory domain measures an individual's ability to maintain and manipulate
information in short-term memory. The memory for sentences subtest measures an individual's
ability to remember verbal information; it consists of a series of sentences, and the individual
must repeat the sentences in the correct order. The digit span subtest measures an individual's
ability to remember numerical information; and it consists of a series of digits, and the individual
must repeat the digits in the correct order.
89

Scoring & Interpretation

Scoring and interpretation of the Stanford Binet Intelligence Scale Fifth Edition (SB5) involves
analyzing an individual's performance on the 10 subtests and calculating several scores,
including a composite score and five cognitive domain scores. The Composite Score is the
primary score used to assess overall cognitive functioning. It is calculated by adding the scaled
scores from each subtest and converting the sum to a standard score with a mean of 100 and a
standard deviation of 15. The Composite Score provides an overall measure of an individual's
cognitive abilities and can be used to identify intellectual giftedness or intellectual disability. The
five cognitive domain scores on the SB-5 provide a more detailed assessment of an individual's
cognitive abilities and can be used to identify specific strengths and weaknesses. Interpretation of
SB5 scores should be done by a qualified professional, considering developmental history,
academic achievement, and behavioral and emotional functioning.

Critical Evaluation

The SB-5 is a widely used measure of intelligence that has been found to have high internal
consistency and test-retest reliability, as well as good construct validity. However, it has been
criticized for its cultural bias and its emphasis on timed tests. Additionally, it may not adequately
capture certain aspects of intelligence, such as emotional intelligence and creativity, and may not
fully capture the complexity of intelligence.

Applications of Intelligence (Cognitive) Testing

In educational settings, cognitive tests provide valuable information about children's intellectual
strengths and weaknesses, which can help educators develop individualized educational plans
that address their specific needs. Intelligence testing is used to identify children who may be at
risk of lower academic performance, to diagnose learning disabilities, and to determine eligibility
for special education services (Glutting, 1989). Intelligence tests such as the Wechsler
Intelligence Scale for Children (WISC) and the Stanford-Binet Intelligence Scale are widely used
in educational settings. For example, a child who scores low on verbal comprehension in SB-5
but has high scores on spatial reasoning may benefit from a curriculum that emphasizes hands-on
learning.

In clinical settings, intelligence testing is used to diagnose intellectual disability, identify


cognitive deficits resulting from brain injury or illness, and assess the cognitive functioning of
individuals with psychiatric disorders. Along with SB-5, tests like Kaufman Assessment Battery
for Children (KABC) are commonly used in clinical settings. These tests provide important
diagnostic information that can guide treatment planning and help clinicians monitor the
effectiveness of interventions (Glutting, 1989). For example, an individual with schizophrenia
90

who scores low on reasoning subscale in SB-5 may benefit from cognitive remediation
interventions that target this specific deficit.

In occupational settings, intelligence testing is used to assess an individual's intellectual abilities


as they relate to job performance. Employers may use intelligence tests to screen job candidates
or to identify employees who may benefit from additional training or development opportunities.
Aptitude tests such as the Armed Services Vocational Aptitude Battery (ASVAB) and the
Differential Aptitude Tests (DAT) are commonly used in occupational settings, along with
intelligence tests. These tests provide valuable information about an individual's cognitive
strengths and weaknesses, which can inform job placement decisions and help employers
identify employees who are likely to excel in leadership or management roles.

In research settings, intelligence testing is used to study the nature of intelligence, its
development across the lifespan, and its relationship to other cognitive and behavioral variables.
Researchers may use intelligence tests to study individual differences in cognitive abilities, to
investigate the neural and genetic bases of intelligence, or to explore the impact of environmental
factors on cognitive development (Glutting, 1989). Intelligence tests have also been used in
forensic settings to assess a defendant's cognitive abilities. For example, a psychologist may
administer the SB-5 to a defendant to determine if they have the cognitive abilities necessary to
stand trial (Glutting, 1989).

Cognitive Testing in India & Critical Evaluation

Cognitive testing has been widely used in India for educational and vocational purposes, but
there are both strengths and limitations to the use of these measures (Kamat, 1934). Intelligence
testing can be used to identify intellectual giftedness and intellectual disability, as well as
identify specific cognitive strengths and weaknesses. It can also help to address issues of social
and economic inequality by identifying individuals with intellectual potential who may not have
had access to educational opportunities. However, there are concerns about the impact of
socioeconomic status on intelligence testing in India, as individuals from lower socioeconomic
backgrounds may have lower scores on intelligence tests due to factors such as poor nutrition,
inadequate education, and exposure to environmental toxins. Additionally, intelligence tests may
not adequately capture certain aspects of intelligence that are valued in the Indian cultural
context. Finally, the use of intelligence testing in educational and vocational settings can have
significant consequences for individuals, and there is a risk of misinterpretation and misuse of
test results.

Conclusion

In conclusion, intelligence testing is a valuable tool used in psychology to assess cognitive


abilities and inform decisions in educational, clinical, occupational, and research settings.
Intelligence tests provide standardized measures of intellectual ability that can be used to identify
91

individuals at risk of academic failure, diagnose intellectual disability or cognitive deficits,


assess job performance, and study the nature of intelligence. While intelligence tests have
limitations and are subject to cultural biases, they remain a crucial tool in psychology and
continue to be widely used in a variety of settings. Intelligence tests can provide valuable
information about a person's cognitive abilities, but they should be used in conjunction with
other measures to provide a comprehensive assessment of a person's cognitive functioning.
Additionally, it is important to recognize the limitations of intelligence tests and to use them in a
culturally sensitive and appropriate manner.

WAIS
An intelligence test is “an individually administered test used to determine a person’s level of
intelligence by measuring their ability to solve problems, form concepts, reason, acquire detail,
and perform other intellectual tasks” (APA, 2023). One measure of cognitive ability is the
Wechsler Adult Intelligence Scale (WAIS). The test was developed by David Wechsler in 1955
and is currently in its fourth edition (APA, 2023; Wechsler, 2008).

The Wechsler Adult Intelligence Scale-IV (WAIS-IV; Wechsler, 2008) was used as the
assessment tool. It can be individually administered to members of a population between the
ages of 16-90, with specific instructions for learning disability cases specified in the manual. The
scale is comprised of 10 main subtests that are used to obtain an overall score, and five
supplementary tests. The tests can be divided into four categories with separate scores:
Perceptual Reasoning (visual puzzles, figure weights, block design, matrix reasoning, and picture
completion), Processing Speed (symbol search, coding, and cancellation), Verbal Comprehension
(similarities, information, vocabulary, and comprehension), and Working Memory (digit span,
arithmetic, and letter number sequencing). All sub-tests require verbal responses, except three,
which are paper pencil tests. These are generally carried out in the order specified by the test
manual in a maximum of two sittings within a week. Each subtest states a discontinuation
criterion, based on the participants responses.The scale has a test-retest reliability of 0.70 to 0.90.

The sub-tests have been elaborated below:


Block Design

It is a nonverbal reasoning test that consists of 14 designs in an increasing order of complexity. It


is scored on the basis of the accuracy of the design created and the time taken, for items 9-14.
Participants are given nine blocks of alternating colours to create a three- dimensional model of a
two-dimensional designs. This exam necessitates rigid application of logic, study of spatial
92

relations, and visual-motor coordination (Kaplan, & Saccuzzo, 2017; Gregory, 2015; Wechsler,
2008)
Similarities

A measure of abstract thinking, it asks participants to identify similarities between pairs of words
(for example, "bud-baby"). This test is not timed. It is scored between 0 and 2, depending on
how precise and clear the participants' answers were.
Digit Span

This is a test of immediate memory. It involves asking a person to repeat sets of data that range
in size from 2 to 9 digits in three different sections—either backwards, forwards, or in ascending
order. There are no time limits and each item comprises of two trials (Kaplan, & Saccuzzo, 2017;
Gregory, 2015; Wechsler, 2008).
Matrix Reasoning

It provides participants with figural reasoning problems, arranged in ascending order of


difficulty. as a test of non-verbal inductive reasoning. To choose the right response out of the
available 5 choices, one must recognize a recurring pattern. Every response receives a score of 1
or 0 (Kaplan, & Saccuzzo, 2017; Gregory, 2015; Wechsler, 2008).
Vocabulary

Thirty words are given to the participant to gauge their vocabulary skills. The participant is asked
to elaborate on what they have understood about the word in detail. It is scored between 0 and 2,
depending on how precise and clear the participants' answers were. Arithmetic

This sub-test is used to assess concentration and mathematical ability. It presents the test taker
with 22-word problems, in increasing order of difficulty, to be answered within 30 seconds each.
Symbol Search

This is a nonverbal test of information-processing speed. The test-taker looks at a target group of
symbols, then quickly scans a search group of symbols, and marks either the symbol in the target
group that occurred within the search group or “NO” box (Kaplan, & Saccuzzo, 2017; Gregory,
2015; Wechsler, 2008).
Visual Puzzles

This non-verbal measure of visual perception asks the test-taker to choose three of the six
smaller shapes that could be used to put together a bigger, completed shape, after being shown a
picture of that completed shape. The mental rotation of forms and visual-spatial analysis are
necessary for successful performance. Each of the 36 questions must be completed in 30 seconds
and receive a score of 1 or 0 (Kaplan, & Saccuzzo, 2017; Gregory, 2015; Wechsler, 2008).
Information
93

This subtest measures the breadth of information possessed by the participant. It consists of 26
items, scored 0 or 1 based on accuracy. However, this test may contain questions that are not
culturally fair, such as details of Sahara Desert or naming political leaders from the UK. It is also
susceptible to bias of culture, education, and information access (Gregory, 2015; Kaplan, &
Saccuzzo, 2017; Wechsler, 2008).
Coding

Coding is a test of visual–motor functioning. Within two minutes, the test-taker must rapidly
draw the appropriate symbol underneath a lengthy list of randomly chosen numbers, associating
one symbol with each of the digits 0 through 9. They are graded according to how many correct
answers they get in the allotted period.
Letter-Number Sequencing

There are ten items with three trials each, with a growing number of stimuli in each trial
consisting of a mix of digits and letters (2 to 8 numbers). The participant must verbally repeat the
stimuli in ascending sequence of numbers, followed by letters, with any necessary repetitions.
This examination gauges one's capacity for focus, attention, and distraction- freeness. (Kaplan, &
Saccuzzo, 2017; Gregory, 2015; Wechsler, 2008).
Figure Weights

The participant must select the right response from 5 choices in order to balance or even out a
weighing scale that has items on one tray. Depending on the difficulty of the 27 items, the time
restriction varies from 20 to 40 seconds (Kaplan, & Saccuzzo, 2017; Gregory, 2015; Wechsler,
2008).

Comprehension

This test evaluates explanation. The simpler queries emphasize common sense, whereas the more
challenging ones call for knowledge of social and cultural norms. Clarity and correctness are
graded on a scale of 0 to 2.
Cancellation

In two subsequent trials, the participant is presented with shapes among the same shapes of other
colours, and must move row wise and cancel only as many as possible of those specified, within
45 seconds.
Picture Completion

The participant is shown 24 pictures, and they have 20 seconds to identify the missing pieces of
each photograph. Scores are based on the capacity to correctly point and/or speak accurately. It
evaluates visual skills and attention to detail (Kaplan,
& Saccuzzo, 2017; Gregory, 2015; Wechsler, 2008).
94

Intelligence tests can be used to predict important outcomes, such as academic achievement,
occupational level, and economic success. Particularly in school setups, they are useful to make
recommendations about teaching, designing the class curriculum, planning group activities,
identifying students with disabilities, and such. As the WAIS yields multiple scores, it may also
be used for guidance and counselling to predict job fit and define career prospects (Benson,
2003). For instance, jobs such as architecture and interior designing require high scores on
perceptual reasoning sub-tests of the WAIS, while someone who scores high on verbal
comprehension may be better at jobs such as journalism, teaching, or jobs which require public
speaking and writing abilities.

The Wechsler Scales are standardized, clinically valid, and legally accepted assessments of
intellectual ability (Bisconer & Ahsan, 2017). The tests may be used for personnel selection to
evaluate the person-job fit. WAIS can assist in determining and identifying potential employees
who have the abilities to contribute to long-term outcomes for the organisation. .In the clinical
setting, IQ scores from the WAIS-IV provide valuable information for diagnoses of attention
deficit hyperactivity disorder (ADHD), specific learning difficulties, and intellectual disability.

WAIS is also applied in relation to brain dysfunction in neuropsychological evaluations.


Discrepancies on the Working Memory Index, Processing Speed Index, Verbal Comprehension
Index, and Perceptual Reasoning Index, and assessment of functioning in line with normative
data can help identify cognitive deficits and brain dysfunction (Cullum & Larrabee, 2010). The
WAIS is also used as a personality test, in accordance with a more comprehensive personality
evaluation. The tests frequently used in research to identify brain dysfunction and comprehend
cognitive abilities of individuals.

Psychological assessment is a sensitive professional action that should be completed with the
utmost concern for the well-being of the individual, their family, employers, and the wider
network of social institutions that might be affected by the results of that particular clinical
assessment (DPMI, 2023). A number of ethical principles recognize that all psychiatric services,
including assessment, are provided in the context of a professional collaboration. The aim of the
evaluation should be advantageous to the person being evaluated, that is, in line with the
principle of beneficence. There is great emphasis on maintaining the confidentiality of any
information a practitioner has obtained from clients during consultations, including test results.
(Principle 5; APA, 1992a). Such material may only be disclosed to others with clear
authorization, usually in writing, of the client or a legal agent. The only exceptions to
confidentiality are exceptional situations where withholding information would obviously place
the client or other people in danger, wherein the required people and law authorities are to be
informed immediately.
95

The user must possess the required knowledge to evaluate psychological tests for suitable
standardisation, reliability, validity, interpretive accuracy, and other psychometric properties. The
test user must obtain test participants' or their legal representatives' informed consent before test
administration. Maintaining a standard of care is also important when discussing specific
medical treatments, including psychological testing. The accepted standard of care, according to
Rinas and Clyne-Jackson (1988), is one that is "usual, customary, or reasonable". Utilizing
outdated tests, such as WAIS (Form I) instead of WAIS-IV, would go against the recognized
standard of care. Relying on test results that are no longer useful for the presenting problem
raises another issue.

The Wechsler Adult Intelligence Scale (WAIS) is a widely used measure of intelligence that
assesses a broad range of cognitive abilities in adults. As a result it has various applications
across diverse fields, for example,

● Clinical psychology: The WAIS is commonly used in clinical psychology to help


diagnose and evaluate cognitive impairments in individuals with various mental health
disorders (Loring & Bauer, 2017). For example, the WAIS can be used to assess the
cognitive functioning of individuals with schizophrenia, bipolar disorder, and other
conditions (Sayin et al., 2014).
● The WAIS is frequently used in neuropsychological evaluation to assess cognitive
deficits associated with brain disease or injury. (Lezak, Howieson, & Loring, 2004). For
instance, the WAIS can be used to evaluate cognitive functioning in people with
dementia, stroke, or traumatic brain injury. (Chen et al., 2021).
● It is used: to evaluate students' cognitive abilities and spot learning disabilities, the WAIS
is occasionally used in educational settings. (Kaufman, 2009). For instance, the WAIS
can be used to evaluate students with ADHD or specific learning disorders' cognitive
functioning. (Kaufman & Lichtenberger, 2006).
● The WAIS can also be used in occupational settings to evaluate job candidates and assess
their cognitive abilities (Reeve, Heggestad, & Lievens, 2009). For example, the WAIS
can be used to assess the cognitive abilities of individuals applying for jobs in fields that
require high levels of cognitive functioning, such as engineering or management (Ryan,
Summers, & Glass, 2018).

CATTELL’S CU;TURE FAIR INTELLIGENCE TEST


Cattell's culture fair intelligence test is an alternative approach to traditional intelligence testing,
designed to minimize cultural bias in assessments of cognitive ability. Developed by Raymond
Cattell in the mid-20th century, the test measures fluid intelligence, or the ability to reason and
96

solve novel problems, using abstract reasoning tasks that are less dependent on cultural
background knowledge (Cattell, 1949).

The culture fair test was developed using factor analysis and aims to reduce the influence of
cultural and environmental factors that may impact performance on traditional intelligence tests.
The test has been used in various settings, including education, employment, and research
(Cattell, 1950).

The importance of culture fair testing lies in the need for unbiased assessments of cognitive
ability that do not disadvantage individuals from particular cultural or ethnic backgrounds.
Traditional intelligence tests have been criticized for their cultural bias, and the development of
alternative testing methods, such as Cattell's culture fair test, aims to address these concerns and
promote more equitable assessments of intelligence (Sternberg, Grigorenko, & Bundy, 2001).

The scoring and interpretation of the culture fair test is based on a normative sample, with scores
compared to those of individuals from similar cultural and demographic backgrounds.

The first version of the culture fair test, Scale 2, Form A, was published in 1949, followed by
several revisions and updates in subsequent years (Cattell, 1950).

The culture fair test has been used in a variety of settings, including education, employment, and
research, and has been translated into several languages for use in different countries (Cattell,
1950). The development of the culture fair test was significant contribution to the field of
intelligence testing, as it highlighted the importance of minimizing cultural bias in assessments
of cognitive ability and provided an alternative method for measuring fluid intelligence.

Cattell's culture fair intelligence test is administered in a paper-and-pencil format and consists of
two subtests: Series and Classification (Cattell, 1949). The Series subtest contains 30 items, it
assesses the ability to identify patterns and relationships in abstract designs and the Classification
subtest contains 25 items and measures the ability to complete a series of abstract reasoning
tasks. Both subtests are designed to measure fluid intelligence and minimize cultural and
environmental factors that may affect performance on traditional intelligence tests.

The culture fair test is designed to be administered to individuals from diverse cultural and
linguistic backgrounds, and the instructions are presented in a standardized format that is
intended to minimize the influence of cultural and linguistic factors on performance (Cattell,
1950). The test is administered under standardized conditions and has a time limit of 30 minutes
for the Series subtest and 20 minutes for the Classification subtest.

The culture fair test has been shown to have good reliability and validity in measuring fluid
intelligence, with correlations between the culture fair test and other measures of intelligence
97

ranging from 0.5 to 0.8 (Cattell, 1950). The test has been used in a variety of settings, including
education and employment, and has been translated into several languages for use in different
countries.

Overall, the culture fair test is designed to minimize cultural and linguistic bias in assessments of
cognitive ability and provides an alternative method for measuring fluid intelligence that is less
dependent on cultural background knowledge.

Traditional intelligence tests have been criticized for being culturally biased and not taking into
account the diverse backgrounds and experiences of test takers (Helms, 1992). Culture fair
testing, on the other hand, attempts to minimize cultural and linguistic factors that may influence
test performance, and aims to provide a fair and unbiased assessment of cognitive ability (Cattell,
1949).

The importance of culture fair testing lies in the need for assessments that are not influenced by
cultural and linguistic factors, which can affect the validity and reliability of intelligence tests.
For example, traditional intelligence tests may include questions that rely on specific cultural
knowledge or experiences, leading to inaccurate or unfair assessments of cognitive ability
(Sandoval, 2014). Culture fair testing, on the other hand, attempts to reduce or eliminate the
impact of cultural and linguistic factors on test performance, providing a more equitable
assessment of cognitive ability.

Benefits of culture fair testing include increased validity and reliability of assessments of
cognitive ability, and the ability to provide a fair and unbiased assessment of individuals from
diverse cultural and linguistic backgrounds. Culture fair tests also provide an alternative measure
of cognitive ability that is less reliant on cultural background knowledge, allowing for a more
accurate and equitable assessment of cognitive ability across diverse populations.

However, there are some limitations to culture fair testing, including the potential for test takers
to experience test anxiety or difficulty with the abstract nature of the tasks (Helms, 1992).
Additionally, some researchers have argued that culture fair tests may still be influenced by
cultural factors, despite attempts to reduce this influence (Sandoval, 2014).

Despite these limitations, culture fair testing provides an important alternative to traditional
intelligence testing, and highlights the need for culturally unbiased assessments of cognitive
ability.

Despite its attempts to minimize cultural and linguistic bias, Cattell's culture fair intelligence test
has faced some criticisms regarding its validity, reliability, and cultural neutrality.

One criticism of the test is that it may not accurately measure cognitive ability across different
cultural groups. Some researchers argue that the test's reliance on abstract reasoning tasks may
98

not be universal across cultures, leading to potential validity issues (Sandoval, 2014).
Additionally, some have raised concerns about the test's reliability, noting that it may not
consistently measure cognitive ability over time (Lichtenberger & Kaufman, 2009).

Critiques of the test's cultural neutrality argue that the test may still be influenced by cultural
factors, despite attempts to reduce this influence. Some have noted that the test's emphasis on
abstract reasoning may be more familiar to individuals from certain cultures, leading to potential
bias (Sandoval, 2014).

Alternative viewpoints on culture fair testing suggest that it may not be possible to completely
eliminate cultural bias in intelligence testing, and that attempts to create culture fair tests may
actually perpetuate cultural stereotypes and biases (Helms, 1992). Some have argued that a more
effective approach to assessing cognitive ability across diverse populations may be to utilize
multiple measures of intelligence, including measures of cultural knowledge and experience
(Sandoval, 2014).

Overall, while culture fair testing provides an important alternative to traditional intelligence
testing, it is not without its limitations and critiques. Continued research and development is
necessary to improve the validity, reliability, and cultural neutrality of intelligence assessments
across diverse populations.

In conclusion, Cattell's culture fair intelligence test is an important alternative to traditional


intelligence testing that attempts to minimize cultural and linguistic biases. However, the test has
faced criticisms regarding its validity, reliability, and cultural neutrality. While culture fair testing
provides a valuable approach to assessing cognitive ability across diverse populations, continued
research and development is necessary to improve the accuracy and fairness of these
assessments.

Key points to consider include the historical context and development of culture fair testing, the
format and administration of Cattell's culture fair intelligence test, the importance of cultural
neutrality in intelligence testing, and the critiques and limitations of culture fair testing.

Implications for intelligence testing and cultural bias suggest that there is a need for continued
efforts to minimize cultural biases in intelligence assessments, while recognizing the limitations
of attempting to create completely culture-free measures of cognitive ability. A more effective
approach may be to utilize multiple measures of intelligence that incorporate cultural knowledge
and experience.

In conclusion, to improve the fairness and accuracy of intelligence testing, it is important to


recognize and address the role of cultural biases. By utilizing a variety of measures and
approaches, researchers and practitioners can work towards developing more comprehensive and
inclusive assessments of cognitive ability.
99

Raven’s Progressive Matrices


A non-verbal test of inductive thinking based on figural stimuli, Raven's Progressive Matrices
(RPM) was first developed in 1938. (Raven, Court, & Raven, 1986, 1992). A non-verbal test is
one in which neither the questions being asked nor the solutions being offered are expressed
verbally (American Psychological Association, 2023).

This test has received a lot of attention in scientific research, and it is also employed in some
institutional contexts for intellectual assessment. General cognitive ability and Spearman's g
factor was intended to be measured by RPM initially (Raven, 1938). Because of this, Raven
opted to administer the test in a unique manner that probably called for the use of g. G was
described as the "eduction of correlates" by Spearman. Eduction is the act of establishing
connections based on the perceived essential similarity between stimuli. Examinees must spot a
repeating pattern or connection between structured figural stimuli organized in a 3 × 3 matrix on
the RPM in order to accurately respond to items. The references to "progressive matrices" relate
to the sequence of the elements in terms of increasing difficulty (Gregory, 2015).

The RPM can be given to groups or individuals, ranging in age from 5 to senior citizens. The
RPM may be administered without using words if necessary, and the instructions are
straightforward. In actuality, the test is utilised all throughout contemporary society. Matrix
stimuli, which are among the most prevalent of all stimuli, make up the entirety of the RPM
(Kaplan & Saccuzzo, 2013).

Historical Background

The RPM was first published in 1938 by John C. Raven (MH). The Mill Hill Vocabulary Scale, a
test of verbal knowledge, and the RPM were later connected. The ability to make sense of
complexity and the capacity to store and reproduce knowledge, according to the RPM's creator,
are the two key facets of general intelligence that may be measured most successfully when
combined with these vocabulary tests. The RPM and Vocabulary Scales are the names given to
the matrices when they are found packed with the MH at this time. To better accommodate
people with different levels of cognitive ability, the 1940s saw the development of advanced
progressive matrices and coloured progressive matrices (Leavitt, 2011).

Characteristics and features of Raven’s Progressive Matrices

For kids between the ages of 5 and 11, the 36-item Coloured Progressive Matrices exam was
created. To improve the young children's attention spans, Raven added colour to this test version.
While the majority of the items on the Standard Progressive Matrices are so challenging that the
test is most appropriate for adults, it is normed for test takers from 6 years of age and above. 60
100

items total, divided into 5 sets of 12 progressions, constitute this assessment. Similar to the
Standard variant but with a greater ceiling are the Advanced Progressive Matrices. Set I of the
Advanced edition has 12 issues, while Set II has 36 problems. Those with outstanding intellect
should use this form in particular (Gregory, 2015).

The Advanced Progressive Matrices may be broken down into two variables, each of which may
have a different level of predictive validity (Dillon, Pohlmann, & Lohman, 1981). Objects in the
first factor can be solved by adding or removing patterns. The ability to make quick decisions
and understand part-whole relationships may come naturally to those who do well on these tests.
The elements in the second factor's composition depend on the ability to recognise how a pattern
develops in order to be solved. Those who perform well on these tasks may have strong
mechanical aptitudes as well as good abilities to predict prospective movement and rotate their
thoughts. At this stage, however, each factor's representation of a talent is only speculative and
requires independent verification (Gregory, 2015).

Several published studies have a bearing on the reliability of the RPM. Burke (1958) did a good
job of summarising the early information, whereas the contemporary RPM guides have gathered
the latter discoveries (Raven & Summers, 1986; Raven, Court, & Raven, 1983, 1986, 1992).
Achievement test validity coefficients often fall between the .30s and the .60s. These scores are a
little lower than what would be expected from more conventional (verbally laden) IQ tests.
Validity coefficients for comparison with different IQ tests range from .50 to .80 (Gregory,
2015).

The Raven Plus's 60 matrices are rated according to their level of difficulty (Raven, Raven, &
Court, 1998). Each has a logical pattern or design that is incomplete. Out of a possible eight
options, the subject must choose the proper design. Either a time restriction or no time limit is
permissible for the exam. 60 pieces total, thought to be of escalating complexity, made up the
original RPM. Nonetheless, it was clear from item response and other studies that the three
middle questions were all about the same in terms of difficulty. As they tended to score all three
if they got one, this led to an overestimation of the IQs of those people who scored at or above
these levels. This issue is fixed in the newer 1998 Raven Plus (Raven et al., 1998). Study backs
up the RPM as a general intelligence metric, or Spearman's g (Colom, Flores-Mendoza, &
Rebello, 2003).

The Standard Progressive Matrices and the WISC-R exhibited almost similar predictive validity
and no indication of differential validity across eight distinct ethnic groups, according to
Saccuzzo and Johnson (1995), who conducted a large research involving thousands of
adolescents.

Clinical Use
101

According to evidence, a number of neurological and neuropsychiatric disorders are associated


with decreased RPM performance (Strauss et al., 2006). Individuals with traumatic brain
damage, schizophrenia, severe depression, dementia, and other conditions have all demonstrated
impairment on the RPM. The test's ability to detect brain problems, on the other hand, typically
appears to be correlated with the severity of damage, and its use in detecting brain damage is
limited. For instance, in Alzheimer's disease (AD), persons with suspected AD tended to score
within normal ranges in the early stages of the illness before showing a decrease over the course
of the two to three years after diagnosis.

With the RPM's more than 1,500 published studies in the literature, many consider it to be the
most thoroughly studied of all nonverbal measures (Strauss et al., 2006). (McCallum, Bracken, &
Wasserman, 2000). It is particularly useful when used with individuals whose performance on
the exam may be hampered by language, hearing, or movement impairments, or who are
non-English speakers or unfamiliar with the predominate North American culture.

The test may be rather insensitive to different types of disability, nevertheless, because of its
motor-reduced aspect and the fact that it is untimed (Leavitt, 2011).

In a recent brain-imaging study, this test was used to assess how disparities in reasoning and
problem-solving abilities translate into variations in the firing of neurons in the brain. RPM has
the capacity to test general fluid intelligence (Gray, 2003). Brain activity associated with the task
was monitored using magnetic resonance imaging (MRI) as individuals filled out the matrices.
The results of the study showed that changes in test results were mirrored in brain activity in the
lateral prefrontal cortex. The anterior cingulated cortex and cerebellum of patients with high
Raven scores also demonstrated higher brain activity. The study demonstrated that common
intelligence tests like RPM are evaluating the workings of crucial and particular brain activity in
addition to giving insight into how various brain regions function (Kaplan & Saccuzzo, 2013).

Is the Raven’s Progressive Matrices culturally fair?

A remarkable set of norms have been released, and the Raven's handbook has been revised
(Raven, 1986, 1990; Raven et al., 1998). One may compare the performance of kids from key
cities throughout the world using these new norms. As a result, a significant critique of the
Raven has now been addressed in a very beneficial and extensive manner. The Raven seems to
lessen the impacts of culture and language (Raven, 2000). For instance, the Raven scale shows a
smaller difference—only 7 or 8 points—than the Wechsler and Binet scales, where Hispanics,
Latinas, and Black Americans often score about 15 points lower than Caucasians. As a result, the
selection bias caused by the Binet or Wechsler is often reduced in half by the Raven. As a result,
it is quite useful for identifying giftedness among underprivileged Black American, Hispanic,
and Latina youngsters (Saccuzzo & Johnson, 1995).
102

However, compared to the Wechsler scales, the Raven actually provides a more accurate
assessment of overall intelligence, unlike the Kaufman, which also has a lesser disparity between
white and other racial groups (Colom et al., 2003). The Raven shows potential as one of the
major stakeholders in the testing field owing to its new international regulations, revised test
manual, and successful computer administered edition (Kaplan & Saccuzzo, 2013).

The RPM is especially useful for the additional testing of kids and people with hearing,
linguistic, or physical challenges. These test subjects are frequently challenging to evaluate using
conventional methods that call for hearing focus, verbal expressiveness, or physical
manipulation. On the other hand, if necessary, gestures can be used to describe the RPM. Also,
the only output required from the examinee is a pencil mark or movement designating the
preferred option. For these reasons, the RPM is perfect for assessing those who have a limited
grasp of the English language. The test procedure contains not a single word in any language,
making the RPM as culturally reduced as it can be (Gregory, 2015).

According to Mills and Tissot (1995), the Advanced Progressive Matrices correctly classified a
greater percentage of minority students as gifted than a more conventional test of academic
ability did (the School and College Ability Test).

Culture Fair Testing

A test that is believed to be reasonably unbiased with regard to unique background factors is one
that is thought to be culture-fair. It is based on common human experience. A culture- fair test is
intended to be applicable across social boundaries and to allow fair comparisons across persons
from various backgrounds, in contrast to some standardised intelligence evaluations, which may
represent experience that is mostly middle-class (American Psychological Association, 2023).

Importance of culturally-fair tests

The idea of "culture-fair testing" only became well-known in the 1960s, when overt notions of
White male and racial superiority started to lose influence as a result of the civil rights and
feminist movements. Despite the fact that feminism and researchers of racial minorities have
long condemned the misuse and misunderstanding of psychological and educational tests, efforts
to discover and remove bias from tests are still underway (Bathje & Feiss, 2014).

All civilizations have a propensity to prioritise some abilities and pursuits over others. Nonverbal
and performance tests can help assess pure intellect without being influenced by learning,
culture, or other variables by removing elements connected to cultural effects

(Kaplan & Saccuzzo, 2013). Cultural bias is a concern that transcends test types. Cultural
prejudice may be seen in all testing and evaluation contexts, including but not limited to IQ tests,
103

neuropsychological tests, educational achievement tests, personality tests, and self- report
surveys (e.g., symptom checklists).

In the past, removing cultural bias in test content has been the main goal of culture-fair testing.
These initiatives have mostly aimed to do away with stereotypes of individuals, culturally
specific information, dependence on linguistic ability (unless in cases where it is clearly an
objective of evaluation), and educational level (except in educational achievement testing). One
such is the Cattell culture-fair intelligence test, which was created to measure IQ without taking
into account cultural background, verbal skills, or educational attainment (Kazdin, 2000).

Critical analysis

Paniagua (2005) reiterated the need to go further than concentrating on just test content and
states that a culture-free test must meet five validity criteria: the test items must be applicable to
the culture of the person being tested (content equivalence), the essence and purpose of each item
must remain consistent across all cultures tested (semantic equivalence), the test must be
comparable across cultures, and the test itself must be valid and reliable (technical equivalence),
the test must evaluate the same theoretical notions across cultures (conceptual equivalent) and
the findings must be interpreted consistently when compared to norms across cultures (criterion
equivalence). No test has been found to satisfy each of these requirements.

Sternberg (2004) proposed that all intelligence tests, including those that were traditionally
regarded to be "culture free," like tests of abstract thinking, evaluate abilities that are obtained
via the connection and interaction of ability with the environment. He opined that rather than
providing insight into fundamental skills, the identification of the primary components of human
intelligence may disclose more about the interactions between aptitudes and cultural norms of
socialisation and education. According to this perspective, it is impossible to develop a test that
is culture-free since assessed performance always involves a complex interaction between
natural talents and a person's culture and social environment. No test can be administered
equitably across cultures if no test is culture-free, hence all relevant cultural elements must be
taken into consideration and regulated before a test can be applied. This is

impractical due to the extensive number of variables, covariances, and individual and group
differences (such as acculturation status, worldview, language, socioeconomic position, health
issues, and stereotype threat) (Valencia & Suzuki, 2001).

Using a wider variety of constructs that capture the strengths of other cultural groups, coupled
with contextual information, is another way that tests may be designed to be culturally fair. One
example of this method is the Theory of Many Intelligences (MI) (Gardner, 1999). According to
MI, people have a variety of fairly distinct types of intelligence (such as linguistic,
logical-mathematical, kinesthetic, musical, interpersonal, intrapersonal, and naturalistic), and a
broader definition of intelligence as well as alternative assessment methods that gauge both
104

abilities and contextual factors would increase recognition of the abilities of people from
marginalised cultural groups (Bathje & Feiss, 2014).

When testing is required, it is crucial to address the obstacles to culture-fair testing as


comprehensively as possible since the stakes are so high. Yet, the fact that it might not be able to
develop tests that are culturally relevant raises the issue of whether it is moral to rely so much on
normed testing and assessment when making important judgements, especially when such
decisions are made without the individual's permission (Bathje & Feiss, 2014).

Conclusion

To conclude, while it is important to realize the importance of a culturally-fair test like Raven’s
Progressive Matrices, it is equally note-worthy to understand the challenges that come along with
it. The RPM is one of the most widely accepted non-verbal intelligence test used in various
arenas all across the globe. Though, the test has its own set of limitations, it has proved to be
quite useful and more comprehensive in comparison to other forms of assessment.

Aptitude Tests
Aptitude tests assess the capability of an individual to acquire a relatively specific task or skill,
essentially measuring segments of ability that are well defined and relatively homogenous
(Gregory, 2015; Kaplan & Saccuzzo, 2013). These tests are generally used to predict future
performance, which makes it especially useful for educational and vocational purposes. Aptitude
tests are of two types: single aptitude tests, which evaluate one ability, such as the Bennet
Mechanical Comprehension Test, and multiple test batteries, which study various aptitudes and
provide a profile of scores, such as DAT or SAT. Multiple test batteries are a relatively new late
development in the field of testing (Anastasi, 1976). The expanding involvement of
psychologists in career counselling, selection, and classification of personnel gave differential
aptitude testing a significant boost. These tests are generally based on factor analysis to
determine the subtests that the test will measure. Factor analysis has produced data that has
revealed the existence of several relatively independent traits; the goal of aptitude tests is to
assess a person's performance across a range of these characteristics. Thus, these batteries offer a
tool for making intraindividual analysis (Anastasi, 1976).
105

Differential Aptitude Test


The Differential Aptitude Test (Bennet et al., 1947) is a multiple aptitude test battery developed
for vocational and educational guidance for students from class 7 to 12. It was first published in
1947(Wang, 1993). It is comprised of eight independent tests that yield individual scores: Verbal
Reasoning, Numerical Reasoning, Abstract Reasoning, Perceptual Speed and Accuracy,
Mechanical Reasoning, Space Relations, Spelling, and Language Usage. These areas were
selected using experimental and experiential data from factorial research and practical
counselling needs (Anastasi, 1976).

The gradual awareness that so-called general intelligence tests are in fact less universal than was
initially believed has fueled the development of different aptitude batteries. The expanding
involvement of psychologists in career counselling, as well as in the selection and categorization
of military and industrial employees, also gave differential aptitude testing a significant boost.
Finally, the design of numerous aptitude batteries has its theoretical foundation in the application
of factor analysis to the study of trait organisation (Anastasi, 1976).

History

Psychologists had already begun to recognize that particular aptitude tests were necessary in
addition to the general intelligence testing before World War I. Tests of technical, clerical,
musical, and artistic aptitudes are some of the most frequently employed. Another interesting
feature was also discovered during the critical assessment of intelligence tests that followed their
widespread and indiscriminate usage in the 1920s: a person's performance on various portions of
such a test frequently showed significant variation. There is a need for tests that are specifically
created to uncover disparities in performance in particular functions if such intra-individual
comparisons are to be undertaken (Anastasi, 1976).

Test Description

The Differential Aptitude Tests (DAT) is a multi-aptitude battery that assesses a person's aptitude
for learning and success in various domains in both adults and junior and high school students.
The test can be given in groups and is generally used for educational and career advice, though it
can also be used to choose personnel. The 1990 edition, which underwent substantial revision,
included two levels, entirely new things, and cut testing time (Wang, 1993).

There are two equivalent variant forms (C and D) for each level (Level 1 for Grades 7-9, Level 2
for Grades 10-12). The test's directions have a Grade 5 level of readability. Eight subtests each
measure one of the following abilities: verbal reasoning (VR), numerical reasoning (NR),
abstract reasoning (AR), perceptual speed and accuracy (PSA), mechanical reasoning (MR),
space relations (SR), spelling (Sp), and language usage (LU). There are nine scores given, one
106

for each subtest and one more for scholastic aptitude (SA). The SA score gauges a person's
capacity to learn in a classroom and is derived from VR and NR (VR + NR) (Wang, 1993).

Psychometric Properties

The DAT has a high reliability in general. With the exception of Mechanical Reasoning,
split-half coefficients are mostly in the.90s and alternate-form reliabilities vary from .73 to .90.
The battery shows adequate predictive, concurrent, and construct validity (AbilityLab, 2015).
The manual provides ample evidence that the DAT tests, particularly the combined verbal and
numerical reasoning test, are effective predictors of other factors like school grades and results
on other aptitude tests (correlations in the .60s and .70s). Particularly, Setiawati (2020) noted that
the battery was useful for predicting success in a psychology study program. The norming
process for the fifth edition was based on data from 84,000 individuals from 150 school districts,
making it nationally applicable for the student population of U.S.A (Gregory, 2015).

Subtests

Verbal Reasoning

This test evaluates a student's capacity to recognise connections between words. Analogies are
used as part of the test. The capacity to infer the relationship between the first pair of words and
apply it to the second pair is measured by this test. The ability to reason verbally may be helpful
in predicting success in academic courses as well as in professions where clear communication is
crucial (Career News, 2013).

Numerical Ability

This test evaluates your capacity for mathematical reasoning. The computing level of the task is
low to ensure that thinking rather than computational ability is emphasised (Career News, 2013).
For courses in math, physics, chemistry, accounting, actuarial science, economics, engineering,
and trades like carpentry and electrical work as well as banking, insurance, computing, and
surveying, numerical reasoning is crucial (Career News, 2013).

Abstract Reasoning

This assessment of reasoning skills is nonverbal. It evaluates how well people can analyse
geometrical patterns or designs. Each test item consists of a geometric series with pieces that
alter in accordance with a predetermined rule. The student is required to forecast the following
107

step in the series and infer the rule(s) that are in effect (Career News, 2013). This kind of abstract
reasoning is a test of a person's intellectual, analytical, and logical abilities.

Perceptual Speed and Accuracy

The capacity to swiftly and accurately compare and mark written lists is measured by this test.
This subtest may be able to forecast performance in specific types of common administrative
chores, such filing and coding. For some occupations requiring technical and scientific data,
strong test results are also preferred (Career News, 2013).
The student is handed the list. On the answer sheet, the list is then repeated. The

underlined combination must be checked off by the student on a different answer sheet. This
aptitude test, which is administered under stringent time constraints, can also be used to forecast
hand-eye coordination. In fields including secretarial work, administration, piloting, computers,
accounting, and finance-related fields, a high score can be helpful (Career News, 2013).

Mechanical Reasoning

The capacity to comprehend fundamental mechanical concepts governing tools, motion, and
machinery is measured by this test. Each item consists of a mechanical problem that is illustrated
and a straightforward question. Things call for logic rather than expertise.

Space Relations

The ability to imagine a three-dimensional item from a two-dimensional pattern and to imagine
how the thing would seem if rotated in space is measured by this subtest. Following a single
pattern are four three-dimensional figures in each difficulty. The one figure that can be created
from the pattern must be chosen by the pupil.

Spelling

The student's ability to spell common English words is evaluated on this test. Three words are
spelled correctly and one word is misspelt in the order in which the words are given. The
misspelt words are the most plausible and frequent mistakes that a significant research
investigation found. What word, for instance, is incorrectly spelled? (Career News, 2013).

Language Use

The ability to spot grammar, punctuation, and capitalization mistakes is measured by this test.
Sentences that are divided into four parts make up the test. The learner must decide whether the
statement is correct as written or whether one section contains a punctuation, capitalization, or
grammar error (Career News, 2013).
108

Educational Aptitude (Verbal Reasoning and Numerical reasoning)

The verbal and numerical reasoning scores from above are combined here. The resulting score
gives the best overall indication of a person's aptitude for learning, or their

capacity to absorb knowledge from teachers and books and do well in academic areas

Applications of DAT

For use in educational assessment, counselling, and personnel classification, about a dozen
multiple aptitude batteries have been created. The methodology, level of technical proficiency,
and quantity of validation data available for these instruments varies greatly. The lack of a useful
use for such sophisticated equipment was the cause of the aptitude batteries' delayed progress.
The urgent necessity to choose applicants who were highly competent for extremely challenging
and specialised duties didn't emerge until World War II. Pilots, flight engineers, and navigators
all had very particular and demanding work duties (Gregory, 2015).

So, these batteries offer an appropriate tool for doing the kind of intraindividual analysis, or
differential diagnosis, that clinicians had been attempting for many years to gain from
intelligence tests, with rudimentary and frequently inaccurate findings. Although the numerous
aptitude batteries cover some of the features not typically covered by IQ tests, they also
incorporate a significant amount of the information formerly gathered from special aptitude tests
into a comprehensive and systematic testing programme.

The DAT is widely used in Europe for vocational guidance and research applications (Nijenhuis
et al., 2000). A Career Planning Program has been developed in concordance with the DAT to
aid in counselling. This uses DAT scores and responses from a Career Planning Questionnaire,
which records liking, interest, educational and vocational goals, and general academic
performance (Anastasi, 1976).

Critical Evaluation

The DAT is one of the most popular multiple aptitude test batteries, due to its quality, credibility,
and utility (Gregory, 2015; Wang, 1993). The fifth edition also eliminates the sex bias on certain
subtests and has been translated into several languages. The Indian version of the DAT was
adapted by J.M. Ojha (Niwlikar, 2019). However, lack of ethnicity-based norms poses a problem
for DAT (Wang, 1993). The mixed patterns of inter-correlations and lack of discriminant validity
between tests also suggest that the battery may measure only a limited range of abilities
(Gregory, 2015).
109

Conclusion

Because they are fully standardised, the test administration processes are simple to understand.
Researchers agree that the test contents are appropriate in light of the DAT's objectives; the test
materials' writing quality is satisfactory, and the drawings are clear and meaningful. The reviews
have generally praised the DAT's psychometric accuracy and usefulness while also pointing up a
few issues, including outdated graphics, a lack of information on ethnicity norms, weak
differential skills, and physically unappealing individuals.

INFORMATION PROCESSING TESTS

Information processing tests are a type of cognitive assessment that measures how quickly and
accurately individuals can process and respond to different types of information. These tests
assess various cognitive abilities, such as attention, memory, processing speed, and executive
function. Information processing tests typically involve presenting stimuli, such as visual or
auditory information, and asking the individual to respond to the stimuli in some way. For
example, a test might present a series of numbers or letters and ask the individual to repeat them
back in the order they were presented, or to identify the number or letter that comes next in the
sequence. Other common types of information processing tests include tests of reaction time,
working memory, and spatial reasoning. These tests can be administered individually or as part
of a larger battery of cognitive tests (Gregory, 2014).

Information processing tests are often used in research settings to study cognitive functioning,
but they can also be used in clinical settings to diagnose and monitor cognitive impairments
associated with conditions such as Alzheimer's disease, traumatic brain injury, or
attention-deficit/hyperactivity disorder (ADHD).

Information processing theory is a theoretical framework that explains how people process, store
and retrieve information from their environment. This theory suggests that people engage in a
series of mental processes, such as attention, perception, memory, and problem-solving when
they interact with information (Atkisnon & Hilgard, 2013).

In the Atkinson-Shiffrin model, stimuli from the environment are processed first in sensory
memory, storage of brief sensory events, such as sights, sounds, and tastes. It is very brief
storage—up to a couple of seconds. If the information is deemed important or relevant, it moves
into the working memory, where it is actively processed and manipulated. Working memory has
110

a limited capacity, so information that is not important or relevant is discarded. If information is


successfully processed in working memory, it is then transferred into long-term memory, where it
is stored and can be retrieved at a later time. Retrieval from long-term memory can be influenced
by factors such as encoding, storage, and retrieval cues (Atkisnon & Hilgard, 2013).

Naglieri and Das (1997) approach aims to assess the cognitive processes that underpin general
intellectual functioning. They are said to be less influenced by verbal abilities and acquired
knowledge (Das et al., 1994; Kaufman & Kaufman, 1983; 2004), more intrinsically related to
cognitive improvement (Das & Naglieri, 1992), and more equitable, with smaller differences
between race groups (Fagan et al.,, 2005).

The underlying theoretical framework

As a modern extension of Luria's work, Naglieri and Das (1990, 2005) developed the Planning,
Attention, Simultaneous, Successive (PASS) theory of intelligence. Planning entails selecting,
implementing, and monitoring effective problem-solving solutions. The use of feedback and
anticipating consequences are critical.

The first process is Attention, which requires selectively attending to some stimuli while ignoring
others. In some cases, attention also entails vigilance over a period of time. Difficulties with this
process underlie attention-deficit/hyperactivity disorder.

The simultaneous processing of information is characterized by the execution of several different


mental operations simultaneously. Forms of thinking and perception that require spatial analysis,
such as drawing a cube, require simultaneous information processing.

Successive processing, on the other hand, involves the integration of stimuli into a specific
sequential order where the elements form a chain-like progression (Das & Naglieri, 1992). While
in simultaneous processing the elements are related in various ways, in successive processing
only a linear relationship is found between the elements.

The functions of the third functional unit, Planning, are regulated by the frontal lobes, especially
the prefrontal region of the brain. This unit allows the individual to formulate plans of action,
implement them, evaluate the effectiveness of a solution and modify these plans if necessary
(Luria, 1973). It is also responsible for the regulation of voluntary actions, impulse control and
linguistic functions such as spontaneous speech (Luria, 1980).

The challenge of a simultaneous successive approach to intelligence assessment is to design


tasks that tap relatively pure forms of each approach to information processing. This strategy is
used in the Kaufman Assessment Battery for Children II (K-ABC-II) and the Das-Naglieri
Cognitive Assessment System (Das & Naglieri, 2012). The Das-Naglieri battery includes both
sequential tasks requiring rapid articulation (such as "say can, ball, hot as fast as you can 10
111

times") and simultaneous measures of both verbal and nonverbal tasks. The battery also assesses
planning and attention in order to embody the PASS theory (Naglieri & Das, 2005).

The Cognitive Assessment System


Guided by the PASS theory, the CAS was developed as a norm-referenced, individually
administered measure designed to evaluate the cognitive functioning of individuals between 5
and 17 years of age. The stipulated requirement for the use of this test is graduate training in the
administration, scoring, and interpretation of individual intelligence tests (Naglieri & Das, 1997).

The CAS is available in two versions: a Standard Battery with twelve subtests and an
eight-subtest Basic Battery. Each PASS scale in the Standard Battery has three subtests, whereas
the Basic Battery has two subtests. The CAS produces scaled scores for the PASS scales as well
as a composite Full Scale score that indicates overall cognitive functioning.

i) The Planning Scale: The purpose of the pencil-and-paper subtests on this scale is to find or
develop an effective strategy for solving novel timed tasks. The Planning Scale score is based on
the testee's performance on the subtests Matching Numbers, Planned Codes, and Planned
Connections, as well as the time it takes to complete each item. The cognitive skills required to
complete the tasks are the generation and use of efficient strategies, plan execution, anticipation
of consequences, impulse control, action organization, self-control, self-monitoring, strategy use,
and feedback use.

ii) The Attention Scale: The tasks require the testee to attend selectively to a specific stimulus
while inhibiting attention to distracting stimuli. Selective attention is tested on both receptive and
expressive levels. The Attention score is calculated using Expressive Attention, Number
Detection, and Receptive Attention measures, as well as the time it takes the subject to complete
each item.

iii) The Successive Processing Scale: The tasks require the testee to integrate stimuli in a specific
linear/serial order, where each element or stimulus is only related to the one before it and there is
little opportunity to integrate the parts. The stimuli range in difficulty from very easy (two-span
spans) to extremely difficult (spans of nine). Word Series, Sentence Repetition, Speech Rate
(ages 5-7 only), and Sentence Questions are the next steps (ages 8–17 only).

iv) The Simultaneous Processing Scale: The subtests of this scale require the testee to integrate
and comprehend multiple pieces of information in order to arrive at the correct answer.
Nonverbal Matrices, Verbal-Spatial Relations, and Figure Memory are CAS measures of
simultaneous processing. This scale incorporates Nonverbal Matrices, Verbal-Spatial Relations,
and Memory Subtests are subtests in which the testee is required to comprehend logical and
grammatical descriptions of spatial relations.
112

v) The Full Scale: The CAS Full Scale score, which is based on an equally weighted aggregate of
the PASS subtests, provides an estimate of overall cognitive functioning.

Reliability
Internal consistency reliabilities and test reliability coefficients for the CAS Full Scale, each
PASS scale, and the individual subtests were calculated (Naglieri & Das, 1997c). On the
Standard Battery, the Full Scale reliability coefficients ranged from.95 to.97. Similarly, the
average reliability coefficients for the other PASS scales were.88 (Planning),.88 (Attention),.93
(Simultaneous Scale), and.93 (Simultaneous Scale) (Successive Scale). Full-Scale reliabilities on
the Basic Battery ranged from.85 to.90.

Validity
The CAS's criterion-related validity has been supported by strong relationships between PASS
scale scores and educational achievement test scores, correlations with academic achievement in
special populations (mentally challenged children), and research into the profiles of specialized
groups of children (children with Attention Deficit/ Hyperactivity Disorder (ADHD) or reading
disabilities, and gifted children). Naglieri and Das (1997) administered the CAS and the
Woodcock Johnson Tests of Achievement - Revised (WJ-R) (Woodcock & Johnson, 1989) to a
representative sample of 1600 US children aged 5 to 17 years. The CAS Full Scale score and the
WJ-R were found to have a strong correlation (.73 for the Standard Battery and .74 for the Basic
Battery). Although the CAS lacks items that are directly dependent on achievement, Naglieri and
Das concluded that the PASS theory could be considered a predictor of achievement and that it
accounted for approximately 50% of the variance in achievement.

Criticism
The theoretical model of intellectual functioning that underpins the Das-Naglieri Cognitive
Assessment System (CAS) was investigated, as well as the scale's effectiveness in
operationalizing the theory. The PASS model is based on Luria's model of functional brain
organization, which has received widespread support in the neuroscience literature as a viable
conceptual paradigm for describing neuropsychological functioning. This review suggests that
the PASS model's Planning and Coding (Successive, Simultaneous) dimensions have theoretical,
clinical, and empirical support for reflecting intellectual functions. However, there appears to be
limited support for Attention as a distinct cognitive processing construct. Face validity
inspection, factor analytic studies, and exceptional sample studies were used to examine the CAS
subtests.The degree to which the CAS achieves the authors' stated goals of providing diversity in
content and mode of presentation varies across PASS domains. The data reviewed indicate that
the CAS tasks and scales have factor analytic and clinical support. Given the CAS's ongoing
development, its ultimate conceptual and clinical value remains to be determined (Telzrow,
1990).
113

PERSONALITY TESTS
Personality assessment refers to the estimation of one’s personality make up, that is the person’s
characteristic behaviour patterns and salient and stable characteristics.Clinical assessment of
personality and psychopathology is a complex task and it is one of the most critical aspects of
working with emotionally disturbed individuals. Personality testing is done for many reasons. Its
aim is to assess what the person is usually like in thoughts, feelings and behaviour patterns.
Personality tests tap individual differences of reacting to certain situations.

The Minnesota Multiphasic Personality Inventory (MMPI)


It is certainly among the most widely used psychodiagnostic instrument. The test consists of 550
unique items whose content ranges from psychiatric symptoms to political and social attitudes.
Assuch,manyresearchershavefeltthattheMMPIcanbe particularly useful diagnostic measure in
instances where there is a denial of problems. The ten basic clinical scales of MMPI include-HS:
Hypochondriasis, D: Depression, HY: Hysteria, Pd. Psychopathic deviate, Mf. Masculinity -
femininity, pa:Paranoia, Pt. Psychasthenia, Sc: Schizophrenia, Ma: Hypomania, Si.
SocialIntroversion scale.

The Sixteen Personality Factor Questionnaire (or 16PF)


It is a personality measure that is most commonly used in India. It is available in many regional
languages also. This personality questionnaire was developed over several decades of research
by Raymond B. Cattell and his colleagues The 16 personality factors were derived on the basis
of factor analysis .The 16PF test gives scores on both the five second-order global traits which
provide an overview of personality at a broader, conceptual level, as well as on the
more-numerous and precise primary traits, which give a picture of the richness and complexity
of each unique personality.

Eysenck Personality Questionnaire


measures only three dimensions of personality
namely,introversion-extroversion,neuroticism,psychoticism and liescore.This questionnaire
consists of 86 items and has been commonly used in research studies in India.
114

PROJECTIVE TECHNIQUES
These techniques are assumed to reveal those central aspects of personality that lie in the
unconscious mind of an individual. Unconscious motivations, hidden desires, inner fears and
complexes are presumed to be elicited by their unstructured nature that affect the client’s
conscious behaviour. An unstructured task is one that permits an endless range of possible
responses. The underlying hypothesis of projective techniques is that the way the test material or
“structures” are perceived and interpreted by the individual, reflects the fundamental aspects of
her or his psychological functioning. In other words, the test material serves as a sort of screen
on which respondents “project” their characteristic thought processes, anxieties, conflicts and
needs .

Clients are shown ambiguous visual stimuli by the psychologist and are asked to tell what they
see in that stimuli. It is presumed that the client will project the unconscious concerns and fears
onto the visual stimulus and thus the psychologist can interpret the responses and understand the
psychodynamic underlying the problem of the client. Tests that utilise this method are called
projective tests. These tests, besides their function of exploring one’s personality, also serve as a
diagnostic tool to uncover the hidden personality issues.

The history of projective techniques began in the beginning of the 15th century when Leonardo
da Vinci selected pupils on the basis of their attempt to find shapes and patterns in ambiguous
form (Piotrowski, 1972). In 1879, a Word association test was constructed by Gallon. Similar
tests were used in clinical settings by Carl Jung. Later, Frank (1939, 1948) introduced the term
projective method to describe a range of tests which could be used to study personality with
unstructured stimuli.

This way, the individual has enough opportunity to project his own personality attributes which
in the course of normal interview or conversation the person would not reveal. More specifically,
projective instruments also represent disguised testing procedures in the sense that the test takers
are not aware of the psychological interpretation to be made of their responses.

Rather than measuring the traits separately the attention is focused on the composite picture.
Finally, projective techniques are an effective tool to reveal the latent or hidden aspects of
personality that remain embedded in the unconscious until uncovered. These techniques are
based on the assumption that if the stimulus structure is weak in nature, it allows the individual
to project his/ her feelings, desires and needs that are further interpreted by the experts.
115

The Rorschach Inkblot Test


The best known projective test is the Rorschach test. This tech- nique was first described in 1921
by Swiss Psychologist Hermann Rorschach. Although standardized series of inkblot had
previously been utilized by psychologists in studies of imagination and other mental functions,
Rorschach was the first which apply inkblot to the diagnos- tic investigation of the personality as
a whole.

In this test "ink-blots' are used as stimulus. The Inkblots are, of course, essentially meaningless,
yet, because they resemble and suggest real objects, they are capable of being structured,
allowing the sub- ject to project meaning into them. Thus, they are ideally suited to serve as
stimuli, even though they have no inherent meaning.

The test consists of 10 properly standardized inkblot cards. All the designs are bilaterally
symmetrical and printed on a white back- ground. Five of the cards are in shades of gray and
black, others are gray plus colour and others are completely chromatic. As the number increases
the ink-blots become more and more complex and difficult.

Administration of the test generally occurs in two stages. In the first stage, the test administrator
presents the cards, one at a time, in a set order. Subjects are asked to describe what they see in
each card. He is to tell the tester everything that he sees while looking at the cards. The tester
records the subject's responses and keep some other records of several other aspects of
performance such as- the time taken by the subject to give his first response, the total time spend
on a card, the position of the card when the response is given, spon- taneous remarks made by
the subject, his emotional expressions and other incidental behaviour during the test session.
There is no time limit, nor any fixed number of responses for each card. The subject can see the
card from any side and takes his own time.

The second stage of the administration is an inquiry. In this stage the tester makes inquiry from
the subject to clarify his responses-as what the subject saw, wherein the card he saw, what
characteristics of the blot determined the particular responses, commenting upon the feature of
the blots that caused him to have a particular response.

In the third phase, called "testing the limits", the examiner tries to ascertain whether the subject
responds to the colour, shading and other meaningful aspects of the inkblots, or whether the
whole or parts of the blots are used by the subject in his responses. All these re- sponses are then
subjected to a scoring system.

The scoring of each card consists of attaching the appropriate symbols to each of the subject's
answers in such a way that it repre- sents as faithfully as possible an abstract of the subject's
reactions to all inkblot.
116

The technique of interpretation of Rorschach responses is very complex. Hence, adequate


experience and expertness are required for the interpretation of scores. It requires a deep insight
into the dy- namics of human personality through training. It is based on the rela- tive number of
responses falling into the various categories and the ratio and interrelationships among different
categories. The scoring categories of the test such as movement and colour, are interpreted as
signifying different functions of the personality intellectual creativ- ity, outgoing emotionality,
practical mindedness and the like. From norms based on work with subjects in various
well-characterized groups, normal individuals, neurotics, and psychotics - the pattern of the
subject's scores may be interpreted as belonging to one or an- other personality make-up. We
need highly trained personnel to ad- minister and interpret Rorschach; and it is a time-consuming
test there are its limitations.In the interpretation of the Rorschach test major emphasis is placed
on the final 'global' description of the personality of an individual.

The TAT
Another widely used projective test is the Thematic Appercep- tion Test (TAT) Thematic
Apperception Tests present more highly struc- tured stimuli and response situation in comparison
to Inkblot Tech- nique. It is known as T.A.T. It was developed by Morgan and Murray in 1935 at
the Harvard Psychological Clinic. It requires more complex and meaningfully organized verbal
responses.

The TAT consists of a set of 30 (in black and white) pictures which portrays human beings in a
variety of actual life situations plus one blank card. The pictures are vague and indefinite. But
they are structured to some extent. The maximum number of pictures used with any subject is 20.
Generally, four overlapping sets of 20 cards are available-for boys, girls, and males over 14 and
female over 14. The cards are presented to the subject one by one. The subject is asked to tell a
story about the picture shown on the card. This story should include details such as (1) what is
happening (2) who the people involved (3) what they are doing (4) what they are thinking and
feeling (5) what the necessary outcome of their action will be and similar other matters relating
to the individual in the story and situation in which they are placed. In the case of the blank card,
the subject is asked to imagine a picture by himself and narrate a story connected with that
picture. The stories that the subjects made around the pic- tures give expression to a wide variety
of interpretations about the feelings and actions of the figures shown in the picture. They are en-
couraged to interpret the pictures as freely and imaginatively as they want and to be completely
open and honest in their responses. Typi- cally given as an individual oral test, it may also be
administered in writing and as a group test. The test is administered in two sessions; 10 cards
being employed during each session. The cards used in the second session are more unusual,
dramatic and bizarre. While re- sponding, subjects are asked to give the free play of their
imagination. There is no right or wrong response. There are no time limits. In fact, the subject is
encouraged to continue for as long as five minutes on a picture. Thus, in TAT the tester plays a
117

relatively passive role- giving the initial instructions and recording responses. The basic rationale
of the test depends primarily on the mechanism of identification.

In scoring and interpreting TAT stories the psychologist first de- termines who the 'hero' is, the
characters with whom the subject has presumably identified himself. In the next step, the content
of the story is analyzed principally with reference to the hero's needs -such as the need for
achievement, affiliation, aggression, sex etc. and press i.e. the forces in the environment that may
facilitate or interfere with the satisfaction of his need. Such forces may be that the hero is
criticized by another person, receiving affection, protection sympathy, being comforted, exposed
to physical danger, shown hostility etc. Next, the psychologist analyses the 'theme' or the plot in
a story. It represents a simple episode which contains one need and one press. That is the
interaction of a hero's needs with environmental forces. Lastly, sto- ries are analyzed in terms of
'outcomes' which include comparative strengths of the forces emanating from the hero and the
strengths of those from the environment: the amount of hardship and frustration experienced:
relative degree of success and failure; happy and un- happy endings etc. It is recommended that
testing should be followed by an interview to learn the origin of the stories seeking associations
to places, names of persons, dates, specific and unusual informa- tion. It enables the examiner to
clarify the meaning of stories and to evaluate their significance more reliably.

Murray has suggested that one should not give literal adher- ence to the behavioural side of the
stories but looks for hidden human motives and intra-psychic conflicts. In general scoring and
interpreta- tion system rely heavily on an analysis of the thematic content of the stories, with
structural properties playing a relatively minor role. In as- sessing the importance or strength of a
particular need or press for the individual, special attention is given to the intensity, duration, and
frequency of its occurrence in different stories, as well as the unique- ness of its association with
a given picture.
Evaluation of T.A.T.

T.A.T. enjoys great popularity as a technique for the study of structure and organization of
personality. Rorschach throws light on the structure of personality while the T.A.T. reveals the
functioning of personality. It is considered of much value for uncovering personality dynamics. It
is a technique for studying fantasy as a product of con- scious and unconscious needs. T.A.T. is
less time-consuming, less technical, can be easily adapted and is certainly more useful. It can be
used both for adults and children separately. It has gained wide-spread use among clinical
psychologists in studies and diagnoses of maladjusted and abnormal persons. But it also used
with other nor- mal groups also for the fuller understanding of personality differences. In
responding to the Thematic Apperception Test situation, the subject is free from social
tensions.T.A.T. permits wide qualitative and quanti- tative individual differences in expressions
of wishes, fantasies, frus- trations, mode of adjustment.
118

In T.A.T, procedures in administering, scoring, and interpreting differ, depending upon the
conceptual system of the user and pur- pose for which the tester using the test. Thus, lack of
agreement on a single scoring and interpretation system complicates the task. Simi- larly,
determination of the validity of the T.A.T. is a difficult task. It is designed to study the hidden
needs which cannot be observed di- rectly. So, it is questionable whether the stories reflect the
test taker's personality or whether they are only stereotypic reactions to the situ- ations pictured.
Thus, there are many problems as regards scoring, reliability, and validity of T.A.T. A basic
assumption that T.A.T. shares with other projective techniques are that present motivational and
emotional condition of the subject affects his or her responses to an unstructured test situation
(Anastasi). So, more systematic research is needed in all these areas to accept it as a tool for
giving useful psycho diagnostic information.

This test was developed by Henry Murray and his colleagues (Morgan and Murray, 1935). The
Thematic Apperception Test (TAT) consists of 20 pictures which are all black and white. The
people depicted in the picture are deliberately drawn in ambiguous situations. After showing the
picture, a story is to be told by the client about the person or people in the picture. They have to
say what is happening in the picture, what has caused the event and what could have taken place
in the past and what would happen in the future. The story narrated by the client is interpreted by
the psychologist, who tries to look for revealing statements and projection of the client’s hidden
emotions onto the characters in the pictures. In the original interpretation method of TAT scores,
the examiner first determines who is the “hero”, the character of either sex with whom the
respondent presumably identifies himself or herself. The content of the stories is then analysed in
reference to Murray’s list of “needs” and “press”. Achievement, affiliation and aggression are the
examples of needs whereas “press” refers to environmental forces that may facilitate or interfere
with need satisfaction.

However, TAT has been used extensively in the research of personality but the high variations in
administration and scoring procedures associated with TAT has made it quite difficult to
investigate the psychometric properties of the TAT. Nevertheless, the value of Thematic
Apperception Techniques has been confirmed and also the clinical utility of various versions of
the TAT both for traditional and specific applications have been established.

Limitations of the Projective Tests


119

Projective tests are basically subjective in nature and the interpretation of the answers of clients
needs deep analytic and artistic traits. Reliability and validity related problems always exist in
projective tests. There are no standard grading scales for projective tests. Person’s varying mood
may decide the person’s answer which may vary considerably from one day to another.

Some situational variables like the examiner’s physical characteristics are likely to influence the
responses on projective techniques. It has also been seen that the changed instructions on the part
of examiner also influence the examinee’s scores on projective techniques to a great extent.

Finally, in the words of Eysenck (1959), projective techniques can be summarised as those in
which the relationship between projective indicators and personality traits have not been
demonstrated by any empirical evidences.

A number of evidences show that most studies of projective techniques are guided by
methodological flaws and are ill designed.

Projective techniques are not guided by any consistent, meaningful and testable theories.

There is no evidence showing a relationship between global interpretation of projective


techniques by experts and psychiatrists.

Generally, projective techniques have poor predictive ability regarding failure or success in
various walks of life.

IN SHORT

These tests are based on the projective hypothesis derived from Freud’s psychoanalytic theory.
The basic idea is that the test taker responds to a relatively unstructured stimuli and much of
meaning to the responses comes from within the person. Thereby, revealing the hidden aspects of
the personality. It is a prolonged, intensive and extensive assessment of that individual’s
personality. It is from these responses that deductions are made about the personality dynamics
including underlying conflicts and ego defenses.

​ a) The Rorschach Inkblot Technique was developed by Herman Rorschach (1921


/1941). He produced a set of 10 inkblots - 5 are black and white, 2 grey and red and 3
multi-coloured. These are on separate cards. Subjects are presented with one card at a
time and asked question as ‘what might this be?’ or ‘what does this remind you of?’After
writing down the first response the tester goes back through each response asking for
more details. The first phase of the test is called the free-association phase, and the
second is called the inquiry. Scoring combines objective and subjective procedures
looking into the area of the stimulus, and other properties of the blot as form, content etc.
Psychological tests help in digging out this information from both conscious, as well as
sub conscious level, systematically.
120

​ b) Thematic Apperception Test was taken by Christina Morgan & Henry Murray
(1938) in developing TAT. It is based on Murray’s theory of needs, which come- up in the
stories given by the patient. The Indian adaptation of this test is available. A set of 10
cards is selected (Fig. 18.9). To guide story production, tester instructs while giving a
picture, that based on the picture a story has to be made by incorporating who all are in
the picture, what had happened before, what is going to happen and what are the people
involved thinking and feeling. Trained psychologists pick up themes coming out from
each story and thereby make personality inferences.

SEMI PROJECTIVE TESTS


While the projective tests are unstructured, semi-projective tests are partly structured, like there
would be completion of sentence, story, or word association.

SENTENCE COMPLETION- Rotter’s Incomplete Sentences Blank


Responses on this test are often most helpful in establishing level of confidence regarding
predictions of overt behaviour. The SCT is designed to tap the patient’s, conscious associations
to areas like self, relationship with father, mother, opposite sex and superiors. It is composed of
series of sentence stems, such as, “I like” “Sometimes I wish ”,which patients are asked to
complete in their own words. The SCT usually elicits information that the patient is quite willing
to give.The level of inference is usually less than in the Rorschach Test or TAT interpretations.

Future directions of psychological assessment: virtual and


computer

Virtual environments (VEs) offer a new human-computer interaction paradigm in which

users are no longer simply external observers of images on a computer screen, but are active

participants within a computer-generated three-dimensional virtual world. Virtual reality

can add, delete, or emphasise details to better help clinicians perform basic functions. These

unique features can provide the patient with specialised, safer treatment techniques for

problems that previously were expensive or impossible to treat in traditional training and
121

therapy. For these reasons VEs have recently attracted much attention in clinical

psychology. One of the main advantages of a virtual environment is that it can be used in a

medical facility, thus avoiding the need to venture into public situations. Infact, in many

applications VEs are used in order to simulate the real world and to assure the researcher

full control of all the parameters implied. Many stimuli for exposure are difficult to arrange

or control, and when exposure is conducted outside of the therapistÕs office, it becomes

more expensive in terms of time and money. The ability to conduct exposures of virtual

aeroplanes for flying phobics or virtual highways for driving phobics, for example, without

leaving the therapistÕs office, would make better treatment available to more sufferers at a

lower cost.

According to this new paradigm the use of computers and in particular of virtual

environments, can offer new powerful tools to psychologists. However, the use of

computers in psychological testing it isn't a novelty: it was initiated well over a quarter

century ago. Technically, computer testing has its origin in phy sicalism and

psychometrics, and the computer applied to psychological testing may be considered a

psychometric machine.The basic thesis is that test scores may be empirically linked to

contest behaviors of test takers through the use of algorithms (a partly Greek term honoring

the ninth century Arab mathematician al-Khuwarizmi), which are mechanical rules for

making decisions. For example, on the basis of empirical correlates, when MMPI Scale 6 is

above a T-score of 75, we may expect that the test taker will show disturbed thinking, have

delusions of persecution and/or grandeur, ideas of reference, feel mistreated, picked on, be

angry and resentful, harbor grudges, and rely heavily on projection as a defense mechanism

. On the basis of such established relations between scores and symptom pictures,
122

characteristics that are commonly found with particular score elevations and patterns may

be fashioned into statements, stored in a statement library, and called back whenever a test

taker registers the scores that have been shown to empirically relate to these statements.

However, in practice, the statements that eventually issue from the computer are not

derived entirely by blind adherence to this scheme. The algorithms additionally incorporate

the experience of their author, and are not free of theoretical bias, clinical flavoring,

intuition, and personally held interpretation. So, early beliefs that the computer would

eliminate the need for skilled diagnostic clinicians has not materialized. Errors,

inconsistencies, and misleading statements are always a possibility. When computer-derived

information is to be employed by a person who does not have sufficient psychological

background to use the material responsibly, it falls upon a psychologist to interpret to that

person the computer interpretation: there must be a clinician between the computer and the

client.

Advantages of VR based assessment tools


The main problem of current computer based assessment is the transformation of the
process of psychological assessment into psychological testing. As Tallent [1] points out,
"the reaching of conclusions through the use of psychometrics often is mislabeled as
assessment, as, for example in computer assessment... [Computer tests] do not provide
automatic answers to real problem...What test results mean in any given case is a human
judgment".
However, the rate of growth of computer testing is remarkable [10]. Computer programs
are available for administering, scoring, profiling, interpretation, and report writing for old
tests, and for new instruments designed specifically for computer analysis. Creative
variations have appeared. In adaptive testing, for example, items presented to the test
taker are contingent on his or her earlier responses, similar to Binet testing, where tests at a
given age-level are administered only if at least one subtest has been passed at the
immediately lower year-level.
Virtual reality can be considered as an highly sophisticated form of adaptive testing.
Infact, the key characteristic of VR is the high level of control of the interaction with the
123

tool without the constrains usually found in computer systems. VR is highly flexible and
programmable. It enables one to present a wide variety of controlled stimuli and to measure
and monitor a wide variety of responses made by the user. Both the synthetic environment
itself and the manner in which this environment is modified by the user's responses can be
tailored to the needs of each client and/or therapeutic application. Moreover, VR is
highly immersive and can cause the participant to feel "present" in the virtual rather than the
real environment. It is also possible for the psychologist to accompany the user into the
synthesised world.
More in detail, there are three important aspects of virtual reality systems that can offer
new possibilities to psychological assessment:
1. How They Are Controlled:
Present alternate computer access systems accept only one or at most two modes of input at a
time. The computer can be controlled by single modes such as pressing keys on a keyboard,
pointing to an on-screen keyboard with a head pointer, or hitting a switch when the computer
presents the desired choice, but present computers do not recognize facial expressions,
idiosyncratic gestures, or monitor actions from several body parts at a time. Most computer
interfaces accept only precise, discrete input. Thus many communicative acts are ignored and the
subtleness and richness of the human communicative gesture are lost. This results in slow,
energy-intensive computer interfaces. Virtual reality systems open the input channel: the
potential is there to monitor movements or actions from any body part or many body parts at the
same time. All properties of the movement can be captured, not just contact of a body part with
an effector. In the virtual environment these actions or signals can be processed in a number of
ways. They can be translated into other actions that have more effect on the world being
controlled, for example, virtual objects could be pushed by blowing, pulled by sipping, and
grasped by jaw closure.
2. Feedback: Because VR systems display feedback in multiple modes, feedback and
prompts can be translated into alternate senses for users with sensory impairments. The
environment could be reduced in size to get the larger or overall perspective (without the
"looking through a straw effect" usually experienced when using screen readers or tactile
displays). Sounds could be translated into vibrations or into a register that is easier to pick
up. Environmental noises can be selectively filtered out. For the individual multimodal
feedback ensures that the visual channel is not overloaded. Vision is the primary feedback
channel of present-day computers; frequently the message is further distorted and alienated
by representation through text. It is very difficult to represent force, resistance, density,
temperature, pitch, etc., through vision alone. Virtual reality presents information in
alternate ways and in more than one way.
3. What Is Controlled: The final advantage is what is controlled. Until the last decade
computers were used to control numbers and text by entering numbers and text using a
keyboard. Recent direct manipulation interfaces have allowed the manipulation of iconic
representations of text files or two dimensional graphic representations of objects through
124

pointing devices such as mice. The objective of direct manipulation environments was to
provide an interface that more directly mimics the manipulation of objects in the real world.
The latest step in that trend, virtual reality systems, allows the manipulation of multisensory
representations of entire environments by natural actions and gestures.

You might also like