Correspondence Analysis in Practice 3rd Edition Michael Greenacre 2024 scribd download
Correspondence Analysis in Practice 3rd Edition Michael Greenacre 2024 scribd download
https://ebookgate.com/product/correspondence-analysis-in-practice-2nd-
ed-edition-michael-greenacre/
ebookgate.com
https://ebookgate.com/product/advanced-language-practice-3rd-edition-
michael-vince/
ebookgate.com
https://ebookgate.com/product/an-introduction-to-the-analysis-of-
algorithms-3rd-edition-michael-soltys/
ebookgate.com
https://ebookgate.com/product/financial-modelling-in-practice-michael-
rees/
ebookgate.com
Infertility in Practice 3rd Edition Adam Balen
https://ebookgate.com/product/infertility-in-practice-3rd-edition-
adam-balen/
ebookgate.com
https://ebookgate.com/product/risk-analysis-in-theory-and-
practice-1st-edition-jean-paul-chavas/
ebookgate.com
https://ebookgate.com/product/pediatric-practice-endocrinology-1st-
edition-michael-kappy/
ebookgate.com
https://ebookgate.com/product/handbook-of-emotions-3rd-ed-3rd-edition-
michael-lewis/
ebookgate.com
Correspondence
Analysis in Practice
Third Edition
CHAPMAN & HALL/CRC
Interdisciplinar y Statistics Series
Series editors: N. Keiding, B.J.T. Morgan, C.K. Wikle, P. van der Heijden
Published titles
AGE-PERIOD-COHORT ANALYSIS: NEW MODELS, METHODS, AND
EMPIRICAL APPLICATIONS Y. Yang and K. C. Land
ANALYSIS OF CAPTURE-RECAPTURE DATA R. S. McCrea and B. J.T. Morgan
AN INVARIANT APPROACH TO STATISTICAL ANALYSIS OF SHAPES
S. Lele and J. Richtsmeier
ASTROSTATISTICS G. Babu and E. Feigelson
BAYESIAN ANALYSIS FOR POPULATION ECOLOGY R. King, B. J.T. Morgan,
O. Gimenez, and S. P. Brooks
BAYESIAN DISEASE MAPPING: HIERARCHICAL MODELING IN SPATIAL
EPIDEMIOLOGY, SECOND EDITION A. B. Lawson
BIOEQUIVALENCE AND STATISTICS IN CLINICAL PHARMACOLOGY
S. Patterson and B. Jones
CLINICAL TRIALS IN ONCOLOGY,THIRD EDITION S. Green, J. Benedetti, A. Smith,
and J. Crowley
CLUSTER RANDOMISED TRIALS R.J. Hayes and L.H. Moulton
CORRESPONDENCE ANALYSIS IN PRACTICE,THIRD EDITION M. Greenacre
DESIGN AND ANALYSIS OF QUALITY OF LIFE STUDIES IN CLINICAL TRIALS,
SECOND EDITION D.L. Fairclough
DYNAMICAL SEARCH L. Pronzato, H. Wynn, and A. Zhigljavsky
FLEXIBLE IMPUTATION OF MISSING DATA S. van Buuren
GENERALIZED LATENT VARIABLE MODELING: MULTILEVEL, LONGITUDI-
NAL, AND STRUCTURAL EQUATION MODELS A. Skrondal and S. Rabe-Hesketh
GRAPHICAL ANALYSIS OF MULTI-RESPONSE DATA K. Basford and J. Tukey
INTRODUCTION TO COMPUTATIONAL BIOLOGY: MAPS, SEQUENCES, AND
GENOMES M. Waterman
MARKOV CHAIN MONTE CARLO IN PRACTICE W. Gilks, S. Richardson, and
D. Spiegelhalter
MEASUREMENT ERROR ANDMISCLASSIFICATION IN STATISTICS AND EPIDE-
MIOLOGY: IMPACTS AND BAYESIAN ADJUSTMENTS P. Gustafson
MEASUREMENT ERROR: MODELS, METHODS, AND APPLICATIONS
J. P. Buonaccorsi
Published titles
MEASUREMENT ERROR: MODELS, METHODS, AND APPLICATIONS
J. P. Buonaccorsi
MENDELIAN RANDOMIZATION: METHODS FOR USING GENETIC VARIANTS
IN CAUSAL ESTIMATION S.Burgess and S.G. Thompson
META-ANALYSIS OF BINARY DATA USINGPROFILE LIKELIHOOD D. Böhning,
R. Kuhnert, and S. Rattanasiri
MISSING DATA ANALYSIS IN PRACTICE T. Raghunathan
POWER ANALYSIS OF TRIALS WITH MULTILEVEL DATA M. Moerbeek and
S. Teerenstra
SPATIAL POINT PATTERNS: METHODOLOGY AND APPLICATIONS WITH R
A. Baddeley, E Rubak, and R. Turner
STATISTICAL ANALYSIS OF GENE EXPRESSION MICROARRAY DATA T. Speed
STATISTICAL ANALYSIS OF QUESTIONNAIRES: A UNIFIED APPROACH
BASED ON R AND STATA F. Bartolucci, S. Bacci, and M. Gnaldi
STATISTICAL AND COMPUTATIONAL PHARMACOGENOMICS R. Wu and M. Lin
STATISTICS IN MUSICOLOGY J. Beran
STATISTICS OF MEDICAL IMAGING T. Lei
STATISTICAL CONCEPTS AND APPLICATIONS IN CLINICAL MEDICINE
J. Aitchison, J.W. Kay, and I.J. Lauder
STATISTICAL AND PROBABILISTIC METHODS IN ACTUARIAL SCIENCE
P.J. Boland
STATISTICAL DETECTION AND SURVEILLANCE OF GEOGRAPHIC CLUSTERS
P. Rogerson and I.Yamada
STATISTICS FOR ENVIRONMENTAL BIOLOGY AND TOXICOLOGY A. Bailer
and W. Piegorsch
STATISTICS FOR FISSION TRACK ANALYSIS R.F. Galbraith
VISUALIZING DATA PATTERNS WITH MICROMAPS D.B. Carr and L.W. Pickle
Ch ap ma n & Hall/CRC
I n t e rd is ci pl in ar y Statistics Series
Correspondence
Analysis in Practice
Third Edition
Michael Greenacre
Universitat Pompeu Fabra
Barcelona, Spain
CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742
This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been made to publish reliable data and
information, but the author and publisher cannot assume responsibility for the validity of all materials or the consequences of their use. The authors and
publishers have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission
to publish in this form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may rectify in any
future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic,
mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or
retrieval system, without written permission from the publishers.
For permission to photocopy or use material electronically from this work, please access www.copyright.com (http://www.copyright.com/) or contact
the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides
licenses and registration for a variety of users. For organizations that have been granted a photocopy license by the CCC, a separate system of payment
has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without
intent to infringe.
This book is a revised and extended third edition of the second edition of
Correspondence Analysis in Practice (Chapman & Hall/CRC, 2007), first pub-
lished in 1993. In the original first edition I wrote the following in the Preface,
which is still relevant today:
“Correspondence analysis is a statistical technique that is useful to all students, Extract from
researchers and professionals who collect categorical data, for example data col- preface of first
lected in social surveys. The method is particularly helpful in analysing cross- edition of
tabular data in the form of numerical frequencies, and results in an elegant but Correspondence
simple graphical display which permits more rapid interpretation and under- Analysis in Practice
standing of the data. Although the theoretical origins of the technique can be
(1993)
traced back over 50 years, the real impetus to the modern application of corre-
spondence analysis was given by the French linguist and data analyst Jean-Paul
Benzécri and his colleagues and students, working initially at the University of
Rennes in the early 1960s and subsequently at the Jussieu campus of the Univer-
sity of Paris. Parallel developments of correspondence analysis have taken place
in the Netherlands and Japan, centred around such pioneering researchers as
Jan de Leeuw and Chikio Hayashi. My own involvement with correspondence
analysis commenced in 1973 when I started my doctoral studies in Benzécri’s
Data Analysis Laboratory in Paris. The publication of my first book Theory and
Applications of Correspondence Analysis in 1984 coincided with the beginning
of a wider dissemination of correspondence analysis outside of France. At that
time I expressed the hope that my book would serve as a springboard for a much
wider and more routine application of correspondence analysis in the future. The
subsequent evolution and growing popularity of the method could not have been
more gratifying, as hundreds of researchers were introduced to the method and
became familiar with its ability to communicate complex tables of numerical
data to non-specialists through the medium of graphics. Researchers with whom
I have collaborated come from such varying backgrounds as sociology, ecology,
palaeontology, archaeology, geology, education, medicine, biochemistry, microbi-
ology, linguistics, marketing research, advertising, religious studies, philosophy,
art and music. In 1989 I was invited by Jay Magidson of Statistical Innovations
Inc. to collaborate with Leo Goodman and Clifford Clogg in the presentation of a
two-day short course in New York, entitled “Correspondence Analysis and Asso-
ciation Models: Geometric Representation and Beyond”. The participants were
mostly marketing professionals from major American companies. For this course
I prepared a set of notes which reinforced the practical, user-oriented approach to
correspondence analysis. ... The positive reaction of the audience was infectious
and inspired me subsequently to present short courses on correspondence analysis
in South Africa, England and Germany. It is from the notes prepared for these
courses that this book has grown.”
In 1991 Prof. Walter Kristof of Hamburg University proposed that we orga- The CARME
nize a conference on correspondence analysis, with the assistance of Dr. Jörg conferences
Blasius of the Zentralarchiv für Empirische Sozialforschung (Central Archive
xi
xii Preface
New material in I have been very gratified to be invited to prepare a new edition of Correspon-
third edition dence Analysis in Practice, having accumulated considerably more experience
in social and environmental research in the nine years since the publication of
the second edition. Apart from revising the existing chapters, five new chapters
have been added, on “Compositional Data Analysis” (an area highly related
to correspondence analysis), “Analysis of Matched Matrices” (joint analysis
of data tables with the same rows and columns), “Correspondence Analysis of
Networks” (applying correspondence analysis to graphs), “Co-Inertia and Co-
Correspondence Analysis” (analysis of relationships between two tables with
common rows), and “Permutation Tests” (performing statistical inference in
the context of correspondence analysis and related methods). All in all, I can
say that this third edition contains almost all my practical knowledge of the
subject, after more than 40 years working in this area.
Preface xiii
At a conference I attended in the 1980s, I was given this lapel button with its “Statisticians
nicely ambiguous maxim, which could well be the motto of correspondence count!”
analysts all over the world:
To illustrate the more technical meaning of this motto, and to give an initial Textual analysis of
example of correspondence analysis, I made a count of the most frequent words third edition
in each of the 30 chapters of this new edition. I had to aggregate variations of
the same word, e.g. “coordinate” and “coordinates”, “plot” and “plotting”,
a process called lemmatization in textual data analysis. The top 10 most
frequent words were, in descending order of frequency: “row/s”, “profile/s”,
“inertia” (which is the way correspondence analysis measures variance in a
table), “point/s”, “column/s”, “data”, “CA” (abbreviation for correspondence
analysis), “variable/s”, “value/s” and “average”. I omitted words that occur
in one chapter only, such as “fuzzy” and “degree”, which are specific to a
single chapter, and removed words that described particular applications. This
left an eventual total of 167 words, which can be regarded as reflecting the
methodological content of the book.
Exhibit 0.1:
analyse/sis association/s asymmetric average axis/es ... First few rows and
columns of the table
Chap 1 10 0 0 0 15 ... of counts of the 167
Chap 2 0 0 0 29 22 ... most frequent words
Chap 3 0 0 0 55 0 ... in the 30 chapters of
Chap 4 0 6 0 22 0 ... Correspondence
Chap 5 0 0 0 22 13 ... Analysis in Practice,
Chap 6 0 0 0 8 0 ... Third Edition,
Chap 7 0 0 0 29 0 ... visualized in Exhibit
Chap 8 47 0 0 14 20 ... 0.2 using
correspondence
Chap 9 0 0 14 6 32 ...
analysis.
Chap 10 0 0 17 0 14 ...
.. .. .. .. .. ..
. . . . . . ...
0.2
present book in
distribution
terms of the most
p-value 30
frequent words in significant/ce
distance/s statistic bootstrap sample/ing
each chapter. row/s hypothesis/es
0.1
frequencies
weighted 29
Numbers in boldface point/s
column/s 5
data pairs
CA dimension 2
8 7 22 16
proximity signifies sets
2627
relative similarity of 21 1723
24
word distribution. 20 28
Directions of the 2518 response/svariable/s
−0.1
19 CA
respondent/s dimension/s
words give the weights
question/s
interpretation for subtables
analysis/analysed
the positioning of indicator
Burt
the chapters.
−0.2
MCA
Technically, this is a diagonal
so-called “contribu-
tion biplot” (see
Chapter 13).
−0.3
matrix/ices
The final table of word counts was composed of the 30 chapters as rows,
and the 167 words as columns (see Exhibit 0.1). This table is very sparse,
i.e. it has many zeros. In fact, 80% of the cells of the table have no counts.
Correspondence analysis copes quite well with such data, which has made it a
popular method in research areas such as linguistics, archaeology and ecology,
where data sets of frequency counts occur that are very sparse.
Exhibit 0.2 shows the “map” of the table, resulting from applying correspon-
dence analysis. The first thing to notice is that the rows (chapters, indicated
by their numbers) and the columns (words, connected to the centre by lines)
are displayed with respect to two “dimensions”. These dimensions are de-
termined by the analysis with the objective of exposing the most important
features of the associations between chapters and words. An alternative way
of thinking about this is that the chapters are mapped according to the simi-
larity in their distributions of words, with closer chapters being more similar
and distant chapters more different. Then the directions of the words explain
the differences between the chapters. Not all the words are shown, because
about two-thirds of the words turn out to be not so important for the interpre-
tation of the result, so only those words are shown that contribute highly to
the positioning of the chapters. Without further explanation of the concepts
Preface xv
underlying correspondence analysis (after all, this is the aim of the book that
follows!) the map clearly shows three sets of words emanating from the centre.
The words out to the top right clearly distinguish Chapters 29 and 30 from all
the others — these are the chapters that concentrate on the sampling, distri-
butional and inferential properties of correspondence analysis, with main key-
words “permutation” and “test”. Chapter 15 on clustering also tends in that
direction because it contains some hypothesis testing. Out in the upper left
direction are all the words describing basic concepts and terminology of corre-
spondence analysis associated with Chapters 1–14 that introduce the method
and develop it, exemplified by the most prominent keyword “profile/s”. To-
wards the bottom of the map are the words associated with a generalization of
correspondence analysis, called multiple correspondence analysis, usually ap-
plied to questionnaire data, described in later chapters. This method involves
various coding schemes in different types of matrices, hence the important
keyword “matrix/ices” down below.
Like the second edition, the book maintains its didactic format, with exactly Format of third
eight pages per chapter to provide a constant amount of material in each edition
chapter for self-learning or teaching (a feature that has been commented on
favourably in book reviews of the second edition). One of my colleagues re-
marked that it was like writing 14-line sonnets with strict rules for metre
and rhyming, which was certainly true in this case: the format definitely con-
tributed to the creative process. The margins are reserved for section headings
as well as captions of the tables and figures — these captions tend to be more
informative than conventional one-liner ones. Each chapter has a short intro-
duction and its own “Contents” list on the first page, and the chapter always
ends with a summary in the form of a bulleted list.
As in the first and second editions, the book’s main thrust is towards the Appendices
practice of correspondence analysis, so most technical issues and mathemati-
cal aspects are gathered in a theoretical appendix at the end of the book. It is
followed by a computational appendix, which describes some features of the R
language relevant to the methods in the book, including the ca package for cor-
respondence analysis. R scripts are placed on the website www.carme-n.org,
along with several of the data sets. No references at all are given in the 30
chapters — instead, a brief bibliographical appendix is given to point readers
towards further readings and more complete literature sources. A glossary of
the most important terms in the book is also provided and the book concludes
with some personal thoughts in the form of an epilogue.
The first edition of this book was written in South Africa, and the second and Acknowledgements
present third editions in Catalonia, Spain. Many people and institutions have
contributed in one way or another to this project. I would like to thank the
BBVA Foundation in Madrid and its director Prof. Rafael Pardo, for sup-
port and encouragement in my work on correspondence analysis. The BBVA
Foundation has published a Spanish translation of the second edition of Cor-
respondence Analysis in Practice, called La Práctica del Análisis de Corre-
spondéncias, available for free download at www.multivariatestatistics.org.
xvi Preface
Contents
Data set 1: My travels . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Continuous variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Expressing data in relative amounts . . . . . . . . . . . . . . . . . . . . 2
Categorical variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Ordering of categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Distances between categories . . . . . . . . . . . . . . . . . . . . . . . . 3
Distance interpretation of scatterplots . . . . . . . . . . . . . . . . . . . 3
Scatterplots as maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Calibration of a direction in the map . . . . . . . . . . . . . . . . . . . . 4
Information-transforming nature of the display . . . . . . . . . . . . . . 4
Nominal and ordinal variables . . . . . . . . . . . . . . . . . . . . . . . . 5
Plotting more than one set of data . . . . . . . . . . . . . . . . . . . . . 5
Interpreting absolute or relative frequencies . . . . . . . . . . . . . . . . 6
Describing and interpreting data, vs. modelling and statistical inference 7
Large data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
SUMMARY: Scatterplots and Maps . . . . . . . . . . . . . . . . . . . . 8
During the original writing of this book, I was reflecting on the journeys I Data set 1: My
had made during the year to Norway, Canada, Greece, France and Germany. travels
According to my diary I spent periods of 18 days in Norway, 15 days in
Canada and 29 days in Greece. Apart from these longer trips I also made
several short trips to France and Germany, totalling 24 days. This numerical
description of my time spent in foreign countries can be visualized in the
graphs of Exhibit 1.1. This seemingly trivial example conceals several issues
that are relevant to our perception of graphs of this type that represent data
with respect to two coordinate axes, and which will eventually help us to
1
2 Scatterplots and Maps
Days
year, in scatterplot 20 20
20% 20%
and bar-chart
formats respectively.
10 10
A percentage scale, 10% 10%
expressing days
relative to the total
of 86 days, is given Norway Canada Greece France/Germany Norway Canada Greece France/Germany
on the right-hand
side of each graph.
Continuous The left-hand vertical axis labelled Days represents the scale of a numeric piece
variables of information often referred to as a continuous, or numerical, variable. The
scale on this axis is the number of days spent in some foreign country, and the
ordering from zero days at the bottom end of the scale to 30 days at the top end
is clearly defined. In the bar-chart form of this display, given in the right-hand
graph of Exhibit 1.1, bars are drawn with lengths proportional to the values
of the variable. Of course, the number of days is a rounded approximation of
the time actually spent in each country, but we call this variable continuous
because the underlying time variable is indeed truly continuous.
Expressing The right-hand vertical axis of each plot in Exhibit 1.1 can be used to read the
data in relative corresponding percentage of days relative to the total of 86 days. For example,
amounts the 18 days in Norway account for 21% of the total time. The total of 86 is
often called the base relative to which the data are expressed. In this case
there is only one set of data and therefore just one base, so in these plots the
original absolute scale on the left and the relative scale on the right can be
depicted on the same graph.
Categorical In contrast to the vertical y-axis, the horizontal x-axis is clearly not a numer-
variables ical variable. The four points along this axis are just positions where we have
placed labels denoting the countries visited. The horizontal scale represents a
discrete, or categorical , variable. There are two features of this horizontal axis
that have no substantive meaning in the graph: the ordering of the categories
and the distances between them.
Ordering of categories 3
Firstly, there is no strong reason why Norway has been placed first, Canada Ordering of
second and Greece third, except perhaps that I visited these countries in that categories
order. Because the France/Germany label refers to a collection of shorter trips
scattered throughout the year, it was placed after the others. By the way, in
this type of representation where order is essentially irrelevant, it is usually
a good idea to re-order the categories in a way that has some substantive
meaning, for example in terms of the values of the variable. In this example
we could order the countries in descending order of days, in which case we
would position the countries in the order Greece, France/Germany, Norway
and Canada, from most visited to least. This simple re-arrangement assists
in the interpretation of data, especially if the data set is much larger: for
example, if I had visited 20 different countries, then the order would contain
relevant information that is not quickly deduced from the data in their original
ordering.
Secondly, there is no reason why the four points are at equal intervals apart Distances between
on the axis. There is also no immediate reason to put them at different in- categories
tervals apart, so it is purely for convenience and aesthetics that they have
been equally spaced. Using correspondence analysis we will show that there
are substantively interesting ways to define intervals between the categories
of a variable such as this one, when it is related to other variables. In fact,
correspondence analysis will be shown to yield values for the categories where
both the distances between the categories and their ordering have substantive
meaning.
Since the ordering of the countries is arbitrary on the horizontal axis of Ex- Distance
hibit 1.1, as well as the distances between them, there would be no sense interpretation of
in measuring and interpreting distances between the displayed points in the scatterplots
left-hand graph. The only distance measurement that has meaning is in the
strictly vertical direction, because of the numerical nature of the vertical axis
that indicates frequency (left-hand scale) or relative frequency (right-hand
scale).
In some special cases, the two variables that define the axes of the scatterplot Scatterplots as
are of the same numerical nature and have comparable scales. For example, maps
suppose that 20 students have written a mathematics examination consisting
of two parts, algebra and geometry, each part counting 50% towards the final
grade. The 20 students can be plotted according to their pair of grades, shown
in Exhibit 1.2. It is important that the two axes representing the respective
grades have scales with unit intervals of identical lengths. Because of the simi-
lar nature of the two variables and their scales, it is possible to judge distances
in any direction of the display, not only horizontally or vertically. Two points
that are close to each other will have similar results in the examination, just
like two neighbouring towns having a small geographical distance between
4 Scatterplots and Maps
Exhibit 1.2: 50
Scatterplot of grades
of 20 students in two
sections (algebra
and geometry) of a 40
mathematics
examination. The
points have spatial
properties: for 30
Geometry
0
0 10 20 30 40 50
Algebra
them. Thus, one can comment here on the shape of the scatter of points and
the fact that there is a small cluster of four students with high grades and a
single student with very high grades. Exhibit 1.2 can be regarded as a map,
because the position of each student can be regarded as a two-dimensional
position, similar to a geographical location in a region defined by latitude and
longitude co-ordinates.
Calibration of a Maps have interesting geometric properties. For example, in Exhibit 1.2 the
direction in the 45◦ dashed line actually defines an axis for the final grades of the students,
map combining the algebra and geometry grades. If this line is calibrated from 0
(bottom left) to 100 (top right), then each student’s final grade can be read
from the map by projecting each point perpendicularly onto this line. An
example is shown of a student who received 12 out of 50 and 18 out of 50
for the two sections, respectively, and whose position projects onto the line at
coordinates 15 and 15, corresponding to a total grade of 30.
Information- The scatterplots in Exhibit 1.1 and Exhibit 1.2 are different ways of expressing
transforming in graphical form the numerical information in the two sets of travel and
nature of the examination data respectively. In each case there is no loss of information
display between the data and the graph. Given the graph it is easy to recover the data
Nominal and ordinal variables 5
In my travel example, the categorical variable “country” has four categories, Nominal and
and, since there is no inherent ordering of the categories, we refer to this vari- ordinal variables
able more specifically as a nominal variable. If the categories are ordered, the
categorical variable is called an ordinal variable. For example, a day could be
classified into three categories according to how much time is spent working:
(i) less than one hour (which I would call a “holiday”), (ii) more than one but
less than six hours (a “half day”, say) and (iii) more than six hours (a “full
day”). These categories, which are based on the continuous variable “time
spent daily working” divided up into intervals, are ordered and this ordering
is usually taken into account in any graphical display of the categories. In
many social surveys, questions are answered on an ordinal scale of response,
for example, an ordinal scale of importance: “not important”/“somewhat
important”/“very important”. Another typical example is a scale of agree-
ment/disagreement: “strongly agree”/“somewhat agree”/“neither agree nor
disagree”/“somewhat disagree”/“strongly disagree”. Here the ordinal position
of the category “neither agree nor disagree” might not lie between “somewhat
agree” and “somewhat disagree”; for example, it might be a category used
by some respondents instead of a “don’t know” response when they do not
understand the question or when they are confused by it. We shall treat this
topic later in this book (Chapter 21) once we have developed the tools that
allow us to study patterns of responses in multivariate questionnaire data.
Exhibit 1.3:
COUNTRY Holidays Half days Full days TOTAL Frequencies of
different types of
Norway 6 1 11 18 day in four sets of
Canada 1 3 11 15 trips.
Greece 4 25 0 29
France/Germany 2 2 20 24
TOTAL 13 31 42 86
Let us suppose now that the 86 days of my foreign trips were classified into one Plotting more than
of the three categories holidays, half days and full days. The cross-tabulation one set of data
of country by type of day is given in Exhibit 1.3. This table can be considered
in two different ways: as a set of rows or a set of columns. For example, each
column is a set of frequencies characterizing the respective type of day, while
each row characterizes the respective country. Exhibit 1.4(a) shows the latter
way, namely a plot of the frequencies for each country (row), where the hori-
zontal axis now represents the type of day (column). Notice that, because the
categories of the variable “type of day” are ordered, it makes sense to connect
6 Scatterplots and Maps
in each row
Days
15 50
expressed as
percentages. 10
25
5
0 0
Holiday Half Day Full Day Holiday Half Day Full Day
Exhibit 1.5:
Percentages of types COUNTRY Holidays Half days Full days
of day in each
country, as well as Norway 33% 6% 61%
the percentages Canada 7% 20% 73%
overall for all Greece 14% 86% 0%
countries combined; France/Germany 8% 8% 83%
rows add up to Overall 15% 36% 49%
100%.
Interpreting There is a lesson to be learnt from these displays that is fundamental to the
absolute or relative analysis of frequency data. Each trip has involved a different number of days
frequencies and so corresponds to a different base as far as the frequencies of the types of
days are concerned. The 6 holidays in Norway, compared to the 4 in Greece,
can be judged only in relation to the total number of days spent in these
respective countries. As percentages they turn out to be quite different: 6 out
of 18 is 33%, while 4 out of 29 is 14%. It is the visualization of the relative
frequencies in Exhibit 1.4(b) that gives a more accurate comparison of how I
spent my time in the different countries. The “marginal” frequencies (18, 15,
29, 24 for the countries, and 13, 31, 42 for the day types) are also interpreted
relative to their respective totals — for example, the last row of Exhibit 1.5
shows the percentages of day types for all countries combined, and could have
been plotted similarly in Exhibit 1.4(b).
Describing and interpreting data, vs. modelling and statistical inference 7
Any conclusion drawn from the points’ positions in Exhibit 1.4(b) is purely Describing and
an interpretation of the data and not a statement of the statistical signifi- interpreting data,
cance of the observed feature. In this book we shall address the statistical vs. modelling and
aspects of graphical displays only towards the end of the book (Chapters 29 statistical inference
and 30); for the most part we shall be concerned only with the question of
data visualization and interpretation. The deduction that I had proportion-
ally more holidays in Norway than in the other countries is certainly true in
the data and can be seen strikingly in Exhibit 1.4(b). It is an entirely dif-
ferent question whether this phenomenon is statistically compatible with a
model or hypothesis of my behaviour that postulates that the proportion of
holidays was generally intended to be the same for all countries, in which case
any observed differences are purely random. Most of statistical methodology
concentrates on problems where data are fitted and compared to a theoretical
model or preconceived hypothesis, with little attention being paid to enlight-
ening ways for describing data, interpreting data and generating hypotheses.
A typical example in the social sciences is the use of the ubiquitous chi-square
statistic to test for association in a cross-tabulation. Often statistically sig-
nificant association is found but there are no simple tools for detecting which
parts of the table are responsible for this association. Correspondence analysis
is one tool that can fill this gap, allowing the data analyst to see the pattern of
association in the data and to generate hypotheses that can be tested in a sub-
sequent stage of research. In most situations data description, interpretation
and modelling can work hand-in-hand with one other. But there are situa-
tions where data description and interpretation assume supreme importance,
for example when the data represent the whole population of interest.
As data tables increase in size, it becomes more difficult to make simple graph- Large data sets
ical displays such as Exhibit 1.4, owing to the overabundance of points. For
example, suppose I had visited 20 countries during the year and had a break-
down of time spent in each one of them, leading to a table with many more
rows. I could also have recorded other data about each day in order to study
possible relationships with the type of day I had; for example, the weather on
each day — “fair weather”, “partly cloudy” or “rainy”. So the table of data
might have many more columns as well as rows. In this case, to draw graphs
such as Exhibit 1.4, involving many more categories and with 20 sets of points
traversing the plot, would result in such a confusion of points and symbols
that it would be difficult to see any patterns at all. It would then become clear
that the descriptive instrument being used, the scatterplot, is inadequate in
bringing out the essential features of the data. This is a convenient point to in-
troduce the basic concepts of correspondence analysis, which is also a method
for visualizing tabular data, but which can easily accommodate larger data
sets in a natural and intuitive way.
8 Scatterplots and Maps
SUMMARY: 1. Scatterplots involve plotting two variables, with respect to a horizontal axis
Scatterplots and and a vertical axis, often called the “x-axis” and “y-axis” respectively.
Maps
2. Usually the x variable is a completely different entity to the y variable. We
can often interpret distances along at least one of the axes in the specific
sense of measuring the distance according to the scale that is calibrated on
the axis. It is usually meaningless to measure or interpret oblique distances
in the plot.
3. In a few cases the x and y variables are similar entities with comparable
scales, in which case interpoint distances can be interpreted as a measure
of difference, or dissimilarity, between the plotted points. In this special
case we call the scatterplot a map. For such maps it is important that the
horizontal and vertical scales have physically equal units, i.e. the aspect
ratio of the axes is equal to 1.
4. When plotting positive quantities (usually frequencies in our context), both
the absolute and relative values of these quantities are of interest.
5. The more complex the data are, the less convenient it is to represent these
data in a scatterplot.
6. This book is concerned with visually describing and interpreting complex
information, rather than modelling it.
Profiles and the Profile Space 2
The concept of a set of relative frequencies, or a profile, is fundamental to
correspondence analysis (referred to from now on by its abbreviation CA).
Such sets, or vectors, of relative frequencies have special geometric features
because the elements of each set add up to 1 (or 100%). In analysing a fre-
quency table, relative frequencies can be computed for rows or for columns —
these are called row or column profiles respectively. In this chapter we shall
show how profiles can be depicted as points in a profile space, illustrating the
concept in the special case when each profile consists of only three elements.
Contents
Profiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Average profile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Row profiles and column profiles . . . . . . . . . . . . . . . . . . . . . . 10
Symmetric treatment of rows and columns . . . . . . . . . . . . . . . . . 11
Asymmetric consideration of the data table . . . . . . . . . . . . . . . . 11
Plotting the profiles in the profile space . . . . . . . . . . . . . . . . . . 11
Vertex points define the extremes of the profile space . . . . . . . . . . . 12
Triangular (or ternary) coordinate system . . . . . . . . . . . . . . . . . 12
Positioning a point in a triangular coordinate system . . . . . . . . . . . 14
Geometry of profiles with more than three elements . . . . . . . . . . . 14
Data on a ratio scale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Data on a common scale . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
SUMMARY: Profiles and the Profile Space . . . . . . . . . . . . . . . . 16
Let us look again at the data in Exhibit 1.3, a table of frequencies with four Profiles
rows (the countries) and three columns (the type of day). The first and most
basic concept in CA is that of a profile, which is a set of frequencies divided
by their total. Exhibit 2.1 shows the row profiles for these data: for example,
the profile of Norway is [0.33 0.06 0.61], where 0.33 = 6/18, 0.06 = 1/18, 0.61
= 11/18. We say that this is the “profile of Norway across the types of day”.
The profile may also be expressed in percentage form, i.e. [33% 6% 61%] in
this case, as in Exhibit 1.5. In a similar way, the profile of Canada across the
day types is [0.07 0.20 0.73], concentrated mostly in the full day category, as
is Norway. In contrast, Greece has a profile of [0.14 0.86 0.00], concentrated
mostly in the half day category, and so on. The percentages are plotted in
Exhibit 1.4(b) on page 6.
9
10 Profiles and the Profile Space
Exhibit 2.1:
Row (country) COUNTRY Holidays Half days Full days
profiles: relative
frequencies of types Norway 0.33 0.06 0.61
of day in each set of Canada 0.07 0.20 0.73
trips, and average Greece 0.14 0.86 0.00
profile showing France/Germany 0.08 0.08 0.83
relative frequencies Average 0.15 0.36 0.49
in all trips.
Average profile In addition to the four country profiles, there is an additional row in Exhibit
2.1 labelled Average. This is the profile of the final row [13 31 42] of Exhibit
1.3, which contains the column sums of the table; in other words this is the
profile of all the trips aggregated together. In Chapter 3 we shall explain more
specifically why this is called the average profile. For the moment, it is only
necessary to realize that, out of the total of 86 days travelled, irrespective
of country visited, 15% were holidays, 36% were half days and 49% were full
days of work. When comparing profiles we can compare one country’s profile
with another, or we can compare a country’s profile with the average profile.
For example, eyeballing the figures in Exhibit 2.1, we can see that of all the
countries, the profiles of Canada and France/Germany are the most similar.
Compared to the average profile, these two profiles have a higher percentage
of full days and are below average on holidays and half days.
Row profiles In the above we looked at the row profiles in order to compare the different
and column countries. We could also consider Exhibit 1.3 as a set of columns and compare
profiles how the different types of days are distributed across the countries. Exhibit 2.2
shows the column profiles as well as the average column profile. For example,
of the 13 holidays 46% were in Norway, 8% in Canada, 31% in Greece and 15%
in France/Germany, and so on for the other columns. Since I spent different
numbers of days in each country, these figures should be checked against those
of the average column profile to see whether they are lower or higher than
the average pattern. For example, 46% of all holidays were spent in Norway,
whereas the number of days spent in Norway was just 21% of the total of 86 —
in this sense there is a high number of holidays there compared to the average.
Exhibit 2.2:
Profiles of types of COUNTRY Holidays Half days Full days Average
day across the
countries, and Norway 0.46 0.03 0.26 0.21
average column Canada 0.08 0.10 0.26 0.17
profile. Greece 0.31 0.81 0.00 0.34
France/Germany 0.15 0.07 0.48 0.28
Symmetric treatment of rows and columns 11
Looking again at the proportion 0.46 (= 6/13) of holidays spent in Norway Symmetric
(Exhibit 2.2) and comparing it to the proportion 0.21 (= 18/86) of all days treatment of rows
spent in that country, we can calculate the ratio 0.46/0.21 = 2.2, and conclude and columns
that holidays in Norway were just over twice the average. Exactly the same
conclusion is reached if a similar calculation is made on the row profiles. In
Exhibit 2.1 the proportion of holidays in Norway was 0.33 (= 6/18) whereas
for all countries the proportion was 0.15 (= 13/86). Thus, there are 0.33/0.15
= 2.2 times as many holidays compared to the average, the same ratio as was
obtained when arguing from the point of view of the column profiles (this
ratio is called the contingency ratio and will re-appear in future chapters).
Whether we argue via the row profiles or column profiles we arrive at the
same conclusion. In Chapter 8 it will be shown that CA treats the rows and
columns of a table in an equivalent fashion or, as we say, in a symmetric way.
Nevertheless, it is true in practice that a table of data is often thought of and Asymmetric
interpreted in a non-symmetric, or asymmetric, fashion, either as a set of rows consideration of
or as a set of columns. For example, since each row of Exhibit 1.3 constitutes a the data table
different country (or pair of countries in the case of France/Germany), it might
be more natural to think of the table row-wise, as in Exhibit 2.1. Deciding
which way is more appropriate depends on the nature of the data and the
researcher’s objective, and the decision is often not a conscious one. One
concrete manifestation of the actual choice is whether the researcher refers to
row or column percentages when interpreting the data. Whatever the decision,
the results of CA will be invariant to this choice, but the interpretation will
adapt to the researcher’s viewpoint.
Let us consider the four row profiles and average profile in Exhibit 2.1 and Plotting the
a completely different way to plot them. Rather than the display of Exhibit profiles in the
1.4(b), where the horizontal axis serves only as labels for the type of day profile space
and the vertical axis represents the percentages, we now propose using three
axes corresponding to the three types of day, which is a scatterplot in three
dimensions. To imagine three perpendicular axes is not difficult: merely look
down into an empty corner of the room you are sitting in and you will see
three axes as shown in Exhibit 2.3. Each of the three edges of the room
serves as an axis for plotting the three elements of the profile. These three
values are now considered to be coordinates of a single point that represents
the whole profile — this is quite different from the graph in Exhibit 1.4(b)
where there is a point for each of the three profile elements. The three axes
are labelled holidays, half days and full days, and are calibrated in fractional
profile units from 0 to 1. To plot the four profiles is now a simple exercise.
Norway’s profile of [0.33 0.06 0.61] (see Exhibit 2.1) is 0.33 of a unit along
axis holidays, 0.06 along axis half days and 0.61 along axis full days. To take
another example, Greece’s profile of [0.14 0.86 0.00] has a zero coordinate
in the full days direction, so its position is on the “wall”, as it were, on the
left-hand side of the display, with coordinates 0.14 and 0.86 on the two axes
12 Profiles and the Profile Space
holidays and half days that define the “wall”. All other row profile points in this
example, including the average row profile [0.15 0.36 0.49], can be plotted
in this three-dimensional space.
Vertex points With a bit of imagination it might not be surprising to discover that the
define the profile points in Exhibit 2.3 all lie exactly in the plane defined by the triangle
extremes of the that joins the extreme unit points [1 0 0], [0 1 0] and [0 0 1] on the three
profile space respective axes, as shown in Exhibit 2.4. This triangle is equilateral and its
three corners are called vertex points or vertices. The vertices coincide with
extreme profiles that are totally concentrated into one of the day types. For
example, the vertex point [1 0 0] corresponds to a trip to a country consisting
only of holidays (fictional in my case, unfortunately). Likewise, the vertex point
[0 0 1] corresponds to a trip consisting only of full days of work.
Triangular (or Having realized that all profile points in three-dimensional space actually lie
ternary) exactly on a flat (two-dimensional) triangle, it is possible to lay this triangle
coordinate system flat, as in Exhibit 2.5. Looking at the profile points in a flat space is clearly
better than trying to imagine their three-dimensional positions in the corner of
a room! This particular type of display is often referred to as the triangular (or
ternary) coordinate system and may be used in any situation where we have
sets of data consisting of three elements that add up to 1, as in the case of the
row profiles in this example. Such data are common in geology and chemistry,
for example where samples are decomposed into three constituents, by weight
Triangular (or ternary) coordinate system 13
holidays
holidays Exhibit 2.4:
The profile points in
Exhibit 2.3 lie
1.00 o
exactly on an
equilateral triangle
joining the vertex
points of the profile
space. Thus the
three-dimensional
profiles are actually
two-dimensional.
NORWAY
• The profile of
Greece lies on the
FRANCE/GERMANY
• edge of the triangle
o full days because it has zero
average * CANADA• full days.
GREECE •
o
half days
NORWAY
•
average
• GREECE *
• •
FRANCE/GERMANY
CANADA
full days
half days
14 Profiles and the Profile Space
Exhibit 2.6:
Norway’s profile
s
ay
[0.33 0.06 0.61] is
lid
positioned using
ho
triangular
0.61
coordinates as
shown, using the
sides of the triangle
as axes. Each side is
calibrated in profile
units from 0 to 1.
NORWAY
•
average
ful
•GREECE *
ld
• •
FRANCE/GERMANY
ay
CANADA
s
half days 0.06
Positioning a Given a blank equal-sided triangle and the profile values, how can we find the
point in a position of a profile point in the triangle, without passing via the underlying
triangular three-dimensional space of Exhibits 2.3 and 2.4? In the triangular coordinate
coordinate system system the sides of the triangle define three axes. Each side is considered
to have a length of 1 and can be calibrated accordingly on a linear scale
from 0 to 1. In order to position a profile in the triangle, its three values
on these axes determine three lines drawn from these values parallel to the
respective sides of the triangle. For example, to position Norway, as illustrated
in Exhibit 2.5, we take a value of 0.33 on the holidays axis (see Exhibit 2.6),
0.06 on the half days axis and 0.61 on the full days axis. Lines from these
coordinate values drawn parallel to the sides of the triangle all meet at the
point representing Norway. In fact, any two of the three profile coordinates
are sufficient to situate a profile in this way, and the remaining coordinate is
always superfluous, which is another way of demonstrating that the profiles
are inherently two-dimensional.
Geometry of The triangular coordinate system may be used only for profiles with three
profiles with more elements. But the idea can easily be generalized to profiles with any number
than three of elements, in which case the coordinate system is known as the barycen-
elements tric coordinate system (“barycentre” is synonymous with “weighted average”,
to be explained in the next chapter, page 19). The dimensionality of this
Data on a ratio scale 15
coordinate system is always one less than the number of elements in the
profile. For example, we have just seen that three-element profiles are con-
tained exactly in a two-dimensional triangular profile space. For profiles with
four elements the dimensionality is three and the profiles lie in a four-pointed
tetrahedron in three-dimensional space. The two-dimensional triangle and the
three-dimensional tetrahedron are examples of what is known in mathematics
as a regular simplex. R code for visualizing an example in three dimensions is
given in the Computing Appendix, pages 257–258, so you can get a feeling for
three-dimensional profile space. For higher-dimensional profiles some strong
imagination would be needed to be able to “see” the profile points spaces of
dimension greater than three, but fortunately CA will be of great help to us
in visualizing such multidimensional profiles.
We have illustrated the concept of a profile using frequency data, which is Data on a ratio
the prime example of data suitable for CA. But CA is applicable to a much scale
wider class of data types; in fact it can be used whenever it makes sense to
express the data in relative amounts, i.e. data on a so-called ratio scale. For
example, suppose we have data on monetary amounts invested by countries
in different areas of research — the relative amounts would be of interest, e.g.
the percentage invested in environmental research, biomedecine, etc. Another
example is of morphometric measurements on a living organism, for example
measurements in centimeters on a fish, its length and width, length of fins, etc.
Again all these measurements can be expressed relative to the total, where
the total is a surrogate measure for the size of the fish, so that we would be
analysing and comparing the shapes of different fish in the form of profiles
rather than the original values.
A necessary condition of the data for CA is that all observations are on the Data on a common
same scale: for example, counts of particular individuals in a frequency table, scale
a common monetary unit in the table of research investments, centimeters
in the morphometric study. It would make no sense in CA to analyse data
with mixed scales of measurement, unless a pre-transformation is conducted
to homogenize the scales of the whole table. Most of the data sets in this book
are frequency data, but in Chapter 26 we shall look at a wide variety of other
types of data and ways of recoding them to be suitable for CA.
16 Profiles and the Profile Space
SUMMARY: 1. The profile of a set of frequencies (or any other amounts that are positive
Profiles and the or zero) is the set of frequencies divided by their total, i.e. the set of relative
Profile Space frequencies.
2. In the case of a cross-tabulation, the rows or columns define sets of fre-
quencies which can be expressed relative to their respective totals to give
row profiles or column profiles.
3. The marginal frequencies of the cross-tabulation can also be expressed
relative to their common total (i.e. the grand total of the table) to give the
average row profile and average column profile.
4. Comparing row profiles to their average leads to the same conclusions as
comparing column profiles to their average.
5. Profiles consisting of m elements can be plotted as points in an m -dimen-
sional space. Because their m elements add up to 1, these profile points
occupy a restricted region of this space, an (m –1)-dimensional subspace
known as a simplex. This simplex is enclosed within the edges joining all
pairs of the m unit vectors on the m perpendicular axes. These unit points
are also called the vertices of the simplex or profile space. The coordinate
system within this simplex is known as the barycentric coordinate system.
6. A special case that is easy to visualize is when the profiles have three
elements, so that the simplex is simply a triangle that joins the three
vertices. This special case of the barycentric coordinate system is known
as the triangular (or ternary) coordinate system.
7. The idea of a profile can be extended to data on a ratio scale where it is
of interest to study relative values. In this case the set of numbers being
profiled should all have the same scale of measurement.
Masses and Centroids 3
There is an equivalent way of thinking about the positions of the profile points
in the profile space, and this will be useful to our eventual understanding and
interpretation of correspondence analysis (CA). This is based on the notion
of a weighted average, or centroid, of a set of points. In the calculation of
an ordinary (unweighted) average, each point receives equal weight, whereas
a weighted average allows different weights to be associated with each point.
When the points are weighted differently, then the centroid does not lie exactly
at the “geographical” centre of the cloud of points, but tends to lie in a position
closer to the points with higher weight.
Contents
Data set 2: Readership and education groups . . . . . . . . . . . . . . . 17
Points as weighted averages . . . . . . . . . . . . . . . . . . . . . . . . . 18
Profile values are weights assigned to the vertices . . . . . . . . . . . . . 19
Each profile point is a weighted average, or centroid, of the vertices . . . 19
Average profile is also a weighted average of the profiles themselves . . . 20
Row and column masses . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Interpretation in the profile space . . . . . . . . . . . . . . . . . . . . . . 21
Merging rows or columns . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Distributionally equivalent rows or columns . . . . . . . . . . . . . . . . 23
Changing the masses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
SUMMARY: Masses and Centroids . . . . . . . . . . . . . . . . . . . . . 24
We now use a typical set of data in social science research, a cross-tabulation Data set 2:
(or “cross-classification”) of two variables from a survey. The table, given in Readership and
Exhibit 3.1, concerns 312 readers of a certain newspaper, in particular their education groups
level of thoroughness in reading the newspaper. Based on data collected in the
survey, each respondent was classified into one of three groups: glance readers,
fairly thorough readers and very thorough readers. These reading classes have
been cross-tabulated against education, an ordinal variable with five categories
ranging from some primary education to some tertiary education. Exhibit 3.1
shows the raw frequencies and the education group profiles in parentheses,
i.e. the row profiles. The triangular coordinate plot of the row profiles, in the
style described in Chapter 2, is given in Exhibit 3.2. In this display the corner
points, or vertices, of the triangle represent the three readership groups —
remember that each vertex is at the position of a “pure” row profile totally
concentrated into that category; for example, the very thorough vertex C3
is representing a fictitious row profile of [0 0 1] that contains 100% very
thorough readers.
17
18 Masses and Centroids
Exhibit 3.1:
Cross-tabulation of Fairly Very
education group by EDUCATION Glance thorough thorough Row
readership class, GROUP C1 C2 C3 Total masses
showing row profiles
and average row Some primary 5 7 2 14 0.045
profile in E1 (0.357) (0.500) (0.143)
parentheses, and the Primary completed 18 46 20 84 0.269
row masses (relative E2 (0.214) (0.548) (0.238)
values of row totals). Some secondary 19 29 39 87 0.279
E3 (0.218) (0.333) (0.448)
Secondary completed 12 40 49 101 0.324
E4 (0.119) (0.396) (0.485)
Some tertiary 3 7 16 26 0.083
E5 (0.115) (0.269) (0.615)
Total 57 129 126 312
Average row profile (0.183) (0.413) (0.404)
Exhibit 3.2: C3
Row profiles
(education groups)
of Exhibit 3.1
depicted in
triangular
coordinates, also
showing the position • E5
of the average row
• E3 •
profile (last row of E4
Exhibit 3.1).
* average
•E2
• E1
C2
C1
Points as Another way to think of the positions of the education groups in the trian-
weighted averages gle is as weighted averages. Assigning weights to the values of a variable is
a well-known concept in statistics. For example, in a class of 26 students,
suppose that the average grade turns out to be 7.5, calculated by summing
the 26 grades and dividing by 26. In fact, 3 students obtain the grade of 9, 7
students obtain an 8, and 16 students obtain a 7, so that the average grade
can be determined equivalently by assigning weights of 3/26 to the grade of
9, 7/26 to the grade of 8 and 16/26 to the grade of 7 and then calculating the
weighted average. Here the weights are the relative frequencies of each grade,
and because the grade of 7 has more weight than the others, the weighted av-
Profile values are weights assigned to the vertices 19
erage of 7.5 is “closer” to this grade, whereas the ordinary arithmetic average
of the three values 7, 8 and 9 is clearly 8.
Looking at the last row of data in Exhibit 3.1, for education group E5 (some Profile values are
tertiary education), we see the same frequencies of 3, 7 and 16 for the three re- weights assigned to
spective readership groups, and associated relative frequencies of 0.115, 0.269 the vertices
and 0.615. The idea now is to imagine 3 cases situated at the glance vertex C1
of the triangle, 7 cases at the fairly thorough vertex C2 and 16 cases at the very
thorough vertex C3 , and then consider what would be the average position for
these 26 cases. In other words, we do not associate the weights with values
of a variable but with positions in the profile space, in this case the positions
of the vertex points. There are more cases at the very thorough corner, so we
would expect the average position of E5 to be closer to this vertex, as is indeed
the case. For the same reason, row profile E1 lies far from the very thorough
corner C3 because it has a very low weight (2 out of 14, or 0.143) on this
category. Hence each row profile point is positioned within the triangle as an
average point, where the profile values, i.e. relative frequencies, serve as the
weights allocated to the vertices. Thus, we can think of the profile values not
only as coordinates in a multidimensional space, but also as weights assigned
to the vertices of a simplex. This idea can be extended to higher-dimensional
profiles: for example, a profile with four elements is also at an average position
with respect to the four corners of a three-dimensional tetrahedron, weighted
by the respective profile elements.
Alternative terms for weighted average are centroid or barycentre. Some par- Each profile point
ticular examples of weighted averages in the profile space are given in Exhibit is a weighted
3.3. For example, the profile point [ 1/3 1/3 1/3 ], which gives equal weight to average, or
the three corners, is positioned exactly at the centre of the triangle, equidis- centroid, of the
tant from the corners, in other words at the ordinary average position of the vertices
three vertices. The profile [ 1/2 1/2 0 ] is at a position midway between the
first and second vertices, since it has equal weight on these two vertices and
zero weight on the third vertex. In general, we can write a formula for the
position of a profile as the centroid of the three vertices as follows, for a profile
[ a b c ] where a + b + c = 1:
Similarly, the position of the average profile is also a weighted average of the
vertex points:
Exhibit 3.3:
Examples of some C3 [0,0,1]
centroids (weighted
averages) of the
vertices in triangular
coordinate space: •[0,1/5,4/5]
the three values are
the weights assigned
to vertices
(C1,C2,C3 ).
•
[1/5,1/5,3/5]
[1/3,1/3,1/3]
• •
[7/15,1/5,1/3]
[1,0,0] [1/2,1/2,0]
C1 • •C2
[0,1,0]
The average is farther from the glance corner since there is less weight on
the glance vertex than on the other two, which have approximately the same
weights (see Exhibit 3.2).
Average profile The average profile is a rather special point — not only is it a centroid of
is also a weighted the three vertices as we have just shown, just like any profile point, but it is
average of the also a centroid of the five row profiles themselves, where different weights are
profiles themselves assigned to the profiles. Looking again at Exhibit 3.1, we notice that the row
totals are different: education group E1 (some primary education) includes only
14 respondents whereas education group E4 (secondary education completed)
has 101 respondents. In the last column of Exhibit 3.1, headed “row masses”,
we have these marginal row frequencies expressed relative to the total sample
size 312. Just as we thought of row profiles as weighted averages of the vertices,
we can think of each of the five row profile points in Exhibit 3.2 being assigned
weights according to their marginal frequencies, as if there were 14 respondents
(proportion 0.045 of the sample) at the position E1, 84 respondents (0.269 of
the sample) at the position E2, and so on. With these weights assigned to
the five profile points, the weighted average position is exactly at the average
profile point:
Average row profile = (0.045 × E1) + (0.269 × E2) + (0.279 × E3)
+ (0.324 × E4) + (0.083 × E5)
This average row profile is at a central position amongst the row profiles but
more attracted to the profiles observed with higher frequency.
Row and The weights assigned to the profiles are so important in CA that they are given
column masses a specific name: masses. The last column of Exhibit 3.1 shows the row masses:
Interpretation in the profile space 21
0.045, 0.269, 0.279, 0.324 and 0.083. The word “mass” is the preferred term in
CA although it is entirely equivalent for our purpose to the term “weight”. An
alternative term such as mass is convenient here to differentiate this geometric
concept of weighting from other forms of weighting that occur in practice, such
as weights assigned to population subgroups in a sample survey.
All that has been said about row profiles and row masses can be repeated in a
similar fashion for the columns. Exhibit 3.4 shows the same contingency table
as Exhibit 3.1 from the column point of view. That is, the three columns have
Exhibit 3.4:
Fairly Very Average Cross-tabulation of
EDUCATION Glance thorough thorough column education group by
GROUP C1 C2 C3 Total profile readership cluster,
showing column
Some primary 5 7 2 14 (0.045) profiles and average
E1 (0.088) (0.054) (0.016) column profile in
Primary completed 18 46 20 84 (0.269) parentheses, and the
E2 (0.316) (0.357) (0.159) column masses.
Some secondary 19 29 39 87 (0.279)
E3 (0.333) (0.225) (0.310)
Secondary completed 12 40 49 101 (0.324)
E4 (0.211) (0.310) (0.389)
Some tertiary 3 7 16 26 (0.083)
E5 (0.053) (0.054) (0.127)
Total 57 129 126 312
Column masses 0.183 0.413 0.404
At this point, even though the final key concepts in CA still remain to be Interpretation in
explained, it is possible to make a brief interpretation of Exhibit 3.2. The ver- the profile space
tices of the triangle represent the “pure profiles” of readership categories C1 ,
C2 and C3 , whereas the education groups are “mixtures” of these readership
categories and find their positions within the triangle in terms of their respec-
22 Masses and Centroids
tive proportions of each of the three categories. Notice the following aspects
of the display:
• The degree of spread of the profile points within the triangle gives an idea
of how much variation there is the contingency table. The closer the profile
points lie to the centroid, the less variation there is, and the more they
deviate from the centroid, the more variation. The profile space is bounded
and the most extreme profiles will lie near the sides of the triangle, or in
the most extreme case at one of the vertices (for example, an illiterate
group with profile [ 1 0 0 ] would lie on the vertex C1 ). In tables of social
science data such as this one, profiles usually occupy a small region of the
profile space close to the average because the variation in profile values
for a particular category will be relatively small. For example, the range
in the first element (i.e. readership category C1 ) across the profiles is only
from 0.115 to 0.357 (Exhibit 3.1), in a potential range from 0 to 1. In
contrast, for data in ecological research, as we shall see later, the range of
profile values is much higher, usually because of many zero frequencies in
the table — the profiles are then more spread out inside the profile space
(see the second example in Chapter 10).
• The profile points are stretched out in what is called a “direction of spread”
more or less from the bottom to the top of the display. Looking from the
bottom upwards, the five education group profiles lie in their natural or-
der of increasing educational qualifications, from E1 to E5. At the top,
group E5 lies closest to the vertex C3, which represents the highest cat-
egory of very thorough reading — we have already seen that this group
has the highest proportion (0.615) of these readers. At the bottom, the
lower educational group is not far from the edge of the triangle which we
know displays profiles with zero C3 readers (for example, see the point
[ 1/2 1/2 0 ] in Exhibit 3.3 as an illustration of a point on the edge). The
interpretation of this pattern would be that as we move up from the bot-
tom of this display to the top, from lower to higher education, the profiles
are generally changing with respect to their relative frequency of type C3
as opposed to that of C1 and C2 combined, while there is no particular
tendency towards either C1 or C2 . In addition, the relative frequency of
C1 is decreasing as the education points move away from C1 towards the
edge joining C2 and C3 .
Merging rows Suppose we wanted to combine the two categories of primary education, E1
or columns and E2, into a new row of Exhibit 3.1, denoted by E1&2. There are two ways
of thinking about this. First, add the two rows together to obtain the row
of frequencies [ 23 53 22 ], with total 98 and profile [.235 .541 .224]. The
alternative way is to think of the profile of E1&2 as the weighted average of
the profiles of E1 and E2:
.045 .269
[ .235 .541 .224 ] = × [ .357 .500 .143 ] + × [ .214 .548 .238 ]
.314 .314
Distributionally equivalent rows or columns 23
•
o E1&2
E2
Exhibit 3.5:
Enlargement of
positions of E1 and
E2 in Exhibit 3.2,
showing the position
of the point E1&2
which merges the
two categories; E2
has 6 times the mass
of E1, hence E1&2
• E1 lies closer to E2 at a
point which splits
the line between the
points in the ratio
84:14 = 6:1.
where the masses of E1 and E2 are .045 and .269, with sum .314 (notice that
the weights in this weighted average are identical to 14/98 and 84/98, where
14 and 84 are the totals of rows E1 and E2). Geometrically, E1&2’s profile
lies on a line between E1 and E2, but closer to E2 as shown in Exhibit 3.5.
The distances from E1 to E1&2 and E2 to E1&2 are in the same proportion
as the totals 84 and 14 respectively; i.e. 6 to 1. E1&2 can be thought of as
the balancing point of the two masses situated at E1 and E2, with the heavier
mass at E2.
Suppose that we had an additional row of data in Exhibit 3.1, a category of Distributionally
“no formal education” denoted by E0, with frequencies [ 10 14 4 ] across the equivalent rows or
reading categories. The profile of E0 is identical to E1’s profile, because the columns
frequencies in E0 are simply twice those of E1. The two sets of frequencies are
said to be distributionally equivalent. Thus the profiles of E0 and E1 are at
exactly the same point in the profile space, and can be merged into one point
with mass equal to the combined masses of the two profiles, i.e. a single point
with frequencies [ 15 21 6 ].
The row and column masses are proportional to the marginal sums of the Changing the
table. If the masses need to be modified for a substantive reason, this can be masses
achieved by a simple transformation of the table. For example, suppose that we
require the five education groups of Exhibit 3.1 to have masses proportional to
their population sizes rather than their sample sizes. Then the table is rescaled
by multiplying each education group profile by its respective population size.
The row profiles of this new table are identical to the original row profiles, but
the row masses are now proportional to the population sizes. Alternatively,
suppose that the education groups are required to be weighted equally, rather
than differentially as described up to now. If we regard the table of row profiles
(or, equivalently, of row percentages) as the original table, then this table has
24 Masses and Centroids
row sums equal to 1 (or 100%), so that each education group is weighted
equally. Hence analysing the table of profiles implies weighting each profile
equally.
SUMMARY: 1. We assume that we are analysing a table of data and are concerned with
Masses and the row problem, i.e. where the row profiles are plotted in the simplex space
Centroids defined by the column vertices. Then each vertex point represents a column
category in the sense that a row profile that is entirely concentrated into
that category would lie exactly at that vertex point.
2. Each profile can be interpreted as the centroid (or weighted average) of the
vertex points, where the weights are the individual elements of the profile.
Thus a profile will tend to lie closer to those vertices for which it has higher
values.
3. Each row profile in turn has a unique weight associated with it, called
a mass, which is proportional to the row sum in the original table. The
average row profile is then the centroid of the row profiles, where each
profile is weighted by its mass in the averaging process.
4. Everything described above for row profiles applies equally to the columns
of the table. In fact, the best way to make the jump from rows to columns
is to re-express the table in its transposed form, where columns become
rows, and vice versa — then everything applies exactly as before.
5. Rows (or columns) that are combined by aggregating their frequencies have
a profile equal to the weighted average of the profiles of the component rows
(or columns).
6. Rows (or columns) that have the same profile are said to be distributionally
equivalent and can be combined into a single point with a mass equal to
the sum of the masses of the combined rows (or columns).
7. Row (or column) masses can be modified to be proportional to prescribed
values by a simple rescaling of the rows (or columns).
Chi-Square Distance and Inertia 4
In correspondence analysis (CA) the way distance is measured between profiles
is a bit more complicated than the one that was used implicitly when we
drew and interpreted the profile plots in Chapters 2 and 3. Distance in CA is
measured using the so-called chi-square distance and this distance is the key
to the many favourable properties of CA. There are several ways to justify
the chi-square distance: some are more technical and beyond the scope of
this book, while other explanations are more intuitive (see Appendix B, pages
270–271 for one theoretical justification). In this chapter we choose the latter
approach, starting with a geometric explanation of the well-known chi-square
statistic computed on a contingency table. All the ideas embodied in the chi-
square statistic carry over to the chi-square distance in CA and to the related
concept of inertia, which is the way CA measures variation in a data table.
Contents
Hypothesis of independence or homogeneity for a contingency table . . 25
Chi-square (χ2 ) statistic to test the homogeneity hypothesis . . . . . . . 26
Calculating the χ2 statistic . . . . . . . . . . . . . . . . . . . . . . . . . 27
Alternative expression of the χ2 statistic in terms of profiles and masses 27
(Total) inertia is the χ2 statistic divided by sample size . . . . . . . . . 28
Euclidean, or Pythagorian, distance . . . . . . . . . . . . . . . . . . . . 28
Chi-square distance: An example of a weighted Euclidean distance . . . 29
Geometric interpretation of inertia . . . . . . . . . . . . . . . . . . . . . 29
Minimum and maximum inertia . . . . . . . . . . . . . . . . . . . . . . 29
Inertia of rows is equal to inertia of columns . . . . . . . . . . . . . . . . 30
Some notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
SUMMARY: Chi-Square Distance and Inertia . . . . . . . . . . . . . . . 32
Consider the data in Exhibit 3.1 again. Notice that, of the sample of 312 Hypothesis of
people, 57 (or 18.3%) are in readership category C1 (“glance”), 129 (41.3%) independence or
in C2 (“fairly thorough”) and 126 (40.4%) in C3 (“very thorough”); i.e. the homogeneity for a
average row profile is the set of proportions [ 0.183 0.413 0.404 ]. If there were contingency table
no difference between the education groups as far as readership is concerned,
we would expect that the profile of each row is more or less the same as the
average profile, and would differ from it only because of random sampling
fluctuations. Assuming no difference, or in other words assuming that the
education groups are homogeneous with respect to their reading habits, what
would we have expected the frequencies in row E5, for example, to be? There
are 26 people in the E5 education group, and we would thus have expected
18.3% of them to be in category C1; i.e. 26 × 0.183 = 4.76 (although it is
25
26 Chi-Square Distance and Inertia
Exhibit 4.1:
Observed Fairly Very
frequencies, as given EDUCATION Glance thorough thorough Row
in Exhibit 3.1, along GROUP C1 C2 C3 Total masses
with expected
frequencies (in Some primary 5 7 2 14 0.045
parentheses) E1 (2.56) (5.78) (5.66)
calculated assuming Primary completed 18 46 20 84 0.269
the homogeneity E2 (15.37) (34.69) (33.94)
assumption to be Some secondary 19 29 39 87 0.279
true. E3 (15.92) (35.93) (35.15)
Secondary completed 12 40 49 101 0.324
E4 (18.48) (41.71) (40.80)
Some tertiary 3 7 16 26 0.083
E5 (4.76) (10.74) (10.50)
Total 57 129 126 312
Average row profile 0.183 0.413 0.404
Chi-square (χ2 ) It is clear that the observed frequencies are always going to be different from
statistic to test the the expected frequencies. The question statisticians now ask is whether these
homogeneity differences are large enough to contradict the assumed hypothesis that the
hypothesis rows are homogeneous, in other words whether the discrepancies between ob-
served and expected frequencies are so large that it is unlikely they could have
arisen by chance alone. This question is answered by computing a measure
of discrepancy between all the observed and expected frequencies, as follows.
Each difference between an observed and expected frequency is computed,
then this difference is squared and finally divided by the expected frequency.
This calculation is repeated for all pairs of observed and expected frequencies
and the results are accumulated into a single figure — the chi-square statistic,
Calculating the χ2 statistic 27
denoted by χ2 :
X (observed − expected)2
χ2 =
expected
Because there are 15 cells in this 5-by-3 (or 5 × 3) table, there will be 15 terms Calculating the χ2
in this computation. For purposes of illustration we show only the first three statistic
and last three terms corresponding to rows E1 and E5:
(5 − 2.56)2 (7 − 5.78)2 (2 − 5.66)2
χ2 = + + + ···
2.56 5.78 5.66
(3 − 4.76)2 (7 − 10.74)2 (16 − 10.50)2
+ + + (4.1)
4.76 10.74 10.50
The grand total of the 15 terms in this calculation turns out to be equal to
26.0. The larger this value, the more discrepant the observed and expected
frequencies are, i.e. the less convinced we are that the assumption of homo-
geneity is correct. In order to judge whether this value of 26.0 is large or small,
we use probabilities of the chi-square distribution corresponding to the “de-
grees of freedom” associated with the statistic. For a 5 × 3 table, the degrees
of freedom are 4 × 2 = 8 (one less than the number of rows multiplied by one
less than the number of columns), and the p-value associated with the value
26.0 of the χ2 statistic with 8 degrees of freedom is p = 0.001. This result
tells us that there is an extremely small probability — one in a thousand —
that the observed frequencies in Exhibit 4.1 can be reconciled with the homo-
geneity assumption. In other words, we reject the homogeneity of the table
and conclude that it is highly likely that real differences exist between the
education groups in terms of their readership profiles.
the average profile values. Each of the 15 terms in this calculation is thus of
the form
(observed row profile − expected row profile)2
row total ×
expected row profile
(Total) inertia We now make one more modification of the χ2 calculation above to bring
is the χ2 statistic it into line with the CA concepts introduced so far: we divide both sides of
divided by sample Equation (4.2) by the total sample size so that each term involves an initial
size multiplying factor equal to the row mass rather than the row total:
χ2
= 12 similar terms · · ·
312
(0.115−0.183)2 (0.269−0.413)2 (0.615−0.404)2
+ 0.083× + 0.083× + 0.083×
0.183 0.413 0.404
(4.3)
where 0.083 = 26/312 is the mass of row E5 (see Exhibit 4.1). The quantity
χ2 /n on the left-hand side, where n is the grand total of the table, is called
the total inertia in CA, or simply the inertia. It is a measure of how much
variance there is in the table and does not depend on the sample size. In statis-
tics this quantity has alternative names such as the mean-square contingency
coefficient, and its square root is known as the phi coefficient (φ); hence we
can denote the inertia by φ2 . If we gather together terms in (4.3) in groups of
three corresponding to a particular row, we obtain the following form for the
inertia:
χ2
= φ2 = 4 similar groups of terms · · ·
312
(0.115 − 0.183)2 (0.269 − 0.413)2 (0.615 − 0.404)2
+ 0.083 × + + (4.4)
0.183 0.413 0.404
Each of the five groups of terms in this formula, one for each row of the table,
is the row mass (e.g. 0.083 for row E5) multiplied by a quantity in square
brackets which looks like a distance measure (or, to be precise, the square of
a distance).
Euclidean, or In (4.4) above, if it were not for the fact that each squared difference between
Pythagorian, observed and expected row profile elements is divided by the expected ele-
distance ment, then the quantity in square brackets would be exactly the square of the
“straight-line” regular distance between the row profile E5 and the average
profile in three-dimensional physical space. This distance is also called the
Euclidean distance or the Pythagorian distance. Let us state this in another
way so that it is fully understood. Suppose we plot the two profile points
[ 0.115 0.269 0.615 ] and [ 0.183 0.413 0.404 ] with respect to three perpen-
dicular axes. Then the distance between them would be the square root of the
sum of squared differences between the coordinates, as follows:
p
Euclidean distance = (0.115−0.183)2 + (0.269−0.413)2 + (0.615−0.404)2
(4.5)
Random documents with unrelated
content Scribd suggests to you:
jälkeen. Tähän ehdottomasti parantavaan reseptiin kuuluu: joka
päivä rämpiä syvissä kinoksissa, märät saappaat, joskus jäätyneet,
jalassa, Öin ja päivin. Toisinaan tilaisuuden sattuessa maataan myös
kuusi, seitsemän tuntia lumessa, joka lämpimän ruumiin alla vedeksi
sulaa. Kilometrittäinen juoksu kuulasateessa on kipeille jäsenille
parhainta sairasvoimistelua ja peninkulmien marssit edistävät
suuresti yleistä hyvinvointia. — Tässä Heinosellekin paras ja halvin
hoitotapa. Kaupanpäällisiksi voi Mannerheim vielä antaa rintaasi
punakeltaisen nauhan. Niin voi käydä. —
*****
Ukko vei hevosensa talliin ja kävi itse tupaan. Alina otti häneltä
lyhdyn, ja lähestyi kinoksessa makaavaa ruumista. Vedossa
vapajavan kynttilän liekki loi vaisun, väräjävän valon kuolleen
kasvoille, joita nuori sisar hartain tuntein katseli.
Ken oletkin ja mitkä lienevätkin olleet ne vaikuttimet, jotka sinut
taisteluun toivat, — kunnioitan kumminkin kuolemaasi. — Ei
vihamielisyyttä enää, olkoon sielullesi rauha — puheli hän vainajalle.
Kun hän tuli tupaan, olivat muut jo levolle asettuneet. Hän etsi
itselleen makuupaikkaa ja löysi tupakamarista vanhan keinutuolin,
johon heittäytyi yötään viettämään.
*****
Our website is not just a platform for buying books, but a bridge
connecting readers to the timeless values of culture and wisdom. With
an elegant, user-friendly interface and an intelligent search system,
we are committed to providing a quick and convenient shopping
experience. Additionally, our special promotions and home delivery
services ensure that you save time and fully enjoy the joy of reading.
ebookgate.com