100% found this document useful (5 votes)
97 views

Correspondence Analysis in Practice 3rd Edition Michael Greenacre 2024 scribd download

Analysis

Uploaded by

rangiecyntie
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (5 votes)
97 views

Correspondence Analysis in Practice 3rd Edition Michael Greenacre 2024 scribd download

Analysis

Uploaded by

rangiecyntie
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 71

Visit https://ebookgate.

com to download the full version and


explore more ebooks

Correspondence Analysis in Practice 3rd Edition


Michael Greenacre

_____ Click the link below to download _____


https://ebookgate.com/product/correspondence-analysis-
in-practice-3rd-edition-michael-greenacre/

Explore and download more ebooks at ebookgate.com


Here are some recommended products that might interest you.
You can download now and explore!

Correspondence analysis in practice 2nd ed Edition Michael


Greenacre

https://ebookgate.com/product/correspondence-analysis-in-practice-2nd-
ed-edition-michael-greenacre/

ebookgate.com

Advanced Language Practice 3rd Edition Michael Vince

https://ebookgate.com/product/advanced-language-practice-3rd-edition-
michael-vince/

ebookgate.com

An Introduction to the Analysis of Algorithms 3rd Edition


Michael Soltys

https://ebookgate.com/product/an-introduction-to-the-analysis-of-
algorithms-3rd-edition-michael-soltys/

ebookgate.com

Financial Modelling in Practice Michael Rees

https://ebookgate.com/product/financial-modelling-in-practice-michael-
rees/

ebookgate.com
Infertility in Practice 3rd Edition Adam Balen

https://ebookgate.com/product/infertility-in-practice-3rd-edition-
adam-balen/

ebookgate.com

Risk analysis in theory and practice 1st Edition Jean-Paul


Chavas

https://ebookgate.com/product/risk-analysis-in-theory-and-
practice-1st-edition-jean-paul-chavas/

ebookgate.com

Video Analysis Methodology and Methods Qualitative


Audiovisual Data Analysis in Sociology 3rd Edition Hubert
Knoblauch
https://ebookgate.com/product/video-analysis-methodology-and-methods-
qualitative-audiovisual-data-analysis-in-sociology-3rd-edition-hubert-
knoblauch/
ebookgate.com

Pediatric Practice Endocrinology 1st Edition Michael Kappy

https://ebookgate.com/product/pediatric-practice-endocrinology-1st-
edition-michael-kappy/

ebookgate.com

Handbook of Emotions 3rd Ed 3rd Edition Michael Lewis

https://ebookgate.com/product/handbook-of-emotions-3rd-ed-3rd-edition-
michael-lewis/

ebookgate.com
Correspondence
Analysis in Practice
Third Edition
CHAPMAN & HALL/CRC
Interdisciplinar y Statistics Series
Series editors: N. Keiding, B.J.T. Morgan, C.K. Wikle, P. van der Heijden
Published titles
AGE-PERIOD-COHORT ANALYSIS: NEW MODELS, METHODS, AND
EMPIRICAL APPLICATIONS Y. Yang and K. C. Land
ANALYSIS OF CAPTURE-RECAPTURE DATA R. S. McCrea and B. J.T. Morgan
AN INVARIANT APPROACH TO STATISTICAL ANALYSIS OF SHAPES
S. Lele and J. Richtsmeier
ASTROSTATISTICS G. Babu and E. Feigelson
BAYESIAN ANALYSIS FOR POPULATION ECOLOGY R. King, B. J.T. Morgan,
O. Gimenez, and S. P. Brooks
BAYESIAN DISEASE MAPPING: HIERARCHICAL MODELING IN SPATIAL
EPIDEMIOLOGY, SECOND EDITION A. B. Lawson
BIOEQUIVALENCE AND STATISTICS IN CLINICAL PHARMACOLOGY
S. Patterson and B. Jones
CLINICAL TRIALS IN ONCOLOGY,THIRD EDITION S. Green, J. Benedetti, A. Smith,
and J. Crowley
CLUSTER RANDOMISED TRIALS R.J. Hayes and L.H. Moulton
CORRESPONDENCE ANALYSIS IN PRACTICE,THIRD EDITION M. Greenacre
DESIGN AND ANALYSIS OF QUALITY OF LIFE STUDIES IN CLINICAL TRIALS,
SECOND EDITION D.L. Fairclough
DYNAMICAL SEARCH L. Pronzato, H. Wynn, and A. Zhigljavsky
FLEXIBLE IMPUTATION OF MISSING DATA S. van Buuren
GENERALIZED LATENT VARIABLE MODELING: MULTILEVEL, LONGITUDI-
NAL, AND STRUCTURAL EQUATION MODELS A. Skrondal and S. Rabe-Hesketh
GRAPHICAL ANALYSIS OF MULTI-RESPONSE DATA K. Basford and J. Tukey
INTRODUCTION TO COMPUTATIONAL BIOLOGY: MAPS, SEQUENCES, AND
GENOMES M. Waterman
MARKOV CHAIN MONTE CARLO IN PRACTICE W. Gilks, S. Richardson, and
D. Spiegelhalter
MEASUREMENT ERROR ANDMISCLASSIFICATION IN STATISTICS AND EPIDE-
MIOLOGY: IMPACTS AND BAYESIAN ADJUSTMENTS P. Gustafson
MEASUREMENT ERROR: MODELS, METHODS, AND APPLICATIONS
J. P. Buonaccorsi
Published titles
MEASUREMENT ERROR: MODELS, METHODS, AND APPLICATIONS
J. P. Buonaccorsi
MENDELIAN RANDOMIZATION: METHODS FOR USING GENETIC VARIANTS
IN CAUSAL ESTIMATION S.Burgess and S.G. Thompson
META-ANALYSIS OF BINARY DATA USINGPROFILE LIKELIHOOD D. Böhning,
R. Kuhnert, and S. Rattanasiri
MISSING DATA ANALYSIS IN PRACTICE T. Raghunathan
POWER ANALYSIS OF TRIALS WITH MULTILEVEL DATA M. Moerbeek and
S. Teerenstra
SPATIAL POINT PATTERNS: METHODOLOGY AND APPLICATIONS WITH R
A. Baddeley, E Rubak, and R. Turner
STATISTICAL ANALYSIS OF GENE EXPRESSION MICROARRAY DATA T. Speed
STATISTICAL ANALYSIS OF QUESTIONNAIRES: A UNIFIED APPROACH
BASED ON R AND STATA F. Bartolucci, S. Bacci, and M. Gnaldi
STATISTICAL AND COMPUTATIONAL PHARMACOGENOMICS R. Wu and M. Lin
STATISTICS IN MUSICOLOGY J. Beran
STATISTICS OF MEDICAL IMAGING T. Lei
STATISTICAL CONCEPTS AND APPLICATIONS IN CLINICAL MEDICINE
J. Aitchison, J.W. Kay, and I.J. Lauder
STATISTICAL AND PROBABILISTIC METHODS IN ACTUARIAL SCIENCE
P.J. Boland
STATISTICAL DETECTION AND SURVEILLANCE OF GEOGRAPHIC CLUSTERS
P. Rogerson and I.Yamada
STATISTICS FOR ENVIRONMENTAL BIOLOGY AND TOXICOLOGY A. Bailer
and W. Piegorsch
STATISTICS FOR FISSION TRACK ANALYSIS R.F. Galbraith
VISUALIZING DATA PATTERNS WITH MICROMAPS D.B. Carr and L.W. Pickle
Ch ap ma n & Hall/CRC
I n t e rd is ci pl in ar y Statistics Series

Correspondence
Analysis in Practice
Third Edition

Michael Greenacre
Universitat Pompeu Fabra
Barcelona, Spain
CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742

© 2017 by Taylor & Francis Group, LLC


CRC Press is an imprint of Taylor & Francis Group, an Informa business

No claim to original U.S. Government works

Printed on acid-free paper


Version Date: 20161031

International Standard Book Number-13: 978-1-4987-3177-5 (Hardback)

This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been made to publish reliable data and
information, but the author and publisher cannot assume responsibility for the validity of all materials or the consequences of their use. The authors and
publishers have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission
to publish in this form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may rectify in any
future reprint.

Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic,
mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or
retrieval system, without written permission from the publishers.

For permission to photocopy or use material electronically from this work, please access www.copyright.com (http://www.copyright.com/) or contact
the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides
licenses and registration for a variety of users. For organizations that have been granted a photocopy license by the CCC, a separate system of payment
has been arranged.

Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without
intent to infringe.

Library of Congress Cataloging‑in‑Publication Data

Names: Greenacre, Michael J.


Title: Correspondence analysis in practice / Michael Greenacre.
Description: Third edition. | Boca Raton, Florida : CRC Press, [2017] |
Series: Interdisciplinary statistics | Includes bibliographical references
and index.
Identifiers: LCCN 2016036265| ISBN 9781498731775 (hardback : alk. paper) |
ISBN 9781498731782 (e‑book)
Subjects: LCSH: Correspondence analysis (Statistics)
Classification: LCC QA278.5 .G74 2017 | DDC 519.5/37‑‑dc23
LC record available at https://lccn.loc.gov/2016036265

Visit the Taylor & Francis Web site at


http://www.taylorandfrancis.com

and the CRC Press Web site at


http://www.crcpress.com
To Françoise, Karolien and Gloudina
Contents
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
1 Scatterplots and Maps . . . . . . . . . . . . . . . . . . . . . . 1
2 Profiles and the Profile Space . . . . . . . . . . . . . . . . . 9
3 Masses and Centroids . . . . . . . . . . . . . . . . . . . . . . 17
4 Chi-Square Distance and Inertia . . . . . . . . . . . . . . . 25
5 Plotting Chi-Square Distances . . . . . . . . . . . . . . . . . 33
6 Reduction of Dimensionality . . . . . . . . . . . . . . . . . . 41
7 Optimal Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . 49
8 Symmetry of Row and Column Analyses . . . . . . . . . . 57
9 Two-Dimensional Displays . . . . . . . . . . . . . . . . . . . 65
10 Three More Examples . . . . . . . . . . . . . . . . . . . . . . 73
11 Contributions to Inertia . . . . . . . . . . . . . . . . . . . . . 81
12 Supplementary Points . . . . . . . . . . . . . . . . . . . . . . 89
13 Correspondence Analysis Biplots . . . . . . . . . . . . . . . 97
14 Transition and Regression Relationships . . . . . . . . . . 105
15 Clustering Rows and Columns . . . . . . . . . . . . . . . . . 113
16 Multiway Tables . . . . . . . . . . . . . . . . . . . . . . . . . . 121
17 Stacked Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
18 Multiple Correspondence Analysis . . . . . . . . . . . . . . 137
19 Joint Correspondence Analysis . . . . . . . . . . . . . . . . 145
20 Scaling Properties of MCA . . . . . . . . . . . . . . . . . . . 153
21 Subset Correspondence Analysis . . . . . . . . . . . . . . . 161
22 Compositional Data Analysis . . . . . . . . . . . . . . . . . . 169
23 Analysis of Matched Matrices . . . . . . . . . . . . . . . . . 177
24 Analysis of Square Tables . . . . . . . . . . . . . . . . . . . . 185
25 Correspondence Analysis of Networks . . . . . . . . . . . . 193
26 Data Recoding . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
27 Canonical Correspondence Analysis . . . . . . . . . . . . . 209
28 Co-Inertia and Co-Correspondence Analysis . . . . . . . . 217
29 Aspects of Stability and Inference . . . . . . . . . . . . . . 225
30 Permutation Tests . . . . . . . . . . . . . . . . . . . . . . . . 233
Appendix A: Theory of Correspondence Analysis . . . . . . . 241
Appendix B: Computation of Correspondence Analysis . . . 255
Appendix C: Glossary of Terms . . . . . . . . . . . . . . . . . . 285
Appendix D: Bibliography of Correspondence Analysis . . . 291
Appendix E: Epilogue . . . . . . . . . . . . . . . . . . . . . . . . . 295
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305
ix
Preface

This book is a revised and extended third edition of the second edition of
Correspondence Analysis in Practice (Chapman & Hall/CRC, 2007), first pub-
lished in 1993. In the original first edition I wrote the following in the Preface,
which is still relevant today:
“Correspondence analysis is a statistical technique that is useful to all students, Extract from
researchers and professionals who collect categorical data, for example data col- preface of first
lected in social surveys. The method is particularly helpful in analysing cross- edition of
tabular data in the form of numerical frequencies, and results in an elegant but Correspondence
simple graphical display which permits more rapid interpretation and under- Analysis in Practice
standing of the data. Although the theoretical origins of the technique can be
(1993)
traced back over 50 years, the real impetus to the modern application of corre-
spondence analysis was given by the French linguist and data analyst Jean-Paul
Benzécri and his colleagues and students, working initially at the University of
Rennes in the early 1960s and subsequently at the Jussieu campus of the Univer-
sity of Paris. Parallel developments of correspondence analysis have taken place
in the Netherlands and Japan, centred around such pioneering researchers as
Jan de Leeuw and Chikio Hayashi. My own involvement with correspondence
analysis commenced in 1973 when I started my doctoral studies in Benzécri’s
Data Analysis Laboratory in Paris. The publication of my first book Theory and
Applications of Correspondence Analysis in 1984 coincided with the beginning
of a wider dissemination of correspondence analysis outside of France. At that
time I expressed the hope that my book would serve as a springboard for a much
wider and more routine application of correspondence analysis in the future. The
subsequent evolution and growing popularity of the method could not have been
more gratifying, as hundreds of researchers were introduced to the method and
became familiar with its ability to communicate complex tables of numerical
data to non-specialists through the medium of graphics. Researchers with whom
I have collaborated come from such varying backgrounds as sociology, ecology,
palaeontology, archaeology, geology, education, medicine, biochemistry, microbi-
ology, linguistics, marketing research, advertising, religious studies, philosophy,
art and music. In 1989 I was invited by Jay Magidson of Statistical Innovations
Inc. to collaborate with Leo Goodman and Clifford Clogg in the presentation of a
two-day short course in New York, entitled “Correspondence Analysis and Asso-
ciation Models: Geometric Representation and Beyond”. The participants were
mostly marketing professionals from major American companies. For this course
I prepared a set of notes which reinforced the practical, user-oriented approach to
correspondence analysis. ... The positive reaction of the audience was infectious
and inspired me subsequently to present short courses on correspondence analysis
in South Africa, England and Germany. It is from the notes prepared for these
courses that this book has grown.”

In 1991 Prof. Walter Kristof of Hamburg University proposed that we orga- The CARME
nize a conference on correspondence analysis, with the assistance of Dr. Jörg conferences
Blasius of the Zentralarchiv für Empirische Sozialforschung (Central Archive
xi
xii Preface

for Empirical Social Research) at the University of Cologne. This confer-


ence was the first international one of its kind and drew a large audience
to Cologne from Germany and neighbouring European countries. This ini-
tial meeting developed into a series of quadrennial conferences, repeated in
1995 and 1999 in Cologne, at the Pompeu Fabra University in Barcelona
(2003), the Erasmus University in Rotterdam (2007), Agrocampus Rennes
(2011) and the University of Naples Federico II (2015). The 1991 conference
led to the publication of the book Correspondence Analysis in the Social Sci-
ences, while the 1995 conference gave birth to another book, Visualization
of Categorical Data, both of which received excellent reviews. For the 1999
conference on Large Scale Data Analysis, participants had to present analyses
of data from the multinational International Social Survey Programme (ISSP
— see www.issp.org). This interdisciplinary meeting included presentations
not only on the latest methodological developments in survey data analysis
but also topics as diverse as religion, the environment and social inequality.
In 2003 we returned to the original theme for the Barcelona conference, which
was baptized with the Catalan girl’s name CARME, standing for Correspon-
dence Analysis and Related MEthods; hence the formation of the CARME
network (www.carme-n.org). This led once more to Jörg Blasius and myself
editing a third book, Multiple Correspondence Analysis and Related Methods,
which was published in 2006. As with the two previous volumes, our idea was
to produce a multi-authored book, inviting experts in the field to contribute.
Our editing task was to write the introductory and linking material, unifying
the notation and compiling a common reference list and index. As a result of
the Rennes conference in 2011, which celebrated 50 years of correspondence
analysis, the book Visualization and Verbalization of Data was published, half
of which is devoted to the history of multivariate analysis. These books mark
the pace of development of correspondence analysis, at least in the social sci-
ences, and are highly recommended to anyone interested in deepening their
knowledge of this versatile statistical method as well as methods related to it.

New material in I have been very gratified to be invited to prepare a new edition of Correspon-
third edition dence Analysis in Practice, having accumulated considerably more experience
in social and environmental research in the nine years since the publication of
the second edition. Apart from revising the existing chapters, five new chapters
have been added, on “Compositional Data Analysis” (an area highly related
to correspondence analysis), “Analysis of Matched Matrices” (joint analysis
of data tables with the same rows and columns), “Correspondence Analysis of
Networks” (applying correspondence analysis to graphs), “Co-Inertia and Co-
Correspondence Analysis” (analysis of relationships between two tables with
common rows), and “Permutation Tests” (performing statistical inference in
the context of correspondence analysis and related methods). All in all, I can
say that this third edition contains almost all my practical knowledge of the
subject, after more than 40 years working in this area.
Preface xiii

At a conference I attended in the 1980s, I was given this lapel button with its “Statisticians
nicely ambiguous maxim, which could well be the motto of correspondence count!”
analysts all over the world:

To illustrate the more technical meaning of this motto, and to give an initial Textual analysis of
example of correspondence analysis, I made a count of the most frequent words third edition
in each of the 30 chapters of this new edition. I had to aggregate variations of
the same word, e.g. “coordinate” and “coordinates”, “plot” and “plotting”,
a process called lemmatization in textual data analysis. The top 10 most
frequent words were, in descending order of frequency: “row/s”, “profile/s”,
“inertia” (which is the way correspondence analysis measures variance in a
table), “point/s”, “column/s”, “data”, “CA” (abbreviation for correspondence
analysis), “variable/s”, “value/s” and “average”. I omitted words that occur
in one chapter only, such as “fuzzy” and “degree”, which are specific to a
single chapter, and removed words that described particular applications. This
left an eventual total of 167 words, which can be regarded as reflecting the
methodological content of the book.

Exhibit 0.1:
analyse/sis association/s asymmetric average axis/es ... First few rows and
columns of the table
Chap 1 10 0 0 0 15 ... of counts of the 167
Chap 2 0 0 0 29 22 ... most frequent words
Chap 3 0 0 0 55 0 ... in the 30 chapters of
Chap 4 0 6 0 22 0 ... Correspondence
Chap 5 0 0 0 22 13 ... Analysis in Practice,
Chap 6 0 0 0 8 0 ... Third Edition,
Chap 7 0 0 0 29 0 ... visualized in Exhibit
Chap 8 47 0 0 14 20 ... 0.2 using
correspondence
Chap 9 0 0 14 6 32 ...
analysis.
Chap 10 0 0 17 0 14 ...
.. .. .. .. .. ..
. . . . . . ...

Total 369 12 39 370 277 ...


xiv Preface

Exhibit 0.2: permutation


test
Correspondence
profile/s
analysis display of
30 chapters of the

0.2
present book in
distribution
terms of the most
p-value 30
frequent words in significant/ce
distance/s statistic bootstrap sample/ing
each chapter. row/s hypothesis/es
0.1
frequencies
weighted 29
Numbers in boldface point/s
column/s 5
data pairs
CA dimension 2

indicate the project/ion 23 4


axis/es categorical
vertex/ices 69 15 original
positions of the 13 12 110 interaction/ive
chapters, and their 1411
0.0

8 7 22 16
proximity signifies sets
2627
relative similarity of 21 1723
24
word distribution. 20 28
Directions of the 2518 response/svariable/s
−0.1

19 CA
respondent/s dimension/s
words give the weights
question/s
interpretation for subtables
analysis/analysed
the positioning of indicator
Burt
the chapters.
−0.2

MCA
Technically, this is a diagonal
so-called “contribu-
tion biplot” (see
Chapter 13).
−0.3

matrix/ices

−0.2 −0.1 0.0 0.1 0.2 0.3 0.4


CA dimension 1

The final table of word counts was composed of the 30 chapters as rows,
and the 167 words as columns (see Exhibit 0.1). This table is very sparse,
i.e. it has many zeros. In fact, 80% of the cells of the table have no counts.
Correspondence analysis copes quite well with such data, which has made it a
popular method in research areas such as linguistics, archaeology and ecology,
where data sets of frequency counts occur that are very sparse.
Exhibit 0.2 shows the “map” of the table, resulting from applying correspon-
dence analysis. The first thing to notice is that the rows (chapters, indicated
by their numbers) and the columns (words, connected to the centre by lines)
are displayed with respect to two “dimensions”. These dimensions are de-
termined by the analysis with the objective of exposing the most important
features of the associations between chapters and words. An alternative way
of thinking about this is that the chapters are mapped according to the simi-
larity in their distributions of words, with closer chapters being more similar
and distant chapters more different. Then the directions of the words explain
the differences between the chapters. Not all the words are shown, because
about two-thirds of the words turn out to be not so important for the interpre-
tation of the result, so only those words are shown that contribute highly to
the positioning of the chapters. Without further explanation of the concepts
Preface xv

underlying correspondence analysis (after all, this is the aim of the book that
follows!) the map clearly shows three sets of words emanating from the centre.
The words out to the top right clearly distinguish Chapters 29 and 30 from all
the others — these are the chapters that concentrate on the sampling, distri-
butional and inferential properties of correspondence analysis, with main key-
words “permutation” and “test”. Chapter 15 on clustering also tends in that
direction because it contains some hypothesis testing. Out in the upper left
direction are all the words describing basic concepts and terminology of corre-
spondence analysis associated with Chapters 1–14 that introduce the method
and develop it, exemplified by the most prominent keyword “profile/s”. To-
wards the bottom of the map are the words associated with a generalization of
correspondence analysis, called multiple correspondence analysis, usually ap-
plied to questionnaire data, described in later chapters. This method involves
various coding schemes in different types of matrices, hence the important
keyword “matrix/ices” down below.
Like the second edition, the book maintains its didactic format, with exactly Format of third
eight pages per chapter to provide a constant amount of material in each edition
chapter for self-learning or teaching (a feature that has been commented on
favourably in book reviews of the second edition). One of my colleagues re-
marked that it was like writing 14-line sonnets with strict rules for metre
and rhyming, which was certainly true in this case: the format definitely con-
tributed to the creative process. The margins are reserved for section headings
as well as captions of the tables and figures — these captions tend to be more
informative than conventional one-liner ones. Each chapter has a short intro-
duction and its own “Contents” list on the first page, and the chapter always
ends with a summary in the form of a bulleted list.
As in the first and second editions, the book’s main thrust is towards the Appendices
practice of correspondence analysis, so most technical issues and mathemati-
cal aspects are gathered in a theoretical appendix at the end of the book. It is
followed by a computational appendix, which describes some features of the R
language relevant to the methods in the book, including the ca package for cor-
respondence analysis. R scripts are placed on the website www.carme-n.org,
along with several of the data sets. No references at all are given in the 30
chapters — instead, a brief bibliographical appendix is given to point readers
towards further readings and more complete literature sources. A glossary of
the most important terms in the book is also provided and the book concludes
with some personal thoughts in the form of an epilogue.
The first edition of this book was written in South Africa, and the second and Acknowledgements
present third editions in Catalonia, Spain. Many people and institutions have
contributed in one way or another to this project. I would like to thank the
BBVA Foundation in Madrid and its director Prof. Rafael Pardo, for sup-
port and encouragement in my work on correspondence analysis. The BBVA
Foundation has published a Spanish translation of the second edition of Cor-
respondence Analysis in Practice, called La Práctica del Análisis de Corre-
spondéncias, available for free download at www.multivariatestatistics.org.
xvi Preface

I owe a similar debt of gratitude to my colleagues and the institution of the


Universitat Pompeu Fabra in Barcelona, where I have been working since
1994, one of the most innovative universities in Europe, recently ranked the
most productive Spanish university after only 25 years of its existence.
I would like to thank all my friends and colleagues in many countries for
moral and intellectual support, especially my wife Zerrin Aşan Greenacre,
Jörg & Beate Blasius, Trevor & Lynda Hastie, Carles Cuadras, John Gower,
Jean-Paul Benzécri, Angelos Markos, Alfonso Iodice d’Enza, Patrick Groe-
nen, Pieter Kroonenberg, Cajo ter Braak, Jan de Leeuw, Ludovic Lebart,
Jean-Marie & Annick Monget, Pierre & Martine Teillard, Michael Meimaris,
Michael Friendly, Antoine de Falguerolles, John Aitchison, Michael Browne,
Cas Crouse, Fred Lombard, June Juritz, Francesca Little, Karl Jöreskog, Les-
ley Andres, Barbara Cottrell, Raul Primicerio, Michaela Aschan, Salve Dahle,
Stig Falk-Petersen, Reinhold Fieler, Sabine Cochrane, Paul Renaud, Haakon
Hop, Tor & Danielle Korneliussen, Yasemin El-Menouar, Ekkehard & Ingvill
Mochmann, Maria Rohlinger, Antonella Curci, Gianna Mastrorilli, Paola Bor-
dandini, Oleg Nenadić, Walter Zucchini, Thierry Fahmy, Kimmo Vehkalahti,
Öztaş Ayhan, Simo Puntanen, George Styan, Juha Alho, François Theron,
Volker Hooyberg, Gurdeep Stephens & Pascal Courty, Antoni Bosch & Helena
Trias, Teresa Garcia-Milà, Tamara Djermanovic, Guillem Lopez, Xavier Cal-
samiglia, Andreu Mas-Colell, Xavier Freixas, Frederic Udina, Albert Satorra,
Jan Graffelman & Nuria Satorra, Rosemarie Nagel, Anna Espinal, Carolina
Chaya, Carlos Pérez, Moya Berry, Alan Griffiths, Dianne Fortescue, Tasos &
Androula Ladikos, Jerry & Mary Ann Reedy, Bodo & Bärbel Bilinski, Andries
& Gaby Claassens, Judy Twycross, Romà Revelles & Carme Clotet of Niu Nou
restaurant, Santi Careta, Marta Andreu, Rita Lugli & Danilo Guaitoli, José
Penalva & Nuria Serrano, Gabor Lugosi and the whole community of Gréixer
in the Pyrenees — you have all played a part in this story!
I also fondly remember dear friends and colleagues who have influenced my ca-
reer, but who have sadly passed away in recent years: Paul Lewi, Cas Troskie,
Dan Bradu, Reg & Kay Griffiths, Leo & Wendy Theron, Tony Brink, Jan
Visser, Victor Thiessen, Al McCutcheon and Ingram Olkin, as well as Ruben
Gabriel, from whom I learnt such a lot about statistics and life.
Particular thanks go to Angelos Markos and Antoine de Falguerolles for valu-
able comments on parts of the manuscript, and to Oleg Nenadić for his con-
tinuing collaboration in the development of our ca package in R.
Like the first and second editions, I have dedicated this book to my three
daughters, Françoise, Karolien and Gloudina, who never cease to amaze me
by their joy, sense of humour and diversity.
Finally, I thank the commissioning editor, Rob Calver, as well as Rebecca
Davies and Karen Simon of Chapman & Hall/CRC Press, for placing their
trust in me and for their constant cooperation in making this third edition of
Correspondence Analysis in Practice become a reality.
Michael Greenacre
Barcelona
Scatterplots and Maps 1
Correspondence analysis is a method of data analysis for representing tabu-
lar data graphically. Correspondence analysis is a generalization of a simple
graphical concept with which we are all familiar, namely the scatterplot. The
scatterplot is the representation of data as a set of points with respect to
two perpendicular coordinate axes: the horizontal axis often referred to as
the x-axis and the vertical one as the y-axis. As a gentle introduction to the
subject of correspondence analysis, it is convenient to reflect for a short time
on our perception of scatterplots and how we interpret them in relation to the
data they represent graphically. Particular emphasis will be placed on how we
interpret distances between points in a scatterplot and when scatterplots can
be seen as a spatial map of the data.

Contents
Data set 1: My travels . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Continuous variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Expressing data in relative amounts . . . . . . . . . . . . . . . . . . . . 2
Categorical variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Ordering of categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Distances between categories . . . . . . . . . . . . . . . . . . . . . . . . 3
Distance interpretation of scatterplots . . . . . . . . . . . . . . . . . . . 3
Scatterplots as maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Calibration of a direction in the map . . . . . . . . . . . . . . . . . . . . 4
Information-transforming nature of the display . . . . . . . . . . . . . . 4
Nominal and ordinal variables . . . . . . . . . . . . . . . . . . . . . . . . 5
Plotting more than one set of data . . . . . . . . . . . . . . . . . . . . . 5
Interpreting absolute or relative frequencies . . . . . . . . . . . . . . . . 6
Describing and interpreting data, vs. modelling and statistical inference 7
Large data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
SUMMARY: Scatterplots and Maps . . . . . . . . . . . . . . . . . . . . 8

During the original writing of this book, I was reflecting on the journeys I Data set 1: My
had made during the year to Norway, Canada, Greece, France and Germany. travels
According to my diary I spent periods of 18 days in Norway, 15 days in
Canada and 29 days in Greece. Apart from these longer trips I also made
several short trips to France and Germany, totalling 24 days. This numerical
description of my time spent in foreign countries can be visualized in the
graphs of Exhibit 1.1. This seemingly trivial example conceals several issues
that are relevant to our perception of graphs of this type that represent data
with respect to two coordinate axes, and which will eventually help us to
1
2 Scatterplots and Maps

Exhibit 1.1: 40% 40%


Graphs of number of 30 30
days spent in foreign 30% 30%
countries in one
Days

Days
year, in scatterplot 20 20
20% 20%
and bar-chart
formats respectively.
10 10
A percentage scale, 10% 10%

expressing days
relative to the total
of 86 days, is given Norway Canada Greece France/Germany Norway Canada Greece France/Germany
on the right-hand
side of each graph.

understand correspondence analysis. Let me highlight these issues one at a


time.

Continuous The left-hand vertical axis labelled Days represents the scale of a numeric piece
variables of information often referred to as a continuous, or numerical, variable. The
scale on this axis is the number of days spent in some foreign country, and the
ordering from zero days at the bottom end of the scale to 30 days at the top end
is clearly defined. In the bar-chart form of this display, given in the right-hand
graph of Exhibit 1.1, bars are drawn with lengths proportional to the values
of the variable. Of course, the number of days is a rounded approximation of
the time actually spent in each country, but we call this variable continuous
because the underlying time variable is indeed truly continuous.

Expressing The right-hand vertical axis of each plot in Exhibit 1.1 can be used to read the
data in relative corresponding percentage of days relative to the total of 86 days. For example,
amounts the 18 days in Norway account for 21% of the total time. The total of 86 is
often called the base relative to which the data are expressed. In this case
there is only one set of data and therefore just one base, so in these plots the
original absolute scale on the left and the relative scale on the right can be
depicted on the same graph.

Categorical In contrast to the vertical y-axis, the horizontal x-axis is clearly not a numer-
variables ical variable. The four points along this axis are just positions where we have
placed labels denoting the countries visited. The horizontal scale represents a
discrete, or categorical , variable. There are two features of this horizontal axis
that have no substantive meaning in the graph: the ordering of the categories
and the distances between them.
Ordering of categories 3

Firstly, there is no strong reason why Norway has been placed first, Canada Ordering of
second and Greece third, except perhaps that I visited these countries in that categories
order. Because the France/Germany label refers to a collection of shorter trips
scattered throughout the year, it was placed after the others. By the way, in
this type of representation where order is essentially irrelevant, it is usually
a good idea to re-order the categories in a way that has some substantive
meaning, for example in terms of the values of the variable. In this example
we could order the countries in descending order of days, in which case we
would position the countries in the order Greece, France/Germany, Norway
and Canada, from most visited to least. This simple re-arrangement assists
in the interpretation of data, especially if the data set is much larger: for
example, if I had visited 20 different countries, then the order would contain
relevant information that is not quickly deduced from the data in their original
ordering.

Secondly, there is no reason why the four points are at equal intervals apart Distances between
on the axis. There is also no immediate reason to put them at different in- categories
tervals apart, so it is purely for convenience and aesthetics that they have
been equally spaced. Using correspondence analysis we will show that there
are substantively interesting ways to define intervals between the categories
of a variable such as this one, when it is related to other variables. In fact,
correspondence analysis will be shown to yield values for the categories where
both the distances between the categories and their ordering have substantive
meaning.

Since the ordering of the countries is arbitrary on the horizontal axis of Ex- Distance
hibit 1.1, as well as the distances between them, there would be no sense interpretation of
in measuring and interpreting distances between the displayed points in the scatterplots
left-hand graph. The only distance measurement that has meaning is in the
strictly vertical direction, because of the numerical nature of the vertical axis
that indicates frequency (left-hand scale) or relative frequency (right-hand
scale).

In some special cases, the two variables that define the axes of the scatterplot Scatterplots as
are of the same numerical nature and have comparable scales. For example, maps
suppose that 20 students have written a mathematics examination consisting
of two parts, algebra and geometry, each part counting 50% towards the final
grade. The 20 students can be plotted according to their pair of grades, shown
in Exhibit 1.2. It is important that the two axes representing the respective
grades have scales with unit intervals of identical lengths. Because of the simi-
lar nature of the two variables and their scales, it is possible to judge distances
in any direction of the display, not only horizontally or vertically. Two points
that are close to each other will have similar results in the examination, just
like two neighbouring towns having a small geographical distance between
4 Scatterplots and Maps

Exhibit 1.2: 50
Scatterplot of grades
of 20 students in two
sections (algebra
and geometry) of a 40
mathematics
examination. The
points have spatial
properties: for 30
Geometry

example, the total


grade is obtained by
projecting each
point perpendi- 20
cularly onto the 45◦
line, which can be
calibrated from 0
(bottom left corner) 10
to 100 (top right
corner).

0
0 10 20 30 40 50

Algebra

them. Thus, one can comment here on the shape of the scatter of points and
the fact that there is a small cluster of four students with high grades and a
single student with very high grades. Exhibit 1.2 can be regarded as a map,
because the position of each student can be regarded as a two-dimensional
position, similar to a geographical location in a region defined by latitude and
longitude co-ordinates.

Calibration of a Maps have interesting geometric properties. For example, in Exhibit 1.2 the
direction in the 45◦ dashed line actually defines an axis for the final grades of the students,
map combining the algebra and geometry grades. If this line is calibrated from 0
(bottom left) to 100 (top right), then each student’s final grade can be read
from the map by projecting each point perpendicularly onto this line. An
example is shown of a student who received 12 out of 50 and 18 out of 50
for the two sections, respectively, and whose position projects onto the line at
coordinates 15 and 15, corresponding to a total grade of 30.

Information- The scatterplots in Exhibit 1.1 and Exhibit 1.2 are different ways of expressing
transforming in graphical form the numerical information in the two sets of travel and
nature of the examination data respectively. In each case there is no loss of information
display between the data and the graph. Given the graph it is easy to recover the data
Nominal and ordinal variables 5

exactly. We say that the scatterplot or map is an “information-transforming


instrument” — it does not process the data at all; it simply expresses the data
in a visual format that communicates the same information in an alternative
way.

In my travel example, the categorical variable “country” has four categories, Nominal and
and, since there is no inherent ordering of the categories, we refer to this vari- ordinal variables
able more specifically as a nominal variable. If the categories are ordered, the
categorical variable is called an ordinal variable. For example, a day could be
classified into three categories according to how much time is spent working:
(i) less than one hour (which I would call a “holiday”), (ii) more than one but
less than six hours (a “half day”, say) and (iii) more than six hours (a “full
day”). These categories, which are based on the continuous variable “time
spent daily working” divided up into intervals, are ordered and this ordering
is usually taken into account in any graphical display of the categories. In
many social surveys, questions are answered on an ordinal scale of response,
for example, an ordinal scale of importance: “not important”/“somewhat
important”/“very important”. Another typical example is a scale of agree-
ment/disagreement: “strongly agree”/“somewhat agree”/“neither agree nor
disagree”/“somewhat disagree”/“strongly disagree”. Here the ordinal position
of the category “neither agree nor disagree” might not lie between “somewhat
agree” and “somewhat disagree”; for example, it might be a category used
by some respondents instead of a “don’t know” response when they do not
understand the question or when they are confused by it. We shall treat this
topic later in this book (Chapter 21) once we have developed the tools that
allow us to study patterns of responses in multivariate questionnaire data.

Exhibit 1.3:
COUNTRY Holidays Half days Full days TOTAL Frequencies of
different types of
Norway 6 1 11 18 day in four sets of
Canada 1 3 11 15 trips.
Greece 4 25 0 29
France/Germany 2 2 20 24
TOTAL 13 31 42 86

Let us suppose now that the 86 days of my foreign trips were classified into one Plotting more than
of the three categories holidays, half days and full days. The cross-tabulation one set of data
of country by type of day is given in Exhibit 1.3. This table can be considered
in two different ways: as a set of rows or a set of columns. For example, each
column is a set of frequencies characterizing the respective type of day, while
each row characterizes the respective country. Exhibit 1.4(a) shows the latter
way, namely a plot of the frequencies for each country (row), where the hori-
zontal axis now represents the type of day (column). Notice that, because the
categories of the variable “type of day” are ordered, it makes sense to connect
6 Scatterplots and Maps

the categories by lines. Clearly, if we want to make a substantive comparison


between the countries, then we should take into account the fact that different
numbers of days in total were spent in each country. Each country total forms
a different base for the re-expression of the corresponding row in Exhibit 1.3
as a set of percentages (Exhibit 1.5). These percentages are visualized in Ex-
hibit 1.4(b) in a plot that expresses better the different compositions of days
in the respective trips.

Exhibit 1.4: (a) (b) 100


30
Plots of (a)
frequencies in 25

Percentage of days (%)


Exhibit 1.3 and (b) 75
relative frequencies 20

in each row
Days

15 50
expressed as
percentages. 10
25
5

0 0
Holiday Half Day Full Day Holiday Half Day Full Day

Norway Canada Norway Canada


Greece France/Germany Greece France/Germany

Exhibit 1.5:
Percentages of types COUNTRY Holidays Half days Full days
of day in each
country, as well as Norway 33% 6% 61%
the percentages Canada 7% 20% 73%
overall for all Greece 14% 86% 0%
countries combined; France/Germany 8% 8% 83%
rows add up to Overall 15% 36% 49%
100%.

Interpreting There is a lesson to be learnt from these displays that is fundamental to the
absolute or relative analysis of frequency data. Each trip has involved a different number of days
frequencies and so corresponds to a different base as far as the frequencies of the types of
days are concerned. The 6 holidays in Norway, compared to the 4 in Greece,
can be judged only in relation to the total number of days spent in these
respective countries. As percentages they turn out to be quite different: 6 out
of 18 is 33%, while 4 out of 29 is 14%. It is the visualization of the relative
frequencies in Exhibit 1.4(b) that gives a more accurate comparison of how I
spent my time in the different countries. The “marginal” frequencies (18, 15,
29, 24 for the countries, and 13, 31, 42 for the day types) are also interpreted
relative to their respective totals — for example, the last row of Exhibit 1.5
shows the percentages of day types for all countries combined, and could have
been plotted similarly in Exhibit 1.4(b).
Describing and interpreting data, vs. modelling and statistical inference 7

Any conclusion drawn from the points’ positions in Exhibit 1.4(b) is purely Describing and
an interpretation of the data and not a statement of the statistical signifi- interpreting data,
cance of the observed feature. In this book we shall address the statistical vs. modelling and
aspects of graphical displays only towards the end of the book (Chapters 29 statistical inference
and 30); for the most part we shall be concerned only with the question of
data visualization and interpretation. The deduction that I had proportion-
ally more holidays in Norway than in the other countries is certainly true in
the data and can be seen strikingly in Exhibit 1.4(b). It is an entirely dif-
ferent question whether this phenomenon is statistically compatible with a
model or hypothesis of my behaviour that postulates that the proportion of
holidays was generally intended to be the same for all countries, in which case
any observed differences are purely random. Most of statistical methodology
concentrates on problems where data are fitted and compared to a theoretical
model or preconceived hypothesis, with little attention being paid to enlight-
ening ways for describing data, interpreting data and generating hypotheses.
A typical example in the social sciences is the use of the ubiquitous chi-square
statistic to test for association in a cross-tabulation. Often statistically sig-
nificant association is found but there are no simple tools for detecting which
parts of the table are responsible for this association. Correspondence analysis
is one tool that can fill this gap, allowing the data analyst to see the pattern of
association in the data and to generate hypotheses that can be tested in a sub-
sequent stage of research. In most situations data description, interpretation
and modelling can work hand-in-hand with one other. But there are situa-
tions where data description and interpretation assume supreme importance,
for example when the data represent the whole population of interest.

As data tables increase in size, it becomes more difficult to make simple graph- Large data sets
ical displays such as Exhibit 1.4, owing to the overabundance of points. For
example, suppose I had visited 20 countries during the year and had a break-
down of time spent in each one of them, leading to a table with many more
rows. I could also have recorded other data about each day in order to study
possible relationships with the type of day I had; for example, the weather on
each day — “fair weather”, “partly cloudy” or “rainy”. So the table of data
might have many more columns as well as rows. In this case, to draw graphs
such as Exhibit 1.4, involving many more categories and with 20 sets of points
traversing the plot, would result in such a confusion of points and symbols
that it would be difficult to see any patterns at all. It would then become clear
that the descriptive instrument being used, the scatterplot, is inadequate in
bringing out the essential features of the data. This is a convenient point to in-
troduce the basic concepts of correspondence analysis, which is also a method
for visualizing tabular data, but which can easily accommodate larger data
sets in a natural and intuitive way.
8 Scatterplots and Maps

SUMMARY: 1. Scatterplots involve plotting two variables, with respect to a horizontal axis
Scatterplots and and a vertical axis, often called the “x-axis” and “y-axis” respectively.
Maps
2. Usually the x variable is a completely different entity to the y variable. We
can often interpret distances along at least one of the axes in the specific
sense of measuring the distance according to the scale that is calibrated on
the axis. It is usually meaningless to measure or interpret oblique distances
in the plot.
3. In a few cases the x and y variables are similar entities with comparable
scales, in which case interpoint distances can be interpreted as a measure
of difference, or dissimilarity, between the plotted points. In this special
case we call the scatterplot a map. For such maps it is important that the
horizontal and vertical scales have physically equal units, i.e. the aspect
ratio of the axes is equal to 1.
4. When plotting positive quantities (usually frequencies in our context), both
the absolute and relative values of these quantities are of interest.
5. The more complex the data are, the less convenient it is to represent these
data in a scatterplot.
6. This book is concerned with visually describing and interpreting complex
information, rather than modelling it.
Profiles and the Profile Space 2
The concept of a set of relative frequencies, or a profile, is fundamental to
correspondence analysis (referred to from now on by its abbreviation CA).
Such sets, or vectors, of relative frequencies have special geometric features
because the elements of each set add up to 1 (or 100%). In analysing a fre-
quency table, relative frequencies can be computed for rows or for columns —
these are called row or column profiles respectively. In this chapter we shall
show how profiles can be depicted as points in a profile space, illustrating the
concept in the special case when each profile consists of only three elements.

Contents
Profiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Average profile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Row profiles and column profiles . . . . . . . . . . . . . . . . . . . . . . 10
Symmetric treatment of rows and columns . . . . . . . . . . . . . . . . . 11
Asymmetric consideration of the data table . . . . . . . . . . . . . . . . 11
Plotting the profiles in the profile space . . . . . . . . . . . . . . . . . . 11
Vertex points define the extremes of the profile space . . . . . . . . . . . 12
Triangular (or ternary) coordinate system . . . . . . . . . . . . . . . . . 12
Positioning a point in a triangular coordinate system . . . . . . . . . . . 14
Geometry of profiles with more than three elements . . . . . . . . . . . 14
Data on a ratio scale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Data on a common scale . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
SUMMARY: Profiles and the Profile Space . . . . . . . . . . . . . . . . 16

Let us look again at the data in Exhibit 1.3, a table of frequencies with four Profiles
rows (the countries) and three columns (the type of day). The first and most
basic concept in CA is that of a profile, which is a set of frequencies divided
by their total. Exhibit 2.1 shows the row profiles for these data: for example,
the profile of Norway is [0.33 0.06 0.61], where 0.33 = 6/18, 0.06 = 1/18, 0.61
= 11/18. We say that this is the “profile of Norway across the types of day”.
The profile may also be expressed in percentage form, i.e. [33% 6% 61%] in
this case, as in Exhibit 1.5. In a similar way, the profile of Canada across the
day types is [0.07 0.20 0.73], concentrated mostly in the full day category, as
is Norway. In contrast, Greece has a profile of [0.14 0.86 0.00], concentrated
mostly in the half day category, and so on. The percentages are plotted in
Exhibit 1.4(b) on page 6.

9
10 Profiles and the Profile Space

Exhibit 2.1:
Row (country) COUNTRY Holidays Half days Full days
profiles: relative
frequencies of types Norway 0.33 0.06 0.61
of day in each set of Canada 0.07 0.20 0.73
trips, and average Greece 0.14 0.86 0.00
profile showing France/Germany 0.08 0.08 0.83
relative frequencies Average 0.15 0.36 0.49
in all trips.

Average profile In addition to the four country profiles, there is an additional row in Exhibit
2.1 labelled Average. This is the profile of the final row [13 31 42] of Exhibit
1.3, which contains the column sums of the table; in other words this is the
profile of all the trips aggregated together. In Chapter 3 we shall explain more
specifically why this is called the average profile. For the moment, it is only
necessary to realize that, out of the total of 86 days travelled, irrespective
of country visited, 15% were holidays, 36% were half days and 49% were full
days of work. When comparing profiles we can compare one country’s profile
with another, or we can compare a country’s profile with the average profile.
For example, eyeballing the figures in Exhibit 2.1, we can see that of all the
countries, the profiles of Canada and France/Germany are the most similar.
Compared to the average profile, these two profiles have a higher percentage
of full days and are below average on holidays and half days.

Row profiles In the above we looked at the row profiles in order to compare the different
and column countries. We could also consider Exhibit 1.3 as a set of columns and compare
profiles how the different types of days are distributed across the countries. Exhibit 2.2
shows the column profiles as well as the average column profile. For example,
of the 13 holidays 46% were in Norway, 8% in Canada, 31% in Greece and 15%
in France/Germany, and so on for the other columns. Since I spent different
numbers of days in each country, these figures should be checked against those
of the average column profile to see whether they are lower or higher than
the average pattern. For example, 46% of all holidays were spent in Norway,
whereas the number of days spent in Norway was just 21% of the total of 86 —
in this sense there is a high number of holidays there compared to the average.

Exhibit 2.2:
Profiles of types of COUNTRY Holidays Half days Full days Average
day across the
countries, and Norway 0.46 0.03 0.26 0.21
average column Canada 0.08 0.10 0.26 0.17
profile. Greece 0.31 0.81 0.00 0.34
France/Germany 0.15 0.07 0.48 0.28
Symmetric treatment of rows and columns 11

Looking again at the proportion 0.46 (= 6/13) of holidays spent in Norway Symmetric
(Exhibit 2.2) and comparing it to the proportion 0.21 (= 18/86) of all days treatment of rows
spent in that country, we can calculate the ratio 0.46/0.21 = 2.2, and conclude and columns
that holidays in Norway were just over twice the average. Exactly the same
conclusion is reached if a similar calculation is made on the row profiles. In
Exhibit 2.1 the proportion of holidays in Norway was 0.33 (= 6/18) whereas
for all countries the proportion was 0.15 (= 13/86). Thus, there are 0.33/0.15
= 2.2 times as many holidays compared to the average, the same ratio as was
obtained when arguing from the point of view of the column profiles (this
ratio is called the contingency ratio and will re-appear in future chapters).
Whether we argue via the row profiles or column profiles we arrive at the
same conclusion. In Chapter 8 it will be shown that CA treats the rows and
columns of a table in an equivalent fashion or, as we say, in a symmetric way.

Nevertheless, it is true in practice that a table of data is often thought of and Asymmetric
interpreted in a non-symmetric, or asymmetric, fashion, either as a set of rows consideration of
or as a set of columns. For example, since each row of Exhibit 1.3 constitutes a the data table
different country (or pair of countries in the case of France/Germany), it might
be more natural to think of the table row-wise, as in Exhibit 2.1. Deciding
which way is more appropriate depends on the nature of the data and the
researcher’s objective, and the decision is often not a conscious one. One
concrete manifestation of the actual choice is whether the researcher refers to
row or column percentages when interpreting the data. Whatever the decision,
the results of CA will be invariant to this choice, but the interpretation will
adapt to the researcher’s viewpoint.

Let us consider the four row profiles and average profile in Exhibit 2.1 and Plotting the
a completely different way to plot them. Rather than the display of Exhibit profiles in the
1.4(b), where the horizontal axis serves only as labels for the type of day profile space
and the vertical axis represents the percentages, we now propose using three
axes corresponding to the three types of day, which is a scatterplot in three
dimensions. To imagine three perpendicular axes is not difficult: merely look
down into an empty corner of the room you are sitting in and you will see
three axes as shown in Exhibit 2.3. Each of the three edges of the room
serves as an axis for plotting the three elements of the profile. These three
values are now considered to be coordinates of a single point that represents
the whole profile — this is quite different from the graph in Exhibit 1.4(b)
where there is a point for each of the three profile elements. The three axes
are labelled holidays, half days and full days, and are calibrated in fractional
profile units from 0 to 1. To plot the four profiles is now a simple exercise.
Norway’s profile of [0.33 0.06 0.61] (see Exhibit 2.1) is 0.33 of a unit along
axis holidays, 0.06 along axis half days and 0.61 along axis full days. To take
another example, Greece’s profile of [0.14 0.86 0.00] has a zero coordinate
in the full days direction, so its position is on the “wall”, as it were, on the
left-hand side of the display, with coordinates 0.14 and 0.86 on the two axes
12 Profiles and the Profile Space

Exhibit 2.3: holidays


Positions of the four
1.00 o
row profiles (•) of
Exhibit 2.1 as well
as their average 0.75
profile (∗)in
three-dimensional
0.50
space, depicted as
the corner of a room
with “floor tiles”. NORWAY
0.25
For example, FRANCE/GERMANY
Norway is 0.06 along
o full days
the half days axis, average CANADA
0.61 along the full
days axis and 0.33 in GREECE
a vertical direction
along the holidays
axis. The unit o
(vertex) points are
half days
also shown as empty
circles on each axis.

holidays and half days that define the “wall”. All other row profile points in this
example, including the average row profile [0.15 0.36 0.49], can be plotted
in this three-dimensional space.

Vertex points With a bit of imagination it might not be surprising to discover that the
define the profile points in Exhibit 2.3 all lie exactly in the plane defined by the triangle
extremes of the that joins the extreme unit points [1 0 0], [0 1 0] and [0 0 1] on the three
profile space respective axes, as shown in Exhibit 2.4. This triangle is equilateral and its
three corners are called vertex points or vertices. The vertices coincide with
extreme profiles that are totally concentrated into one of the day types. For
example, the vertex point [1 0 0] corresponds to a trip to a country consisting
only of holidays (fictional in my case, unfortunately). Likewise, the vertex point
[0 0 1] corresponds to a trip consisting only of full days of work.

Triangular (or Having realized that all profile points in three-dimensional space actually lie
ternary) exactly on a flat (two-dimensional) triangle, it is possible to lay this triangle
coordinate system flat, as in Exhibit 2.5. Looking at the profile points in a flat space is clearly
better than trying to imagine their three-dimensional positions in the corner of
a room! This particular type of display is often referred to as the triangular (or
ternary) coordinate system and may be used in any situation where we have
sets of data consisting of three elements that add up to 1, as in the case of the
row profiles in this example. Such data are common in geology and chemistry,
for example where samples are decomposed into three constituents, by weight
Triangular (or ternary) coordinate system 13

holidays
holidays Exhibit 2.4:
The profile points in
Exhibit 2.3 lie
1.00 o
exactly on an
equilateral triangle
joining the vertex
points of the profile
space. Thus the
three-dimensional
profiles are actually
two-dimensional.
NORWAY
• The profile of
Greece lies on the
FRANCE/GERMANY
• edge of the triangle
o full days because it has zero
average * CANADA• full days.

GREECE •

o
half days

holidays Exhibit 2.5:


The triangle in
Exhibit 2.4 that
contains the row
(country) profiles.
The three corners,
or vertices, of the
triangle represent
the columns (day
types).

NORWAY

average
• GREECE *
• •
FRANCE/GERMANY
CANADA
full days
half days
14 Profiles and the Profile Space

Exhibit 2.6:
Norway’s profile

s
ay
[0.33 0.06 0.61] is

lid
positioned using

ho
triangular
0.61
coordinates as
shown, using the
sides of the triangle
as axes. Each side is
calibrated in profile
units from 0 to 1.
NORWAY

average

ful
•GREECE *

ld
• •
FRANCE/GERMANY

ay
CANADA

s
half days 0.06

or by volume. A particular sample is characterized by the three proportions


of the constituents and can thus be displayed as a single point with respect
to triangular (or ternary) coordinates.

Positioning a Given a blank equal-sided triangle and the profile values, how can we find the
point in a position of a profile point in the triangle, without passing via the underlying
triangular three-dimensional space of Exhibits 2.3 and 2.4? In the triangular coordinate
coordinate system system the sides of the triangle define three axes. Each side is considered
to have a length of 1 and can be calibrated accordingly on a linear scale
from 0 to 1. In order to position a profile in the triangle, its three values
on these axes determine three lines drawn from these values parallel to the
respective sides of the triangle. For example, to position Norway, as illustrated
in Exhibit 2.5, we take a value of 0.33 on the holidays axis (see Exhibit 2.6),
0.06 on the half days axis and 0.61 on the full days axis. Lines from these
coordinate values drawn parallel to the sides of the triangle all meet at the
point representing Norway. In fact, any two of the three profile coordinates
are sufficient to situate a profile in this way, and the remaining coordinate is
always superfluous, which is another way of demonstrating that the profiles
are inherently two-dimensional.

Geometry of The triangular coordinate system may be used only for profiles with three
profiles with more elements. But the idea can easily be generalized to profiles with any number
than three of elements, in which case the coordinate system is known as the barycen-
elements tric coordinate system (“barycentre” is synonymous with “weighted average”,
to be explained in the next chapter, page 19). The dimensionality of this
Data on a ratio scale 15

coordinate system is always one less than the number of elements in the
profile. For example, we have just seen that three-element profiles are con-
tained exactly in a two-dimensional triangular profile space. For profiles with
four elements the dimensionality is three and the profiles lie in a four-pointed
tetrahedron in three-dimensional space. The two-dimensional triangle and the
three-dimensional tetrahedron are examples of what is known in mathematics
as a regular simplex. R code for visualizing an example in three dimensions is
given in the Computing Appendix, pages 257–258, so you can get a feeling for
three-dimensional profile space. For higher-dimensional profiles some strong
imagination would be needed to be able to “see” the profile points spaces of
dimension greater than three, but fortunately CA will be of great help to us
in visualizing such multidimensional profiles.

We have illustrated the concept of a profile using frequency data, which is Data on a ratio
the prime example of data suitable for CA. But CA is applicable to a much scale
wider class of data types; in fact it can be used whenever it makes sense to
express the data in relative amounts, i.e. data on a so-called ratio scale. For
example, suppose we have data on monetary amounts invested by countries
in different areas of research — the relative amounts would be of interest, e.g.
the percentage invested in environmental research, biomedecine, etc. Another
example is of morphometric measurements on a living organism, for example
measurements in centimeters on a fish, its length and width, length of fins, etc.
Again all these measurements can be expressed relative to the total, where
the total is a surrogate measure for the size of the fish, so that we would be
analysing and comparing the shapes of different fish in the form of profiles
rather than the original values.

A necessary condition of the data for CA is that all observations are on the Data on a common
same scale: for example, counts of particular individuals in a frequency table, scale
a common monetary unit in the table of research investments, centimeters
in the morphometric study. It would make no sense in CA to analyse data
with mixed scales of measurement, unless a pre-transformation is conducted
to homogenize the scales of the whole table. Most of the data sets in this book
are frequency data, but in Chapter 26 we shall look at a wide variety of other
types of data and ways of recoding them to be suitable for CA.
16 Profiles and the Profile Space

SUMMARY: 1. The profile of a set of frequencies (or any other amounts that are positive
Profiles and the or zero) is the set of frequencies divided by their total, i.e. the set of relative
Profile Space frequencies.
2. In the case of a cross-tabulation, the rows or columns define sets of fre-
quencies which can be expressed relative to their respective totals to give
row profiles or column profiles.
3. The marginal frequencies of the cross-tabulation can also be expressed
relative to their common total (i.e. the grand total of the table) to give the
average row profile and average column profile.
4. Comparing row profiles to their average leads to the same conclusions as
comparing column profiles to their average.
5. Profiles consisting of m elements can be plotted as points in an m -dimen-
sional space. Because their m elements add up to 1, these profile points
occupy a restricted region of this space, an (m –1)-dimensional subspace
known as a simplex. This simplex is enclosed within the edges joining all
pairs of the m unit vectors on the m perpendicular axes. These unit points
are also called the vertices of the simplex or profile space. The coordinate
system within this simplex is known as the barycentric coordinate system.
6. A special case that is easy to visualize is when the profiles have three
elements, so that the simplex is simply a triangle that joins the three
vertices. This special case of the barycentric coordinate system is known
as the triangular (or ternary) coordinate system.
7. The idea of a profile can be extended to data on a ratio scale where it is
of interest to study relative values. In this case the set of numbers being
profiled should all have the same scale of measurement.
Masses and Centroids 3
There is an equivalent way of thinking about the positions of the profile points
in the profile space, and this will be useful to our eventual understanding and
interpretation of correspondence analysis (CA). This is based on the notion
of a weighted average, or centroid, of a set of points. In the calculation of
an ordinary (unweighted) average, each point receives equal weight, whereas
a weighted average allows different weights to be associated with each point.
When the points are weighted differently, then the centroid does not lie exactly
at the “geographical” centre of the cloud of points, but tends to lie in a position
closer to the points with higher weight.

Contents
Data set 2: Readership and education groups . . . . . . . . . . . . . . . 17
Points as weighted averages . . . . . . . . . . . . . . . . . . . . . . . . . 18
Profile values are weights assigned to the vertices . . . . . . . . . . . . . 19
Each profile point is a weighted average, or centroid, of the vertices . . . 19
Average profile is also a weighted average of the profiles themselves . . . 20
Row and column masses . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Interpretation in the profile space . . . . . . . . . . . . . . . . . . . . . . 21
Merging rows or columns . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Distributionally equivalent rows or columns . . . . . . . . . . . . . . . . 23
Changing the masses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
SUMMARY: Masses and Centroids . . . . . . . . . . . . . . . . . . . . . 24

We now use a typical set of data in social science research, a cross-tabulation Data set 2:
(or “cross-classification”) of two variables from a survey. The table, given in Readership and
Exhibit 3.1, concerns 312 readers of a certain newspaper, in particular their education groups
level of thoroughness in reading the newspaper. Based on data collected in the
survey, each respondent was classified into one of three groups: glance readers,
fairly thorough readers and very thorough readers. These reading classes have
been cross-tabulated against education, an ordinal variable with five categories
ranging from some primary education to some tertiary education. Exhibit 3.1
shows the raw frequencies and the education group profiles in parentheses,
i.e. the row profiles. The triangular coordinate plot of the row profiles, in the
style described in Chapter 2, is given in Exhibit 3.2. In this display the corner
points, or vertices, of the triangle represent the three readership groups —
remember that each vertex is at the position of a “pure” row profile totally
concentrated into that category; for example, the very thorough vertex C3
is representing a fictitious row profile of [0 0 1] that contains 100% very
thorough readers.
17
18 Masses and Centroids

Exhibit 3.1:
Cross-tabulation of Fairly Very
education group by EDUCATION Glance thorough thorough Row
readership class, GROUP C1 C2 C3 Total masses
showing row profiles
and average row Some primary 5 7 2 14 0.045
profile in E1 (0.357) (0.500) (0.143)
parentheses, and the Primary completed 18 46 20 84 0.269
row masses (relative E2 (0.214) (0.548) (0.238)
values of row totals). Some secondary 19 29 39 87 0.279
E3 (0.218) (0.333) (0.448)
Secondary completed 12 40 49 101 0.324
E4 (0.119) (0.396) (0.485)
Some tertiary 3 7 16 26 0.083
E5 (0.115) (0.269) (0.615)
Total 57 129 126 312
Average row profile (0.183) (0.413) (0.404)

Exhibit 3.2: C3
Row profiles
(education groups)
of Exhibit 3.1
depicted in
triangular
coordinates, also
showing the position • E5
of the average row

• E3 •
profile (last row of E4
Exhibit 3.1).
* average
•E2
• E1
C2
C1

Points as Another way to think of the positions of the education groups in the trian-
weighted averages gle is as weighted averages. Assigning weights to the values of a variable is
a well-known concept in statistics. For example, in a class of 26 students,
suppose that the average grade turns out to be 7.5, calculated by summing
the 26 grades and dividing by 26. In fact, 3 students obtain the grade of 9, 7
students obtain an 8, and 16 students obtain a 7, so that the average grade
can be determined equivalently by assigning weights of 3/26 to the grade of
9, 7/26 to the grade of 8 and 16/26 to the grade of 7 and then calculating the
weighted average. Here the weights are the relative frequencies of each grade,
and because the grade of 7 has more weight than the others, the weighted av-
Profile values are weights assigned to the vertices 19

erage of 7.5 is “closer” to this grade, whereas the ordinary arithmetic average
of the three values 7, 8 and 9 is clearly 8.

Looking at the last row of data in Exhibit 3.1, for education group E5 (some Profile values are
tertiary education), we see the same frequencies of 3, 7 and 16 for the three re- weights assigned to
spective readership groups, and associated relative frequencies of 0.115, 0.269 the vertices
and 0.615. The idea now is to imagine 3 cases situated at the glance vertex C1
of the triangle, 7 cases at the fairly thorough vertex C2 and 16 cases at the very
thorough vertex C3 , and then consider what would be the average position for
these 26 cases. In other words, we do not associate the weights with values
of a variable but with positions in the profile space, in this case the positions
of the vertex points. There are more cases at the very thorough corner, so we
would expect the average position of E5 to be closer to this vertex, as is indeed
the case. For the same reason, row profile E1 lies far from the very thorough
corner C3 because it has a very low weight (2 out of 14, or 0.143) on this
category. Hence each row profile point is positioned within the triangle as an
average point, where the profile values, i.e. relative frequencies, serve as the
weights allocated to the vertices. Thus, we can think of the profile values not
only as coordinates in a multidimensional space, but also as weights assigned
to the vertices of a simplex. This idea can be extended to higher-dimensional
profiles: for example, a profile with four elements is also at an average position
with respect to the four corners of a three-dimensional tetrahedron, weighted
by the respective profile elements.

Alternative terms for weighted average are centroid or barycentre. Some par- Each profile point
ticular examples of weighted averages in the profile space are given in Exhibit is a weighted
3.3. For example, the profile point [ 1/3 1/3 1/3 ], which gives equal weight to average, or
the three corners, is positioned exactly at the centre of the triangle, equidis- centroid, of the
tant from the corners, in other words at the ordinary average position of the vertices
three vertices. The profile [ 1/2 1/2 0 ] is at a position midway between the
first and second vertices, since it has equal weight on these two vertices and
zero weight on the third vertex. In general, we can write a formula for the
position of a profile as the centroid of the three vertices as follows, for a profile
[ a b c ] where a + b + c = 1:

centroid position = (a × vertex 1) + (b × vertex 2) + (c × vertex 3)

For example, the position of education group E5 in Exhibit 3.2 is obtained as


follows:

E5 = (0.115 × glance) + (0.269 × fairly thorough) + (0.615 × very thorough)

Similarly, the position of the average profile is also a weighted average of the
vertex points:

average = (0.183 × glance) + (0.413 × fairly thorough) + (0.404 × very thorough)


20 Masses and Centroids

Exhibit 3.3:
Examples of some C3 [0,0,1]
centroids (weighted
averages) of the
vertices in triangular
coordinate space: •[0,1/5,4/5]
the three values are
the weights assigned
to vertices
(C1,C2,C3 ).

[1/5,1/5,3/5]

[1/3,1/3,1/3]
• •
[7/15,1/5,1/3]

[1,0,0] [1/2,1/2,0]
C1 • •C2
[0,1,0]

The average is farther from the glance corner since there is less weight on
the glance vertex than on the other two, which have approximately the same
weights (see Exhibit 3.2).

Average profile The average profile is a rather special point — not only is it a centroid of
is also a weighted the three vertices as we have just shown, just like any profile point, but it is
average of the also a centroid of the five row profiles themselves, where different weights are
profiles themselves assigned to the profiles. Looking again at Exhibit 3.1, we notice that the row
totals are different: education group E1 (some primary education) includes only
14 respondents whereas education group E4 (secondary education completed)
has 101 respondents. In the last column of Exhibit 3.1, headed “row masses”,
we have these marginal row frequencies expressed relative to the total sample
size 312. Just as we thought of row profiles as weighted averages of the vertices,
we can think of each of the five row profile points in Exhibit 3.2 being assigned
weights according to their marginal frequencies, as if there were 14 respondents
(proportion 0.045 of the sample) at the position E1, 84 respondents (0.269 of
the sample) at the position E2, and so on. With these weights assigned to
the five profile points, the weighted average position is exactly at the average
profile point:
Average row profile = (0.045 × E1) + (0.269 × E2) + (0.279 × E3)
+ (0.324 × E4) + (0.083 × E5)
This average row profile is at a central position amongst the row profiles but
more attracted to the profiles observed with higher frequency.

Row and The weights assigned to the profiles are so important in CA that they are given
column masses a specific name: masses. The last column of Exhibit 3.1 shows the row masses:
Interpretation in the profile space 21

0.045, 0.269, 0.279, 0.324 and 0.083. The word “mass” is the preferred term in
CA although it is entirely equivalent for our purpose to the term “weight”. An
alternative term such as mass is convenient here to differentiate this geometric
concept of weighting from other forms of weighting that occur in practice, such
as weights assigned to population subgroups in a sample survey.
All that has been said about row profiles and row masses can be repeated in a
similar fashion for the columns. Exhibit 3.4 shows the same contingency table
as Exhibit 3.1 from the column point of view. That is, the three columns have

Exhibit 3.4:
Fairly Very Average Cross-tabulation of
EDUCATION Glance thorough thorough column education group by
GROUP C1 C2 C3 Total profile readership cluster,
showing column
Some primary 5 7 2 14 (0.045) profiles and average
E1 (0.088) (0.054) (0.016) column profile in
Primary completed 18 46 20 84 (0.269) parentheses, and the
E2 (0.316) (0.357) (0.159) column masses.
Some secondary 19 29 39 87 (0.279)
E3 (0.333) (0.225) (0.310)
Secondary completed 12 40 49 101 (0.324)
E4 (0.211) (0.310) (0.389)
Some tertiary 3 7 16 26 (0.083)
E5 (0.053) (0.054) (0.127)
Total 57 129 126 312
Column masses 0.183 0.413 0.404

been expressed as relative frequencies with respect to their column totals,


giving three profiles with five values each. The column totals relative to the
grand total are now column masses assigned to the column profiles, and the
average column profile is the set of row totals divided by the grand total.
Again, we could write the average column profile as a weighted average of the
three column profiles C1 , C2 and C3 :
Average column profile = (0.183 × C1) + (0.413 × C2) + (0.404 × C3)
Notice how the row and column masses play two different roles, as weights
and as averages: in Exhibit 3.4 the average column profile is the set of row
masses in Exhibit 3.1, and the column masses in Exhibit 3.4 are the elements
of what was previously the average row profile in Exhibit 3.1.

At this point, even though the final key concepts in CA still remain to be Interpretation in
explained, it is possible to make a brief interpretation of Exhibit 3.2. The ver- the profile space
tices of the triangle represent the “pure profiles” of readership categories C1 ,
C2 and C3 , whereas the education groups are “mixtures” of these readership
categories and find their positions within the triangle in terms of their respec-
22 Masses and Centroids

tive proportions of each of the three categories. Notice the following aspects
of the display:

• The degree of spread of the profile points within the triangle gives an idea
of how much variation there is the contingency table. The closer the profile
points lie to the centroid, the less variation there is, and the more they
deviate from the centroid, the more variation. The profile space is bounded
and the most extreme profiles will lie near the sides of the triangle, or in
the most extreme case at one of the vertices (for example, an illiterate
group with profile [ 1 0 0 ] would lie on the vertex C1 ). In tables of social
science data such as this one, profiles usually occupy a small region of the
profile space close to the average because the variation in profile values
for a particular category will be relatively small. For example, the range
in the first element (i.e. readership category C1 ) across the profiles is only
from 0.115 to 0.357 (Exhibit 3.1), in a potential range from 0 to 1. In
contrast, for data in ecological research, as we shall see later, the range of
profile values is much higher, usually because of many zero frequencies in
the table — the profiles are then more spread out inside the profile space
(see the second example in Chapter 10).
• The profile points are stretched out in what is called a “direction of spread”
more or less from the bottom to the top of the display. Looking from the
bottom upwards, the five education group profiles lie in their natural or-
der of increasing educational qualifications, from E1 to E5. At the top,
group E5 lies closest to the vertex C3, which represents the highest cat-
egory of very thorough reading — we have already seen that this group
has the highest proportion (0.615) of these readers. At the bottom, the
lower educational group is not far from the edge of the triangle which we
know displays profiles with zero C3 readers (for example, see the point
[ 1/2 1/2 0 ] in Exhibit 3.3 as an illustration of a point on the edge). The
interpretation of this pattern would be that as we move up from the bot-
tom of this display to the top, from lower to higher education, the profiles
are generally changing with respect to their relative frequency of type C3
as opposed to that of C1 and C2 combined, while there is no particular
tendency towards either C1 or C2 . In addition, the relative frequency of
C1 is decreasing as the education points move away from C1 towards the
edge joining C2 and C3 .

Merging rows Suppose we wanted to combine the two categories of primary education, E1
or columns and E2, into a new row of Exhibit 3.1, denoted by E1&2. There are two ways
of thinking about this. First, add the two rows together to obtain the row
of frequencies [ 23 53 22 ], with total 98 and profile [.235 .541 .224]. The
alternative way is to think of the profile of E1&2 as the weighted average of
the profiles of E1 and E2:
.045 .269
[ .235 .541 .224 ] = × [ .357 .500 .143 ] + × [ .214 .548 .238 ]
.314 .314
Distributionally equivalent rows or columns 23


o E1&2
E2
Exhibit 3.5:
Enlargement of
positions of E1 and
E2 in Exhibit 3.2,
showing the position
of the point E1&2
which merges the
two categories; E2
has 6 times the mass
of E1, hence E1&2
• E1 lies closer to E2 at a
point which splits
the line between the
points in the ratio
84:14 = 6:1.

where the masses of E1 and E2 are .045 and .269, with sum .314 (notice that
the weights in this weighted average are identical to 14/98 and 84/98, where
14 and 84 are the totals of rows E1 and E2). Geometrically, E1&2’s profile
lies on a line between E1 and E2, but closer to E2 as shown in Exhibit 3.5.
The distances from E1 to E1&2 and E2 to E1&2 are in the same proportion
as the totals 84 and 14 respectively; i.e. 6 to 1. E1&2 can be thought of as
the balancing point of the two masses situated at E1 and E2, with the heavier
mass at E2.

Suppose that we had an additional row of data in Exhibit 3.1, a category of Distributionally
“no formal education” denoted by E0, with frequencies [ 10 14 4 ] across the equivalent rows or
reading categories. The profile of E0 is identical to E1’s profile, because the columns
frequencies in E0 are simply twice those of E1. The two sets of frequencies are
said to be distributionally equivalent. Thus the profiles of E0 and E1 are at
exactly the same point in the profile space, and can be merged into one point
with mass equal to the combined masses of the two profiles, i.e. a single point
with frequencies [ 15 21 6 ].

The row and column masses are proportional to the marginal sums of the Changing the
table. If the masses need to be modified for a substantive reason, this can be masses
achieved by a simple transformation of the table. For example, suppose that we
require the five education groups of Exhibit 3.1 to have masses proportional to
their population sizes rather than their sample sizes. Then the table is rescaled
by multiplying each education group profile by its respective population size.
The row profiles of this new table are identical to the original row profiles, but
the row masses are now proportional to the population sizes. Alternatively,
suppose that the education groups are required to be weighted equally, rather
than differentially as described up to now. If we regard the table of row profiles
(or, equivalently, of row percentages) as the original table, then this table has
24 Masses and Centroids

row sums equal to 1 (or 100%), so that each education group is weighted
equally. Hence analysing the table of profiles implies weighting each profile
equally.

SUMMARY: 1. We assume that we are analysing a table of data and are concerned with
Masses and the row problem, i.e. where the row profiles are plotted in the simplex space
Centroids defined by the column vertices. Then each vertex point represents a column
category in the sense that a row profile that is entirely concentrated into
that category would lie exactly at that vertex point.
2. Each profile can be interpreted as the centroid (or weighted average) of the
vertex points, where the weights are the individual elements of the profile.
Thus a profile will tend to lie closer to those vertices for which it has higher
values.
3. Each row profile in turn has a unique weight associated with it, called
a mass, which is proportional to the row sum in the original table. The
average row profile is then the centroid of the row profiles, where each
profile is weighted by its mass in the averaging process.
4. Everything described above for row profiles applies equally to the columns
of the table. In fact, the best way to make the jump from rows to columns
is to re-express the table in its transposed form, where columns become
rows, and vice versa — then everything applies exactly as before.
5. Rows (or columns) that are combined by aggregating their frequencies have
a profile equal to the weighted average of the profiles of the component rows
(or columns).
6. Rows (or columns) that have the same profile are said to be distributionally
equivalent and can be combined into a single point with a mass equal to
the sum of the masses of the combined rows (or columns).
7. Row (or column) masses can be modified to be proportional to prescribed
values by a simple rescaling of the rows (or columns).
Chi-Square Distance and Inertia 4
In correspondence analysis (CA) the way distance is measured between profiles
is a bit more complicated than the one that was used implicitly when we
drew and interpreted the profile plots in Chapters 2 and 3. Distance in CA is
measured using the so-called chi-square distance and this distance is the key
to the many favourable properties of CA. There are several ways to justify
the chi-square distance: some are more technical and beyond the scope of
this book, while other explanations are more intuitive (see Appendix B, pages
270–271 for one theoretical justification). In this chapter we choose the latter
approach, starting with a geometric explanation of the well-known chi-square
statistic computed on a contingency table. All the ideas embodied in the chi-
square statistic carry over to the chi-square distance in CA and to the related
concept of inertia, which is the way CA measures variation in a data table.

Contents
Hypothesis of independence or homogeneity for a contingency table . . 25
Chi-square (χ2 ) statistic to test the homogeneity hypothesis . . . . . . . 26
Calculating the χ2 statistic . . . . . . . . . . . . . . . . . . . . . . . . . 27
Alternative expression of the χ2 statistic in terms of profiles and masses 27
(Total) inertia is the χ2 statistic divided by sample size . . . . . . . . . 28
Euclidean, or Pythagorian, distance . . . . . . . . . . . . . . . . . . . . 28
Chi-square distance: An example of a weighted Euclidean distance . . . 29
Geometric interpretation of inertia . . . . . . . . . . . . . . . . . . . . . 29
Minimum and maximum inertia . . . . . . . . . . . . . . . . . . . . . . 29
Inertia of rows is equal to inertia of columns . . . . . . . . . . . . . . . . 30
Some notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
SUMMARY: Chi-Square Distance and Inertia . . . . . . . . . . . . . . . 32

Consider the data in Exhibit 3.1 again. Notice that, of the sample of 312 Hypothesis of
people, 57 (or 18.3%) are in readership category C1 (“glance”), 129 (41.3%) independence or
in C2 (“fairly thorough”) and 126 (40.4%) in C3 (“very thorough”); i.e. the homogeneity for a
average row profile is the set of proportions [ 0.183 0.413 0.404 ]. If there were contingency table
no difference between the education groups as far as readership is concerned,
we would expect that the profile of each row is more or less the same as the
average profile, and would differ from it only because of random sampling
fluctuations. Assuming no difference, or in other words assuming that the
education groups are homogeneous with respect to their reading habits, what
would we have expected the frequencies in row E5, for example, to be? There
are 26 people in the E5 education group, and we would thus have expected
18.3% of them to be in category C1; i.e. 26 × 0.183 = 4.76 (although it is
25
26 Chi-Square Distance and Inertia

Exhibit 4.1:
Observed Fairly Very
frequencies, as given EDUCATION Glance thorough thorough Row
in Exhibit 3.1, along GROUP C1 C2 C3 Total masses
with expected
frequencies (in Some primary 5 7 2 14 0.045
parentheses) E1 (2.56) (5.78) (5.66)
calculated assuming Primary completed 18 46 20 84 0.269
the homogeneity E2 (15.37) (34.69) (33.94)
assumption to be Some secondary 19 29 39 87 0.279
true. E3 (15.92) (35.93) (35.15)
Secondary completed 12 40 49 101 0.324
E4 (18.48) (41.71) (40.80)
Some tertiary 3 7 16 26 0.083
E5 (4.76) (10.74) (10.50)
Total 57 129 126 312
Average row profile 0.183 0.413 0.404

ridiculous to talk of 0.76 of a person, it is necessary to maintain such fractions


in these calculations). Likewise, we would have expected 26 × 0.413 = 10.74
of the E5 subjects to be in category C2, and 26 × 0.404 = 10.50 in category
C3. There are various names in the literature given to this “assumption of
no difference” between the rows of a contingency table (or, similarly, between
the columns) — the “hypothesis of independence” is one of them, or perhaps
more aptly for our purpose here, the “homogeneity assumption”. Under the
homogeneity assumption, we would therefore have expected the frequencies for
E5 to be [ 4.76 10.74 10.50 ], but in reality they are observed to be [3 7 16].
In a similar fashion we can compute what each row of frequencies would be
if the assumption of homogeneity were exactly true. Exhibit 4.1 shows the
expected values in each row underneath their corresponding observed values.
Notice that exactly the same expected frequencies are calculated if we argue
from the point of view of column profiles, i.e. assuming homogeneity of the
readership groups.

Chi-square (χ2 ) It is clear that the observed frequencies are always going to be different from
statistic to test the the expected frequencies. The question statisticians now ask is whether these
homogeneity differences are large enough to contradict the assumed hypothesis that the
hypothesis rows are homogeneous, in other words whether the discrepancies between ob-
served and expected frequencies are so large that it is unlikely they could have
arisen by chance alone. This question is answered by computing a measure
of discrepancy between all the observed and expected frequencies, as follows.
Each difference between an observed and expected frequency is computed,
then this difference is squared and finally divided by the expected frequency.
This calculation is repeated for all pairs of observed and expected frequencies
and the results are accumulated into a single figure — the chi-square statistic,
Calculating the χ2 statistic 27

denoted by χ2 :
X (observed − expected)2
χ2 =
expected

Because there are 15 cells in this 5-by-3 (or 5 × 3) table, there will be 15 terms Calculating the χ2
in this computation. For purposes of illustration we show only the first three statistic
and last three terms corresponding to rows E1 and E5:
(5 − 2.56)2 (7 − 5.78)2 (2 − 5.66)2
χ2 = + + + ···
2.56 5.78 5.66
(3 − 4.76)2 (7 − 10.74)2 (16 − 10.50)2
+ + + (4.1)
4.76 10.74 10.50
The grand total of the 15 terms in this calculation turns out to be equal to
26.0. The larger this value, the more discrepant the observed and expected
frequencies are, i.e. the less convinced we are that the assumption of homo-
geneity is correct. In order to judge whether this value of 26.0 is large or small,
we use probabilities of the chi-square distribution corresponding to the “de-
grees of freedom” associated with the statistic. For a 5 × 3 table, the degrees
of freedom are 4 × 2 = 8 (one less than the number of rows multiplied by one
less than the number of columns), and the p-value associated with the value
26.0 of the χ2 statistic with 8 degrees of freedom is p = 0.001. This result
tells us that there is an extremely small probability — one in a thousand —
that the observed frequencies in Exhibit 4.1 can be reconciled with the homo-
geneity assumption. In other words, we reject the homogeneity of the table
and conclude that it is highly likely that real differences exist between the
education groups in terms of their readership profiles.

The statistical test of homogeneity described above is relevant to statistical Alternative


inference, but we are more interested here in the ability of the χ2 statistic expression of the
to measure discrepancy from homogeneity, in other words to measure hetero- χ2 statistic in
geneity of the profiles. We shall now re-express the χ2 statistic in a different terms of profiles
form by dividing the numerator and denominator of each set of three terms and masses
for a particular row by the square of the corresponding row total. For exam-
ple, looking just at the last three terms of the χ2 calculation given in (4.1)
above, we divide the numerator and denominator of each term by the square
of E5’s total, i.e. 262 , in order to obtain observed and expected profiles in the
numerators rather than the original raw frequencies:
3 4.76 2 7 10.74 2 16 10.50 2
  
2 26 − 26 26 − 26 26 − 26
χ = 12 similar terms · · · + 4.76 + 10.74 + 10.50
262 262 262
= 12 similar terms · · ·
(0.115 − 0.183)2 (0.269 − 0.413)2 (0.615 − 0.404)2
+ 26 × + 26 × + 26 ×
0.183 0.413 0.404
(4.2)
Notice in the last line above that one of the factors of 26 in each denominator
has been taken out, so that the denominators are also profile values, equal to
28 Chi-Square Distance and Inertia

the average profile values. Each of the 15 terms in this calculation is thus of
the form
(observed row profile − expected row profile)2
row total ×
expected row profile

(Total) inertia We now make one more modification of the χ2 calculation above to bring
is the χ2 statistic it into line with the CA concepts introduced so far: we divide both sides of
divided by sample Equation (4.2) by the total sample size so that each term involves an initial
size multiplying factor equal to the row mass rather than the row total:
χ2
= 12 similar terms · · ·
312
(0.115−0.183)2 (0.269−0.413)2 (0.615−0.404)2
+ 0.083× + 0.083× + 0.083×
0.183 0.413 0.404
(4.3)
where 0.083 = 26/312 is the mass of row E5 (see Exhibit 4.1). The quantity
χ2 /n on the left-hand side, where n is the grand total of the table, is called
the total inertia in CA, or simply the inertia. It is a measure of how much
variance there is in the table and does not depend on the sample size. In statis-
tics this quantity has alternative names such as the mean-square contingency
coefficient, and its square root is known as the phi coefficient (φ); hence we
can denote the inertia by φ2 . If we gather together terms in (4.3) in groups of
three corresponding to a particular row, we obtain the following form for the
inertia:
χ2
= φ2 = 4 similar groups of terms · · ·
312  
(0.115 − 0.183)2 (0.269 − 0.413)2 (0.615 − 0.404)2
+ 0.083 × + + (4.4)
0.183 0.413 0.404
Each of the five groups of terms in this formula, one for each row of the table,
is the row mass (e.g. 0.083 for row E5) multiplied by a quantity in square
brackets which looks like a distance measure (or, to be precise, the square of
a distance).

Euclidean, or In (4.4) above, if it were not for the fact that each squared difference between
Pythagorian, observed and expected row profile elements is divided by the expected ele-
distance ment, then the quantity in square brackets would be exactly the square of the
“straight-line” regular distance between the row profile E5 and the average
profile in three-dimensional physical space. This distance is also called the
Euclidean distance or the Pythagorian distance. Let us state this in another
way so that it is fully understood. Suppose we plot the two profile points
[ 0.115 0.269 0.615 ] and [ 0.183 0.413 0.404 ] with respect to three perpen-
dicular axes. Then the distance between them would be the square root of the
sum of squared differences between the coordinates, as follows:
p
Euclidean distance = (0.115−0.183)2 + (0.269−0.413)2 + (0.615−0.404)2
(4.5)
Random documents with unrelated
content Scribd suggests to you:
jälkeen. Tähän ehdottomasti parantavaan reseptiin kuuluu: joka
päivä rämpiä syvissä kinoksissa, märät saappaat, joskus jäätyneet,
jalassa, Öin ja päivin. Toisinaan tilaisuuden sattuessa maataan myös
kuusi, seitsemän tuntia lumessa, joka lämpimän ruumiin alla vedeksi
sulaa. Kilometrittäinen juoksu kuulasateessa on kipeille jäsenille
parhainta sairasvoimistelua ja peninkulmien marssit edistävät
suuresti yleistä hyvinvointia. — Tässä Heinosellekin paras ja halvin
hoitotapa. Kaupanpäällisiksi voi Mannerheim vielä antaa rintaasi
punakeltaisen nauhan. Niin voi käydä. —

Olisi sen Heinosparan kustannuksella varmaan ison aikaa iloa


pidetty, mutta tuvan ovi avautui ja pöydän päässä istuvan luutnantin
eteen astui terhakka poikanen, joka reippaasti kunniaa tehden
lausui:

— Komppanianpäällikkö! Meillä on täällä viisi vankia, jotka


saimme toverini kanssa vartiopaikalta.

Hymyillen katsahti nuori luutnantti pientä sotilasta ajatuksissaan


sanoen:

— Näiden kanssa ne suuret työt suoritetaan. Hän nousi ja seurasi


poikaa ulos.

Vangit, joista kaksi täysiveristä huligaania ja kolme kalpeata


tehtaalaisnuorukaista, odottivat kohtaloaan saunassa, jota heikosti
valaisi kiukaan reunalla käryävä kotoinen talikynttilä.

Lyhyt, nopea kuulustelu sen ratkaisi. Nuo kaksi, joiden kasvoilta


tuntui jo ihmisyys paenneen, syytivät epätoivonsa vimmalla
"lahtareille" tuikean kiroustulvan, mutta poikaset pelosta väristen
vakuuttivat lähteneensä toisten houkuttelemina sotaan.
— Saatte näyttää, oletteko puhuneet totta — sanoi
komppanianpäällikkö pojille. — Te seuraatte nyt meitä rintaman
selkäpuolella ja teette töitä mihin kykenette, eikä teille tapahdu
pienintäkään pahaa. Mutta vähimmästäkin juonittelusta on heti
tienne poikki. — Ymmärrättehän!

— Kyllä, herra luutnantti. Ammuttakoon meidät paikalla, jos


pienintäkään viekkautta huomataan.

Kuvaamaton kiitollisuus loisti nuorten vankien kasvoilta.


Tämmöistä kohtelua he eivät olleet odottaneet, sillä kamala oli
kohtalo valkoisen vangin punaisten puolelle joutuessaan.

— Lapsiparat, — huokasi Jouko itsekseen. Pojat köyhistä kodeista


raastetaan tänne, syöstään laumoittain surman suuhun, siksi vaan,
että joku vallan- ja kunnianhimoinen "kansanvaltuutettu" saisi vielä
jonkun aikaa pitää tämän kurjan, raastetun ja häväistyn maan
ohjaksia, jotka kumminkin häneltä kerran siirtyvät taitavampiin käsiin.

— Menkää pojat tupaan, siellä saatte kuumaa teetä ja voileivän


—. Hän käski viitaten heidät poistumaan selvittääkseen välit noiden
turkulaisten katusankarien kanssa, joille virkkoi:

— Miten teidän mielestänne olisi kanssanne meneteltävä?

— Eikö teillä ole kivääreitä, joita voisitte koetella? — vastasi toinen


vangeista ilkeästi irvistäen.

— Lieneepä lahtareilla jokunen niitäkin, — ilvehtii toinen.

Joukon muoto synkistyi. Hän viittasi pihalla seisoville vartioille,


jotka katosivat vangit mukanaan pimeyteen. Jonkun hetken perästä
kuului kahdesti kolmen kiväärin yhteislaukaus. Jouko huomasi
Heinosen vierellään.

— "Persana, mitä noista säästää yhlestäkään, ampua ne saisi


kaikki järestään. Antaa vaan työtä ja päänvaivaa, mihin ne sijoittaa ja
millä ne ruokkii, kun ei tahlo armeijallekaan ruokaa riittää."

— Eihän noita nyt raaski lapsia ampua, syyntakeettomia ne ovat,


pitää toki armahtaa niin monta, kuin on mahdollista, olemmehan
vastuussa tehtävästämme.

Toista se on noiden kanssa. Tuommoiset paatuneet konnat,


ihmiskunnan hylkiöt, joista ei kenellekään elämässä ole hyötyä, ne
joutavat huoletta muuttaamaan muille maille.

— Ampuminen on liian helppo kuolema noille, voi persana! kiluttaa


niitä pitäisi, kiluttaa sanon minä!

Hän kääri tiukemmin turkkia ympärilleen ja hänen pyöreästä


nelikymmenvuotiaasta olemuksestaan ei näkynyt muuta kuin
pehmeät, hiukan veltostuneet posket ja turpeiden luomien alta
muljottavat ihraiset silmät.

— "Kiluttaa oikein, kiluttaa minä sanon!" — kertasi hän yhä noita


inhoittavia sanojaan.

Jouko jo tunsi kuumenevansa.

— Jumaliste, olisimmeko silloin punaisia paremmat! — Minä sinut


"kilutan", ellet laita itseäsi näkyvistäni, tukin takaisin kellariin
issiaksinesi ja pakoittavine varpainesi. Oletko täyttänyt velvollisuutesi
tänä päivänä!
Hän ravisti miestä turkinkauluksesta, työnsi hänet sitte ovesta
tupaan, josta jo kuului harras kuorsaus.
IV.

Valkoisen Suomen pääkaupungissa vietetään juhlaa. Liput liehuvat,


sinivalkoiset, punakeltaiset ja ylpeät leijonaliput — kaikki
puolitangossa ja valtakadulla kulkee suuri saattue, koko kaupungin
väki on liikkellä osoittaakseen kunniaa ensimmäisille vapaussodassa
kaatuneille sankareille.

Ja torvet soivat ja surumarssi kajahtaa raskaan harmajana


sydäntalven päivänä ja kansa vakavana kulkee ja jokaisen kasvoilla
on ilme, kuin saattaisi hän omaansa viimeiseen lepoon.

Kirkko on väkeä täynnä. Ei mahdu sisälle kuin puolet pyrkijöistä,


toiset ulkopuolella odottavat pakkasen ja tuiman pohjatuulen
käsissä.

Ja urkujen soidessa kantavat omaiset esiin kalliin uhrinsa,


laskevat alttarin eteen arkut, toiset mustan, toiset lumivalkean ja
tuoreen viheriät asparagusoksat kannella kiehkuroivat ja niiden
lomassa hohtavat kukat, veripunaiset, puhtaanvalkoiset. — —
Tuossakin kantaa arkkua isä. Käsityöläinen hän on, köyhä mies, joka
keväällä oli toivonut näkevänsä poikansa valkoisen lakin —
valkoisen hän saikin verhon — ja jo ennenkuin kevät saapui, lepäsi
kaunis kalvenneena valkoisessa arkussaan.

Mutta isä kulkee pystypäin, kasvoillaan kirkastava juhlailme, —


hän antoi ilomielin ainoisensa, antoi isänmaalleen parhaan, minkä
omisti.

— — Siinä veli veljeään kantaa ja mustan arkun perässä kulkee


kumaraisena vanha äiti. Mutta hänen silmänsä ovat kirkkaat ja
hänen ryppyisillä kasvoillaan on heijastus kuin oudon onnen, kun
hän alttarille saattelee poikansa nuorimman, vanhuutensa tuen,
yksinäisen elämänsä auvon ja lohdun, ja suuren surun ylitse soivat
sielussa sanat: — "Minun uhrini isänmaalle, koska niin oli Jumalan
tahto."

Lehterin hämärässä nurkassa istuu tumma tyttönen. Hänen


hienopiirteisillä kasvoillaan ovat ilmeet herkät ja silmien
epämääräinen väri näyttää mielialojen mukaan vaihtelevan.

Hän istuu hiukan eteenpäin kumartuneena ja silmänsä tähyävät


jonnekin rajattomaan, määrättömään kaukaisuuteen. Kasvoilla
hehkuu nyt haltioitunut into, hän on kuin rukoukseen vaipuneena:

— Minä tunnen nyt sinun olemuksesi ihanuuden, sinä valkoisen


armeijan Valkea Henki. Sinä, sen salainen Johtaja. — Sinun avullasi
he voittavat moninkertaisen vihollisensa, sinun voimallasi he ilolla
tuskiin ja kuolemaan käyvät. — Ihanaa olisi kuulua sinun joukkoihisi!
——

Urkujen soitto värisyttää ilmaa, säveliä, muistosanoja, kukkien


tuoksua, — kyyneleitä. — On kuin kaikki yhdeksi sulaisivat, kuin
kukin toisensa surun tuntisi, — on kuin kukin itkisi toisensa tuskaa.
Ja valkoisen armeijan Valkoinen Henki levittää siunauksensa
kautta kirkon, ylitse ihmisjoukon, ihmisten, jotka itkevät, toistensa
tuskaa.

Niin myös tyttönen, lehterin hämärässä nurkassa, ja hänen


kyyneleensä putoilivat, kuin suvisen sateen pisarat, silmät itkivät,
huulet hymyilivät. —Ihmiset uhrautuvat toistensa tähden, ihmiset
ovat jo oppineet rakastamaan toinen toistaan…

Ja suuret, väkevät tunteet hänen hentoa olemustaan värisyttävät.

*****

On ilta. Alina Auran pulpetin ääressä luo sähkölamppu kalpean


valon viheriän varjostimen takaa. Suuret konttorikirjat avautuvat ja
paiskautuvat jälleen kiinni, uudelleen avautuakseen. Numerot
sarakkeissa sekaantuvat, laskut sotkeutuvat ja silmä hairahtuu pitkiä
riviä seuratessa.

Vapisevin käsin sulkee hän taas raskaan kirjan ja painaa käsiinsä


tumman pään. Ummistuneitten silmien eteen kohoavat kummat
kuvat: -Tuolla näkyvät kasvot kivun vääristämät, tuolla polttavat,
kuumeiset huulet hehkuvat vettä janoovina — tuolla hourailun
kourissa kamppailevan käsi tyhjästä haparoi ystävän kättä… Ja
ulkona riehuu taistelu, kiväärit paukkuvat, tykit jyskivät ja hangella
makaa kymmenittäin kaatuneita, jotka kuroittavat käsiään apua
anoen, veren hiljalleen kuiviin valuessa. — — —

Neitonen kohotti päätään, pyyhkäisi otsaansa ja katsoi ikkunaan,


josta vastaan tuijotti talvi-illan pimeys.
Mutta siellä sisällä syntynyt ääni puhui yhä voimakkaammin,
kunnes hän oli vakuutettu siitä, että todella kuuli sen kehoituksen.
Silloin kuvastui kasvoilla päättäväisyys. Hän nousi pöytänsä äärestä
ja lähti valmistautumaan tehtävään, johon kutsui häntä sisäinen ääni.
V.

Pitkin mäkistä maantietä kulkee pitkä jono vahvasti kuormitettuja


hevosia, jotka kuljettavat sekä väkeä, että muonaa rintamalle, joka
on nykyisin kymmenen penikulman päässä rautatien asemalta.

Kuta kauemmaksi ehditään, sitä epätasaisemmiksi käyvät


välimatkat kuormien välillä ja lopuksi ei etumaisia enää saata nähdä
jonon loppupäähän, ainoastaan kaukaa kuuluva kulkusten helinä
toteaa niiden vielä olevan äänen kantamilla.

Viimeisessä reessä, nahkaisten peitossa on Alina Aura, jolle


vihdoinkin on toivottu hetki koittanut. Hänen vierellään nukkuu toveri,
sairaanhoitajatar hänkin, jolla on sama matkan määrä. Näin on
kulettu monta tuntia, sillä tie tuntuu loputtoman pitkältä, mutta vaikka
onkin keskiyö, ei sisar Alina ole saattanut vielä hetkeksikään
nukahtaa.

Hän lepää hiljaa katsellen tiepuolessa torkkuvia kuusia, joiden


oksat valkean taakkansa painosta taipuilevat alimmat aina
pehmeäkiteiseen kimaltelevaan lumikenttään saakka. — Valkoista,
kaikki on valkoista! — Koivut huurteen helyissä, petäjän oksat
pehmeän untuvalumen peittäminä ja yllä sininen taivaan holvi, jonka
syvyydestä tähdet vilkkuen tuikehtivat.

On ihanaa tässä levätä kuunnellen jalaksen ratinaa ja kavioiden


kapsetta pakastaneella tiellä. Kulkuset helisevät erikielin ja niiden
vienot äänet raukenevat yksinäiseen erämaan yölliseen
hiljaisuuteen.

Tuntuu kuin ne keskustelisivat, kyselisivät: — ymmärrätkö, kulkija


tiukujen kieltä?

Kyytimies hoputtaa hidastelevaa hevostaan puoleksi itsekseen


puhellen.

— Toissapäivänä oli valkoisilla kuuma päivä, vihollinen, ainakin


kymmenkertainen, yritti saartaa, mutta ei onnistunut. Minkähänlaista
on taas huomenna, kun perille päästään ja mikä lieneekin tämän
sodan loppu?

Sisar Alina oli vaiti. Hän ei halunnut puhellen häiritä tuudittelevia


tunteitaan, mutta ukko oli kumminkin sanonut sanan, joka tarttui
tahtomatta hänen korvaansa.

— Saartaa, — niin saartaa! — sellaistakin saattaa tapahtua.


Kymmenkertainen vihollisjoukko saartaa meidät, miehemme
kaatuvat viimeiseen asti. — Entä me naiset? — Kuinka käy meidän
silloin? Jos niiden villiintyneitten punaryssien käsiin nainen, on
edessä kohtalo, kuolemaa kamalampi, voi, ei saata sitä ajatella!

Hän koetti irtautua hirmukuvasta, joka pyrki mielessä nousemaan,


se kuva tyrmistytti aatoksen, seisatti suonissa veren.
Ja on kuin tiu’ut leperrellen kyselisivät: — oletko valmis kaikkeen,
yksinpä sellaiseenkin kohtaloon? — Miehet sotaa käykööt, ei ole
sodassa naisen paikka, miksi lähdit, miksi, tyttönen, tänne lähdit?

Alina kokosi voimansa ja koetti kirvottautua painajaisestaan: —


Minä lähdin siksi, että Hän kutsui, Hän puhui minun sieluuni kirkossa
ja minä tunsin Hänen läheisyytensä silloin. — Vaikene epäilyksen
ääni! Hän näkee minut ja tietää mikä on osani oleva.

Vähitellen vaimenee sen voima, hirmukuva kalpenee ja katoo, jo


soittaa sointuvammin tiukujen kuoro ja nuori matkalainen vaipuu
keveään unenhorrokseen.

Hän havahtui saavuttaessa kylään, jossa oli päätetty syöttää


hevoset ja viettää loppuyö. Edellisistä kuormista olivat jo miehet
siirtyneet asumuksiin. Yksi hevonen näkyi sen talon pihamaalla,
jonne sisarten kyytimies oli ajanut.

— Kas niin — puheli ukko hevostaan päästellen — täältä ovat


punaiset kiireellä lähteneet. Älkää, neidit, säikähtäkö, mutta täällä on
niiden ruumiita ympäri pihamaata ja talo näkyy olevan tyhjänä.

— Katsokaas tuossakin, näyttää aivan kuin istuvan navetan


seinään nojaten, pieni verinaarmu poskessa ja ohimolla. Entä tämä,
joka katselee kohti taivasta ulos pullistunein silmin. — Täällä on niitä
kymmenkunta tämän rakennuksen ympärillä.

Ukko vei hevosensa talliin ja kävi itse tupaan. Alina otti häneltä
lyhdyn, ja lähestyi kinoksessa makaavaa ruumista. Vedossa
vapajavan kynttilän liekki loi vaisun, väräjävän valon kuolleen
kasvoille, joita nuori sisar hartain tuntein katseli.
Ken oletkin ja mitkä lienevätkin olleet ne vaikuttimet, jotka sinut
taisteluun toivat, — kunnioitan kumminkin kuolemaasi. — Ei
vihamielisyyttä enää, olkoon sielullesi rauha — puheli hän vainajalle.

Niin jäykät olivat kuolleen kasvot, kuin kiveen hakatut, aivankuin ei


olisi noissa piirteissä koskaan eloa ollutkaan. — Miksi se niin erosi
muista, joita hän ennen oli ruumiina nähnyt? Missä oli kuoleman ylhä
juhlallisuus? — Ei tämän kasvoilla mitään sellaista ollut. Vai oliko
hän ehkä aatteensa mukaiseen uskoon nukahtanut ilman
heräämisen toivoa ja tuo usko oli hänen sielunsa aineeseen niin
lujasti kytkenyt. — Siksikö näytti hänen unensa olevan syvä, kuin
kallion tuhatvuotinen uni.

Näitä mietiskeli sisar Alina kaatuneen punakaartilaisen vieressä ja,


kun hän aatoksistaan heräsi, huomasi hän olevansa pihalla — yksin.
"Täällä on niitä kymmenkunta tämän rakennuksen ympärillä", muisti
hän ukon sanoneen ja hänen ruumistaan karmi kylmä väristys.

Kun hän tuli tupaan, olivat muut jo levolle asettuneet. Hän etsi
itselleen makuupaikkaa ja löysi tupakamarista vanhan keinutuolin,
johon heittäytyi yötään viettämään.

Ruumista puisteli väsymys ja vilu, sielussa asui yksinäisyyden ja


orpouden tunto. Tuolla ympärillä makaavat ruumiitkin tulivat
alinomaa mieleen. Se pyrki pahemmin lamaan, näin ruumiillisen
väsymyksen painaessa.

— Missä on nyt palava innostukseni? — Näinkö lyhyeen se loppui,


jo ennen perille saapumista —? Entä siellä sitten? — kyseli hän
itseltään ja nukahti vihdoin kesken tukahutetun nyyhkytyksen.
VI

Muutamana kauniina kevättalven aamuna hyppäsi luutnantti


Toivonen ratsunsa selkään, matkansa määränä kenttäsairaala, joka
sijaitsi muutaman kilometrin päässä rintaman takana. Siellä nyt
sairasteli Seppo poikanen ja yhä huononevan kuului.

Hepo juoksi tasaista ravia. Maaliskuun aurinko hellitti lämpöisiä


säteitään hankia sulatellen. Metsästä lemusi jo tuores keväinen
tuoksu.

Mutta nuoren ratsastajan mieli pyrki murheelliseksi ja kysymykset


tulivat, nuo tuskalliset kysymykset ja heräsivät raskaat muistot nyt,
jolloin taistelun jännitys ei hermoja pingoittanut. Aina kun oli
levonhetki, ne tulivat ja kiersivät hänet ahtaaseen pusertavaan
piiriinsä. Hän koetti niitä vastustella, koetti torjua tahtonsa voimalla,
mutta aina uudelleen ne tulivat. Viimein ei hän enää vastustellut,
vaan antoi niiden vapaasti tulla ja sanottavansa sanoa, ollen itse
kuin sivullisena kuulijana.

— Mitä tämä oikeastaan on? Veljessotaa käydään maassa. Sotaa


hirmuisinta mitä koskaan on ollut. Onko kauhua tämänveroista
nähnyt vielä yksikään kansa maan päällä? Mikä ihmisiin on mennyt?
— Mikä tämän on synnyttäjä, alkuunpanija? — Mistä ovat peräisin
ne olennot, jotka pieksevät, ruhjovat, nylkevät ja ristiinnaulitsevat
ihmispoloisia, jotka heidän uhreikseen ovat joutuneet. Onhan moni
elävänä poltettukin ja tunnustihan kuolinhetkenään eräs punanen
keittäneensä valkoisen soturin. Kylmään veteen oli pannut siitä
hiljalleen kuumennellen, kunnes kiehui pata. Ovatko tämmöiset
sodan kävijät ihmisiä ollenkaan, vaikka sen hahmossa esiintyvät.

Se on ryssän rutto, Venäjältä tänne virtaileva bolshevjikkilainen


myrkky, joka tekee ihmisistä petoja.

— Voi Venäjä, Venäjä!

Jo on puolet Suomen väestöstä siihen saastutettu, saako toinen


vielä terveenä säilynyt, osa pelastetuksi tämän poloisen maan?

— Saa — a! — Sen täytyy! — kuulee hän puoliääneen sanovansa


ja kohottautuen suoraksi satulassaan koetti hän taas irtautua näistä
mietteistään ja kannusti hevosta, joka syöksyi tuulena juoksuun.

Hän oli jo saapunut kylään, ja tuossa jo liehuikin punaisenristin


lippu erään rakennuksen katolla — siinä se siis oli.

Hän kiiruhti sisään ja kyseli etsimäänsä.

Miten tarkkaan olivat täällä tilat käytetyt, vuode vuoteen vieressä,


joiden välistä oli hoitajattarien pujoteltava.

Hänelle osoitettiin tila suuren salin peränurkassa. Siinä oli


valkeitten raitien sisässä kuin pieni käärö josta selviytyi esille
Seppopoikanen, kuihtuneena, pienen pieneksi kutistuneena, joten
ensisilmäykseltä oli vaikeata häntä entiseksi tuntea.
Suljetuin silmin hän siinä lepäsi, mutta kuultuansa tutun äänen
nimeänsä mainitsevan, avautuivat raskaat luomet ja kaksi kirkasta
lapsensilmää loisti Joukolle vaikeitten pielusten lomasta. Niissä oli
nyt, kuten ennenkin sama rakkauden, luottamuksen ja rajattoman
ihailun ilme.

Hän ei sanonut mitään, mutta huulensa vetäytyivät heikkoon


hymyyn ja, kun Jouko otti tuon pienen kalpean kätösen omaansa, oli
se omituisen kylmä ja kostea. — Olihan hänen verensä ennen
avunsaantia melkein kuiviin vuotanut.

Hän istui vuoteen laidalla, mutta sairaan silmät painuivat taas


väsyneesti umpeen ja Joukon katse lepäsi milloin noilla kauniilla
pikkukasvoilla, milloin harhaili se ympäri huonetta, jonka asukkaat
olivat isänmaalleen kukin veronsa suorittaneet.

— Elämä täällä on toista, kuin rintamalla. Täällä on kivun ja


kärsimysten maailma — ajatteli Jouko, ja hänen mieltänsä ahdisti. —
Jos kerran pitäisi tänne, jos en saisikaan kaatua kentälle, niinkuin
olen kuvitellut. Se olisi helppoa, mutta tämä! — Kivun kärsimykset,
toisten voihkina ja valitukset, alituinen sairaalan haju ja koko ilma
täällä on kuin ainaisten tuskantunteiden kyllästämä. — Kunpa
säästyisi tästä, olisi armollinen se iskevä kuula ja ottaisi kokonaan
kerta kaikkiaan.

Tuossakin nuori mies, Sven Duvan mallia, heittelee levottomana


vuoteellaan. Hänen kasvonsa ovat turvoksissa ja silmissä palaa
kuumeinen kiilto. Käsi haparoi peitteellä etsien hätäisesti jotakin —
"Minne se karbiini tuli, meidät aiotaan saartaa, — täytyy murtaa
ketju," — puhelee se oudolla, luonnottomalla, äänellä.
Toisesta nurkasta kuului kuin hiljaista, tukahutettua itkua. Siellä on
parinkymmenen ikäinen, tummatukkainen nuorukainen, side
silmillään. Hän on taistelussa saanut räjähtävän kuulan kasvoilleen,
mikä repi nenän yläosan, soensi silmät ja siellä hän nyt nyyhkii
nurkassa, katsellen pimeillä silmillään elinkautiseen mustaan yöhön.

Hänen vieressään taas toinen katselee nuorta luutnanttia kirkkain,


tajuisin katsein. Jouko haluaa puhella jonkun kanssa ja menee
hänen luokseen, kysyy nimeä ja kotipaikkaa. Poika vaan hymyilee
sopertaen "— ei — ei — mi —" Hän osoittaa kaulassaan pientä, jo
ruvella olevaa haavaa, josta on kuula mennyt, tullen niskasta ulos.
Siinä meni mieheltä puhekyky ja halpautui oikea käsi joka lepäsi
tuossa vierellä kuin kuollut kappale, ainoastaan ajottainen pakotus
ilmoitti sen vielä muuhun ruumiiseen kuuluvan.

Mutta ovensuussa makaava, vanhahko mies, voihkasee äkkiä


tuskaisesti ja nuo kasvot, äsken kelmeät ja jäykät, kuin ruumiin,
saavat taas elonmerkkiä hänen ääneen vaikertaessaan.

Silloin helähtää tuskien asunnossa lempeä, kaunissointuinen ääni:

— Onko isännällä tuskia taas? — Antakaa kun muutan asentoa,


ehkä vähän helpoittaa — kuuluvat sanat ja vuoteen viereen on
ilmestynyt valkopukuinen hoitajatar.

Hän pyyhkii hellien sairaan otsalla helmeilevän hien, kääntää


pielusta, pöyhentelee, kohentelee.

— Onko nyt parempi, onko yhtään parempi?+

— On se, voih, — on se, niinkuin vähän helpompi taas — ääntelee


sairas ja muistakin tuntuu sisar Alinan ilmestyttyä, että on se vähän
helpompi nyt taas.

Hän siirtyy vuoteelta vuoteelle puhellen kullekin erikseen, sanoo


sanasen vaan, ja he kuuntelevat hänen ääntään, ja hetkeksi
unohtuvat surut ja kivut.

— Kuinkas poju voipi — sanoo hän lähestyen nuorta sotilasta,


joka istuma-asentoon tuettuna, pää siteissä katselee tylsin ilmein.
Mutta hoitajattaren lähestyessä saavat hänen raukeat silmänsä
loisteen, kun hän kättään ojentaen sammaltaen sopertaa jotakin, jota
Jouko ei ymmärrä.

— Mitä hän sanoo?

— Sanoi vaan, että neidin poika, — niin se on minun pojuni tämä


— puhui hoitajatar hyväillen ojennettua kättä ja kiiruhti edelleen.

Ja Joukon katse seuraa tuota joustavaa vartta ja pehmeitä liikkeitä


nuortean ruumiin, näkee silmät, jotka niin lempeinä loistavat hänen
puhuessaan potilailleen, — mutta enemmän kuuntelee ääntä, joka
niin kauniisti soinnahtaa suurten surujen asunnossa.

Hänen pieni ystävänsä nukkuu yhä, eikä Jouko voi täällä


kauemmin viipyä. Hän nousee ja poistuu hiljaa hänen vuoteensa
vierestä. Ovella tapaa hän vielä hoitajattaren, ojentaa hänelle
kätensä ja virkkaa:

— Kiitos teille! — Hyvästi.

Mutta samassa hän punastuu. Oliko toinen ymmärtänyt, mistä


häntä kiitettiin. — Eihän hän ollut sitä aikonut sanoa, se tuli kuin
itsestään.
Nuori luutnantti ratsastaa pois, mutta yhä on hän tuntevinaan
kädessään pehmeän, lämpöisen kätösen ja korvissa kaikuu ääni,
kauniimpi, kuin mitä hän koskaan ennen on kuullut.
VII.

Ollaan jo maaliskuun lopulla. — Vakavasti, varmasti on valkoinen


armeija edennyt, vaikka vastarinta kauemmaksi tultaessa yhä
vahvistui. Vähälukuisilla joukoilla oli moninkerroin ylivoimaisen
vihollisen hyökkäykset pidätetty ja vähitellen sitkeästi etelämmäksi
edetty, kohti Tamperetta, punaisen vallan lujaa tukikohtaa.

Oli sitä jo näinä viikkoina Pohjanmaan pojilta elämän mukavuudet


unohtuneet ja yön rauhallinen uni, oli enää vaan muistona heidän
mielissään.

Mutta se suuri isänmaallinen innostus, joka heidät oli tänne tuonut


ei matkan vaivoissa ollut laimennut ja eteenpäin paloi miehillä mieli.

Niinpä eräänä iltapäivänä marssi kapt. V:n pataljoona, johon nyt


Jouko Toivonenkin kuului, K:n pitäjään, jonne kello kolmen tienoissa
saapuivat. Teiden risteyksissä lähti osa kirkolle päin vievää tietä, osa
taas kääntyy asemalle.

Punaryssät olivat sieltä vasta lähteneet ja suuri siivottomuus


vallitsi kaikkialla, joten myötäseuranneiden naisten oli heti miesten
avustamina käytävä puhdistuspuuhiin.
— Ei ole hätää, rata on etäämpänä poikki — puhelee pataljoonan
päällikkö ratsuaan puhelinpylvääseen sitoen.

Niin hajautuvat joukot, osa lähetetään varmemmaksi vakuudeksi


radanvartta tarkastamaan, toiset asettuvat vartiopaikoille, muutamat
taas jäävät selailemaan asemapihaan jätettyä suurta sanomalehti
pinkkaa.

Jouko Toivonen puolestaan ottaa mukaansa pari miestä ja lähtee


tutkimaan ympäristön asumuksia radan toisella puolen.

*****

Sillävälin soi aseman puhelin, johon pelkäämätön kapteeni suorat


sanat vastasi, — mutta tämän keskustelun seuraus tuli myös kohta
näkyviin.

Pian porhalsi panssarijuna kohti asemaa, päästen radan vierellä


olevan kallion suojassa joka esti äänenkin kuulumasta, kenenkään
huomaamatta aseman lähelle, vaihteelle asti, josta alkoi ammunnan
kivääreillä, kuularuiskuilla, vieläpä tykeilläkin.

Jouko Toivonen muutaman miehen kanssa oli nyt eristettynä


muista, avoimella, suojattomalla paikalla. Hän näki, miten vartiat
junan edeltä pakenivat, etsien halkopinojen välistä suojaa, miten
valkoiset hyppäsivät alas asemarakennuksen ikkunoista läheiseen
metsään juosten, ja kuinka muuan nuori suojeluskuntalainen, ontuva
poika päästeli rauhallisena, kuulasateesta välittämättä, aitaan
kiinnitettyjä hevosia.

Vain muutama oli enää jälellä. Päällikkönsä ratsun hän


lähenevästä vaarasta huolimatta pelasti ja tuli viimeiseksi erään
Welcome to our website – the ideal destination for book lovers and
knowledge seekers. With a mission to inspire endlessly, we offer a
vast collection of books, ranging from classic literary works to
specialized publications, self-development books, and children's
literature. Each book is a new journey of discovery, expanding
knowledge and enriching the soul of the reade

Our website is not just a platform for buying books, but a bridge
connecting readers to the timeless values of culture and wisdom. With
an elegant, user-friendly interface and an intelligent search system,
we are committed to providing a quick and convenient shopping
experience. Additionally, our special promotions and home delivery
services ensure that you save time and fully enjoy the joy of reading.

Let us accompany you on the journey of exploring knowledge and


personal growth!

ebookgate.com

You might also like