Towards A Multifunctional Lexical Resource Design And Implementation Of A Graphbased Lexicon Model Dennis Spohr instant download
Towards A Multifunctional Lexical Resource Design And Implementation Of A Graphbased Lexicon Model Dennis Spohr instant download
https://ebookbell.com/product/towards-a-multifunctional-lexical-
resource-design-and-implementation-of-a-graphbased-lexicon-model-
dennis-spohr-50927368
https://ebookbell.com/product/multifunctional-nanoheterogeneous-
nioh2nife-catalysts-on-silicon-photoanode-toward-efficient-water-and-
urea-oxidation-sol-a-lee-59161854
https://ebookbell.com/product/towards-a-typology-of-poetic-forms-from-
language-to-metrics-and-beyond-jeanlouis-aroui-44870232
https://ebookbell.com/product/towards-a-historical-grammar-of-balochi-
agnes-korn-45334130
https://ebookbell.com/product/towards-a-climateresilient-future-
together-tools-for-engaging-citizens-for-a-better-future-mandy-a-van-
den-ende-46075248
Towards A Comparative Economic History Of Cinema 19301970 John
Sedgwick
https://ebookbell.com/product/towards-a-comparative-economic-history-
of-cinema-19301970-john-sedgwick-46123710
https://ebookbell.com/product/towards-a-social-history-of-early-
modern-dutch-peter-burke-46256416
https://ebookbell.com/product/towards-a-circular-economy-
transdisciplinary-approach-for-business-aldo-alvarezrisco-46377842
https://ebookbell.com/product/towards-a-political-education-through-
environmental-issues-melki-slimani-46784822
https://ebookbell.com/product/towards-a-social-democratic-century-how-
european-and-global-social-democracy-can-chart-a-course-through-the-
crises-katharina-hofmann-de-moura-46833002
L E X IC O G R A PH IC A
Series Maior
Edited by
Pierre Corbin, Ulrich Heid, Thomas Herbst, Sven-Gçran Malmgren,
Oskar Reichmann, Wolfgang Schweickard, Herbert Ernst Wiegand
141
Dennis Spohr
Towards a Multifunctional
Lexical Resource
Design and Implementation
of a Graph-based Lexicon Model
De Gruyter
D 93
ISBN 978-3-11-027115-7
e-ISBN 978-3-11-027123-3
ISSN 0175-9264
1 Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1 Computational Lexicography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Function Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2.1 Definition of a Lexicographical Function . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2.2 The Concept of a Leximat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Multifunctionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.4 Objectives and Contributions of this Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
10 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
List of tables
5.1 Hierarchy and attributes of relations between lexemes, forms and senses . . . . . . . . . 76
5.2 Hierarchy and attributes of collocational relations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.3 Hierarchy and attributes of lexical-semantic relations . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.4 Hierarchy and attributes of morphological relations . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.5 Hierarchy and attributes of abbreviational, idiomatic and proverbial relations . . . . . 81
5.6 Needs of untrained users in a text-receptive situation in the mother tongue . . . . . . . 120
5.7 Needs of trained users in a text-productive situation in a foreign language . . . . . . . 122
5.8 Status and labels for linguistically trained vs. untrained users . . . . . . . . . . . . . . . . . . 123
5.9 Number of entities in the MLR model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
AI Artificial Intelligence
DB Database
DTD Document Type Definition
ED Electronic Dictionary
GUI Graphical User Interface
HTML HyperText Markup Language
HTTP HyperText Transfer Protocol
ISO International Organisation for Standardisation
KB Knowledge Base
KR Knowledge Representation
LMF Lexical Markup Framework
MWE Multi-Word Expression
NLP Natural Language Processing
OWL Web Ontology Language
OWL DL OWL Description Logic
RDBMS Relational DataBase Management System
RDF Resource Description Framework
RDFS RDF Schema
SALSA SAarbrücken Lexical Semantics Acquistion
SeRQL Sesame RDF Query Language
SGML Standard Generalised Markup Language
SPARQL SPARQL Protocol and RDF Query Language
SW Semantic Web
SWRL Semantic Web Rule Language
TFS Typed Feature Structure
UML Unified Modelling Language
VDE Valency Dictionary of English
W3C World Wide Web Consortium
WWW World Wide Web
XML eXtensible Markup Language
XSL eXtensible Stylesheet Language
XSLT XSL Transformations
1 Preface
In 2005, while I was working as a student assistant at the Institute for Natural Language Pro-
cessing at the University of Stuttgart, my then supervisor Ulrich Heid returned from a meeting
with Günther Görz at University Erlangen-Nürnberg, impressed by the research done there on
using knowledge representation formalisms for modelling computational lexica. Convinced
that these formalisms would provide interesting solutions to current issues in computational
lexicography, in particular the representation of multi-word expressions in electronic dictio-
naries, Heid offered a diploma thesis project on representing lexicographic descriptions of
collocations in these formalisms. I subsequently took up this subject and wrote my diploma
thesis entitled »A Description Logic Approach to Modelling Collocations«, which described
the use of the Web Ontology Language OWL in order to represent the properties and relations
of collocations. Very early on in this project, it became apparent that OWL opened up new
ways of dealing with more general lexicographic questions, such as the relation between elec-
tronic dictionaries for human use on the one hand and computational lexica designed to sup-
port natural language processing (NLP) applications on the other, as well as (in)consistency
and inference in lexicographic descriptions. Inspired by this, and backed by further experi-
ence gained during a position as a research assistant at Saarland University in Saarbrücken, I
started working on a PhD project elaborating on these ideas.
This book, which represents an updated version of the doctoral dissertation that I had sub-
mitted to the Faculty of Humanities of the University of Stuttgart and successfully defended
on July 16, 2010, deals with the development of a model for a lexical resource that can be
used to serve both human users and NLP applications. While this suggests its immediate clas-
sification as a purely lexicographic work, several central aspects of this book go beyond what
would traditionally be referred to as »lexicographic research«. As was mentioned above, it
describes the application of important recent results of a branch of computer science known
as artificial intelligence (AI) to the field of dictionary research, such as knowledge represen-
tation formalisms and description logic reasoning, and discusses the solutions they offer to
specific lexicographic issues. One of these is that they allow for viewing lexica as labelled
graphs, i. e. net-like resources in which entities (nodes) are linked by means of labelled arcs
(edges). This view of lexical resources has been proven to be very suitable for capturing the
highly relational nature of lexical data (see e. g. Polguère (2006); Trippel (2006); Spohr and
Heid (2006)), and is in opposition to the traditional text-based view. These advances are com-
plemented by a strong focus on computational linguistic methods, including the mining of
existing lexical resources for lexicographically relevant information, as well as the utilisation
of this information for applications of NLP. This »multifunctionality« – both with respect to
the usability of lexicographic information for NLP applications and human users, as well as
with respect to the different types within these two categories – is at the heart of this work.
Conceptually, this study is thus to be classified as a bridge between the areas of lexicography
and computational linguistics on the one hand (often referred to as computational lexicogra-
phy), and AI on the other.
2
Many people have influenced this work in one way or another. I am indebted to Ulrich Heid
for giving me the opportunity to carry out this research and benefit from his expertise, as well
as for his constant support and interest in my work over the years. I highly appreciate that
he has always taken the time to comment on my work, despite his obligations at different
universities and in various committees.
The idea of a »Mutterwörterbuch« formulated by Rufus Gouws from Stellenbosch Institute
for Advanced Study has had a profound influence on this work, and I would like to thank him
for sharing his ideas with me. Similarly, the works of Henning Bergenholtz and Sven Tarp
have had a deep impact on the theoretical aspects of this work, and I am grateful for our very
fruitful discussions at the e-Lexicography workshop in Valladolid, which was organised by
Pedro A. Fuertes Olivera in 2010.
The audiences at several conferences have provided very useful feedback to different devel-
opment stages of this work. I would like to thank in particular the audience at eLEX 2009 in
Louvain-la-Neuve for their critical comments, which have resulted in a refined argumentation
regarding different aspects of this approach.
My former office mates, colleagues and friends in Stuttgart, Saarbrücken and Bielefeld
have always created a very cordial and inspiring work environment. For this reason, An-
dré Blessing, Regine Brandtner, Aljoscha Burchardt, Philipp Cimiano, Anette Frank, Steffen
Heidinger, Simone Heinold, Christian Hying, Katja Lapshinova-Koltunski, Fabienne Martin,
Benjamin Massot, Lukas Michelbacher, Florian Niebling, Sebastian Padó, Klaus Rothen-
häusler, Achim Stein and Melanie Uth have contributed to this work much more than they
are aware of, and I thank them very much.
Further thanks go to Günther Görz, Ulrich Heid, Stefan Schierholz and Achim Stein for
reviewing previous drafts of this book. I also appreciate that the coordinators of the Interna-
tional Graduate School 609 »Linguistic representations and their interpretation« have given
me the opportunity to share my preliminary results and ideas with experienced researchers as
well as fellow young researchers.
On a personal note, I would like to thank my parents, who have encouraged me to study in
the first place, and who have always loved and supported me over the years. Finally, none of
this would have been possible without my wonderful wife María, whose patience, encourage-
ment, love and understanding have helped me to stay on track when necessary, and to leave
this track whenever possible.
According to Tarp (2008: p. 4), lexicography is the »science of dictionaries«, which is meant
to be interpreted in direct opposition to the views of researchers who attribute to lexico-
graphy the status of a subdiscipline of linguistics. However, while Tarp provides several
justifications for his claim (ibid., pp. 4-6), it seems very strong to conceive of lexicography as
a science which is independent of linguistics. These doubts come from the observation that –
irrespective of the particular type of dictionary and the principles that are used to structure it
– dictionaries commonly deal with the representation of the items of a language. Moreover,
given the amount of linguistic analysis that is typically involved in the initial stages of the
compilation of a dictionary, it seems equally plausible to follow the view e. g. of Gibbon
(2000), who classifies lexicography as a branch of applied linguistics that deals with the
design and construction of dictionaries and the representation of linguistic information.
Irrespective of the debate as to the scientific status of lexicography, which is of no further
relevance to the core ideas of this study, there seems to be general agreement on the inter-
nal structure of a dictionary. According to Hausmann and Wiegand (1989), for example, it
consists of a microstructure, a macrostructure and a mediostructure. While the microstruc-
ture refers to the structure of dictionary entries in terms of the descriptions they contain, the
macrostructure refers to the sequence of these dictionary entries and the access to them. The
mediostructure describes the links between entities, both in terms of references between en-
tries and – in the context of a graph-based lexicon – links between any kind of lexical or
descriptive entity.
In addition to these general lexicographic aspects, however, it seems that the definition of
lexicography presented above needs to be extended. In particular, it is restricted to dictionar-
ies for human usage only, and does not take into account the structuring of lexical resources
for computational use, nor the computational methods which are used to create or process
these structures. In addition to the tools which »assist the various lexicographical tasks«
(Atkins and Zampolli, 1994: p. 4), these aspects are part of the field of computational lexi-
cography, which is classified as a subdiscipline of computational linguistics by Heid (1997).
However, in contrast to the debatable position with respect to the independence of lexico-
graphy and linguistics, it seems even more difficult to view computational lexicography as
separated from computational linguistics. Among others, this is due to the fact that computa-
tional lexicography is closely interlinked with other subdisciplines of computational linguis-
tics, such as corpus linguistics for the extraction of lexicographically relevant information
from corpora, as well as any branch of computational linguistics that relies to a considerable
extent on structured representations of linguistic information, e. g. machine translation. As
such, computational lexicography plays a central role in both applied and theoretical respects,
in the sense that it applies lexicographic principles and the results of lexicographic research
with respect to the structuring of dictionaries in a computational scenario. Therefore, it is not
equivalent to the recently emerging e-lexicography, is centered around user-oriented elec-
tronic dictionaries and thus considered to be subsumed by computational lexicography.
4
As was just mentioned, function theory is a rather recent general lexicographic theory that has
been developed by Henning Bergenholtz and Sven Tarp (see e. g. Bergenholtz and Tarp, 2002)
and published in its latest version by Tarp (2008). In his book, Tarp evaluates the influences
that existing lexicographic theories have had on the evolution of function theory by providing
a critical account of the predominant directions up to that point. In this contrastive survey,
it becomes evident that while function theory has initially built on existing theories such as
Wiegand’s general lexicographic theory (see e. g. Wiegand, 1988, 1998), it has developed
into its own independent theoretical branch.
The central aspect of function theory is that it starts from »dictionary users as an object of
research« (Tarp, 2008: p. 33) and derives from this analysis requirements on the representa-
tion of lexicographic data in dictionaries. In particular, Tarp analyses the needs that users may
have in different cognitive and communicative situations, where cognitive situations are those
in which a user wishes to learn more about a given topic, while communicative situations refer
to cases »in connection with current or planned communication« (Tarp, 2008: p. 45). Here,
the emphasis is on the fact that these situations are essentially extra-lexicographical situa-
tions, which means that they arise in potential users before any dictionary consultation takes
place, i. e. before they might turn into actual dictionary users. In this vein, a lexicographical
function is defined as follows.
5
»The genuine purpose of dictionaries is to satisfy the types of lexicographically relevant need that
may arise in one or more types of potential user in one or more types of extra-lexicographical situa-
tion.« (Tarp, 2008: p. 88)
Besides the theoretical contribution of function theory, Tarp’s work introduces the concept of
a leximat. This term, »which has connotations of both lexicography [. . . ] and mechanics«, is
defined as follows.
»A leximat is a lexicographical tool consisting of a search engine with access to a database and/or
the internet, enabling users with a specific type of communicative or cognitive need to gain access
via active or passive searching to lexicographical data, from which they can extract the type of
information required to cover their specific needs.« (Tarp, 2008: p. 123)
On the one hand, this definition includes lexicographical tools which are labelled as passive
leximats, as they perform tasks without the user having asked for them (e. g. correction of
misspelled user input; cf. the »invisible dictionary« of Bergenholtz, 2005). On the other
hand, it includes active leximats, which refers to tools that point users to knowledge sources
6
which are capable of satisfying their needs. In the case of cognitive needs, this may involve
external resources on the internet.
Tarp deliberately introduces the term leximat in order to differentiate it from an elec-
tronic dictionary, mostly due to the traditional connotation of dictionaries as printed refer-
ence works. However, due to the strong focus on user-oriented lexicography in his work,
the definition of a leximat ignores its use in an NLP scenario. As it is not clear whether this
interpretation has been intended – or is simply the result of omission – the term leximat will
only be used occasionally in order to refer to certain aspects of the work presented in this
book. Instead, the more general term multifunctional lexical resource will be used, which –
as pointed out by Heid (1997) – may include dictionaries, grammars as well as text corpora
and even certain NLP tools (ibid., p. 21; see also Trippel, 2010: p. 166), and is thus taken to
subsume the key notions of a leximat.
2.3 Multifunctionality
Taking into account the definition of a lexicographical function as proposed by Tarp (2008),
one could say that a multifunctional lexical resource is one whose genuine purpose is to serve
more than one of these functions. As was indicated above, however, multifunctionality goes
beyond this. dictionary. In particular, the term is borrowed from literature of the late 1980’s
and early 1990’s on reusable lexical resources (see e. g. Heid and McNaught, 1991; Bläser
et al., 1992) and has been discussed in great detail in Heid (1997). Here, a multifunctional
lexical resource is defined as follows.
[»The term ›reusable lexical resource‹ denotes a source of linguistic knowledge whose conception
has been specified and realised such that its use in different situations or systems (both different
NLP applications and different (interactive) usage situations with ›human users‹) is incorporated in
the design criteria. Such sources of linguistic knowledge are also referred to as ›multifunctional‹
resources.«]
According to this definition, and taking into account the analysis presented above, one can
say that multifunctionality extends along several dimensions. On the one hand, it refers to
the utilisation of the contents of a lexical resource for human users as well as NLP tools.
On the other hand, it further needs to cater for the needs of different types of users in dif-
ferent situations (e. g. text-productive vs. text-receptive; see above), as well as different types
of NLP applications (e. g. syntactic parsing or machine translation). Finally, a third dimen-
sion can be identified which is concerned with user-specific characteristics like their mother
tongue or their language proficiency on the one hand (cf. Tarp, 2008), and application-specific
7
paradigms and terminologies on the other. This means that in order to be truly multifunc-
tional, the model of a lexical resource needs to offer the formal means to define a specific
point in this space with respect to the domain (e. g. a human user vs. an NLP application), the
function (e. g. a text-receptive situation vs. syntactic parsing) and an idiosyncratic property
(e. g. with German mother tongue vs. Lexical-Functional Grammar), and to serve the needs
associated with this point. assumed to be overlapping. This is due to the fact that although
some of the needs are shared by human users and NLP applications, the criteria to identify
them are not. For example, although one could say that proof-reading and syntactic parsing
are to some extent related in the sense that both of them need e. g. syntactic valence infor-
mation (for human users see Tarp (2008): p. 77), one would not say that syntactic parsing
is a relevant function of its own in the context of human users. Therefore, we assume that
these two functions are located in different ranges, and that the tools used in order to serve
the needs associated with them are not necessarily the same.
With reference to function theory, Gouws (2007) states that the »compilation of any dic-
tionary needs to be preceded by a clear identification of the functions that should prevail in
the dictionary« (Gouws, 2007: p. 66). However, as the complete range of potential users and
applications of a multifunctional lexical resource are not necessarily known at the time the
lexicon model is compiled, this does not apply completely in this context. In fact, such a
lexicon is in direct opposition to application-specific lexical resources on the one hand and
human-readable electronic dictionaries on the other. According to de Schryver (2003: p. 145)
these two »seem worlds apart«, which is exemplified by the fact that each has typically been
designed to serve one particular purpose or a small set of predefined ones and requires con-
siderable effort when trying to adapt it to serving unforeseen purposes (see e. g. Asmussen
and Ørsnes, 2005; Spohr, 2004).
Based on the assumption that the internal structure of a lexicon model circumstances does
not need to differ along with the purposes that the corresponding lexical resource is to serve,
multifunctionality thus refers to the ability to extract different »dictionaries« from a com-
mon lexical data collection, i. e. to allow for different views on the lexical data. Each of
these views comes with its own lexicographic needs, and thus, the definition of a multifunc-
tional lexical resource presupposes an extensible mechanism that enables its use in various
scenarios. Such considerations suggest an architecture according to the notion of »Mutter-
wörterbuch« ( »mother dictionary«) as conceived of by Gouws (2006), in which entities are
marked on a macro- and microstructural level for their inclusion in a particular usage situa-
tion. As a result, the traditional view of lexical entries as static entities needs to be replaced
by one which conceives them as dynamic entities that are generated according to the needs in
this situation.
The central objectives of this study are to design and implement a model for a multifunctional
lexical resource that is capable of satisfying the needs of different types of users in different
types of communicative and cognitive situations, as well as to provide the mechanisms which
are necessary in order to serve different types of NLP applications. In addition to presenting
8
the details of this model, a further major objective is to show how the representations therein
can be integrated into the architecture of a multifunctional lexical resource and support the
definition of its access functionality, as well as the function-based presentation of its content.
However, it should be emphasised that the model as presented in this study does not make
any claims as to its completeness with respect to the range of linguistic phenomena that can
be represented. Rather, its objective is to be general enough as to enable later extension to
further phenomena, and to show the ways in which this can be accomplished.
In the course of achieving these objectives, this study makes contributions in the following
respects, which are – in terms of Wiegand (1989) – situated on the methodological side of
lexicographic research. First, it provides the implementation of a model that is inspired by
the primarily theoretical considerations of Gouws (2006) and Heid and Gouws (2006), and
proposes a general and extensible architecture for dealing with user needs in the context of a
multifunctional lexical resource. In addition to mere inclusion marks in the sense of Gouws
(2006), this mechanism takes into account the user need status as proposed by Tarp (2008) as
well as the language background and proficiency of the user. On a conceptual level, this study
thus contributes to the view of a lexical entry as a dynamic entity that is generated at the time
of the consultation, as opposed to the traditional view of lexical entries as static text-based
entities (see Polguère, 2006: p. 51). Moreover, the thoughts presented in the afore-mentioned
works are applied not only in the context of human users, but extended to applications in the
field of natural language processing.
On a more abstract level, this study introduces concepts from artificial intelligence to the
field of computational lexicography, and highlights the support of standard knowledge rep-
resentation formalisms and reasoning in the definition of a consistent and integer lexical
resource. While the general approach of using AI-related formalisms in a lexicographical
setting is not a recent one (Evans and Gazdar, 1996; see also Görz, in prep), it seems to be
undocumented to use them in a context that (i) includes electronic dictionaries which serve
both human users as well as NLP applications, and (ii) views these dictionaries as graphs.
By touching upon the results of a recent AI-related project like the semantic web and the
range of possibilities opened up by such technologies, this book thus makes an important
contribution to research on the internal structure of electronic lexical resources. This is ex-
emplified by means of a prototype implementation of a multifunctional lexical resource which
contains unified representations of lexicographically relevant information from the SALSA
corpus (Burchardt et al., 2006) and a recently developed database of automatically extracted
collocations (Weller and Heid, 2010).
This book is structured as follows. Chapter 3 discusses general requirements on the design
of a lexicon model, as well as more specific ones arising in the context of a multifunctional
lexical resource. In addition to this, it analyses the implications that these requirements have
on different components of the multifunctional lexical resource, as well as on the choice of the
formalism that is used for its definition. Finally, the chapter gives an overview of recent other
approaches in this field and assesses their performance regarding the identified requirements.
Chapter 4 is devoted to the discussion of the formalism that is used for the definition of
the lexicon model. After a short introduction to the semantic web, an AI-related project that
deals among others with the development of knowledge representation formalisms, the main
properties of the Resource Description Framework (RDF) and the Web Ontology Language
(OWL) are presented and positioned with respect to the more widely known and used for-
malism XML. In particular, the chapter focusses on the formal characteristics of OWL DL, a
9
In this chapter, we will first discuss the main requirements for the design of the multifunc-
tional lexical resource model (MLR model) as well as requirements which address different
components of the MLR as a whole (Section 3.1). Here, we will focus on those requirements
which we consider underrepresented both in current models for computational lexicons and
electronic dictionaries. This is not to say, however, that each of these requirements in isola-
tion is not treated adequately anywhere, but rather that it is not possible to find a resource that
offers a balanced combination of them. Justifications for this claim will be given in Section
3.2, which provides an overview of the state of the art of approaches to defining electronic
dictionaries and models for computational lexical resources1 .
Most of the requirements have been discussed at large in the literature and can, in the context
of an MLR, be classified as requirements on the description, formal and technical require-
ments, as well as requirements with respect to multifunctionality (see also Spohr, 2008). In
the following sections, we will discuss a number of these requirements in more detail, fo-
cussing on particular aspects of the description in Section 3.1.1 and the formal and technical
aspects in Section 3.1.2. As the main emphasis of this work is on the definition of a model for
an MLR, requirements which are centered around the coverage in terms of lexical items are
largely ignored. Requirements which originate from the multifunctionality of the MLR will
be discussed in Section 3.1.3. Finally, Section 3.1.4 summarises the implications of these
requirements on the choice of the formalism for defining the MLR model as well as the other
components which make up the MLR.
The first complex of requirements to be discussed refers to various kinds of descriptions, such
as linguistic properties of lexical items or multimedia elements for illustrative purposes. In
contrast to traditional print dictionaries, the electronic medium does not impose space restric-
tions on the presentation of dictionary content. Thus, there is no need for text condensation
or »traditional space-saving mechanisms« (Gouws, 2007: p. 66), and the descriptions should
be stored and indicated in their unabbreviated form (see »decompression« in de Schryver,
2003; Schall, 2007). Moreover, the detail of the descriptions has to be such that the MLR
is capable of serving as useful input to both specialised NLP tasks and human expert users,
while retaining the possibility to generate or extract less detailed descriptions from the data
if required. In addition to this, the MLR model should provide a typology of lexical items, in
1 Resources which are only commercially available will not be considered.
11
order to allow for example for »search by lemma type« (Schall, 2007: p. 59), such as search
restricted to compounds or idioms.
As the mere reproduction of a comprehensive list of phenomena to be covered is not neces-
sary in the context of this work, the following subsections explain three particular phenomena
which we consider immediately relevant to both human users and several NLP applications,
namely valence, collocations and preferences. The section closes with a few remarks on
multilinguality.
Heid (2006) emphasises the importance of detailed valence descriptions with respect to both
human users and NLP applications. From the viewpoint of text production, for example,
he notes that it is vital to make explicit reference to the valence differences between (quasi-)
synonyms such as »treffen« and »begegnen« (»encounter, meet«), where the direct accusative
object of »treffen« (»that which is encountered«) is mapped onto the indirect dative object
of »begegnen«. In addition to this, Heid lists machine translation as an example of NLP
technology which relies heavily on (and greatly benefits from) detailed valence description:
a machine translation system needs to have the information that e. g. the direct object of
»remember« is mapped onto the prepositional object of its translation equivalent »erinnern«.
A number of researchers have suggested to use the three-layered approach to valence de-
scription proposed by FrameNet (Baker et al., 1998; see e. g. Heid and Krüger, 1996; Atkins
et al., 2003; Boas, 2005; Heid, 2006) in order to provide adequate treatment of valence phe-
nomena in the lexicon. In this approach, the subcategorised (as well as optional) arguments of
a predicate are not only assigned a phrasal category (as in many current valence dictionaries;
see Herbst et al. (2005) for an example) and a grammatical function (as in NLP lexica such
as traditional LFG subcategorisation lexica; see e. g. Kaplan and Bresnan, 1982), but also
a semantic role (»frame element« in FrameNet). A valence pattern thus consists of one or
several such category-function-role triples. This combination of both syntactic and semantic
information in the FrameNet approach provides »an analysis of meaning far more granular
than is normally possible in commercial lexicography« (Atkins et al., 2003: p. 340).
A crucial point made by Heid (2006) – in the context of valence dictionaries – is that
valence descriptions should not only be provided for »prototypical predicates« like verbs, but
that this treatment should be extended to cover nouns and multi-word expressions as well (see
e. g. Heid and Gouws, 2006). In line with what has been discussed for (lexical-)semantically
related items, the differences in valence patterns as well as the mapping of valence arguments
(i) between morphologically related items (e. g. verbs and their nominalisations), and (ii)
between »collocationally related« items (e. g. nouns and their occurrences in support-verb-
constructions) are of central importance. Therefore, the points that have been made in this
section so far do also apply to non-verbal lexical items and multi-word expressions.
3.1.1.2 Collocations
Although collocations are highly relevant to both NLP and language-learning tasks (see e. g.
Heid, 2006; Tarp, 2008), adequate treatment has been largely neglected in past dictionary
design. Apart from specialised collocation dictionaries which have been specifically designed
12
to deal with multi-word expressions (see e. g. Crowther et al., 2003; DiCE2 ), many current
electronic dictionaries (e. g. ELDIT: Abel and Weber, 2000) assign to them the status of
usage examples in the microstructure of a lexical entry. As a result, it is in such dictionaries
difficult – if not impossible – to obtain more detailed information about collocations, such as
valence descriptions (see above) or preferences (see below). Such information is, however,
indispensable in order for a dictionary to be a useful tool. In other words, the MLR model
needs to promote collocations to the status of »treatment units« (Gouws and Prinsloo, 2005:
pp. 134f), i. e. they should become part of the macrostructure of the dictionary and receive
their own microstructural description (see also Heid and Gouws (2006) and Spohr (2005)).
Moreover, due to their relevance as well as the fact that their descriptions are generally more
complex than those of single-word entities, the features of collocations will be picked up very
frequently throughout the discussions in this work.
3.1.1.4 Multilinguality
Citing Atkins (1996), de Schryver (2003) states that bi- and multilingual dictionaries should
offer monolingual functions as well (e. g. definitions or usage notes), and that the second
2 http://www.dicesp.com
3 Cf. »er hat berechnet, dass . . . « (»he has calculated that . . . «), where in almost every observable
corpus instance »berechnen« is used in a past tense form (Heid, 2006: p. 76).
13
language should receive »full treatment« (p. 164f). Moreover, Tarp points out that »tradi-
tional learner’s dictionaries that are either monolingual or bilingual only have a limited use-
ful value« (Tarp, 2008: p. 151). Following the approach of viewing bilingual dictionaries as
combinations of two monolingual dictionaries as in Spohr and Heid (2006), this requirement
can be met by the general design of the MLR structure. In addition to this bilingual view,
the approach presented there can be extended to allow for a general view on multilingual
dictionaries as combinations of several monolingual ones (cf. Section 4.3.2 and 5.5.2).
Formal and technical requirements refer to formal properties of the MLR model, as well as
the technical aspects of the lexical resource. One of the requirements that is to be discussed in
more detail below refers to access and retrieval, which should be scalable and performed very
efficiently. Moreover, it should allow for complete exploration of all sorts of data contained
in the lexicon, their relations as well as complex combinations of both. In addition to this,
the model should support an explicit organisation of the descriptions mentioned above, in the
sense that it should be possible to identify different types of indications unambiguously (cf.
Heid, 1997: p. 12f4 ). Such typed links can also refer to entities outside the lexicon, such as a
query to another lexical resource, a thesaurus or an online search engine (de Schryver, 2003;
Tarp, 2008; Verlinde, 2010).
In addition to this, formal requirements include issues of consistency and integrity, which
become increasingly important when dealing with large amounts of lexicographical data, and
even more if these have been acquired and inserted both automatically and manually. Rele-
vant questions in this context are e. g. which properties or relations are used to describe which
kinds of items, and how it is possible to ensure that these items actually make use of only the
properties they are supposed to. Further technical requirements include the integration of a
morphological analyser that processes the input of the user – for word-form-based search or
orthographically tolerant search – as well as the logging of a user’s search behaviour in order
to be able to analyse and improve the interface of the lexical resource (Verlinde (2010); see
also Knapp (2004); Bergenholtz and Johnsen (2005); de Schryver et al. (2006)).
In the following subsections, we deal in more detail with the formal and technical require-
ments with respect to access and retrieval as well as consistency and integrity. However, they
will be discussed only insofar as they directly affect the internal structure of the MLR model.
Further aspects, such as how users should have access to the data, will be discussed in Section
3.1.3.
Tarp (2008) lists a number of minimal features that should be retrievable from an ED, such as
idioms, lemmas, irregular forms, word class or gender. In addition to this, Chiari (2006) states
that combinations of such features should also be queryable, e. g. »all nouns and verbs which
are rare or frequent and specific of any field except physics« (Chiari, 2006: p. 144). Of course,
4 See also the criterion »Konsistenz und Eindeutigkeit der Informationskodierung« (»consistency and
unambiguity of information encoding«) in (Schall, 2007: p. 40).
14
such expressions can be arbitrarily extended (». . . , and which subcategorise prepositional
phrases except ones with ’auf’ . . . «), meaning that for all items in the lexical resource that are
connected in some way, these connections should be explorable (see e. g. Dodd, 1989; Schall,
2007). In addition, this should be possible by different search modes, such as orthographically
tolerant search (fuzzy search), search with Boolean operators and wildcards, or even by voice
(cf. de Schryver, 2003; Schall, 2007; see also Section 3.1.3.1).
Chiari’s ideas are in line with what is more generally labelled as »Ad-hoc-Abfrage« (»ad-
hoc querying«) in (Heid, 1997: pp. 145ff) or »non-standard access« in Spohr and Heid
(2006), i. e. »access via paths involving other properties and relations than just lemmas«
(Spohr and Heid, 2006: p. 71). In a related way, de Schryver mentions »access aspects for
which the outer search path (leading to a lemma sign) does not necessarily precede the inner
search path (leading to data within articles)« (de Schryver, 2003: p. 173), and Atkins even
talked about the »iron grip of the alphabet«, calling for »new methods of access« (Atkins,
1992: p. 521). In this vein, it should in principle be possible to access the data at any ar-
bitrary point in the model. In other words, there should not be a predefined entry point or
access point to the data – as is usually the case with standard lemma-based query access5 .
In contrast to the traditional view on dictionaries as lists of entries – which are according to
Polguère simply »texts, in the most general sense« (Polguère, 2006: p. 51) – this requires to
view dictionaries as graphs in which, among others, »implicit references, in fact, all words
[. . . ] should be hyperlinked to the relevant lemma« (Prinsloo, 2005: p. 18; cf. Knapp, 2004:
pp. 87-89), and where all nodes and edges in the graph may serve as potential access points
(see e. g. Spohr and Heid, 2006; Trippel, 2006). Moreover, the MLR should enable access to
external and complementary sources of information, such as online search engines and text
corpora, in the sense of Tarp’s leximat which is capable of serving both communicative and
cognitive functions (cf. Tarp, 2008: p. 120ff; see also de Schryver, 2003; Gelpí, 2007).
The next requirement on the MLR model that is to be discussed here refers to rather formal
aspects, namely the means that are necessary in order to ensure consistency and integrity of
the MLR. In a certain sense, the notion of integrity is subsumed by the notion of consistency.
However, we choose to use these two terms in order to describe two separate things. For
us, consistency refers to the question as to whether the underlying model of the MLR is
satisfiable – i. e. whether it is at all possible for lexical data to satisfy the conditions defined
in the lexicon model without causing any contradictions – as well as whether the data actually
satisfy the conditions.
Integrity, on the other hand, refers to the question as to whether their descriptions are com-
plete. In order to achieve this, it is necessary to be able to (i) identify and distinguish between
different types of data in an MLR, (ii) define different wellformedness constraints and prop-
erties for these types, (iii) restrict the set of items that can occur as values of these properties,
and (iv) make sure that the data adhere to these restrictions. A very basic kind of consistency
check is e. g. to ensure that the values of a part-of-speech property of lexical items are actu-
5 Whether this is desirable for all kinds of users is not the question here. We believe, however, that it
is better to set the stage for »unrestricted access« to the data and constrain it e. g. by means of the
graphical user interface than to allow only for restricted access in the first place.
15
ally made up of part-of-speech tags, and not of grammatical gender, case, or misspelt variants
(e. g. Nn instead of NN), and this should probably be possible in any »mildly« structured for-
malism. However, more intricate cases are conceivable e. g. for collocations of the type »V +
NOb j « (i. e. collocations with a verbal collocator and a nominal base that represents the object
of the verb, such as »eine RedeN haltenV « (»to giveV a speechN «)), where the part of speech
of the base of the collocation (»Rede«) has to be »N«, and the collocator (»halten«) has to
be a transitive verb that subcategorises an object. Such restrictions have to be formalisable
and verifiable in the MLR model, and traditional approaches that rely on document type def-
initions (DTD) or XML schemata do not have the formal means to express these kinds of
restrictions.
Multifunctional requirements denote those requirements that have been derived from the in-
tended multifunctionality of the lexical resource. In particular, they refer to the lexicographi-
cal functions that the MLR is to serve, and thus both to specific users’ needs and to needs of
NLP applications. As the functions are not necessarily known a priori, a general and exten-
sible mechanism for modelling these needs is required. Moreover, multifunctional require-
ments refer to the ways in which the content of the lexical resource is presented to different
types of users, e. g. by means of complete vs. (gradually) reduced article views, as well as
the possibility to generate a printable PDF version of the lexical entry (cf. Bartels and Spieß,
2002; de Schryver, 2003; Schall, 2007). As the points of departure are rather different for ad-
dressing the needs of human users and the needs of NLP applications, we will discuss below
the multifunctional requirements on the MLR first from a user-oriented perspective and then
from an NLP-oriented perspective.
As was mentioned above, there is an abundance of literature focussing on the needs of users,
and many of the requirements discussed there deal with the interaction between users and the
lexical resource. At the time of writing, probably the most comprehensive overview is pro-
vided by de Schryver (2003), who analysed in detail the relevant literature up to this point and
produced a list of no less than 118 desiderata, with emphasis on the advantages that electronic
dictionaries offer compared to their printed counterparts. A more recent publication in this
field is the dissertation by Schall, who elaborates – among others on the basis of de Schryver’s
more abstract analysis – a condensed catalogue of 95 criteria according to which electronic
dictionaries should be evaluated (cf. Schall, 2007: pp. 81–84). These criteria cover aspects
of both content, access and presentation, and after introducing detailed guidelines as to how
this evaluation can be carried out, Schall goes on to evaluate several English and German
monolingual electronic dictionaries according to her guidelines (Schall, 2007: chap. 3).
Although most of the requirements mentioned in these works have their justification and
are relevant to the task at hand, it would not be reasonable to reproduce a discussion of
every single one of them. This is mainly due to the fact that many of them merely represent
»reminders« of small but more or less important details that should be paid attention to. For
16
example, criterion 16 mentioned by Schall states that for entries containing several graphical
illustrations (e. g. »fruit«) one should adhere to the proportions of the depicted items (cf.
Schall, 2007: p. 44). Therefore, we refer the curious reader to the respective publications
and instead focus on a subset of these requirements that rather relate to the conception of the
model that underlies the MLR with respect to its users.
The most important requirement – and the most obvious one, one might assume – is that
the MLR should have a user-friendly user interface. De Schryver notes that »if the contents
one needs at a particular point in time cannot be accessed in a quick and straightforward way,
the dictionary [. . . ] fails to be a good dictionary« (de Schryver, 2003: p. 173). However,
Chiari (2006) points out that although user-friendliness in terms of ease of access is a core
goal, it should not affect the overall performance and flexibility of the dictionary, such as
user customisation and the need to create user-defined dictionaries – a view she shares with a
number of other researchers. For example, Atkins (1996) defines the notion of a »virtual
dictionary« that is created at the time of dictionary consultation (see de Schryver, 2003:
p. 162), and Gouws speaks in terms of a »Mutterwörterbuch« ( »mother dictionary«), i. e.
an abstract dictionary model from which several different (and more specialised) dictionaries
can be generated (Gouws, 2006). What these approaches have in common is the demand to be
able to define individual dictionaries on the basis of the needs of specific types of users (cf.
the criterion »Individualisierung« (»customisation«) in Schall, 2007: p. 83; see also Gelpí
(2007)). On the one hand, this can be partially realised using search forms that allow users
to mark the kinds of information they want to retrieve (see e. g. DiCouèbe6 ). On the other
hand, not every type of user can be expected to define their output indications in that way.
Moreover, when it comes to mapping a query to (both micro- and macrostructural) subsets of
the data based on the type of user and the needs associated with this type, a more general and
principled approach is needed.
As was mentioned in Section 2.2 above, Tarp (2008) provides useful insights in this re-
spect. In contrast to de Schryver (2003) and Schall (2007), which might to a large extent
be considered as comprehensive reviews of the literature at that time, Tarp (2008) presents a
more theoretical analysis based on his own general theory of lexicographical functions (func-
tion theory; see Section 2.2 above; cf. Bergenholtz and Tarp, 2002). Departing from a notion
similar to that of Atkins, who stated that »[e]very good dictionary starts from a clear idea of
who its users are and what they are going to do with it« (Atkins, 1996: p. 525), Tarp puts
particular emphasis on the needs of potential users, i. e. needs arising in cognitive or com-
municative situations which precede the actual consultation of the dictionary (cf. Tarp, 2008:
pp. 47ff). According to this view, dictionaries should be conceived of with the needs of users
in particular usage situations in mind (e. g. text reception vs. production). Moreover, Storrer
(2001) states the following.
»Lexikalische Daten können so modelliert werden, dass in Abhängigkeit von Nutzerinteressen und
Nutzungssituationen die jeweils relevanten lexikographischen Angaben und Verweise herausgegrif-
fen und in ästhetisch ansprechender Weise am Bildschirm dargestellt werden.«(Storrer, 2001: p. 53f)
[»Lexical data can be modelled such that, depending on users’ interests and usage situations, the
respectively relevant lexicographic indications and links are extracted and presented on screen in an
aesthetically appealing way.«]
6 http://olst.ling.umontreal.ca/dicouebe/main.php
17
Therefore, what is necessary for an MLR is a formal tool that states (i) which elements
of the lexical description are needed in a particular usage situation, and (ii) how these are
presented to the user. Here, the analyses in Tarp (2008) can be used primarily for (i), by means
of specifying the status of a specific kind of indication e. g. as primary need or secondary
need. For (ii), the tool needs to provide specifications which allow for the presentation of
the same content to different users in different ways. This involves e. g. different interface
languages depending on the mother tongue of the users, as well as varying amounts of expert
terminology, depending on their lexicographic proficiency.
In addition to serving the needs of different types of users as outlined in the previous section,
a truly multifunctional lexical resource has to be able to serve NLP applications as well. Al-
though de Schryver notes that the data structures of large-scale NLP lexicons on the one hand
and human-readable electronic dictionaries on the other »seem worlds apart«, he notices a
clear tendency towards combinations of the two (de Schryver, 2003: p. 145). Similar to the
human users of a dictionary, one could say that most NLP applications »speak« their own lan-
guage. However, while in the case of human users this issue can be resolved e. g. by offering
the presentation of the content in several different languages, in the case of NLP it typically
involves completely different structures. For example, if a particular NLP application expects
input in XML, it does not automatically mean that it can reasonably process any arbitrary
XML format that it is provided with.
On the one hand, this can be overcome by providing – ideally standardised – formats
for the exchange of data with NLP applications, such as LMF (ISO/FDIS 24613, 2008; see
Section 3.2.2.1) or PAULA (Potsdam interchange format for linguistic annotation; Dipper
et al., 2006). Here, the dictionary content is converted into the respective exchange format,
which can in turn be converted into the format that is needed for the NLP tool. In the case
of standardised exchange formats, the NLP application either has the routines necessary for
conversion from the standard format into the application-specific format already, or they need
to be implemented by the developer of the application. A slightly different view on this issue
is to give NLP applications direct access to the lexical resource by means of an application
programming interface (API), which enables the developer of an NLP application to imme-
diately retrieve and process information from the MLR in a form that suits the respective
NLP tool. Here, the most important prerequisite for wide usability is that this API is ideally
defined in a platform-independent programming language, such as Java.
The extent to which these requirements can be met by the model of an MLR is rather lim-
ited. In fact, it does not go beyond the need to enable the definition of APIs and exchange
formats or, at best, support them e. g. by making use of existing APIs. The MLR as a whole
should provide access for NLP applications though, and therefore needs to provide mecha-
nisms for exporting the content of the lexical resource in a standardised exchange format as
well as an API for direct access to the lexicon.
18
The different types of requirements mentioned above have direct implications on various
aspects of the MLR. The requirements on the description primarily affect the linguistic model
that underlies the lexical resource. As this involves the way in which lexical entities and
their descriptions are represented, linguistic requirements determine, on a more abstract level,
the choice of the formalism that is used for defining the MLR model. While formal and
technical requirements affect the internal structure of the model as well, they further have an
effect on the interface between the MLR and human users, as well as between the MLR and
NLP applications. Therefore, we can say that in addition to affecting the underlying model,
they also influence the general architecture of the MLR and its components. Multifunctional
requirements have implications on all of the aspects just mentioned, namely the choice of the
formalism, the MLR model, as well as the lexicon architecture. In particular, however, they
also have a more theoretical implication with respect to the concept of a lexical entry. The
various implications will be discussed in the following.
One of the implications which can be directly derived from the above analysis is the fact
that the underlying formalism – in addition to being graph-based – cannot be entirely un-
constrained, but rather has to be strongly typed. In this respect, very general approaches as
those described e. g. by Trippel (2006) and Polguère (2006) do not seem to provide for the
appropriate structural means for ensuring consistency and integrity in the sense discussed
above – such as relations with defined domain and range – as they rather focus on a very
general and unconstrained graph structure (see also Section 3.2.2). In contrast to this, typed
formalisms based on the Resource Description Framework (RDF; Manola and Miller, 2004),
such as RDF Schema (Brickley and Guha, 2004) or the Web Ontology Language (OWL;
Bechhofer et al., 2004), offer among others the formal devices needed to address the issues
mentioned above. In addition to this, if one attempts to define a new graph-based framework
or metalanguage, it is not unlikely to arrive at a point where one starts »remodelling« subsets
of RDF – the description of items in a lexicon can very well be considered as a specific case
of describing resources in general –, except for the fact that then large parts of the existing
technical infrastructure are no longer available, such as tools which interpret the vocabulary
that is needed for this description. This means that all but very basic infrastructure has to
be reimplemented in order to be able to interpret the new vocabulary, e. g. the semantics of
specific XML element tags7 . This is not to say that all these issues dissolve once RDF is
used. It rather means that using a common metalanguage that has been defined in a declar-
ative and standard framework entails some advantages, such as the fact that – e. g. in the
case of OWL – the formal characteristics and complexity have been investigated extensively
and are well-known, and that even at the most abstract level more than just very basic in-
frastructure is available (see Section 4.3). In addition to this, consistency control is one of
the strongest claims of Burchardt et al. (2008b) in favour of using a typed framework for
7 Cf. the joint project between the Universities of Tübingen (SFB 441), Hamburg (SFB 538) and
Potsdam (SFB 632), which addresses, among others, the issue of sustainability of linguistic data
(see e. g. Dipper et al., 2006; see also Trippel, 2006: p. 37).
19
the definition of models for lexical resources. In essence, they have proposed a combination
of general knowledge representation methods (e. g. theorem provers for axiom-based consis-
tency checking) with a graph-based query language (see also Section 6.1.2). In doing so,
they have been able to express highly complex consistency queries involving several distinct
layers, such as the formal definition of FrameNet, the frame semantic annotation scheme, as
well as syntactic corpus annotations.
Finally, the point made by Heid (1997) in the context of explicitly vs. implicitly organ-
ised dictionaries is a further argument in favour of using a typed formalism. Here, explicitly
organised dictionaries are those in which the type of every indication can be identified sepa-
rately, whereas in implicitly organised dictionaries the type of an indication, as well as e. g.
its spacial boundaries in a dictionary entry, needs to be determined on the basis of metale-
xicographic analysis. As a result, implicit (i. e. untyped) resources may possibly contain
ambiguous indications (Heid, 1997: pp. 12f).
In addition to the implications on the choice of the formalism, the different kinds of require-
ments directly affect the internal structure of the MLR model and its immediate infrastructure.
For example, in order to be able to retrieve arbitrary combinations of items from the lexical
resource, the relations between these entities have to be expressible in the model in the first
place. In other words, the model should be very powerful and offer representations for a wide
range of phenomena. On the other hand, however, it is very important to keep an eye on the
complexity of the descriptions in the lexicon. Since the data have to be retrievable with an
adequate amount of effort, the MLR model should not be more complex than necessary and,
ideally, support the development of other components of the lexicon, such as a graphical user
interface.
With regard to the modelling of valence information, it needs to be said that although
FrameNet (Baker et al., 1998) offers a number of attractive solutions for the description of
valence phenomena, a complete adaptation of their paradigm would produce a rather spe-
cific and theory-dependent lexicon model. Therefore, while it is reasonable to follow the
FrameNet approach in principle and to adhere to three levels of valence description – syntac-
tic category, syntactic function and semantic role – the MLR model should, for the sake of
theory-independence, deviate as far as the formalisation of frames and roles is concerned. In
other words, it should not rely on frame semantics as a theoretical framework and be extensi-
ble such that it can express valence information also on the basis of a different set of semantic
roles, such as the one defined by Sowa (2000), if this is desired. Nonetheless, in view of
the existence of resources such as the Berkeley FrameNet database for English (Ruppenhofer
et al., 2002) and the SALSA lexicon for German (Burchardt et al., 2006; Spohr et al., 2007),
it would certainly be inconsiderate to ignore FrameNet and not make use of the lexical knowl-
edge contained in these resources. Therefore, it seems reasonable to use FrameNet as a first
basis for the description of valence information.
20
As was mentioned in the analysis above, a model is needed which formalises the needs of
different types of users in terms of the entities in the MLR. This component should ideally be
kept separate from the actual data, in the sense that it provides user-specific views on the data,
instead of manipulating them directly. The specifications in this model should be represented
in a format that enables them to be processed by the components of the user interface that deal
with the access by querying as well as the presentation of the content of the lexical resource
to the user. In addition to this, the analysis above suggests that the MLR should allow for
the embedding of several NLP tools – such as a morphological analyser that processes the
input given by the user and a generator for generating inflected forms – as well as a tool
that enables orthographically tolerant search, e. g. by means of calculating the Levenshtein
distance (Levenshtein, 1966).
Finally, one of the most important implications of the above analysis – in particular of
Tarp’s account of lexicographic functions (Tarp, 2008) and the concept of multifunctionality
as suggested e. g. by Heid and Gouws (2006) – is that lexical entries are not uniform across
different types of users, since the same indications may need to be presented to different kinds
of users in different ways. Therefore, a lexical entry is not a static entity that is stored as a text
in some database. Rather, it is to be conceived of as an entity that is determined on the basis
of the needs of a user, and that has to be generated dynamically from the statements which
make up the MLR graph. In Chapter 5 and especially Chapter 6 of this work, we present ideas
as to how the dynamic notion of lexical entries can be supported by the underlying model,
in the sense that a lexical entry is generated at the moment the consultation takes place. An
architecture which caters for this would be conceptually very close to Gouws’ notion of a
»mother dictionary« (Gouws, 2006) as well as the concept of a leximat as conceived of by
Tarp (2008).
This section will assess the performance of related approaches in the field of lexical resources,
covering both electronic dictionaries for human users, models for computational lexicons
designed for NLP purposes, as well as approaches for combining the two. As the focus
of the discussion in this section is on recent approaches, however, we will just give a brief
introduction and review of the main points of criticism that had been identified with respect
to the more »traditional« approaches, before discussing in detail the more recent proposals.8
The predominant approaches of the 1980’s and 1990’s have been described and analysed at
length in the relevant literature (see e. g. Menon and Modiano (1993); Heid (1997); Hellwig
(1997)).
8 In order to highlight the differences and problems with some of these proposals, a certain amount of
formal detail in terms of XML representations is at times required. A very basic introduction to the
main characteristics of XML can be found in Section 4.2.2.
21
The 1980’s and 1990’s have seen a large number of projects aiming at the definition of stan-
dardised formats for reusable lexical resources. Among the most prominent ones are proba-
bly the (then EC-funded) projects ACQUILEX9 , GENELEX10 , MULTILEX11 , as well as the
more recent PAROLE12 and its follow-up 13 . What all of these approaches share is that they
aimed at the definition of lexical resources and standard representation formats for compu-
tational use only. Hence, reusability had the sole interpretation of »being usable for more
than one NLP application«. The ACQUILEX project, for example, has dealt with the acqui-
sition of lexical knowledge from machine-readable versions of printed dictionaries, as well
as their representation in a format that would serve different NLP applications. However, as
the emphasis has been on computational use, no investigation as to how these representa-
tions could in turn support the production of user-oriented dictionaries has been carried out.
In addition to this, due to the focus on the computational processing efficiency, as well as
storage bandwidth restrictions which were much more of an issue at that time, the devel-
oped formats largely fail to be supportive as far as their naming schemes are concerned. For
example, the definition of the SIMPLE format provides rather opaque element names like
combuf and inp, which stand for »combination of usage feature« and »inflection paradigm«
respectively. This impedes the process of getting familiar with the format and the semantics
of its elements enormously, in particular for application developers who consider using the
respective format for interchange. This is also true of frameworks like COMLEX (Grishman
et al., 1994), whose interface language has been inspired by the Lisp programming language
family. Besides, many formats have been defined in formalisms which are no longer in use,
and not all of them have made the transition to current representation languages like XML.
As a result, the tools that support them are no longer actively developed and can therefore not
be considered state-of-the-art.
On a more conceptual level, Heid (1997) criticises a number of these approaches for their
lack of a »content model« (»Inhaltsmodell«: p. 28), i. e. a set of formal specifications that
define the well-formedness of a lexical description (ibid., pp. 37f), which in turn demands a
high level of familiarity with the »intended« structures in order to add representations to the
resource. This criticism applies as well to the EAGLES project, which aimed at the definition
of a »multilingual and multifunctional model for a dictionary viewed as a resource out of
which to extract specific application lexicons« (Menon and Modiano, 1993: p. 4) on the basis
of the results of ACQUILEX, MULTILEX and GENELEX (cf. Hellwig, 1997). Although
such lexical specifications had been developed at a later stage in the project, the formal tools
for processing the constraints were still missing. This was different in the DELIS14 project,
which used the constraint-based formalism TFS (Typed Feature Structure; Emele, 1994) to
define lexical specifications, suffering however from scalability issues (Heid, 1997: pp. 88f).
9 »ACQUIsition of LEXical knowledge for natural language processing systems«, 1989-1995, Copes-
take (1992).
10 »GENEric LEXicon«, 1989-1994, Antoni-Lay et al. (1994).
11 »MULTIlingual LEXical representation«, 1991-1993, Paprotté and Schumacher (1993).
12 »Preparatory Action for linguistic Resources Organisation for Language Engineering«, 1996-1998,
Ruimy et al. (1998).
13 »Semantic Information for Multifunctional Plurilingual Lexica«, 1999-2000, Lenci et al. (2000).
14 »Descriptive Lexical Specification and tools for corpus-based lexicon building«, 1993-1995.
22
Even if the formal tools existed in the form of SGML (and later XML) document type def-
initions (DTDs), the respective projects made use of these formalisms in a way that did not
fully exploit their defining power. The CONCEDE15 project, which is based on the »Text
Encoding Initiative« (The TEI Consortium, 2009) and is, in fact, a more restrictive encoding
of it, has produced an XML DTD.16 that defines structures like the one in (3.1), which shows
part of the English/Slovene entry for »although« (taken from Erjavec et al., 2003)
As can be seen in (3.1), information like the part-of-speech is represented in text form (i. e.
character data in XML terminology). The DTD, however, is not capable of constraining the
content of the pos element, which means that it was in principle possible to enter any kind
of character data here.17 Moreover, the value of the type attribute in the alt element is
intended to constrain the elements that can occur as its children. However, there is no formal
means that would check for the equality of the text value of the type attribute and the names
of the children of the alt element. While such modelling decisions have certainly been made
in order to keep the format as flexible as possible, they undermine the structural means that
a formalism like XML offers for representing consistent data. As will be shown in the next
section, this is also characteristic of some of the very recent approaches.
The Lexical Markup Framework (LMF; ISO/FDIS 24613, 2008) is a very recent ISO initiative
which is devoted to the definition of a framework for modelling lexical resources, and which
has reached standard status by the end of 2008. Its main goals have been defined as follows.
»Lexical Markup Framework (LMF) is an abstract metamodel that provides a common, standardized
framework for the construction of computational lexicons. LMF ensures the encoding of linguistic
information in a way that enables reusability in different applications and for different tasks. LMF
15 »CONsortium for Central European Dictionary Encoding«, 1998-2000, Erjavec et al. (2003).
16 See http://www.itri.brighton.ac.uk/projects/concede/DR2.1/XML/xcesDic.dtd.
17 This could only be achieved at a later stage with a move to XML Schema.
23
provides a common, shared representation of lexical objects, including morphological, syntactic and
semantic aspects.« (ibid., p. vi)
LMF provides the so-called LMF core package, which contains very basic classes like Le-
xicalEntry, Form and Sense (ibid., p. 8). This core package serves as a basis for various
extensions, such as packages for morphological, syntactic and semantic descriptions, all of
which depend directly or indirectly on the core package (cf. ibid., p. 10). These have been
defined by means of diagrams in the Unified Modelling Language (UML), although informa-
tive XML specifications have been added at a later stage. Since the example fragments that
are available on the LMF website18 are all provided in this XML format, it is assumed that
the XML specification is intended to be used as the main interchange format. At the time of
writing, it seems that the first large-scale use of encoding lexical data in LMF is being done
in the KYOTO project (WordNet-LMF; see Soria et al., 2009).
As the above definition suggests, one of the primary goals of LMF is to achieve interoper-
ability between lexical resources and applications exchanging content with lexical resources.
However, although LMF intends to facilitate »true content interoperability across all aspects
of electronic lexical resources« (ISO/FDIS 24613, 2008: p. vi), it remains rather vague on
its application in the various contexts, and in particular of its application in human usage sit-
uations19 . Despite these restrictions, however, LMF represents a very powerful and promis-
ing framework that proposes modelling solutions for a wide range of linguistic phenomena.
Moreover, due to the fact that it is the result of several years of work of a group of experts and
has been published as an ISO standard, it can be considered as reference for the definition of
a multifunctional lexical resource. For this reason, comparisons between the representations
chosen in the lexicon model proposed here and the representations provided by LMF will
recur very frequently throughout this work, in particular in the context of the modelling of
collocations (Section 5.1.2) and valence information (Section 5.2.3). In the following, we
will comment on some of the general design patterns in the XML specification that have been
identified as problematic with respect to the requirements discussed above.
Similar to what has been mentioned in the context of the traditional approaches, it seems
that LMF sacrifices the typing power of XML for flexibility. This is among others reflected
in the definition of the feat element, which is used for representing attribute-value pairs by
means of the two attributes att and val. The following is a fragment of a representation
available from the LMF website.
(3.2) <Lexicon>
<LexicalEntry>
<feat att="partOfSpeech" val="verb"/>
<Lemma>
<feat att="writtenForm" val="boil"/>
</Lemma>
<SubcategorizationFrame id="regularSVO">
<SyntacticArgument id="synArgX">
<feat att="syntacticFunction" val="subject"/>
18 See http://www.lexicalmarkupframework.org/.
19 Francopoulo et al. (2006) even state that LMF is a »framework for the construction of NLP lexicons«
(ibid., p. 233), whereas the standard document does not seem to be as restrictive.
24
As can be seen in (3.2), the feat element is used to encode all sorts of properties, such as the
part-of-speech property or the syntactic functions and categories in a subcategorisation frame.
According to ISO/FDIS 24613 (2008), the values of att and val are meant to be taken from
the ISO/FDIS 12620 (2009) Data Category Registry (DCR; see below). However, there is no
inherent mechanism for ensuring that this is actually the case. In particular, the non-normative
LMF DTD specifies the following for the attributes of feat.
In essence, this means that the values of the two attributes are simply made up of char-
acter data. While this has advantages in terms of flexibility and extensibility, it does not
suffice to make sure that the values of these attributes have actually been taken from the
DCR. Moreover, it does not suffice to differentiate between attributes which have a fixed
number of admissible values, like partOfSpeech, and those which can take any value,
such as writtenForm. Finally, the number of occurrences of a particular value of att is
not restricted this way (e. g. att="partOfSpeech" could occur more than once), nor is
the »type« of the data category with respect to the elements in which it can occur (e. g.
att="writtenForm" inside of SyntacticArgument), and therefore, all of these constraints
would need to be checked by an external application. This has been improved in the afore-
mentioned implementation in the KYOTO project, where the data categories have been mod-
elled as XML attributes themselves. Instead of the representation in (3.2), a WordNet-LMF
representation would look e. g. as follows20 .
In addition to these issues, LMF makes heavy use of implicit references. Consider the UML
diagram shown in Figure 3.1, which is a reproduction of Figure N.1 in ISO/FDIS 24613
(2008: p. 66).
As can be seen in the figure, the white boxes in the top-lefthand corner represent the com-
ponents of an MWE, while the boxes at the bottom of the figure represent the lexical re-
alisations of the constituents of the MWE pattern by the components. In the example, the
components »to«, »the« and »lions« are meant to represent the PP constituent of the MWE
pattern. However, the respective instances are not referenced directly by means of IDREF,
as one would expect, »but, on the contrary, they are referenced by their respective order-
ing« (ISO/FDIS 24613, 2008: p. 64). In Figure 3.1, this is shown by means of the attribute
componentRank of the MWE Lex instance. For the component »lions«, for example, this
indirect reference would be represented in the XML specification as follows.
Figure 3.1: LMF representation of MWE patterns (taken from ISO/FDIS 24613, 2008: p. 66)
The problems with this representation are the following. Considering the possibility of defin-
ing IDREF attributes which would represent actual direct links between elements, it seems
odd to make use of references that need to be established first. Moreover, these links are
dependent on a particular ordering of the Component elements, and are in fact incorrect if
this ordering is – maybe unintentionally – changed at one point. Finally, there is no inher-
ent mechanism which could ensure that the number which is the value of the val attribute
of the MWELex element is at most as high as the number of Component elements in the re-
spective ListOfComponents. If an IDREF attribute had been used in order to link these
elements, a validator would indicate links involving entities that do not exist. This is not the
case with this representation, where the reference e. g. to a non-existent fifth component via
att="componentRank" val="5" would not result in an ill-formed XML file. According to
ISO/FDIS 24613 (2008), this has been done in order to »provide a generic representation of
26
MWE combinations within a given language« (ibid., p. 64). Considering the disadvantages
just presented, however, it seems questionable whether this has been the optimal choice.
In summary, it should be emphasised again that the objective of LMF is not to provide an
XML format. Essentially, LMF is a meta-standard whose objective is to provide a common
framework for specifying models for lexical resources by means of UML diagrams which
conform to the specification provided in the LMF standard document. However, for develop-
ers of new or existing lexical resources who want to offer LMF as a format for exchanging
data between their application and other applications which are able to process LMF data,
the XML specification is the primary point of departure, and should therefore ideally be de-
fined in a way that ensures consistent data exchange. With the informative DTD provided by
ISO/FDIS 24613 (2008), it seems that this is not the case.
As was mentioned in Section 3.1.2.1, one of the requirements on the MLR model is to con-
ceive of the lexicon as a graph. This view is a rather new development, and recent years have
seen a number of approaches in this direction. The most prominent of these are the Lexical
Systems approach (LS; Polguère, 2006) and the Lexicon Graph model (LG; Trippel, 2006),
both of which view lexicons as directed graphs. In the case of LSs, these are implemented
as flat Prolog databases consisting only of two different types of clauses, namely entity()
and link(). In contrast to this, an LG is represented by means of a custom XML format
consisting essentially of the three elements lexitems, relations and knowledge, each of
which contains further elements taken from a fixed inventory of nine different types (e. g.
relation, which further contains source and target elements; for more details see Trip-
pel, 2006: p. 115f). Despite these implementational differences, the two approaches bear –
as Polguère points out – »striking similarities« (Polguère, 2009: p. 43). This refers primarily
to the fact that both approaches aim at the definition of flexible lexicon models which impose
very little constraints on the data that can be represented.
Polguère (2009) criticises the fact that in lexical resources which are structured by means
of predefined principles, developers have to »stretch« their models when adding further phe-
nomena. In particular, this criticism is meant to apply to hierarchically structured resources,
and Polguère emphasises that LSs are non-hierarchical structures which allow for the »in-
jection« of a hierarchical organisation on demand (ibid., p. 43). How this hierarchical inter-
pretation comes about, however, remains rather unclear. In particular, if there is no inherent
hierarchical organisation within an LS, it cannot be assumed that there is a built-in relation
for representing inheritance, because if there was, then it would be hard to maintain a differ-
ence between hierarchical and non-hierarchical structures. Thus, there is no way to express
that relations like »is_a« (as e. g. in DiCo; Polguère, 2000), »inherits-from« (as in FrameNet;
Baker et al., 1998) and »subClassOf« (as in any RDF-based resource) essentially refer to the
same relation, since they are not mapped to a common relation in an LS. Therefore, if the
mentioned resources were compiled into LSs, each of these relations would need their own
interpretation function in order for the hierarchical organisation to be »injected«. Polguère
mentions that for the compilation of the DiCo database, a hierarchy of semantic labels has
been created by means of the Protégé ontology editor (Knublauch et al., 2004), then exported
to XML and finally inserted into the LS Prolog format. As this seems like a rather ad-hoc
27
blend of different formalisms, and since this editor in particular has been developed primarily
for ontology languages like OWL, it is not obvious why ontology formalisms have not been
used in the first place. In contrast to this, the LG model tries to map different hierarchical
relations to a common inheritance relation, rather than representing each of them individu-
ally (cf. Trippel, 2006: p. 108). This suggests a more principled and accurate treatment of
hierarchical information in the LG model.
According to Polguère (2009), directed graphs are powerful representations »particularly
suited for lexical knowledge«, which is emphasised by the author’s claim that LSs are capable
of representing »all information present in any form of dictionary and database« (ibid., p. 49).
While this may seem like a very advantageous property of LSs on the one hand, it somewhat
summarises their major disadvantage on the other. In other words, the consequence of Pol-
guère’s claim is that anything can be modelled, including incorrect information. Although
LSs offer a means for expressing a value of trust of a particular statement (see Polguère,
2009: p. 45) – with a value of ’0’ indicating that a certain piece of information ought to
be incorrect – there is no inherent mechanism for determining under which circumstances a
statement is incorrect. This would require a formal model that states e. g. that a lexeme which
specifies two different parts-of-speech is incorrect. In addition to this, it seems difficult to
detect missing information, since anything that is expressed as a Prolog clause using one of
the two admitted predicates (see above) is a well-formed statement in an LS. Polguère seems
to be aware of this, as he describes LSs as being »not too choosy«. In this respect, the LG
model seems to be more restrictive, as it makes use of an XML document grammar. However,
as Trippel (2006: p. 101f) indicates, this is limited to rather simple consistency issues.
In sum, it can be said that LSs are to be understood as highly flexible data structures rather
than formal models for the representation of lexical information. The LG model provides a
formal description to some extent, such as the constraint that the relation element consists
of at least one source element and exactly one target. However, similar to the XML-based
approaches discussed above, crucial aspects seem to be hidden mainly in CDATA values of
elements and attributes, e. g. in the form of an unrestricted type attribute (Trippel, 2006:
p. 116). Therefore, the provided formalisation refers to the data structure only, not to the
actual linguistic content. For this, it would be necessary to add further definitions, e. g. for
relations of a certain type, which are absent, however. While this is completely reasonable
given its objective to define a generic and highly flexible lexicon model, it meets the re-
quirements mentioned in the previous section only partly. In particular, given the need for a
strongly typed formalism, and considering the benefits of widely used standard formalisms
like RDF and OWL (see Section 4.3), both LSs and the LG model are regarded as being too
unrestricted to serve as a formal basis for the definition of the multifunctional lexicon model,
and thus, their overlap with the approach presented here is mainly restricted to the view of
the lexicon as a graph.
The SALSA lexicon model (Spohr et al., 2007; Burchardt et al., 2008a) is an approach to
representing multi-layer corpus annotations in a form that enables flexible querying of the
different layers. In particular, the approach has resulted in the definition of an OWL-based
lexicon model, which has been done with the participation of the author. In addition to
28
enabling flexible querying, the SALSA lexicon model aimed at the definition of mechanisms
for checking the consistency and integrity of the manual annotations in the SALSA corpus
(Burchardt et al., 2006). In order to achieve this, the lexicon model provides a formalisation
of FrameNet (Baker et al., 1998), which has served as the framework for the lexical-semantic
annotation layer, as well as the SALSA annotation scheme, which specifies e. g. the types of
frames and semantic roles which can be annotated in a sentence. The lexicon model is thus
not a generic one that generalises to other frameworks or annotation schemes, but rather the
successful application of a general methodology for inducing a formal lexicon model from
corpus annotations.
In addition to highlighting a number of benefits of modelling linguistic information in a
logic-based formalism, the SALSA lexicon model has produced a useful workflow of the lex-
icon compilation process, covering the steps from the conversion of XML corpus annotations
to OWL, the consistency and integrity checking of these annotations on the basis of descrip-
tion logic axioms and graph queries, as well as scalable storage in a relational database. Due
to the close links with respect to the underlying formalism and the thus resulting overlap,
several central aspects of the SALSA lexicon model will be discussed in the course of this
book, primarily in Chapter 6.
While the previous subsections have presented different approaches to defining computational
models which aim at the representation of lexical data, the current subsection deals with re-
sources which describe linguistic information in general, without commitment to a particular
model. In particular, we will briefly introduce the ISO 12620 Data Category Registry (DCR;
ISO/FDIS 12620, 2009) and the General Ontology for Linguistic Description (GOLD; Far-
rar and Langendoen, 2003, 2010). These two resources are very relevant in the context of a
lexicon model, as they provide inventories for linguistic description.
The DCR is an ISO standard aimed at the definition of »widely accepted linguistic con-
cepts« (Windhouwer, 2009), whose latest version can be accessed via the ISOcat web inter-
face21 . Technically, the DCR is a flat list of categories that can be used for the description of
linguistic entities. For each descriptive device, the DCR provides natural language definitions
and examples, as well as names in different languages. For example, the data category entry
of partOfSpeech22 specifies that it is a »category assigned to a word based on its grammatical
and semantic properties«, and that possible names are e. g. »pos« in English and »Wortklasse«
in German. Finally, the valid values are specified (such as »adjective« or »commonNoun«),
which can be used for modelling attribute-value pairs as mentioned above in the context of
LMF (see page 24). However, as was further discussed there, no mechanism has been imple-
mented yet which would ensure that the attributes make use of the admitted values only.
In contrast to this, the descriptions in GOLD are more formalised than the flat specifications
in the DCR. According to Farrar and Langendoen (2010), GOLD is an ontological theory that
specifies entities in the domain of linguistics, e. g. InflectedUnit, TenseFeature or hasSubject.
Instead of a simple listing of the different types of linguistic entities, the OWL version23 of
21 See http://www.isocat.org/interface/index.html.
22 See http://www.isocat.org/rest/dc/396.
23 See http://www.linguistics-ontology.org/gold-2008.owl.
29
GOLD goes beyond the natural language definitions in ISO 12620 in that it further attempts
to formalise general linguistic knowledge in the form of description logic axioms. Straight-
forward axioms specify e. g. that »verb« is a part of speech or »subject« is a syntactic role,
while more complex ones like (3.6) (taken from Farrar and Langendoen, 2010) state that an
inflected unit is a grammar unit that has an inflectional unit as constituent.
(3.6) In f lectedUnit GrammarUnit [ DhasConstituent.In f lectionalUnit
These formalisations mean that GOLD can make use of the existing computational infrastruc-
ture of OWL, such as application programming interfaces and description logic reasoners for
ensuring the use of valid attribute-value pairs. However, due to its emphasis on the ontologi-
cal nature of linguistic entities, which focusses on giving definitions e. g. of what constitutes a
phrase or a linguistic sign in general, GOLD provides a lot of information that is not relevant
in the context of a lexicon model and is thus only indirectly usable. Compared to this, the
objectives of ISO/FDIS 12620 (2009) are more closely related to the definition of a lexical
resource. However, the implementation of ISO 12620 is still under development, which is
shown e. g. by the fact that the current version of March 2010 contains two identical data
categories for part-of-speech. Moreover, the DCR selection process as outlined in ISO/FDIS
12620 (2009) and ISO/FDIS 24613 (2008) has not been implemented yet. As a result, nei-
ther the DCR nor GOLD will be used directly in the MLR model presented here, in the sense
that their specifications are not directly imported into the model. Nonetheless, as will be dis-
cussed in Chapter 5, both GOLD and the DCR have been very influential in the modelling of
the descriptive devices in the MLR, and a number of data categories have in fact been taken
from these resources.
The final type of lexicon model concerns models for so-called ontology lexica, which pro-
vide linguistic enrichment for the entities defined in an ontology. For example, a property
like capital in a geographical ontology can be described by an entry in an ontology lexicon
which specifies that it can be realised by means of »is capital of«, where the subject of the
property capital is mapped to the subject of »is capital of«, and the object of the property
to the complement of the preposition »of«. This information is then used for NLP tasks like
natural language generation or question answering (see e. g. Unger et al., 2010). The most
prominent representatives of this category are LingInfo, LexOnto and LexInfo, with the latter
being based on the former two models and LMF (see Cimiano et al. (2011) for a detailed
discussion).
While being very closely related to the model presented here from a technological perspec-
tive – all of these models are based on Semantic Web formalisms –, they differ in terms of
which purpose the model is to serve, as well as which kinds of linguistic information can be
represented and how. On the one hand, they have been developed to represent lexicalisations
of ontological concepts, not electronic dictionaries for computational and human use, and do
thus not provide a rich classification of linguistic concepts as e. g. GOLD or the model devel-
oped in this work. On the other hand, as will become obvious in Sections 5.1 and 5.2 of this
book, the representation of valence information and preference phenomena, in particular with
respect to multi-word expressions, is rather different, as is the way in which these phenomena
are interrelated in the model.
30
After having introduced a number of relevant models for computational lexical resources, we
will now focus on a selection of electronic dictionaries for human use. This selection has
been done on the basis of the requirements analysis presented above, as all of the dictionaries
discussed in the following are considered to illustrate important aspects. Therefore, we will
not reproduce the comprehensive discussions presented for example in Schall (2007), but
rather highlight specific features which indicate the approach to multifunctionality taken in
each dictionary. In general, it can be said that none of the mentioned dictionaries caters for
NLP applications, which means that multifunctionality is restricted to »serving several types
of human users«. Moreover, most of them are still in their development phase.
3.2.3.1 ADNW
The Aktives Deutsch-Niedersorbisches Wörterbuch24 (ADNW; Bartels and Spieß, 2002) rep-
resents a very basic kind of multifunctional dictionary. It is being developed by the Sorbisches
Institut in Saxony and is aimed at advanced learners as well as teachers and students of Lower
Sorbian. Despite being explicitly under development, the ADNW displays a very clear ten-
dency towards serving multiple functions. In particular, it offers for a selection of lexemes a
»Schulversion« (»school version), which differs from the full version in the indications that
are displayed in a dictionary entry as well as its general layout25 .
While the primary goal of the ADNW is to be published as a printed dictionary, the lat-
est development steps are being released in electronic form on the dictionary’s website26 .
Although this means that general benefits of the electronic medium, in particular advanced
search functionalities, are not fully exploited and can therefore not be evaluated critically, the
general attempt to approach multifunctionality by providing variable presentation modes for
dictionary entries is without any doubt relevant in the context of this work.
The Ordbogen over faste vendinger27 is a monolingual Danish idiom dictionary that provides
a direct implementation of the key notions of function theory (see Section 2.2 above). It
caters for different situation types by offering users different search options, such as »I would
like to have support in understanding an expression« 28 for users in a text-receptive situation
and ». . . in writing a text« 29 for text-productive situations. Moreover, it allows users to find
24 »Active Lower Sorbian Dictionary«
25 Compare the full entry of »Ecke« http://www.dolnoserbski.de/dnw/dnw/ecke.html with its school
version, found at http://www.dolnoserbski.de/dnw/dnsw/ecke.html.
26 See http://www.dolnoserbski.de/dnw/index.htm.
27 »Dictionary of fixed expressions«; see http://www.ordbogen.com/opslag.php?dict=fvdd. As of
September 2011, access to the dictionary is no longer free of charge. The following discussion
is therefore based on Almind et al. (2006) and the examples in the instruction manual, which is – as
the entire dictionary website – available in Danish only.
28 »Jeg vil have hjælp til at forstå en vending«.
29 ». . . skrive en tekst«.
31
expressions starting from a specific meaning30 (onomasiological access) and further offers the
option to learn more about an expression31 , which underlines its close connection to Tarp’s
concept of a leximat (cf. page 5 above).
Choosing one of these options has a direct impact on the way in which the dictionary entries
are presented (see Almind et al., 2006: p. 179). The instruction manual lists entries for »have
aben« (»to be in an undesirable situation«; literally »to have the monkey«) as an example,
whose text-receptive version contains only meaning indications, whereas the text-productive
entry starts with the fixed expressions involving »aben«, followed by meaning indications,
grammatical information, collocations, examples and synonyms. In other words, it contains
those indications which correspond to Tarp’s primary and secondary needs in text-productive
situations.
Despite a very clear explanation of the fact that dictionary entries differ according to the
selected situation, Almind et al. (2006) remain unclear as to how the actual process of select-
ing the relevant indications is carried out. More precisely, it is not explained explicitly if (i)
there is a separate model that specifies which indications are relevant in a certain situation,
or (ii) the relevance is represented as part of the dictionary itself, or even (iii) the dictionary
entries themselves have been hard-coded. The chosen strategy is a very important factor in
assessing the dictionary, as it has considerable effects on the adaptability of the dictionary to
further communicative situations, as well as changes to the covered situations, e. g. if further
indications are added at a later stage, or if existing indications are identified to be of less
relevance than had been assumed. In addition to this, the Ordbogen over faste vendinger
offers only very simple query access to the dictionary data. In particular, access by means of
(partial) lemmata is the only search route offered, which means that it is not possible to query
for more complex configurations, such as combinations of »aben« with a verb and/or an ad-
jective. Finally, the fact that its interface exists in Danish only is a considerable obstacle for
less advanced users, and therefore, the Ordbogen over faste vendinger is remarkable mainly
for the multifunctional presentation of its dictionary entries.
3.2.3.3 BLF/DAFLES
Similar to the Ordbogen over faste vendinger, the interface to the Base Lexicale du Français32
(BLF) is clearly oriented towards the notion of a leximat. According to Verlinde (2010), it
is a lexicographic tool for learning French vocabulary that is entirely based on users’ needs.
The content of the BLF is based on the Dictionnaire d’Apprentissage du Français Langue
Étrangère ou Seconde33 (DAFLES; Selva et al., 2002).
The general strategy of the interface to the BLF is to let users determine the kinds of infor-
mation that they are given in response to a query. Here, the BLF makes use of Tarp’s analysis
of user needs in different communicative and cognitive situations (see above), and offers dif-
ferent entry points to the data. For example, a user can either get specific information on a
particular word (such as the gender or orthography), verify the use of a word or a sequence
of words, or learn how to express a certain idea. In general, a user’s involvement starts with
30 ». . . søge efter en vending ud fra en betydning«.
31 ». . . vide mere om en vending«.
32 »Lexical Resource of French«, see http://ilt.kuleuven.be/blf/.
33 »Learners’ Dictionary of French as Foreign or Second Language«
32
the specification of a lexical item, which is followed by selecting the desired output informa-
tion. Once a query has been stated that way, users are given the desired answer, from where
they have the chance to explore the respective item more, e. g. by clicking on it. Each of
the queries and clicks of a user are recorded and stored in a database, from where they are
available to further research into the consultation behaviour of dictionary users.
As the previous paragraph suggests, the considerations in the context of the BLF are very
closely related to those presented in this work. The primary differences lie in the way in
which the modelling of user needs is approached. As was mentioned in Section 3.1.3.1, the
approach followed here is to devise a formal tool that specifies the pieces of information that
are relevant in a certain situation – in contrast to the users specifying the output themselves.
Primary reasons for doing so are to avoid overloading of the user with options (i. e. »infor-
mation stress« or even »information death« in the sense of function-theoretic scholars like
Henning Bergenholtz and Sven Tarp), and to make the consultation situation less dependent
on the particular need. More precisely, the way in which users have access to the information
they are looking for should not be fundamentally different depending on the type of the in-
formation. This dependence on particular needs and the resulting information stress has been
attested in a recent usability study (Heid, 2011) and lead to a redesign of the BLF interface
(cf. Verlinde, 2011). Further differences will become apparent in Sections 5.4 and 6.2.2.
3.2.3.4 ELDIT
now not been tested in computational scenarios, although a systematic approach seems to be
conceivable in general (cf. Knapp, 2004: p. 97).
3.2.3.5 elexiko/OWID
elexiko is part of the XML-based monolingual lexical information system OWID35 developed
at the Institut für Deutsche Sprache whose main target groups consists of German native
speakers and learners of German. According to Haß (2005: p. 3), it is a »plurifunctional«
dictionary which lets the users decide the function that it is to serve in a particular usage
situation, by allowing them to display and hide certain indications once a dictionary entry is
displayed (e. g. »for etymological information click here«). However, as Haß (2005) points
out, the dictionary authors differentiate in the creation of the dictionary between information
for laymen and information for expert linguists and model the information accordingly. As is
shown in Müller-Spitzer (2005), this is achieved by means of XSLT stylesheets which display
the indications that are relevant for a particular type of user. As of early 2010, it seems that
this aspect is still under development, since it has not been made publicly available yet.
As with the other dictionaries discussed so far, the search functionality offered by elexiko36
is rather limited. Its extended search allows to specify values for a small selection of indi-
cations, e. g. whether the part-of-speech is a verb or a noun. For other indications, however,
it only offers the possibility to state if a certain indication should be there or not. Grammar
indications, can e. g. be accessed only by means of selecting »any« or »with valence«.
In contrast to the recent developments in the field of computational lexicography, the model
of elexiko is not at all graph-based. In fact, in can be taken as a prototypical example of a
text-based dictionary in the sense of Polguère (2006). However, this is certainly true of the
other dictionaries as well; it is just that elexiko allows us to gain detailed insights into its
internal structure. As key multifunctional aspects are still under development, elexiko is only
on its way to becoming a multifunctional lexical information system. Whether it is going to
be multifunctional in the sense explained in Section 2.3 above by serving NLP applications
as well, cannot be determined at this stage.
3.2.3.6 DWDS
The Digitales Wörterbuch der Deutschen Sprache des 20. Jh. (»digital dictionary of 20th cen-
tury German«; DWDS37 ) is an ongoing project at the Berlin-Brandenburgische Akademie der
Wissenschaften, aiming at the development of a freely available lexical database of German.
While the DWDS is of limited interest from a user-oriented perspective on multifunctional-
ity, in the sense of serving different types of users e. g. in text-receptive vs. text-productive
situations, it is very relevant from the point of view of how lexical data can be presented
to a user. Especially in its current development version 2.0beta, the DWDS has moved
away from primarily displaying the entries from its printed predecessor, the Wörterbuch
der deutschen Gegenwartssprache (»dictionary of contemporary German«; Klappenbach and
Malige-Klappenbach, 1980) towards offering an interactive presentation.
35 »Online-Wortschatz-Informationssystem Deutsch«
36 See http://www.owid.de/suche/elex/erweitert.
37 See http://www.dwds.de/woerterbuch.
34
As can be seen in Figure 3.2, which shows a screenshot of the entry for »bringen«, the view
contains four different panels which display information from the DWDS dictionary (top
left), OpenThesaurus (top right) and the DWDS corpus (bottom left), as well as a word profile
(bottom right). Further panels can be added by clicking on the blue button on the left-hand
side of the screen, and each of these can be moved around freely on the screen. Within the
DWDS dictionary panel, users can decide to be satisfied with the information displayed by
default (namely grammatical information, sense definitions and style markings), or else click
on specific items in the entry in order to get further information, mainly example sentences.
Whether the selection of what is displayed is regulated by a formal model of user needs is not
clear. However, there is no doubt that the contents of the dictionary panel could be adapted
such that it serves different user types. In fact, the GUI has a button for changing between
different views (see the brown button on the left-hand side of Figure 3.2), although this seems
to affect only on the kinds of panels that are displayed, as well as where they are positioned on
the screen. This mechanism could certainly be extended for the mentioned task. In sum, the
DWDS is a very good example of how interoperability of different lexical resources can be
35
achieved, by combining dictionary content, corpus access and access to external resources.
Moreover, it aims at supporting the formulation of queries that go beyond the very simple
ones seen on the previous pages. Although the query language used so far seems to be too
complex for the average untrained user38 , the general idea of offering advanced options for
stating complex queries is very positive. We will return to this aspect in the context of the
following dictionary.
3.2.3.7 DiCouèbe
The last electronic lexical resource to be discussed here is the web interface to the Diction-
naire de Combinatoire/Lexique actif du français (DiCo/LAF39 ; Polguère, 2000; see also page
26 above), a monolingual French dictionary describing the combinatorial properties of lexi-
cal units based on Meaning-Text Theory (Mel’čuk and Žolkovskij, 1970). As with some of
the dictionaries discussed so far, it is not explicitly targeted at a particular group of users, al-
though its interface suggests that it is aimed at expert users rather than inexperienced learners
(cf. Figure 3.3).
The web interface DiCouèbe is most remarkable for its query interface, which allows for
the formulation of quite complex queries. As can be seen in Figure 3.3, the query interface
provides a number of text fields for specifying values of particular properties, such as »nom
vocable« (»word«), »fonction lexicale« (»lexical function«) or »marque d’usage« (»usage
marking«). For most of these indications, more than one value can be specified. In addition
to this, it is possible to specify the indications that should appear in the result, by ticking the
boxes on the lefthand side of the query form. This way, users are shown only those indications
that they have explicitly asked for. Although these features make the DiCouèbe interface
– in terms of the complexity of the dictionary queries that can be formulated – the most
advanced of the ones discussed in this section, it needs to be said that it is still very difficult
for untrained users to pose queries. The main problem with the DiCouèbe interface is that
for most indications the user needs to know the possible values. For example, it is necessary
to know which usage markings exists and how they are spelt, before a felicitous query can
be formulated. With possible values like Caus1PredAble1Real1 for lexical functions, this is
not a trivial task. Such problems can be overcome easily by offering drop-down lists instead
of unconstrained text fields, which is the strategy that has been followed in this work (see
Section 6.2.2.1).
3.2.4 Summary
Summing up what has been discussed on the previous pages, it can be safely said that none
of the lexical resources – neither on the side of computational lexicons nor dictionaries for
human users – has been conceptualised as a truly multifunctional lexical resource accord-
ing to the definition in Section 2.3. Although some of them (e. g. the Ordbogen over faste
38 One example provided in the online help is the query »sein with $p=VVFIN #20 $p=VVPP #0
worden«, which extracts phrases of the form »sein Participle worden«. Clearly, such queries are
meant to be used for corpus look-up by expert users, rather than for mere dictionary consultation.
39 See http://olst.ling.umontreal.ca/dicouebe/index.php.
36
vendinger) have been designed such that they can serve in different usage situations, they
have not been prepared for access by or exchange with NLP applications. In addition to this,
non-standard access as highlighted in Section 3.1.2.1 is possible only in elexiko, DWDS and
DiCouèbe. While in the case of elexiko the offered means allow for the formulation of very
simple queries only, the ones offered in the DWDS seem too complex as to be used by the av-
erage dictionary user40 . The DiCouèbe interface seems to offer a very good balance between
these two. As was mentioned above, however, it requires the user to have detailed knowledge
of the names of the data categories in the resource in order to extract information. Moreover,
as is the case for several of the monolingual dictionaries just discussed, they cater for one
interface language only, namely the one that is identical to the object language (e. g. Danish
for the Ordbogen, German for the DWDS or French for the DiCouèbe). As a result, their
target groups are already restricted to users with advanced knowledge in these languages.
Such issues certainly need to be overcome in order to make a dictionary interface useable for
a wider audience, including learners at the beginner’s level (cf. ELDIT).
Due to the fact that the focus of this study is on providing a model for a multifunctional
lexical resource, the general steps towards the definition of an electronic dictionary for human
users taken in this work – including ideas for user interfaces and query access – cannot be
assumed to be able to »compete« with sophisticated user interfaces such as the one offered by
the DWDS. Moreover, since most of the dictionaries discussed in the previous section allow
only for very limited insight as to their underlying models, the major reference of this work
is in the field of the models for computational lexicons (cf. Section 3.2.2). As was mentioned
40 Again, linguistically untrained users are certainly not the target group of this feature of the DWDS,
and this complexity of the query language is in line with the one of those used for other resources
for computational lexicographic research.
Random documents with unrelated
content Scribd suggests to you:
The Project Gutenberg eBook of The Boy from
Green Ginger Land
This ebook is for the use of anyone anywhere in the United States
and most other parts of the world at no cost and with almost no
restrictions whatsoever. You may copy it, give it away or re-use it
under the terms of the Project Gutenberg License included with this
ebook or online at www.gutenberg.org. If you are not located in the
United States, you will have to check the laws of the country where
you are located before using this eBook.
Language: English
BY
E. VAUGHAN-SMITH
AUTHOR OF ‘CRAGS OF DUTY,’ ETC.
Illustrated
LONDON
WELLS GARDNER, DARTON & CO., LTD.
PATERNOSTER BUILDINGS, E.C.
To-morrow morning came all too soon. A pleasant letter from Aunt
Grace arrived at breakfast-time, containing a warm invitation for
Punch to take up his abode at Woodsleigh, which was a great relief
and pleasure to the rest of the party, but otherwise the day was a
trying one. Mary went about with a duster swathed round her head,
as she always did during the spring-cleaning, and there was a
general feeling of bustle and discomfort. The children wandered
restlessly from room to room, trying to help, but usually only
succeeded in being in the way, and secretly they rather longed for
the cab which was to take them to the station in time for the 11.35
train.
The cab came at last, and less than a quarter of an hour later they
found themselves installed with Punch and endless baggage in a
second-class railway carriage, while Mary stood on the platform
smiling bravely. Another few minutes, and the train was starting with
a shriek and a pant. All three children leaned out of the window,
waving frantically, till the line curved round a corner and Mary and
her fluttering handkerchief were lost to sight. After that it was Punch
who saved the situation. All his journeys to the seaside had failed to
accustom him to railway travelling, and he now took refuge under
the seat, looking so cowed and miserable that nobody could think of
anything but how to comfort and reassure him. They were so much
occupied with this as to be quite taken by surprise at reaching their
destination in what seemed an astonishingly short time.
The only people waiting on Woodsleigh platform were a lad who
served both as porter and ticket-collector and Aunt Grace herself—an
Aunt Grace who looked wonderfully young and pretty to be aunt and
guardian to such a big girl as Emmeline. She was, in fact, very much
what her niece Kitty might become a few years hence when
transformed from a tomboy into a fashionable, grown-up young lady.
She hurried forward to open the carriage door for the children, and
greeted the whole party, including Punch, with such frank delight at
seeing them that not even Emmeline could help being charmed, and
a limpet-like twin was soon clinging to either side of her in a
devoted, if rather inconvenient, fashion.
‘We shall have to leave the boxes to be brought up by the milk-
cart in the course of the afternoon,’ explained Aunt Grace when the
luggage had all been taken out of the train. ‘We’re very primitive at
Woodsleigh, and the milk-cart’s the only thing we can boast of in the
way of a public conveyance. It won’t come till later on in the
afternoon, but I can lend brushes and sponges, so I hope you’ll be
able to manage all right till then.’
‘We did wash our hands just before coming, and Mary brushed all
our hairs,’ Micky was careful to assure her, ‘so you needn’t trouble to
lend us things. But thank you all the same,’ he added hastily, for fear
of hurting her feelings.
‘Micky, you know Mary always makes us wash our hands and faces
after railway journeys!’ said Emmeline—a remark which Micky, who
was just then stooping down to undo Punch’s lead, found it
convenient not to hear.
‘I hope before long to get a donkey and donkey-cart of our own,’
observed Aunt Grace as they left the station and came out into a
village street; ‘then we shan’t have to depend on the milk-cart, and
it will be much more convenient altogether.’
‘Oh, Aunt Grace, how lovely!’ exclaimed Kitty, giving a joyous little
skip. ‘Donkeys are such dears!’
‘I shall ride ours bare-back,’ announced Micky, ‘and teach him all
sorts of tricks.’
‘I’m always so glad to think of a donkey having a good home,’ said
Emmeline; ‘people are so cruel to them sometimes. When we stayed
at the seaside, it often made us quite sad to see how they were ill-
treated.’
‘Yes, I know,’ said Aunt Grace; ‘it is very sad. Two or three years
ago I was staying at the seaside with some children, who made a
special point of hiring the ones with unkind masters for extra long
rides, and never letting them be whipped, so as to give them a rest
from being ill-treated.’
‘I wish I knew those nice children,’ said Kitty.
‘And I expect they found the donkeys really went quite as well,
didn’t they, Aunt Grace?’ asked Emmeline, who had not yet learned
that virtue often has to be its own reward.
‘Well, I’m afraid I can’t say they did,’ owned Aunt Grace with a
little twinkle in her eye; ‘at the best of times they went at a slow and
stately pace somewhat resembling a funeral procession, and at the
worst of times they sat down comfortably in the middle of the road
and refused to budge. Still, I don’t doubt that if my friends had had
the bringing up of those donkeys from the first, they would have
gone all right without being beaten. It was simply that the poor
creatures had got so used to it that they didn’t understand anything
else.’
‘Aren’t we nearly at your house?’ asked Kitty presently; ‘we seem
getting quite outside the village now.’
‘No, we have still about ten minute’s walk before we get to Fir-tree
Cottage,’ replied Aunt Grace; ‘it stands right away from other
houses, just outside a large wood. It’s very nice in most ways being
quite out of the village, for it makes one so much freer to do just as
one likes, but it’s rather inconvenient sometimes being so far from
the station. It’s really not so very much farther to Chudstone Station
—the one you passed next before Woodsleigh; indeed, when I have
plenty of time, I sometimes start from there instead of from
Woodsleigh, for it makes a delightful walk through the wood.’
‘How jolly to live in a cottage and so near a wood!’ cried Kitty,
giving another little skip.
‘As to living in a cottage, I’m afraid you won’t find it quite your
idea of one,’ said Aunt Grace, ‘though it really was one before
grandfather built on the front part of the house. The wood’s real
enough though, and begins only just outside our back-garden gate,
which is very charming of it.’
‘I thought grandfather was a Professor,’ remarked Micky, looking
puzzled.
‘Why, so he was,’ said Aunt Grace.
‘But if he built the front part of the house he must have been a
stone-mason, like Mary’s brother,’ objected Micky.
‘Aunt Grace didn’t mean that he built it with his own hands, you
silly child!’ said Emmeline, laughing.
‘But I don’t wonder Micky thought I did,’ said Aunt Grace kindly; ‘it
was very natural.’
Aunt Grace was right in saying that Fir-tree Cottage was not the
kind of cottage to which the children were used. It was what they
considered quite a large house, standing well back from the road
among lawns and shrubberies, and when they walked in at the front
door they found themselves, not in the poky little passage that Kitty
had been picturing to herself from her remembrances of seaside
lodgings, but in a hall as large as the one at their old home, and far
more charming, for it was bright with ferns and flowering plants and
cosy with cushioned seats and lion-skin rugs. In this hall they were
met by a rather austere-looking person whom Aunt Grace called
Jane.
‘Jane was my nurse when I was a little girl,’ she said, ‘so we are
very old friends, and now she is going to help look after you;’ at
which Jane smiled grimly, and Emmeline thought how horrid it would
be to have her to look after them instead of kind, gentle Mary.
‘Now, we must certainly take Punch to be introduced to Cook,’ said
Aunt Grace; ‘she’s a splendid person for animals.’
This introduction was so successful that Emmeline forgot all
disagreeable impressions. Cook was found in her bright airy kitchen
with its red-tiled floor and rows of shining dish-covers, and she and
Punch seemed quite delighted with one another. ‘That’s a rare nice
little dog,’ she kept saying as he smelt round her skirts with marked
approval. ‘Have you shown them the kennel, miss?’ she added. ‘I
give that a good scrubbing yesterday as soon as ever I heard he was
coming, so that will be all nice and fresh for him now. There’s clean
straw in too.’
‘We must go and admire it,’ said Aunt Grace, and they went
through the scullery and out into the back-yard, in one corner of
which was an enormous dog-kennel.
‘The last dog who lived here was a St. Bernard,’ explained Aunt
Grace, ‘so Punch will find his quarters very roomy ones.’
‘Aunt Grace, you aren’t going to keep him chained up except when
he goes for walks, are you?’ asked Kitty.
‘Why, of course not,’ said Aunt Grace; ‘this is his private bedroom,
that’s all, and I no more expect him to stay here all day than I shall
expect you to stay in your bedrooms.’
This so greatly relieved the children that they were in a mood to
be delighted with everything when Aunt Grace led them upstairs to
show them their own bedrooms. She took them first to the room
which the two girls were to share, and they both exclaimed at the
sight of its dainty white-painted furniture and fresh muslin hangings.
In each half of the room was a little white bed, a white wash-hand
stand, and a white chest of drawers with a looking-glass standing on
the top of it.
‘It’s quite like grand grown-up ladies, both of us having a wash-
stand and a dressing-table to ourselves,’ said Kitty, with much
satisfaction; ‘there was only one of each in the night-nursery at
home.’
‘They are such pretty ones, too,’ said Emmeline. ‘I do love white
enamel.’
‘I’m very glad you like them,’ said Aunt Grace, looking pleased; ‘I
always think one has so much more heart in keeping one’s room tidy
if the furniture’s nice.’
‘Yes, you won’t have to leave your things about here, Kitty,’
remarked Emmeline, in her elder-sisterly way.
Kitty was not listening; she had rushed to the window. ‘I do
believe—yes, you really can just see the sea!’ she exclaimed. ‘Oh,
Aunt Grace, may we go there every day?’
‘I’m afraid it’s rather too far off to go there every day,’ said Aunt
Grace; ‘it’s a good five miles. Still, I hope we shall be able to go
there quite often—at all events when we’ve got our donkey-cart.’
There was a door between the girls’ room and the next, which
Aunt Grace pointed out to them. ‘My room is the next,’ she said, ‘so
you’ll be able to run in for any help you want. Jane will come in and
do your hair in the mornings, but of course she won’t always be
there for the odds and ends of things that need doing.’
‘I’ve done my own hair for quite a long time,’ Emmeline was
careful to inform her.
Aunt Grace did not seem much impressed. ‘That’s a good thing,’
she observed cheerfully.
They went to Micky’s room after that. They had to cross the
passage and go down some steps in order to reach it, for it was in
the part of the house which had been the original Fir-tree Cottage,
where the rooms were all much lower—like cottage rooms in fact.
There were but two of them on the upstairs floor, and the other one
was to be the schoolroom. Underneath these two rooms were two
others, now used as the scullery and larder. Micky’s room was not
quite so daintily furnished as his sisters’, but it had a delightful view
out on to the lawn and wood beyond, which made it a very pleasant
one. What especially gratified Micky, however, was its being alone.
‘You need a man to sleep in this part of the house,’ he remarked;
‘burglars would be sure to choose it to attack, because they’d think
there would be fewer people to shoot them, so it’s a jolly good thing
it’s me you’ve put here, and not the girls.’
‘Micky always sleeps with his gun at the foot of his bed, just in
case,’ said Kitty.
Just at that moment the dinner-bell rang.
‘Well, I must run and get ready,’ said Aunt Grace. ‘Can I lend
anybody anything?’
‘Thank you; we should be very grateful for a sponge,’ said
Emmeline, ‘and, Aunt Grace, Micky must wash, mustn’t he? Just look
at his hands!’
Micky made a face at her, and Aunt Grace said calmly: ‘I expect he
will wash: gentlemen usually do. But I feel it’s a question we must
leave to himself—at all events till his luggage comes.’
Emmeline flushed crimson. Then a choky feeling came into her
throat; her eyes began to sting, and she had to hurry out of the
room lest she should burst out crying. It was not only that she was
hurt for herself, but her sense of loyalty was grieved. Mary had
always made Micky wash his hands before dinner. It would always
be like this, she said to herself. The others would leave off all the
good ways they had been taught, and whenever Emmeline, the only
one who would never forget, tried to remind them, Aunt Grace
would snub her.
The chokiness and the stinging gradually passed off, and
Emmeline could trust her voice again.
‘Kitty, you really needn’t have gone and told Aunt Grace about our
only having one wash-stand and dressing-table at home,’ she
snapped, as they were washing their hands.
‘Why ever not?’ asked Kitty, opening her eyes.
‘It makes us seem such babies,’ said Emmeline, crossly; ‘and,
though of course you and Micky are babies, it’s rather hard on me.’
Fortunately Kitty was both sweet-tempered and tactful, so she
made no answer, and the subject dropped. Emmeline, however, went
down to the dining-room in anything but a good temper. Even the
sight of Micky with spotlessly clean face and hands failed to soothe
her; it was exactly like Micky to go and wash his hands just in order
to make her seem in the wrong.
‘I think this clock is a little bit slow,’ said Aunt Grace, after a few
minutes of eager chatter on the twins’ part and silence on
Emmeline’s, which an onlooker might have described as sulky, but
which she herself considered dignified. ‘Would you mind telling me
the right time by that lovely little watch of yours, Emmeline?’
Wily Aunt Grace! That little gold watch which had been given her
by her mother was the pride and joy of Emmeline’s heart. Nothing so
delighted her as to be asked the time. She gave the required
information with the utmost graciousness; the dining-room clock was
exactly three minutes slow, it seemed, by the right time. Aunt Grace
actually left her seat then and there and went to the mantelpiece to
move on the minute-hand three spaces, and Emmeline began to
wonder whether a person who cared so much about the right time,
and showed such a proper amount of faith in her gold watch, could
be so very worldly after all!
The children and Aunt Grace were just setting out for an exploring
expedition in the wood after dinner when Emmeline suddenly felt
Micky, who was walking by her side a little behind the others, press
a hot, sticky coin into her hand.
‘Why, what is it?’ she asked, with a wonder which did not grow
less when she discovered that it was a penny.
‘It’s to make up for making that face,’ said Micky, who had grown
very red. ‘It was beastly rude of me, but for the moment I had quite
forgotten about you being a girl.’
KITTY GAVE SUCH A BOUND OF DELIGHT THAT SHE
NEARLY UPSET HER TEA.
The early days of the children’s new life were so full of interest
and discoveries that even Emmeline did not manage to be nearly as
homesick as she fancied she was.
To begin with, they had explored the whole house, a good deal of
the wood, and every inch of the garden. They had discovered,
moreover, that the said garden was the most delightful of play-
places, chiefly because it was splendid for story games. It owed its
excellence from this point of view to the fact that it contained a
summer-house and a wood-pile, either or both of which could serve
if need were as houses for the story people to live in, which, as Kitty
remarked, ‘made things seem ever so much realer.’ To be sure, there
were times when they had to pretend a good deal about the wood-
pile; it just depended how Mr. Brown, the gardener, had arranged it,
but it usually did for desert islands, where the dwellings might be
supposed to be rather rough and ready, and if the worst came to the
worst there was always the summer-house.
For the whole of one glorious red-letter afternoon, indeed, the
story people had revelled in the run of yet a third house. Just
outside the back-yard was a little shed, always respectfully referred
to by Micky and Kitty as ‘Mr. Brown’s study,’ that being the place
where he was accustomed to black the boots and clean the knives.
On the afternoon in question Mr. Brown had stayed at home for
some reason, so that his study was left undefended from the twins,
who entered in and took possession. It made an even more
desirable abode than the summer-house, for not only was it
pervaded by a delicious smell of knife-powder and boot-blacking and
mustiness, but also it was much better furnished; there were stools,
and shelves, and knives, and boots, and packets of seeds and queer
little pots, with nice messy stuff inside them, whereas in the
summer-house there was nothing at all except a wooden bench,
which was fixed to the wall and ran round three sides of it. So the
story people lived there for the whole of that afternoon with great
satisfaction to themselves, but, unhappily, not with any satisfaction
at all to Jane when she came to fetch them in to tea and found Mr.
Brown’s usually neat ‘study’ turned almost inside out, and Micky and
Kitty all over boot-blacking. Aunt Grace and Emmeline returned from
a garden-party to find not only the twins, but Alice, the little day-girl
who had been inveigled into joining the game, in the deepest
disgrace, and Jane muttering terrible things about ‘warnings.’
Fortunately the affair passed off without such dire consequences,
but from that time forward Mr. Brown’s study was forbidden ground.
It was a great disappointment; but consolation was not long in
coming, for it was only a very few days later that they discovered
the Feudal Castle.
Aunt Grace had gone to a garden-party, and the three children
were spending a blissful afternoon in the wood. Emmeline had
curled herself up comfortably with a story-book, but the twins
happened to be Red Indians that day, and had gone off on a
desperate expedition against the Pale Faces. Before long they came
rushing back to Emmeline, and insisted on dragging her off to see
‘something wonderful.’
‘Something wonderful’ proved to be merely an empty cottage,
hardly more than a hut, indeed, which, from its broken windows,
torn thatch, worm-eaten door, and altogether forlorn appearance,
looked as if it had been deserted for several years. Emmeline
grasped its capabilities at first sight, and when the twins led her
inside and triumphantly displayed a three-legged chair with a broken
seat, and part of what had once been a table—when she saw the
grate, rusty and cobwebbed with disuse, but a real grate
nevertheless, she was quite ready to agree with them that the story
people had found their ideal house at last.
‘Isn’t this perfectly lovely?’ said Kitty, dancing about. ‘And,
Emmeline, it has two rooms. Come and see the other one.’
The other room contained nothing at all except somebody’s very
old boot, and a straw hat with the crown almost out, both of which
Kitty pointed out as great finds. Emmeline, however, was left cold by
these treasures.
‘They look as if they had belonged to rather dirty people,’ she said.
‘I think we’d better clear them out. Besides,’ she added, as Kitty
looked disappointed, ‘this is a Feudal Castle, and they are not the
sort of things people in Feudal Castles would wear.’
From that time forward the empty cottage was always known as
the Feudal Castle. It was felt to be a most brilliant suggestion of
Emmeline’s.
It would have quite spoilt the romance of the Feudal Castle if it
had become a place of common resort, so from the very first the
Bolton children bound themselves by a solemn pledge of secrecy not
to reveal its existence to anyone. It was in an unfrequented part of
the wood, where they themselves never happened to have gone
before, and it did not strike them that perhaps other people might
have done so.
Unfortunately they could not spend as much time in the Feudal
Castle as they would have liked, for lessons began again the very
day after it was discovered. In themselves lessons were pleasanter
than they had ever been before, for Miss Miller, their new governess,
who bicycled over each morning from one of the neighbouring
villages, was brighter and more interesting than old-fashioned Miss
Rogers. To be sure, Emmeline was at first inclined to resent it as a
slight to Miss Rogers when she found herself expected to do by
short division sums she had ‘always been taught’ to do by long; but
she was a sensible girl on the whole, and when once she had
thoroughly mastered the new method, and found out how much
quicker and neater it was than the old one, she began to take quite
a pride in working her sums by it, and altogether became so docile
and well-behaved a pupil that Miss Miller soon shared the general
opinion that she was a model child.
To Emmeline’s relief, and possibly also a little to her
disappointment, she was not required to depart from the ways in
which she had been brought up in any more important respects than
that question of short division versus long. So far from amusing
herself all Sunday, as Emmeline had a vague impression that
fashionable people did, Aunt Grace attended more services than
Mary herself had done, and was certainly just as particular with
regard to the children’s Sunday observances; indeed, in some ways
she was even more so, for instead of being content with a bare
repetition of the Catechism, she insisted on seeing that they clearly
understood its meaning. And whereas Emmeline had formerly
learned merely a verse or two of a hymn, Aunt Grace now expected
her to learn the Collect and Gospel for the week, which was a far
more serious task. Emmeline could not well grumble at it openly, but
at the bottom of her heart she was possibly a little irritated with
Aunt Grace for behaving so very differently from what she had
pictured.
‘There is going to be a Meeting in the village schoolroom to-night,’
said Aunt Grace as she was pouring out tea one fine Saturday
evening in September, about a month after the children’s arrival at
Woodsleigh. ‘Mr. Faulkner—that’s Mrs. Robinson’s clergyman brother
—is going to speak about the work of a Home for poor friendless
boys and girls, of which he is the Chaplain. I wonder if you three
would like to come.’
‘I should like it very much,’ said Emmeline.
‘Will it be all talking, or will there be a magic lantern?’ asked
Micky, cautious before committing himself.
‘Will it keep us up lovely and late?’ cried Kitty.
‘I believe there’s to be a magic lantern, and we shan’t be back till
about ten, I suppose,’ said Aunt Grace; whereupon Kitty gave such a
bound of delight that she nearly upset her tea, and Micky graciously
expressed his opinion that the Meeting wouldn’t be half bad.
‘Work among children is always particularly interesting,’ said
Emmeline; ‘their characters are still so plastic that they can be
moulded into whatever shape you want.’ She had once heard a
visitor make the remark, and had treasured it up for future use.
‘I didn’t know you had had such a wide experience in bringing up
young people, Emmeline,’ said Aunt Grace, with a twinkle in her eye;
and Emmeline grew rather red.
‘The only condition I make to the twins’ going is that they shall lie
down after tea till it is time to start,’ went on Aunt Grace after a
moment, ‘else they will be so very tired to-morrow morning.’
The twins looked rather blank at this. ‘Will there be supper when
we come home?’ asked Micky.
‘Yes,’ said Aunt Grace, with a smile.
‘Oh well, then, we’ll lie down if you really want us to,’ said Micky,
and as it never occurred to Kitty to dispute what he had decided, the
matter was regarded as settled.
On their way to the Meeting Aunt Grace told the children a little
about the lecturer, whom she had already met in London. For several
years he had worked so devotedly in one of the very worst parts of
the great city that at last his health had given way, and the doctors
had said that for the present, at all events, it would be madness to
take any but light country duty. At the time the verdict had almost
broken his heart, for he was quite wrapped up in his people, above
all in the poor little children of the parish, many of whom were being
brought up as pickpockets. It had been a great consolation to him,
however, when he was offered the Chaplaincy of this Home, where
he knew that his work would still lie among children like those he
had left.
‘Some of them, indeed, are the very same,’ added Aunt Grace. ‘For
instance, I know of one boy there—that is, I think he is there still,
though he must be about the age for leaving by now—whose life Mr.
Faulkner once saved. He wasn’t a clergyman then, but a doctor, and
this boy was lying at death’s door with diphtheria. He had been
horribly neglected by some cruel people with whom he lived, and by
the time Mr. Faulkner discovered him the illness had been allowed to
get such a hold that the child would probably have been choked by
some horrible stuff that was growing in his throat if Mr. Faulkner
hadn’t sucked up the poisonous stuff through a tube which he put
into the throat. Of course, it was a terribly dangerous thing to do—
indeed, he caught the illness through doing it—but it saved the boy’s
life. Before that time he had been one of the most abandoned little
child-thieves in the parish, but ever since he has been absolutely
devoted to Mr. Faulkner, and he is now growing up into a very fine
character. I believe he hopes to go out as a Missionary one day,
which would be a wonderful end for anyone who began as a little
pickpocket.’
‘Mr. Faulkner must be a saint,’ said Emmeline.
‘So he is,’ agreed Aunt Grace heartily; ‘but I don’t know,’ she
added, with a whimsical little smile, ‘whether he’ll any more fit your
idea of a saint than Fir-tree Cottage did that of a cottage.’
Aunt Grace was right. Emmeline could not help feeling a little
shock of surprise when, soon after they had taken their seats in the
schoolroom, a curly-haired little man, with a round, merry face,
came and stood before the great white lantern-sheet, and she
realised that this must be the Lecturer.
‘Why, that man’s a little boy!’ remarked Kitty, in a stage whisper.
And, indeed, there was something very boyish in his appearance.
Not that they had much time to study it, for in another moment the
lights were lowered, a hymn appeared on the lantern-sheet, and
after it had been sung through lustily the lecture began.
The first picture shown represented a room in London—such a
filthy, miserable room as the children could never even have
imagined. On a ragged mattress in one corner lay a little boy, so thin
that he was more like a skeleton than a child. He had been almost
dying, it appeared, when he had been discovered by the Society to
which the Home belonged, and rescued from death, or worse, for
the room had been kept by a wicked man who was bringing up this
child and a number of others to a life of crime.
The next picture was far less harrowing to the feelings of the
audience, for it showed the same boy fat, and clean and comfortable
after a few years spent in the beautiful Home among the Surrey
hills, where Mr. Faulkner was now Chaplain. He had since joined the
Royal Navy, said the clergyman, and was now learning to serve his
King and country as a brave man should, instead of making a
livelihood by robbery.
‘Perhaps he’ll be one of my men some day,’ whispered Micky, who
had every intention of ending his life as an Admiral.
Picture followed picture, showing tragic scenes of child life in
darkest London, varied from time to time by groups of prosperous
children whom the Society had adopted. On the whole it was much
like other lectures of its kind, but the Bolton children, who had been
at nothing of the sort before, listened and gazed entranced, and felt
very regretful when it was over, and a final hymn and a collection
brought the proceedings to a close.
Mrs. Robinson, the Vicar’s wife, hurried forward to speak to Aunt
Grace as soon as the lights were turned up and people were
beginning to disperse.
‘You’ll come to supper with us to-morrow, won’t you?’ she said; ‘I
know my brother is much looking forward to meeting you again.’
A pretty rosy colour came into Aunt Grace’s cheeks. ‘Thank you; I
shall be delighted to come,’ she said, and she looked as though she
meant it.
The Lecturer himself came up to them the next moment, and
greeted Aunt Grace as a friend.
‘You’ll let me see you home?’ he asked, eagerly; ‘that lane is so
long and dark—I know it of old.’
‘Thank you; but, you see, I have a very sufficient bodyguard in
two nieces and a nephew,’ said Aunt Grace, laughing, ‘and I hear
Mrs. Robinson just inviting the churchwarden and his wife to go
home with her for the express purpose of meeting you, so I’m afraid
it wouldn’t do to take you away from them.’
‘Well, I shall come to-morrow, then,’ said Mr. Faulkner. ‘I want to
be introduced to your bodyguard’; and he gave the children a
mischievous look that made him appear more like a schoolboy than
ever.
‘I do love people who have twinkly smiles,’ remarked Kitty to
Micky, on the way home after the meeting in the village schoolroom.
Micky’s great blue eyes had a rapt, far-away expression.
‘I wonder if it’s worth while,’ he said thoughtfully.
‘If what’s worth while?’ asked Kitty.
‘To be so horrid and clean as those children were in the Homes,
even if you do get plenty to eat.’
‘But, Micky, we are clean—sometimes,’ said Kitty. It was just as
well she qualified the statement.
‘Yes, but we are used to it,’ said Micky; ‘things aren’t half as bad
when you are used to them.’
‘What part of the lecture did you like best?’ asked Kitty of
Emmeline, who was walking along in dreamy silence.
‘Oh, I don’t know,’ said Emmeline. She spoke without thinking, for
she did know perfectly well. Mr. Faulkner had spoken of a little
twelve-year-old girl named Kathleen, whose pocket-money had been
the very first subscription towards the building of the particular
Home where he was Chaplain. The heart of this child had become so
full of noble pity for her poor little brothers and sisters of the slums
that she spent most of her playtime working among them and for
them, and came to have such a wonderful influence on them, that
they looked upon her more as an angel than an ordinary human girl.
The story had fired Emmeline’s imagination, and she was dreaming
over it still.
‘Didn’t you enjoy the meeting, Aunt Grace?’ asked Kitty, taking her
aunt’s hand.
‘Yes, dear. Why do you ask?’
‘Because you seem so grave, somehow—like when we’ve been
naughty.
‘I was thinking, I suppose,’ said Aunt Grace, laughing, and for the
rest of the walk she chatted merrily about all kinds of things.
‘It’s easy to see she doesn’t care much about the poor children,’
thought Emmeline, feeling well satisfied with herself; ‘if she did, she
wouldn’t make so many jokes.’
All the way home, and while they were having supper afterwards,
Emmeline went on thinking of the little girl who had spent her
pocket-money and her playtime on the poor.
‘Do you know,’ she said abruptly, in the middle of her basin of
soup, ‘I think it would be very nice if we had a collecting-box for that
Home. I’ve got a shilling in my money-box upstairs which I’ll put in
for a start. I did mean to have saved up to buy “Queechy,” but I’ll
gladly give that up for the sake of the poor little children. Kitty and
Micky, if you were unselfish you’d give up your money too.’
The twins looked blank, and instead of being touched at
Emmeline’s self-sacrifice Aunt Grace said rather sharply, ‘Really,
Emmeline, it is not your business to settle what the twins ought to
give. Start a box if you like, but I can’t have you forcing the others
to contribute to it.’
Emmeline tried to reflect that this was only what she might have
expected; people’s worldly relations always did persecute them when
they wanted to do anything specially beautiful or unselfish; but she
could not help looking hurt, and Kitty, who never could bear anyone
to be snubbed, broke in:
‘Oh, but she didn’t mean to force us, Aunt Grace. It was only a
suggestion. You shall have my sixpence, Emmeline—at least,
threepence of it will be from me and the other threepence from
Micky. Then it won’t matter his saving his own money for a new gun.
You see, it’s really necessary he should have one that’s not broken
when he sleeps in such a lonely part of the house.’
‘Of course,’ agreed Aunt Grace, smiling, as she twisted one of
Kitty’s long curls between her fingers. ‘Should you like to ask Mr.
Faulkner for a collecting-box when he calls to-morrow, Emmeline?’
she added, in an unusually kind voice for a persecuting relation.
‘No; my extra money-box will do quite well,’ said Emmeline shortly.
The extra money-box had been given her by Micky on her last
birthday. Having dropped a carefully treasured sixpence down that
same mouse-hole which had been fatal to so many of his marbles,
Micky had been at his wits’ end what to give Emmeline till the happy
thought had struck him of presenting her with his own money-box,
then standing empty and useless. Emmeline had thanked him for it
graciously at the time, but Micky had always had an uneasy feeling
that it was rather a mean makeshift of a present, so he was
delighted to find it turning out at last to be really of some use.
‘I think that’s a splendid plan,’ he said; ‘you’ll be able to open it
whenever you want to count how much money you’ve got, which
you can’t do with the ordinary stupid sort of missionary-box.’
‘There’s a good deal in that,’ said Aunt Grace. ‘See, here’s a bright
new shilling as a contribution to the extra money-box’s first meal.
And now I think it’s time all you young people went to bed.’
For some time after she had got into bed that evening Emmeline
lay awake dreaming day-dreams of that twelve-year-old girl who had
been so wonderfully good to the poor. Strangely enough, however,
the child of her visions was no longer a stranger, but Emmeline
Welcome to our website – the perfect destination for book lovers and
knowledge seekers. We believe that every book holds a new world,
offering opportunities for learning, discovery, and personal growth.
That’s why we are dedicated to bringing you a diverse collection of
books, ranging from classic literature and specialized publications to
self-development guides and children's books.
ebookbell.com