0% found this document useful (0 votes)
41 views62 pages

Improved Document Geocoding For Geo-Complex Text: Ian Patullo

This document summarizes Ian Patullo's dissertation on improving document geocoding performance for texts with high geo-complexity. The dissertation develops an ontological framework to describe current geocoding techniques and new heuristics to handle texts where traditional methods struggle. Experiments show the updated system improves geocoding accuracy, thoroughness, and coverage for some complex texts, while maintaining performance on standard texts.

Uploaded by

Matt Janecek
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views62 pages

Improved Document Geocoding For Geo-Complex Text: Ian Patullo

This document summarizes Ian Patullo's dissertation on improving document geocoding performance for texts with high geo-complexity. The dissertation develops an ontological framework to describe current geocoding techniques and new heuristics to handle texts where traditional methods struggle. Experiments show the updated system improves geocoding accuracy, thoroughness, and coverage for some complex texts, while maintaining performance on standard texts.

Uploaded by

Matt Janecek
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 62

Improved Document Geocoding for Geo-complex Text

Ian Patullo

This report is submitted as partial fullment of the requirements for the Honours Programme of the School of Computer Science and Software Engineering, The University of Western Australia, 2008

Abstract
Document geocoding is the process of assigning a spatial reference to a piece of text based on an analysis of its content. This project is concerned with improving geocoding performance on text with high geo-complexity, that is, text which traditional systems struggle to tag correctly. During the course of this dissertation we produce an ontological framework which describes the way current techniques identify locations in text, and develop new heuristics that work specically where the the old ones fail. Using a performance metric that combines the accuracy of assignment with the number of false positives and the overall coverage of place names in a document, we show that our system is capable of improving performance in some texts with high geo-complexity, while attaining similar performance on more standard texts.

Keywords: Document Geocoding, Geotagging, Disambiguation, Text Mining, Information Retrieval, Gazetteer CR Categories: I.2.7 [Natural Language ProcessingText Analysis] ii

Acknowledgements
To To To To To my supervisor, Dr Wei Liu, for her direction and advice. Dave Pratt for the idea. everyone at work, for their interest, exibility and feedback. my family and friends, who no doubt missed me. my brain, which got me through. Thanks to all of you.

iii

Contents
Abstract Acknowledgements 1 Introduction 1.1 1.2 1.3 An Introduction to Geocoding . . . . . . . . . . . . . . . . . . . . Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Geo-Complexity and this Dissertation . . . . . . . . . . . . . . . . ii iii 1 1 1 2 4 4 5 5 6 7 9 11 13 13 14 15 16 17 19 19 19

2 Background 2.1 Unstructured Geocoding . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 2.1.2 2.2 2.2.1 2.2.2 2.3 Locating place names within Text . . . . . . . . . . . . . . Turning Names into Locations . . . . . . . . . . . . . . . . Initial Heuristics . . . . . . . . . . . . . . . . . . . . . . . Secondary Heuristics . . . . . . . . . . . . . . . . . . . . .

Disambiguation . . . . . . . . . . . . . . . . . . . . . . . . . . . .

The Problem of Geo-complexity . . . . . . . . . . . . . . . . . . .

3 Investigating Deeper into Geo-Complexity 3.1 3.2 The Call for Investigation . . . . . . . . . . . . . . . . . . . . . . Targets for Investigation . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 3.2.2 3.3 Heuristics . . . . . . . . . . . . . . . . . . . . . . . . . . . Non Heuristic Measures . . . . . . . . . . . . . . . . . . .

Discussion and Updates . . . . . . . . . . . . . . . . . . . . . . .

4 Implementation and Experimental Setup 4.1 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.1 Initialisation . . . . . . . . . . . . . . . . . . . . . . . . . . iv

4.1.2 4.2 4.2.1 4.2.2 4.2.3 4.2.4 4.3 4.3.1 4.3.2

Tagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Gazetteer . . . . . . . . . . . . . . . . . . . . . . . . . Input Texts . . . . . . . . . . . . . . . . . . . . . . . . . . Exclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . Conguration and Metrics . . . . . . . . . . . . . . . . . . Testing the Ontological Framework . . . . . . . . . . . . . Testing the Geocoder . . . . . . . . . . . . . . . . . . . . .

21 23 23 23 24 25 27 27 27 29 29 29 33 34 34 37 39 42 42 42 48 54

Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . .

Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5 Results and Discussion 5.1 The Ontological Framework . . . . . . . . . . . . . . . . . . . . . 5.1.1 5.1.2 5.1.3 5.2 5.2.1 5.2.2 5.2.3 Are Improvements Possible? . . . . . . . . . . . . . . . . . What areas are there for improvement? . . . . . . . . . . .

What are the Identiable Properties of Geo-Complex Text? 34 Experiment 1: Overall Performance . . . . . . . . . . . . . Experiment 2: Performance of the Updated System . . . . Experiment 3: The Eect of Gazetteer Size across all Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Geocoder Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6 Conclusions and Further Work 6.1 6.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Further Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

A Initial Project Proposal B The Hard Text Framework

List of Tables
2.1 4.1 5.1 5.2 5.3 5.4 5.5 Similarity Weighting Table . . . . . . . . . . . . . . . . . . . . . . The breakdown of tagged documents. . . . . . . . . . . . . . . . . Percentage of place names which arent recognised by the Standard and Extended Gazetteers . . . . . . . . . . . . . . . . . . . . . . . The average number of senses per place name in the Standard and Amplied Gazetteers . . . . . . . . . . . . . . . . . . . . . . . . . Instance of Properties with no Discernible Bias toward easy text. Percentage of entries for which the default sense assignement would be incorrect. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Georeference cluster rating by type (0 = no clustering, 4 = Local clustering) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 24

29 31 32 32 33 55

B.1 The list of properties required for geocoding heuristics. . . . . . .

vi

List of Figures
2.1 All possible geographic interpretations of the names of the Australian capitals . . . . . . . . . . . . . . . . . . . . . . . . . . . . System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . Locations with the Default Sense vs Total Taggable Locations . . Overall Performance by Conguration . . . . . . . . . . . . . . . . Best Performance over the Dataset . . . . . . . . . . . . . . . . . Accuracy of Metrics Using the Amplied Gazetteer . . . . . . . . Thoroughness of Metrics Using the Amplied Gazetteer . . . . . . Coverage of Metrics Using the Amplied Gazetteer . . . . . . . . Comparative Accuracy Between Gazetteers . . . . . . . . . . . . . Comparative Thoroughness Between Gazetteers . . . . . . . . . . Comparative Coverage Between Gazetteers . . . . . . . . . . . . . 6 20 30 36 36 38 38 38 40 40 40 41

4.1 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9

5.10 Comparative Performance Between Gazetteers . . . . . . . . . . .

vii

CHAPTER 1

Introduction

1.1 An Introduction to Geocoding


Geocoding, also known as Geotagging, Localising or Grounding, is the process of assigning a spatial reference to a piece of text based on an analysis of its content. Documents are indexed this way as an aid to data discovery; a spatial search can be added to a text search to rene results, or spatial references can be used to improve the utility of research corpora or corporate data repositories. This has already found practical use on the web in tools like Googles Local Search [4] and Metacartas Geosearch News [5], which attempt to improve the relevance of local business and news searches respectively. These two services exemplify the two major directions that large scale geocoding operations can take. Local search uses structured data available in the form of addresses and phone numbers to index content, aggregates pre-tagged web pages and photos, and links directly to rst hand sources of spatial data like GPS devices [3] . Conversely, Geosearch News tackles the more immature area of unstructured geocoding geocoding based on unstructured references to place names. News articles do not display locations in a predictable or organised way, and georeferences in common language, while ubiquitous, are also ambiguous-exactly the problem that modern addressing systems are designed to avoid. As a consequence, this second type of geocoding is both more widely applicable, and less precise than the rst.

1.2 Motivation
Most recent research has been directed toward unstructured geocoding. In particular, many implementations have focused on using the properties of text on the Internet to assign spatial references more accurately. These geocoders tag locations by combining evidence from the document itself with externally available 1

context. The performance of these tools has been attributed in part to the fact that the web pages which they target are written to be understood by people from all over the world[22]. Due to the ease with which these sorts of texts are geocoded, we say they have a low geo-complexity. However, performance degrades when these methods are applied to text without a global focus, like government web pages, local news papers or other sites that cater to specic communities. Yet many texts, online or oine, are written to be understood only by certain demographics. Furthermore, oine texts do not have the available meta-information that online texts do. Thus there is a need to develop better ways of geocoding geo-complex text without resorting to context beyond what is available in the document itself. To do that, we need to produce both a description of the properties of geocomplex text, and a geocoding solution that takes advantage of those properties to improve its performance.

1.3 Geo-Complexity and this Dissertation


Geo-complex texts can be understood by looking at how and why they are created. Essentially documents are produced to match the context of the people who read them. Readers can dier substantially in interest, education, expertise, location, age, values and economic status. These dierences lead to a dierent knowledge framework, which will inuence how they interpret a document. Any reader from outside the target context will naturally struggle to comprehend parts of a given text. However, targeted writing is necessary for ecient communication. When author and reader share a context, constructs which form part of their shared knowledge framework do not need to be unambiguously described: We say The Uni instead of 35 Stirling Highway Crawley, WA, 6009 This has a particular inuence on how places are referenced in geo-complex text. For instance, we can assume that all World War II enthusiasts and all French people know that Dieppe is a port in France. Documents targeted at either of these groups do not need to qualify that reference. Such a reference, however, would be problematic for a geocoding algorithm. This is because anyone from New Brunswik in Canada similarly knows that Dieppe is a city in Westmorland County. There is not enough information for a culturally agnostic geocoder to choose one way or the other. In general, the geo-complexity of a text increases as more specic knowledge is required to contextualise location references. However most existing geocoders

were not built to handle these documents well. This leads to the hypothesis that underpins this dissertation: Hypothesis: Geo-complex text has distinctive properties that can be identied and leveraged to improve geocoding performance. Over the course of this document we will be developing this idea. Our approach is centered on developing updated heuristics for choosing between the candidate senses of placenames in text. This has two component steps. Firstly, the current range of heuristics are analysed to nd weaknesses and opportunities for improvement. Secondly new algorithms and heuristics are developed to take advantage of these improvements. Geo-complex text is found to contian tightly clustered references, and this property is used as a means for identifying locations with higher accuracy. The same property is used for discarding locations that fall outside the cluster point. The new heuristics improve geocoding performance on geo-complex text by up to 15%. This document is broken up into the following chapters: Chapter 2 provides a review of current geocoding heuristics. It also discusses how geo-complexity has been recognised and dealt with in the literature. In Chapter 3 we systematically develop an ontological framework for assessing the eectiveness of geocoding heuristics. We use the framework to derive a set of parameters that characterise geo-complexity and to develop three modied heuristics for use with geo-complex text. Chapter 4 describes the design of our geocoding system and details two sets of experiments: The rst relating to the framework in Chapter 3, the second relating to the geocoder and our updated heuristics. Chapter 5 covers the results of these experiments and the related discussion.

CHAPTER 2

Background

Many sources of text contain structured information, such as phone numbers, area codes or addresses. This data can be resolved into a fairly accurate and unambiguous location. The locations referred to by a document can then be used to determine its geographic scope. This is often termed Address Geocoding. However, the characteristics that such methods rely upon are not true for all text. In particular: Address resolution is limited to texts referencing urban areas and places with an organised addressing system [16]. Locations are often referenced by means other than a formal address, and many documents contain only unstructured references to locations. Addresses may indicate the source rather than the target location of a document [11]. Limitations of address geocoding have led to the current interest in geocoding based on the unstructured georeferences in a document. This is a challenging domain, and much work has been invested in nding ways to improve the performance of this type of tagging.

2.1 Unstructured Geocoding


Unstructured geocoding techniques have been developed in response to two main tasks. These are the identication and extraction of place names from the text, and the normalising of extracted place names into unambiguous locations.

2.1.1 Locating place names within Text


Named entity recognition (NER) is a general term describing the process of identifying and classifying references to self-contained constructs within text. Implementations are found in a diverse range of disciplines, from marketing to bioinformatics. NER is typically concerned with the identication of persons, locations, organisations or quantities referenced in text. The techniques used can be broadly classied into Gazetteer-based, rule-based, or machine learning-based approaches. A gazetteer based approach attempts NER by providing as exhaustive a list of entity names as possible. Identication is limited to words that have been encountered before, and performance is hampered signicantly by the fact that almost all common words are also place names [25]. Rule based methods make use of a set of pre-dened extraction patterns to identify locations in text. The rules reect how geographic references appear in natural language, and have proved to be highly eective [27]. Machine learning techniques attempt to replicate the good qualities of rule based methods without being limited to an expert rule set. NER is a well established domain, and results with F measures of around .96 are achievable [33]. Replicating this performance is beyond the scope of this research. We thus use a corpus of manually tagged text as our input data. However, it remains that high quality NER is essential to any Geocoding application.

2.1.2 Turning Names into Locations


Beyond name extraction, the greatest challenge to document geocoders is the problem of ambiguous text references. Many words are polysemous, that is, have multiple meanings. Some words with a potentially geographic interpretation could also be used in a non geographic fashion (Washington D.C. vs Denzel Washington). A similar problem arises from dierent locations sharing the same place name. Bualo, New York, for instance, shares its name with 22 other cities or towns in the U.S. [21]. This type of ambiguity, called geo/geo [11], or Referent ambiguity [27] must also be accounted for before a geocoding application can produce meaningful results. Humans are naturally very good at this disambiguation process, but for computers, a full semantic model is out of reach. The next section of this document explains the current approaches to handling polysemy and their eectiveness.

Figure 2.1: All possible geographic interpretations of the names of the Australian capitals

2.2 Disambiguation
In order for a recognised geographic entity to become a useful piece of metadata it must be grounded, meaning it must be resolved to a single unambiguous location [20]. A geo reference can have many candidate locations. Amitay et al found that 37% of geographic references in Web pages had more than one potential target [11]. Smith and Crane found the phenomenon to be even more pronounced in classical texts, where up to 92% of place names were ambiguous [29], though these results are dependent on both the size of the Gazetteer and the nature of the texts that were tested. In Figure 2.1 we see the geographical area covered by just 8 references: Adelaide, Brisbane, Canberra, Darwin, Hobart, Melbourne, Perth and Sydney. Numerous methods exist for choosing between the candidate senses of a geographic reference. For clarity we separate them into Initial Heuristics - which can be applied to a document as a rst pass, and Secondary Heuristics - which work best after some locations have already been decided.

2.2.1 Initial Heuristics


Default Sense One of the least complex ways of choosing between candidate senses for a location is by selecting one dominant sense to be the assumed target (eg: Choosing London, England as the default London). This is particularly useful when there is little or no context that would normally favour one sense over another. Several systems assign a default sense when other heuristics fail to identify a sense with probability beyond a pre-dened threshold value [20, 22, 32]. Others eliminate the disambiguation phase entirely by using a gazetteer populated by default values only [12, 27]. Default senses cannot be correct or incorrect. They are simply an attempt to rank senses of a place name based on some method of dening importance. The techniques for assigning a default sense can be either statistically or conceptually based. Statistical Approaches: Amitay et al used population size to select the default sense [11]. They reasoned that a greater population would lead to more references. This technique requires little human intervention and is suitable for selecting default senses even between relatively unknown candidates. Its major drawback is that locations with a small population, such as Aspen, Colorado, can have characteristics (in this case a ski resort) that make them more likely to be mentioned than their more populous namesakes. Further to this, there is limited value in selecting default senses by population unless there exists a signicant dierence in population values. This is particularly true when considering the varied quality of gazetteer sources. Li et al [21] suggested querying a search engine with the ambiguous place name as the search string and using the sense most commonly referenced as the default sense. This techniques greatest limitation is that the score of a particular location is related only to its online presence. Such a presence is not guaranteed to be appropriate to all text and would have a signicant North American bias. Conceptual Approaches: In a later work, Li et al [22] theorised that countries would be more frequently mentioned than cities of the same name, and added this approach to their existing one. 7

Srihari, Niu and Li [30] used the Oxford English Dictionary to select default senses. They argued that large cities or countries that made it into the dictionary were likely to be the most frequently mentioned. Default sense is seldom the only heuristic used because increasing the number of candidate senses generally improves geocoding performance [11]. Textual Context Reliable initial information can be obtained by analysing the context of each place name. Documents intended for a diverse or global audience, most notably web resources, typically contain disambiguating cues that reduce the need for the reader to have any specic local knowledge [11]. Because these cues are often in a recognisable format, they can be identied by the geocoder. Local context is queried using similar techniques to the rule based entity extraction discussed in section 2.1.1 [27]. Likely formats that can be grounded this way are phrases like : City of Perth State of Illinois Sydney, New South Wales Chicago, the Windy City These cues allow for place names to be grounded with a high probability immediately after the extraction phase [22]. Other sources of Context For web based resources, information besides the plain text content is available. Harvesting this information allows geocoders to further improve their grounding accuracy. Three main contextual sources are outlined below. The Link Cloud [17, 31]: The pattern of incoming links to a web resource can give some idea of the location of a pages content. The sites that contain links to a resource can be quickly assigned a spatial reference by their IP address, by looking at whois.com or even by geocoding the linking pages. The rationale behind this method is that sites referencing a particular location will be of more interest to other sites from that location. Such information can also be held separate from the rest of the 8

geocoding process and used to distinguish between the content of the page and the audience of the page. Metadata in Content: Some web resources, like Wikipedia, contain structured meta information in charts and tables. Overell et al [27] were able to reliably ground Wikipedia articles by examining the meta-information they contained. They found that many articles about places had charts with coordinate information that could be checked against known location candidates. IP address information [28, 32]: The IP of a web page can give some indication as to where the page is likely to be talking about, as can the URL. Such information needs to be considered carefully, as it may only provide details of the source location, rather than the location of the content.

2.2.2 Secondary Heuristics


Single Sense Per Discourse While working in the eld of word sense discrimination, Gale et al [19] found that 96% of references to polysemous words were used in the same sense for the entire discourse. They concluded that once a word had been successfully interpreted, all other instances of that word in the same article could be interpreted in the same way. They also suggested that a competent author would avoid using more than one sense of the same word without providing further disambiguating cues. This approach has now become accepted practice for document geocoding, as it has been shown to hold true in this eld also [11]. Once a place name has been successfully grounded to a location, considerably more evidence is required before it can be interpreted in another sense within the same document. Spatual Extent Minimisation (SEM) Several geocoding implementations try to minimise the spatial footprint of any given document. The reasoning is that documents are likely to refer to a narrow geographic region. If that is the case, then geographic entities that have already been successfully grounded can be used to suggest likely locations for other references. In the simplest form the hierarchical relationships dened in a gazetteer 9

are used to infer references to child nodes from references to the parent node. If, for example, a reference to Western Australia was successfully extracted from a text and grounded, subsequent references to Perth and Scarborough would be associated with places in Western Australia, rather than those in the U.K. More sophisticated models [11] also treat references to child nodes as evidence for the parent. While this technique works best as a secondary heuristic, Li et al [22] suggest a means for it to be used as an initial heuristic: They construct a weighted graph where each node is a candidate place name. A connection is made between each pair of nodes that do not share the same name. The connection weight is set according to the function: W eightxy = (Latx Laty )2 + (Longx Longy )2 Simxy

Where Simxy is the similarity weighting between x and y as described in Table 2.1. X Y Relation Similarity City City Same State 3 City State Same State 4 City Country Same Country 4 City City Same Country 2 Otherwise 1 Table 2.1: Similarity Weighting Table Each sense is given a total score, which is the sum of the weights of each connection it is a part of. If the ratio between the lowest and second lowest scores is less than a dened threshold, then the sense with the lowest weight is returned with a high condence rating. Otherwise the sense with the lowest weight is returned with a lower condence, in order to rate it below the default sense heuristics. This approach can produce a spatial extent that is close to minimal without requiring numerous pre-grounded locations, and we use it as part of our core geocoding system.

10

2.3 The Problem of Geo-complexity


The above methods have been applied with great success in past experiments [21, 31]. However other results indicate that they are not enough to guarantee accuracy across all types of text [11, 14]. These inadequacies have led to research into how geocoding can better be performed on problem domains. Crane and Jones [14], for instance, dealt with historical texts focusing on a single subject the American Civil War. Historical texts present a challenge because place names and the way they are referenced change over time. Their special purpose tagger was tuned to a single newspaper, the Richmond Times Dispatch. Focusing on this small subset of text from the period allowed them to manually compensate for the most common errors, and to take advantage of both the common subject and common context of the articles. Similar work has been done by Nissam et al [26]. Conversely, Amitay et al [11] found that their general purpose system had a performance hit of nearly 20% when applied to web pages with a predominently local readership. They noticed that local text did less to address potential referent ambiguities. They also remarked that such texts reference smaller towns and cities more often, lowering the chance that these references would be the default assignment and even the chance that they would occur in the Gazetteer at all. Li et al [22] focused on text with low geo-complexity - like travel guides that provided a state level overview and CNN news. Nevertheless they found that their results could have been improved with better coverage of local place names and features. They also cited ambiguous local references like Holland Dam in New York - as contributing to the error rate. Leidner et al come to similar conclusions [20]. Martins et al [25] suggested that two dierent gazetteers could be used to handle texts with dierent local scope. They produced one ontology for global texts, and another, more detailed ontology for texts pertaining to Portugal only. Their system, which attempts to nd the geographic Focus of a document, was able to produce comparable results between the local, Portugese documents and the globally focused documents. However, once again, improved results came at the cost of a highly specialised system. Work has also been done in relation to texts in multiple languages [15, 23, 25]. The diering ways in which languages reference place names hamstring any attempts to use local context as a normalisation vector, and place names are not consistent between cultures. (Eg: Germany vs Deutschland). Thus either a truly general purpose algorithm is required, or locale specic rule systems must be combined with language detection algorithms. The rst form tends to reduce 11

the tagging precision while the second tends to compromise the entity extraction performance signicantly [15, 24, 23].

12

CHAPTER 3

Investigating Deeper into Geo-Complexity

In the previous chapter we discussed the range of current heuristics and their limitations as exposed in literature. We now investigate geo-complexity with an experimental analysis of the texts themselves.

3.1 The Call for Investigation


There are many factors that encourage us to examine text in this way. Broadly, they can be described by three aims. The results we gather regarding these aims will form a basis for the development of new goecoding techniques. Aim 1: To determine if it is possible to improve geocoding performance on geocomplex texts. The rst motivation is to decide if improvement is possible at all. By way of example, one of the properties we intend to target is what we call Unqualied Spatial Deixis. In linguistics, a deixis is a feature that gains meaning through its origo, or point of reference. A spatial deixis might be: The Outer Suburbs of Perth. In this case the term Outer Suburbs is given meaning through association with its origo, Perth. Changing the origo will change the meaning. An unqualied spatial deixis is one where the origo is not clear from the text. consider: 1 Perths Northern train line ends at Currambine 2 Perths Northern train line ends in the outer suburbs. 3 The Northern train line ends in the outer suburbs. 13

4 The Northern train line ends near Grandmas house. Items 1 and 2 are a fully qualied reference and a qualied spatial deixis respectively. Items 3 and 4 are examples of an unqualied spatial deixis. In practice we can use more than the context of a single sentence to assign an origo, but if unqualied deixes were the only factor in geo-complexity, we would conclude that very little could be done to improve tagging performance without resorting to a special purpose algorithm. Aim 2: To identify properties of geo-complex text that can be used to improve geocoding performance. The second motivation is that an experimental denition will help identify distinct areas for improvement. Trends in the way locations are mentioned, the types of location present (Eg: Country and Capital or Suburb and Street) and any other signicant statistics can be exploited to improve results. Th previous chapter already indicates that there are multiple contributing factors to geo-complexity. Our results will enable us to target the factors that can be compensated for most eectively. Aim 3: To determine textual properties that can be used to distinguish between texts of diering geo-complexity. The third and nal motivation for an investigation into geo-complex text is so we might be able to determine whether text is geo-complex or not automatically. Ideally this would mean being able to perform identication based on content alone. However, nding a relationship between geo-complexity and text category would also be benecial. We expect that local newspapers will be harder to tag than global ones, for instance, but producing more detailed guidelines along these lines will help resolve how our results can best be applied.

3.2 Targets for Investigation


In Section 2.2 we discussed heuristics for choosing between candidate senses of an ambiguous place name. Each one relies on certain assumptions about the properties of text that are believed to be relatively reliable. The drop in performance experienced when applying these techniques to geo-complex texts leads us to question the applicability of these assumptions. In this section we examine the premises implicit in disambiguation heuristics in order to create an ontology for text analysis. The ontology will form the basis for our later experiments. 14

3.2.1 Heuristics
Default Sense Section 2.2.1 indicates that default senses are chosen based on statistical or perceptual importance. Intuitively we would expect texts with high geo-complexity, which are mostly local and of narrow interest, to contain relatively few references to globally important entities. We would also expect a higher incidence of references to locations that would be interpreted incorrectly if left to a default tagger (such as the Liverpool in New South Wales). The following properties have been identied as relevant to the default sense heuristic. The number of entities that have a default sense. The number for which the default sense is the correct sense. The number for which it is not. The most commonly used default sense heuristic. (Population, Importance) Minimising Spatial Extent SEM has not received as much focus in literature as other methods of location normalisation. This may be because texts with a global focus tend to reference places that are further apart. If geo-complex texts are more likely to reference narrow areas, we may nd that locations cluster together. It needs to be determined: At what level of granularity can the references in text be said to be Clustered (None, Global, Country, State, Local) Whether there are multiple clusterings. Textual Context Techniques that focus on the textual context of a georeference examine the text for clues that help to resolve ambiguity. Geo-complex texts may be more relaxed in how rigorously they describe a location. This would lead to a smaller number of contextual patterns. The properties identied as being relevant to the performance of contextual techniques are: 15

The incidence of references to containing areas. (Countries, Regions) The incidence of places dened in reference to their container. (Perth, Australia, For instance.) The incidence of place dened in reference to nearby places. (Near Darwin, 50 km from Broome) Whether there are clues about the nature of an entity. (The Kimberly Region)

3.2.2 Non Heuristic Measures


The properties required by disambiguation heuristics are not the only features that can dier between texts with diering geo-complexity. As we discussed in section 2.3, there are dierences in both the representation and nature of georeferences between texts of varying geo-complexity. Representation Representation refers to the literal expressions used as location names. It is relevant to know: The incidence of Unqualied Spatial Deixis. The incidence of alternative or abbreviated names. Whether the Single Sense per Discourse heuristic holds across all text. Nature The nature of a reference refers to attributes such as the locations importance, its scope, in the case of references to countries and regions, and its type (House, Street, City, etc...). Each of these has an eect on whether a particular location will be included in a Gazetteer. An Amplied Gazetteer is also likely to aect the level of ambiguity encountered for each textual reference. The properties inuenced by the nature of references are: The number of entities that do not appear in a gazetteer that is similar to those found in the literature.

16

The number of entities that do not appear in an amplied gazetteer. The number of senses each word has on average. After removing duplicate properties and those that can be inferred from other results, we are left with 18 properties which make up a nal framework for testing. Appendix B has has the nal format of the framework.

3.3 Discussion and Updates


The full results of our experiments with the ontological framework and the detailed reasoning behind our improvements can be found in section 5.1. Here, however, we will present a summary of our ndings. The main aim of this section is to provide a basis for the updates we will make to the geocoding system. Our tests conrm the initial predictions about which texts would prove to be most geo-complex. We discovered, however, that geo-complexity exists on a continuum. The default sense heuristic, for instance, shows a relatively smooth graduation from local tourism publications up to global news. We have also found that some useful spatial properties are also properties of all well written text. If the focus of the document is events in time and space, authors will use context building techniques to orient the reader. Conversely, regardless of the scope of audience, text that is not spatially focused does not have the same potential for geotagging accuracy. This indicates that a modicum of performance can be obtained on all text without modication to the established heuristics. On the other hand, we nd that spatial extent minimisation techniques, in particular, are inadequate when applied with a larger gazetteer. Firstly, there is no mechanism for discarding sense assignments that are incorrect. Current methods use a best t approach for assigning gazetteer entries to place names. This requires that the majority of locations in a text must be in the gazetteer, or failing that, they must have no false cognates (entries in the gazetteer with the same name, but a dierent location). The former is not a property of text with high geo-complexity, the latter is not a property of large gazetteers. Secondly, existing SEM methods can not take advantage of the highly clustered nature of references in geo-complex text. Techniques that arrange senses hierarchically (according to their country and state)[11] can not distinguish between possibilities with high granularity. Similarly the network minimisation 17

technique developed by Li et al [22] has trouble dierentiating between senses that are close together, as the connection weights are averaged out by the large number of senses found in an amplied gazetteer. In response, our updated system will implement two new algorithms during the SEM component of geocoding. The rst is a more aggressive minimisation technique which is sensitive to locations that are extremely close together. The second is a clustering algorithm that discards senses that fall outside the perceived clustering area of a document. A nal update to the system adds a new default sense heuristic, where unique entries are considered to be default values. Full details of the algorithms and their implementation are provided in the Chapter 4.

18

CHAPTER 4

Implementation and Experimental Setup

This chapter is in three sections. The rst describes the geocoding system, which is based on the models presented in the literature and the updates described in section 3.3. The second relates the information about our experimental setup which is not related to system design. The third gives a description of the experiments used to test our system.

4.1 Design
Figure 4.1 shows an overview of the geocoding system. Each block represents a modular component of the overall system. The modules work in turn on a model of the target document which is passed between them. The geocoder was designed to work with any or all of the modules activated and new modules can be added at any point. The steps that go into geocoding the document can broadly be described as: Initialisation, tagging and reporting. We will explain each of those sections in some detail here.

4.1.1 Initialisation
The initialisation portion of the geocoding process is where the location references in a document are converted into a format that can be tagged. In this model each reference has already been identied during the NER phase, so here it is extracted and assigned several properties for use later on. We record the reference name, its archetype - which allows us to connect two references to the same place to one another- and its position in the document. This step is also where the geocoding modules which we plan to use are selected and where we assign the gazetteer to use for extraction.

19

Figure 4.1: System Overview

20

4.1.2 Tagging
This component is made up of the default sense module, the Context engine and the minimisation module. Default Senses Default Senses are assigned Hierarchically. Each place name retrieves a list of candidate senses. The default sense is searched for in the following order: 1. Continent, condence 0.75 2. Country, condence 0.75 3. Region/State, condence 0.75 4. Max Population (population 20,000) condence 0.7 5. Max Population (0 < population < 20,000) condence 0.5 From the results of the ontological framework testing, there is reason to presume that as a gazetteer becomes more comprehensive any entries which are unique in the gazetteer become more likely to be unique in reality. In response we have updated the default sense hierarchy to reect this new mechanism: 6: Unique Sense, condence 0.5 If a hit is made at a higher level, proceeding levels arent evaluated. The condence ratings are chosen simply to assign a rank to the dierent methods of assignment. While it is possible to produce a graduated scale between items 4 and 5, for instance, none of the other techniques at use are able to provide a rating with sucient granularity to reason between them. The thresholds provided are from the literature [11, 22], but could be replaced with any ordinal scale. The Context Engine The context engine is quite primitive when compared to the full featured implementations available elsewhere [22, 27]. Rather than using a set of language rules to nd pattern hits, we instead use the distance between georeferences as the sole requirement for further investigation.

21

If two references are within two words of each other, we test for the hierarchical member property. If a hit is scored, then the two references are given their assignments with condence 0.9 After the contextual step is completed, the system propagates the sense with the highest condence from the rst two phases to the other entries with the same archetype. Minimising Spatial Extent There are two selectable SEM modules. The core system uses the network minimisation technique for SEM explained in section 2.2.2 The other module is a new spatial minimisation technique. The initialisation for this routine is the same as for the core algorithm,ith connections given weight according to: W eightxy = (Latx Laty )2 + (Longx Longy )2 Simxy

The new algorithm runs in a manner similar to Prims algorithm [13]: Initially: Order ascending by weight and set the first node in the first connection as complete. 1: Order the connections by weight ascending. 2: Choose the connection with one completed node that has minimal weight. 3: Mark the connections other node as complete. 4: Remove every other node with the same archetype as the node that has been completed. 5: Remove every connection belonging to the nodes that were removed. 6: Repeat from 1 until all nodes are complete The main dierence is that we are removing nodes as well as connections. This approach gives great importance to the rst few connections that are chosen, which correspond to the nodes that are closest to one another. Compare this to the algorithm from the literature, which doesnt remove nodes and which considers the eect of all location senses on a given node. The next step in the algorithm is to determine if the nodes are clustered. The nodes are ordered at a state/province level. If the most common state has 22

more than three entries it is determined that those entries form a cluster. The standard deviation from the centroid of the cluster is calculated and stored. The condence assigned to the nodes returned from the minimisation operation is decided by each nodes distance from the centroid. If the standard deviation is greater than 8 then all condences are set to 0.6. If the s.d. is less than 8 and the node is less than two standard deviations from the centroid it is assigned a condence of 0.8. Otherwise it is assigned a condence of 0. This is to allow locations with no default sense to remain unassigned. The presumption is that if they do not fall within the cluster, and they do not have a default sense assignment (indicating that they are potentially well known) then they are a false sense.

4.2 Experimental Setup


4.2.1 The Gazetteer
The gazetteer that we use is a combination of several freely available sources. It is designed to be viewed at multiple levels of granularity. Primarily there are two such levels. The rst is designed to be comparable to the gazetteers used in other work, and is built primarily from the World Gazetteer [9] and data available from the US board of geographic names [10], which were the sources used by Amitay et al [11]. For each place name in this view level there is one location marked that is the default sense. In total there are 139,000 entries in this gazetteer. We call this the standard gazetteer. The second viewing level is designed to present a more exhaustive list of names, including a greater number of natural features and statistical divisions. The extra entries have been obtained from the gazetteers maintained by the National Geospatial Intelligence Agency [6] and geonames.org [2]. In total there are 6,246,000 entries in this gazetteer. We call this the amplied gazetteer.

4.2.2 Input Texts


32 documents with a total of 30,115 words were taken from the Internet and manually tagged. In total there were 611 references to places, of which 226 were unique. The breakdown of sources is displayed in table 4.1. The categories were chosen in order to provide a diverse range of likely geo-complexity. The Global

23

and Global News categories represent text with a global audience, which we expect to be easy to tag. Government, Open Document Project(ODP) and Local News categories are expected to have mixed properties. Texts in these categories are expected to have relatively high incidence of disambiguating cues, and a moderate level of unknown entities (locations not known by the gazetteer). Local Tourism texts and Geological reports are expected to have the greatest incidence of unknown entries, and moderate to low incidence of cues for disambiguation. Text Type GLOBAL Number Notes 9 Documents were taken from the front page of content aggregation site, Digg[1]. The site has a global, mostly western userbase. 4 Articles chosen from the front pages of the BBC [7] and New York times[8] websites. 2 Pages selected randomly from the Australian .gov domain 5 Pages taken randomly from the Open Directory Project under the categories: Top: Regional: Oceania: Australia: Western Australia: Regions and Top: Regional: Oceania: Australia: Victoria: Regions 5 Articles taken from the front page of Regional newspapers websites. 3 Pages taken randomly from regional tourism pages focusing on Western Austrlalia 4 Geological reports taken from companies listed as resources sector by the Australian stock exchange.

GLOBAL NEWS GOVERNMENT ODP

LOCAL NEWS LOCAL TOURISM GEOLOGICAL

Table 4.1: The breakdown of tagged documents.

4.2.3 Exclusions
We have chosen not to include texts that might be geo-complex because of language or a historical nature. These domains present their own challenges and are in some ways incompatible with the improvements we hope to generate. Foreign language texts may not have the same divisions between geo-complex and regular 24

text as there are in English. Furthermore, place names are not constant between languages. Historical texts require specialised gazetteers if they are to be eectively tagged. Place names are constantly changing due to political division, evolving language and cultural renaming policies [18]. These additional entries are likely to confound performance on contemporary text. Thus, these factors will be left to other studies.

4.2.4 Conguration and Metrics


In order to test the performance of the core system and its modications, the geocoding engine will tag the 32 texts in our corpus under six dierent congurations: Standard Gazetteer or Standard/Core The full geocoder working with the standard gazetteer (Not including the updated Heuristics). Standard/Updated The full geocoder using the updated heuristics on the standard gazetteer. Defaults The default sense heuristic is the only one that will be used. The unique entry heuristic will not be included. Run against the amplied gazetteer. All Defaults As before, but including the unique entry heuristic. Core System The full system running on the amplied gazetteer (Not including the updated heuristics) Updated System The full system using the updated heuristics, running on the amplied gazetteer. We have identied four relevant metrics for evaluating the eectiveness of our geocoding solution; Accuracy, Thoroughness, Coverage and Performance. These metrics were intentionally chosen over the Precision/Recall metrics used in the literature, as the interpretation of the latter are inconsistent between publications. Furthermore, Precision and Recall do not account for gazetteer size, and so do not provide an objective basis for comparison between systems. A brief explanation of each metric follows:

25

Accuracy Accuracy = Ncorrect Ngazetteer

Accuracy is dened as the ratio of the number of correctly assigned georeferences to the total number of references within the scope of the gazetteer. It has been chosen in order to provide a metric that can be compared over dierent gazetteers and which gives an indication of a congurations ability to correctly assign locations within its own scope. A high accuracy is most useful when good results are desired for a specic list of names (eg: Hospitals or airports). Thoroughness Ncorrect + Ndiscarded Ndocument The thoroughness metric is designed to capture the overall reliability of the geocoder. It is dened as the percentage of entries that are either correctly identied or correctly assigned as outside the scope of the gazetteer. A high thoroughness score is good in situations when false positives are undesirable, such as when assigning an overall page focus. T horoughness = Coverage Ncorrect Ndocument The coverage metric gives an indication of the overall tagging capability of a conguration. It is dependent on gazetteer size as it shows the correctly tagged locations as a percentage of the total number of locations in the document. A high coverage is useful when the volume of locations that are successfully tagged are important (eg: Data mining). Coverage = Performance The performance metric gives an idea of overall system quality. As we use it, it is the simple average of the three other metrics. However depending on the application, unequal weightings could be used.

26

4.3 Experiments
4.3.1 Testing the Ontological Framework
Section 3.1 described the three aims for our experimentation on the ontological framework: Aim 1: To determine if it is possible to improve geocoding performance on geocomplex texts. Aim 2: To identify properties of geo-complex text that can be used to improve geocoding performance. Aim 3: To determine textual properties that can be used to distinguish between texts of diering geo-complexity. Each parameter in the framework will be tested for its occurrence over the entire corpus. The results will be organised by text category, and presented according to the aim for which they are most relevant. Unless otherwise stated, results are determined against the amplied gazetteer.

4.3.2 Testing the Geocoder


Three experiments will be run on the geocoder to evaluate the dierent set-ups against the metrics dened in the previous section Experiment 1 Aim: To evaluate the performance of all congurations over the test data. Method: The performance metric of each set-up will be calculated over the entire corpus. Experiment 2 Aim: To evaluate the accuracy, thoroughness and coverage of the updated system. Method: The accuracy, thoroughness and coverage of each set-up which uses the amplied gazetteer will be calculated over the entire corpus. 27

Experiment 3 Aim: To compare the eect of gazetteer size across all metrics. Method: All metrics will be calculated for the core system and the updated system in both standard and amplied gazetteer congurations.

28

CHAPTER 5

Results and Discussion

5.1 The Ontological Framework


The results are grouped here according to their relevance to each of the aims identied in section 3.1.

5.1.1 Are Improvements Possible?


Barriers to Performance One of the clearest limitations of current systems has been Gazetteer size. Table 5.1 shows that nearly 50% of location references were not recognised by the standard gazetteer. Importantly, even the amplied gazetteer, with 45 times the number of entries, only brings this down to 24%. This puts a clear limit on the potential improvement available to an expanded solution, particularly if speed performance is a priority. Type Global News Global Geology ODP Government Local News Tourism TOTALS Not in Standard Not in Amplied Dierence 13% 10% 3% 18% 11% 7% 59% 46% 12% 31% 12% 19% 33% 0% 33% 65% 30% 34% 75% 29% 45% 49% 24% 25%

Table 5.1: Percentage of place names which arent recognised by the Standard and Extended Gazetteers

29

It is interesting to note that even in categories like global news we nd locations such as The White House or Wall Street, which are not geographically signicant enough to be added to a gazetteer, yet are culturally signicant enough to be recognised on a global level. The diering interpretation of what constitutes a place name explains in part why these ndings seem to contradict results in the literature which claim accuracy in the high 90s. Additionally, many extraction engines are gazetteer based, meaning that locations which were not in the gazetteer were not included in any calculations. Figure 5.1 shows the relative incidence of default entries between the dierent text types. We can see that the types that we suspected of being most geocomplex tend to reference places which are are not politically important (Not Capitals, Regions or Countries) and are not population centers. The trend is most obvious in tourist and local news articles. While in each case roughly 70% of entries are available in the Gazetteer, respectively only 24% and 33% are the default entries.

Figure 5.1: Locations with the Default Sense vs Total Taggable Locations There was only one completely unqualied spatial deixis. However, in three of the Local News texts, three of the geological texts and one of the ODP texts, spatial deixis was used in place of a literal reference without that reference being mentioned elsewhere. This decreases the number of georeferences that can be used to provide a disambiguating context for the document. 30

A nal barrier to performance is the increased number of potential senses to choose between when using an amplied gazetteer. Table 5.2 shows the average number of senses per place name for each gazetteer. On average each place name has nine times as many potential locations when queried against the amplied gazetteer. Name Local Tourism Geology Government ODP Local News Global News Global ALL DOCUMENTS Standard SPP 1.9 1.4 2.5 1.5 2.8 1.7 2.5 1.9 Amplied SPP 7.3 11.1 21.5 13.7 27.9 18.7 34.0 17.3 Increase% 380% 760% 860% 920% 970% 1080% 1360% 900%

Table 5.2: The average number of senses per place name in the Standard and Amplied Gazetteers These results are surprising in that they indicate a level of ambiguity that is far greater than could be predicted from the properties of the gazetteers. 75% of names in the standard gazetteer are unique, as are 59% of those in the amplied. This leads to an expected ambiguity of about 1.18 and 1.45 senses per name respectively. We hypothesise that this disparity is a result of the English colonial origin of many Australian, African and American place names. Figure 2.1 provides some support for this conjecture. Similarities Many of the properties in our experimental ontology had no discernible bias toward either complex or regular text. They included the hierarchical contextual pattern (a place described by its membership of a region), the incidence of abbreviation and the incidence of references at either a country or regional level. Table 5.3 shows the percentage of documents in each category that contain each such property. The evidence here suggests that contextual patterns and container level references are simply an eective method of providing a spatial context to the reader. Abbreviations are used either out of informality or as shorthand for a location that has already been mentioned. However there is no appreciable pattern to this usage. 31

Name Hierarchical Abbrev Country region Local Tourism 100% 66% 33% 100% Local News 20% 20% 0% 40% Geology 100% 75% 25% 75% Government 0% 0% 100% 0% ODP 20% 0% 60% 40% Global 22% 22% 22% 33% Global News 50% 75% 75% 75% Table 5.3: Instance of Properties with no Discernible Bias toward easy text. An exception to this is found in non spatial les. Seven of the documents (six in the Global category, one in the ODP category) were either articles dealing with science, technology and the arts, or information on a particular product. These documents had fewer georeferences (an average of two each) and generally did not provide contextual clues. Table 5.4 shows a property related to the default sense heuristic. It displays the percentage of place name references that have a default sense, but which are not the default sense. The results have been normalised against the total number of possible default assignments. The results show that roughly 10% of default assignments are incorrect, highlighting the usefulness of complementary heuristics. Name Non Default Global News 0.0% Local News 7.4% ODP 11.1% Geology 11.7% Global 12.6% Government 16.6% Local Tourism 18.5% ALL 11.4% Table 5.4: Percentage of entries for which the default sense assignement would be incorrect. Finally, we found that the Single Sense per Discourse rule holds true equally for all texts. 98.3% of references were stable in this regard. The only exceptions were text which contained references to both a place and a river which can be known by the same name, such as Ord. In these cases the authors were careful to make the distinction apparent. 32

5.1.2 What areas are there for improvement?


There are three main areas we have identied as having the potential for improvement. The rst, the improved coverage provided by an amplied gazetteer, has been presented as a double edged sword thus far. However we have identied two further properties that may provide a performance boost, particularly when used with an amplied gazetteer. Of these the rst is a new default sense heuristic, Unique Entry. Earlier we remarked on the relatively high levels of ambiguity in our sample documents and attributed this to the prevalence of British names in former colonies. The corollary of this is that names in geo-complex text, which frequently have a local focus, are more likely to be inherited from local dialects or be otherwise unique. This is supported by the results in table 5.2 which show that text we expected to be geo-complex have references with fewer possible senses. Furthermore, as the total coverage provided by the gazetteer becomes more representative, we have a higher chance that if there is only one entry for a name in the gazetteer, then there is only one instance of that name in reality. We thus theorise that the larger the gazetteer, the more likely for any unique entry which is not already a default entry, to be the correct assignment. The other property which is notably and exploitably dierent as a function of geo-complexity is reference clustering. Existing spatial extent minimisation techniques are intended to capitalise on this property. However, as table 5.5 shows, location clusters-the geographic scope required to contain 80% or more of the spatial references-are much tighter in geo-complex text. The clustering Name Clustering Global News 0.75 Global 1.1 Government 1.5 ODP 2.0 Local News 2.6 Geology 4.0 Local Tourism 4.0 Total 2.1 Table 5.5: Georeference cluster rating by type (0 = no clustering, 4 = Local clustering) property causes us to modify our approach to SEM in two ways. Firstly, the best results currently in the literature use a modied version of Prims algorithm for 33

producing a minimum spanning tree (Section 2.2.2). The weight of each sense is calculated as the total weight of all the connections it is a part of. The sheer number of connections possible when using an amplied gazetteer will have an averaging eect a cluster with a diameter of 400 kilometers will not appear signicantly dierent to a cluster with a diameter of 40 kilometers, as long as there is a signicant number of links to senses on the other side of the world. To counter the averaging eect, we require a more aggressive minimisation routine which is sensitive to locations that are very near one another. Secondly, the clustering eect provides us with a mechanism to automatically check the results we get from a minimisation routine. If a result set is determined to be highly clustered, then it may be possible to improve results by discarding elements that are beyond the cluster region - if they lack any evidence for being kept. These clustering heuristics work better with a large gazetteer, as a greater number of tagged place names would lead to a more precise cluster.

5.1.3 What are the Identiable Properties of Geo-Complex Text?


Our particular results are gazetteer dependent, but the trends they display are applicable to all systems. A more geo-complex text can be identied by: A higher incidence of locations not stored in the gazetteer (Table 5.1) A lower incidence of locations that are the default sense (Figure 5.1) A lower number of senses per place name (Table 5.2) A smaller total geographic extent. (Table 5.5)

5.2 Geocoder Tests


5.2.1 Experiment 1: Overall Performance
Figures 5.2 and 5.3 show the performance by conguration and the best performance, respectively. The all defaults conguration is shown to be competitive with the others for all but the most geo-complex categories, but it never performs best. This agrees with the literature, in that it shows that increasing the number of candidate senses does improve results. 34

However this is not true in all cases. The core system and the standard gazetteer are only dierentiated by gazetteer, as they both use the same heuristics. The fact that the core system performs slightly worse indicates that scalable heuristics are more important to performance than gazetteer size alone. Figure 5.3 shows that the best all round conguration is the updated system running on the amplied gazetteer. This is because it has similar characteristics to a default tagger when operating on text with low geo-complexity, but activates the clustering and minimisation characteristics when it is advantageous to do so.

35

Figure 5.2: Overall Performance by Conguration

Figure 5.3: Best Performance over the Dataset

36

5.2.2 Experiment 2: Performance of the Updated System


In gures 5.4, 5.5 and 5.6: accuracy, thoroughness and coverage, we see that the updated heuristics perform better overall than the heuristics from the literature. The all defaults conguration has better ratings in all metrics than the defaults conguration, and better thoroughness than the core system. The latter can be attributed to the fact that a default tagger does not assign any condence to places that have no default entry, meaning that it correctly discards more incorrect values than the core system. On the other hand, because of the more aggressive way that it assigns references, the core system beats the all defaults conguration on accuracy and coverage, particularly as geo-complexity increases. The conguration using updated heuristics is the best overall in all metrics. The dierence is greatest on highly geo-complex texts with tight clustering. In the local tourism category it has 15% better accuracy, 15% better thoroughness and 9% better coverage than the next best conguration. Similarly, when used on text with low geo-complexity it performs equal to or better than all other congurations. However, performance degrades on text that is moderately geo-complex - the Local news and ODP categories. In the ODP category it has the second worst accuracy and coverage, and the worst thoroughness. These texts do not have a level of clustering sucient for the clustering algorithm to work well, nor do they have a high incidence of default values. Because the updated conguration uses default values when clustering fails, it performs worse than the core system.

37

Figure 5.4: Accuracy of Metrics Using the Amplied Gazetteer

Figure 5.5: Thoroughness of Metrics Using the Amplied Gazetteer

Figure 5.6: Coverage of Metrics Using the Amplied Gazetteer 38

5.2.3 Experiment 3: The Eect of Gazetteer Size across all Metrics


Figures 5.7, 5.8, 5.9 and 5.10 show the comparative accuracy, thoroughness, coverage and overall performance of the core and updated set-ups, between gazetteers. It is immediately obvious that the congurations using the standard gazetteer have much better accuracy and thoroughness than the amplied systems. Furthermore, the amplied/core system performs worse in both metrics over all categories. This is to be expected, as none of the place names with high ambiguity but no default sense are included in the standard gazetteer. As such, the standard congurations are essentially working on an easier version of the text. This also explains why the Standard/Updated system achieved accuracy of 100% in four of the categories. In contrast with Amplied/Core, The Amplied/Updated conguration achieves comparable (<5% dierence) or greater accuracy in four out of 7 categories, and comparable thoroughness in 5 out of 7. This is once again due to the fact that it requires greater certainty before it makes an assignment than does the core system. The Coverage of both amplied congurations is better than both standard ones. This is to be expected, as coverage values the total number of correctly assigned place names more highly than the other metrics. The dierence in coverage makes up for the other two metrics somewhat when it comes to the performance metric. Overall, the Amplied/Core, Standard/Core and Standard/Updated perform similarly. The Amplied/Updated again has the best performance, being comparable to the other setups in 4 categories and better in 2.

39

Figure 5.7: Comparative Accuracy Between Gazetteers

Figure 5.8: Comparative Thoroughness Between Gazetteers

Figure 5.9: Comparative Coverage Between Gazetteers 40

Figure 5.10: Comparative Performance Between Gazetteers

41

CHAPTER 6

Conclusions and Further Work

6.1 Conclusion
In this dissertation we attempted to improve geocoding performance for geocomplex texts without resulting to external means of context. We focused specifically on text that is hard to tag because of high ambiguity, a low instance of default senses and a higher instance of unrecognised place names. We hypothesised that such text could have other, unique properties that could be leveraged to improve the accuracy, thoroughness, coverage and overall performance of a tagger. The ontological framework we constructed provided us with a means to assess the blind spots inherent in current geocoding techniques. From this assessment we developed three new heuristics, a new default sense heuristic, a new SEM heuristic that is more sensitive to neighbouring locations, and a clustering technique for discarding unlikely references. Our updated system conrmed our original hypothesis in that it showed that geo-complex texts could indeed be tagged with high accuracy when reference clustering was present. Furthermore, in an overall performance measure, the new heuristics performed similarly or better to all other congurations in six of the seven text categories.

6.2 Further Work


There are several results from this study that warrant further investigation. Firstly, the clustering algorithm we use is only capable of working o a single centroid. However texts may better be modeled with more than one cluster point. Furthermore, a single cluster point devalues important associations such as state-to-state within a country, and thus assigns more importance 42

to the default sense heuristic for determining remaining entities. Future work may investigate the use of more cluster points, or may use the derived centroid as a rst pass mechanism for improving the weighting on a second minimisation operation. On a related note the minimisation and clustering algorithms, which work over the entire text, could be developed to handle longer documents. In long text, locations that are in two dierent sections should not be allowed to inuence each other unduly. Later studies may attempt to conrm our identied hard text properties against a larger and more varied sample corpus. This study had a bias toward Australian and North American locations, but English text focusing on other regions may, in fact, have slightly dierent properties. Finally, much work needs to be done on improving the speed performance of tagging with an amplied gazetteer. In this research we identied a set of key properties that dierentiated hard text from soft. This information could be used to improve speed by using a standard gazetteer initially, switching automatically to the amplied gazetteer if key hard text indicators are encountered. Such a system could conceivably have the speed and thoroughness of a standard approach, coupled with the recall of an extended implementation.

43

Bibliography
[1] Digg, August 2008. http://digg.com/. [2] Geonames.org, Online Gazetteer, May 2008. http://download.geonames.org/export/dump/allCountries.zip. [3] Google Corporate Information, October http://www.google.com/corporate/history.html. [4] Google Local Search, June 2008. http://www.local.google.com/. [5] Metacarta Geosearch News, http://geosearch.metacarta.com/. June 2008. 2008.

[6] National Geospatial Intelligence Agency, June 2008. http://earth-info.nga.mil/gns/html/geonames dd dms date 20080617.zip. [7] BBC, August 2008. http://www.bbc.co.uk/. [8] The New York Times, August 2008. http://www.nytimes.com/. [9] The World Gazetteer, July 2008. http://world-gazetteer.com/dataen.zip. [10] United States Geological Survey, August 2008. http://geonames.usgs.gov/docs/stategaz/Populated places 20080815.zip. [11] Einat Amitay, Nadav HarEl, Ron Sivan, and Aya Soer. Web-a-where: geotagging web content. In Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, pages 273280, Sheeld, United Kingdom, 2004. ACM. [12] Paul Clough. Extracting metadata for spatially-aware information retrieval on the internet. In Proceedings of the 2005 workshop on Geographic information retrieval, pages 2530, Bremen, Germany, 2005. ACM. [13] Thomas H. Corman, Charles E. Leiserson, Ronald L. Rivest, and Cliord Stein. Introduction to Algorithms. MIT Press, Cambridge Massachusetts, second edition, 2001.

44

[14] Gregory Crane and Alison Jones. The challenge of virginia banks: an evaluation of named entity analysis in a 19th-century newspaper collection. In JCDL 06: Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries, pages 3140, New York, NY, USA, 2006. ACM. [15] Silviu Cucerzan and David Yarowsky. Language independent ner using a unied model of internal and contextual evidence. In COLING-02: proceedings of the 6th conference on Natural language learning, pages 14, Morristown, NJ, USA, 2002. Association for Computational Linguistics. [16] Jr Clodoveu A. Davis and Frederico T. Fonseca. Assessing the certainty of locations produced by an address geocoding system. Geoinformatica, 11:103129, 2007. [17] Junyan Ding, Luis Gravano, and Narayanan Shivakumar. Computing geographical scopes of web resources. In Proceedings of the 26th International Conference on Very Large Data Bases, pages 545556. Morgan Kaufmann Publishers Inc., 2000. [18] Committee for Geographical Names in Australasia. Policy guidelines for the recording and use of aboriginal and torres strait islander place names. Technical report, IGSM, 1992. [19] William A. Gale, Kenneth W. Church, and David Yarowsky. One sense per discourse. In Proceedings of the workshop on Speech and Natural Language, pages 233237, Harriman, New York, 1992. Association for Computational Linguistics. [20] Jochen L. Leidner, Gail Sinclair, and Bonnie Webber. Grounding spatial named entities for information extraction and question answering. In Proceedings of the HLT-NAACL 2003 workshop on Analysis of geographic references - Volume 1, pages 3138. Association for Computational Linguistics, 2003. [21] Huifeng Li, Rohini K. Srihari, Cheng Niu, and Wei Li. Location normalization for information extraction. In Proceedings of the 19th international conference on Computational linguistics - Volume 1, pages 17, Taipei, Taiwan, 2002. Association for Computational Linguistics. [22] Huifeng Li, Rohini K. Srihari, Cheng Niu, and Wei Li. Infoxtract location normalization: a hybrid approach to geographic references in information extraction. In Proceedings of the HLT-NAACL 2003 workshop on Analysis of geographic references - Volume 1, pages 3944. Association for Computational Linguistics, 2003. 45

[23] Robert Malouf. Markov models for language-independent named entity recognition. In COLING-02: proceedings of the 6th conference on Natural language learning, pages 14, Morristown, NJ, USA, 2002. Association for Computational Linguistics. [24] Bruno Martins and Mrio J. Silva. Language identication in web pages. In a SAC 05: Proceedings of the 2005 ACM symposium on Applied computing, pages 764768, New York, NY, USA, 2005. ACM. [25] Bruno Martins, Mrio J. Silva, and Marcirio Silveira Chaves. Challenges and resources for evaluating geographical ir. In Proceedings of the 2005 workshop on Geographic information retrieval, pages 6569, Bremen, Germany, 2005. ACM. [26] Malvina Nissim, Colin Matheson, and James Reid. Recognising geographical entities in scottish historical documents. In Proceedings of the Workshop on Geographic Information Retrieval at SIGIR 2004. SIGIR, 2004. [27] Simon E. Overell and Stefan Rger. Geographic co-occurrence as a tool for gir. In Proceedings of the 4th ACM workshop on Geographical information retrieval, pages 7176, Lisbon, Portugal, 2007. ACM. [28] Alexei Pyalling, Michael Maslov, and Pavel Braslavski. Automatic geotagging of russian web sites. In Proceedings of the 15th international conference on World Wide Web, pages 965966, Edinburgh, Scotland, 2006. ACM. [29] David A. Smith and Gregory Crane. Disambiguating geographic names in a historical digital library. In Proceedings of the 5th European Conference on Research and Advanced Technology for Digital Libraries, pages 127136. Springer-Verlag, 2001. [30] Rohini Srihari, Cheng Niu, and Wei Li. A hybrid approach for named entity and sub-type tagging. In Proceedings of the sixth conference on Applied natural language processing, pages 247254, Seattle, Washington, 2000. Morgan Kaufmann Publishers Inc. [31] Chuang Wang, Xing Xie, Lee Wang, Yansheng Lu, and Wei-Ying Ma. Detecting geographic locations from web resources. In Proceedings of the 2005 workshop on Geographic information retrieval, pages 1724, Bremen, Germany, 2005. ACM. [32] Chuang Wang, Xing Xie, Lee Wang, Yansheng Lu, and Wei-Ying Ma. Web resource geographic location classication and detection. In Special interest

46

tracks and posters of the 14th international conference on World Wide Web, pages 11381139, Chiba, Japan, 2005. ACM. [33] GuoDong Zhou and Jian Su. Named entity recognition using an hmm-based chunk tagger. In ACL 02: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pages 473480, Morristown, NJ, USA, 2001. Association for Computational Linguistics.

47

APPENDIX A

Initial Project Proposal

Background
Everyone is now used to the concept of searching for documents using the Google interface, where we normally expect to nd the required document in the rst twenty items or not at all - even if the number of possible matches is in the tens of thousands. Unfortunately, a text based search frequently leads to a dead end. Annually thousands of hours are wasted by people who are either searching for information that isnt there, failing to nd information that is, or recreating information that already exists. [6] In 2001, International Data Corporation attempted to quantify these losses. They estimated that a company with 1000 knowledge workers wastes $2.5 - $3.5 million a year on irrelevant searches, the unproductive time equating to approximately $15 million in lost revenue. One way of improving the relevance of search results is to lter the results by location. It is estimated, for instance, that 80% of the information dealt with by local governments has some discernible association with a location [1]. Furthermore, Delboni et al. found that even among the general populace, 20% of queries submitted to search engines included a geographic reference. [8] The process of assigning a spatial reference to a document or other piece of data, based on an analysis of its content, is called Geocoding, or sometimes Geotagging, Localising or Grounding. The eort spent on Geocoding techniques has traditionally focused on determining an exact spatial reference from an address, or a near exact reference based on building names, postal codes or telephone area codes [1]. However, There are three main reasons why address referencing is not sucient in itself for document categorisation: Address resolution is limited to urban areas and places with an organised addressing system.

48

Locations are often referenced by means other than a formal address. Addresses do not always indicate the target location of a document. [5] A dierent approach to the problem is to apply Named Entity Recognition(NER) techniques to textual data in order to derive a spatial reference for the document as a whole, based on the occurrence of locations within the text which are recognised by a well populated gazetteer. Examples of this method are MetaCartas Geographic Text Search [4] and the Web-a-Where geotagging system, developed by Amitay et al. [5]. However, distilling a spatial focus from a document is far more challenging and less precise than address resolution. [2] The two main barriers to accurate NER geocoding are entity ambiguity and source-target ambiguity. Entity ambiguity occurs when a word that potentially refers to a location also has some other meaning. Ambiguous pairs may both be place names, like Perth, Scotland and Perth, Western Australia, or may have a non spatial possibility, like Bualo, N.Y and Bualo Bill. Source-Target ambiguities occur when there are references to a documents origin within the text body, such as the authors address. The context required to understand and correctly interpret these ambiguities is frequently not found within the text itself, but rather derived from external experience. [4] Multiple Heuristics have been proposed as means of determining the correct disambiguation to choose, including: Choosing locations with a high population over those with a low one [5, 4] Selecting locations over the whole text in order to minimise total coverage letting the textual proximity of named entities within the text imply a geographical relationship [2, 5, 7] Using the immediate context of the Entity, Eg: Bualo, N.Y. as opposed to Mr. James Bualo Combining several of these Heuristics leads to fairly accurate results, however there is room for improvement. Amitay et al. Found that when applied to web pages from the .gov domain, their focus algorithm was only able to identify documents at a country level 73% of the time [5]. They found that documents not meant for public consumption typically carried less disambiguating cues for locations assumed to be general knowledge. The Challenge thus remains to produce a focus-nding algorithm which is able to more eectively deduce a focus for use as an aid to document discovery. 49

Aims
Project Aims
The aim of this project is to implement a working focus nding algorithm, and to investigate various disambiguation heuristics [5, 4, 7] and evaluate their overall eect on geocoding accuracy. The implementation will act on text data to produce a probable spatial reference for the document. As each Named entity can not be guaranteed to be correctly resolved, we will take a fuzzy approach to the derivation of the focus from the Named Entity List. The references will be indexed so that a later query will produce a list of documents with an associated relevance index when given an initial spatial window. The geographic relevance Index will be combined with a standard text relevance index in order to produce a nal search list. By combining a spatial relevance index with a Google-style relevance index, we expect to increase the probability of discovery of relevant documents without having to go beyond the rst 20 documents. [4, 6]

Method
The method can be summed up in 6 steps: 1. The document to be geocoded is parsed and indexed. It is also passed to a text indexer for normal processing. 2. The document index is checked against a pre-compiled Gazetteer of place names in order to identify Named Entities which may refer to geographic phenomena. 3. Disambiguation Heuristics are applied to decide on a single context for each Entity, giving each a condence rating. 4. Fuzzy logic is applied to determine one or more focuses for the document. 5. The resulting spatial relevance index is stored. 6. Searches on the document base provide both textual and spatial criteria. The resulting document list is a combination of both the spatial and textual relevance.

50

Design Criteria:
We aim to produce an implementation that is Highly modular, allowing for future developments, such as a map-display type interface for geographic ltering, and which is not constrained by any particularly gazetteer or Geocoding algorithm.

Project Time line


Research: April Gazetteer format Document Indexing format Gather Test Data Test Design Full text indexing systems and APIs Domain knowledge and focus heuristics System Design: May Produce system design document Implementation: May - July Gazetteer Creation Document Indexing module Context Heuristics Testing: June - August Document Indexing Heuristics Tests combined Index Relevance tests Reporting: April - October Documentation April onwards Compile Documentation into Draft Dissertation September Final Dissertation October

51

Software And Hardware Requirements


No Special Requirements.

52

Bibliography
[1] Clodoveu A. Davis Jr , Frederico T. Fonseca Assessing the Certainty of Locations Produced by an Address Geocoding System Geoinformatica Vol 11, 2007. Page(s)103-109 [2] F. Bilhaut, T. Charnois , P. Enjalbert, Y. Mathet, 2003. Geographic reference analysis for geographic document querying. In Proceedings of the HLTNAACL 2003 Workshop on Analysis of Geographic References - Volume 1 , Association for Computational Linguistics, Morristown, NJ, Page(s) 55-62. [3] D. Wu, G. Ngai, M. Carpuat, J. Larsen, Y. Yang, 2002. Boosting for named entity recognition. In Proceedings of the 6th Conference on Natural Language Learning - Volume 20 , Association for Computational Linguistics, Morristown, NJ, Page(s) 1-4. [4] E. Rauch, M. Bukatin, K. Baker, 2003. A condence-based framework for disambiguating geographic terms. In Proceedings of the HLT-NAACL 2003 Workshop on Analysis of Geographic References - Volume 1 , Association for Computational Linguistics, Morristown, NJ [5] E. Amitay, N. HarEl , R. Sivan, A. Soer, 2004. Web-a-where: geotagging web content. In Proceedings of the 27th Annual international ACM SIGIR Conference on Research and Development in information Retrieval SIGIR 04. ACM, New York, NY, Page(s) 273-280. [6] S. Feldman and C. Sherman, 2001 The High Cost Of Not Finding Information. IDC, Framingham, MA [7] J.L Leidner, G. Sinclair, B. Webber, 2003. Grounding spatial named entities for information extraction and question answering. In Proceedings of the HLT-NAACL 2003 Workshop on Analysis of Geographic References - Volume 1 Association for Computational Linguistics, Morristown, NJ [8] T. Delboni, K Borges, A. Laender, C. Davis, 2007 Semantic Expansion of Geographic Web Queries Based on Natural Language Positioning Expressions In Transactions in GIS - Volume 11 Blackwell Publishing Ltd, Page(s) 377-397

53

APPENDIX B

The Hard Text Framework

54

Property Default Population Default Unique Default Capital Non Default Non Standard Non Expanded Gazetteer Country Region Hierarchical Nearby Abbrev Unqual Deixis Feature Total Unique Options no words Granularity

Description The number of references that are correctly defaulted based on the population Heuristic The number of references that are correctly defaulted because the entry is unique The number of references that are correctly defaulted based on the Country, Region, Capital Heuristic The number of entries for which a default exists, but would be incorrect. The number of entries that arent found in a gazetteer where every distinct name has a default entry The Number of entries that arent found in an Amplied Gazetteer Does the text contain a reference to the Country that it is about? Does the text contain a reference to the State or Province that it is about? Does the text dene locations by their Hierarchical containers? Does the text dene locations by other, nearby locations? Does the text use abbreviations or slang words as location names? Are there instances of unqualied spatial deixis? The number of references to features like Rivers of Clis The total number of references. The number of distinct place names. The number gazetteer entries that match one of the place names. The number of words in the document. The level of granularity sucient to cover more than 80% of references (None = 0, global = 1, Country = 2, State = 3, Local = 4)

Table B.1: The list of properties required for geocoding heuristics.

55

You might also like