Improved Document Geocoding For Geo-Complex Text: Ian Patullo
Improved Document Geocoding For Geo-Complex Text: Ian Patullo
Ian Patullo
This report is submitted as partial fullment of the requirements for the Honours Programme of the School of Computer Science and Software Engineering, The University of Western Australia, 2008
Abstract
Document geocoding is the process of assigning a spatial reference to a piece of text based on an analysis of its content. This project is concerned with improving geocoding performance on text with high geo-complexity, that is, text which traditional systems struggle to tag correctly. During the course of this dissertation we produce an ontological framework which describes the way current techniques identify locations in text, and develop new heuristics that work specically where the the old ones fail. Using a performance metric that combines the accuracy of assignment with the number of false positives and the overall coverage of place names in a document, we show that our system is capable of improving performance in some texts with high geo-complexity, while attaining similar performance on more standard texts.
Keywords: Document Geocoding, Geotagging, Disambiguation, Text Mining, Information Retrieval, Gazetteer CR Categories: I.2.7 [Natural Language ProcessingText Analysis] ii
Acknowledgements
To To To To To my supervisor, Dr Wei Liu, for her direction and advice. Dave Pratt for the idea. everyone at work, for their interest, exibility and feedback. my family and friends, who no doubt missed me. my brain, which got me through. Thanks to all of you.
iii
Contents
Abstract Acknowledgements 1 Introduction 1.1 1.2 1.3 An Introduction to Geocoding . . . . . . . . . . . . . . . . . . . . Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Geo-Complexity and this Dissertation . . . . . . . . . . . . . . . . ii iii 1 1 1 2 4 4 5 5 6 7 9 11 13 13 14 15 16 17 19 19 19
2 Background 2.1 Unstructured Geocoding . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 2.1.2 2.2 2.2.1 2.2.2 2.3 Locating place names within Text . . . . . . . . . . . . . . Turning Names into Locations . . . . . . . . . . . . . . . . Initial Heuristics . . . . . . . . . . . . . . . . . . . . . . . Secondary Heuristics . . . . . . . . . . . . . . . . . . . . .
Disambiguation . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3 Investigating Deeper into Geo-Complexity 3.1 3.2 The Call for Investigation . . . . . . . . . . . . . . . . . . . . . . Targets for Investigation . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 3.2.2 3.3 Heuristics . . . . . . . . . . . . . . . . . . . . . . . . . . . Non Heuristic Measures . . . . . . . . . . . . . . . . . . .
Tagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Gazetteer . . . . . . . . . . . . . . . . . . . . . . . . . Input Texts . . . . . . . . . . . . . . . . . . . . . . . . . . Exclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . Conguration and Metrics . . . . . . . . . . . . . . . . . . Testing the Ontological Framework . . . . . . . . . . . . . Testing the Geocoder . . . . . . . . . . . . . . . . . . . . .
21 23 23 23 24 25 27 27 27 29 29 29 33 34 34 37 39 42 42 42 48 54
Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . .
Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5 Results and Discussion 5.1 The Ontological Framework . . . . . . . . . . . . . . . . . . . . . 5.1.1 5.1.2 5.1.3 5.2 5.2.1 5.2.2 5.2.3 Are Improvements Possible? . . . . . . . . . . . . . . . . . What areas are there for improvement? . . . . . . . . . . .
What are the Identiable Properties of Geo-Complex Text? 34 Experiment 1: Overall Performance . . . . . . . . . . . . . Experiment 2: Performance of the Updated System . . . . Experiment 3: The Eect of Gazetteer Size across all Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Geocoder Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . .
List of Tables
2.1 4.1 5.1 5.2 5.3 5.4 5.5 Similarity Weighting Table . . . . . . . . . . . . . . . . . . . . . . The breakdown of tagged documents. . . . . . . . . . . . . . . . . Percentage of place names which arent recognised by the Standard and Extended Gazetteers . . . . . . . . . . . . . . . . . . . . . . . The average number of senses per place name in the Standard and Amplied Gazetteers . . . . . . . . . . . . . . . . . . . . . . . . . Instance of Properties with no Discernible Bias toward easy text. Percentage of entries for which the default sense assignement would be incorrect. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Georeference cluster rating by type (0 = no clustering, 4 = Local clustering) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 24
29 31 32 32 33 55
vi
List of Figures
2.1 All possible geographic interpretations of the names of the Australian capitals . . . . . . . . . . . . . . . . . . . . . . . . . . . . System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . Locations with the Default Sense vs Total Taggable Locations . . Overall Performance by Conguration . . . . . . . . . . . . . . . . Best Performance over the Dataset . . . . . . . . . . . . . . . . . Accuracy of Metrics Using the Amplied Gazetteer . . . . . . . . Thoroughness of Metrics Using the Amplied Gazetteer . . . . . . Coverage of Metrics Using the Amplied Gazetteer . . . . . . . . Comparative Accuracy Between Gazetteers . . . . . . . . . . . . . Comparative Thoroughness Between Gazetteers . . . . . . . . . . Comparative Coverage Between Gazetteers . . . . . . . . . . . . . 6 20 30 36 36 38 38 38 40 40 40 41
4.1 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9
vii
CHAPTER 1
Introduction
1.2 Motivation
Most recent research has been directed toward unstructured geocoding. In particular, many implementations have focused on using the properties of text on the Internet to assign spatial references more accurately. These geocoders tag locations by combining evidence from the document itself with externally available 1
context. The performance of these tools has been attributed in part to the fact that the web pages which they target are written to be understood by people from all over the world[22]. Due to the ease with which these sorts of texts are geocoded, we say they have a low geo-complexity. However, performance degrades when these methods are applied to text without a global focus, like government web pages, local news papers or other sites that cater to specic communities. Yet many texts, online or oine, are written to be understood only by certain demographics. Furthermore, oine texts do not have the available meta-information that online texts do. Thus there is a need to develop better ways of geocoding geo-complex text without resorting to context beyond what is available in the document itself. To do that, we need to produce both a description of the properties of geocomplex text, and a geocoding solution that takes advantage of those properties to improve its performance.
were not built to handle these documents well. This leads to the hypothesis that underpins this dissertation: Hypothesis: Geo-complex text has distinctive properties that can be identied and leveraged to improve geocoding performance. Over the course of this document we will be developing this idea. Our approach is centered on developing updated heuristics for choosing between the candidate senses of placenames in text. This has two component steps. Firstly, the current range of heuristics are analysed to nd weaknesses and opportunities for improvement. Secondly new algorithms and heuristics are developed to take advantage of these improvements. Geo-complex text is found to contian tightly clustered references, and this property is used as a means for identifying locations with higher accuracy. The same property is used for discarding locations that fall outside the cluster point. The new heuristics improve geocoding performance on geo-complex text by up to 15%. This document is broken up into the following chapters: Chapter 2 provides a review of current geocoding heuristics. It also discusses how geo-complexity has been recognised and dealt with in the literature. In Chapter 3 we systematically develop an ontological framework for assessing the eectiveness of geocoding heuristics. We use the framework to derive a set of parameters that characterise geo-complexity and to develop three modied heuristics for use with geo-complex text. Chapter 4 describes the design of our geocoding system and details two sets of experiments: The rst relating to the framework in Chapter 3, the second relating to the geocoder and our updated heuristics. Chapter 5 covers the results of these experiments and the related discussion.
CHAPTER 2
Background
Many sources of text contain structured information, such as phone numbers, area codes or addresses. This data can be resolved into a fairly accurate and unambiguous location. The locations referred to by a document can then be used to determine its geographic scope. This is often termed Address Geocoding. However, the characteristics that such methods rely upon are not true for all text. In particular: Address resolution is limited to texts referencing urban areas and places with an organised addressing system [16]. Locations are often referenced by means other than a formal address, and many documents contain only unstructured references to locations. Addresses may indicate the source rather than the target location of a document [11]. Limitations of address geocoding have led to the current interest in geocoding based on the unstructured georeferences in a document. This is a challenging domain, and much work has been invested in nding ways to improve the performance of this type of tagging.
Figure 2.1: All possible geographic interpretations of the names of the Australian capitals
2.2 Disambiguation
In order for a recognised geographic entity to become a useful piece of metadata it must be grounded, meaning it must be resolved to a single unambiguous location [20]. A geo reference can have many candidate locations. Amitay et al found that 37% of geographic references in Web pages had more than one potential target [11]. Smith and Crane found the phenomenon to be even more pronounced in classical texts, where up to 92% of place names were ambiguous [29], though these results are dependent on both the size of the Gazetteer and the nature of the texts that were tested. In Figure 2.1 we see the geographical area covered by just 8 references: Adelaide, Brisbane, Canberra, Darwin, Hobart, Melbourne, Perth and Sydney. Numerous methods exist for choosing between the candidate senses of a geographic reference. For clarity we separate them into Initial Heuristics - which can be applied to a document as a rst pass, and Secondary Heuristics - which work best after some locations have already been decided.
Srihari, Niu and Li [30] used the Oxford English Dictionary to select default senses. They argued that large cities or countries that made it into the dictionary were likely to be the most frequently mentioned. Default sense is seldom the only heuristic used because increasing the number of candidate senses generally improves geocoding performance [11]. Textual Context Reliable initial information can be obtained by analysing the context of each place name. Documents intended for a diverse or global audience, most notably web resources, typically contain disambiguating cues that reduce the need for the reader to have any specic local knowledge [11]. Because these cues are often in a recognisable format, they can be identied by the geocoder. Local context is queried using similar techniques to the rule based entity extraction discussed in section 2.1.1 [27]. Likely formats that can be grounded this way are phrases like : City of Perth State of Illinois Sydney, New South Wales Chicago, the Windy City These cues allow for place names to be grounded with a high probability immediately after the extraction phase [22]. Other sources of Context For web based resources, information besides the plain text content is available. Harvesting this information allows geocoders to further improve their grounding accuracy. Three main contextual sources are outlined below. The Link Cloud [17, 31]: The pattern of incoming links to a web resource can give some idea of the location of a pages content. The sites that contain links to a resource can be quickly assigned a spatial reference by their IP address, by looking at whois.com or even by geocoding the linking pages. The rationale behind this method is that sites referencing a particular location will be of more interest to other sites from that location. Such information can also be held separate from the rest of the 8
geocoding process and used to distinguish between the content of the page and the audience of the page. Metadata in Content: Some web resources, like Wikipedia, contain structured meta information in charts and tables. Overell et al [27] were able to reliably ground Wikipedia articles by examining the meta-information they contained. They found that many articles about places had charts with coordinate information that could be checked against known location candidates. IP address information [28, 32]: The IP of a web page can give some indication as to where the page is likely to be talking about, as can the URL. Such information needs to be considered carefully, as it may only provide details of the source location, rather than the location of the content.
are used to infer references to child nodes from references to the parent node. If, for example, a reference to Western Australia was successfully extracted from a text and grounded, subsequent references to Perth and Scarborough would be associated with places in Western Australia, rather than those in the U.K. More sophisticated models [11] also treat references to child nodes as evidence for the parent. While this technique works best as a secondary heuristic, Li et al [22] suggest a means for it to be used as an initial heuristic: They construct a weighted graph where each node is a candidate place name. A connection is made between each pair of nodes that do not share the same name. The connection weight is set according to the function: W eightxy = (Latx Laty )2 + (Longx Longy )2 Simxy
Where Simxy is the similarity weighting between x and y as described in Table 2.1. X Y Relation Similarity City City Same State 3 City State Same State 4 City Country Same Country 4 City City Same Country 2 Otherwise 1 Table 2.1: Similarity Weighting Table Each sense is given a total score, which is the sum of the weights of each connection it is a part of. If the ratio between the lowest and second lowest scores is less than a dened threshold, then the sense with the lowest weight is returned with a high condence rating. Otherwise the sense with the lowest weight is returned with a lower condence, in order to rate it below the default sense heuristics. This approach can produce a spatial extent that is close to minimal without requiring numerous pre-grounded locations, and we use it as part of our core geocoding system.
10
the tagging precision while the second tends to compromise the entity extraction performance signicantly [15, 24, 23].
12
CHAPTER 3
In the previous chapter we discussed the range of current heuristics and their limitations as exposed in literature. We now investigate geo-complexity with an experimental analysis of the texts themselves.
4 The Northern train line ends near Grandmas house. Items 1 and 2 are a fully qualied reference and a qualied spatial deixis respectively. Items 3 and 4 are examples of an unqualied spatial deixis. In practice we can use more than the context of a single sentence to assign an origo, but if unqualied deixes were the only factor in geo-complexity, we would conclude that very little could be done to improve tagging performance without resorting to a special purpose algorithm. Aim 2: To identify properties of geo-complex text that can be used to improve geocoding performance. The second motivation is that an experimental denition will help identify distinct areas for improvement. Trends in the way locations are mentioned, the types of location present (Eg: Country and Capital or Suburb and Street) and any other signicant statistics can be exploited to improve results. Th previous chapter already indicates that there are multiple contributing factors to geo-complexity. Our results will enable us to target the factors that can be compensated for most eectively. Aim 3: To determine textual properties that can be used to distinguish between texts of diering geo-complexity. The third and nal motivation for an investigation into geo-complex text is so we might be able to determine whether text is geo-complex or not automatically. Ideally this would mean being able to perform identication based on content alone. However, nding a relationship between geo-complexity and text category would also be benecial. We expect that local newspapers will be harder to tag than global ones, for instance, but producing more detailed guidelines along these lines will help resolve how our results can best be applied.
3.2.1 Heuristics
Default Sense Section 2.2.1 indicates that default senses are chosen based on statistical or perceptual importance. Intuitively we would expect texts with high geo-complexity, which are mostly local and of narrow interest, to contain relatively few references to globally important entities. We would also expect a higher incidence of references to locations that would be interpreted incorrectly if left to a default tagger (such as the Liverpool in New South Wales). The following properties have been identied as relevant to the default sense heuristic. The number of entities that have a default sense. The number for which the default sense is the correct sense. The number for which it is not. The most commonly used default sense heuristic. (Population, Importance) Minimising Spatial Extent SEM has not received as much focus in literature as other methods of location normalisation. This may be because texts with a global focus tend to reference places that are further apart. If geo-complex texts are more likely to reference narrow areas, we may nd that locations cluster together. It needs to be determined: At what level of granularity can the references in text be said to be Clustered (None, Global, Country, State, Local) Whether there are multiple clusterings. Textual Context Techniques that focus on the textual context of a georeference examine the text for clues that help to resolve ambiguity. Geo-complex texts may be more relaxed in how rigorously they describe a location. This would lead to a smaller number of contextual patterns. The properties identied as being relevant to the performance of contextual techniques are: 15
The incidence of references to containing areas. (Countries, Regions) The incidence of places dened in reference to their container. (Perth, Australia, For instance.) The incidence of place dened in reference to nearby places. (Near Darwin, 50 km from Broome) Whether there are clues about the nature of an entity. (The Kimberly Region)
16
The number of entities that do not appear in an amplied gazetteer. The number of senses each word has on average. After removing duplicate properties and those that can be inferred from other results, we are left with 18 properties which make up a nal framework for testing. Appendix B has has the nal format of the framework.
technique developed by Li et al [22] has trouble dierentiating between senses that are close together, as the connection weights are averaged out by the large number of senses found in an amplied gazetteer. In response, our updated system will implement two new algorithms during the SEM component of geocoding. The rst is a more aggressive minimisation technique which is sensitive to locations that are extremely close together. The second is a clustering algorithm that discards senses that fall outside the perceived clustering area of a document. A nal update to the system adds a new default sense heuristic, where unique entries are considered to be default values. Full details of the algorithms and their implementation are provided in the Chapter 4.
18
CHAPTER 4
This chapter is in three sections. The rst describes the geocoding system, which is based on the models presented in the literature and the updates described in section 3.3. The second relates the information about our experimental setup which is not related to system design. The third gives a description of the experiments used to test our system.
4.1 Design
Figure 4.1 shows an overview of the geocoding system. Each block represents a modular component of the overall system. The modules work in turn on a model of the target document which is passed between them. The geocoder was designed to work with any or all of the modules activated and new modules can be added at any point. The steps that go into geocoding the document can broadly be described as: Initialisation, tagging and reporting. We will explain each of those sections in some detail here.
4.1.1 Initialisation
The initialisation portion of the geocoding process is where the location references in a document are converted into a format that can be tagged. In this model each reference has already been identied during the NER phase, so here it is extracted and assigned several properties for use later on. We record the reference name, its archetype - which allows us to connect two references to the same place to one another- and its position in the document. This step is also where the geocoding modules which we plan to use are selected and where we assign the gazetteer to use for extraction.
19
20
4.1.2 Tagging
This component is made up of the default sense module, the Context engine and the minimisation module. Default Senses Default Senses are assigned Hierarchically. Each place name retrieves a list of candidate senses. The default sense is searched for in the following order: 1. Continent, condence 0.75 2. Country, condence 0.75 3. Region/State, condence 0.75 4. Max Population (population 20,000) condence 0.7 5. Max Population (0 < population < 20,000) condence 0.5 From the results of the ontological framework testing, there is reason to presume that as a gazetteer becomes more comprehensive any entries which are unique in the gazetteer become more likely to be unique in reality. In response we have updated the default sense hierarchy to reect this new mechanism: 6: Unique Sense, condence 0.5 If a hit is made at a higher level, proceeding levels arent evaluated. The condence ratings are chosen simply to assign a rank to the dierent methods of assignment. While it is possible to produce a graduated scale between items 4 and 5, for instance, none of the other techniques at use are able to provide a rating with sucient granularity to reason between them. The thresholds provided are from the literature [11, 22], but could be replaced with any ordinal scale. The Context Engine The context engine is quite primitive when compared to the full featured implementations available elsewhere [22, 27]. Rather than using a set of language rules to nd pattern hits, we instead use the distance between georeferences as the sole requirement for further investigation.
21
If two references are within two words of each other, we test for the hierarchical member property. If a hit is scored, then the two references are given their assignments with condence 0.9 After the contextual step is completed, the system propagates the sense with the highest condence from the rst two phases to the other entries with the same archetype. Minimising Spatial Extent There are two selectable SEM modules. The core system uses the network minimisation technique for SEM explained in section 2.2.2 The other module is a new spatial minimisation technique. The initialisation for this routine is the same as for the core algorithm,ith connections given weight according to: W eightxy = (Latx Laty )2 + (Longx Longy )2 Simxy
The new algorithm runs in a manner similar to Prims algorithm [13]: Initially: Order ascending by weight and set the first node in the first connection as complete. 1: Order the connections by weight ascending. 2: Choose the connection with one completed node that has minimal weight. 3: Mark the connections other node as complete. 4: Remove every other node with the same archetype as the node that has been completed. 5: Remove every connection belonging to the nodes that were removed. 6: Repeat from 1 until all nodes are complete The main dierence is that we are removing nodes as well as connections. This approach gives great importance to the rst few connections that are chosen, which correspond to the nodes that are closest to one another. Compare this to the algorithm from the literature, which doesnt remove nodes and which considers the eect of all location senses on a given node. The next step in the algorithm is to determine if the nodes are clustered. The nodes are ordered at a state/province level. If the most common state has 22
more than three entries it is determined that those entries form a cluster. The standard deviation from the centroid of the cluster is calculated and stored. The condence assigned to the nodes returned from the minimisation operation is decided by each nodes distance from the centroid. If the standard deviation is greater than 8 then all condences are set to 0.6. If the s.d. is less than 8 and the node is less than two standard deviations from the centroid it is assigned a condence of 0.8. Otherwise it is assigned a condence of 0. This is to allow locations with no default sense to remain unassigned. The presumption is that if they do not fall within the cluster, and they do not have a default sense assignment (indicating that they are potentially well known) then they are a false sense.
23
and Global News categories represent text with a global audience, which we expect to be easy to tag. Government, Open Document Project(ODP) and Local News categories are expected to have mixed properties. Texts in these categories are expected to have relatively high incidence of disambiguating cues, and a moderate level of unknown entities (locations not known by the gazetteer). Local Tourism texts and Geological reports are expected to have the greatest incidence of unknown entries, and moderate to low incidence of cues for disambiguation. Text Type GLOBAL Number Notes 9 Documents were taken from the front page of content aggregation site, Digg[1]. The site has a global, mostly western userbase. 4 Articles chosen from the front pages of the BBC [7] and New York times[8] websites. 2 Pages selected randomly from the Australian .gov domain 5 Pages taken randomly from the Open Directory Project under the categories: Top: Regional: Oceania: Australia: Western Australia: Regions and Top: Regional: Oceania: Australia: Victoria: Regions 5 Articles taken from the front page of Regional newspapers websites. 3 Pages taken randomly from regional tourism pages focusing on Western Austrlalia 4 Geological reports taken from companies listed as resources sector by the Australian stock exchange.
4.2.3 Exclusions
We have chosen not to include texts that might be geo-complex because of language or a historical nature. These domains present their own challenges and are in some ways incompatible with the improvements we hope to generate. Foreign language texts may not have the same divisions between geo-complex and regular 24
text as there are in English. Furthermore, place names are not constant between languages. Historical texts require specialised gazetteers if they are to be eectively tagged. Place names are constantly changing due to political division, evolving language and cultural renaming policies [18]. These additional entries are likely to confound performance on contemporary text. Thus, these factors will be left to other studies.
25
Accuracy is dened as the ratio of the number of correctly assigned georeferences to the total number of references within the scope of the gazetteer. It has been chosen in order to provide a metric that can be compared over dierent gazetteers and which gives an indication of a congurations ability to correctly assign locations within its own scope. A high accuracy is most useful when good results are desired for a specic list of names (eg: Hospitals or airports). Thoroughness Ncorrect + Ndiscarded Ndocument The thoroughness metric is designed to capture the overall reliability of the geocoder. It is dened as the percentage of entries that are either correctly identied or correctly assigned as outside the scope of the gazetteer. A high thoroughness score is good in situations when false positives are undesirable, such as when assigning an overall page focus. T horoughness = Coverage Ncorrect Ndocument The coverage metric gives an indication of the overall tagging capability of a conguration. It is dependent on gazetteer size as it shows the correctly tagged locations as a percentage of the total number of locations in the document. A high coverage is useful when the volume of locations that are successfully tagged are important (eg: Data mining). Coverage = Performance The performance metric gives an idea of overall system quality. As we use it, it is the simple average of the three other metrics. However depending on the application, unequal weightings could be used.
26
4.3 Experiments
4.3.1 Testing the Ontological Framework
Section 3.1 described the three aims for our experimentation on the ontological framework: Aim 1: To determine if it is possible to improve geocoding performance on geocomplex texts. Aim 2: To identify properties of geo-complex text that can be used to improve geocoding performance. Aim 3: To determine textual properties that can be used to distinguish between texts of diering geo-complexity. Each parameter in the framework will be tested for its occurrence over the entire corpus. The results will be organised by text category, and presented according to the aim for which they are most relevant. Unless otherwise stated, results are determined against the amplied gazetteer.
Experiment 3 Aim: To compare the eect of gazetteer size across all metrics. Method: All metrics will be calculated for the core system and the updated system in both standard and amplied gazetteer congurations.
28
CHAPTER 5
Table 5.1: Percentage of place names which arent recognised by the Standard and Extended Gazetteers
29
It is interesting to note that even in categories like global news we nd locations such as The White House or Wall Street, which are not geographically signicant enough to be added to a gazetteer, yet are culturally signicant enough to be recognised on a global level. The diering interpretation of what constitutes a place name explains in part why these ndings seem to contradict results in the literature which claim accuracy in the high 90s. Additionally, many extraction engines are gazetteer based, meaning that locations which were not in the gazetteer were not included in any calculations. Figure 5.1 shows the relative incidence of default entries between the dierent text types. We can see that the types that we suspected of being most geocomplex tend to reference places which are are not politically important (Not Capitals, Regions or Countries) and are not population centers. The trend is most obvious in tourist and local news articles. While in each case roughly 70% of entries are available in the Gazetteer, respectively only 24% and 33% are the default entries.
Figure 5.1: Locations with the Default Sense vs Total Taggable Locations There was only one completely unqualied spatial deixis. However, in three of the Local News texts, three of the geological texts and one of the ODP texts, spatial deixis was used in place of a literal reference without that reference being mentioned elsewhere. This decreases the number of georeferences that can be used to provide a disambiguating context for the document. 30
A nal barrier to performance is the increased number of potential senses to choose between when using an amplied gazetteer. Table 5.2 shows the average number of senses per place name for each gazetteer. On average each place name has nine times as many potential locations when queried against the amplied gazetteer. Name Local Tourism Geology Government ODP Local News Global News Global ALL DOCUMENTS Standard SPP 1.9 1.4 2.5 1.5 2.8 1.7 2.5 1.9 Amplied SPP 7.3 11.1 21.5 13.7 27.9 18.7 34.0 17.3 Increase% 380% 760% 860% 920% 970% 1080% 1360% 900%
Table 5.2: The average number of senses per place name in the Standard and Amplied Gazetteers These results are surprising in that they indicate a level of ambiguity that is far greater than could be predicted from the properties of the gazetteers. 75% of names in the standard gazetteer are unique, as are 59% of those in the amplied. This leads to an expected ambiguity of about 1.18 and 1.45 senses per name respectively. We hypothesise that this disparity is a result of the English colonial origin of many Australian, African and American place names. Figure 2.1 provides some support for this conjecture. Similarities Many of the properties in our experimental ontology had no discernible bias toward either complex or regular text. They included the hierarchical contextual pattern (a place described by its membership of a region), the incidence of abbreviation and the incidence of references at either a country or regional level. Table 5.3 shows the percentage of documents in each category that contain each such property. The evidence here suggests that contextual patterns and container level references are simply an eective method of providing a spatial context to the reader. Abbreviations are used either out of informality or as shorthand for a location that has already been mentioned. However there is no appreciable pattern to this usage. 31
Name Hierarchical Abbrev Country region Local Tourism 100% 66% 33% 100% Local News 20% 20% 0% 40% Geology 100% 75% 25% 75% Government 0% 0% 100% 0% ODP 20% 0% 60% 40% Global 22% 22% 22% 33% Global News 50% 75% 75% 75% Table 5.3: Instance of Properties with no Discernible Bias toward easy text. An exception to this is found in non spatial les. Seven of the documents (six in the Global category, one in the ODP category) were either articles dealing with science, technology and the arts, or information on a particular product. These documents had fewer georeferences (an average of two each) and generally did not provide contextual clues. Table 5.4 shows a property related to the default sense heuristic. It displays the percentage of place name references that have a default sense, but which are not the default sense. The results have been normalised against the total number of possible default assignments. The results show that roughly 10% of default assignments are incorrect, highlighting the usefulness of complementary heuristics. Name Non Default Global News 0.0% Local News 7.4% ODP 11.1% Geology 11.7% Global 12.6% Government 16.6% Local Tourism 18.5% ALL 11.4% Table 5.4: Percentage of entries for which the default sense assignement would be incorrect. Finally, we found that the Single Sense per Discourse rule holds true equally for all texts. 98.3% of references were stable in this regard. The only exceptions were text which contained references to both a place and a river which can be known by the same name, such as Ord. In these cases the authors were careful to make the distinction apparent. 32
producing a minimum spanning tree (Section 2.2.2). The weight of each sense is calculated as the total weight of all the connections it is a part of. The sheer number of connections possible when using an amplied gazetteer will have an averaging eect a cluster with a diameter of 400 kilometers will not appear signicantly dierent to a cluster with a diameter of 40 kilometers, as long as there is a signicant number of links to senses on the other side of the world. To counter the averaging eect, we require a more aggressive minimisation routine which is sensitive to locations that are very near one another. Secondly, the clustering eect provides us with a mechanism to automatically check the results we get from a minimisation routine. If a result set is determined to be highly clustered, then it may be possible to improve results by discarding elements that are beyond the cluster region - if they lack any evidence for being kept. These clustering heuristics work better with a large gazetteer, as a greater number of tagged place names would lead to a more precise cluster.
However this is not true in all cases. The core system and the standard gazetteer are only dierentiated by gazetteer, as they both use the same heuristics. The fact that the core system performs slightly worse indicates that scalable heuristics are more important to performance than gazetteer size alone. Figure 5.3 shows that the best all round conguration is the updated system running on the amplied gazetteer. This is because it has similar characteristics to a default tagger when operating on text with low geo-complexity, but activates the clustering and minimisation characteristics when it is advantageous to do so.
35
36
37
39
41
CHAPTER 6
6.1 Conclusion
In this dissertation we attempted to improve geocoding performance for geocomplex texts without resulting to external means of context. We focused specifically on text that is hard to tag because of high ambiguity, a low instance of default senses and a higher instance of unrecognised place names. We hypothesised that such text could have other, unique properties that could be leveraged to improve the accuracy, thoroughness, coverage and overall performance of a tagger. The ontological framework we constructed provided us with a means to assess the blind spots inherent in current geocoding techniques. From this assessment we developed three new heuristics, a new default sense heuristic, a new SEM heuristic that is more sensitive to neighbouring locations, and a clustering technique for discarding unlikely references. Our updated system conrmed our original hypothesis in that it showed that geo-complex texts could indeed be tagged with high accuracy when reference clustering was present. Furthermore, in an overall performance measure, the new heuristics performed similarly or better to all other congurations in six of the seven text categories.
to the default sense heuristic for determining remaining entities. Future work may investigate the use of more cluster points, or may use the derived centroid as a rst pass mechanism for improving the weighting on a second minimisation operation. On a related note the minimisation and clustering algorithms, which work over the entire text, could be developed to handle longer documents. In long text, locations that are in two dierent sections should not be allowed to inuence each other unduly. Later studies may attempt to conrm our identied hard text properties against a larger and more varied sample corpus. This study had a bias toward Australian and North American locations, but English text focusing on other regions may, in fact, have slightly dierent properties. Finally, much work needs to be done on improving the speed performance of tagging with an amplied gazetteer. In this research we identied a set of key properties that dierentiated hard text from soft. This information could be used to improve speed by using a standard gazetteer initially, switching automatically to the amplied gazetteer if key hard text indicators are encountered. Such a system could conceivably have the speed and thoroughness of a standard approach, coupled with the recall of an extended implementation.
43
Bibliography
[1] Digg, August 2008. http://digg.com/. [2] Geonames.org, Online Gazetteer, May 2008. http://download.geonames.org/export/dump/allCountries.zip. [3] Google Corporate Information, October http://www.google.com/corporate/history.html. [4] Google Local Search, June 2008. http://www.local.google.com/. [5] Metacarta Geosearch News, http://geosearch.metacarta.com/. June 2008. 2008.
[6] National Geospatial Intelligence Agency, June 2008. http://earth-info.nga.mil/gns/html/geonames dd dms date 20080617.zip. [7] BBC, August 2008. http://www.bbc.co.uk/. [8] The New York Times, August 2008. http://www.nytimes.com/. [9] The World Gazetteer, July 2008. http://world-gazetteer.com/dataen.zip. [10] United States Geological Survey, August 2008. http://geonames.usgs.gov/docs/stategaz/Populated places 20080815.zip. [11] Einat Amitay, Nadav HarEl, Ron Sivan, and Aya Soer. Web-a-where: geotagging web content. In Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, pages 273280, Sheeld, United Kingdom, 2004. ACM. [12] Paul Clough. Extracting metadata for spatially-aware information retrieval on the internet. In Proceedings of the 2005 workshop on Geographic information retrieval, pages 2530, Bremen, Germany, 2005. ACM. [13] Thomas H. Corman, Charles E. Leiserson, Ronald L. Rivest, and Cliord Stein. Introduction to Algorithms. MIT Press, Cambridge Massachusetts, second edition, 2001.
44
[14] Gregory Crane and Alison Jones. The challenge of virginia banks: an evaluation of named entity analysis in a 19th-century newspaper collection. In JCDL 06: Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries, pages 3140, New York, NY, USA, 2006. ACM. [15] Silviu Cucerzan and David Yarowsky. Language independent ner using a unied model of internal and contextual evidence. In COLING-02: proceedings of the 6th conference on Natural language learning, pages 14, Morristown, NJ, USA, 2002. Association for Computational Linguistics. [16] Jr Clodoveu A. Davis and Frederico T. Fonseca. Assessing the certainty of locations produced by an address geocoding system. Geoinformatica, 11:103129, 2007. [17] Junyan Ding, Luis Gravano, and Narayanan Shivakumar. Computing geographical scopes of web resources. In Proceedings of the 26th International Conference on Very Large Data Bases, pages 545556. Morgan Kaufmann Publishers Inc., 2000. [18] Committee for Geographical Names in Australasia. Policy guidelines for the recording and use of aboriginal and torres strait islander place names. Technical report, IGSM, 1992. [19] William A. Gale, Kenneth W. Church, and David Yarowsky. One sense per discourse. In Proceedings of the workshop on Speech and Natural Language, pages 233237, Harriman, New York, 1992. Association for Computational Linguistics. [20] Jochen L. Leidner, Gail Sinclair, and Bonnie Webber. Grounding spatial named entities for information extraction and question answering. In Proceedings of the HLT-NAACL 2003 workshop on Analysis of geographic references - Volume 1, pages 3138. Association for Computational Linguistics, 2003. [21] Huifeng Li, Rohini K. Srihari, Cheng Niu, and Wei Li. Location normalization for information extraction. In Proceedings of the 19th international conference on Computational linguistics - Volume 1, pages 17, Taipei, Taiwan, 2002. Association for Computational Linguistics. [22] Huifeng Li, Rohini K. Srihari, Cheng Niu, and Wei Li. Infoxtract location normalization: a hybrid approach to geographic references in information extraction. In Proceedings of the HLT-NAACL 2003 workshop on Analysis of geographic references - Volume 1, pages 3944. Association for Computational Linguistics, 2003. 45
[23] Robert Malouf. Markov models for language-independent named entity recognition. In COLING-02: proceedings of the 6th conference on Natural language learning, pages 14, Morristown, NJ, USA, 2002. Association for Computational Linguistics. [24] Bruno Martins and Mrio J. Silva. Language identication in web pages. In a SAC 05: Proceedings of the 2005 ACM symposium on Applied computing, pages 764768, New York, NY, USA, 2005. ACM. [25] Bruno Martins, Mrio J. Silva, and Marcirio Silveira Chaves. Challenges and resources for evaluating geographical ir. In Proceedings of the 2005 workshop on Geographic information retrieval, pages 6569, Bremen, Germany, 2005. ACM. [26] Malvina Nissim, Colin Matheson, and James Reid. Recognising geographical entities in scottish historical documents. In Proceedings of the Workshop on Geographic Information Retrieval at SIGIR 2004. SIGIR, 2004. [27] Simon E. Overell and Stefan Rger. Geographic co-occurrence as a tool for gir. In Proceedings of the 4th ACM workshop on Geographical information retrieval, pages 7176, Lisbon, Portugal, 2007. ACM. [28] Alexei Pyalling, Michael Maslov, and Pavel Braslavski. Automatic geotagging of russian web sites. In Proceedings of the 15th international conference on World Wide Web, pages 965966, Edinburgh, Scotland, 2006. ACM. [29] David A. Smith and Gregory Crane. Disambiguating geographic names in a historical digital library. In Proceedings of the 5th European Conference on Research and Advanced Technology for Digital Libraries, pages 127136. Springer-Verlag, 2001. [30] Rohini Srihari, Cheng Niu, and Wei Li. A hybrid approach for named entity and sub-type tagging. In Proceedings of the sixth conference on Applied natural language processing, pages 247254, Seattle, Washington, 2000. Morgan Kaufmann Publishers Inc. [31] Chuang Wang, Xing Xie, Lee Wang, Yansheng Lu, and Wei-Ying Ma. Detecting geographic locations from web resources. In Proceedings of the 2005 workshop on Geographic information retrieval, pages 1724, Bremen, Germany, 2005. ACM. [32] Chuang Wang, Xing Xie, Lee Wang, Yansheng Lu, and Wei-Ying Ma. Web resource geographic location classication and detection. In Special interest
46
tracks and posters of the 14th international conference on World Wide Web, pages 11381139, Chiba, Japan, 2005. ACM. [33] GuoDong Zhou and Jian Su. Named entity recognition using an hmm-based chunk tagger. In ACL 02: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pages 473480, Morristown, NJ, USA, 2001. Association for Computational Linguistics.
47
APPENDIX A
Background
Everyone is now used to the concept of searching for documents using the Google interface, where we normally expect to nd the required document in the rst twenty items or not at all - even if the number of possible matches is in the tens of thousands. Unfortunately, a text based search frequently leads to a dead end. Annually thousands of hours are wasted by people who are either searching for information that isnt there, failing to nd information that is, or recreating information that already exists. [6] In 2001, International Data Corporation attempted to quantify these losses. They estimated that a company with 1000 knowledge workers wastes $2.5 - $3.5 million a year on irrelevant searches, the unproductive time equating to approximately $15 million in lost revenue. One way of improving the relevance of search results is to lter the results by location. It is estimated, for instance, that 80% of the information dealt with by local governments has some discernible association with a location [1]. Furthermore, Delboni et al. found that even among the general populace, 20% of queries submitted to search engines included a geographic reference. [8] The process of assigning a spatial reference to a document or other piece of data, based on an analysis of its content, is called Geocoding, or sometimes Geotagging, Localising or Grounding. The eort spent on Geocoding techniques has traditionally focused on determining an exact spatial reference from an address, or a near exact reference based on building names, postal codes or telephone area codes [1]. However, There are three main reasons why address referencing is not sucient in itself for document categorisation: Address resolution is limited to urban areas and places with an organised addressing system.
48
Locations are often referenced by means other than a formal address. Addresses do not always indicate the target location of a document. [5] A dierent approach to the problem is to apply Named Entity Recognition(NER) techniques to textual data in order to derive a spatial reference for the document as a whole, based on the occurrence of locations within the text which are recognised by a well populated gazetteer. Examples of this method are MetaCartas Geographic Text Search [4] and the Web-a-Where geotagging system, developed by Amitay et al. [5]. However, distilling a spatial focus from a document is far more challenging and less precise than address resolution. [2] The two main barriers to accurate NER geocoding are entity ambiguity and source-target ambiguity. Entity ambiguity occurs when a word that potentially refers to a location also has some other meaning. Ambiguous pairs may both be place names, like Perth, Scotland and Perth, Western Australia, or may have a non spatial possibility, like Bualo, N.Y and Bualo Bill. Source-Target ambiguities occur when there are references to a documents origin within the text body, such as the authors address. The context required to understand and correctly interpret these ambiguities is frequently not found within the text itself, but rather derived from external experience. [4] Multiple Heuristics have been proposed as means of determining the correct disambiguation to choose, including: Choosing locations with a high population over those with a low one [5, 4] Selecting locations over the whole text in order to minimise total coverage letting the textual proximity of named entities within the text imply a geographical relationship [2, 5, 7] Using the immediate context of the Entity, Eg: Bualo, N.Y. as opposed to Mr. James Bualo Combining several of these Heuristics leads to fairly accurate results, however there is room for improvement. Amitay et al. Found that when applied to web pages from the .gov domain, their focus algorithm was only able to identify documents at a country level 73% of the time [5]. They found that documents not meant for public consumption typically carried less disambiguating cues for locations assumed to be general knowledge. The Challenge thus remains to produce a focus-nding algorithm which is able to more eectively deduce a focus for use as an aid to document discovery. 49
Aims
Project Aims
The aim of this project is to implement a working focus nding algorithm, and to investigate various disambiguation heuristics [5, 4, 7] and evaluate their overall eect on geocoding accuracy. The implementation will act on text data to produce a probable spatial reference for the document. As each Named entity can not be guaranteed to be correctly resolved, we will take a fuzzy approach to the derivation of the focus from the Named Entity List. The references will be indexed so that a later query will produce a list of documents with an associated relevance index when given an initial spatial window. The geographic relevance Index will be combined with a standard text relevance index in order to produce a nal search list. By combining a spatial relevance index with a Google-style relevance index, we expect to increase the probability of discovery of relevant documents without having to go beyond the rst 20 documents. [4, 6]
Method
The method can be summed up in 6 steps: 1. The document to be geocoded is parsed and indexed. It is also passed to a text indexer for normal processing. 2. The document index is checked against a pre-compiled Gazetteer of place names in order to identify Named Entities which may refer to geographic phenomena. 3. Disambiguation Heuristics are applied to decide on a single context for each Entity, giving each a condence rating. 4. Fuzzy logic is applied to determine one or more focuses for the document. 5. The resulting spatial relevance index is stored. 6. Searches on the document base provide both textual and spatial criteria. The resulting document list is a combination of both the spatial and textual relevance.
50
Design Criteria:
We aim to produce an implementation that is Highly modular, allowing for future developments, such as a map-display type interface for geographic ltering, and which is not constrained by any particularly gazetteer or Geocoding algorithm.
51
52
Bibliography
[1] Clodoveu A. Davis Jr , Frederico T. Fonseca Assessing the Certainty of Locations Produced by an Address Geocoding System Geoinformatica Vol 11, 2007. Page(s)103-109 [2] F. Bilhaut, T. Charnois , P. Enjalbert, Y. Mathet, 2003. Geographic reference analysis for geographic document querying. In Proceedings of the HLTNAACL 2003 Workshop on Analysis of Geographic References - Volume 1 , Association for Computational Linguistics, Morristown, NJ, Page(s) 55-62. [3] D. Wu, G. Ngai, M. Carpuat, J. Larsen, Y. Yang, 2002. Boosting for named entity recognition. In Proceedings of the 6th Conference on Natural Language Learning - Volume 20 , Association for Computational Linguistics, Morristown, NJ, Page(s) 1-4. [4] E. Rauch, M. Bukatin, K. Baker, 2003. A condence-based framework for disambiguating geographic terms. In Proceedings of the HLT-NAACL 2003 Workshop on Analysis of Geographic References - Volume 1 , Association for Computational Linguistics, Morristown, NJ [5] E. Amitay, N. HarEl , R. Sivan, A. Soer, 2004. Web-a-where: geotagging web content. In Proceedings of the 27th Annual international ACM SIGIR Conference on Research and Development in information Retrieval SIGIR 04. ACM, New York, NY, Page(s) 273-280. [6] S. Feldman and C. Sherman, 2001 The High Cost Of Not Finding Information. IDC, Framingham, MA [7] J.L Leidner, G. Sinclair, B. Webber, 2003. Grounding spatial named entities for information extraction and question answering. In Proceedings of the HLT-NAACL 2003 Workshop on Analysis of Geographic References - Volume 1 Association for Computational Linguistics, Morristown, NJ [8] T. Delboni, K Borges, A. Laender, C. Davis, 2007 Semantic Expansion of Geographic Web Queries Based on Natural Language Positioning Expressions In Transactions in GIS - Volume 11 Blackwell Publishing Ltd, Page(s) 377-397
53
APPENDIX B
54
Property Default Population Default Unique Default Capital Non Default Non Standard Non Expanded Gazetteer Country Region Hierarchical Nearby Abbrev Unqual Deixis Feature Total Unique Options no words Granularity
Description The number of references that are correctly defaulted based on the population Heuristic The number of references that are correctly defaulted because the entry is unique The number of references that are correctly defaulted based on the Country, Region, Capital Heuristic The number of entries for which a default exists, but would be incorrect. The number of entries that arent found in a gazetteer where every distinct name has a default entry The Number of entries that arent found in an Amplied Gazetteer Does the text contain a reference to the Country that it is about? Does the text contain a reference to the State or Province that it is about? Does the text dene locations by their Hierarchical containers? Does the text dene locations by other, nearby locations? Does the text use abbreviations or slang words as location names? Are there instances of unqualied spatial deixis? The number of references to features like Rivers of Clis The total number of references. The number of distinct place names. The number gazetteer entries that match one of the place names. The number of words in the document. The level of granularity sucient to cover more than 80% of references (None = 0, global = 1, Country = 2, State = 3, Local = 4)
55