Legal Indexing Aid System
Legal Indexing Aid System
4, September 2012
ISSN-1110-2586
Abstract
Our study focuses on the analysis of the keywords, assigned to the texts titles published in the Lebanese official journal. This latter contains legislative and regulatory texts. The study focuses also on the related legal lexicon generated, through the manual information processing operation executed, on the texts titles, during more than two decades. The object of the study is to describe a legal indexing aid system developed to meet homogenization legal vocabularys needs in order to achieve consistency and to enhance access and retrieval of legal information. Our experiment shows that most of the assigned keywords, meant to represent the legal documents content, through the manual analysis and processing of their titles, may be automatically done. The keywords may be extracted by a system built out of pre-designed patterns and algorithms, based on the frozen structure of the titles which are analyzing and grouped according their objectives.. We use the local grammar approach as finite state automata to represent each group. The system aims to automatically find and extract the keywords assigned by the indexers and to suggest or generate further potential keywords based on a set of features calculated for each node of a title. Keywords : Artificial intelligence, Legal indexing decision support system, local grammar, linguistic information retrieval, text parsing, Arabic natural language processing, Legal informatics.
-16-
ISSN-1110-2586
laymen. Nevertheless, the huge number and the varieties of the published texts, searching for specific legislation or regulatory material is a difficult and time consuming task. Hence, indexing the gazette in order to enhance information search and retrieval was the first step toward establishing Legal Data bank at the center of studies and researches in legal informatics, at the Lebanese university [1]. Nevertheless many technical hindrances emerged from many realities among which, those due to the fact that these texts are written in Arabic while adequate application, softwares, and linguistic tools werent available. Other hindrances were due to the nature of the legal language requiring specific knowledge background related not only to knowledge representation, but also to legal concepts exact meaning and relevant context. Accordingly, it was decided to rely on human indexing with controlled vocabulary methodology.
3. Preparatory activities
The operation relies on human indexation. Its based on the respect of the special structure of the document, as well as on its relationship to three categories of administrative, legal and topic, clusters. These categories was decided upon the express relations of the text clearly exposed by its titles core subject, the administration concerned by its implementation as well as by the legal domain it is connected to. The processing operation mainly consists of describing the titles content that indicates the topic of the text by assigning keywords. For some texts published without title, the indexer assigns one. To do this, he shall go back, either to the body of the document itself, or to the titles of the texts it refers to. This practice is due to the dynamic linkage that may exist, more often, between legal texts, making reference a determinant element, not only in describing the content, but also, in building the contextual environment and deciding upon the accuracy of implicit concepts. It reveals the importance of the title and text network in prcising semantic environment. These links refer to some modified, implemented, detailed, or deleted laws or regulations. Assigned Keywords reflect explicit concepts contained in the title and the document but also implicit ones. Giving the fact, that computer cannot understand natural language, and the
-17-
ISSN-1110-2586
implicit meaning it bears, these later were of weight in choosing the human indexing methodology. This indexing effort has been ongoing for almost three decades now and has generated a list of keywords that was used to develop linguistic tools among which, an extensive lexicon specifically designed to ensure vocabularys indexing consistency. Nevertheless, and despite the satisfying results it yields, human indexing is still expensive and is still biased by subjectivity, since it heavily relies on personal understanding and interpretation of the analyzed content, which in its turn depends on the expertise, the scientific acquirements, and the personal background of the indexer [11]. Moreover, a study done at the center revealed that, not only keywords consistency is questionable (since its different from indexing work session to another indexing work session and from indexer to anther indexer), but also, that the operation revealed to be neither cost worthy, nor efficient to attain the methodologys stated objectives, such as: Understanding the individuality of the text Expressing the meaning of this distinctiveness accurately and consistently through specific keywords. Controlling the indexing vocabulary. Achieving search and retrieval pertinence worthy of the financial cost invested in the work
3.1 Evaluation process Giving the fact, that human indexing has been adopted mainly to achieve what automatic indexing cannot: understanding and describing the implicit content of the text, it was normal to evaluate the output of the methodology by conducting a study intended to analyze the nature of the assigned keywords (apparent or implicit) and their role in describing the content in the light of the fact that the gazettes vocabulary is a strict language. The number of the keywords was 5593 used to describe 1098 legislative and regulatory texts. The studys results showed the following: The study revealed that 79% of the keywords used to describe the apparent concepts were descriptors literally taken either from the title or from the documents body itself. Moreover, 93% of the apparent keywords was taken as it is, without any grammatical or syntax change, 2,9% was derive, and 4, 1% was synonyms. The keywords meant to describe the implicit concepts were used only in 59,4% of the texts, they represented only 17,3% of the 5593 keywords, while the rest which is 2,8% of the keywords were false descriptors. A closer look at these keywords showed that only 51,6% were precise while, 4,6% were broader terms, 40,9% were contextual, 0,7% were inaccurate. Besides, 45% of these keywords represent either administrative division or proper nouns.
-18-
ISSN-1110-2586
3.2 Assessment process In this context, we began to believe automation of the keywords assignment operation, is a must, especially after adding to the above results some well-known elements such as: - The strict nature of the legislative and regulatory language, that makes it always use the same terms to describe the same concepts. - The particular linguistic nature of the titles that starts with a number of particular specific verbs. - The dangers of human indexations inconsistency, as well as its lack of pertinence even when carried on by persons with thorough knowledge of the domain. - The importance of the texts networks that helps determining and prcising the implicit concepts.
4. Related work
In general the extraction keyword methods use Statistical, linguistic or hybrid approaches. It use explicitly the information contained in a document such as word frequency and word position [22], or the TF x IDF (term frequency and inverted document frequency) with the distance between word or POS of a Word[14][26][6]. Recently with the expansion of web semantic concept, the extraction approaches use a semantic level as in [10][7]. To build a help system for selecting keywords we use a particular linguistic tool, the local grammar. [13], which represent more adequate the frozen texts where limited variation in form is possible. The local grammar were used to extract information, many systems are developed to extract proper name in different language [17][4], or to extract date, time and measure [5]. The local grammar is represented by a FSA and creates a syntactic chain between the words of title. In this paper, we use local grammar to represent the title of texts of official journal in order to select keywords that semantically represent the title.
-19-
ISSN-1110-2586
As a matter of fact, the official journal texts titles are the entity, which describes the main subject, in any given text; hence, the specimen we choose to work with is about 32000 titles, published, during a decade, going from the year 2000 till 2010. It represents 52% of the all published texts in the official journal.
-20-
ISSN-1110-2586
The study of each scheme suggests that titles are sufficiently frozen to be described by local grammars [13], which allow the construction of finite state automata (FSA) representing the local grammar for each group type.
Analyzer
Filter the title Stemming the terms of title Apply local grammar Apply rules
Resources
Local grammar Rules List of synonyms List of morphologic variations
Extraction Keywords
Figure (1) : The architecture of extraction system The system is organized according to three modules structure: the Analyzer, the resources, and the data base system. This latter is represented by three fold operation: the extraction keywords, the verification and approval, and the insert into database. Moreover, the system uses a set of resources such as: the local grammar, the rules, the list of synonyms and the list of morphological variations. The title is the entry for the system.
-21-
ISSN-1110-2586
The analyzer is the main component of the system. Its essential feature is to be directed by the syntax, it starts by a filter to select the first term of title, so the terms of the title are normalized by using the synonyms list and morphological variations list. After that the analyzer applies the corresponding FSA according to the first word in the title. A set of rules is used to define the semantic relation between the terms in order to facilitate the keywords choice. The rules are generally expressed as an 'IF-THEN' rule. An example for the rule is as following: IF in the [Object] we have cooperation agreement and in the [Subject] we have youth and sports THEN the keywords are Sports cooperation, Youth cooperation.
Figure (2): general FSA for the Verb Conclude The grayed states Object and Subject are the names of sub-graphs. The sub-graph Object, recognize the type of verb conclude, is composed of two list of single words to facilitate the composition of the keywords as showing in figure (3). The sub-graph Subject is composed by two boxes, it recognize first the propositions such as ( , about) , ( concerning, in the field), the other state is the sentence that contains the objectives of the agreement as showing in figure (4). The parties is not processed by our system, it need more study to build a specialized ontology that can represent the varieties, the particularity, and the structures of the different state department and administration usually involved in the signature of conventions and contracts between the Lebanese government and foreign countries, international and regional organizations, NGOs and governments. The parties are added manually. The table 2 shows a sample of titles recognized by the system.
-22-
ISSN-1110-2586
Table 2: Sample of Official Lebanese Journal Titles 1 Conclude a cooperation agreement in the field of youth and sports between the Government of the Republic of Lebanon and the Government of the Hashemite Kingdom of Jordan 2 The conclusion of an executive program in the field of tourism cooperation between the Government of the Republic of Lebanon and the Government of Arab Republic of Egypt
After analyze the title, the system can add keywords as following in table 3: Table 3: keyword extracted by the system States Verb: conclude "" Object : cooperation agreement Subject : the field of youth and sports Parties : the Government of the Republic of
Lebanon and the Government of the Hashemite Kingdom of Jordan
-23-
6852-0111-ISSN
/ /
Sentence
-42-
ISSN-1110-2586
10. Conclusion
This system was implemented on a sample taken from the already existing database. The indexers, who reviewed the results, were satisfied with the consistency, the exactness and the precision of the assigned keywords. They even reported similarity between automatically assigned keywords and manually assigned ones. The fact is confirmed by applying a study of similarity between the term extracted manually and automatically. We used the cosine coefficient [22], which is one of the methods adopted to measure similarity between two groups of words. Hence, the terms are represented by weighted vector; we give simply the weight 1 for the term in the list and 0 otherwise. The result is about 0.76. We explain the result by the missing of proper name not integrated by the system and some title doesnt exactly match the structure of local grammar. As stated, geographical entities, personal nouns, and institutions names, havent been integrated in the processing operation. Their varieties as well as, their huge numbers make it very difficult to process them in absence of specific ontologies. Accordingly, they still need to be represented, through human intervention. Accordingly, the Next step in the project will focus on automating the categorization task that involves integrating official journals texts according to their content into: three categories: legal, administrative and thematic. On the other hand, we will work on the list of keywords to build a legal ontology domain [2] [23].
-25-
ISSN-1110-2586
References
[1] Al Achkar Mona (2007), Official Journal Indexing at The legal informatics center Lebanese University- Internal Report. 2007. [2] A Salem, Marco Alphonce, (2010) Web-Based Ontologies for Breast and Lung Cancer 6th International Conference of Euro-Mediterranean Medical Informatics and Telemedicine,2010 [3] Barnbrook G. (2002). Defining Language: A local grammar of definition sentences. Amsterdam: John Benjamins Publishers. [4] Choi, Key-sun, Nam, Jee-sun. (1997). A Local-Grammar-based Approach to Recognizing of Proper Names in Korean Texts. Proceedings of the Fifth Workshop on Very Large Corpora, University/Hong-Kong University of Science and Technology, pp. 2730-288. [5] Constant M. (2002), Methods for constructing lexicon-grammar resources : the example of measure expressions, Proceedings of the 3rd conference Language Resources and Evaluation Conference, Las Palmas, 2002 [6] E. Frank, E. Paytner, I.H. Witten, C. Gutwin, C.G. Nevill-Manning (1999), Domain specific keyphrase extraction In Proceedings of the sixteenth international joint conference on artificial intelligence, Morgan Kaufmann , pp. 667668, 1999 [7] Ercan, G., Cicekli, I. (2007). Using lexical chains for keyword extraction In Information Processing and Management 43(6), pp 17051714, 2007 [8] Friburger N., Maurel D. (2001). Finite state transducer Cascade to extract Proper Nouns in French text, 2nd Conference on Implementing and Application of Automata: in Lecture Notes in Computer Science, Pretoria (South Africa). [9] Frantzi, K.T. , Ananiadou, S. (1996) : A hybrid approach to term recognition. Proc. Int. Conf. on Natural Language Processing and Industrial Applications, Universit de Moncton, Canada [10] Grineva, M.; Grinev, M.; and Lizorkin, D. (2009). Extracting key terms from noisy and multitheme documents . In Proceedings of the 18th International Conference on WWW, pp : 661670, (2009) [11] Ginger Shields, (2005), What are the main differences between human indexing and automatic indexing? LI-842 Automatic Indexing Assignment, April 26, 2005. [12] Gonenc Ercan, Ilyas Cicekli, Using Lexical Chains for Keyword Extraction In Information Processing & Management, Vol 43, Issue 6, November 2007, pp 1705 1714 [13] Gross. M. (1993). Local grammars and their representation by finite automata. In M. Hoey, editor, Data Description, Discourse, pages 2638. HarperCollins, London.
-26-
ISSN-1110-2586
[14] Mihalcea, R., and Tarau, P. (2004). Textrank: Bringing order into texts. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP-04), pp : 233242.2004. [15] Harris Z. (1991). A Theory of Language and Information: A Mathematical Approach. Oxford: Clarendon Press [16] Hunston, S., Sinclair, J. (2000). A Local Grammar of Evaluation. In Evaluation in Text: Authorial Stance and the Construction of Discourse, Hunston, S. & Thompson, G. (eds), Oxford, Oxford University Press: pp. 75-100. [17] H. N. Traboulsi,(2006) Named Entity Recognition: A Local Grammar-based Approach, Ph.D. dissertation, Dept. of Computing, Surrey Univ. Guildford, [18] Legal Informatics Center (1993): Data Base. Lebanese University publications. ( in Arabic) Beirut 1993. [19] Peter D. Karp (1993), The Design Space of Frame Knowledge Representation Systems Artificial Intelligence Center, SRI International, note #512. [20] P. Turney (2000), Learning to extract keyphrases from text In Journal of Information Retrieval, 2 (4), pp. 303336, 2000 [21] RAMMAL M, (2006), Access to legal Documents on the Web. The Lebanese Experience, in The 2nd International Workshop on New Trends in Information, NTIT2006, Homs, Syria. [22] Salton G., Buckley C, Term-weighting approaches in automatic text retrieval, Information Processing and Management, 24(5), 513-523, 1988. [23] Sartor, G.; Casanovas, P.; Biasiotti, M.; Fernndez-Barrera, (2011), Approaches to Legal Ontologies, Series: Law, Governance and Technology Series, Vol. 1. 1st Edition. 2011. [24] Silberztein M. (1997), The Lexical Analysis of Natural Languages, In E. Roche, Y. Schabes (eds.), Finite State Language Processing, The MIT Press, Cambridge, MA [25] Woods W.A. (1970), Transition Network Grammars for Natural Language Analysis, Communications of the ACM, 13:10 [26] Zhang, K.; Xu, H.; Tang, J.; and Li, J.-Z. (2006). Keyword extraction using support vector machine. In Lecture Notes in Computer Science: Advances in Web-Age Information Management, pp :8596.
-27-