0% found this document useful (0 votes)

85 views12 pages

CACIC 20070725 Induction Trees LopezDeLuise - v7

This document describes research on using induction trees to automatically classify words by lexical category (noun, verb, etc.) based on simple morphological and syntactic descriptors. It analyzed a dataset of over 360,000 words extracted from Spanish web pages, categorized the words, and evaluated classification precision with and without stemming. The results showed good prediction power when combining stemming with a short list of descriptors.

Uploaded by

api-3734323

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

85 views12 pages

CACIC 20070725 Induction Trees LopezDeLuise - v7

Uploaded by

api-3734323

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Induction Trees for

Automatic Word Classification

Daniela López De Luise
AI Group , Facultad de Ingeniería, Universidad de Palermo (UP)
Ciudad Autónoma de Buenos Aires – Argentina
[email protected]

and

Juan M. Ale
Facultad de Ingeniería, Universidad de Buenos Aires(UBA)
Ciudad Autónoma de Buenos Aires – Argentina
[email protected]

Abstract

This work studies induction tree application for certain word category detection by simple
morpho-syntactical descriptors that are proposed here. The classification power for these
new descriptors with and without stemming is also studied. Finally, results show that
classification prediction power is good when stem is coordinated with a short list of
descriptors.

Keywords: machine learning, lexical categorization, morphology, syntax

Resumen

En este trabajo estudia el uso de árboles de inducción para la detección de ciertos tipos de
palabras usando algunos descriptores morfosintáctico propuestos. También se estudia el
poder de clasificación de estos nuevos descriptores con y sin extracción de raíces de palabras
(stemming). Finalmente, se muestra en los resultados que el poder de predicción de la
clasificación es bueno cuando se combinan stemming con algunos de los descriptores
presentados.

Palabras claves: aprendizaje automático, clasificación de palabras, mofología, sintaxis

1. INTRODUCTION
It is hard to perform an efficient handling of digital documentation due to several
phenomena as synonymy (different words with similar meaning), polysemy (a word with two
or more meanings), anaphoras (implicit mentions by means of demonstrative pronouns),
metaphors (use of a word with a meaning or in a context different from the habitual one),
metonymy (rhetorical figure that consists of transferring the meaning of a word or phrase to
another word or phrase with different meaning, with semantic or logical proximity) [10],
misspellings, punctuation, neologisms, foreigner words and differences between linguistic
competence (based in grammar rules) and actuation (the way grammar is used by a native
speaker) [2]. Many approaches have been used to solve these problems, some of them are:
-Exhaustive tables of words or punctuation, optionally combined with lexical knowledge
databases such as WordNet (to process using synonyms) [10].
-Exhaustive text revision to extract and classify errors in texts [2].
-Use of a corpus of traditionally detectable mistakes in the language [2].
-Normative [2].
-Style books [2].
-Scoring synonymy degree of expressions [4].
-Contextual information processing [13].
Based on those strategies, several applications and studies have been performed: for
correcting documents [14], classification of documents, written text analysis, inflectional
language1 analysis [17], statistical machine translation [12], text summarization [10],
automatic grammar and style checking [2] automatic translation [4], etc., even covering areas
like statistical modeling of speech [8]. To perform such activities it is very useful to be able to
automatically detect the word lexical category (if a word is a noun, article, verb, etc.).
Sometimes this detection is part of the global approach as in the case of the text checking
presented in [6], whereas in other cases are special developments as in [2], or [7], but always
with complex semantic management or with long linguistic inference procedures. This paper
proposes a set of morpho-syntactical descriptors for words, using just local information, to be
used to automatically find out the actual lexical category of certain words with reasonable
precision. The set of morpho-syntactical descriptors defined here are combined with
stemming algorithmic [15] to get invariant radixes as extra descriptors. This proposal uses
also an Induction Tree. Although Induction Trees2 can be used for learning in many areas
[11], they are applied here to word classification. An induction tree is a model of some basic
characteristics of a dataset extracted by an induction process on instances. It is used due to its
flexibility and its power to apply the acquired knowledge to new concrete instances.
Because the Web is a kind of text repository, traditional morpho-syntactical
processing had to overcome new problems (specific problems for internet documentation): It
will be required to adapt processing to activities such as Web Services [14], Information
Retrieval, automatic extraction of knowledge from Web Documents [1], using Web as a

1
languages, where words have usually several different morphological forms that are created by changing a
suffix [17].
2
From Mitchell [11]: “Decision Tree learning is a method for approximating discrete-valued target functions, in
which the learned function is represented by a decision tree. Learning trees can also be re-represented as sets of
if-then rules to improve human readability. These learning methods are among the most popular of inductive
inference algorithms and have been successfully applied to a broad range of tasks from learning to diagnose
medical cases to learning to assess credit risk on loan applicants” .
corpus for automatic collocation3 identification [16], etc. Therefore it is important to process
automatically text as mentioned previously but considering the special features of web writers
and readers. For that reason, all the text processed in this paper is extracted only from web
pages.
Another point is that internet sets the same availability degree for sites in any
language. So, the web pages covered here are taken from Spanish sites in any country.
The rest of this paper is organized as follows: section 2 describes the database and
data collection procedure, section 3 describe field selection and induction tree model
construction, and section 4 presents some conclusions and future work.

2. DATA ANALYSIS
In this section there is a short description of the processing steps (section 2.1), dataset and
sample characteristics (sections 2.2 and 2.3 respectively).

2.1. Methodology
Four sets of web pages in Spanish were made regarding several topics. All of them were
downloaded in text format. From the total number of 340 pages, 361217 words were extracted
with a Java application. The output was saved as 15 plain text files. The text files were
converted into Excel format to be able to use an Excel’s form to manually fill in the field
tipoPalabra (kind of word). The resulting files were processed with other java program to
introduce the stemming column and afterward converted into csv format to be able to work
with WEKA4 software. After that, some preliminary statistics were performed with InfoStat5
to detect the main dataset features and the csv files were processed with WEKA Explorer. An
induction tree model was built from data as detailed in the following sections. Figure 1
depicts graphically all the mentioned steps.

Figure 1. Flow of processing steps

3
statistically significant word associations representing “a conventional way of saying things” [9].
4
WEKA: open source workbench for Data Mining and Machine Learning [18].
5
InfoStat: statistical software from a group named InfoStat in the Universidad Nacional de Córdoba.
2.2. Dataset Description
The text files were processed with a Java application. For each word, a set of 25 description
fields were extracted. Therefore, each database record represent a word. The fields are
detailed below:
-Continue fields: there isn’t.
-Numerable fields: 10 fields were non-negative integers with a big boundary (see Table 1).
All of them were discretized into fixed-size intervals, to be able to categorize and process
them together with nominal fields. They were separated into 3 or 5 categories. (see Table 2).
-Discrete fields: there isn’t.
-No-numeric fields: 15 fields have a domain composed by a specific set of literals (syllabus,
punctuation signs, a set of predefined words or the classical binomial Yes/No). See Table 3
for details.
-Missing data: they were considered as a distinct data value and processed with the rest of
the data.
Table 1. Numerable fields

Table 2. Categorization

2.3 Sample Characteristics

Data fields dependences were studied with correspondence analysis. This task was performed
with InfoStat software. All the 25 fields were considered, but only a random sample of 47820
instances were processed. The independency test was performed with parameter α = 0.05,
statistic χ2 y H0= “independent”.
Results show that:
-tipoPalabra (kind of word) is independent from tipoPag (kind of page) and siguePuntuación
(punctuation follows the actual word).
-palAntTipo (kind of previous word) is independent from cantVocalesFuerte (number of
strong vowels in the word).
-resaltada (the word is remarked in the text) is independent from cantVocalesFuerte (number
of strong vowels).
Table 3. Results with Different Splits

3. INDUCTION TREES FOR CLASSIFICATION

In this section the construction of an induction tree (using WEKA [18] software) with many
parameter values is studied. The remainder of this work uses the following metrics to evaluate
results [18]:
1) Metrics used for error handling evaluation
1.a) Precision: metric used in Information Retrieval (IR). It is the rate of relevant instances
returned by the total of instances returned.
1.b) Recall: metric used IR. It is the rate of relevant returned by the number of relevant
instances.
1.c) Recall-precision: plot used in IR with recall (x-axis) and precision (y-axis).
2) Metrics used for predictability
2.d) Kappa (κ): used to compare predictability against a random predictor. It can take from
0 to 1, being 0 the random predictor value and 1 the best predictor.
3) Metrics for confidence validation
3.e) Margin curve: a bigger margin denotes a better predictor. It is the difference between
the estimated probability of the true class and that of the most likely predicted class other than
the true class.

In the following, an induction tree with J4.8 algorithm is used to build a model to predict the
kind of certain words based on the descriptors introduced in this paper (section 3.1), based on
the descriptors and stem (section 3.2) and based only in the best descriptors and stem (section
3.3).

3.1. Classification of Words Using proposed descriptors

The J48 algorithm is used here to build the induction tree using the fields presented in 2.2.
Dataset Description (except for stem field). Here, the following analysis is performed:
alternate splittings of training sample, different data categorizations, influence of descriptors
on the model and windowing6.

6
windowing is a strategy for selecting a subset of data for processing.
1) Splitting of the training sample.
Different percentages of instances were taken from the same sample to construct/validate the
model by setting several splitting values. The data records were randomly extracted from the
47820 instances according to the settled percentage. The initial sampling window had 6838
instances. Results are shown in Table 4.
Table 4 Results with Different Splits

It can be seen that classification improves from 66% of instances for testing (and 34% for
training) to 100% for training and testing. The classification model becomes more confident.
2) Alternates for field categorization.
As part of sensitivity analysis, different categorizations for just one of the descriptor variables
is performed: cantOcurrencias (number of times the word is detected within the html page).
This variable is selected for this study because it is always near the tree-model root (it is
important to determine the kind of word). It was evaluated with 3 and 7 bins. Results are
shown in Table 5.
Table 5. Results with Different Categorizations

The table shows the precision and total error changes due to categorization. To study
the strength of this tendency, the margin-curves, precision, recall and recall-precision analysis
is performed but only for nouns:
- Margin-curves for 3 and 7 categories reflect a slight tendency to join the x-axis with the
instance number. It seem like each new instance makes the classifier more trustable. This
tendency becomes apparent with 66% of splitting, and remains with 70% and 0% (see Figure
2).
66%split 66%split

70%split 70%split

00%split 00%split

Figure 2. margin curve with 3 (on the left) and 7 categories (on the right)
- Precision-curves show that precision with 3 categories is better than with 3 categories but
with 7 categories more instances are retrieved (102 against 95 with 66% of splitting). See
Figure 3.
66%split 66%split

70%split 70%split
00%split 00%split

Figure 3. Precision curve for 3 (on the left) and 7 (on the right) categories
- Recall-curve presents a minimum recall value for 7 categories higher than the value for 3
categories. Conversely, the slope has a softer slope for 3 categories (see Figure 4).
66%split 66%split

70%split 70%split

00%split 00%split

Figure 4. Recall curve for 3 (on the left) and 7 (on the right) categories
- Finally, precision-recall curve (see Figure 5), show that precision is best for 3 categories but
at expense of fewer number of instances. This behavior is observed for all the splitting rates
experienced (66%, 70%, 0%).
66%split 70%split 00%split

Figure 5. Precision-recall curve for 3 categories (66%, 70%, 0% split )

3)Descriptor choices
In this section alternate descriptor selection criteria are studied to find out the influence of
field selection on the classification power. Table 6 shows a brief of the following analysis:
a)Low-computational-cost fields selection: the high-cost fields, were taken out whenever
the removal did not affect the tree performance. The resulting selection was 12 descriptors.
None of them involves processing all the html document. Just one of such descriptors needs
sentence processing.
b)Categorized fields selection: nominal fields where removed and the numeric fields
(categorized) were used to construct the model.
c)Nominal fields selection: numeric fields where removed and the nominal fields were used
to construct the model.
d)Independent fields selection: some independent fields were taken out before constructing
the model (see correspondence analysis in 2.3 Sample Characteristics). The process was
repeated 5 times changing the extracted subsets according with different criteria.
Table 6. Results with Different field selections

As can be seen from the results in Table 6, there is a low correctly-classified rate and kappa
values.
4) Instance windowing
Three windows of instances were selected. The windows were of different size and
composition as described below:
a)sample 1: 47829 instances. The word-class distribution is: 6689 nouns, 2762 verbs, 11027
other class, 36 unknown class. Main characteristics of the sample: words were extracted from
pages mainly with the same subtopic within the set theme. Besides, each page were longer
than in the other two samples.
b)sample 2: 20515 instances. The word-class distribution is: 6392 nouns, 3050 verbs, 11054
other class, 19 unknown class. Main characteristics of the sample: pages were related to many
different subtopics and typically very short in the average.
c)sample 3: 20524 instances. The word-class distribution is: 6535 nouns, 2954 verbs, 11014
other class, 21 unknown class. Main characteristics of the sample: pages were related to
different subtopics but with intermediate size in the average.
The model training was performed with each sample, taking 12 data fields (4 of them
categorical). Results are shown in Table 7.
Table 7. Results with Different samples

As can be seen from the table, there is a significant variation of classification power with the
dataset. Those results are due the characteristic of each one. As a consequence of these
characteristics, the noun rate is highest in the second sample, making the classification
correctness higher than sample 1 and lower than sample 3. Kappa statistics decreases for
sample 3, which has a fewer number of nouns than sample 2, even considering that sample 3
performs a bit better classification rate due to the shorter pages.

3.2. Classification Using descriptors and Stemming

The classifier behavior was studied considering stem. Sample 2 was extended with the
corresponding radixes using stemming algorithm. Records with same stem were counted and
those whose stem frequency is lower than 10 were eliminated from the set. The resulting set
has 2316 instances.
Classification model was constructed with distinct attribute considerations: several simple
global descriptors, stem and three simple descriptors, stem and six simple descriptors, stem as
unique descriptor. Table 8 shows the results obtained: correctly classified rate improves with
stemming combined with descriptors. Kappa value denotes that it is a better model also (κ
increases up to 0.887). It can be seen that the field stem is not as good for classification by
kind of word (tipoPal) as descriptors do.
Table 8. Stem with/without Descriptors
3.3. Word Classification with Stemming and best Morpho-Syntactical Descriptors
The 12 best descriptors (4 of them were categorical) are selected and combined with
syntactical-radixes. Such descriptors describe the topic of the document, kind of word, kind of
html page, kind of previous word, word suffix, word length, number of vowels, etc.
Results with and without stemming are shown in Table 9. Here the confidence level has
improved very much when considering stem.
Table 9 Descriptors with/without Stem

4. CONCLUSIONS AND FUTURE WORK

From the previous sections some interesting conclusions can be extracted:
-Training set must have more than 20514 to get better results.
-Categorization procedure takes influence on the classifier confidence, improving it when the
number of categories increases.
-The best subset of data fields have many interdependencies.
-The html-page length influences the performance. Better results are obtained with lengthy
pages.
-Stem has not enough classification power by itself.
-Descriptors have not enough classification power by itself.
-A combination of stemming with detected better descriptors makes it possible to perform
word classifications with good confidence levels.

Some interesting future works are:

-Repeat this analysis considering as kind of previous words: “none”, “article”, “preposition”,
“pronoun”, and “other”.
-Analyze categorical field dependencies to reduce the number of variables with a kind of
formula.
-Study the variations due to other field categorization criteria.
-Evaluate alternate algorithms.
-Compare results against other sources as books, magazines, etc.
REFERENCES
[1]. Alani H. et al. (2003) “Automatic Extraction of Knowledge from Web
Documents”, In Proc. of 2nd International Semantic Web Conference - Workshop on Human
Language Technology for the Semantic Web and Web Services, Sanibel Island.
[2]. Aldezábal I. (1996) “Del analizador morfológico al etiquetador: unidades léxicas
complejas y desambiguación”. Procesamiento del lenguaje natural. N. 19, pp. 90-100. España.
[3]. Díaz Villa A.M. (2005) “Tipología de errores gramaticales para un corrector
automático” Magazine Procesamiento del Lenguaje Natural, vol 35.
[4]. Fernández Lanza S. (2003) “Una contribución al procesamiento automático de
sinonimia utilizando Prolog” PhD dissertation, Santiago de Compostela, España.
[5]. Figuerola C. G. (2000) “Categorización automática de documentos en español: algunos
resultados experimentales”. ReLIS, Jornadas de Bibliotecas Digitales.
http://imhotep.unizar.es/jbidi/jbidi2000/14_2000.pdf
[6]. Genthial D. (1990) “Contribution of a Category Hierarchy ti the Robustness of Syntacic
Parsing”, 13th CoLing, vol. 2, pp. 139-144. Helsinki, Finland.
[7]. Gulla A. A. (1996) “A Sign Expansion Approach to Dynamic, Multi-purpose
Lexicons”, International Conference on Computational Linguistics. Proceedings of the 16th
Conference on Computational Linguistics. Vol. 1. pp. 478 – 483. Copenhagen. Denmark.
[8]. Levinson S. (2006) “Statistical Modeling and Classification”, AT&T Bell Laboratories,
Murray Hill, New Jersey, USA. Also available at
http://cslu.cse.ogi.edu/HLTSurvey/ch11node4.html.
[9]. Manning C., Schütze H. (1999) “Foundations of Statistical Natural Language
Processing”, Cambridge, Mass. MIT Press. ISBN 0262133601
[10]. Mateo P.L., González J.C., Villena J., Martínez J.L. (2003) “Un sistema para resumen
automático de textos en castellano” DAEDALUS S.A., Madrid, España.
[11]. Mitchell T. (1997) Machine Learning, New York: WCB/Mc Graw Hill, pp. 51-80.
[12]. Nießen S., Ney H. (2000) “Improving SMT quality with morpho-syntactic analysis”, in
Proc. of the 18th conference on Computational linguistics – Vol. 2, pp. 1081 – 1085,
Saarbrücken, Germany.
[13]. Oliveira O.N., Nunes M.G. V., Oliveira M.C. F. (1998) “Por qué no podemos hablar
con una computadora?” Magazine of Sociedad Mexicana de Física., México, v. 12, pp. 1 - 9.
[14]. Platzer C., Dustdar S. (2005) “A Vector Space Search Engine for Web Services”, in
Proc. of the Third European Conference on Web Services (ECOWS’ 05), Vaxjo, Sweden.
[15]. Porter, M. F. (1980) “An Algorithm for suffix Stripping”, Program, vol. 14 (3), pp. 130-
137.
[16]. Seretan V., Nerima L., Vehrli E. (2004) “Using the Web as a Corpus for the Syntactic-
Based Collocation Identification”, in Proceedings of the 21st International Conference on
Computational Linguistics and 44th Annual Meeting of the Association for Computational
Linguistics, pages 953-960, Sydney, Australia.
[17]. Trabalka M., Bieliková M. (2000) “Using XML and Regular Expressions in the
Syntactic Analysis of Inflectional Language”, In Proc. of Symposium on Advances in
Databases and Information Systems (ADBIS-DASFAA'2000), Praha. pp. 185-194
[18]. Witten I. H., Frank E. (2005) “DataMining – Practical Machin Learning Tools and
Techniques”, 2nd ed., San Francisco: Morgan Kaufmann Publishers.

Module03 Embeddings
No ratings yet
Module03 Embeddings
102 pages
C18 Technical Upgrade Best Practices For Oracle E-Business Suite 12.2 1
No ratings yet
C18 Technical Upgrade Best Practices For Oracle E-Business Suite 12.2 1
48 pages
Modeling and Learning Multilingual Inflectional Morphology in A Minimally Supervised Framework
No ratings yet
Modeling and Learning Multilingual Inflectional Morphology in A Minimally Supervised Framework
221 pages
NLP - Module 2
No ratings yet
NLP - Module 2
54 pages
NLP Unit Test 2
No ratings yet
NLP Unit Test 2
10 pages
Lecture 3
No ratings yet
Lecture 3
70 pages
Intro Text Mining
No ratings yet
Intro Text Mining
83 pages
NLP Lecture2 Text Pre Processing
No ratings yet
NLP Lecture2 Text Pre Processing
54 pages
Text Mining
No ratings yet
Text Mining
62 pages
Feature Eng
No ratings yet
Feature Eng
34 pages
Apex Institute of Technology Natural Language Processing (20CST354)
No ratings yet
Apex Institute of Technology Natural Language Processing (20CST354)
43 pages
Reference Material NLP - 2
No ratings yet
Reference Material NLP - 2
40 pages
Word Sense Disambiguation
No ratings yet
Word Sense Disambiguation
39 pages
Text Mining
No ratings yet
Text Mining
34 pages
PDF 3900 Series Base Station Nodebfunction Performance Counter Reference v100r009c00 11pdf en
No ratings yet
PDF 3900 Series Base Station Nodebfunction Performance Counter Reference v100r009c00 11pdf en
508 pages
Chapter 1
No ratings yet
Chapter 1
41 pages
Learning To Classify Documents According To Formal and Informal Style
No ratings yet
Learning To Classify Documents According To Formal and Informal Style
31 pages
Action Replay Codes
No ratings yet
Action Replay Codes
4 pages
5b. Word Vectors
No ratings yet
5b. Word Vectors
24 pages
A Survey of Stemming Algorithms in Information Retrieval: Author Index Subject Index Search Home
No ratings yet
A Survey of Stemming Algorithms in Information Retrieval: Author Index Subject Index Search Home
22 pages
NLP Text Preprocessing
No ratings yet
NLP Text Preprocessing
19 pages
Introduction To Semantic Processing
No ratings yet
Introduction To Semantic Processing
13 pages
DHull GGrefenstette Technical Report MLTT96
No ratings yet
DHull GGrefenstette Technical Report MLTT96
17 pages
Sepe A POS Tagger For Spanish
No ratings yet
Sepe A POS Tagger For Spanish
10 pages
Information Security Handbook - Watermark
No ratings yet
Information Security Handbook - Watermark
271 pages
Canalta Parts Catalogue PDF
No ratings yet
Canalta Parts Catalogue PDF
28 pages
A Fast Morphological Algorithm With Unknown Word Guessing Induced by A Dictionary For A Web Search Engine
No ratings yet
A Fast Morphological Algorithm With Unknown Word Guessing Induced by A Dictionary For A Web Search Engine
8 pages
Hange-1 4
No ratings yet
Hange-1 4
13 pages
11679-Article Text-17452-1-10-20200425
No ratings yet
11679-Article Text-17452-1-10-20200425
12 pages
Words & Transducers
No ratings yet
Words & Transducers
7 pages
A Comparative Study For Arabic Text Clas
No ratings yet
A Comparative Study For Arabic Text Clas
11 pages
QA Review: IR-based Question Answering
No ratings yet
QA Review: IR-based Question Answering
11 pages
Using Wordnet For Text Categorization
No ratings yet
Using Wordnet For Text Categorization
9 pages
Keyword Extraction Measure
No ratings yet
Keyword Extraction Measure
9 pages
A T C A V E M: Rabic EXT Ategorization Lgorithm Using Ector Valuation Ethod
No ratings yet
A T C A V E M: Rabic EXT Ategorization Lgorithm Using Ector Valuation Ethod
10 pages
Text Classification For Arabic Words Using Rep-Tree
No ratings yet
Text Classification For Arabic Words Using Rep-Tree
8 pages
Tasks in NLP
No ratings yet
Tasks in NLP
7 pages
Automatic Lexical Semantic Classification of Nouns
No ratings yet
Automatic Lexical Semantic Classification of Nouns
8 pages
Automatic Selection of Defining Vocabulary in An E
No ratings yet
Automatic Selection of Defining Vocabulary in An E
5 pages
Unsupervised Methods For Developing Taxonomies by Combining Syntactic and Statistical Information
No ratings yet
Unsupervised Methods For Developing Taxonomies by Combining Syntactic and Statistical Information
8 pages
Algorithms For The Verification of The Semantic Relation Between A Compound and A Given Lexeme
No ratings yet
Algorithms For The Verification of The Semantic Relation Between A Compound and A Given Lexeme
8 pages
The Design of A System For The Automatic Extraction of A Lexical Database Analogous To Wordnet From Raw Text
No ratings yet
The Design of A System For The Automatic Extraction of A Lexical Database Analogous To Wordnet From Raw Text
8 pages
Lexical Analyzer For Myanmar Language: University of Computer Studies, Mandalay
No ratings yet
Lexical Analyzer For Myanmar Language: University of Computer Studies, Mandalay
8 pages
The Traditional Approach To Natural Language Processing
No ratings yet
The Traditional Approach To Natural Language Processing
7 pages
Different Type of Feature Selection For Text Classification
No ratings yet
Different Type of Feature Selection For Text Classification
6 pages
Bộ Test Theo Form 2025 - Anh 11 - Global Success - KÌ 1 (Word Full Key) Unit 3 - Test 1 Form 2025
No ratings yet
Bộ Test Theo Form 2025 - Anh 11 - Global Success - KÌ 1 (Word Full Key) Unit 3 - Test 1 Form 2025
7 pages
Chapter 4 - Processing Text
No ratings yet
Chapter 4 - Processing Text
7 pages
Ai DP 2
No ratings yet
Ai DP 2
3 pages
Text Classification by Augmenting Bag of Words (BOW) Representation With Co-Occurrence Feature
No ratings yet
Text Classification by Augmenting Bag of Words (BOW) Representation With Co-Occurrence Feature
5 pages
A Software Tool For Building A Statistical Prefix Processor
No ratings yet
A Software Tool For Building A Statistical Prefix Processor
6 pages
Job Opportunity Finding by Text Classification: Procedia Engineering
No ratings yet
Job Opportunity Finding by Text Classification: Procedia Engineering
5 pages
DLL MC - CISSE07 - v7n
No ratings yet
DLL MC - CISSE07 - v7n
5 pages
Sample
No ratings yet
Sample
8 pages
Machine Learning in Automated Text Categorization FABRIZIO SEBASTIANI Consiglio Nazionale Delle Ricerche
No ratings yet
Machine Learning in Automated Text Categorization FABRIZIO SEBASTIANI Consiglio Nazionale Delle Ricerche
3 pages
Master Thesis Proposal
No ratings yet
Master Thesis Proposal
4 pages
Embeddings
No ratings yet
Embeddings
3 pages
228 International Conference On Engineering Technologies (ICENTE'17)
No ratings yet
228 International Conference On Engineering Technologies (ICENTE'17)
3 pages
Brochure - Professional Certificate in Coding-Full Time - V21
No ratings yet
Brochure - Professional Certificate in Coding-Full Time - V21
15 pages
CS502 Assignment 1 Solution
50% (2)
CS502 Assignment 1 Solution
2 pages
Discovering The Lexical Features of A Language
No ratings yet
Discovering The Lexical Features of A Language
2 pages
1.CS100 - 1 - First Week
No ratings yet
1.CS100 - 1 - First Week
53 pages
EAadhaar 0013010140424220240206114031 280220241115 240228 112353
No ratings yet
EAadhaar 0013010140424220240206114031 280220241115 240228 112353
2 pages
Electrical Engineer - SARAVANAN Resume-New
No ratings yet
Electrical Engineer - SARAVANAN Resume-New
3 pages
Review Paper
No ratings yet
Review Paper
6 pages
Qualcomm Technologies, Inc.: Device Description Key Features (See For Details)
No ratings yet
Qualcomm Technologies, Inc.: Device Description Key Features (See For Details)
57 pages
3D Body Scanning
No ratings yet
3D Body Scanning
11 pages
MA Flash Report 5-7-09 v10
No ratings yet
MA Flash Report 5-7-09 v10
37 pages
Simulation With Arena Chapter 2 - Fundamental Simulation Concepts Slide 1 of 46
No ratings yet
Simulation With Arena Chapter 2 - Fundamental Simulation Concepts Slide 1 of 46
46 pages
Iom Veltron DPT2500
No ratings yet
Iom Veltron DPT2500
18 pages
NSX Sd-Wan by Velocloud: Confidential ©2018 Vmware, Inc
No ratings yet
NSX Sd-Wan by Velocloud: Confidential ©2018 Vmware, Inc
17 pages
Oracle Data Dictionary
No ratings yet
Oracle Data Dictionary
2 pages
6AG11376AA017BA0 Datasheet en
No ratings yet
6AG11376AA017BA0 Datasheet en
3 pages
GST India Sales Demo Guideline
No ratings yet
GST India Sales Demo Guideline
18 pages
Csee Ui 73
No ratings yet
Csee Ui 73
10 pages
DCM750 IPTV Gateway IP Protocol Conversion Scena
No ratings yet
DCM750 IPTV Gateway IP Protocol Conversion Scena
4 pages
30 Whitepapers Iso 29100 How Can Organizations Secure Its Privacy Network
No ratings yet
30 Whitepapers Iso 29100 How Can Organizations Secure Its Privacy Network
9 pages
Lesson 4 - Career Research Essay
No ratings yet
Lesson 4 - Career Research Essay
3 pages
Itc-Escort: ./: (495) 937-5341, 674-2690 E-Mail: Info@escortpro - Ru, HTTP
No ratings yet
Itc-Escort: ./: (495) 937-5341, 674-2690 E-Mail: Info@escortpro - Ru, HTTP
9 pages
ICHC Sonographer PDF
No ratings yet
ICHC Sonographer PDF
2 pages
CPE186 Lesson Plan 7
No ratings yet
CPE186 Lesson Plan 7
6 pages
MOA RADTECH Tagana An
No ratings yet
MOA RADTECH Tagana An
2 pages
QUESTION
No ratings yet
QUESTION
4 pages
Visualizing Data Structures
From Everand
Visualizing Data Structures
Rhonda Hoenigman
No ratings yet
Algorithms and Data Structures: An Easy Guide to Programming Skills
From Everand
Algorithms and Data Structures: An Easy Guide to Programming Skills
Rigdon Jonathan
No ratings yet
Mastering Algorithms and Data Structures
From Everand
Mastering Algorithms and Data Structures
Manish Soni
No ratings yet
Gensim for Natural Language Processing: Definitive Reference for Developers and Engineers
From Everand
Gensim for Natural Language Processing: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Explanation Based Learning: Fundamentals and Applications
From Everand
Explanation Based Learning: Fundamentals and Applications
Fouad Sabry
No ratings yet
Pattern Recognition: Fundamentals and Applications
From Everand
Pattern Recognition: Fundamentals and Applications
Fouad Sabry
No ratings yet
Language Identification: Fundamentals and Applications
From Everand
Language Identification: Fundamentals and Applications
Fouad Sabry
No ratings yet
Concept Mining: Fundamentals and Applications
From Everand
Concept Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet

CACIC 20070725 Induction Trees LopezDeLuise - v7

Uploaded by

CACIC 20070725 Induction Trees LopezDeLuise - v7

Uploaded by

Induction Trees for

Automatic Word Classification

Keywords: machine learning, lexical categorization, morphology, syntax

Palabras claves: aprendizaje automático, clasificación de palabras, mofología, sintaxis

Figure 1. Flow of processing steps

2.3 Sample Characteristics

3. INDUCTION TREES FOR CLASSIFICATION

3.1. Classification of Words Using proposed descriptors

Figure 5. Precision-recall curve for 3 categories (66%, 70%, 0% split )

3.2. Classification Using descriptors and Stemming

4. CONCLUSIONS AND FUTURE WORK

Some interesting future works are:

You might also like