CACIC 20070725 Induction Trees LopezDeLuise - v7
CACIC 20070725 Induction Trees LopezDeLuise - v7
and
Juan M. Ale
Facultad de Ingeniería, Universidad de Buenos Aires(UBA)
Ciudad Autónoma de Buenos Aires – Argentina
[email protected]
Abstract
This work studies induction tree application for certain word category detection by simple
morpho-syntactical descriptors that are proposed here. The classification power for these
new descriptors with and without stemming is also studied. Finally, results show that
classification prediction power is good when stem is coordinated with a short list of
descriptors.
Resumen
En este trabajo estudia el uso de árboles de inducción para la detección de ciertos tipos de
palabras usando algunos descriptores morfosintáctico propuestos. También se estudia el
poder de clasificación de estos nuevos descriptores con y sin extracción de raíces de palabras
(stemming). Finalmente, se muestra en los resultados que el poder de predicción de la
clasificación es bueno cuando se combinan stemming con algunos de los descriptores
presentados.
1
languages, where words have usually several different morphological forms that are created by changing a
suffix [17].
2
From Mitchell [11]: “Decision Tree learning is a method for approximating discrete-valued target functions, in
which the learned function is represented by a decision tree. Learning trees can also be re-represented as sets of
if-then rules to improve human readability. These learning methods are among the most popular of inductive
inference algorithms and have been successfully applied to a broad range of tasks from learning to diagnose
medical cases to learning to assess credit risk on loan applicants” .
corpus for automatic collocation3 identification [16], etc. Therefore it is important to process
automatically text as mentioned previously but considering the special features of web writers
and readers. For that reason, all the text processed in this paper is extracted only from web
pages.
Another point is that internet sets the same availability degree for sites in any
language. So, the web pages covered here are taken from Spanish sites in any country.
The rest of this paper is organized as follows: section 2 describes the database and
data collection procedure, section 3 describe field selection and induction tree model
construction, and section 4 presents some conclusions and future work.
2. DATA ANALYSIS
In this section there is a short description of the processing steps (section 2.1), dataset and
sample characteristics (sections 2.2 and 2.3 respectively).
2.1. Methodology
Four sets of web pages in Spanish were made regarding several topics. All of them were
downloaded in text format. From the total number of 340 pages, 361217 words were extracted
with a Java application. The output was saved as 15 plain text files. The text files were
converted into Excel format to be able to use an Excel’s form to manually fill in the field
tipoPalabra (kind of word). The resulting files were processed with other java program to
introduce the stemming column and afterward converted into csv format to be able to work
with WEKA4 software. After that, some preliminary statistics were performed with InfoStat5
to detect the main dataset features and the csv files were processed with WEKA Explorer. An
induction tree model was built from data as detailed in the following sections. Figure 1
depicts graphically all the mentioned steps.
3
statistically significant word associations representing “a conventional way of saying things” [9].
4
WEKA: open source workbench for Data Mining and Machine Learning [18].
5
InfoStat: statistical software from a group named InfoStat in the Universidad Nacional de Córdoba.
2.2. Dataset Description
The text files were processed with a Java application. For each word, a set of 25 description
fields were extracted. Therefore, each database record represent a word. The fields are
detailed below:
-Continue fields: there isn’t.
-Numerable fields: 10 fields were non-negative integers with a big boundary (see Table 1).
All of them were discretized into fixed-size intervals, to be able to categorize and process
them together with nominal fields. They were separated into 3 or 5 categories. (see Table 2).
-Discrete fields: there isn’t.
-No-numeric fields: 15 fields have a domain composed by a specific set of literals (syllabus,
punctuation signs, a set of predefined words or the classical binomial Yes/No). See Table 3
for details.
-Missing data: they were considered as a distinct data value and processed with the rest of
the data.
Table 1. Numerable fields
Table 2. Categorization
In the following, an induction tree with J4.8 algorithm is used to build a model to predict the
kind of certain words based on the descriptors introduced in this paper (section 3.1), based on
the descriptors and stem (section 3.2) and based only in the best descriptors and stem (section
3.3).
6
windowing is a strategy for selecting a subset of data for processing.
1) Splitting of the training sample.
Different percentages of instances were taken from the same sample to construct/validate the
model by setting several splitting values. The data records were randomly extracted from the
47820 instances according to the settled percentage. The initial sampling window had 6838
instances. Results are shown in Table 4.
Table 4 Results with Different Splits
It can be seen that classification improves from 66% of instances for testing (and 34% for
training) to 100% for training and testing. The classification model becomes more confident.
2) Alternates for field categorization.
As part of sensitivity analysis, different categorizations for just one of the descriptor variables
is performed: cantOcurrencias (number of times the word is detected within the html page).
This variable is selected for this study because it is always near the tree-model root (it is
important to determine the kind of word). It was evaluated with 3 and 7 bins. Results are
shown in Table 5.
Table 5. Results with Different Categorizations
The table shows the precision and total error changes due to categorization. To study
the strength of this tendency, the margin-curves, precision, recall and recall-precision analysis
is performed but only for nouns:
- Margin-curves for 3 and 7 categories reflect a slight tendency to join the x-axis with the
instance number. It seem like each new instance makes the classifier more trustable. This
tendency becomes apparent with 66% of splitting, and remains with 70% and 0% (see Figure
2).
66%split 66%split
70%split 70%split
00%split 00%split
Figure 2. margin curve with 3 (on the left) and 7 categories (on the right)
- Precision-curves show that precision with 3 categories is better than with 3 categories but
with 7 categories more instances are retrieved (102 against 95 with 66% of splitting). See
Figure 3.
66%split 66%split
70%split 70%split
00%split 00%split
Figure 3. Precision curve for 3 (on the left) and 7 (on the right) categories
- Recall-curve presents a minimum recall value for 7 categories higher than the value for 3
categories. Conversely, the slope has a softer slope for 3 categories (see Figure 4).
66%split 66%split
70%split 70%split
00%split 00%split
Figure 4. Recall curve for 3 (on the left) and 7 (on the right) categories
- Finally, precision-recall curve (see Figure 5), show that precision is best for 3 categories but
at expense of fewer number of instances. This behavior is observed for all the splitting rates
experienced (66%, 70%, 0%).
66%split 70%split 00%split
As can be seen from the results in Table 6, there is a low correctly-classified rate and kappa
values.
4) Instance windowing
Three windows of instances were selected. The windows were of different size and
composition as described below:
a)sample 1: 47829 instances. The word-class distribution is: 6689 nouns, 2762 verbs, 11027
other class, 36 unknown class. Main characteristics of the sample: words were extracted from
pages mainly with the same subtopic within the set theme. Besides, each page were longer
than in the other two samples.
b)sample 2: 20515 instances. The word-class distribution is: 6392 nouns, 3050 verbs, 11054
other class, 19 unknown class. Main characteristics of the sample: pages were related to many
different subtopics and typically very short in the average.
c)sample 3: 20524 instances. The word-class distribution is: 6535 nouns, 2954 verbs, 11014
other class, 21 unknown class. Main characteristics of the sample: pages were related to
different subtopics but with intermediate size in the average.
The model training was performed with each sample, taking 12 data fields (4 of them
categorical). Results are shown in Table 7.
Table 7. Results with Different samples
As can be seen from the table, there is a significant variation of classification power with the
dataset. Those results are due the characteristic of each one. As a consequence of these
characteristics, the noun rate is highest in the second sample, making the classification
correctness higher than sample 1 and lower than sample 3. Kappa statistics decreases for
sample 3, which has a fewer number of nouns than sample 2, even considering that sample 3
performs a bit better classification rate due to the shorter pages.