ClassifTextDocumentsByTriantaphyllou1 PDF
ClassifTextDocumentsByTriantaphyllou1 PDF
that were accepted by the previous clause(s). The algorithm repeats this
process until all the examples in E
A
j
j and NEGA
j
jE
A
j
j. Please note that Steps 1 and 2 in
Fig. 2 also consider the negations (denoted as
AA
j
) of the features A
j
. The ratio POSA
j
=NEGA
j
in Steps 1 and 2 is used to quickly identify features that can form a clause that would accept all the
Fig. 2. A fast heuristic for forming CNF clauses for the OCAT approach.
588 S. Nieto Saanchez et al. / Information Processing and Management 38 (2002) 583604
positive examples while rejecting many of the remaining negative ones (for the CNF case). By
selecting a feature that maximizes this ratio, it is likely to have a feature that has a large POSA
j
.
The next iteration of the OCAT approach (in Fig. 1) will consider the original four positive
examples, but now the set of the negative examples consist of the ones not rejected so far. That is,
the second, third, and sixth negative examples, in terms of the original set E
. Working as above,
it can be seen that the next application of the fast heuristic will return the clause :
AA
2
_
AA
3
. The
loop in Fig. 1 needs to be repeated once more, and a third (and nal) clause is derived. That clause
is A
1
_ A
3
_
AA
4
. In other words, the logical expression (in CNF) which is derived from the
training examples depicted in Fig. 3 is as follows:
A
2
_ A
4
^
AA
2
_
AA
3
^ A
1
_ A
3
_
AA
4
: 1
A fundamental property of expression (1) is that it accepts (i.e., evaluates to 1) all the exam-
ples in E
AA
4
_ A
2
_
AA
1
^ A
1
_
AA
3
: 2
As with expression (1), a property of the corresponding Boolean function f x A
3
_
AA
2
^
AA
4
_ A
2
_
AA
1
^ A
1
_
AA
3
is to accept (i.e., to evaluate to 1) the former negative examples and to
reject (i.e., to evaluate to 0) the former positive examples. For convenience, following the setting
of the examples in Figs. 3 and 4, expression (1) will be called the positive rule (denoted as R
) while
expression (2) the negative rule (denoted as R
).
The disadvantage of using only one rule (logical expression) can be overcome by considering
the combined decisions of R
and R
. Under
this setting, the classication of e can only be:
1. Correct if and only if:
(a) R
1 and R
0;
(b) R
0 and R
1.
2. Incorrect if and only if:
(c) R
0 and R
1;
(d) R
1 and R
0.
Fig. 4. The training example sets in reverse roles.
590 S. Nieto Saanchez et al. / Information Processing and Management 38 (2002) 583604
3. Undecided if and only if:
(e) R
1 and R
1;
(f) R
1 and R
1;
(g) R
0 and R
0;
(h) R
0 and R
0.
Cases (a) and (b) are called correct classications because both rules perform according
to the desired properties described above. However, as indicated above it is possible that the rules
could incorrectly classify an example (cases (c) and (d)). Or the rules could simultaneously ac-
cept (cases (e) and (f)) or reject (cases (g) and (h)) the example. Cases (e)(h) are called unde-
cided because one of the rules does not possess enough classication knowledge, and thus such
a rule must be reconstructed. Therefore, undecided situations open the path to improve
the accuracy of a classication system. This paper exploits the presence of undecided situa-
tions in order to guide the reconstruction of the rule that triggered an erroneous classication
decision.
5. An overview of the vector space model
The VSM is a mathematical model of an IR system that can also be used for the classication of
text documents (Salton, 1989; Salton & Wong, 1975). It is often used as a benchmarking method
when dealing with document retrieval and classication related problems. Fig. 5 illustrates a
typical three-step strategy of the VSM approach to clustering.
To address Step 1 Salton (1989) indicates that a suitable measure for pairwise comparing any
two surrogates X and Y is the cosine coecient (CC) as dened in Eq. (3) (other similarity
measures are listed in Salton (1989, Chapter 10):
CC
jX \ Y j
jXj
1=2
jY j
1=2
: 3
In this formula, X x
1
; x
2
; x
3
; . . . ; x
t
and Y y
1
; y
2
; y
3
; . . . ; y
t
, where x
i
indicates the presence
(1) or absence (0) of the ith indexing term in X and similarly for y
i
in respect to Y. Moreover,
jXj jY j t is the number of indexing terms, and jX \ Y j is the number of indexing terms ap-
pearing simultaneously in X and Y. To be consistent with the utilization of binary surrogates,
formula (3) provides the CC expression for the case of Boolean vectors. This coecient measures
the angle between two surrogates (Boolean vectors) in a Cartesian plane. Salton (1989) indicates
Fig. 5. The VSM approach.
S. Nieto Saanchez et al. / Information Processing and Management 38 (2002) 583604 591
that the magnitude of this angle can be used to measure the similarity between any two docu-
ments. This statement is based on the observation that two surrogates are identical, if and only if
the angle between them is equal to 0.
In Step 2 the VSM clusters together documents that share a similar content based on their
surrogates. According to Salton (1989), any clustering technique can be used to group documents
with similar surrogates. A collection of clustering techniques is given in Anderberg (1973), Ald-
enderfer (1984), Spaath (1985), and Van Rijsbergen (1979). However, it is important to mention
here that with any of these techniques, the number of generated classes is always a function of
some predened parameters. This is in contrast with the requirements of our problem here in
which the number of classes is exactly equal to two. When the VSM works under a predened
number of classes, it is said to perform a pseudo-classication. In this study the training examples
are already grouped into two (disjoint) classes. Thus, the VSM is applied on the examples
(documents) in each class and the corresponding centroids are derived. Hence, we will continue to
use this kind of pseudo-classication.
To address Step 3 Salton (1989), Salton & Wong (1975), and Van Rijsbergen (1979) suggest the
computation of a class centroid to be done as follows. Let w
rj
j 1; 2; 3; . . . ; t be the jth element
of the centroid for class C
r
which contains q documents. Also, the surrogate for document D
i
is
dened as fD
ij
g. Then, w
rj
is computed as follows:
w
rj
1=q
X
q
i1
D
ij
for j 1; 2; 3; . . . ; t: 4
That is, the centroid for class C
r
is also a surrogate (also known as the average document)
dened on t keywords.
Finally, the VSM classies a new document by comparing (i.e., computing the CC) its surrogate
against the centroids that were created in Step 3. A new document will be placed in the class for
which the CC value is maximum.
In the tests to be described later in this paper, the VSM is applied on the documents (training
examples) available for each class. In this way, the centroid of each one of the two classes is
derived. For instance, consider the training examples depicted in Figs. 3 and 4. The VSM is now
applied on these data. The centroids in expression (5) have been constructed from the data in Fig.
3 and the centroids in expression (6) from the data in Fig. 4. Obviously, the centroids for the
second set are in reverse order of those for the rst set of data.
C
and C
and C
stand for
the centroids for the data in Fig. 4, respectively. In order to match the names of the positive and
negative rules described for the OCAT algorithm, the two centroids for the data in Fig. 3 will be
called the positive centroids while the centroids for the data in Fig. 4 will be called the negative
centroids. As with the OCAT algorithm, the utilization of two sets of centroids has been inves-
592 S. Nieto Saanchez et al. / Information Processing and Management 38 (2002) 583604
tigated in order to tackle the new classication problem by using the VSM as new examples
become available.
6. Guided learning for the classication of text documents
The central idea of the GLA can be illustrated as follows. Suppose that the collection to be
classied contains millions of documents. Also, suppose that an oracle (i.e., an expert classier) is
queried in order to classify a small sample of examples (documents) into classes E
and E
. Next,
suppose that the OCAT algorithm is used to construct the positive and negative rules, such as was
the case with expressions (1) and (2). As indicated earlier, these rules may be inaccurate when
classifying examples not included in the training set, and therefore they will result in one of the
classication outputs provided in cases (a)(g), as described earlier. One way to improve the
classication accuracy of these rules is to add one more document to the training set (either in E
or E
) and have them reconstructed. Therefore, the question GLA attempts to answer is: What is
the next document to be inspected by the expert classier so that the classication performance of the
derived rules can be improved as fast as possible?
One way to provide the expert with this document is to randomly select one from the remaining
unclassied documents. We call this the RANDOM input learning strategy. A drawback of this
strategy may occur if the oracle and incumbent rules frequently classify a document in the same
class. If this occurs frequently, then the utilization of the oracle and the addition of the new ex-
amples to the training set is of no benet. An alternative and more ecient way to provide the
expert with a document is to select one in an undecided situation. This strategy (in a general
form) was rst introduced in Triantaphyllou & Soyster (1996b). This approach appears to be a
more ecient way of selecting the document because an undecided situation implies that one of
the rules misclassied the document. Therefore, the experts verdict will not only guide the re-
construction of the rule that triggered the misclassication, but it may also improve the learning
rate of the two rules. We call this the GUIDED input learning strategy. An incremental learning
version of the OCATapproach is described in Nieto Sanchez, Triantaphyllou, Chen, &Liao (2002).
7. Experimental data
In order to determine the classication performance of the OCAT approach in addressing this
new problem, the OCATs classication accuracy was compared with that of the VSM. Both
approaches were tested under three experimental settings:
1. a Leave-One-Out Cross-Validation (or CV) (also known as the Round-Robin test);
2. a 30/30 Cross-Validation (or 30CV), where 30 stands for the number of training documents in
each class; and
3. in an experimental setting in which the OCAT algorithm was studied under a random and a
guided learning strategy.
These will be dened below. This multiple testing strategy was selected in order to gain a more
comprehensive assessment of the eectiveness of the various methods.
S. Nieto Saanchez et al. / Information Processing and Management 38 (2002) 583604 593
For these tests, a sample of 2897 documents was randomly selected from four document classes
of the TIPSTER collection (Harman, 1995; Voorhees, 1998). The previous numbers of documents
in each class were determined based on memory limitations on the computing platform used (an
IBM Pentium II PC running Windows 95). The TIPSTER collection is a standard data set for
experimentation with IR systems. The four document classes were as follows:
1. Department of Energy (DOE) documents,
2. Wall Street Journal (WSJ) documents,
3. Associated Press (AP) documents, and
4. ZIPFF class documents.
We chose documents from this collection because for security reasons we did not have access to
actual secret DOE documents.
Table 1 shows the number of documents that were used in the experimentation. These docu-
ments were randomly extracted from the four classes of the TIPSTER collection.
We simulated two mutually exclusive classes by forming the following ve class-pairs: (DOE vs.
AP), (DOE vs. WSJ), (DOE vs. ZIPFF), (AP vs. WSJ), and (WSJ vs. ZIPFF). These ve class-
pairs were randomly selected from all possible class-pairs combinations. To comply with the
notation presented in the previous sections, the rst class of each class-pair was denoted as E
,
while the second class was denoted as E
PP
VSM
H
1
:
PP
OCAT
>
PP
VSM
2: H
0
: p 0:50
H
1
: p 6 0:50
where P
OCAT
and P
VSM
are the numbers of documents with correct classication under the two
algorithms divided by the total number of documents in the experiment. In addition, p is the
probability of nding an equal number of positive and negative dierences in a set of outcomes.
More on how these tests were performed is provided in the following sub-sections which present
the computational results.
8.4. Experimental setting for the Guided Learning Approach
Consider the question: What is the best next document to be given to the oracle in order to
improve the performance of the classication rule? Three samples of 510 documents (255 from each
class) from the three class-pairs: (DOE vs. ZIPFF), (AP. vs. DOE), and (WSJ vs. ZIPFF) were
used. The number of 510 documents was determined by the available RAM memory on the
Windows PC we used. The previous three class-pairs were processed by the OCAT algorithm
(only) under the RANDOM and the GUIDED learning approaches.
These two learning approaches were implemented as follows. At rst, 30 documents from each
class in the experiment were randomly selected, and the positive and negative rules (logical ex-
pressions) were constructed. Next, the class membership of all 510 documents in the experiment
was inferred based on the two classication rules. The criteria expressed as cases (a)(g) in Section
4 were used to determine the number of correct, incorrect, and undecided classications.
Next, a document was added to the initial training sample as follows. For the case of the
RANDOM approach, this document was selected at random from among the documents not
included in the training sets yet (i.e., neither in E
nor in E
).
In contrast, under the GUIDED approach this document was selected from the set of docu-
ments which the positive and negative rules had already termed as undecided. However, if the
two rules did not detect an undecided case, then the GUIDED approach was replaced by the
RANDOM approach until a new undecided case was identied. This process for the RAN-
DOM and GUIDED approaches was repeated until all 510 documents were included in the two
training sets E
and E
and E
. On the other hand, it can be seen that under the RANDOM learning
approach, the rates Ri and Ru reached 0% when 99.8% (509 documents) of the documents are
processed.
11. Concluding remarks
This paper has examined a classication problem in which a document must be classied into
one of two disjoint classes. As an example of the importance of this type of classication, one can
consider the possible release to the public of documents that may aect national security. The
method proposed in this paper (being an automatic method) is not infallible. This is also true
because its performance depends on how representative the training examples (documents) are.
The application of such an approach to a problem of critical importance (such as the one high-
lighted in the introduction) can be seen as an important and useful automatic tool for a pre-
liminary selection of the documents to be classied.
We considered an approach to this problem based on the VSM algorithm and compared it with
an algorithm which is based on mathematical logic, called the OCAT algorithm. We tested these
two approaches on almost 3000 documents from the four document classes of the TIPSTER
collection: Department of Energy (DOE), Wall Street Journal (WSJ), Associated Press (AP), and
the ZIPFF class. Furthermore, these documents were analyzed under two types of experimental
settings: (i) Leave-One-Out Cross-Validation and (ii) a 30/30 Cross-Validation (where 30 indicates
the initial number of training documents from each document class). The experimental results
suggest that the OCAT algorithm performed signicantly better in classifying documents into two
disjoint classes than the VSM.
Moreover, the results of a third experiment suggested that the classication eciency of the
OCAT algorithm can be improved substantially if a GLA is implemented. Actually, experiments
on samples of 510 documents from the previous four classes of the TIPSTER collection indicated
that the OCAT algorithm needed only about 336 (i.e., 66% of the) training documents before it
correctly classied all of the documents.
The results presented here, although limited to a relatively small collection of almost 3000
documents, are encouraging because they suggest that the OCAT algorithm can be used in the
classication of larger collections of documents.
602 S. Nieto Saanchez et al. / Information Processing and Management 38 (2002) 583604
Acknowledgements
The authors are very appreciative for the very thoughtful and constructive comments made by
two anonymous referees of an earlier version of the paper. Those comments played an important
role in improving the context and presentation of this paper. The rst and second authors wish to
thank the US Department of Energy for providing the nancial funds that supported this re-
search. The rst author is pleased to recognize the partial nancial support from Consejo Nac-
ional de Ciencia y Technologia-Mexico. The second author is also very appreciative to the US
Navy, Oce of Naval Research, for the support from grants N00014-95-1-0639 and N00014-97-1-
0632.
References
Aldenderfer, M. S. (1984). Cluster analysis. Beverly-Hill, CA: Sage.
Anderberg, M. R. (1973). Cluster analysis for applications. New York: Academic Publishers.
Barnes, J. W. (1994). Statistical analysis for engineers and scientists, a computer-based approach. New York: McGraw-
Hill.
Buckley, C., & Salton, G. (1995). Optimization of relevance feedback weights. In Proceedings of SIGIR 1995 (pp. 351
357).
Chen, H. (1996). Machine learning approach to document retrieval: An overview and an experiment. Technical Report,
University of Arizona, MIS Department, Tucson, AZ, USA.
Cleveland, D., & Cleveland, A. D. (1983). Introduction to indexing and abstracting. Littleton, CO: Libraries Unlimited.
Deshpande, A. S., & Triantaphyllou, E. (1998). A greedy randomized adaptive search procedure (GRASP) for inferring
logical clauses from examples in polynomial time and some extensions. Mathematical and Computer Modelling,
27(1), 7599.
DOE (1995). General Course on Classication/Declassication, Student Syllabus, Handouts, and Practical Exercises. US
Department of Energy, Germantown, MD, USA.
DynMeridian (1996). Declassication Productivity Initiative Study Report. DynCorp Company, Report Prepared for the
US Department of Energy, Germantown, MD, USA.
Fox, C. (1990). A stop list for general text. ACM Special Interest Group on Information Retrieval, 24(12), 1935.
Harman, D. (1995). Overview of the second text retrieval conference (TREC-2). Information Processing and
Management, 31(3), 271289.
Luhn, H. P. (1958). The automatic creation of literature abstracts. IBM Journal of Research and Development, 5(3),
155165.
Luhn, H. P. (1957). A statistical approach to mechanized encoding and searching of literary information. IBM Journal
of Research and Development, 4(4), 600605.
Meadow, C. T. (1992). Text information retrieval systems. San Diego, CA: Academic Press.
Nieto Sanchez, S., Triantaphyllou, E., Chen, J., & Liao, T. W. (2002). An incremental learning algorithm for
constructing Boolean functions from positive and negative examples. Computers and Operations Research (in press).
Salton, G. (1968). Automatic information organization and retrieval. New York: McGraw-Hill.
Salton, G., & Wong, A. (1975). A vector space model for automatic indexing. Information retrieval and language
processing, 18(11), 613620.
Salton, G. (1989). Automatic text processing. The transformation analysis, and retrieval of information by computer.
Reading, MA: Addison-Wesley.
Scholtes, J. C. (1993). Neural networks in natural language processing and information retrieval. The Netherlands: North-
Holland.
Spaath, H. (1985). Cluster dissection and analysis: theory, Fortran programs, and examples. Chichester, UK: Ellis
Harwood.
S. Nieto Saanchez et al. / Information Processing and Management 38 (2002) 583604 603
Shaw, W. M. (1995). Term-relevance computations and perfect retrieval performance. Information Processing and
Management, 31(4), 312321.
Triantaphyllou, E. (2001). The OCAT approach for data mining and knowledge discovery, Working Paper, IMSE
Department, Louisiana State University, Baton Rouge, LA 70803-6409, USA.
Triantaphyllou, E., & Soyster, A. L. (1996a). On the minimum number of logical clauses inferred from examples.
Computers and Operations Research, 23(8), 783799.
Triantaphyllou, E., & Soyster, A. L. (1996b). An approach to guided learning of Boolean functions. Mathematical and
Computing Modelling, 23(3), 6986.
Triantaphyllou, E., & Soyster, A. L. (1995). A relationship between CNF and DNF systems which are derived from the
same positive and negative examples. ORSA Journal on Computing, 7(3), 283285.
Triantaphyllou, E., Soyster, A. L., & Kumara, S. R. T. (1994). Generating logical expressions from positive and
negative examples via a branch-and-bound approach. Computers and Operations Research, 21(2), 185197.
Triantaphyllou, E. (1994). Inference of a minimum size Boolean function from examples by using a new ecient
branch-and-bound approach. Journal of Global Optimization, 5, 6494.
Van Rijsbergen, C. J. (1979). Information retrieval (2nd ed.). London: Butterworths.
Voorhees, E. (1998). Overview of the sixth text retrieval conference (TREC-6). In Proceedings of the sixth text retrieval
conference (TREC-6), Gaithersburg, MD, USA (pp. 127).
Zip, H. P. (1949). Human behavior and the principle of least eort. Menlo Park, CA: Addison-Wesley.
604 S. Nieto Saanchez et al. / Information Processing and Management 38 (2002) 583604