0% found this document useful (0 votes)
10 views

ARTICULO 6

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

ARTICULO 6

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/221574179

The Electronic Primaries: Predicting the U.S. Presidency Using Feature


Selection with Safe Data Reduction.

Conference Paper · January 2005


Source: DBLP

CITATIONS READS

12 493

4 authors:

Pablo Moscato Luke Mathieson


The University of Newcastle, Australia University of Technology Sydney
358 PUBLICATIONS 10,637 CITATIONS 44 PUBLICATIONS 459 CITATIONS

SEE PROFILE SEE PROFILE

Alexandre Mendes Regina Berretta


The University of Newcastle, Australia The University of Newcastle, Australia
82 PUBLICATIONS 1,598 CITATIONS 154 PUBLICATIONS 1,871 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Alexandre Mendes on 03 June 2014.

The user has requested enhancement of the downloaded file.


The Electronic Primaries: Predicting the U.S. Presidency Using
Feature Selection with Safe Data Reduction
Pablo Moscato Luke Mathieson Alexandre Mendes Regina Berretta

Newcastle Bioinformatics Initiative


School of Electrical Engineering and Computer Science
Faculty of Engineering and the Built Environment
University of Newcastle
Callaghan NSW 2308

Abstract 2 k-Feature Set and its Variants


The data mining inspired problem of finding the crit- The k-Feature Set problem as considered here, de-
ical, and most useful features to be used to classify a rives from the field of data mining and knowledge dis-
data set, and construct rules to predict the class of covery. It asks whether, in a set of m examples, each
future examples is an interesting and important prob- with n features, there is a k size subset of the features
lem. It is also one of the most useful problems with that explains some dichotomy within the examples.
applications in many areas such as microarray anal- Formally:
ysis, genomics, proteomics, pattern recognition, data
compression and knowledge discovery. Expressed as k-Feature Set
k-Feature Set it is also a formally hard problem. Instance: A boolean m × n matrix M,
In this paper we present a method for coping with a boolean m × 1 target vector T ,
this hardness using the combinatorial optimisation and a positive integer k.
and parameterized complexity inspired technique of Parameter: k
sound reduction rules. We apply our method to an Question: ∃S ⊆ [1, . . . , n], |S| ≤ k such that
interesting data set which is used to predict the win- ∀i, j ∈ [1, . . . , m] where Ti 6= Tj
ner of the popular vote in the U.S. presidential elec- ∃s ∈ S such that Mi,s 6= Mj,s ?
tions. We demonstrate the power and flexibility of
the reductions, especially when used in the context of Clearly this problem can be generalized quite eas-
the (α, β)k-Feature Set variant problem. ily to deal with non-boolean, discrete entries in both
M and T . Unfortunately not only is this problem
NP-Complete (Davies & Russell 1994), but also W[2]-
1 Introduction Complete (Cotta & Moscato 2003). This intractabil-
ity, in both the classical and parameterized sense,
The prediction of the next U.S. president is an impor- suggests that the problem must be attacked through
tant pastime of political pundits in the U.S. and the heuristic means. It is interesting to note however,
rest of the world. One may consider that such a com- that although this problem is inherently hard (in the
plex, large problem, with many variables would be formal sense), there are still reduction rules that seem
insoluble to methodical systems. Despite this appar- to deal with the problem very effectively.
ent difficulty, Lichtman and Keilis-Borok determined In this paper we are concerned with a generalisa-
in 1981 (Lichtman & Keilis-Borok 1981) that a sys- tion of k-Feature Set called (α, β)k-Feature Set.
tem of only twelve questions was needed to predict This variant takes into consideration the possibility
the swing of the popular vote in the United States. of choosing a set of features that maximizes the num-
This study uses this problem to demonstrate the ber of differences between two examples with different
generalisability of a system developed around the k- target values, and maximizes the similarities between
Feature Set problem and the determination of le- examples with the same target value. Formally:
sion pathologies in breast cancer cases (Mathieson
et al. 2004). The prediction problem presented here (α, β)k-Feature Set
gives a good ‘toy’1 problem to work with, as it al- Instance: A discrete valued m × n matrix
lows transparent demonstration of the principles of M, a discrete valued m × 1 target
the reduction technique, and further clarification of vector T , and positive integers
the importance and usefulness of the confidence mea- α, β and k.
sures inherent in the system. Parameter: k
Copyright c 2005, Australian Computer Society, Inc. This pa- Question: ∃S ⊆ [1, . . . , n], |S| ≤ k such that
per appeared at the 28th Australasian Computer Science Con- ∀i, j ∈ [1, . . . , m] and
ference (ACSC2005), The University of Newcastle, Australia. ◦ if Ti 6= Tj , ∃S 0 ⊆ S where
Conferences in Research and Practice in Information Technol-
ogy, Vol. 38. Vladimir Estivill-Castro, Ed. Reproduction for
|S 0 | ≥ α and ∀s ∈ S 0 Mi,s 6= Mj,s
academic, not-for profit purposes permitted provided this text ◦ if Ti = Tj , ∃S 0 ⊆ S where
is included. |S 0 | ≥ β and ∀s ∈ S 0 Mi,s = Mj,s ?
1
Note that ‘toy’ refers here to the size of the problem, which in
terms of a computational problem is tiny. Note that if we choose α = 1 and β = 0 then we
return immediately to k-Feature Set. Thus this
more general version is also intractable.
2.1 (α, β)k-Feature Set as a Graph Problem Set problem4 , with appropriate modification deal-
ing with α and β domination greater than 1 and 0
The (α, β)k-Feature Set problem can be easily rep- respectively (Cotta, Sloper & Moscato 2004). First
resented in the form of a bi-partite graph, where define dv to be the number of times it is necessary
we have a vertex for each feature, and a vertex for to further dominate the vertex v, initially dv will be
each pair of examples. Edges are inserted between either α or β, if v is in A or B respectively. Further
a pair vertex and a feature vertex when the exam- let N (u) denote the (open) neighbourhood of u, that
ples differ in that feature, if they are from different is, all vertices attached to u, not including u itself.
classes2 , or if they are the same in that feature, if The rules are:
they are from the same class. Now it is easy to see
that the k-Feature Set problem is equivalent to 1. If there exists a pair vertex v and (deg(v) = dv ),
the Red/Blue-Dominating Set problem (Downey then
& Fellows 1997) where we must choose at most k of
the feature vertices to dominate the pair vertices. For • add N (v) to the dominating set
the (α, β)k-Feature Set variant that we deal with, • for each f ∈ N (v) decrease du for every u ∈
we must choose at most k vertices from the set of N (f )
feature vertices that dominate pairs representing ex- • remove v and N (v) from the graph
amples from different classes α times, and dominate
pairs representing examples from the same class β 2. If there exists two feature vertices f1 and f2 such
times. Precisely: that N (f1 ) ⊂ N (f2 ) where for every u ∈ N (f1 )
Given an instance of the (α, β)k-Feature Set prob- where deg(u) − du > 0, then
lem, construct a bi-partite graph G = (A ∪ B, F, E),
such that ∀i, j ∈ [1, . . . , m], i 6= j if Ti 6= Tj then • remove f1 from the graph
∃vij ∈ A, and ∃vij ∈ B otherwise. Further ∀s ∈
3. If there exists two pair vertices v1 and v2 such
[1, . . . , n] ∃fs ∈ F . Thus ∀i, j ∈ [1, . . . , m], i 6= j, ∀s ∈ that N (v1 ) ⊂ N (v2 ) and dv1 ≥ dv2 then
[1, . . . , n] if vij ∈ A and Mi,s 6= Mj,s or if vij ∈ B
and Mi,s = Mj,s , then ∃(vij , fs ) ∈ E. • remove v2 from the graph
That is, we construct the graph out of three sets
of vertices, A, which we will call the ‘alpha’ vertices, In other words:
B, the ‘beta’ vertices, and F , the ‘feature’ vertices. 1. If there is a pair vertex that needs to be domi-
Each ‘alpha’ vertex represent a pair of examples from nated x more times to reach the requisite dom-
different classes (i.e. that have different entries in the ination number (α or β as appropriate), and it
target vector T ). Each ‘beta’ vertex represents a pair is only connected to x features, all these features
of examples from the same class. Each feature in the must be in the feature set. Thus for each of these
original matrix has its own vertex in F . Edges exist x features we can mark its neighbours as being
only from A to F and B to F 3 . If, in the original data dominated by that feature, and remove that fea-
matrix, two examples i, j have different classes, and ture from the graph. Further, as the original pair
differ in feature f , then there is an edge from vertex vertex is now sufficiently dominated, we can re-
vi,j ∈ A to vertex vf . If they are in the same class, move it from the graph too.
and are the same in feature f , then we place an edge
between vi,j ∈ B and vf . 2. If we have two features, one whose neighbour-
hood is a subset of the other’s, and for every
2.2 Data Reduction pair vertex attached to the smaller feature we
do not need both to reach the appropriate dom-
With many intractable problems, the technique of ination number, then the smaller feature would
data reduction, or reduction to a ‘problem kernel’ in never be chosen over the larger feature, as the
the parameterized setting, is an important method larger feature does all the work of the smaller
for creating practical tools to use with the problems feature, and possibly more, thus we can safely
as it often allows algorithms that are efficient in the remove the smaller feature5 .
size of the instance, with the non-polynomial com-
ponent confined to the parameter, which is fixed for 3. If we have two pair vertices, such that the neigh-
a given instance. The fundamental idea of data re- bourhood of one is a subset of the neighbourhood
duction is to pre-process the instance with a set of of the other, and the smaller needs to be domi-
rules that reduce the size of the data set without nated at least as many times as the larger, then
losing optimal solutions. This sort of technique has we know that we are going to have to choose suf-
long been used in Operations Research and related ficient of the smaller’s neighbours to also dom-
fields, but has received very little attention outside inate the larger, thus we need not consider the
of these areas. With the formalisation of the concept larger, as it will be automatically dealt with if
of data reduction and the development of accompa- the smaller is.6
nying analysis and complexity tools by Downey and 4
These rules do have earlier genesis in at least combinatorial
Fellows (Downey & Fellows 1997), and the demon- optimisation, Weihe however provides an eminently relevant for-
stration that many of these long standing reduction mulation and application.
5
techniques are not in fact heuristic, this approach for Note here that the rule is specifically stated for strict subsets.
dealing with large, complex data sets is slowly gaining It can be reformulated such that if the two features have precisely
the same neighbourhood, then they are merged, rather than one
momentum. being deleted, as they would be equivalent in their use. We have
The reductions that we apply to the (α, β)k- chosen not to use this as it complicates the final procedure of pro-
Feature Set problem come from Weihe (Weihe ducing decision trees and rules, and we currently have no method
1998), as applied to the Red/Blue Dominating for dealing with this complication.
6
Again we can potentially merge pair vertices here if the neigh-
2 bourhoods were equal. However as the pair vertices are merely
The examples i, j are in different classes if Ti 6= Tj .
3 representatives, and are not produced in the solution, merging is
The edges are undirected, the use of ‘to’ is not strict.
an unnecessary complication.
What remains after these reductions are applied nating power of a given feature set. For example, we
completely, and no further reduction can be applied, may have several feature sets of size x, but a given
is called the ‘kernel’, or ‘problem kernel’. feature set f may dominate more example pairs than
These reduction rules allow pre-processing of the the others (that is, the sum of the degrees of the fea-
data to indicate features that must be in the feature tures in the feature set is higher, naturally two fea-
set (Rule 1), features that can be discarded (Rule tures may thus dominate the same pair vertex). We
2), pairs that provide no extra information about the may expect from this that perhaps f is a better fea-
solution (Rule 3), and reduces the size of the graph in ture set in some fashion, as it has features which are
general. Often this reduction is quite significant (q.v. in some sense more applicable.
(Mathieson et al. 2004) for further examples), and Thus we exploit the solution to the (α, β)k-
allows the kernel of the problem to be solved quickly Feature Set problem in many ways. Most sim-
by a heuristic method, or if the reduction is sufficient, ply we undertake the procedure described in 2.1 and
by complete enumeration. 2.2, and then apply some heuristic to the kernel to
The real importance of these reductions is that produce a feature set which we can then proceed to
they not only significantly reduce the data, but that examine. We can also vary the α and β values to cre-
in doing so they preserve the optimality of the so- ate smaller feature sets with lower robustness. The
lution. This is an important distinction from many small kernel size also allows us to produce all the fea-
other data reduction techniques that cull or condense ture sets of a given size (minimal or otherwise), and
data, but do not guarantee that the optimal solu- compare the similarities.
tion is still present in the reduced data set. This We also use the WEKA software package7 . WEKA
distinction is especially important when dealing with offers several algorithms and heuristics for building
instances that derive from areas such as oncology and rule sets and decision trees such as C4.5 and ID3. We
radiology, as destruction of the optimal solution could have employed several techniques to allow examina-
be lead to incorrect conclusions. tion of the stability and reliability of our results under
different heuristics. We used the ID3, J48, PART and
3 (α, β)k-Feature Set as a Tool PRISM heuristics, but found that ID3 and J48 con-
tinually produced the same decision trees.
A solution to the (α, β)k-Feature Set problem for
a given set of data gives us a set of features that 4 The Election Question
are the minimal set needed to correctly classify all
of the examples in the data set. Thus the ability to The problem that we apply our system to, presented
solve (α, β)k-Feature Set for a given set of data in (Lichtman & Keilis-Borok 1981), is to use the an-
gives us a powerful tool for revealing, in a large data swers to a set of yes/no questions to classify and sub-
set, what the underlying controlling factors are. This sequently predict the outcome of the popular vote
knowledge then allows us to both explore the forces in U.S. presidential elections. Lichtman and Keilis-
in action in the system represented by the data, and Borok present twelve questions8 , using which they are
to guide our predictions about new, unclassified ex- able to ‘predict’ the outcome of the popular vote in
amples. all U.S. presidential elections up to 1980.
Also by selecting different α and β values, we can
explore results of varying confidence levels. The most
notable approach here is to compare the results from
choosing positive α, and either 0 or positive β. The
first option gives a feature set that purely considers
the smallest explanation for the existence of the sep- Q4
arate classes within the data. The second gives a set
of features that also explains something about why No Yes
examples in the same class are actually in that class,
not just why they are not in another class. Increasing
the α and β values also gives robustness against error. Q12 Q8
These increased values force the redundant explana-
tion of either why the two examples in the pair are No Yes No Yes
in different classes or the same. Thus we can expect
an explanation that offers multiple points on which
to make a decision, so if a small number of these were IV Q7 CV Q9
in error, it would still be possible to categorise the
example at hand on a majority basis. This is not 16 No Yes 7 No Yes
infallible of course, a large number of errors in the
data can still lead to incorrect classification, but this CV IV CV
IV
is true of all systems.
The reduction rules in themselves also provide an 3 1 1 3
invaluable tool for analysing the data. As they only
go so far as to indicate which features are definitely
needed, and those that never will be (and those that
are equivalent), the remaining kernel is open to solu- Figure 1: Decision tree for α = β = 2. ‘IV’ indicates
tion by a technique of choice. This allows not only the an incumbent victory, ‘CV’ a challenger victory. The
comparison of different solution techniques, but due numbers beneath each leaf indicate how many of the
to the generally small size of the kernel, application of examples in the data set that they classify.
methods that guarantee optimality, such as complete
enumeration, is often feasible. Thus we can often not 7
only find one optimal solution, but all of them. 8
http://www.cs.waikato.ac.nz/ml/weka/
Reproduced here in Table 1, with answers in Table 2.
Another aspect that can be examined is the domi-
# Question
1 Has the incumbent party been in office more than a single term? (no)
2 Did the incumbent party gain more then 50% of the vote cast in the previous election? (yes)
3 Was there major third party activity during the election year? (no)
4 Was there a serious contest for the nomination of the incumbent party candidate? (no)
5 Was the incumbent party candidate the sitting president? (yes)
6 Was the election year a time of recession or depression? (no)
7 Was the yearly mean per capita rate of growth in real gross national product during the incumbent administration
equal to or greater than the mean rate in the previous 8 years and equal or greater than 1%? (yes)
8 Did the incumbent president initiate major changes in national policy? (yes)
9 Was there major social unrest in the nation during the incumbent administration? (no)
10 Was the incumbent administration tainted by a major scandal? (no)
11 Is the incumbent party candidate charismatic or a national hero? (yes)
12 Is the challenging party candidate charismatic or a national hero? (no)

Table 1: The 12 questions presented by Lichtman and Keilis-Borok. The answers in parenthesis favor the
incumbent party.

Pairs
Feature Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Dominated
0 1 1 1 1 1 1 1 0 0 1 1 2359
0 1 1 1 1 0 1 1 1 0 1 1 2371
0 1 1 1 0 1 1 1 0 1 1 1 2347
0 1 1 1 0 1 1 1 1 0 1 1 2359
1 1 1 1 0 1 1 1 0 0 1 1 2329
1 1 1 1 0 0 1 1 1 0 1 1 2341
1 0 1 1 0 1 1 1 1 0 1 1 2359
0 0 1 1 0 1 1 1 1 1 1 1 2377
1 0 1 1 0 0 1 1 1 1 1 1 2359
Appearances 4 6 9 9 2 6 9 9 6 3 9 9
Popularity 0.44 0.67 1 1 0.22 0.67 1 1 0.67 0.33 1 1
Weight 237 237 255 357 267 255 245 245 267 255 231 267

Figure 2: The feature sets for α = β = 2. The feature sets are laid out in rows, with a ‘1’ indicating that
the feature represented by that column is present, a ‘0’ not present. ‘Appearances’ indicates the number of
times a feature appears in total across all feature sets, ‘Popularity’ gives this as a ratio of the total number of
feature sets, and ‘Weight’ indicates how many pairs the feature can dominate if it is in the feature set. The
final column gives the number of pairs each feature set dominates, with the largest highlighted.

4.1 Our Results Q9. The decision tree for the first feature set (that
which dominates 2377 pair vertices) can be seen in
Several experiments were conducted with the data Figure 1 (developed with the J48 and ID3 heuristics,
from (Lichtman & Keilis-Borok 1981). Initial inspec- both giving the same answer). It is noted however,
tion of the graph indicated that the maximum possi- that at this point we do not have any algorithms or
ble α and β were 2 and 2. The graph, with beta ver- heuristics able to build decision trees or rules sets
tices, consisted of 12 feature vertices corresponding that take advantage of the redundancy available with
to the 12 questions, and 465 pair vertices, 234 ‘alpha’ α > 1 and β > 0, thus the tree created for this feature
vertices and 231 ‘beta’ vertices. When β = 0 was set does not use all the features available to it. The
considered, the graph contained only the 234 ‘alpha’ sets of rules created by using the PART and PRISM
vertices, as the ‘beta’ vertices can be immediately dis- heuristics are presented in Tables 3 and 4.
carded. The feature sets for α = β = 1 were also generated.
When α = 2 and β = 2, the reduction rules added In this case the reduction rules did not indicate that
5 features to the feature set (Q3, Q7, Q8, Q11 and any features had to be in the kernel (unsurprisingly,
Q12), discarded no features, and reduced the number as there were no degree 1 pair vertices), and did not
of pair vertices from 465 to only 13. The minimal indicate that any of the features were irrelevant. It
feature set size for α = β = 2 was nine features, did however reduce the number of pair vertices from
with nine different feature sets possible9 (see Figure 465 to 43, 30 ‘alpha’ and 13 ‘beta’. From this kernel
2). Notably, if α = β = 2, Q3, Q7, Q8, Q11 and Q12 we determined that there are two minimal feature sets
must be in the feature set, as they all are attached to for α = β = 1 for this data, (Q4, Q5, Q7, Q9, Q12)
pair vertices of degree 2 (and thus these pair vertices and (Q2, Q3, Q4, Q7, Q8). Of the two, the first
require those features to be dominated the requisite dominated the most pair vertices, 1403 compared to
number of times). Interestingly Q4 was required, if 1339. The decision trees for these two feature sets are
we want to achieve the minimally sized feature set, shown in Figures 4 and 5. Note that the decision tree
but no reduction indicated this. Out of these nine for the feature set that dominates more pair vertices
feature sets, the one consisting of the six common fea- is more compact, suggesting perhaps that this extra
tures, plus features Q6, Q9 and Q10, dominated the domination indicates greater discriminatory power.
greatest number of pair vertices (including overlaps), The classification rules generated by the PRISM
2377 (see Figure 2). The next nearest feature set in heuristic are shown in Table 5, and the rules from the
these terms dominated 2371 pairs, and consisted of PART heuristic in Table 6.
the six common features plus features Q2, Q5 and Feature sets were also generated for α = 1, β = 0.
9
These feature sets were confirmed as the only 9 by complete
Of these there was 23, all of size 5, which between
enumeration. them used all of the features. The results of this can
Year Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Target Rule Outcome
1864 0 0 0 0 1 0 0 1 1 0 0 0 1
(Q4 = 1) & (Q8 = 0) Challenger Victory
1868 1 1 0 0 0 0 1 1 1 0 1 0 1
1872 1 1 0 0 1 0 1 0 0 0 1 0 1 (Q4 = 1) & (Q9 = 1) Challenger Victory
1880 1 0 0 1 0 0 1 1 0 0 0 0 1 (Q12 = 1) & (Q7 = 0) Challenger Victory
1888 0 0 0 0 1 0 0 0 0 0 0 0 1
(Q4 = 0) & (Q12 = 0) Incumbent Victory
1900 0 1 0 0 1 0 1 0 0 0 0 1 1
1904 1 1 0 0 1 0 1 0 0 0 1 0 1 (Q7 = 1) & (Q4 = 0) Incumbent Victory
1908 1 1 0 0 0 1 0 1 0 0 0 0 1 (Q8 = 1) & (Q9 = 0) Incumbent Victory
1916 0 0 0 0 1 0 0 1 0 0 0 0 1
1924 0 1 1 0 1 0 1 1 0 1 0 0 1
1928 1 1 0 0 0 0 1 0 0 0 0 0 1 Table 3: Classification rules generated by the PRISM
1936 0 1 0 0 1 1 1 1 0 0 1 0 1 heuristic for α = β = 2. Here as well as in the decision
1940 1 1 0 0 1 1 1 1 0 0 1 0 1 tree, the potential robustness given by high α and β
1944 1 1 0 0 1 0 1 1 0 0 1 0 1 values is not exploited, thus not all features from the
1948 1 1 1 0 1 0 0 1 0 0 0 0 1 feature set are used.
1956 0 1 0 0 1 0 1 0 0 0 1 0 1
1964 0 0 0 0 1 0 1 0 0 0 0 0 1 Rule Outcome
1972 0 0 0 0 1 0 0 1 1 0 0 0 1
1860 1 0 1 1 0 0 1 0 1 0 0 0 0 (Q4 = 1) & (Q8 = 0) Challenger Victory (7.0)
1876 1 1 0 1 0 1 0 0 0 1 0 0 0 (Q9 = 0) & (Q12 = 0) Incumbent Victory (14.0)
1884 1 0 0 1 0 0 1 0 1 0 1 0 0 (Q3 = 0) & (Q6 = 0) Incumbent Victory (4.0)
1892 0 0 1 0 1 0 0 1 1 0 0 1 0
1896 0 0 0 1 0 1 0 1 1 0 1 0 0
Otherwise Challenger Victory (6.0)
1912 1 1 1 1 1 0 1 0 0 0 0 0 0
1920 1 0 0 1 0 1 0 1 1 0 0 0 0 Table 4: Classification rules generated by the PART
1932 1 1 0 0 1 1 0 0 1 0 0 1 0 heuristic for α = β = 2. These rules are used in a
1952 1 0 0 1 0 0 0 0 0 1 0 1 0
1960 1 1 0 0 0 1 0 0 0 0 0 1 0
cascade fashion beginning with rule 1. The accompa-
1968 1 1 1 1 0 0 1 1 1 0 0 0 0
nying numbers indicate how many examples each rule
1976 1 1 0 1 1 0 0 0 0 1 0 0 0 classifies out of those left unclassified by the previous
1980 0 0 1 1 1 1 1 0 0 1 0 1 0 rules.

Table 2: The data set presented by Lichtman &


Keilis-Borok (1=yes, 0=no). The target column rep- setting. That we are able to apply such simple rules,
resents the winner of the popular vote (1=incumbent, and yet consistently produce problem kernels that are
0=challenger). feasible to completely enumerate, without loss of the
optimality of the solution, demonstrates the power of
this methodology. Even on such a small data set the
reductions in the size of the problem afforded by this
be seen in Figure 3. This figure also includes some approach is appreciable, with the problem reduced by
additional information about the feature sets as well. ∼80% or more. This allows the quick and easy appli-
The decision tree for the feature set with the greatest cation of almost any technique desired to uncover the
dominating power (Q4,Q5,Q8,Q9,Q12), Figure 6, was complete answer, up to and including complete enu-
generated, as were the rules from the PART (Table meration. Further, as this system provides good an-
8) and PRISM (Table 7) heuristics. In this case the swers in a ‘short’11 amount of time, it can also be eas-
reduction rules removed no features, and added none ily used as a preprocessing step for other techniques
to the feature set, as in the case of α = β = 1, but such as ANNs and the like.
reduced the number of pair vertices from 234 to only The use of higher α and β values also seems to
41. have an effect on the results regarding the decision
trees and rules. Higher α and β seems to provide
5 Discussion more room for strong, but not apparently necessary,
features to be included in the minimal feature set,
5.1 (α, β)k-Feature Set in General thus leading to more compact classification and pre-
diction tools. This unfortunately is subject to the
Firstly, it is clear that the application of (α, β)k- problem discussed next.
Feature Set solvation tools can allow the simplifi- One of the problems with this technique at the
cation of the answers to problems of this kind. In this moment is that we currently have no formal way of
particular case we see that we only need five out of exploiting high α and β values. Any information re-
the twelve questions to correctly classify the examples garding the robustness of the feature set, or any infor-
present in the data set (α = 1, β = 0). Of course we mation about within-class similarity12 is essentially
also see that there are many different combinations of discarded when we move to the application of tech-
five features that can achieve this (Figure 3). Even niques such as ID3 or PRISM, etc., for the generation
to cover10 each example twice (α = β = 2), and thus of classification tools.
provide greater robustness against error, we need only
nine of the twelve questions. The discovery of these
core features and feature sets is greatly facilitated by 5.2 The Election Question and Our Results
the restructuring of the (α, β)k-Feature Set prob- Examination of our results for this data set shows
lem as a Red/Blue Dominating Set problem, and some interesting characteristics of the data. We see
the subsequent ability to apply powerful reduction
11
rules. Keep in mind that in general both of these ‘Short’ of course being a rather relative term when NP-Hard
problems are concerned. Here however, the time is a matter of
problems are formally hard, even in a parameterized milliseconds at most, and is most heavily influenced by the amount
10 of screen output desired.
Note that we use the word ‘cover’ lightly here. There may be no 12
This is also potentially an area that can be exploited in regards
relation to what are commonly referred to as ‘covering’ problems.
to automatic class generation.
Pairs
Feature Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Dominated
0 0 1 1 0 1 0 1 0 0 0 1 600
0 0 1 0 0 1 0 1 1 0 0 1 550
0 0 1 1 0 1 0 1 1 0 0 0 630
0 1 1 1 0 1 0 1 0 0 0 0 627
0 1 0 1 0 1 0 1 0 0 0 1 624
0 1 0 1 0 1 0 0 1 0 0 1 620
0 0 0 1 0 1 0 1 1 0 0 1 627
1 1 0 1 0 0 0 0 1 0 0 1 626
1 0 0 1 0 0 0 1 1 0 0 1 633
0 0 0 1 1 0 0 1 1 0 0 1 648
0 1 0 1 1 0 0 0 1 0 0 1 641
0 0 0 1 1 0 0 0 1 1 0 1 598
0 0 0 1 1 1 0 0 1 0 0 1 632
0 0 0 1 1 0 1 0 1 0 0 1 647
0 1 1 0 1 0 1 1 0 0 0 0 601
0 1 1 1 0 0 1 1 0 0 0 0 642
1 0 1 1 0 0 1 1 0 0 0 0 639
0 0 1 1 0 0 1 1 0 0 0 1 615
0 0 1 1 0 0 1 0 0 0 1 1 587
0 0 1 1 0 0 1 0 1 0 0 1 611
0 0 0 1 0 0 1 1 1 0 0 1 642
0 1 0 1 0 0 1 0 1 0 0 1 635
0 1 0 1 0 0 1 1 0 0 0 1 639
Appearances 3 9 10 21 6 8 10 14 14 1 1 18
Popularity 0.130435 0.391304 0.434783 0.913043 0.26087 0.347826 0.434783 0.608696 0.608696 0.043478 0.043478 0.782609
Weight 117 120 96 173 132 111 126 127 123 77 99 93

Figure 3: All 23 feature sets for α = 1,β = 0. The feature sets are laid out in rows, with a ‘1’ indicating that
the feature represented by that column is present, a ‘0’ not present. ‘Appearances’ indicates the number of
times a feature appears in total throughout the 23 feature sets, ‘Popularity’ gives this as a ratio of the total
number of feature sets, and ‘Weight’ indicates how many pairs the feature can dominate if it is in the feature
set. The final column gives the number of pairs each feature set dominates, with the largest highlighted.

several features repeatedly being used in most fea- would be at significant advantage, with 10 out of 11
ture sets, indicating that they are probably the most possible outcomes favouring him in three out of the
important (and at least have high discriminatory four trees. There would also be significantly more
power). The most obvious of these is Q4 (Was there rules that would be potentially applicable. Based on
a serious contest for the nomination of the incumbent this it seems that instability in the incumbent party
party candidate?), which occurs in all but two of the is one of the most significant factors.
feature sets, and in roughly half of the rules gener- From there, we consider the answer to Q12 to be
ated by the PART and PRISM heuristics. Similarly it ‘yes’14 , although we uncertain of this, being some-
appears that Q12 (Is the challenging party candidate what outside the system. It is interesting to consider
charismatic or a national hero?) is a highly important that currently (at the time of writing), the Republican
feature, and thus factor in the choice of vote winners. election effort seems to be directed towards reversing
Q7 (The short version being ‘Was the economy strong this opinion (i.e. making the answer to Q12 ‘no’). If
under the incumbent administration?’) is also one of they are successful at this, then three out of four of
the more important features, and appears regularly our decision trees (the fourth doesn’t consider Q12)
in the feature sets and decision trees and rules. indicate an incumbent victory. Notably if Q12 were
Comparing feature sets for fixed values of α and β ‘no’, the path in the tree leading to this decision is
(Figures 3 & 2), there seems to be no obviously signif- much shorter, and classifies 16 out of the 31 exam-
icant trends other than those mentioned above. Some ples, suggesting that U.S. elections rely largely on a
pairs of features seem to be interchangeable once the stable incumbent administration, and the discredit-
‘important’ features (Q4 and Q12) are present, such ing of the challenger. This suggests that the current
as Q8 & Q9 for α = 1, β = 0, where if one is not Republican tactic is a wise and time honoured one,
present, the other almost certainly is, though obvi- and that perhaps they are aware of this.
ously this is not a rule. Other features seem to have We believe the answer to Q7 to also be ‘yes’. From
little import at all, especially Q10 (Was the incum- our basic research the U.S. economic growth seems to
bent administration tainted by a major scandal?), be strong but it appears to have weakened at least in
which is almost never used13 . Interestingly for α = 1, the short term, though the growth rate is still above
β = 0, Q11 (Is the incumbent party candidate charis- that specified in Q7. If the answer to Q7 is in fact
matic or a national hero?), the complementary fea- ‘no’, then three of the decision trees indicate that the
ture of Q12, is almost never used. However when we challenging party will win, making the Republican’s
ask for α = β = 2, it is vital. It seems reasonable to attack on John Kerry’s persona even more relevant,
suggest that α = β = 2 indicates the more subtle in- as a change in Q12 would then change the outcome
teractions present in the data, that only appear when of the vote.
more complex feature sets are considered. From these three answers we have the result ‘In-
cumbent Victory’ for three out of four decision trees.
5.3 Our Prediction For the last we must also answer Q5 and Q9. Ad-
dressing Q9 first, we consider that there has been no
Beginning with the decision trees, Figures 1, 4, 5, major social unrest in the U.S., though again we are
and 6, we choose the answer to Q4 to be ‘no’, which far from experts, and find the question to be ambigu-
we think is a reasonable and obvious answer. If the 14
At the time of writing the current challenging candidate was
answer to Q4 were to be ‘yes’ however, the challenger John Kerry, who was decorated 5 times, including 3 Purple Hearts,
13
in the Vietnam conflict, and also seems to be reasonably charis-
The implications of this we leave to the reader. matic.
Rule Outcome
(Q4 = 1) & (Q8 = 0) Challenger Victory
(Q4 = 1) & (Q7 = 0) Challenger Victory Q4
(Q3 = 1) & (Q2 = 0) Challenger Victory No Yes
(Q4 = 1) & (Q2 = 1) Challenger Victory
(Q7 = 0) & (Q8 = 0) & (Q2 = 1) Challenger Victory
(Q4 = 0) & (Q7 = 1) Incumbent Victory Q12 Q7
(Q4 = 0) & (Q8 = 1) & (Q3 = 0) Incumbent Victory
(Q4 = 0) & (Q2 = 0) & (Q3 = 0) Incumbent Victory No Yes No Yes
(Q8 = 1) & (Q2 = 1) & (Q4 = 0) Incumbent Victory
(Q8 = 1) & (Q7 = 1) & (Q2 = 0) Incumbent Victory IV Q7 CV Q9

16 No Yes 5 No Yes
Table 5: Classification rules generated by the PRISM
heuristic for α = β = 1 from the feature set
(Q2,Q3,Q4,Q7,Q8). CV IV Q5 CV
Rule Outcome 3 1 No Yes 3
(Q4 = 1) & (Q8 = 0)
Challenger Victory (7.0)
(Q4 = 0) & (Q7 = 1)
Incumbent Victory (11.0) CV
IV
(Q4 = 1) & (Q7 =Challenger
0) Victory (2.0)
(Q2 = 0) & (Q3 = 0)
Incumbent Victory (5.0) 1 2
(Q8 = 0) Challenger Victory (2.0)
(Q2 = 1) & (Q4 = 0) Incumbent Victory (2.0)
Otherwise Challenger Victory (2.0)

Table 6: Classification rules generated by the PART


heuristic for α = β = 1. These rules are used in a Figure 4: Decision tree for α = β = 1. ‘IV’ indicates
cascade fashion beginning with rule 1. The accompa- an incumbent victory, ‘CV’ a challenger victory. The
nying numbers indicate how many examples each rule numbers beneath each leaf indicate how many of the
classifies out of those left unclassified by the previous examples in the data set that they classify.
rules.
ous. Although the U.S. is currently involved in an
Rule Outcome
overseas conflict, and some protests have occurred, it
does not seem to be an unusual position for the na-
(Q4 = 1) & (Q8 = 0) Challenger Victory tion, and does not constitute major unrest within the
(Q4 = 1) & (Q9 = 1) Challenger Victory nation. The answer to Q5 is currently ‘yes’ (George
(Q12 = 1) & (Q9 = 1) Challenger Victory W. Bush), thus indicating a victory for the incumbent
(Q12 = 1) & (Q5 = 0) Challenger Victory party. If the answer to Q9 were ‘yes’, a ‘Challenger
(Q4 = 0) & (Q12 = 0) Incumbent Victory
Victory’ would be the result, which seems a reason-
able likelihood if there were significant unrest which
(Q9 = 0) & (Q8 = 1) Incumbent Victory would indicate unhappiness with the incumbent ad-
(Q4 = 0) & (Q9 = 0) & (Q5 = 1) Incumbent Victory ministration.
Turning then to the rules generated using the
Table 7: Classification rules generated by the PRISM PRISM heuristic, based on the above answers to the
heuristic for α = 1, β = 0 from the feature set questions we get ‘Incumbent Victory’ from Tables 3
(Q4,Q5,Q8,Q9,Q12). (rule 5) and 5 (rule 6). If we consider the answer
to Q8 (Did the incumbent president initiate major
Rule Outcome changes in national policy?) to be ‘yes’, which seems
(Q4 = 1) & (Q8 = 0) Challenger Victory (7.0)
reasonable if we consider the USA PATRIOT act15
and so on, we also get ‘Incumbent Victory’ from Ta-
(Q9 = 0) & (Q12 = 0) Incumbent Victory (14.0) ble 7 (rule 6). The rules cascades generated using the
(Q4 = 0) & (Q12 = 0) Incumbent Victory (3.0) PART heuristic also indicate an ‘Incumbent Victory’,
(Q9 = 1) Challenger Victory (5.0) using the answers to the questions as above except
(Q5 = 0) Challenger Victory (1.0) in the case of α = β = 2, where we also require the
Otherwise Incumbent Victory (1.0) answers to Q3 and Q6 (both of which we believe to
be ‘no’), but still get the same result.
Table 8: Classification rules generated by the PART From these results we consider that a Republican
heuristic for the α = 1, β = 0 feature set victory (in the popular vote) is most likely in this
(Q4,Q5,Q8,Q9,Q12). These rules are used in a cas- coming election. Unfortunately of course we are no
cade fashion beginning with rule 1. The accompany- more qualified than any other non historian or polit-
ing numbers indicate how many examples each rule ical scientist to judge the validity of these answers.
classifies out of those left unclassified by the previous However Dr. Lichtman and Dr. Keilis-Borok, au-
rules. thors of the original 1981 paper (Lichtman & Keilis-
Borok 1981) that this work is based on, and authors
15
Available from: http://www.fincen.gov/pa main.html
Q4
Q4
No Yes
No Yes
Q7 Q8

No Yes No Yes
Q12 Q8
Q8 IV CV Q7
No Yes No Yes
No Yes 11 No Yes
7

Q2 Q3 CV Q2 IV Q9 CV Q9
No Yes No Yes 2 No Yes
16 No Yes 7 No Yes
IV CV IV Q2 IV CV

1 2 4
No Yes
1 1 Q5 CV IV CV
CV IV
No Yes 2 1 3
1 1

CV IV
Figure 5: Alternate decision tree for α = β = 1 based 1 1
on a different optimal feature set. ‘IV’ indicates an
incumbent victory, ‘CV’ a challenger victory. The
numbers beneath each leaf indicate how many of the
examples in the data set that they classify. Figure 6: Decision Tree for the feature set
(Q4,Q5,Q8,Q9,Q12), with α = 1 and β = 0. The
numbers below each leaf indicate how many examples
of several subsequent related articles and develop- the leaf classifies.‘IV’ indicates an incumbent victory,
ments, have also predicted that this coming election ‘CV’ a challenger victory.
will see an incumbent victory16 . Dr. Fair at Yale has
produced his own method of predicting the share of which classes example should belong to. At least for
the votes the incumbent administration will receive the extra information provided by increased α val-
(Fair 1978), and concurs with our and Dr. Licht- ues we can envision a form of decision that allows
man & Dr. Keilis-Borok’s prediction of a Republi- multiple features at each decision point. For exam-
can victory. His model is based however on entirely ple, if we (manually) construct the decision tree for
economic factors, which seems to be a popular and α = 2, we arrive at a tree with (Q4∨Q12) at the
traditional prediction method. Forbes magazine sug- root, with the ‘no’ branch corresponding to Q4 and
gests however that the state of the economy is not as Q12 as ‘no’, and classifying 16 of the ‘Incumbent Vic-
strong a factor as traditionally believed (Ackman & tory’ examples (essentially a contraction of the left
Hazlin 2004), with only 64% of elections being pre- hand branch of the exhibited trees). From the ‘yes’
dictably by economic factors alone, and even then branch we can then insert a decision vertex labelled
Forbes uses a significantly more complex economic (¬Q3∨¬Q7∨Q9), which on the ‘no’ branch classifies
model than is usually proposed, with seven interde- the final two ‘Incumbent Victories’, leaving all the
pendent variables. ‘Challenger Victories’ down the ‘yes’ branch of this
second decision. This kind of tree is obviously more
6 Conclusion useful for solutions to (α, β)k-Feature Set with
α > 1. It remains to generalise and automate this
We presented in this paper a deterministic and opti- process. It also remains to incorporate the informa-
mality preserving method of reductions to allow the tion giving by increased β values.
solutions to a series of problems related to the k- With regard to the actual data used, we predict
Feature Set problem to be found. We also pre- that it is most probable that George W. Bush will
sented a small test data set and the application of serve a second term as President of the United States
this system and other techniques for the use in anal- of America, if there are no dramatic changes in the
ysis, classification and prediction with regards to the candidates or the knowledge of the public.
system the data represents. We believe this method is
both flexible and powerful, and has clear applications 6.1 Acknowledgements
in all field where data mining techniques are used.
Further research may include the development or P. Moscato would like to thank Dr. Keilis-Borok for a
modification of current methods for generating deci- discussion they had in December 1991 in Trieste while
sion trees and classifying rule sets to allow the ex- both were at the International Centre for Theoretical
ploitation of the extra power and information offered Physics.
by the (α, β)k-Feature Set variant of the prob-
lem. The use of these ideas on unclassified data is
also an interesting area of potential research, using References
the maximisation of the α and β values to indicate
D. Ackman & M. Hazlin, It’s not the economy, stupid,
16
http://www.counterpunch.org/lichtman07292004.html Forbes,
http://www.signonsandiego.com/uniontrib/20040509/news 1n9predict.html http://moneycentral.msn.com/content/invest/forbes/
http://hnn.us/articles/6599.html P92882.asp?GT1=4529, (2004)
C. Cotta & P.ViewMoscato,
publication stats
The k-Feature Set Prob-
lem is W[2]-Complete, Journal of Computer and
System Sciences, 67(4), (2003), pages 686-690
C. Cotta, C. Sloper & P. Moscato, Evolutionary
Search of Thresholds for Robust Feature Set Se-
lection: Application to the Analysis of Microar-
ray Data, Proceedings of EvoBio2004 - 2nd Euro-
pean Workshop on Evolutionary Computation and
Bioinformatics, Coimbra, Portugal, April, (2004)
R. Downey & M. Fellows, Parameterized Complexity,
Springer, (1997)
S. Davies & S. Russell, NP-Completeness of Searches
for Smallest Possible Feature Sets, Proceedings of
the AAAI Symposium on Relevence, (1994), pages
41-43
R. Fair, The Effect of Economic Events on Votes for
President, The Review of Economics and Statis-
tics, May, (1978), pages 159-173
http://fairmodel.econ.yale.edu/vote2004/index2.htm

A. J. Lichtman and V. I. Keilis-Borok, Pattern Recog-


nition Applied to Presidential Elections in the
United States, 1860-1980: Role of Integral Social,
Economic, and Political Traits, Proceedings of the
National Academy of Sciences of the United States
of America, 78(11), (1981), pages 7230-7234
L. Mathieson, A. Mendes, J. Marsden, J. Pond and
P. Moscato, Computer Aided Breast Cancer Di-
agnosis with Optimal Feature Sets: Reduction
Rules and Optimization Techniques, Manuscript
in Preparation, (2004)
K.Weihe, Covering Trains by Stations or the Power of
Data Reduction, OnLine Proceedings of ALEX’98
- 1st Workshop on Algorithms and Experiments,
http://rtm.science.unitn.it/alex98/proceedings.html,
(1998)

You might also like