0% found this document useful (0 votes)
7 views

WWW_Explainable Neural Rule Learning

Uploaded by

Wei Deng
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

WWW_Explainable Neural Rule Learning

Uploaded by

Wei Deng
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Explainable Neural Rule Learning

Shaoyun Shi1 , Yuexiang Xie2 , Zhen Wang2 , Bolin Ding2 , Yaliang Li2 *, and Min Zhang1 *
1 Department
of Computer Science and Technology, Institute for Artificial Intelligence,
Beijing National Research Center for Information Science and Technology, Tsinghua University, Beijing 100084, China
2 Alibaba Group

[email protected],{yuexiang.xyx,jones.wz,bolin.ding,yaliang.li}@alibaba-inc.com,[email protected]

ABSTRACT KEYWORDS
Although neural networks have achieved great successes in vari- explainable neural networks; rule learning; out of distribution
ous machine learning tasks, people can hardly know what neural ACM Reference Format:
networks learn from data due to their black-box nature. The lack Shaoyun Shi, Yuexiang Xie, Zhen Wang, Bolin Ding, Yaliang Li and Min
of such explainability is one of the limitations of neural networks Zhang. 2022. Explainable Neural Rule Learning. In Proceedings of the ACM
when applied in domains, e.g., healthcare and finance, that demand Web Conference 2022 (WWW ’22), April 25–29, 2022, Virtual Event, Lyon,
transparency and accountability. Moreover, explainability is benefi- France. ACM, New York, NY, USA, 11 pages. https://doi.org/10.1145/3485447.
cial for guiding a neural network to learn the causal patterns that 3512023
can extrapolate out-of-distribution (OOD) data, which is critical in
1 INTRODUCTION
real-world applications and has surged as a hot research topic.
In order to improve the explainability of neural networks, we pro- Recent years have witnessed the great successes of neural net-
pose a novel method—Explainable Neural Rule Learning (denoted works [29, 38], which are often ascribed to their extraordinary
as ENRL), with the aim to integrate the expressiveness of neural expressiveness. However, what a neural network learns from train-
networks and the explainability of rule-based systems. Specifically, ing data are a bunch of model parameters from which people cannot
we first design several operator modules and guide them to behave interpret what function this neural network expresses. Thus, in
as certain relational operators via self-supervised learning. With inference, people just feed each input instance to the neural net-
input feature fields and learnable context values serving as argu- work and get the prediction, while they can hardly explain how
ments, these operator modules are used as predicates to constitute the network makes decisions. Due to this black-box nature, it is a
the atomic propositions. Then we employ neural logical operations consensus that existing neural networks lack explainability [27, 34].
to combine atomic propositions into a collection of rules. Finally, Despite their successes, the lack of explainability has limited
we design a voting mechanism for these rules so that they col- the application of neural networks in domains that require trans-
laboratively make up our predictive model. Thus, rule learning is parency and accountability to make trustworthy decisions. For
transformed to neural architecture search, that is, to choose the example, in healthcare/education systems, in addition to predicted
appropriate arrangements of feature fields and operator modules. diagnosis/assessment, the applied model must also give the corre-
After searching for a specific architecture and learning the involved sponding pieces of evidence [39, 46]. Similarly, both the supporting
modules, the resulting neural network explicitly expresses some and the opposing rationales of the decisions are required for finance
rules and thus possesses explainability. Therefore, we can predict and justice intelligence [30]. Moreover, neural networks might ex-
for each input instance according to rules it satisfies, which at the ploit spurious patterns as shortcuts to fit the training data instead
same time explains how the neural network makes that decision. We of learning the causal patterns that consistently work on both in-
conduct a series of experiments on both synthetic and real-world distribution and out-of-distribution (OOD) data [28, 53]. It would
datasets to evaluate ENRL. Compared with conventional neural be beneficial for justifying and even guiding which kind of patterns
networks, ENRL achieves competitive in-distribution performance to learn if a neural network could explain its decisions.
while providing the extra benefits of explainability. Meanwhile, Therefore, improving the explainability of neural networks has
ENRL significantly alleviates performance drop on OOD test data, attracted much attention from both academia and industry in recent
implying the effectiveness of rule learning. Codes are provided at years. As discussed in Section 4, prior works on this topic attempt to
https://github.com/Shuriken13/ENRL. unveil the statistical relationships (e.g., correlation) between input
features [1, 18, 25]/network modules [9, 26, 51] and the outputs.
CCS CONCEPTS Although this is a critical step towards explainable neural networks,
it is difficult for people to know the reasoning process of a neural
• Computing methodologies → Neural networks.
network from these relationships, not to mention that each module
Permission to make digital or hard copies of all or part of this work for personal or is still a black box.
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
In this work, we aim to improve neural networks’ explainability
on the first page. Copyrights for components of this work owned by others than ACM by instructing each module’s explainable behavior and promoting
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, the neural network to imitate a collection of rules. On the one hand,
to post on servers or to redistribute to lists, requires prior specific permission and/or a
fee. Request permissions from [email protected]. * Corresponding Author
WWW ’22, April 25–29, 2022, Virtual Event, Lyon, France Work done when Shaoyun was an intern at Alibaba. This work is supported by Alibaba
© 2022 Association for Computing Machinery. Group through Alibaba Research Intern Program. This work is supported by the Natural
ACM ISBN 978-1-4503-9096-5/22/04. . . $15.00 Science Foundation of China (Grant No. U21B2026) and Tsinghua University Guoqiang
https://doi.org/10.1145/3485447.3512023 Research Institute.

3031
WWW ’22, April 25–29, 2022, Virtual Event, Lyon, France S. Shi, Y. Xie, Z. Wang, B. Ding, Y. Li, M. Zhang

the learned rules are much more flexible with neural network-based the building bricks of rules, are depicted on the left-hand side of the
arguments and predicates than those in traditional rule-based sys- figure. An ECM comprises a feature field as an argument, an opera-
tems. On the other hand, imitating rules makes the reasoning pro- tor module as a predicate, and a learnable context value as another
cess of the learned neural network readily interpretable to human argument, to express an atomic proposition such as “age ≥ 18”. The
beings. Meanwhile, imitating rules poses implicit regularization explainability of ECM is guaranteed by self-supervised operator
upon the neural network, hopefully steering the model toward learning, which encourages the operator modules to possess spe-
learning the causal patterns that extrapolate OOD test data. cific mathematical properties. Thus, the operator modules behave
To this end, we propose a novel method to conduct Explainable as the corresponding relational operators, such as not greater than
Neural Rule Learning (denoted as ENRL). Specifically, we first (≤), not less than (≥), and belong to (∈). We employ neural negation
design several operator modules and utilize self-supervised tasks to and neural conjunction to express a collection of rules from several
grant them specific mathematical properties so that they behave as ECMs. Hence, rule learning can be conducted as NAS, which auto-
corresponding relational operators, including not greater than (≤), matically searches for suitable arrangements of feature fields and
not less than (≥), and belong to (∈). These operator modules serve operator modules for all the ECMs. As shown on the right-hand
as predicates, together with feature fields and learnable context side of Figure 1, the resulting model is a multi-tree neural network
values serving as arguments, to form the atomic propositions. Then where each binary tree consists of ECMs as nodes. For each input
we employ neural logical operations such as negation and conjunc- instance, its features are transformed into embedding represen-
tion to orchestrate atomic propositions into rules. For each input tations and fed into the root of each tree. Then the instance will
instance, these rules collaboratively determine its prediction with be routed from the root to a leaf node along with corresponding
our designed voting mechanism. In this way, each specific neural transformations, where the activated path denotes the rule this
architecture, i.e., an arrangement of feature fields and operator instance satisfies. Finally, we design a voting mechanism for all
modules at the different stages of the neural network, corresponds satisfied rules so that they collaboratively determine the prediction
to a specific collection of rules. Hence, seeking appropriate rules for this input instance.
is transformed into neural architecture search (NAS) [36], which
automatically searches for the suitable architecture to express the 2.1 Explainable Neural Rules
rules that correctly correlate the input instances with their labels. In this section, we introduce how to promote the neural networks to
After attaining a specific neural architecture and learning the pa- imitate rules. The building bricks of rules is Explainable Condition
rameters of involved modules, the resulting neural network model Module (ECM), which can be regarded as the explainable neural
can predict each input instance and, at the same time, explain the modules that expresses an atomic proposition, e.g., “age ≥ 18”.
prediction by expressing how this instance satisfies the rules.
We conduct a series of experiments on both synthetic and real- 2.1.1 Single Rule. We first introduce how to express a single rule
world datasets and provide quantitative and qualitative analysis to that consists of L ECMs. Let ei=1...L ∈ (0, 1) denotes the output
demonstrate the advantages of the proposed ENRL in improving the of an ECM, which can be interpreted as the extent to which the
explainability of neural networks. After searching for an appropri- input instance satisfies the expressed atomic proposition. Then
ate architecture, we provide the case study and analysis to show the the output of a single rule can be the conjunction given as: r =
ÎL
rules learned by ENRL, including those with positive and negative e 1 ∧ e 2 ∧ ... ∧ e L = i=1 ei , where r ∈ (0, 1) denotes the extent
voting weights which represent the support and opposition for the to which the input instance satisfies the expressed rule. Inspired
predictions, respectively. Besides, the experimental results show by previous studies on neural logic [32, 47], we define the neural
that ENRL achieves competitive in-distribution fitting performance conjunction here as the product of input values, i.e., a ∧ b := ab.
compared with conventional neural networks while significantly Other neural logic modules [7, 41] can also be applied here.
alleviating performance drop on OOD data, which confirms the
advantages brought in by learning explainable rules. 2.1.2 Rules Expressed by Multiple Trees. Note that the expressive-
The main contributions are summarized as follows: (1) We pro- ness of a single rule is not enough, therefore we adopt a complete
pose a novel method named ENRL to search for an appropriate binary tree topology to express multiple rules. Adopting such tree
neural network architecture to express rules. (2) A self-supervised structure, a single rule can be represented by a path from the root
operator learning is used for guiding neural operator modules to to a leaf node, and the rules expressed by the tree partition the
behave as certain relational operators. These operator modules are input feature space into several disjoint subspaces. Multiple trees
integrated with feature fields to form atomic propositions, which can be constructed to further enrich the representation ability.
are the building bricks for expressing the rules learned by ENRL. As the node of the tree, the output of ECM should be expanded
(3) Experimental results show that ENRL can explicitly provide to two disjoint part, denoted as e and ¬e. Follow the definition of
the explainability for the predictions, and achieves competitive the neural negation in previous study, we further have ¬e = 1 − e.
in-distribution performance while significantly alleviating perfor- Without loss of generality, when building up the tree, we formulate
mance drop on OOD data compared with baseline models. that an ECM always outputs e to its left subtree (implies that input
features satisfy the atomic proposition expressed by the ECM), and
2 METHODOLOGY outputs ¬e to its right subtree (implies that input features dissatisfy
In this section, we present our method—Explainable Neural Rule the atomic proposition expressed by the ECM). Thus the rules can
Learning (denoted as ENRL). The overall architecture of ENRL is be expressed by the conjunction of the outputs of the path from the
illustrated in Figure 1, where Explainable Condition Modules (ECM), root to the leaf nodes. A tree t of depth L can express 2L distinct

3032
Explainable Neural Rule Learning WWW ’22, April 25–29, 2022, Virtual Event, Lyon, France

Figure 1: Illustration of Explainable Neural Rule Learning (ENRL). Explainable Condition Modules (ECM) are atomic propo-
sitions to form rules. A path from the root to the leaf node represents a learned rule orchestrated by the neural conjunction
and negation. Only one rule in each tree will be activated by each input instance, denoted as the path in red.
rules, denoted as r t ∈ (0, 1). For example, as shown in right- means “male”. For numerical features, we transform values into
i=1, ...,2L
hand side of Figure 1, the depth of each tree is 3, and there exits discrete buckets and ensure that larger values are discretized into
total 23 = 8 rules represented by the paths from the root to a buckets with larger indexes. For example, for user’s ages, we can
leaf node. For the rule r it shown in the figure, it can be expressed use x i = 0 to represent “less than 10 years old”, x i = 1 for “10-19
by the conjunction of the outputs of the ECMs in the path, i.e., years old”, x i = 2 for “20-29 years old” and so on. This strategy
r it = ¬e 1,1
t ∧ et ∧ et . is commonly used in deep neural models [42, 50], because such
2,2 3,3
Note that only one rule with the largest value will be activated one-hot encoding is suitable for the embedding layer. Each posi-
by each input instance. The activated rules of total T tree will be tion in the one-hot encoding corresponds to a feature embedding
aggregated via an voting weights to output the final predictions ŷ, vector, i.e. xi = uix i ∈ {ui1 , ui2 ...uhi }, where {uij }j=1...hi ∈ Rd ×hi
i
which can be formally given as: is the embedding matrix of feature i with vector size d, and hi is
the number of distinct values this feature field takes.
T Õ
2L
v tj I(j = arg max{r it }i=1
2L
Õ
ŷ = ) (1) 2.2.2 Learnable Context Value. In decision tree, the context values
i
t =1 j=1
(i.e., the split points) are greedily selected according to the maxi-
where I(j = arg maxi {r it }i=1
2 ) is the characteristic function to pre-
L mum information gain. To become more flexible and possess better
expressiveness, the context values in ECMs are designed as the
serve the satisfied rule, and v tj is a scalar value denoted the voting
learnable parameters. That is to say, we allows ECMs to search for
weights. A positive or negative voting weight represents the support
optimal context values to form suitable rules by the data driven
and opposition for the prediction, respectively, while its absolute
manner. Another reason for adopting learnable context value in
value denotes the confidence.
ECM is that, vectors are powerful to represent more types of infor-
2.1.3 Discussion. The advantages of adopting such tree-structured mation, such as a set of elements. After training, the learned context
topology to express rules by ECMs are twofold. Firstly, compared to value can be decoded into understandable values or elements ac-
an equivalent number of multiple parallel rules, trees needs fewer cording to how we process the feature field (e.g., the discretization
ECMs, which is more efficient to search the combinations of atomic process) and what operator module we adopted.
propositions to form suitable rules for the given data (O(L × 2L ) vs.
O(2 × 2L )). Secondly, in each tree, there exists one and only one 2.2.3 Operator Module. Operator modules are implemented by
rule that a input instance satisfies, since the input feature space has neural networks, each of which works as a predicate in an ECM.
been partitioned into disjoint subspace by each tree. It makes the With the help of self-supervised operator learning (Section 2.3),
proposed method more robust to the unseen data, while no rules different operator modules f ⊙ (·, ·) are guided to behave as the cor-
might be activated by an input instance organized in the parallel responding relational operators, such as not greater than (≤), not
manner. Note that there might be more fancy architecture such as less than (≥), and belong to (∈). The same type of operator modules
graphs to assemble ECMs to from rules. are shared among different ECMs. For example, if two ECMs both
use the operator not greater than (≤), their operator modules f ≤ (·, ·)
2.2 Explainable Condition Module have shared parameters. In this study, we adopt MLPs to imple-
ECM is designed as the building bricks for expressing rules, which ment the operator modules, which take a feature field and a context
comprises of a feature field as argument, an operator module as value as inputs, and output a scalar value to denote whether the
predicate, and a learnable context value as another argument. corresponding relation between the inputs can be established.

2.2.1 Feature Field. In this work, we consider the ubiquitous tab- 2.2.4 Searching. Each ECM contains a feature field as argument,
ular data where each instance consists of several feature fields. an operator module as predicate, and a learnable context value
Each feature field is represented by a one-hot vector, and x i de- as another argument. With ECM as the explainable module, rule
notes the “1” value’s index of the i-th feature. For example, for a learning can be conducted as NAS, which automatically searches
categorical feature like gender, x i = 0 means “female” and x i = 1 for suitable arrangements of feature fields and operator modules

3033
WWW ’22, April 25–29, 2022, Virtual Event, Lyon, France S. Shi, Y. Xie, Z. Wang, B. Ding, Y. Li, M. Zhang

for ECMs. Given F candidate feature fields and O operator modules, their numerical relationships (e.g., 2 ≤ 3):
the output of k-th ECM can be given as:
hi Õ
hi
−loд f ≤ (uia , ubi ) ,
Õ Õ
F Õ
O F Õ
O ℓle =

w ikj f j (xi , ckij ) w ikj σ ckij ) ,
Õ Õ
ek = = MLPj (xi ⊕

i ∈Fn a=1 b=a
i=1 j=1 i=1 j=1 hi Õ
a
(2)
−loд f ≥ (uia , ubi ) ,
Õ Õ
F Õ
O ℓge =

s.t . w ikj = Gumbel-Max(okij ) ∈ {0, 1}, w ikj = 1,
Õ
i ∈Fn a=1 b=1
i=1 j=1
where hi denotes the number of distinct values of feature i, i.e.,
where ⊕ means vector concatenation, σ (·) is the Sigmoid function. {uij }j=1...hi ∈ Rd ×hi . We sample a batch of a, b, c from Si in each
The operator module f j (·, ·) is parameterized as MLP, which outputs training step to balance the effectiveness and efficiency.
a value between (0, 1), given feature field xi and the corresponding By minimizing these loss functions on f ≤ and f ≥ , we get two
context value ckij as inputs. There exists F × O candidate combi- partial orders that work on every vector set Si for numerical fea-
nations of features fields and operator modules in total, and w ikj tures. The partial orders on vector space allow us to transform
denotes feature-operator selecting weight corresponding to the the numerical features into vector space while preserving their
specific combination of feature i and operator j. By regarding okij s numerical properties. The context vector cki, ⊙ can be decoded via
as the “architecture parameters” in NAS, we adopt the Gumbel-Max comparing cki, ⊙ with feature representations {uij }j as:
trick [20] to allow an end-to-end learning of these feature-operator
selecting weights. This trick has been widely adopted in many NAS c i,k ⊙ = arg min{| f ≤ (uij , cki, ⊙ ) − f ≥ (uij , cki, ⊙ )|} (3)
j
methods, such as SNAS [49] and DATA [6], to search the suitable
architecture in a differentiable manner. where c i,k ⊙ is the one-hot index of feature i. And we obtain its
explain according to the discretization process. For example, for
2.3 Self-Supervised Operator Learning the feature filed age, c i = 0 represents the range (0, 10], and c i = 1
We utilize self-supervised tasks to grant operator modules with represents the range (10, 20].
specific mathematical properties so that they can behave as corre-
sponding relational operators. In this study, we implement three 2.3.2 Operator for Categorical Features. As for the operator belong
relational operators, including not greater than (≤) and not less than to (∈), the corresponding operator module, denoted as f ∈ (xi , cki, ∈ ),
(≥) for numerical features, and belong to (∈) for categorical features. is trained to express that whether feature value x i is in the set rep-
resented by the context vector cki, ∈ . Formally, the objective function
2.3.1 Operators for Numerical Features. Note that the operators not is:
greater than (≤) and not less than (≥) are reflexive partial orders. A
−loд 1 − f ∈ (xi− , cki, ∈ ) ,
Õ ÕÕ
ℓin =

homogeneous relation ⊙ defined on set S is a reflexive partial order
i ∈Fc k xi−
when it satisfies reflexivity, antisymmetry and transitivity [44]:
• Reflexivity: for ∀a ∈ S, a ⊙ a. where Fc is the set of categorical features, and xi− denotes the
• Antisymmetry: for ∀a, b ∈ S, if a ⊙ b and b ⊙ a, then a = b. negative samples, which are drawn from N (0, 0.01) during training
• Transitivity: for ∀a, b, c ∈ S, if a ⊙ b and b ⊙ c, then a ⊙ c. to simulate that a random point in the vector space is not in the
set represented by cki, ∈ . After training, the elements in cki, ∈ can be
Based on these mathematical properties, we design a self-supervised derived by
operator learning strategy to guide the operator modules, denoted
as f ≤ (·, ·) and f ≥ (·, ·), to behave as corresponding relational op- c i,k ∈ = {j | f ∈ (uij , cki, ∈ ) > γ }, (4)
erators. To be specific, for each numerical feature i and operator where γ is the threshold to judge whether uij belong to cki, ∈ or not.
⊙ ∈ {≤, ≥}, we construct a set Si including feature embedding
And each index j corresponds to a category, e.g., j = 0 for male and
matrix {uij }j and all context values in ECMs at different positions k
j = 1 for female.
in ENRL {cki, ⊙ }k, ⊙ , i.e., Si = {uij }j ∪ {cki, ⊙ }k, ⊙ . Then the loss func- It is worth pointing out that more operators can be introduced
tions designed for making ECMs satisfying the three mathematical into ENRL, where their explainability can be assured, as long as their
properties above can be formally given as: mathematical properties can be formulated as operator learning
Õ Õ Õ objective functions. For example, there are interpretable modules
ℓref = −loд f ⊙ (a, a) ,

in CV and VQA [17, 40].
⊙ ∈ { ≤, ≥ } i ∈Fn a∈Si

ℓant =
Õ Õ Õ
f ⊙ (a, b)f ⊙ (b, a)∥a − b∥2 , 2.4 Training
⊙ ∈ { ≤, ≥ } i ∈Fn a,b∈Si 2.4.1 Voting Weights. We tries different feature-operator combi-
Õ Õ Õ nations for all ECMs via NAS during the training. However, it is
ℓtra = −f ⊙ (a, b)f ⊙ (b, c)loд f ⊙ (a, c) ,

challenging to optimize choices of feature-operator pairs and voting
⊙ ∈ { ≤, ≥ } i ∈Fn a,b,c∈Si
weights simultaneously. If NAS makes an excellent trial but the
where Fn is the set of numerical features, and ℓref , ℓant , ℓtra are voting weight vr of the rule r does not update timely, such as a
self-supervised losses for reflexivity, antisymmetry and transitivity, supporting rule with negative weight, then the learning algorithm
respectively. Besides, numerical feature representations should obey may not recognize the good trial and update in a wrong direction.

3034
Explainable Neural Rule Learning WWW ’22, April 25–29, 2022, Virtual Event, Lyon, France

To solve the problem, we introduce a state representation for every construct decision rules. The final structure is usually different from
rule during the training: the optimal solution, especially when some important features do
F Õ
O not work individually but are useful in cooperation. ENRL adopts
w ikj дj (ckij ), NAS to search for the suitable architecture to express the rules that
Õ
vr = h({sk }k ∈Pr ), sk =
i=1 j=1 correctly correlate the input instances with their labels. (ii) ENRL
where дj (·) is an operator-related function that takes the context models features as high-dimension vectors. The neural-based archi-
value as input and outputs a state vector sk of ECM k that represents tecture provides ENRL with a higher capacity to model large-scale
what the atomic proposition looks like in this ECM. Pr is the set data of better expressiveness. Besides, the modular design makes
of ECMs in rule r , and h(·) takes all state vectors of ECMs in rule r ENRL flexible to introduce various types of operators and features.
and outputs the voting weight vr ∈ R. By such design, vr changes For example, we can further design an operator to capture a specific
as the rule changes and keeps up-to-date during training. In our pattern in an image or sentence. (iii) In DTree, an input instance
implementation, дj (·) are operator-specific multilayer perceptrons, goes into a certain child node based on the condition of the current
and h(·) is a global multilayer perceptron. Note that h(·) and дj (·) node, which is sensitive to noises. An inaccurate feature may lead
are used to help learn vr during the training, but we only need to to a wrong leaf node and thus output. ENRL matches the input
record vr for the inference process after training. instance with all rules and uses the best-matched rules to vote for
the output. Soft-inference can tolerate noises, and the mistake of
2.4.2 Loss Function. Learnable parameters of the an ENRL model one node may be recoverable by others.
include embeddings {uij }j for all features i, multilayer perceptrons On the other hand, MLP [14] is a typical structure of neural net-
MLPj for all operators module j, “architecture parameters” okij in works. Previous work argues that multilayer feedforward networks
each ECM k, and voting weight vrt for all rules r in trees t. All the with a nonpolynomial activation function can approximate any
parameters are jointly trained end-to-end via applying SGD, with function [24]. State-of-the-art neural networks for image classifica-
the loss function: tion trained with stochastic gradient methods easily fit a random
labeling of the training data [53], which means a black-box neural
L = ℓ + λ(ℓref + ℓant + ℓtra + ℓle + ℓge + ℓin ) network may memorize training data instead of learning how to
where ℓref , ℓant , ℓtra , ℓle , ℓge are self-supervised losses defined in Sec- generalize to unseen data. It has been verified by many researchers
tion 2.3, and λ denotes the hyperparameter to control the strength of and industrial applications that MLP has a high capacity but little
self-supervised loss. ℓ is the task specific loss, e.g, the cross-entropy explainability [27, 34]. For ENRL, although the operator modules
loss for binary classification task defined as: are implemented with MLPs, their explainability is guaranteed by
Õ self-supervised operator learning, which encourages the operator
ℓ=− yi loд(ŷi ) + (1 − yi )loд(1 − ŷi )
modules to possess specific mathematical properties.
i
In a nutshell, the proposed ENRL integrate the expressiveness of
where yi ∈ {0, 1} is the label of instance i, and ŷi (defined in Eq. (1)) neural networks and the explainability of some traditional machine
denotes the corresponding prediction. models and rule-based systems.
2.4.3 Pruning. Similar to vr , although there is a summation for all
3 EXPERIMENT
feature-operator pairs in Eq. (2), only one w ikj in ECM k is non-zero.
Three research questions are answered through experiments:
During the inference, there is no need to calculate those operator
RQ1 Could ENRL learn explainable rules?
modules with w ikj = 0. Sparse structure weights make NAS-based
RQ2 Does ENRL alleviate performance drop on OOD data?
ENRL has a high inference efficiency. For numerical and categorical
RQ3 How does ENRL compare to state-of-the-art models on in-
features, we also ensure that suitable types of operators are applied:
distribution fitting performance?
w i,k ≤ = w i,k ≥ = 0, ∀i ∈ Fc ; w i,k ∈ = 0, ∀i ∈ Fn ,
3.1 Settings
which means ≤ and ≥ do not act on categorical features, and ∈ does
3.1.1 Dataset. We conduct experiments on three public real-world
not act on numerical features. It reduces the search space of NAS
datasets including Adult, Credit and RSC2017, and one synthetic
and improves the efficiency of training and testing.
dataset denoted as Synthetic.
There are also many other pruning strategies, such as architec-
!#
ture recycling [4, 5] or incomplete training [31, 33] of NAS. Besides,
thanks to the explainable design of ENRL, we can take a look at "#$
learned rules, and remove duplicate or invalid ones, such as rules !"

with conflict conditions, to improve the inference efficiency. "#%

2.5 Discussion !!

We provide a comparison of Decision Tree, MLP, and ENRL in terms Figure 2: Illustration of how labels depend on invariant fea-
of explainability and capacity. tures in the synthetic data.
Decision Tree (DTree) [3] is a typical machine learning method The Synthetic dataset is generated to evaluate the OOD perfor-
with high explainability. Although ENRL in our experiments is also mance of algorithms, in which some features have spurious corre-
constructed in tree-based architecture, there are significant differ- lations with labels in the training set. Specifically, each instance x
ences between ENRL and DTree: (i) DTree uses a greedy strategy to has six features in uniform distribution x i ∼ U(0, 1), i = 1...6 and

3035
WWW ’22, April 25–29, 2022, Virtual Event, Lyon, France S. Shi, Y. Xie, Z. Wang, B. Ding, Y. Li, M. Zhang

Table 1: Examples of learned rules with the largest positive/negative voting weights of ENRL.
Dataset Weight v jt Rules
2.7388 (native_country ∈ {Haiti,Honduras}) ∧ (hours_per_week < 15) ∧ (education_num > 16)
Adult 2.3649 (native_country ∈ {Haiti,Honduras}) ∧ (hours_per_week ≥ 15) ∧ (education_num > 16)
-1.2243 (hours_per_week > 55) ∧ (education_num > 16) ∧ (education ∈ {Preschool,9th,7th-8th})
(NumberOfTime60-89DaysPastDueNotWorse > 1) ∧ (NumberOfDependents > 1)
0.6575
∧ (RevolvingUtilizationOfUnsecuredLines < 0.053)
Credit
0.5600 (DebtRatio ≤ 0.341) ∧ (NumberOfDependents ≤ 1) ∧ (NumberOfTimes90DaysLate > 2)
(DebtRatio > 0.055) ∧ (DebtRatio ≤ 0.493) ∧ (NumberRealEstateLoansOrLines < 4)
-0.3925
∧ (NumberOfTimes90DaysLate < 1)
2.7287 (item_country < {Austria,Other}) ∧ (item_career_level ≤ 1) ∧ (item_discipline_id < {8,2,13})
RSC2017 0.5980 (item_discipline_id < {19,10,21,8,2,13}) ∧ (item_country ∈ {Austria,Other,Switzerland})
-0.3074 (item_discipline_id ∈ {22,15,13}) ∧ (user_career_level < 2)
5.9121 (x 2 < 0.497) ∧ (x 3 < 0.484) ∧ (x 1 < 0.491) ∧ (x 5 < 0.003)
Synthetic 5.2113 (x 2 < 0.497) ∧ (x 3 < 0.484) ∧ (x 1 < 0.491) ∧ (x 5 ≥ 0.003)
-4.9838 (x 3 ≤ 0.501) ∧ (x 2 ≥ 0.499) ∧ (x 1 < 0.491)

a binary label y ∈ {0, 1} which only depends on invariant features. features are discretized into equal-sized 100/1000 buckets, so that
x 1 , x 2 , x 3 ∈ T are invariant features and x 4 , x 5 , x 6 ∈ S are spurious boundaries might not be integers.
features. Each spurious feature has a large correlation ρ x i ∈S,y ≈ 0.5 For each dataset, we show two supporting rules with the positive
with the label in the training set, but has no correlation ρ x i ∈S,y ≈ 0 voting weights, and one opposing rule with the negative voting
in the testing set. Models can easily fit spurious features as short- weights. For Adult dataset, the task is to predict whether a person
cut, but cannot achieve 100% accuracy. In contrast, combing all makes over 50K a year according to input features. The label is 1 for
of the three invariant features will provide 100% accuracy in both “> 50K” and 0 for “≤ 50K”. The two supporting weights together
training and testing. However, each single or pair of invariant fea- describe that immigrants from Haiti and Honduras have high in-
tures have no statistical correlation with the label. Compared to comes if they are highly educated, while the opposing rule shows
spurious features, it is harder and costs more effort for models to that if one’s education is lower than high school and works more
capture such relationships. We randomly generated the dataset and than 55 hours per week, he/she is more likely to have an income of
perturbed 1% of labels as noises. 10% of the training set is used as no more than 50K. For Credit dataset, the goal is to predict whether
the validation set. A more comprehensible illustration is in Figure 2. somebody will experience at least 90 days past due delinquency in
RSC2017 is used to compare in/out-distribution performance by the next two years or not. The label is 1 for past due delinquency
evaluating on interactions from warm/cold users in the test set sep- and 0 for not. It is clear that the most reliable rule represents that
arately. It is hard to define what is in/out-distribution data on Adult checking whether one experienced delinquency before and how the
or Credit. More details of datasets can be found in Appendix A.1. balance is on his credit lines. The opposing rule describes somebody
that has never experienced delinquency and is in a low debt ratio.
3.1.2 Baselines. We compare the proposed ENRL with three widely- Another dataset RSC2017 is a job recommendation dataset from
used machine learning methods including KNN [12], Decision XING, a website mainly in Germany, Austria, and Switzerland. The
Tree [3] (DTree), and Random Forest [2] (RForest). Deep neural label is 1 if the user and the item have a positive interaction, and
decision tree [52] (DNDT) and three state-of-the-art deep neural 0 for a negative interaction. Discipline_id represents disciplines
models including DeepFM [15], AutoInt [43], DeepLight [11] are such as Consulting, HR, etc, which are anonymized by the dataset
also compared. Detailed information about baselines are in Appen- provider. Although we cannot know the exact types of these jobs,
dix A.2. We run each experiment with five different random seeds the top rule implies that some jobs have a higher click-through
and report the average AUC and standard error. The implementa- rate. For example, further statistical computations verify that jobs
tion details can be found in Appendix A.3. satisfying the first rule have a CTR of 0.694, much larger than the
global average of 0.450. As for Synthetic dataset, the learned rules
are more clear. The two supporting rules describe a positive cube
3.2 Experimental Results in Figure 2, and the opposing rule describes a negative cube.
3.2.1 Explainable Rules (RQ1). Each rule, represented by a path These results confirm that the proposed ENRL can provide ex-
from the root to a leaf node, is assigned with a voting weight, and planations by expressing how the input satisfies the learned rules.
we rank the learned rules according to the absolute value of their
voting weights to preserve the reliable rules with large confident
values. The learned rules will be decoded into the understandable 3.2.2 OOD Performance (RQ2). Neural networks might learn spu-
expressions according to Eq. (3) and Eq. (4). The invalid rules that rious correlations from the observed data, which causes the signifi-
never activate will be ignored, such as age < 0 or (age < 10)∧(age > cant performance drop on OOD data [16, 35]. The proposed method
20). We also remove redundant atomic proposition in a rule, e.g., ENRL, by imitating rules pose implicit regularization upon the neu-
(times > 2) ∧ (times > 3) will be simplified as times > 3. The rules ral networks, is excepted to steer the model toward learning the
learned by ENRL are illustrated in Table 1. Note that some numerical causal features that can generalize to OOD data. To demonstrate the

3036
Explainable Neural Rule Learning WWW ’22, April 25–29, 2022, Virtual Event, Lyon, France

Table 2: OOD performance (AUC) on RSC2017 and Synthetic datasets.


Synthetic RSC2017
Model
In Out In-Out Drop In Out In-Out Drop
KNN 0.9768 ± 0.0000 0.8669 ± 0.0000 -11.25% 0.5041 ± 0.0000 0.5052 ± 0.0000 0.22%
Traditional DTree 0.9739 ± 0.0000 0.8275 ± 0.0001 -15.03% 0.6630 ± 0.0026 0.5992 ± 0.0000 -9.62%
RForest 0.9793 ± 0.0003 0.7342 ± 0.0043 -25.03% 0.7027 ± 0.0036 0.6476 ± 0.0016 -7.84%
DNDT 0.9869 ± 0.0000 0.9683 ± 0.0000 -1.88% 0.6436 ± 0.0005 0.6184 ± 0.0018 -3.92%
DeepFM 0.9543 ± 0.0165 0.9008 ± 0.0270 -5.61% 0.7952 ± 0.0089 0.7072 ± 0.0042 -11.07%
Neural
AutoInt 0.9853 ± 0.0022 0.9425 ± 0.0199 -4.34% 0.7715 ± 0.0075 0.7073 ± 0.0069 -8.32%
DeepLight 0.9877 ± 0.0005 0.9781 ± 0.0035 -0.97% 0.7920 ± 0.0039 0.7110 ± 0.0024 -10.23%
Ours ENRL 0.9908 ± 0.0001 0.9896 ± 0.0001 -0.12% 0.7769 ± 0.0049 0.7199 ± 0.0011 -7.34%

Table 3: In-distribution fitting performance (AUC) on all datasets.


Model Adult Credit RSC2017 Synthetic
KNN 0.7929 ± 0.0000 0.6456 ± 0.0000 0.5007 ± 0.0000 0.8669 ± 0.0000
Traditional DTree 0.9032 ± 0.0000 0.8272 ± 0.0000 0.6514 ± 0.0015 0.8275 ± 0.0001
RForest 0.9083 ± 0.0002 0.8383 ± 0.0001 0.6917 ± 0.0029 0.7342 ± 0.0043
DNDT 0.8486 ± 0.0002 0.8234 ± 0.0005 0.6234 ± 0.0014 0.9683 ± 0.0000
DeepFM 0.9081 ± 0.0016 0.8333 ± 0.0006 0.7729 ± 0.0079 0.9008 ± 0.0270
Neural
AutoInt 0.9146 ± 0.0005 0.8377 ± 0.0003 0.7585 ± 0.0083 0.9425 ± 0.0199
DeepLight 0.9056 ± 0.0058 0.8341 ± 0.0006 0.7764 ± 0.0035 0.9781 ± 0.0035
Ours ENRL 0.9141 ± 0.0002 0.8372 ± 0.0004 0.7692 ± 0.0052 0.9896 ± 0.0001

effectiveness of ENRL in learning causal features, we conduct ex- In general, neural networks perform better than traditional ma-
periments of OOD performance on Synthetic and RSC2017 datasets chine learning methods due to their extraordinary expressiveness.
and show the experimental results in Table 2. DNDT is designed for numerical features, which does not per-
From the table we can observe that, compared to baseline models, form well on datasets with many categorical features like Adult
ENRL achieves the best OOD performance and the significantly and RSC2017. DeepLight performs the best on the largest dataset
lower in-out performance drop on both datasets. Specifically, for RSC2017, but does not have good performances on small datasets
Synthetic dataset, although DTree and RForest learn explainable such as Adult and Credit, since neural networks with fancy struc-
trees, their greedy strategy tends to use spurious features, since tures require plenty of data to train. In contrast, ENRL is among
a single spurious feature (i.e., x 4 , x 5 and x 6 ) provides larger infor- the top three models on all four datasets and provides comparable
mation gain than a single causal feature (i.e., x 1 , x 2 and x 3 ). These performance with the best one. The reason is that although we
methods cannot learn the fundamental rules that need a combina- implicitly regularize the neural modules in ENRL for providing
tion of three causal features. On the other hand, although neural explainability by imitating rules, the learned rules are useful to
models capture high-order crossing features, they can be impacted guide models to capture the causal features. ENRL performs the
bring by the spurious features, and even memorize some training best on the Synthetic dataset given the OOD test set, because the
instances, which causes the in-out performance drop. In compari- learned rules can be well extrapolated.
son, ENRL is able to learn the correct rules in the Synthetic dataset
3.2.4 Analysis. According to the experiments, it can be concluded
so that it keeps effective on OOD data. For RSC2017, traditional
that ENRL learns explainable rules which significantly improve the
machine learning methods such as KNN fail to achieve good per-
explainability of neural networks. Thanks to the learned rules in-
formances on in(out)-distribution data due to their limited capacity.
stead of memorization of training samples, ENRL not only achieves
Neural models perform well on in-distribution warm users, but
comparable in-distribution fitting performance with state-of-the-art
they suffer a significant performance drop on OOD cold users. The
methods, but also alleviates the performance drop on OOD data.
reason is that these models prefer to use interaction-based features
to model user/item history, which does not work for cold users.
3.3 Parameter Study
Instead, ENRL provides comparable performance on in-distribution
data and achieves the best on OOD data, since it learns rules not To have a deeper insight on the characteristics of ENRL, we further
only for in-distribution data, but also rules based on user/item investigate the effect of some hyper-parameters in ENRL on Credit
profile/content that can be extrapolated to OOD cold users. and Synthetic (results on all dataset are in Section B) .
3.3.1 Weight of Self-Supervised Loss. As the self-supervised opera-
3.2.3 In-Distribution Fitting Performance (RQ3). To confirm that tor loss is optimized together with the task-specific loss, a proper
the proposed method integrates the expressiveness of neural net- strength of self-supervised loss is necessary. The operator learning
works and the explainability of rule-based systems, we compare the task should be cooperated with the target task, because datasets
in-distribution fitting performance of ENRL and baseline methods have different distributions, and the task-specific loss is hard or easy
on four datasets. The experimental results are shown in Table 3. to optimize. A suitable λ not only improves explainability, but also

3037
WWW ’22, April 25–29, 2022, Virtual Event, Lyon, France S. Shi, Y. Xie, Z. Wang, B. Ding, Y. Li, M. Zhang

helps learn reasonable rules and improves the fitting performance node in a greedy way by selecting the feature field as well as a
of ENRL. However, a too-large λ might block the optimization of context value that maximize the information gain or minimize the
task-specific loss. impurity [3]. An ensemble of trees based on repeatedly sampling
data, such as Random Forest [2], usually performs better. However,
these naturally explainable methods are hard to model high-order
features, and their performance cannot catch up with the growing
requirements of machine learning applications.
Although neural networks have made remarkable achievements
recently for their large capacity to fit the data, the black-box na-
ture limits their applications. Thus, some researchers design back-
Figure 3: In-distribution fitting performance (AUC) of ENRL propagation-based techniques like LRP [1, 25] to recover the rel-
with different weights of self-supervised loss (λ). evance between the output and input features of a trained neural
network. Other work like GAM [18] carefully groups similar at-
3.3.2 Number and Length of Rules. It can be concluded from Fig- tributions to form global human-interpretable explanations across
ure 4 that as T increases, the performance of ENRL keeps improving subpopulations, which show the landscape of neural network pre-
because the more rules/trees in an ENRL architecture, the easier it dictions. Nevertheless, these post-hoc techniques do not improve
is to find reasonable rules, and the larger the model capacity owns. the intransparency nature of neural networks, and their explana-
However, after reaching around 40 rules, the model performance tions cannot completely recover how the model makes the predic-
does not improve significantly with more rules. The reason we tion [10]. As a result, researchers take efforts to design transpar-
think is that the model capacity is large enough for the data, and ent neural architecture. Currently, a large number of researchers
the length of rules limits the description ability of rules. Figure 5 improve the explainability of neural models by introducing the
shows that generally, the longer the rule length, the better the attention network, which can automatically learn which informa-
model performance. Longer rules can describe more detailed con- tion the network should pay more attention to, and the attention
ditions and smaller groups of data. However, too-long rules might weights explain what the prediction is based on [9, 26, 51]. For ex-
memorize and overfit some data, which causes lower performance ample, there is work designing neural reasoning framework based
and reduces model efficiency. on attention [17]. However, attention may not be sufficient as an
explanation [48]. Some researchers find adversarial distributions of
attention weights in models for some classification, which indicates
that researchers should be cautious when looking into attention
weights to interpret the link between inputs and outputs [19]. Other
work design neural modules to simulate logic operators [7, 41]. In
computer vision, neural-backed decision trees are proposed to give
a hierarchical classification of images [23, 45]. However, these meth-
Figure 4: In-distribution fitting performance (AUC) of ENRL ods with pre-defined architecture do not give rules to explain their
with different numbers of trees in ENRL (T ). predictions, and users still do not know why each neural module
gives such output. Moreover, NBDT requires hierarchical labels,
which are not always available. Feature-based deep neural decision
tree (DNDT) [52] is interpretable but is designed for numerical
features. It models features as scalars, which do not perform well
in real applications with complex high-dimension feature space.

Figure 5: In-distribution fitting performance (AUC) of ENRL 5 CONCLUSION


with different lengths of rules in ENRL (L).
In this paper, we propose a novel method ENRL to achieve explain-
able neural networks via learning rules with neural network-based
4 RELATED WORK modules. Our main idea is two-fold: One is to endow the basic
Many traditional non-neural machine learning methods are de- neural modules of ENRL with explainable behaviors; Another is
signed in an interpretable way. In linear regression and logistic to orchestrate a group of these modules to imitate a collection of
regression, the corresponding weight explains the influence (neg- rules. Therefore, we design self-supervised learning tasks to regu-
ative or positive) of a feature on the final prediction [13]. KNN larize the behaviors of these modules and apply NAS to learn the
is an instance-based model that can explain the prediction for an arrangements of these modules that can express the desired rules.
instance by its nearest neighbors [12]. Naïve Bayes Classifier is Experiments on synthetic and real-world datasets show that ENRL
based on the Bayes Theorem, where the probability of classes for exhibits competitive in-distribution performance against baselines
each feature is estimated [37]. Another category of explainable while providing explainability as its strength. Meanwhile, ENRL
method is the tree-based model [8], which is widely used for not significantly alleviates performance drop on OOD data. These re-
only explainability, but also outstanding effectiveness and efficiency. sults confirm that ENRL effectively integrates neural networks’
Tree-based models like Decision Tree split all training data at each expressiveness and rule-based systems’ explainability.

3038
Explainable Neural Rule Learning WWW ’22, April 25–29, 2022, Virtual Event, Lyon, France

REFERENCES 106384.
[1] Sebastian Bach, Alexander Binder, Grégoire Montavon, Frederick Klauschen, [31] Ramakanth Pasunuru and Mohit Bansal. 2019. Continual and Multi-Task Archi-
Klaus-Robert Müller, and Wojciech Samek. 2015. On pixel-wise explanations for tecture Search. In ACL (1). Association for Computational Linguistics, 1911–1922.
non-linear classifier decisions by layer-wise relevance propagation. PloS one 10, [32] Ali Payani and Faramarz Fekri. 2019. Learning Algorithms via Neural Logic
7 (2015), e0130140. Networks. CoRR abs/1904.01554 (2019).
[2] Leo Breiman. 2001. Random Forests. Mach. Learn. 45, 1 (2001), 5–32. [33] Hieu Pham, Melody Y. Guan, Barret Zoph, Quoc V. Le, and Jeff Dean. 2018.
[3] Leo Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. 1984. Classification Efficient Neural Architecture Search via Parameter Sharing. In ICML (Proceedings
and Regression Trees. Wadsworth. of Machine Learning Research, Vol. 80). PMLR, 4092–4101.
[4] Han Cai, Tianyao Chen, Weinan Zhang, Yong Yu, and Jun Wang. 2018. Efficient [34] Forough Poursabzi-Sangdeh, Daniel G. Goldstein, Jake M. Hofman, Jennifer Wort-
Architecture Search by Network Transformation. In AAAI. AAAI Press, 2787– man Vaughan, and Hanna M. Wallach. 2021. Manipulating and Measuring Model
2794. Interpretability. In CHI. ACM, 237:1–237:52.
[5] Han Cai, Jiacheng Yang, Weinan Zhang, Song Han, and Yong Yu. 2018. Path-Level [35] Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. 2019.
Network Transformation for Efficient Architecture Search. In ICML (Proceedings Do ImageNet Classifiers Generalize to ImageNet?. In ICML (Proceedings of Ma-
of Machine Learning Research, Vol. 80). PMLR, 677–686. chine Learning Research, Vol. 97). PMLR, 5389–5400.
[6] Jianlong Chang, Xinbang Zhang, Yiwen Guo, Gaofeng Meng, Shiming Xiang, [36] Pengzhen Ren, Yun Xiao, Xiaojun Chang, Poyao Huang, Zhihui Li, Xiaojiang
and Chunhong Pan. 2019. DATA: Differentiable ArchiTecture Approximation. In Chen, and Xin Wang. 2021. A Comprehensive Survey of Neural Architecture
NeurIPS. 874–884. Search: Challenges and Solutions. ACM Comput. Surv. 54, 4 (2021), 76:1–76:34.
[7] Hanxiong Chen, Shaoyun Shi, Yunqi Li, and Yongfeng Zhang. 2021. Neural [37] Irina Rish et al. 2001. An empirical study of the naive Bayes classifier. In IJCAI
Collaborative Reasoning. In WWW. ACM / IW3C2, 1516–1527. 2001 workshop on empirical methods in artificial intelligence, Vol. 3. 41–46.
[8] Linda A Clark and Daryl Pregibon. 2017. Tree-based models. In Statistical models [38] Jürgen Schmidhuber. 2015. Deep learning in neural networks: An overview.
in S. Routledge, 377–419. Neural Networks 61 (2015), 85–117.
[9] Dawei Cong, Yanyan Zhao, Bing Qin, Yu Han, Murray Zhang, Alden Liu, and [39] Nida Shahid, Tim Rappon, and Whitney Berta. 2019. Applications of artificial
Nat Chen. 2019. Hierarchical Attention based Neural Network for Explainable neural networks in health care organizational decision-making: A scoping review.
Recommendation. In ICMR. ACM, 373–381. PloS one 14, 2 (2019), e0212356.
[10] Arun Das and Paul Rad. 2020. Opportunities and Challenges in Explainable [40] Wen Shen, Zhihua Wei, Shikun Huang, Binbin Zhang, Jiaqi Fan, Ping Zhao,
Artificial Intelligence (XAI): A Survey. CoRR abs/2006.11371 (2020). and Quanshi Zhang. 2021. Interpretable Compositional Convolutional Neural
[11] Wei Deng, Junwei Pan, Tian Zhou, Deguang Kong, Aaron Flores, and Guang Lin. Networks. In IJCAI. ijcai.org, 2971–2978.
2021. DeepLight: Deep Lightweight Feature Interactions for Accelerating CTR [41] Shaoyun Shi, Hanxiong Chen, Weizhi Ma, Jiaxin Mao, Min Zhang, and Yongfeng
Predictions in Ad Serving. In WSDM. ACM, 922–930. Zhang. 2020. Neural Logic Reasoning. In CIKM. ACM, 1365–1374.
[12] Evelyn Fix and Joseph Lawson Hodges. 1989. Discriminatory analysis. Non- [42] Shaoyun Shi, Min Zhang, Xinxing Yu, Yongfeng Zhang, Bin Hao, Yiqun Liu,
parametric discrimination: Consistency properties. International Statistical Re- and Shaoping Ma. 2019. Adaptive Feature Sampling for Recommendation with
view/Revue Internationale de Statistique 57, 3 (1989), 238–247. Missing Content Feature Values. In CIKM. ACM, 1451–1460.
[13] David A Freedman. 2009. Statistical models: theory and practice. cambridge [43] Weiping Song, Chence Shi, Zhiping Xiao, Zhijian Duan, Yewen Xu, Ming Zhang,
university press. and Jian Tang. 2019. AutoInt: Automatic Feature Interaction Learning via Self-
[14] Matt W Gardner and SR Dorling. 1998. Artificial neural networks (the multilayer Attentive Neural Networks. In CIKM. ACM, 1161–1170.
perceptron)—a review of applications in the atmospheric sciences. Atmospheric [44] Walter Denis Wallis. 2011. A beginner’s guide to discrete mathematics. Springer
environment 32, 14-15 (1998), 2627–2636. Science & Business Media.
[15] Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. 2017. [45] Alvin Wan, Lisa Dunlap, Daniel Ho, Jihan Yin, Scott Lee, Suzanne Petryk,
DeepFM: A Factorization-Machine based Neural Network for CTR Prediction. In Sarah Adel Bargal, and Joseph E. Gonzalez. 2021. NBDT: Neural-Backed Decision
IJCAI. ijcai.org, 1725–1731. Tree. In ICLR. OpenReview.net.
[16] Dan Hendrycks and Thomas G. Dietterich. 2019. Benchmarking Neural Net- [46] Lei Wang, Dongxiang Zhang, Jipeng Zhang, Xing Xu, Lianli Gao, Bing Tian Dai,
work Robustness to Common Corruptions and Perturbations. In ICLR (Poster). and Heng Tao Shen. 2019. Template-Based Math Word Problem Solvers with
OpenReview.net. Recursive Neural Networks. In AAAI. AAAI Press, 7144–7151.
[17] Ronghang Hu, Jacob Andreas, Trevor Darrell, and Kate Saenko. 2018. Explainable [47] Zhuo Wang, Wei Zhang, Ning Liu, and Jianyong Wang. 2020. Transparent
Neural Computation via Stack Neural Module Networks. In ECCV (7) (Lecture Classification with Multilayer Logical Perceptrons and Random Binarization. In
Notes in Computer Science, Vol. 11211). Springer, 55–71. AAAI. AAAI Press, 6331–6339.
[18] Mark Ibrahim, Melissa Louie, Ceena Modarres, and John W. Paisley. 2019. Global [48] Sarah Wiegreffe and Yuval Pinter. 2019. Attention is not not Explanation. In
Explanations of Neural Networks: Mapping the Landscape of Predictions. In AIES. EMNLP/IJCNLP (1). Association for Computational Linguistics, 11–20.
ACM, 279–287. [49] Sirui Xie, Hehui Zheng, Chunxiao Liu, and Liang Lin. 2019. SNAS: stochastic
[19] Sarthak Jain and Byron C. Wallace. 2019. Attention is not Explanation. In NAACL- neural architecture search. In ICLR (Poster). OpenReview.net.
HLT (1). Association for Computational Linguistics, 3543–3556. [50] Yuexiang Xie, Zhen Wang, Yaliang Li, Bolin Ding, Nezihe Merve Gürel, Ce Zhang,
[20] Eric Jang, Shixiang Gu, and Ben Poole. 2017. Categorical Reparameterization Minlie Huang, Wei Lin, and Jingren Zhou. 2021. FIVES: Feature Interaction Via
with Gumbel-Softmax. In ICLR (Poster). OpenReview.net. Edge Search for Large-Scale Tabular Data. In KDD. ACM, 3795–3805.
[21] Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Opti- [51] Linyi Yang, Zheng Zhang, Su Xiong, Lirui Wei, James Ng, Lina Xu, and Ruihai
mization. In ICLR (Poster). Dong. 2018. Explainable Text-Driven Neural Network for Stock Prediction. In
[22] Ron Kohavi. 1996. Scaling Up the Accuracy of Naive-Bayes Classifiers: A Decision- CCIS. IEEE, 441–445.
Tree Hybrid. In KDD. AAAI Press, 202–207. [52] Yongxin Yang, Irene Garcia Morillo, and Timothy M. Hospedales. 2018. Deep
[23] Peter Kontschieder, Madalina Fiterau, Antonio Criminisi, and Samuel Rota Bulò. Neural Decision Trees. CoRR abs/1806.06988 (2018).
2016. Deep Neural Decision Forests. In IJCAI. IJCAI/AAAI Press, 4190–4194. [53] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals.
[24] Moshe Leshno, Vladimir Ya. Lin, Allan Pinkus, and Shimon Schocken. 1993. 2021. Understanding deep learning (still) requires rethinking generalization.
Multilayer feedforward networks with a nonpolynomial activation function can Commun. ACM 64, 3 (2021), 107–115.
approximate any function. Neural Networks 6, 6 (1993), 861–867.
[25] Heyi Li, Yunke Tian, Klaus Mueller, and Xin Chen. 2019. Beyond saliency:
Understanding convolutional neural networks from saliency prediction on layer-
wise relevance propagation. Image Vis. Comput. 83-84 (2019), 70–86.
[26] Yi-Ju Lu and Cheng-Te Li. 2020. GCAN: Graph-aware Co-Attention Networks
for Explainable Fake News Detection on Social Media. In ACL. Association for
Computational Linguistics, 505–514.
[27] Gary Marcus. 2018. Deep Learning: A Critical Appraisal. CoRR abs/1801.00631
(2018).
[28] Vaishnavh Nagarajan, Anders Andreassen, and Behnam Neyshabur. 2021. Under-
standing the failure modes of out-of-distribution generalization. In International
Conference on Learning Representations.
[29] Michael A Nielsen. 2015. Neural networks and deep learning. Vol. 25. Determina-
tion press San Francisco, CA.
[30] Ahmet Murat Özbayoglu, Mehmet Ugur Gudelek, and Omer Berat Sezer. 2020.
Deep learning for financial applications : A survey. Appl. Soft Comput. 93 (2020),

3039
WWW ’22, April 25–29, 2022, Virtual Event, Lyon, France S. Shi, Y. Xie, Z. Wang, B. Ding, Y. Li, M. Zhang

A EXPERIMENTAL SETTINGS invariant features will provide 100% accuracy in both training and
testing. However, each single or pair of invariant features have no
A.1 Datasets
statistical correlation with the label. A more comprehensible illus-
We conduct experiments on three public real-world datasets and tration is in Figure 2. Compared to spurious features, it is harder
one synthetic dataset. The statistics are summarized in Table 4. and costs more effort for machine learning models to capture such
Table 4: Statistics of the datasets. relationships. But once learned, they provide invariantly high per-
formance on the entire instance space, no matter training or testing.
Dataset #Feature #Train #Valid #Test We randomly generated the dataset and changed 1% of labels as
Adult 14 29,305 3,256 16,281 noises. 10% of the training set is used as the validation set.
Credit 10 90,000 10,000 50,000
RSC2017 23 3,124,092 134,317 121,945 A.2 Baselines
Synthetic 6 90,000 10,000 50,000
ENRL is compared with the following baselines:
• Adult 1 is a well-known machine learning dataset collected • KNN (Fix and Hodges, 1989) [12]. It is a well-known instance-
by Ronny Kohavi and Barry Becker in 1994 [22]. The task is to based machine learning algorithm. It makes predictions based on
predict whether a person makes over 50K a year according to given the labels of the most similar training samples with the target
features, including education, occupation, country, etc. We follow sample. We use the implementation of scikit-learn to run the ex-
the original train-test splitting and randomly sample 10% of the periments 4 .
training set as the validation set. • DTree (Breiman et al., 1984) [3]. Decision tree is a widely-used
• Credit 2 is a classic dataset from a Kaggle challenge. The goal of tree-based machine learning algorithm with good explainability. It
the competition is to build a model for the bank to predict whether chooses features and splits the data in a greedy way during the train-
somebody will experience 90 days past due delinquency or worse ing. We use the scikit-learn implementation in our experiments 5 .
in the next two years. Features include a person’s income and some • RForest (Breiman, 2001) [2]. Random forest is an ensemble
information about his current and past finance. We randomly split method with multiple decision trees. Every time it samples a subset
the labeled samples by 9:1:5 for training, validation, and testing. of training data and features to build a base tree classifier, and
• RSC2017 3 is the dataset from ACM RecSys Challenge 2017. finally combines them. We use the scikit-learn implementation 6 .
It focuses on the job recommendation task on XING. The dataset • DNDT (Yang et al., 2018a) [52]. It is a tree-based model with
contains 93 days of user-item interactions and many features about neural architectures, which is interpretable and trainable without
the job industry, location, user education, etc. The task is to predict greedy splitting. We modify the public codes provided by the au-
whether a user-item pair will have a positive interaction. We remove thors to test on our datasets 7 .
users/items with no positive interactions throughout the 93 days • DeepFM (Guo et al., 2017) [15]. It is a popular factorization-
and split the dataset by chronological order. The interactions in the machine-based neural network for CTR Prediction, which combines
last seven days are for testing, and the 8-10th days from the bottom neural networks and factorization machines. The original paper
are for validation, and the others remained for training. does not provide codes, and we use the public implementation in
Cold-start is a well-known problem in the recommendation, GitHub 8 by paper [11].
which means making recommendations for new users/items with • AutoInt (Song et al., 2019) [43]. It is one of the state-of-the-
no historical interactions. In RSC2017, we regard the users that art algorithms which automatically learns the high-order feature
have no positive interactions in the training set as cold users. Test interactions of input features and applies a multi-head self-attentive
samples of cold users have different distributions with the training neural network with residual connections to model them. We use
data, i.e., OOD data, which helps us evaluate the OOD performance public codes provided by the authors to run the experiments 9 .
of ENRL and baselines. • DeepLight (Deng et al., 2021) [11]. It is one of the state-of-
• Synthetic. In order to further investigate ENRL and verify the-art methods for click-through rate (CTR) prediction. DeepLight
that the learned rules are superior in alleviating performance drop improves neural CTR models by pruning redundant parameters and
on OOD data, we generate a synthetic dataset. Each instance x has dense embedding vectors, and at the same time explicitly searches
six features in uniform distribution x i ∼ U(0, 1), i = 1...6 and a for informative feature interactions. We use the authors’ public
binary label y ∈ {0, 1} which only depends on invariant features. codes to conduct the experiments 10 .
x 1 , x 2 , x 3 ∈ T are invariant features and x 4 , x 5 , x 6 ∈ S are spurious
features. Each spurious feature has a large correlation ρ x i ∈S,y ≈ 0.5
4 https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.
with the label in the training set, but has no correlation ρ x i ∈S,y ≈ 0
KNeighborsClassifier.html
in the testing set. Models can easily fit spurious features with low 5 https://scikit-learn.org/stable/modules/generated/sklearn.tree.
cost when training, but these features cannot provide 100% accuracy DecisionTreeClassifier.html
6 https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.
and are probably harmful when predicting if they contribute to the
RandomForestClassifier.html
outputs. In contrast, invariant features keep the same relationship 7 https://github.com/wOOL/DNDT
with labels in training and testing sets. Combing all of the three 8 https://github.com/WayneDW/DeepLight_Deep-Lightweight-Feature-
Interactions/blob/master/model/DeepFMs.py
1 https://archive.ics.uci.edu/ml/datasets/Adult 9 https://github.com/shichence/AutoInt
2 https://www.kaggle.com/c/GiveMeSomeCredit/ 10 https://github.com/WayneDW/DeepLight_Deep-Lightweight-Feature-
3 http://www.recsyschallenge.com/2017/ Interactions

3040
Explainable Neural Rule Learning WWW ’22, April 25–29, 2022, Virtual Event, Lyon, France

A.3 Parameters and Running Environment


In KNN, the number of neighbors is 5. For DTree and RForest,
we search the maximum depth of the tree in [4, 8...64] and the
minimum number of samples required to split an internal node
in [2, 4...8192]. The number of trees in RForest is 40, which is the
same as that in ENRL. We use Adam [21] to train all neural models,
including DNDT, DeepFM, AutoInt, DeepLight, and ENRL, in mini-
batches of size 128 with a learning rate of 0.001. Models are trained
at most 1000 epochs, and early-stopping is conducted according to
the performance on the validation set. To prevent neural models
from overfitting, we use both the ℓ2 -regularization and dropout.
The weight of ℓ2 -regularization is searched between 1 × 10−6 to
1 × 10−4 and dropout ratio is set to 0.2. Vector size of features
is 64. Other arguments follow the default configuration of public
toolkits/implementations. For ENRL, the number of trees T is 40
for all datasets. The length of rules L is 3 for RCS2017 and 5 for
others. The weight of self-supervised loss λ is 1 × 10−5 for Adult
Figure 6: In-distribution fitting performance (AUC) of ENRL and RSC2017, 1 for Credict and Synthetic. All neural models run
with different weights of self-supervised loss (λ). on a single GPU (NVIDIA GeForce GTX 2080Ti).
For complexity of ENRL, take RSC2017 as an example, there are
160 rules in total and 40 rules activated for each prediction. The
number of parameters of ENRL is 741,260 for training and 149,900
for inference, which is the same level as that of the strongest base-
line DeepLight (413,200). Although ENRL requires more training
time than DeepLight (2h v.s. 5h), ENRL provides both better perfor-
mance and model explainability. And inference of ENRL finishes in
seconds.

B PARAMETER STUDY
We show results of parameter study on all four datasets here, includ-
ing weight of self-supervised loss λ (Figure 6), the number of trees T
(Figure 7), and the length of rules L (Figure 8). Results are consistent
on four datasets. A suitable weight λ of self-supervised loss not only
improves explainability, but also helps learn reasonable rules and
improves the fitting performance of ENRL. However, a too-large
λ might block the optimization of task-specific loss. More/longer
Figure 7: In-distribution fitting performance (AUC) of ENRL rules can improve the performance of ENRL, but too many/long
with different numbers of trees in ENRL (T ). rules reduce the model efficiency and cause overfitting.

Figure 8: In-distribution fitting performance (AUC) of ENRL


with different lengths of rules in ENRL (L).

3041

You might also like