What NARX Networks Can Compute
What NARX Networks Can Compute
1 Introduction
1.1 Background
Much of the work on the computational capabilities of recurrent neural networks
has focused on synthesis: how neuron-like elements are capable of constructing fi-
nite state machines (FSMs) [I, ii, 15, 16, 23]. All of these results assume that the
nonlinearity used in the network is a hard-limiting threshold function. However,
when recurrent networks are used adaptively, continuous-valued, differentiable
nonlinearities are almost always used. Thus, an interesting question is how the
computational complexity changes for these types of functions. For example, [18]
has shown that finite state machines can be stably mapped into second order
recurrent neural networks with sigmoid activation functions. More recently, re-
current networks have been shown to be at least as powerful as Turing machines,
and in some cases can have super-Turing capabilities [12, 21, 22].
1.2 S u m m a r y of Results
T h i s w o r k e x t e n d s t h e i d e a s d i s c u s s e d a b o v e to an i m p o r t a n t class of d i s c r e t e -
t i m e n o n l i n e a r s y s t e m s called Nonlinear AutoRegressive with exogenous inputs
( N A R X ) m o d e l [14]:
y(t)= f ( u ( t - n u ) , . . . , u ( t - 1 ) , u ( t ) , y ( t - n y ) , . . . , y ( t - 1 ) ) , (1)
96
where u(t) and y(t) represent input and output of the network at time t, nu and
ny are the input and output order, and the function f is a nonlinear function.
It has been demonstrated that this particular model is well suited for modeling
nonlinear systems [3, 5, 17, 19, 24]. When the function f can be approximated
by a Multilayer Perceptron, the resulting system is called a N A R X network [3,
17]. Other work [10] has shown that for the problems of grammatical inference
and nonlinear system identification, gradient descent learning is more effective
in NARX networks than in recurrent neural network architectures that have
"hidden states." For these studies, the NARX neural net usually converges much
faster and generalizes better than the other types of recurrent networks.
This work proves that NARX networks are computationally at least as strong
as fully connected networks within a linear slowdown. This implies that NARX
networks with a finite number of nodes and taps are at least as powerful as Turing
machines, and thus are universal computation devices, a somewhat unexpected
results given the limited nature of feedback in these networks.
These results should be contrasted with the mapping theorems of [6] which
imply NARX networks should be capable of representing arbitrary systems ex-
pressible in the form of equation (1), which give no bound to the number of nodes
required to achieve a good approximation. Furthermore, how such systems relate
to conventional models of computation is not clear.
Finally we provide some related results concerning NARX networks with
hard-limiting nonlinearities. Even though these networks are only capable of
implementing a subclass of finite state machines called finite memory machines
(FMMs) in real time, if given more time (a sublinear slowdown) they can simulate
arbitrary FMMs.
For our purposes we need consider only fully-connected and NARX recurrent
neural networks. These networks will have only single-input, single-output sys-
terns, though these results easily extend to the multi-variable case.
We shall adopt the notation that x corresponds to a state variable, u to an
input variable, y to an output variable, and z to a node activation value. In
each of these networks we shall let N correspond to the dimension of the state
space. When necessary to distinguish between variables of the two networks,
those associated with the NARX network will be marked with a tilde.
The state variables of a recurrent network are defined to be the memory elements,
i.e. the set of time delay operators. In a fully connected network there is a one-to-
one correspondence between node activations and state variables of the network,
since each node value is stored at every time step. Specifically, the value of the
N state variables at the next time step are given by xi(t + 1) = zi(t). Each
97
node weights and sums the external inputs to the network and the states of the
network. Specifically, the activation function for each node is defined by
zi(t)=a(~ai,jxj(t)+biu(t)+ci)j=l (2)
where ai,j, bi, and ci are fixed real valued weights, and a is a nonlinear function
which will be discussed below. The output is assigned arbitrarily to be the value
of the first node in the network, y(t) = zl(t).
The network is said to be fully connected because there is a weight between
every pair of nodes. However, when weight ai,j = 0 there is effectively no con-
nection between nodes i and j. Thus, a fully connected network is very general,
and can be used to represent many different kinds of architectures, including
those in which only a subset of the possible connections between nodes are used.
The MLP consists of a set of nodes organized into two layers. There are/~ nodes
in the first layer which perform the function
i= 1,... ,H.
The output layer consists of a single linear node ~l(t) = ~ = 1 wij2j (t) + Oi.
a(x)=~O 1 x_<c
[ y+-j_~ x > c
where c E R.
Here we present a sketch of the proof of the theorem. The interested reader
is referred to [20] for more details.
2i(k + 1) =
where the constant fli is large enough to make the input to a less than s when
22N-i+1 (k) ~ # so that the whole function is zero. A similar argument is used
to ensure that the final node implements the sequencing signal properly. Since
only one of the hidden layer nodes is non-zero, the output node of Af is simply
a linear combination so that the output of the network is equal to the value of
the currently active hidden layer node. The resulting network will simulate iT
with a linear slowdown.
It has been shown that fully connected networks with a fixed, finite number
of saturated linear activation functions are universal computation devices [22].
As a result it is possible to simulate a Turing machine with the NARX network
such that the slowdown is constant regardless of problem size. Thus, we conclude
that
Here we look at variants of the NARX networks, in which the output functions
are not linear combiners but rather a hard-limiting nonlinearity.
If the inputs are binary, then recurrent neural networks are only capable
of implementing Finite State Machines (FSMs), and in real time NARX net-
works are only capable of implementing a subset of FSMs called finite memory
machines (FMMs) [13].
Intuitively, the reason why FMMs are constrained is that there is a limited
amount of information that can be represented by feeding back the outputs alone.
If more information could be inserted into the feedback loop, then it should be
possible to simulate arbitrary FSMs in structures like NARX networks. In fact,
we next show that this is the case. We will show that NARX networks with
hard-limiting nonlinearities are capable of simulating fully connected networks
with a slowdown proportional to the number of nodes. As a result, the NARX
network will be able to simulate arbitrary FSMs.
In [11] it was shown that any n-state FSM can be implemented by a four
layer recurrent neural network with O (vr~) hard-limiting nodes. It is trivial
to show that a fully connected recurrent neural network can simulate an L -
layer recurrent network with a slowdown of L. Based on the fact that a NARX
network with hard-limiting output nodes is only capable of implementing FMMs,
we conclude that
5 Conclusion
We proved that NARX neural networks are capable of simulating fully connected
networks within a linear slowdown, and as a result are universal dynamical
systems. This theorem is somewhat surprising since the nature of feedback in
this type of network is so limited, i.e. the feedback comes only from the output
neuron.
The Turing equivalence of NARX neural networks implies that they are ca-
pable of representing solutions to just about any computational problem. Thus,
in theory NARX networks can be used instead of fully recurrent neural nets
without loosing any computational power.
But Turing equivalence implies that the space of possible solutions is ex-
tremely large. Searching such a large space with gradient descent learning al-
gorithms could be quite difficult. Our experience indicates that it is difficult to
learn even small finite state machines (FSMs) from example strings in either of
these types of networks unless particular caution is taken in the construction
of the machines [9, 4]. Often, a solution is found that classifies the training set
perfectly, but the network in fact learns some complicated dynamical system
which cannot necessarily be equated with any finite state machine,
NARX networks with hard-limiting nonlinearities can be shown to be capable
of only implementing a subclass of finite state machines called finite memory
machines. But, they can implement arbitrary finite state machines if a sublinear
slowdown is allowed.
These results open several questions. What is the simplest feedback or recur-
rence necessary for any network to be ~ r i n g universal? Do these results have
implications about the computational power of other types of architectures such
as recurrent networks with local feedback [2, 7, 8]?
Acknowledgements
We would like to thank Peter Tifio and Hanna Siegelmann for many helpful comments.
101
References
1. N. Alon, A.K. Dewdney, and T.J. Ott. Efficient simulation of finite automata by
neural nets. JACM, 38(2):495-514, 1991.
2. A.D. Back and A.C. Tsoi. FIR and IIR synapses, a new neural network architecture
for time series modeling. Neural Computation, 3(3):375-385, 1991.
3. S. Chert, S.A. Billings~ and P.M. Grant. Non-linear system identification using
neural networks. Int. J. Control, 51(6):1191-1214, 1990.
4. D.S. Clouse, C.L. Giles, B.G. Horne, and G.W. Cottrell. Learning large deBruijn
automata with feed-forward neural networks. Technical Report CS94-398, CSE
Dept., UCSD, La Jolla, CA, 1994.
5. J. Connor, L.E. Atlas, and D.R. Martin. Recurrent networks and NARMA rood-
eling. In NIPS4, pages 301-308, 1992.
6. G. Cybenko. Approximation by superpositions of a sigmoidal function. Math. of
Control, Signals, and Sys., 2(4):303-314, 1989.
7. B. de Vries and J.C. Principe. The gamma model - - A new neural model for
temporal processing. Neural Networks, 5:565-576, 1992.
8. P. Frasconi, M. Gori, and G. Soda. Local feedback multilayered networks. Neural
Computation, 4:120-130, 1992.
9. C.L. Giles, B.G. Horne, and T. Lin. Learning a class of large finite state machines
with a recurrent neural network. Neural Networks, 1995. In press.
10. B.G. Horne and C.L. Giles. An experimental comparison of recurrent neural net-
works. In NIPST, 1995. To appear.
11. B.G. Horne and D.R. Hush. Bounds on the complexity of recurrent neural network
implementations of finite state machines. In NIPS6, pages 359-366, 1994.
12. J. Kilian and H.T. Siegelmann. On the power of sigmoid neural networks. In Proc.
6th ACM Work. on Comp. Learning Theory, pages 137-143, 1993.
13. Z. Kohavi. Switching and finite automata theory. McGraw-Hill, New York, NY,
2rid edition, 1978.
14. I.J. Leontaritis and S.A. Billings. Input-output parametric models for non-linear
systems: Part I: deterministic non-linear systems. Int. J. Control, 41(2):303-328,
1985.
15. W.S. McCulloch and W.H. Pitts. A logical calculus of the ideas immanent in
nervous activity. Bull. Math. Biophysics, 5:115-133, 1943.
16. M.L. Minsky. Computation: Finite and infinite machines. Prentice-Hall, Engle-
wood Cliffs, 1967.
17. K.S. Narendra and K. Parthasarathy. Identification and control of dynamical sys-
tems using neural networks. IEEE Trans. on Neural Networks, 1:4-27, March
1990.
18. C.W. Omlin and C.L. Giles. Stable encoding of large finite-state automata in
recurrent neural networks with sigmoid discriminants. Neural Computation, 1996.
accepted for publication.
19. S.-Z. Qin, H.-T. Su, and T.J. McAvoy. Comparison of four neural net learning
methods for dynamic system identification. IEEE Trans. on Neural Networks,
3(1):122-130, 1992.
20. H.T. Siegelmann, B.G. Horne, and C.L. Giles. Computational capabilities of
NARX neural networks. Technical Report UMIACS-TR-95-12 and CS-TR-3408,
Institute for Advanced Computer Studies, University of Maryland, 1995.
21. H.T. Siegelmann and E.D. Sontag. Analog computation via neural networks. The-
oretical Computer Science, 131:331-360, 1994.
102
22. H.T. Siegelmann and E.D. Sontag. On the computational power of neural net-
works. J. Comp..and Sys. Science, 50(1):132-150, 1995.
23. H.T. Siegelmann, E.D. Sontag, and C.L. Giles. The complexity of language recog-
nition by neural networks. In Algorithms, Software, Architecture (Proc. of IFIP
12th World Computer Congress), pages 329-335. North-Holland, 1992.
24. H.-T. Su, T.J. McAvoy, and P. Werbos. Long-term predictions of chemical pro-
cesses using recurrent neural networks: A parallel training approach. Ind. Eng.
Chem. Res., 31:1338-1352, 1992.