0% found this document useful (0 votes)
148 views

What NARX Networks Can Compute

NARX neural networks are a type of recurrent neural network with limited feedback that comes only from the output neuron rather than hidden states. This document proves that NARX networks are computationally universal, meaning they are at least as powerful as Turing machines. It shows that NARX networks can simulate fully connected recurrent neural networks with only a linear slowdown in computation time. As fully connected networks are known to be Turing equivalent, this establishes that NARX networks are also Turing equivalent and universal computation devices, despite their limited feedback structure.

Uploaded by

cristian_master
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
148 views

What NARX Networks Can Compute

NARX neural networks are a type of recurrent neural network with limited feedback that comes only from the output neuron rather than hidden states. This document proves that NARX networks are computationally universal, meaning they are at least as powerful as Turing machines. It shows that NARX networks can simulate fully connected recurrent neural networks with only a linear slowdown in computation time. As fully connected networks are known to be Turing equivalent, this establishes that NARX networks are also Turing equivalent and universal computation devices, despite their limited feedback structure.

Uploaded by

cristian_master
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

W h a t N A R X N e t w o r k s Can C o m p u t e

Bill G. H o r n e 1, H a v a T. S i e g e l m a n n 2 a n d C. Lee Giles 1,3

1 NEC Research Institute, 4 Independence Way, Princeton, NJ 08540


2 Dept. of Information Systems Eng., Faculty of Ind. Eng. and Management,
Technion (Israel Institute of Tech.), Haifa 32000, Israel
3 UMIACS, University of Maryland, College Park, MD 20742

Abstract. We prove that a class of architectures called NARX neural


networks, popular in control applications and other problems, are at least
as powerful as fully connected recurrent neural networks. Recent results
have shown t h a t fully connected networks are Turing equivalent. Building
on those results, we prove that N A R X networks are also universal com-
putation devices. NARX networks have a limited feedback which comes
only from the output neuron rather than from hidden states. There is
much interest in the amount and type of recurrence to be used in recur-
rent neural networks. Our results pose the question of what amount of
feedback or recurrence is necessary for any network to be Turing equiv-
alent and what restrictions on feedback limit computational power.

1 Introduction

1.1 Background
Much of the work on the computational capabilities of recurrent neural networks
has focused on synthesis: how neuron-like elements are capable of constructing fi-
nite state machines (FSMs) [I, ii, 15, 16, 23]. All of these results assume that the
nonlinearity used in the network is a hard-limiting threshold function. However,
when recurrent networks are used adaptively, continuous-valued, differentiable
nonlinearities are almost always used. Thus, an interesting question is how the
computational complexity changes for these types of functions. For example, [18]
has shown that finite state machines can be stably mapped into second order
recurrent neural networks with sigmoid activation functions. More recently, re-
current networks have been shown to be at least as powerful as Turing machines,
and in some cases can have super-Turing capabilities [12, 21, 22].

1.2 S u m m a r y of Results

T h i s w o r k e x t e n d s t h e i d e a s d i s c u s s e d a b o v e to an i m p o r t a n t class of d i s c r e t e -
t i m e n o n l i n e a r s y s t e m s called Nonlinear AutoRegressive with exogenous inputs
( N A R X ) m o d e l [14]:

y(t)= f ( u ( t - n u ) , . . . , u ( t - 1 ) , u ( t ) , y ( t - n y ) , . . . , y ( t - 1 ) ) , (1)
96

where u(t) and y(t) represent input and output of the network at time t, nu and
ny are the input and output order, and the function f is a nonlinear function.
It has been demonstrated that this particular model is well suited for modeling
nonlinear systems [3, 5, 17, 19, 24]. When the function f can be approximated
by a Multilayer Perceptron, the resulting system is called a N A R X network [3,
17]. Other work [10] has shown that for the problems of grammatical inference
and nonlinear system identification, gradient descent learning is more effective
in NARX networks than in recurrent neural network architectures that have
"hidden states." For these studies, the NARX neural net usually converges much
faster and generalizes better than the other types of recurrent networks.
This work proves that NARX networks are computationally at least as strong
as fully connected networks within a linear slowdown. This implies that NARX
networks with a finite number of nodes and taps are at least as powerful as Turing
machines, and thus are universal computation devices, a somewhat unexpected
results given the limited nature of feedback in these networks.
These results should be contrasted with the mapping theorems of [6] which
imply NARX networks should be capable of representing arbitrary systems ex-
pressible in the form of equation (1), which give no bound to the number of nodes
required to achieve a good approximation. Furthermore, how such systems relate
to conventional models of computation is not clear.
Finally we provide some related results concerning NARX networks with
hard-limiting nonlinearities. Even though these networks are only capable of
implementing a subclass of finite state machines called finite memory machines
(FMMs) in real time, if given more time (a sublinear slowdown) they can simulate
arbitrary FMMs.

2 Recurrent Neural Network Models

For our purposes we need consider only fully-connected and NARX recurrent
neural networks. These networks will have only single-input, single-output sys-
terns, though these results easily extend to the multi-variable case.
We shall adopt the notation that x corresponds to a state variable, u to an
input variable, y to an output variable, and z to a node activation value. In
each of these networks we shall let N correspond to the dimension of the state
space. When necessary to distinguish between variables of the two networks,
those associated with the NARX network will be marked with a tilde.

2.1 ~lly Connected Recurrent Neural Network

The state variables of a recurrent network are defined to be the memory elements,
i.e. the set of time delay operators. In a fully connected network there is a one-to-
one correspondence between node activations and state variables of the network,
since each node value is stored at every time step. Specifically, the value of the
N state variables at the next time step are given by xi(t + 1) = zi(t). Each
97

node weights and sums the external inputs to the network and the states of the
network. Specifically, the activation function for each node is defined by

zi(t)=a(~ai,jxj(t)+biu(t)+ci)j=l (2)

where ai,j, bi, and ci are fixed real valued weights, and a is a nonlinear function
which will be discussed below. The output is assigned arbitrarily to be the value
of the first node in the network, y(t) = zl(t).
The network is said to be fully connected because there is a weight between
every pair of nodes. However, when weight ai,j = 0 there is effectively no con-
nection between nodes i and j. Thus, a fully connected network is very general,
and can be used to represent many different kinds of architectures, including
those in which only a subset of the possible connections between nodes are used.

2.2 NARX Recurrent Neural Network

A NARX network consists of a Multilayer Perceptron (MLP) which takes as


input a window of past input and output values and computes the current output.
Specifically, the operation of the network is defined by

t)(t) = k~ ( g ( t - n ~ ) , . . . , g ( t - 1),~(t),~(t- ny),...,~(t- 1)) , (3)

where the function ~ is the mapping performed by the MLP.


The states of the NARX network correspond to a set of two tapped-delay
lines. One consists of nu taps on the input values, and the other consists of ny
taps on the output values. Specifically, the state at time t corresponds to the
values

5~(t)=[ ~t(t-n~)... g(t-1) ~(t-ny) ... ~(t-1) ] .

The MLP consists of a set of nodes organized into two layers. There are/~ nodes
in the first layer which perform the function

i= 1,... ,H.

The output layer consists of a single linear node ~l(t) = ~ = 1 wij2j (t) + Oi.

D e f i n i t i o n 1. A function a is said to be a bounded, one-side saturated (BOSS)


function if it satisfies the following conditions: (i.) a has a bounded range, i.e.,
L < a(x) < U, L ~ U for all x E IR. (ii.) a is left-side saturated, i.e. there exists
a finite value s, such that a(x) = S for all x ~ s. (iii.) a is non-constant.
98

Many sigmoid-like functions are BOSS functions including hard-limiting thresh-


old functions, saturated linear functions, and "one side saturated sigmoids",
(

a(x)=~O 1 x_<c
[ y+-j_~ x > c

where c E R.

3 Turing Equivalence of N A R X Networks


We prove t h a t NARX networks with BOSS functions are capable of simulating
fully connected networks with only a linear slowdown. Because of the universality
of some types of fully connected networks with a finite number of nodes, we
conclude t h a t the associated NARX networks are Turing universal as well.

T h e o r e m 2. N A R X networks with one hidden layer of nodes with B O S S acti-


vation functions and a linear output node can simulate fully connected recurrent
networks with B O S S activation functions with a linear slowdown.

Here we present a sketch of the proof of the theorem. The interested reader
is referred to [20] for more details.

Proof. To prove the theorem we show how to construct a N A R X network iV


t h a t simulates a fully connected network j r with N nodes, each of which uses
a BOSS activation function or. The N A R X network requires N + 1 hidden layer
nodes, a linear output node, an output shift register of order ny = 2N, and no
taps on the input. Without loss of generality we assume t h a t the left saturation
value of a is S = 0.
The simulation suffers a linear slowdown; specifically, if F computes in time
T, then the total computation time taken by iV is ( N + 1)T. At each time step,
iV will simulate the value of exactly one of the nodes in jr. The additional time
step will be used to encode a sequencing signal indicating which node should be
simulated next.
The output taps of iV will be used to store the simulated states of ~ ; no
taps on the input are required, i.e. n~ = 0. At any given time the t a p p e d delay
line must contain the complete set of values corresponding to all N nodes of jc
at the previous simulated time step. To accomplish this, a t a p p e d delay line of
length ny = 2N can be used which will always contain all of the values of j r at
time t - 1 immediately preceding a sequencing signal, #, to indicate where these
variables are in the tap.
The sequencing signal is chosen in such a way t h a t we can define a simple
function f~ t h a t is used to either "turn off" neurons or to yield a constant value,
according to the values in the taps. Let # = U + e for some positive constant e.
We define the atfine function, f~(x) = x - #. Then, f~(#) = 0 and f ~ ( x ) <_ - e
for all x E [L, U]. It can be shown that the ith hidden layer node takes on a
non-zero value only when the sequencing symbol occurs in state X2N-~+l and
99

when the values of zj(t - 1) are stored in states X N T m - - i , m --~ 1 , . . . , N . It can


be shown that the ith node in the hidden layer of Af is updated as follows.

2i(k + 1) =

a([ Nm~_la~,mS:N+,~-~(k)+b~u(k)~-e~]~-~i[X2N-i+l(k)--].t])', (4)

where the constant fli is large enough to make the input to a less than s when
22N-i+1 (k) ~ # so that the whole function is zero. A similar argument is used
to ensure that the final node implements the sequencing signal properly. Since
only one of the hidden layer nodes is non-zero, the output node of Af is simply
a linear combination so that the output of the network is equal to the value of
the currently active hidden layer node. The resulting network will simulate iT
with a linear slowdown.

It has been shown that fully connected networks with a fixed, finite number
of saturated linear activation functions are universal computation devices [22].
As a result it is possible to simulate a Turing machine with the NARX network
such that the slowdown is constant regardless of problem size. Thus, we conclude
that

NARX networks with one hidden layer of nodes with saturated


C o r o l l a r y 3.
linear activation functions and linear output nodes are Turing equivalent.

4 NARX Network with Hard-limited Neurons

Here we look at variants of the NARX networks, in which the output functions
are not linear combiners but rather a hard-limiting nonlinearity.
If the inputs are binary, then recurrent neural networks are only capable
of implementing Finite State Machines (FSMs), and in real time NARX net-
works are only capable of implementing a subset of FSMs called finite memory
machines (FMMs) [13].
Intuitively, the reason why FMMs are constrained is that there is a limited
amount of information that can be represented by feeding back the outputs alone.
If more information could be inserted into the feedback loop, then it should be
possible to simulate arbitrary FSMs in structures like NARX networks. In fact,
we next show that this is the case. We will show that NARX networks with
hard-limiting nonlinearities are capable of simulating fully connected networks
with a slowdown proportional to the number of nodes. As a result, the NARX
network will be able to simulate arbitrary FSMs.

NARX networks with hard-limiting activation functions, one hid-


T h e o r e m 4.
den layer of N + 1 nodes, and a output tapped delay line of length 4N + 1 can
simulate fully connected networks with N hard-limiting activation functions with
a slowdown of 2N + 3.
100

Proo]. See [20].

In [11] it was shown that any n-state FSM can be implemented by a four
layer recurrent neural network with O (vr~) hard-limiting nodes. It is trivial
to show that a fully connected recurrent neural network can simulate an L -
layer recurrent network with a slowdown of L. Based on the fact that a NARX
network with hard-limiting output nodes is only capable of implementing FMMs,
we conclude that

C o r o l l a r y 5. For every F S M A/t, there exists an F M M which can simulate J~4


with 0 (x/~) nodes and 0 (x/~) slowdown.

5 Conclusion

We proved that NARX neural networks are capable of simulating fully connected
networks within a linear slowdown, and as a result are universal dynamical
systems. This theorem is somewhat surprising since the nature of feedback in
this type of network is so limited, i.e. the feedback comes only from the output
neuron.
The Turing equivalence of NARX neural networks implies that they are ca-
pable of representing solutions to just about any computational problem. Thus,
in theory NARX networks can be used instead of fully recurrent neural nets
without loosing any computational power.
But Turing equivalence implies that the space of possible solutions is ex-
tremely large. Searching such a large space with gradient descent learning al-
gorithms could be quite difficult. Our experience indicates that it is difficult to
learn even small finite state machines (FSMs) from example strings in either of
these types of networks unless particular caution is taken in the construction
of the machines [9, 4]. Often, a solution is found that classifies the training set
perfectly, but the network in fact learns some complicated dynamical system
which cannot necessarily be equated with any finite state machine,
NARX networks with hard-limiting nonlinearities can be shown to be capable
of only implementing a subclass of finite state machines called finite memory
machines. But, they can implement arbitrary finite state machines if a sublinear
slowdown is allowed.
These results open several questions. What is the simplest feedback or recur-
rence necessary for any network to be ~ r i n g universal? Do these results have
implications about the computational power of other types of architectures such
as recurrent networks with local feedback [2, 7, 8]?

Acknowledgements
We would like to thank Peter Tifio and Hanna Siegelmann for many helpful comments.
101

References

1. N. Alon, A.K. Dewdney, and T.J. Ott. Efficient simulation of finite automata by
neural nets. JACM, 38(2):495-514, 1991.
2. A.D. Back and A.C. Tsoi. FIR and IIR synapses, a new neural network architecture
for time series modeling. Neural Computation, 3(3):375-385, 1991.
3. S. Chert, S.A. Billings~ and P.M. Grant. Non-linear system identification using
neural networks. Int. J. Control, 51(6):1191-1214, 1990.
4. D.S. Clouse, C.L. Giles, B.G. Horne, and G.W. Cottrell. Learning large deBruijn
automata with feed-forward neural networks. Technical Report CS94-398, CSE
Dept., UCSD, La Jolla, CA, 1994.
5. J. Connor, L.E. Atlas, and D.R. Martin. Recurrent networks and NARMA rood-
eling. In NIPS4, pages 301-308, 1992.
6. G. Cybenko. Approximation by superpositions of a sigmoidal function. Math. of
Control, Signals, and Sys., 2(4):303-314, 1989.
7. B. de Vries and J.C. Principe. The gamma model - - A new neural model for
temporal processing. Neural Networks, 5:565-576, 1992.
8. P. Frasconi, M. Gori, and G. Soda. Local feedback multilayered networks. Neural
Computation, 4:120-130, 1992.
9. C.L. Giles, B.G. Horne, and T. Lin. Learning a class of large finite state machines
with a recurrent neural network. Neural Networks, 1995. In press.
10. B.G. Horne and C.L. Giles. An experimental comparison of recurrent neural net-
works. In NIPST, 1995. To appear.
11. B.G. Horne and D.R. Hush. Bounds on the complexity of recurrent neural network
implementations of finite state machines. In NIPS6, pages 359-366, 1994.
12. J. Kilian and H.T. Siegelmann. On the power of sigmoid neural networks. In Proc.
6th ACM Work. on Comp. Learning Theory, pages 137-143, 1993.
13. Z. Kohavi. Switching and finite automata theory. McGraw-Hill, New York, NY,
2rid edition, 1978.
14. I.J. Leontaritis and S.A. Billings. Input-output parametric models for non-linear
systems: Part I: deterministic non-linear systems. Int. J. Control, 41(2):303-328,
1985.
15. W.S. McCulloch and W.H. Pitts. A logical calculus of the ideas immanent in
nervous activity. Bull. Math. Biophysics, 5:115-133, 1943.
16. M.L. Minsky. Computation: Finite and infinite machines. Prentice-Hall, Engle-
wood Cliffs, 1967.
17. K.S. Narendra and K. Parthasarathy. Identification and control of dynamical sys-
tems using neural networks. IEEE Trans. on Neural Networks, 1:4-27, March
1990.
18. C.W. Omlin and C.L. Giles. Stable encoding of large finite-state automata in
recurrent neural networks with sigmoid discriminants. Neural Computation, 1996.
accepted for publication.
19. S.-Z. Qin, H.-T. Su, and T.J. McAvoy. Comparison of four neural net learning
methods for dynamic system identification. IEEE Trans. on Neural Networks,
3(1):122-130, 1992.
20. H.T. Siegelmann, B.G. Horne, and C.L. Giles. Computational capabilities of
NARX neural networks. Technical Report UMIACS-TR-95-12 and CS-TR-3408,
Institute for Advanced Computer Studies, University of Maryland, 1995.
21. H.T. Siegelmann and E.D. Sontag. Analog computation via neural networks. The-
oretical Computer Science, 131:331-360, 1994.
102

22. H.T. Siegelmann and E.D. Sontag. On the computational power of neural net-
works. J. Comp..and Sys. Science, 50(1):132-150, 1995.
23. H.T. Siegelmann, E.D. Sontag, and C.L. Giles. The complexity of language recog-
nition by neural networks. In Algorithms, Software, Architecture (Proc. of IFIP
12th World Computer Congress), pages 329-335. North-Holland, 1992.
24. H.-T. Su, T.J. McAvoy, and P. Werbos. Long-term predictions of chemical pro-
cesses using recurrent neural networks: A parallel training approach. Ind. Eng.
Chem. Res., 31:1338-1352, 1992.

You might also like