Approximating recurrent neural networks using feed-forward architectures

21st April, 2017

Recurrent neural network architectures can have useful computational properties, with complex temporal dynamics and attractor regimes. However, evaluation of recurrent dynamic architectures requires solution of systems of differential equations, and the number of evaluations required to determine their response to a given input can vary with the input, or can be indeterminate altogether in the case of oscillations or instability. In feed-forward networks, by contrast, only a single pass through the network is needed to determine the response to a given input.

Modern machine-learning systems are designed to operate efficiently on feed-forward architectures. We hypothesised that two-layer feedforward architectures with simple, deterministic dynamics could approximate the responses of single-layer recurrent network architectures. By identifying the fixed-point responses of a given recurrent network, we trained two-layer networks to directly approximate the fixed-point response to a given input. These feed-forward networks then embodied useful computations, including competitive interactions, information transformations and noise rejection. Our approach was able to find useful approximations to recurrent networks, which can be evaluated in linear and deterministic time complexity.

Recurrent networks and feed-forward approximations

Fig. 1a shows an example of a simple 2-neuron single-layer recurrent network. The dynamics of each rectified-linear neuron (\(x_j\), composed into a vector of activity \(\mathbf{x}\)) is governed by a nonlinear differential equation \(\mathbf{\dot{x}}\tau+\mathbf{x}=\left[W_R\mathbf{x}\right]^+ + \mathbf{i}\), which evolves in response to the input provided to the network (\(\mathbf{i}\)), as well as the activity of the rest of the network transformed by the recurrent synaptic weight matrix \(W_{R}\). Here \( \left[x\right]^+ \) is the threshold-linear function \(\max\left(0,x\right) \).

Figure 1: Recurrent and feed-forward network architectures. (a) Two-neuron recurrent architecture. Rectified-linear (ReLU) neurons (\(\mathbf{x}\); circles) receive input (\(\mathbf{i}\)), and possibly reach a stable fixed point in activity (\(\mathbf{f}\); the values of network activity \(\mathbf{x}\) at the fixed point, if it exists) through recurrent interactions via weights \(W_{R}\). (b) A \(2\times2\) neuron feed-forward architecture. Input \(\mathbf{i}\) is transformed through two layers of ReLU neurons (\(\mathbf{x}^{1},\mathbf{x}^{2}\)) via all-to-all weight matrices \(W_{FF}^{1}\) and \(W_{FF}^{2}\). The activity \(\mathbf{x}^{2}\) of layer 2 is the output of the network.

We attempted to approximate the mapping between network inputs \(\mathbf{i}\) and network fixed points \(\left[\mathbf{f}\right]^{+}\) using a family of feed-forward network architectures (Fig. 1b). For a recurrent network with \(N=2\) neurons, the corresponding feed-forward approximation consisted of two layers, each consisting of \(N=2\) ReLU neurons. All-to-all weight matrices \(W_{FF}^{1}\) and \(W_{FF}^{2}\) defined the connectivity between the network input (\(\mathbf{i}\)), the neurons of layer 1 (\(\mathbf{x}^{1}\)), and the neurons of layer 2 (\(\mathbf{x}^{2}\)). The activity of \(\mathbf{x}^{2}\) were taken as the output of the network. The response of the feedforward approximation network was given by \[\mathbf{x}^{1}= \left[W_{FF}^{1}\cdot\mathbf{i}-\mathbf{b}_{FF}^{1}\right]^{+}\] \[ \mathbf{x}^{2}= \left[W_{FF}^{2}\cdot\mathbf{x}^{1}-\mathbf{b}_{FF}^{2}\right]^{+}\mathrm{.} \]

Training

We performed a random sampling of the input space by drawing uniform random variates from the unit \(N\)-dimensional cube \(\left(-1,1\right)^{N}\). For each input, we solved the dynamics of the recurrent network to determine whether a stable fixed-point response existed for that input, discarding inputs for which no stable fixed point existed. We therefore found a mapping between a set of inputs \(\mathcal{I}\) and the set of corresponding fixed-point responses \(\mathcal{F}\), which was used as training data to find an optimal feed-forward approximation to that mapping. We trained the networks using a stochastic gradient-descent optimisation algorithm with momentum and adaptive learning rates (Adam; Kingma & Ba 2015).

Competitive networks with partitioned excitatory structure

We investigated a simple version of subnetwork partitioning, with all-or-nothing recurrent excitatory connectivity (Fig. 2a). Networks with this connectivity pattern exhibit strong recurrent recruitment of excitatory neurons within a given partition, coupled with strong competition between partitions mediated by shared inhibitory feedback. As a consequence the recurrent network can be viewed as solving a simple classification problem, whereby the network signals which is the greater of the summed input to partition A (\(\iota_{1+2}=\Sigma_{j=1,2}\iota_{j}\)) or to partition B (\(\iota_{3+4}=\Sigma_{j=3,4}\iota_{j}\)). In addition, the network signals an analogue value linearly related to the difference between the inputs. If \(\iota_{1+2}>\iota_{3+4}\) then the network should respond by strong activation of \(x_{1,2}\) and complete inactivation of \(x_{3,4}\) (and vice versa for \(\iota_{1+2}<\iota_{3+4}\)).

Figure 2: Competition in a two-partition network. (a) A recurrent dynamic network with two all-or-nothing excitatory partitions (A and B), and a single global inhibitory neuron (Inh). The architecture of the feed-forward approximation network is as shown in Fig. 1b. (b) Stimulating the partitions with a linear mixture of input currents \(\left(0,1\right)\) (grey shading) provokes strong competition in the response of the partitions (note rapid switching between partition A and B when inputs are roughly equal). The feed-forward approximation (dashed lines) exhibits similar competitive switching to the recurrent network (solid lines). Recurrent network parameters: \(\left\{ N,w_{E},w_{I},\mathbf{b}\right\} =\left\{ 5,2.5,8,\mathbf{0}\right\}\).

We examined the strength of competition present between excitatory partitions, by providing mixed input to both partitions, comparing the recurrent network response with the feed-forward approximation (Fig. 2b). Both networks exhibited strong competition between responses of the two excitatory partitions: only a single partition was active for a given network input, even when the input currents to the two partitions were almost equal. In addition, the feed-forward network learned a good approximation to the analogue response of the recurrent network.

Although the feed-forward approximation was not trained explicitly as a classifier, we examined the extent to which the feed-forward approximation had learned the decision boundary implemented by the recurrent network (Fig. 3). Multi-layer feed-forward neural networks of course have a long history of being used as classifiers (e.g. Rumelhart et al. 1986, LeCun et al. 1989). The purpose of the approach presented here is to examine how well the feed-forward approximation has learned to mimic the boundaries between basins of attraction embedded in the recurrent dynamic network. This question is particularly interesting for larger and more complex recurrent networks, for which the boundaries between basins of attraction are not known a priori.

Figure 3: Decision boundary is almost aligned between recurrent network and feed-forward approximation. Shown is the projection of the input space \(\mathbf{i}\) into two dimensions, with the ideal decision boundary of \(\iota_{1+2}=\iota_{3+4}\) indicated as a dashed line. The majority of inputs sampled close to the decision boundary resulted in the activation of the correct partition in both the recurrent network and feed-forward approximation (grey dots). The decision boundary of the feed-forward approximation was not perfectly aligned with that of the recurrent network, resulting in misclassification of some inputs close to the decision boundary (orange dots). Network parameters as in Fig. 2.

Discussion

Feed-forward approximations to dynamic recurrent systems can capture the information processing benefits of highly recurrent networks in conceptually and computationally simpler architectures. Information processing tasks such as selective amplification and noise rejection as performed by recurrent dynamical networks can therefore be incorporated into feed-forward network architectures. Evaluation of the feed-forward approximations is deterministic in time, in contrast to seeking a fixed-point response in the dynamic recurrent network, where the time taken to reach a fixed-point response — and indeed the existence of a stable fixed point — can depend on the input to the network. Feed-forward approximations provide a guaranteed solution for each network input, although in the case of oscillatory or unstable dynamics in the recurrent network the approximation will be inaccurate. Finally, the architecture of the feed-forward approximations is compatible with modern systems for optimised and distributed evaluation of deep networks.

Publication

This work was published in Neural Computation: DR Muir. 2018. Feed-forward approximations to dynamic recurrent network architectures. Neural Computation 30 2:546–567. DOI: 10.1162/neco_a_01042. Preprint PDF Published PDF

Image credit

Loop by maldoit. (cc 2.0) by-nc-nd