Approximating recurrent neural networks using feed-forward architectures
21st April, 2017Recurrent neural network architectures can have useful computational properties, with complex temporal dynamics and attractor regimes. However, evaluation of recurrent dynamic architectures requires solution of systems of differential equations, and the number of evaluations required to determine their response to a given input can vary with the input, or can be indeterminate altogether in the case of oscillations or instability. In feed-forward networks, by contrast, only a single pass through the network is needed to determine the response to a given input.
Modern machine-learning systems are designed to operate efficiently on feed-forward architectures. We hypothesised that two-layer feedforward architectures with simple, deterministic dynamics could approximate the responses of single-layer recurrent network architectures. By identifying the fixed-point responses of a given recurrent network, we trained two-layer networks to directly approximate the fixed-point response to a given input. These feed-forward networks then embodied useful computations, including competitive interactions, information transformations and noise rejection. Our approach was able to find useful approximations to recurrent networks, which can be evaluated in linear and deterministic time complexity.
Recurrent networks and feed-forward approximations
Fig. 1a shows an example of a simple 2-neuron single-layer recurrent network. The dynamics of each rectified-linear neuron (\(x_j\), composed into a vector of activity \(\mathbf{x}\)) is governed by a nonlinear differential equation \(\mathbf{\dot{x}}\tau+\mathbf{x}=\left[W_R\mathbf{x}\right]^+ + \mathbf{i}\), which evolves in response to the input provided to the network (\(\mathbf{i}\)), as well as the activity of the rest of the network transformed by the recurrent synaptic weight matrix \(W_{R}\). Here \( \left[x\right]^+ \) is the threshold-linear function \(\max\left(0,x\right) \).
We attempted to approximate the mapping between network inputs \(\mathbf{i}\) and network fixed points \(\left[\mathbf{f}\right]^{+}\) using a family of feed-forward network architectures (Fig. 1b). For a recurrent network with \(N=2\) neurons, the corresponding feed-forward approximation consisted of two layers, each consisting of \(N=2\) ReLU neurons. All-to-all weight matrices \(W_{FF}^{1}\) and \(W_{FF}^{2}\) defined the connectivity between the network input (\(\mathbf{i}\)), the neurons of layer 1 (\(\mathbf{x}^{1}\)), and the neurons of layer 2 (\(\mathbf{x}^{2}\)). The activity of \(\mathbf{x}^{2}\) were taken as the output of the network. The response of the feedforward approximation network was given by \[\mathbf{x}^{1}= \left[W_{FF}^{1}\cdot\mathbf{i}-\mathbf{b}_{FF}^{1}\right]^{+}\] \[ \mathbf{x}^{2}= \left[W_{FF}^{2}\cdot\mathbf{x}^{1}-\mathbf{b}_{FF}^{2}\right]^{+}\mathrm{.} \]
Training
We performed a random sampling of the input space by drawing uniform random variates from the unit \(N\)-dimensional cube \(\left(-1,1\right)^{N}\). For each input, we solved the dynamics of the recurrent network to determine whether a stable fixed-point response existed for that input, discarding inputs for which no stable fixed point existed. We therefore found a mapping between a set of inputs \(\mathcal{I}\) and the set of corresponding fixed-point responses \(\mathcal{F}\), which was used as training data to find an optimal feed-forward approximation to that mapping. We trained the networks using a stochastic gradient-descent optimisation algorithm with momentum and adaptive learning rates (Adam; Kingma & Ba 2015).
Competitive networks with partitioned excitatory structure
We investigated a simple version of subnetwork partitioning, with all-or-nothing recurrent excitatory connectivity (Fig. 2a). Networks with this connectivity pattern exhibit strong recurrent recruitment of excitatory neurons within a given partition, coupled with strong competition between partitions mediated by shared inhibitory feedback. As a consequence the recurrent network can be viewed as solving a simple classification problem, whereby the network signals which is the greater of the summed input to partition A (\(\iota_{1+2}=\Sigma_{j=1,2}\iota_{j}\)) or to partition B (\(\iota_{3+4}=\Sigma_{j=3,4}\iota_{j}\)). In addition, the network signals an analogue value linearly related to the difference between the inputs. If \(\iota_{1+2}>\iota_{3+4}\) then the network should respond by strong activation of \(x_{1,2}\) and complete inactivation of \(x_{3,4}\) (and vice versa for \(\iota_{1+2}<\iota_{3+4}\)).
We examined the strength of competition present between excitatory partitions, by providing mixed input to both partitions, comparing the recurrent network response with the feed-forward approximation (Fig. 2b). Both networks exhibited strong competition between responses of the two excitatory partitions: only a single partition was active for a given network input, even when the input currents to the two partitions were almost equal. In addition, the feed-forward network learned a good approximation to the analogue response of the recurrent network.
Although the feed-forward approximation was not trained explicitly as a classifier, we examined the extent to which the feed-forward approximation had learned the decision boundary implemented by the recurrent network (Fig. 3). Multi-layer feed-forward neural networks of course have a long history of being used as classifiers (e.g. Rumelhart et al. 1986, LeCun et al. 1989). The purpose of the approach presented here is to examine how well the feed-forward approximation has learned to mimic the boundaries between basins of attraction embedded in the recurrent dynamic network. This question is particularly interesting for larger and more complex recurrent networks, for which the boundaries between basins of attraction are not known a priori.
Discussion
Feed-forward approximations to dynamic recurrent systems can capture the information processing benefits of highly recurrent networks in conceptually and computationally simpler architectures. Information processing tasks such as selective amplification and noise rejection as performed by recurrent dynamical networks can therefore be incorporated into feed-forward network architectures. Evaluation of the feed-forward approximations is deterministic in time, in contrast to seeking a fixed-point response in the dynamic recurrent network, where the time taken to reach a fixed-point response — and indeed the existence of a stable fixed point — can depend on the input to the network. Feed-forward approximations provide a guaranteed solution for each network input, although in the case of oscillatory or unstable dynamics in the recurrent network the approximation will be inaccurate. Finally, the architecture of the feed-forward approximations is compatible with modern systems for optimised and distributed evaluation of deep networks.
Publication
This work was published in Neural Computation: DR Muir. 2018. Feed-forward approximations to dynamic recurrent network architectures. Neural Computation 30 2:546–567. DOI: 10.1162/neco_a_01042. Preprint PDF Published PDF