In the previous post, we discussed a simplified neural random-access memory machine (NRAM). We dubbed this simplified machine a neural register machine, since it lacks random-access memory but has a fixed set of registers to store data. In this post, we'll implement the register machine using Theano, and then train it on a very simple example. We'll then demonstrate how we could extend this to the full NRAM model, showing how to implement the READ and WRITE gates.

This post is meant as a code-heavy demonstration of how to implement a model like the one under discussion, and all the subtleties that have to be dealt with. You will not get much out of this post unless you are willing to read the code carefully and scrutinize it until you are sure you understand it!

This post is a runnable IPython notebook, which you can download here. While reading it, I recommend you refer to the previous post for the theory being implemented. The code for the full NRAM model is available on Github.

We'll start by importing the the necessary libraries, namely Theano and Numpy:

In [1]:

import theano
from theano import tensor
import numpy as np

from collections import namedtuple

Implementing Gates

Next, let's define a few gates. Recall from our previous posts that gates can be constants, single-argument functions, or two-argument functions, and operate on probability distributions over the integers 0 to $M - 1$.

We'll represent gates via a Gate namedtuple with an arity field (0 for constants, 1 for single argument functions, etc) and a module field (which stores the actual function the gate applies):

In [2]:

Gate = namedtuple("Gate", "arity module")

Constant Gates

The simplest gates to define are constants. The probability distribution for a constant is a one-hot encoding of the value; for example, the integer one is converted to a vector in which the 2nd element (index 1) is one, and all others are zero. This vector represents the probability distribution which, when sampled, returns one always.

Let's define a few useful constant gates:

In [3]:

# Theano's to_one_hot converts a numpy array
# to a one-hot encoded version. In order to
# know how big to make the vector, all gates
# take M, the smallest non-representable integer,
# as an argument.
from theano.tensor.extra_ops import to_one_hot

def make_constant_gate(value):
    """Create a gate that returns a constant distribution."""
    # Arguments to to_one_hot must be Numpy arrays.
    arr = np.asarray([value])
    
    def module(max_int):
        """Return the one-hot encoded constant."""
        return to_one_hot(arr, max_int)
    
    arity = 0
    return Gate(arity, module)

# Gates for 0, 1, 2: useful constants!
gate_zero = make_constant_gate(0)
gate_one  = make_constant_gate(1)
gate_two  = make_constant_gate(2)

Broadcasting

You should be aware of one particular feature of the above code: broadcasting. When we end up training our neural network, we don't want to run the for every sample; rather, we want to run the machine on all the samples at once. This has no effect on semantics, but is important for performance.

When we do this, we represent each sample as a separate row (the first dimension of the tensor). Since constants are the same for all samples, they need to be broadcast along the first dimension; Theano needs to know to automatically scale the constant vector along the first dimension, copying it as many times as necessary.

to_one_hot automatically creates a variable of the row type, which is broadcastable in its first dimension. Observe this in the following code, in which y automatically gets broadcast along the first dimension to match the dimension of x:

In [4]:

x = np.ones((2, 3))
y = gate_one.module(3)

print("X =", x)
print("Y =", y.eval())
print("X + Y =", (x + y).eval())

X = [[ 1.  1.  1.]
 [ 1.  1.  1.]]
Y = [[ 0.  1.  0.]]
X + Y = [[ 1.  2.  1.]
 [ 1.  2.  1.]]

For more info on broadcasting, take a look at the Theano documentation on broadcasting.

Unary Gates

Next up, we define two one-argument gates. The eq_zero gate returns whether its input is equal to zero; the negate gate returns its input times negative one (modulo $M$).

eq_zero returns a boolean (a zero or a one); that is, the result only has non-zero values in the first to components. To implement it, we must create the vector component-by-component, for which we need Theano's set_subtensor, which allows direct manipulation of vector components:

In [5]:

# In Theano, writing
#   >>> A = set_subtensor(A[0], x)
# corresponds to standard Python's
#   >>> A = A.copy(); A[0] = x
from theano.tensor import set_subtensor

def eq_zero(A, max_int):
    """Given a Theano vector A, return a vector
    of the same shape where the first component is
    1 - A[0], the second component is A[0], and the
    other components are all zero.
    
    This corresponds to a neural gate that checks for zero."""
    
    # Initialize result to be zeros in the same shape as A.
    # This should be a list of row vectors of length max_int.
    # By operating on all rows at once we allow multiple samples
    # to be processed in one call to this function.
    result = tensor.zeros_like(A)
    result = set_subtensor(result[:, 1], A[:, 0])
    result = set_subtensor(result[:, 0], 1 - A[:, 0])
    return result

gate_eqz = Gate(1, eq_zero)

In order to implement negate, we must recognize that arithmetic in our formulation operates modulo $M$. Negative values simply correspond to other positive integers. Specifically, the mapping from values to their negatives is:

$$\begin{align*} 0 \to&\; 0 \\ 1 \to&\; M - 1 \\ 2 \to&\; M - 2 \\ \vdots \to&\; \vdots \\ M - 1 \to&\; 1\end{align*}$$

You can confirm this mapping by observing that $x + (-x) \equiv 0 \text{ mod } M$ for each one, as in $2 + (M - 2) = M \equiv 0 \text{ mod } M$. As you can see from observing the right column, this is just the values 0 through $M - 1$ in descending order, but rotated one space to have zero at the top. We can implement this with Theano's roll and Python's a[::-1] for reversing an array:

In [6]:

# `roll` circularly rotates elements of a vector.
# For example, 
#   >>> roll([1, 2, 3], -2)
# rotates the elements two spaces to the left, resulting in
#   >>> [3, 1, 2]
from theano.tensor import roll

def negate(A, max_int):
    """Negate a distribution over integers."""
    return roll(A[:, ::-1], 1, axis=1)

gate_negate = Gate(1, negate)

Binary Gates

Finally, let's implement at least one binary gate: addition. We discussed the addition gate in the previous post, yielding the following module for computing the distribution of a sum:

$$m_{\text{ADD}}(\vec A, \vec B)_\ell = \sum_{j=0}^{M-1} A_j B_{\ell - k \text{ mod } M}.$$

We can implement this efficiently by creating a matrix, $B'$, which, when premultiplied by $\vec A$, yields $m_{\text{ADD}}(\vec A, \vec B)$. The columns of $B'$ must be the coefficients $B_{\ell - k \text{ mod } M}$; these columns are just reversed and circularly shifted version of $\vec B$ itself:

In [7]:

# `stack` stacks a list of vectors
# into a matrix. It increases the number 
# of dimensions in the tensor by one.
# The dimension it adds is configurable.
from theano.tensor import stack

# `batched_dot` performs matrix multiplication
# across a set of samples, where each sample is
# a row (uses the dimension).
from theano.tensor import batched_dot

def add(A, B, max_int):
    """Returns the distribution for a sum of integers."""
    rows = [roll(B[:, ::-1], shift + 1, axis=1)
            for shift in range(max_int)]
    B_prime = stack(rows, axis=1).transpose(0, 2, 1)
    return batched_dot(A, B_prime)

gate_add = Gate(2, add)

We use the transpose method to transpose the last two dimension but leave the first one alone (since it contains the parallel samples). This gate can be tricky, so I recommend using it as a self-test in the exercise below!

To gain a bit of intuition, let's test our modules:

In [8]:

# Distributions over the set {0, 1, 2, 3, 4}
max_int = 5

zero = gate_zero.module(max_int)
one = gate_one.module(max_int)
two = gate_two.module(max_int)
three = add(one, two, max_int)

print("0 = 0?   ", eq_zero(zero, max_int).eval())
print("0 = 1?   ", eq_zero(one, max_int).eval())
print("2        ", two.eval())
print("-2 = 3   ", negate(two, max_int).eval())
print("1 + 2 = 3", three.eval())
print("3 + 3 = 1", add(three, three, max_int).eval())

0 = 0?    [[ 0.  1.  0.  0.  0.]]
0 = 1?    [[ 1.  0.  0.  0.  0.]]
2         [[ 0.  0.  1.  0.  0.]]
-2 = 3    [[ 0.  0.  0.  1.  0.]]
1 + 2 = 3 [[ 0.  0.  0.  1.  0.]]
3 + 3 = 1 [[ 0.  1.  0.  0.  0.]]

Exercise: Confirm that these definitions work when the input is a list of samples (each sample as one row), rather than just one sample at a time. For each gate, look at all the intermediate products and explain what shape each of the tensors have. Pay particular attention to the add gate.

Implementing Fuzzy Circuits

Now that we have a set of gates to play with, let's figure out how to connect them into end-to-end fuzzy circuits.

Recall from the previous post that, for a given gate $i$, the controller generates coefficients $\vec a_i$ and $\vec b_i$ as necessary, where the vector indicates coefficients for a linear combination of registers and previous outputs. Given these coefficients, we'd like to write the gates together into one end-to-end system. To do so, we need to be able to take a coefficient-weighted sum of registers and previous outputs:

In [9]:

def avg(distributions, coefficients):
    """
    Return the weighted average of a set of vectors.
    
    Shapes:
       distributions: (S, N, M)
       coefficients:  (S, N)
       return value:  (S, M)

    where
        S:  number of samples to perform this operation on
        N:  number of vectors in the set
        M:  number of elements in each vector
    """
    # Shuffle coefficients to shape (S, N, 1)
    coeffs = coefficients.dimshuffle(0, 1, 'x')
    
    # Transpose distributions to (S, M, N)
    dists = distributions.transpose(0, 2, 1)
    
    # Batched multiply to get shape (S, M, 1),
    # then drop the last dimension.
    return batched_dot(dists, coeffs).flatten(2)

This operation is a straight-forward matrix multiplication, made much less straight-forward by the need to massage the dimensions into a form where applying the operation to many samples at once works. We can confirm that it works with one sample:

In [10]:

vectors = [[
    [1, 2, 3],       # X
    [10, 20, 30],    # Y
    [100, 200, 300], # Z
]]
coefficients = [[0.1, 0.3, 0.6]]

from theano.tensor import as_tensor

# 0.1 X + 0.3 Y + 0.6 Z
avg(as_tensor(vectors), as_tensor(coefficients)).eval()

Out[10]:

array([[  63.1,  126.2,  189.3]])

Exercise: Reimplement this yourself with Theano, making sure that the operation works when applied to many samples at once. Look at the docstring of avg to determine the input shapes.

The next building block is one that runs just one gate, given the registers and previous gate outputs, and the list of controller-generated coefficients:

In [11]:

def run_gate(gate_inputs, gate, controller_coefficients, max_int):
    """Return the output of a gate in the circuit.
    
    gate_inputs:
      The values of the registers and previous gate outputs.
    gate:
      The gate to compute output for. Arity must
      match len(controller_coefficients).
    controller_coeffficients:
      A list of coefficient arrays from the controller,
      one coefficient for every gate input (0 for constants).
    """
    args = [avg(gate_inputs, coefficients)
            for coefficients in controller_coefficients]
    output = gate.module(*args, max_int)
    
    # Special-case constant gates.
    # Since they have no outputs, they always output
    # one sample. Repeat their outputs as many times
    # as necessary, effectively doing manual broadcasting
    # to generate an output of the right size.
    if gate.arity == 0:
        output = output.repeat(gate_inputs.shape[0], axis=0)
        
    return output

Finally, since we can now run one gate, we can combine all our gates into a full circuit. The full circuit runs through all the gates, generating outputs, and then generates new values for the registers, similarly using the controller coefficients.

In [12]:

from theano.tensor import concatenate

def run_circuit(registers, gates, controller_coefficients, max_int):
    # Initially, only the registers may be used as inputs.
    gate_inputs = registers
    
    # Run through all the gates.
    for i in range(len(gates)):
        output = run_gate(gate_inputs, gates[i],
                          controller_coefficients[i], max_int)
        # Append the output of the gate as an input for future gates.
        gate_inputs = concatenate([gate_inputs,
                                   output.dimshuffle(0, 'x', 1)],
                                  axis=1)
        
    # All leftover coefficients are for registers.
    new_registers = []
    for i in range(len(gates), len(controller_coefficients)):
        new_registers.append(avg(gate_inputs, controller_coefficients[i]))
    return tensor.stack(new_registers, axis=1)

Testing this requires a bit more setup than before, as it encompasses almost the entire complexity of our system. We will create a simple system, with three registers, four gates, and $M = 5$. We must define the initial register values, the order of the gates, and the connections between the registers, the gates, and the new register values. The connections are set by choosing appropriate controller coefficients.

In [13]:

# Registers: R1 = 0, R2 = 1, R3 = 3
# These use the previously defined max_int = 5.
v0 = zero.eval()
v1 = one.eval()
v3 = three.eval()
registers = tensor.stack([v0, v1, v3], axis=1)

# Gates: 1, NEG(x), ADD(x, y), EQZ(x)
gates = [gate_one, gate_negate, gate_add, gate_eqz]

# Circuit:
#
#  R1  -----------------------     - R1'
#      1 -- NEG --            \   /
#                 \            \ /
#                  ADD -- EQZ --X 
#                 /              \
#  R2  -----------                - R2'
#  R3  ---------------------------- R3'
#
#  Equivalently:
#    R3' = R3
#    R2' = R1
#    R1' = [R2 - 1 == 0]

from theano.tensor import as_tensor
def c(*cs):
    """Convert a list into a coefficient tensor, shape 1 x N."""
    return as_tensor(cs).reshape((1, len(cs)))
    
controller_coefficients = [
    # 1 coefficients
    [],
    
    # NEG coefficients
    [c(0, 0, 0, 1)],
    
    # ADD coefficients
    [c(0, 0, 0, 0, 1),
     c(0, 1, 0, 0, 0)],
    
    # EQZ coefficients
    [c(0, 0, 0, 0, 0, 1)],
    
    # R1' (not a gate, thus not a list of coefficients)
    c(0, 0, 0, 0, 0, 0, 1),
    
    # R2'
    c(1, 0, 0, 0, 0, 0, 0),
    
    # R3'
    c(0, 0, 1, 0, 0, 0, 0),
]

With the inputs set up, we can run our circuit and observe the outputs of each of the gates and the new values of each register. We have four gates and three new register values:

The first gate is a constant gate, and should return 1.
The second gate is a negation of the 1, so should return -1.
The third gate adds R2 to -1, and, since R2 = 0, should return 0.
The fourth gate checks R2 - 1 for equality with 0, so should return 1.
R3', the new value of R3, should equal R3 and be 3.
R2', the new value of R2, should equal R1 and be 0.
R1' should equal the output of the last gate, so should be 1.

We can confirm these expected values:

In [14]:

new_registers = run_circuit(registers, gates,
                            controller_coefficients, max_int)
print(new_registers.eval().argmax(axis=2))

[[1 0 3]]

Implementing the Controller

We have now implemented the probabilistic circuit gates and combined them together into a single circuit, directed by coefficients generated by a hypothetical neural network controller. In this section, we'll implement the neural network controller itself. For this post, we'll use a simple feed-forward multilayer perceptron, although we could also use an LSTM recurrent neural network.

Recall that the inputs to the neural network are the 0th components of the registers; that way, the neural network can use binary information ("is this register zero?") but cannot learn to solve the problem entirely on its own.

We'll start by generating the weights of the neural network. The weights will be Theano shared variables, which will allow us to update them as we train. Shared variables are mutable values (numpy arrays) that can be used in a Theano computation graph and which Theano knows how to update.

The weights will be initially randomly sampled from a uniform distribution around zero:

In [15]:

# `shared` converts a numpy array to a shared variable.
from theano import shared
from numpy.random import uniform

def init_weight(*dims, low=-0.3, high=0.3):
    """Create a randomly-initialized shared variable weight matrix."""
    weights = uniform(low=low, high=high, size=dims)
    var = shared(weights.astype(np.float32), name="W{0}x{1}".format(*dims))
        
    return var

The neural network will start with an input layer, drawing from the registers; it will then have one or more hidden layers; finally, from the hidden layer the neural network will generate all of its outputs (controller coefficients and willingness to complete the computation).

In [16]:

def mlp_weights(num_registers, layer_sizes, gates):
    """
    Generate weights and biases for all the connections
    in the neural network controller.
    
    layer_sizes: Number of units in each hidden layer.
    """
    
    # The first layer has one input per register.
    prev_layer = num_registers

    # Weights for making the hidden layers.
    for layer_size in layer_sizes:
        # Weights.
        yield init_weight(prev_layer, layer_size)        
        # Biases.
        yield init_weight(1, layer_size)
        
        # Keep track of last layer size.
        prev_layer = layer_size
    
    # Weights for gate coefficients (output layers).
    for prev_gates, gate in enumerate(gates):
        num_outputs = num_registers + prev_gates
        for _ in range(gate.arity):
            # Weights.
            yield init_weight(prev_layer, num_outputs)
            # Biases.
            yield init_weight(1, num_outputs)
        
    # Weights for new register value coefficients (output layers).
    num_outputs = num_registers + len(gates)
    for _ in range(num_registers):
        # Weights.
        yield init_weight(prev_layer, num_outputs)
        # Biases.
        yield init_weight(1, num_outputs)
        
    # Weights for willingness to complete computation output.
    yield init_weight(prev_layer, 1)
    # Biases.
    yield init_weight(1, 1)

Using these weights, we can define the outputs of the network by implementing forward propogation. All intermediate layers use rectified linear units (ReLu)), and the output layers must sum to one, so use a softmax. The willingness to complete computation is a single number between zero and one, so uses a sigmoid unit.

In [17]:

from theano.tensor.nnet import softmax, relu, sigmoid

# Define a helper function for deconstructing
# the parameter list.
def take(values, i):
    """Return the next pair of weights and biases after the
    starting index and the new starting index."""
    return values[i], values[i + 1], i + 2

def mlp_forward_prop(num_registers, num_layers, gates,
                     registers, params):
    """Run forward propogation on the register machine (one step)."""
    # Extract 0th component from all registers.
    last_layer = registers[:, :, 0]
    
    # Propogate forward to hidden layers.
    idx = 0
    for _ in range(num_layers):
        W, b, idx = take(params, idx)
        last_layer = relu(last_layer.dot(W) + b)
    
    # Propogate forward to gate coefficient outputs.
    # In the result list, each result is a list of
    # coefficients, as gates may have 0, 1, or 2 inputs.
    controller_coefficients = []
    for gate in gates:
        coeffs = []
        for _ in range(gate.arity):
            W, b, idx = take(params, idx)
            coeffs.append(softmax(last_layer.dot(W) + b))
        controller_coefficients.append(coeffs)
    
    # Forward propogate to new register value coefficients.
    for _ in range(num_registers):
        W, b, idx = take(params, idx)
        coeffs = softmax(last_layer.dot(W) + b)
        controller_coefficients.append(coeffs)
        
    # Forward propogate to generate willingness to complete.
    W, b, idx = take(params, idx)
    complete = sigmoid(last_layer.dot(W) + b)
        
    return controller_coefficients, complete

We can now run a single timestep of our register machine from end to end. However, without training it, it does not output anything interesting.

Exercise: Verify that this code works, and that it works for many samples at once. Constructing the inputs for this code is educational on its own!

We can now run forward propagation on our neural network to generate the controller coefficients, and we can use the controller coefficients as weights to run our fuzzy circuit.

Putting those together to run a full timestep of our machine now becomes fairly straightforward:

In [18]:

def step_machine(gates, max_int, num_registers, num_layers,
                 registers, params):
    """Run a single timestep of the machine."""
    # Run single-step forward propagation.
    controller_out = mlp_forward_prop(num_registers, num_layers, gates,
                                      registers, params)
    coefficients, complete = controller_out

    # Using the generated coefficients, advance the registers.
    new_registers = run_circuit(registers, gates,
                                coefficients, max_int)

    return new_registers, complete

Implementing the Cost Function

$\renewcommand{\R}[0]{\mathbb{R}} \renewcommand{\Z}[0]{\mathbb{Z}} \newcommand{\bmatr}[1] {\begin{bmatrix} #1 \end{bmatrix}}$In order to have forward propogation do something useful with our model, we must train the weights, and for that we need to compute our objective function and then differentiate it with respect to the weights. Recall from the previous post that our objective is

$$J(\theta) = -\sum_{t = 1}^T p_t \sum_{i=1}^R \log(r^{(t)}_{i,y_i}),$$ where $r^{(t)}_{i,y_i}$ is the probability of register $i$ having value $y_i$ at timestep $t$ (the $y_i$th component at time $t$), $y \in {\Z_M}^R$ is the vector of desired integer outputs at the end of the algorithm, and $p_t$ is the probability that the computation completed after timestep $t$.

To compute this cost, we must be able to compute $r^{(t)}_{i,y_i}$:

In [19]:

def log_prob_correct(registers, desired_output, num_registers, max_int):
    """Compute log-probability of correctness over all registers."""
    cost = 0
    
    # Add epsilon to every log to avoid having inf in costs.
    epsilon = 1e-100
    
    for i in range(num_registers):
        # Create a mask to extract just the values we want.
        # This has the same shape as the registers, and a one
        # on components we want to include in the sum, a zero
        # on components we don't wish to include in the sum.
        mask = to_one_hot(desired_output[:, i], max_int)
        
        # Compute the loss for this register using the mask.
        loss = (mask * tensor.log(registers[:, i, :] + epsilon)).sum(axis=1)
        
        # Accumulate costs over all registers, keeping the dimensions S x 1.
        cost += tensor.shape_padright(loss, 1)
        
    return cost

Although we could run forward propogation, collect all the results, and use them to compute the cost, this would require storing in memory all the results. Instead, we can compute the cost during the forward propogation, thus reducing the amount of data we need to actively store. Observe that in the following code we accumulate the cost over each timestep.

We must also compute $p_t$ in order to compute the cost for a timestep, which is a bit messy:

$$p_t = \begin{cases} 1 - \displaystyle\sum_{i=1}^{t-1} p_i & \text{ if } t = T \\ f_t \displaystyle\prod_{i=1}^{t-1} (1 - f_i)& \text{ otherwise} \end{cases}$$ where $T$ is the number of timesteps and $f_i$ is the willingness to complete at timestep $i$.

At the end, we must know the cumulative probability so far, and at every other timestep, we must know the probability that the computation isn't already concluded. To be able to compute all these, we pass timestep $(t)$, max_timesteps $(T)$, cum_prob_complete $(\sum_{i=1}^{t-1} p_i)$, and prob_incomplete $(\prod_{i=1}^{t-1} (1 - f_i))$ to each step:

In [20]:

def step_cost(gates, max_int, desired_output, max_timesteps,
              num_registers, num_layers, timestep, registers,
              cost, cum_prob_complete, prob_incomplete, params):
    # Run the machine forward one step.
    machine_result = step_machine(gates, max_int, num_registers,
                                  num_layers, registers, params)
    registers, complete = machine_result
    
    # Complete the probability that the algorithm is done
    # after this step. Force the algorithm to complete after 
    # T timesteps.
    if timestep == max_timesteps:
        prob_complete = 1 - cum_prob_complete
    else:
        prob_complete = complete * prob_incomplete
        
    # Update the probability that the computation isn't complete
    prob_incomplete *= 1 - complete
        
    # Accumulate the probability that a result has been produced.
    cum_prob_complete += prob_complete
    
    # Cost for this timestep.
    unscaled_cost = log_prob_correct(registers, desired_output,
                                     num_registers, max_int)
    scaled_cost = prob_complete * unscaled_cost
    cost -= scaled_cost
    
    return registers, cost, cum_prob_complete, prob_incomplete

Putting this all together, we run the machine for $T$ timesteps, passing the outputs at each step as the next inputs:

In [21]:

from theano.tensor import dtensor3, imatrix

# This function is a boring utility function.
# Main code of interest is below in `run`.
def make_broadcastable(weights):
    """Shared variables (the weights of the controller) are
    not broadcastable by default. We need to make them broadcastable
    to use them. This function does so manually."""
    broadcastable = []
    for var in weights:
        # Only make biases broadcastable.
        if var.get_value().shape[0] == 1:
            # Keep the name the same.
            name = var.name
            var = tensor.addbroadcast(var, 0)
            var.name = name
        broadcastable.append(var)
    return broadcastable

def run(gates, num_registers, max_int,
        num_timesteps, num_layers, reg_lambda, params):
    params = make_broadcastable(params)

    # Create symbolic variables for the input to the machine
    # and for the desired output of the machine.
    initial_registers = dtensor3("R")
    desired_registers = imatrix("Y")
    
    # Run the model for all timesteps. The arguments are 
    # registers, cost, cumulative probability complete, 
    # and probability incomplete. The latter are initialized
    # to zero and to one, respectively.
    v0 = as_tensor(0)
    v1 = as_tensor(1)
    output = [initial_registers, v0, v0, v1]
    registers = []
    for timestep in range(num_timesteps):
        output = step_cost(gates, max_int, desired_registers,
                           num_timesteps, num_registers,
                           num_layers, timestep + 1,
                           *output, params)
        registers.append(output[0])

    # Add in regularization, to avoid overfitting simple examples.
    reg_cost = reg_lambda * sum((p * p).sum() for p in params)
    
    # Get the final cost: regularization plus loss.
    final_cost = reg_cost + output[1].sum()
    
    # Return the symbolic variables, the final cost, and the
    # intermediate register values for analysis and prediction.
    return initial_registers, desired_registers, final_cost, registers

Preparing Training Data

Once we have computed the cost, the magic of Theano really shines: we can get backpropogation practically for free by using the grad method. grad automatically computes the derivatives of a value with respect to any variables, in this case the shared weights of the neural network.

With that in mind, let's come up with a simple training example and enough data to train with: we'll teach our machine to do a single-step computation corresponding to the circuit we defined above:

$$\begin{align*} R_1' &= [R_2 - 1 == 0] \\ R_3' &= R_3 \\ R_2' &= R_1 \end{align*}$$

In [22]:

def example_circuit(r1, r2, r3):
    """Python implementation of our example circuit."""
    return int(r2 - 1 == 0), r3, r1

Given the limited number of registers and integer values, we can generate all possible inputs and all corresponding outputs as training data. itertools.product makes this very easy to do by generating the Cartesian product of our integer set:

In [23]:

import itertools

# Generate all sample inputs.
max_int = 50
num_registers = 3
inputs = list(itertools.product(range(max_int),
                                repeat=num_registers))

# Generate all sample outputs.
outputs = [example_circuit(*values) for values in inputs]

Right now the inputs and outputs are stored as lists of tuples, whereas Theano and Numpy work with arrays, and expect the input to be one-hot encoded. Let's address that by encoding the inputs and storing the data as Numpy arrays:

In [24]:

def encode(samples):
    """Convert inputs to one-hot matrix form.
    The result is shape (S, R, M), where, as usual:
    S - num samples, R - num registers, M - max int
    """
    samples = np.asarray(samples)
    
    # Encode each register separately.
    # to_one_hot requires a 1-d vector.
    encoded = []
    for i in range(num_registers):
        encoded.append(to_one_hot(samples[:, i], max_int).eval())
    return np.asarray(encoded).swapaxes(0, 1)

inputs = encode(inputs)
outputs = np.asarray(outputs, dtype=np.int32)

These input and output arrays are suitable to train on, but have a subtle, lurking issue. Namely: what percentage of the outputs have R1 equal to zero, and what percentage have R1 equal to one?

If you consider our example with $M = 50$, output R1 is equal to one only when input R2 is equal to one. Only one out of fifty inputs will have R2 equal to one, so just about 2% of our data will have output R2 equal to one. Although this is not a classification problem, we nonetheless have an issue of imbalanced classes!

When you have heavily imbalanced classes, neural networks will tend to just learn to predict the majority class; although this is a local minimum, it's one that is very easy to find, and so the networks will almost always do so. The easiest ways to fix this are to either assign the examples weights (and weigh the minority class much more heavily), to undersample the majority class (only take a subset of that data), or to oversample the minority class (repeat that data). For us, the simplest approach will be to repeat the minority class enough times to make it equivalent to the majority class:

In [25]:

# Find the underrepresented class indices: R2 = 1
idxs = inputs[:, 1, 1] == 1

# Oversample the class 48X to make it equal to R2 != 1 class.
inputs_plus  = np.repeat(inputs[idxs, :, :], 48, axis=0)
outputs_plus = np.repeat(outputs[idxs, :], 48, axis=0)

# Create new data set with approximately equal class sizes.
inputs = np.concatenate((inputs, inputs_plus))
outputs = np.concatenate((outputs, outputs_plus))

Exercise: Why are we repeating the minority class 48 times? After we oversample, what percentage of the data will be in the minority class (R2 = 1) and what percentage will be in the majority class (R2 != 1)?

In order to diagnose over and underfitting, and get an accurate estimate of the performance, we split our data into a training and test set:

In [26]:

# Randomly shuffle the indices 0..S - 1.
# Use these indices to shuffle inputs and outputs in unison.
num_samples = inputs.shape[0]
shuffle_order = np.random.permutation(num_samples)

# Use 30% of our data for testing, 70% for training.
test_ratio = 0.3
test_indices = shuffle_order[:int(test_ratio * num_samples)]
train_indices = shuffle_order[int(test_ratio * num_samples):]

test_inputs = inputs[test_indices, :, :]
test_outputs = outputs[test_indices, :]
train_inputs = inputs[train_indices, :, :]
train_outputs = outputs[train_indices, :]

Adam Optimization Method

As in the Kurach et al. paper, we'll use the Adam optimization algorithm. This algorithm resembles Adagrad and RMSProp; a good overview of the different optimization algorithms is here. The paper that introduces Adam give some nice theoretical bounds, but more importantly, it seems like it has empirically good performance.

I tested this code using Adam and standard SGD, and found that standard SGD would often converge to high-cost local minima and could not progress further.

In [27]:

def adam_optimize(params, train, train_inputs, train_outputs,
                  alpha=0.001, b1=0.9, b2=0.999,
                  epsilon=1e-8, batch_size=1000):
    """Implementation of Adam optimization method, with hyperparameters
    taken as recommended by the original publication."""
    # Initialize first and second moment estimates to zero.
    # This causes some bias, which is addressed later.
    moment1 =  [0 for _ in params]
    moment2 = [0 for _ in params]
    
    timestep = 0  # Current optimization step
    batch = 0     # Where does this batch start
    
    converged = False
    while not converged:
        timestep += 1
        
        # Train on a small batch.
        inputs  = train_inputs[batch:batch+batch_size, :, :]
        outputs = train_outputs[batch:batch+batch_size, :]        
        cost, *gradients = train(inputs, outputs)
        
        # Advance to next batch.
        batch = (batch + batch_size) % train_inputs.shape[0]

        # Compute first and second moment estimates.
        # These are decaying moving averages; first moment
        # uses the gradient, second uses squared gradient.
        moment1  = [b1 * m + (1 - b1) * gradient
                    for (m, gradient)
                    in zip(moment1, gradients)]
        moment2 = [b2 * v + (1 - b2) * gradient ** 2
                   for (v, gradient)
                   in zip(moment2, gradients)]
        
        # Correct for initialization bias and compute new values.
        correction1 = 1. / (1 - b1 ** timestep)
        correction2 = 1. / (1 - b2 ** timestep)
        corrected1 = [correction1 * m for m in moment1]
        corrected2 = [correction2 * v for v in moment2]
        
        # Compute new parameter values.
        params_new = [p.get_value() - alpha * m1 / (np.sqrt(m2) + epsilon)
                      for (p, m1, m2) in zip(params, corrected1, corrected2)]

        # Check for convergence by looking at magnitude of delta.
        delta = [abs(p.get_value() - p_new)
                 for (p, p_new) in zip(params, params_new)]
        converged = all((d < 0.5 * alpha).all() for d in delta)        
        
        # Update parameters to new values.
        for p, p_new in zip(params, params_new):
            p.set_value(p_new.astype('float32'))
            
        # Provide some output for tracking during runtime.
        if timestep % 100 == 1 or converged:
            print("Cost (t = %4d): \t%.2f" % (timestep - 1, cost))

Training with Backpropagation

After all of that work and preparation, we can finally construct our model and train it:

In [28]:

# Initialize neural network parameters.
layer_sizes = [5, 5]
params =  list(mlp_weights(num_registers, layer_sizes, gates))

# Initialize cost and gradients.
num_timesteps = 2
reg_lambda = 0.1
result = run(gates, num_registers, max_int,
             num_timesteps, len(layer_sizes),
             reg_lambda, params)
init_registers, desired_registers, cost, registers = result
gradients = theano.grad(cost, params)

# Compile training function to compute gradients.
train = theano.function([init_registers, desired_registers],
                        [cost] + gradients)


# Compile prediction function (registers after one timestep)
predict = theano.function([init_registers], registers[0])

Adam takes about a thousand iterations to fully train the model, printing the cost every 100 iterations:

In [29]:

adam_optimize(params, train, train_inputs, train_outputs)

Cost (t =    0): 	6014.46
Cost (t =  100): 	5442.26
Cost (t =  200): 	4038.44
Cost (t =  300): 	2068.13
Cost (t =  400): 	1187.40
Cost (t =  500): 	577.50
Cost (t =  600): 	171.71
Cost (t =  666): 	107.83

Now that we have a trained model, we can evaluate its performance on our test set:

In [30]:

def percent_correct(inputs, outputs):
    """Compute the percent of examples that were computed correctly."""
    # Convert the one-hot encoding to integers.
    result = predict(inputs).argmax(axis=2)
    
    # Check how many of the registers for each sample
    # had the expected result.
    num_eq = (result == outputs).sum(axis=1)
    
    # A sample was correct if *all* of the registers
    # were correct. Count correct samples.
    all_eq = (num_eq == inputs.shape[1]).sum()
    
    # Return ratio of correct samples times 100.
    return 100 * float(all_eq) / inputs.shape[0]

print("Correct: %.1f%%" % percent_correct(test_inputs, test_outputs))

Correct: 100.0%

Our model managed to learn to solve this problem with 100% correctness – unsurprising, given that it was actually just learning to produce a constant circuit, completely independent of the registers!

This concludes the runnable portion of this blog post.

Random Access Memory Gates

Extending our model from a register machine to a full random-access machine turns out to be quite straightforward. Recall from the previous post that our machine accesses memory through fuzzy READ and WRITE gates, so all we need to do is

Implement the READ gate.
Implement the WRITE gate.
Modify the cost function to use memory, rather than the registers.

Both of these gates are surprisingly simple, although the implementation can be somewhat mysterious. The best way to understand the implementation is to use the matrix dimension, figure out what dimensions you expect, and what dimensions every intermediate product should be.

When reading, we want to average the contribution of every memory cell to the result, weighted by the input probabilities. For example, if $M = 3$, and our pointer is $(0.2, 0.8, 0.0)$, we would take 0.2 of the first cell plus 0.8 of the second cell:

In [31]:

def read(ptr, max_int, mem):
    """Gate to read from the memory tape.
    Return a tuple of the result and the new (unmodified) memory bank.
    
    ptr: Distribution over locations to read from.
    mem: Memory bank.
    
    ptr – S x M'
    mem – S x M' x M
    
    where
      S  – number of samples being operated on
      M' – number of memory locations
      M  – number of representable integers
    
    M = M' always, but distinguishing between them helps keep track of semantics.
    """
    # Reading from memory ends up being a matrix multiplication (per sample),
    # but getting it right just involves shuffling the dimensions to be right:
    #
    #    m = mem.transpose(0, 2, 1):  S x M x M'
    #    p = shape_padright(ptr, 1):  S x M' x 1
    #             batched_dot(m, p):  S x M x 1
    #   batched_dot(m, p).flatten(2): S x M
    return batched_dot(mem.transpose(0, 2, 1),
                       shape_padright(ptr, 1)).flatten(2), mem

Exercise: Create some input data for this gate and verify that it works as expected.

The WRITE gate operates on a similar principle. Instead of writing to one location, it writes to all locations, with the strength of the writing modulated by the input pointer probabilities. Each memory cell also retains its old values, similarly weighted by the input pointer probabilities. For example, if $M = 3$, and our pointer is $(0.2, 0.8, 0.0)$, the first cell would be 0.8 times its old value plus 0.2 times the new values, the second cell would be 0.2 times the old value plus 0.8 times the new value, and the third cell would keep its old value:

In [32]:

def write(ptr, val, max_int, mem):
    """Gate to write to the memory tape.
    Return a tuple of the written value and the modified memory bank.
    
    ptr: Distribution over locations to write to.
    val: Distribution over values to write.
    mem: Memory bank.
    
    ptr – S x M'
    ptr – S x M
    mem – S x M' x M
    
    where
      S  – number of samples being operated on
      M' – number of memory locations
      M  – number of representable integers
    
    M = M' always, but distinguishing between them helps keep track of semantics.
    """
    # As with reading, tracking the dimensions makes this operation simple.
    # We want to compute an "old" contribution and a "new" contribution.
    #     p = shape_padright(ptr, 1): S x M' x 1
    #  v = val.dimshuffle(0, 'x', 1): S x 1 x M
    #                        1 - ptr: S x M'
    # J = shape_padright(1 - ptr, 1): S x M' x 1
    #                  old = J * mem: S x M' x M (broadcasting)
    #                    new = p * v: S x M' x M
    p = shape_padright(ptr, 1)
    v = val.dimshuffle(0, 'x', 1)
    j = shape_padright(1 - ptr, 1)
    new_mem = j * mem + batched_dot(p, v)
    return val, new_mem

If we modify our cost function to look at memory, rather than registers, our model will be the complete NRAM model.

Training Tips and Tricks

Training this machine can get a bit tricky. Depending on the complexity of the task, it has a tendency to quickly converge to a local minimum and get stuck there. The following are a few tricks to get past these local minima.

Gradient Noise

Add Gaussian noise to the gradients after backpropagation, and let the magnitude of the Gaussian noise decay with the iteration timestep. This forces the model to explore a bit more during the early iterations, and can help it avoid those local minima. (See this paper.)

Gradient Clipping

During backpropagation, the gradients have a tendency to get quite large for deeper networks. Clipping the gradients to some minimum and maximum values can help remediate this. Theano offers the grad_clip operator specifically for this purpose.

Gradient Normalization

Normalizing the gradient to some fixed constant magnitude after backpropagation can be helpful, although in my experience this is less important with Adam or similar optimization methods as opposed to straightforward SGD.

Curriculum Learning

Rather than immediately training on your target problem, attempt to break the problem down into incremental steps, and train first on the easiest problem, then a mix of the easiest and harder problems, and so on, gradually increasing the difficulty (by mixing in harder and harder examples in the batches). In my experience, this was necessary for almost all tasks.

Entropy Bonus

In order to encourage the model to explore before settling on a local minimum, add a "bonus" (as opposed to penalty) to the cost function for high-entropy distributions. This prevents the model from just outputting a single value early on; decrease the entropy bonus (exponentially) throughout training. Shannon entropy of the probability distributions works nicely:

$$E(\vec A) = \sum_{i=0}^{M-1} A_i \log A_i.$$

$E(\vec A)$ is high when the distribution is spread out, and low when it is concentrated at a single point.

Conclusion

Ultimately, building and training these models takes some practice and time; debugging them when they don't work can be tricky. A combination of the tips and tricks from above, as well as others, can often help coax these models into learning.

If you would like to play with this model yourself, check out the Github repository – it contains all the code necessary to build the full NRAM network, and depends only on Numpy and Theano. You can teach the machine surprisingly complex operations with very few neural network parameters! You can choose one of the tasks from the publication, or invent your own.

Andrew Gibiansky :: Math → [Code]

NRAM: Theano Implementation