## Feed-Forward Networks

**This post illustrates steps required to build and train feed-forward neural networks from scratch in Python.**

### Implementing the Network in Code

We will consider the following \( L \)-layer network structure

\[ X^{(l)}_i= σ^{(l)}(\sum_jw^{(l)}_{i,l}x^{(l-1)}_j+b^{(l)}_i), l∈\{1,…,L\} \]

\( x^{(l)} \) is the output of the \( l \)-th layer, where \( 0 \) is taken to be the index of the **input layer** *by definition*, so that \( x^{(0)}=u \). Likewise, overall network output is the output of the \( L \)-th layer, so that \( y=x^{(L)} \). This layer is usually referred to as the **output layer**. All other layers, \( L-1 \)of them, are referred to as **hidden layers**.

The activation function of the \( l \)-th layer is denoted by \( σ^{(l)} \). Network parameters within \( l \)-th layers are weights \( w^{(l)}_{i,j} \) and biasses \( b^{(l)}_i \).

Let us first create a class representing an activation function. This is, essentially, a pair representing the function itself and its first derivative.

class Activation: """An activation function.""" def __init__(self, func, der): self.func = func self.der = der

There are several activation functions in practical use. Among the most well-known ones are: **linear**, **rectified linear** (**ReLU**), and **sigmoid** activation functions. **ReLU** is the common choice in modern neural networks, due to its simplicity and effectiveness. Note that **U** stands for *unit*, so that **ReLU** = *Rectified Linear Unit*.

# Linear activation function Lin_Activation = Activation(lambda x: x, lambda x: np.ones(shape=x.shape))

# Rectified linear activation function ReLU_Activation = Activation (lambda x: np.maximum(x,np.zeros(shape=x.shape)), lambda x:(x>0).astype(np.float))

Let us visualize these common activation functions, alongside their derivatives.

# Function used to visualize (plot) activation # functions alongside their derivatives def plot_activation(activation, x = np.arange(-1, 1, 0.01), label = ''): plt.plot(x, activation.func(x), linewidth=3, label='activation') plt.plot(x, activation.der(x), '--', label= 'derivative') plt.legend(fontsize=16) plt.title(label, fontsize=16)

plt.subplot(1,2,1) plot_activation(Lin_Activation, label='linear') plt.subplot(1,2,2) plot_activation(ReLU_Activation, label='ReLu')

In the following step, we develop a class representing a single layer of the feed-forward neural network. The main responsibility of each layer is to propagate input signal forward, from its inputs to its outputs.

Each layer is statefull, in the sense that relevant intermediate values are preserved and refreshed during each new forward pass. These values are utilized later, during error back-propagation in the learning phase.

Each layer is a computational unit implementing the folowing static map

\[ x^{(l)}_i=σ^{(l)}(z^{(l)}_i) \]

\[ z^{(l)}_i=\sum_jw^{(l)}_{i,j}x^{(l-1)}_j+b^{(l)}_i \]

where \( l \) is the layer index, \( w^{(l)}_{i,j} \)and \( b^{(l)}_i \) are its **tunable parameters**, \( σ^{(l)} \) is the **activation function**, and \( z^{(l)}_i \) are **intermediate results** (which are refreshed during each forward run, and used later for back-propagation).

class Layer: # A single layer of a feed-forward neural network. def __init__(self, n_input, n_output, activation): # Initialize network layer of the given size and the given # activation. self.W = 0.1 * np.random.randn(n_output, n_input) self.b = 0.1 * np.random.rand(n_output, 1) self.activation = activation # Input to the activation function. # (Layer output before activation.) self.z = np.zeros(shape=(n_output, 1)) # The actual output. self.output = np.zeros(shape=(n_output, 1)) def evaluate(self, u): # Evaluate layer output for the given input. # u - `NumPy` array of shape (n_inputs, 1) # Returns network output, which is always # a `NumPy` array of shape (n_outputs, 1) self.z = np.dot(self.W, u) + self.b self.output = self.activation.func(self.z) return self.output

Finally, we build a network by stacking multiple layers sequentially. The following listing contains essentially all that is required to implement a feed-forward network in `Python`

!

class Network: """Feed-forward neural network.""" def __init__(self, layers): """Initialize network from a list of layers.""" self.layers = layers # The last seen network input. # This is used by the backpropagation algorithm during training. self.u = None def evaluate(self, u): """Evaluate network output for the given input. u - `NumPy` array of shape (n_input, 1) where n_inputs is the number of inputs of the network (equal to the number of nputs of its first layer). Returns the output of the last layer, a `NumPy` array of shape (n_outputs, 1), where n_outputs is the number of outputs of the last layer.""" def cost_gradient(self, y): """ Evaluate gradient of the cost function by back-propagation. y - Target output corresponding to the last seen input vector. Returns a list of pairs (2-tuples), each containing gradient of the cost function with respect to network weights and biases, in that order. The list is in reversed order, the first element is a pair of gradients corresponding to the parameters of the last network layers, etc. The last pair corresponds to the first network layer. Note: `evaluate` function must be called before this one. """ grad = [] # This is the derivative of the cost function (evaluated based # on a single sample) with respect to the output of the # last network layer. The value presented here corresponds to # quadratic cost. Change of the cost function will affect # only this line! grad_a = 2*(self.layers[-1].output - y) # BACKPROPAGATION: for l in reversed(range(len(self.layers))): x_prev = self.layers[l-1].output if l>0 else self.u dsigma = self.layers[l].activation.der(self.layers[l].z) grad_b = dsigma * grad_a grad_W = np.outer(grad_b, x_prev) grad.append((grad_W, grad_b)) # Update intermediate vector to be used in the next iteration # of the backpropagation algorithm. The computation is not # necessary once the first layer is reached. if not l == 0: grad_a = np.dot(self.layers[l].W.T, grad_b) return grad def batch_cost_gradient(self, u_list, y_list): """ Evaluate cumulative gradient of a training batch. u_list - List of training inputs y_list - List of training outputs (targets) Returns a list of pairs (2-tuples), each containing gradient of the cost function with respect to network weights and biases, in that order. Every gradient is a sum of all sample gradients within the batch. """ grad_total = None for u, y in zip(u_list, y_list): self.evaluate(u) grad = self.cost_gradient(y) if grad_total: for gt, g in zip(grad_total, grad): gtW, gtb = gt gW, gb = g gtW += gW gtb += gb else: grad_total = grad return grad_total def train_on(self, u_list, y_list): """ Update network parameters on the basis of a single training batch. u - Training input; a `NumPy` array of shape (n_inputs, 1) y - Training output; a `NumPy` array of shape (n_outputs, 1) The function does not return a value. After invocation, the network weights and biases will be updated. """ grads = self.batch_cost_gradient(u_list, y_list) for layer, grad in zip(self.layers, reversed(grads)): grad_W, grad_b = grad layer.W -= 0.01*grad_W layer.b -= 0.01*grad_b

Network output is evaluated by means of `evaluate`

method. This method takes a single input point and evaluates the corresponding network output.

Method `cost gradient`

of the `Network`

class implements the **back-propagation algorithm** in order to compute gradient of the cost function (the algorithm is derived and explained in detail, in the following Section). For simplicity, the cost \( J_1 \) is defined as the square of the difference between the obtained and desired network output. Since the network output is, in fact, the output of the \( L \)-th layer, we have that the cost function computed on the basis of one training sample

\[ J_1=\sum^{n_L}_{i=1}(x^{(L)}_i-y_i)^2 \]

The sum in the previous expression is over all output neurons! In other words, if the network output is multi-dimensional, then each training sample contains several targets, and the total error is obtained by summing up errors of all individual targets.

In reality, the cost (denoted by \( J \)) is obtained by averaging \( J_1 \) over a **batch** of training samples. The notation \( J_1 \) is used to emphasize that the cost is with respect to *a single* sample. This is irrelevant at this point, since the gradient is defined *per sample*. The gradient of a batch is obtained by summing or averaging gradients of all samples belonging to the batch. This is performed by the `batch_cost_gradient`

method.

The network is trained using `train_on`

method, which takes a batch of input and output samples. The gradient is computed by summing the gradients over the entire batch, and **one step** of the optimization algorithm is performed.

### Back-Propagation Algorithm

Let us derive formulas needed to derive the **backpropagation algorithm**. In order to do so, consider an arbitrary network layer (let us label it by \( l \) as before):

\[ x^{(l)}_i=σ^{(l)}(z^{(l)}_i) \]

\[ z^{(l)}_i=\sum_jw^{(l)}_{i,j}x^{(l-1)}_j+b^{(l)}_i \]

The question we would like to answer is: **What is the sensitivity of the cost function \( J_1 \) with respect to trainable parameters of this layer \( w^{(l)}_{i,j} \) and \( b^l_i \)?**

By **chain rule** of calculus

\[ \frac{\partial J_1}{\partial w^{(\ell)}_{i,j}} = \frac{\partial J_1}{\partial x^{(\ell)}_i} \frac{\partial x^{(\ell)}_i}{\partial z^{(\ell)}_i} \frac{\partial z^{(\ell)}_i}{\partial w^{(\ell)}_{i,j}} \]

\[ \frac{\partial J_1}{\partial b^{(\ell)}_i} = \frac{\partial J_1}{\partial x^{(\ell)}_i} \frac{\partial x^{(\ell)}_i}{\partial z^{(\ell)}_i} \frac{\partial z^{(\ell)}_i}{\partial b^{(\ell)}_i} \]

In both of these expression, the second and the thirs term are easy to compute. In fact,

\[ \frac{\partial x^{(\ell)}_i}{\partial z^{(\ell)}_i} = \sigma^{\prime(\ell)}(z^{(\ell)}_i) \]

is just the derivative of the activation function computed at the current *state* of the neuron. Also, since \( z^{(\ell)}_i \) is a linear function of the parameters, we have

\[ \frac{\partial z^{(\ell)}_i}{\partial w^{(\ell)}_{i,j}} = x^{(\ell-1)}_j \]

\[ \frac{\partial z^{(\ell)}_i}{\partial b^{(\ell)}_i} = 1 \]

By combining appropriate formulas, we obtain the following expressions for the relevant gradients

\[ \frac{\partial J_1}{\partial w^{(\ell)}_{i,j}} = \frac{\partial J_1}{\partial x^{(\ell)}_i} \sigma^{\prime(\ell)}(z^{(\ell)}_i) x^{(\ell-1)}_j \]

\[ \frac{\partial J_1}{\partial b^{(\ell)}_i} = \frac{\partial J_1}{\partial x^{(\ell)}_i} \sigma^{\prime(\ell)}(z^{(\ell)}_i) \]

Note also that, if one would like to represent \( \frac{\partial J_1}{\partial w^{(\ell)}_{i,j}} \) as a matrix, having the same shape as the parameter matrix \( \mathbf{W}^{(\ell)} = \{ w^{(\ell)}_{i,j} \} \), then it is possible to write the previous expressions in matrix/vector form:

\[ \frac{\partial J_1}{\partial \mathbf{b}^{(\ell)}} = \frac{\partial J_1}{\partial \mathbf{x}^{(\ell)}} \circ \sigma^{\prime(\ell)}(\mathbf{z}^{(\ell)}) \]

\[ \frac{\partial J_1}{\partial \mathbf{W}^{(\ell)}} = \frac{\partial J_1}{\partial \mathbf{b}^{(\ell)}} \otimes \mathbf{x}^{(\ell-1)} \]

Where \( \circ \) denotes element-wise (Hadamard) product, and \( \otimes \)

$$denotes the outer product of two vectors.

What remains to be computed is the derivative of the cost function with respect to the output of the given layer. This is the most complicated step (although, not particularly complicated in it self). It is the essence of the back-propagation algorithm.

The main problem is that the layer output directly influences cost value only if the layer is the output one! For inner layers, each layer only directly influences its first successor, and through it and all consecutive layers, it influences output and the cost function indirectly. However, **it turns out that there is a recursive formula for computing the derivative**!

Consider first the output layer, for which obviously

\[ \frac{\partial J_1}{\partial x^{(L)}_i} = 2 (x^{(L)}_i – y_i) \]

or, in the vector notation,

\[ \frac{\partial J_1}{\partial \mathbf{x}^{(L)}} = 2 (\mathbf{x}^{(L)} – \mathbf{y}) \]

Consider now an inner layer with index \( \ell < L \). It influences the cost exclusively through the outputs of the consequtive layer, so that

\[ \frac{\partial J_1}{\partial x^{(\ell)}_i} = \sum_{j=1}^{n_{\ell+1}} \frac{\partial J_1}{\partial x^{(\ell+1)}_j} \frac{\partial x^{(\ell+1)}_j}{\partial x^{(\ell)}_i} \]

By applying the chain rule to the second factor within the sum,

\[ \frac{\partial x^{(\ell+1)}_j}{\partial x^{(\ell)}_i} = \frac{\partial x^{(\ell+1)}_j}{\partial z^{(\ell+1)}_j} \frac{\partial z^{(\ell+1)}_j}{\partial x^{(\ell)}_i} = \sigma^{\prime(\ell+1)}(z^{(\ell+1)}_j) w_{j,i}^{(\ell+1)} \]

and finally,

\[ \frac{\partial J_1}{\partial x^{(\ell)}_i} = \sum_{j=1}^{n_{\ell+1}} \frac{\partial J_1}{\partial x^{(\ell+1)}_j} \frac{\partial x^{(\ell+1)}_j}{\partial z^{(\ell+1)}_j} \frac{\partial z^{(\ell+1)}_j}{\partial x^{(\ell)}_i} = \sum_{j=1}^{n_{\ell+1}} \frac{\partial J_1}{\partial x^{(\ell+1)}_j} \sigma^{\prime(\ell+1)}(z^{(\ell+1)}_j) w_{j,i}^{(\ell+1)} \]

It is convenient to note that the first two factors within the sum actually constitute the gradient of the cost function with respect to the bias term in the \( \ell+1 \)-st layer, so that

\[ \frac{\partial J_1}{\partial x^{(\ell)}_i} = \sum_{j=1}^{n_{\ell+1}} \frac{\partial J_1}{\partial b^{(\ell+1)}_j} w_{j,i}^{(\ell+1)} \]

The expression can be re-written more concisely in the matrix/vector notation

\[ \frac{\partial J_1}{\partial \mathbf{x}^{(\ell)}} = \left(\mathbf{W}^{(\ell+1)}\right)^T \frac{\partial J_1}{\partial \mathbf{b}^{(\ell+1)}} \]

All modern learning (optimization) algorithms of machine learning are based on gradient descent principle. Parameter values are changed so as to gradially decrease the cost function. This is achieve by changing the tunable parameters in the direction which is “governed” (in some sense) by the negative gradient of the cost function.

Most deep learning models are trained using large data sets, meaning that the gradient should be computed based on very large number of training points, which tends to be slow for large training sets.

The **stochastic gradient** principle states that it is possible to approximate the mean of the overall gradient by a small number of points (usually referred to as **batch** or, sometimes, **mini-batch**). In fact, just a single point is sometimes used. Using batches of small size leads to more efficient computations, but it also results in gradient direction which fluctuates significantly between batches.

In next blog post we will show some examples.