This is the second post in a series where I explain my understanding on how Neural Networks work. I am not an expert on the topic, yet :), but I have been exploring Machine Learning during the last months (check my study list and my exercises of Coursera courses here and here). I think it is a content worth sharing as I believe it can help those readers in a similar situation. These are the posts that compose the series:

- Part 1: Foundations
- Part 2: Forward Propagation Vectorization (this post)
- Part 3: Gradient computation and Back-propagation (coming soon)

I don’t want this series to be a ‘yasonn’ (yet another series on neural networks) so I will try to transmit my view, my understanding, and how I have perceived how they work.

## Foundations of NN

This post dives into the first of the two most important techniques of NN: the forward propagation. I assume that you already know the basics of NN and the following topics look familiar to you:

- How a single unit works
- How the different units are interconnected
- The effects of this interconnection in terms of weights and activation functions
- Why and where in the overall optimization process the cost computation is performed
- Forward Propagation algorithm as the main tool to compute the cost computation

Otherwise read first the foundations post. Also, even if you already master the NN technique, I still recommend you to read the foundations post because you will understand my view of NNs and it will be easier to follow this second one.

You can see this text as an extension of the ‘Forward Propagation and Cost Computation‘ section in the foundations post. In there I described an overview of the optimization process focused on (1) applying the forward propagation algorithm with each of the training samples to compute *h*, and (2) compute the cost *J* using the long formula that uses the obtained *h* values (and obviously some others).

If the overall process is clear, let’s now focus on how to get the values of *h*. That is, the forward propagation algorithm.

## Forward Propagation

Training a NN is usually a process that requires a not negligible computation power. Recall that the cost *J* is computed in each one of the optimization steps, and for each cost calculation every one of the training samples needs to be forward propagated to compute *h*. Since training sets are usually large and optimization algorithms might need lots of iterations, it makes sense to find an efficient way to perform a single forward propagation.

And this is where vectorization has something to say. Note that we are not deriving an alternative algortithm to compute *h. W*e will just organize the data in a vectorized shape so we can use advanced libraries that optimize the vector computations. As simple as that.

Or not. While the idea of vectorizing is certainly simple when you think of a single matrix, it gets more complex when you are trying to convert a rather complex structure like a NN into something completely vectorized. Again, it is not a conceptual problem, but at the beginning it is easy to get confused with the large number of matrices in play, and specially their different shapes.

### Notation

Vectorization requires formality, so let’s start by describing the notation. Note that there is nothing new here but only giving names to elements of the NN already introduced in the foundations post.

*a*: Value of the unit_{i}^{(n)}*i*in layer*n**z*: Internal value in each unit that sums all (weighted) inputs_{i}^{(n)}*g(z)*: Activation function (usually the sigmoid function)*Θ*: Weight of the link that connects unit_{ab}^{(n)}*b*in layer*n*to unit*a*in layer*n+1*(note*a*and*b*are in the opposite order)*X*: Value of input sample_{m}*m**h*: Value of output prediction_{k}*k*

But remember that we are going for a vectorized implementation, so we will be really using *a ^{(n)}*

*, z*

^{(n)}*,*

*Θ*^{(n)}*, X*and

*h*.

### Formal Forward Propagation

Finally, this is where everything gets interesting. The diagram below shows a NN with:

- An input layer (
*n=1*) with*M*units (each one fed by the corresponding*M*input values*X*) - Two hidden layers (
*n=2*and*n=3*) with*L*and*J*units respectively - An output layer (
*n=4*) with*K*units (representing the values of the K predictions*h)*

The diagram also shows the weights *Θ *that connect each pair of layers: *Θ^{(1)}* for weights connecting layer 1 to layer 2,

*Θ*connecting layer 2 to layer 3, and

^{(2)}*Θ*connecting layer 3 to layer 4. Note that the index of

^{(3)}*Θ*indicates the origin layer.

Note that I have drawn the elements with different colors in order to differentiate the layers. That is, everything painted blue represents information related to layer 1, everything red to layer 2, and so on. Also see that the *Θ *weights use another set of three colors because they really link two layers (e.g. pink for *Θ^{(1)}* with weights connecting layer 1 and 2). Furthermore, observe that the colors are also used to indicate what the

*Θ*indexes represent. For instance,

*Θ*represents the weight of the link that connects the unit

_{L1}^{(1)}*1*in layer

*1*to unit

*L*in the upcoming layer (2)

*.*This will be very useful later on when understanding the matrices involved in the vectorization.

You might have noticed a set of additional units on top of each layer (the dashed circles). These are the *bias units* and are used to introduce a scalar contribution in the model. They can be seen as a kind of intercept similar to the role of *b* in a linear equation *y=ax+b*. If we look at the bias unit from the single unit perspective, we can interpret the bias as a shift to left or right of the activation function (as this stackoverflow answer clearly explains). Note that the values in a bias unit are always 1 (*a _{0}^{(n) }= 1*) and the real modeling comes from the weight that links that particular unit to those in the next layer (

*Θ*).

_{i0}^{(n)}Since everything is introduced, we can start using the diagram to quickly talk about Forward Propagation. Let’s start by analyzing the initial propagation step between the first two layers:

- First of all, it is easy to see that the values that compose one
*X*sample feed the NN and set the values in layer 1 (*a*to_{1}^{(1) }*a*)._{M}^{(1)} - These values, together with the bias unit
*a*, (1) get weighted by_{0}^{(1)}*Θ*, (2) propagate to layer 2, and (3) their summation is set as the value of z^{(1)}at each unit_{i}^{(2) }*i*in layer 2. - The activation function is applied to each z
and we then obtain a_{i}^{(2)}._{i}^{(2)}

Note that the same process is repeated at each propagation step. That is, each a* _{i}^{(2) }*is weighted by the corresponding value in

*Θ*in order to come up with z

^{(2) }*.Then, the activation function is applied in order to obtain a*

_{i}^{(3) }*. And the same to get the final a*

_{i}^{(3)}*that in turn correspond to the*

_{i}^{(4)}*h*values.

As you can see, the only complexity so far is to get used to the new notation with letters, subindices and superindices.

### Vectorization details

I previously tried to convince you that the vectorization was needed to speed up the calculations. Let’s take another step and convert the entire procedure into a set of matrix-based operations. In order to assist the explanation, follow the description in the following picture with all the vectorization details.

As we did before, let’s start by analyzing the initial propagation step between the first two layers:

- First of all note how the vector
*a*is constructed: the ‘1’ value from the bias unit^{(1) }*a*and the values_{0}^{(1) }*{X*from the input sample that feeds the NN. It is important to understand the dimensions of the matrices:_{1}, …, X_{M}}*a*has length M+1 because of the bias unit plus the M input samples from^{(1)}*X*. - Computing z
is just a matter of computing the matrix multiplication^{(2) }*Θ*^{(1)}*a*. Note that this multiplication actually performs both the corresponding weighting of^{(1)}*a*by^{(1)}*Θ*and the summation of all the terms. Also observe how the first element of^{(1) }*a*, the ‘1’ from bias unit, multiplies the first column of^{(1)}*Θ*, the bias weights^{(1)}*Θ*. Again, note that the vector_{i0}^{(1)}*a*has length^{(1) }*M+1,*the matrix*Θ*has dimensions^{(1) }*LxM+1*and hence the vector zhas length L.^{(2) } - Finally, compute a
by just applying the activation function^{(2) }*g(x)*to zand introducing the corresponding bias unit. Note that the vector a^{(2)}has length L+1, because of the L-sized z^{(2) }plus the bias unit a^{(2)}._{0}^{(2)}

And that’s it for the first propagation step. Exactly the same matrix operations are applied in the upcoming propagation steps in order to compute the J+1 values of the units of the other hidden unit, a* ^{(3) }*, and the K values of the output units, a

*, that correspond to the*

^{(4)}*K*predicted values

*h*. Note that a

*is not present in a*

_{0}^{(4) }*because this is the output layer and there is not ‘next’ layer to propagate to.*

^{(4) }I don’t think it is a complex procedure because it is based on just weighting and propagating values from layer to layer. As I said before, the only complexity comes from fact that when you try to vectorize a complex structure it is too easy to get overwhelmed by indices and matrix dimensions. I encourage you to follow all the matrices details in order to get used to it and get ready for implementing it all.

Finally, want to see the vectorization in action? Check the *forward_propagate(X, theta1, theta2)* function in the code from my NN exercise.

## Conclusion

First of all congratulations if you reached to this point. That was a pretty detailed description

In this post I tried to provide a clear explanation of the Forward Propagation procedure and a detailed description of a vectorized version.

In the following post I will review with more detail the other great issue in NN technique:

- The gradient computation (with Backpropagation)

Hope to see you there!

## 1 comment for “Understanding Neural Networks (part 2): Vectorized Forward Propagation”