This is the first post in a series where I explain my understanding on how Neural Networks work. I am not an expert on the topic, yet :), but I have been exploring Machine Learning during the last months (check my study list and my exercises of Coursera courses here and here). I think it is a content worth sharing as I believe it can help those readers in a similar situation. These are the posts that compose the series:
- Part 1: Foundations (this post)
- Part 2: Cost computation and Forward Propagation (coming soon)
- Part 3: Gradient computation and Back-propagation (coming soon)
I don’t want this series to be a ‘yasonn’ (yet another series on neural networks) so I will try to transmit my view, my understanding, and how I have perceived how they work. Nevertheless, this first post might look like repetitive to you as I will quickly go through the basis of Neural Networks. If this is not the first time, I hope you like it. If you have never read anything about them, I encourage you to read another more detailed introduction and come back later. When I was starting I liked a lot this post by Adam Geitgey or this detailed ebook by Michael Nielsen.
Neural Networks and Machine Learning
I like Neural Networks (NN from now on) because they offer a technique to solve complex problems in a very elegant manner. To me, NNs represent very well the idea behind Machine Learning foundations about teaching a systems to do a particular task without explicitly programming it to do such task.
This is not new at all because our brain does the same. For instance, the auditory cortex is trained to hear because it is pugged to out ears. What is curious is that if we plugged it to our eyes, it would learn to see. This is what a NN is about, a complex system that we can train to perform a particular task like reading handwritten digits, or detecting cat images, or steering the wheel of a self-driving car… Well, some of these examples are closer to the Deep Learning field, but anyway it is still based about NNs.
But before diving into the usual details about NN’s characteristics, let’s start by understanding why we use them. Simply put, if we want to predict an output given a set of input data, we just insert a NN in between. Thus, it is like a black box that sits in the middle and makes some calculations using the input values in order to come up with the output result. For instance, in the popular example of digit classification the input is an image of a handwritten number and the output the predicted digit. So far, so easy.
But what’s inside that black box? Well, a NN. And what is that? A network of neurons. Or let’s start using NN terminology and call it network of units.
Usually, most of the NN descriptions start by describing what a single unit does (like this introduction for dummies). The diagram below describes this idea.
Note that the model of the single unit also follows the structure of a black box inserted between inputs and output. In this case the calculations of the black box are just (1) adding the inputs and (2) applying an activation function (e.g. step function centered at a particular threshold).
Although understanding the behavior of a single unit is straightforward (you can confirm this by following the red and blue numerical examples), one of the key elements of the NN has been already introduced: the activation function. At the unit level it looks like just a function applied to the outcome of a summation, but believe me when I say that it is this simple step that allows the NN to model more complex behaviors.
Network of units
Reviewing the single unit behavior is a good starting point but I believe that understanding how a NN works is about understanding it as whole. This is not a trivial task, and it definitely needs time. But let me help you talking about it in this section that explains what I consider the key aspects of the NN operation. I will use the following diagram representing a network of interconnected units.
Highly Connected Topology
The interconnection structure is somehow the foundations where NN are built on. A NN is composed of different layers of units (in the example: the input, hidden and output layers). The key point is to perceive that not only each layer feeds the next one, but also that ‘all’ units of each layer feed ‘all’ units of the upcoming layer. Note how the blue connections leaving input unit in1 land on all hidden units hid1 to hid4.
Everything is based on the idea of the Forward Propagation. As briefly commented before, a NN is something that sits between the inputs and the outputs. So the forward propagation is the technique used by the NN to propagate the input values towards the output across all the layers. The second post of this series will bring much more detail on how Forward Propagation works; for now it is enough to just think of it as a ‘propagation’.
In the example of digit classification, the input layer is fed with a representation of the handwritten digit (each pixel grey value maps to a different input unit). ‘All’ these values are propagated to ‘all’ the units of the hidden layer (the blue arrows from input unit in1 and the orange arrows from in2). In the hidden layer each unit will make its own calculations and ‘all’ these results are in turn propagated to the output layer (the dashed blue-orange arrows) where each output unit will again perform their calculations and come up with a result. Note that the example of the handwritten numbers has been specified as a classification problem and each output unit represents one of the possible digits. That is, in a correct classification only the unit representing the right digit will be activated (digit 2, or 2nd unit in the diagram).
Note that in each layer there is an additional bias unit (not drawn in the diagram for sake of simplicity) that also connects to all units of the next layer. This bias unit, as the name indicates, is used to introduce a bias term in the model. Just think of it as a constant value that has a similar role as the term b in a simple linear model y=ax+b.
While the full connectivity between neighbor layers adds difficulty to model comprehension, it is indeed one of the aspects that allows NN to model complex behaviors. Right now just remember that every unit of every layer gets somehow propagated forward and ends up affecting all the units of all the upcoming layers. We will see more on that effect in the second post of the series about forward propagation and cost computation.
We know that a NN is composed of layers with units. And we know that these units are connected with their counterparts in the neighboring layers. Let’s complicate this a little bit more and consider that each one of these connections between units has a different weight. The forward propagation concept remains the same because the value of a given unit will spread to the next layer; but now its effect on each neighbor unit will depend on the value of the weight of that particular connection. Take as an example the blue connection between in1 and hid1: the value in in1 is propagated to hid1 but multiplied by the weight wA1.
Again, the concept of applying a weight to a single connection is not difficult to understand as it is just a multiplier. However, thinking about it at the network level is what makes it interesting because it is exactly the composition of all weights that really determines how the effect of each unit is propagated towards the output. Let’s take an example and look at the output unit out2 that gets activated if the predicted digit is a ‘2’. If we look backwards, out2 is affected by the weights from all the units in the hidden layer (wB2, wB12, …). In turn, each unit of the hidden layer is affected by all units of the input layer (e.g. hid2 by wA1, wA5, …).
It is getting complicated, right? In the second post of the series we will talk in detail about how the forward propagation behaves and what is the role of the weights in all the process.
For me this is actually the master key. In the previous description of the single unit behavior I already commented that the activation at unit level is easy to understand (e.g. just applying a step function). But, similarly to the weights idea, it gets conceptually quite complex when thinking of all the activations as a whole (one for each unit).
I say that the unit activation is maybe the most important aspect because it is what makes NN a very good model for complex behaviors. If there were no activation function in the units, the NN would be just a linear combination of different input elements with a complex layered weights structure (a linear model to predict linear behaviors). Introducing the activation function really brings in the capacity to model complicated non-linear behaviors.
One must see that what is really propagated is whether a unit is activated or not (activated if it has value 1, not activated if it has value 0). Think of it as light bulbs in the units, some are turned on and some are turned off. Turning on a unit means that this state will be propagated to the next layer (but weighted!), which will contribute to deciding whether this layer’s unit must be turned on or not. And note that turning on a unit in the output layer really means deciding which digit is predicted.
And last but not least, a final detail about the activation function is that NN commonly use the sigmoid function that produces the s-shaped curve plotted below. This should not complicate a lot the global understanding. Just think of it as a an s-shaped function that gives output values between 0 an 1 (as in the step function). The larger (and positive) the t, the closer to ‘1’ the output; the smaller (or negative) the t, the closer to ‘0’ the output.
The reasons to use the sigmoid-based activation are that:
- Small changes in the input cause small changes in the output. Interesting across for the iteration-based optimization algorithms.
- It has a simple derivative: S‘(t) = S(t)(1-S(t)). Interesting property for minimization of the cost function.
For me, understanding how a NN works is about understanding how these three concepts interact at the network level. One must understand how the network connectivity is related to the output, how the weighting affects the result of the unit activations, how each unit activation affects the upcoming layers, … So good luck with it, but be patient because it is not trivial.
Training the NN: Optimization problem
We have talked about a connected topology, about weights and about an activation function. Let’s assume that we have decided to use a topology with 3 layers and 400, 25 and 10 units respectively (it is actually the NN for the digit classification example), and that we are going to activate units based on the commonly used sigmoid function. How do we decide the weights? There is a lot of them – 10.285 in the digit classification (401×25 + 26×10) – and we cannot make it manually with some trial-and-error methodology.
This is where this becomes an optimization problem that aims at finding the weights that provide the best predictions. Or, more technically, at finding the weights that minimize a cost defined by a difference between the predicted classification and the real one. And since here we are talking about a minimization problem, in order to solve it we are going to need a couple of elements:
- The cost J as a function of the weights
- The gradient of J with respect to the weights
Minimization? cost? gradient? That was maybe a too much for now. Let’s discuss it step by step.
And first of all allow me to go back to the fundamentals of Machine Learning and comment that NNs are a Supervised Learning technique. This means that we use a set of labeled data to train the model, and then we use the trained model to make predictions of unlabeled data. Labeled data refers to the set of input samples of which we know the correct output (for instance, a set of images with handwritten digits with a label indicating the corresponding digit). Think of it as just pairs of input and output data that we use to train the NN. Therefore, unlabeled data is just a set of input values from which we will predict the output because we do not know it.
Training the model is the tough part and in the NN scenario it involves finding those weights that result in better predictions. And what does ‘better prediction’ means? Well this is computed by the cost function (also known as the objective function or simply J). Let’s see how the cost is computed.
Forward Propagation and Cost computation
We will first assume that we have (note from now on I will start using more formal NN notation):
- A weight Θ for each connection
- 10 samples in the training set X
- And the respective 10 outputs y.
If we feed the input layer of the NN with one training sample at a time and we apply 10 times the forward propagation across layers (which involves weighting by the Θ at each connection), we end up having 10 output predicted digits. Note that each prediction really comes from the values of the output units. Hence for every one of the propagations we end up with an vector h with such values. And these values (assuming we are using the sigmoid activation function) represent the probability that the input digit corresponds to the digit that maps to that output unit. For example, in the following diagram, the first run that is training with a ‘2’ digit computes a 0.9 probability at the second output unit.
Once we have the predicted h arrays, computing the cost is just applying a function that evaluates the differences between these predicted values h and the real labels y. That is, if a prediction were correct and h were equal to to y, the cost would be 0. Note that this comparison is done at the level of output unit so we have to actually compare each array h including predicted values between 0 and 1 to an array y of values that are also either 0 or 1. In the first run of the diagram, we compare an array of h with an array of y. The array of h represents the probabilities from the output units, and y represents the one-hot encoding of the digit 2 (all 0s but a 1 in the second position)
The cost function that NNs commonly use is described in the image below. Do not be scared because it is not difficult to understand (and if you are familiar with the Logistic Regression cost function, you are done with it). The first summation over i loops all the input samples X and the second over k loops all the output units (K=10 in the digit classification example). Inside the summations there are two terms: the first one only applies to the samples when y is 1 while the second when y is 0. The three outer summations at the right side are the regularization term to avoid very different Θ values.
The difficult part of the cost computation is to navigate the formula with so many indices and specially when dealing with a vectorized implementation (we will deal with this in the second post of the series).
Optimizer and Gradient Computation
So far we know how to:
- Feed the NN with 10 input samples X
- Apply forward propagation 10 times with the same set of weights Θ
- Predict 10 sets of values of h
- Compare them to 10 values of y
- Obtain a single value of the cost J.
What really interests us in here is that a particular set of weights Θa ends up producing a cost of ja. With the same procedure with a different set of weights Θb we would end up with a different value of cost jb. If jb were smaller than ja, Θb would be a better set of weights. Searching for the best set is the role of the optimizer (or minimizer if we talk about the optimizer that minimizes the cost function).
The optimizer is an iteration-based procedure that starts computing the cost j0 for an initial set of Θ0 in the first iteration (see the optimization procedure summarized in the diagram below). In the second iteration, it will apply small changes to Θ0 to derive a set Θ1 that produces a cost j1 smaller than j0. In the third iteration it will apply small changes and derive the set Θ2 that produces a cost j2 even smaller. And so on until the cost converges at iteration i; then the set Θi will be the best weights that the optimizer found and that produce the best predictions (at least compared to the sets result of the previous iterations).
While the optimization procedure is usually managed by a third party library, it is our role to help it providing a way to compute the two elements that we mentioned at the beginning of the section:
- The cost J as a function of the weights, so the optimizer can interpret if a set of Θi is a good candidate or not.
- The gradient of J (▽J) with respect to the weights, so the optimizer can apply small changes to Θi and continue to the next iteration.
We have already talked a little bit about how to compute the cost J (the long formula in a previous image), and we will now talk very briefly about how to compute the gradient ▽J. This calculation is actually the trickiest part in understanding a NN operation an it deserves a complete post for itself (the third of the series). For now we just need to know that there is an algorithm called Backpropagation that is the most widely used to compute the gradient of J. It actually returns a matrix with the partial derivatives of J with respect to each weight theta (the Jacobian). We just tell the optimizer this matrix and we are good to go.
As a summary of all the training process the following pseudocode explains the high-level implementation steps of the code to train a NN. For a complete and vectorized implementation in Python refer to my code of the NN exercise from the course on Machine Learning at Coursera.
- Init theta to random values - Optimization iterations until convergence of cost J: - Compute the cost - Forward propagation to predict h for each sample - Compute the cost comparing h with y for each sample - Average the cost across all samples - across to compute the gradient with respect to each theta - The optimizer returns the best found set of theta values
Predicting and Evaluating the model
So at this point we have a set of Θ values, the weights, that we are pretty confident of because it is the best set that the optimizer found. What do we do with them? First we will configure the NN with these values at each connection, then we will fed the NN with an input sample, and finally we will apply forward propagation in order to find out which of the output units gets activated. In the digit classification example, this implies feeding the NN with the pixels of a handwritten digit and propagate them forward to see which output unit gets activated. If for example the fifth output unit becomes active, it means that the NN interpreted that the digit in the image is a 5.
By the way, if you were wondering if the the Forward Propagation concept is both used at the training phase to compute the cost and at the prediction phase to compute the actual prediction, you are right.
Note that predicting a single sample is not the trickiest part at all. If we want to know which digit is on an image we just have to feed the NN and observe which output unit is fired. The interesting part of the prediction step is actually how to use it to test the model and evaluate the performance. The simplest evaluation that we can perform is to take a labeled data set (remember, a set where we have both the input and the right output) and predict the respective output values to compare them to the real counterparts. With this comparison we can obtain performance measures like accuracy, precision-recall curves, F1 score, …
Let me elaborate a little bit more on how to organize the sets of data because it is a very important aspect when it comes to evaluating the model. Note that for the entire training step we did use a set of labeled data. In the previous paragraph I just said to test the optimized model also with a set of labeled data. What it is usually done is to take all the labeled data that we have and split it into a training set and a test set (around 70% and 30% respectively). We will obviously use the first one for training the NN and the second one for testing the model.
Note that it is important to keep this separation of data because in the training phase we might end up with a model that fits too much the behavior of the training data. If we evaluate this possibly ‘over-fitted’ model with the same training data, we will definitely obtain very good performance indicators. But when we evaluate it with the test data, then we will observe the real performance. And remember that one should not go back and forth fine tuning the model until the evaluation in the test set becomes good enough, because then we are converting the test set into training set. In order to allow for this back and forth procedure you can split the original data into three pieces and add the cross-validation set.
Enough for now, let’s leave the details of how to evaluate a model for another post because it is an interesting enough topic by itself and it actually applies to any Machine Learning algorithm, not only to NN.
First of all congratulations if you reached to this point. That was a pretty long text
In this post I tried to transmit my thoughts and understanding on how NN work. To do that I needed to also go through the usual descriptions, but I hope it was interesting.
In the following two posts I will review with more detail two of the steps that I described in this text:
- The cost computation (with Forward Propagation)
- The gradient computation (with Backpropagation)
Hope to see you there!