views:

594

answers:

3

Update: a better formulation of the issue.

I'm trying to understand the backpropagation algorithm with an XOR neural network as an example. For this case there are 2 input neurons + 1 bias, 2 neurons in the hidden layer + 1 bias, and 1 output neuron.

 A   B  A XOR B
 1    1   -1
 1   -1    1
-1    1    1
-1   -1   -1

A sample XOR neural network

I'm using the stochastic backpropagation.

After reading a bit more I have found out that the error of the output unit is propagated to the hidden layers... initially this was confusing, because when you get to the input layer of the neural network, then each neuron gets an error adjustment from both of the neurons in the hidden layer. In particular, the way the error is distributed is difficult to grasp at first.

Step 1 calculate the output for each instance of input.
Step 2 calculate the error between the output neuron(s) (in our case there is only one) and the target value(s):
Step 2
Step 3 we use the error from Step 2 to calculate the error for each hidden unit h:
Step 3

The 'weight kh' is the weight between the hidden unit h and the output unit k, well this is confusing because the input unit does not have a direct weight associated with the output unit. After staring at the formula for a few hours I started to think about what the summation means, and I'm starting to come to the conclusion that each input neuron's weight that connects to the hidden layer neurons is multiplied by the output error and summed up. This is a logical conclusion, but the formula seems a little confusing since it clearly says the 'weight kh' (between the output layer k and hidden layer h).

Am I understanding everything correctly here? Can anybody confirm this?

What's O(h) of the input layer? My understanding is that each input node has two outputs: one that goes into the the first node of the hidden layer and one that goes into the second node hidden layer. Which of the two outputs should be plugged into the O(h)*(1 - O(h)) part of the formula?
Step 3

+1  A: 

What I read from Step 3's equation is:

  1. O_h = last output of this hidden unit (O_h on the input layer is the actual input value)
  2. w_kh = weight of connection between this hidden unit and a unit of the next layer (towards output)
  3. delta_k = error of unit of the next layer (towards output, same unit as previous bullet)

Each unit has only one output, but each link between the output and the next layer is weighted. So the output is the same, but on the receiving end, each unit will receive a different value if the weight of the links is different. O_h always refers to the value of this neuron for the last iteration. Error does not apply to the input layer, as by definition, the input has no 'error' per se.

The error needs to be calculated layer by layer, starting at the output side, since we need the error values of layer N+1 to calculate layer N. You are right, there is no direct connection between input and output in backpropagation.

I believe the equation is correct, if counterintuitive. What is probably confusing is that in forward propagation for each unit we have to consider all the units and links on the left of the unit (input values), but for error propagation (backpropagation) was have to consider the units on the right (output value) of the unit being processed.

cjcela
OK, after reading some more I also agree with you: O_h is the actual value of the unit. A little clarification on delta_k: how would one calculate it for the hidden layer? I understand how to calculate it for the output layer since we can directly compare it to the XOR target value. But what's the target value for the hidden layer? I assumed we only calculate it once with respect to the output layer and we use it for all layers.
Lirik
There is no 'target value' for the hidden neurons. You must use the equation you have listed in step 3 to calculate all delta_k's for the hidden layer neurons. Notice that you only need the output value and the errors from the neurons in the layers to the right to do that - that is why the errors MUST be calculated starting from the output back towards the input.
cjcela
Equation in step 2 is only meant for the output neurons' error. Equation in step 3 is meant for the hidden neurons' error. Input layer neurons have no error term.
cjcela
@cjcelaOK, so the delta for an input neuron is calculated by multiplying its output Oh(1-Oh) and the sum of the weight * errors calculated for the hidden units to the right. Say the top hidden unit has an error of 0.02 and the bottom of 0.01, the weights from the input unit towards both hidden units are 0.5, and the output for the input unit is 1, then we'll end up with 1(1-1)(0.5*0.02+0.5*0.01) = 0; therefore there is no adjustment made to the weight of the input unit.
Lirik
Just found something that may help, Lirik. Take a look at the C/C++ source code here: http://www.codeproject.com/KB/recipes/BP.aspx - it is all there.
cjcela
+3  A: 

The tutorial you posted here is actually doing it wrong. I double checked it against Bishop's two standard books and two of my working implementations. I will point out below where exactly.

An important thing to keep in mind is that you are always searching for derivatives of the error function with respect to a unit or weight. The former are the deltas, the latter is what you use to update your weights.

If you want to understand backpropagation, you have to understand the chain rule. It's all about the chain rule here. If you don't know how it works exactly, check up at wikipedia - it's not that hard. But as soon as you understand the derivations, everything falls into place. Promise! :)

∂E/∂W can be composed into ∂E/∂o ∂o/∂W via the chain rule. ∂o/∂W is easily calculated, since it's just the derivative of the activation/output of a unit with respect to the weights. ∂E/∂o is actually what we call the deltas. (I am assuming that E, o and W are vectors/matrices here)

We do have them for the output units, since that is where we can calculate the error. (Mostly we have an error function that comes down to delta of (t_k - o_k), eg for quadratic error function in the case of linear outputs and cross entropy in case for logistic outputs.)

The question now is, how do we get the derivatives for the internal units? Well, we know that the output of a unit is the sum of all incoming units weighted by their weights and the application of a transfer function afterwards. So o_k = f(sum(w_kj * o_j, for all j)).

So what we do is, derive o_k with respect to o_j. Since delta_j = ∂E/∂o_j = ∂E/∂o_k ∂o_k/∂o_j = delta_k ∂o_k/o_j. So given delta_k, we can calculate delta_j!

Let's do this. o_k = f(sum(w_kj * o_j, for all j)) => ∂o_k/∂o_j = f'(sum(w_kj * o_j, for all j)) * w_kj = f'(z_k) * w_kj.

For the case of the sigmoidal transfer function, this becomes z_k(1 - z_k) * w_kj. (Here is the error in the tutorial, the author says o_k(1 - o_k) * w_kj!)

bayer
+1 for interleaving the computational details w/ the intuition behind backprop.
doug
+1  A: 
ldog
@gmatt, thanks for the participation... the question is a little old (feb 2010), but I figured out where I was having the problem.
Lirik