Update neural-networks-2.md

omoindrot · web-flow · commit ff1edc37cc7f · 2017-11-27T23:27:50.000-08:00
diff --git a/neural-networks-2.md b/neural-networks-2.md
@@ -117,7 +117,7 @@ $$
 
 where in the first 2 steps we have used [properties of variance](http://en.wikipedia.org/wiki/Variance). In third step we assumed zero mean inputs and weights, so \\(E[x_i] = E[w_i] = 0\\). Note that this is not generally the case: For example ReLU units will have a positive mean. In the last step we assumed that all \\(w_i, x_i\\) are identically distributed. From this derivation we can see that if we want \\(s\\) to have the same variance as all of its inputs \\(x\\), then during initialization we should make sure that the variance of every weight \\(w\\) is \\(1/n\\). And since \\(\text{Var}(aX) = a^2\text{Var}(X)\\) for a random variable \\(X\\) and a scalar \\(a\\), this implies that we should draw from unit gaussian and then scale it by \\(a = \sqrt{1/n}\\), to make its variance \\(1/n\\). This gives the initialization `w = np.random.randn(n) / sqrt(n)`.
 
-A similar analysis is carried out in [Understanding the difficulty of training deep feedforward neural networks](http://jmlr.org/proceedings/papers/v9/glorot10a/glorot10a.pdf) by Glorot et al. In this paper, the authors end up recommending an initialization of the form \\( \text{Var}(w) = 2/(n_{in} + n_{out}) \\) where \\(n_{in}, n_{out}\\) are the number of units in the previous layer and the next layer. This is motivated by based on a compromise and an equivalent analysis of the backpropagated gradients. A more recent paper on this topic, [Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification](http://arxiv-web3.library.cornell.edu/abs/1502.01852) by He et al., derives an initialization specifically for ReLU neurons, reaching the conclusion that the variance of neurons in the network should be \\(2.0/n\\). This gives the initialization `w = np.random.randn(n) * sqrt(2.0/n)`, and is the current recommendation for use in practice in the specific case of neural networks with ReLU neurons.
+A similar analysis is carried out in [Understanding the difficulty of training deep feedforward neural networks](http://jmlr.org/proceedings/papers/v9/glorot10a/glorot10a.pdf) by Glorot et al. In this paper, the authors end up recommending an initialization of the form \\( \text{Var}(w) = 2/(n_{in} + n_{out}) \\) where \\(n_{in}, n_{out}\\) are the number of units in the previous layer and the next layer. This is based on a compromise and an equivalent analysis of the backpropagated gradients. A more recent paper on this topic, [Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification](http://arxiv-web3.library.cornell.edu/abs/1502.01852) by He et al., derives an initialization specifically for ReLU neurons, reaching the conclusion that the variance of neurons in the network should be \\(2.0/n\\). This gives the initialization `w = np.random.randn(n) * sqrt(2.0/n)`, and is the current recommendation for use in practice in the specific case of neural networks with ReLU neurons.
 
 **Sparse initialization**. Another way to address the uncalibrated variances problem is to set all weight matrices to zero, but to break symmetry every neuron is randomly connected (with weights sampled from a small gaussian as above) to a fixed number of neurons below it. A typical number of neurons to connect to may be as small as 10.