Use range instead of xrange

brentyi · brentyi · commit 054c402d32e8 · 2020-04-07T16:59:20.000-07:00
diff --git a/linear-classify.md b/linear-classify.md
@@ -202,7 +202,7 @@ def L_i(x, y, W):
   correct_class_score = scores[y]
   D = W.shape[0] # number of classes, e.g. 10
   loss_i = 0.0
-  for j in xrange(D): # iterate over all wrong classes
+  for j in range(D): # iterate over all wrong classes
     if j == y:
       # skip for the true class to only loop over incorrect classes
       continue
diff --git a/neural-networks-case-study.md b/neural-networks-case-study.md
@@ -30,7 +30,7 @@ D = 2 # dimensionality
 K = 3 # number of classes
 X = np.zeros((N*K,D)) # data matrix (each row = single example)
 y = np.zeros(N*K, dtype='uint8') # class labels
-for j in xrange(K):
+for j in range(K):
   ix = range(N*j,N*(j+1))
   r = np.linspace(0.0,1,N) # radius
   t = np.linspace(j*4,(j+1)*4,N) + np.random.randn(N)*0.2 # theta
@@ -66,7 +66,7 @@ W = 0.01 * np.random.randn(D,K)
 b = np.zeros((1,K))
 ```
 
-Recall that we `D = 2` is the dimensionality and `K = 3` is the number of classes. 
+Recall that we `D = 2` is the dimensionality and `K = 3` is the number of classes.
 
 <a name='scores'></a>
 
@@ -142,7 +142,7 @@ $$
 \frac{\partial L_i }{ \partial f_k } = p_k - \mathbb{1}(y_i = k)
 $$
 
-Notice how elegant and simple this expression is. Suppose the probabilities we computed were `p = [0.2, 0.3, 0.5]`, and that the correct class was the middle one (with probability 0.3). According to this derivation the gradient on the scores would be `df = [0.2, -0.7, 0.5]`. Recalling what the interpretation of the gradient, we see that this result is highly intuitive: increasing the first or last element of the score vector `f` (the scores of the incorrect classes) leads to an *increased* loss (due to the positive signs +0.2 and +0.5) - and increasing the loss is bad, as expected. However, increasing the score of the correct class has *negative* influence on the loss. The gradient of -0.7 is telling us that increasing the correct class score would lead to a decrease of the loss \\(L_i\\), which makes sense. 
+Notice how elegant and simple this expression is. Suppose the probabilities we computed were `p = [0.2, 0.3, 0.5]`, and that the correct class was the middle one (with probability 0.3). According to this derivation the gradient on the scores would be `df = [0.2, -0.7, 0.5]`. Recalling what the interpretation of the gradient, we see that this result is highly intuitive: increasing the first or last element of the score vector `f` (the scores of the incorrect classes) leads to an *increased* loss (due to the positive signs +0.2 and +0.5) - and increasing the loss is bad, as expected. However, increasing the score of the correct class has *negative* influence on the loss. The gradient of -0.7 is telling us that increasing the correct class score would lead to a decrease of the loss \\(L_i\\), which makes sense.
 
 All of this boils down to the following code. Recall that `probs` stores the probabilities of all classes (as rows) for each example. To get the gradient on the scores, which we call `dscores`, we proceed as follows:
 
@@ -193,34 +193,34 @@ reg = 1e-3 # regularization strength
 
 # gradient descent loop
 num_examples = X.shape[0]
-for i in xrange(200):
-  
+for i in range(200):
+
   # evaluate class scores, [N x K]
-  scores = np.dot(X, W) + b 
-  
+  scores = np.dot(X, W) + b
+
   # compute the class probabilities
   exp_scores = np.exp(scores)
   probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True) # [N x K]
-  
+
   # compute the loss: average cross-entropy loss and regularization
   correct_logprobs = -np.log(probs[range(num_examples),y])
   data_loss = np.sum(correct_logprobs)/num_examples
   reg_loss = 0.5*reg*np.sum(W*W)
   loss = data_loss + reg_loss
   if i % 10 == 0:
     print "iteration %d: loss %f" % (i, loss)
-  
+
   # compute the gradient on scores
   dscores = probs
   dscores[range(num_examples),y] -= 1
   dscores /= num_examples
-  
+
   # backpropate the gradient to the parameters (W,b)
   dW = np.dot(X.T, dscores)
   db = np.sum(dscores, axis=0, keepdims=True)
-  
+
   dW += reg*W # regularization gradient
-  
+
   # perform a parameter update
   W += -step_size * dW
   b += -step_size * db
@@ -340,29 +340,29 @@ reg = 1e-3 # regularization strength
 
 # gradient descent loop
 num_examples = X.shape[0]
-for i in xrange(10000):
-  
+for i in range(10000):
+
   # evaluate class scores, [N x K]
   hidden_layer = np.maximum(0, np.dot(X, W) + b) # note, ReLU activation
   scores = np.dot(hidden_layer, W2) + b2
-  
+
   # compute the class probabilities
   exp_scores = np.exp(scores)
   probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True) # [N x K]
-  
+
   # compute the loss: average cross-entropy loss and regularization
   correct_logprobs = -np.log(probs[range(num_examples),y])
   data_loss = np.sum(correct_logprobs)/num_examples
   reg_loss = 0.5*reg*np.sum(W*W) + 0.5*reg*np.sum(W2*W2)
   loss = data_loss + reg_loss
   if i % 1000 == 0:
     print "iteration %d: loss %f" % (i, loss)
-  
+
   # compute the gradient on scores
   dscores = probs
   dscores[range(num_examples),y] -= 1
   dscores /= num_examples
-  
+
   # backpropate the gradient to the parameters
   # first backprop into parameters W2 and b2
   dW2 = np.dot(hidden_layer.T, dscores)
@@ -374,11 +374,11 @@ for i in xrange(10000):
   # finally into W,b
   dW = np.dot(X.T, dhidden)
   db = np.sum(dhidden, axis=0, keepdims=True)
-  
+
   # add regularization gradient contribution
   dW2 += reg * W2
   dW += reg * W
-  
+
   # perform a parameter update
   W += -step_size * dW
   b += -step_size * db
diff --git a/optimization-1.md b/optimization-1.md
@@ -99,7 +99,7 @@ Since it is so simple to check how good a given set of parameters **W** is, the
 # assume the function L evaluates the loss function
 
 bestloss = float("inf") # Python assigns the highest possible float value
-for num in xrange(1000):
+for num in range(1000):
   W = np.random.randn(10, 3073) * 0.0001 # generate random parameters
   loss = L(X_train, Y_train, W) # get the loss over the entire training set
   if loss < bestloss: # keep track of the best solution
@@ -147,7 +147,7 @@ The first strategy you may think of is to try to extend one foot in a random dir
 ```python
 W = np.random.randn(10, 3073) * 0.001 # generate random starting W
 bestloss = float("inf")
-for i in xrange(1000):
+for i in range(1000):
   step_size = 0.0001
   Wtry = W + np.random.randn(10, 3073) * step_size
   loss = L(Xtr_cols, Ytr, Wtry)
@@ -187,11 +187,11 @@ The formula given above allows us to compute the gradient numerically. Here is a
 
 ```python
 def eval_numerical_gradient(f, x):
-  """ 
-  a naive implementation of numerical gradient of f at x 
+  """
+  a naive implementation of numerical gradient of f at x
   - f should be a function that takes a single argument
   - x is the point (numpy array) to evaluate the gradient at
-  """ 
+  """
 
   fx = f(x) # evaluate function value at original point
   grad = np.zeros(x.shape)
@@ -215,7 +215,7 @@ def eval_numerical_gradient(f, x):
   return grad
 ```
 
-Following the gradient formula we gave above, the code above iterates over all dimensions one by one, makes a small change `h` along that dimension and calculates the partial derivative of the loss function along that dimension by seeing how much the function changed. The variable `grad` holds the full gradient in the end. 
+Following the gradient formula we gave above, the code above iterates over all dimensions one by one, makes a small change `h` along that dimension and calculates the partial derivative of the loss function along that dimension by seeing how much the function changed. The variable `grad` holds the full gradient in the end.
 
 **Practical considerations**. Note that in the mathematical formulation the gradient is defined in the limit as **h** goes towards zero, but in practice it is often sufficient to use a very small value (such as 1e-5 as seen in the example). Ideally, you want to use the smallest step size that does not lead to numerical issues. Additionally, in practice it often works better to compute the numeric gradient using the **centered difference formula**: \\( [f(x+h) - f(x-h)] / 2 h \\) . See [wiki](http://en.wikipedia.org/wiki/Numerical_differentiation) for details.
 
@@ -297,7 +297,7 @@ $$
 \nabla_{w_j} L_i = \mathbb{1}(w_j^Tx_i - w_{y_i}^Tx_i + \Delta > 0) x_i
 $$
 
-Once you derive the expression for the gradient it is straight-forward to implement the expressions and use them to perform the gradient update. 
+Once you derive the expression for the gradient it is straight-forward to implement the expressions and use them to perform the gradient update.
 
 <a name='gd'></a>
 
@@ -346,7 +346,7 @@ In this section,
 
 - We developed the intuition of the loss function as a **high-dimensional optimization landscape** in which we are trying to reach the bottom. The working analogy we developed was that of a blindfolded hiker who wishes to reach the bottom. In particular, we saw that the SVM cost function is piece-wise linear and bowl-shaped.
 - We motivated the idea of optimizing the loss function with
-**iterative refinement**, where we start with a random set of weights and refine them step by step until the loss is minimized. 
+**iterative refinement**, where we start with a random set of weights and refine them step by step until the loss is minimized.
 - We saw that the **gradient** of a function gives the steepest ascent direction and we discussed a simple but inefficient way of computing it numerically using the finite difference approximation (the finite difference being the value of *h* used in computing the numerical gradient).
 - We saw that the parameter update requires a tricky setting of the **step size** (or the **learning rate**) that must be set just right: if it is too low the progress is steady but slow. If it is too high the progress can be faster, but more risky. We will explore this tradeoff in much more detail in future sections.
 - We discussed the tradeoffs between computing the **numerical** and **analytic** gradient. The numerical gradient is simple but it is approximate and expensive to compute. The analytic gradient is exact, fast to compute but more error-prone since it requires the derivation of the gradient with math. Hence, in practice we always use the analytic gradient and then perform a **gradient check**, in which its implementation is compared to the numerical gradient.