Skip to content

Commit 054c402

Browse files
committed
Use range instead of xrange
1 parent 80e9738 commit 054c402

File tree

3 files changed

+29
-29
lines changed

3 files changed

+29
-29
lines changed

linear-classify.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -202,7 +202,7 @@ def L_i(x, y, W):
202202
correct_class_score = scores[y]
203203
D = W.shape[0] # number of classes, e.g. 10
204204
loss_i = 0.0
205-
for j in xrange(D): # iterate over all wrong classes
205+
for j in range(D): # iterate over all wrong classes
206206
if j == y:
207207
# skip for the true class to only loop over incorrect classes
208208
continue

neural-networks-case-study.md

Lines changed: 20 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ D = 2 # dimensionality
3030
K = 3 # number of classes
3131
X = np.zeros((N*K,D)) # data matrix (each row = single example)
3232
y = np.zeros(N*K, dtype='uint8') # class labels
33-
for j in xrange(K):
33+
for j in range(K):
3434
ix = range(N*j,N*(j+1))
3535
r = np.linspace(0.0,1,N) # radius
3636
t = np.linspace(j*4,(j+1)*4,N) + np.random.randn(N)*0.2 # theta
@@ -66,7 +66,7 @@ W = 0.01 * np.random.randn(D,K)
6666
b = np.zeros((1,K))
6767
```
6868

69-
Recall that we `D = 2` is the dimensionality and `K = 3` is the number of classes.
69+
Recall that we `D = 2` is the dimensionality and `K = 3` is the number of classes.
7070

7171
<a name='scores'></a>
7272

@@ -142,7 +142,7 @@ $$
142142
\frac{\partial L_i }{ \partial f_k } = p_k - \mathbb{1}(y_i = k)
143143
$$
144144

145-
Notice how elegant and simple this expression is. Suppose the probabilities we computed were `p = [0.2, 0.3, 0.5]`, and that the correct class was the middle one (with probability 0.3). According to this derivation the gradient on the scores would be `df = [0.2, -0.7, 0.5]`. Recalling what the interpretation of the gradient, we see that this result is highly intuitive: increasing the first or last element of the score vector `f` (the scores of the incorrect classes) leads to an *increased* loss (due to the positive signs +0.2 and +0.5) - and increasing the loss is bad, as expected. However, increasing the score of the correct class has *negative* influence on the loss. The gradient of -0.7 is telling us that increasing the correct class score would lead to a decrease of the loss \\(L_i\\), which makes sense.
145+
Notice how elegant and simple this expression is. Suppose the probabilities we computed were `p = [0.2, 0.3, 0.5]`, and that the correct class was the middle one (with probability 0.3). According to this derivation the gradient on the scores would be `df = [0.2, -0.7, 0.5]`. Recalling what the interpretation of the gradient, we see that this result is highly intuitive: increasing the first or last element of the score vector `f` (the scores of the incorrect classes) leads to an *increased* loss (due to the positive signs +0.2 and +0.5) - and increasing the loss is bad, as expected. However, increasing the score of the correct class has *negative* influence on the loss. The gradient of -0.7 is telling us that increasing the correct class score would lead to a decrease of the loss \\(L_i\\), which makes sense.
146146

147147
All of this boils down to the following code. Recall that `probs` stores the probabilities of all classes (as rows) for each example. To get the gradient on the scores, which we call `dscores`, we proceed as follows:
148148

@@ -193,34 +193,34 @@ reg = 1e-3 # regularization strength
193193

194194
# gradient descent loop
195195
num_examples = X.shape[0]
196-
for i in xrange(200):
197-
196+
for i in range(200):
197+
198198
# evaluate class scores, [N x K]
199-
scores = np.dot(X, W) + b
200-
199+
scores = np.dot(X, W) + b
200+
201201
# compute the class probabilities
202202
exp_scores = np.exp(scores)
203203
probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True) # [N x K]
204-
204+
205205
# compute the loss: average cross-entropy loss and regularization
206206
correct_logprobs = -np.log(probs[range(num_examples),y])
207207
data_loss = np.sum(correct_logprobs)/num_examples
208208
reg_loss = 0.5*reg*np.sum(W*W)
209209
loss = data_loss + reg_loss
210210
if i % 10 == 0:
211211
print "iteration %d: loss %f" % (i, loss)
212-
212+
213213
# compute the gradient on scores
214214
dscores = probs
215215
dscores[range(num_examples),y] -= 1
216216
dscores /= num_examples
217-
217+
218218
# backpropate the gradient to the parameters (W,b)
219219
dW = np.dot(X.T, dscores)
220220
db = np.sum(dscores, axis=0, keepdims=True)
221-
221+
222222
dW += reg*W # regularization gradient
223-
223+
224224
# perform a parameter update
225225
W += -step_size * dW
226226
b += -step_size * db
@@ -340,29 +340,29 @@ reg = 1e-3 # regularization strength
340340

341341
# gradient descent loop
342342
num_examples = X.shape[0]
343-
for i in xrange(10000):
344-
343+
for i in range(10000):
344+
345345
# evaluate class scores, [N x K]
346346
hidden_layer = np.maximum(0, np.dot(X, W) + b) # note, ReLU activation
347347
scores = np.dot(hidden_layer, W2) + b2
348-
348+
349349
# compute the class probabilities
350350
exp_scores = np.exp(scores)
351351
probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True) # [N x K]
352-
352+
353353
# compute the loss: average cross-entropy loss and regularization
354354
correct_logprobs = -np.log(probs[range(num_examples),y])
355355
data_loss = np.sum(correct_logprobs)/num_examples
356356
reg_loss = 0.5*reg*np.sum(W*W) + 0.5*reg*np.sum(W2*W2)
357357
loss = data_loss + reg_loss
358358
if i % 1000 == 0:
359359
print "iteration %d: loss %f" % (i, loss)
360-
360+
361361
# compute the gradient on scores
362362
dscores = probs
363363
dscores[range(num_examples),y] -= 1
364364
dscores /= num_examples
365-
365+
366366
# backpropate the gradient to the parameters
367367
# first backprop into parameters W2 and b2
368368
dW2 = np.dot(hidden_layer.T, dscores)
@@ -374,11 +374,11 @@ for i in xrange(10000):
374374
# finally into W,b
375375
dW = np.dot(X.T, dhidden)
376376
db = np.sum(dhidden, axis=0, keepdims=True)
377-
377+
378378
# add regularization gradient contribution
379379
dW2 += reg * W2
380380
dW += reg * W
381-
381+
382382
# perform a parameter update
383383
W += -step_size * dW
384384
b += -step_size * db

optimization-1.md

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -99,7 +99,7 @@ Since it is so simple to check how good a given set of parameters **W** is, the
9999
# assume the function L evaluates the loss function
100100

101101
bestloss = float("inf") # Python assigns the highest possible float value
102-
for num in xrange(1000):
102+
for num in range(1000):
103103
W = np.random.randn(10, 3073) * 0.0001 # generate random parameters
104104
loss = L(X_train, Y_train, W) # get the loss over the entire training set
105105
if loss < bestloss: # keep track of the best solution
@@ -147,7 +147,7 @@ The first strategy you may think of is to try to extend one foot in a random dir
147147
```python
148148
W = np.random.randn(10, 3073) * 0.001 # generate random starting W
149149
bestloss = float("inf")
150-
for i in xrange(1000):
150+
for i in range(1000):
151151
step_size = 0.0001
152152
Wtry = W + np.random.randn(10, 3073) * step_size
153153
loss = L(Xtr_cols, Ytr, Wtry)
@@ -187,11 +187,11 @@ The formula given above allows us to compute the gradient numerically. Here is a
187187

188188
```python
189189
def eval_numerical_gradient(f, x):
190-
"""
191-
a naive implementation of numerical gradient of f at x
190+
"""
191+
a naive implementation of numerical gradient of f at x
192192
- f should be a function that takes a single argument
193193
- x is the point (numpy array) to evaluate the gradient at
194-
"""
194+
"""
195195

196196
fx = f(x) # evaluate function value at original point
197197
grad = np.zeros(x.shape)
@@ -215,7 +215,7 @@ def eval_numerical_gradient(f, x):
215215
return grad
216216
```
217217

218-
Following the gradient formula we gave above, the code above iterates over all dimensions one by one, makes a small change `h` along that dimension and calculates the partial derivative of the loss function along that dimension by seeing how much the function changed. The variable `grad` holds the full gradient in the end.
218+
Following the gradient formula we gave above, the code above iterates over all dimensions one by one, makes a small change `h` along that dimension and calculates the partial derivative of the loss function along that dimension by seeing how much the function changed. The variable `grad` holds the full gradient in the end.
219219

220220
**Practical considerations**. Note that in the mathematical formulation the gradient is defined in the limit as **h** goes towards zero, but in practice it is often sufficient to use a very small value (such as 1e-5 as seen in the example). Ideally, you want to use the smallest step size that does not lead to numerical issues. Additionally, in practice it often works better to compute the numeric gradient using the **centered difference formula**: \\( [f(x+h) - f(x-h)] / 2 h \\) . See [wiki](http://en.wikipedia.org/wiki/Numerical_differentiation) for details.
221221

@@ -297,7 +297,7 @@ $$
297297
\nabla_{w_j} L_i = \mathbb{1}(w_j^Tx_i - w_{y_i}^Tx_i + \Delta > 0) x_i
298298
$$
299299

300-
Once you derive the expression for the gradient it is straight-forward to implement the expressions and use them to perform the gradient update.
300+
Once you derive the expression for the gradient it is straight-forward to implement the expressions and use them to perform the gradient update.
301301

302302
<a name='gd'></a>
303303

@@ -346,7 +346,7 @@ In this section,
346346

347347
- We developed the intuition of the loss function as a **high-dimensional optimization landscape** in which we are trying to reach the bottom. The working analogy we developed was that of a blindfolded hiker who wishes to reach the bottom. In particular, we saw that the SVM cost function is piece-wise linear and bowl-shaped.
348348
- We motivated the idea of optimizing the loss function with
349-
**iterative refinement**, where we start with a random set of weights and refine them step by step until the loss is minimized.
349+
**iterative refinement**, where we start with a random set of weights and refine them step by step until the loss is minimized.
350350
- We saw that the **gradient** of a function gives the steepest ascent direction and we discussed a simple but inefficient way of computing it numerically using the finite difference approximation (the finite difference being the value of *h* used in computing the numerical gradient).
351351
- We saw that the parameter update requires a tricky setting of the **step size** (or the **learning rate**) that must be set just right: if it is too low the progress is steady but slow. If it is too high the progress can be faster, but more risky. We will explore this tradeoff in much more detail in future sections.
352352
- We discussed the tradeoffs between computing the **numerical** and **analytic** gradient. The numerical gradient is simple but it is approximate and expensive to compute. The analytic gradient is exact, fast to compute but more error-prone since it requires the derivation of the gradient with math. Hence, in practice we always use the analytic gradient and then perform a **gradient check**, in which its implementation is compared to the numerical gradient.

0 commit comments

Comments
 (0)