Optimizers (optim)
Optimizers update model parameters using accumulated gradients. Import them
from numpygrad.optim:
import numpygrad.optim as optim
All optimizers share the same interface: construct with a parameter list, call
zero_grad() before each forward pass, and call step() after
backward():
optimizer = optim.SGD(model.parameters(), step_size=1e-2)
optimizer.zero_grad() # reset .grad on all params
loss.backward() # accumulate gradients
optimizer.step() # update params
optim.Optimizer (base class)
Provides zero_grad() which sets param.grad = None for every parameter.
Subclasses must implement step().
optim.SGD
Vanilla stochastic gradient descent:
optimizer = optim.SGD(model.parameters(), step_size=1e-3)
Each step applies param.data -= step_size * param.grad with no momentum or
weight decay. Good for quick experiments and small models.
optim.AdamW
Adam with decoupled weight decay:
optimizer = optim.AdamW(
model.parameters(),
lr=1e-3,
betas=(0.9, 0.999),
eps=1e-8,
weight_decay=1e-2,
)
Parameters:
lr— learning rate (default1e-3)betas— exponential decay rates for the first and second moment estimates (default(0.9, 0.999))eps— numerical stability term added to the denominator (default1e-8)weight_decay— L2 regularisation coefficient applied directly to weights, decoupled from the gradient update (default1e-2)
AdamW is the recommended default for most tasks. Use SGD when you want
full control over the update rule or are studying optimisation dynamics.