Optimizers (optim)

Optimizers update model parameters using accumulated gradients. Import them from numpygrad.optim:

import numpygrad.optim as optim

All optimizers share the same interface: construct with a parameter list, call zero_grad() before each forward pass, and call step() after backward():

optimizer = optim.SGD(model.parameters(), step_size=1e-2)

optimizer.zero_grad()   # reset .grad on all params
loss.backward()         # accumulate gradients
optimizer.step()        # update params

optim.Optimizer (base class)

Provides zero_grad() which sets param.grad = None for every parameter. Subclasses must implement step().

optim.SGD

Vanilla stochastic gradient descent:

optimizer = optim.SGD(model.parameters(), step_size=1e-3)

Each step applies param.data -= step_size * param.grad with no momentum or weight decay. Good for quick experiments and small models.

optim.AdamW

Adam with decoupled weight decay:

optimizer = optim.AdamW(
    model.parameters(),
    lr=1e-3,
    betas=(0.9, 0.999),
    eps=1e-8,
    weight_decay=1e-2,
)

Parameters:

  • lr — learning rate (default 1e-3)

  • betas — exponential decay rates for the first and second moment estimates (default (0.9, 0.999))

  • eps — numerical stability term added to the denominator (default 1e-8)

  • weight_decay — L2 regularisation coefficient applied directly to weights, decoupled from the gradient update (default 1e-2)

AdamW is the recommended default for most tasks. Use SGD when you want full control over the update rule or are studying optimisation dynamics.