Optimizers (``optim``)
======================

Optimizers update model parameters using accumulated gradients. Import them
from ``numpygrad.optim``::

    import numpygrad.optim as optim

All optimizers share the same interface: construct with a parameter list, call
``zero_grad()`` before each forward pass, and call ``step()`` after
``backward()``::

    optimizer = optim.SGD(model.parameters(), step_size=1e-2)

    optimizer.zero_grad()   # reset .grad on all params
    loss.backward()         # accumulate gradients
    optimizer.step()        # update params

``optim.Optimizer`` (base class)
---------------------------------

Provides ``zero_grad()`` which sets ``param.grad = None`` for every parameter.
Subclasses must implement ``step()``.

``optim.SGD``
-------------

Vanilla stochastic gradient descent::

    optimizer = optim.SGD(model.parameters(), step_size=1e-3)

Each step applies ``param.data -= step_size * param.grad`` with no momentum or
weight decay. Good for quick experiments and small models.

``optim.AdamW``
---------------

Adam with decoupled weight decay::

    optimizer = optim.AdamW(
        model.parameters(),
        lr=1e-3,
        betas=(0.9, 0.999),
        eps=1e-8,
        weight_decay=1e-2,
    )

Parameters:

- ``lr`` — learning rate (default ``1e-3``)
- ``betas`` — exponential decay rates for the first and second moment
  estimates (default ``(0.9, 0.999)``)
- ``eps`` — numerical stability term added to the denominator (default ``1e-8``)
- ``weight_decay`` — L2 regularisation coefficient applied directly to weights,
  **decoupled** from the gradient update (default ``1e-2``)

AdamW is the recommended default for most tasks. Use ``SGD`` when you want
full control over the update rule or are studying optimisation dynamics.