Custom Modules
==============

All trainable components in numpygrad inherit from ``nn.Module``. Subclassing
it is the standard way to define new layers, loss functions, or complete
architectures.

Minimal example
---------------

Override ``forward`` with the computation your module performs::

    import numpygrad as npg
    import numpygrad.nn as nn

    class Affine(nn.Module):
        def __init__(self, in_features: int, out_features: int) -> None:
            super().__init__()
            self.weight = nn.Parameter(npg.random.randn((in_features, out_features)))
            self.bias   = nn.Parameter(npg.zeros((out_features,)))

        def forward(self, x: npg.array) -> npg.array:
            return x @ self.weight + self.bias

    layer = Affine(4, 8)
    out = layer(npg.random.randn((2, 4)))  # shape (2, 8)

Any attribute assigned as a ``Parameter`` is automatically included in
``module.parameters()`` and therefore in the optimizer's update step.

Composing modules
-----------------

Assign child modules as attributes and they are tracked recursively::

    class TwoLayer(nn.Module):
        def __init__(self, dim: int) -> None:
            super().__init__()
            self.fc1 = Affine(dim, dim)
            self.fc2 = Affine(dim, dim)

        def forward(self, x: npg.array) -> npg.array:
            return self.fc2(npg.relu(self.fc1(x)))

    net = TwoLayer(16)
    print(len(list(net.parameters())))  # 4 — weight + bias for each layer

``parameters()`` walks the full module tree recursively, so you can nest
modules arbitrarily deep.

Using ``Sequential``
--------------------

For a simple chain of modules, ``nn.Sequential`` avoids boilerplate::

    model = nn.Sequential(
        nn.Linear(4, 32),
        nn.ReLU(),
        nn.Linear(32, 2),
    )

    out = model(x)   # applies each module in order

Buffers
-------

If you need a non-trainable array stored on the module (e.g. a running mean),
assign it as a plain ``Array`` — it will not appear in ``parameters()`` but
is still accessible as an attribute::

    class BatchNorm1d(nn.Module):
        def __init__(self, num_features: int) -> None:
            super().__init__()
            self.scale  = nn.Parameter(npg.ones((num_features,)))
            self.shift  = nn.Parameter(npg.zeros((num_features,)))
            self.running_mean = npg.zeros((num_features,))  # not a Parameter

        def forward(self, x: npg.array) -> npg.array:
            mean = x.mean(axis=0)
            self.running_mean = 0.9 * self.running_mean + 0.1 * mean
            x_norm = (x - mean) / (x.var(axis=0) ** 0.5 + 1e-5)
            return self.scale * x_norm + self.shift

Inspecting parameters
---------------------

``state_dict()`` returns a flat ``dict`` mapping parameter names to their
underlying NumPy arrays — useful for checkpointing::

    sd = model.state_dict()
    # {'fc1.weight': array(...), 'fc1.bias': array(...), ...}

    import numpy as np
    np.savez("checkpoint.npz", **sd)