Neural Network (``nn``)
=======================

The ``numpygrad.nn`` module provides layers, activation modules, and loss
functions. Import it as::

    import numpygrad.nn as nn

Module system
-------------

``nn.Module``
~~~~~~~~~~~~~

Base class for all layers and models. Subclass it and override ``forward``::

    class MyLayer(nn.Module):
        def __init__(self) -> None:
            super().__init__()
            self.weight = nn.Parameter(npg.random.randn((4, 4)))

        def forward(self, x):
            return x @ self.weight

Key methods:

- ``module(x)`` — calls ``forward(x)`` (via ``__call__``)
- ``module.parameters()`` — iterator over all ``Parameter`` objects in the
  module and its children, recursively
- ``module.state_dict()`` — ``dict`` mapping parameter names to NumPy arrays
- ``module.train()`` / ``module.eval()`` — switch training mode on/off
  (affects ``Dropout``)

``nn.Parameter``
~~~~~~~~~~~~~~~~

A subclass of ``Array`` that always has ``requires_grad=True``. Assigning a
``Parameter`` as a module attribute automatically registers it with
``parameters()``::

    self.bias = nn.Parameter(npg.zeros((8,)))

``nn.Sequential``
~~~~~~~~~~~~~~~~~

Chains modules in order::

    model = nn.Sequential(
        nn.Linear(4, 16),
        nn.ReLU(),
        nn.Linear(16, 2),
    )
    out = model(x)

Layers
------

``nn.Linear(num_inputs, num_outputs, bias=True)``
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Fully connected layer: ``y = x @ W + b``.

- ``weight``: ``Parameter`` of shape ``(num_inputs, num_outputs)``
- ``bias``: ``Parameter`` of shape ``(num_outputs,)``, or absent when ``bias=False``

::

    layer = nn.Linear(8, 4)
    out = layer(npg.random.randn((16, 8)))   # (16, 4)

``nn.MLP(input_dim, hidden_sizes, output_dim, activation="relu")``
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Multi-layer perceptron: stacked ``Linear`` layers separated by the chosen
activation. ``activation`` can be ``"relu"``, ``"tanh"``, or ``"sigmoid"``::

    model = nn.MLP(
        input_dim=784,
        hidden_sizes=[256, 128],
        output_dim=10,
        activation="relu",
    )

``nn.Conv2d(in_channels, out_channels, kernel_size, stride=1, padding=0, bias=True)``
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

2D convolutional layer. ``kernel_size``, ``stride``, and ``padding`` each
accept an int or a ``(H, W)`` tuple::

    conv = nn.Conv2d(3, 32, kernel_size=3, padding=1)
    out = conv(npg.random.randn((8, 3, 28, 28)))   # (8, 32, 28, 28)

``nn.Embedding(num_embeddings, embedding_dim)``
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Lookup table mapping integer indices to dense vectors. Equivalent to an
indexed row lookup with full gradient support::

    embed = nn.Embedding(vocab_size, 64)
    x = npg.array([3, 1, 4, 1, 5])   # integer indices
    out = embed(x)                    # (5, 64)

``nn.LayerNorm(normalized_shape, eps=1e-5, elementwise_affine=True)``
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Layer normalisation over the last ``len(normalized_shape)`` dimensions.
When ``elementwise_affine=True`` (the default), learnable ``weight`` (gamma)
and ``bias`` (beta) parameters are added::

    ln = nn.LayerNorm(512)
    out = ln(npg.random.randn((4, 16, 512)))   # (4, 16, 512), normalised over dim=-1

``nn.Dropout(p=0.5)``
~~~~~~~~~~~~~~~~~~~~~

Randomly zeros elements with probability ``p`` during training and rescales
the remaining values by ``1/(1-p)`` (inverted dropout). Dropout is a no-op
during evaluation (after calling ``model.eval()``)::

    drop = nn.Dropout(p=0.1)
    out = drop(x)   # during training: randomly mask 10% of values

``nn.MultiHeadAttention(d_model, num_heads, bias=True)``
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Multi-head scaled dot-product attention. ``d_model`` must be divisible by
``num_heads``::

    attn = nn.MultiHeadAttention(d_model=64, num_heads=8)
    out = attn(q, k, v)                 # q, k, v: (batch, seq, d_model)
    out = attn(q, k, v, attn_mask=mask) # optional additive mask

Activation modules
------------------

These wrap the functional activations as ``Module`` subclasses, useful inside
``Sequential``::

    nn.ReLU()
    nn.GELU()
    nn.Sigmoid()
    nn.Tanh()
    nn.SoftPlus()

Loss functions
--------------

``nn.cross_entropy_loss(logits, targets, reduction="mean")``
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Cross-entropy loss for classification. Supports both 2D ``(N, C)`` and
higher-dimensional ``(*, C)`` logits — the extra dimensions are flattened
automatically::

    # 2D: standard (batch, classes)
    logits = model(x_batch)              # (32, 10)
    loss = nn.cross_entropy_loss(logits, y_batch)
    loss.backward()

    # 3D: sequence models (batch, seq_len, vocab_size)
    logits = lm(tokens)                  # (B, T, V)
    loss = nn.cross_entropy_loss(logits, targets)   # targets: (B, T)

- ``logits``: shape ``(*, C)`` — raw (un-normalised) scores
- ``targets``: shape ``(*,)`` — integer class indices in ``[0, C)``
- ``reduction``: ``"mean"`` (default) or ``"sum"``

``nn.mse(predictions, targets, reduction="mean", weight=None)``
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Mean squared error loss for regression.

- ``predictions``, ``targets``: same shape
- ``weight``: optional per-sample weight array (same leading dimension)
- ``reduction``: ``"mean"`` (default) or ``"sum"``

::

    pred = model(x_batch)           # (32, 1)
    loss = nn.mse(pred, y_batch)
    loss.backward()

Parameter initialisation (``nn.init``)
---------------------------------------

The ``nn.init`` submodule provides in-place parameter initialisation helpers,
following the same convention as ``torch.nn.init``. Every function modifies
``tensor.data`` in-place and returns the tensor::

    import numpygrad.nn as nn

    w = nn.Parameter(npg.zeros((128, 64)))
    nn.init.kaiming_uniform_(w, nonlinearity="relu")

Basic
~~~~~

.. list-table::
   :header-rows: 1
   :widths: 50 50

   * - Function
     - Description
   * - ``nn.init.uniform_(tensor, low=-1, high=1)``
     - Uniform distribution over ``[low, high]``
   * - ``nn.init.normal_(tensor, mean=0, std=1)``
     - Normal distribution
   * - ``nn.init.zeros_(tensor)``
     - Fill with zeros
   * - ``nn.init.ones_(tensor)``
     - Fill with ones

Kaiming (He)
~~~~~~~~~~~~

Variance-preserving initialisation for networks with ReLU-family activations:

.. list-table::
   :header-rows: 1
   :widths: 50 50

   * - Function
     - Description
   * - ``nn.init.kaiming_uniform_(tensor, mode="fan_in", nonlinearity="relu")``
     - Kaiming uniform: :math:`\mathcal{U}(-\text{bound}, \text{bound})` where :math:`\text{bound} = \sqrt{3} \cdot \text{gain} / \sqrt{\text{fan}}`
   * - ``nn.init.kaiming_normal_(tensor, mode="fan_in", nonlinearity="relu")``
     - Kaiming normal: :math:`\mathcal{N}(0, \text{gain}^2/\text{fan})`

Xavier (Glorot)
~~~~~~~~~~~~~~~

Variance-preserving initialisation for networks with symmetric activations
(tanh, sigmoid):

.. list-table::
   :header-rows: 1
   :widths: 50 50

   * - Function
     - Description
   * - ``nn.init.xavier_uniform_(tensor, gain=1.0)``
     - Xavier uniform: :math:`\mathcal{U}(-b, b)` where :math:`b = \text{gain} \cdot \sqrt{6 / (\text{fan\_in} + \text{fan\_out})}`
   * - ``nn.init.xavier_normal_(tensor, gain=1.0)``
     - Xavier normal: :math:`\mathcal{N}(0, \text{gain}^2 \cdot 2 / (\text{fan\_in} + \text{fan\_out}))`

``mode`` controls whether fan_in or fan_out is used. ``nonlinearity`` sets
the gain; recognised values are ``"relu"``, ``"gelu"``, ``"tanh"``,
``"sigmoid"``, ``"leaky_relu"``, ``"linear"``, ``"identity"``.
Fan is computed from the tensor shape: ``(fan_in, fan_out)`` for 2D tensors,
``(C_in * KH * KW, C_out * KH * KW)`` for 4D (Conv) tensors.