Neural Network (`nn`)

The numpygrad.nn module provides layers, activation modules, and loss functions. Import it as:

import numpygrad.nn as nn

Module system

`nn.Module`

Base class for all layers and models. Subclass it and override forward:

class MyLayer(nn.Module):
    def __init__(self) -> None:
        super().__init__()
        self.weight = nn.Parameter(npg.random.randn((4, 4)))

    def forward(self, x):
        return x @ self.weight

Key methods:

module(x) — calls forward(x) (via __call__)
module.parameters() — iterator over all Parameter objects in the module and its children, recursively
module.state_dict() — dict mapping parameter names to NumPy arrays
module.train() / module.eval() — switch training mode on/off (affects Dropout)

`nn.Parameter`

A subclass of Array that always has requires_grad=True. Assigning a Parameter as a module attribute automatically registers it with parameters():

self.bias = nn.Parameter(npg.zeros((8,)))

`nn.Sequential`

Chains modules in order:

model = nn.Sequential(
    nn.Linear(4, 16),
    nn.ReLU(),
    nn.Linear(16, 2),
)
out = model(x)

Layers

`nn.Linear(num_inputs, num_outputs, bias=True)`

Fully connected layer: y = x @ W + b.

weight: Parameter of shape (num_inputs, num_outputs)
bias: Parameter of shape (num_outputs,), or absent when bias=False

layer = nn.Linear(8, 4)
out = layer(npg.random.randn((16, 8)))   # (16, 4)

`nn.MLP(input_dim, hidden_sizes, output_dim, activation="relu")`

Multi-layer perceptron: stacked Linear layers separated by the chosen activation. activation can be "relu", "tanh", or "sigmoid":

model = nn.MLP(
    input_dim=784,
    hidden_sizes=[256, 128],
    output_dim=10,
    activation="relu",
)

`nn.Conv2d(in_channels, out_channels, kernel_size, stride=1, padding=0, bias=True)`

2D convolutional layer. kernel_size, stride, and padding each accept an int or a (H, W) tuple:

conv = nn.Conv2d(3, 32, kernel_size=3, padding=1)
out = conv(npg.random.randn((8, 3, 28, 28)))   # (8, 32, 28, 28)

`nn.Embedding(num_embeddings, embedding_dim)`

Lookup table mapping integer indices to dense vectors. Equivalent to an indexed row lookup with full gradient support:

embed = nn.Embedding(vocab_size, 64)
x = npg.array([3, 1, 4, 1, 5])   # integer indices
out = embed(x)                    # (5, 64)

`nn.LayerNorm(normalized_shape, eps=1e-5, elementwise_affine=True)`

Layer normalisation over the last len(normalized_shape) dimensions. When elementwise_affine=True (the default), learnable weight (gamma) and bias (beta) parameters are added:

ln = nn.LayerNorm(512)
out = ln(npg.random.randn((4, 16, 512)))   # (4, 16, 512), normalised over dim=-1

`nn.Dropout(p=0.5)`

Randomly zeros elements with probability p during training and rescales the remaining values by 1/(1-p) (inverted dropout). Dropout is a no-op during evaluation (after calling model.eval()):

drop = nn.Dropout(p=0.1)
out = drop(x)   # during training: randomly mask 10% of values

`nn.MultiHeadAttention(d_model, num_heads, bias=True)`

Multi-head scaled dot-product attention. d_model must be divisible by num_heads:

attn = nn.MultiHeadAttention(d_model=64, num_heads=8)
out = attn(q, k, v)                 # q, k, v: (batch, seq, d_model)
out = attn(q, k, v, attn_mask=mask) # optional additive mask

Activation modules

These wrap the functional activations as Module subclasses, useful inside Sequential:

nn.ReLU()
nn.GELU()
nn.Sigmoid()
nn.Tanh()
nn.SoftPlus()

Loss functions

`nn.cross_entropy_loss(logits, targets, reduction="mean")`

Cross-entropy loss for classification. Supports both 2D (N, C) and higher-dimensional (*, C) logits — the extra dimensions are flattened automatically:

# 2D: standard (batch, classes)
logits = model(x_batch)              # (32, 10)
loss = nn.cross_entropy_loss(logits, y_batch)
loss.backward()

# 3D: sequence models (batch, seq_len, vocab_size)
logits = lm(tokens)                  # (B, T, V)
loss = nn.cross_entropy_loss(logits, targets)   # targets: (B, T)

logits: shape (*, C) — raw (un-normalised) scores
targets: shape (*,) — integer class indices in [0, C)
reduction: "mean" (default) or "sum"

`nn.mse(predictions, targets, reduction="mean", weight=None)`

Mean squared error loss for regression.

predictions, targets: same shape
weight: optional per-sample weight array (same leading dimension)
reduction: "mean" (default) or "sum"

pred = model(x_batch)           # (32, 1)
loss = nn.mse(pred, y_batch)
loss.backward()

Parameter initialisation (`nn.init`)

The nn.init submodule provides in-place parameter initialisation helpers, following the same convention as torch.nn.init. Every function modifies tensor.data in-place and returns the tensor:

import numpygrad.nn as nn

w = nn.Parameter(npg.zeros((128, 64)))
nn.init.kaiming_uniform_(w, nonlinearity="relu")

Basic

Function	Description
`nn.init.uniform_(tensor, low=-1, high=1)`	Uniform distribution over `[low, high]`
`nn.init.normal_(tensor, mean=0, std=1)`	Normal distribution
`nn.init.zeros_(tensor)`	Fill with zeros
`nn.init.ones_(tensor)`	Fill with ones

Kaiming (He)

Variance-preserving initialisation for networks with ReLU-family activations:

Function	Description
`nn.init.kaiming_uniform_(tensor, mode="fan_in", nonlinearity="relu")`	Kaiming uniform: \(\mathcal{U}(-\text{bound}, \text{bound})\) where \(\text{bound} = \sqrt{3} \cdot \text{gain} / \sqrt{\text{fan}}\)
`nn.init.kaiming_normal_(tensor, mode="fan_in", nonlinearity="relu")`	Kaiming normal: \(\mathcal{N}(0, \text{gain}^2/\text{fan})\)

Xavier (Glorot)

Variance-preserving initialisation for networks with symmetric activations (tanh, sigmoid):

Function	Description
`nn.init.xavier_uniform_(tensor, gain=1.0)`	Xavier uniform: \(\mathcal{U}(-b, b)\) where \(b = \text{gain} \cdot \sqrt{6 / (\text{fan\_in} + \text{fan\_out})}\)
`nn.init.xavier_normal_(tensor, gain=1.0)`	Xavier normal: \(\mathcal{N}(0, \text{gain}^2 \cdot 2 / (\text{fan\_in} + \text{fan\_out}))\)

mode controls whether fan_in or fan_out is used. nonlinearity sets the gain; recognised values are "relu", "gelu", "tanh", "sigmoid", "leaky_relu", "linear", "identity". Fan is computed from the tensor shape: (fan_in, fan_out) for 2D tensors, (C_in * KH * KW, C_out * KH * KW) for 4D (Conv) tensors.

Neural Network (nn)

Module system

nn.Module

nn.Parameter

nn.Sequential

Layers

nn.Linear(num_inputs, num_outputs, bias=True)

nn.MLP(input_dim, hidden_sizes, output_dim, activation="relu")

nn.Conv2d(in_channels, out_channels, kernel_size, stride=1, padding=0, bias=True)

nn.Embedding(num_embeddings, embedding_dim)

nn.LayerNorm(normalized_shape, eps=1e-5, elementwise_affine=True)

nn.Dropout(p=0.5)

nn.MultiHeadAttention(d_model, num_heads, bias=True)