Neural Network (nn)

The numpygrad.nn module provides layers, activation modules, and loss functions. Import it as:

import numpygrad.nn as nn

Module system

nn.Module

Base class for all layers and models. Subclass it and override forward:

class MyLayer(nn.Module):
    def __init__(self) -> None:
        super().__init__()
        self.weight = nn.Parameter(npg.random.randn((4, 4)))

    def forward(self, x):
        return x @ self.weight

Key methods:

  • module(x) — calls forward(x) (via __call__)

  • module.parameters() — iterator over all Parameter objects in the module and its children, recursively

  • module.state_dict()dict mapping parameter names to NumPy arrays

  • module.train() / module.eval() — switch training mode on/off (affects Dropout)

nn.Parameter

A subclass of Array that always has requires_grad=True. Assigning a Parameter as a module attribute automatically registers it with parameters():

self.bias = nn.Parameter(npg.zeros((8,)))

nn.Sequential

Chains modules in order:

model = nn.Sequential(
    nn.Linear(4, 16),
    nn.ReLU(),
    nn.Linear(16, 2),
)
out = model(x)

Layers

nn.Linear(num_inputs, num_outputs, bias=True)

Fully connected layer: y = x @ W + b.

  • weight: Parameter of shape (num_inputs, num_outputs)

  • bias: Parameter of shape (num_outputs,), or absent when bias=False

layer = nn.Linear(8, 4)
out = layer(npg.random.randn((16, 8)))   # (16, 4)

nn.MLP(input_dim, hidden_sizes, output_dim, activation="relu")

Multi-layer perceptron: stacked Linear layers separated by the chosen activation. activation can be "relu", "tanh", or "sigmoid":

model = nn.MLP(
    input_dim=784,
    hidden_sizes=[256, 128],
    output_dim=10,
    activation="relu",
)

nn.Conv2d(in_channels, out_channels, kernel_size, stride=1, padding=0, bias=True)

2D convolutional layer. kernel_size, stride, and padding each accept an int or a (H, W) tuple:

conv = nn.Conv2d(3, 32, kernel_size=3, padding=1)
out = conv(npg.random.randn((8, 3, 28, 28)))   # (8, 32, 28, 28)

nn.Embedding(num_embeddings, embedding_dim)

Lookup table mapping integer indices to dense vectors. Equivalent to an indexed row lookup with full gradient support:

embed = nn.Embedding(vocab_size, 64)
x = npg.array([3, 1, 4, 1, 5])   # integer indices
out = embed(x)                    # (5, 64)

nn.LayerNorm(normalized_shape, eps=1e-5, elementwise_affine=True)

Layer normalisation over the last len(normalized_shape) dimensions. When elementwise_affine=True (the default), learnable weight (gamma) and bias (beta) parameters are added:

ln = nn.LayerNorm(512)
out = ln(npg.random.randn((4, 16, 512)))   # (4, 16, 512), normalised over dim=-1

nn.Dropout(p=0.5)

Randomly zeros elements with probability p during training and rescales the remaining values by 1/(1-p) (inverted dropout). Dropout is a no-op during evaluation (after calling model.eval()):

drop = nn.Dropout(p=0.1)
out = drop(x)   # during training: randomly mask 10% of values

nn.MultiHeadAttention(d_model, num_heads, bias=True)

Multi-head scaled dot-product attention. d_model must be divisible by num_heads:

attn = nn.MultiHeadAttention(d_model=64, num_heads=8)
out = attn(q, k, v)                 # q, k, v: (batch, seq, d_model)
out = attn(q, k, v, attn_mask=mask) # optional additive mask

Activation modules

These wrap the functional activations as Module subclasses, useful inside Sequential:

nn.ReLU()
nn.GELU()
nn.Sigmoid()
nn.Tanh()
nn.SoftPlus()

Loss functions

nn.cross_entropy_loss(logits, targets, reduction="mean")

Cross-entropy loss for classification. Supports both 2D (N, C) and higher-dimensional (*, C) logits — the extra dimensions are flattened automatically:

# 2D: standard (batch, classes)
logits = model(x_batch)              # (32, 10)
loss = nn.cross_entropy_loss(logits, y_batch)
loss.backward()

# 3D: sequence models (batch, seq_len, vocab_size)
logits = lm(tokens)                  # (B, T, V)
loss = nn.cross_entropy_loss(logits, targets)   # targets: (B, T)
  • logits: shape (*, C) — raw (un-normalised) scores

  • targets: shape (*,) — integer class indices in [0, C)

  • reduction: "mean" (default) or "sum"

nn.mse(predictions, targets, reduction="mean", weight=None)

Mean squared error loss for regression.

  • predictions, targets: same shape

  • weight: optional per-sample weight array (same leading dimension)

  • reduction: "mean" (default) or "sum"

pred = model(x_batch)           # (32, 1)
loss = nn.mse(pred, y_batch)
loss.backward()

Parameter initialisation (nn.init)

The nn.init submodule provides in-place parameter initialisation helpers, following the same convention as torch.nn.init. Every function modifies tensor.data in-place and returns the tensor:

import numpygrad.nn as nn

w = nn.Parameter(npg.zeros((128, 64)))
nn.init.kaiming_uniform_(w, nonlinearity="relu")

Basic

Function

Description

nn.init.uniform_(tensor, low=-1, high=1)

Uniform distribution over [low, high]

nn.init.normal_(tensor, mean=0, std=1)

Normal distribution

nn.init.zeros_(tensor)

Fill with zeros

nn.init.ones_(tensor)

Fill with ones

Kaiming (He)

Variance-preserving initialisation for networks with ReLU-family activations:

Function

Description

nn.init.kaiming_uniform_(tensor, mode="fan_in", nonlinearity="relu")

Kaiming uniform: \(\mathcal{U}(-\text{bound}, \text{bound})\) where \(\text{bound} = \sqrt{3} \cdot \text{gain} / \sqrt{\text{fan}}\)

nn.init.kaiming_normal_(tensor, mode="fan_in", nonlinearity="relu")

Kaiming normal: \(\mathcal{N}(0, \text{gain}^2/\text{fan})\)

Xavier (Glorot)

Variance-preserving initialisation for networks with symmetric activations (tanh, sigmoid):

Function

Description

nn.init.xavier_uniform_(tensor, gain=1.0)

Xavier uniform: \(\mathcal{U}(-b, b)\) where \(b = \text{gain} \cdot \sqrt{6 / (\text{fan\_in} + \text{fan\_out})}\)

nn.init.xavier_normal_(tensor, gain=1.0)

Xavier normal: \(\mathcal{N}(0, \text{gain}^2 \cdot 2 / (\text{fan\_in} + \text{fan\_out}))\)

mode controls whether fan_in or fan_out is used. nonlinearity sets the gain; recognised values are "relu", "gelu", "tanh", "sigmoid", "leaky_relu", "linear", "identity". Fan is computed from the tensor shape: (fan_in, fan_out) for 2D tensors, (C_in * KH * KW, C_out * KH * KW) for 4D (Conv) tensors.