How Autograd Works

numpygrad uses define-by-run (also called dynamic) automatic differentiation. The computation graph is built implicitly as you run operations — there is no separate “compilation” step.

The computation graph

Every Array that participates in a differentiable computation holds a reference to the Function that produced it (array.grad_fn) and to its input Array nodes (array.parents). Together these form a directed acyclic graph (DAG) where:

  • Nodes are Array objects.

  • Edges point from an output array back to its inputs.

  • Leaf nodes are arrays you created directly (e.g. with npg.array(...) or npg.random.randn(...)). They have no grad_fn.

A simple example:

import numpygrad as npg

a = npg.array([2.0], requires_grad=True)   # leaf
b = npg.array([3.0], requires_grad=True)   # leaf
c = a * b                                  # c.grad_fn = Mul
d = c + a                                  # d.grad_fn = Add

The graph for d looks like:

d (Add)
├── c (Mul)
│   ├── a (leaf)
│   └── b (leaf)
└── a (leaf)

Calling backward()

array.backward() traverses the graph in reverse topological order, calling each Function’s backward method and accumulating gradients into array.grad for every leaf with requires_grad=True:

d.backward()
print(a.grad)   # d(d)/d(a) = d/da[(a*b) + a] = b + 1 = 4.0
print(b.grad)   # d(d)/d(b) = d/db[(a*b)] = a = 2.0

By default backward() starts with a scalar gradient of 1. For non-scalar outputs pass an explicit gradient array:

out = npg.array([[1.0, 2.0], [3.0, 4.0]], requires_grad=True)
result = out * 2
result.backward(npg.ones_like(result).data)

Gradient accumulation

Gradients accumulate in array.grad rather than being overwritten. This matches PyTorch’s behaviour and is required when an array appears multiple times in a graph (like a in the example above). Call optimizer.zero_grad() (or set array.grad = None) between training steps to reset them.

Non-differentiable operations

Some Array methods — astype, nonzero, all, any, fill, sort, round — do not propagate gradients. They return a new array but do not attach a grad_fn.

Comparison operators (>, <, ==, etc.) also return arrays without gradient tracking, since they are not differentiable.

Broadcasting and gradients

When an operation broadcasts one operand to match the shape of another, backward() automatically sums the upstream gradient over the broadcasted axes to recover the gradient of the original (smaller) shape. You do not need to handle this manually.