How Autograd Works
numpygrad uses define-by-run (also called dynamic) automatic differentiation. The computation graph is built implicitly as you run operations — there is no separate “compilation” step.
The computation graph
Every Array that participates in a differentiable computation holds a
reference to the Function that produced it (array.grad_fn) and to its
input Array nodes (array.parents). Together these form a directed
acyclic graph (DAG) where:
Nodes are
Arrayobjects.Edges point from an output array back to its inputs.
Leaf nodes are arrays you created directly (e.g. with
npg.array(...)ornpg.random.randn(...)). They have nograd_fn.
A simple example:
import numpygrad as npg
a = npg.array([2.0], requires_grad=True) # leaf
b = npg.array([3.0], requires_grad=True) # leaf
c = a * b # c.grad_fn = Mul
d = c + a # d.grad_fn = Add
The graph for d looks like:
d (Add)
├── c (Mul)
│ ├── a (leaf)
│ └── b (leaf)
└── a (leaf)
Calling backward()
array.backward() traverses the graph in reverse topological order,
calling each Function’s backward method and accumulating gradients
into array.grad for every leaf with requires_grad=True:
d.backward()
print(a.grad) # d(d)/d(a) = d/da[(a*b) + a] = b + 1 = 4.0
print(b.grad) # d(d)/d(b) = d/db[(a*b)] = a = 2.0
By default backward() starts with a scalar gradient of 1. For non-scalar
outputs pass an explicit gradient array:
out = npg.array([[1.0, 2.0], [3.0, 4.0]], requires_grad=True)
result = out * 2
result.backward(npg.ones_like(result).data)
Gradient accumulation
Gradients accumulate in array.grad rather than being overwritten. This
matches PyTorch’s behaviour and is required when an array appears multiple
times in a graph (like a in the example above). Call
optimizer.zero_grad() (or set array.grad = None) between training
steps to reset them.
Non-differentiable operations
Some Array methods — astype, nonzero, all, any, fill,
sort, round — do not propagate gradients. They return a new array but
do not attach a grad_fn.
Comparison operators (>, <, ==, etc.) also return arrays without
gradient tracking, since they are not differentiable.
Broadcasting and gradients
When an operation broadcasts one operand to match the shape of another,
backward() automatically sums the upstream gradient over the broadcasted
axes to recover the gradient of the original (smaller) shape. You do not need
to handle this manually.