How Autograd Works
==================

numpygrad uses **define-by-run** (also called dynamic) automatic differentiation.
The computation graph is built implicitly as you run operations — there is no
separate "compilation" step.

The computation graph
---------------------

Every ``Array`` that participates in a differentiable computation holds a
reference to the ``Function`` that produced it (``array.grad_fn``) and to its
input ``Array`` nodes (``array.parents``). Together these form a directed
acyclic graph (DAG) where:

- **Nodes** are ``Array`` objects.
- **Edges** point from an output array back to its inputs.
- **Leaf nodes** are arrays you created directly (e.g. with ``npg.array(...)``
  or ``npg.random.randn(...)``). They have no ``grad_fn``.

A simple example::

    import numpygrad as npg

    a = npg.array([2.0], requires_grad=True)   # leaf
    b = npg.array([3.0], requires_grad=True)   # leaf
    c = a * b                                  # c.grad_fn = Mul
    d = c + a                                  # d.grad_fn = Add

The graph for ``d`` looks like::

    d (Add)
    ├── c (Mul)
    │   ├── a (leaf)
    │   └── b (leaf)
    └── a (leaf)

Calling ``backward()``
----------------------

``array.backward()`` traverses the graph in **reverse topological order**,
calling each ``Function``'s ``backward`` method and accumulating gradients
into ``array.grad`` for every leaf with ``requires_grad=True``::

    d.backward()
    print(a.grad)   # d(d)/d(a) = d/da[(a*b) + a] = b + 1 = 4.0
    print(b.grad)   # d(d)/d(b) = d/db[(a*b)] = a = 2.0

By default ``backward()`` starts with a scalar gradient of 1. For non-scalar
outputs pass an explicit gradient array::

    out = npg.array([[1.0, 2.0], [3.0, 4.0]], requires_grad=True)
    result = out * 2
    result.backward(npg.ones_like(result).data)

Gradient accumulation
---------------------

Gradients **accumulate** in ``array.grad`` rather than being overwritten. This
matches PyTorch's behaviour and is required when an array appears multiple
times in a graph (like ``a`` in the example above). Call
``optimizer.zero_grad()`` (or set ``array.grad = None``) between training
steps to reset them.

Non-differentiable operations
------------------------------

Some ``Array`` methods — ``astype``, ``nonzero``, ``all``, ``any``, ``fill``,
``sort``, ``round`` — do not propagate gradients. They return a new array but
do not attach a ``grad_fn``.

Comparison operators (``>``, ``<``, ``==``, etc.) also return arrays without
gradient tracking, since they are not differentiable.

Broadcasting and gradients
--------------------------

When an operation broadcasts one operand to match the shape of another,
``backward()`` automatically sums the upstream gradient over the broadcasted
axes to recover the gradient of the original (smaller) shape. You do not need
to handle this manually.