Operators
=========

All operators are available both as module-level functions (``npg.relu(x)``)
and as ``Array`` methods (``x.relu()`` where applicable). Every operator listed
here is differentiable — it records itself into the computation graph when any
input has ``requires_grad=True``.

Element-wise
------------

.. list-table::
   :header-rows: 1
   :widths: 40 60

   * - Function
     - Description
   * - ``npg.add(a, b)``
     - Element-wise addition (also ``a + b``)
   * - ``npg.mul(a, b)``
     - Element-wise multiplication (also ``a * b``)
   * - ``npg.pow(a, exponent)``
     - Element-wise power (also ``a ** exponent``)
   * - ``npg.exp(a)``
     - Element-wise :math:`e^x`
   * - ``npg.log(a)``
     - Natural logarithm (undefined for non-positive values)
   * - ``npg.abs(a)``
     - Absolute value
   * - ``npg.relu(a)``
     - ``max(0, x)`` element-wise
   * - ``npg.clip(a, min, max)``
     - Clamp values to ``[min, max]``
   * - ``npg.maximum(a, b)``
     - Element-wise max of two arrays
   * - ``npg.minimum(a, b)``
     - Element-wise min of two arrays

Reductions
----------

All reduction functions accept ``axis=None`` (reduce all axes) and
``keepdims=False``.

.. list-table::
   :header-rows: 1
   :widths: 40 60

   * - Function
     - Description
   * - ``npg.sum(a, axis, keepdims)``
     - Sum of elements
   * - ``npg.mean(a, axis, keepdims)``
     - Mean of elements
   * - ``npg.prod(a, axis, keepdims)``
     - Product of elements
   * - ``npg.max(a, axis, keepdims)``
     - Maximum value
   * - ``npg.min(a, axis, keepdims)``
     - Minimum value
   * - ``npg.argmax(a, axis, keepdims)``
     - Index of maximum value (no gradient)
   * - ``npg.var(a, axis, ddof, keepdims)``
     - Variance. ``ddof=0`` (population) or ``ddof=1`` (sample)
   * - ``npg.std(a, axis, ddof, keepdims)``
     - Standard deviation (``sqrt(var(...))``)
   * - ``npg.cumsum(a, axis)``
     - Cumulative sum along ``axis``
   * - ``npg.cumprod(a, axis)``
     - Cumulative product along ``axis``

Activations
-----------

.. list-table::
   :header-rows: 1
   :widths: 40 60

   * - Function
     - Description
   * - ``npg.softmax(a, axis=-1)``
     - Softmax along ``axis``
   * - ``npg.log_softmax(a, axis=-1)``
     - Log-softmax (numerically stable)
   * - ``npg.sigmoid(a)``
     - :math:`\sigma(x) = 1 / (1 + e^{-x})`
   * - ``npg.tanh(a)``
     - Hyperbolic tangent
   * - ``npg.softplus(a)``
     - :math:`\log(1 + e^x)` (smooth approximation of ReLU)
   * - ``npg.gelu(a)``
     - Gaussian Error Linear Unit (tanh approximation): :math:`0.5 x (1 + \tanh(\sqrt{2/\pi}(x + 0.044715 x^3)))`
   * - ``npg.relu(a)``
     - ``max(0, x)`` (also listed under element-wise)

Linear algebra
--------------

.. list-table::
   :header-rows: 1
   :widths: 40 60

   * - Function
     - Description
   * - ``npg.matmul(a, b)`` / ``npg.mm(a, b)``
     - Matrix multiplication. Handles 1D (dot product), 2D, and batched 3D inputs.
   * - ``npg.dot(a, b)``
     - Dot product of two 1D or 2D arrays
   * - ``npg.norm(a, axis, keepdims)``
     - Frobenius / L2 norm
   * - ``npg.diagonal(a, offset, axis1, axis2)``
     - Extract diagonal elements
   * - ``npg.trace(a, offset)``
     - Sum of diagonal elements (``diagonal(a, offset).sum()``)

Shape transforms
----------------

.. list-table::
   :header-rows: 1
   :widths: 40 60

   * - Function
     - Description
   * - ``npg.reshape(a, new_shape)``
     - Change shape without changing data. Returns a view when possible.
   * - ``npg.transpose(a, axes)``
     - Permute dimensions. ``axes`` is a tuple; ``None`` reverses all.
   * - ``npg.flatten(a)``
     - Flatten to 1D (equivalent to ``reshape(a, (-1,))``)
   * - ``npg.unsqueeze(a, axis)``
     - Insert a size-1 dimension at ``axis``
   * - ``npg.squeeze(a, axis=None)``
     - Remove size-1 dimensions (all if ``axis=None``)
   * - ``npg.repeat(a, repeats, axis=None)``
     - Repeat elements along an axis
   * - ``npg.triu(a, k=0)``
     - Upper-triangular part of a matrix (zeros below diagonal ``k``)
   * - ``npg.split(a, split_size_or_sections, dim=0)``
     - Split array into chunks. ``split_size_or_sections`` is an int (equal
       chunks; last may be smaller) or a list of sizes. Returns a tuple of
       ``Array`` objects.
   * - ``npg.stack(arrays, axis=0)``
     - Stack a list of arrays along a new axis
   * - ``npg.cat(arrays, axis=0)``
     - Concatenate arrays along an existing axis

Convolution
-----------

``npg.conv2d(input, weight, bias=None, stride=1, padding=0)``

2D convolution with full backward support.

- ``input``: shape ``(N, C_in, H, W)``
- ``weight``: shape ``(C_out, C_in, KH, KW)``
- ``bias``: shape ``(C_out,)`` or ``None``
- ``stride`` and ``padding`` accept an int or a ``(H, W)`` tuple
- Output shape: ``(N, C_out, H_out, W_out)``

Example::

    import numpygrad as npg

    x = npg.random.randn((2, 3, 32, 32))                # batch of 2 RGB images
    w = npg.random.randn((16, 3, 3, 3), requires_grad=True)
    out = npg.conv2d(x, w, stride=1, padding=1)  # (2, 16, 32, 32)

Special
-------

``npg.setitem(a, key, value)``

Differentiable in-place assignment. Equivalent to ``a[key] = value`` but
records the operation in the computation graph, allowing gradients to flow
through the assignment::

    a = npg.zeros((4,), requires_grad=True)
    b = npg.setitem(a, 2, npg.array([5.0]))
    b.sum().backward()
    print(a.grad)   # [1., 1., 1., 1.]

``npg.masked_fill(a, mask, value)``

Fill positions where boolean ``mask`` is ``True`` with scalar ``value``.
Broadcasts ``mask`` over ``a`` when ranks differ (e.g. a 2D causal mask
applied to a 4D attention score tensor)::

    mask = npg.triu(npg.ones((T, T)), k=1).view(1, 1, T, T)
    scores = scores.masked_fill(mask, float("-inf"))

Embedding lookup (``npg.ops.embedding``)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Row-wise lookup into a weight matrix; used internally by ``nn.Embedding``::

    weight = npg.random.randn((vocab_size, embed_dim), requires_grad=True)
    indices = npg.array([0, 3, 1])
    out = npg.ops.embedding(weight, indices)   # (3, embed_dim)