GPT-2 Character Language Model ================================ Source: ``examples/gpt2/main.py`` Overview -------- A GPT-2-style transformer trained as a character-level language model on Shakespeare's complete works. The architecture mirrors the original GPT-2 paper: stacked transformer blocks with causal (masked) self-attention, LayerNorm, and GeLU activations — all implemented in pure NumPy using numpygrad. Running ------- :: python -m examples.gpt2.main # downloads data, trains with defaults python -m examples.gpt2.main --help # see all options Selected options: - ``--context-size`` — sequence length seen by the model (default 64) - ``--num-blocks`` — number of transformer blocks (default 6) - ``--num-heads`` — attention heads per block (default 6) - ``--embedding-dim`` — model width (default 288) - ``--num-steps`` — training steps (default 2 048) - ``--batch-size`` — mini-batch size (default 32) - ``--temperature`` — sampling temperature for generation (default 1.0) Sample output ------------- A sample from the GPT-2 model before training: .. code-block:: text AH,.u,I.znUsANK.MK.L.J'NCzWv'rAnsyMJD.en.HJtnM.eB3J?LsnTA.h.zN.MN.MWN.Fwz.MNdYOsL MaMN.TN.JCMMNqZiMJenfP.$wZNvhvN'MNMNM.,M.LKlYZ.Tcx.UhaVMNgZ$nTYM.zmsebEUTcMNMCchQNEsx.c WTA.sW?.M?YpMN SMwAQDN!Hb.MvUxM yqWMNq.M.Uq.Zc.Mw,MGV.mJkNDnUwMN.MUMcHLZ'x.ME,ebF!.MaM And 45 minutes later, it speaks at least some kind of proto-English language! Lesson learned, LLMs should not be trained on CPU. .. code-block:: text COMINIUS: Nine, I two it-ay because Till 'twere but equied big last!' Blanger, un withouts, Thy friend rude, or wantouch Romeo. Here have kneels: ounis, keep you To be even find and full of second worse: I ree say! DUCHESS OF YORK: Nay, be thing? STINGS: The assible tender of the breast, To fall the absence in this tower itself abiliory. Will you be blessing fancy, To give outward upon. Mark for your Naples; Upon you have not past a meritent woman: The prottle golden person, have good, hour some do break together, our love; and pray.'d you I mean not to be dvertender, use themselves: Ah, our spirits are not: accentible 'Twise; and our enemyour as most say. LUCENTIO: Prithee, my life, he hath with you, and saved upon him. Here is receive against the head. I pray you'll saw 'Come hither come; Not mark the people I dower to ceaear The rashly of his audible, fair suns ever Their apt in thine eyes and weary not; My lord, is not only to too; compine. So, fare you And you are all to be friend and she will, let us giveli to friends? BENVOLIO: Alas is rememberance! ESTIAN: Ah, the thing that I trouble him, hear not with me. EDWARD: Call it in Rutland: we have not singled key, Yet I shall so, in my exe, He shall away on the wagow of yourselve?' do you Even o' the brother; but whereify hearese cadies The stewes of the fac officient counter. ROMEO: Is it a content our tongue. VINCENTIO: Ay, man, you should love your mind. BICA: Well, the ribunless lackers, convenient is awed upon, Even together, remembers his silvance and drink you! QUEEN ELIZABETH: Lords, her hath all of wealer inchange in post All fellow of these his own is fainted ofun, And she dismiss'd thy children, twi herable withcland. ROMEO: I dare thee more let than thou hats from the gits in art: Nay, let up, so, in the birdshires. Nurse: Romer, you may again Margaret, I have stood like to Choot, when may have it fought To like to spea Architecture ------------ The model follows the standard GPT-2 design: .. code-block:: none Tokens → Embedding + Positional Embedding → Dropout → N × TransformerBlock LayerNorm → CausalAttention → residual LayerNorm → MLP (GeLU) → residual → LayerNorm → Linear (vocabulary projection) Code walkthrough ---------------- **Config** All architectural hyperparameters live in a single dataclass:: @dataclasses.dataclass class GPT2Config: context_size: int = 64 num_blocks: int = 6 num_heads: int = 6 embedding_dim: int = 288 dropout: float = 0.0 vocab_size: int | None = None # set from dataset **Causal self-attention** Q, K, V are produced by a single fused projection then split:: class CausalAttention(nn.Module): def __init__(self, config): self.in_proj = nn.Linear(embed_dim, 3 * embed_dim, bias=False) self.out_proj = nn.Linear(embed_dim, embed_dim, bias=False) # static upper-triangular causal mask (1 = masked) self.causal_mask = npg.triu(npg.ones((T, T)), k=1).view(1, 1, T, T) def forward(self, x): B, C, _ = x.shape q, k, v = self.in_proj(x).split(embed_dim, dim=2) # reshape: (B, C, H, head_dim) → (B, H, C, head_dim) q = q.view(B, C, num_heads, head_dim).transpose(1, 2) k = k.view(B, C, num_heads, head_dim).transpose(1, 2) v = v.view(B, C, num_heads, head_dim).transpose(1, 2) scores = q @ k.transpose(-2, -1) / npg.sqrt(head_dim) scores = scores.masked_fill(self.causal_mask[:, :, :C, :C], float("-inf")) weights = npg.softmax(scores, axis=-1) x = (weights @ v).transpose(1, 2).reshape(B, C, embed_dim) return self.out_proj(x) **MLP block with GeLU** :: class MLP(nn.Module): def __init__(self, config, dilation=4): self.up_proj = nn.Linear(embed_dim, dilation * embed_dim, bias=False) self.gelu = nn.GELU() self.down_proj = nn.Linear(dilation * embed_dim, embed_dim, bias=False) def forward(self, x): return self.down_proj(self.gelu(self.up_proj(x))) **Transformer block (pre-LayerNorm)** :: class Gpt2Block(nn.Module): def forward(self, x): x = x + self.attn(self.ln1(x)) # attention residual x = x + self.ffn(self.ln2(x)) # FFN residual return x **Full model — token + position embeddings** :: class GPT2(nn.Module): def forward(self, x): # x: (B, C) integer token ids positions = npg.arange(C) x = self.dropout(self.token_embedding(x) + self.position_embedding(positions)) x = self.blocks(x) return self.output_projection(self.output_ln(x)) # (B, C, vocab_size) **Loss — 3D logits** ``cross_entropy_loss`` flattens ``(B, T, V)`` logits internally:: logits = net(x) # (B, T, vocab_size) loss = nn.cross_entropy_loss(logits, target) # target: (B, T) loss.backward() **Autoregressive sampling** :: @npg.no_grad() def sample(net, context, max_new_tokens=100, temperature=1.0): for _ in range(max_new_tokens): logits = net(context[:, -context_size:]) / temperature probs = npg.softmax(logits[:, -1, :], axis=-1) idx = np.random.choice(vocab_size, p=probs.numpy().squeeze()) context = npg.cat((context, npg.array([[idx]])), axis=-1) return context **Training loop** :: optimizer = npg.optim.AdamW(net.parameters(), lr=1e-3) for step, (x, target) in enumerate(dataloader): optimizer.zero_grad() logits = net(x) loss = nn.cross_entropy_loss(logits, target) loss.backward() optimizer.step() New primitives used ------------------- This example exercises several numpygrad features added alongside it: - ``nn.LayerNorm`` — layer normalisation with learnable affine parameters - ``nn.Embedding`` — token and position embedding lookup tables - ``nn.Dropout`` — inverted dropout; a no-op after ``model.eval()`` - ``nn.GELU`` — GeLU activation module (tanh approximation) - ``nn.Sequential`` — chains the transformer blocks - ``nn.init`` — parameter initialisation helpers - ``npg.triu`` — creates the static upper-triangular causal mask - ``npg.split`` — splits the fused Q/K/V projection into three tensors - ``masked_fill`` — applies the causal mask (broadcasts 2D mask over 4D scores) - ``cross_entropy_loss`` — accepts ``(B, T, V)`` logits without manual reshape