GPT-2 Character Language Model

Source: examples/gpt2/main.py

Overview

A GPT-2-style transformer trained as a character-level language model on Shakespeare’s complete works. The architecture mirrors the original GPT-2 paper: stacked transformer blocks with causal (masked) self-attention, LayerNorm, and GeLU activations — all implemented in pure NumPy using numpygrad.

Running

python -m examples.gpt2.main              # downloads data, trains with defaults
python -m examples.gpt2.main --help       # see all options

Selected options:

--context-size — sequence length seen by the model (default 64)
--num-blocks — number of transformer blocks (default 6)
--num-heads — attention heads per block (default 6)
--embedding-dim — model width (default 288)
--num-steps — training steps (default 2 048)
--batch-size — mini-batch size (default 32)
--temperature — sampling temperature for generation (default 1.0)

Sample output

A sample from the GPT-2 model before training:

AH,.u,I.znUsANK.MK.L.J'NCzWv'rAnsyMJD.en.HJtnM.eB3J?LsnTA.h.zN.MN.MWN.Fwz.MNdYOsL MaMN.TN.JCMMNqZiMJenfP.$wZNvhvN'MNMNM.,M.LKlYZ.Tcx.UhaVMNgZ$nTYM.zmsebEUTcMNMCchQNEsx.c WTA.sW?.M?YpMN SMwAQDN!Hb.MvUxM yqWMNq.M.Uq.Zc.Mw,MGV.mJkNDnUwMN.MUMcHLZ'x.ME,ebF!.MaM

And 45 minutes later, it speaks at least some kind of proto-English language! Lesson learned, LLMs should not be trained on CPU.

COMINIUS:
Nine, I two it-ay because
Till 'twere but equied big last!' Blanger, un withouts,
Thy friend rude, or wantouch Romeo. Here have kneels: ounis, keep you
To be even find and full of second worse: I ree say!

DUCHESS OF YORK:
Nay, be thing?

STINGS:
The assible tender of the breast,
To fall the absence in this tower itself abiliory.
Will you be blessing fancy,
To give outward upon. Mark for your Naples;
Upon you have not past a meritent woman:
The prottle golden person, have good, hour some do break
together, our love; and pray.'d you
I mean not to be dvertender, use themselves:
Ah, our spirits are not: accentible
'Twise; and our enemyour as most say.

LUCENTIO:
Prithee, my life, he hath with you, and saved
upon him. Here is receive against the head.
I pray you'll saw 'Come hither come;
Not mark the people I dower to ceaear
The rashly of his audible, fair suns ever
Their apt in thine eyes and weary not;
My lord, is not only to too; compine. So, fare you
And you are all to be friend and she will, let us giveli to friends?

BENVOLIO:
Alas is rememberance!

ESTIAN:
Ah, the thing that I trouble him, hear not with me.

EDWARD:
Call it in Rutland: we have not singled key,
Yet I shall so, in my exe,
He shall away on the wagow of yourselve?' do you
Even o' the brother; but whereify hearese cadies
The stewes of the fac officient counter.

ROMEO:
Is it a content our tongue.

VINCENTIO:
Ay, man, you should love your mind.

BICA:
Well, the ribunless lackers, convenient is awed upon,
Even together, remembers his silvance and drink you!

QUEEN ELIZABETH:
Lords, her hath all of wealer inchange in post
All fellow of these his own is fainted ofun,
And she dismiss'd thy children, twi herable withcland.

ROMEO:
I dare thee more let than thou hats from the gits in art:
Nay, let up, so, in the birdshires.

Nurse:
Romer, you may again Margaret,
I have stood like to Choot, when may have it fought
To like to spea

Architecture

The model follows the standard GPT-2 design:

Tokens → Embedding + Positional Embedding
    → Dropout
    → N × TransformerBlock
        LayerNorm → CausalAttention → residual
        LayerNorm → MLP (GeLU) → residual
    → LayerNorm
    → Linear (vocabulary projection)

Code walkthrough

Config

All architectural hyperparameters live in a single dataclass:

@dataclasses.dataclass
class GPT2Config:
    context_size:  int   = 64
    num_blocks:    int   = 6
    num_heads:     int   = 6
    embedding_dim: int   = 288
    dropout:       float = 0.0
    vocab_size:    int | None = None   # set from dataset

Causal self-attention

Q, K, V are produced by a single fused projection then split:

class CausalAttention(nn.Module):
    def __init__(self, config):
        self.in_proj  = nn.Linear(embed_dim, 3 * embed_dim, bias=False)
        self.out_proj = nn.Linear(embed_dim, embed_dim, bias=False)
        # static upper-triangular causal mask (1 = masked)
        self.causal_mask = npg.triu(npg.ones((T, T)), k=1).view(1, 1, T, T)

    def forward(self, x):
        B, C, _ = x.shape
        q, k, v = self.in_proj(x).split(embed_dim, dim=2)
        # reshape: (B, C, H, head_dim) → (B, H, C, head_dim)
        q = q.view(B, C, num_heads, head_dim).transpose(1, 2)
        k = k.view(B, C, num_heads, head_dim).transpose(1, 2)
        v = v.view(B, C, num_heads, head_dim).transpose(1, 2)

        scores  = q @ k.transpose(-2, -1) / npg.sqrt(head_dim)
        scores  = scores.masked_fill(self.causal_mask[:, :, :C, :C], float("-inf"))
        weights = npg.softmax(scores, axis=-1)
        x = (weights @ v).transpose(1, 2).reshape(B, C, embed_dim)
        return self.out_proj(x)

MLP block with GeLU

class MLP(nn.Module):
    def __init__(self, config, dilation=4):
        self.up_proj   = nn.Linear(embed_dim, dilation * embed_dim, bias=False)
        self.gelu      = nn.GELU()
        self.down_proj = nn.Linear(dilation * embed_dim, embed_dim, bias=False)

    def forward(self, x):
        return self.down_proj(self.gelu(self.up_proj(x)))

Transformer block (pre-LayerNorm)

class Gpt2Block(nn.Module):
    def forward(self, x):
        x = x + self.attn(self.ln1(x))   # attention residual
        x = x + self.ffn(self.ln2(x))    # FFN residual
        return x

Full model — token + position embeddings

class GPT2(nn.Module):
    def forward(self, x):           # x: (B, C) integer token ids
        positions = npg.arange(C)
        x = self.dropout(self.token_embedding(x) + self.position_embedding(positions))
        x = self.blocks(x)
        return self.output_projection(self.output_ln(x))   # (B, C, vocab_size)

Loss — 3D logits

cross_entropy_loss flattens (B, T, V) logits internally:

logits = net(x)                               # (B, T, vocab_size)
loss   = nn.cross_entropy_loss(logits, target)  # target: (B, T)
loss.backward()

Autoregressive sampling

@npg.no_grad()
def sample(net, context, max_new_tokens=100, temperature=1.0):
    for _ in range(max_new_tokens):
        logits = net(context[:, -context_size:]) / temperature
        probs  = npg.softmax(logits[:, -1, :], axis=-1)
        idx    = np.random.choice(vocab_size, p=probs.numpy().squeeze())
        context = npg.cat((context, npg.array([[idx]])), axis=-1)
    return context

Training loop

optimizer = npg.optim.AdamW(net.parameters(), lr=1e-3)
for step, (x, target) in enumerate(dataloader):
    optimizer.zero_grad()
    logits = net(x)
    loss   = nn.cross_entropy_loss(logits, target)
    loss.backward()
    optimizer.step()

New primitives used

This example exercises several numpygrad features added alongside it:

nn.LayerNorm — layer normalisation with learnable affine parameters
nn.Embedding — token and position embedding lookup tables
nn.Dropout — inverted dropout; a no-op after model.eval()
nn.GELU — GeLU activation module (tanh approximation)
nn.Sequential — chains the transformer blocks
nn.init — parameter initialisation helpers
npg.triu — creates the static upper-triangular causal mask
npg.split — splits the fused Q/K/V projection into three tensors
masked_fill — applies the causal mask (broadcasts 2D mask over 4D scores)
cross_entropy_loss — accepts (B, T, V) logits without manual reshape