GPT-2 Character Language Model
Source: examples/gpt2/main.py
Overview
A GPT-2-style transformer trained as a character-level language model on Shakespeare’s complete works. The architecture mirrors the original GPT-2 paper: stacked transformer blocks with causal (masked) self-attention, LayerNorm, and GeLU activations — all implemented in pure NumPy using numpygrad.
Running
python -m examples.gpt2.main # downloads data, trains with defaults
python -m examples.gpt2.main --help # see all options
Selected options:
--context-size— sequence length seen by the model (default 64)--num-blocks— number of transformer blocks (default 6)--num-heads— attention heads per block (default 6)--embedding-dim— model width (default 288)--num-steps— training steps (default 2 048)--batch-size— mini-batch size (default 32)--temperature— sampling temperature for generation (default 1.0)
Sample output
A sample from the GPT-2 model before training:
AH,.u,I.znUsANK.MK.L.J'NCzWv'rAnsyMJD.en.HJtnM.eB3J?LsnTA.h.zN.MN.MWN.Fwz.MNdYOsL MaMN.TN.JCMMNqZiMJenfP.$wZNvhvN'MNMNM.,M.LKlYZ.Tcx.UhaVMNgZ$nTYM.zmsebEUTcMNMCchQNEsx.c WTA.sW?.M?YpMN SMwAQDN!Hb.MvUxM yqWMNq.M.Uq.Zc.Mw,MGV.mJkNDnUwMN.MUMcHLZ'x.ME,ebF!.MaM
And 45 minutes later, it speaks at least some kind of proto-English language! Lesson learned, LLMs should not be trained on CPU.
COMINIUS:
Nine, I two it-ay because
Till 'twere but equied big last!' Blanger, un withouts,
Thy friend rude, or wantouch Romeo. Here have kneels: ounis, keep you
To be even find and full of second worse: I ree say!
DUCHESS OF YORK:
Nay, be thing?
STINGS:
The assible tender of the breast,
To fall the absence in this tower itself abiliory.
Will you be blessing fancy,
To give outward upon. Mark for your Naples;
Upon you have not past a meritent woman:
The prottle golden person, have good, hour some do break
together, our love; and pray.'d you
I mean not to be dvertender, use themselves:
Ah, our spirits are not: accentible
'Twise; and our enemyour as most say.
LUCENTIO:
Prithee, my life, he hath with you, and saved
upon him. Here is receive against the head.
I pray you'll saw 'Come hither come;
Not mark the people I dower to ceaear
The rashly of his audible, fair suns ever
Their apt in thine eyes and weary not;
My lord, is not only to too; compine. So, fare you
And you are all to be friend and she will, let us giveli to friends?
BENVOLIO:
Alas is rememberance!
ESTIAN:
Ah, the thing that I trouble him, hear not with me.
EDWARD:
Call it in Rutland: we have not singled key,
Yet I shall so, in my exe,
He shall away on the wagow of yourselve?' do you
Even o' the brother; but whereify hearese cadies
The stewes of the fac officient counter.
ROMEO:
Is it a content our tongue.
VINCENTIO:
Ay, man, you should love your mind.
BICA:
Well, the ribunless lackers, convenient is awed upon,
Even together, remembers his silvance and drink you!
QUEEN ELIZABETH:
Lords, her hath all of wealer inchange in post
All fellow of these his own is fainted ofun,
And she dismiss'd thy children, twi herable withcland.
ROMEO:
I dare thee more let than thou hats from the gits in art:
Nay, let up, so, in the birdshires.
Nurse:
Romer, you may again Margaret,
I have stood like to Choot, when may have it fought
To like to spea
Architecture
The model follows the standard GPT-2 design:
Tokens → Embedding + Positional Embedding
→ Dropout
→ N × TransformerBlock
LayerNorm → CausalAttention → residual
LayerNorm → MLP (GeLU) → residual
→ LayerNorm
→ Linear (vocabulary projection)
Code walkthrough
Config
All architectural hyperparameters live in a single dataclass:
@dataclasses.dataclass
class GPT2Config:
context_size: int = 64
num_blocks: int = 6
num_heads: int = 6
embedding_dim: int = 288
dropout: float = 0.0
vocab_size: int | None = None # set from dataset
Causal self-attention
Q, K, V are produced by a single fused projection then split:
class CausalAttention(nn.Module):
def __init__(self, config):
self.in_proj = nn.Linear(embed_dim, 3 * embed_dim, bias=False)
self.out_proj = nn.Linear(embed_dim, embed_dim, bias=False)
# static upper-triangular causal mask (1 = masked)
self.causal_mask = npg.triu(npg.ones((T, T)), k=1).view(1, 1, T, T)
def forward(self, x):
B, C, _ = x.shape
q, k, v = self.in_proj(x).split(embed_dim, dim=2)
# reshape: (B, C, H, head_dim) → (B, H, C, head_dim)
q = q.view(B, C, num_heads, head_dim).transpose(1, 2)
k = k.view(B, C, num_heads, head_dim).transpose(1, 2)
v = v.view(B, C, num_heads, head_dim).transpose(1, 2)
scores = q @ k.transpose(-2, -1) / npg.sqrt(head_dim)
scores = scores.masked_fill(self.causal_mask[:, :, :C, :C], float("-inf"))
weights = npg.softmax(scores, axis=-1)
x = (weights @ v).transpose(1, 2).reshape(B, C, embed_dim)
return self.out_proj(x)
MLP block with GeLU
class MLP(nn.Module):
def __init__(self, config, dilation=4):
self.up_proj = nn.Linear(embed_dim, dilation * embed_dim, bias=False)
self.gelu = nn.GELU()
self.down_proj = nn.Linear(dilation * embed_dim, embed_dim, bias=False)
def forward(self, x):
return self.down_proj(self.gelu(self.up_proj(x)))
Transformer block (pre-LayerNorm)
class Gpt2Block(nn.Module):
def forward(self, x):
x = x + self.attn(self.ln1(x)) # attention residual
x = x + self.ffn(self.ln2(x)) # FFN residual
return x
Full model — token + position embeddings
class GPT2(nn.Module):
def forward(self, x): # x: (B, C) integer token ids
positions = npg.arange(C)
x = self.dropout(self.token_embedding(x) + self.position_embedding(positions))
x = self.blocks(x)
return self.output_projection(self.output_ln(x)) # (B, C, vocab_size)
Loss — 3D logits
cross_entropy_loss flattens (B, T, V) logits internally:
logits = net(x) # (B, T, vocab_size)
loss = nn.cross_entropy_loss(logits, target) # target: (B, T)
loss.backward()
Autoregressive sampling
@npg.no_grad()
def sample(net, context, max_new_tokens=100, temperature=1.0):
for _ in range(max_new_tokens):
logits = net(context[:, -context_size:]) / temperature
probs = npg.softmax(logits[:, -1, :], axis=-1)
idx = np.random.choice(vocab_size, p=probs.numpy().squeeze())
context = npg.cat((context, npg.array([[idx]])), axis=-1)
return context
Training loop
optimizer = npg.optim.AdamW(net.parameters(), lr=1e-3)
for step, (x, target) in enumerate(dataloader):
optimizer.zero_grad()
logits = net(x)
loss = nn.cross_entropy_loss(logits, target)
loss.backward()
optimizer.step()
New primitives used
This example exercises several numpygrad features added alongside it:
nn.LayerNorm— layer normalisation with learnable affine parametersnn.Embedding— token and position embedding lookup tablesnn.Dropout— inverted dropout; a no-op aftermodel.eval()nn.GELU— GeLU activation module (tanh approximation)nn.Sequential— chains the transformer blocksnn.init— parameter initialisation helpersnpg.triu— creates the static upper-triangular causal masknpg.split— splits the fused Q/K/V projection into three tensorsmasked_fill— applies the causal mask (broadcasts 2D mask over 4D scores)cross_entropy_loss— accepts(B, T, V)logits without manual reshape