← ML Research

Mini GPT

Output logits [batch, T, 3000] weight-tied projection · Final LayerNorm × 5 LayerNorm → Causal Self-Attention (8 heads, d_k = 32) + residual LayerNorm → Feed-Forward (d_ff = 512, GELU) + residual 263k LayerNorm → Causal Self-Attention (8 heads, d_k = 32) + residual LayerNorm → Feed-Forward (d_ff = 512, GELU) + residual LayerNorm → Causal Self-Attention (8 heads, d_k = 32) + residual LayerNorm → Feed-Forward (d_ff = 512, GELU) + residual LayerNorm → Causal Self-Attention (8 heads, d_k = 32) + residual LayerNorm → Feed-Forward (d_ff = 512, GELU) + residual

Highlights

  • Large configuration (7.07M parameters) achieves validation perplexity 8.08 after training on 104M tokens, more than halving the small model's perplexity of 17.75 with 6.7x the parameters.
  • Full stack built from scratch: 3,000-token BPE vocabulary, sinusoidal positional encoding, 5 pre-norm transformer blocks, 8-head causal self-attention (d_k = 32), d_ff = 512, and weight-tied output projection.
  • Embedding space encodes semantic structure: the analogy "princess - girl + boy" returns "prince" as the top cosine-similarity match, and nearest-neighbor probes reveal clean family-role and activity clusters.

Architecture

Input token IDs are projected into a 256-dimensional embedding space and summed with fixed sinusoidal positional encodings, carrying no learned position parameters, which removes a source of overfitting on limited context lengths. The signal then passes through five identical transformer blocks in sequence. Each block applies pre-norm: layer normalization before both sublayers, with residual connections wrapping each. The first sublayer is 8-head causal self-attention with a lower-triangular mask preventing any token from attending to its future; the second is a two-layer feed-forward network with d_ff = 512 and GELU activation. A final layer norm stabilizes the representations before the output projection maps back to vocabulary size. The output projection shares its weight matrix with the token embedding table, reducing the parameter count by 768k without any quality loss, a classic efficiency decision that also encourages the model to learn consistent token representations in both input and output space. Total trainable parameters: 3.40M (medium) and 7.07M (large).

Training

The dataset contains 104M training tokens drawn from a large corpus of short stories with a simple, consistent grammar. Token streams are concatenated and sliced into fixed-length context windows of 256 tokens; each sample is an (input, target) pair where the target is the input shifted by one position, the standard causal language modeling objective. Optimization uses AdamW (b1 = 0.9, b2 = 0.999, weight decay 0.1) with a linear warmup over the first 500 steps followed by cosine decay. Gradient norm is clipped at 1.0. Mixed precision (torch.cuda.amp) is enabled on GPU, providing roughly 2x throughput with no accuracy penalty. At peak the training loop processes ~39,500 tokens per second, completing the medium-scale run in under 35 minutes. The initial cross-entropy loss of 7.37 is close to the theoretical random-prediction baseline of ln(3000) = 8.01, confirming the initialization and loss setup are correct.

Results

Three configurations were trained to convergence under identical conditions: same dataset, same optimizer, same number of gradient steps, with only the model capacity varied. Final validation perplexity: small (1.05M params) 17.75; medium (3.40M) 10.23; large (7.07M) 8.08. The quantitative gap is large relative to the compute cost: the large model is more than twice as certain about the next token as the small model. Qualitatively, the large model produces multi-sentence output that maintains a consistent setting and cast of characters across the full context window, while the small model loses coherence after one or two sentences. Temperature controls the sharpness of the output distribution: at t = 0.5 the model is most coherent but tends toward repetition; at t = 1.2 it introduces more novel elements but shows more syntactic drift. The sweet spot for this model size is t = 0.8 to 1.0.

What the embeddings learned

A well-trained language model should encode more than token co-occurrence statistics: it should organize the embedding space so that semantically related tokens cluster and relational structure is preserved under vector arithmetic. Nearest-neighbor probes show clean family-role clusters: the five closest tokens to "dad" include "Dad", "daddy", "mom", "parents", and "Daddy", with cosine similarities between 0.50 and 0.67. The analogy test "princess - girl + boy" returns "prince" as the top match at 0.66 cosine similarity. The model has encoded the royal/gender relationship as a consistent direction in embedding space, a property that emerges from language statistics alone without any explicit supervision on semantic relationships.