Yes, I know I can’t compete with the big guns like Google and ChatGPT, but that wasn’t the point of this project; the purpose was to learn and understand how to train an LLM.

Here is what I believe my pipeline will be, although I may come back and edit this if it changes.

  • Tokenize – Turn text into token IDs.
  • Embed – Map each ID to a vector.
  • Add positions – So the model knows order (learned position embeddings).
  • Transformer blocks – Each block has:
    • Causal self-attention – Each position can only look at past tokens (no future), implemented with a causal mask.
    • Feed-forward – Small MLP (e.g. linear → GELU → linear) per position.
  • LM head – Final linear layer from hidden size to vocab size → logits for the next token.
  • Training – Minimize cross-entropy between predicted next-token distribution and the actual next token.
  • Generation – Sample from that distribution, append the token, repeat.