Yes, I know I can’t compete with the big guns like Google and ChatGPT, but that wasn’t the point of this project; the purpose was to learn and understand how to train an LLM.
Here is what I believe my pipeline will be, although I may come back and edit this if it changes.
- Tokenize – Turn text into token IDs.
- Embed – Map each ID to a vector.
- Add positions – So the model knows order (learned position embeddings).
- Transformer blocks – Each block has:
- Causal self-attention – Each position can only look at past tokens (no future), implemented with a causal mask.
- Feed-forward – Small MLP (e.g. linear → GELU → linear) per position.
- LM head – Final linear layer from hidden size to vocab size → logits for the next token.
- Training – Minimize cross-entropy between predicted next-token distribution and the actual next token.
- Generation – Sample from that distribution, append the token, repeat.