PROJECT DIVE: LLM BUILD | Amy Jerkovich

Yes, I know I can’t compete with the big guns like Google and ChatGPT, but that wasn’t the point of this project; the purpose was to learn and understand how to train an LLM.

Here is what I believe my pipeline will be, although I may come back and edit this if it changes.

Tokenize – Turn text into token IDs.
Embed – Map each ID to a vector.
Add positions – So the model knows order (learned position embeddings).
Transformer blocks – Each block has:
- Causal self-attention – Each position can only look at past tokens (no future), implemented with a causal mask.
- Feed-forward – Small MLP (e.g. linear → GELU → linear) per position.
LM head – Final linear layer from hidden size to vocab size → logits for the next token.
Training – Minimize cross-entropy between predicted next-token distribution and the actual next token.
Generation – Sample from that distribution, append the token, repeat.