For training my model, I used a Python library called PyTorch. PyTorch is a library that lets you build neural networks out of basic math operations (on arrays of numbers). PyTorch gives you Tensors, Layers, Automatic Gradients, and Optimizers.
Here is a breakdown of what those actually are:
Tensor
A tensor is an array type that PyTorch uses for basically everything. A tensor is a multi-dimensional array of numbers. It's what holds all of your model's 'data' (like the weights, vectors, token ID's, etc.).
Each of your model's weights is stored in a vector, but not every tensor is a weight. A batch of token ID's is a tensor, for example.
1D tensor (a list of numbers):
tensor([ 0.1, -0.3, 0.5, 0.0, -0.2 ])
2D tensor (a grid/table):
tensor([[ 1, 2, 3 ],
[ 4, 5, 6 ],
[ 7, 8, 9 ]])
What are a model's "weights"?
A model's weights are the only thing that changes when you train it. Weights define one big, complex math recipe that is adjusted when the model gets an answer to a question right or wrong. After a lot of time, the more the model is trained, the weights adjust to be better and better. There is no "memory" of past questions, just a recipe of numbers and math that gives you the correct answer to your question.
Layers
Fixed math (the operation) — e.g., "multiply input by a matrix, add a vector," or "take max(0, x)," etc. That part doesn't change.
Weights — the numbers (matrix, bias, etc.) that the math uses. Those are what training updates.
So: a layer = a fixed operation + the weights it uses. The operation is the recipe; the weights are the learned numbers inside it.
One layer = one step: "take input, do this math with my weights, return output."
What is the operation used for training?
The operation is fixed math whose result is included in a layer. It multiplies the input by
the matrix, then adds the vector.
Automatic Gradients
How much loss goes up or down and by how much is called the gradient of the loss that's applied to the weight. So you always need one gradient per parameter, as this is what informs the adjustments of the weights.
Automatic gradients are what compute the gradients for you.
When we talk about loss in training, what does it mean? What is loss?
Loss is a single number that measures how wrong a model's predictions are. The lower the number, the closer the model is to getting the answer correct. The higher the number, the further away it was. Over the course of the training, we hope to see the loss number decrease.
Training tries to continuously lower the loss number by updating the weights.
Optimizer
An optimizer is what takes your gradients and actually updates your weights.