An LLM never sees text; only integers. Its inputs and outputs are all represented as sequences of token IDs (e.g., [7, 4, 10, 10, 12]). This is why we use a tokenizer to convert strings of text into number IDs.
A tokenizer works as a bridge that converts strings to IDs on the way in, and then converts them back to string characters on the way out.
- Going in: Turn text into IDs so the model can process it (encode).
- Going out: Turn the model's predicted IDs back into text so humans can read it (decode).
We take a string of text and split the text into individual characters, removing any duplicates. Then we sort the remaining characters into alphabetical order.
We do this when building a tokenizer to create a vocabulary; we want a list of every character that can appear mapped to a unique ID.
Sorting the characters creates a stable order, so the same text always creates the same vocabulary.
chars = sorted(set(text))
Here we are building dictionaries that map each character to a unique integer. By doing so, we're able to create a unique array of numbers for every word. The array of numbers is what we call a vector.
self.stoi = {ch: i for i, ch in enumerate(chars)}
self.itos = {i: ch for i, ch in enumerate(chars)}
Encoding is what takes the dictionary and converts it into the list of unique numbers. For
the word "Hello," for example, and a vocabulary where 'h'→7, 'e'→4, 'l'→10, 'o'→12, you get
[7, 4, 10, 10, 12].
def encode(self, text: str) -> list[int]:
return [self.stoi[c] for c in text if c in self.stoi]
The decoder takes the array of numbers and converts it back into text.
def decode(self, ids: list[int]) -> str:
return "".join(self.itos[i] for i in ids if i in self.itos)
Full Code Sample:
class CharTokenizer:
def __init__(self, text: str):
chars = sorted(set(text))
self.vocab_size = len(chars)
self.stoi = {ch: i for i, ch in enumerate(chars)}
self.itos = {i: ch for i, ch in enumerate(chars)}
def encode(self, text: str) -> list[int]:
return [self.stoi[c] for c in text if c in self.stoi]
def decode(self, ids: list[int]) -> str:
return "".join(self.itos[i] for i in ids if i in self.itos)