TOKENIZER: LLM BUILD | Amy Jerkovich

An LLM never sees text; only integers. Its inputs and outputs are all represented as sequences of token IDs (e.g., [7, 4, 10, 10, 12]). This is why we use a tokenizer to convert strings of text into number IDs.

A tokenizer works as a bridge that converts strings to IDs on the way in, and then converts them back to string characters on the way out.

Going in: Turn text into IDs so the model can process it (encode).
Going out: Turn the model's predicted IDs back into text so humans can read it (decode).

We take a string of text and split the text into individual characters, removing any duplicates. Then we sort the remaining characters into alphabetical order.

We do this when building a tokenizer to create a vocabulary; we want a list of every character that can appear mapped to a unique ID.

Sorting the characters creates a stable order, so the same text always creates the same vocabulary.

chars = sorted(set(text))

Here we are building dictionaries that map each character to a unique integer. By doing so, we're able to create a unique array of numbers for every word. The array of numbers is what we call a vector.

self.stoi = {ch: i for i, ch in enumerate(chars)}
self.itos = {i: ch for i, ch in enumerate(chars)}

Encoding is what takes the dictionary and converts it into the list of unique numbers. For the word "Hello," for example, and a vocabulary where 'h'→7, 'e'→4, 'l'→10, 'o'→12, you get [7, 4, 10, 10, 12].

def encode(self, text: str) -> list[int]:
    return [self.stoi[c] for c in text if c in self.stoi]

The decoder takes the array of numbers and converts it back into text.

def decode(self, ids: list[int]) -> str:
    return "".join(self.itos[i] for i in ids if i in self.itos)

Full Code Sample:

class CharTokenizer:
    def __init__(self, text: str):
        chars = sorted(set(text))
        self.vocab_size = len(chars)
        self.stoi = {ch: i for i, ch in enumerate(chars)}
        self.itos = {i: ch for i, ch in enumerate(chars)}

    def encode(self, text: str) -> list[int]:
        return [self.stoi[c] for c in text if c in self.stoi]

    def decode(self, ids: list[int]) -> str:
        return "".join(self.itos[i] for i in ids if i in self.itos)