LLM Tokenization

LLM Tokenization

Andrej Karpathy recently published a new lecture (opens in a new tab) on large language model (LLM) tokenization. Tokenization is a key part of training LLMs but it's a process that involves training tokenizers using their own datasets and algorithms (e.g., Byte Pair Encoding (opens in a new tab)).

In the lecture, Karpathy teaches how to implement a GPT tokenizer from scratch. He also discusses weird behaviors that trace back to tokenization.

"LLM Tokenization"

Figure Source: https://youtu.be/zduSFxRajkE?t=6711 (opens in a new tab)

Here is the text version of the list above:

  • Why can't LLM spell words? Tokenization.
  • Why can't LLM do super simple string processing tasks like reversing a string? Tokenization.
  • Why is LLM worse at non-English languages (e.g. Japanese)? Tokenization.
  • Why is LLM bad at simple arithmetic? Tokenization.
  • Why did GPT-2 have more than necessary trouble coding in Python? Tokenization.
  • Why did my LLM abruptly halt when it sees the string "<endoftext>"? Tokenization.
  • What is this weird warning I get about a "trailing whitespace"? Tokenization.
  • Why the LLM break if I ask it about "SolidGoldMagikarp"? Tokenization.
  • Why should I prefer to use YAML over JSON with LLMs? Tokenization.
  • Why is LLM not actually end-to-end language modeling? Tokenization.
  • What is the real root of suffering? Tokenization.

To improve the reliability of LLMs, it's important to understand how to prompt these models which will also involve understanding their limitations. While there isn't too much emphasis on tokenizers (beyond the max_tokens configuration) at inference time, good prompt engineering involves understanding the constraints and limitations inherent in tokenization similar to how to structure or format your prompt. You could have a scenario where your prompt is underperforming because it's failing to, for instance, understand an acronym or concept that's not properly processed or tokenized. That's a very common problem that a lot of LLM developers and researchers overlook.

A good tool for tokenization is the Tiktokenizer (opens in a new tab) and this is what's actually used in the lecture for demonstration purposes.