A token in a large language model is the smallest unit of text that the model processes and generates. Tokens can represent different levels of linguistic units, such as characters, words, subwords, or phrases, depending on the tokenisation technique used. Each base model can potentially use a different set of tokens. In models such as LLaMA the tokens used are mostly words and subwords, with a full set of individual characters also available for building words that are not in the token set.
Tokenisation[edit | edit source]
Tokenisation is the process of dividing a piece of text into tokens, which are then mapped to numerical representations called embeddings. Embeddings encode semantic and contextual information about a particular token, enabling LLMs to understand and generate coherent and relevant text. Tokens are the fundamental building blocks of LLMs, but they also introduce limitations, such as the maximum token limit that restricts the length of the input and output sequences.
There are different types of tokenization, depending on the level of granularity and the purpose of the base model to be trained. Some common types are:
- Word tokenisation: This is the most basic and common type of tokenisation, where the text is split into words based on whitespace and punctuation. For example, the sentence “I love NLP.” would be split into four tokens: [“I”, “love”, “NLP”, “.”]. As well as use in LLMs, word tokenisation is useful for tasks like word frequency analysis, sentiment analysis, and topic modeling.
- Character tokenisation: This is where the text is split into individual characters, regardless of their meaning or context. For example, the word “NLP” would be split into three tokens: [“N”, “L”, “P”]. Character tokenization is the classic way that computers represent text and is useful for tasks like spelling correction, and character-level language modeling. The fine granularity of representation in character tokenisation greatly limits the ability of embeddings to encode semantic meaning.
- Subword tokenisation: This is where the text is split into smaller units that are not necessarily words, but meaningful segments of words. Subword tokenisation can be applied at different levels for different contexts, depending on the task to be performed. Subword tokenisation is useful for tasks like machine translation, speech recognition, and text compression. Subword tokenisation is the most common approach in large language models.
Phrase tokenisation: This is where the text is split into groups of words that form meaningful units or expressions. For example, the sentence “She bought a new car yesterday.” could be split into three tokens: [“She”, “bought a new car”, “yesterday”]. Phrase tokenization is useful for tasks like information extraction, question answering, and semantic parsing.
Example of Subword Tokenisation[edit | edit source]
As an example of subword tokenisation, consider the word “tokenization”. This word could be split a number of different ways:
- Into five tokens aligned with syllables: [“tok”, “en”, “is”, “a”, “tion”]. Note here that the tokens 'tok' and 'en' don't have any meaning on their own, and the tokens 'is' and 'a' have meanings unrelated to the English words with the same spellings.
- Into three tokens aligned with meaning: ["token", "is", "ation"]. Here the token "token" has meaning, which aligns with the general meaning of the word "token". Similarly the suffix "ation" has meaning as a modifier for the earlier tokens. The earlier suffix "is" also has meaning, but is subject to confusion with the English language word of the same spelling and that in American English the token "iz" would be used in its place.
- In an appropriate grammar, a split into two tokens "token" and "isation" might also be appropriate.
- If the word was common enough in the training material for the base model, it may also be assigned a single token "tokenisation", allowing for a more compact representation. This would typically require that the synonymous token "tokenization" is also included in the vocabulary, for use when generating text in American English.