Toggle search
Search
Toggle menu
notifications
Toggle personal menu
Editing
Token
From llamawiki.ai
Views
Read
Edit
Edit source
View history
associated-pages
Page
Discussion
More actions
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
A '''token''' in a [[large language model]] is the smallest unit of text that the model processes and generates. Tokens can represent different levels of linguistic units, such as characters, words, subwords, or phrases, depending on the tokenisation technique used. Each [[base model]] can potentially use a different set of tokens. In models such as [[LLaMA]] the tokens used are mostly words and subwords, with a full set of individual characters also available for building words that are not in the token set. == Tokenisation == Tokenisation is the process of dividing a piece of text into tokens, which are then mapped to numerical representations called [[embeddings]]. Embeddings encode semantic and contextual information about a particular token, enabling LLMs to understand and generate coherent and relevant text. Tokens are the fundamental building blocks of LLMs, but they also introduce limitations, such as the maximum token limit that restricts the length of the input and output sequences. There are different types of tokenization, depending on the level of granularity and the purpose of the base model to be trained. Some common types are: * '''Word tokenisation''': This is the most basic and common type of tokenisation, where the text is split into words based on whitespace and punctuation. For example, the sentence “I love NLP.” would be split into four tokens: [“I”, “love”, “NLP”, “.”]. As well as use in LLMs, word tokenisation is useful for tasks like word frequency analysis, sentiment analysis, and topic modeling. * '''Character tokenisation''': This is where the text is split into individual characters, regardless of their meaning or context. For example, the word “NLP” would be split into three tokens: [“N”, “L”, “P”]. Character tokenization is the classic way that computers represent text and is useful for tasks like spelling correction, and character-level language modeling. The fine granularity of representation in character tokenisation greatly limits the ability of embeddings to encode semantic meaning. * '''Subword tokenisation''': This is where the text is split into smaller units that are not necessarily words, but meaningful segments of words. Subword tokenisation can be applied at different levels for different contexts, depending on the task to be performed. Subword tokenisation is useful for tasks like machine translation, speech recognition, and text compression. Subword tokenisation is the most common approach in large language models. '''Phrase tokenisation''': This is where the text is split into groups of words that form meaningful units or expressions. For example, the sentence “She bought a new car yesterday.” could be split into three tokens: [“She”, “bought a new car”, “yesterday”]. Phrase tokenization is useful for tasks like information extraction, question answering, and semantic parsing. === Example of Subword Tokenisation === As an example of subword tokenisation, consider the word “tokenization”. This word could be split a number of different ways: * Into five tokens aligned with syllables: [“tok”, “en”, “is”, “a”, “tion”]. Note here that the tokens 'tok' and 'en' don't have any meaning on their own, and the tokens 'is' and 'a' have meanings unrelated to the English words with the same spellings. * Into three tokens aligned with meaning: ["token", "is", "ation"]. Here the token "token" has meaning, which aligns with the general meaning of the word "token". Similarly the suffix "ation" has meaning as a modifier for the earlier tokens. The earlier suffix "is" also has meaning, but is subject to confusion with the English language word of the same spelling and that in American English the token "iz" would be used in its place. * In an appropriate grammar, a split into two tokens "token" and "isation" might also be appropriate. * If the word was common enough in the [[training material]] for the base model, it may also be assigned a single token "tokenisation", allowing for a more compact representation. This would typically require that the synonymous token "tokenization" is also included in the vocabulary, for use when generating text in American English.
Summary:
Please note that all contributions to llamawiki.ai may be edited, altered, or removed by other contributors. If you do not want your writing to be edited mercilessly, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource (see
LlamaWiki:Copyrights
for details).
Do not submit copyrighted work without permission!
Cancel
Editing help
(opens in new window)