Inference is the process of generating outputs from an artificial intelligence model, typically by feeding the model inputs similar to those used in training.
In the context of a large language model, inference is the process of generating text or other outputs based on some input prompt or context. Inference can be used for various tasks, such as text completion, summarisation, translation, question answering, and more. Inference can also be interactive, where the user and the model have a conversation or collaborate on a creative task.
Process Description[edit | edit source]
The process of inference involves several steps to generating the output:
- Tokenisation: This is the first step of inference, where the input text (prompt) is converted into a sequence of tokens that the LLM can understand. Tokens are usually words or subwords that represent the smallest units of meaning in a language. Different LLMs may use different tokenization methods and vocabularies, which can affect the performance and compatibility of the models.
- Encoding: This is the step where the LLM processes the input tokens and generates a hidden representation for each token. The hidden representation captures the semantic and syntactic information of the token and its context. The LLM uses a deep neural network, usually based on the transformer architecture, to encode the input tokens.
- Decoding: This is the step where the LLM generates output tokens based on the hidden representations of the input tokens. The LLM uses another neural network, usually also based on the transformer architecture, and proceeds to generate new tokens one at a time. For each token to be generated, the neural network calculates the probability that each token in its vocabulary will appear. It then applies a token selection process to select one of those tokens to continue the output.
Token Selection[edit | edit source]
The process of selecting the next token to appear is the output is influenced my many of the model's inference hyperparameters, including:
- Temperature: This is a parameter that controls the randomness or diversity of the output tokens. A higher temperature means more randomness and diversity, while a lower temperature means more certainty and conformity.
- Top_p: This is a parameter that controls the probability mass of the output tokens. The LLM assigns a probability to each possible output token based on its hidden representation. Top-p sampling means choosing an output token from the smallest set of tokens whose cumulative probability exceeds a threshold p.
- Top_k: This is a parameter that controls the number of output tokens to consider. The LLM assigns a probability to each possible output token based on its hidden representation. Top-k sampling means choosing an output token from the k most probable tokens.
- Typical_p: This is a parameter that controls the typicality or normality of the output tokens. The LLM assigns a probability to each possible output token based on its hidden representation and its frequency in the training data. Typical-p sampling means choosing an output token from the smallest set of tokens whose cumulative probability exceeds a threshold p and whose frequency in the training data is above a certain percentile q. For example, if p is 0.9 and q is 0.5, then the output token will be chosen from the top 10% most probable tokens that are also more frequent than half of the tokens in the training data. A higher p or q means more typicality or normality, while a lower p or q means more novelty or abnormality.
- **Beam search**: This is a parameter that controls the number of output sequences to explore. The LLM generates output tokens one by one until it reaches an end-of-sequence token or a maximum length limit. Beam search means keeping track of b best output sequences at each step and expanding them with new output tokens until they are complete or terminated.
See Also[edit | edit source]
- Transformer on Wikipedia
- wikipedia:Beam search on Wikipedia
- Batch Prompting: Efficient Inference with Large Language Model APIs. [1]
- Inference with Reference: Lossless Acceleration of Large Language Models. [2]
- Fast Distributed Inference Serving for Large Language Models. [3]
- FlexGen: High-Throughput Generative Inference of Large Language Models [4]