Positional encoding (or Positional Embedding) is a way of adding information about the order or position of the tokens in a sequence to a model that does not have any inherent notion of order. For example, a transformer model is based on attention mechanisms, which can learn the relevance or similarity between different parts of the input and output sequences, but do not capture the sequential order of the tokens. Therefore, positional encoding is needed to inject some positional information into the model, so that it can distinguish between tokens that have the same meaning but different positions.
Positional Encoding Mechanisms[edit | edit source]
Sinusoidal Positional Encoding[edit | edit source]
There are different ways of implementing positional encoding. The original approach was the sinusoidal positional encoding proposed by Vaswani et al. in Attention is All You Need (2017). This method uses sinusoidal functions to encode the position of each token as a vector of the same dimension as the token embedding. The formula for computing the positional encoding vector for a given position and dimension is:
where is the position of the token in the sequence, is the index of the element within the vector, and is the size of the token embedding. The intuition behind this formula is that it creates a unique and periodic representation for each position, where different dimensions have different frequencies. This allows the model to capture both absolute and relative positions of the tokens, as well as generalise to longer sequences than seen during training.
The sinusoidal positional encoding is applied to the transformer model by adding it to the token embedding before feeding it to the encoder or decoder. The positional encoding vector has the same size as the token embedding vector, so that they can be summed element-wise. The resulting vector contains both semantic and positional information of the token, which can be processed by the attention and feed-forward layers of the transformer.
Rotary Positional Embedding (RoPE)[edit | edit source]
Rotary Position Embedding (RoPE) is a method for encoding positional information in transformer models proposed in the paper RoFormer: Enhanced Transformer with Rotary Position Embedding.
Prior position encoding mechanisms like sinusoidal position encoding used in the original Transformer model directly added position embeddings to the input token embeddings. This can make the model less flexible when dealing with variable sequence lengths.
Unlike sinusoidal position encoding RoPE encodes absolute position information using rotation matrices rather than adding position embeddings. The query, key, and value vectors in self-attention are multiplied by these rotation matrices before self-attention is applied.
More specifically, the input token embeddings are rotated by angles that depend on their position in the sequence. The rotation angles are predetermined constants based on the dimensionality of the embeddings. This rotation encodes relative position information between tokens.
Some key advantages of RoPE:
- It naturally incorporates relative position information through the rotation operations. The inner product between query and key vectors will decay with increasing relative distance between tokens, which matches the intuition that distant tokens should have less connection.
- It keeps the norm of the token embeddings unchanged, so it can be flexibly combined with linear attention mechanisms like Performer.
- There are no restrictions on maximum sequence length like with learned absolute position embeddings.
- It showed improved performance over baseline Transformer models across machine translation, language modeling, and GLUE benchmark tasks.
RoPE embeddings have opened up the potential for increasing context lengths by the application of RoPE Scaling.