Attention is All You Need

From llamawiki.ai
Attention is all you need
Published12 June 2017 (7 years ago) (2017-06-12)
AuthorsAshish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin
InstitutionGoogle Brain
arXiv Abstract[[1]]
arXiv PDF[[2]]

The paper Attention is all you need was published in 2017 by a team of researchers from Google Brain and Google Research. The paper proposed a novel architecture for sequence transduction, which is the task of transforming one sequence into another, such as machine translation, text summarisation, or speech recognition. The architecture, called the Transformer, is based entirely on attention mechanisms, which are a way of computing the relevance or similarity between different parts of the input and output sequences. The paper showed that the Transformer outperformed the previous state-of-the-art models, which were based on recurrent or convolutional neural networks, in terms of quality, speed, and scalability.

Novel Mechanisms[edit | edit source]

The paper introduced two types of attention mechanisms: scaled dot-product attention and multi-head attention. Scaled dot-product attention computes the weighted average of a set of values, where the weights are derived from the dot products of a query vector with a set of key vectors. The dot products are scaled by a factor of the square root of the dimensionality of the vectors to prevent large values from dominating the softmax function. Multi-head attention applies scaled dot-product attention multiple times in parallel, using different linear projections of the query, key, and value vectors. This allows the model to attend to different aspects or subspaces of the input and output sequences.

The paper also introduced two novel components: positional encoding and layer normalization. Positional encoding is a way of injecting information about the relative or absolute position of each token in the sequence into the model. The paper used sinusoidal functions to encode the position as a vector that can be added to the input embeddings. Layer normalization is a way of normalizing the inputs or outputs of each layer in the network. The paper applied layer normalization before each sub-layer (such as attention or feed-forward) and added a residual connection after each sub-layer.

Architecture[edit | edit source]

The paper presented the Transformer as an encoder-decoder model, where both the encoder and decoder consist of multiple layers of self-attention and feed-forward sub-layers. The encoder encodes the input sequence into a hidden representation, and the decoder generates the output sequence based on the hidden representation and the previous outputs. The decoder also uses an additional attention sub-layer to attend to the encoder outputs. The paper used six layers for both the encoder and decoder, with eight heads for multi-head attention and 512 dimensions for all vectors.

Performance[edit | edit source]

The paper evaluated the Transformer on two machine translation tasks: WMT 2014 English-to-German and WMT 2014 English-to-French. The paper used a new training regime called label smoothing to prevent overfitting and improve generalization. The paper also used beam search with a length penalty to generate outputs. The paper reported that the Transformer achieved 28.4 BLEU on English-to-German and 41.8 BLEU on English-to-French, surpassing the previous best results by over 2 BLEU and setting new single-model state-of-the-art scores. The paper also showed that the Transformer was more efficient and faster than recurrent or convolutional models, requiring significantly less time to train and infer.

The paper demonstrated that attention is indeed all you need for sequence transduction, as it can capture complex dependencies and context information without recurrence or convolution. The paper also showed that attention can be scaled up to large models and data sets, achieving impressive results on challenging tasks.

Subsequent Work[edit | edit source]

The paper inspired many subsequent works that extended or improved the Transformer architecture for various NLP applications. The paper also contributed to the development of large-scale pre-trained language models, such as GPT-3 and LLaMA, which have revolutionized the field of NLP and deep learning.

See Also[edit | edit source]