Embeddings are numerical representations of words or phrases that capture their meaning and context. They allow NLP algorithms, and specifically large language models to process language more effectively and accurately. Embeddings are also called word vectors, token vectors or semantic vectors, because they represent the semantic similarity and relations between words in a vector space.
Generation of Embeddings[edit | edit source]
There are different ways to create embeddings, depending on the task and the data. Some of the most common methods are:
Prediction-based embeddings[edit | edit source]
These embeddings are based on the prediction of words or phrases in a context, such as the neural network models Word2Vec or GloVe. These embeddings capture the syntactic and semantic information of words in a low-dimensional and dense vector space, by learning from large amounts of text data. They also allow for operations like vector addition and subtraction, which can reveal interesting linguistic properties. For example, king - man + woman = queen.
Contextualized embeddings[edit | edit source]
These embeddings are based on the representation of words or phrases in a specific context, such as the transformer models. These embeddings capture the dynamic and complex meaning of words in different situations, by using deep neural networks that can encode both the left and the right context of a word. They also allow for fine-tuning and transfer learning, which can improve the performance of downstream NLP tasks.
Frequency-based embeddings[edit | edit source]
These embeddings are based on the frequency of words or phrases in the training material, such as the term frequency-inverse document frequency (TF-IDF) or the co-occurrence matrix. These embeddings capture the importance and the distribution of words in a document or a collection of documents, but they do not capture the syntactic or semantic information very well. They also tend to have high dimensionality and sparsity, which can affect the performance and the memory of the algorithms. These were used in earlier NLP tasks, but are no longer popular for LLMs.
Embeddings are very useful for NLP applications, such as text classification, sentiment analysis, machine translation, question answering, and more. They enable NLP models to understand natural language better and produce more meaningful results.
Technical Structure[edit | edit source]
Embeddings are numerical representations of words or phrases that capture their meaning and context. They are vector in that they can be represented by a list of numbers which assigns them a 'position' in a 'space' representing language and meaning. The standard embedding vectors for LLaMA each have 2048 numbers. Each number in the list corresponds to a dimension in a vector space, and the position of the vector in the space reflects its semantic and syntactic properties.
Simplified Example[edit | edit source]
For example, imagine a two-dimensional vector space (a plane) where the x-axis represents the gender of a word and the y-axis represents the number of a word. In this space, the word “he” would have a negative x-value and a zero y-value, meaning that it is masculine and singular. The word “they” would have a zero x-value and a positive y-value, meaning that it is neutral and plural. The word “she” would have a positive x-value and a zero y-value, meaning that it is feminine and singular. The word “them” would have a zero x-value and a negative y-value, meaning that it is neutral and singular or plural.
The distance between two vectors in this space indicates how similar or dissimilar they are in terms of gender and number. The angle between two vectors indicates how related or unrelated they are in terms of gender and number. For example, the angle between “he” and “she” would be 180 degrees, meaning that they are opposite in gender. The angle between “he” and “they” would be 90 degrees, meaning that they are orthogonal or unrelated in gender.
Of course, this is a very simplified example of how embeddings work. In use, the values of the embedding for each token are learned as part of the base model training process, and it is not typically attempted to directly attribute meaning to the dimensions. These models can capture more nuanced and diverse aspects of language, such as sentiment, topic, style, tone, etc.