The architecture powering large language models.
- Characteristics of self-attention:
- It uses parameter sharing to cope with long input passages of varying lengths, which would otherwise make fully-connected neural networks impractical.
- It contains connections between word representations that depend on the words themselves. - A self-attention block takes in a sequence of vectors and returns the same number of vectors of the same size.
- Each input vector is linearly mapped to a key, a value and a query.
- Each output vector is a weighted sum of the values produced from all the input vectors.
- The scalar weight is the attention that an output vector pays to an input vector, and is a non-linear function (softmax) of the key-query inner products.
- Values need to have the same dimension as the input/output vectors, whereas keys and queries can have a different dimension.
- Position encodings are added to reflect the fact that the order of words matters in conveying a message.
- Scale the dot products by the square root of the dimension of queries to make gradients large enough for efficient training.
- Multi-head attention:
- Perform h parallel attention functions whose dimension of value is D/h, and vertically concatenate their outputs.
- It has been speculated that this makes the self-attention network more robust to bad initializations. - A transformer layer consists of
- a multi-head attention layer with a residual connection,
- a LayerNorm operation,
- a position-wise parallel MLP with a residual connection,
- another LayerNorm operation. - Three types of transformer models:
- encoder model (e.g. BERT)
- decoder model (e.g. GPT-3)
- encoder-decoder model - Self-supervised training allows the use of enormrous amounts of data without the need for manual labels.
- An example is predicting missing words from sentences in a corpus. - A decoder model is trained to maximize log probability of the input text under the autoregressive model.
- Masked self-attention allows each position in the decoder to attend to all positions up to and including that position.
- Beam search is used in practice to generate the most likely next tokens.
- Enormous language models are few-shot learners in that they can perform many tasks without fine-tuning.
- Encoder-decoder attention allows decoder embeddings to attend to the encoder embeddings.
- The keys and values come from the output of the encoder.
- The queries come from the previous decoder layer. - Self-attention interactions need to be pruned to reduce time complexity for long sequences.
- The idea is similar to convolutional structures. - Transformers for images:
- The quadratic complexity of self-attention poses a challenge for transformers to work on images.
- Compared with convolutional nets, transformers need extremely large amounts of training data to supersede the inductive bias.
- The Vision Transformer (ViT) divides an images into patches and linearly maps them to a lower dimension before 1D position encodings.
- The Swin (shifted-window) Transformer processes images at multiple scales and by shifting windows at each scale.
References:
- Simon J. D. Prince. Understanding Deep Learning. MIT Press, 2023.
- A. Vaswani et al. Attention is All You Need. NIPS, 2017.