[Book Notes] Transformers

Devin Z

3 min readJan 29, 2024

The architecture powering large language models.

Golden Gate Bridge, San Francisco, December 30, 2023

Characteristics of self-attention:
- It uses parameter sharing to cope with long input passages of varying lengths, which would otherwise make fully-connected neural networks impractical.
- It contains connections between word representations that depend on the words themselves.
A self-attention block takes in a sequence of vectors and returns the same number of vectors of the same size.
- Each input vector is linearly mapped to a key, a value and a query.
- Each output vector is a weighted sum of the values produced from all the input vectors.
- The scalar weight is the attention that an output vector pays to an input vector, and is a non-linear function (softmax) of the key-query inner products.
- Values need to have the same dimension as the input/output vectors, whereas keys and queries can have a different dimension.

Position encodings are added to reflect the fact that the order of words matters in conveying a message.
Scale the dot products by the square root of the dimension of queries to make gradients large enough for efficient training.

Scaled dot-product self-attention

Multi-head attention:
- Perform h parallel attention functions whose dimension of value is D/h, and vertically concatenate their outputs.
- It has been speculated that this makes the self-attention network more robust to bad initializations.
A transformer layer consists of
- a multi-head attention layer with a residual connection,
- a LayerNorm operation,
- a position-wise parallel MLP with a residual connection,
- another LayerNorm operation.
Three types of transformer models:
- encoder model (e.g. BERT)
- decoder model (e.g. GPT-3)
- encoder-decoder model
Self-supervised training allows the use of enormrous amounts of data without the need for manual labels.
- An example is predicting missing words from sentences in a corpus.
A decoder model is trained to maximize log probability of the input text under the autoregressive model.
Masked self-attention allows each position in the decoder to attend to all positions up to and including that position.
Beam search is used in practice to generate the most likely next tokens.
Enormous language models are few-shot learners in that they can perform many tasks without fine-tuning.
Encoder-decoder attention allows decoder embeddings to attend to the encoder embeddings.
- The keys and values come from the output of the encoder.
- The queries come from the previous decoder layer.
Self-attention interactions need to be pruned to reduce time complexity for long sequences.
- The idea is similar to convolutional structures.
Transformers for images:
- The quadratic complexity of self-attention poses a challenge for transformers to work on images.
- Compared with convolutional nets, transformers need extremely large amounts of training data to supersede the inductive bias.
- The Vision Transformer (ViT) divides an images into patches and linearly maps them to a lower dimension before 1D position encodings.
- The Swin (shifted-window) Transformer processes images at multiple scales and by shifting windows at each scale.

References:

Simon J. D. Prince. Understanding Deep Learning. MIT Press, 2023.
A. Vaswani et al. Attention is All You Need. NIPS, 2017.

[Book Notes] Transformers

References:

Written by Devin Z