[Book Notes] Transformers

Devin Z
3 min readJan 29, 2024

The architecture powering large language models.

Golden Gate Bridge, San Francisco, December 30, 2023
  • Characteristics of self-attention:
    - It uses parameter sharing to cope with long input passages of varying lengths, which would otherwise make fully-connected neural networks impractical.
    - It contains connections between word representations that depend on the words themselves.
  • A self-attention block takes in a sequence of vectors and returns the same number of vectors of the same size.
    - Each input vector is linearly mapped to a key, a value and a query.
    - Each output vector is a weighted sum of the values produced from all the input vectors.
    - The scalar weight is the attention that an output vector pays to an input vector, and is a non-linear function (softmax) of the key-query inner products.
    - Values need to have the same dimension as the input/output vectors, whereas keys and queries can have a different dimension.
Dot-product self-attention
  • Position encodings are added to reflect the fact that the order of words matters in conveying a message.
  • Scale the dot products by the square root of the dimension of queries to make gradients large enough for efficient training.
Scaled dot-product self-attention
  • Multi-head attention:
    - Perform h parallel attention functions whose dimension of value is D/h, and vertically concatenate their outputs.
    - It has been speculated that this makes the self-attention network more robust to bad initializations.
  • A transformer layer consists of
    - a multi-head attention layer with a residual connection,
    - a LayerNorm operation,
    - a position-wise parallel MLP with a residual connection,
    - another LayerNorm operation.
  • Three types of transformer models:
    - encoder model (e.g. BERT)
    - decoder model (e.g. GPT-3)
    - encoder-decoder model
  • Self-supervised training allows the use of enormrous amounts of data without the need for manual labels.
    - An example is predicting missing words from sentences in a corpus.
  • A decoder model is trained to maximize log probability of the input text under the autoregressive model.
  • Masked self-attention allows each position in the decoder to attend to all positions up to and including that position.
  • Beam search is used in practice to generate the most likely next tokens.
  • Enormous language models are few-shot learners in that they can perform many tasks without fine-tuning.
  • Encoder-decoder attention allows decoder embeddings to attend to the encoder embeddings.
    - The keys and values come from the output of the encoder.
    - The queries come from the previous decoder layer.
  • Self-attention interactions need to be pruned to reduce time complexity for long sequences.
    - The idea is similar to convolutional structures.
  • Transformers for images:
    - The quadratic complexity of self-attention poses a challenge for transformers to work on images.
    - Compared with convolutional nets, transformers need extremely large amounts of training data to supersede the inductive bias.
    - The Vision Transformer (ViT) divides an images into patches and linearly maps them to a lower dimension before 1D position encodings.
    - The Swin (shifted-window) Transformer processes images at multiple scales and by shifting windows at each scale.

References:

  1. Simon J. D. Prince. Understanding Deep Learning. MIT Press, 2023.
  2. A. Vaswani et al. Attention is All You Need. NIPS, 2017.

--

--