[Paper Notes] Wayformer: Motion Forecasting via Simple & Efficient Attention Networks

Devin Z
3 min readFeb 5, 2024

Application of transformer models in autonomous driving.

Ed R. Levin County Park, January 28, 2024
  • Challenges with behavior prediction of real-world agents:
    - The output is highly unstructured and multimodal.
    - The input consists of a heterogeneous mix of modalities.
  • Previous works require either excessive modality-specific tuning or extensive explorations in modeling options.
  • Two components of Wayformer:
    - a self-attention scene encoder
    - a cross-attention decoder
  • Multimodal data includes:
    - history of each agent
    - interactions of an agent with the closest context agents
    - closest segments in the roadgraph
    - closest traffic signal states
  • Before encoding:
    - Transform the scene into an agent’s ego-centric frame of reference.
    - Project different modalities into the same dimension.
    - Add learned positional embeddings to different modalities.
  • Three types of fusion strategies:
    - In late fusion, each modality has a dedicated encoder and their outputs get concatenated together.
    - In early fusion, inputs from different modalities are concatenated before being fed into a cross-modal encoder.
    - In hierarchical fusion, the model capacity is split between modality-specific self-attention encoders and the cross-modal encoder in a hierarchical fashion.
  • Scalability issues with transformer networks:
    - Self-attention is quadratic in the input sequence length.
    - Position-wise feed-forward networks are expensive sub-networks.
  • Two techniques to trade quality for efficiency:
    - Factorize a multi-axis attention into separate ones for different dimensions — one for temporal dimensions and another for spatial dimensions.
    - Map the high-dimensional input into a lower-dimensional latent space.
  • The output format is the same as in MultiPath² and MultiPath++³.
    - The forecast for each agent is a Gaussian mixture model, where each mode is a sequence of states offset from a prior anchor trajectory.
    - Given an anchor trajectory, the scene-specific offsets at different time steps are assumed to be conditionally independent.
    - A classification head outputs the mixture likelihoods for different modes, which represent intent uncertainty.
    - A regression head outputs the means and covariances of the offset at each time step from each anchor trajectory, which describe control uncertainty.
    - The log-probability of a ground truth trajectory incorporates both a classification loss and a regression loss for each Gaussian.
    - MultiPath predefines static anchor trajectories through a separate k-means clustering.
    - MultiPath++ learns anchor embeddings as part of the overall model training.
MultiPath output format
  • The GMM outputs are pruned into fewer modes using a trajectory aggregation algorithm.
    - Greedily select k centroid modes one by one to cover the maximum total likelihood out of the uncovered modes.
    - Iteratively refine the k centroid modes by reassigning each to the weighted average of the closest centroid modes assigned to it.
  • Limitations:
    - The same scene is encoded repeatedly for different ego-agents.
    - It cannot capture some important nuances in highly interactive scenes.
    - Possible futures for different agents are modeled separately.
    - Given the intent of an agent, the predicted states are temporally conditionally independent.

--

--