[Paper Notes] Wayformer: Motion Forecasting via Simple & Efficient Attention Networks

Devin Z

3 min readFeb 5, 2024

Application of transformer models in autonomous driving.

Ed R. Levin County Park, January 28, 2024

Challenges with behavior prediction of real-world agents:
- The output is highly unstructured and multimodal.
- The input consists of a heterogeneous mix of modalities.
Previous works require either excessive modality-specific tuning or extensive explorations in modeling options.
Two components of Wayformer:
- a self-attention scene encoder
- a cross-attention decoder
Multimodal data includes:
- history of each agent
- interactions of an agent with the closest context agents
- closest segments in the roadgraph
- closest traffic signal states
Before encoding:
- Transform the scene into an agent’s ego-centric frame of reference.
- Project different modalities into the same dimension.
- Add learned positional embeddings to different modalities.
Three types of fusion strategies:
- In late fusion, each modality has a dedicated encoder and their outputs get concatenated together.
- In early fusion, inputs from different modalities are concatenated before being fed into a cross-modal encoder.
- In hierarchical fusion, the model capacity is split between modality-specific self-attention encoders and the cross-modal encoder in a hierarchical fashion.
Scalability issues with transformer networks:
- Self-attention is quadratic in the input sequence length.
- Position-wise feed-forward networks are expensive sub-networks.
Two techniques to trade quality for efficiency:
- Factorize a multi-axis attention into separate ones for different dimensions — one for temporal dimensions and another for spatial dimensions.
- Map the high-dimensional input into a lower-dimensional latent space.
The output format is the same as in MultiPath² and MultiPath++³.
- The forecast for each agent is a Gaussian mixture model, where each mode is a sequence of states offset from a prior anchor trajectory.
- Given an anchor trajectory, the scene-specific offsets at different time steps are assumed to be conditionally independent.
- A classification head outputs the mixture likelihoods for different modes, which represent intent uncertainty.
- A regression head outputs the means and covariances of the offset at each time step from each anchor trajectory, which describe control uncertainty.
- The log-probability of a ground truth trajectory incorporates both a classification loss and a regression loss for each Gaussian.
- MultiPath predefines static anchor trajectories through a separate k-means clustering.
- MultiPath++ learns anchor embeddings as part of the overall model training.

MultiPath output format

The GMM outputs are pruned into fewer modes using a trajectory aggregation algorithm.
- Greedily select k centroid modes one by one to cover the maximum total likelihood out of the uncovered modes.
- Iteratively refine the k centroid modes by reassigning each to the weighted average of the closest centroid modes assigned to it.
Limitations:
- The same scene is encoded repeatedly for different ego-agents.
- It cannot capture some important nuances in highly interactive scenes.
- Possible futures for different agents are modeled separately.
- Given the intent of an agent, the predicted states are temporally conditionally independent.

References:

N. Nayakanti, et al. Wayformer: Motion Forecasting via Simple & Efficient Attention Networks. ICRA, 2023.
Y. Chai, et al. MultiPath: Multiple Probabilistic Anchor Trajectory Hypotheses for Behavior Prediction. CoRL, 2019.
B. Varadarajan, et al. MultiPath++: Efficient Information Fusion and Trajectory Aggregation for Behavior Prediction. ICRA, 2022.

[Paper Notes] Wayformer: Motion Forecasting via Simple & Efficient Attention Networks

References:

Written by Devin Z