[Paper Notes] Wayformer: Motion Forecasting via Simple & Efficient Attention Networks
Application of transformer models in autonomous driving.
- Challenges with behavior prediction of real-world agents:
- The output is highly unstructured and multimodal.
- The input consists of a heterogeneous mix of modalities. - Previous works require either excessive modality-specific tuning or extensive explorations in modeling options.
- Two components of Wayformer:
- a self-attention scene encoder
- a cross-attention decoder - Multimodal data includes:
- history of each agent
- interactions of an agent with the closest context agents
- closest segments in the roadgraph
- closest traffic signal states - Before encoding:
- Transform the scene into an agent’s ego-centric frame of reference.
- Project different modalities into the same dimension.
- Add learned positional embeddings to different modalities. - Three types of fusion strategies:
- In late fusion, each modality has a dedicated encoder and their outputs get concatenated together.
- In early fusion, inputs from different modalities are concatenated before being fed into a cross-modal encoder.
- In hierarchical fusion, the model capacity is split between modality-specific self-attention encoders and the cross-modal encoder in a hierarchical fashion. - Scalability issues with transformer networks:
- Self-attention is quadratic in the input sequence length.
- Position-wise feed-forward networks are expensive sub-networks. - Two techniques to trade quality for efficiency:
- Factorize a multi-axis attention into separate ones for different dimensions — one for temporal dimensions and another for spatial dimensions.
- Map the high-dimensional input into a lower-dimensional latent space. - The output format is the same as in MultiPath² and MultiPath++³.
- The forecast for each agent is a Gaussian mixture model, where each mode is a sequence of states offset from a prior anchor trajectory.
- Given an anchor trajectory, the scene-specific offsets at different time steps are assumed to be conditionally independent.
- A classification head outputs the mixture likelihoods for different modes, which represent intent uncertainty.
- A regression head outputs the means and covariances of the offset at each time step from each anchor trajectory, which describe control uncertainty.
- The log-probability of a ground truth trajectory incorporates both a classification loss and a regression loss for each Gaussian.
- MultiPath predefines static anchor trajectories through a separate k-means clustering.
- MultiPath++ learns anchor embeddings as part of the overall model training.
- The GMM outputs are pruned into fewer modes using a trajectory aggregation algorithm.
- Greedily select k centroid modes one by one to cover the maximum total likelihood out of the uncovered modes.
- Iteratively refine the k centroid modes by reassigning each to the weighted average of the closest centroid modes assigned to it. - Limitations:
- The same scene is encoded repeatedly for different ego-agents.
- It cannot capture some important nuances in highly interactive scenes.
- Possible futures for different agents are modeled separately.
- Given the intent of an agent, the predicted states are temporally conditionally independent.
References:
- N. Nayakanti, et al. Wayformer: Motion Forecasting via Simple & Efficient Attention Networks. ICRA, 2023.
- Y. Chai, et al. MultiPath: Multiple Probabilistic Anchor Trajectory Hypotheses for Behavior Prediction. CoRL, 2019.
- B. Varadarajan, et al. MultiPath++: Efficient Information Fusion and Trajectory Aggregation for Behavior Prediction. ICRA, 2022.