# [Paper Notes] Wayformer: Motion Forecasting via Simple & Efficient Attention Networks

Application of transformer models in autonomous driving.

- Challenges with behavior prediction of real-world agents:

- The output is highly unstructured and multimodal.

- The input consists of a heterogeneous mix of modalities. - Previous works require either excessive modality-specific tuning or extensive explorations in modeling options.
- Two components of Wayformer:

- a self-attention scene encoder

- a cross-attention decoder - Multimodal data includes:

- history of each agent

- interactions of an agent with the closest context agents

- closest segments in the roadgraph

- closest traffic signal states - Before encoding:

- Transform the scene into an agent’s ego-centric frame of reference.

- Project different modalities into the same dimension.

- Add learned positional embeddings to different modalities. - Three types of fusion strategies:

- In late fusion, each modality has a dedicated encoder and their outputs get concatenated together.

- In early fusion, inputs from different modalities are concatenated before being fed into a cross-modal encoder.

- In hierarchical fusion, the model capacity is split between modality-specific self-attention encoders and the cross-modal encoder in a hierarchical fashion. - Scalability issues with transformer networks:

- Self-attention is quadratic in the input sequence length.

- Position-wise feed-forward networks are expensive sub-networks. - Two techniques to trade quality for efficiency:

- Factorize a multi-axis attention into separate ones for different dimensions — one for temporal dimensions and another for spatial dimensions.

- Map the high-dimensional input into a lower-dimensional latent space. - The output format is the same as in MultiPath² and MultiPath++³.

- The forecast for each agent is a Gaussian mixture model, where each mode is a sequence of states offset from a prior anchor trajectory.

- Given an anchor trajectory, the scene-specific offsets at different time steps are assumed to be conditionally independent.

- A classification head outputs the mixture likelihoods for different modes, which represent intent uncertainty.

- A regression head outputs the means and covariances of the offset at each time step from each anchor trajectory, which describe control uncertainty.

- The log-probability of a ground truth trajectory incorporates both a classification loss and a regression loss for each Gaussian.

- MultiPath predefines static anchor trajectories through a separate k-means clustering.

- MultiPath++ learns anchor embeddings as part of the overall model training.

- The GMM outputs are pruned into fewer modes using a trajectory aggregation algorithm.

- Greedily select*k*centroid modes one by one to cover the maximum total likelihood out of the uncovered modes.

- Iteratively refine the*k*centroid modes by reassigning each to the weighted average of the closest centroid modes assigned to it. - Limitations:

- The same scene is encoded repeatedly for different ego-agents.

- It cannot capture some important nuances in highly interactive scenes.

- Possible futures for different agents are modeled separately.

- Given the intent of an agent, the predicted states are temporally conditionally independent.

## References:

- N. Nayakanti, et al.
*Wayformer: Motion Forecasting via Simple & Efficient Attention Networks*. ICRA, 2023. - Y. Chai, et al.
*MultiPath: Multiple Probabilistic Anchor Trajectory Hypotheses for Behavior Prediction*. CoRL, 2019. - B. Varadarajan, et al.
*MultiPath++: Efficient Information Fusion and Trajectory Aggregation for Behavior Prediction*. ICRA, 2022.