Next watch classification and watch time prediction.
- Scale and non-functional requirements:
- ~1 billion parameters (majority in embeddings)
- hundreds of billions of examples
- many hours of videos uploaded per second
- serving latency under tens of milliseconds - Two-step recommendation:
- Candidate generation takes in coarse features and provides broad personalization and collaborative filtering.
- Ranking takes in a rich set of features and distinguishes relative importance among candidates with high recall.
- Candidate generation:
- The problem is formulated as a multi-class classification among millions of videos as the next to watch, given the user’s history and the context.
- The deep neural network with a softmax output layer is trained to minimize the cross-entropy loss.
- At serving time, perform top-k nearest neighbor search (most matching videos for a user-context pair) instead of the softmax computation.
- Multivalent features, such as a user’s watch history or search history, are first mapped into a sequence of embeddings and then averaged into fixed-width dense vectors.
- Embeddings are learned along with other neural network parameters together. Or rather, the purpose of the neural network is to learn the user and video embeddings.
- Implicit feedback is used because it is order of magnitude more abundant than explicit feedback.
- The age of a training example is fed as a feature to counteract the bias of the machine learning model for old videos whilst users prefer fresh content.
- A fixed number of training examples are generated for each user to prevent a cohort of highly active users from dominating the loss.
- Some information, such as the latest search query, needs to be withheld to avoid overfitting the surrogate problem.
- Predicting a held-out watch would leak future information and ignore asymmetric consumption patterns. - Ranking:
- The problem is formulated as predicting the watch time of each video impression. In contrast, predicting CTR would promote “clickbait”.
- Positive impressions are weighted by their observed watch times while negative impressions receive unit weight.
- Assuming the fraction of positive impressions is small, the odds learned by the logistic regression is approximately an estimate of the expected watch time.
- Query features are those about users and contexts, while item features are those about video impressions being scored.
- The most important signals are those that describe a user’s previous interaction with the item itself and other similar items.
- Churn: successive recommendation requests should not return identical lists, which is achieved by introducing the frequency of past video impressions as a feature.
- For categorical features, he dimension of an embedding is approximately proportional to the logarithm of the ID space size.
- Sharing embeddings for the same ID space across categorical features improves generalization, speeds up training and reduces memory requirements.
- Continuous features are mapped into [0,1) through quantile normalization. - Offline metrics are used extensively to guide development, but A/B testing has the final say on the effectiveness of the algorithm.
References:
- Covington, Paul, Jay Adams, and Emre Sargin. Deep Neural Networks for YouTube Recommendations. Proceedings of the 10th ACM conference on recommender systems. 2016.