[Paper Notes] Deep Neural Networks for YouTube Recommendations

Devin Z
3 min readSep 18, 2024

--

Next watch classification and watch time prediction.

YouTube Campus, San Bruno, September 10, 2024
  • Scale and non-functional requirements:
    - ~1 billion parameters (majority in embeddings)
    - hundreds of billions of examples
    - many hours of videos uploaded per second
    - serving latency under tens of milliseconds
  • Two-step recommendation:
    - Candidate generation takes in coarse features and provides broad personalization and collaborative filtering.
    - Ranking takes in a rich set of features and distinguishes relative importance among candidates with high recall.
Recommendation System Architecture
  • Candidate generation:
    - The problem is formulated as a multi-class classification among millions of videos as the next to watch, given the user’s history and the context.
    - The deep neural network with a softmax output layer is trained to minimize the cross-entropy loss.
    - At serving time, perform top-k nearest neighbor search (most matching videos for a user-context pair) instead of the softmax computation.
    - Multivalent features, such as a user’s watch history or search history, are first mapped into a sequence of embeddings and then averaged into fixed-width dense vectors.
    - Embeddings are learned along with other neural network parameters together. Or rather, the purpose of the neural network is to learn the user and video embeddings.
    - Implicit feedback is used because it is order of magnitude more abundant than explicit feedback.
    - The age of a training example is fed as a feature to counteract the bias of the machine learning model for old videos whilst users prefer fresh content.
    - A fixed number of training examples are generated for each user to prevent a cohort of highly active users from dominating the loss.
    - Some information, such as the latest search query, needs to be withheld to avoid overfitting the surrogate problem.
    - Predicting a held-out watch would leak future information and ignore asymmetric consumption patterns.
  • Ranking:
    - The problem is formulated as predicting the watch time of each video impression. In contrast, predicting CTR would promote “clickbait”.
    - Positive impressions are weighted by their observed watch times while negative impressions receive unit weight.
    - Assuming the fraction of positive impressions is small, the odds learned by the logistic regression is approximately an estimate of the expected watch time.
    - Query features are those about users and contexts, while item features are those about video impressions being scored.
    - The most important signals are those that describe a user’s previous interaction with the item itself and other similar items.
    - Churn: successive recommendation requests should not return identical lists, which is achieved by introducing the frequency of past video impressions as a feature.
    - For categorical features, he dimension of an embedding is approximately proportional to the logarithm of the ID space size.
    - Sharing embeddings for the same ID space across categorical features improves generalization, speeds up training and reduces memory requirements.
    - Continuous features are mapped into [0,1) through quantile normalization.
  • Offline metrics are used extensively to guide development, but A/B testing has the final say on the effectiveness of the algorithm.

References:

  1. Covington, Paul, Jay Adams, and Emre Sargin. Deep Neural Networks for YouTube Recommendations. Proceedings of the 10th ACM conference on recommender systems. 2016.

--

--

Devin Z
Devin Z

Written by Devin Z

认识世界,改造世界

No responses yet