[Paper Notes] Sampling-Bias-Corrected Neural Modeling for Large Corpus Item Recommendations

Devin Z
4 min readSep 22, 2024

--

Two-tower neural network for scalable item retrieval.

Sierra National Forest, August 31, 2024
  • Two phases of recommendation:
    - a scalable retrieval system, which is what this paper talks about;
    - a full-blown ranking system, which is what Wide & Deep talked about.
  • Application:
    - The proposed method is used as one of the nominators for generating hundreds of candidates when recommending YouTube videos conditioned on a feed video being watched by a user.
    - This was evaluated both in offline experiments on clicked video recall and in online A/B testing on user engagement improvement.
  • Challenges with representation learning:
    - extremely large item corpus
    - sparse user feedback
    - code-start for fresh items
  • MF-based models are usually only capable of capturing second-order interactions of features (e.g. factorization machines), compared with DNNs.
  • Two-tower architecture:
    - The left tower encodes user and context features.
    - The right tower encodes item features.
Two-Tower DNN Model¹
  • The multi-class classification network is a degenerated two-tower DNN, where
    - The right tower is simplified to a single layer with item embeddings.
    - There is assumed to be a fixed vocabulary of items and training is assisted by negative sampling.
  • Theoretical modeling:
    - Queries and items are represented by feature vectors x and y respectively.
    - Separate embedding functions u and v are applied to query and item features.
    - The output of the model is the inner product of two embeddings.
    - Given a query, the probability of picking a candidate item is modeled by the softmax function.
    - Each datapoint in the training set consists of a query, an item, and an associated reward capturing user engagement with the item (e.g. watch time).
    - The loss function is a weighted log-likelihood.
  • In practice, it’s not feasible to include all candidate items and only in-batch items are considered as negatives for all queries in the same batch.
    - This would result in popular items overly penalized as negatives.
    - To counteract that, an estimate of the sampling probability of the item is deducted from the inner product logit.
    - The sampling probability of an item is estimated by streaming frequency estimation.
  • Streaming frequency estimation:
    - To estimate the sampling probability of an item, it reduces to estimating the average number of steps between two consecutive hits of the item, which is maintained as array B.
    - Array A is the latest step when an item is sampled, and is used for updating array B.
    - The updating step can be understood as a SDG with fixed learning rate.
    - In distributed training, arrays A, B are served by multiple parameter servers and are updated along with asynchronous stochastic gradient descent training of neural networks.
    - Multiple hashing is adopted to mitigate over-estimation of item estimation, similar to the idea of count-min sketch.
  • Additional details:
    - Embeddings are normalized to improve trainability and retrieval quality.
    - A well-tuned temperature denominator (as a hyperparameter) is added to each logit to sharpen the predictions.
  • Feature engineering:
    - The final embedding for a multi-hot vector of categorical features (e.g. video topics) is a weighted sum of the embeddings for each of the values in the multi-hot vector.
    - Out-of-vocabulary entities are randomly assigned to a fixed set of hash buckets and a separate embedding is learned for each one.
    - The watch history of a user is treated as a bag of words, which is represented by the average of the video id embeddings
  • Data engineering:
    - Training datasets are organized by days, whereby the model is trained in a streaming manner to keep up with latest data distribution shift.

References:

  1. Yi, Xinyang, et al. Sampling-Bias-Corrected Neural Modeling for Large Corpus Item Recommendations. 13th ACM Conference on recommender Systems. ACM, 2019.

--

--