[Paper Notes] Sampling-Bias-Corrected Neural Modeling for Large Corpus Item Recommendations

Devin Z
4 min readSep 22, 2024

--

Two-tower neural network for scalable item retrieval.

Sierra National Forest, August 31, 2024
  • Two phases of recommendation:
    - a scalable retrieval system, which is what this paper talks about;
    - a full-blown ranking system, which is what Wide & Deep talked about.
  • Application:
    - The proposed method is used as one of the nominators for generating hundreds of candidates when recommending YouTube videos conditioned on a feed video being watched by a user.
    - This was evaluated both in offline experiments on clicked video recall and in online A/B testing on user engagement improvement.
  • Challenges with representation learning:
    - extremely large item corpus
    - sparse user feedback
    - code-start for fresh items
  • MF-based models are usually only capable of capturing second-order interactions of features (e.g. factorization machines), compared with DNNs.
  • Two-tower architecture:
    - The left tower encodes user and context features.
    - The right tower encodes item features.
Two-Tower DNN Model¹
  • The multi-class classification network is a degenerated two-tower DNN, where
    - The right tower is simplified to a single layer with item embeddings.
    - There is assumed to be a fixed vocabulary of items and training is assisted by negative sampling.
  • Theoretical modeling:
    - Queries and items are represented by feature vectors x and y respectively.
    - Separate embedding functions u and v are applied to query and item features.
    - The output of the model is the inner product of two embeddings.
    - Given a query, the probability of picking a candidate item is modeled by the softmax function.
    - Each datapoint in the training set consists of a query, an item, and an associated reward capturing user engagement with the item (e.g. watch time).
    - The loss function is a weighted log-likelihood.
  • In practice, it’s not feasible to include all candidate items and only in-batch items are considered as negatives for all queries in the same batch.
    - This would result in popular items overly penalized as negatives.
    - To counteract that, an estimate of the sampling probability of the item is deducted from the inner product logit.
    - The sampling probability of an item is estimated by streaming frequency estimation.
  • Streaming frequency estimation:
    - To estimate the sampling probability of an item, it reduces to estimating the average number of steps between two consecutive hits of the item, which is maintained as array B.
    - Array A is the latest step when an item is sampled, and is used for updating array B.
    - The updating step can be understood as a SDG with fixed learning rate.
    - In distributed training, arrays A, B are served by multiple parameter servers and are updated along with asynchronous stochastic gradient descent training of neural networks.
    - Multiple hashing is adopted to mitigate over-estimation of item estimation, similar to the idea of count-min sketch.
  • Additional details:
    - Embeddings are normalized to improve trainability and retrieval quality.
    - A well-tuned temperature denominator (as a hyperparameter) is added to each logit to sharpen the predictions.
  • Feature engineering:
    - The final embedding for a multi-hot vector of categorical features (e.g. video topics) is a weighted sum of the embeddings for each of the values in the multi-hot vector.
    - Out-of-vocabulary entities are randomly assigned to a fixed set of hash buckets and a separate embedding is learned for each one.
    - The watch history of a user is treated as a bag of words, which is represented by the average of the video id embeddings
  • Data engineering:
    - Training datasets are organized by days, whereby the model is trained in a streaming manner to keep up with latest data distribution shift.

References:

  1. Yi, Xinyang, et al. Sampling-Bias-Corrected Neural Modeling for Large Corpus Item Recommendations. 13th ACM Conference on recommender Systems. ACM, 2019.

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Devin Z
Devin Z

Written by Devin Z

认识世界,改造世界

No responses yet

Write a response