[Paper Notes] Sampling-Bias-Corrected Neural Modeling for Large Corpus Item Recommendations

4 min readSep 22, 2024

Two-tower neural network for scalable item retrieval.

Two phases of recommendation:
- a scalable retrieval system, which is what this paper talks about;
- a full-blown ranking system, which is what Wide & Deep talked about.
Application:
- The proposed method is used as one of the nominators for generating hundreds of candidates when recommending YouTube videos conditioned on a feed video being watched by a user.
- This was evaluated both in offline experiments on clicked video recall and in online A/B testing on user engagement improvement.
Challenges with representation learning:
- extremely large item corpus
- sparse user feedback
- code-start for fresh items
MF-based models are usually only capable of capturing second-order interactions of features (e.g. factorization machines), compared with DNNs.
Two-tower architecture:
- The left tower encodes user and context features.
- The right tower encodes item features.

The multi-class classification network is a degenerated two-tower DNN, where
- The right tower is simplified to a single layer with item embeddings.
- There is assumed to be a fixed vocabulary of items and training is assisted by negative sampling.
Theoretical modeling:
- Queries and items are represented by feature vectors x and y respectively.
- Separate embedding functions u and v are applied to query and item features.
- The output of the model is the inner product of two embeddings.
- Given a query, the probability of picking a candidate item is modeled by the softmax function.
- Each datapoint in the training set consists of a query, an item, and an associated reward capturing user engagement with the item (e.g. watch time).
- The loss function is a weighted log-likelihood.

In practice, it’s not feasible to include all candidate items and only in-batch items are considered as negatives for all queries in the same batch.
- This would result in popular items overly penalized as negatives.
- To counteract that, an estimate of the sampling probability of the item is deducted from the inner product logit.
- The sampling probability of an item is estimated by streaming frequency estimation.

Streaming frequency estimation:
- To estimate the sampling probability of an item, it reduces to estimating the average number of steps between two consecutive hits of the item, which is maintained as array B.
- Array A is the latest step when an item is sampled, and is used for updating array B.
- The updating step can be understood as a SDG with fixed learning rate.
- In distributed training, arrays A, B are served by multiple parameter servers and are updated along with asynchronous stochastic gradient descent training of neural networks.
- Multiple hashing is adopted to mitigate over-estimation of item estimation, similar to the idea of count-min sketch.
Additional details:
- Embeddings are normalized to improve trainability and retrieval quality.
- A well-tuned temperature denominator (as a hyperparameter) is added to each logit to sharpen the predictions.
Feature engineering:
- The final embedding for a multi-hot vector of categorical features (e.g. video topics) is a weighted sum of the embeddings for each of the values in the multi-hot vector.
- Out-of-vocabulary entities are randomly assigned to a fixed set of hash buckets and a separate embedding is learned for each one.
- The watch history of a user is treated as a bag of words, which is represented by the average of the video id embeddings
Data engineering:
- Training datasets are organized by days, whereby the model is trained in a streaming manner to keep up with latest data distribution shift.

References: