# [Paper Notes] Sampling-Bias-Corrected Neural Modeling for Large Corpus Item Recommendations

Two-tower neural network for scalable item retrieval.

- Two phases of recommendation:

- a scalable retrieval system, which is what this paper talks about;

- a full-blown ranking system, which is what*Wide & Deep*talked about. - Application:

- The proposed method is used as one of the nominators for generating hundreds of candidates when recommending YouTube videos conditioned on a feed video being watched by a user.

- This was evaluated both in offline experiments on clicked video recall and in online A/B testing on user engagement improvement. - Challenges with representation learning:

- extremely large item corpus

- sparse user feedback

- code-start for fresh items - MF-based models are usually only capable of capturing second-order interactions of features (e.g. factorization machines), compared with DNNs.
- Two-tower architecture:

- The left tower encodes user and context features.

- The right tower encodes item features.

- The multi-class classification network is a degenerated two-tower DNN, where

- The right tower is simplified to a single layer with item embeddings.

- There is assumed to be a fixed vocabulary of items and training is assisted by negative sampling. - Theoretical modeling:

- Queries and items are represented by feature vectors*x*and*y*respectively.

- Separate embedding functions*u*and*v*are applied to query and item features.

- The output of the model is the inner product of two embeddings.

- Given a query, the probability of picking a candidate item is modeled by the softmax function.

- Each datapoint in the training set consists of a query, an item, and an associated reward capturing user engagement with the item (e.g. watch time).

- The loss function is a weighted log-likelihood.

- In practice, it’s not feasible to include all candidate items and only in-batch items are considered as negatives for all queries in the same batch.

- This would result in popular items overly penalized as negatives.

- To counteract that, an estimate of the sampling probability of the item is deducted from the inner product logit.

- The sampling probability of an item is estimated by streaming frequency estimation.

- Streaming frequency estimation:

- To estimate the sampling probability of an item, it reduces to estimating the average number of steps between two consecutive hits of the item, which is maintained as array*B*.

- Array*A*is the latest step when an item is sampled, and is used for updating array*B*.

- The updating step can be understood as a SDG with fixed learning rate.

- In distributed training, arrays*A*,*B*are served by multiple parameter servers and are updated along with asynchronous stochastic gradient descent training of neural networks.

- Multiple hashing is adopted to mitigate over-estimation of item estimation, similar to the idea of count-min sketch. - Additional details:

- Embeddings are normalized to improve trainability and retrieval quality.

- A well-tuned temperature denominator (as a hyperparameter) is added to each logit to sharpen the predictions. - Feature engineering:

- The final embedding for a multi-hot vector of categorical features (e.g. video topics) is a weighted sum of the embeddings for each of the values in the multi-hot vector.

- Out-of-vocabulary entities are randomly assigned to a fixed set of hash buckets and a separate embedding is learned for each one.

- The watch history of a user is treated as a bag of words, which is represented by the average of the video id embeddings - Data engineering:

- Training datasets are organized by days, whereby the model is trained in a streaming manner to keep up with latest data distribution shift.

## References:

- Yi, Xinyang, et al.
*Sampling-Bias-Corrected Neural Modeling for Large Corpus Item Recommendations*. 13th ACM Conference on recommender Systems. ACM, 2019.