[Paper Notes] Wide & Deep Learning for Recommender Systems

Devin Z
3 min readSep 15, 2024

--

Apps recommendation ranking at Google Play store.

Sierra National Forest, August 31, 2024
  • Two-step recommendation:
    - Retrieval system, which returns a short list of items as the candidate pool;
    - Ranking system, which ranks all candidate items by their scores.
  • The score of an item (e.g. an app) is the probability of a user action label (e.g. app acquisition) conditioned on
    - user features
    - context features
    - item features
  • Two objectives:
    - Memorization: learning the frequent co-occurrence of items or features and exploiting the correlation available in the historical data.
    - Generalization: exploring new feature combinations that have never or rarely occurred in the past based on transitivity of correlation.
  • Background knowledge:
    - Collaborative filtering achieves memorization by using the sparse user-item interaction matrix.
    - Typically, serving a query entails finding the top-k similar items to the items that this user has historically interacted with.
    - Matrix factorization is a technique that reduces the dimension of user and item vectors (through SVD++, etc).
    - Logistic regression is a simple, scalable and interpretable model for predicting the user action (e.g. CTR) based on manual feature engineering (or GBDT).
    - Factorization machines learn low-dimensional embeddings for features and include pairwise combinations of features.
  • Problems with past work:
    - Linear models generalize poorly to unseen feature interactions.
    - Embedding-based models can over-generalize when the user-item interactions are sparse and high-rank.
  • The approach of the paper:
    - The model includes a linear combination of a wide component and a deep component.
    - The wide component is a generalized linear model with cross-product transformations.
    - The deep component is a feed-forward neural network that contains 3 ReLU layers.
    - The two components undergo joint training instead of simple ensemble.
  • System implementation:
    - The training set contains over 500 billing examples, each corresponds to one impression.
    - Categorical feature strings are mapped to integer IDs, and then generate 32-dimensional embeddings.
    - Continuous feature values are normalized to [0,1] based on their CDF quantiles.
    - All the feature embeddings are concatenated into a ~1200-dimensional dense vector before being fed into the deep component.
    - The embeddings and the linear model weights are initialized from the previous model for a new training set.
    - Multithreading parallelism is used to reduce the serving latency.
Wide & Deep model structure for apps recommendation¹
  • Evaluation metrics:
    - acquisition rates in online A/B tests
    - AUC in offline holdout data

References:

  1. Cheng, Heng-Tze, et al. Wide & Deep Learning for Recommender Systems. Proceedings of the 1st workshop on deep learning for recommender systems. ACM, 2016.

--

--

Devin Z
Devin Z

Written by Devin Z

认识世界,改造世界

No responses yet