[Paper Notes] Real-time Personalization using Embeddings for Search Ranking at Airbnb
The skip-gram model for item embeddings.
- Unique problems at Airbnb:
- It’s a two-sided market, where one needs to optimize for both host and guest preferences.
- A user rarely consumes the same item twice (data sparseness).
- One listing can accept only one guest for a certain set of dates. - Methodology:
- Use short-term in-session data to learn listing embeddings.
- Use long-term sparse booking data to learn user-type embeddings and listing-type embeddings. - Background knowledge:
- Embeddings map one-hot encodings of category features into dense vectors.
- The skip-gram model works by predicting a surrounding item given a central item.
- CBOW (continuous bag of words) works by predicting the central item given a surrounding item.
- In real-time serving systems, embeddings are typically cached in in-memory key-value stores and indexed for approximate nearest neighbor queries (via HNSW, IVF-QP, etc).
- Negative sampling and hierarchical softmax are techniques for reducing computation during training. - The skip-gram model for listing embeddings:
- The objective is to maximize the log likelihood of observing a surrounding clicked listing in the context window of a central clicked listing.
- Negative sampling is used to avoid the heavy computation of the denominator in the softmax function when the vocabulary is large, such that a multi-class classification is approximated by a binary classification.
- For sessions ended with user booking a listing, the booked listing is added as global context regardless of whether it’s within the context window.
- Congregated search: users typically search within a single market (i.e. location they want to stay at). To account for that, random negative samples from the same market as the central listing is explicitly added.
- The embedding of a newly created listing is determined by the mean vector of the embeddings of the three closest listings that are heuristically sought for.
- It was experimentally confirmed that the listing embedding encoded similarity in geography, listing type and price range.
- User-type and listing-type embeddings:
- A rule-based mapping is defined to determine the type of a listing given its metadata.
- The type of a user is similarly determined except the latest booking history is taken into account if any.
- The type of a listing or a user may change over time.
- The user_type and listing_type embeddings are learned in the same vector space from long-term cross-market booking events.
- A session is defined as a time-ordered sequence of user-type and listing-type pairs in the booking history of a single user.
- If a user_type is the central item, the surrounding items are positive and (random) negative listing_types.
- If a listing_type is the central item, the surrounding items are positive and (random) negative user_types.
- Host rejections are explicitly added as negative samples to reflect host-side preferences.
- Listing types that better match a user type have higher cosine similarity in embeddings.
References:
- Mihajlo Grbovic and Haibin Cheng. Real-time Personalization using Embeddings for Search Ranking at Airbnb. KDD ’18, August 19–23, 2019, London, United Kingdom.