Deep Neural Networks for YouTube Recommendations

Challenge	Core Problem	Architectural Response
Scale	Millions of videos, billions of users, real-time serving	Two-stage funnel, ANN at retrieval, separate models per stage
Freshness	New content every second; historically-trained models bias toward old content	Example age feature, continuous training pipelines, exploration
Noise	Implicit signals only; no ground truth satisfaction; metadata quality low	Watch time label over CTR, per-user loss weighting, robust surrogates

Phase 02

Candidate Generation

Recommendation as Extreme Multiclass Classification

The paper reframes recommendation as a classification problem. Given user U in context C, what is the probability they watch video i next?

softmax P(w_t = i | U, C) = exp(v_i · u) / Σ_j∈V exp(v_j · u)

Where u ∈ R^N is the user embedding produced by the DNN, and v_i ∈ R^N are video embeddings. The network learns user embeddings such that dot product with relevant video embeddings is maximized.

Why Classification and Not Regression or Pairwise Ranking?

Alternatives

Regression: what label? sparse explicit ratings don't scale
Pairwise ranking: combinatorially expensive at millions of videos
Hierarchical softmax: tried and rejected — arbitrary tree splits hurt accuracy

Why Classification Wins

Natural label: the video the user actually watched next
Produces embeddings as a byproduct — reusable for ANN serving
Scales with negative sampling (100x+ speedup)
Non-linear generalization of matrix factorization

Sampled Softmax

Standard softmax over millions of classes is infeasible. The solution: sample ~thousands of negatives from the background distribution, compute cross-entropy loss only over this small set, and apply importance weighting to correct for sampling bias.

For each training example (user watched video i):
  positive:  video i
  negatives: ~thousands sampled randomly from corpus
  
Loss = cross-entropy over {positive ∪ sampled_negatives}
Correction: importance weighting for non-uniform sampling

Speedup: 100x+ over full softmax

Mini-Note 2.1

🔴 Memorize Cold

Sampled softmax: ~thousands negatives, importance weighted, 100x speedup
User embedding = DNN output layer (256-dim)
Video embedding = softmax output layer weights (same 256-dim space)
Why not hierarchical softmax: arbitrary tree splits degrade accuracy

🟢 Derive

Why classification → produces embeddings for ANN serving as a free byproduct
Why implicit feedback → volume wins at tail; explicit feedback too sparse
Why not pairwise ranking → combinatorially expensive at millions

Model Architecture: Embeddings, Averaging & the Tower Network

The core challenge: take heterogeneous, variable-length, sparse inputs and produce a fixed-size dense vector (the user embedding). Here is how each input type is handled:

Step 1: Sparse ID → dense embedding via lookup table
  Video ID (integer) → E ∈ R^(|V| × 256) → dense vector ∈ R^256
  1M videos × 256 dims = 256M params (bulk of model)

Step 2: Variable-length sequence → fixed vector via averaging
  [v1_emb, v2_emb, ..., v50_emb] → average → single 256-dim vector
  Tried: sum, max, average → averaging won

Step 3: All features concatenated → wide first layer
  watch_avg [256] + search_avg [256] + geo [32] + device [32] + scalars
  → wide ~600-800 dim concatenated vector

Step 4: Tower of ReLU layers
  Wide input → 2048 → 1024 → 512 → 256 (user embedding u)

Key non-obvious detail: The output layer weights of the softmax classifier ARE the video embeddings v_i. They live in the same 256-dim space as the user embedding — which is exactly why dot product similarity works for ANN retrieval at serving time.

Architecture Choice	Why	Trade-off
Averaging over RNN	Fast, parallelizable, no sequential dependency	Loses sequence order information
Wide → narrow tower	Information bottleneck; compress to meaningful 256-dim	Dimensionality limits representation capacity
Shared embedding tables	3x+ parameter reduction; rare entities generalize better	Same base; layers above learn role-specific transforms
Separate feature inputs despite shared embeddings	Same video plays different semantic roles in different features	More complex input layer

Depth experiments from the paper confirmed that each additional layer adds non-linear interaction modeling. Best configuration: 1024 → 512 → 256 ReLU layers.

Heterogeneous Signals: What Goes In and Why

Key advantage of DNN over matrix factorization: MF can only use the user-item interaction matrix. DNN can consume anything representable as a vector.

Signal	What It Captures	Representation	Key Design Choice
Watch history	Long-term taste	Avg of 50 video embeddings	ALL watches (not rec-driven); bag of 50
Search history	Short-term intent	Avg of unigram+bigram embeddings	Unordered bag; bigrams capture phrases
Geography	Regional priors	Learned embedding ~32-dim	High cardinality → embedding not scalar
Device	Context / consumption mode	Learned embedding ~32-dim	Mobile vs desktop preferences differ
Age / Gender	Demographic priors	Normalized scalar [0,1]	Binary/continuous → direct scalar input
Example age	Freshness correction	Scalar = t_max − t_watch	Set to 0 at serving (see 2.4)

Demographics serve as priors for cold start. A new user with no watch history gets a zero embedding for watch history — the model gracefully falls back to demographic-based population priors. As history accumulates, the model automatically downweights demographics.

Notably absent at candidate generation: video content features (title, description, audio), social signals, explicit feedback. Content features appear in modern systems but not in this 2016 architecture.

The Example Age Feature: Correcting Historical Bias

Precise definition: Example age = t_max − t_watch, where t_max is the last day of the training window and t_watch is when the watch event occurred. It is the age of the training example, not the age of the video.

The Problem: Sampling Bias Toward the Past

Within a 6-week training window, a video uploaded 5 weeks ago has 5 weeks of watch events. A video uploaded 3 days ago has 3 days. The model naively learns "old popular videos get watched more" — a sampling artifact, not reality.

The Fix: Teaching the Model About Time

During training:
  example_age = t_max - t_watch   (e.g., 0 to 42 days for 6-week window)
  Model learns: small example_age ↔ fresh content popularity spike
                large example_age ↔ established stable popularity

At serving:
  example_age = 0  (or slightly negative)
  Tells the model: "predict for right now"
  Model applies the fresh-content part of the learned popularity curve

What It IS

Age of the watch event (training example)
Correction for sampling bias
Teaches time-dependent popularity
Set to 0 at serving = "right now"

What It is NOT

Age of the video (upload date)
A freshness filter or score
Video-level recency signal
User recency signal

Mini-Note 2.4

🔴 Memorize Cold

example_age = t_max − t_watch (NOT video age)
Set to 0 (or slightly negative) at serving time
Fixes sampling bias, not video-level staleness

🟢 Derive

Why bias exists → more training examples for older videos due to elapsed time
Why setting 0 at serving works → applies fresh-content popularity curve learned during training

Label & Context Selection: The Surrogate Problem

The paper's strong claim: "The choice of this surrogate learning problem has an outsized importance on performance in A/B testing but is very difficult to measure with offline experiments."

Getting the training setup right matters more than model architecture.

The Rollback Fix

WRONG — Random Holdout:
  History: v1 v2 v3 [v4] v5 v6   (v4 held out)
  Input:   v1 v2 v3     v5 v6    ← sees future events!
  Problem: future leakage + ignores asymmetric consumption

CORRECT — Rollback (future watch prediction):
  Choose random cutpoint t in history
  Input:  everything BEFORE t       (v1 v2 v3)
  Label:  the NEXT watch AFTER t    (v4)
  Excluded: v5, v6                  (strict future — never seen)

The Search Query Contamination Problem

Including the most recent search query as a feature when training a homepage model causes the model to learn: "after searching X, recommend X results." It reproduces the search results page as homepage recommendations — disastrous in A/B tests.

Fix: Represent search queries as unordered bags of tokens. Discard sequence and recency. The model can no longer connect a specific search to specific results.

Four Data Pipeline Fixes

Fix	Problem Solved
Train on ALL watches (not rec-driven)	Closed feedback loop; new content starvation
Per-user example capping (fixed N)	Power user dominance; equal loss weighting
Rollback label (future watch)	Future leakage; asymmetric consumption patterns
Unordered bag of tokens for search	Search query contamination on homepage model

Meta-lesson: The biggest wins in this system came from fixing how data is used, not from making the model more complex. Data fixes often beat model fixes.

Serving: Approximate Nearest Neighbor, Not Softmax

The softmax denominator is constant across all videos — it doesn't affect ranking. So top-N retrieval reduces to finding video embeddings with the highest dot product with the user embedding. This is Maximum Inner Product Search (MIPS) — a nearest neighbor problem.

Top-N retrieval = argmax_i (v_i · u) — a MIPS problem, not full softmax

ANN Algorithm	Approach	Query Time	Trade-off
LSH (2016 approach)	Locality sensitive hashing to buckets	O(1) amortized	Recall degrades in high dimensions
HNSW	Multi-layer proximity graph navigation	O(log N)	High memory; excellent recall/speed
IVFPQ	Coarse clusters + product quantization	O(1) approx	Memory efficient; good for billion scale
ScaNN (Google prod)	Anisotropic quantization for MIPS	O(log N)	2x faster than HNSW; optimized for dot product

The A/B results were not sensitive to the choice of ANN algorithm — because the ranking stage downstream corrects retrieval errors. The two-stage design provides error tolerance: retrieval doesn't need to be perfect.

Serving Pipeline

User request
  → User embedding DNN forward pass (~5ms)
  → ANN lookup against pre-built index (~5ms)
  → Top-500 candidate video IDs
  → [Ranking stage]

Train/serve asymmetry: Training uses sampled softmax; serving uses ANN dot product. This works because sampled softmax IS essentially dot product optimization — but it's an implicit assumption. Modern two-tower models address this explicitly with contrastive loss.

Phase 03

Ranking

Why a Separate Ranker?

Dimension	Candidate Generation	Ranking
Items scored	Millions	~Hundreds
Compute per item	~0.1ms max	~1ms+ affordable
Feature availability	User-side only (don't know specific item yet)	Full (user × specific video) cross features
Objective	Broad relevance recall	Precise engagement prediction
Output	User embedding for ANN	Scalar score per video
Loss	Sampled softmax (classification)	Weighted logistic regression

Why Not CTR? The Clickbait Problem

CTR measures the quality of the thumbnail and title. Watch time measures the quality of the video itself. You can trick a click with a deceptive thumbnail — you cannot trick 8 minutes of attention.

The four jobs of the ranker beyond simple scoring: precise scoring with rich features, context specialization (homepage vs. watch-next vs. mobile), source blending (normalize candidates from CF, trending, search), and churn introduction (demote repeatedly ignored impressions).

Feature Taxonomy

The ranking model uses hundreds of features organized along three independent taxonomic dimensions:

Dimension	Types	Key Point
Value type	Categorical vs. Continuous	Categorical → embeddings; Continuous → quantile normalization
Cardinality	Univalent (single value) vs. Multivalent (set)	Multivalent → average embeddings before feeding to network
Compute timing	Query (user/context) vs. Impression (video-specific)	Query computed once per request; Impression computed per candidate

Feature Importance Hierarchy

The paper is explicit: the most important signals are those describing a user's previous interaction with the item and similar items.

Tier 1 (most powerful): User × item interaction history — channel watch count, topic recency. Generalizes via content-aware CF without content features.
Tier 2: Impression frequency — how many times shown without being watched. Encodes explicit rejection signal.
Tier 3: Candidate source metadata — which source nominated this, what score it received. Different sources have different precision.
Tier 4: Video-level features — upload age, view count, category. Less personalized, shared across users.
Tier 5: User-level features — demographics, device, history embedding. Already used in retrieval; less marginal signal at ranking.

Query vs. Impression compute pattern: Query features are computed ONCE per request and amortized across all ~500 candidates. Impression features are computed per candidate — their cost scales linearly. This guides where to invest in feature computation latency budget.

Shared Embeddings & Quantile Normalization

Shared Embedding Spaces

Multiple features reference the same entity (e.g., video ID of the impression, last watched video ID, seed video ID). One global embedding table per ID space; all features referencing the same space share the table. But each feature is fed separately into the network.

Why Share

3x+ parameter reduction (critical for memory)
Rare entities get gradient from multiple feature contexts
Semantic consistency across feature roles

Why Feed Separately

Same video plays different roles (impression vs. seed vs. last-watched)
Layers above learn role-specific transformations
Shared base + separate inputs = best of both worlds

OOV (out-of-vocabulary) values → zero embedding. Vocabulary truncated to top-N by frequency in clicked impressions. Embedding dimension scales with log(cardinality).

Quantile Normalization for Continuous Features

Neural networks are sensitive to input scale. Standard z-score fails for power-law distributed features (view counts, watch times) — outliers compress everything else.

x̃ = F(x) = ∫_-∞^x p(t) dt (CDF transform → uniform [0,1))

After computing x̃, feed three versions of each continuous feature: x̃, x̃², √x̃. This gives the network cheap access to super-linear and sub-linear relationships without manual feature engineering. Paper result: removing powers increased loss by 0.2%.

Mini-Note 3.3

🔴 Memorize Cold

Quantile norm: CDF transform → uniform [0,1)
Powers: feed x̃, x̃², √x̃ for every continuous feature
OOV → zero embedding (neutral signal)
Dim ∝ log(cardinality)

🟢 Derive

Why shared embeddings generalize better → gradient from multiple feature contexts
Why quantile over z-score → power law distributions; outlier robustness
Why powers → cheap non-linear expressivity without more model capacity

Modeling Expected Watch Time via Weighted Logistic Regression

Each training example is either a positive (shown, clicked, watched T_i seconds) or negative (shown, not clicked). The modification: weight positive examples by their observed watch time T_i; keep negative weight = 1.

Positive example (clicked, watched T_i seconds):
  Loss weight = T_i      ← watch time in seconds

Negative example (not clicked):
  Loss weight = 1        ← unit weight

Same cross-entropy loss, same architecture. One change.

What the Model Actually Learns

Working through the math: with watch-time weighting, the learned log-odds are approximately:

Learned odds ≈ E[T] × (1 + P) ≈ E[T] (since click probability P ≪ 1)

The model learns odds that closely approximate expected watch time E[T], not click probability. At serving time, using e^{f(x)} (exponential activation) directly outputs these odds — the quantity you want to rank by.

Empirical Validation

Configuration	Watch-time Weighted Pairwise Loss
No hidden layers (linear)	41.6%
512 → 256 ReLU	35.2%
1024 → 512 → 256 ReLU	34.6%
Equal weighting (positive = negative = 1)	+4.1% vs. watch-time weighted

Elegant insight: This is a poor man's multi-task model — it implicitly learns both P(click) and E[T|click] simultaneously through a single weighted loss, without explicit two-head architecture. Modern systems make this explicit with MTL.

Mini-Note 3.4

🔴 Memorize Cold

Positive weight = T_i (seconds watched)
Negative weight = 1
Serving activation = e^{f(x)}, NOT sigmoid
Equal weighting → +4.1% worse pairwise loss
Learned odds ≈ E[T] (since P ≪ 1)

🟢 Derive

Why watch-time weighting → high T_i positives dominate gradient → push toward E[T]
Why exponential at serving → logistic odds = e^f, learned odds ≈ E[T]
Why not direct regression → skewed distribution + class imbalance makes classification more stable

Phase 04

Systems Perspective

Training Pipeline

Data Generation Decisions

Decision	Choice	Reason
Which watches	ALL watches (not rec-driven)	Break feedback loop; new content surface
Examples per user	Fixed cap N	Equal loss weighting; prevent power user dominance
Label type	Future watch (rollback)	No future leakage; asymmetric patterns
Training window	Several weeks	Signal for rare items; recent enough

Different Negatives for Different Stages

Retrieval Negatives

Random from corpus
Importance weighted
Learns: relevant vs. random noise
Easy but effective at corpus scale

Ranking Negatives

Shown to user but NOT clicked
Much harder — passed retrieval filter
Learns: good impression vs. plausible-but-bad
Critical distinction for ranking quality

Most common silent failure: Training/serving skew — features computed differently at training vs. serving time. Silent degradation, hard to detect. Fix: use the same feature computation code (library) for both training and serving.

Distributed Training

Parameter server architecture with asynchronous SGD. Embedding tables (256M+ params) sharded across parameter servers. Workers fetch only the embedding rows needed per batch. Async SGD trades slightly stale gradients for much faster wall-clock time — the right trade-off at thousands of workers.

Offline vs. Online Evaluation

"Live A/B results are not always correlated with offline experiments." — this is not a minor caveat. It is a fundamental property of recommendation systems.

Four Sources of Divergence

Feedback loop: Offline data reflects old model's distribution. New model changes what data gets generated. Offline holdout is from a different distribution.
Surrogate misalignment: Offline metrics (MAP, NDCG) measure prediction accuracy. Online metrics measure total session value — correlation but not identity.
Position bias: Users click position 1 more than position 5 regardless of quality. Offline evaluation can't model presentation effects.
Long-term effects: Diversity reduces short-term CTR but improves long-term retention. Short A/B tests miss this.

Offline Metric Design Principle

Design offline metrics that mirror the online objective as closely as possible. YouTube's ranking offline metric: watch-time weighted pairwise loss — matches training objective → better correlation with online results.

Metric Hierarchy

Level	Metrics	Importance	Difficulty
Long-term	Day-N retention, DAU/MAU	Highest	Hardest to measure
Session	Session length, depth	High	Medium
Engagement	Watch time, completion rate	Medium	Easy
Satisfaction	Surveys, dislike rate	High	Sparse

Goodhart's Law: Every metric, if optimized hard enough, gets gamed by the model finding shortcuts. Watch time → autoplay traps. CTR → clickbait. Session length → rabbit holes. This is why multi-metric evaluation and human spot-checks are necessary.

Serving Infrastructure

End-to-End Latency Budget

Stage                    Items        Latency    Cost/item
────────────────────────────────────────────────────────
Feature lookup (user)    1            ~5ms       5ms
User embedding DNN       1            ~5ms       5ms
ANN index lookup         1M → 500     ~5ms       0.005ms
Feature lookup (video)   500          ~10ms      0.02ms
Ranking DNN scoring      500          ~20ms      0.04ms
Sort + filter            500          ~1ms       —
────────────────────────────────────────────────────────
Total                                 ~46ms

ANN: 0.005ms/item vs Ranking: 0.04ms/item → 8x more expensive
→ Confirms why ranking DNN can't run at retrieval scale

Feature Store Architecture

The feature store unifies batch (daily aggregates) and real-time (per-event streaming) features behind a single low-latency lookup interface. The critical rule: same feature computation code at training and serving time — prevents training/serving skew.

Caching Strategy

What to Cache	TTL	Reason
User embeddings	~5 minutes	Taste changes slowly; DNN forward pass saved
Video features	Minutes to hours	View count, category change slowly
Impression history	DO NOT CACHE	Must be fresh; stale = wrong churn behavior
Final ranked list	DO NOT CACHE	Personalization must be per-request

Graceful Degradation

Feature store slow → use cached features or defaults
Ranking DNN overloaded → fall back to retrieval score ordering
ANN index unavailable → fall back to popularity-based recommendations
Full service down → pre-cached trending videos

Post-processing layer: Between model output and what users see: de-duplication, diversity enforcement, policy filters (age restrictions, geographic licensing), ad insertion. This is rule-based, not ML. Always distinguish model output from final serving output.

Part II — Where Things Stand Today

The 2016 paper established the blueprint. What followed was a decade of refinement — better retrieval models, richer user representations, multi-objective training, and entirely new serving paradigms. This section traces that arc and examines the current state across the industry.

Phase 05

Evolution & State of Art

What Changed at YouTube Post-2016

2016

Single DNN retrieval + Single DNN ranking. Average pooling. Watch time objective. Random negatives. This paper.

2018

Multi-task learning introduced. MMoE (Multi-gate Mixture of Experts) solves conflicting gradient problem across tasks. Multiple objectives blended via weighted sum.

2019

Two-tower model formalized. Sequential attention replaces averaging. Sampling bias correction (subtract log p(v)). In-batch negatives. Position bias correction via shallow tower.

2020

Multimodal content embeddings for cold start. Hard negative mining (iterative). ScaNN published as production ANN algorithm.

2021+

Transformer-based user modeling. RL/bandit for exploration. Real-time feature updates (seconds). Continuous training pipelines.

MMoE — Solving Conflicting Objectives

Watch time alone creates pathological behavior (rabbit holes, autoplay traps). MMoE introduces task-specific gating networks over a mixture of expert networks — each task learns which experts to weight, allowing conflicting tasks to use different expert combinations without interfering.

Final ranking score:
  = w1 × E[watch_time]
  + w2 × P(like)
  + w3 × P(share)
  - w4 × P(dislike)
  - w5 × P(skip)

Weights tuned via A/B testing.
These values are closely guarded at any company.

Industry Evolution: Facebook, TikTok, Twitter

Platform	Key Innovation	Primary Objective	Unique Signal
Facebook/Meta	DLRM: explicit dot product feature crosses + hybrid CPU-GPU for terabyte embeddings	CTR + meaningful social interactions	Social graph engagement
TikTok	Cascade new content expansion; Monolith real-time embedding updates	Watch completion rate (% not seconds)	Pure exploration — no social graph required
Twitter/X	SimClusters community detection; GNN user modeling	Engagement + recency	Social proof (followed accounts' engagement)
YouTube (modern)	MMoE multi-task; transformer sequence; ScaNN retrieval	Multi-objective blend	Long-form watch time + satisfaction surveys

TikTok's Cascade Expansion (Cold Start Innovation)

New video uploaded
  → Shown to small random cohort (~few hundred users)
  → Measure: completion rate, likes, shares
  → If signals good → expand to larger cohort (×10)
  → Cascade until natural audience size reached
  → Poor signals → stays in long tail

Result: every creator gets a fair chance regardless of size.
        Small creator → viral is structurally possible.
        No content features needed — real engagement bootstraps CF.

Key Architecture Differences

TikTok uses completion rate, not watch time — normalizes across video lengths; absolute time meaningless for 15s vs. 10min content
Facebook needs hybrid CPU-GPU — embedding tables are terabytes; don't fit in GPU memory; dense MLP on GPU, sparse lookup on CPU
Twitter uses GNN — social graph structure carries signal pure CF misses; who you follow shapes what you want to see
Monolith (TikTok) — real-time embedding updates via collision-free hash map; no batch training boundary; model reflects last few seconds of behavior

Retrieval Evolution: Two-Tower → HNSW → ScaNN

Model Side Evolution

Stage	User Representation	Training Signal	Key Paper
2016	Single vector, avg pooling	Random negatives + importance weighting	This paper
2019	Symmetric two-tower	In-batch negatives + sampling bias correction	Google RecSys 2019
2020	Two-tower + hard negatives	Iterative hard negative mining	Facebook, Google
2022+	Multi-vector (per interest cluster)	Contrastive + MaxSim scoring	ColBERT-inspired

Index Side Evolution

HNSW: Multi-layer proximity graph. O(log N) query. Greedy navigation: start at sparse top layer, descend to dense bottom layer. Best recall/speed trade-off. High memory (full vectors in RAM). Industry standard via hnswlib, Faiss, Pinecone, Qdrant.

ScaNN (Google 2020): Anisotropic quantization — minimizes error specifically in the direction that affects dot product (MIPS), not uniform reconstruction error. 2x faster than HNSW at same recall. Production system at Google for YouTube retrieval.

Standard PQ:  minimize ||v - v̂||²  (uniform, direction-agnostic)
ScaNN:        minimize error in direction affecting u·v  
              → directionally aware → better MIPS accuracy per bit

Sampling bias correction (2019): Popular items appear too often as negatives → model under-recommends them. Fix: corrected_score = u·v − log(p(v)) where p(v) is item frequency in training data.

Ranking Evolution: DCN, Feature Crosses, Transformers

The Feature Interaction Problem

Deep MLPs can learn interactions but do so inefficiently — no inductive bias toward feature crosses, need enormous capacity, interactions are combinatorially explosive. This motivated explicit interaction modeling.

Model	Key Innovation	Year
Wide & Deep	Manual cross + MLP; memorization + generalization	Google 2016
DeepFM	FM replaces manual wide; shared embeddings; automatic crosses	Huawei 2017
DCN	Cross network: explicit polynomial feature interactions without combinatorial explosion	Google 2017
DIN	Candidate-aware attention over user history; user vector adapts per candidate	Alibaba 2018
DCN v2	Full matrix cross layer (vs. rank-1); richer interactions	Google 2021
SASRec	Causal transformer for sequential recommendation; O(n²) attention	2018

DCN Cross Network (Key Formula)

x_l+1 = x₀ · (w_l · x_l) + b_l + x_l After L layers → degree L+1 feature interactions

DIN: Candidate-Aware User Representation

Standard (YouTube 2016):
  User vector = avg(all history embeddings)
  → SAME vector regardless of candidate being scored

DIN:
  For candidate item c:
    a_i = attention(history_item_i, candidate_c)
    user_vector = Σ a_i × history_item_i_embedding
  → DIFFERENT user vector per candidate
  → Cooking video candidate → cooking history items get high weight
  → Sports video candidate → same history, different weights

Three-Stage Funnel (Modern Standard)

500 candidates from retrieval
      ↓
[Light Ranker: simple features, fast model ~1ms total]
      ↓
50 candidates
      ↓
[Heavy Ranker: transformer history, DCN v2, MTL ~20ms total]
      ↓
Final top-N

State of Art Today: Sequential Models & LLMs

Long Sequence Handling

Users on mature platforms have 50,000–100,000+ interaction histories. Quadratic transformer attention is infeasible. Three approaches:

Hierarchical modeling: Recent history (fine-grained attention) + long-term history (coarse clusters) + session (separate transformer)
TWIN (ByteDance 2023): Two-stage inside user modeling — lightweight retrieval over full history (O(n)) → expensive attention over top-K retrieved (O(K²)) where K ≪ n
Mamba / SSMs: O(n) inference with constant memory state — no quadratic bottleneck. Early but promising for long histories.

LLMs in RecSys: Honest Assessment

Use Case	Status	Why
LLM text embeddings as ranking features	✅ Production deployed	Offline, cached; cheap at serving; powerful for cold start
LLM for query understanding / intent	✅ Production deployed	Natural language → structured query; works well
Conversational recommendation	✅ Early production (Rufus, YouTube AI)	Multi-turn preference elicitation; genuinely new capability
LLM as direct ranker	🔬 Research only	500ms–2s latency; $0.01–0.10 per request — infeasible at scale
LLM replacing collaborative filtering	❌ Not working	CF captures revealed preference; LLM captures semantic similarity — weaker signal

Why LLMs can't replace CF at scale: CF captures revealed preference (what 1M users actually watched). LLM captures semantic similarity (X and Y are about similar topics). Revealed preference is a stronger engagement signal than semantic similarity. "1M users who watched X also watched Y" >> "X and Y are semantically similar."

The state-of-art production architecture: LLM-augmented two-stage pipeline. LLM handles cold start (text embeddings), query understanding, explanation generation, and conversational refinement. The two-stage funnel remains the backbone.

Phase 06

Design Principles

System Design Frameworks

Requirements Scoping (Always Do This First)

Scale:      How many users? Items? RPS? Latency budget?
Objective:  Watch time? CTR? Multi-objective? Guardrails?
Context:    Homepage vs. watch-next? Long-form vs. short-form?
Cold start: New users? New items? How frequent?

The Answer Structure That Works

Step 1: Two-stage funnel (always start here, derive from compute budget constraint)
Step 2: Retrieval deep dive — two-tower, contrastive loss, in-batch + hard negatives, ANN (HNSW/ScaNN)
Step 3: Ranking deep dive — DCN v2 cross network, DIN attention, multi-task heads, weighted logistic regression
Step 4: Training pipeline — rollback labels, all watches, per-user cap, different negatives per stage
Step 5: Serving infrastructure — feature store, latency budget, graceful degradation
Step 6: Evaluation — custom offline metric mirroring objective, A/B primary, retention long-term

Strong Answers for Common Probes

Question	Key Points to Hit
Why two stages?	Compute budget + feature availability at retrieval time + source blending. Derive from scale constraint.
New video cold start?	Content embeddings (LLM) + freshness retrieval channel + cascade expansion + example age feature
New user cold start?	Demographic priors → exploration to discover interests → CF takes over as history accumulates
Why watch time not CTR?	CTR = thumbnail quality. Watch time = video quality. Implemented via T_i-weighted logistic regression → odds ≈ E[T]
How to evaluate?	Offline: custom metric mirroring objective. Online A/B: watch time primary. Long-term: retention. Never conflate offline with online.
Failure modes?	Training/serving skew (silent, dangerous), feedback loop, popularity bias, embedding staleness, distribution shift
How to improve?	Hard negatives → sequential attention → multi-task MTL → cold start improvements → DCN v2 → three-stage funnel

The Meta-Skill: Framing Every Answer

1. State the problem first, then the solution. 2. Always state the trade-off. 3. Connect to business metrics. 4. Know the numbers. 5. Acknowledge what you'd measure before shipping.

Edge Cases & Common Misconceptions

Gotcha Question	What the Interviewer is Testing	Core Insight to Demonstrate
"What exactly is example age?"	Whether you confuse it with video age	t_max − t_watch (watch event age, NOT video age). Corrects sampling bias in training data.
"Why withhold search query?"	Understanding surrogate problem gaming	Including last search → model reproduces search results page on homepage. Unordered bag of tokens breaks the connection.
"Why rollback labels?"	Understanding future leakage	Random holdout sees future events + ignores asymmetric consumption. Rollback mirrors serving conditions exactly.
"Why per-user capping?"	Understanding loss function dynamics	Power law user distribution → power users dominate gradients without capping. Cap = equal loss weighting across users.
"Why not hierarchical softmax?"	Deep understanding of the rejection	No natural video hierarchy. Arbitrary tree splits make each binary decision harder. Sampled softmax avoids artificial structure.
"Walk me through the weighted LR math"	Can you derive learned odds ≈ E[T]?	T_i weights → learned log-odds ≈ Σ T_i / (N-k) ≈ E[T] × (1+P) ≈ E[T] since P ≪ 1
"Why train on all watches?"	Understanding feedback loops	Rec-driven only → closed loop → model reinforces own biases → new content starved → never discovered
"Train/serve asymmetry in retrieval?"	Whether you notice this subtlety	Trained with sampled softmax; served with ANN dot product. Works because softmax IS dot product optimization. Modern fix: contrastive loss + hard negatives.
"What's wrong with this paper?"	Intellectual honesty and depth	Single objective (watch time gaming), averaging (sequence order lost), train/serve asymmetry, no feedback loop/fairness discussion, offline metrics underspecified

What to Memorize Cold vs. What to Derive

Tier 1 — Numbers & Formulas

Memorize These Exactly

Candidate generation:
  Embedding dim: 256 | Bag size: 50 | Vocab: 1M videos + 1M search tokens
  Best depth: 1024 → 512 → 256 ReLU | Negatives: ~thousands (100x speedup)

Ranking:
  Best config: 1024 → 512 → 256 ReLU | Pairwise loss: 34.6%
  Equal weighting: +4.1% worse | No hidden layers: 41.6%

Key formulas:
  example_age = t_max − t_watch  (set to 0 at serving)
  Positive weight = T_i (seconds) | Negative weight = 1
  Learned odds ≈ E[T] × (1+P) ≈ E[T]  (since P ≪ 1)
  Serving activation = e^{f(x)}  NOT sigmoid
  Quantile norm: x̃ = F(x) = CDF transform → [0,1)
  Powers: feed x̃, x̃², √x̃ for every continuous feature

Tier 2 — Named Techniques

Name	One-line Description
Two-tower	Separate user + item encoders; score = dot product u·v
MMoE	Task-specific gating over mixture of experts; handles conflicting objectives
DCN v2	Cross network with full matrix; explicit polynomial feature interactions
DIN	Attention over user history weighted by relevance to candidate
SASRec	Causal transformer for sequential recommendation; O(n²) attention
HNSW	Multi-layer proximity graph; O(log N) ANN; high memory
ScaNN	Anisotropic quantization for MIPS; 2x faster than HNSW; Google production
TWIN	Two-stage inside user modeling: retrieve relevant history → attend (ByteDance)
Monolith	Real-time embedding updates via collision-free hash map; no batch training boundary
SimClusters	Community detection over social graph → sparse user representation (Twitter)
DLRM	Explicit dot product feature crosses + hybrid CPU-GPU for terabyte embeddings (Facebook)
Sampled softmax	Train over sampled negatives; 100x speedup; importance weighted
In-batch negatives	Other batch items as negatives; efficient + harder than random

Tier 3 — Non-Obvious Design Choices

Cannot Be Derived — Must Know

Train on ALL watches, not rec-driven only → breaks feedback loop
Per-user example capping (fixed N) → equal loss weighting across users
Rollback label (future watch, not random holdout) → no future leakage
Video embeddings = softmax output layer weights (same 256-dim as user)
Shared embedding tables, separate feature inputs → efficiency + role specialization
Quantile normalization (CDF), not z-score → handles power law distributions
Powers of continuous features: x̃, x̃², √x̃
Serving activation = e^{f(x)}, not σ(f(x))
Example age = 0 at serving (not at training)
Search queries as unordered bag of tokens (no sequence, no recency)
OOV → zero embedding (neutral, uninformative)
Retrieval negatives = random; Ranking negatives = shown but not clicked

Tier 4 — Derive From First Principles

Question	Derivation Path
Why two stages?	Scale → can't afford rich computation per item at millions → cheap broad retrieval → expensive precise ranking on small set
Why ANN?	Denominator doesn't affect ranking → MIPS problem → exact O(\|V\|) infeasible → approximate with tolerable accuracy loss → ranking corrects errors
Why implicit feedback?	Explicit ratings sparse → can't train tail content → implicit (completions) 1000x more data → volume wins despite noise
Why watch time not CTR?	CTR → clickbait problem → CTR = thumbnail quality → watch time = video quality → harder to game → better satisfaction proxy
Why independent scoring?	Joint scoring = exponential complexity → independent = embarrassingly parallel → batch all candidates → fast serving
Why shared embeddings?	Same entity in multiple features → separate tables = redundant + sparse training → shared = 3x reduction + more gradient per entity
Why quantile normalization?	Power law distributions → z-score fails → outliers compress everything → CDF maps any distribution to uniform [0,1) → well-conditioned inputs

The Five Connecting Patterns

Pattern 1: Scale is the master constraint — it determines everything else. At millions of items you can't have rich features. At hundreds you can.

Pattern 2: Data fixes often beat model fixes. Example age, rollback labels, per-user capping, all-watches policy — all data pipeline decisions with outsized impact.

Pattern 3: Every design choice is a cost/benefit trade-off. Always state both sides: averaging vs. RNN, random vs. hard negatives, shared vs. separate embeddings, single vs. multi-task.

Pattern 4: The surrogate chain has many links. True objective → surrogate metric → training label → model output → offline metric → online metric → true objective. Every link is an approximation. Know where the failures hide.

Pattern 5: Offline metrics ≠ online results, fundamentally. Offline is for iteration speed. A/B is ground truth. Never optimize offline metrics in a vacuum.

Quick Reference

"Design YouTube/TikTok recommendation"

Two-stage funnel → retrieval (two-tower, ANN) → ranking (DCN+DIN, multi-task) → training (rollback, all watches, per-user cap) → evaluation (custom offline + A/B watch time)

"Why two stages?"

Compute budget + feature availability (don't know item at retrieval time) + source blending. Derive from: scale → can't afford rich computation → stage decomposition.

"New video cold start?"

LLM text embeddings (available immediately) + separate freshness retrieval channel + cascade expansion (TikTok-style small cohort → measure → expand) + example age feature

"New user cold start?"

Demographic priors (zero vector → graceful fallback) → exploration to discover interests → CF takes over. Onboarding questions accelerate Phase 1→2.

"Why watch time not CTR?"

CTR = thumbnail quality (gameable by clickbait). Watch time = video quality (can't fake 8 minutes). Implemented via T_i-weighted logistic regression → odds ≈ E[T]. e^f(x) activation at serving.

"How to keep fresh?"

example_age = t_max − t_watch (set 0 at serving) + real-time feature store (Kafka → Flink → Redis) + continuous training + separate freshness retrieval channel

"How to evaluate?"

Offline: custom metric mirroring objective (watch-time pairwise loss). Online A/B: watch time primary, retention long-term. Offline ≠ online — trust A/B for final decisions.

"What can go wrong?"

Training/serving skew (silent, most dangerous) + feedback loop (rec-driven training only) + popularity bias (sampling bias correction) + embedding staleness + distribution shift

"How to improve this system?"

Hard negatives → DIN/transformer user modeling → MMoE multi-task → cold start improvements → DCN v2 feature interactions → three-stage funnel → position bias correction

"What's wrong with this paper?"

Single watch time objective (gameable) + averaging loses sequence + train/serve asymmetry + no feedback loop/fairness discussion + offline metrics underspecified

The Big Picture

Why This Paper Matters & The Three Core Challenges

The Two-Stage Architecture: Data Flow & Topology

Key Connection Points Between Stages

Candidate Generation

Recommendation as Extreme Multiclass Classification

Why Classification and Not Regression or Pairwise Ranking?

Sampled Softmax

Model Architecture: Embeddings, Averaging & the Tower Network

Heterogeneous Signals: What Goes In and Why

The Example Age Feature: Correcting Historical Bias

The Problem: Sampling Bias Toward the Past

The Fix: Teaching the Model About Time

Label & Context Selection: The Surrogate Problem

The Rollback Fix

The Search Query Contamination Problem

Four Data Pipeline Fixes

Serving: Approximate Nearest Neighbor, Not Softmax

Serving Pipeline

Ranking

Why a Separate Ranker?

Why Not CTR? The Clickbait Problem

Feature Taxonomy

Feature Importance Hierarchy

Shared Embeddings & Quantile Normalization

Shared Embedding Spaces

Quantile Normalization for Continuous Features

Modeling Expected Watch Time via Weighted Logistic Regression

What the Model Actually Learns

Empirical Validation

Systems Perspective

Training Pipeline

Data Generation Decisions

Different Negatives for Different Stages

Distributed Training

Offline vs. Online Evaluation

Four Sources of Divergence

Offline Metric Design Principle

Metric Hierarchy

Serving Infrastructure

End-to-End Latency Budget

Feature Store Architecture

Caching Strategy

Graceful Degradation

Evolution & State of Art

What Changed at YouTube Post-2016

MMoE — Solving Conflicting Objectives

Industry Evolution: Facebook, TikTok, Twitter

TikTok's Cascade Expansion (Cold Start Innovation)

Key Architecture Differences

Retrieval Evolution: Two-Tower → HNSW → ScaNN

Model Side Evolution

Index Side Evolution

Ranking Evolution: DCN, Feature Crosses, Transformers

The Feature Interaction Problem

DCN Cross Network (Key Formula)

DIN: Candidate-Aware User Representation

Three-Stage Funnel (Modern Standard)

State of Art Today: Sequential Models & LLMs

Long Sequence Handling

LLMs in RecSys: Honest Assessment

Design Principles

System Design Frameworks

Requirements Scoping (Always Do This First)

The Answer Structure That Works

Strong Answers for Common Probes

The Meta-Skill: Framing Every Answer

Edge Cases & Common Misconceptions

What to Memorize Cold vs. What to Derive

The Five Connecting Patterns

Quick Reference