Training cross-encoders

Hey, I am using cross-encoders for recommender systems. I am recommending some items (item_name, importance for the user, item_category) to some documents (notes, and categories).

item_name change for users and i want a user-agnostic model. hence using a cross encoder instead a classic recommender approach.

Anybody tried to randomly put some of the features to ““ to force the network to learn from each of them properly. A bit like random sampling of features.

Happy to have any tips on training this kind of model, have some decent results from now but I am still missing some nuances.

Tried “cross-encoder/ms-marco-TinyBERT-L2-v2” and “cross-encoder/ms-marco-MiniLM-L2-v2” for now on 1 million pairs

1 Like

hmm…


Your problem, re-framed in the “retrieval + reranking” mental model

What you are doing is text relevance ranking:

  • “Query” side: item_name + item_category (+ importance)
  • “Document” side: note_text + note_categories
  • Output: a single relevance score used to rank notes for each item (or items for each note)

This is exactly the intended use of cross-encoders (often called rerankers): concatenate the two texts, run a transformer, predict relevance. The specific checkpoints you tested are trained for MS MARCO passage ranking, which is a different distribution than “item metadata vs personal notes”. (Hugging Face)

Also, you chose 2-layer (“L2”) models. They are fast, but they have limited capacity and are known to leave “nuance” on the table compared to deeper MiniLM variants. Sentence-Transformers even publishes a table showing L2 < L4 < L6 < L12 on standard reranking metrics. (sbert.net)


Why you get “decent results but missing nuance” (most common causes)

Cause 1: You are likely training a ranker like a classifier

If your training is “independent pair labels” (BCE on random pairs), but your production task is “for each item, rank many notes”, you often get:

  • good average accuracy
  • weak ordering among top candidates
  • brittle behavior on close calls (the “nuance” you feel missing)

Fix: evaluate and tune as a ranking problem using per-item candidate lists and ranking metrics. The CrossEncoder reranking evaluator computes MRR@10, NDCG@10, MAP for exactly that setup. (sbert.net)

Cause 2: Negatives are too easy (or secretly positive)

If negatives are sampled randomly, the model learns crude shortcuts and stops improving. If negatives include unlabeled positives, the model is trained to suppress real matches (common in recommender-like data where labels are incomplete).

Sentence-Transformers explicitly emphasizes that strong cross-encoders are trained to recognize hard negatives, and provides mine_hard_negatives() for this reason. (sbert.net)
The false-negative problem is widely discussed in retrieval training: “hard negatives” should be challenging but truly incorrect, and mining must be done carefully. (Hugging Face)

Cause 3: You are out-of-domain relative to MS MARCO

Your notes are not web passages. Your “queries” are not natural language questions. The model’s priors are wrong.

The MS MARCO model cards state they were trained on the MS MARCO passage ranking task and describe the intended reranking usage. (Hugging Face)

Cause 4: Feature dominance, leakage, and “importance”

You want a user-agnostic model. But you feed “importance for the user”, which is inherently user-conditioned.

Typical failure modes:

  • The model uses importance as a shortcut.
  • Importance correlates with exposure or labeling. You get leakage.
  • The numeric form is unstable (raw numbers-as-text are hard).

Cause 5: Truncation and field formatting destroy the signal

Cross-encoders have a max token length. If note text is long or schema is sloppy, the model may never see the crucial part of the note, or it may not reliably know which token belongs to which field.

Cause 6: Overfitting happens fast for cross-encoders

This is a known quirk: cross-encoders can overfit quickly, so you need strong evaluation and early stopping. The Hugging Face Sentence-Transformers reranker training guide explicitly warns about this and recommends reranking evaluators plus load_best_model_at_end. (Hugging Face)


About your idea: randomly setting some features to ""

The concept is good. The implementation matters.

You are describing input/feature dropout: corrupt the input so the model cannot rely on a single feature and must build redundancy. That idea is old and valid. “Feature knock-out” style regularization is a formal version of “randomly remove features”. (DSpace)

But empty string "" is usually the wrong “missing value” for transformers

"" often becomes “no tokens”. That creates ambiguity:

  • “field missing” vs “field exists but empty”
  • training distribution differs from inference if you don’t do the same at serving time

Better: use explicit sentinel tokens per field, and drop fields at the field level.

Example (conceptual):

  • ITEM_NAME: {item_name_or_<MISSING_ITEM_NAME>}
  • ITEM_CAT: {item_cat_or_<MISSING_ITEM_CAT>}
  • IMPORTANCE: {bucket_or_<MISSING_IMPORTANCE>}
  • NOTE_CAT: {note_cat_or_<MISSING_NOTE_CAT>}
  • NOTE: {note_text}

Then do field dropout:

  • with probability p, replace a field’s value with its <MISSING_…> token
  • keep the field header (ITEM_NAME:) so the model knows what slot it is reading

This is the “safe” version of your idea.


High-leverage solutions for your exact case

1) Make training and evaluation match the ranking task

Instead of thinking “pairs”, think “lists”.

For each item (query), create:

  • 1 positive note (or several)
  • K negative notes (prefer hard negatives)
  • evaluate on ranking metrics (NDCG@10, MRR@10)

The ST reranking evaluator is built for this. (sbert.net)

Why this fixes nuance: “nuance” lives in top-k ordering among similar candidates. Ranking metrics reward that. Pair accuracy often does not.


2) Upgrade your negatives: mine hard negatives from your real candidate generator

Hard negatives should look like what your system will actually consider at inference time.

A robust pipeline:

  1. Use a fast first-stage retriever (embedding search, BM25, hybrid).
  2. For each item, retrieve top K candidate notes.
  3. Positives are labeled matches.
  4. Negatives are top-ranked retrieved notes that are not positives.

Sentence-Transformers explicitly supports this workflow and provides mine_hard_negatives() to expand datasets with hard negatives. (sbert.net)

Why this fixes nuance: the model learns to separate “almost right” from “right”, not just “random wrong” from “right”.

Pitfall: false negatives. If your labeling is incomplete, some “negatives” are truly relevant. That can directly erase nuance. Treat this as a first-class problem. (Hugging Face)

Mitigations:

  • exclude “near-duplicate” notes from negatives if you cannot label them reliably
  • consider weak labeling rules that mark some candidates as “unknown” rather than negative
  • keep negative sampling conservative and iterate

3) Handle “importance” so the model stays user-agnostic

If you truly want user-agnostic semantics, the cleanest pattern is:

  • Do not feed importance into the cross-encoder.
  • Use the cross-encoder to score semantic match only.
  • Combine with importance after scoring: final_score = semantic_score × importance_weight.

If you must include importance:

  • discretize into a few buckets (LOW/MED/HIGH)
  • apply field dropout to it so it cannot dominate
  • watch for performance collapse when you ablate that field

4) Fix schema formatting so the model can reliably learn from each feature

Cross-encoders learn better when the input is stable and explicit.

Good practices:

  • always include field labels (ITEM_NAME:, ITEM_CAT: …)
  • keep field order fixed
  • use separators consistently (e.g., [SEP] between blocks)
  • normalize categories (consistent casing, synonyms merged)

Then verify with ablations:

  • evaluate with ITEM_NAME knocked out
  • evaluate with categories knocked out
  • evaluate with both present

If the score collapses when one field is removed, you have dominance.


5) Address domain mismatch via continued pretraining on your notes

If notes are domain-specific, continue pretraining on in-domain text (DAPT), then fine-tune.

“Don’t Stop Pretraining” shows consistent gains from domain-adaptive and task-adaptive pretraining on unlabeled in-domain text before supervised fine-tuning. (arXiv)

This is often the cleanest way to recover nuance when labels are limited or style is idiosyncratic.


6) Check if you are capacity-limited by L2 models

The ST pretrained models table shows performance increases with deeper MiniLM variants (L4/L6/L12) and lower docs/sec. (sbert.net)

Practical move:

  • run an offline evaluation with a stronger reranker (even temporarily)
  • measure the “ceiling”
  • decide whether to keep it, distill, or accept L2 limits

7) If you use many negatives, watch for overconfidence and score saturation

Recommendation training with negative sampling + BCE can produce overconfidence, which hurts stable ranking separation.

Two relevant results:

  • The shallow cross-encoder paper notes shallow models can benefit from generalized Binary Cross-Entropy (gBCE), originally successful in recommendation. (arXiv)
  • The gSASRec work introduces gBCE and argues it mitigates overconfidence from BCE with negative sampling. (arXiv)

You do not need to adopt gBCE immediately, but if your logits saturate and top-k ordering is unstable, this is a credible lever.


A concrete recipe I would run for your system

Step A. Build ranking-style training units

For each item:

  • q = template(item_name, item_cat, maybe importance_bucket)
  • retrieve top 100–500 candidate notes using your first-stage method
  • label positives
  • choose 10–50 negatives from retrieved candidates (hard negatives)

Step B. Train with strong evaluation and early stopping

  • use CrossEncoderRerankingEvaluator on a held-out dev set
  • stop early and keep the best checkpoint, because cross-encoders overfit fast (Hugging Face)
  • track NDCG@10 and MRR@10 (sbert.net)

Step C. Add feature dropout correctly

  • field dropout with <MISSING_…> sentinels, not empty strings (DSpace)

  • start mild:

    • drop ITEM_NAME 5–15%
    • drop ITEM_CAT 10–30%
    • drop NOTE_CAT 10–30%
    • drop IMPORTANCE 20–50% if you keep it in-model
  • cap: do not drop everything at once

Step D. Do diagnostic ablations

Evaluate:

  • full input
  • no item_name
  • no categories
  • no importance
    This tells you what the model actually uses and where nuance is coming from.

Step E. If nuance still missing

  • try a deeper reranker (L4/L6) to test capacity ceiling (sbert.net)
  • do DAPT on your note corpus (arXiv)
  • tighten hard negatives mining (more “confusable” negatives) (sbert.net)

Common pitfalls to explicitly avoid

  • Using random negatives only. You will plateau early.
  • Feeding user-conditioned “importance” and expecting user-agnostic behavior.
  • Evaluating with only pairwise accuracy. It hides top-k failures.
  • Dropping fields by "" without sentinels. It introduces ambiguity.
  • Not matching training negatives to inference candidates. Distribution mismatch kills nuance.

Summary

  • Your “randomly blank features” idea is valid regularization, but use field dropout with explicit missing tokens, not "". (DSpace)
  • “Missing nuance” is usually easy negatives + classifier-style training + weak ranking evaluation. Fix with hard negatives and ranking metrics (MRR@10, NDCG@10). (sbert.net)
  • MS MARCO L2 models are fast but limited and out-of-domain. Test deeper rerankers and consider domain-adaptive pretraining on your notes. (sbert.net)
  • If you keep “importance”, treat it carefully or apply it after scoring to preserve user-agnostic semantics.
  • If you scale negatives aggressively, watch for overconfidence and consider gBCE-related ideas if needed. (arXiv)