What you’re seeing is a very common and very predictable behaviour of decoder‑only models when you turn a classification problem into a generation problem: the model learns the label distribution of the fine‑tuning set as a prior, and unless you actively counteract it, that prior dominates inference.
Let’s break this down cleanly and then walk through the techniques that actually work.
Why this happens
A decoder model trained to generate labels is essentially learning:
P(\text{label} \mid \text{input})
But during fine‑tuning, because the dataset is small and imbalanced, the model also implicitly learns:
P(\text{label}) \approx \text{empirical distribution of fine‑tuning set}
When the fine‑tuning set is small (10k) and the pretraining set is noisy, the model overfits to the clean but skewed distribution.
This is especially strong in decoder‑only architectures because they are trained autoregressively and treat the label token(s) as part of the language distribution.
Techniques that actually work
Below are the methods that reliably break the “distribution copying” behaviour.
- Loss Re‑weighting / Class‑Balanced Loss
This is the most direct fix.
You assign higher loss weight to minority classes and lower weight to majority classes.
Common strategies:
- Inverse frequency weighting
- Effective number of samples (Cui et al., 2019)
- Focal loss (helps with imbalance + noisy labels)
This prevents the model from learning the skewed prior.
- Oversampling / Undersampling (but carefully)
For generation tasks, oversampling minority classes works surprisingly well.
But you must avoid:
- duplicating identical samples (causes overfitting)
- oversampling too aggressively (destroys natural priors)
A good rule of thumb:
- oversample until each class is within 2–3× of each other, not fully balanced.
- Use instruction‑style prompts that force the model to ignore priors
This is underrated.
Instead of training the model to output just the label, train it to follow a task instruction:
Input: <text> Task: Identify the correct topic regardless of frequency. Output: <topic>
This reduces the model’s reliance on unconditional priors.
- Add a calibration layer at inference
Even if the model is biased, you can correct it post‑hoc.
Two strong methods:
a) Temperature scaling per class
You adjust logits so that rare classes are not suppressed.
b) Prior correction
If you know the true expected class distribution, you can apply Bayes correction:
P{\text{corrected}}(y|x) \propto \frac{P{\text{model}}(y|x)}{P_{\text{model}}(y)}
This is extremely effective for generative classifiers.
-
Use a two‑stage model: encoder classifier + decoder generator
This is a hybrid approach:
-
Use a small encoder classifier trained with class‑balanced loss.
-
Feed its predicted class into the decoder as a conditioning token.
This gives you:
- balanced classification
- rich generative output
- Freeze most of the decoder during fine‑tuning
If you fine‑tune the entire decoder on a small imbalanced dataset, it will absolutely learn the skew.
Instead:
- freeze 90–99% of layers
- fine‑tune only the top 1–3 transformer blocks
- or use of the decoder during LoRA adapters
This preserves the fine‑tuning
If pretrained distribution you fine‑tune the entire decoderal knowledge and reduces overfitting on a small imbalanced dataset, it will to the fine‑tuning distribution.
absolutely learn the skew.
Instead—
- Use synthetic balancing:
- freeze 90–99% of layers
- fine
If you can generate‑tune only the top synthetic examples for minority classes 1–3 transformer (even using your blocks
- or use LoRA adapters
own model), you can rebalance theThis preserves the dataset without pretrained distributional knowledge and oversampling.
This works well for reduces overfitting to the fine‑tuning topic generation because the model distribution.
- Use can generate paraphrases.
synthetic balancing
If you can generate 8. Add a “label prior suppression synthetic examples for minority classes” token
This is (even using your a trick used in some LLM classification own model), you can rebalance the setups.
You add a special token dataset without like:
`
<force_fairness> oversampling.
This works well for
`
And train the topic generation model that when this token is present because the model can generate parap, it should ignorehrases.
prior label frequencies 8. Add a “label.
This works because decoder models are prior suppression extremely sensitive” token
This is a trick used in fine‑tuning
If some LLM classification setups.
You add you fine‑tune the entire decoder a special token like:
<force_fairness> on a small imbalanced
And train the dataset, it will absolutely learn model that when the skew.
Instead this token is present, it should ignore:
- freeze 90–99% of layers
- fine prior label frequencies.
This works because‑tune only the top 1–3 transformer decoder models are extremely sensitive blocks
to conditioning tokens.
This preserves the pretrained distribution
What I’d recommendal knowledge and for your exact setup
Given your reduces overfitting to the fine‑tuning description:
- large noisy pre distribution.
- Usetraining set
- small clean but synthetic balancing
If you can generate imbalanced fine synthetic examples‑tuning set
- decoder‑only architecture for minority classes (even using your own model), you can rebalance the dataset without oversampling.
This works well for topic generation because the model can generate paraphrases.
- Add a “label prior suppression” token
This is a trick used in some LLM classification setups.
You add a special token like:
<force_fairness>
And train the model that when this token is present, it should ignore prior label frequencies.
This works because decoder models are extremely sensitive to conditioning tokens.
What I’d recommend for your exact setup
Given your description:
- large noisy pretraining set
- small clean but imbalanced fine‑tuning set
- decoder‑only architecture
- generation‑based classification
The most effective combination is:
-
Freeze most of the model + LoRA adapters
Prevents overfitting to skew.
-
Class‑balanced loss (effective number of samples)
Prevents the model from copying the distribution.
-
Instruction‑style prompting
Reduces reliance on unconditional priors.
-
Optional: post‑hoc calibration
If you need perfect distribution alignment.
This combo usually fixes the issue without needing to modify the dataset too much.