Why don’t we apply temperature scaling during training to match inference-time decoding?

I am a beginner in large language models, and while studying autoregressive training and decoding, I came across the following question. I apologize in advance if this is a basic or well-known issue, but I would reall appreciate any clarification or insight.

In standard autoregressive language model training, model parameters are learned by maximizing the conditional log-likelihood of the data.
At time step t, the model computes logits from the hidden state:

logits_t = h_t @ W_out + b   # shape: [vocab_size]

During training, the conditional distribution is obtained using the standard softmax (i.e. temperature T = 1):

probs = softmax(logits_t)   # T = 1
loss = cross_entropy(probs, target_token)


where target_token is the ground-truth token x_t (teacher forcing).

However, during inference (decoding / generation), the model typically does not sample from this same distribution. Instead, it is common to apply temperature scaling before softmax:

probs_T = softmax(logits_t / T)   # T != 1

where T controls the smoothness / uncertainty of the output distribution.

Core Issue

From a train–inference consistency perspective, this introduces a clear asymmetry:

Training optimizes the model under softmax(logits) (T = 1)

Inference executes the model under softmax(logits / T) (T ≠ 1)

This leads to the following question.

Question

Is it theoretically or practically valid to apply temperature scaling during training as well, i.e. compute the cross-entropy loss using:

probs_T = softmax(logits_t / T)
loss = cross_entropy(probs_T, target_token)


instead of the standard T = 1 softmax, so that the learned conditional distribution more closely matches the distribution actually used at inference time?

Equivalently:

Could introducing temperature during training reduce the mismatch between training-time and decoding-time probability distributions?

Or does temperature scaling fundamentally belong only to the decoding stage, with theoretical reasons why it should not be included in training?

1 Like

hmm… idk about math almost at all…:innocent:


1) What training is fitting

In standard autoregressive training, you define the model’s conditional distribution as:

p_\theta(x_t \mid x_{<t}) ;=; \mathrm{softmax}(z_t)

and you maximize the log-likelihood of the observed next token (equivalently minimize cross-entropy). This is maximum-likelihood estimation (MLE): you are fitting the distribution that best explains the data under your model class.


2) What temperature at inference is doing

At inference you often do not want to sample exactly from the learned distribution. You want a controllable policy for generation: more deterministic for factual/code, more diverse for brainstorming.

Temperature modifies the sampling distribution:

p_{\theta,T}(i \mid x_{<t}) ;=; \mathrm{softmax}(z_t/T)_i

Libraries treat this explicitly as a generation-time logits transformation (a decoding control), not a training objective. For example, Transformers implements temperature as a logits warper (TemperatureLogitsWarper) that “modulates the logits distribution” during generation. (Hugging Face)

Key property: temperature is a monotonic scaling of logits, so it preserves ranking.


3) “Train–inference mismatch” is usually not what it looks like

3.1 If you decode greedily, temperature does not change the chosen token

Greedy decoding chooses:

\arg\max_i z_i

Scaling logits by a positive constant does not change their order, so the argmax token is identical for any

T>0

.

So for greedy decoding, “T=1 in training vs T≠1 in inference” is not a mismatch in which token is selected—only in probability values (which you are not using if you only argmax).

3.2 The bigger mismatch in autoregressive models is teacher forcing vs free-running

Training conditions on true prefixes, inference conditions on model-generated prefixes. Scheduled Sampling is an early canonical method addressing exactly this discrepancy. (arXiv)

Temperature does not fix that exposure-bias-style mismatch.


4) If you apply temperature in the training loss, what objective are you optimizing?

Your proposed change is:

\mathcal{L}_T ;=; -\log\Big(\mathrm{softmax}(z/T)_y\Big)

This is theoretically valid: it defines a coherent likelihood for a differently parameterized distribution.

But it changes what your “native model distribution” is (you are no longer doing MLE for

\mathrm{softmax}(z)

; you are doing MLE for

\mathrm{softmax}(z/T)

).

4.1 With a fixed constant temperature, much of it is just logit-scale reparameterization

If

z = hW + b

then:

\mathrm{softmax}(z/T) ;=; \mathrm{softmax}(h(W/T) + b/T)

So using a constant

T

is often close to “rescale the output layer.” It does not fundamentally add new modeling power; it mainly changes logit scale and training dynamics.

4.2 It changes gradients and optimization dynamics (this can matter)

Let:

p_T ;=; \mathrm{softmax}(z/T)

Then for one-hot target

y

, the gradient w.r.t. logits is:

\frac{\partial \mathcal{L}_T}{\partial z} ;=; \frac{p_T - y}{T}

So training temperature:

  • changes the probability vector inside the gradient (because
p_T

depends on

T

),

  • scales gradient magnitude by
1/T

.

Empirically and theoretically, training temperature can affect learning dynamics and generalization; “Temperature check” studies this effect directly for softmax-cross-entropy training. (arXiv)


5) Why training with temperature usually does not solve your motivation (“match inference-time decoding”)

5.1 Inference temperature is a user/task knob, not a fixed property

In practice, temperature changes by application (QA, creative writing, code), and is often combined with top-p/top-k/repetition penalties. Transformers models this as a stack of logits processors/warpers at inference. (Hugging Face)

If you bake in one

T

at training, you optimize for one generation style, but you still typically want adjustable inference behavior.

5.2 “Matching decoding” is bigger than temperature

Most real decoding is:

  • temperature
  • plus truncation (top-p/top-k)
  • plus penalties
  • sometimes beam search

So the “policy” is not differentiable end-to-end in the same way as MLE; changing only

T

in training usually does not align training with the actual generation procedure anyway.

5.3 It doesn’t address the primary autoregressive train–test gap

If your concern is “training conditions differ from generation conditions,” Scheduled Sampling and related approaches target that directly. (arXiv)


6) When does temperature in training make sense? (Important “yes” cases)

6.1 Train-time softmax tempering as regularization

There is explicit work proposing exactly “divide logits by

T

during training” to reduce overfitting / overly peaky distributions in low-resource translation:

  • Dabre & Fujita (2020) propose training-time softmax tempering and report improvements, analyzing entropy and gradients. (arXiv)
  • Follow-up work further analyzes the balance with label smoothing. (ACLENSBL)

So: training-time temperature is not “wrong”; it is a legitimate experimental knob—but usually justified as regularization/dynamics, not decode-consistency.

6.2 Knowledge distillation (standard, widely used)

Distillation explicitly raises softmax temperature to produce “soft targets,” then trains the student with the same temperature. (arXiv)

6.3 Calibration (usually post-hoc)

If your concern is “probabilities are miscalibrated / overconfident,” standard practice is to fit a single temperature on validation data after training (post-hoc temperature scaling). (arXiv)


7) What I would recommend for a beginner (practical)

If your goal is better text generation

  1. **Train with standard cross-entropy on logits (native
T=1

).**
2. Tune inference-time decoding (temperature/top-p/top-k), treating these as policy knobs—exactly as libraries implement them. (Hugging Face)

If your goal is “less peaky / less overconfident” outputs

Try these before training-time temperature:

  • Label smoothing (soft targets) reduces overconfidence and can improve calibration/beam search behavior. (arXiv)
  • Confidence penalty / entropy regularization explicitly discourages low-entropy outputs and is evaluated on language modeling and translation. (arXiv)
  • Post-hoc temperature scaling if probability calibration is the goal. (arXiv)

If you still want to experiment with training-time temperature

Treat it as a regularization / training-dynamics hyperparameter, not a “consistency fix.” Expect to retune learning rate and stability; temperature can affect training dynamics in nontrivial ways. (arXiv)


8) Implementation tip (stable and correct)

In PyTorch-style frameworks, CrossEntropyLoss expects raw logits. If you want to train with temperature:

loss = F.cross_entropy(logits / T, target)

Avoid doing softmax manually before cross-entropy; stable implementations fuse log-softmax internally.


Takeaway

  • Training with
T=1

is MLE for the model’s native distribution.

  • Using
T \neq 1

at inference is usually a deliberate decoding policy decision (implemented as a logits warper). (Hugging Face)

  • Training-time temperature can be valid and useful (regularization, distillation), but it is typically not the right lever if your motivation is simply “match inference-time sampling.” (arXiv)