I was finetuning Whisper, and later during inference I noticed a bunch of warnings that were not there when using the base model.
`generation_config` default values have been modified to match model-specific defaults: {'suppress_tokens': [1, 2, 7, 8, 9, 10, 14, 25, 26, 27, 28, 29, 31, 58, 59, 60, 61, 62, 63, 90, 91, 92, 93, 359, 503, 522, 542, 873, 893, 902, 918, 922, 931, 1350, 1853, 1982, 2460, 2627, 3246, 3253, 3268, 3536, 3846, 3961, 4183, 4667, 6585, 6647, 7273, 9061, 9383, 10428, 10929, 11938, 12033, 12331, 12562, 13793, 14157, 14635, 15265, 15618, 16553, 16604, 18362, 18956, 20075, 21675, 22520, 26130, 26161, 26435, 28279, 29464, 31650, 32302, 32470, 36865, 42863, 47425, 49870, 50254, 50258, 50358, 50359, 50360, 50361, 50362], 'begin_suppress_tokens': [220, 50257]}. If this is not desired, please set these values explicitly.
A custom logits processor of type <class 'transformers.generation.logits_process.SuppressTokensLogitsProcessor'> has been passed to `.generate()`, but it was also created in `.generate()`, given its parameterization. The custom <class 'transformers.generation.logits_process.SuppressTokensLogitsProcessor'> will take precedence. Please check the docstring of <class 'transformers.generation.logits_process.SuppressTokensLogitsProcessor'> to see related `.generate()` flags.
A custom logits processor of type <class 'transformers.generation.logits_process.SuppressTokensAtBeginLogitsProcessor'> has been passed to `.generate()`, but it was also created in `.generate()`, given its parameterization. The custom <class 'transformers.generation.logits_process.SuppressTokensAtBeginLogitsProcessor'> will take precedence. Please check the docstring of <class 'transformers.generation.logits_process.SuppressTokensAtBeginLogitsProcessor'> to see related `.generate()` flags.
Being new to finetuning, I got worried that I have messed up my generation_config.json for the finetune. However, I did not see any difference besides forced_decoder_ids (which I quickly learned that is deprecated, so it’s good that my finetune does not have it anymore) and “transformers_version”: “4.57.6” for my finetune versus “4.31.0.dev0” for the base model.
If I replace my transformers_version with “4.31.0.dev0”, the warnings disappear.
To verify that there might be something fishy going on with suppress_tokens and begin_suppress_tokens, I created a simple script that loads the base model, saves generation_config.json using the latest transformers (not v5 yet though) and then I load it back and try using the base with the new generation_config.json.
from transformers import WhisperForConditionalGeneration, AutoProcessor
model_path = "openai/whisper-small"
cache_dir="./models"
model = WhisperForConditionalGeneration.from_pretrained(model_path, cache_dir=cache_dir)
processor = AutoProcessor.from_pretrained(model_path, cache_dir=cache_dir)
# disable deprecated behavior
model.generation_config.forced_decoder_ids = None
save_directory = "./output"
model.generation_config.save_pretrained(save_directory)
Then I replace the base model’s generation_config.json with the one from “./output” and run the following script:
from transformers import pipeline
# I know this path is not supposed to be this way,
# but I just want to reuse the same download folder
# with the replaced generation_config.json.
model_path = "./models/models--openai--whisper-small/snapshots/973afd24965f72e36ca33b3055d56a652f456b4d"
pipe = pipeline(
"automatic-speech-recognition",
model=model_path
)
result = pipe("./dumps/input/8467578323502011.wav", generate_kwargs={
"language": "german",
"task": "transcribe"
})
transcription = result["text"]
print(transcription)
Yep, still the only difference was transformers_version and got the same warnings for the new version, and it went away if I changed transformers_version.
Why does it detect modifications of suppress_tokens and begin_suppress_tokens, if I load the base model with the original values, as downloaded from Huggingface openai repository?
I suspect it has something to do with this description from the docs:
suppress_tokens (list[int], optional) — A list containing the non-speech tokens that will be used by the logit processor in the generate function. NON_SPEECH_TOKENS and NON_SPEECH_TOKENS_MULTI each correspond to the english-only and the multilingual model.
begin_suppress_tokens (list[int], optional, defaults to [220,50256]) — A list containing tokens that will be suppressed at the beginning of the sampling process. Initialized as the token for " " (blank_token_id) and the eos_token_id
This one looks suspicious - why is it [220,50256], if the original Openai models have 50257 here?
"begin_suppress_tokens": [
220,
50257
],
NON_SPEECH_TOKENS and NON_SPEECH_TOKENS_MULTI might also differ from the base generation_config.json because it clearly cannot have both of them.
Then I thought - if transformers is smart enough to detect the difference, then it clearly should be able to just use the internal values and don’t care about the values in the file at all. So I removed suppress_tokens and begin_suppress_tokens from generation_config.json. Great, now the only warning is:
return_token_timestamps` is deprecated for WhisperFeatureExtractor and will be removed in Transformers v5. Use `return_attention_mask` instead, as the number of frames can be inferred from it.
Still, it’s a bit sad that something in transformers can generate generation_config.json that it later complains about. And also 50257 → 50256 difference is puzzling. Which number is right? Could this difference bite me someday?