Whisper Large V3 Turbo โ€” Frisian (Common Voice 22.0)

Summary

  • Base: openai/whisper-large-v3-turbo
  • Task: ASR (transcribe)
  • Data: Common Voice 22.0 Frisian (fy-NL), custom splits:
    • Dev/Test: ~1 hour each, sampled from the original dev/test using clip durations to target 1 hour.
    • Train: original train plus โ€œotherโ€, with any speakers that appear in the 1-hour dev/test subsets removed; duplicate clips removed. Train is speaker-disjoint from dev/test.
    • Audio normalized to 16 kHz; clips capped at 30 seconds.
  • Training (essentials): 5 epochs; bf16; effective batch โ‰ˆ24 (per_device_train_batch_size 12 ร— grad_accum 2); lr 1e-5; warmup 500; weight decay โ‰ˆ0.00248; linear scheduler.

Evaluation (jiwer WER)

  • Dev (1h subset): WER 0.0479, loss 0.0544
  • Test (1h subset): WER 0.0574, loss 0.0628

Usage

from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torch, librosa

model_id = "bartelds/whisper-large-v3-turbo-fy-cv22"
device = "cuda" if torch.cuda.is_available() else "cpu"

processor = WhisperProcessor.from_pretrained(model_id)
model = WhisperForConditionalGeneration.from_pretrained(model_id).to(device)

audio, sr = librosa.load("example.wav", sr=16000, mono=True)
inputs = processor(audio, sampling_rate=sr, return_tensors="pt")
with torch.no_grad():
    ids = model.generate(inputs.input_features.to(device))
text = processor.batch_decode(ids, skip_special_tokens=True)[0]
print(text)

Notes & limits

  • Domain: Only Common Voice fy-NL; expect degradation on broadcast/telephone/noisy domains.
  • Encoder is frozen; gains come from decoder-side adaptation.
  • No punctuation/casing constraints; post-processing may be needed.
  • License follows Whisper (MIT); ensure downstream data complies.
Downloads last month
3
Safetensors
Model size
0.8B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for bartelds/whisper-large-v3-turbo-fy-cv22

Finetuned
(452)
this model