Whisper Large V3 Turbo โ Frisian (Common Voice 22.0)
Summary
- Base:
openai/whisper-large-v3-turbo - Task: ASR (transcribe)
- Data: Common Voice 22.0 Frisian (fy-NL), custom splits:
- Dev/Test: ~1 hour each, sampled from the original dev/test using clip durations to target 1 hour.
- Train: original train plus โotherโ, with any speakers that appear in the 1-hour dev/test subsets removed; duplicate clips removed. Train is speaker-disjoint from dev/test.
- Audio normalized to 16 kHz; clips capped at 30 seconds.
- Training (essentials): 5 epochs; bf16; effective batch โ24 (per_device_train_batch_size 12 ร grad_accum 2); lr 1e-5; warmup 500; weight decay โ0.00248; linear scheduler.
Evaluation (jiwer WER)
- Dev (1h subset): WER 0.0479, loss 0.0544
- Test (1h subset): WER 0.0574, loss 0.0628
Usage
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torch, librosa
model_id = "bartelds/whisper-large-v3-turbo-fy-cv22"
device = "cuda" if torch.cuda.is_available() else "cpu"
processor = WhisperProcessor.from_pretrained(model_id)
model = WhisperForConditionalGeneration.from_pretrained(model_id).to(device)
audio, sr = librosa.load("example.wav", sr=16000, mono=True)
inputs = processor(audio, sampling_rate=sr, return_tensors="pt")
with torch.no_grad():
ids = model.generate(inputs.input_features.to(device))
text = processor.batch_decode(ids, skip_special_tokens=True)[0]
print(text)
Notes & limits
- Domain: Only Common Voice fy-NL; expect degradation on broadcast/telephone/noisy domains.
- Encoder is frozen; gains come from decoder-side adaptation.
- No punctuation/casing constraints; post-processing may be needed.
- License follows Whisper (MIT); ensure downstream data complies.
- Downloads last month
- 3
Model tree for bartelds/whisper-large-v3-turbo-fy-cv22
Base model
openai/whisper-large-v3
Finetuned
openai/whisper-large-v3-turbo