Whisper Large V3 Turbo — Frisian (Common Voice 22.0)

Summary

Base: openai/whisper-large-v3-turbo
Task: ASR (transcribe)
Data: Common Voice 22.0 Frisian (fy-NL), custom splits:
- Dev/Test: ~1 hour each, sampled from the original dev/test using clip durations to target 1 hour.
- Train: original train plus “other”, with any speakers that appear in the 1-hour dev/test subsets removed; duplicate clips removed. Train is speaker-disjoint from dev/test.
- Audio normalized to 16 kHz; clips capped at 30 seconds.
Training (essentials): 5 epochs; bf16; effective batch ≈24 (per_device_train_batch_size 12 × grad_accum 2); lr 1e-5; warmup 500; weight decay ≈0.00248; linear scheduler.

Evaluation (jiwer WER)

Dev (1h subset): WER 0.0479, loss 0.0544
Test (1h subset): WER 0.0574, loss 0.0628

Usage

from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torch, librosa

model_id = "bartelds/whisper-large-v3-turbo-fy-cv22"
device = "cuda" if torch.cuda.is_available() else "cpu"

processor = WhisperProcessor.from_pretrained(model_id)
model = WhisperForConditionalGeneration.from_pretrained(model_id).to(device)

audio, sr = librosa.load("example.wav", sr=16000, mono=True)
inputs = processor(audio, sampling_rate=sr, return_tensors="pt")
with torch.no_grad():
    ids = model.generate(inputs.input_features.to(device))
text = processor.batch_decode(ids, skip_special_tokens=True)[0]
print(text)

Notes & limits

Domain: Only Common Voice fy-NL; expect degradation on broadcast/telephone/noisy domains.
Encoder is frozen; gains come from decoder-side adaptation.
No punctuation/casing constraints; post-processing may be needed.
License follows Whisper (MIT); ensure downstream data complies.

Downloads last month: 3

Safetensors

Model size

0.8B params

Tensor type

F32

Model tree for bartelds/whisper-large-v3-turbo-fy-cv22

Base model

openai/whisper-large-v3

Finetuned

openai/whisper-large-v3-turbo

Finetuned

(452)

this model