Whisper-Large-v3 Portuguese - CAPES Baseline (State-of-the-Art Comparison)

This model is a fine-tuned version of openai/whisper-large-v3 for Portuguese automatic speech recognition (ASR). It was trained on Common Voice 17.0 Portuguese combined with the CAPES filtered synthetic dataset using the original filtering methodology from our previous IEEE Access 2024 paper.

Purpose

This model serves as a direct comparison baseline with the current state-of-the-art Portuguese Whisper model (my-north-ai/whisper-large-v3-pt), which was trained using the same CAPES dataset and filtering approach. The purpose is to:

  1. Replicate state-of-the-art performance: Uses the same CAPES synthetic dataset (derived from academic thesis recordings) and filtering methodology as the my-north-ai model
  2. Establish comparison baseline: Provides a controlled starting point to evaluate the effectiveness of the new WAVe filtering methodology
  3. Benchmark against established methods: Demonstrates performance using sentence-level filtering (the previous state-of-the-art approach)

What is CAPES?

CAPES is a synthetic Portuguese speech dataset generated from academic thesis recordings using our previous methodology (Perezhohin et al., IEEE Access 2024). Key characteristics:

  • Source: Academic thesis transcripts (longer, more complex utterances than Common Voice)
  • Original size: ~55k synthetic samples
  • Filtering: Sentence-level quality assessment (our previous approach)
  • Filtered size: 33.2k samples retained after filtering
  • Usage: Combined with Common Voice 17.0 for ASR training

This dataset and filtering methodology was used to create the my-north-ai/whisper-large-v3-pt model, which achieved state-of-the-art results for Portuguese Whisper at the time.

Model Details

Property Value
Base Model openai/whisper-large-v3
Language Portuguese (pt)
Task Automatic Speech Recognition (transcribe)
Parameters 1550M
Training Data Common Voice 17.0 + CAPES Filtered (Sentence-Level)
Total Training Samples ~55,000
Sampling Rate 16kHz
Filtering Method Sentence-level (IEEE Access 2024)

Evaluation Results

This Model (whisper-large-v3-cv-capes-fs024-IEEE-pt)

Metric Value
Validation Loss 0.0913
Validation WER 7.74%
Test WER (Common Voice) 8.43%
Test WER (MLS) 13.54%
Best Checkpoint Step 400
Max Training Steps 1,080

Comparison with WAVe Filtering and Other Approaches

Training Data Filtering Method Samples Max Steps Test WER (CV) Test WER (MLS)
CAPES + CV (this model) Sentence-level ~55k 1,080 8.43% 13.54%
CAPES + CV (WAVe) Word-level (q≥0.8) ~45k 880 7.95% 6.89%
Our Synthetic + CV Word-level (q≥0.5) 41k 805 8.33% 10.27%
CV Only None 22k 430 11.78% 15.31%

Key Performance Characteristics

  • State-of-the-art replication: Achieves comparable performance to my-north-ai/whisper-large-v3-pt
  • Strong in-domain: 8.43% Test WER on Common Voice
  • Baseline cross-domain: 13.54% MLS WER (establishes comparison point)
  • Large dataset: Uses 55k total samples (largest among comparisons)
  • Most training steps: Requires 1,080 max steps

Comparison with my-north-ai/whisper-large-v3-pt

This model replicates the training methodology used in my-north-ai/whisper-large-v3-pt:

Aspect my-north-ai/whisper-large-v3-pt This Model
Base Model whisper-large-v3 whisper-large-v3
Synthetic Data CAPES CAPES
Filtering Sentence-level Sentence-level
Real Data Common Voice Common Voice
Purpose State-of-the-art Portuguese ASR Comparison baseline for WAVe

Both models use the same fundamental approach, making this an ideal baseline to evaluate the improvements from WAVe word-level filtering.

Training Data

Dataset Composition

Source Samples Description
Common Voice 17.0 Portuguese 21,866 Real crowdsourced speech
CAPES Filtered (Sentence-level) ~33,200 Academic thesis-derived synthetic speech
Total ~55,000

CAPES Dataset Characteristics

The CAPES synthetic dataset differs from our GPT-4o-generated data:

  • Source transcripts: Academic thesis recordings (longer, more complex utterances)
  • Generation: Previous TTS methodology
  • Filtering: Sentence-level quality assessment (not word-level)
  • Quality: Contains synthesis errors that may hide within extended passages

Training Procedure

Hyperparameters

Parameter Value
Learning Rate 5e-6
Batch Size (Global) 256
Warmup Steps 200
Max Epochs 5
Precision BF16
Optimizer AdamW (fused)
Eval Steps 50
Metric for Best Model eval_loss

Training Infrastructure

  • GPU: NVIDIA H200 (140GB VRAM)
  • Operating System: Ubuntu 22.04
  • Framework: Hugging Face Transformers

Usage

Transcription Pipeline

from transformers import pipeline

transcriber = pipeline(
    "automatic-speech-recognition",
    model="yuriyvnv/whisper-large-v3-cv-capes-fs024-IEEE-pt",
    device="cuda"
)

result = transcriber("path/to/portuguese_audio.wav")
print(result["text"])

Direct Model Usage

from transformers import WhisperProcessor, WhisperForConditionalGeneration
import librosa

processor = WhisperProcessor.from_pretrained("yuriyvnv/whisper-large-v3-cv-capes-fs024-IEEE-pt")
model = WhisperForConditionalGeneration.from_pretrained("yuriyvnv/whisper-large-v3-cv-capes-fs024-IEEE-pt")
model.to("cuda")

audio, sr = librosa.load("path/to/portuguese_audio.wav", sr=16000)
input_features = processor(audio, sampling_rate=16000, return_tensors="pt").input_features.to("cuda")

predicted_ids = model.generate(input_features)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)

Specifying Language

model.generation_config.language = "pt"
model.generation_config.task = "transcribe"

When to Use This Model

This model is ideal when:

  • Replicating state-of-the-art: Want to use the established CAPES methodology
  • Comparison baseline: Evaluating new filtering approaches against established methods
  • Maximum dataset size: Prefer larger synthetic datasets (55k total samples)

Consider WAVe-filtered alternatives for better performance:

Why WAVe Filtering Improves on This Approach

The successor model whisper-large-v3-cv-capes-filtered-pt applies WAVe word-level filtering to the same CAPES dataset and achieves:

  • 5.7% better in-domain WER (7.95% vs 8.43%)
  • 49% better cross-domain WER (6.89% vs 13.54%) - dramatic improvement
  • 18% fewer training steps (880 vs 1,080)
  • 30% less data (23k vs 33k synthetic samples)

This dramatic improvement demonstrates that word-level filtering is more effective than sentence-level filtering for detecting synthesis errors in longer utterances.

Limitations

  • Sentence-level filtering: Cannot detect localized synthesis errors within long utterances
  • Suboptimal cross-domain: 13.54% MLS WER vs 6.89% with WAVe filtering
  • Training inefficiency: Requires 1,080 steps vs 880 with WAVe filtering
  • Data volume: Uses more data but achieves worse performance than filtered approaches

Citation

This model uses the CAPES dataset and filtering methodology from our previous work:

@article{perezhohin2024enhancing,
  title={Enhancing Automatic Speech Recognition: Effects of Semantic Audio Filtering on Models Performance},
  author={Perezhohin, Yuriy and Santos, Tiago and Costa, Victor and Peres, Fernando and Castelli, Mauro},
  journal={IEEE Access},
  year={2024},
  publisher={IEEE}
}

References

License

Apache 2.0

Downloads last month
9
Safetensors
Model size
2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for yuriyvnv/whisper-large-v3-cv-capes-fs024-IEEE-pt

Finetuned
(682)
this model

Dataset used to train yuriyvnv/whisper-large-v3-cv-capes-fs024-IEEE-pt

Evaluation results