Whisper-Large-v3 Portuguese - CAPES Baseline (State-of-the-Art Comparison)
This model is a fine-tuned version of openai/whisper-large-v3 for Portuguese automatic speech recognition (ASR). It was trained on Common Voice 17.0 Portuguese combined with the CAPES filtered synthetic dataset using the original filtering methodology from our previous IEEE Access 2024 paper.
Purpose
This model serves as a direct comparison baseline with the current state-of-the-art Portuguese Whisper model (my-north-ai/whisper-large-v3-pt), which was trained using the same CAPES dataset and filtering approach. The purpose is to:
- Replicate state-of-the-art performance: Uses the same CAPES synthetic dataset (derived from academic thesis recordings) and filtering methodology as the my-north-ai model
- Establish comparison baseline: Provides a controlled starting point to evaluate the effectiveness of the new WAVe filtering methodology
- Benchmark against established methods: Demonstrates performance using sentence-level filtering (the previous state-of-the-art approach)
What is CAPES?
CAPES is a synthetic Portuguese speech dataset generated from academic thesis recordings using our previous methodology (Perezhohin et al., IEEE Access 2024). Key characteristics:
- Source: Academic thesis transcripts (longer, more complex utterances than Common Voice)
- Original size: ~55k synthetic samples
- Filtering: Sentence-level quality assessment (our previous approach)
- Filtered size: 33.2k samples retained after filtering
- Usage: Combined with Common Voice 17.0 for ASR training
This dataset and filtering methodology was used to create the my-north-ai/whisper-large-v3-pt model, which achieved state-of-the-art results for Portuguese Whisper at the time.
Model Details
| Property | Value |
|---|---|
| Base Model | openai/whisper-large-v3 |
| Language | Portuguese (pt) |
| Task | Automatic Speech Recognition (transcribe) |
| Parameters | 1550M |
| Training Data | Common Voice 17.0 + CAPES Filtered (Sentence-Level) |
| Total Training Samples | ~55,000 |
| Sampling Rate | 16kHz |
| Filtering Method | Sentence-level (IEEE Access 2024) |
Evaluation Results
This Model (whisper-large-v3-cv-capes-fs024-IEEE-pt)
| Metric | Value |
|---|---|
| Validation Loss | 0.0913 |
| Validation WER | 7.74% |
| Test WER (Common Voice) | 8.43% |
| Test WER (MLS) | 13.54% |
| Best Checkpoint | Step 400 |
| Max Training Steps | 1,080 |
Comparison with WAVe Filtering and Other Approaches
| Training Data | Filtering Method | Samples | Max Steps | Test WER (CV) | Test WER (MLS) |
|---|---|---|---|---|---|
| CAPES + CV (this model) | Sentence-level | ~55k | 1,080 | 8.43% | 13.54% |
| CAPES + CV (WAVe) | Word-level (q≥0.8) | ~45k | 880 | 7.95% | 6.89% |
| Our Synthetic + CV | Word-level (q≥0.5) | 41k | 805 | 8.33% | 10.27% |
| CV Only | None | 22k | 430 | 11.78% | 15.31% |
Key Performance Characteristics
- State-of-the-art replication: Achieves comparable performance to my-north-ai/whisper-large-v3-pt
- Strong in-domain: 8.43% Test WER on Common Voice
- Baseline cross-domain: 13.54% MLS WER (establishes comparison point)
- Large dataset: Uses 55k total samples (largest among comparisons)
- Most training steps: Requires 1,080 max steps
Comparison with my-north-ai/whisper-large-v3-pt
This model replicates the training methodology used in my-north-ai/whisper-large-v3-pt:
| Aspect | my-north-ai/whisper-large-v3-pt | This Model |
|---|---|---|
| Base Model | whisper-large-v3 | whisper-large-v3 |
| Synthetic Data | CAPES | CAPES |
| Filtering | Sentence-level | Sentence-level |
| Real Data | Common Voice | Common Voice |
| Purpose | State-of-the-art Portuguese ASR | Comparison baseline for WAVe |
Both models use the same fundamental approach, making this an ideal baseline to evaluate the improvements from WAVe word-level filtering.
Training Data
Dataset Composition
| Source | Samples | Description |
|---|---|---|
| Common Voice 17.0 Portuguese | 21,866 | Real crowdsourced speech |
| CAPES Filtered (Sentence-level) | ~33,200 | Academic thesis-derived synthetic speech |
| Total | ~55,000 |
CAPES Dataset Characteristics
The CAPES synthetic dataset differs from our GPT-4o-generated data:
- Source transcripts: Academic thesis recordings (longer, more complex utterances)
- Generation: Previous TTS methodology
- Filtering: Sentence-level quality assessment (not word-level)
- Quality: Contains synthesis errors that may hide within extended passages
Training Procedure
Hyperparameters
| Parameter | Value |
|---|---|
| Learning Rate | 5e-6 |
| Batch Size (Global) | 256 |
| Warmup Steps | 200 |
| Max Epochs | 5 |
| Precision | BF16 |
| Optimizer | AdamW (fused) |
| Eval Steps | 50 |
| Metric for Best Model | eval_loss |
Training Infrastructure
- GPU: NVIDIA H200 (140GB VRAM)
- Operating System: Ubuntu 22.04
- Framework: Hugging Face Transformers
Usage
Transcription Pipeline
from transformers import pipeline
transcriber = pipeline(
"automatic-speech-recognition",
model="yuriyvnv/whisper-large-v3-cv-capes-fs024-IEEE-pt",
device="cuda"
)
result = transcriber("path/to/portuguese_audio.wav")
print(result["text"])
Direct Model Usage
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import librosa
processor = WhisperProcessor.from_pretrained("yuriyvnv/whisper-large-v3-cv-capes-fs024-IEEE-pt")
model = WhisperForConditionalGeneration.from_pretrained("yuriyvnv/whisper-large-v3-cv-capes-fs024-IEEE-pt")
model.to("cuda")
audio, sr = librosa.load("path/to/portuguese_audio.wav", sr=16000)
input_features = processor(audio, sampling_rate=16000, return_tensors="pt").input_features.to("cuda")
predicted_ids = model.generate(input_features)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)
Specifying Language
model.generation_config.language = "pt"
model.generation_config.task = "transcribe"
When to Use This Model
This model is ideal when:
- Replicating state-of-the-art: Want to use the established CAPES methodology
- Comparison baseline: Evaluating new filtering approaches against established methods
- Maximum dataset size: Prefer larger synthetic datasets (55k total samples)
Consider WAVe-filtered alternatives for better performance:
- whisper-large-v3-cv-capes-filtered-pt: 49% better cross-domain (6.89% vs 13.54% MLS), 18% fewer steps
- whisper-large-v3-mixed-pt: Best overall with newer synthetic data
Why WAVe Filtering Improves on This Approach
The successor model whisper-large-v3-cv-capes-filtered-pt applies WAVe word-level filtering to the same CAPES dataset and achieves:
- 5.7% better in-domain WER (7.95% vs 8.43%)
- 49% better cross-domain WER (6.89% vs 13.54%) - dramatic improvement
- 18% fewer training steps (880 vs 1,080)
- 30% less data (23k vs 33k synthetic samples)
This dramatic improvement demonstrates that word-level filtering is more effective than sentence-level filtering for detecting synthesis errors in longer utterances.
Limitations
- Sentence-level filtering: Cannot detect localized synthesis errors within long utterances
- Suboptimal cross-domain: 13.54% MLS WER vs 6.89% with WAVe filtering
- Training inefficiency: Requires 1,080 steps vs 880 with WAVe filtering
- Data volume: Uses more data but achieves worse performance than filtered approaches
Citation
This model uses the CAPES dataset and filtering methodology from our previous work:
@article{perezhohin2024enhancing,
title={Enhancing Automatic Speech Recognition: Effects of Semantic Audio Filtering on Models Performance},
author={Perezhohin, Yuriy and Santos, Tiago and Costa, Victor and Peres, Fernando and Castelli, Mauro},
journal={IEEE Access},
year={2024},
publisher={IEEE}
}
References
- Base Model: openai/whisper-large-v3
- Training Data (Real): mozilla-foundation/common_voice_17_0
- Comparison Model: my-north-ai/whisper-large-v3-pt
- Improved Version: whisper-large-v3-cv-capes-filtered-pt (with WAVe filtering)
- Whisper Paper: Robust Speech Recognition via Large-Scale Weak Supervision
- CAPES Methodology: Enhancing ASR with Semantic Audio Filtering (IEEE Access 2024)
License
Apache 2.0
- Downloads last month
- 9
Model tree for yuriyvnv/whisper-large-v3-cv-capes-fs024-IEEE-pt
Base model
openai/whisper-large-v3Dataset used to train yuriyvnv/whisper-large-v3-cv-capes-fs024-IEEE-pt
Evaluation results
- Test WER on Common Voice 17.0 (Portuguese)test set self-reported8.430
- Test WER (MLS) on Multilingual LibriSpeech (Portuguese)test set self-reported13.540