Whisper-Large-v3 Portuguese - CAPES Baseline (State-of-the-Art Comparison)

This model is a fine-tuned version of openai/whisper-large-v3 for Portuguese automatic speech recognition (ASR). It was trained on Common Voice 17.0 Portuguese combined with the CAPES filtered synthetic dataset using the original filtering methodology from our previous IEEE Access 2024 paper.

Purpose

This model serves as a direct comparison baseline with the current state-of-the-art Portuguese Whisper model (my-north-ai/whisper-large-v3-pt), which was trained using the same CAPES dataset and filtering approach. The purpose is to:

Replicate state-of-the-art performance: Uses the same CAPES synthetic dataset (derived from academic thesis recordings) and filtering methodology as the my-north-ai model
Establish comparison baseline: Provides a controlled starting point to evaluate the effectiveness of the new WAVe filtering methodology
Benchmark against established methods: Demonstrates performance using sentence-level filtering (the previous state-of-the-art approach)

What is CAPES?

CAPES is a synthetic Portuguese speech dataset generated from academic thesis recordings using our previous methodology (Perezhohin et al., IEEE Access 2024). Key characteristics:

Source: Academic thesis transcripts (longer, more complex utterances than Common Voice)
Original size: ~55k synthetic samples
Filtering: Sentence-level quality assessment (our previous approach)
Filtered size: 33.2k samples retained after filtering
Usage: Combined with Common Voice 17.0 for ASR training

This dataset and filtering methodology was used to create the my-north-ai/whisper-large-v3-pt model, which achieved state-of-the-art results for Portuguese Whisper at the time.

Model Details

Property	Value
Base Model	openai/whisper-large-v3
Language	Portuguese (pt)
Task	Automatic Speech Recognition (transcribe)
Parameters	1550M
Training Data	Common Voice 17.0 + CAPES Filtered (Sentence-Level)
Total Training Samples	~55,000
Sampling Rate	16kHz
Filtering Method	Sentence-level (IEEE Access 2024)

Evaluation Results

This Model (whisper-large-v3-cv-capes-fs024-IEEE-pt)

Metric	Value
Validation Loss	0.0913
Validation WER	7.74%
Test WER (Common Voice)	8.43%
Test WER (MLS)	13.54%
Best Checkpoint	Step 400
Max Training Steps	1,080

Comparison with WAVe Filtering and Other Approaches

Training Data	Filtering Method	Samples	Max Steps	Test WER (CV)	Test WER (MLS)
CAPES + CV (this model)	Sentence-level	~55k	1,080	8.43%	13.54%
CAPES + CV (WAVe)	Word-level (q≥0.8)	~45k	880	7.95%	6.89%
Our Synthetic + CV	Word-level (q≥0.5)	41k	805	8.33%	10.27%
CV Only	None	22k	430	11.78%	15.31%

Key Performance Characteristics

State-of-the-art replication: Achieves comparable performance to my-north-ai/whisper-large-v3-pt
Strong in-domain: 8.43% Test WER on Common Voice
Baseline cross-domain: 13.54% MLS WER (establishes comparison point)
Large dataset: Uses 55k total samples (largest among comparisons)
Most training steps: Requires 1,080 max steps

Comparison with my-north-ai/whisper-large-v3-pt

This model replicates the training methodology used in my-north-ai/whisper-large-v3-pt:

Aspect	my-north-ai/whisper-large-v3-pt	This Model
Base Model	whisper-large-v3	whisper-large-v3
Synthetic Data	CAPES	CAPES
Filtering	Sentence-level	Sentence-level
Real Data	Common Voice	Common Voice
Purpose	State-of-the-art Portuguese ASR	Comparison baseline for WAVe

Both models use the same fundamental approach, making this an ideal baseline to evaluate the improvements from WAVe word-level filtering.

Training Data

Dataset Composition

Source	Samples	Description
Common Voice 17.0 Portuguese	21,866	Real crowdsourced speech
CAPES Filtered (Sentence-level)	~33,200	Academic thesis-derived synthetic speech
Total	~55,000

CAPES Dataset Characteristics

The CAPES synthetic dataset differs from our GPT-4o-generated data:

Source transcripts: Academic thesis recordings (longer, more complex utterances)
Generation: Previous TTS methodology
Filtering: Sentence-level quality assessment (not word-level)
Quality: Contains synthesis errors that may hide within extended passages

Training Procedure

Hyperparameters

Parameter	Value
Learning Rate	5e-6
Batch Size (Global)	256
Warmup Steps	200
Max Epochs	5
Precision	BF16
Optimizer	AdamW (fused)
Eval Steps	50
Metric for Best Model	eval_loss

Training Infrastructure

GPU: NVIDIA H200 (140GB VRAM)
Operating System: Ubuntu 22.04
Framework: Hugging Face Transformers

Usage

Transcription Pipeline

from transformers import pipeline

transcriber = pipeline(
    "automatic-speech-recognition",
    model="yuriyvnv/whisper-large-v3-cv-capes-fs024-IEEE-pt",
    device="cuda"
)

result = transcriber("path/to/portuguese_audio.wav")
print(result["text"])

Direct Model Usage

from transformers import WhisperProcessor, WhisperForConditionalGeneration
import librosa

processor = WhisperProcessor.from_pretrained("yuriyvnv/whisper-large-v3-cv-capes-fs024-IEEE-pt")
model = WhisperForConditionalGeneration.from_pretrained("yuriyvnv/whisper-large-v3-cv-capes-fs024-IEEE-pt")
model.to("cuda")

audio, sr = librosa.load("path/to/portuguese_audio.wav", sr=16000)
input_features = processor(audio, sampling_rate=16000, return_tensors="pt").input_features.to("cuda")

predicted_ids = model.generate(input_features)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)

Specifying Language

model.generation_config.language = "pt"
model.generation_config.task = "transcribe"

When to Use This Model

This model is ideal when:

Replicating state-of-the-art: Want to use the established CAPES methodology
Comparison baseline: Evaluating new filtering approaches against established methods
Maximum dataset size: Prefer larger synthetic datasets (55k total samples)

Consider WAVe-filtered alternatives for better performance:

whisper-large-v3-cv-capes-filtered-pt: 49% better cross-domain (6.89% vs 13.54% MLS), 18% fewer steps
whisper-large-v3-mixed-pt: Best overall with newer synthetic data

Why WAVe Filtering Improves on This Approach

The successor model whisper-large-v3-cv-capes-filtered-pt applies WAVe word-level filtering to the same CAPES dataset and achieves:

5.7% better in-domain WER (7.95% vs 8.43%)
49% better cross-domain WER (6.89% vs 13.54%) - dramatic improvement
18% fewer training steps (880 vs 1,080)
30% less data (23k vs 33k synthetic samples)

This dramatic improvement demonstrates that word-level filtering is more effective than sentence-level filtering for detecting synthesis errors in longer utterances.

Limitations

Sentence-level filtering: Cannot detect localized synthesis errors within long utterances
Suboptimal cross-domain: 13.54% MLS WER vs 6.89% with WAVe filtering
Training inefficiency: Requires 1,080 steps vs 880 with WAVe filtering
Data volume: Uses more data but achieves worse performance than filtered approaches

Citation

This model uses the CAPES dataset and filtering methodology from our previous work:

@article{perezhohin2024enhancing,
  title={Enhancing Automatic Speech Recognition: Effects of Semantic Audio Filtering on Models Performance},
  author={Perezhohin, Yuriy and Santos, Tiago and Costa, Victor and Peres, Fernando and Castelli, Mauro},
  journal={IEEE Access},
  year={2024},
  publisher={IEEE}
}

References

Base Model: openai/whisper-large-v3
Training Data (Real): mozilla-foundation/common_voice_17_0
Comparison Model: my-north-ai/whisper-large-v3-pt
Improved Version: whisper-large-v3-cv-capes-filtered-pt (with WAVe filtering)
Whisper Paper: Robust Speech Recognition via Large-Scale Weak Supervision
CAPES Methodology: Enhancing ASR with Semantic Audio Filtering (IEEE Access 2024)

License

Apache 2.0

Downloads last month: 9

Safetensors

Model size

2B params

Tensor type

F32

Model tree for yuriyvnv/whisper-large-v3-cv-capes-fs024-IEEE-pt

Base model

openai/whisper-large-v3

Finetuned

(682)

this model

Dataset used to train yuriyvnv/whisper-large-v3-cv-capes-fs024-IEEE-pt

Evaluation results

Test WER on Common Voice 17.0 (Portuguese)
test set self-reported

8.430
Test WER (MLS) on Multilingual LibriSpeech (Portuguese)
test set self-reported

13.540