Whisper-Tiny Portuguese - Full Synthetic Data (Unfiltered)
This model is a fine-tuned version of openai/whisper-tiny for Portuguese automatic speech recognition (ASR). It was trained on Common Voice 17.0 Portuguese combined with all synthetic speech data without quality filtering, representing the maximum data augmentation approach.
Purpose
This model completes the evaluation of synthetic data augmentation strategies for Whisper-Tiny Portuguese by testing whether maximum data volume can compensate for architectural limitations. The results provide important insights:
Key Finding: Unfiltered synthetic data provides better in-domain performance (29.84%) than mid-high filtered (30.11%), but worse than high-quality filtered (29.33%). This reveals a non-monotonic relationship between data volume and performance for Tiny models.
| Configuration | Synthetic Samples | Test WER (CV) | Test WER (MLS) |
|---|---|---|---|
| CV Only | 0 | 30.72% | 45.83% |
| High-Quality (best) | 7,312 | 29.33% | 44.18% |
| Unfiltered (this) | 21,968 | 29.84% | 46.54% |
| Mid-High | 19,181 | 30.11% | 47.25% |
Interestingly, including all data (even low-quality) outperforms the mid-high threshold, suggesting complex interactions between data quality and model capacity.
Model Details
| Property | Value |
|---|---|
| Base Model | openai/whisper-tiny |
| Language | Portuguese (pt) |
| Task | Automatic Speech Recognition (transcribe) |
| Parameters | 39M |
| Training Data | Common Voice 17.0 + ALL Synthetic (Unfiltered) |
| Total Training Samples | 43,834 |
| Sampling Rate | 16kHz |
Evaluation Results
This Model (whisper-tiny-cv-full-synthetic-pt)
| Metric | Value |
|---|---|
| Validation Loss | 0.4517 |
| Validation WER | 28.06% |
| Test WER (Common Voice) | 29.84% |
| Test WER (MLS) | 46.54% |
| Best Checkpoint | Step 500 |
| Max Training Steps | 860 |
Comparison with Other Training Configurations (Whisper-Tiny Portuguese)
| Training Data | Max Steps | Val Loss | Val WER | Test WER (CV) | Test WER (MLS) |
|---|---|---|---|---|---|
| Common Voice Only | 430 | 0.4463 | 27.05% | 30.72% | 45.83% |
| High-Quality (q ≥ 0.8) + CV | 575 | 0.4481 | 26.74% | 29.33% | 44.18% |
| Mid-High (q ≥ 0.5) + CV | 805 | 0.4550 | 26.95% | 30.11% | 47.25% |
| All Synthetic + CV (Unfiltered) | 860 | 0.4517 | 28.06% | 29.84% | 46.54% |
Key Performance Characteristics
- Second-best in-domain: 29.84% Test WER (only 0.51% worse than best)
- Better than mid-high: Outperforms q ≥ 0.5 threshold despite including low-quality data
- Worse cross-domain than baseline: 46.54% vs 45.83% MLS WER
- Most training steps: 860 max steps (100% more than baseline)
- Largest dataset: 43,834 samples—double the baseline
Complete Portuguese Tiny Model Rankings
| Rank | Configuration | Test WER (CV) | Test WER (MLS) | Recommendation |
|---|---|---|---|---|
| 1 | High-Quality (q≥0.8) | 29.33% | 44.18% | Best choice |
| 2 | Unfiltered (this) | 29.84% | 46.54% | Research only |
| 3 | Mid-High (q≥0.5) | 30.11% | 47.25% | Not recommended |
| 4 | CV Only | 30.72% | 45.83% | Simple baseline |
Unexpected finding: Unfiltered data ranks #2, outperforming mid-high filtering. This suggests that for Tiny models, either strict filtering (q ≥ 0.8) or no filtering works better than moderate filtering.
Non-Monotonic Quality-Performance Relationship
For Tiny models, the relationship between data quality threshold and performance is non-monotonic:
Performance (CV WER):
Best → High-Quality (q≥0.8): 29.33%
#2 → Unfiltered: 29.84%
#3 → Mid-High (q≥0.5): 30.11%
Worst → CV Only: 30.72%
Interpretation:
- Strict filtering removes problematic samples effectively (high-quality is best)
- No filtering allows the model to see all patterns, including occasional useful low-quality samples
- Moderate filtering (q ≥ 0.5) may remove some useful borderline samples while keeping too much noise
This phenomenon doesn't occur with Large-v3, where mid-high filtering is optimal, suggesting it's specific to limited-capacity architectures.
Training Data
Dataset Composition
| Source | Samples | Description |
|---|---|---|
| Common Voice 17.0 Portuguese | 21,866 | Real speech from Mozilla's crowdsourced dataset |
| Synthetic Transcript PT (all) | 21,968 | Complete TTS audio without filtering |
| Total | 43,834 |
WAVe Quality Distribution (For Reference)
| Quality Level | Samples | Percentage | Used in This Model |
|---|---|---|---|
| High (q ≥ 0.8) | 7,312 | 33.3% | ✓ |
| Medium (0.5 ≤ q < 0.8) | 11,869 | 54.0% | ✓ |
| Low (q < 0.5) | 2,787 | 12.7% | ✓ |
| Total | 21,968 | 100% | All used |
Training Procedure
Hyperparameters
| Parameter | Value |
|---|---|
| Learning Rate | 5e-5 |
| Batch Size (Global) | 256 |
| Warmup Steps | 200 |
| Max Epochs | 5 |
| Precision | BF16 |
| Optimizer | AdamW (fused) |
| Eval Steps | 50 |
| Metric for Best Model | eval_loss |
Training Infrastructure
- GPU: NVIDIA H200 (140GB VRAM)
- Operating System: Ubuntu 22.04
- Framework: Hugging Face Transformers
Usage
Transcription Pipeline
from transformers import pipeline
transcriber = pipeline(
"automatic-speech-recognition",
model="yuriyvnv/whisper-tiny-cv-full-synthetic-pt",
device="cuda"
)
result = transcriber("path/to/portuguese_audio.wav")
print(result["text"])
Direct Model Usage
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import librosa
processor = WhisperProcessor.from_pretrained("yuriyvnv/whisper-tiny-cv-full-synthetic-pt")
model = WhisperForConditionalGeneration.from_pretrained("yuriyvnv/whisper-tiny-cv-full-synthetic-pt")
model.to("cuda")
audio, sr = librosa.load("path/to/portuguese_audio.wav", sr=16000)
input_features = processor(audio, sampling_rate=16000, return_tensors="pt").input_features.to("cuda")
predicted_ids = model.generate(input_features)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)
Specifying Language
model.generation_config.language = "pt"
model.generation_config.task = "transcribe"
When to Use This Model
Generally not recommended for production. This model is useful for:
- Research purposes: Understanding non-monotonic quality-performance relationships
- Ablation studies: Complete picture of synthetic data effects on Tiny
- When filtering is unavailable: If you must use unfiltered synthetic data with Tiny
For production use:
- whisper-tiny-high-mixed-pt: Best Tiny (29.33% WER)
- whisper-tiny-cv-only-pt: Simplest setup
- whisper-large-v3-high-mixed-pt: Best overall (7.94% WER)
Research Conclusions
This model completes our analysis of synthetic data augmentation for Portuguese Tiny ASR:
Key Findings:
- High-quality filtering is optimal for Tiny: Only q ≥ 0.8 provides consistent improvement
- Unfiltered outperforms mid-high: Non-monotonic relationship specific to small models
- Cross-domain still degrades: Even best Tiny config (44.18% MLS) trails baseline Large-v3 (22.43%)
- Model capacity fundamentally limits benefit: Maximum 1.39% improvement regardless of data strategy
Practical Recommendations:
| Scenario | Recommendation |
|---|---|
| Must use Tiny | Use high-quality filtered (q ≥ 0.8) |
| No filtering available | Unfiltered is second-best for Tiny |
| Cross-domain critical | Upgrade to Large-v3 |
| Simplest setup | CV-only baseline (only 1.39% worse than best) |
Limitations
- Worse cross-domain than baseline: 46.54% vs 45.83% MLS WER
- Not the best configuration: High-quality filtering is better
- 100% more training steps: Doubled compute for suboptimal results
- Limited architectural capacity: Cannot fully leverage data volume
Citation
This model is part of research on WAVe (Word-Aligned Verification) for synthetic speech quality assessment. While the WAVe methodology paper is currently under review, please cite our previous work that motivated this research:
@article{perezhohin2024enhancing,
title={Enhancing Automatic Speech Recognition: Effects of Semantic Audio Filtering on Models Performance},
author={Perezhohin, Yuriy and Santos, Tiago and Costa, Victor and Peres, Fernando and Castelli, Mauro},
journal={IEEE Access},
year={2024},
publisher={IEEE}
}
References
- Base Model: openai/whisper-tiny
- Training Data (Real): mozilla-foundation/common_voice_17_0
- Training Data (Synthetic): yuriyvnv/synthetic_transcript_pt
- Whisper Paper: Robust Speech Recognition via Large-Scale Weak Supervision
- Motivating Research: Enhancing ASR with Semantic Audio Filtering (IEEE Access 2024)
License
Apache 2.0
- Downloads last month
- 42
Model tree for yuriyvnv/whisper-tiny-cv-full-synthetic-pt
Base model
openai/whisper-tinyDatasets used to train yuriyvnv/whisper-tiny-cv-full-synthetic-pt
Collection including yuriyvnv/whisper-tiny-cv-full-synthetic-pt
Evaluation results
- Test WER on Common Voice 17.0 (Portuguese)test set self-reported29.840
- Test WER (MLS) on Multilingual LibriSpeech (Portuguese)test set self-reported46.540