Whisper-Tiny Portuguese - Full Synthetic Data (Unfiltered)

This model is a fine-tuned version of openai/whisper-tiny for Portuguese automatic speech recognition (ASR). It was trained on Common Voice 17.0 Portuguese combined with all synthetic speech data without quality filtering, representing the maximum data augmentation approach.

Purpose

This model completes the evaluation of synthetic data augmentation strategies for Whisper-Tiny Portuguese by testing whether maximum data volume can compensate for architectural limitations. The results provide important insights:

Key Finding: Unfiltered synthetic data provides better in-domain performance (29.84%) than mid-high filtered (30.11%), but worse than high-quality filtered (29.33%). This reveals a non-monotonic relationship between data volume and performance for Tiny models.

Configuration Synthetic Samples Test WER (CV) Test WER (MLS)
CV Only 0 30.72% 45.83%
High-Quality (best) 7,312 29.33% 44.18%
Unfiltered (this) 21,968 29.84% 46.54%
Mid-High 19,181 30.11% 47.25%

Interestingly, including all data (even low-quality) outperforms the mid-high threshold, suggesting complex interactions between data quality and model capacity.

Model Details

Property Value
Base Model openai/whisper-tiny
Language Portuguese (pt)
Task Automatic Speech Recognition (transcribe)
Parameters 39M
Training Data Common Voice 17.0 + ALL Synthetic (Unfiltered)
Total Training Samples 43,834
Sampling Rate 16kHz

Evaluation Results

This Model (whisper-tiny-cv-full-synthetic-pt)

Metric Value
Validation Loss 0.4517
Validation WER 28.06%
Test WER (Common Voice) 29.84%
Test WER (MLS) 46.54%
Best Checkpoint Step 500
Max Training Steps 860

Comparison with Other Training Configurations (Whisper-Tiny Portuguese)

Training Data Max Steps Val Loss Val WER Test WER (CV) Test WER (MLS)
Common Voice Only 430 0.4463 27.05% 30.72% 45.83%
High-Quality (q ≥ 0.8) + CV 575 0.4481 26.74% 29.33% 44.18%
Mid-High (q ≥ 0.5) + CV 805 0.4550 26.95% 30.11% 47.25%
All Synthetic + CV (Unfiltered) 860 0.4517 28.06% 29.84% 46.54%

Key Performance Characteristics

  • Second-best in-domain: 29.84% Test WER (only 0.51% worse than best)
  • Better than mid-high: Outperforms q ≥ 0.5 threshold despite including low-quality data
  • Worse cross-domain than baseline: 46.54% vs 45.83% MLS WER
  • Most training steps: 860 max steps (100% more than baseline)
  • Largest dataset: 43,834 samples—double the baseline

Complete Portuguese Tiny Model Rankings

Rank Configuration Test WER (CV) Test WER (MLS) Recommendation
1 High-Quality (q≥0.8) 29.33% 44.18% Best choice
2 Unfiltered (this) 29.84% 46.54% Research only
3 Mid-High (q≥0.5) 30.11% 47.25% Not recommended
4 CV Only 30.72% 45.83% Simple baseline

Unexpected finding: Unfiltered data ranks #2, outperforming mid-high filtering. This suggests that for Tiny models, either strict filtering (q ≥ 0.8) or no filtering works better than moderate filtering.

Non-Monotonic Quality-Performance Relationship

For Tiny models, the relationship between data quality threshold and performance is non-monotonic:

Performance (CV WER):
  Best    → High-Quality (q≥0.8): 29.33%
  #2      → Unfiltered:          29.84%
  #3      → Mid-High (q≥0.5):    30.11%
  Worst   → CV Only:             30.72%

Interpretation:

  • Strict filtering removes problematic samples effectively (high-quality is best)
  • No filtering allows the model to see all patterns, including occasional useful low-quality samples
  • Moderate filtering (q ≥ 0.5) may remove some useful borderline samples while keeping too much noise

This phenomenon doesn't occur with Large-v3, where mid-high filtering is optimal, suggesting it's specific to limited-capacity architectures.

Training Data

Dataset Composition

Source Samples Description
Common Voice 17.0 Portuguese 21,866 Real speech from Mozilla's crowdsourced dataset
Synthetic Transcript PT (all) 21,968 Complete TTS audio without filtering
Total 43,834

WAVe Quality Distribution (For Reference)

Quality Level Samples Percentage Used in This Model
High (q ≥ 0.8) 7,312 33.3% ✓
Medium (0.5 ≤ q < 0.8) 11,869 54.0% ✓
Low (q < 0.5) 2,787 12.7% ✓
Total 21,968 100% All used

Training Procedure

Hyperparameters

Parameter Value
Learning Rate 5e-5
Batch Size (Global) 256
Warmup Steps 200
Max Epochs 5
Precision BF16
Optimizer AdamW (fused)
Eval Steps 50
Metric for Best Model eval_loss

Training Infrastructure

  • GPU: NVIDIA H200 (140GB VRAM)
  • Operating System: Ubuntu 22.04
  • Framework: Hugging Face Transformers

Usage

Transcription Pipeline

from transformers import pipeline

transcriber = pipeline(
    "automatic-speech-recognition",
    model="yuriyvnv/whisper-tiny-cv-full-synthetic-pt",
    device="cuda"
)

result = transcriber("path/to/portuguese_audio.wav")
print(result["text"])

Direct Model Usage

from transformers import WhisperProcessor, WhisperForConditionalGeneration
import librosa

processor = WhisperProcessor.from_pretrained("yuriyvnv/whisper-tiny-cv-full-synthetic-pt")
model = WhisperForConditionalGeneration.from_pretrained("yuriyvnv/whisper-tiny-cv-full-synthetic-pt")
model.to("cuda")

audio, sr = librosa.load("path/to/portuguese_audio.wav", sr=16000)
input_features = processor(audio, sampling_rate=16000, return_tensors="pt").input_features.to("cuda")

predicted_ids = model.generate(input_features)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)

Specifying Language

model.generation_config.language = "pt"
model.generation_config.task = "transcribe"

When to Use This Model

Generally not recommended for production. This model is useful for:

  • Research purposes: Understanding non-monotonic quality-performance relationships
  • Ablation studies: Complete picture of synthetic data effects on Tiny
  • When filtering is unavailable: If you must use unfiltered synthetic data with Tiny

For production use:

Research Conclusions

This model completes our analysis of synthetic data augmentation for Portuguese Tiny ASR:

Key Findings:

  1. High-quality filtering is optimal for Tiny: Only q ≥ 0.8 provides consistent improvement
  2. Unfiltered outperforms mid-high: Non-monotonic relationship specific to small models
  3. Cross-domain still degrades: Even best Tiny config (44.18% MLS) trails baseline Large-v3 (22.43%)
  4. Model capacity fundamentally limits benefit: Maximum 1.39% improvement regardless of data strategy

Practical Recommendations:

Scenario Recommendation
Must use Tiny Use high-quality filtered (q ≥ 0.8)
No filtering available Unfiltered is second-best for Tiny
Cross-domain critical Upgrade to Large-v3
Simplest setup CV-only baseline (only 1.39% worse than best)

Limitations

  • Worse cross-domain than baseline: 46.54% vs 45.83% MLS WER
  • Not the best configuration: High-quality filtering is better
  • 100% more training steps: Doubled compute for suboptimal results
  • Limited architectural capacity: Cannot fully leverage data volume

Citation

This model is part of research on WAVe (Word-Aligned Verification) for synthetic speech quality assessment. While the WAVe methodology paper is currently under review, please cite our previous work that motivated this research:

@article{perezhohin2024enhancing,
  title={Enhancing Automatic Speech Recognition: Effects of Semantic Audio Filtering on Models Performance},
  author={Perezhohin, Yuriy and Santos, Tiago and Costa, Victor and Peres, Fernando and Castelli, Mauro},
  journal={IEEE Access},
  year={2024},
  publisher={IEEE}
}

References

License

Apache 2.0

Downloads last month
42
Safetensors
Model size
37.8M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for yuriyvnv/whisper-tiny-cv-full-synthetic-pt

Finetuned
(1665)
this model

Datasets used to train yuriyvnv/whisper-tiny-cv-full-synthetic-pt

Collection including yuriyvnv/whisper-tiny-cv-full-synthetic-pt

Evaluation results