Whisper-Tiny Portuguese - Full Synthetic Data (Unfiltered)

This model is a fine-tuned version of openai/whisper-tiny for Portuguese automatic speech recognition (ASR). It was trained on Common Voice 17.0 Portuguese combined with all synthetic speech data without quality filtering, representing the maximum data augmentation approach.

Purpose

This model completes the evaluation of synthetic data augmentation strategies for Whisper-Tiny Portuguese by testing whether maximum data volume can compensate for architectural limitations. The results provide important insights:

Key Finding: Unfiltered synthetic data provides better in-domain performance (29.84%) than mid-high filtered (30.11%), but worse than high-quality filtered (29.33%). This reveals a non-monotonic relationship between data volume and performance for Tiny models.

Configuration	Synthetic Samples	Test WER (CV)	Test WER (MLS)
CV Only	0	30.72%	45.83%
High-Quality (best)	7,312	29.33%	44.18%
Unfiltered (this)	21,968	29.84%	46.54%
Mid-High	19,181	30.11%	47.25%

Interestingly, including all data (even low-quality) outperforms the mid-high threshold, suggesting complex interactions between data quality and model capacity.

Model Details

Property	Value
Base Model	openai/whisper-tiny
Language	Portuguese (pt)
Task	Automatic Speech Recognition (transcribe)
Parameters	39M
Training Data	Common Voice 17.0 + ALL Synthetic (Unfiltered)
Total Training Samples	43,834
Sampling Rate	16kHz

Evaluation Results

This Model (whisper-tiny-cv-full-synthetic-pt)

Metric	Value
Validation Loss	0.4517
Validation WER	28.06%
Test WER (Common Voice)	29.84%
Test WER (MLS)	46.54%
Best Checkpoint	Step 500
Max Training Steps	860

Comparison with Other Training Configurations (Whisper-Tiny Portuguese)

Training Data	Max Steps	Val Loss	Val WER	Test WER (CV)	Test WER (MLS)
Common Voice Only	430	0.4463	27.05%	30.72%	45.83%
High-Quality (q ≥ 0.8) + CV	575	0.4481	26.74%	29.33%	44.18%
Mid-High (q ≥ 0.5) + CV	805	0.4550	26.95%	30.11%	47.25%
All Synthetic + CV (Unfiltered)	860	0.4517	28.06%	29.84%	46.54%

Key Performance Characteristics

Second-best in-domain: 29.84% Test WER (only 0.51% worse than best)
Better than mid-high: Outperforms q ≥ 0.5 threshold despite including low-quality data
Worse cross-domain than baseline: 46.54% vs 45.83% MLS WER
Most training steps: 860 max steps (100% more than baseline)
Largest dataset: 43,834 samples—double the baseline

Complete Portuguese Tiny Model Rankings

Rank	Configuration	Test WER (CV)	Test WER (MLS)	Recommendation
1	High-Quality (q≥0.8)	29.33%	44.18%	Best choice
2	Unfiltered (this)	29.84%	46.54%	Research only
3	Mid-High (q≥0.5)	30.11%	47.25%	Not recommended
4	CV Only	30.72%	45.83%	Simple baseline

Unexpected finding: Unfiltered data ranks #2, outperforming mid-high filtering. This suggests that for Tiny models, either strict filtering (q ≥ 0.8) or no filtering works better than moderate filtering.

Non-Monotonic Quality-Performance Relationship

For Tiny models, the relationship between data quality threshold and performance is non-monotonic:

Performance (CV WER):
  Best    → High-Quality (q≥0.8): 29.33%
  #2      → Unfiltered:          29.84%
  #3      → Mid-High (q≥0.5):    30.11%
  Worst   → CV Only:             30.72%

Interpretation:

Strict filtering removes problematic samples effectively (high-quality is best)
No filtering allows the model to see all patterns, including occasional useful low-quality samples
Moderate filtering (q ≥ 0.5) may remove some useful borderline samples while keeping too much noise

This phenomenon doesn't occur with Large-v3, where mid-high filtering is optimal, suggesting it's specific to limited-capacity architectures.

Training Data

Dataset Composition

Source	Samples	Description
Common Voice 17.0 Portuguese	21,866	Real speech from Mozilla's crowdsourced dataset
Synthetic Transcript PT (all)	21,968	Complete TTS audio without filtering
Total	43,834

WAVe Quality Distribution (For Reference)

Quality Level	Samples	Percentage	Used in This Model
High (q ≥ 0.8)	7,312	33.3%	✓
Medium (0.5 ≤ q < 0.8)	11,869	54.0%	✓
Low (q < 0.5)	2,787	12.7%	✓
Total	21,968	100%	All used

Training Procedure

Hyperparameters

Parameter	Value
Learning Rate	5e-5
Batch Size (Global)	256
Warmup Steps	200
Max Epochs	5
Precision	BF16
Optimizer	AdamW (fused)
Eval Steps	50
Metric for Best Model	eval_loss

Training Infrastructure

GPU: NVIDIA H200 (140GB VRAM)
Operating System: Ubuntu 22.04
Framework: Hugging Face Transformers

Usage

Transcription Pipeline

from transformers import pipeline

transcriber = pipeline(
    "automatic-speech-recognition",
    model="yuriyvnv/whisper-tiny-cv-full-synthetic-pt",
    device="cuda"
)

result = transcriber("path/to/portuguese_audio.wav")
print(result["text"])

Direct Model Usage

from transformers import WhisperProcessor, WhisperForConditionalGeneration
import librosa

processor = WhisperProcessor.from_pretrained("yuriyvnv/whisper-tiny-cv-full-synthetic-pt")
model = WhisperForConditionalGeneration.from_pretrained("yuriyvnv/whisper-tiny-cv-full-synthetic-pt")
model.to("cuda")

audio, sr = librosa.load("path/to/portuguese_audio.wav", sr=16000)
input_features = processor(audio, sampling_rate=16000, return_tensors="pt").input_features.to("cuda")

predicted_ids = model.generate(input_features)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)

Specifying Language

model.generation_config.language = "pt"
model.generation_config.task = "transcribe"

When to Use This Model

Generally not recommended for production. This model is useful for:

Research purposes: Understanding non-monotonic quality-performance relationships
Ablation studies: Complete picture of synthetic data effects on Tiny
When filtering is unavailable: If you must use unfiltered synthetic data with Tiny

For production use:

whisper-tiny-high-mixed-pt: Best Tiny (29.33% WER)
whisper-tiny-cv-only-pt: Simplest setup
whisper-large-v3-high-mixed-pt: Best overall (7.94% WER)

Research Conclusions

This model completes our analysis of synthetic data augmentation for Portuguese Tiny ASR:

Key Findings:

High-quality filtering is optimal for Tiny: Only q ≥ 0.8 provides consistent improvement
Unfiltered outperforms mid-high: Non-monotonic relationship specific to small models
Cross-domain still degrades: Even best Tiny config (44.18% MLS) trails baseline Large-v3 (22.43%)
Model capacity fundamentally limits benefit: Maximum 1.39% improvement regardless of data strategy

Practical Recommendations:

Scenario	Recommendation
Must use Tiny	Use high-quality filtered (q ≥ 0.8)
No filtering available	Unfiltered is second-best for Tiny
Cross-domain critical	Upgrade to Large-v3
Simplest setup	CV-only baseline (only 1.39% worse than best)

Limitations

Worse cross-domain than baseline: 46.54% vs 45.83% MLS WER
Not the best configuration: High-quality filtering is better
100% more training steps: Doubled compute for suboptimal results
Limited architectural capacity: Cannot fully leverage data volume

Citation

This model is part of research on WAVe (Word-Aligned Verification) for synthetic speech quality assessment. While the WAVe methodology paper is currently under review, please cite our previous work that motivated this research:

@article{perezhohin2024enhancing,
  title={Enhancing Automatic Speech Recognition: Effects of Semantic Audio Filtering on Models Performance},
  author={Perezhohin, Yuriy and Santos, Tiago and Costa, Victor and Peres, Fernando and Castelli, Mauro},
  journal={IEEE Access},
  year={2024},
  publisher={IEEE}
}

References

Base Model: openai/whisper-tiny
Training Data (Real): mozilla-foundation/common_voice_17_0
Training Data (Synthetic): yuriyvnv/synthetic_transcript_pt
Whisper Paper: Robust Speech Recognition via Large-Scale Weak Supervision
Motivating Research: Enhancing ASR with Semantic Audio Filtering (IEEE Access 2024)

License

Apache 2.0

Downloads last month: 42

Safetensors

Model size

37.8M params

Tensor type

F32

Model tree for yuriyvnv/whisper-tiny-cv-full-synthetic-pt

Base model

openai/whisper-tiny

Finetuned

(1665)

this model

Datasets used to train yuriyvnv/whisper-tiny-cv-full-synthetic-pt

Collection including yuriyvnv/whisper-tiny-cv-full-synthetic-pt

Whisper Models Portuguese Language

Collection

This Repo contains Whisper models trained on subsets of data like Common Voice 17(CV_17), Synthetic(Generated by OpenAI) + CV17 and Synthetic Only. • 15 items • Updated 18 days ago • 1

Evaluation results

Test WER on Common Voice 17.0 (Portuguese)
test set self-reported

29.840
Test WER (MLS) on Multilingual LibriSpeech (Portuguese)
test set self-reported

46.540