Model Card for shooding/faster-whisper-large-v3-zh-TW
A CTranslate2 float16 fine-tune of openai/whisper-large-v3 for Traditional Chinese (zh-TW) / Taiwanese Mandarin ASR. Fine-tuned
with LoRA via Unsloth on the adi-gov-tw/Taiwan-Tongues-ASR-CE-dataset-zhtw dataset, then converted to faster-whisper format for
efficient CPU and GPU inference. Source code: https://github.com/shooding/taiwan-finetune
Model Details
Model Description
This model adapts Whisper large-v3 to Taiwanese Mandarin speech using parameter-efficient LoRA fine-tuning (only ~2% of parameters
trained). The resulting merged checkpoint is converted to CTranslate2 float16 for use with the faster-whisper library, providing
significantly faster inference with lower memory footprint compared to the original PyTorch model.
- Developed by: shooding
- Model type: Encoder-decoder speech transformer (Whisper architecture), CTranslate2 format
- Language(s): Traditional Chinese (zh-TW), Taiwanese Mandarin
- License: Apache 2.0
- Fine-tuned from: openai/whisper-large-v3
Model Sources
- Repository: https://huggingface.co/shooding/faster-whisper-large-v3-zh-TW
- Paper: N/A
- Demo: N/A
Uses
Direct Use
Transcribing Taiwanese Mandarin (zh-TW) audio files. Suitable for real-time or batch transcription pipelines using the
faster-whisper library.
Downstream Use
Can be integrated into voice assistants, subtitle generation tools, meeting transcription services, or any pipeline targeting Traditional Chinese speech recognition.
Out-of-Scope Use
- Cantonese or Mainland Mandarin β performance may degrade due to accent and vocabulary differences
- Noisy or far-field audio β not tested under these conditions
- Languages other than Chinese β the model is constrained to
language=zh
How to Get Started with the Model
GPU (recommended):
from faster_whisper import WhisperModel
model = WhisperModel(
"shooding/faster-whisper-large-v3-zh-TW",
device="cuda",
compute_type="float16",
)
segments, info = model.transcribe("audio.wav", language="zh", task="transcribe")
for segment in segments:
print(f"[{segment.start:.2f}s β {segment.end:.2f}s] {segment.text}")
CPU (int8 quantization):
model = WhisperModel(
"shooding/faster-whisper-large-v3-zh-TW",
device="cpu",
compute_type="int8",
)
Training Details
Training Data
Dataset: adi-gov-tw/Taiwan-Tongues-ASR-CE-dataset-zhtw
A Taiwanese Mandarin speech corpus from the Taiwanese government, covering diverse speakers and domains. Loaded in streaming mode to
avoid local disk download.
Training Procedure
Build Pipeline
1. Load openai/whisper-large-v3 via unsloth.FastModel β auto-patches the conv1d fp16 type mismatch bug (RuntimeError: Input type
(float) and bias type (c10::Half))
2. Apply LoRA adapters (r=64, Ξ±=64, target modules: q_proj, v_proj)
3. Set generation_config: language=zh, task=transcribe, forced_decoder_ids=None
4. Fine-tune with Seq2SeqTrainer on the streaming dataset
5. Merge LoRA β full model via save_pretrained_merged (merged_16bit)
6. Convert to CTranslate2: ct2-transformers-converter --quantization float16
LoRA Configuration
ββββββββββββββββββ¬ββββββββββββββββββββββββββββββ
β Parameter β Value β
ββββββββββββββββββΌββββββββββββββββββββββββββββββ€
β r β 64 β
ββββββββββββββββββΌββββββββββββββββββββββββββββββ€
β lora_alpha β 64 β
ββββββββββββββββββΌββββββββββββββββββββββββββββββ€
β target_modules β q_proj, v_proj β
ββββββββββββββββββΌββββββββββββββββββββββββββββββ€
β lora_dropout β 0 β
ββββββββββββββββββΌββββββββββββββββββββββββββββββ€
β bias β none β
ββββββββββββββββββΌββββββββββββββββββββββββββββββ€
β task_type β None (required for Whisper) β
ββββββββββββββββββ΄ββββββββββββββββββββββββββββββ
Training Hyperparameters
- Training regime: fp16 mixed precision (T4) / bf16 (A100+)
βββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββ
β Hyperparameter β Value β
βββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββ€
β max_steps β 2000 β
βββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββ€
β per_device_train_batch_size β 4 β
βββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββ€
β gradient_accumulation_steps β 4 (effective batch = 16) β
βββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββ€
β learning_rate β 1e-4 β
βββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββ€
β warmup_steps β 100 β
βββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββ€
β lr_scheduler_type β cosine β
βββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββ€
β optimizer β adamw_8bit (Unsloth) β
βββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββ€
β weight_decay β 0.001 β
βββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββ€
β eval_steps / save_steps β 200 β
βββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββ€
β best model metric β CER (lower is better) β
βββββββββββββββββββββββββββββββ΄βββββββββββββββββββββββββββ
Evaluation
Testing Data, Factors & Metrics
Testing Data
Held-out split of adi-gov-tw/Taiwan-Tongues-ASR-CE-dataset-zhtw (200 samples).
Metrics
CER (Character Error Rate): edit distance between predicted and reference character sequences divided by reference length. Lower is
better. Chinese has no word boundaries, making CER more appropriate than WER.
Note: Training-time CER is computed via argmax over decoder logits (greedy decoding). Production inference with beam search will
typically yield lower CER.
Results
The model improves CER on Taiwanese Mandarin compared to the untuned whisper-large-v3 baseline. Exact numbers on a standardized zh-TW
benchmark are planned for a future update.
Summary
Improved zh-TW ASR accuracy vs. baseline Whisper large-v3, trainable on a single T4 GPU via LoRA.
Environmental Impact
- Hardware Type: Google Colab T4 GPU (16 GB VRAM)
- Hours used: ~2β4 hours
- Cloud Provider: Google (Colab)
- Compute Region: Not specified
- Carbon Emitted: Not measured. Use ML CO2 Impact calculator for estimation.
Technical Specifications
Model Architecture and Objective
Whisper large-v3 encoder-decoder transformer (~1.5B parameters). Fine-tuned with cross-entropy on (mel spectrogram, Chinese
transcript) pairs. LoRA keeps only ~2% of parameters trainable, reducing VRAM by 50%+.
Compute Infrastructure
Hardware
Google Colab T4 GPU (16 GB VRAM). Unsloth's LoRA + adamw_8bit optimizer fit the large-v3 model within T4 budget.
Software
- Unsloth β FastModel, LoRA, gradient checkpointing
- Transformers 4.56.2
- TRL 0.22.2
- CTranslate2 β float16 conversion
- faster-whisper β inference
Citation
BibTeX:
@misc{shooding2026fasterwhisper_zhtw,
author = {shooding},
title = {faster-whisper-large-v3-zh-TW: LoRA fine-tune of Whisper large-v3 for Taiwanese Mandarin},
year = {2026},
howpublished = {\url{https://huggingface.co/shooding/faster-whisper-large-v3-zh-TW}},
}
Glossary
- CER: Character Error Rate β edit distance / reference length. Standard metric for Chinese ASR.
- LoRA: Low-Rank Adaptation β trains injected rank-decomposition matrices, leaving base weights frozen.
- CTranslate2: Fast transformer inference engine supporting quantization and optimized CUDA/CPU kernels.
- faster-whisper: Whisper reimplemented with CTranslate2; typically 4Γ faster with lower memory usage.
Model Card Authors
shooding
Model Card Contact
Open an issue on the model repository.
- Downloads last month
- 75
Model tree for shooding/faster-whisper-large-v3-zh-TW
Base model
openai/whisper-large-v3