Model Card for shooding/faster-whisper-large-v3-zh-TW

A CTranslate2 float16 fine-tune of openai/whisper-large-v3 for Traditional Chinese (zh-TW) / Taiwanese Mandarin ASR. Fine-tuned with LoRA via Unsloth on the adi-gov-tw/Taiwan-Tongues-ASR-CE-dataset-zhtw dataset, then converted to faster-whisper format for efficient CPU and GPU inference. Source code: https://github.com/shooding/taiwan-finetune

Model Details

Model Description

This model adapts Whisper large-v3 to Taiwanese Mandarin speech using parameter-efficient LoRA fine-tuning (only ~2% of parameters trained). The resulting merged checkpoint is converted to CTranslate2 float16 for use with the faster-whisper library, providing significantly faster inference with lower memory footprint compared to the original PyTorch model.

  • Developed by: shooding
  • Model type: Encoder-decoder speech transformer (Whisper architecture), CTranslate2 format
  • Language(s): Traditional Chinese (zh-TW), Taiwanese Mandarin
  • License: Apache 2.0
  • Fine-tuned from: openai/whisper-large-v3

Model Sources

Uses

Direct Use

Transcribing Taiwanese Mandarin (zh-TW) audio files. Suitable for real-time or batch transcription pipelines using the faster-whisper library.

Downstream Use

Can be integrated into voice assistants, subtitle generation tools, meeting transcription services, or any pipeline targeting Traditional Chinese speech recognition.

Out-of-Scope Use

  • Cantonese or Mainland Mandarin β€” performance may degrade due to accent and vocabulary differences
  • Noisy or far-field audio β€” not tested under these conditions
  • Languages other than Chinese β€” the model is constrained to language=zh

How to Get Started with the Model

GPU (recommended):

from faster_whisper import WhisperModel

model = WhisperModel(
    "shooding/faster-whisper-large-v3-zh-TW",
    device="cuda",
    compute_type="float16",
)

segments, info = model.transcribe("audio.wav", language="zh", task="transcribe")
for segment in segments:
    print(f"[{segment.start:.2f}s β†’ {segment.end:.2f}s] {segment.text}")

CPU (int8 quantization):

model = WhisperModel(
    "shooding/faster-whisper-large-v3-zh-TW",
    device="cpu",
    compute_type="int8",
)

Training Details

Training Data

Dataset: adi-gov-tw/Taiwan-Tongues-ASR-CE-dataset-zhtw

A Taiwanese Mandarin speech corpus from the Taiwanese government, covering diverse speakers and domains. Loaded in streaming mode to
avoid local disk download.

Training Procedure

Build Pipeline

1. Load openai/whisper-large-v3 via unsloth.FastModel β€” auto-patches the conv1d fp16 type mismatch bug (RuntimeError: Input type
(float) and bias type (c10::Half))
2. Apply LoRA adapters (r=64, Ξ±=64, target modules: q_proj, v_proj)
3. Set generation_config: language=zh, task=transcribe, forced_decoder_ids=None
4. Fine-tune with Seq2SeqTrainer on the streaming dataset
5. Merge LoRA β†’ full model via save_pretrained_merged (merged_16bit)
6. Convert to CTranslate2: ct2-transformers-converter --quantization float16

LoRA Configuration

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Parameter    β”‚            Value            β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ r              β”‚ 64                          β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ lora_alpha     β”‚ 64                          β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ target_modules β”‚ q_proj, v_proj              β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ lora_dropout   β”‚ 0                           β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ bias           β”‚ none                        β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ task_type      β”‚ None (required for Whisper) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Training Hyperparameters

- Training regime: fp16 mixed precision (T4) / bf16 (A100+)

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚       Hyperparameter        β”‚          Value           β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ max_steps                   β”‚ 2000                     β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ per_device_train_batch_size β”‚ 4                        β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ gradient_accumulation_steps β”‚ 4 (effective batch = 16) β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ learning_rate               β”‚ 1e-4                     β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ warmup_steps                β”‚ 100                      β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ lr_scheduler_type           β”‚ cosine                   β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ optimizer                   β”‚ adamw_8bit (Unsloth)     β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ weight_decay                β”‚ 0.001                    β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ eval_steps / save_steps     β”‚ 200                      β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ best model metric           β”‚ CER (lower is better)    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Evaluation

Testing Data, Factors & Metrics

Testing Data

Held-out split of adi-gov-tw/Taiwan-Tongues-ASR-CE-dataset-zhtw (200 samples).

Metrics

CER (Character Error Rate): edit distance between predicted and reference character sequences divided by reference length. Lower is
better. Chinese has no word boundaries, making CER more appropriate than WER.

Note: Training-time CER is computed via argmax over decoder logits (greedy decoding). Production inference with beam search will
typically yield lower CER.

Results

The model improves CER on Taiwanese Mandarin compared to the untuned whisper-large-v3 baseline. Exact numbers on a standardized zh-TW
 benchmark are planned for a future update.

Summary

Improved zh-TW ASR accuracy vs. baseline Whisper large-v3, trainable on a single T4 GPU via LoRA.

Environmental Impact

- Hardware Type: Google Colab T4 GPU (16 GB VRAM)
- Hours used: ~2–4 hours
- Cloud Provider: Google (Colab)
- Compute Region: Not specified
- Carbon Emitted: Not measured. Use ML CO2 Impact calculator for estimation.

Technical Specifications

Model Architecture and Objective

Whisper large-v3 encoder-decoder transformer (~1.5B parameters). Fine-tuned with cross-entropy on (mel spectrogram, Chinese
transcript) pairs. LoRA keeps only ~2% of parameters trainable, reducing VRAM by 50%+.

Compute Infrastructure

Hardware

Google Colab T4 GPU (16 GB VRAM). Unsloth's LoRA + adamw_8bit optimizer fit the large-v3 model within T4 budget.

Software

- Unsloth β€” FastModel, LoRA, gradient checkpointing
- Transformers 4.56.2
- TRL 0.22.2
- CTranslate2 β€” float16 conversion
- faster-whisper β€” inference

Citation

BibTeX:

@misc{shooding2026fasterwhisper_zhtw,
  author       = {shooding},
  title        = {faster-whisper-large-v3-zh-TW: LoRA fine-tune of Whisper large-v3 for Taiwanese Mandarin},
  year         = {2026},
  howpublished = {\url{https://huggingface.co/shooding/faster-whisper-large-v3-zh-TW}},
}

Glossary

- CER: Character Error Rate β€” edit distance / reference length. Standard metric for Chinese ASR.
- LoRA: Low-Rank Adaptation β€” trains injected rank-decomposition matrices, leaving base weights frozen.
- CTranslate2: Fast transformer inference engine supporting quantization and optimized CUDA/CPU kernels.
- faster-whisper: Whisper reimplemented with CTranslate2; typically 4Γ— faster with lower memory usage.

Model Card Authors

shooding

Model Card Contact

Open an issue on the model repository.
Downloads last month
75
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for shooding/faster-whisper-large-v3-zh-TW

Adapter
(199)
this model

Dataset used to train shooding/faster-whisper-large-v3-zh-TW