Spark-TTS Arabic (Fine-tuned on ClArTTS)

Fine-tuned version of SparkAudio/Spark-TTS-0.5B specialized for Arabic text-to-speech synthesis. The LLM component has been fine-tuned on the ClArTTS dataset (Classical Arabic Text-to-Speech corpus) containing 12 hours of high-quality single-speaker recordings.

๐Ÿ“‹ Model Description

Spark-TTS is a neural text-to-speech system that combines a language model (Qwen2) with a neural audio codec (BiCodec) for high-quality speech synthesis. This version has been specifically optimized for Arabic through fine-tuning on Classical Arabic speech data.

Architecture Components:

  • LLM (Qwen2): Fine-tuned for Arabic text-to-semantic token generation
  • BiCodec: Neural audio codec for semantic-to-audio token conversion (unchanged)
  • wav2vec2-large-xlsr-53: Speech encoder for voice cloning (unchanged)

What Changed: Only the LLM component was fine-tuned. The audio tokenizer and speech encoder remain identical to the base model, ensuring compatibility with the original Spark-TTS architecture.

Key Features:

  • Voice cloning with 5-30 seconds of reference audio
  • Natural prosody and intonation for Classical Arabic
  • Single-speaker consistency
  • Controllable generation parameters

๐ŸŽฏ Intended Use

Direct Use

  • Arabic audiobook narration (Classical/MSA)
  • Voice-over for Arabic educational content
  • Accessibility tools for Arabic text
  • Voice cloning for Arabic speakers
  • Arabic language learning applications

Downstream Use

Can be further fine-tuned for:

  • Dialectal Arabic variants (Egyptian, Levantine, Gulf)
  • Domain-specific terminology (religious texts, literature)
  • Multi-speaker scenarios
  • Emotional or expressive speech

Out-of-Scope Use

Not recommended for:

  • Real-time speech synthesis (model is relatively slow)
  • Non-diacritized Arabic text (requires tashkeel)
  • Languages other than Arabic
  • Singing or non-speech audio generation

๐Ÿšจ Very Important Note: This model requires the official Spark-TTS repository for inference. The model files alone are not sufficient - you must clone the Spark-TTS repo and use their inference pipeline.

๐Ÿš€ How to Use

Installation

First, clone the official Spark-TTS repository (required for inference):

# Clone Spark-TTS
git clone https://github.com/SparkAudio/Spark-TTS
cd Spark-TTS

# Install dependencies
pip install transformers soundfile huggingface_hub omegaconf torch

Download Model

from huggingface_hub import snapshot_download

# Download the fine-tuned model
model_dir = snapshot_download(
    repo_id="azeddinShr/Spark-TTS-Arabic-Complete",
    local_dir="./arabic_model"
)

Setup Inference Environment

import sys
import torch
import soundfile as sf

# Add Spark-TTS to path
sys.path.insert(0, './cli')

# Import SparkTTS class
from SparkTTS import SparkTTS

# Initialize device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

Load Model

# Load the fine-tuned Arabic model
tts = SparkTTS("./arabic_model", device)
print("โœ… Model loaded successfully!")

Basic Text-to-Speech

# Prepare input text (must include diacritics)
text = "ู…ูŽุฑู’ุญูŽุจู‹ุง ุจููƒูู…ู’ ูููŠ ู†ูŽู…ููˆุฐูŽุฌู ุชูŽุญู’ูˆููŠู„ู ุงู„ู†ูŽู‘ุตูู‘ ุฅูู„ูŽู‰ ูƒูŽู„ูŽุงู…ู ุจูุงู„ู„ูู‘ุบูŽุฉู ุงู„ู’ุนูŽุฑูŽุจููŠูŽู‘ุฉู."

# Reference audio and its transcript
reference_audio = "path/to/reference.wav"  # 5-30 seconds of clear Arabic speech
reference_text = "ุงู„ู†ูŽู‘ุตูู‘ ุงู„ู’ู…ูุทูŽุงุจูู‚ู ู„ูู„ุตูŽู‘ูˆู’ุชู ุงู„ู’ู…ูŽุฑู’ุฌูุนููŠูู‘"

# Generate speech
wav = tts.inference(
    text,
    prompt_speech_path=reference_audio,
    prompt_text=reference_text
)

# Save output
sf.write("output.wav", wav, samplerate=16000)
print("โœ… Audio generated!")

Advanced Generation with Parameters

# Generate with custom parameters
wav = tts.inference(
    text,
    prompt_speech_path=reference_audio,
    prompt_text=reference_text,
    temperature=0.8,      # Controls randomness (0.1-1.5, default: 0.8)
    top_k=50,            # Top-k sampling (default: 50)
    top_p=0.95           # Nucleus sampling (default: 0.95)
)

sf.write("output_custom.wav", wav, samplerate=16000)

โš ๏ธ Important Requirements

Input Text Requirements

  • Diacritization (Tashkeel) is REQUIRED
  • Text must include full Arabic diacritics (ููŽุชู’ุญูŽุฉุŒ ูƒูŽุณู’ุฑูŽุฉุŒ ุถูŽู…ูŽู‘ุฉุŒ ุณููƒููˆู†ุŒ etc.)
  • Use AI tools (ChatGPT, Claude) or online diacritizers to add tashkeel

Example:

  • โŒ Bad: "ู…ุฑุญุจุง ุจูƒู… ููŠ ุงู„ู†ู…ูˆุฐุฌ"
  • โœ… Good: "ู…ูŽุฑู’ุญูŽุจู‹ุง ุจููƒูู…ู’ ูููŠ ุงู„ู†ูŽู‘ู…ููˆุฐูŽุฌู"

Reference Audio Requirements

  • Duration: 5-30 seconds of clear speech
  • Quality: Clean recording, minimal background noise
  • Speaker: Single speaker only
  • Language: Arabic (preferably MSA or Classical)
  • Format: WAV file recommended

Reference Transcript Requirements

  • Must match reference audio exactly
  • Must include full diacritics
  • Text alignment is critical for quality

๐Ÿ“Š Training Details

Training Data

Dataset: MBZUAI/ClArTTS

  • Full dataset size: 12 hours, 9,500 utterances
  • Training subset: 30% (~2,850 utterances)
  • Speaker: Single male speaker
  • Language: Classical Arabic (MSA)
  • Sample rate: 40.1 kHz (resampled to 24 kHz for training)
  • Text quality: Fully diacritized

Training Procedure

Fine-tuning Framework: Axolotl + LoRA

Training Configuration:

Base Model: SparkAudio/Spark-TTS-0.5B (LLM component only)
Fine-tuning Method: Full fine-tuning (not LoRA)
Epochs: 20
Batch Size: 8 (1 per device ร— 8 gradient accumulation)
Learning Rate: 2e-4
Optimizer: AdamW (torch fused)
LR Scheduler: Cosine
Warmup Steps: 10
Sequence Length: 1024
Precision: bfloat16
Gradient Checkpointing: Enabled

Data Processing:

  1. Audio resampled to 24 kHz
  2. Semantic tokens extracted using BiCodec
  3. Training pairs: [text, semantic_tokens] created for LLM training
  4. Text normalized to lowercase during processing

Training Infrastructure:

  • Hardware: Single NVIDIA GPU (Colab)
  • Training Time: ~3-4 hours
  • Framework: PyTorch + Transformers + Axolotl

Data Preparation Steps:

# 1. Load ClArTTS from HuggingFace
# 2. Resample audio from 40.1 kHz โ†’ 24 kHz
# 3. Extract semantic tokens using BiCodec
# 4. Create metadata: [audio_path, text]
# 5. Generate training pairs: [text โ†’ semantic_tokens]

Base Model:

@misc{sparktts2024,
  title={Spark-TTS: Zero-Shot Multi-Style Text-to-Speech via Large Language Models},
  author={SparkAudio Team},
  year={2024},
  url={https://github.com/SparkAudio/Spark-TTS}
}

Training Dataset:

@inproceedings{kulkarni2023clartts,
  author={Ajinkya Kulkarni and Atharva Kulkarni and Sara Shatnawi and Hanan Aldarmaki},
  title={ClArTTS: An Open-Source Classical Arabic Text-to-Speech Corpus},
  year={2023},
  booktitle={INTERSPEECH 2023},
  pages={5511--5515},
  doi={10.21437/Interspeech.2023-2224}
}

๐Ÿ‘ Acknowledgments

  • Base Model: SparkAudio Team for Spark-TTS-0.5B
  • Dataset: MBZUAI for ClArTTS corpus
  • Frameworks: Hugging Face Transformers, Axolotl, PyTorch

๐Ÿ“„ License

Apache 2.0 (same as base model)

๐Ÿ“ง Contact

For questions, collaboration, or support:


Note: This model requires the official Spark-TTS repository for inference. The model files alone are not sufficient - you must clone the Spark-TTS repo and use their inference pipeline.

Downloads last month
5
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for azeddinShr/Spark-TTS-Arabic-Complete

Finetuned
(21)
this model

Dataset used to train azeddinShr/Spark-TTS-Arabic-Complete