StyleStream

arXiv demo GitHub license

StyleStream: Real-Time Zero-Shot Voice Style Conversion

Official PyTorch model weights for streamable voice style conversion in timbre, accent, and emotion.

StyleStream overview

Release note: To reduce voice-cloning misuse, this public release excludes the style encoder weights. Public inference uses curated target speaker embeddings, not arbitrary target-speaker cloning.

News

  • 2026/06/11: StyleStream offline / streaming inference code and weights are open sourced! πŸ”₯ πŸ”₯ πŸ”₯
  • 2026/06/03: StyleStream was accepted to the INTERSPEECH 2026 long paper track! πŸŽ‰ πŸŽ‰ πŸŽ‰

Files

This Hugging Face repo hosts the public inference assets:

  • stylizer-no-style-enc.ckpt: stylizer checkpoint without style encoder weights
  • destylizer.ckpt: destylizer checkpoint
  • vocos_causal_best.ckpt: causal vocoder checkpoint
  • target_spkrs.tar: larger curated target speaker inventory

Small target examples and the full inference code are available in the GitHub repo:

https://github.com/Berkeley-Speech-Group/StyleStream

Download

Install the Hugging Face CLI if needed:

pip install huggingface_hub

From the StyleStream project root, download checkpoints:

hf download Louis0324/StyleStream \
  stylizer-no-style-enc.ckpt destylizer.ckpt vocos_causal_best.ckpt \
  --repo-type model --local-dir assets/ckpts

Download the larger target speaker inventory:

hf download Louis0324/StyleStream target_spkrs.tar --repo-type model --local-dir assets/target_spkrs

Expected local layout:

assets/ckpts/
  stylizer-no-style-enc.ckpt
  destylizer.ckpt
  vocos_causal_best.ckpt

assets/target_spkrs/
  target_spkrs.tar

Usage

Clone the GitHub repo and follow its setup instructions:

git clone https://github.com/Berkeley-Speech-Group/StyleStream.git
cd StyleStream
pip install -r requirements.txt

Offline Streamlit app:

streamlit run inference/offline_app.py

Recommended streaming inference:

python inference/streaming.py

Use this terminal script for the fastest realtime performance. It runs the speed test before audio IO, selects a streamable inference-step setting, and lets you switch target styles by typing a target index.

Streaming Streamlit app:

streamlit run inference/streaming_app.py

Use this when you want browser-based target selection, audio device selection, live status, and speed-test visualization. It has the same core streaming functionality, but is slower because of Streamlit overhead.

Command-line examples:

./inference/run_inference_offline.sh
./inference/run_inference_simulate_streaming.sh

Style Inventory

Target styles use this folder format:

target_name/
  target_name.wav
  target_name.npy

The .wav provides target mel/acoustic context. The .npy file is the pre-extracted style embedding with shape [768].

Intended Use

StyleStream is released for educational, research, and not-for-profit use. It is intended for voice style conversion research, benchmarking, comparison, and reproducible inference.

The public release does not include style encoder weights and does not support arbitrary target-speaker cloning.

License

The code is released under a research, educational, and not-for-profit software license. Commercial use requires prior written permission from The Regents of the University of California.

See the LICENSE file in this Hugging Face model repo:

https://huggingface.co/Louis0324/StyleStream/blob/main/LICENSE

Acknowledgements

F5-TTS: stylizer flow matching modules.

Citation

If you find StyleStream useful, please consider giving a star and citation:

@article{liu2026stylestream,
  title={StyleStream: Real-Time Zero-Shot Voice Style Conversion},
  author={Yisi Liu and Nicholas Lee and Gopala Anumanchipalli},
  journal={arXiv preprint arXiv:2602.20113},
  year={2026}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Papers for Louis0324/StyleStream