StyleStream
StyleStream: Real-Time Zero-Shot Voice Style Conversion
Official PyTorch model weights for streamable voice style conversion in timbre, accent, and emotion.
Release note: To reduce voice-cloning misuse, this public release excludes the style encoder weights. Public inference uses curated target speaker embeddings, not arbitrary target-speaker cloning.
News
- 2026/06/11: StyleStream offline / streaming inference code and weights are open sourced! π₯ π₯ π₯
- 2026/06/03: StyleStream was accepted to the INTERSPEECH 2026 long paper track! π π π
Files
This Hugging Face repo hosts the public inference assets:
stylizer-no-style-enc.ckpt: stylizer checkpoint without style encoder weightsdestylizer.ckpt: destylizer checkpointvocos_causal_best.ckpt: causal vocoder checkpointtarget_spkrs.tar: larger curated target speaker inventory
Small target examples and the full inference code are available in the GitHub repo:
https://github.com/Berkeley-Speech-Group/StyleStream
Download
Install the Hugging Face CLI if needed:
pip install huggingface_hub
From the StyleStream project root, download checkpoints:
hf download Louis0324/StyleStream \
stylizer-no-style-enc.ckpt destylizer.ckpt vocos_causal_best.ckpt \
--repo-type model --local-dir assets/ckpts
Download the larger target speaker inventory:
hf download Louis0324/StyleStream target_spkrs.tar --repo-type model --local-dir assets/target_spkrs
Expected local layout:
assets/ckpts/
stylizer-no-style-enc.ckpt
destylizer.ckpt
vocos_causal_best.ckpt
assets/target_spkrs/
target_spkrs.tar
Usage
Clone the GitHub repo and follow its setup instructions:
git clone https://github.com/Berkeley-Speech-Group/StyleStream.git
cd StyleStream
pip install -r requirements.txt
Offline Streamlit app:
streamlit run inference/offline_app.py
Recommended streaming inference:
python inference/streaming.py
Use this terminal script for the fastest realtime performance. It runs the speed test before audio IO, selects a streamable inference-step setting, and lets you switch target styles by typing a target index.
Streaming Streamlit app:
streamlit run inference/streaming_app.py
Use this when you want browser-based target selection, audio device selection, live status, and speed-test visualization. It has the same core streaming functionality, but is slower because of Streamlit overhead.
Command-line examples:
./inference/run_inference_offline.sh
./inference/run_inference_simulate_streaming.sh
Style Inventory
Target styles use this folder format:
target_name/
target_name.wav
target_name.npy
The .wav provides target mel/acoustic context. The .npy file is the pre-extracted style embedding with shape [768].
Intended Use
StyleStream is released for educational, research, and not-for-profit use. It is intended for voice style conversion research, benchmarking, comparison, and reproducible inference.
The public release does not include style encoder weights and does not support arbitrary target-speaker cloning.
License
The code is released under a research, educational, and not-for-profit software license. Commercial use requires prior written permission from The Regents of the University of California.
See the LICENSE file in this Hugging Face model repo:
https://huggingface.co/Louis0324/StyleStream/blob/main/LICENSE
Acknowledgements
F5-TTS: stylizer flow matching modules.
Citation
If you find StyleStream useful, please consider giving a star and citation:
@article{liu2026stylestream,
title={StyleStream: Real-Time Zero-Shot Voice Style Conversion},
author={Yisi Liu and Nicholas Lee and Gopala Anumanchipalli},
journal={arXiv preprint arXiv:2602.20113},
year={2026}
}