Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
prithivMLmods 
posted an update 10 days ago
Post
3542
One speech model with seven voices, streamlined with multimodal capabilities for vision tasks. Performs vision(image-text) to audio inference with Qwen2.5-VL + VibeVoice-Realtime-0.5B. Vision to VibeVoice (EN) - The demo is live. 🗣️🔥

🤗 Vision-to-VibeVoice-en [Demo]: prithivMLmods/Vision-to-VibeVoice-en
✨ Collection: https://huggingface.co/collections/prithivMLmods/multimodal-implementations
✨ Speech [VibeVoice-Realtime-0.5B]: microsoft/VibeVoice-Realtime-0.5B
✨ Vision [Qwen2.5-VL]: Qwen/Qwen2.5-VL-7B-Instruct

To know more about it, visit the app page or the respective model page!

Let's Go!!!

Multimodal is interesting stuff

·

@AbstractPhil Indeed, absolutely.

can this be trained with a custom voice?

·

@melvindave
Yes, I think. Kindly refer to the VibeVoice community materials.

https://github.com/vibevoice-community/VibeVoice-finetuning