š The WAVe paper is officially out in the Information Sciences Journal.
You saw the PT and NL model releases earlier this year. This is the peer-reviewed paper behind them, with the full method, ablations, and downstream ASR evaluation.
Quick recap: WAVe is a 1B multimodal embedding model that filters synthetic speech at the word level, not the sentence level. On Portuguese ASR it cuts training steps by 34%, improves cross-domain generalization by 50%, and matches WER with 30% less synthetic data.
I wanted to share a cool feature from my open source AI native web browser, Vessel: Persistent highlights!
You can highlight anything on the page and the context is provided to the agent. It's kind of a fun way to learn about new stuff, synthesize info, or just deepen your comprehension/understanding.
Since highlights are persistent, you can close the page, come back later - and your highlights will be exactly where you left them. I've found this particularly useful when reviewing technical blogs, model cards, etc.
So I built a multimodal video annotation pipeline in my spare time, as you do.
corpus-mill turns any long-form video with people on camera into a time-aligned event corpus across audio, vision, OCR, faces, brand observations, music, and clip-worthy moments. Runs entirely on local GPU because ā and I cannot stress this enough ā your footage has no business being on someone else's servers.
The honest origin: I needed real multimodal supervision data, the public corpora are weirdly thin once you need per-frame / per-speaker / per-second labels with provenance, so I built one. Then it grew. Then I looked up and it was 30K LOC and ~30 stages and I thought, ok, maybe other people would want this.
Stack is the usual suspects: Whisper-large-v3 (faster-whisper), pyannote-3.1 (which secretly drags in 433 NeMo modules ā surprise!), Qwen2.5-VL-7B for vision/OCR/shoppable detection, dlib + YuNet for faces, qwen2.5:7b / qwen3:14b via local Ollama for the LLM passes, chromaprint + PDQ for fingerprinting. Outputs as Parquet + SQLite. Apache 2.0.
There's a Docker compose that works, after I spent a day discovering that faster-whisper wants CUDA 12 cuBLAS while pyannote 4 wants CUDA 13, and the answer is "install both, point LD_LIBRARY_PATH at the cu12 wheels, ship it." That's now baked in. You're welcome.
Spare-time project, bugs are real, fixing them for your specific footage is on you. If you're training multimodal models and want a corpus pipeline you fully control on-prem, this might save you months. If not, the README is at least mildly entertaining.
Multimodal-Edge Demo, a node-based inference canvas demo, is now live on Spaces. It features node-based Transformers for fast inference across 10+ edge-device multimodal models on the Hub, all within a single space. The series includes models from Qwen3.5, Qwen3-VL, Gemma 4, and the LFM 2.5 VL model series, with support for reasoning and grounding tasks.
This Space is a fork of the brilliant Eliahu/Model-Atlas, the official demo of "Charting and Navigating Hugging Face's Model Atlas" (Horwitz et al., arXiv 2503.10633). Their pre-computed HF model graph is the foundation of every node and edge you see, and we are deeply grateful for its open release.
The original atlas is a static snapshot of early 2025. Model Galaxy turns it into a living, multimodal map. We injected the 2026 trending originals that did not exist when the atlas was frozen ā DeepSeek-V4, Hy3-preview, GLM-5.1, Kimi-K2, gpt-oss, Nemotron-3 Super / Nano / Omni, Hermes-4.3, Qwen3-Coder-Next, Llama-3.3, Granite-4.1, plus the latest multimodal releases (FLUX.2, ERNIE-Image, HunyuanImage / Video, LTX-2.3, Wan2.2, Kokoro-82M, VoxCPM2, Voxtral-TTS, whisper-v3-turbo, Gemma-4, Qwen3-Omni, Phi-4-mm) ā each with proper base_model lineage edges.
We also added the complete VIDRAFT Darwin family ontology: 120 nodes covering Darwin Core, AETHER, every brand variant (Rogue, AWAXIS, TenOS, Warecube), NOESIS-Darwin multimodal extensions, and 40+ community quantizations ā the most complete Darwin lineage view anywhere.
The name "Galaxy" is now literal: our three injected clusters are re-laid out as logarithmic spiral galaxies, with bigger models near the bright cores and quantizations scattering to the outer arms ā just like real star mass distribution. A top-right toggle switches between Galaxy mode (deep-space gradient with 220 animated stars) and Atlas mode (clean white panels for reports). A 15-second progress bar narrates the render, and per-modality / per-company colors make every cluster legible at a glance.
Final scale: 22,480 nodes in the default Modalities atlas, 137,324 in the Large NLP atlas, and a 277-node compact Darwin + Trending view for instant exploration. Feedback and PRs welcome.
1ļøā£ Build a solid RL env with Verifiers (Prime Intellect) 2ļøā£ Generate synthetic data: <200 games sampled from GPT-5-mini playing in the env 3ļøā£ SFT warm-up to teach format 4ļøā£ Group-based RL (CISPO) against opponents making 20-70% random moves 5ļøā£ RL again with stronger opponents (0-25% random moves) + 1.25 temperature to push exploration and shake off suboptimal strategies