s3nh PRO
AI & ML interests
Recent Activity
Organizations
Standard quantization places levels on a uniform grid. ICRB-Q places them on geodesics of the Fisher-Rao statistical manifold — the Riemannian manifold (M, g_F) where the metric tensor is the Fisher information. This means:
High-Fisher-curvature regions (where small weight changes cause large output changes) get exponentially denser levels.
Low-curvature, "flat" regions (e.g. many heads in early transformer layers) get coarse 2-bit or 3-bit quantization automatically.
The codebook construction reduces to solving: place 2^b points in parameter space to minimize expected geodesic distance from any weight to its nearest level.
This strictly generalizes AWQ's per-channel scaling (which is a zero-order approximation to this manifold geometry) and GPTQ's second-order correction (which is a local linearization).
New blog post exploring what happens when you take optimal data mixing insights and scale up the data generation itself.
We built Sutra, a multi-stage framework for generating pedagogical pre-training data guided by a knowledge graph of ~2,000 concepts across 9 domains. The pipeline includes structured content generation, six-dimension quality evaluation, diversity management across 20 content styles, and a cleaning stage to prevent collapse.
The result is codelion/sutra-10B, a 10.2 billion token pedagogical dataset with rich metadata (domain, complexity, prerequisites, quality scores) on every entry.
We trained codelion/SmolLM2-70M on it for 3 full epochs (30.6B tokens) on a single A10 GPU in ~78 hours.
Key finding: perplexity kept improving across epochs, but benchmark gains plateaued fast. At 70M parameters, the model hits a representational ceiling that more data alone can't break through.
Full writeup with comparisons against 7 other datasets, detailed benchmark breakdowns, and connections to recent work on synthetic data scaling, curriculum learning, and data mixing laws: https://huggingface.co/blog/codelion/scaling-pedagogical-pretraining-10-billion-tokens
All datasets at multiple scales (10M, 100M, 1B, 10B) plus seed concepts and an SFT variant are in the Sutra Pedagogical Datasets collection.
- HD resolution - 1280×720 · 32 fps
- For each frame keyboard and mouse + world state (player position, velocity, weapon ...)
- HD Stereo audio
- All 10 players perspective
https://huggingface.co/collections/blanchon/opencs2
it is with great pleasure i present to you my working one-click deploy 16GB ram completely free huggingface spaces deployment.
repo : Tonic/hugging-claw (use git clone to inspect)
literally the one-click link : Tonic/hugging-claw
you can also run it locally and see for yourself :
docker run -it -p 7860:7860 --platform=linux/amd64 \
-e HF_TOKEN="YOUR_VALUE_HERE" \
-e OPENCLAW_GATEWAY_TRUSTED_PROXIES="YOUR_VALUE_HERE" \
-e OPENCLAW_GATEWAY_PASSWORD="YOUR_VALUE_HERE" \
-e OPENCLAW_CONTROL_UI_ALLOWED_ORIGINS="YOUR_VALUE_HERE" \
registry.hf.space/tonic-hugging-claw:latest
just a few quite minor details i'll take care of but i wanted to share here first
App is here : https://www.patreon.com/posts/137551634
Full tutorial how to use and train : https://youtu.be/DPX3eBTuO_Y
Key findings from our research on optimal architectures for small language models:
→ Depth beats width: 32 layers outperforms 12 layers at the same parameter count
→ Best-in-class factuality: 47.5% on TruthfulQA
→ 10x training efficiency using WSD (Warmup-Stable-Decay) conversion
→ Canon layers add only 0.13% parameters but improve reasoning
We trained on 1B tokens using the optimal 50-30-20 dataset mix (PDFs + filtered web + educational content), then converted to diffusion with just 100M additional tokens.
Blog: https://huggingface.co/blog/codelion/optimal-model-architecture
Model: codelion/dhara-70m
mii-llm/nesso-4B
#Nesso-4B is a fine-tuned version of Qwen-4B, trained on a highly curated and balanced dataset designed specifically for multilingual agentic workflows and conversational use cases.
As shown in the video below we simulate, the new “cowork” from #Antrophic, without any data sharing all running on a consumer device. The model can be used to build agentic behavior in #privateAI environments.
Not every problem requires super intelligence: in many cases, intelligence at the edge is more than enough.
#Nesso4B #AgenticAI #PrivateAI #EdgeAI #OnDeviceAI
zai-org/GLM-OCR
✨ 0.9B
✨ MIT licensed
✨ Multimodal GLM-V architecture
✨ #1 on OmniDocBench v1.5 (94.62)
Following Rain-100M, we’re scaling up. Rain-v2 features a larger training dataset.
We’ve published a comprehensive blog covering the end-to-end journey—from raw data collection to rigorous evaluation and safety testing.
HF Repo: 🤗 raincandy-u/Rain-v2
Blog: 📚
https://angelkawaii.xyz/2026/01/29/rain-v2/
Special thanks to the open-source community and the SmolLM2 team for their foundational work! 🚀
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model (2502.02737)
Repo: raincandy-u/Rain-100M
Data: HuggingFaceFW/fineweb-edu, ~3B tokens, English only
Tokenizer: custom 16k BPE, context length 4096
Architecture: 12 Transformer layers, hidden size 768, 12 heads, MLP 2048, SiLU, bf16
Rain-100M is a raw base model (not instruction-tuned or safety-aligned), aimed at small-scale research, debugging training pipelines, and CPU/edge experiments. If you run evaluations, finetunes, or visualizations with it, I would be very interested in your results!
Its first step of my spare time projects, sft on Qwen3-8B,
EduHelper is a child-friendly tutoring assistant fine-tuned from the Qwen3-8B base model using parameter-efficient fine-tuning (PEFT) with LoRA on the ajibawa-2023/Education-Young-Children dataset.
s3nh/EduHelp-8B
Glad to share my work, have a wonderful day!
psychotheraputic preferences just landed on
Beck-8B as a base model, 13000 steps on educational dataset.
Time to go further and build more 🥰
s3nh/EduHelp_Beck_8B
Thanks to @basilic_ai for computations <3