Building on HF

7 27 40

Zixi "Oz" Li PRO

OzTianlu

https://github.com/lizixi-0x2F

lizixi-0x2F

AI & ML interests

My research focuses on deep reasoning with small language models, Transformer architecture innovation, and knowledge distillation for efficient alignment and transfer.

Recent Activity

reacted to theirpost with 🤗 1 day ago

https://github.com/lizixi-0x2F/March I just released March, an open-source high-performance KV cache sharing library for LLM inference that uses Trie-based prefix deduplication. When you run LLM services, you often see thousands of requests sharing the same system prompt and conversation history. But traditional KV cache systems store each sequence separately — duplicating the exact same data over and over again. Pure waste. March uses a Trie structure to automatically detect and reuse identical token prefixes. Instead of storing [system_prompt + history] 1000 times, it's stored once. Everyone shares it. - 80-97% memory reduction in prefix-heavy workloads (tested on SmolLM2-135M with 500 multi-turn conversations) - Zero-copy queries — returns direct pointers into the memory pool, no expensive memcpy on the hot path - Predictable memory usage — fixed-size page pool with O(L) complexity - Trade-off: slightly slower than dict O(1) lookup, but the memory savings are worth it in production

posted an update 1 day ago

reacted to reaperdoesntknow's post with 👍 3 days ago

We present a methodology for training small language models on CPU at FP32 precision that achieves capability-per-dollar efficiency orders of magnitude beyond GPU-based training. Across15modelsspanningfournovelarchitecturefamilies—MixtureofAttentions(MoA),cross- architecture fusion (Qemma), swarm intelligence (SAGI), and metric-space causal language models (DiscoverLM)—total compute cost was $24 on a single AMD EPYC 9454P proces- sor. We introduce seven methodological pillars: (1) FP32 precision preservation, with exper- iments demonstrating 5,810×single-operation error and 23,225×compounding error ratio for FP16 at network depth; (2) sparse cognitive architectures where 0.02–7% of parameters activate per token, matching CPU branching rather than GPU SIMD; (3) developmental curriculum training progressing from language to logic to transfer to depth; (4) continuous belt-fed data ingestion eliminating truncation waste; (5) hardware-native optimization for AMD Zen 4 via AOCL/OpenMP/NUMA-aware allocation; (6) self-regulating thermodynamic governance with emergent temperature measurement grounded in L2-star discrepancy; and (7) open-standard compute (AVX2 SIMD at FP32) free of proprietary vendor dependency. We argue that transformers were designed for GPU hardware rather than mathematical optimality, and that architecture designed for geometric correctness—metric-space attention, triangle inequality enforcement, sparse expert routing—naturally favor CPU execution. For sub-2B parameter models, CPU training produces more capable models at a fraction of the cost.

View all activity

Organizations

reactedto their post with 🤗 1 day ago

Post

903

https://github.com/lizixi-0x2F/March
I just released March, an open-source high-performance KV cache sharing library for LLM inference that uses Trie-based prefix deduplication.
When you run LLM services, you often see thousands of requests sharing the same system prompt and conversation history. But traditional KV cache systems store each sequence separately — duplicating the exact same data over and over again. Pure waste.
March uses a Trie structure to automatically detect and reuse identical token prefixes. Instead of storing [system_prompt + history] 1000 times, it's stored once. Everyone shares it.
- 80-97% memory reduction in prefix-heavy workloads (tested on SmolLM2-135M with 500 multi-turn conversations)
- Zero-copy queries — returns direct pointers into the memory pool, no expensive memcpy on the hot path
- Predictable memory usage — fixed-size page pool with O(L) complexity
- Trade-off: slightly slower than dict O(1) lookup, but the memory savings are worth it in production

1 reply

posted an update 1 day ago

Post

903

https://github.com/lizixi-0x2F/March
I just released March, an open-source high-performance KV cache sharing library for LLM inference that uses Trie-based prefix deduplication.
When you run LLM services, you often see thousands of requests sharing the same system prompt and conversation history. But traditional KV cache systems store each sequence separately — duplicating the exact same data over and over again. Pure waste.
March uses a Trie structure to automatically detect and reuse identical token prefixes. Instead of storing [system_prompt + history] 1000 times, it's stored once. Everyone shares it.
- 80-97% memory reduction in prefix-heavy workloads (tested on SmolLM2-135M with 500 multi-turn conversations)
- Zero-copy queries — returns direct pointers into the memory pool, no expensive memcpy on the hot path
- Predictable memory usage — fixed-size page pool with O(L) complexity
- Trade-off: slightly slower than dict O(1) lookup, but the memory savings are worth it in production

1 reply

reactedto reaperdoesntknow's post with 👍 3 days ago

Post

3241

We present a methodology for training small language models on CPU at FP32 precision
that achieves capability-per-dollar efficiency orders of magnitude beyond GPU-based training.
Across15modelsspanningfournovelarchitecturefamilies—MixtureofAttentions(MoA),cross-
architecture fusion (Qemma), swarm intelligence (SAGI), and metric-space causal language
models (DiscoverLM)—total compute cost was $24 on a single AMD EPYC 9454P proces-
sor. We introduce seven methodological pillars: (1) FP32 precision preservation, with exper-
iments demonstrating 5,810×single-operation error and 23,225×compounding error ratio for
FP16 at network depth; (2) sparse cognitive architectures where 0.02–7% of parameters activate
per token, matching CPU branching rather than GPU SIMD; (3) developmental curriculum
training progressing from language to logic to transfer to depth; (4) continuous belt-fed data
ingestion eliminating truncation waste; (5) hardware-native optimization for AMD Zen 4 via
AOCL/OpenMP/NUMA-aware allocation; (6) self-regulating thermodynamic governance with
emergent temperature measurement grounded in L2-star discrepancy; and (7) open-standard
compute (AVX2 SIMD at FP32) free of proprietary vendor dependency. We argue that transformers were designed for GPU hardware rather than mathematical optimality, and that architecture designed for geometric correctness—metric-space attention, triangle inequality enforcement, sparse expert routing—naturally favor CPU execution. For sub-2B parameter models, CPU training produces more capable models at a fraction of the cost.

6 replies

updated a model 6 days ago

NoesisLab/Kai-30B-Instruct

Text Generation • 33B • Updated 6 days ago • 370 • 21

reactedto danielhanchen's post with 🔥 17 days ago

Post

3864

We collaborated with NVIDIA to teach you about Reinforcement Learning and RL environments. 💚 Learn:

• Why RL environments matter + how to build them
• When RL is better than SFT
• GRPO and RL best practices
• How verifiable rewards and RLVR work

Blog: https://unsloth.ai/blog/rl-environments

4 replies

repliedto their post 17 days ago

This comment has been hidden

updated a model 17 days ago

NoesisLab/Arcade-3B

Text Generation • 3B • Updated 17 days ago • 140 • 8

repliedto their post 17 days ago

This comment has been hidden

repliedto their post 17 days ago

What an academic tone! New baselines, here.

upvoted an article 17 days ago

Article

Arcade-3B: SLM Optimization via Orthogonal Decoupling of Latent State Spaces

17 days ago

•

published an article 17 days ago

Article

Arcade-3B: SLM Optimization via Orthogonal Decoupling of Latent State Spaces

17 days ago

•

upvoted an article 17 days ago

Article

Arcade-3B: 基于隐藏层状态空间正交解耦的 SLM 优化

17 days ago

•

published an article 17 days ago

Article

Arcade-3B: 基于隐藏层状态空间正交解耦的 SLM 优化

17 days ago

•

reactedto their post with 🤗 18 days ago

Post

5381

Arcade-3B — SmolReasoner
NoesisLab/Arcade-3B
Arcade-3B is a 3B instruction-following and reasoning model built on SmolLM3-3B. It is the public release from the ARCADE project at NoesisLab, which investigates the State–Constraint Orthogonality Hypothesis: standard Transformer hidden states conflate factual content and reasoning structure in the same subspace, and explicitly decoupling them improves generalization.

5 replies

posted an update 18 days ago

Post

5381

5 replies

liked a model 18 days ago

NoesisLab/Arcade-3B

Text Generation • 3B • Updated 17 days ago • 140 • 8

published a model 18 days ago

NoesisLab/Arcade-3B

Text Generation • 3B • Updated 17 days ago • 140 • 8

liked a dataset 18 days ago

OpenDataArena/MMFineReason-SFT-123K-Qwen3-VL-235B-Thinking

Viewer • Updated Feb 3 • 123k • 680 • 78

liked a model 18 days ago

Tesslate/OmniCoder-9B

Text Generation • Updated 20 days ago • 29.1k • 551

reactedto smirki's post with 👍 19 days ago

Post

621

Introducing OmniCoder-9B

We trained a 9B coding agent on 425K real agentic trajectories from Claude Opus 4.6, GPT-5.4, GPT-5.3-Codex, and Gemini 3.1 Pro across Claude Code, OpenCode, Codex, and Droid scaffolding.

Results:
- GPQA Diamond: 83.8 pass@1 (166/198), 86.4 pass@3 — above GPT-OSS-120B (80.1), Qwen3.5-9B (81.7), and Claude Haiku 4.5 (73)
- AIME 2025: 90 pass@5 (27/30)
- Terminal-Bench 2.0: 28.1 (25/89) — +8.1 points over base model

The key insight: We trained on what frontier agents actually do, real tool calls, real error recovery, real edit diffs. The model learns read-before-write patterns, responds to LSP diagnostic, and applies minimal diffs instead of full rewrites.

Base: Qwen3.5-9B. LoRA SFT, 4x H200, Axolotl, 99.35% packing efficiency.

Weights:

Tesslate huggingface.co/Tesslate/OmniCoder-9B
GGUF: huggingface.co/Tesslate/OmniCoder-9B-GGUF
Apache 2.0. Run it locally.

Zixi "Oz" Li PRO

AI & ML interests

Recent Activity

Organizations

OzTianlu's activity

Arcade-3B: SLM Optimization via Orthogonal Decoupling of Latent State Spaces

Arcade-3B: SLM Optimization via Orthogonal Decoupling of Latent State Spaces

Arcade-3B: 基于隐藏层状态空间正交解耦的 SLM 优化

Arcade-3B: 基于隐藏层状态空间正交解耦的 SLM 优化