Qwen3-4B-SFT / README.md
Sea-fill's picture
Update README.md
c62a4cc verified
metadata
language:
  - en
  - zh
license: apache-2.0
library_name: transformers
pipeline_tag: text-generation
tags:
  - qwen3
  - causal-lm
  - supervised-fine-tuning
  - math
  - reasoning
  - code
  - science
base_model: Qwen/Qwen3-4B-Base
model-index:
  - name: Qwen3-4B-SFT
    results:
      - task:
          type: text-generation
        dataset:
          name: AIME 2024
          type: aime2024
        metrics:
          - name: accuracy
            type: accuracy
            value: 20.8
      - task:
          type: text-generation
        dataset:
          name: AIME 2025
          type: aime2025
        metrics:
          - name: accuracy
            type: accuracy
            value: 19.4
      - task:
          type: text-generation
        dataset:
          name: AMC 2023
          type: amc2023
        metrics:
          - name: accuracy
            type: accuracy
            value: 58
      - task:
          type: text-generation
        dataset:
          name: GPQA-Diamond
          type: gpqa_diamond
        metrics:
          - name: accuracy
            type: accuracy
            value: 29.1

Qwen3-4B-SFT:

Qwen3-4B-SFT is a reasoning-focused model derived from Qwen3-4B-Base via full-parameter fine-tuning on the verl framework.

There is a notable shortage of reproducible 'warm-start' SFT bases in open-source practice, this model bridges the gap between base models and reinforcement learning models. Optimally aligned for Chain-of-Thought (CoT) and instruction following, it serves as a robust warm-start for Reinforcement Learning.

Dataset Base (4B)† Qwen3-4B-SFT (this model) Improvement
AIME 2024 11.25% 20.8% +9.55%
AIME 2025 6.46% 19.4% +12.94%
AMC 2023 31.09% 58.0% +26.91%
GPQA-Diamond 7.77% 29.1% +21.33%

† Base (4B) figures are taken from (arXiv:2602.10885).

Qwen3-style reasoning and instruction following

Minimal pattern (illustrative):

<|im_start|>user
… Among options A–D, which is correct? Reason step by step and put the final letter in \boxed{}.
<|im_end|>

<|im_start|>assistant
<think>
Compare A vs B vs C vs D against the stem; eliminate …; D remains consistent with …
</think>
Step-by-step: … (short derivation in the visible channel)
Final answer: \boxed{D}
<|im_end|>

Use a large enough max_new_tokens on hard math so both the reasoning block and the visible \boxed{…} line fit before generation stops.

Configuration Notes

  • Template: Trained with the Qwen chat template; learns to end responses with <|im_end|> (151645).
  • Suggested Configuration:
    {
      "eos_token_id": 151645
    }
    

You may adjust settings according to your training or deployment needs.

Training Infrastructure

  • Cluster: MeluXina Supercomputer (LuxProvide)
  • Node Config: 4 NVIDIA-A100 GPUs per node.
  • Final SFT Run: 12 Node-hours (16× A100 for 3 hours)
  • Total R&D Investment: ~700 Node-hours (Includes data ablation, hyperparameter sweeps, and extensive benchmark evaluation.)

Project Links

Limitations

  • Not optimized for factual correctness in all domains
  • May still produce hallucinations or unsafe outputs
  • Performance is sensitive to prompt style and decoding settings

Citation

If you use this model, please cite this checkpoint, bibTeX for this release :

@misc{qwen3-4b-sft-2026,
  title        = {{Qwen3-4B-SFT}: Supervised Fine-Tuned {Qwen3}-4B for Reasoning},
  author       = {Hongyang Li, Xiao Li and {Sea-Fill Community}},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/SeaFill2025/Qwen3-4B-SFT}},
  note         = {Checkpoint trained with verl; warm-start for pre-RL alignment research. Maintained by Sea-Fill Community.}
}