tooltuned-qwen-3.5-4b

TL;DR

LoRA (rank 16) fine-tune of Qwen/Qwen3.5-4B for tool-calling, trained on Salesforce/xlam-function-calling-60k with Unsloth + TRL.

BFCL V4 results

Gate disclosure (v1.0): This adapter is published below the +3pp BFCL gate defined in the project brief (delta -8.30pp on the in-tree V3 evaluator). The regression is concentrated in irrelevance / live_irrelevance categories -- see ADR 0006 for the locked diagnosis and the Phase 3.5 remediation spec (deferred for v1.0).

Model Overall accuracy
Base (Qwen/Qwen3.5-4B) 87.3%
This adapter 79.0%
Delta -8.3pp

Per-category breakdown

Category Base Tuned Delta
irrelevance 80.0% 42.0% -38.0pp
live_irrelevance 98.0% 78.0% -20.0pp
live_multiple 78.0% 78.0% +0.0pp
live_parallel 81.2% 68.8% -12.5pp
live_parallel_multiple 95.8% 91.7% -4.2pp
live_relevance 66.7% 77.8% +11.1pp
live_simple 80.0% 74.0% -6.0pp
multiple 92.0% 90.0% -2.0pp
parallel 88.0% 88.0% +0.0pp
parallel_multiple 98.0% 92.0% -6.0pp
simple 90.0% 88.0% -2.0pp

n=458, evaluated 2026-05-13

Training data

  • Sources: xlam (Salesforce/xlam-function-calling-60k)
  • Samples used: 10,000
  • Validation fraction: 0.05
  • Held-out fraction: 0.05
  • Thinking-mode strategy: preserve (preserves Qwen 3.5's default reasoning trace; xLAM rows have no <think> content so the three strategies converge in practice)

Training procedure

Supervised fine-tuning via Unsloth's FastLanguageModel + TRL's SFTTrainer. Adapter only — base weights are frozen. Single A100 (40 GB), bf16, gradient checkpointing on.

Hyperparameters

Knob Value
base_model Qwen/Qwen3.5-4B
lora.rank 16
lora.alpha 32
lora.dropout 0.0
lora.target_modules q_proj, k_proj, v_proj, o_proj
optimizer adamw_8bit
learning_rate 0.0002
warmup_ratio 0.03
weight_decay 0.0
batch_size 16
grad_accum_steps 1
effective_batch_size 16
epochs 1
max_steps n/a
max_seq_len 2048
packing True
seed 42

Intended use

Function calling / tool use in chat agents. The adapter pairs with the base Qwen 3.5 4B chat template; pass tool schemas in the system prompt and the model emits <tool_call> blocks (or XML-tagged <function=...> calls; the inference helper parses both).

Out of scope

  • Non-English instruction following (xLAM is English-only).
  • Long-context tool dialogues beyond 2,048 tokens — the adapter was trained at that sequence length.
  • Safety-critical decisions. The adapter inherits Qwen 3.5's safety profile, no additional alignment was applied.

Limitations

  • LoRA rank 16 is a known-safe default, not an ablated optimum. Higher ranks may move the BFCL number further; rank ablations are a stretch goal.
  • BFCL holds out one slice of tool-calling behavior; performance on task families outside that distribution (multi-turn agentic loops, fully novel APIs) is not directly measured.

License

Apache-2.0, matching the base model Qwen/Qwen3.5-4B.

Reproduction

Source: https://github.com/sukhrobnurali/tooltuned-qwen. Pinned versions live in pyproject.toml; the lockfile (uv.lock) is the reproducibility contract.

git clone https://github.com/sukhrobnurali/tooltuned-qwen
cd tooltuned-qwen
uv sync
# Run on Colab Pro A100; see notebooks/colab_main.ipynb

Training curves: https://wandb.ai/sukhrob-production/tooltuned-qwen.

Citation

@misc{nurali_tooltuned_qwen_2026,
  author       = {Sukhrob Nurali},
  title        = {tooltuned-qwen-3.5-4b: a tool-calling LoRA for Qwen 3.5 4B},
  year         = {2026},
  howpublished = {\url{https://huggingface.co/sukhrobnurali/tooltuned-qwen-3.5-4b}}
}

Author

Downloads last month
69
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for sukhrobnurali/tooltuned-qwen-3.5-4b

Finetuned
Qwen/Qwen3.5-4B
Adapter
(244)
this model

Dataset used to train sukhrobnurali/tooltuned-qwen-3.5-4b