tooltuned-qwen-3.5-4b

TL;DR

LoRA (rank 16) fine-tune of Qwen/Qwen3.5-4B for tool-calling, trained on Salesforce/xlam-function-calling-60k with Unsloth + TRL.

Result on BFCL V4: 79.0% (base 87.3%, -8.3pp)
Adapter: sukhrobnurali/tooltuned-qwen-3.5-4b
Format: bf16 LoRA adapter (Unsloth advises against 4-bit quant for Qwen 3.5)

BFCL V4 results

Gate disclosure (v1.0): This adapter is published below the +3pp BFCL gate defined in the project brief (delta -8.30pp on the in-tree V3 evaluator). The regression is concentrated in irrelevance / live_irrelevance categories -- see ADR 0006 for the locked diagnosis and the Phase 3.5 remediation spec (deferred for v1.0).

Model	Overall accuracy
Base (Qwen/Qwen3.5-4B)	87.3%
This adapter	79.0%
Delta	-8.3pp

Per-category breakdown

Category	Base	Tuned	Delta
irrelevance	80.0%	42.0%	-38.0pp
live_irrelevance	98.0%	78.0%	-20.0pp
live_multiple	78.0%	78.0%	+0.0pp
live_parallel	81.2%	68.8%	-12.5pp
live_parallel_multiple	95.8%	91.7%	-4.2pp
live_relevance	66.7%	77.8%	+11.1pp
live_simple	80.0%	74.0%	-6.0pp
multiple	92.0%	90.0%	-2.0pp
parallel	88.0%	88.0%	+0.0pp
parallel_multiple	98.0%	92.0%	-6.0pp
simple	90.0%	88.0%	-2.0pp

n=458, evaluated 2026-05-13

Training data

Sources: xlam (Salesforce/xlam-function-calling-60k)
Samples used: 10,000
Validation fraction: 0.05
Held-out fraction: 0.05
Thinking-mode strategy: preserve (preserves Qwen 3.5's default reasoning trace; xLAM rows have no <think> content so the three strategies converge in practice)

Training procedure

Supervised fine-tuning via Unsloth's FastLanguageModel + TRL's SFTTrainer. Adapter only — base weights are frozen. Single A100 (40 GB), bf16, gradient checkpointing on.

Hyperparameters

Knob	Value
`base_model`	Qwen/Qwen3.5-4B
`lora.rank`	16
`lora.alpha`	32
`lora.dropout`	0.0
`lora.target_modules`	q_proj, k_proj, v_proj, o_proj
`optimizer`	adamw_8bit
`learning_rate`	0.0002
`warmup_ratio`	0.03
`weight_decay`	0.0
`batch_size`	16
`grad_accum_steps`	1
`effective_batch_size`	16
`epochs`	1
`max_steps`	n/a
`max_seq_len`	2048
`packing`	True
`seed`	42

Intended use

Function calling / tool use in chat agents. The adapter pairs with the base Qwen 3.5 4B chat template; pass tool schemas in the system prompt and the model emits <tool_call> blocks (or XML-tagged <function=...> calls; the inference helper parses both).

Out of scope

Non-English instruction following (xLAM is English-only).
Long-context tool dialogues beyond 2,048 tokens — the adapter was trained at that sequence length.
Safety-critical decisions. The adapter inherits Qwen 3.5's safety profile, no additional alignment was applied.

Limitations

LoRA rank 16 is a known-safe default, not an ablated optimum. Higher ranks may move the BFCL number further; rank ablations are a stretch goal.
BFCL holds out one slice of tool-calling behavior; performance on task families outside that distribution (multi-turn agentic loops, fully novel APIs) is not directly measured.

License

Apache-2.0, matching the base model Qwen/Qwen3.5-4B.

Reproduction

Source: https://github.com/sukhrobnurali/tooltuned-qwen. Pinned versions live in pyproject.toml; the lockfile (uv.lock) is the reproducibility contract.

git clone https://github.com/sukhrobnurali/tooltuned-qwen
cd tooltuned-qwen
uv sync
# Run on Colab Pro A100; see notebooks/colab_main.ipynb

Training curves: https://wandb.ai/sukhrob-production/tooltuned-qwen.

Citation

@misc{nurali_tooltuned_qwen_2026,
  author       = {Sukhrob Nurali},
  title        = {tooltuned-qwen-3.5-4b: a tool-calling LoRA for Qwen 3.5 4B},
  year         = {2026},
  howpublished = {\url{https://huggingface.co/sukhrobnurali/tooltuned-qwen-3.5-4b}}
}

Author

Sukhrob Nurali — sukhrobnurali@gmail.com
Hugging Face: sukhrobnurali
GitHub: sukhrobnurali

Downloads last month: 69

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for sukhrobnurali/tooltuned-qwen-3.5-4b

Base model

Qwen/Qwen3.5-4B-Base

Finetuned

Qwen/Qwen3.5-4B

Adapter

(244)

this model

sukhrobnurali
/

tooltuned-qwen-3.5-4b