You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

The Patho-R1-3B model and its associated materials are released under the CC-BY-NC-ND 4.0 license. Access is restricted to non-commercial, academic research purposes only, with proper citation required. Any commercial usage, redistribution, or derivative work (including training models based on this model or generating datasets from its outputs) is strictly prohibited without prior written approval.
Users must register with an official institutional email address (generic domains such as @gmail, @qq, @hotmail, etc. will not be accepted). By requesting access, you confirm that your information is accurate and current, and that you agree to comply with all terms listed herein. If other members of your organization wish to use the model, they must register independently and agree to the same terms.

Patho-R1: A Multimodal Reinforcement Learning-Based Pathology Expert Reasoner

[Arxiv] | [Github Repo] | [Cite]

Introduction📝

While vision-language models have shown impressive progress in general medical domains, pathology remains a challenging subfield due to its high-resolution image requirements and complex diagnostic reasoning.

To address this gap, we introduce Patho-R1-3B, a multimodal pathology reasoner designed to enhance diagnostic understanding through structured reasoning. Patho-R1-3B is trained using a three-stage pipeline:

Continued pretraining on 3.5M pathology figure-caption pairs for domain knowledge acquisition
Supervised fine-tuning on 500k expert-annotated Chain-of-Thought samples to encourage reasoning
Reinforcement learning with Decoupled CLIP and Dynamic sAmpling Policy Optimization to refine response quality

Experimental results show that Patho-R1-3B achieves strong performance across key pathology tasks, including multiple choice questions and visual question answering, highlighting its potential for real-world pathology AI applications.

Quickstart🏃

Here we show a code snippet to show you how to use the chat model with transformers and qwen_vl_utils:

from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info


model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "WenchuanZhang/Patho-R1-3B",
    torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained("WenchuanZhang/Patho-R1-3B")

# example question from Pathmmu-test-dataset
# ground truth: D
# Reasoning style options (choose one):
# - Chain-of-Draft, a concise reasoning prompting strategy (COD):
# You are a pathology expert, your task is to think step by step, but only keep a minimum draft for each thinking step, with 5 words at most. Return the answer at the end of the response after a separator. Use the following format:<think> Your step-by-step reasoning </think><answer> Your final answer </answer>
# - Chain-of-Thought (COT):
messages = [
    {   "role": "system",
        "content": "You are a pathology expert, your task is to answer question step by step. Use the following format:<think> Your step-by-step reasoning </think><answer> Your final answer </answer>"},
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "./images/example.jpg",
            },
            {"type": "text", "text": "What feature in the provided micrograph is indicative of chronic inflammation? /n A. Granuloma formation /n B. Multinucleated giant cells /n C. Neutrophilic infiltration /n D. Plasma cells with eccentrically placed nuclei"},       
        ],
    }
]
# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to(model.device)

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=2048)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

Acknowledgements🎖

We gratefully acknowledge the contributions of the open-source community, particularly the following projects which laid the foundation for various components of this work:

Qwen for providing powerful vision language models that significantly advanced our multimodal understanding and generation capabilities.
DocLayout-YOLO for document layout detection.
PaddleOCR for comprehensive optical character recognition.
ModelScope Swift for efficient model serving and deployment tools.
LLaMA-Factory for robust LLM training and fine-tuning pipelines.
VERL for valuable visual-language pretraining resources.
DeepSeek for high-quality models and infrastructure supporting text understanding.

We thank the authors and contributors of these repositories for their dedication and impactful work, which made our development of Patho-R1-3B possible.

Citation❤️

If you find our work helpful, a citation would be greatly appreciated:

@article{zhang2025patho,
  title={Patho-R1: A Multimodal Reinforcement Learning-Based Pathology Expert Reasoner},
  author={Zhang, Wenchuan and Zhang, Penghao and Guo, Jingru and Cheng, Tao and Chen, Jie and Zhang, Shuwan and Zhang, Zhang and Yi, Yuhao and Bu, Hong},
  journal={arXiv preprint arXiv:2505.11404},
  year={2025}
}

Downloads last month: 11

Safetensors

Model size

4B params

Tensor type

BF16