VLAA-Thinker-Qwen2VL-2B / README.md

Add model card and metadata (#1)

66d7b90 verified 5 months ago

532 Bytes

metadata

license: cc-by-nc-4.0
pipeline_tag: image-text-to-text
library_name: transformers

VLAA-Thinker-Qwen2VL-2B

This model is a vision-language model based on the Qwen2VL architecture, as described in the paper SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models. It takes both image and text as input and generates text as output.