Instructions to use Qwen/Qwen2.5-VL-7B-Instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Qwen/Qwen2.5-VL-7B-Instruct with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="Qwen/Qwen2.5-VL-7B-Instruct")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")
model = AutoModelForImageTextToText.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use Qwen/Qwen2.5-VL-7B-Instruct with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Qwen/Qwen2.5-VL-7B-Instruct"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Qwen/Qwen2.5-VL-7B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/Qwen/Qwen2.5-VL-7B-Instruct

SGLang

How to use Qwen/Qwen2.5-VL-7B-Instruct with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Qwen/Qwen2.5-VL-7B-Instruct" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Qwen/Qwen2.5-VL-7B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Qwen/Qwen2.5-VL-7B-Instruct" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Qwen/Qwen2.5-VL-7B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use Qwen/Qwen2.5-VL-7B-Instruct with Docker Model Runner:
```
docker model run hf.co/Qwen/Qwen2.5-VL-7B-Instruct
```

Bounding boxes coordinates

#13

by ljoana - opened Feb 4, 2025

Discussion

ljoana

Feb 4, 2025

What rescaling should be done so that the bbox coordinates are matching the original image? I am seeing some mismatches but can't seem to figure what's the issue.

allenjs626

Feb 5, 2025

The bbox_2d coordinates are x1, y1, x2, y2 rather than x,y,w,h. And they will be relative to your resized image size if you are resizing. For example:

image = Image.open(image_path)
img_width, img_height = image.size
max_size = 1280
if max(image.size) > max_size:
ratio = max_size / max(image.size)
new_size = tuple(int(dim * ratio) for dim in image.size)
# set each dimension to be a multiple of 28
new_size = tuple(int(dim // 28) * 28 for dim in new_size)
image = image.resize(new_size, Image.LANCZOS)
img_width, img_height = image.size

then in the messages:

{
    "role": "user",
    "content": [
        {
            "type": "image",
            "image": f"file://{image_path}",
            "resized_width": img_width,
            "resized_height": img_height,
        },

.....

Phreak87

Jun 16, 2025

•

edited Jun 16, 2025

i cannot manage to get the coordinates right ... Please help!

i have a image with traffic-signs and like to detect the stop/bus-sign.
the original image has 1920*1080 pixels. With Max_Pixel 1280 i scale down
to a image-size of 1260x700 (28 Pixel Blocks, smaller 1280, X:45x28, Y:25x28)

Prompt for the 7b Model is: "Locate the Stop-sign and return the location in the form of coordinates in the format {'bbox_2d': [x1, y1, x2, y2]}."
i run the detection on the scaled image as Base64.

Result in X looks always good but y is offset (but wondering why stop is too less and bus is too much ...).

i use OLLAMA for this model - so the complete Ollama call is:

------------------------------JSON-----------------------------------
[
{
"role": "system",
"content": "You are a knowledgeable, efficient, and direct AI assistant. \r\nProvide concise answers, focusing on the key information needed. \r\nOffer suggestions tactfully when appropriate to improve outcomes. \r\nEngage in productive collaboration with the user."
},
{
"role": "user",
"content": "Locate the Bus-sign and return the location in the form of coordinates in the format {'bbox_2d': [x1, y1, x2, y2]}.",
"Images": [
"iVBORw0KGgoAAAANSUhEUgAABOw ... ly4cOHChQsXLlwMG7gk1oULFy5cuHDhwoULFy5cDBM4zv8HKdhcabzmo1oAAAAASUVORK5CYII="
]
}
]

Annotation is:

            For Each item In items
                Dim X1 As Integer = CInt(item("bbox_2d")(0))
                Dim Y1 As Integer = CInt(item("bbox_2d")(1))
                Dim X2 As Integer = CInt(item("bbox_2d")(2))
                Dim Y2 As Integer = CInt(item("bbox_2d")(3))
                BMP2.Draw(New Rectangle(X1, Y1, X2 - X1, Y2 - Y1), New Bgra(0, 0, 255, 255), 2)
            Next

Could be also a problem with Ollama because there is no option (at least i don't found any) to set
"resized_width": img_width,
"resized_height": img_height,

maybe you have any sugestions how the y could be set to the correct position.

ljoana

Jun 16, 2025

•

edited Jun 16, 2025

i cannot manage to get the coordinates right ... Please help!

i have a image with traffic-signs and like to detect the stop/bus-sign.
the original image has 1920*1080 pixels. With Max_Pixel 1280 i scale down
to a image-size of 1260x700 (28 Pixel Blocks, smaller 1280, X:45x28, Y:25x28)

Prompt for the 7b Model is: "Locate the Stop-sign and return the location in the form of coordinates in the format {'bbox_2d': [x1, y1, x2, y2]}."
i run the detection on the scaled image as Base64.

Result in X looks always good but y is offset (but wondering why stop is too less and bus is too much ...).

i use OLLAMA for this model - so the complete Ollama call is:

------------------------------JSON-----------------------------------
[
{
"role": "system",
"content": "You are a knowledgeable, efficient, and direct AI assistant. \r\nProvide concise answers, focusing on the key information needed. \r\nOffer suggestions tactfully when appropriate to improve outcomes. \r\nEngage in productive collaboration with the user."
},
{
"role": "user",
"content": "Locate the Bus-sign and return the location in the form of coordinates in the format {'bbox_2d': [x1, y1, x2, y2]}.",
"Images": [
"iVBORw0KGgoAAAANSUhEUgAABOw ... ly4cOHChQsXLlwMG7gk1oULFy5cuHDhwoULFy5cDBM4zv8HKdhcabzmo1oAAAAASUVORK5CYII="
]
}
]

Annotation is:
            For Each item In items
                Dim X1 As Integer = CInt(item("bbox_2d")(0))
                Dim Y1 As Integer = CInt(item("bbox_2d")(1))
                Dim X2 As Integer = CInt(item("bbox_2d")(2))
                Dim Y2 As Integer = CInt(item("bbox_2d")(3))
                BMP2.Draw(New Rectangle(X1, Y1, X2 - X1, Y2 - Y1), New Bgra(0, 0, 255, 255), 2)
            Next
Could be also a problem with Ollama because there is no option (at least i don't found any) to set
"resized_width": img_width,
"resized_height": img_height,

maybe you have any sugestions how the y could be set to the correct position.

Hi @Phreak87 , I struggled with that as well, I attempted to explain it in this medium post https://medium.com/@levchevajoana/qwen2-5-vl-with-mlx-vlm-c4329b40ab87. If you have any questions I’ll try to help.

Phreak87

Jun 17, 2025

Thank you so much for your help!

the grounding seems to not work correctly in the 7B-Variants. With the
3B-Parameter-Model this worked on the first try (detected and annotated the Horse-Sign):

wanghongyu1111

Jan 4

same question，have you resolved the bbox offset issue on qwen2.5vl - 7B?

wanghongyu1111

Jan 6

•

edited Jan 6

i cannot manage to get the coordinates right ... Please help!

i have a image with traffic-signs and like to detect the stop/bus-sign.
the original image has 1920*1080 pixels. With Max_Pixel 1280 i scale down
to a image-size of 1260x700 (28 Pixel Blocks, smaller 1280, X:45x28, Y:25x28)

Prompt for the 7b Model is: "Locate the Stop-sign and return the location in the form of coordinates in the format {'bbox_2d': [x1, y1, x2, y2]}."
i run the detection on the scaled image as Base64.

Result in X looks always good but y is offset (but wondering why stop is too less and bus is too much ...).

i use OLLAMA for this model - so the complete Ollama call is:

------------------------------JSON-----------------------------------
[
{
"role": "system",
"content": "You are a knowledgeable, efficient, and direct AI assistant. \r\nProvide concise answers, focusing on the key information needed. \r\nOffer suggestions tactfully when appropriate to improve outcomes. \r\nEngage in productive collaboration with the user."
},
{
"role": "user",
"content": "Locate the Bus-sign and return the location in the form of coordinates in the format {'bbox_2d': [x1, y1, x2, y2]}.",
"Images": [
"iVBORw0KGgoAAAANSUhEUgAABOw ... ly4cOHChQsXLlwMG7gk1oULFy5cuHDhwoULFy5cDBM4zv8HKdhcabzmo1oAAAAASUVORK5CYII="
]
}
]

Annotation is:
            For Each item In items
                Dim X1 As Integer = CInt(item("bbox_2d")(0))
                Dim Y1 As Integer = CInt(item("bbox_2d")(1))
                Dim X2 As Integer = CInt(item("bbox_2d")(2))
                Dim Y2 As Integer = CInt(item("bbox_2d")(3))
                BMP2.Draw(New Rectangle(X1, Y1, X2 - X1, Y2 - Y1), New Bgra(0, 0, 255, 255), 2)
            Next
Could be also a problem with Ollama because there is no option (at least i don't found any) to set
"resized_width": img_width,
"resized_height": img_height,

maybe you have any sugestions how the y could be set to the correct position.
Hi @Phreak87 , I struggled with that as well, I attempted to explain it in this medium post https://medium.com/@levchevajoana/qwen2-5-vl-with-mlx-vlm-c4329b40ab87. If you have any questions I’ll try to help.

hi, @ljoana . have you resolved the bbox offset issue on qwen2.5vl - 7B?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment