Instructions to use Qwen/Qwen2.5-VL-7B-Instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Qwen/Qwen2.5-VL-7B-Instruct with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="Qwen/Qwen2.5-VL-7B-Instruct") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct") model = AutoModelForImageTextToText.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Inference
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use Qwen/Qwen2.5-VL-7B-Instruct with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Qwen/Qwen2.5-VL-7B-Instruct" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Qwen/Qwen2.5-VL-7B-Instruct", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/Qwen/Qwen2.5-VL-7B-Instruct
- SGLang
How to use Qwen/Qwen2.5-VL-7B-Instruct with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Qwen/Qwen2.5-VL-7B-Instruct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Qwen/Qwen2.5-VL-7B-Instruct", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Qwen/Qwen2.5-VL-7B-Instruct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Qwen/Qwen2.5-VL-7B-Instruct", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use Qwen/Qwen2.5-VL-7B-Instruct with Docker Model Runner:
docker model run hf.co/Qwen/Qwen2.5-VL-7B-Instruct
Bounding boxes coordinates
What rescaling should be done so that the bbox coordinates are matching the original image? I am seeing some mismatches but can't seem to figure what's the issue.
The bbox_2d coordinates are x1, y1, x2, y2 rather than x,y,w,h. And they will be relative to your resized image size if you are resizing. For example:
image = Image.open(image_path)
img_width, img_height = image.size
max_size = 1280
if max(image.size) > max_size:
ratio = max_size / max(image.size)
new_size = tuple(int(dim * ratio) for dim in image.size)
# set each dimension to be a multiple of 28
new_size = tuple(int(dim // 28) * 28 for dim in new_size)
image = image.resize(new_size, Image.LANCZOS)
img_width, img_height = image.size
then in the messages:
{
"role": "user",
"content": [
{
"type": "image",
"image": f"file://{image_path}",
"resized_width": img_width,
"resized_height": img_height,
},
.....
i cannot manage to get the coordinates right ... Please help!
i have a image with traffic-signs and like to detect the stop/bus-sign.
the original image has 1920*1080 pixels. With Max_Pixel 1280 i scale down
to a image-size of 1260x700 (28 Pixel Blocks, smaller 1280, X:45x28, Y:25x28)
Prompt for the 7b Model is: "Locate the Stop-sign and return the location in the form of coordinates in the format {'bbox_2d': [x1, y1, x2, y2]}."
i run the detection on the scaled image as Base64.
Result in X looks always good but y is offset (but wondering why stop is too less and bus is too much ...).

i use OLLAMA for this model - so the complete Ollama call is:
------------------------------JSON-----------------------------------
[
{
"role": "system",
"content": "You are a knowledgeable, efficient, and direct AI assistant. \r\nProvide concise answers, focusing on the key information needed. \r\nOffer suggestions tactfully when appropriate to improve outcomes. \r\nEngage in productive collaboration with the user."
},
{
"role": "user",
"content": "Locate the Bus-sign and return the location in the form of coordinates in the format {'bbox_2d': [x1, y1, x2, y2]}.",
"Images": [
"iVBORw0KGgoAAAANSUhEUgAABOw ... ly4cOHChQsXLlwMG7gk1oULFy5cuHDhwoULFy5cDBM4zv8HKdhcabzmo1oAAAAASUVORK5CYII="
]
}
]
Annotation is:
For Each item In items
Dim X1 As Integer = CInt(item("bbox_2d")(0))
Dim Y1 As Integer = CInt(item("bbox_2d")(1))
Dim X2 As Integer = CInt(item("bbox_2d")(2))
Dim Y2 As Integer = CInt(item("bbox_2d")(3))
BMP2.Draw(New Rectangle(X1, Y1, X2 - X1, Y2 - Y1), New Bgra(0, 0, 255, 255), 2)
Next
Could be also a problem with Ollama because there is no option (at least i don't found any) to set
"resized_width": img_width,
"resized_height": img_height,
maybe you have any sugestions how the y could be set to the correct position.
i cannot manage to get the coordinates right ... Please help!
i have a image with traffic-signs and like to detect the stop/bus-sign.
the original image has 1920*1080 pixels. With Max_Pixel 1280 i scale down
to a image-size of 1260x700 (28 Pixel Blocks, smaller 1280, X:45x28, Y:25x28)Prompt for the 7b Model is: "Locate the Stop-sign and return the location in the form of coordinates in the format {'bbox_2d': [x1, y1, x2, y2]}."
i run the detection on the scaled image as Base64.Result in X looks always good but y is offset (but wondering why stop is too less and bus is too much ...).
i use OLLAMA for this model - so the complete Ollama call is:
------------------------------JSON-----------------------------------
[
{
"role": "system",
"content": "You are a knowledgeable, efficient, and direct AI assistant. \r\nProvide concise answers, focusing on the key information needed. \r\nOffer suggestions tactfully when appropriate to improve outcomes. \r\nEngage in productive collaboration with the user."
},
{
"role": "user",
"content": "Locate the Bus-sign and return the location in the form of coordinates in the format {'bbox_2d': [x1, y1, x2, y2]}.",
"Images": [
"iVBORw0KGgoAAAANSUhEUgAABOw ... ly4cOHChQsXLlwMG7gk1oULFy5cuHDhwoULFy5cDBM4zv8HKdhcabzmo1oAAAAASUVORK5CYII="
]
}
]Annotation is:
For Each item In items Dim X1 As Integer = CInt(item("bbox_2d")(0)) Dim Y1 As Integer = CInt(item("bbox_2d")(1)) Dim X2 As Integer = CInt(item("bbox_2d")(2)) Dim Y2 As Integer = CInt(item("bbox_2d")(3)) BMP2.Draw(New Rectangle(X1, Y1, X2 - X1, Y2 - Y1), New Bgra(0, 0, 255, 255), 2) NextCould be also a problem with Ollama because there is no option (at least i don't found any) to set
"resized_width": img_width,
"resized_height": img_height,maybe you have any sugestions how the y could be set to the correct position.
Hi @Phreak87 , I struggled with that as well, I attempted to explain it in this medium post https://medium.com/@levchevajoana/qwen2-5-vl-with-mlx-vlm-c4329b40ab87. If you have any questions I’ll try to help.
same question,have you resolved the bbox offset issue on qwen2.5vl - 7B?
i cannot manage to get the coordinates right ... Please help!
i have a image with traffic-signs and like to detect the stop/bus-sign.
the original image has 1920*1080 pixels. With Max_Pixel 1280 i scale down
to a image-size of 1260x700 (28 Pixel Blocks, smaller 1280, X:45x28, Y:25x28)Prompt for the 7b Model is: "Locate the Stop-sign and return the location in the form of coordinates in the format {'bbox_2d': [x1, y1, x2, y2]}."
i run the detection on the scaled image as Base64.Result in X looks always good but y is offset (but wondering why stop is too less and bus is too much ...).
i use OLLAMA for this model - so the complete Ollama call is:
------------------------------JSON-----------------------------------
[
{
"role": "system",
"content": "You are a knowledgeable, efficient, and direct AI assistant. \r\nProvide concise answers, focusing on the key information needed. \r\nOffer suggestions tactfully when appropriate to improve outcomes. \r\nEngage in productive collaboration with the user."
},
{
"role": "user",
"content": "Locate the Bus-sign and return the location in the form of coordinates in the format {'bbox_2d': [x1, y1, x2, y2]}.",
"Images": [
"iVBORw0KGgoAAAANSUhEUgAABOw ... ly4cOHChQsXLlwMG7gk1oULFy5cuHDhwoULFy5cDBM4zv8HKdhcabzmo1oAAAAASUVORK5CYII="
]
}
]Annotation is:
For Each item In items Dim X1 As Integer = CInt(item("bbox_2d")(0)) Dim Y1 As Integer = CInt(item("bbox_2d")(1)) Dim X2 As Integer = CInt(item("bbox_2d")(2)) Dim Y2 As Integer = CInt(item("bbox_2d")(3)) BMP2.Draw(New Rectangle(X1, Y1, X2 - X1, Y2 - Y1), New Bgra(0, 0, 255, 255), 2) NextCould be also a problem with Ollama because there is no option (at least i don't found any) to set
"resized_width": img_width,
"resized_height": img_height,maybe you have any sugestions how the y could be set to the correct position.
Hi @Phreak87 , I struggled with that as well, I attempted to explain it in this medium post https://medium.com/@levchevajoana/qwen2-5-vl-with-mlx-vlm-c4329b40ab87. If you have any questions I’ll try to help.
hi, @ljoana . have you resolved the bbox offset issue on qwen2.5vl - 7B?
