Update README.md

e9179f6 verified 3 months ago

5.13 kB

	---
	license: apache-2.0
	language:
	- multilingual
	base_model:
	- Qwen/Qwen2-VL-7B-Instruct
	tags:
	- mmeb
	- multimodal-embedding
	pipeline_tag: feature-extraction
	---
	# Ops-MM-embedding-v1-7B

	Ops-MM-embedding-v1-7B is a dense, large-scale multimodal embedding model developed and open-sourced by the Alibaba Cloud OpenSearch-AI team, fine-tuned from Qwen2-VL.


	## Key Features

	### Unified Multimodal Embeddings
	- Encodes text, images, text-image pairs, visual documents, and videos (by treating video frames as multiple image inputs) into a unified embedding space for cross-modal retrieval.

	### High Performance on MMEB
	- Achieves SOTA results among models of similar scale on MMEB-V2 and MMEB-Image benchmark (until 2025-07-03).

	### Multilingual Capabilities
	- Ops-MM-embedding-v1-7B achieves SOTA performance among dense models on the ViDoRe-v2 benchmark, demonstrating strong cross-lingual generalization.



	## Training data

	MMEB-train, CC-3M, colpali training set.


	## Performance

	### MMEB-V2

	\| Model \| Model Size (B) \| Overall \| Image-Overall \| Video-Overall \| Visdoc-Overall \|
	\| ------------------------ \| -------------- \| ------- \| ------------- \| ------------- \| -------------- \|
	\| seed-1.6-embedding \| unknown \| 71.27 \| 77.78 \| 55.34 \| 73.44 \|
	\| Ops-MM-embedding-v1-7B \| 8.29 \| 67.61 \| 72.72 \| 53.76 \| 70.34 \|
	\| Ops-MM-embedding-v1-2B \| 2.21 \| 63.44 \| 69.03 \| 47.56 \| 66.96 \|
	\| VLM2Vec-V2.0-Qwen2VL-2B \| 2.21 \| 58.02 \| 64.85 \| 34.85 \| 65.36 \|
	\| gme-Qwen2-VL-7B-Instruct \| 8.29 \| 57.83 \| 55.95 \| 38.43 \| 75.18 \|
	\| gme-Qwen2-VL-2B-Instruct \| 2.21 \| 54.08 \| 51.89 \| 33.64 \| 72.71 \|


	### MMEB-Image

	The table below compares performance on MMEB-Image benchmark among models of similar size.

	\| Models \| Model Size(B) \| Image-Overall \| I-CLS \| I-QA \| I-RET \| I-VG \|
	\| ------------------------------------- \| ------------- \| ------------- \| ----- \| ----- \| ------ \| ------ \|
	\| Ops-MM-embedding-v1-7B \| 8.29 \| 72.72 \| 69.65 \| 69.58 \| 73.09 \| 87.15 \|
	\| QQMM-embed \| 8.297 \| 72.175 \| 70.07 \| 69.52 \| 71.175 \| 87.075 \|
	\| B3_Qwen2_7B \| 8.29 \| 72 \| 70 \| 66.5 \| 74.1 \| 84.6 \|
	\| UniME(LLaVA-OneVision-7B-LoRA-Res336) \| 8.03 \| 70.7 \| 66.8 \| 66.6 \| 70.5 \| 90.9 \|
	\| LLaVE-7B \| 8.03 \| 70.3 \| 65.7 \| 65.4 \| 70.9 \| 91.9 \|
	\| UNITE-Instruct-7B \| 8.29 \| 70.3 \| 68.3 \| 65.1 \| 71.6 \| 84.8 \|


	### ViDoRe-v2

	\| Model \| Avg \| ESG Restaurant Human \| MIT Bio Multi. \| Econ Macro Multi. \| ESG Restaurant Synth. Multi. \|
	\| ---------------------- \| --------- \| -------------------- \| -------------- \| ----------------- \| ---------------------------- \|
	\| gme-7B \| 55.61 \| 63.37 \| 49.49 \| 54.21 \| 55.38 \|
	\| seed 1.6 embedding \| 56.57 \| 63.3 \| 57.14 \| 53.85 \| 51.99 \|
	\| Ops-MM-embedding-v1-7B \| 59.59 \| 66.27 \| 54.34 \| 60.92 \| 56.82 \|
	\| Ops-MM-embedding-v1-2B \| 53.18 \| 58.57 \| 52.87 \| 47.89 \| 53.39 \|



	## Usage

	```python
	from ops_mm_embedding_v1 import OpsMMEmbeddingV1, fetch_image


	model = OpsMMEmbeddingV1(
	"OpenSearch-AI/Ops-MM-embedding-v1-7B",
	device="cuda",
	attn_implementation="flash_attention_2"
	)

	t2i_prompt = "Find an image that matches the given text."
	texts = [
	"The Tesla Cybertruck is a battery electric pickup truck built by Tesla, Inc. since 2023.",
	"Alibaba office.",
	"Alibaba office.",
	]
	images = [
	"https://upload.wikimedia.org/wikipedia/commons/e/e9/Tesla_Cybertruck_damaged_window.jpg",
	"https://upload.wikimedia.org/wikipedia/commons/e/e0/TaobaoCity_Alibaba_Xixi_Park.jpg",
	"https://upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Alibaba_Binjiang_Park.jpg/1024px-Alibaba_Binjiang_Park.jpg"
	]

	images = [fetch_image(image) for image in images]

	# Text and image embedding
	text_embeddings = model.get_text_embeddings(texts)
	image_embeddings = model.get_image_embeddings(images)
	print('Text and image embeddings', (text_embeddings @ image_embeddings.T).tolist())

	# Fused Embedding
	text_with_image_embeddings = model.get_fused_embeddings(texts=texts, images=images, instruction=t2i_prompt)
	print('Text and image embeddings', (text_embeddings @ image_embeddings.T).tolist())

	# Multi-image embeddings
	multi_images = [
	[images[0]],
	[images[1], images[2]],
	]
	multi_image_embeddings = model.get_image_embeddings(multi_images)
	print('Multi-image embeddings', (multi_image_embeddings @ multi_image_embeddings.T).tolist())

	```