Transformers documentation

Ernie 4.5 VL MoE

Transformers

You are viewing main version, which requires installation from source. If you'd like regular pip install, checkout the latest stable version (v5.2.0).

Join the Hugging Face community

and get access to the augmented documentation experience

Collaborate on models, datasets and Spaces

Faster examples with accelerated inference

Switch between documentation themes

to get started

This model was released on 2025-06-30 and added to Hugging Face Transformers on 2025-12-19.

Ernie 4.5 VL MoE

Overview

The Ernie 4.5 VL MoE model was released in the Ernie 4.5 Model Family release by baidu. This family of models contains multiple different architectures and model sizes. The Vision-Language series in specific is composed of a novel multimodal heterogeneous structure, sharing parameters across modalities and dedicating parameters to specific modalities. This becomes especially apparent in the Mixture of Expert (MoE) which is composed of

Dedicated Text Experts
Dedicated Vision Experts
Shared Experts

This architecture has the advantage to enhance multimodal understanding without compromising, and even improving, performance on text-related tasks. An more detailed breakdown is given in the Technical Report.

Other models from the family can be found at Ernie 4.5 and at Ernie 4.5 MoE.

Usage

The example below demonstrates how to generate text based on an image with Pipeline or the AutoModel class.

Pipeline

AutoModel

Using Ernie 4.5 VL MoE with video input is similar to using it with image input. The model can process video data and generate text based on the content of the video.

from transformers import AutoModelForImageTextToText, AutoProcessor

model = AutoModelForImageTextToText.from_pretrained(
    "baidu/ERNIE-4.5-VL-28B-A3B-PT",
    dtype="auto",
    device_map="auto",  # Use tp_plan="auto" instead to enable Tensor Parallelism!
    revision="refs/pr/11",
)
processor = AutoProcessor.from_pretrained("baidu/ERNIE-4.5-VL-28B-A3B-PT", revision="refs/pr/11")
message = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Please describe what you can see during this video."},
            {
                "type": "video",
                "url": "https://huggingface.co/datasets/raushan-testing-hf/videos-test/resolve/main/tiny_video.mp4",
            },
        ],
    }
]

inputs = processor.apply_chat_template(
    message,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt"
).to(model.device)

generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

Transformers

Ernie 4.5 VL MoE

Overview

Usage

Ernie4_5_VLMoeConfig

class transformers.Ernie4_5_VLMoeConfig

Ernie4_5_VLMoeTextConfig

class transformers.Ernie4_5_VLMoeTextConfig

Ernie4_5_VLMoeVisionConfig

class transformers.Ernie4_5_VLMoeVisionConfig

Ernie4_5_VLMoeImageProcessor

class transformers.Ernie4_5_VLMoeImageProcessor

preprocess

Ernie4_5_VLMoeImageProcessorFast

class transformers.Ernie4_5_VLMoeImageProcessorFast

preprocess

Ernie4_5_VLMoeVideoProcessor

class transformers.Ernie4_5_VLMoeVideoProcessor

preprocess

Ernie4_5_VLMoeProcessor

class transformers.Ernie4_5_VLMoeProcessor

__call__

Ernie4_5_VLMoeTextModel

class transformers.Ernie4_5_VLMoeTextModel

forward

Ernie4_5_VLMoeVisionTransformerPretrainedModel

class transformers.Ernie4_5_VLMoeVisionTransformerPretrainedModel

forward

Ernie4_5_VLMoeVariableResolutionResamplerModel

class transformers.Ernie4_5_VLMoeVariableResolutionResamplerModel

forward

Ernie4_5_VLMoeModel

class transformers.Ernie4_5_VLMoeModel

forward

get_video_features

get_image_features

Ernie4_5_VLMoeForConditionalGeneration

class transformers.Ernie4_5_VLMoeForConditionalGeneration

forward

get_video_features

get_image_features

call