Impossible to deploy using vLLM 0.20.0

#6
by hdnh2006 - opened

Hello! I have 2xRTX4000 PRO, and it seems there's something wrong with MoE models:

vllm serve RedHatAI/gemma-4-26B-A4B-it-NVFP4 --max-model-len 190000   --reasoning-parser gemma4   --tool-call-parser gemma4 --enable-auto-tool-choice -
-max-num-batched-tokens 4096 --tensor-parallel-size 2 --gpu-memory-utilization 0.95
(APIServer pid=10923) INFO 05-08 12:34:18 [utils.py:299] 
(APIServer pid=10923) INFO 05-08 12:34:18 [utils.py:299]        █     █     █▄   ▄█
(APIServer pid=10923) INFO 05-08 12:34:18 [utils.py:299]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.20.0
(APIServer pid=10923) INFO 05-08 12:34:18 [utils.py:299]   █▄█▀ █     █     █     █  model   RedHatAI/gemma-4-26B-A4B-it-NVFP4
(APIServer pid=10923) INFO 05-08 12:34:18 [utils.py:299]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=10923) INFO 05-08 12:34:18 [utils.py:299] 
(APIServer pid=10923) INFO 05-08 12:34:18 [utils.py:233] non-default args: {'model_tag': 'RedHatAI/gemma-4-26B-A4B-it-NVFP4', 'enable_auto_tool_choice': True, 'tool_call_parser': 'gemma4', 'model': 'RedHatAI/gemma-4-26B-A4B-it-NVFP4', 'max_model_len': 190000, 'reasoning_parser': 'gemma4', 'tensor_parallel_size': 2, 'gpu_memory_utilization': 0.95, 'max_num_batched_tokens': 4096}
(APIServer pid=10923) WARNING 05-08 12:34:18 [envs.py:1818] Unknown vLLM environment variable detected: VLLM_TEST_ENDPOINT
(APIServer pid=10923) WARNING 05-08 12:34:18 [envs.py:1818] Unknown vLLM environment variable detected: VLLM_MODEL
config.json: 19.7kB [00:00, 24.6MB/s]
processor_config.json: 1.69kB [00:00, 4.75MB/s]
(APIServer pid=10923) INFO 05-08 12:34:20 [model.py:555] Resolved architecture: Gemma4ForConditionalGeneration
(APIServer pid=10923) INFO 05-08 12:34:20 [model.py:1680] Using max model len 190000
(APIServer pid=10923) INFO 05-08 12:34:21 [nixl_utils.py:20] Setting UCX_RCACHE_MAX_UNRELEASED to '1024' to avoid a rare memory leak in UCX when using NIXL.
(APIServer pid=10923) INFO 05-08 12:34:21 [nixl_utils.py:32] NIXL is available
(APIServer pid=10923) INFO 05-08 12:34:21 [scheduler.py:239] Chunked prefill is enabled with max_num_batched_tokens=4096.
(APIServer pid=10923) INFO 05-08 12:34:21 [config.py:101] Gemma4 model has heterogeneous head dimensions (head_dim=256, global_head_dim=512). Forcing TRITON_ATTN backend to prevent mixed-backend numerical divergence.
(APIServer pid=10923) INFO 05-08 12:34:21 [vllm.py:840] Asynchronous scheduling is enabled.
(APIServer pid=10923) INFO 05-08 12:34:21 [kernel.py:205] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'])
(APIServer pid=10923) WARNING 05-08 12:34:21 [cuda.py:233] Forcing --disable_chunked_mm_input for models with multimodal-bidirectional attention.
tokenizer_config.json: 2.09kB [00:00, 5.39MB/s]
tokenizer.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 32.2M/32.2M [00:01<00:00, 22.6MB/s]
chat_template.jinja: 16.4kB [00:00, 28.9MB/s]
(APIServer pid=10923) INFO 05-08 12:34:27 [compilation.py:303] Enabled custom fusions: act_quant
generation_config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 203/203 [00:00<00:00, 1.11MB/s]
INFO 05-08 12:34:37 [nixl_utils.py:32] NIXL is available
(EngineCore pid=11088) INFO 05-08 12:34:37 [core.py:109] Initializing a V1 LLM engine (v0.20.0) with config: model='RedHatAI/gemma-4-26B-A4B-it-NVFP4', speculative_config=None, tokenizer='RedHatAI/gemma-4-26B-A4B-it-NVFP4', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=190000, download_dir=None, load_format=auto, tensor_parallel_size=2, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=compressed-tensors, quantization_config=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='gemma4', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=RedHatAI/gemma-4-26B-A4B-it-NVFP4, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'ir_enable_torch_wrap': True, 'splitting_ops': ['vllm::unified_attention_with_output', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::gdn_attention_core_xpu', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::deepseek_v4_attention', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_vision_items_per_batch': 0, 'encoder_cudagraph_max_frames_per_batch': None, 'compile_sizes': [], 'compile_ranges_endpoints': [4096], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': False, 'static_all_moe_layers': []}, kernel_config=KernelConfig(ir_op_priority=IrOpPriorityConfig(rms_norm=['native']), enable_flashinfer_autotune=True, moe_backend='auto')
(EngineCore pid=11088) WARNING 05-08 12:34:37 [multiproc_executor.py:1029] Reducing Torch parallelism from 24 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
(EngineCore pid=11088) INFO 05-08 12:34:37 [multiproc_executor.py:139] DP group leader: node_rank=0, node_rank_within_dp=0, master_addr=127.0.0.1, mq_connect_ip=172.17.0.2 (local), world_size=2, local_world_size=2
INFO 05-08 12:34:42 [nixl_utils.py:32] NIXL is available
INFO 05-08 12:34:42 [nixl_utils.py:32] NIXL is available
(Worker pid=11191) INFO 05-08 12:34:46 [parallel_state.py:1402] world_size=2 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:37113 backend=nccl
(Worker pid=11192) INFO 05-08 12:34:46 [parallel_state.py:1402] world_size=2 rank=1 local_rank=1 distributed_init_method=tcp://127.0.0.1:37113 backend=nccl
(Worker pid=11191) INFO 05-08 12:34:47 [pynccl.py:111] vLLM is using nccl==2.28.9
(Worker pid=11191) WARNING 05-08 12:34:47 [symm_mem.py:66] SymmMemCommunicator: Device capability 12.0 not supported, communicator is not available.
(Worker pid=11192) WARNING 05-08 12:34:47 [symm_mem.py:66] SymmMemCommunicator: Device capability 12.0 not supported, communicator is not available.
(Worker pid=11191) INFO 05-08 12:34:47 [parallel_state.py:1715] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0, EPLB rank N/A
(Worker_TP0 pid=11191) INFO 05-08 12:34:48 [gpu_model_runner.py:4777] Starting to load model RedHatAI/gemma-4-26B-A4B-it-NVFP4...
(Worker_TP1 pid=11192) INFO 05-08 12:34:48 [kernel.py:205] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'])
(Worker_TP1 pid=11192) INFO 05-08 12:34:48 [cuda.py:308] Using AttentionBackendEnum.TRITON_ATTN backend.
(Worker_TP1 pid=11192) INFO 05-08 12:34:48 [cuda.py:308] Using AttentionBackendEnum.TRITON_ATTN backend.
(Worker_TP0 pid=11191) INFO 05-08 12:34:48 [vllm.py:840] Asynchronous scheduling is enabled.
(Worker_TP0 pid=11191) INFO 05-08 12:34:48 [kernel.py:205] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'])
(Worker_TP0 pid=11191) INFO 05-08 12:34:48 [compilation.py:303] Enabled custom fusions: act_quant
(Worker_TP0 pid=11191) INFO 05-08 12:34:48 [__init__.py:683] Using FlashInferCutlassNvFp4LinearKernel for NVFP4 GEMM
(Worker_TP0 pid=11191) INFO 05-08 12:34:48 [cuda.py:308] Using AttentionBackendEnum.TRITON_ATTN backend.
(Worker_TP0 pid=11191) INFO 05-08 12:34:48 [nvfp4.py:280] Using 'VLLM_CUTLASS' NvFp4 MoE backend out of potential backends: ['FLASHINFER_TRTLLM', 'FLASHINFER_CUTEDSL', 'FLASHINFER_CUTEDSL_BATCHED', 'FLASHINFER_CUTLASS', 'VLLM_CUTLASS', 'MARLIN', 'EMULATION'].
(Worker_TP0 pid=11191) INFO 05-08 12:34:48 [cuda.py:308] Using AttentionBackendEnum.TRITON_ATTN backend.
model.safetensors.index.json: 4.55MB [00:00, 24.4MB/s]
(Worker_TP1 pid=11192) INFO 05-08 12:37:08 [weight_utils.py:615] Time spent downloading weights for RedHatAI/gemma-4-26B-A4B-it-NVFP4: 138.685520 seconds
(Worker_TP0 pid=11191) INFO 05-08 12:37:09 [weight_utils.py:904] Filesystem type for checkpoints: OVERLAY. Checkpoint size: 15.30 GiB. Available RAM: 237.08 GiB.
(Worker_TP0 pid=11191) INFO 05-08 12:37:09 [weight_utils.py:927] Auto-prefetch is disabled because the filesystem (OVERLAY) is not a recognized network FS (NFS/Lustre). If you want to force prefetching, start vLLM with --safetensors-load-strategy=prefetch.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:06<00:00,  6.95s/it]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:06<00:00,  6.95s/it]
(Worker_TP0 pid=11191) 
(Worker_TP1 pid=11192) ERROR 05-08 12:37:16 [multiproc_executor.py:870] WorkerProc failed to start.
(Worker_TP1 pid=11192) ERROR 05-08 12:37:16 [multiproc_executor.py:870] Traceback (most recent call last):
(Worker_TP1 pid=11192) ERROR 05-08 12:37:16 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 837, in worker_main
(Worker_TP1 pid=11192) ERROR 05-08 12:37:16 [multiproc_executor.py:870]     worker = WorkerProc(*args, **kwargs)
(Worker_TP1 pid=11192) ERROR 05-08 12:37:16 [multiproc_executor.py:870]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=11192) ERROR 05-08 12:37:16 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP1 pid=11192) ERROR 05-08 12:37:16 [multiproc_executor.py:870]     return func(*args, **kwargs)
(Worker_TP1 pid=11192) ERROR 05-08 12:37:16 [multiproc_executor.py:870]            ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=11192) ERROR 05-08 12:37:16 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 619, in __init__
(Worker_TP1 pid=11192) ERROR 05-08 12:37:16 [multiproc_executor.py:870]     self.worker.load_model()
(Worker_TP1 pid=11192) ERROR 05-08 12:37:16 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 323, in load_model
(Worker_TP1 pid=11192) ERROR 05-08 12:37:16 [multiproc_executor.py:870]     self.model_runner.load_model(load_dummy_weights=load_dummy_weights)
(Worker_TP1 pid=11192) ERROR 05-08 12:37:16 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP1 pid=11192) ERROR 05-08 12:37:16 [multiproc_executor.py:870]     return func(*args, **kwargs)
(Worker_TP1 pid=11192) ERROR 05-08 12:37:16 [multiproc_executor.py:870]            ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=11192) ERROR 05-08 12:37:16 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 4793, in load_model
(Worker_TP1 pid=11192) ERROR 05-08 12:37:16 [multiproc_executor.py:870]     self.model = model_loader.load_model(
(Worker_TP1 pid=11192) ERROR 05-08 12:37:16 [multiproc_executor.py:870]                  ^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=11192) ERROR 05-08 12:37:16 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP1 pid=11192) ERROR 05-08 12:37:16 [multiproc_executor.py:870]     return func(*args, **kwargs)
(Worker_TP1 pid=11192) ERROR 05-08 12:37:16 [multiproc_executor.py:870]            ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=11192) ERROR 05-08 12:37:16 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/base_loader.py", line 80, in load_model
(Worker_TP1 pid=11192) ERROR 05-08 12:37:16 [multiproc_executor.py:870]     process_weights_after_loading(model, model_config, target_device)
(Worker_TP1 pid=11192) ERROR 05-08 12:37:16 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/utils.py", line 107, in process_weights_after_loading
(Worker_TP1 pid=11192) ERROR 05-08 12:37:16 [multiproc_executor.py:870]     quant_method.process_weights_after_loading(module)
(Worker_TP1 pid=11192) ERROR 05-08 12:37:16 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe/compressed_tensors_moe_w4a4_nvfp4.py", line 208, in process_weights_after_loading
(Worker_TP1 pid=11192) ERROR 05-08 12:37:16 [multiproc_executor.py:870]     ) = convert_to_nvfp4_moe_kernel_format(
(Worker_TP1 pid=11192) ERROR 05-08 12:37:16 [multiproc_executor.py:870]         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=11192) ERROR 05-08 12:37:16 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/oracle/nvfp4.py", line 346, in convert_to_nvfp4_moe_kernel_format
(Worker_TP1 pid=11192) ERROR 05-08 12:37:16 [multiproc_executor.py:870]     ) = prepare_nvfp4_moe_layer_for_fi_or_cutlass(
(Worker_TP1 pid=11192) ERROR 05-08 12:37:16 [multiproc_executor.py:870]         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=11192) ERROR 05-08 12:37:16 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/utils/flashinfer_fp4_moe.py", line 379, in prepare_nvfp4_moe_layer_for_fi_or_cutlass
(Worker_TP1 pid=11192) ERROR 05-08 12:37:16 [multiproc_executor.py:870]     raise NotImplementedError(
(Worker_TP1 pid=11192) ERROR 05-08 12:37:16 [multiproc_executor.py:870] NotImplementedError: ('Intermediate size padding for w1 and w3, for %s NvFp4 backend, but this is not currently supported', 'VLLM_CUTLASS')
(Worker_TP0 pid=11191) INFO 05-08 12:37:16 [default_loader.py:384] Loading weights took 7.29 seconds
(Worker_TP0 pid=11191) ERROR 05-08 12:37:17 [multiproc_executor.py:870] WorkerProc failed to start.
(Worker_TP0 pid=11191) ERROR 05-08 12:37:17 [multiproc_executor.py:870] Traceback (most recent call last):
(Worker_TP0 pid=11191) ERROR 05-08 12:37:17 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 837, in worker_main
(Worker_TP0 pid=11191) ERROR 05-08 12:37:17 [multiproc_executor.py:870]     worker = WorkerProc(*args, **kwargs)
(Worker_TP0 pid=11191) ERROR 05-08 12:37:17 [multiproc_executor.py:870]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=11191) ERROR 05-08 12:37:17 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP0 pid=11191) ERROR 05-08 12:37:17 [multiproc_executor.py:870]     return func(*args, **kwargs)
(Worker_TP0 pid=11191) ERROR 05-08 12:37:17 [multiproc_executor.py:870]            ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=11191) ERROR 05-08 12:37:17 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 619, in __init__
(Worker_TP0 pid=11191) ERROR 05-08 12:37:17 [multiproc_executor.py:870]     self.worker.load_model()
(Worker_TP0 pid=11191) ERROR 05-08 12:37:17 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 323, in load_model
(Worker_TP0 pid=11191) ERROR 05-08 12:37:17 [multiproc_executor.py:870]     self.model_runner.load_model(load_dummy_weights=load_dummy_weights)
(Worker_TP0 pid=11191) ERROR 05-08 12:37:17 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP0 pid=11191) ERROR 05-08 12:37:17 [multiproc_executor.py:870]     return func(*args, **kwargs)
(Worker_TP0 pid=11191) ERROR 05-08 12:37:17 [multiproc_executor.py:870]            ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=11191) ERROR 05-08 12:37:17 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 4793, in load_model
(Worker_TP0 pid=11191) ERROR 05-08 12:37:17 [multiproc_executor.py:870]     self.model = model_loader.load_model(
(Worker_TP0 pid=11191) ERROR 05-08 12:37:17 [multiproc_executor.py:870]                  ^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=11191) ERROR 05-08 12:37:17 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP0 pid=11191) ERROR 05-08 12:37:17 [multiproc_executor.py:870]     return func(*args, **kwargs)
(Worker_TP0 pid=11191) ERROR 05-08 12:37:17 [multiproc_executor.py:870]            ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=11191) ERROR 05-08 12:37:17 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/base_loader.py", line 80, in load_model
(Worker_TP0 pid=11191) ERROR 05-08 12:37:17 [multiproc_executor.py:870]     process_weights_after_loading(model, model_config, target_device)
(Worker_TP0 pid=11191) ERROR 05-08 12:37:17 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/utils.py", line 107, in process_weights_after_loading
(Worker_TP0 pid=11191) ERROR 05-08 12:37:17 [multiproc_executor.py:870]     quant_method.process_weights_after_loading(module)
(Worker_TP0 pid=11191) ERROR 05-08 12:37:17 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe/compressed_tensors_moe_w4a4_nvfp4.py", line 208, in process_weights_after_loading
(Worker_TP0 pid=11191) ERROR 05-08 12:37:17 [multiproc_executor.py:870]     ) = convert_to_nvfp4_moe_kernel_format(
(Worker_TP0 pid=11191) ERROR 05-08 12:37:17 [multiproc_executor.py:870]         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=11191) ERROR 05-08 12:37:17 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/oracle/nvfp4.py", line 346, in convert_to_nvfp4_moe_kernel_format
(Worker_TP0 pid=11191) ERROR 05-08 12:37:17 [multiproc_executor.py:870]     ) = prepare_nvfp4_moe_layer_for_fi_or_cutlass(
(Worker_TP0 pid=11191) ERROR 05-08 12:37:17 [multiproc_executor.py:870]         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=11191) ERROR 05-08 12:37:17 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/utils/flashinfer_fp4_moe.py", line 379, in prepare_nvfp4_moe_layer_for_fi_or_cutlass
(Worker_TP0 pid=11191) ERROR 05-08 12:37:17 [multiproc_executor.py:870]     raise NotImplementedError(
(Worker_TP0 pid=11191) ERROR 05-08 12:37:17 [multiproc_executor.py:870] NotImplementedError: ('Intermediate size padding for w1 and w3, for %s NvFp4 backend, but this is not currently supported', 'VLLM_CUTLASS')
[rank0]:[W508 12:37:17.358092543 ProcessGroupNCCL.cpp:1575] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
(EngineCore pid=11088) ERROR 05-08 12:37:18 [core.py:1136] EngineCore failed to start.
(EngineCore pid=11088) ERROR 05-08 12:37:18 [core.py:1136] Traceback (most recent call last):
(EngineCore pid=11088) ERROR 05-08 12:37:18 [core.py:1136]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1110, in run_engine_core
(EngineCore pid=11088) ERROR 05-08 12:37:18 [core.py:1136]     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=11088) ERROR 05-08 12:37:18 [core.py:1136]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=11088) ERROR 05-08 12:37:18 [core.py:1136]   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=11088) ERROR 05-08 12:37:18 [core.py:1136]     return func(*args, **kwargs)
(EngineCore pid=11088) ERROR 05-08 12:37:18 [core.py:1136]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=11088) ERROR 05-08 12:37:18 [core.py:1136]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 876, in __init__
(EngineCore pid=11088) ERROR 05-08 12:37:18 [core.py:1136]     super().__init__(
(EngineCore pid=11088) ERROR 05-08 12:37:18 [core.py:1136]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 118, in __init__
(EngineCore pid=11088) ERROR 05-08 12:37:18 [core.py:1136]     self.model_executor = executor_class(vllm_config)
(EngineCore pid=11088) ERROR 05-08 12:37:18 [core.py:1136]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=11088) ERROR 05-08 12:37:18 [core.py:1136]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 107, in __init__
(EngineCore pid=11088) ERROR 05-08 12:37:18 [core.py:1136]     super().__init__(vllm_config)
(EngineCore pid=11088) ERROR 05-08 12:37:18 [core.py:1136]   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=11088) ERROR 05-08 12:37:18 [core.py:1136]     return func(*args, **kwargs)
(EngineCore pid=11088) ERROR 05-08 12:37:18 [core.py:1136]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=11088) ERROR 05-08 12:37:18 [core.py:1136]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 109, in __init__
(EngineCore pid=11088) ERROR 05-08 12:37:18 [core.py:1136]     self._init_executor()
(EngineCore pid=11088) ERROR 05-08 12:37:18 [core.py:1136]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 200, in _init_executor
(EngineCore pid=11088) ERROR 05-08 12:37:18 [core.py:1136]     self.workers = WorkerProc.wait_for_ready(unready_workers)
(EngineCore pid=11088) ERROR 05-08 12:37:18 [core.py:1136]                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=11088) ERROR 05-08 12:37:18 [core.py:1136]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 747, in wait_for_ready
(EngineCore pid=11088) ERROR 05-08 12:37:18 [core.py:1136]     raise e from None
(EngineCore pid=11088) ERROR 05-08 12:37:18 [core.py:1136] Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause.
(EngineCore pid=11088) Process EngineCore:
(EngineCore pid=11088) Traceback (most recent call last):
(EngineCore pid=11088)   File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore pid=11088)     self.run()
(EngineCore pid=11088)   File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore pid=11088)     self._target(*self._args, **self._kwargs)
(EngineCore pid=11088)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1140, in run_engine_core
(EngineCore pid=11088)     raise e
(EngineCore pid=11088)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1110, in run_engine_core
(EngineCore pid=11088)     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=11088)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=11088)   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=11088)     return func(*args, **kwargs)
(EngineCore pid=11088)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=11088)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 876, in __init__
(EngineCore pid=11088)     super().__init__(
(EngineCore pid=11088)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 118, in __init__
(EngineCore pid=11088)     self.model_executor = executor_class(vllm_config)
(EngineCore pid=11088)                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=11088)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 107, in __init__
(EngineCore pid=11088)     super().__init__(vllm_config)
(EngineCore pid=11088)   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=11088)     return func(*args, **kwargs)
(EngineCore pid=11088)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=11088)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 109, in __init__
(EngineCore pid=11088)     self._init_executor()
(EngineCore pid=11088)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 200, in _init_executor
(EngineCore pid=11088)     self.workers = WorkerProc.wait_for_ready(unready_workers)
(EngineCore pid=11088)                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=11088)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 747, in wait_for_ready
(EngineCore pid=11088)     raise e from None
(EngineCore pid=11088) Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause.
(APIServer pid=10923) Traceback (most recent call last):
(APIServer pid=10923)   File "/usr/local/bin/vllm", line 10, in <module>
(APIServer pid=10923)     sys.exit(main())
(APIServer pid=10923)              ^^^^^^
(APIServer pid=10923)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/main.py", line 92, in main
(APIServer pid=10923)     args.dispatch_function(args)
(APIServer pid=10923)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/serve.py", line 122, in cmd
(APIServer pid=10923)     uvloop.run(run_server(args))
(APIServer pid=10923)   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 96, in run
(APIServer pid=10923)     return __asyncio.run(
(APIServer pid=10923)            ^^^^^^^^^^^^^^
(APIServer pid=10923)   File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
(APIServer pid=10923)     return runner.run(main)
(APIServer pid=10923)            ^^^^^^^^^^^^^^^^
(APIServer pid=10923)   File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=10923)     return self._loop.run_until_complete(task)
(APIServer pid=10923)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=10923)   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=10923)   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 48, in wrapper
(APIServer pid=10923)     return await main
(APIServer pid=10923)            ^^^^^^^^^^
(APIServer pid=10923)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 678, in run_server
(APIServer pid=10923)     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=10923)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 692, in run_server_worker
(APIServer pid=10923)     async with build_async_engine_client(
(APIServer pid=10923)                ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=10923)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=10923)     return await anext(self.gen)
(APIServer pid=10923)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=10923)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 100, in build_async_engine_client
(APIServer pid=10923)     async with build_async_engine_client_from_engine_args(
(APIServer pid=10923)                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=10923)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=10923)     return await anext(self.gen)
(APIServer pid=10923)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=10923)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 136, in build_async_engine_client_from_engine_args
(APIServer pid=10923)     async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=10923)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=10923)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 217, in from_vllm_config
(APIServer pid=10923)     return cls(
(APIServer pid=10923)            ^^^^
(APIServer pid=10923)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 146, in __init__
(APIServer pid=10923)     self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=10923)                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=10923)   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=10923)     return func(*args, **kwargs)
(APIServer pid=10923)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=10923)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 130, in make_async_mp_client
(APIServer pid=10923)     return AsyncMPClient(*client_args)
(APIServer pid=10923)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=10923)   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=10923)     return func(*args, **kwargs)
(APIServer pid=10923)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=10923)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 900, in __init__
(APIServer pid=10923)     super().__init__(
(APIServer pid=10923)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 535, in __init__
(APIServer pid=10923)     with launch_core_engines(
(APIServer pid=10923)          ^^^^^^^^^^^^^^^^^^^^
(APIServer pid=10923)   File "/usr/lib/python3.12/contextlib.py", line 144, in __exit__
(APIServer pid=10923)     next(self.gen)
(APIServer pid=10923)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 1119, in launch_core_engines
(APIServer pid=10923)     wait_for_engine_startup(
(APIServer pid=10923)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 1178, in wait_for_engine_startup
(APIServer pid=10923)     raise RuntimeError(
(APIServer pid=10923) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
/usr/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

It looks like the problem is with parallel computing, in one RTX5000 PRO works fine:

vllm serve RedHatAI/gemma-4-26B-A4B-it-NVFP4 --max-model-len 190000   --reasoning-parser gemma4   --tool-call-parser gemma4 --enable-auto-tool-choice --max-num-batched-tokens 4096 --tensor-parallel-size 1 --gpu-memory-utilization 0.95
(APIServer pid=1550) INFO 05-08 12:45:00 [utils.py:299] 
(APIServer pid=1550) INFO 05-08 12:45:00 [utils.py:299]        █     █     █▄   ▄█
(APIServer pid=1550) INFO 05-08 12:45:00 [utils.py:299]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.20.0
(APIServer pid=1550) INFO 05-08 12:45:00 [utils.py:299]   █▄█▀ █     █     █     █  model   RedHatAI/gemma-4-26B-A4B-it-NVFP4
(APIServer pid=1550) INFO 05-08 12:45:00 [utils.py:299]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=1550) INFO 05-08 12:45:00 [utils.py:299] 
(APIServer pid=1550) INFO 05-08 12:45:00 [utils.py:233] non-default args: {'model_tag': 'RedHatAI/gemma-4-26B-A4B-it-NVFP4', 'enable_auto_tool_choice': True, 'tool_call_parser': 'gemma4', 'model': 'RedHatAI/gemma-4-26B-A4B-it-NVFP4', 'max_model_len': 190000, 'reasoning_parser': 'gemma4', 'gpu_memory_utilization': 0.95, 'max_num_batched_tokens': 4096}
(APIServer pid=1550) WARNING 05-08 12:45:00 [envs.py:1818] Unknown vLLM environment variable detected: VLLM_TEST_ENDPOINT
(APIServer pid=1550) WARNING 05-08 12:45:00 [envs.py:1818] Unknown vLLM environment variable detected: VLLM_MODEL
(APIServer pid=1550) INFO 05-08 12:45:01 [model.py:555] Resolved architecture: Gemma4ForConditionalGeneration
(APIServer pid=1550) INFO 05-08 12:45:01 [model.py:1680] Using max model len 190000
(APIServer pid=1550) INFO 05-08 12:45:02 [nixl_utils.py:20] Setting UCX_RCACHE_MAX_UNRELEASED to '1024' to avoid a rare memory leak in UCX when using NIXL.
(APIServer pid=1550) INFO 05-08 12:45:02 [nixl_utils.py:32] NIXL is available
(APIServer pid=1550) INFO 05-08 12:45:02 [scheduler.py:239] Chunked prefill is enabled with max_num_batched_tokens=4096.
(APIServer pid=1550) INFO 05-08 12:45:02 [config.py:101] Gemma4 model has heterogeneous head dimensions (head_dim=256, global_head_dim=512). Forcing TRITON_ATTN backend to prevent mixed-backend numerical divergence.
(APIServer pid=1550) INFO 05-08 12:45:02 [vllm.py:840] Asynchronous scheduling is enabled.
(APIServer pid=1550) INFO 05-08 12:45:02 [kernel.py:205] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'])
(APIServer pid=1550) WARNING 05-08 12:45:02 [cuda.py:233] Forcing --disable_chunked_mm_input for models with multimodal-bidirectional attention.
tokenizer_config.json: 2.09kB [00:00, 3.19MB/s]
tokenizer.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 32.2M/32.2M [00:01<00:00, 31.4MB/s]
chat_template.jinja: 16.4kB [00:00, 30.8MB/s]
(APIServer pid=1550) INFO 05-08 12:45:08 [compilation.py:303] Enabled custom fusions: act_quant
generation_config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 203/203 [00:00<00:00, 666kB/s]
INFO 05-08 12:45:19 [nixl_utils.py:32] NIXL is available
(EngineCore pid=1794) INFO 05-08 12:45:19 [core.py:109] Initializing a V1 LLM engine (v0.20.0) with config: model='RedHatAI/gemma-4-26B-A4B-it-NVFP4', speculative_config=None, tokenizer='RedHatAI/gemma-4-26B-A4B-it-NVFP4', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=190000, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=compressed-tensors, quantization_config=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='gemma4', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=RedHatAI/gemma-4-26B-A4B-it-NVFP4, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'ir_enable_torch_wrap': True, 'splitting_ops': ['vllm::unified_attention_with_output', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::gdn_attention_core_xpu', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::deepseek_v4_attention', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_vision_items_per_batch': 0, 'encoder_cudagraph_max_frames_per_batch': None, 'compile_sizes': [], 'compile_ranges_endpoints': [4096], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': False, 'static_all_moe_layers': []}, kernel_config=KernelConfig(ir_op_priority=IrOpPriorityConfig(rms_norm=['native']), enable_flashinfer_autotune=True, moe_backend='auto')
(EngineCore pid=1794) INFO 05-08 12:45:24 [parallel_state.py:1402] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://172.17.0.4:37261 backend=nccl
(EngineCore pid=1794) INFO 05-08 12:45:24 [parallel_state.py:1715] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0, EPLB rank N/A
(EngineCore pid=1794) INFO 05-08 12:45:24 [gpu_model_runner.py:4777] Starting to load model RedHatAI/gemma-4-26B-A4B-it-NVFP4...
(EngineCore pid=1794) INFO 05-08 12:45:25 [vllm.py:840] Asynchronous scheduling is enabled.
(EngineCore pid=1794) INFO 05-08 12:45:25 [kernel.py:205] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'])
(EngineCore pid=1794) INFO 05-08 12:45:25 [compilation.py:303] Enabled custom fusions: act_quant
(EngineCore pid=1794) INFO 05-08 12:45:25 [__init__.py:683] Using FlashInferCutlassNvFp4LinearKernel for NVFP4 GEMM
(EngineCore pid=1794) INFO 05-08 12:45:25 [cuda.py:308] Using AttentionBackendEnum.TRITON_ATTN backend.
(EngineCore pid=1794) INFO 05-08 12:45:25 [nvfp4.py:280] Using 'VLLM_CUTLASS' NvFp4 MoE backend out of potential backends: ['FLASHINFER_TRTLLM', 'FLASHINFER_CUTEDSL', 'FLASHINFER_CUTEDSL_BATCHED', 'FLASHINFER_CUTLASS', 'VLLM_CUTLASS', 'MARLIN', 'EMULATION'].
(EngineCore pid=1794) INFO 05-08 12:45:26 [cuda.py:308] Using AttentionBackendEnum.TRITON_ATTN backend.
model.safetensors.index.json: 4.55MB [00:00, 270MB/s]
(EngineCore pid=1794) INFO 05-08 12:45:50 [weight_utils.py:615] Time spent downloading weights for RedHatAI/gemma-4-26B-A4B-it-NVFP4: 24.042237 seconds
(EngineCore pid=1794) INFO 05-08 12:45:51 [weight_utils.py:904] Filesystem type for checkpoints: OVERLAY. Checkpoint size: 15.30 GiB. Available RAM: 472.99 GiB.
(EngineCore pid=1794) INFO 05-08 12:45:51 [weight_utils.py:927] Auto-prefetch is disabled because the filesystem (OVERLAY) is not a recognized network FS (NFS/Lustre). If you want to force prefetching, start vLLM with --safetensors-load-strategy=prefetch.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:10<00:00, 10.52s/it]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:10<00:00, 10.52s/it]
(EngineCore pid=1794) 
(EngineCore pid=1794) INFO 05-08 12:46:02 [default_loader.py:384] Loading weights took 10.90 seconds
(EngineCore pid=1794) INFO 05-08 12:46:02 [nvfp4.py:485] Using MoEPrepareAndFinalizeNoDPEPModular
(EngineCore pid=1794) WARNING 05-08 12:46:02 [compressed_tensors_w4a4_nvfp4.py:97] In NVFP4 linear, the global scale for input or weight are different for parallel layers (e.g. q_proj, k_proj, v_proj). This  will likely result in reduced accuracy. Please verify the model accuracy. Consider using a checkpoint with a shared global NVFP4 scale for fused layers.
(EngineCore pid=1794) INFO 05-08 12:46:02 [gpu_model_runner.py:4879] Model loading took 15.76 GiB memory and 37.162847 seconds
(EngineCore pid=1794) INFO 05-08 12:46:03 [gpu_model_runner.py:5820] Encoder cache will be initialized with a budget of 4096 tokens, and profiled with 1 video items of the maximum feature size.
(EngineCore pid=1794) /usr/local/lib/python3.12/dist-packages/transformers/tokenization_utils_base.py:2353: UserWarning: `max_length` is ignored when `padding`=`True` and there is no truncation strategy. To pad to max length, use `padding='max_length'`.
(EngineCore pid=1794)   warnings.warn(
(EngineCore pid=1794) WARNING 05-08 12:46:24 [op.py:241] Priority not set for op rms_norm, using native implementation.
(EngineCore pid=1794) INFO 05-08 12:46:37 [backends.py:1069] Using cache directory: /workspace/.vllm_cache/torch_compile_cache/ced645544a/rank_0_0/backbone for vLLM's torch.compile
(EngineCore pid=1794) INFO 05-08 12:46:37 [backends.py:1128] Dynamo bytecode transform time: 11.58 s
(EngineCore pid=1794) INFO 05-08 12:46:49 [backends.py:376] Cache the graph of compile range (1, 4096) for later use
(EngineCore pid=1794) INFO 05-08 12:47:13 [backends.py:391] Compiling a graph for compile range (1, 4096) takes 34.78 s
(EngineCore pid=1794) INFO 05-08 12:47:20 [decorators.py:668] saved AOT compiled function to /workspace/.vllm_cache/torch_compile_cache/torch_aot_compile/6fd79cfd16265f6994ed0d633fe7b3f5f084090dd1eb03b0281559213bb88470/rank_0_0/model
(EngineCore pid=1794) INFO 05-08 12:47:20 [monitor.py:53] torch.compile took 54.71 s in total
(EngineCore pid=1794) INFO 05-08 12:47:22 [monitor.py:81] Initial profiling/warmup run took 2.01 s
(EngineCore pid=1794) INFO 05-08 12:47:31 [gpu_model_runner.py:5963] Profiling CUDA graph memory: PIECEWISE=51 (largest=512), FULL=35 (largest=256)
(EngineCore pid=1794) INFO 05-08 12:47:37 [gpu_model_runner.py:6042] Estimated CUDA graph memory: 1.12 GiB total
(EngineCore pid=1794) INFO 05-08 12:47:37 [gpu_worker.py:440] Available KV cache memory: 26.5 GiB
(EngineCore pid=1794) INFO 05-08 12:47:37 [gpu_worker.py:455] CUDA graph memory profiling is enabled (default since v0.21.0). The current --gpu-memory-utilization=0.9500 is equivalent to --gpu-memory-utilization=0.9263 without CUDA graph memory profiling. To maintain the same effective KV cache size as before, increase --gpu-memory-utilization to 0.9737. To disable, set VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=0.
(EngineCore pid=1794) INFO 05-08 12:47:37 [kv_cache_utils.py:1711] GPU KV cache size: 115,776 tokens
(EngineCore pid=1794) INFO 05-08 12:47:37 [kv_cache_utils.py:1716] Maximum concurrency for 190,000 tokens per request: 5.76x
(EngineCore pid=1794) 2026-05-08 12:47:37,879 - INFO - autotuner.py:457 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
[AutoTuner]: Tuning fp4_gemm: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 82.11profile/s]
[AutoTuner]: Tuning fp4_gemm: 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 104.15profile/s]
[AutoTuner]: Tuning fp4_gemm: 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 106.36profile/s]
[AutoTuner]: Tuning fp4_gemm: 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 115.32profile/s]
[AutoTuner]: Tuning fp4_gemm: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 87.51profile/s]
[AutoTuner]: Tuning fp4_gemm: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 87.66profile/s]
(EngineCore pid=1794) 2026-05-08 12:47:39,203 - INFO - autotuner.py:466 - flashinfer.jit: [Autotuner]: Autotuning process ends
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|█████████████████████████████████████████████████████████████████████████| 51/51 [00:06<00:00,  7.91it/s]
Capturing CUDA graphs (decode, FULL): 100%|████████████████████████████████████████████████████████████████████████████████████████████| 35/35 [00:10<00:00,  3.32it/s]
(EngineCore pid=1794) INFO 05-08 12:47:57 [gpu_model_runner.py:6133] Graph capturing finished in 18 secs, took 0.98 GiB
(EngineCore pid=1794) INFO 05-08 12:47:57 [gpu_worker.py:599] CUDA graph pool memory: 0.98 GiB (actual), 1.12 GiB (estimated), difference: 0.14 GiB (14.1%).
(EngineCore pid=1794) INFO 05-08 12:47:57 [core.py:299] init engine (profile, create kv cache, warmup model) took 114.47 s (compilation: 54.71 s)
(EngineCore pid=1794) INFO 05-08 12:47:57 [kernel.py:205] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'])
(APIServer pid=1550) INFO 05-08 12:47:57 [api_server.py:598] Supported tasks: ['generate']
(APIServer pid=1550) INFO 05-08 12:47:58 [parser_manager.py:202] "auto" tool choice has been enabled.
(APIServer pid=1550) WARNING 05-08 12:47:58 [model.py:1437] Default vLLM sampling parameters have been overridden by the model's `generation_config.json`: `{'temperature': 1.0, 'top_k': 64, 'top_p': 0.95}`. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
(APIServer pid=1550) INFO 05-08 12:48:06 [hf.py:314] Detected the chat template content format to be 'openai'. You can set `--chat-template-content-format` to override this.
(APIServer pid=1550) /usr/local/lib/python3.12/dist-packages/transformers/tokenization_utils_base.py:2353: UserWarning: `max_length` is ignored when `padding`=`True` and there is no truncation strategy. To pad to max length, use `padding='max_length'`.
(APIServer pid=1550)   warnings.warn(
(APIServer pid=1550) INFO 05-08 12:48:31 [base.py:233] Multi-modal warmup completed in 24.898s
(APIServer pid=1550) /usr/local/lib/python3.12/dist-packages/transformers/tokenization_utils_base.py:2353: UserWarning: `max_length` is ignored when `padding`=`True` and there is no truncation strategy. To pad to max length, use `padding='max_length'`.
(APIServer pid=1550)   warnings.warn(
(APIServer pid=1550) INFO 05-08 12:48:49 [base.py:233] Readonly multi-modal warmup completed in 17.886s
(APIServer pid=1550) INFO 05-08 12:48:49 [api_server.py:602] Starting vLLM server on http://0.0.0.0:8000
(APIServer pid=1550) INFO 05-08 12:48:49 [launcher.py:37] Available routes are:
(APIServer pid=1550) INFO 05-08 12:48:49 [launcher.py:46] Route: /openapi.json, Methods: GET, HEAD
(APIServer pid=1550) INFO 05-08 12:48:49 [launcher.py:46] Route: /docs, Methods: GET, HEAD
(APIServer pid=1550) INFO 05-08 12:48:49 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: GET, HEAD
(APIServer pid=1550) INFO 05-08 12:48:49 [launcher.py:46] Route: /redoc, Methods: GET, HEAD
(APIServer pid=1550) INFO 05-08 12:48:49 [launcher.py:46] Route: /tokenize, Methods: POST
(APIServer pid=1550) INFO 05-08 12:48:49 [launcher.py:46] Route: /detokenize, Methods: POST
(APIServer pid=1550) INFO 05-08 12:48:49 [launcher.py:46] Route: /load, Methods: GET
(APIServer pid=1550) INFO 05-08 12:48:49 [launcher.py:46] Route: /version, Methods: GET
(APIServer pid=1550) INFO 05-08 12:48:49 [launcher.py:46] Route: /health, Methods: GET
(APIServer pid=1550) INFO 05-08 12:48:49 [launcher.py:46] Route: /metrics, Methods: GET
(APIServer pid=1550) INFO 05-08 12:48:49 [launcher.py:46] Route: /v1/models, Methods: GET
(APIServer pid=1550) INFO 05-08 12:48:49 [launcher.py:46] Route: /ping, Methods: GET
(APIServer pid=1550) INFO 05-08 12:48:49 [launcher.py:46] Route: /ping, Methods: POST
(APIServer pid=1550) INFO 05-08 12:48:49 [launcher.py:46] Route: /invocations, Methods: POST
(APIServer pid=1550) INFO 05-08 12:48:49 [launcher.py:46] Route: /v1/chat/completions, Methods: POST
(APIServer pid=1550) INFO 05-08 12:48:49 [launcher.py:46] Route: /v1/chat/completions/batch, Methods: POST
(APIServer pid=1550) INFO 05-08 12:48:49 [launcher.py:46] Route: /v1/responses, Methods: POST
(APIServer pid=1550) INFO 05-08 12:48:49 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=1550) INFO 05-08 12:48:49 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=1550) INFO 05-08 12:48:49 [launcher.py:46] Route: /v1/completions, Methods: POST
(APIServer pid=1550) INFO 05-08 12:48:49 [launcher.py:46] Route: /v1/messages, Methods: POST
(APIServer pid=1550) INFO 05-08 12:48:49 [launcher.py:46] Route: /v1/messages/count_tokens, Methods: POST
(APIServer pid=1550) INFO 05-08 12:48:49 [launcher.py:46] Route: /inference/v1/generate, Methods: POST
(APIServer pid=1550) INFO 05-08 12:48:49 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=1550) INFO 05-08 12:48:49 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=1550) INFO 05-08 12:48:49 [launcher.py:46] Route: /generative_scoring, Methods: POST
(APIServer pid=1550) INFO 05-08 12:48:49 [launcher.py:46] Route: /v1/chat/completions/render, Methods: POST
(APIServer pid=1550) INFO 05-08 12:48:49 [launcher.py:46] Route: /v1/completions/render, Methods: POST
(APIServer pid=1550) INFO:     Started server process [1550]
(APIServer pid=1550) INFO:     Waiting for application startup.
(APIServer pid=1550) INFO:     Application startup complete.

Any idea?

Sign up or log in to comment