Impossible to deploy using vLLM 0.20.0

by hdnh2006 - opened 21 days ago

hdnh2006

Hello! I have 2xRTX4000 PRO, and it seems there's something wrong with MoE models:

vllm serve RedHatAI/gemma-4-26B-A4B-it-NVFP4 --max-model-len 190000   --reasoning-parser gemma4   --tool-call-parser gemma4 --enable-auto-tool-choice -
-max-num-batched-tokens 4096 --tensor-parallel-size 2 --gpu-memory-utilization 0.95
(APIServer pid=10923) INFO 05-08 12:34:18 [utils.py:299] 
(APIServer pid=10923) INFO 05-08 12:34:18 [utils.py:299]        █     █     █▄   ▄█
(APIServer pid=10923) INFO 05-08 12:34:18 [utils.py:299]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.20.0
(APIServer pid=10923) INFO 05-08 12:34:18 [utils.py:299]   █▄█▀ █     █     █     █  model   RedHatAI/gemma-4-26B-A4B-it-NVFP4
(APIServer pid=10923) INFO 05-08 12:34:18 [utils.py:299]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=10923) INFO 05-08 12:34:18 [utils.py:299] 
(APIServer pid=10923) INFO 05-08 12:34:18 [utils.py:233] non-default args: {'model_tag': 'RedHatAI/gemma-4-26B-A4B-it-NVFP4', 'enable_auto_tool_choice': True, 'tool_call_parser': 'gemma4', 'model': 'RedHatAI/gemma-4-26B-A4B-it-NVFP4', 'max_model_len': 190000, 'reasoning_parser': 'gemma4', 'tensor_parallel_size': 2, 'gpu_memory_utilization': 0.95, 'max_num_batched_tokens': 4096}
(APIServer pid=10923) WARNING 05-08 12:34:18 [envs.py:1818] Unknown vLLM environment variable detected: VLLM_TEST_ENDPOINT
(APIServer pid=10923) WARNING 05-08 12:34:18 [envs.py:1818] Unknown vLLM environment variable detected: VLLM_MODEL
config.json: 19.7kB [00:00, 24.6MB/s]
processor_config.json: 1.69kB [00:00, 4.75MB/s]
(APIServer pid=10923) INFO 05-08 12:34:20 [model.py:555] Resolved architecture: Gemma4ForConditionalGeneration
(APIServer pid=10923) INFO 05-08 12:34:20 [model.py:1680] Using max model len 190000
(APIServer pid=10923) INFO 05-08 12:34:21 [nixl_utils.py:20] Setting UCX_RCACHE_MAX_UNRELEASED to '1024' to avoid a rare memory leak in UCX when using NIXL.
(APIServer pid=10923) INFO 05-08 12:34:21 [nixl_utils.py:32] NIXL is available
(APIServer pid=10923) INFO 05-08 12:34:21 [scheduler.py:239] Chunked prefill is enabled with max_num_batched_tokens=4096.
(APIServer pid=10923) INFO 05-08 12:34:21 [config.py:101] Gemma4 model has heterogeneous head dimensions (head_dim=256, global_head_dim=512). Forcing TRITON_ATTN backend to prevent mixed-backend numerical divergence.
(APIServer pid=10923) INFO 05-08 12:34:21 [vllm.py:840] Asynchronous scheduling is enabled.
(APIServer pid=10923) INFO 05-08 12:34:21 [kernel.py:205] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'])
(APIServer pid=10923) WARNING 05-08 12:34:21 [cuda.py:233] Forcing --disable_chunked_mm_input for models with multimodal-bidirectional attention.
tokenizer_config.json: 2.09kB [00:00, 5.39MB/s]
tokenizer.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 32.2M/32.2M [00:01<00:00, 22.6MB/s]
chat_template.jinja: 16.4kB [00:00, 28.9MB/s]
(APIServer pid=10923) INFO 05-08 12:34:27 [compilation.py:303] Enabled custom fusions: act_quant
generation_config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 203/203 [00:00<00:00, 1.11MB/s]
INFO 05-08 12:34:37 [nixl_utils.py:32] NIXL is available
(EngineCore pid=11088) INFO 05-08 12:34:37 [core.py:109] Initializing a V1 LLM engine (v0.20.0) with config: model='RedHatAI/gemma-4-26B-A4B-it-NVFP4', speculative_config=None, tokenizer='RedHatAI/gemma-4-26B-A4B-it-NVFP4', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=190000, download_dir=None, load_format=auto, tensor_parallel_size=2, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=compressed-tensors, quantization_config=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='gemma4', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=RedHatAI/gemma-4-26B-A4B-it-NVFP4, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'ir_enable_torch_wrap': True, 'splitting_ops': ['vllm::unified_attention_with_output', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::gdn_attention_core_xpu', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::deepseek_v4_attention', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_vision_items_per_batch': 0, 'encoder_cudagraph_max_frames_per_batch': None, 'compile_sizes': [], 'compile_ranges_endpoints': [4096], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': False, 'static_all_moe_layers': []}, kernel_config=KernelConfig(ir_op_priority=IrOpPriorityConfig(rms_norm=['native']), enable_flashinfer_autotune=True, moe_backend='auto')
(EngineCore pid=11088) WARNING 05-08 12:34:37 [multiproc_executor.py:1029] Reducing Torch parallelism from 24 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
(EngineCore pid=11088) INFO 05-08 12:34:37 [multiproc_executor.py:139] DP group leader: node_rank=0, node_rank_within_dp=0, master_addr=127.0.0.1, mq_connect_ip=172.17.0.2 (local), world_size=2, local_world_size=2
INFO 05-08 12:34:42 [nixl_utils.py:32] NIXL is available
INFO 05-08 12:34:42 [nixl_utils.py:32] NIXL is available
(Worker pid=11191) INFO 05-08 12:34:46 [parallel_state.py:1402] world_size=2 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:37113 backend=nccl
(Worker pid=11192) INFO 05-08 12:34:46 [parallel_state.py:1402] world_size=2 rank=1 local_rank=1 distributed_init_method=tcp://127.0.0.1:37113 backend=nccl
(Worker pid=11191) INFO 05-08 12:34:47 [pynccl.py:111] vLLM is using nccl==2.28.9
(Worker pid=11191) WARNING 05-08 12:34:47 [symm_mem.py:66] SymmMemCommunicator: Device capability 12.0 not supported, communicator is not available.
(Worker pid=11192) WARNING 05-08 12:34:47 [symm_mem.py:66] SymmMemCommunicator: Device capability 12.0 not supported, communicator is not available.
(Worker pid=11191) INFO 05-08 12:34:47 [parallel_state.py:1715] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0, EPLB rank N/A
(Worker_TP0 pid=11191) INFO 05-08 12:34:48 [gpu_model_runner.py:4777] Starting to load model RedHatAI/gemma-4-26B-A4B-it-NVFP4...
(Worker_TP1 pid=11192) INFO 05-08 12:34:48 [kernel.py:205] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'])
(Worker_TP1 pid=11192) INFO 05-08 12:34:48 [cuda.py:308] Using AttentionBackendEnum.TRITON_ATTN backend.
(Worker_TP1 pid=11192) INFO 05-08 12:34:48 [cuda.py:308] Using AttentionBackendEnum.TRITON_ATTN backend.
(Worker_TP0 pid=11191) INFO 05-08 12:34:48 [vllm.py:840] Asynchronous scheduling is enabled.
(Worker_TP0 pid=11191) INFO 05-08 12:34:48 [kernel.py:205] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'])
(Worker_TP0 pid=11191) INFO 05-08 12:34:48 [compilation.py:303] Enabled custom fusions: act_quant
(Worker_TP0 pid=11191) INFO 05-08 12:34:48 [__init__.py:683] Using FlashInferCutlassNvFp4LinearKernel for NVFP4 GEMM
(Worker_TP0 pid=11191) INFO 05-08 12:34:48 [cuda.py:308] Using AttentionBackendEnum.TRITON_ATTN backend.
(Worker_TP0 pid=11191) INFO 05-08 12:34:48 [nvfp4.py:280] Using 'VLLM_CUTLASS' NvFp4 MoE backend out of potential backends: ['FLASHINFER_TRTLLM', 'FLASHINFER_CUTEDSL', 'FLASHINFER_CUTEDSL_BATCHED', 'FLASHINFER_CUTLASS', 'VLLM_CUTLASS', 'MARLIN', 'EMULATION'].
(Worker_TP0 pid=11191) INFO 05-08 12:34:48 [cuda.py:308] Using AttentionBackendEnum.TRITON_ATTN backend.
model.safetensors.index.json: 4.55MB [00:00, 24.4MB/s]
(Worker_TP1 pid=11192) INFO 05-08 12:37:08 [weight_utils.py:615] Time spent downloading weights for RedHatAI/gemma-4-26B-A4B-it-NVFP4: 138.685520 seconds
(Worker_TP0 pid=11191) INFO 05-08 12:37:09 [weight_utils.py:904] Filesystem type for checkpoints: OVERLAY. Checkpoint size: 15.30 GiB. Available RAM: 237.08 GiB.
(Worker_TP0 pid=11191) INFO 05-08 12:37:09 [weight_utils.py:927] Auto-prefetch is disabled because the filesystem (OVERLAY) is not a recognized network FS (NFS/Lustre). If you want to force prefetching, start vLLM with --safetensors-load-strategy=prefetch.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:06<00:00,  6.95s/it]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:06<00:00,  6.95s/it]
(Worker_TP0 pid=11191) 
(Worker_TP1 pid=11192) ERROR 05-08 12:37:16 [multiproc_executor.py:870] WorkerProc failed to start.
(Worker_TP1 pid=11192) ERROR 05-08 12:37:16 [multiproc_executor.py:870] Traceback (most recent call last):
(Worker_TP1 pid=11192) ERROR 05-08 12:37:16 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 837, in worker_main
(Worker_TP1 pid=11192) ERROR 05-08 12:37:16 [multiproc_executor.py:870]     worker = WorkerProc(*args, **kwargs)
(Worker_TP1 pid=11192) ERROR 05-08 12:37:16 [multiproc_executor.py:870]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=11192) ERROR 05-08 12:37:16 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP1 pid=11192) ERROR 05-08 12:37:16 [multiproc_executor.py:870]     return func(*args, **kwargs)
(Worker_TP1 pid=11192) ERROR 05-08 12:37:16 [multiproc_executor.py:870]            ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=11192) ERROR 05-08 12:37:16 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 619, in __init__
(Worker_TP1 pid=11192) ERROR 05-08 12:37:16 [multiproc_executor.py:870]     self.worker.load_model()
(Worker_TP1 pid=11192) ERROR 05-08 12:37:16 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 323, in load_model
(Worker_TP1 pid=11192) ERROR 05-08 12:37:16 [multiproc_executor.py:870]     self.model_runner.load_model(load_dummy_weights=load_dummy_weights)
(Worker_TP1 pid=11192) ERROR 05-08 12:37:16 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP1 pid=11192) ERROR 05-08 12:37:16 [multiproc_executor.py:870]     return func(*args, **kwargs)
(Worker_TP1 pid=11192) ERROR 05-08 12:37:16 [multiproc_executor.py:870]            ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=11192) ERROR 05-08 12:37:16 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 4793, in load_model
(Worker_TP1 pid=11192) ERROR 05-08 12:37:16 [multiproc_executor.py:870]     self.model = model_loader.load_model(
(Worker_TP1 pid=11192) ERROR 05-08 12:37:16 [multiproc_executor.py:870]                  ^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=11192) ERROR 05-08 12:37:16 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP1 pid=11192) ERROR 05-08 12:37:16 [multiproc_executor.py:870]     return func(*args, **kwargs)
(Worker_TP1 pid=11192) ERROR 05-08 12:37:16 [multiproc_executor.py:870]            ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=11192) ERROR 05-08 12:37:16 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/base_loader.py", line 80, in load_model
(Worker_TP1 pid=11192) ERROR 05-08 12:37:16 [multiproc_executor.py:870]     process_weights_after_loading(model, model_config, target_device)
(Worker_TP1 pid=11192) ERROR 05-08 12:37:16 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/utils.py", line 107, in process_weights_after_loading
(Worker_TP1 pid=11192) ERROR 05-08 12:37:16 [multiproc_executor.py:870]     quant_method.process_weights_after_loading(module)
(Worker_TP1 pid=11192) ERROR 05-08 12:37:16 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe/compressed_tensors_moe_w4a4_nvfp4.py", line 208, in process_weights_after_loading
(Worker_TP1 pid=11192) ERROR 05-08 12:37:16 [multiproc_executor.py:870]     ) = convert_to_nvfp4_moe_kernel_format(
(Worker_TP1 pid=11192) ERROR 05-08 12:37:16 [multiproc_executor.py:870]         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=11192) ERROR 05-08 12:37:16 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/oracle/nvfp4.py", line 346, in convert_to_nvfp4_moe_kernel_format
(Worker_TP1 pid=11192) ERROR 05-08 12:37:16 [multiproc_executor.py:870]     ) = prepare_nvfp4_moe_layer_for_fi_or_cutlass(
(Worker_TP1 pid=11192) ERROR 05-08 12:37:16 [multiproc_executor.py:870]         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=11192) ERROR 05-08 12:37:16 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/utils/flashinfer_fp4_moe.py", line 379, in prepare_nvfp4_moe_layer_for_fi_or_cutlass
(Worker_TP1 pid=11192) ERROR 05-08 12:37:16 [multiproc_executor.py:870]     raise NotImplementedError(
(Worker_TP1 pid=11192) ERROR 05-08 12:37:16 [multiproc_executor.py:870] NotImplementedError: ('Intermediate size padding for w1 and w3, for %s NvFp4 backend, but this is not currently supported', 'VLLM_CUTLASS')
(Worker_TP0 pid=11191) INFO 05-08 12:37:16 [default_loader.py:384] Loading weights took 7.29 seconds
(Worker_TP0 pid=11191) ERROR 05-08 12:37:17 [multiproc_executor.py:870] WorkerProc failed to start.
(Worker_TP0 pid=11191) ERROR 05-08 12:37:17 [multiproc_executor.py:870] Traceback (most recent call last):
(Worker_TP0 pid=11191) ERROR 05-08 12:37:17 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 837, in worker_main
(Worker_TP0 pid=11191) ERROR 05-08 12:37:17 [multiproc_executor.py:870]     worker = WorkerProc(*args, **kwargs)
(Worker_TP0 pid=11191) ERROR 05-08 12:37:17 [multiproc_executor.py:870]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=11191) ERROR 05-08 12:37:17 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP0 pid=11191) ERROR 05-08 12:37:17 [multiproc_executor.py:870]     return func(*args, **kwargs)
(Worker_TP0 pid=11191) ERROR 05-08 12:37:17 [multiproc_executor.py:870]            ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=11191) ERROR 05-08 12:37:17 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 619, in __init__
(Worker_TP0 pid=11191) ERROR 05-08 12:37:17 [multiproc_executor.py:870]     self.worker.load_model()
(Worker_TP0 pid=11191) ERROR 05-08 12:37:17 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 323, in load_model
(Worker_TP0 pid=11191) ERROR 05-08 12:37:17 [multiproc_executor.py:870]     self.model_runner.load_model(load_dummy_weights=load_dummy_weights)
(Worker_TP0 pid=11191) ERROR 05-08 12:37:17 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP0 pid=11191) ERROR 05-08 12:37:17 [multiproc_executor.py:870]     return func(*args, **kwargs)
(Worker_TP0 pid=11191) ERROR 05-08 12:37:17 [multiproc_executor.py:870]            ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=11191) ERROR 05-08 12:37:17 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 4793, in load_model
(Worker_TP0 pid=11191) ERROR 05-08 12:37:17 [multiproc_executor.py:870]     self.model = model_loader.load_model(
(Worker_TP0 pid=11191) ERROR 05-08 12:37:17 [multiproc_executor.py:870]                  ^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=11191) ERROR 05-08 12:37:17 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP0 pid=11191) ERROR 05-08 12:37:17 [multiproc_executor.py:870]     return func(*args, **kwargs)
(Worker_TP0 pid=11191) ERROR 05-08 12:37:17 [multiproc_executor.py:870]            ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=11191) ERROR 05-08 12:37:17 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/base_loader.py", line 80, in load_model
(Worker_TP0 pid=11191) ERROR 05-08 12:37:17 [multiproc_executor.py:870]     process_weights_after_loading(model, model_config, target_device)
(Worker_TP0 pid=11191) ERROR 05-08 12:37:17 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/utils.py", line 107, in process_weights_after_loading
(Worker_TP0 pid=11191) ERROR 05-08 12:37:17 [multiproc_executor.py:870]     quant_method.process_weights_after_loading(module)
(Worker_TP0 pid=11191) ERROR 05-08 12:37:17 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe/compressed_tensors_moe_w4a4_nvfp4.py", line 208, in process_weights_after_loading
(Worker_TP0 pid=11191) ERROR 05-08 12:37:17 [multiproc_executor.py:870]     ) = convert_to_nvfp4_moe_kernel_format(
(Worker_TP0 pid=11191) ERROR 05-08 12:37:17 [multiproc_executor.py:870]         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=11191) ERROR 05-08 12:37:17 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/oracle/nvfp4.py", line 346, in convert_to_nvfp4_moe_kernel_format
(Worker_TP0 pid=11191) ERROR 05-08 12:37:17 [multiproc_executor.py:870]     ) = prepare_nvfp4_moe_layer_for_fi_or_cutlass(
(Worker_TP0 pid=11191) ERROR 05-08 12:37:17 [multiproc_executor.py:870]         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=11191) ERROR 05-08 12:37:17 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/utils/flashinfer_fp4_moe.py", line 379, in prepare_nvfp4_moe_layer_for_fi_or_cutlass
(Worker_TP0 pid=11191) ERROR 05-08 12:37:17 [multiproc_executor.py:870]     raise NotImplementedError(
(Worker_TP0 pid=11191) ERROR 05-08 12:37:17 [multiproc_executor.py:870] NotImplementedError: ('Intermediate size padding for w1 and w3, for %s NvFp4 backend, but this is not currently supported', 'VLLM_CUTLASS')
[rank0]:[W508 12:37:17.358092543 ProcessGroupNCCL.cpp:1575] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
(EngineCore pid=11088) ERROR 05-08 12:37:18 [core.py:1136] EngineCore failed to start.
(EngineCore pid=11088) ERROR 05-08 12:37:18 [core.py:1136] Traceback (most recent call last):
(EngineCore pid=11088) ERROR 05-08 12:37:18 [core.py:1136]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1110, in run_engine_core
(EngineCore pid=11088) ERROR 05-08 12:37:18 [core.py:1136]     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=11088) ERROR 05-08 12:37:18 [core.py:1136]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=11088) ERROR 05-08 12:37:18 [core.py:1136]   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=11088) ERROR 05-08 12:37:18 [core.py:1136]     return func(*args, **kwargs)
(EngineCore pid=11088) ERROR 05-08 12:37:18 [core.py:1136]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=11088) ERROR 05-08 12:37:18 [core.py:1136]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 876, in __init__
(EngineCore pid=11088) ERROR 05-08 12:37:18 [core.py:1136]     super().__init__(
(EngineCore pid=11088) ERROR 05-08 12:37:18 [core.py:1136]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 118, in __init__
(EngineCore pid=11088) ERROR 05-08 12:37:18 [core.py:1136]     self.model_executor = executor_class(vllm_config)
(EngineCore pid=11088) ERROR 05-08 12:37:18 [core.py:1136]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=11088) ERROR 05-08 12:37:18 [core.py:1136]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 107, in __init__
(EngineCore pid=11088) ERROR 05-08 12:37:18 [core.py:1136]     super().__init__(vllm_config)
(EngineCore pid=11088) ERROR 05-08 12:37:18 [core.py:1136]   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=11088) ERROR 05-08 12:37:18 [core.py:1136]     return func(*args, **kwargs)
(EngineCore pid=11088) ERROR 05-08 12:37:18 [core.py:1136]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=11088) ERROR 05-08 12:37:18 [core.py:1136]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 109, in __init__
(EngineCore pid=11088) ERROR 05-08 12:37:18 [core.py:1136]     self._init_executor()
(EngineCore pid=11088) ERROR 05-08 12:37:18 [core.py:1136]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 200, in _init_executor
(EngineCore pid=11088) ERROR 05-08 12:37:18 [core.py:1136]     self.workers = WorkerProc.wait_for_ready(unready_workers)
(EngineCore pid=11088) ERROR 05-08 12:37:18 [core.py:1136]                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=11088) ERROR 05-08 12:37:18 [core.py:1136]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 747, in wait_for_ready
(EngineCore pid=11088) ERROR 05-08 12:37:18 [core.py:1136]     raise e from None
(EngineCore pid=11088) ERROR 05-08 12:37:18 [core.py:1136] Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause.
(EngineCore pid=11088) Process EngineCore:
(EngineCore pid=11088) Traceback (most recent call last):
(EngineCore pid=11088)   File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore pid=11088)     self.run()
(EngineCore pid=11088)   File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore pid=11088)     self._target(*self._args, **self._kwargs)
(EngineCore pid=11088)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1140, in run_engine_core
(EngineCore pid=11088)     raise e
(EngineCore pid=11088)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1110, in run_engine_core
(EngineCore pid=11088)     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=11088)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=11088)   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=11088)     return func(*args, **kwargs)
(EngineCore pid=11088)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=11088)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 876, in __init__
(EngineCore pid=11088)     super().__init__(
(EngineCore pid=11088)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 118, in __init__
(EngineCore pid=11088)     self.model_executor = executor_class(vllm_config)
(EngineCore pid=11088)                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=11088)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 107, in __init__
(EngineCore pid=11088)     super().__init__(vllm_config)
(EngineCore pid=11088)   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=11088)     return func(*args, **kwargs)
(EngineCore pid=11088)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=11088)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 109, in __init__
(EngineCore pid=11088)     self._init_executor()
(EngineCore pid=11088)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 200, in _init_executor
(EngineCore pid=11088)     self.workers = WorkerProc.wait_for_ready(unready_workers)
(EngineCore pid=11088)                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=11088)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 747, in wait_for_ready
(EngineCore pid=11088)     raise e from None
(EngineCore pid=11088) Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause.
(APIServer pid=10923) Traceback (most recent call last):
(APIServer pid=10923)   File "/usr/local/bin/vllm", line 10, in <module>
(APIServer pid=10923)     sys.exit(main())
(APIServer pid=10923)              ^^^^^^
(APIServer pid=10923)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/main.py", line 92, in main
(APIServer pid=10923)     args.dispatch_function(args)
(APIServer pid=10923)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/serve.py", line 122, in cmd
(APIServer pid=10923)     uvloop.run(run_server(args))
(APIServer pid=10923)   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 96, in run
(APIServer pid=10923)     return __asyncio.run(
(APIServer pid=10923)            ^^^^^^^^^^^^^^
(APIServer pid=10923)   File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
(APIServer pid=10923)     return runner.run(main)
(APIServer pid=10923)            ^^^^^^^^^^^^^^^^
(APIServer pid=10923)   File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=10923)     return self._loop.run_until_complete(task)
(APIServer pid=10923)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=10923)   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=10923)   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 48, in wrapper
(APIServer pid=10923)     return await main
(APIServer pid=10923)            ^^^^^^^^^^
(APIServer pid=10923)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 678, in run_server
(APIServer pid=10923)     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=10923)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 692, in run_server_worker
(APIServer pid=10923)     async with build_async_engine_client(
(APIServer pid=10923)                ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=10923)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=10923)     return await anext(self.gen)
(APIServer pid=10923)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=10923)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 100, in build_async_engine_client
(APIServer pid=10923)     async with build_async_engine_client_from_engine_args(
(APIServer pid=10923)                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=10923)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=10923)     return await anext(self.gen)
(APIServer pid=10923)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=10923)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 136, in build_async_engine_client_from_engine_args
(APIServer pid=10923)     async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=10923)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=10923)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 217, in from_vllm_config
(APIServer pid=10923)     return cls(
(APIServer pid=10923)            ^^^^
(APIServer pid=10923)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 146, in __init__
(APIServer pid=10923)     self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=10923)                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=10923)   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=10923)     return func(*args, **kwargs)
(APIServer pid=10923)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=10923)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 130, in make_async_mp_client
(APIServer pid=10923)     return AsyncMPClient(*client_args)
(APIServer pid=10923)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=10923)   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=10923)     return func(*args, **kwargs)
(APIServer pid=10923)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=10923)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 900, in __init__
(APIServer pid=10923)     super().__init__(
(APIServer pid=10923)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 535, in __init__
(APIServer pid=10923)     with launch_core_engines(
(APIServer pid=10923)          ^^^^^^^^^^^^^^^^^^^^
(APIServer pid=10923)   File "/usr/lib/python3.12/contextlib.py", line 144, in __exit__
(APIServer pid=10923)     next(self.gen)
(APIServer pid=10923)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 1119, in launch_core_engines
(APIServer pid=10923)     wait_for_engine_startup(
(APIServer pid=10923)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 1178, in wait_for_engine_startup
(APIServer pid=10923)     raise RuntimeError(
(APIServer pid=10923) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
/usr/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

hdnh2006

21 days ago

It looks like the problem is with parallel computing, in one RTX5000 PRO works fine:

vllm serve RedHatAI/gemma-4-26B-A4B-it-NVFP4 --max-model-len 190000   --reasoning-parser gemma4   --tool-call-parser gemma4 --enable-auto-tool-choice --max-num-batched-tokens 4096 --tensor-parallel-size 1 --gpu-memory-utilization 0.95
(APIServer pid=1550) INFO 05-08 12:45:00 [utils.py:299] 
(APIServer pid=1550) INFO 05-08 12:45:00 [utils.py:299]        █     █     █▄   ▄█
(APIServer pid=1550) INFO 05-08 12:45:00 [utils.py:299]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.20.0
(APIServer pid=1550) INFO 05-08 12:45:00 [utils.py:299]   █▄█▀ █     █     █     █  model   RedHatAI/gemma-4-26B-A4B-it-NVFP4
(APIServer pid=1550) INFO 05-08 12:45:00 [utils.py:299]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=1550) INFO 05-08 12:45:00 [utils.py:299] 
(APIServer pid=1550) INFO 05-08 12:45:00 [utils.py:233] non-default args: {'model_tag': 'RedHatAI/gemma-4-26B-A4B-it-NVFP4', 'enable_auto_tool_choice': True, 'tool_call_parser': 'gemma4', 'model': 'RedHatAI/gemma-4-26B-A4B-it-NVFP4', 'max_model_len': 190000, 'reasoning_parser': 'gemma4', 'gpu_memory_utilization': 0.95, 'max_num_batched_tokens': 4096}
(APIServer pid=1550) WARNING 05-08 12:45:00 [envs.py:1818] Unknown vLLM environment variable detected: VLLM_TEST_ENDPOINT
(APIServer pid=1550) WARNING 05-08 12:45:00 [envs.py:1818] Unknown vLLM environment variable detected: VLLM_MODEL
(APIServer pid=1550) INFO 05-08 12:45:01 [model.py:555] Resolved architecture: Gemma4ForConditionalGeneration
(APIServer pid=1550) INFO 05-08 12:45:01 [model.py:1680] Using max model len 190000
(APIServer pid=1550) INFO 05-08 12:45:02 [nixl_utils.py:20] Setting UCX_RCACHE_MAX_UNRELEASED to '1024' to avoid a rare memory leak in UCX when using NIXL.
(APIServer pid=1550) INFO 05-08 12:45:02 [nixl_utils.py:32] NIXL is available
(APIServer pid=1550) INFO 05-08 12:45:02 [scheduler.py:239] Chunked prefill is enabled with max_num_batched_tokens=4096.
(APIServer pid=1550) INFO 05-08 12:45:02 [config.py:101] Gemma4 model has heterogeneous head dimensions (head_dim=256, global_head_dim=512). Forcing TRITON_ATTN backend to prevent mixed-backend numerical divergence.
(APIServer pid=1550) INFO 05-08 12:45:02 [vllm.py:840] Asynchronous scheduling is enabled.
(APIServer pid=1550) INFO 05-08 12:45:02 [kernel.py:205] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'])
(APIServer pid=1550) WARNING 05-08 12:45:02 [cuda.py:233] Forcing --disable_chunked_mm_input for models with multimodal-bidirectional attention.
tokenizer_config.json: 2.09kB [00:00, 3.19MB/s]
tokenizer.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 32.2M/32.2M [00:01<00:00, 31.4MB/s]
chat_template.jinja: 16.4kB [00:00, 30.8MB/s]
(APIServer pid=1550) INFO 05-08 12:45:08 [compilation.py:303] Enabled custom fusions: act_quant
generation_config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 203/203 [00:00<00:00, 666kB/s]
INFO 05-08 12:45:19 [nixl_utils.py:32] NIXL is available
(EngineCore pid=1794) INFO 05-08 12:45:19 [core.py:109] Initializing a V1 LLM engine (v0.20.0) with config: model='RedHatAI/gemma-4-26B-A4B-it-NVFP4', speculative_config=None, tokenizer='RedHatAI/gemma-4-26B-A4B-it-NVFP4', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=190000, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=compressed-tensors, quantization_config=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='gemma4', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=RedHatAI/gemma-4-26B-A4B-it-NVFP4, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'ir_enable_torch_wrap': True, 'splitting_ops': ['vllm::unified_attention_with_output', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::gdn_attention_core_xpu', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::deepseek_v4_attention', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_vision_items_per_batch': 0, 'encoder_cudagraph_max_frames_per_batch': None, 'compile_sizes': [], 'compile_ranges_endpoints': [4096], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': False, 'static_all_moe_layers': []}, kernel_config=KernelConfig(ir_op_priority=IrOpPriorityConfig(rms_norm=['native']), enable_flashinfer_autotune=True, moe_backend='auto')
(EngineCore pid=1794) INFO 05-08 12:45:24 [parallel_state.py:1402] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://172.17.0.4:37261 backend=nccl
(EngineCore pid=1794) INFO 05-08 12:45:24 [parallel_state.py:1715] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0, EPLB rank N/A
(EngineCore pid=1794) INFO 05-08 12:45:24 [gpu_model_runner.py:4777] Starting to load model RedHatAI/gemma-4-26B-A4B-it-NVFP4...
(EngineCore pid=1794) INFO 05-08 12:45:25 [vllm.py:840] Asynchronous scheduling is enabled.
(EngineCore pid=1794) INFO 05-08 12:45:25 [kernel.py:205] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'])
(EngineCore pid=1794) INFO 05-08 12:45:25 [compilation.py:303] Enabled custom fusions: act_quant
(EngineCore pid=1794) INFO 05-08 12:45:25 [__init__.py:683] Using FlashInferCutlassNvFp4LinearKernel for NVFP4 GEMM
(EngineCore pid=1794) INFO 05-08 12:45:25 [cuda.py:308] Using AttentionBackendEnum.TRITON_ATTN backend.
(EngineCore pid=1794) INFO 05-08 12:45:25 [nvfp4.py:280] Using 'VLLM_CUTLASS' NvFp4 MoE backend out of potential backends: ['FLASHINFER_TRTLLM', 'FLASHINFER_CUTEDSL', 'FLASHINFER_CUTEDSL_BATCHED', 'FLASHINFER_CUTLASS', 'VLLM_CUTLASS', 'MARLIN', 'EMULATION'].
(EngineCore pid=1794) INFO 05-08 12:45:26 [cuda.py:308] Using AttentionBackendEnum.TRITON_ATTN backend.
model.safetensors.index.json: 4.55MB [00:00, 270MB/s]
(EngineCore pid=1794) INFO 05-08 12:45:50 [weight_utils.py:615] Time spent downloading weights for RedHatAI/gemma-4-26B-A4B-it-NVFP4: 24.042237 seconds
(EngineCore pid=1794) INFO 05-08 12:45:51 [weight_utils.py:904] Filesystem type for checkpoints: OVERLAY. Checkpoint size: 15.30 GiB. Available RAM: 472.99 GiB.
(EngineCore pid=1794) INFO 05-08 12:45:51 [weight_utils.py:927] Auto-prefetch is disabled because the filesystem (OVERLAY) is not a recognized network FS (NFS/Lustre). If you want to force prefetching, start vLLM with --safetensors-load-strategy=prefetch.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:10<00:00, 10.52s/it]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:10<00:00, 10.52s/it]
(EngineCore pid=1794) 
(EngineCore pid=1794) INFO 05-08 12:46:02 [default_loader.py:384] Loading weights took 10.90 seconds
(EngineCore pid=1794) INFO 05-08 12:46:02 [nvfp4.py:485] Using MoEPrepareAndFinalizeNoDPEPModular
(EngineCore pid=1794) WARNING 05-08 12:46:02 [compressed_tensors_w4a4_nvfp4.py:97] In NVFP4 linear, the global scale for input or weight are different for parallel layers (e.g. q_proj, k_proj, v_proj). This  will likely result in reduced accuracy. Please verify the model accuracy. Consider using a checkpoint with a shared global NVFP4 scale for fused layers.
(EngineCore pid=1794) INFO 05-08 12:46:02 [gpu_model_runner.py:4879] Model loading took 15.76 GiB memory and 37.162847 seconds
(EngineCore pid=1794) INFO 05-08 12:46:03 [gpu_model_runner.py:5820] Encoder cache will be initialized with a budget of 4096 tokens, and profiled with 1 video items of the maximum feature size.
(EngineCore pid=1794) /usr/local/lib/python3.12/dist-packages/transformers/tokenization_utils_base.py:2353: UserWarning: `max_length` is ignored when `padding`=`True` and there is no truncation strategy. To pad to max length, use `padding='max_length'`.
(EngineCore pid=1794)   warnings.warn(
(EngineCore pid=1794) WARNING 05-08 12:46:24 [op.py:241] Priority not set for op rms_norm, using native implementation.
(EngineCore pid=1794) INFO 05-08 12:46:37 [backends.py:1069] Using cache directory: /workspace/.vllm_cache/torch_compile_cache/ced645544a/rank_0_0/backbone for vLLM's torch.compile
(EngineCore pid=1794) INFO 05-08 12:46:37 [backends.py:1128] Dynamo bytecode transform time: 11.58 s
(EngineCore pid=1794) INFO 05-08 12:46:49 [backends.py:376] Cache the graph of compile range (1, 4096) for later use
(EngineCore pid=1794) INFO 05-08 12:47:13 [backends.py:391] Compiling a graph for compile range (1, 4096) takes 34.78 s
(EngineCore pid=1794) INFO 05-08 12:47:20 [decorators.py:668] saved AOT compiled function to /workspace/.vllm_cache/torch_compile_cache/torch_aot_compile/6fd79cfd16265f6994ed0d633fe7b3f5f084090dd1eb03b0281559213bb88470/rank_0_0/model
(EngineCore pid=1794) INFO 05-08 12:47:20 [monitor.py:53] torch.compile took 54.71 s in total
(EngineCore pid=1794) INFO 05-08 12:47:22 [monitor.py:81] Initial profiling/warmup run took 2.01 s
(EngineCore pid=1794) INFO 05-08 12:47:31 [gpu_model_runner.py:5963] Profiling CUDA graph memory: PIECEWISE=51 (largest=512), FULL=35 (largest=256)
(EngineCore pid=1794) INFO 05-08 12:47:37 [gpu_model_runner.py:6042] Estimated CUDA graph memory: 1.12 GiB total
(EngineCore pid=1794) INFO 05-08 12:47:37 [gpu_worker.py:440] Available KV cache memory: 26.5 GiB
(EngineCore pid=1794) INFO 05-08 12:47:37 [gpu_worker.py:455] CUDA graph memory profiling is enabled (default since v0.21.0). The current --gpu-memory-utilization=0.9500 is equivalent to --gpu-memory-utilization=0.9263 without CUDA graph memory profiling. To maintain the same effective KV cache size as before, increase --gpu-memory-utilization to 0.9737. To disable, set VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=0.
(EngineCore pid=1794) INFO 05-08 12:47:37 [kv_cache_utils.py:1711] GPU KV cache size: 115,776 tokens
(EngineCore pid=1794) INFO 05-08 12:47:37 [kv_cache_utils.py:1716] Maximum concurrency for 190,000 tokens per request: 5.76x
(EngineCore pid=1794) 2026-05-08 12:47:37,879 - INFO - autotuner.py:457 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
[AutoTuner]: Tuning fp4_gemm: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 82.11profile/s]
[AutoTuner]: Tuning fp4_gemm: 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 104.15profile/s]
[AutoTuner]: Tuning fp4_gemm: 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 106.36profile/s]
[AutoTuner]: Tuning fp4_gemm: 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 115.32profile/s]
[AutoTuner]: Tuning fp4_gemm: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 87.51profile/s]
[AutoTuner]: Tuning fp4_gemm: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 87.66profile/s]
(EngineCore pid=1794) 2026-05-08 12:47:39,203 - INFO - autotuner.py:466 - flashinfer.jit: [Autotuner]: Autotuning process ends
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|█████████████████████████████████████████████████████████████████████████| 51/51 [00:06<00:00,  7.91it/s]
Capturing CUDA graphs (decode, FULL): 100%|████████████████████████████████████████████████████████████████████████████████████████████| 35/35 [00:10<00:00,  3.32it/s]
(EngineCore pid=1794) INFO 05-08 12:47:57 [gpu_model_runner.py:6133] Graph capturing finished in 18 secs, took 0.98 GiB
(EngineCore pid=1794) INFO 05-08 12:47:57 [gpu_worker.py:599] CUDA graph pool memory: 0.98 GiB (actual), 1.12 GiB (estimated), difference: 0.14 GiB (14.1%).
(EngineCore pid=1794) INFO 05-08 12:47:57 [core.py:299] init engine (profile, create kv cache, warmup model) took 114.47 s (compilation: 54.71 s)
(EngineCore pid=1794) INFO 05-08 12:47:57 [kernel.py:205] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'])
(APIServer pid=1550) INFO 05-08 12:47:57 [api_server.py:598] Supported tasks: ['generate']
(APIServer pid=1550) INFO 05-08 12:47:58 [parser_manager.py:202] "auto" tool choice has been enabled.
(APIServer pid=1550) WARNING 05-08 12:47:58 [model.py:1437] Default vLLM sampling parameters have been overridden by the model's `generation_config.json`: `{'temperature': 1.0, 'top_k': 64, 'top_p': 0.95}`. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
(APIServer pid=1550) INFO 05-08 12:48:06 [hf.py:314] Detected the chat template content format to be 'openai'. You can set `--chat-template-content-format` to override this.
(APIServer pid=1550) /usr/local/lib/python3.12/dist-packages/transformers/tokenization_utils_base.py:2353: UserWarning: `max_length` is ignored when `padding`=`True` and there is no truncation strategy. To pad to max length, use `padding='max_length'`.
(APIServer pid=1550)   warnings.warn(
(APIServer pid=1550) INFO 05-08 12:48:31 [base.py:233] Multi-modal warmup completed in 24.898s
(APIServer pid=1550) /usr/local/lib/python3.12/dist-packages/transformers/tokenization_utils_base.py:2353: UserWarning: `max_length` is ignored when `padding`=`True` and there is no truncation strategy. To pad to max length, use `padding='max_length'`.
(APIServer pid=1550)   warnings.warn(
(APIServer pid=1550) INFO 05-08 12:48:49 [base.py:233] Readonly multi-modal warmup completed in 17.886s
(APIServer pid=1550) INFO 05-08 12:48:49 [api_server.py:602] Starting vLLM server on http://0.0.0.0:8000
(APIServer pid=1550) INFO 05-08 12:48:49 [launcher.py:37] Available routes are:
(APIServer pid=1550) INFO 05-08 12:48:49 [launcher.py:46] Route: /openapi.json, Methods: GET, HEAD
(APIServer pid=1550) INFO 05-08 12:48:49 [launcher.py:46] Route: /docs, Methods: GET, HEAD
(APIServer pid=1550) INFO 05-08 12:48:49 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: GET, HEAD
(APIServer pid=1550) INFO 05-08 12:48:49 [launcher.py:46] Route: /redoc, Methods: GET, HEAD
(APIServer pid=1550) INFO 05-08 12:48:49 [launcher.py:46] Route: /tokenize, Methods: POST
(APIServer pid=1550) INFO 05-08 12:48:49 [launcher.py:46] Route: /detokenize, Methods: POST
(APIServer pid=1550) INFO 05-08 12:48:49 [launcher.py:46] Route: /load, Methods: GET
(APIServer pid=1550) INFO 05-08 12:48:49 [launcher.py:46] Route: /version, Methods: GET
(APIServer pid=1550) INFO 05-08 12:48:49 [launcher.py:46] Route: /health, Methods: GET
(APIServer pid=1550) INFO 05-08 12:48:49 [launcher.py:46] Route: /metrics, Methods: GET
(APIServer pid=1550) INFO 05-08 12:48:49 [launcher.py:46] Route: /v1/models, Methods: GET
(APIServer pid=1550) INFO 05-08 12:48:49 [launcher.py:46] Route: /ping, Methods: GET
(APIServer pid=1550) INFO 05-08 12:48:49 [launcher.py:46] Route: /ping, Methods: POST
(APIServer pid=1550) INFO 05-08 12:48:49 [launcher.py:46] Route: /invocations, Methods: POST
(APIServer pid=1550) INFO 05-08 12:48:49 [launcher.py:46] Route: /v1/chat/completions, Methods: POST
(APIServer pid=1550) INFO 05-08 12:48:49 [launcher.py:46] Route: /v1/chat/completions/batch, Methods: POST
(APIServer pid=1550) INFO 05-08 12:48:49 [launcher.py:46] Route: /v1/responses, Methods: POST
(APIServer pid=1550) INFO 05-08 12:48:49 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=1550) INFO 05-08 12:48:49 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=1550) INFO 05-08 12:48:49 [launcher.py:46] Route: /v1/completions, Methods: POST
(APIServer pid=1550) INFO 05-08 12:48:49 [launcher.py:46] Route: /v1/messages, Methods: POST
(APIServer pid=1550) INFO 05-08 12:48:49 [launcher.py:46] Route: /v1/messages/count_tokens, Methods: POST
(APIServer pid=1550) INFO 05-08 12:48:49 [launcher.py:46] Route: /inference/v1/generate, Methods: POST
(APIServer pid=1550) INFO 05-08 12:48:49 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=1550) INFO 05-08 12:48:49 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=1550) INFO 05-08 12:48:49 [launcher.py:46] Route: /generative_scoring, Methods: POST
(APIServer pid=1550) INFO 05-08 12:48:49 [launcher.py:46] Route: /v1/chat/completions/render, Methods: POST
(APIServer pid=1550) INFO 05-08 12:48:49 [launcher.py:46] Route: /v1/completions/render, Methods: POST
(APIServer pid=1550) INFO:     Started server process [1550]
(APIServer pid=1550) INFO:     Waiting for application startup.
(APIServer pid=1550) INFO:     Application startup complete.

hdnh2006

21 days ago

Any idea?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment