How does one run GLM 4.6V using the vLLM docker image?

#19
by kldzj - opened

If we need to manually upgrade to transformers v5, is there currently no way to run this model using vLLMs v0.12.0 docker image?

Create a new file called GLM-4.6V.Dockerfile with the following content:

FROM vllm/vllm-openai:nightly
RUN uv pip install transformers==5.0.0rc0 --upgrade --no-deps --system
RUN uv pip install huggingface-hub --upgrade --no-deps --system

Then create a custom docker image by running:

docker build . -f GLM-4.6V.Dockerfile -t vllm/vllm-openai:glm46v

Then start a container from this custom vLLM image:

docker run -it \
  --gpus all \
  --ipc host \
  -p 8000:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai:glm46v \
     --model zai-org/GLM-4.6V \
     --tool-call-parser glm45 \
     --reasoning-parser glm45 \
     --enable-auto-tool-choice \
     --enable-expert-parallel \
     --allowed-local-media-path / \
     --mm-encoder-tp-mode data \
     --mm-processor-cache-type shm

did you manage to get the correct responses from the model in non streaming requests?
for me reasoning and response are both in either response or reasoning_content with the respective other beeing empty. (0.13.0 in content nightly in reasoning)
only when streaming the request it is correct without reasoning traces in content.

for me is not working.. follow my docker container:

  vllm:
    runtime: nvidia
    container_name: vllm
    build:
      context: ../
      dockerfile: docker/stack/glm/Dockerfile
    environment:
      NVIDIA_VISIBLE_DEVICES: "0,1,2,3,4,5,6,7"
      HUGGING_FACE_HUB_TOKEN: ${HUGGING_FACE_HUB_TOKEN}
    command: >
      ${VLLM_MODEL}
      --port 8000
      --dtype auto
      --host 0.0.0.0
      --max-num-seqs 1
      --tensor-parallel-size 8
      --tool-call-parser glm45
      --reasoning-parser glm45
      --enable-auto-tool-choice
      --gpu-memory-utilization 0.85
      --served-model-name ${VLLM_MODEL}
      --media-io-kwargs '{"video": {"num_frames": -1}}'
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 8
              capabilities: [gpu]
    shm_size: 64g
    ipc: host
    networks:
      - vllm
      - payslip_check
    volumes:
      - glm_cache:/root/.cache/huggingface

follow my docker file:

FROM vllm/vllm-openai:nightly
RUN uv pip install transformers==5.0.0rc0 --upgrade --no-deps --system
RUN uv pip install huggingface-hub --upgrade --no-deps --system

I have the following error:

RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
/usr/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 4 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

@GianSPF Try with 2 GPUs for now with TP 2 maybe? u can only say that for me it works with 2 rtx 3090 ubuntu 24.04 cuda 13 also using docker , although i didnt install huggingface-hub inside my container and i just ran pip install instad of uv pip install , but i dont know if that makes a differens

Also, you might want to include the logs above the RuntimeError as the message below is a Warning that seems to be generated because of the error and shutdown.
these error logs might help with analyzing your specific problem

maybe also add information about which GPUs you are using

@meganoob1337

I update the docker vllm compose to 2GPUs and TP 2.

2x h100
Ubuntu 24.04 LTS for NVIDIA® GPUs (CUDA® 12)

Do you have any idea why Im having these errors? follow logs:

(Worker_TP0 pid=537) ERROR 01-07 00:59:32 [multiproc_executor.py:751] WorkerProc failed to start.
(Worker_TP0 pid=537) ERROR 01-07 00:59:32 [multiproc_executor.py:751] Traceback (most recent call last):
(Worker_TP0 pid=537) ERROR 01-07 00:59:32 [multiproc_executor.py:751]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 722, in worker_main
(Worker_TP0 pid=537) ERROR 01-07 00:59:32 [multiproc_executor.py:751]     worker = WorkerProc(*args, **kwargs)
(Worker_TP0 pid=537) ERROR 01-07 00:59:32 [multiproc_executor.py:751]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=537) ERROR 01-07 00:59:32 [multiproc_executor.py:751]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 562, in __init__
(Worker_TP0 pid=537) ERROR 01-07 00:59:32 [multiproc_executor.py:751]     self.worker.load_model()
(Worker_TP0 pid=537) ERROR 01-07 00:59:32 [multiproc_executor.py:751]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 275, in load_model
(Worker_TP0 pid=537) ERROR 01-07 00:59:32 [multiproc_executor.py:751]     self.model_runner.load_model(eep_scale_up=eep_scale_up)
(Worker_TP0 pid=537) ERROR 01-07 00:59:32 [multiproc_executor.py:751]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 3858, in load_model
(Worker_TP0 pid=537) ERROR 01-07 00:59:32 [multiproc_executor.py:751]     raise e
(Worker_TP0 pid=537) ERROR 01-07 00:59:32 [multiproc_executor.py:751]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 3781, in load_model
(Worker_TP0 pid=537) ERROR 01-07 00:59:32 [multiproc_executor.py:751]     self.model = model_loader.load_model(
(Worker_TP0 pid=537) ERROR 01-07 00:59:32 [multiproc_executor.py:751]                  ^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=537) ERROR 01-07 00:59:32 [multiproc_executor.py:751]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/base_loader.py", line 49, in load_model
(Worker_TP0 pid=537) ERROR 01-07 00:59:32 [multiproc_executor.py:751]     model = initialize_model(
(Worker_TP0 pid=537) ERROR 01-07 00:59:32 [multiproc_executor.py:751]             ^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=537) ERROR 01-07 00:59:32 [multiproc_executor.py:751]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/utils.py", line 48, in initialize_model
(Worker_TP0 pid=537) ERROR 01-07 00:59:32 [multiproc_executor.py:751]     return model_class(vllm_config=vllm_config, prefix=prefix)
(Worker_TP0 pid=537) ERROR 01-07 00:59:32 [multiproc_executor.py:751]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=537) ERROR 01-07 00:59:32 [multiproc_executor.py:751]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/glm4_1v.py", line 1451, in __init__
(Worker_TP0 pid=537) ERROR 01-07 00:59:32 [multiproc_executor.py:751]     self.language_model = init_vllm_registered_model(
(Worker_TP0 pid=537) ERROR 01-07 00:59:32 [multiproc_executor.py:751]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=537) ERROR 01-07 00:59:32 [multiproc_executor.py:751]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 359, in init_vllm_registered_model
(Worker_TP0 pid=537) ERROR 01-07 00:59:32 [multiproc_executor.py:751]     return initialize_model(vllm_config=vllm_config, prefix=prefix)
(Worker_TP0 pid=537) ERROR 01-07 00:59:32 [multiproc_executor.py:751]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=537) ERROR 01-07 00:59:32 [multiproc_executor.py:751]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/utils.py", line 48, in initialize_model
(Worker_TP0 pid=537) ERROR 01-07 00:59:32 [multiproc_executor.py:751]     return model_class(vllm_config=vllm_config, prefix=prefix)
(Worker_TP0 pid=537) ERROR 01-07 00:59:32 [multiproc_executor.py:751]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=537) ERROR 01-07 00:59:32 [multiproc_executor.py:751]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/glm4_moe.py", line 658, in __init__
(Worker_TP0 pid=537) ERROR 01-07 00:59:32 [multiproc_executor.py:751]     self.model = Glm4MoeModel(
(Worker_TP0 pid=537) ERROR 01-07 00:59:32 [multiproc_executor.py:751]                  ^^^^^^^^^^^^^
(Worker_TP0 pid=537) ERROR 01-07 00:59:32 [multiproc_executor.py:751]   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 291, in __init__
(Worker_TP0 pid=537) ERROR 01-07 00:59:32 [multiproc_executor.py:751]     old_init(self, **kwargs)
(Worker_TP0 pid=537) ERROR 01-07 00:59:32 [multiproc_executor.py:751]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/glm4_moe.py", line 428, in __init__
(Worker_TP0 pid=537) ERROR 01-07 00:59:32 [multiproc_executor.py:751]     self.start_layer, self.end_layer, self.layers = make_layers(
(Worker_TP0 pid=537) ERROR 01-07 00:59:32 [multiproc_executor.py:751]                                                     ^^^^^^^^^^^^
(Worker_TP0 pid=537) ERROR 01-07 00:59:32 [multiproc_executor.py:751]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 606, in make_layers
(Worker_TP0 pid=537) ERROR 01-07 00:59:32 [multiproc_executor.py:751]     maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}"))
(Worker_TP0 pid=537) ERROR 01-07 00:59:32 [multiproc_executor.py:751]                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=537) ERROR 01-07 00:59:32 [multiproc_executor.py:751]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/glm4_moe.py", line 430, in <lambda>
(Worker_TP0 pid=537) ERROR 01-07 00:59:32 [multiproc_executor.py:751]     lambda prefix: Glm4MoeDecoderLayer(
(Worker_TP0 pid=537) ERROR 01-07 00:59:32 [multiproc_executor.py:751]                    ^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=537) ERROR 01-07 00:59:32 [multiproc_executor.py:751]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/glm4_moe.py", line 363, in __init__
(Worker_TP0 pid=537) ERROR 01-07 00:59:32 [multiproc_executor.py:751]     self.mlp = Glm4MoE(
(Worker_TP0 pid=537) ERROR 01-07 00:59:32 [multiproc_executor.py:751]                ^^^^^^^^
(Worker_TP0 pid=537) ERROR 01-07 00:59:32 [multiproc_executor.py:751]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/glm4_moe.py", line 181, in __init__
(Worker_TP0 pid=537) ERROR 01-07 00:59:32 [multiproc_executor.py:751]     self.experts = SharedFusedMoE(
(Worker_TP0 pid=537) ERROR 01-07 00:59:32 [multiproc_executor.py:751]                    ^^^^^^^^^^^^^^^
(Worker_TP0 pid=537) ERROR 01-07 00:59:32 [multiproc_executor.py:751]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/shared_fused_moe.py", line 28, in __init__
(Worker_TP0 pid=537) ERROR 01-07 00:59:32 [multiproc_executor.py:751]     super().__init__(**kwargs)
(Worker_TP0 pid=537) ERROR 01-07 00:59:32 [multiproc_executor.py:751]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/layer.py", line 643, in __init__
(Worker_TP0 pid=537) ERROR 01-07 00:59:32 [multiproc_executor.py:751]     self.quant_method.create_weights(layer=self, **moe_quant_params)
(Worker_TP0 pid=537) ERROR 01-07 00:59:32 [multiproc_executor.py:751]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/unquantized_fused_moe_method.py", line 151, in create_weights
(Worker_TP0 pid=537) ERROR 01-07 00:59:32 [multiproc_executor.py:751]     torch.empty(
(Worker_TP0 pid=537) ERROR 01-07 00:59:32 [multiproc_executor.py:751]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_device.py", line 103, in __torch_function__
(Worker_TP0 pid=537) ERROR 01-07 00:59:32 [multiproc_executor.py:751]     return func(*args, **kwargs)
(Worker_TP0 pid=537) ERROR 01-07 00:59:32 [multiproc_executor.py:751]            ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=537) ERROR 01-07 00:59:32 [multiproc_executor.py:751] torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.38 GiB. GPU 0 has a total capacity of 79.10 GiB of which 967.75 MiB is free. Process 9275 has 78.14 GiB memory in use. Of the allocated memory 76.09 GiB is allocated by PyTorch, and 81.74 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
(Worker_TP0 pid=537) INFO 01-07 00:59:32 [multiproc_executor.py:709] Parent process exited, terminating worker
(Worker_TP1 pid=538) INFO 01-07 00:59:32 [multiproc_executor.py:709] Parent process exited, terminating worker
[rank0]:[W107 00:59:33.114190029 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
(EngineCore_DP0 pid=404) Process EngineCore_DP0:
(EngineCore_DP0 pid=404) ERROR 01-07 00:59:35 [core.py:895] EngineCore failed to start.
(EngineCore_DP0 pid=404) ERROR 01-07 00:59:35 [core.py:895] Traceback (most recent call last):
(EngineCore_DP0 pid=404) ERROR 01-07 00:59:35 [core.py:895]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 886, in run_engine_core
(EngineCore_DP0 pid=404) ERROR 01-07 00:59:35 [core.py:895]     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore_DP0 pid=404) ERROR 01-07 00:59:35 [core.py:895]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=404) ERROR 01-07 00:59:35 [core.py:895]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 651, in __init__
(EngineCore_DP0 pid=404) ERROR 01-07 00:59:35 [core.py:895]     super().__init__(
(EngineCore_DP0 pid=404) ERROR 01-07 00:59:35 [core.py:895]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 105, in __init__
(EngineCore_DP0 pid=404) ERROR 01-07 00:59:35 [core.py:895]     self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=404) ERROR 01-07 00:59:35 [core.py:895]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=404) ERROR 01-07 00:59:35 [core.py:895]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 97, in __init__
(EngineCore_DP0 pid=404) ERROR 01-07 00:59:35 [core.py:895]     super().__init__(vllm_config)
(EngineCore_DP0 pid=404) ERROR 01-07 00:59:35 [core.py:895]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 101, in __init__
(EngineCore_DP0 pid=404) ERROR 01-07 00:59:35 [core.py:895]     self._init_executor()
(EngineCore_DP0 pid=404) ERROR 01-07 00:59:35 [core.py:895]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 172, in _init_executor
(EngineCore_DP0 pid=404) ERROR 01-07 00:59:35 [core.py:895]     self.workers = WorkerProc.wait_for_ready(unready_workers)
(EngineCore_DP0 pid=404) ERROR 01-07 00:59:35 [core.py:895]                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=404) ERROR 01-07 00:59:35 [core.py:895]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 660, in wait_for_ready
(EngineCore_DP0 pid=404) ERROR 01-07 00:59:35 [core.py:895]     raise e from None
(EngineCore_DP0 pid=404) ERROR 01-07 00:59:35 [core.py:895] Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause.
(EngineCore_DP0 pid=404) Traceback (most recent call last):
(EngineCore_DP0 pid=404)   File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP0 pid=404)     self.run()
(EngineCore_DP0 pid=404)   File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore_DP0 pid=404)     self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=404)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 899, in run_engine_core
(EngineCore_DP0 pid=404)     raise e
(EngineCore_DP0 pid=404)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 886, in run_engine_core
(EngineCore_DP0 pid=404)     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore_DP0 pid=404)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=404)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 651, in __init__
(EngineCore_DP0 pid=404)     super().__init__(
(EngineCore_DP0 pid=404)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 105, in __init__
(EngineCore_DP0 pid=404)     self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=404)                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=404)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 97, in __init__
(EngineCore_DP0 pid=404)     super().__init__(vllm_config)
(EngineCore_DP0 pid=404)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 101, in __init__
(EngineCore_DP0 pid=404)     self._init_executor()
(EngineCore_DP0 pid=404)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 172, in _init_executor
(EngineCore_DP0 pid=404)     self.workers = WorkerProc.wait_for_ready(unready_workers)
(EngineCore_DP0 pid=404)                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=404)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 660, in wait_for_ready
(EngineCore_DP0 pid=404)     raise e from None
(EngineCore_DP0 pid=404) Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause.
(APIServer pid=1) Traceback (most recent call last):
(APIServer pid=1)   File "/usr/local/bin/vllm", line 10, in <module>
(APIServer pid=1)     sys.exit(main())
(APIServer pid=1)              ^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/main.py", line 73, in main
(APIServer pid=1)     args.dispatch_function(args)
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/serve.py", line 60, in cmd
(APIServer pid=1)     uvloop.run(run_server(args))
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 96, in run
(APIServer pid=1)     return __asyncio.run(
(APIServer pid=1)            ^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
(APIServer pid=1)     return runner.run(main)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=1)     return self._loop.run_until_complete(task)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 48, in wrapper
(APIServer pid=1)     return await main
(APIServer pid=1)            ^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1324, in run_server
(APIServer pid=1)     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1343, in run_server_worker
(APIServer pid=1)     async with build_async_engine_client(
(APIServer pid=1)                ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=1)     return await anext(self.gen)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 171, in build_async_engine_client
(APIServer pid=1)     async with build_async_engine_client_from_engine_args(
(APIServer pid=1)                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=1)     return await anext(self.gen)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 212, in build_async_engine_client_from_engine_args
(APIServer pid=1)     async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=1)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 207, in from_vllm_config
(APIServer pid=1)     return cls(
(APIServer pid=1)            ^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 134, in __init__
(APIServer pid=1)     self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=1)                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 122, in make_async_mp_client
(APIServer pid=1)     return AsyncMPClient(*client_args)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 824, in __init__
(APIServer pid=1)     super().__init__(
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 479, in __init__
(APIServer pid=1)     with launch_core_engines(vllm_config, executor_class, log_stats) as (
(APIServer pid=1)          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/lib/python3.12/contextlib.py", line 144, in __exit__
(APIServer pid=1)     next(self.gen)
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 921, in launch_core_engines
(APIServer pid=1)     wait_for_engine_startup(
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 980, in wait_for_engine_startup
(APIServer pid=1)     raise RuntimeError(
(APIServer pid=1) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
/usr/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 3 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.38 GiB. GPU 0 has a total capacity of 79.10 GiB of which 967.75 MiB is free. Process 9275 has 78.14 GiB memory in use. Of the allocated memory 76.09 GiB is allocated by PyTorch, and 81.74 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

this is the culprit
Reduce your Context length maybe? or check if any processes are using VRAM, or reduce --gpu-memory-utilization 0.85 to something lower, if you have anything else using vram on the machine, the percentage isnt always 100% accurate in my experience, try to lower it for fitting everything.
maybe restart your pc and try on a clean machine

EDIT:
ah wait we are in the 4.6V repo not 4.6V-Flash
maybe you might need higher TP , but i would still play around with the max-length for now , try to get it running with like 1k context and then you see how much space you have for KV cache (it says in the logs if you search for concurrency))

Sign up or log in to comment