1. 前言

上一章节我们介绍了vllm的安装部署,在这个过程中遇到了安装运行的问题记录一下

vLLM(1)私有化安装部署配置_vllm配置-CSDN博客文章浏览阅读251次,点赞9次,收藏3次。vLLM是目前主流的大模型部署框架之一,以其高效的内存管理、持续批处理和张量并行性在企业生产环境中表现突出。该框架通过PagedAttention算法优化KV缓存管理,支持GPU加速和连续批处理,并内置API安全验证功能,同时兼容HuggingFace模型和OpenAI接口。安装过程需创建Python 3.12虚拟环境后直接pip安装即可。后续将重点介绍其多模态应用场景。_vllm配置 https://blog.csdn.net/yilvqingtai/article/details/149633838

2. 服务器软硬件配置

  • NVIDIA  A10 24G四卡
  • 服务器已安装NVIDIA driver 显卡驱动
  • 服务器已安装cuda 12.8
(base) root@jinhu:/usr/local# nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Fri_Feb_21_20:23:50_PST_2025
Cuda compilation tools, release 12.8, V12.8.93
Build cuda_12.8.r12.8/compiler.35583870_0
  • cuda默认安装位置
(base) root@jinhu:/usr/local# ls
bin      btjdk     cuda     cuda-12.8  freetype  include  libiconv  openssl  share
btgojdk  bttomcat  cuda-12  etc        games     lib      man       sbin     src

原来服务器安装cudu 12.0,由于多次安装,没有删除,因此在升级12.8版本之后,删除了多余的cuda目录和cuda-12目录

3.运行vllm遇到的找不到libcuda.so文件

3.1 vllm安装版本

用conda虚拟环境python=13.2版本安装vllm,目前vllm版本0.10.0

(vLLM_cuda128_env_python312) root@jinhu:~/comfyDownFile# pip show vllm
Name: vllm
Version: 0.10.0
Summary: A high-throughput and memory-efficient inference and serving engine for LLMs
Home-page: https://github.com/vllm-project/vllm
Author: vLLM Team
Author-email: 
License-Expression: Apache-2.0
Location: /root/anaconda3/envs/vLLM_cuda128_env_python312/lib/python3.12/site-packages
Requires: aiohttp, blake3, cachetools, cbor2, cloudpickle, compressed-tensors, depyf, diskcache, einops, fastapi, filelock, gguf, huggingface-hub, lark, llguidance, lm-format-enforcer, mistral_common, msgspec, ninja, numba, numpy, openai, opencv-python-headless, outlines_core, partial-json-parser, pillow, prometheus-fastapi-instrumentator, prometheus_client, protobuf, psutil, py-cpuinfo, pybase64, pydantic, python-json-logger, pyyaml, pyzmq, ray, regex, requests, scipy, sentencepiece, setuptools, six, tiktoken, tokenizers, torch, torchaudio, torchvision, tqdm, transformers, typing_extensions, watchfiles, xformers, xgrammar
Required-by: 

3.2 运行离线模型代码错误

离线运行千问模型

from vllm import LLM

llm = LLM(model="/home/vLLM/models/Qwen/Qwen3-0___6B",
         
          trust_remote_code=True,
          tensor_parallel_size=2,
          gpu_memory_utilization=0.8,
          max_model_len=4096,
)

异常日志如下 

INFO 07-26 02:17:08 [config.py:1604] Using max model len 4096
INFO 07-26 02:17:08 [config.py:2434] Chunked prefill is enabled with max_num_batched_tokens=8192.
INFO 07-26 02:17:09 [core.py:572] Waiting for init message from front-end.
INFO 07-26 02:17:09 [core.py:71] Initializing a V1 LLM engine (v0.10.0) with config: model='/home/vLLM/models/Qwen/Qwen3-0___6B', speculative_config=None, tokenizer='/home/vLLM/models/Qwen/Qwen3-0___6B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=/home/vLLM/models/Qwen/Qwen3-0___6B, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output","vllm.mamba_mixer2"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":512,"local_cache_dir":null}
ERROR 07-26 02:17:09 [core.py:632] EngineCore failed to start.
ERROR 07-26 02:17:09 [core.py:632] Traceback (most recent call last):
ERROR 07-26 02:17:09 [core.py:632]   File "/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 623, in run_engine_core
ERROR 07-26 02:17:09 [core.py:632]     engine_core = EngineCoreProc(*args, **kwargs)
ERROR 07-26 02:17:09 [core.py:632]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-26 02:17:09 [core.py:632]   File "/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 441, in __init__
ERROR 07-26 02:17:09 [core.py:632]     super().__init__(vllm_config, executor_class, log_stats,
ERROR 07-26 02:17:09 [core.py:632]   File "/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 77, in __init__
ERROR 07-26 02:17:09 [core.py:632]     self.model_executor = executor_class(vllm_config)
ERROR 07-26 02:17:09 [core.py:632]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-26 02:17:09 [core.py:632]   File "/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 53, in __init__
ERROR 07-26 02:17:09 [core.py:632]     self._init_executor()
ERROR 07-26 02:17:09 [core.py:632]   File "/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 47, in _init_executor
ERROR 07-26 02:17:09 [core.py:632]     self.collective_rpc("init_worker", args=([kwargs], ))
ERROR 07-26 02:17:09 [core.py:632]   File "/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 58, in collective_rpc
ERROR 07-26 02:17:09 [core.py:632]     answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 07-26 02:17:09 [core.py:632]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-26 02:17:09 [core.py:632]   File "/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/vllm/utils/__init__.py", line 2985, in run_method
ERROR 07-26 02:17:09 [core.py:632]     return func(*args, **kwargs)
ERROR 07-26 02:17:09 [core.py:632]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 07-26 02:17:09 [core.py:632]   File "/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/vllm/worker/worker_base.py", line 556, in init_worker
ERROR 07-26 02:17:09 [core.py:632]     worker_class = resolve_obj_by_qualname(
ERROR 07-26 02:17:09 [core.py:632]                    ^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-26 02:17:09 [core.py:632]   File "/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/vllm/utils/__init__.py", line 2539, in resolve_obj_by_qualname
ERROR 07-26 02:17:09 [core.py:632]     module = importlib.import_module(module_name)
ERROR 07-26 02:17:09 [core.py:632]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-26 02:17:09 [core.py:632]   File "/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/importlib/__init__.py", line 90, in import_module
ERROR 07-26 02:17:09 [core.py:632]     return _bootstrap._gcd_import(name[level:], package, level)
ERROR 07-26 02:17:09 [core.py:632]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-26 02:17:09 [core.py:632]   File "<frozen importlib._bootstrap>", line 1387, in _gcd_import
ERROR 07-26 02:17:09 [core.py:632]   File "<frozen importlib._bootstrap>", line 1360, in _find_and_load
ERROR 07-26 02:17:09 [core.py:632]   File "<frozen importlib._bootstrap>", line 1331, in _find_and_load_unlocked
ERROR 07-26 02:17:09 [core.py:632]   File "<frozen importlib._bootstrap>", line 935, in _load_unlocked
ERROR 07-26 02:17:09 [core.py:632]   File "<frozen importlib._bootstrap_external>", line 999, in exec_module
ERROR 07-26 02:17:09 [core.py:632]   File "<frozen importlib._bootstrap>", line 488, in _call_with_frames_removed
ERROR 07-26 02:17:09 [core.py:632]   File "/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 33, in <module>
ERROR 07-26 02:17:09 [core.py:632]     from vllm.v1.worker.gpu_model_runner import GPUModelRunner
ERROR 07-26 02:17:09 [core.py:632]   File "/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 33, in <module>
ERROR 07-26 02:17:09 [core.py:632]     from vllm.model_executor.layers.mamba.mamba_mixer2 import MambaBase
ERROR 07-26 02:17:09 [core.py:632]   File "/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/vllm/model_executor/layers/mamba/mamba_mixer2.py", line 29, in <module>
ERROR 07-26 02:17:09 [core.py:632]     from vllm.model_executor.layers.mamba.ops.ssd_combined import (
ERROR 07-26 02:17:09 [core.py:632]   File "/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/vllm/model_executor/layers/mamba/ops/ssd_combined.py", line 15, in <module>
ERROR 07-26 02:17:09 [core.py:632]     from .ssd_bmm import _bmm_chunk_fwd
ERROR 07-26 02:17:09 [core.py:632]   File "/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/vllm/model_executor/layers/mamba/ops/ssd_bmm.py", line 16, in <module>
ERROR 07-26 02:17:09 [core.py:632]     @triton.autotune(
ERROR 07-26 02:17:09 [core.py:632]      ^^^^^^^^^^^^^^^^
ERROR 07-26 02:17:09 [core.py:632]   File "/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 378, in decorator
ERROR 07-26 02:17:09 [core.py:632]     return Autotuner(fn, fn.arg_names, configs, key, reset_to_zero, restore_value, pre_hook=pre_hook,
ERROR 07-26 02:17:09 [core.py:632]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-26 02:17:09 [core.py:632]   File "/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 130, in __init__
ERROR 07-26 02:17:09 [core.py:632]     self.do_bench = driver.active.get_benchmarker()
ERROR 07-26 02:17:09 [core.py:632]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-26 02:17:09 [core.py:632]   File "/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/triton/runtime/driver.py", line 23, in __getattr__
ERROR 07-26 02:17:09 [core.py:632]     self._initialize_obj()
ERROR 07-26 02:17:09 [core.py:632]   File "/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/triton/runtime/driver.py", line 20, in _initialize_obj
ERROR 07-26 02:17:09 [core.py:632]     self._obj = self._init_fn()
ERROR 07-26 02:17:09 [core.py:632]                 ^^^^^^^^^^^^^^^
ERROR 07-26 02:17:09 [core.py:632]   File "/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/triton/runtime/driver.py", line 9, in _create_driver
ERROR 07-26 02:17:09 [core.py:632]     return actives[0]()
ERROR 07-26 02:17:09 [core.py:632]            ^^^^^^^^^^^^
ERROR 07-26 02:17:09 [core.py:632]   File "/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/triton/backends/nvidia/driver.py", line 535, in __init__
ERROR 07-26 02:17:09 [core.py:632]     self.utils = CudaUtils()  # TODO: make static
ERROR 07-26 02:17:09 [core.py:632]                  ^^^^^^^^^^^
ERROR 07-26 02:17:09 [core.py:632]   File "/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/triton/backends/nvidia/driver.py", line 89, in __init__
ERROR 07-26 02:17:09 [core.py:632]     mod = compile_module_from_src(Path(os.path.join(dirname, "driver.c")).read_text(), "cuda_utils")
ERROR 07-26 02:17:09 [core.py:632]           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-26 02:17:09 [core.py:632]   File "/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/triton/backends/nvidia/driver.py", line 66, in compile_module_from_src
ERROR 07-26 02:17:09 [core.py:632]     so = _build(name, src_path, tmpdir, library_dirs(), include_dir, libraries)
ERROR 07-26 02:17:09 [core.py:632]          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-26 02:17:09 [core.py:632]   File "/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/triton/runtime/build.py", line 36, in _build
ERROR 07-26 02:17:09 [core.py:632]     subprocess.check_call(cc_cmd, stdout=subprocess.DEVNULL)
ERROR 07-26 02:17:09 [core.py:632]   File "/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/subprocess.py", line 413, in check_call
ERROR 07-26 02:17:09 [core.py:632]     raise CalledProcessError(retcode, cmd)
ERROR 07-26 02:17:09 [core.py:632] subprocess.CalledProcessError: Command '['/usr/bin/gcc', '/tmp/tmpnpkuzanv/main.c', '-O3', '-shared', '-fPIC', '-Wno-psabi', '-o', '/tmp/tmpnpkuzanv/cuda_utils.cpython-312-x86_64-linux-gnu.so', '-lcuda', '-L/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/triton/backends/nvidia/lib', '-L/lib/x86_64-linux-gnu', '-I/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/triton/backends/nvidia/include', '-I/tmp/tmpnpkuzanv', '-I/root/anaconda3/envs/vLLMenv_python312/include/python3.12']' returned non-zero exit status 1.
Process EngineCore_0:
Traceback (most recent call last):
  File "/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 636, in run_engine_core
    raise e
  File "/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 623, in run_engine_core
    engine_core = EngineCoreProc(*args, **kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 441, in __init__
    super().__init__(vllm_config, executor_class, log_stats,
  File "/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 77, in __init__
    self.model_executor = executor_class(vllm_config)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 53, in __init__
    self._init_executor()
  File "/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 47, in _init_executor
    self.collective_rpc("init_worker", args=([kwargs], ))
  File "/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 58, in collective_rpc
    answer = run_method(self.driver_worker, method, args, kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/vllm/utils/__init__.py", line 2985, in run_method
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/vllm/worker/worker_base.py", line 556, in init_worker
    worker_class = resolve_obj_by_qualname(
                   ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/vllm/utils/__init__.py", line 2539, in resolve_obj_by_qualname
    module = importlib.import_module(module_name)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/importlib/__init__.py", line 90, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen importlib._bootstrap>", line 1387, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1360, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1331, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 935, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 999, in exec_module
  File "<frozen importlib._bootstrap>", line 488, in _call_with_frames_removed
  File "/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 33, in <module>
    from vllm.v1.worker.gpu_model_runner import GPUModelRunner
  File "/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 33, in <module>
    from vllm.model_executor.layers.mamba.mamba_mixer2 import MambaBase
  File "/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/vllm/model_executor/layers/mamba/mamba_mixer2.py", line 29, in <module>
    from vllm.model_executor.layers.mamba.ops.ssd_combined import (
  File "/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/vllm/model_executor/layers/mamba/ops/ssd_combined.py", line 15, in <module>
    from .ssd_bmm import _bmm_chunk_fwd
  File "/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/vllm/model_executor/layers/mamba/ops/ssd_bmm.py", line 16, in <module>
    @triton.autotune(
     ^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 378, in decorator
    return Autotuner(fn, fn.arg_names, configs, key, reset_to_zero, restore_value, pre_hook=pre_hook,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 130, in __init__
    self.do_bench = driver.active.get_benchmarker()
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/triton/runtime/driver.py", line 23, in __getattr__
    self._initialize_obj()
  File "/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/triton/runtime/driver.py", line 20, in _initialize_obj
    self._obj = self._init_fn()
                ^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/triton/runtime/driver.py", line 9, in _create_driver
    return actives[0]()
           ^^^^^^^^^^^^
  File "/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/triton/backends/nvidia/driver.py", line 535, in __init__
    self.utils = CudaUtils()  # TODO: make static
                 ^^^^^^^^^^^
  File "/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/triton/backends/nvidia/driver.py", line 89, in __init__
    mod = compile_module_from_src(Path(os.path.join(dirname, "driver.c")).read_text(), "cuda_utils")
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/triton/backends/nvidia/driver.py", line 66, in compile_module_from_src
    so = _build(name, src_path, tmpdir, library_dirs(), include_dir, libraries)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/triton/runtime/build.py", line 36, in _build
    subprocess.check_call(cc_cmd, stdout=subprocess.DEVNULL)
  File "/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/subprocess.py", line 413, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['/usr/bin/gcc', '/tmp/tmpnpkuzanv/main.c', '-O3', '-shared', '-fPIC', '-Wno-psabi', '-o', '/tmp/tmpnpkuzanv/cuda_utils.cpython-312-x86_64-linux-gnu.so', '-lcuda', '-L/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/triton/backends/nvidia/lib', '-L/lib/x86_64-linux-gnu', '-I/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/triton/backends/nvidia/include', '-I/tmp/tmpnpkuzanv', '-I/root/anaconda3/envs/vLLMenv_python312/include/python3.12']' returned non-zero exit status 1.
/usr/bin/ld: cannot find -lcuda: No such file or directory
collect2: error: ld returned 1 exit status
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[4], line 3
      1 from vllm import LLM
----> 3 llm = LLM(model="/home/vLLM/models/Qwen/Qwen3-0___6B",
      4           trust_remote_code=True,
      5           max_model_len=4096,
      6 )

File ~/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/vllm/entrypoints/llm.py:273, in LLM.__init__(self, model, task, tokenizer, tokenizer_mode, skip_tokenizer_init, trust_remote_code, allowed_local_media_path, tensor_parallel_size, dtype, quantization, revision, tokenizer_revision, seed, gpu_memory_utilization, swap_space, cpu_offload_gb, enforce_eager, max_seq_len_to_capture, disable_custom_all_reduce, disable_async_output_proc, hf_token, hf_overrides, mm_processor_kwargs, override_pooler_config, compilation_config, **kwargs)
    243 engine_args = EngineArgs(
    244     model=model,
    245     task=task,
   (...)    269     **kwargs,
    270 )
    272 # Create the Engine (autoselects V0 vs V1)
--> 273 self.llm_engine = LLMEngine.from_engine_args(
    274     engine_args=engine_args, usage_context=UsageContext.LLM_CLASS)
    275 self.engine_class = type(self.llm_engine)
    277 self.request_counter = Counter()

File ~/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/vllm/engine/llm_engine.py:497, in LLMEngine.from_engine_args(cls, engine_args, usage_context, stat_loggers)
    494     from vllm.v1.engine.llm_engine import LLMEngine as V1LLMEngine
    495     engine_cls = V1LLMEngine
--> 497 return engine_cls.from_vllm_config(
    498     vllm_config=vllm_config,
    499     usage_context=usage_context,
    500     stat_loggers=stat_loggers,
    501     disable_log_stats=engine_args.disable_log_stats,
    502 )

File ~/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/vllm/v1/engine/llm_engine.py:126, in LLMEngine.from_vllm_config(cls, vllm_config, usage_context, stat_loggers, disable_log_stats)
    118 @classmethod
    119 def from_vllm_config(
    120     cls,
   (...)    124     disable_log_stats: bool = False,
    125 ) -> "LLMEngine":
--> 126     return cls(vllm_config=vllm_config,
    127                executor_class=Executor.get_class(vllm_config),
    128                log_stats=(not disable_log_stats),
    129                usage_context=usage_context,
    130                stat_loggers=stat_loggers,
    131                multiprocess_mode=envs.VLLM_ENABLE_V1_MULTIPROCESSING)

File ~/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/vllm/v1/engine/llm_engine.py:103, in LLMEngine.__init__(self, vllm_config, executor_class, log_stats, usage_context, stat_loggers, mm_registry, use_cached_outputs, multiprocess_mode)
     99 self.output_processor = OutputProcessor(self.tokenizer,
    100                                         log_stats=self.log_stats)
    102 # EngineCore (gets EngineCoreRequests and gives EngineCoreOutputs)
--> 103 self.engine_core = EngineCoreClient.make_client(
    104     multiprocess_mode=multiprocess_mode,
    105     asyncio_mode=False,
    106     vllm_config=vllm_config,
    107     executor_class=executor_class,
    108     log_stats=self.log_stats,
    109 )
    111 if not multiprocess_mode:
    112     # for v0 compatibility
    113     self.model_executor = self.engine_core.engine_core.model_executor  # type: ignore

File ~/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/vllm/v1/engine/core_client.py:77, in EngineCoreClient.make_client(multiprocess_mode, asyncio_mode, vllm_config, executor_class, log_stats)
     73     return EngineCoreClient.make_async_mp_client(
     74         vllm_config, executor_class, log_stats)
     76 if multiprocess_mode and not asyncio_mode:
---> 77     return SyncMPClient(vllm_config, executor_class, log_stats)
     79 return InprocClient(vllm_config, executor_class, log_stats)

File ~/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/vllm/v1/engine/core_client.py:514, in SyncMPClient.__init__(self, vllm_config, executor_class, log_stats)
    512 def __init__(self, vllm_config: VllmConfig, executor_class: type[Executor],
    513              log_stats: bool):
--> 514     super().__init__(
    515         asyncio_mode=False,
    516         vllm_config=vllm_config,
    517         executor_class=executor_class,
    518         log_stats=log_stats,
    519     )
    521     self.is_dp = self.vllm_config.parallel_config.data_parallel_size > 1
    522     self.outputs_queue = queue.Queue[Union[EngineCoreOutputs, Exception]]()

File ~/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/vllm/v1/engine/core_client.py:408, in MPClient.__init__(self, asyncio_mode, vllm_config, executor_class, log_stats, client_addresses)
    404     self.stats_update_address = client_addresses.get(
    405         "stats_update_address")
    406 else:
    407     # Engines are managed by this client.
--> 408     with launch_core_engines(vllm_config, executor_class,
    409                              log_stats) as (engine_manager,
    410                                             coordinator,
    411                                             addresses):
    412         self.resources.coordinator = coordinator
    413         self.resources.engine_manager = engine_manager

File ~/anaconda3/envs/vLLMenv_python312/lib/python3.12/contextlib.py:144, in _GeneratorContextManager.__exit__(self, typ, value, traceback)
    142 if typ is None:
    143     try:
--> 144         next(self.gen)
    145     except StopIteration:
    146         return False

File ~/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/vllm/v1/engine/utils.py:697, in launch_core_engines(vllm_config, executor_class, log_stats, num_api_servers)
    694 yield local_engine_manager, coordinator, addresses
    696 # Now wait for engines to start.
--> 697 wait_for_engine_startup(
    698     handshake_socket,
    699     addresses,
    700     engines_to_handshake,
    701     parallel_config,
    702     vllm_config.cache_config,
    703     local_engine_manager,
    704     coordinator.proc if coordinator else None,
    705 )

File ~/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/vllm/v1/engine/utils.py:750, in wait_for_engine_startup(handshake_socket, addresses, core_engines, parallel_config, cache_config, proc_manager, coord_process)
    748     if coord_process is not None and coord_process.exitcode is not None:
    749         finished[coord_process.name] = coord_process.exitcode
--> 750     raise RuntimeError("Engine core initialization failed. "
    751                        "See root cause above. "
    752                        f"Failed core proc(s): {finished}")
    754 # Receive HELLO and READY messages from the input socket.
    755 eng_identity, ready_msg_bytes = handshake_socket.recv_multipart()

RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

核心错误日志:

subprocess.CalledProcessError: Command '['/usr/bin/gcc', '/tmp/tmpnpkuzanv/main.c', '-O3', '-shared', '-fPIC', '-Wno-psabi', '-o', '/tmp/tmpnpkuzanv/cuda_utils.cpython-312-x86_64-linux-gnu.so', '-lcuda', '-L/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/triton/backends/nvidia/lib', '-L/lib/x86_64-linux-gnu', '-I/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/triton/backends/nvidia/include', '-I/tmp/tmpnpkuzanv', '-I/root/anaconda3/envs/vLLMenv_python312/include/python3.12']' returned non-zero exit status 1.
/usr/bin/ld: cannot find -lcuda: No such file or directory
collect2: error: ld returned 1 exit status

3.3 错误原因分析

/usr/bin/ld: cannot find -lcuda: No such file or directory

这个错误表明在链接阶段,系统无法找到CUDA的库文件(libcuda.so 或 libcuda.a)

原因分析:

  1. CUDA Toolkit未安装:可能没有安装CUDA Toolkit,或者安装的版本不兼容。
  2. CUDA库路径未正确设置:即使安装了CUDA Toolkit,但系统的库路径(如LD_LIBRARY_PATH)没有包含CUDA库所在的目录,或者链接器配置文件(如ld.so.conf)中没有正确配置。
  3. 符号链接缺失:有时候,CUDA库文件存在,但缺少必要的符号链接(比如libcuda.so指向具体版本文件的软链接)。

4 解决方案

4.1 找到libcuda.so文件所在位置

通过一下命令,查找系统libcuda.so文件是否存在

在cuda-12.8安装目录查找,没有找到

(base) root@jinhu:/usr/local# ls /usr/local/cuda-12.8/lib64 | grep libcuda.so
(base) root@jinhu:/usr/local#

全局查找,在/usr/lib/x86_64-linux-gnu/目录下

(base) root@jinhu:/usr/local/cuda-12.8/lib64# sudo find / -name 'libcuda.so*'
/usr/local/cuda-12.8/targets/x86_64-linux/lib/stubs/libcuda.so
/usr/lib/x86_64-linux-gnu/libcuda.so.1
/usr/lib/x86_64-linux-gnu/stubs/libcuda.so
/usr/lib/x86_64-linux-gnu/libcuda.so.570.148.08

 关于libcuda.so在不同目录下的区别:

安装NVIDIA 官方驱动的时候,会安装libcuda.so文件,并存储在/usr/lib/x86_64-linux-gnu/目录下

libcuda.so 这是 CUDA 驱动 API 的核心库。它提供了应用程序与底层 NVIDIA GPU 驱动程序进行直接通信的接口。任何使用 CUDA 的程序(无论是直接使用驱动 API 还是通过运行时 API)最终都需要在运行时链接到这个库

  • 目的: 系统级共享,供所有用户和应用程序使用。

  1. libcuda.so 属于驱动程序,不属于Toolkit:

    • 最核心的原因是:libcuda.so 是 NVIDIA GPU 驱动程序的核心组成部分,它是由 nvidia-driver 包安装的,而不是由 cuda-toolkit 包安装的。

    • CUDA Toolkit 主要提供开发工具(nvccnsight)、运行时库(libcudart.so)、数学库(libcublas.solibcufft.so)、头文件、示例等。它不包含也不负责安装底层的 GPU 驱动库 libcuda.so

4.2 环境变量设置

在全局变量中配置cuda路径

运行命令

 vi  ~/.bashrc

在内容结尾中添加如下内容 :

export PATH=/usr/local/cuda-12.8/bin:$PATH
# export LD_LIBRARY_PATH=/usr/local/cuda-12.8/lib64:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:$LD_LIBRARY_PATH

编译起效 

source ~/.bashrc
(base) root@jinhu:/usr/local/cuda-12.8/lib64# sudo find / -name 'libcuda.so*'
/usr/local/cuda-12.8/targets/x86_64-linux/lib/stubs/libcuda.so
/usr/lib/x86_64-linux-gnu/libcuda.so.1
/usr/lib/x86_64-linux-gnu/stubs/libcuda.so
/usr/lib/x86_64-linux-gnu/libcuda.so.570.148.08
/var/lib/docker/overlay2/84850e86f40d73dbae057eafc1875cb57327a85ad6b5a8f771d28b664612f455/diff/usr/local/cuda-12.8/compat/libcuda.so
/var/lib/docker/overlay2/84850e86f40d73dbae057eafc1875cb57327a85ad6b5a8f771d28b664612f455/diff/usr/local/cuda-12.8/compat/libcuda.so.1
/var/lib/docker/overlay2/84850e86f40d73dbae057eafc1875cb57327a85ad6b5a8f771d28b664612f455/diff/usr/local/cuda-12.8/compat/libcuda.so.570.86.10
(base) root@jinhu:/usr/local/cuda-12.8/lib64# echo "/usr/lib/x86_64-linux-gnu" | sudo tee /etc/ld.so.conf.d/cuda.conf
/usr/lib/x86_64-linux-gnu
(base) root@jinhu:/usr/local/cuda-12.8/lib64# sudo ldconfig
(base) root@jinhu:/usr/local/cuda-12.8/lib64# sudo ln -s /usr/lib/x86_64-linux-gnu/libcuda.so.1 /usr/lib/x86_64-linux-gnu/libcuda.so
(base) root@jinhu:/usr/local/cuda-12.8/lib64# vi  ~/.bashrc
(base) root@jinhu:/usr/local/cuda-12.8/lib64# source ~/.bashrc
(base) root@jinhu:/usr/local/cuda-12.8/lib64# vi  ~/.bashrc

4.3 再次运行结果成功

from vllm import LLM

llm = LLM(model="/home/vLLM/models/Qwen/Qwen3-0___6B",
         
          trust_remote_code=True,
          tensor_parallel_size=2,
          gpu_memory_utilization=0.8,
          max_model_len=4096,
)




INFO 07-30 04:34:13 [config.py:1604] Using max model len 4096
INFO 07-30 04:34:13 [config.py:2434] Chunked prefill is enabled with max_num_batched_tokens=8192.
INFO 07-30 04:34:14 [core.py:572] Waiting for init message from front-end.
INFO 07-30 04:34:14 [core.py:71] Initializing a V1 LLM engine (v0.10.0) with config: model='/home/vLLM/models/Qwen/Qwen3-0___6B', speculative_config=None, tokenizer='/home/vLLM/models/Qwen/Qwen3-0___6B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=/home/vLLM/models/Qwen/Qwen3-0___6B, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output","vllm.mamba_mixer2"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":512,"local_cache_dir":null}
WARNING 07-30 04:34:14 [multiproc_worker_utils.py:307] Reducing Torch parallelism from 36 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 07-30 04:34:14 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0, 1], buffer_handle=(2, 16777216, 10, 'psm_b244038f'), local_subscribe_addr='ipc:///tmp/d7c52630-a47f-4adc-8772-5b36db25d7b0', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=0 pid=1003186) INFO 07-30 04:34:16 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_17a5fe6c'), local_subscribe_addr='ipc:///tmp/470f026d-1b8b-4ea8-88d6-7d283f0bb2fd', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=1 pid=1003187) INFO 07-30 04:34:16 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_e5c2d241'), local_subscribe_addr='ipc:///tmp/b8c85095-e2b6-4631-97ec-6a0e031730c2', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=0 pid=1003186) (VllmWorker rank=1 pid=1003187) INFO 07-30 04:34:17 [__init__.py:1375] Found nccl from library libnccl.so.2
INFO 07-30 04:34:17 [__init__.py:1375] Found nccl from library libnccl.so.2
(VllmWorker rank=0 pid=1003186) (VllmWorker rank=1 pid=1003187) INFO 07-30 04:34:17 [pynccl.py:70] vLLM is using nccl==2.26.2
INFO 07-30 04:34:17 [pynccl.py:70] vLLM is using nccl==2.26.2
(VllmWorker rank=0 pid=1003186) INFO 07-30 04:34:17 [custom_all_reduce_utils.py:208] generating GPU P2P access cache in /root/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3.json
(VllmWorker rank=0 pid=1003186) (VllmWorker rank=1 pid=1003187) INFO 07-30 04:34:36 [custom_all_reduce_utils.py:246] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3.json
INFO 07-30 04:34:36 [custom_all_reduce_utils.py:246] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3.json
(VllmWorker rank=0 pid=1003186) (VllmWorker rank=1 pid=1003187) WARNING 07-30 04:34:36 [custom_all_reduce.py:147] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 07-30 04:34:36 [custom_all_reduce.py:147] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorker rank=0 pid=1003186) INFO 07-30 04:34:36 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[1], buffer_handle=(1, 4194304, 6, 'psm_6b68bda1'), local_subscribe_addr='ipc:///tmp/da4bb46b-0a31-4756-9b7e-2027a321dace', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=1 pid=1003187) INFO 07-30 04:34:36 [parallel_state.py:1102] rank 1 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 1, EP rank 1
(VllmWorker rank=1 pid=1003187) WARNING 07-30 04:34:36 [topk_topp_sampler.py:59] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
(VllmWorker rank=1 pid=1003187) INFO 07-30 04:34:36 [gpu_model_runner.py:1843] Starting to load model /home/vLLM/models/Qwen/Qwen3-0___6B...
(VllmWorker rank=0 pid=1003186) INFO 07-30 04:34:36 [parallel_state.py:1102] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
(VllmWorker rank=0 pid=1003186) WARNING 07-30 04:34:36 [topk_topp_sampler.py:59] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
(VllmWorker rank=0 pid=1003186) INFO 07-30 04:34:36 [gpu_model_runner.py:1843] Starting to load model /home/vLLM/models/Qwen/Qwen3-0___6B...
(VllmWorker rank=1 pid=1003187) INFO 07-30 04:34:37 [gpu_model_runner.py:1875] Loading model from scratch...
(VllmWorker rank=0 pid=1003186) INFO 07-30 04:34:37 [gpu_model_runner.py:1875] Loading model from scratch...
(VllmWorker rank=1 pid=1003187) INFO 07-30 04:34:37 [cuda.py:290] Using Flash Attention backend on V1 engine.
(VllmWorker rank=0 pid=1003186) INFO 07-30 04:34:37 [cuda.py:290] Using Flash Attention backend on V1 engine.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.09it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.06it/s]
(VllmWorker rank=0 pid=1003186) 
(VllmWorker rank=1 pid=1003187) INFO 07-30 04:34:37 [default_loader.py:262] Loading weights took 0.28 seconds
(VllmWorker rank=0 pid=1003186) INFO 07-30 04:34:37 [default_loader.py:262] Loading weights took 0.27 seconds
(VllmWorker rank=1 pid=1003187) INFO 07-30 04:34:38 [gpu_model_runner.py:1892] Model loading took 0.5660 GiB and 0.451207 seconds
(VllmWorker rank=0 pid=1003186) INFO 07-30 04:34:38 [gpu_model_runner.py:1892] Model loading took 0.5660 GiB and 0.444961 seconds
(VllmWorker rank=0 pid=1003186) INFO 07-30 04:34:47 [backends.py:530] Using cache directory: /root/.cache/vllm/torch_compile_cache/566a023fde/rank_0_0/backbone for vLLM's torch.compile
(VllmWorker rank=0 pid=1003186) INFO 07-30 04:34:47 [backends.py:541] Dynamo bytecode transform time: 8.56 s
(VllmWorker rank=1 pid=1003187) INFO 07-30 04:34:47 [backends.py:530] Using cache directory: /root/.cache/vllm/torch_compile_cache/566a023fde/rank_1_0/backbone for vLLM's torch.compile
(VllmWorker rank=1 pid=1003187) INFO 07-30 04:34:47 [backends.py:541] Dynamo bytecode transform time: 8.88 s
(VllmWorker rank=0 pid=1003186) INFO 07-30 04:34:53 [backends.py:194] Cache the graph for dynamic shape for later use
(VllmWorker rank=1 pid=1003187) INFO 07-30 04:34:54 [backends.py:194] Cache the graph for dynamic shape for later use
(VllmWorker rank=0 pid=1003186) INFO 07-30 04:35:21 [backends.py:215] Compiling a graph for dynamic shape takes 33.70 s
(VllmWorker rank=1 pid=1003187) INFO 07-30 04:35:21 [backends.py:215] Compiling a graph for dynamic shape takes 33.62 s
(VllmWorker rank=1 pid=1003187) INFO 07-30 04:35:32 [monitor.py:34] torch.compile takes 42.50 s in total
(VllmWorker rank=0 pid=1003186) INFO 07-30 04:35:32 [monitor.py:34] torch.compile takes 42.26 s in total
(VllmWorker rank=0 pid=1003186) INFO 07-30 04:35:34 [gpu_worker.py:255] Available KV cache memory: 15.57 GiB
(VllmWorker rank=1 pid=1003187) INFO 07-30 04:35:34 [gpu_worker.py:255] Available KV cache memory: 15.57 GiB
INFO 07-30 04:35:34 [kv_cache_utils.py:833] GPU KV cache size: 291,552 tokens
INFO 07-30 04:35:34 [kv_cache_utils.py:837] Maximum concurrency for 4,096 tokens per request: 71.18x
INFO 07-30 04:35:34 [kv_cache_utils.py:833] GPU KV cache size: 291,552 tokens
INFO 07-30 04:35:34 [kv_cache_utils.py:837] Maximum concurrency for 4,096 tokens per request: 71.18x
Capturing CUDA graph shapes: 100%|██████████| 67/67 [00:04<00:00, 16.41it/s]
(VllmWorker rank=0 pid=1003186) INFO 07-30 04:35:39 [gpu_model_runner.py:2485] Graph capturing finished in 5 secs, took 0.64 GiB
(VllmWorker rank=1 pid=1003187) INFO 07-30 04:35:39 [gpu_model_runner.py:2485] Graph capturing finished in 5 secs, took 0.64 GiB
INFO 07-30 04:35:39 [core.py:193] init engine (profile, create kv cache, warmup model) took 61.32 seconds

Logo

欢迎来到FlagOS开发社区,这里是一个汇聚了AI开发者、数据科学家、机器学习爱好者以及业界专家的活力平台。我们致力于成为业内领先的Triton技术交流与应用分享的殿堂,为推动人工智能技术的普及与深化应用贡献力量。

更多推荐