vLLM(2)私有化安装cuda之libcuda.so找不到问题
本文记录了在A100 4卡服务器上部署vLLM框架时遇到的libcuda.so缺失问题的解决过程。关键点包括:1)确认CUDA 12.8环境和驱动安装;2)分析发现libcuda.so位于/usr/lib/x86_64-linux-gnu/而非CUDA安装目录;3)通过修改LD_LIBRARY_PATH环境变量、创建符号链接和更新ldconfig成功解决问题;4)最终实现Qwen3-0_6B模型在
1. 前言
上一章节我们介绍了vllm的安装部署,在这个过程中遇到了安装运行的问题记录一下
2. 服务器软硬件配置
- NVIDIA A10 24G四卡
- 服务器已安装NVIDIA driver 显卡驱动
- 服务器已安装cuda 12.8
(base) root@jinhu:/usr/local# nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Fri_Feb_21_20:23:50_PST_2025
Cuda compilation tools, release 12.8, V12.8.93
Build cuda_12.8.r12.8/compiler.35583870_0
- cuda默认安装位置
(base) root@jinhu:/usr/local# ls
bin btjdk cuda cuda-12.8 freetype include libiconv openssl share
btgojdk bttomcat cuda-12 etc games lib man sbin src
原来服务器安装cudu 12.0,由于多次安装,没有删除,因此在升级12.8版本之后,删除了多余的cuda目录和cuda-12目录
3.运行vllm遇到的找不到libcuda.so文件
3.1 vllm安装版本
用conda虚拟环境python=13.2版本安装vllm,目前vllm版本0.10.0
(vLLM_cuda128_env_python312) root@jinhu:~/comfyDownFile# pip show vllm
Name: vllm
Version: 0.10.0
Summary: A high-throughput and memory-efficient inference and serving engine for LLMs
Home-page: https://github.com/vllm-project/vllm
Author: vLLM Team
Author-email:
License-Expression: Apache-2.0
Location: /root/anaconda3/envs/vLLM_cuda128_env_python312/lib/python3.12/site-packages
Requires: aiohttp, blake3, cachetools, cbor2, cloudpickle, compressed-tensors, depyf, diskcache, einops, fastapi, filelock, gguf, huggingface-hub, lark, llguidance, lm-format-enforcer, mistral_common, msgspec, ninja, numba, numpy, openai, opencv-python-headless, outlines_core, partial-json-parser, pillow, prometheus-fastapi-instrumentator, prometheus_client, protobuf, psutil, py-cpuinfo, pybase64, pydantic, python-json-logger, pyyaml, pyzmq, ray, regex, requests, scipy, sentencepiece, setuptools, six, tiktoken, tokenizers, torch, torchaudio, torchvision, tqdm, transformers, typing_extensions, watchfiles, xformers, xgrammar
Required-by:
3.2 运行离线模型代码错误
离线运行千问模型
from vllm import LLM
llm = LLM(model="/home/vLLM/models/Qwen/Qwen3-0___6B",
trust_remote_code=True,
tensor_parallel_size=2,
gpu_memory_utilization=0.8,
max_model_len=4096,
)
异常日志如下
INFO 07-26 02:17:08 [config.py:1604] Using max model len 4096
INFO 07-26 02:17:08 [config.py:2434] Chunked prefill is enabled with max_num_batched_tokens=8192.
INFO 07-26 02:17:09 [core.py:572] Waiting for init message from front-end.
INFO 07-26 02:17:09 [core.py:71] Initializing a V1 LLM engine (v0.10.0) with config: model='/home/vLLM/models/Qwen/Qwen3-0___6B', speculative_config=None, tokenizer='/home/vLLM/models/Qwen/Qwen3-0___6B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=/home/vLLM/models/Qwen/Qwen3-0___6B, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output","vllm.mamba_mixer2"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":512,"local_cache_dir":null}
ERROR 07-26 02:17:09 [core.py:632] EngineCore failed to start.
ERROR 07-26 02:17:09 [core.py:632] Traceback (most recent call last):
ERROR 07-26 02:17:09 [core.py:632] File "/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 623, in run_engine_core
ERROR 07-26 02:17:09 [core.py:632] engine_core = EngineCoreProc(*args, **kwargs)
ERROR 07-26 02:17:09 [core.py:632] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-26 02:17:09 [core.py:632] File "/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 441, in __init__
ERROR 07-26 02:17:09 [core.py:632] super().__init__(vllm_config, executor_class, log_stats,
ERROR 07-26 02:17:09 [core.py:632] File "/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 77, in __init__
ERROR 07-26 02:17:09 [core.py:632] self.model_executor = executor_class(vllm_config)
ERROR 07-26 02:17:09 [core.py:632] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-26 02:17:09 [core.py:632] File "/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 53, in __init__
ERROR 07-26 02:17:09 [core.py:632] self._init_executor()
ERROR 07-26 02:17:09 [core.py:632] File "/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 47, in _init_executor
ERROR 07-26 02:17:09 [core.py:632] self.collective_rpc("init_worker", args=([kwargs], ))
ERROR 07-26 02:17:09 [core.py:632] File "/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 58, in collective_rpc
ERROR 07-26 02:17:09 [core.py:632] answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 07-26 02:17:09 [core.py:632] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-26 02:17:09 [core.py:632] File "/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/vllm/utils/__init__.py", line 2985, in run_method
ERROR 07-26 02:17:09 [core.py:632] return func(*args, **kwargs)
ERROR 07-26 02:17:09 [core.py:632] ^^^^^^^^^^^^^^^^^^^^^
ERROR 07-26 02:17:09 [core.py:632] File "/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/vllm/worker/worker_base.py", line 556, in init_worker
ERROR 07-26 02:17:09 [core.py:632] worker_class = resolve_obj_by_qualname(
ERROR 07-26 02:17:09 [core.py:632] ^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-26 02:17:09 [core.py:632] File "/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/vllm/utils/__init__.py", line 2539, in resolve_obj_by_qualname
ERROR 07-26 02:17:09 [core.py:632] module = importlib.import_module(module_name)
ERROR 07-26 02:17:09 [core.py:632] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-26 02:17:09 [core.py:632] File "/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/importlib/__init__.py", line 90, in import_module
ERROR 07-26 02:17:09 [core.py:632] return _bootstrap._gcd_import(name[level:], package, level)
ERROR 07-26 02:17:09 [core.py:632] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-26 02:17:09 [core.py:632] File "<frozen importlib._bootstrap>", line 1387, in _gcd_import
ERROR 07-26 02:17:09 [core.py:632] File "<frozen importlib._bootstrap>", line 1360, in _find_and_load
ERROR 07-26 02:17:09 [core.py:632] File "<frozen importlib._bootstrap>", line 1331, in _find_and_load_unlocked
ERROR 07-26 02:17:09 [core.py:632] File "<frozen importlib._bootstrap>", line 935, in _load_unlocked
ERROR 07-26 02:17:09 [core.py:632] File "<frozen importlib._bootstrap_external>", line 999, in exec_module
ERROR 07-26 02:17:09 [core.py:632] File "<frozen importlib._bootstrap>", line 488, in _call_with_frames_removed
ERROR 07-26 02:17:09 [core.py:632] File "/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 33, in <module>
ERROR 07-26 02:17:09 [core.py:632] from vllm.v1.worker.gpu_model_runner import GPUModelRunner
ERROR 07-26 02:17:09 [core.py:632] File "/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 33, in <module>
ERROR 07-26 02:17:09 [core.py:632] from vllm.model_executor.layers.mamba.mamba_mixer2 import MambaBase
ERROR 07-26 02:17:09 [core.py:632] File "/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/vllm/model_executor/layers/mamba/mamba_mixer2.py", line 29, in <module>
ERROR 07-26 02:17:09 [core.py:632] from vllm.model_executor.layers.mamba.ops.ssd_combined import (
ERROR 07-26 02:17:09 [core.py:632] File "/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/vllm/model_executor/layers/mamba/ops/ssd_combined.py", line 15, in <module>
ERROR 07-26 02:17:09 [core.py:632] from .ssd_bmm import _bmm_chunk_fwd
ERROR 07-26 02:17:09 [core.py:632] File "/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/vllm/model_executor/layers/mamba/ops/ssd_bmm.py", line 16, in <module>
ERROR 07-26 02:17:09 [core.py:632] @triton.autotune(
ERROR 07-26 02:17:09 [core.py:632] ^^^^^^^^^^^^^^^^
ERROR 07-26 02:17:09 [core.py:632] File "/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 378, in decorator
ERROR 07-26 02:17:09 [core.py:632] return Autotuner(fn, fn.arg_names, configs, key, reset_to_zero, restore_value, pre_hook=pre_hook,
ERROR 07-26 02:17:09 [core.py:632] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-26 02:17:09 [core.py:632] File "/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 130, in __init__
ERROR 07-26 02:17:09 [core.py:632] self.do_bench = driver.active.get_benchmarker()
ERROR 07-26 02:17:09 [core.py:632] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-26 02:17:09 [core.py:632] File "/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/triton/runtime/driver.py", line 23, in __getattr__
ERROR 07-26 02:17:09 [core.py:632] self._initialize_obj()
ERROR 07-26 02:17:09 [core.py:632] File "/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/triton/runtime/driver.py", line 20, in _initialize_obj
ERROR 07-26 02:17:09 [core.py:632] self._obj = self._init_fn()
ERROR 07-26 02:17:09 [core.py:632] ^^^^^^^^^^^^^^^
ERROR 07-26 02:17:09 [core.py:632] File "/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/triton/runtime/driver.py", line 9, in _create_driver
ERROR 07-26 02:17:09 [core.py:632] return actives[0]()
ERROR 07-26 02:17:09 [core.py:632] ^^^^^^^^^^^^
ERROR 07-26 02:17:09 [core.py:632] File "/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/triton/backends/nvidia/driver.py", line 535, in __init__
ERROR 07-26 02:17:09 [core.py:632] self.utils = CudaUtils() # TODO: make static
ERROR 07-26 02:17:09 [core.py:632] ^^^^^^^^^^^
ERROR 07-26 02:17:09 [core.py:632] File "/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/triton/backends/nvidia/driver.py", line 89, in __init__
ERROR 07-26 02:17:09 [core.py:632] mod = compile_module_from_src(Path(os.path.join(dirname, "driver.c")).read_text(), "cuda_utils")
ERROR 07-26 02:17:09 [core.py:632] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-26 02:17:09 [core.py:632] File "/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/triton/backends/nvidia/driver.py", line 66, in compile_module_from_src
ERROR 07-26 02:17:09 [core.py:632] so = _build(name, src_path, tmpdir, library_dirs(), include_dir, libraries)
ERROR 07-26 02:17:09 [core.py:632] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-26 02:17:09 [core.py:632] File "/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/triton/runtime/build.py", line 36, in _build
ERROR 07-26 02:17:09 [core.py:632] subprocess.check_call(cc_cmd, stdout=subprocess.DEVNULL)
ERROR 07-26 02:17:09 [core.py:632] File "/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/subprocess.py", line 413, in check_call
ERROR 07-26 02:17:09 [core.py:632] raise CalledProcessError(retcode, cmd)
ERROR 07-26 02:17:09 [core.py:632] subprocess.CalledProcessError: Command '['/usr/bin/gcc', '/tmp/tmpnpkuzanv/main.c', '-O3', '-shared', '-fPIC', '-Wno-psabi', '-o', '/tmp/tmpnpkuzanv/cuda_utils.cpython-312-x86_64-linux-gnu.so', '-lcuda', '-L/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/triton/backends/nvidia/lib', '-L/lib/x86_64-linux-gnu', '-I/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/triton/backends/nvidia/include', '-I/tmp/tmpnpkuzanv', '-I/root/anaconda3/envs/vLLMenv_python312/include/python3.12']' returned non-zero exit status 1.
Process EngineCore_0:
Traceback (most recent call last):
File "/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 636, in run_engine_core
raise e
File "/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 623, in run_engine_core
engine_core = EngineCoreProc(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 441, in __init__
super().__init__(vllm_config, executor_class, log_stats,
File "/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 77, in __init__
self.model_executor = executor_class(vllm_config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 53, in __init__
self._init_executor()
File "/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 47, in _init_executor
self.collective_rpc("init_worker", args=([kwargs], ))
File "/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 58, in collective_rpc
answer = run_method(self.driver_worker, method, args, kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/vllm/utils/__init__.py", line 2985, in run_method
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/vllm/worker/worker_base.py", line 556, in init_worker
worker_class = resolve_obj_by_qualname(
^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/vllm/utils/__init__.py", line 2539, in resolve_obj_by_qualname
module = importlib.import_module(module_name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/importlib/__init__.py", line 90, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "<frozen importlib._bootstrap>", line 1387, in _gcd_import
File "<frozen importlib._bootstrap>", line 1360, in _find_and_load
File "<frozen importlib._bootstrap>", line 1331, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 935, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 999, in exec_module
File "<frozen importlib._bootstrap>", line 488, in _call_with_frames_removed
File "/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 33, in <module>
from vllm.v1.worker.gpu_model_runner import GPUModelRunner
File "/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 33, in <module>
from vllm.model_executor.layers.mamba.mamba_mixer2 import MambaBase
File "/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/vllm/model_executor/layers/mamba/mamba_mixer2.py", line 29, in <module>
from vllm.model_executor.layers.mamba.ops.ssd_combined import (
File "/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/vllm/model_executor/layers/mamba/ops/ssd_combined.py", line 15, in <module>
from .ssd_bmm import _bmm_chunk_fwd
File "/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/vllm/model_executor/layers/mamba/ops/ssd_bmm.py", line 16, in <module>
@triton.autotune(
^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 378, in decorator
return Autotuner(fn, fn.arg_names, configs, key, reset_to_zero, restore_value, pre_hook=pre_hook,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 130, in __init__
self.do_bench = driver.active.get_benchmarker()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/triton/runtime/driver.py", line 23, in __getattr__
self._initialize_obj()
File "/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/triton/runtime/driver.py", line 20, in _initialize_obj
self._obj = self._init_fn()
^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/triton/runtime/driver.py", line 9, in _create_driver
return actives[0]()
^^^^^^^^^^^^
File "/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/triton/backends/nvidia/driver.py", line 535, in __init__
self.utils = CudaUtils() # TODO: make static
^^^^^^^^^^^
File "/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/triton/backends/nvidia/driver.py", line 89, in __init__
mod = compile_module_from_src(Path(os.path.join(dirname, "driver.c")).read_text(), "cuda_utils")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/triton/backends/nvidia/driver.py", line 66, in compile_module_from_src
so = _build(name, src_path, tmpdir, library_dirs(), include_dir, libraries)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/triton/runtime/build.py", line 36, in _build
subprocess.check_call(cc_cmd, stdout=subprocess.DEVNULL)
File "/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/subprocess.py", line 413, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['/usr/bin/gcc', '/tmp/tmpnpkuzanv/main.c', '-O3', '-shared', '-fPIC', '-Wno-psabi', '-o', '/tmp/tmpnpkuzanv/cuda_utils.cpython-312-x86_64-linux-gnu.so', '-lcuda', '-L/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/triton/backends/nvidia/lib', '-L/lib/x86_64-linux-gnu', '-I/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/triton/backends/nvidia/include', '-I/tmp/tmpnpkuzanv', '-I/root/anaconda3/envs/vLLMenv_python312/include/python3.12']' returned non-zero exit status 1.
/usr/bin/ld: cannot find -lcuda: No such file or directory
collect2: error: ld returned 1 exit status
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
Cell In[4], line 3
1 from vllm import LLM
----> 3 llm = LLM(model="/home/vLLM/models/Qwen/Qwen3-0___6B",
4 trust_remote_code=True,
5 max_model_len=4096,
6 )
File ~/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/vllm/entrypoints/llm.py:273, in LLM.__init__(self, model, task, tokenizer, tokenizer_mode, skip_tokenizer_init, trust_remote_code, allowed_local_media_path, tensor_parallel_size, dtype, quantization, revision, tokenizer_revision, seed, gpu_memory_utilization, swap_space, cpu_offload_gb, enforce_eager, max_seq_len_to_capture, disable_custom_all_reduce, disable_async_output_proc, hf_token, hf_overrides, mm_processor_kwargs, override_pooler_config, compilation_config, **kwargs)
243 engine_args = EngineArgs(
244 model=model,
245 task=task,
(...) 269 **kwargs,
270 )
272 # Create the Engine (autoselects V0 vs V1)
--> 273 self.llm_engine = LLMEngine.from_engine_args(
274 engine_args=engine_args, usage_context=UsageContext.LLM_CLASS)
275 self.engine_class = type(self.llm_engine)
277 self.request_counter = Counter()
File ~/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/vllm/engine/llm_engine.py:497, in LLMEngine.from_engine_args(cls, engine_args, usage_context, stat_loggers)
494 from vllm.v1.engine.llm_engine import LLMEngine as V1LLMEngine
495 engine_cls = V1LLMEngine
--> 497 return engine_cls.from_vllm_config(
498 vllm_config=vllm_config,
499 usage_context=usage_context,
500 stat_loggers=stat_loggers,
501 disable_log_stats=engine_args.disable_log_stats,
502 )
File ~/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/vllm/v1/engine/llm_engine.py:126, in LLMEngine.from_vllm_config(cls, vllm_config, usage_context, stat_loggers, disable_log_stats)
118 @classmethod
119 def from_vllm_config(
120 cls,
(...) 124 disable_log_stats: bool = False,
125 ) -> "LLMEngine":
--> 126 return cls(vllm_config=vllm_config,
127 executor_class=Executor.get_class(vllm_config),
128 log_stats=(not disable_log_stats),
129 usage_context=usage_context,
130 stat_loggers=stat_loggers,
131 multiprocess_mode=envs.VLLM_ENABLE_V1_MULTIPROCESSING)
File ~/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/vllm/v1/engine/llm_engine.py:103, in LLMEngine.__init__(self, vllm_config, executor_class, log_stats, usage_context, stat_loggers, mm_registry, use_cached_outputs, multiprocess_mode)
99 self.output_processor = OutputProcessor(self.tokenizer,
100 log_stats=self.log_stats)
102 # EngineCore (gets EngineCoreRequests and gives EngineCoreOutputs)
--> 103 self.engine_core = EngineCoreClient.make_client(
104 multiprocess_mode=multiprocess_mode,
105 asyncio_mode=False,
106 vllm_config=vllm_config,
107 executor_class=executor_class,
108 log_stats=self.log_stats,
109 )
111 if not multiprocess_mode:
112 # for v0 compatibility
113 self.model_executor = self.engine_core.engine_core.model_executor # type: ignore
File ~/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/vllm/v1/engine/core_client.py:77, in EngineCoreClient.make_client(multiprocess_mode, asyncio_mode, vllm_config, executor_class, log_stats)
73 return EngineCoreClient.make_async_mp_client(
74 vllm_config, executor_class, log_stats)
76 if multiprocess_mode and not asyncio_mode:
---> 77 return SyncMPClient(vllm_config, executor_class, log_stats)
79 return InprocClient(vllm_config, executor_class, log_stats)
File ~/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/vllm/v1/engine/core_client.py:514, in SyncMPClient.__init__(self, vllm_config, executor_class, log_stats)
512 def __init__(self, vllm_config: VllmConfig, executor_class: type[Executor],
513 log_stats: bool):
--> 514 super().__init__(
515 asyncio_mode=False,
516 vllm_config=vllm_config,
517 executor_class=executor_class,
518 log_stats=log_stats,
519 )
521 self.is_dp = self.vllm_config.parallel_config.data_parallel_size > 1
522 self.outputs_queue = queue.Queue[Union[EngineCoreOutputs, Exception]]()
File ~/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/vllm/v1/engine/core_client.py:408, in MPClient.__init__(self, asyncio_mode, vllm_config, executor_class, log_stats, client_addresses)
404 self.stats_update_address = client_addresses.get(
405 "stats_update_address")
406 else:
407 # Engines are managed by this client.
--> 408 with launch_core_engines(vllm_config, executor_class,
409 log_stats) as (engine_manager,
410 coordinator,
411 addresses):
412 self.resources.coordinator = coordinator
413 self.resources.engine_manager = engine_manager
File ~/anaconda3/envs/vLLMenv_python312/lib/python3.12/contextlib.py:144, in _GeneratorContextManager.__exit__(self, typ, value, traceback)
142 if typ is None:
143 try:
--> 144 next(self.gen)
145 except StopIteration:
146 return False
File ~/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/vllm/v1/engine/utils.py:697, in launch_core_engines(vllm_config, executor_class, log_stats, num_api_servers)
694 yield local_engine_manager, coordinator, addresses
696 # Now wait for engines to start.
--> 697 wait_for_engine_startup(
698 handshake_socket,
699 addresses,
700 engines_to_handshake,
701 parallel_config,
702 vllm_config.cache_config,
703 local_engine_manager,
704 coordinator.proc if coordinator else None,
705 )
File ~/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/vllm/v1/engine/utils.py:750, in wait_for_engine_startup(handshake_socket, addresses, core_engines, parallel_config, cache_config, proc_manager, coord_process)
748 if coord_process is not None and coord_process.exitcode is not None:
749 finished[coord_process.name] = coord_process.exitcode
--> 750 raise RuntimeError("Engine core initialization failed. "
751 "See root cause above. "
752 f"Failed core proc(s): {finished}")
754 # Receive HELLO and READY messages from the input socket.
755 eng_identity, ready_msg_bytes = handshake_socket.recv_multipart()
RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
核心错误日志:
subprocess.CalledProcessError: Command '['/usr/bin/gcc', '/tmp/tmpnpkuzanv/main.c', '-O3', '-shared', '-fPIC', '-Wno-psabi', '-o', '/tmp/tmpnpkuzanv/cuda_utils.cpython-312-x86_64-linux-gnu.so', '-lcuda', '-L/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/triton/backends/nvidia/lib', '-L/lib/x86_64-linux-gnu', '-I/root/anaconda3/envs/vLLMenv_python312/lib/python3.12/site-packages/triton/backends/nvidia/include', '-I/tmp/tmpnpkuzanv', '-I/root/anaconda3/envs/vLLMenv_python312/include/python3.12']' returned non-zero exit status 1.
/usr/bin/ld: cannot find -lcuda: No such file or directory
collect2: error: ld returned 1 exit status
3.3 错误原因分析
/usr/bin/ld: cannot find -lcuda: No such file or directory
这个错误表明在链接阶段,系统无法找到CUDA的库文件(libcuda.so 或 libcuda.a)
原因分析:
- CUDA Toolkit未安装:可能没有安装CUDA Toolkit,或者安装的版本不兼容。
- CUDA库路径未正确设置:即使安装了CUDA Toolkit,但系统的库路径(如LD_LIBRARY_PATH)没有包含CUDA库所在的目录,或者链接器配置文件(如ld.so.conf)中没有正确配置。
- 符号链接缺失:有时候,CUDA库文件存在,但缺少必要的符号链接(比如libcuda.so指向具体版本文件的软链接)。
4 解决方案
4.1 找到libcuda.so文件所在位置
通过一下命令,查找系统libcuda.so文件是否存在
在cuda-12.8安装目录查找,没有找到
(base) root@jinhu:/usr/local# ls /usr/local/cuda-12.8/lib64 | grep libcuda.so
(base) root@jinhu:/usr/local#
全局查找,在/usr/lib/x86_64-linux-gnu/目录下
(base) root@jinhu:/usr/local/cuda-12.8/lib64# sudo find / -name 'libcuda.so*'
/usr/local/cuda-12.8/targets/x86_64-linux/lib/stubs/libcuda.so
/usr/lib/x86_64-linux-gnu/libcuda.so.1
/usr/lib/x86_64-linux-gnu/stubs/libcuda.so
/usr/lib/x86_64-linux-gnu/libcuda.so.570.148.08
关于libcuda.so在不同目录下的区别:
安装NVIDIA 官方驱动的时候,会安装libcuda.so文件,并存储在/usr/lib/x86_64-linux-gnu/目录下
libcuda.so 这是 CUDA 驱动 API 的核心库。它提供了应用程序与底层 NVIDIA GPU 驱动程序进行直接通信的接口。任何使用 CUDA 的程序(无论是直接使用驱动 API 还是通过运行时 API)最终都需要在运行时链接到这个库
目的: 系统级共享,供所有用户和应用程序使用。
libcuda.so属于驱动程序,不属于Toolkit:
最核心的原因是:
libcuda.so是 NVIDIA GPU 驱动程序的核心组成部分,它是由nvidia-driver包安装的,而不是由cuda-toolkit包安装的。CUDA Toolkit 主要提供开发工具(
nvcc,nsight)、运行时库(libcudart.so)、数学库(libcublas.so,libcufft.so)、头文件、示例等。它不包含也不负责安装底层的 GPU 驱动库libcuda.so。
4.2 环境变量设置
在全局变量中配置cuda路径
运行命令
vi ~/.bashrc
在内容结尾中添加如下内容 :
export PATH=/usr/local/cuda-12.8/bin:$PATH
# export LD_LIBRARY_PATH=/usr/local/cuda-12.8/lib64:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:$LD_LIBRARY_PATH
编译起效
source ~/.bashrc
(base) root@jinhu:/usr/local/cuda-12.8/lib64# sudo find / -name 'libcuda.so*'
/usr/local/cuda-12.8/targets/x86_64-linux/lib/stubs/libcuda.so
/usr/lib/x86_64-linux-gnu/libcuda.so.1
/usr/lib/x86_64-linux-gnu/stubs/libcuda.so
/usr/lib/x86_64-linux-gnu/libcuda.so.570.148.08
/var/lib/docker/overlay2/84850e86f40d73dbae057eafc1875cb57327a85ad6b5a8f771d28b664612f455/diff/usr/local/cuda-12.8/compat/libcuda.so
/var/lib/docker/overlay2/84850e86f40d73dbae057eafc1875cb57327a85ad6b5a8f771d28b664612f455/diff/usr/local/cuda-12.8/compat/libcuda.so.1
/var/lib/docker/overlay2/84850e86f40d73dbae057eafc1875cb57327a85ad6b5a8f771d28b664612f455/diff/usr/local/cuda-12.8/compat/libcuda.so.570.86.10
(base) root@jinhu:/usr/local/cuda-12.8/lib64# echo "/usr/lib/x86_64-linux-gnu" | sudo tee /etc/ld.so.conf.d/cuda.conf
/usr/lib/x86_64-linux-gnu
(base) root@jinhu:/usr/local/cuda-12.8/lib64# sudo ldconfig
(base) root@jinhu:/usr/local/cuda-12.8/lib64# sudo ln -s /usr/lib/x86_64-linux-gnu/libcuda.so.1 /usr/lib/x86_64-linux-gnu/libcuda.so
(base) root@jinhu:/usr/local/cuda-12.8/lib64# vi ~/.bashrc
(base) root@jinhu:/usr/local/cuda-12.8/lib64# source ~/.bashrc
(base) root@jinhu:/usr/local/cuda-12.8/lib64# vi ~/.bashrc
4.3 再次运行结果成功
from vllm import LLM
llm = LLM(model="/home/vLLM/models/Qwen/Qwen3-0___6B",
trust_remote_code=True,
tensor_parallel_size=2,
gpu_memory_utilization=0.8,
max_model_len=4096,
)
INFO 07-30 04:34:13 [config.py:1604] Using max model len 4096
INFO 07-30 04:34:13 [config.py:2434] Chunked prefill is enabled with max_num_batched_tokens=8192.
INFO 07-30 04:34:14 [core.py:572] Waiting for init message from front-end.
INFO 07-30 04:34:14 [core.py:71] Initializing a V1 LLM engine (v0.10.0) with config: model='/home/vLLM/models/Qwen/Qwen3-0___6B', speculative_config=None, tokenizer='/home/vLLM/models/Qwen/Qwen3-0___6B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=/home/vLLM/models/Qwen/Qwen3-0___6B, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output","vllm.mamba_mixer2"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":512,"local_cache_dir":null}
WARNING 07-30 04:34:14 [multiproc_worker_utils.py:307] Reducing Torch parallelism from 36 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 07-30 04:34:14 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0, 1], buffer_handle=(2, 16777216, 10, 'psm_b244038f'), local_subscribe_addr='ipc:///tmp/d7c52630-a47f-4adc-8772-5b36db25d7b0', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=0 pid=1003186) INFO 07-30 04:34:16 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_17a5fe6c'), local_subscribe_addr='ipc:///tmp/470f026d-1b8b-4ea8-88d6-7d283f0bb2fd', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=1 pid=1003187) INFO 07-30 04:34:16 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_e5c2d241'), local_subscribe_addr='ipc:///tmp/b8c85095-e2b6-4631-97ec-6a0e031730c2', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=0 pid=1003186) (VllmWorker rank=1 pid=1003187) INFO 07-30 04:34:17 [__init__.py:1375] Found nccl from library libnccl.so.2
INFO 07-30 04:34:17 [__init__.py:1375] Found nccl from library libnccl.so.2
(VllmWorker rank=0 pid=1003186) (VllmWorker rank=1 pid=1003187) INFO 07-30 04:34:17 [pynccl.py:70] vLLM is using nccl==2.26.2
INFO 07-30 04:34:17 [pynccl.py:70] vLLM is using nccl==2.26.2
(VllmWorker rank=0 pid=1003186) INFO 07-30 04:34:17 [custom_all_reduce_utils.py:208] generating GPU P2P access cache in /root/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3.json
(VllmWorker rank=0 pid=1003186) (VllmWorker rank=1 pid=1003187) INFO 07-30 04:34:36 [custom_all_reduce_utils.py:246] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3.json
INFO 07-30 04:34:36 [custom_all_reduce_utils.py:246] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3.json
(VllmWorker rank=0 pid=1003186) (VllmWorker rank=1 pid=1003187) WARNING 07-30 04:34:36 [custom_all_reduce.py:147] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 07-30 04:34:36 [custom_all_reduce.py:147] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorker rank=0 pid=1003186) INFO 07-30 04:34:36 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[1], buffer_handle=(1, 4194304, 6, 'psm_6b68bda1'), local_subscribe_addr='ipc:///tmp/da4bb46b-0a31-4756-9b7e-2027a321dace', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=1 pid=1003187) INFO 07-30 04:34:36 [parallel_state.py:1102] rank 1 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 1, EP rank 1
(VllmWorker rank=1 pid=1003187) WARNING 07-30 04:34:36 [topk_topp_sampler.py:59] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
(VllmWorker rank=1 pid=1003187) INFO 07-30 04:34:36 [gpu_model_runner.py:1843] Starting to load model /home/vLLM/models/Qwen/Qwen3-0___6B...
(VllmWorker rank=0 pid=1003186) INFO 07-30 04:34:36 [parallel_state.py:1102] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
(VllmWorker rank=0 pid=1003186) WARNING 07-30 04:34:36 [topk_topp_sampler.py:59] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
(VllmWorker rank=0 pid=1003186) INFO 07-30 04:34:36 [gpu_model_runner.py:1843] Starting to load model /home/vLLM/models/Qwen/Qwen3-0___6B...
(VllmWorker rank=1 pid=1003187) INFO 07-30 04:34:37 [gpu_model_runner.py:1875] Loading model from scratch...
(VllmWorker rank=0 pid=1003186) INFO 07-30 04:34:37 [gpu_model_runner.py:1875] Loading model from scratch...
(VllmWorker rank=1 pid=1003187) INFO 07-30 04:34:37 [cuda.py:290] Using Flash Attention backend on V1 engine.
(VllmWorker rank=0 pid=1003186) INFO 07-30 04:34:37 [cuda.py:290] Using Flash Attention backend on V1 engine.
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 4.09it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 4.06it/s]
(VllmWorker rank=0 pid=1003186)
(VllmWorker rank=1 pid=1003187) INFO 07-30 04:34:37 [default_loader.py:262] Loading weights took 0.28 seconds
(VllmWorker rank=0 pid=1003186) INFO 07-30 04:34:37 [default_loader.py:262] Loading weights took 0.27 seconds
(VllmWorker rank=1 pid=1003187) INFO 07-30 04:34:38 [gpu_model_runner.py:1892] Model loading took 0.5660 GiB and 0.451207 seconds
(VllmWorker rank=0 pid=1003186) INFO 07-30 04:34:38 [gpu_model_runner.py:1892] Model loading took 0.5660 GiB and 0.444961 seconds
(VllmWorker rank=0 pid=1003186) INFO 07-30 04:34:47 [backends.py:530] Using cache directory: /root/.cache/vllm/torch_compile_cache/566a023fde/rank_0_0/backbone for vLLM's torch.compile
(VllmWorker rank=0 pid=1003186) INFO 07-30 04:34:47 [backends.py:541] Dynamo bytecode transform time: 8.56 s
(VllmWorker rank=1 pid=1003187) INFO 07-30 04:34:47 [backends.py:530] Using cache directory: /root/.cache/vllm/torch_compile_cache/566a023fde/rank_1_0/backbone for vLLM's torch.compile
(VllmWorker rank=1 pid=1003187) INFO 07-30 04:34:47 [backends.py:541] Dynamo bytecode transform time: 8.88 s
(VllmWorker rank=0 pid=1003186) INFO 07-30 04:34:53 [backends.py:194] Cache the graph for dynamic shape for later use
(VllmWorker rank=1 pid=1003187) INFO 07-30 04:34:54 [backends.py:194] Cache the graph for dynamic shape for later use
(VllmWorker rank=0 pid=1003186) INFO 07-30 04:35:21 [backends.py:215] Compiling a graph for dynamic shape takes 33.70 s
(VllmWorker rank=1 pid=1003187) INFO 07-30 04:35:21 [backends.py:215] Compiling a graph for dynamic shape takes 33.62 s
(VllmWorker rank=1 pid=1003187) INFO 07-30 04:35:32 [monitor.py:34] torch.compile takes 42.50 s in total
(VllmWorker rank=0 pid=1003186) INFO 07-30 04:35:32 [monitor.py:34] torch.compile takes 42.26 s in total
(VllmWorker rank=0 pid=1003186) INFO 07-30 04:35:34 [gpu_worker.py:255] Available KV cache memory: 15.57 GiB
(VllmWorker rank=1 pid=1003187) INFO 07-30 04:35:34 [gpu_worker.py:255] Available KV cache memory: 15.57 GiB
INFO 07-30 04:35:34 [kv_cache_utils.py:833] GPU KV cache size: 291,552 tokens
INFO 07-30 04:35:34 [kv_cache_utils.py:837] Maximum concurrency for 4,096 tokens per request: 71.18x
INFO 07-30 04:35:34 [kv_cache_utils.py:833] GPU KV cache size: 291,552 tokens
INFO 07-30 04:35:34 [kv_cache_utils.py:837] Maximum concurrency for 4,096 tokens per request: 71.18x
Capturing CUDA graph shapes: 100%|██████████| 67/67 [00:04<00:00, 16.41it/s]
(VllmWorker rank=0 pid=1003186) INFO 07-30 04:35:39 [gpu_model_runner.py:2485] Graph capturing finished in 5 secs, took 0.64 GiB
(VllmWorker rank=1 pid=1003187) INFO 07-30 04:35:39 [gpu_model_runner.py:2485] Graph capturing finished in 5 secs, took 0.64 GiB
INFO 07-30 04:35:39 [core.py:193] init engine (profile, create kv cache, warmup model) took 61.32 seconds
欢迎来到FlagOS开发社区,这里是一个汇聚了AI开发者、数据科学家、机器学习爱好者以及业界专家的活力平台。我们致力于成为业内领先的Triton技术交流与应用分享的殿堂,为推动人工智能技术的普及与深化应用贡献力量。
更多推荐

所有评论(0)