[cuda]异步内存拷贝中的默认同步

MemcpyIn the reference documentation, each memcpy function is categorized as synchronous or asynchronous, corresponding to the definitions below.SynchronousAll transfers involving Unified Memory regio

adream307

1486人浏览 · 2020-08-03 16:23:55

adream307 · 2020-08-03 16:23:55 发布

Memcpy

In the reference documentation, each memcpy function is categorized as synchronous or asynchronous, corresponding to the definitions below.

Synchronous

All transfers involving Unified Memory regions are fully synchronous with respect to the host.
For transfers from pageable host memory to device memory, a stream sync is performed before the copy is initiated. The function will return once the pageable buffer has been copied to the staging memory for DMA transfer to device memory, but the DMA to final destination may not have completed.
For transfers from pinned host memory to device memory, the function is synchronous with respect to the host.
For transfers from device to either pageable or pinned host memory, the function returns only once the copy has completed.
For transfers from device memory to device memory, no host-side synchronization is performed.
For transfers from any host memory to any host memory, the function is fully synchronous with respect to the host.

Asynchronous

For transfers from device memory to pageable host memory, the function will return only once the copy has completed.
For transfers from any host memory to any host memory, the function is fully synchronous with respect to the host.
For all other transfers, the function is fully asynchronous. If pageable memory must first be staged to pinned memory, this will be handled asynchronously with a worker thread.

按照上述的参考文献，从 device 往 host 拷贝数据，如果 host 为 pageable，那么即使使用 cudaMemcpyAsync ，也是同步的，不是异步的，测试程序如下:

#include <cuda_runtime.h>
#include <stdint.h>
#include <assert.h>
#include <chrono>
#include <iostream>

int main()
{
    void *d_ptr = nullptr;
    void *h1_ptr = nullptr;
    void *h2_ptr = nullptr;

    cudaStream_t s0 = 0;

    int64_t mem_size = 4*1024*1024*1024LL;

    cudaMalloc(&d_ptr, mem_size);
    assert(d_ptr != nullptr);

    cudaMallocHost(&h1_ptr, mem_size);
    assert(h1_ptr != nullptr);

    h2_ptr = new char[mem_size];
    assert(h2_ptr != nullptr);

    cudaStreamCreateWithFlags(&s0, cudaStreamNonBlocking);

    
    cudaMemcpyAsync(h1_ptr,d_ptr,1024*1024,cudaMemcpyDeviceToHost,s0);
    cudaMemcpyAsync(h2_ptr,d_ptr,1024*1024,cudaMemcpyDeviceToHost,s0);
    cudaMemcpyAsync(h1_ptr,d_ptr,1024*1024,cudaMemcpyDeviceToHost,s0);
    cudaMemcpyAsync(h2_ptr,d_ptr,1024*1024,cudaMemcpyDeviceToHost,s0);
    cudaMemcpyAsync(h1_ptr,d_ptr,1024*1024,cudaMemcpyDeviceToHost,s0);
    cudaMemcpyAsync(h2_ptr,d_ptr,1024*1024,cudaMemcpyDeviceToHost,s0);

    cudaStreamSynchronize(s0);

    auto t0 = std::chrono::high_resolution_clock::now();
    cudaMemcpyAsync(h1_ptr,d_ptr,mem_size,cudaMemcpyDeviceToHost,s0);
    auto t1 = std::chrono::high_resolution_clock::now();
    auto span = (std::chrono::duration<double, std::milli>(t1 - t0)).count();
    cudaStreamSynchronize(s0);
    auto t2=std::chrono::high_resolution_clock::now();
    auto span2 = (std::chrono::duration<double, std::milli>(t2 - t0)).count();
    std::cout << "span1 " << span << std::endl;
    std::cout << "span2 " << span2 << std::endl;


    t0 = std::chrono::high_resolution_clock::now();
    cudaMemcpyAsync(h2_ptr,d_ptr,mem_size,cudaMemcpyDeviceToHost,s0);
    t1 = std::chrono::high_resolution_clock::now();
    span = (std::chrono::duration<double, std::milli>(t1 - t0)).count();

    cudaStreamSynchronize(s0);
    std::cout << "span3 " << span << std::endl;
}

输出如下:

span1 0.003362
span2 339.151
span3 1824.59

在测试程序中 h1_ptr 是 cudaMallocHost 申请的，所以不是 pageble，而 h2_ptr 是 new 申请的，所以属于 pageable

根据测试结果可以分发现往 h1_ptr拷贝数据是异步的，往 h2_ptr 拷贝数据是同步的，必须等数据拷贝完成才返回。

FlagOS智算系统软件栈

欢迎来到FlagOS开发社区，这里是一个汇聚了AI开发者、数据科学家、机器学习爱好者以及业界专家的活力平台。我们致力于成为业内领先的Triton技术交流与应用分享的殿堂，为推动人工智能技术的普及与深化应用贡献力量。

更多推荐

PyTorch CUDA调试第一步：5分钟学会使用torch_use_cuda_dsa

是PyTorch提供的一个调试工具，它允许你在CUDA设备端（GPU）执行断言检查。简单来说，就是在GPU上运行的代码中加入断言语句，当条件不满足时会触发错误，帮助你快速发现代码中的问题。这对于调试CUDA内核中的错误特别有用，因为设备端的错误通常比主机端更难调试。是一个非常实用的调试工具，尤其适合CUDA内核的调试。通过设备端断言，你可以快速发现代码中的逻辑错误，提高调试效率。希望这篇笔记能帮助

FlagOS智算系统软件栈

如何用AI优化PyTorch CUDA调试：torch_use_cuda_dsa详解

例如，在矩阵乘法中，可以断言矩阵的维度匹配，或者在计算过程中检查中间值是否在合理范围内。通过AI辅助工具，如Kimi-K2模型，我们可以快速生成带有详细注释的代码示例，解释每个参数的作用和调试技巧。AI不仅能帮助我们理解复杂的CUDA调试技术，还能提供自动补全和错误诊断功能，显著提高开发效率。为了更好地理解断言的作用，我们可以故意在代码中引入一些可能触发断言的条件。对于需要进行CUDA调试的开发者

FlagOS智算系统软件栈

解决bitsandbytes安装难题：libcudart.so找不到的终极方案

🚀 **bitsandbytes** 是一个革命性的PyTorch库，通过8位量化技术让大型语言模型变得触手可及。这个强大的工具能够将模型推理和训练的内存消耗降低到原来的几分之一，但安装时经常遇到的"libcudart.so not found"错误让许多开发者头疼不已。今天，我将为你提供一套完整的解决方案，彻底告别这个困扰！## 🔍 为什么会出现libcudart.so找不到的错误？