Memcpy

In the reference documentation, each memcpy function is categorized as synchronous or asynchronous, corresponding to the definitions below.

Synchronous
  • All transfers involving Unified Memory regions are fully synchronous with respect to the host.
  • For transfers from pageable host memory to device memory, a stream sync is performed before the copy is initiated. The function will return once the pageable buffer has been copied to the staging memory for DMA transfer to device memory, but the DMA to final destination may not have completed.
  • For transfers from pinned host memory to device memory, the function is synchronous with respect to the host.
  • For transfers from device to either pageable or pinned host memory, the function returns only once the copy has completed.
  • For transfers from device memory to device memory, no host-side synchronization is performed.
  • For transfers from any host memory to any host memory, the function is fully synchronous with respect to the host.
Asynchronous
  • For transfers from device memory to pageable host memory, the function will return only once the copy has completed.
  • For transfers from any host memory to any host memory, the function is fully synchronous with respect to the host.
  • For all other transfers, the function is fully asynchronous. If pageable memory must first be staged to pinned memory, this will be handled asynchronously with a worker thread.

按照上述的参考文献,从 devicehost 拷贝数据,如果 hostpageable,那么即使使用 cudaMemcpyAsync ,也是同步的,不是异步的,测试程序如下:

#include <cuda_runtime.h>
#include <stdint.h>
#include <assert.h>
#include <chrono>
#include <iostream>

int main()
{
    void *d_ptr = nullptr;
    void *h1_ptr = nullptr;
    void *h2_ptr = nullptr;

    cudaStream_t s0 = 0;

    int64_t mem_size = 4*1024*1024*1024LL;

    cudaMalloc(&d_ptr, mem_size);
    assert(d_ptr != nullptr);

    cudaMallocHost(&h1_ptr, mem_size);
    assert(h1_ptr != nullptr);

    h2_ptr = new char[mem_size];
    assert(h2_ptr != nullptr);

    cudaStreamCreateWithFlags(&s0, cudaStreamNonBlocking);

    
    cudaMemcpyAsync(h1_ptr,d_ptr,1024*1024,cudaMemcpyDeviceToHost,s0);
    cudaMemcpyAsync(h2_ptr,d_ptr,1024*1024,cudaMemcpyDeviceToHost,s0);
    cudaMemcpyAsync(h1_ptr,d_ptr,1024*1024,cudaMemcpyDeviceToHost,s0);
    cudaMemcpyAsync(h2_ptr,d_ptr,1024*1024,cudaMemcpyDeviceToHost,s0);
    cudaMemcpyAsync(h1_ptr,d_ptr,1024*1024,cudaMemcpyDeviceToHost,s0);
    cudaMemcpyAsync(h2_ptr,d_ptr,1024*1024,cudaMemcpyDeviceToHost,s0);

    cudaStreamSynchronize(s0);

    auto t0 = std::chrono::high_resolution_clock::now();
    cudaMemcpyAsync(h1_ptr,d_ptr,mem_size,cudaMemcpyDeviceToHost,s0);
    auto t1 = std::chrono::high_resolution_clock::now();
    auto span = (std::chrono::duration<double, std::milli>(t1 - t0)).count();
    cudaStreamSynchronize(s0);
    auto t2=std::chrono::high_resolution_clock::now();
    auto span2 = (std::chrono::duration<double, std::milli>(t2 - t0)).count();
    std::cout << "span1 " << span << std::endl;
    std::cout << "span2 " << span2 << std::endl;


    t0 = std::chrono::high_resolution_clock::now();
    cudaMemcpyAsync(h2_ptr,d_ptr,mem_size,cudaMemcpyDeviceToHost,s0);
    t1 = std::chrono::high_resolution_clock::now();
    span = (std::chrono::duration<double, std::milli>(t1 - t0)).count();

    cudaStreamSynchronize(s0);
    std::cout << "span3 " << span << std::endl;
}

输出如下:

span1 0.003362
span2 339.151
span3 1824.59

在测试程序中 h1_ptrcudaMallocHost 申请的,所以不是 pageble,而 h2_ptrnew 申请的,所以属于 pageable

根据测试结果可以分发现往 h1_ptr拷贝数据是异步的,往 h2_ptr 拷贝数据是同步的,必须等数据拷贝完成才返回。

Logo

欢迎来到FlagOS开发社区,这里是一个汇聚了AI开发者、数据科学家、机器学习爱好者以及业界专家的活力平台。我们致力于成为业内领先的Triton技术交流与应用分享的殿堂,为推动人工智能技术的普及与深化应用贡献力量。

更多推荐