Table of Contents Streams Learning CUDA to Solve Scientific Problems. 1 Objectives 2 Pinned-Memory 3 Streams Miguel Cárdenas Montes Centro de Investigaciones Energéticas Medioambientales y Tecnológicas, Madrid, Spain miguel.cardenas@ciemat.es 2010 M. Cárdenas (CIEMAT) T5. Streams. 2010 1 / 26 M. Cárdenas (CIEMAT) T5. Streams. 2010 2 / 26 2010 4 / 26 Objectives To use streams in order to improve the performance. Pinned-Memory Technical Issues Stream. Pinned-memory. M. Cárdenas (CIEMAT) T5. Streams. 2010 3 / 26 M. Cárdenas (CIEMAT) T5. Streams. Pinned-Memory II Pinned-Memory I Until now, the instruction malloc() has been used to allocate on the host memory. However, the CUDA runtime offers its own mechanism for allocating host memory: cudaHostAlloc(). There is a significant difference between the memory that malloc() allocates and the memory that cudaHostAlloc() allocates. M. Cárdenas (CIEMAT) T5. Streams. 2010 The instruction malloc() allocates standard and pageable host memory. The instruction cudaHostAlloc() allocates a buffer of page-locked host memory. This can be called pinned-memory. Page-locked memory guarantees that the operating system will never page this memory out to disk, which ensures its residency in physical memory. 5 / 26 M. Cárdenas (CIEMAT) T5. Streams. 2010 6 / 26 Pinned-Memory III Knowing the physical address of a buffer, the GPU can then use direct memory access (DMA) to copy data to or from the host. Since DMA copies proceed without intervention from the CPU, it also means that the CPU could be simultaneously paging these buffers out to the disk or relocating their physical addresses by updating the operating system’s pagetables. The possibility of the CPU moving pageable data means that using memory for a DMA copy is essential. Pinned-Memory III On the warning side, the computer running the application needs to have available physical memory for every page-locked buffer, since these buffers can never be swapped out to disk. The use of pinned memory should be restricted to memory that will be used as a source or destination in call to cudaMemcpy() and freeing them when they are no longer needed. In fact, even when you attempt to perform a memory copy with pageable memory, the CUDA driver still uses DMA to transfer the buffer to the GPU. Therefore, the copy happens twice, first from a pageable system buffer to a page-locked ”staging” buffer and then from the page-locked system buffer to the GPU. M. Cárdenas (CIEMAT) T5. Streams. 2010 7 / 26 M. Cárdenas (CIEMAT) T5. Streams. 2010 8 / 26 Pinned-Memory IV To allocate on host memory as pinned memory the instruction cudaHostAlloc() has to be used. Streams For freeing the memory allocated the instruction should be cudaFreeHost(). cudaHostAlloc( (void**)&a, size * sizeof( *a ), cudaHostAllocDefault ) ; cudaFreeHost ( a ); M. Cárdenas (CIEMAT) T5. Streams. 2010 9 / 26 M. Cárdenas (CIEMAT) T5. Streams. 2010 10 / 26 Single Stream II First at all, the device chosen must support the capacity termed device overlap. Single Stream I Streams can play an important role in accelerating applications. A CUDA stream represents a queue of GPU operations that get executed in a specific order. Operations such as: kernel launches, memory copies, and event start and stop can be placed and ordered into a stream. The order in which operations are added to the stream specifies the order in which they will be executed. M. Cárdenas (CIEMAT) T5. Streams. 2010 11 / 26 A GPU supporting this feature possesses the capacity to simultaneously execute a CUDA kernel while performing a copy between device and host memory. int main( void ) { cudaDeviceProp prop; int whichDevice; HANDLE_ERROR( cudaGetDevice( &whichDevice ) ); HANDLE_ERROR( cudaGetDeviceProperties( &prop, whichDevice ) ); if (!prop.deviceOverlap) { printf( "Device will not handle overlaps"); return 0; } M. Cárdenas (CIEMAT) T5. Streams. 2010 12 / 26 Single Stream IV Single Stream III Then the memory is allocated and the array fulfilled with random integers. If the device supports overlapping, then ... The stream should be created with the instruction cudaStreamCreate(). int *host_a, *host_b, *host_c; int *dev_a, *dev_b, *dev_c; // allocate the memory on HANDLE_ERROR( cudaMalloc( HANDLE_ERROR( cudaMalloc( HANDLE_ERROR( cudaMalloc( cudaEvent_t start, stop; float elapsedTime; // start the timers HANDLE_ERROR( cudaEventCreate( &start ) ); HANDLE_ERROR( cudaEventCreate( &stop ) ); HANDLE_ERROR( cudaEventRecord( start, 0 ) ); // allocate page-locked memory HANDLE_ERROR( cudaHostAlloc( (void**)&host_a, FULL_DATA_SIZE*sizeof(int), cudaHostAllocDefault ) ); HANDLE_ERROR( cudaHostAlloc( (void**)&host_b, FULL_DATA_SIZE*sizeof(int), cudaHostAllocDefault ) ); HANDLE_ERROR( cudaHostAlloc( (void**)&host_c, FULL_DATA_SIZE*sizeof(int), cudaHostAllocDefault ) ); // initialize the stream cudaStream_t stream; HANDLE_ERROR( cudaStreamCreate( &stream ) ); M. Cárdenas (CIEMAT) the GPU (void**)&dev_a, N*sizeof(int) ) ); (void**)&dev_b, N*sizeof(int) ) ); (void**)&dev_c, N*sizeof(int) ) ); for (int i=0; i < FULL_DATA_SIZE; i++) { host_a[i] = rand(); host_b[i] = rand(); } T5. Streams. 2010 13 / 26 M. Cárdenas (CIEMAT) T5. Streams. 2010 14 / 26 Single Stream V The call cudaMemcpyAsync( ) places a request to perform a memory copy into the stream specified by the argument stream. Single Stream VI When the call returns, there is no guarantee that the copy be definitively performed before the next operation into the same stream. When the loop has terminated, there could still be a bit of work queued up for the GPU to finish. The use of cudaMemcpyAsunc() requires the use of cudaHostAlloc(). It is needed to synchronize with the host, in order to guarantee that the tasks have been done. Also, the kernel invocation used the argument stream. After the synchronization the timer can be stopped. // now loop over full data for (int i=0; i<FULL_DATA_SIZE; i+=N) { // copy the locked memory to the device, asynchronously HANDLE_ERROR( cudaMemcpyAsync( dev_a, host_a+i, N*sizeof(int), cudaMemcpyHostToDevice, stream ) ); HANDLE_ERROR( cudaMemcpyAsync( dev_b, host_b+i, N*sizeof(int), cudaMemcpyHostToDevice, stream ) ); // copy result chunk from locked to full buffer HANDLE_ERROR( cudaStreamSynchronize( stream ) ); kernel <<< N/256, 256, 0, stream >>> (dev_a, dev_b, dev_c) ; // copy back data from device to locked memory HANDLE_ERROR( cudaMemcpyAsync( host_c+i, dev_c, N*sizeof(int), cudaMemcpyDeviceToHost, stream ) ); } M. Cárdenas (CIEMAT) T5. Streams. 2010 15 / 26 M. Cárdenas (CIEMAT) T5. Streams. 2010 16 / 26 Single Stream VII After the synchronization the timer can be stopped. Single Stream VIII The memory can be cleaned. Finally a dummy kernel is used and some mandatory data. Before exiting the application, the stream has to be destroyed. # define N (1024 * 1024) # define FULL_DATA_SIZE (N*20) HANDLE_ERROR( cudaEventRecord( stop, 0 ) ); __global__ void kernel( int *a, int *b, int *c) { int idx = threadIdx.x + blockIdx.x * blockDim.x; if (idx < N) { int idx1 = (idx + 1 ) % 256; int idx2 = (idx + 2 ) % 256; float as = (a[idx] + a[idx1] + a[idx2]) /3.0f; float bs = (b[idx] + b[idx1] + b[idx2]) /3.0f; c[idx] = (as + bs)/2; } } HANDLE_ERROR( cudaEventSynchronize( stop ) ); HANDLE_ERROR( cudaEventElapsedTime( &elapsedTime, start, stop ) ); printf( "Time taken: %3.1f ms \n", elapsedTime ); // cleanup the streams and memory HANDLE_ERROR( cudaFreeHost ( host_a ) ); HANDLE_ERROR( cudaFreeHost ( host_b ) ); HANDLE_ERROR( cudaFreeHost ( host_c ) ); HANDLE_ERROR( cudaFree ( dev_a ) ); HANDLE_ERROR( cudaFree ( dev_b ) ); HANDLE_ERROR( cudaFree ( dev_c ) ); HANDLE_ERROR( cudaStreamDestroy( stream ) ); return 0; } M. Cárdenas (CIEMAT) T5. Streams. 2010 17 / 26 M. Cárdenas (CIEMAT) T5. Streams. 2010 18 / 26 Kernel invocation, parameters Dg is of type dim3 and specifies the dimension and size of the grid, such that Dg.x * Dg.y equals the number of blocks being launched; Dg.z must be equal to 1; Kernel invocation, parameters Any call to a global function must specify the execution configuration for that call. The execution configuration defines the dimension of the grid and blocks that will be used to execute the function on the device, as well as the associated stream. When using the runtime API (Section 3.2), the execution configuration is specified by inserting an expression of the form <<<Dg, Db, Ns, S>>> between the function name and the parenthesized argument list. Db is of type dim3 and specifies the dimension and size of each block, such that Db.x * Db.y * Db.z equals the number of threads per block; Ns is of type size t and specifies the number of bytes in shared memory that is dynamically allocated per block for this call in addition to the statically allocated memory; this dynamically allocated memory is used by any of the variables declared as an external array; Ns is an optional argument which defaults to 0; S is of type cudaStream t and specifies the associated stream; S is an optional argument which defaults to 0. M. Cárdenas (CIEMAT) T5. Streams. 2010 19 / 26 M. Cárdenas (CIEMAT) T5. Streams. 2010 20 / 26 Multiple Streams I At the beginning of the previous example, the feature of supporting overlap was checked; and the computation was broken into chunks. The underlying idea to improve the performance is to divide the computational tasks and overlapping them. The newer NVIDIA GPUs support simultaneously: kernel execution, and two memory copies (one to the device and one from the device). M. Cárdenas (CIEMAT) T5. Streams. 2010 Multiple Streams II For multiple streams, each of them must be created. // initialize the streams cudaStream_t stream0, stream1; HANDLE_ERROR( cudaStreamCreate( &stream0 ) ); HANDLE_ERROR( cudaStreamCreate( &stream1 ) ); 21 / 26 M. Cárdenas (CIEMAT) T5. Streams. 2010 22 / 26 Multiple Streams IV Multiple Streams III All actions must be duplicated: buffers allocations, kernel invocations, synchronization, clean-up. All actions must be duplicated: buffers allocations, kernel invocations, synchronization, clean-up. // now loop over full data for (int i=0; i<FULL_DATA_SIZE; i+=N) { // copy the locked memory to the device, asynchronously HANDLE_ERROR( cudaMemcpyAsync( dev_a0, host_a+i, N*sizeof(int), cudaMemcpyHostToDevice, stream0 ) ); HANDLE_ERROR( cudaMemcpyAsync( dev_b0, host_b+i, N*sizeof(int), cudaMemcpyHostToDevice, stream0 ) ); int *host_a, *host_b, *host_c; int *dev_a0, *dev_b0, *dev_c0; // for stream 0 int *dev_a1, *dev_b1, *dev_c1; // for stream 1 // allocate the memory on HANDLE_ERROR( cudaMalloc( HANDLE_ERROR( cudaMalloc( HANDLE_ERROR( cudaMalloc( HANDLE_ERROR( cudaMalloc( HANDLE_ERROR( cudaMalloc( HANDLE_ERROR( cudaMalloc( the GPU (void**)&dev_a0, (void**)&dev_b0, (void**)&dev_c0, (void**)&dev_a1, (void**)&dev_b1, (void**)&dev_c1, N*sizeof(int) N*sizeof(int) N*sizeof(int) N*sizeof(int) N*sizeof(int) N*sizeof(int) ) ) ) ) ) ) kernel <<< N/256, 256, 0, stream0 >>> (dev_a0, dev_b0, dev_c0) ; // copy back data from device to locked memory HANDLE_ERROR( cudaMemcpyAsync( host_c+i, dev_c0, N*sizeof(int), cudaMemcpyDeviceToHost, stream0 ) ); ); ); ); ); ); ); // copy the locked memory to the device, asynchronously HANDLE_ERROR( cudaMemcpyAsync( dev_a1, host_a+i, N*sizeof(int), cudaMemcpyHostToDevice, stream1 ) ); HANDLE_ERROR( cudaMemcpyAsync( dev_b1, host_b+i, N*sizeof(int), cudaMemcpyHostToDevice, stream1 ) ); kernel <<< N/256, 256, 0, stream1 >>> (dev_a1, dev_b1, dev_c1) ; // copy back data from device to locked memory HANDLE_ERROR( cudaMemcpyAsync( host_c+i, dev_c1, N*sizeof(int), cudaMemcpyDeviceToHost, stream1 ) ); // allocate page-locked memory HANDLE_ERROR( cudaHostAlloc( (void**)&host_a, FULL_DATA_SIZE*sizeof(int), cudaHostAllocDefault ) ); HANDLE_ERROR( cudaHostAlloc( (void**)&host_b, FULL_DATA_SIZE*sizeof(int), cudaHostAllocDefault ) ); HANDLE_ERROR( cudaHostAlloc( (void**)&host_c, FULL_DATA_SIZE*sizeof(int), cudaHostAllocDefault ) ); } for (int i=0; i < FULL_DATA_SIZE; i++) { host_a[i] = rand(); host_b[i] = rand(); } // synchronization // cleanup // streams destruction M. Cárdenas (CIEMAT) T5. Streams. 2010 23 / 26 M. Cárdenas (CIEMAT) T5. Streams. 2010 24 / 26 Performance Considerations I Thanks Users should take care about the sequence of actions queued in the streams. It is very easy to inadvertently block the copies or kernel executions of another stream. To alleviate this problem, it suffices to enqueue our operations breadth-first across streams rather than depth-first. // now loop over full data for (int i=0; i<FULL_DATA_SIZE; i+=N) { // enqueue copies of a in stream0 and stream1 HANDLE_ERROR( cudaMemcpyAsync( dev_a0, host_a+i, N*sizeof(int), cudaMemcpyHostToDevice, stream0 ) ); HANDLE_ERROR( cudaMemcpyAsync( dev_a1, host_a+i, N*sizeof(int), cudaMemcpyHostToDevice, stream1 ) ); // enqueue copies of b in stream0 and stream1 HANDLE_ERROR( cudaMemcpyAsync( dev_b0, host_b+i, N*sizeof(int), cudaMemcpyHostToDevice, stream0 ) ); HANDLE_ERROR( cudaMemcpyAsync( dev_b1, host_b+i, N*sizeof(int), cudaMemcpyHostToDevice, stream1 ) ); Thanks Questions? More questions? // enqueue kernel in stream0 and stream1 kernel <<< N/256, 256, 0, stream0 >>> (dev_a0, dev_b0, dev_c0) ; kernel <<< N/256, 256, 0, stream1 >>> (dev_a1, dev_b1, dev_c1) ; // copy back data HANDLE_ERROR( cudaMemcpyAsync( host_c+i, dev_c0, N*sizeof(int), cudaMemcpyDeviceToHost, stream0 ) ); HANDLE_ERROR( cudaMemcpyAsync( host_c+i, dev_c1, N*sizeof(int), cudaMemcpyDeviceToHost, stream1 ) ); } // synchronization // cleanup // streams destruction M. Cárdenas (CIEMAT) T5. Streams. 2010 25 / 26 M. Cárdenas (CIEMAT) T5. Streams. 2010 26 / 26