CUDA C/C++ Streams and Concurrency

CUDA C/C++ Streams and ConcurrencySteve RennichNVIDIAC oncurrencyThe ability to perform multiple CUDA operations simultaneously(beyond multi-threaded parallelism)CUDA Kernel <<<>>>cudaMemcpyAsync (HostToDevice)cudaMemcpyAsync (DeviceToHost)Operations on the CPUF ermi architecture can simultaneously support(compute capability +)Up to 16 CUDA kernels on GPU2 cudaMemcpyAsyncs (must be in different directions)Computation on the CPUS treamsStreamA sequence of operations that execute in issue-order on the GPUP rogramming model used to effect concurrencyCUDA operations in different Streams may run concurrentlyCUDA operations from different Streams may be interleavedK1K2K3K4 Concurrency ExampleSerialConcurrent overlap kernel and D2H copycudaMemcpyAsync(H2D)cudaMemcpyAsync( D2H)Kernel<<<>>>timecudaMemcpyAsync(H2D) performance improvementstreamsHD2K2HD3DH2 Amount of ConcurrencySerial (1x)2-way Concurrency (up to 2x)3-way Concurrency (up to 3x)4-way Concurrency (3x+)4+ way concurrencyKernel <<< >>>cudaMemcpyAsync(H2D)cudaMemcpyAsync(D2 H)K1K2K3K4cudaMemcpyAsync(H2D)

DH1DH3DH4K1K3K4HD1DH1DH2DH3DH4HD2HD4K1K2 K3HD1DH1DH2DH3K4 on on on CPUE xample Tiled DGEMMCPU (4core Westmere x5670 GHz, MKL)43 GflopsGPU (C2070) Serial : 125 Gflops ( )2-way : 177 Gflops ( )3-way : 262 Gfllops ( )GPU + CPU4-way con.: 282 Gflops ( )Up to 330 Gflops for larger rankObtain maximum performance by leveraging concurrencyAll communication hidden effectively removes device memory size limitation default stream stream 1 stream 2 stream 3 stream 4 CPUN vidia Visual Profiler (nvvp)DGEMM: m=n=8192, k=288 Default Stream (aka Stream '0')Stream used when no stream is specifiedCompletely synchronous host and deviceAs if cudaDeviceSynchronize() inserted before and after every CUDA operationExceptions asynchronous hostKernel launches in the default streamcudaMemcpy*AsynccudaMemset*Asynccu daMemcpy within the same device H2D cudaMemcpy of 64kB or lessRequirements for ConcurrencyCUDA operations must be in different, non-0, streamscudaMemcpyAsync with host from 'pinned' memoryPage-locked memoryAllocated using cudaMallocHost() or cudaHostAlloc() Sufficient resources must be availablecudaMemcpyAsyncs in different directionsDevice resources (SMEM, registers, blocks, etc.)

Simple Example: SynchronouscudaMalloc ( &dev1, size ) ;double* host1 = (double*) malloc ( &host1, size ) ; .. cudaMemcpy ( dev1, host1, size, H2D ) ;kernel2 <<< grid, block, 0 >>> ( .., dev2, .. ) ;kernel3 <<< grid, block, 0 >>> ( .., dev3, .. ) ;cudaMemcpy ( host4, dev4, size, D2H ) ;..completely synchronousAll CUDA operations in the default stream are synchronousSimple Example: Asynchronous, No StreamscudaMalloc ( &dev1, size ) ;double* host1 = (double*) malloc ( &host1, size ) ; .. cudaMemcpy ( dev1, host1, size, H2D ) ;kernel2 <<< grid, block >>> ( .., dev2, .. ) ;some_CPU_method ();kernel3 <<< grid, block >>> ( .., dev3, .. ) ;cudaMemcpy ( host4, dev4, size, D2H ) ;..potentiallyoverlappedGPU kernels are asynchronous with host by defaultSimple Example: Asynchronous with StreamscudaStream_t stream1, stream2, stream3, stream4 ;cudaStreamCreate ( &stream1).

CudaMalloc ( &dev1, size ) ;cudaMallocHost ( &host1, size ) ; // pinned memory required on cudaMemcpyAsync ( dev1, host1, size, H2D, stream1 ) ;kernel2 <<< grid, block, 0, stream2 >>> ( .., dev2, .. ) ;kernel3 <<< grid, block, 0, stream3 >>> ( .., dev3, .. ) ;cudaMemcpyAsync ( host4, dev4, size, D2H, stream4 ) ;some_CPU_method ();..potentiallyoverlappedFully asynchronous / concurrent Data used by concurrent operations should be independentExplicit SynchronizationSynchronize everythingcudaDeviceSynchronize ()Blocks host until all issued CUDA calls are completeSynchronize a specific streamcudaStreamSynchronize ( streamid )Blocks host until all CUDA calls in streamid are completeSynchronize using EventsCreate specific ' events ', within Streams , to use for synchronizationcudaEventRecord ( event, streamid )cudaEventSynchronize ( event )cudaStreamWaitEvent ( stream, event )cudaEventQuery ( event )Resolve using an eventExplicit Synchronization Example{ cudaEvent_t event.}

CudaEventCreate ( // create event cudaMemcpyAsync ( d_in, in, size, H2D, stream1 ); // 1) H2D copy of new input cudaEventRecord (event, stream1); // record event cudaMemcpyAsync ( out, d_out, size, D2H, stream2 ); // 2) D2H copy of previous result cudaStreamWaitEvent ( stream2, event ); // wait for event in stream1 kernel <<< , , , stream2 >>> ( d_in, d_out ); // 3) must wait for 1 and 2 asynchronousCPUmethod ( .. ) // Async GPU method}Implicit SynchronizationThese operations implicitly synchronize all other CUDA operationsPage-locked memory allocationcudaMallocHostcudaHostAllocDev ice memory allocationcudaMallocNon-Async version of memory operationscudaMemcpy* (no Async suffix)cudaMemset* (no Async suffix)

Change to L1/shared memory configurationcudaDeviceSetCacheConfig Stream SchedulingFermi hardware has 3 queues1 Compute Engine queue2 Copy Engine queues one for H2D and one for D2 HCUDA operations are dispatched to HW in the sequence they were issuedPlaced in the relevant queueStream dependencies between engine queues are maintained, but lost within an engine queueA CUDA operation is dispatched from the engine queue if:Preceding calls in the same stream have completed,Preceding calls in the same queue have been dispatched, andResources are availableCUDA kernels may be executed concurrently if they are in different streamsThreadblocks for a given kernel are scheduled if all threadblocks for preceding kernels have been scheduled and there still are SM resources availableNote a blocked operation blocks all other operations in the queue, even in other streamsExample Blocked QueueTwo Streams , stream 1 is issued firstStream 1 : HDa1, HDb1, K1, DH1 (issued first)Stream 2.

DH2 (completely independent of stream 1)K1DH1DH2programH2D queuecompute queueD2H queueHDa1K1DH1DH2issue ordertimeHDa1K1DH1DH2executionSignals between queues enforce synchronizationCUDA operations get added to queues in issue orderwithin queues, stream dependencies are lostDH1 blocks completely independent DH2runtime = 5 HDb1 HDa1 HDb1 HDb1 Example Blocked QueueTwo Streams , stream 2 is issued firstStream 1 : HDa1, HDb1, K1, DH1 Stream 2 : DH2 (issued first)K1DH1DH2programH2D queuecompute queueD2H queueHDa1K1DH1DH2issue ordertimeHDa1K1DH1DH2executionSignals between queues enforce synchronizationCUDA operations get added to queues in issue orderwithin queues, stream dependencies are lostruntime = 4 HDb1 HDa1 HDb1 HDb1issue order matters!

ConcurrentExample - Blocked KernelTwo Streams just issuing CUDA kernelsStream 1 : Ka1, Kb1 Stream 2 : Ka2, Kb2 Kernels are similar size, fill of the SM resourcesIssue depth firstIssue breadth firstKb2Kb1Ka2Ka1compute queueissue ordertimeKa1Kb1Kb2Ka2executionKb2Ka2Kb1K a1compute queueissue ordertimeKa1Kb1Kb2Ka2executionissue order matters!runtime = 2runtime = 3Kb1Kd2 Example - Optimal Concurrency can Depend on Kernel Execution TimeTwo Streams just issuing CUDA kernels but kernels are different 'sizes'Stream 1 : Ka1 {2}, Kb1 {1}Stream 2 : Kc2 {1}, Kd2 {2}Kernels fill of the SM resourcesDepth firstissue order matters!execution time matters!Kd2Kb1Kc2Ka1compute queueissue ordertimeKa1executionKd2Kb1Kc2Kd2Kc2Kb1K a1compute queueissue ordertimeKa1executionKd2Kb1Kc2Kc2Ka1comp ute queueissue ordertimeKa1executionKd2Kb1Kc2 Breadth firstCustomruntime = 5runtime = 4runtime = 3 Concurrent Kernel SchedulingConcurrent kernel scheduling is specialNormally, a signal is inserted into the queues, after the operation, to launch the next operation in the same streamFor the compute engine queue, to enable concurrent kernels, when compute kernels are issued sequentially, this signal is delayed until after the last sequential compute kernelIn some situations this delay of signals can block other queuesExample Concurrent Kernels and BlockingThree Streams .

Each performing (HD, K, DH)Breadth firstSequentially issued kernels delay signals and block cudaMemcpy(D2H)HD1programH2D queuecompute queueD2H queueHD1K1DH1DH2issue ordertimeexecutionSignals between sequentially issued kernels are delayedHD2HD3K1K2K3DH1DH2DH3HD1HD2HD3K2K 3DH3HD1K1DH1DH2HD2HD3K2K3DH3blockingrunt ime = 7 Example Concurrent Kernels and BlockingThree Streams , each performing (HD, K, DH)Depth first 'usually' best for FermiHD1programH2D queuecompute queueD2H queueHD1K1DH1DH2issue ordertimeexecutionHD2HD3K1K2K3DH1DH2DH3H D1HD2HD3K2K3DH3HD1K1DH1DH2HD2HD3K2K3DH3 Kernels no longer issued sequentiallyruntime = 5 Previous ArchitecturesCompute Capability +Support for GPU / CPU concurrencyCompute Capability + ( C1060 )Adds support for asynchronous memcopies (single engine )( some exceptions check using asyncEngineCount device property )Compute Capability + ( C2050 )Add support for concurrent GPU kernels( some exceptions check using concurrentKernels device property )Adds second copy engine to support bidirectional memcopies( some exceptions check using asyncEngineCount device property )

Additional DetailsIt is difficult to get more than 4 kernels to run concurrentlyConcurrency can be disabled with environment variableCUDA_LAUNCH_BLOCKING cudaStreamQuery can be used to separate sequential kernels and prevent delaying signalsKernels using more than 8 textures cannot run concurrentlySwitching L1/Shared configuration will break concurrencyTo run concurrently, CUDA operations must have no more than 62 intervening CUDA operations That is, in 'issue order' they must not be separated by more than 62 other issuesFurther operations are serializedcudaEvent_t is useful for timing, but for performance usecudaEventCreateWithFlags ( &event, cudaEventDisableTiming ) Concurrency GuidelinesCode to programming model StreamsFuture devices will continually improve HW representation of Streams modelPay attention to issue orderCan make a differencePay attention to resources and operations which can break concurrencyAnything in the default streamEvents & synchronizationStream queriesL1/Shared configuration changes8+ texturesUse tools (Visual Profiler, Parallel Ensight) to visualize Concurrency (but these don't currently show concurrent kernels)ThankYouQuestions1.

CUDA C/C++ Streams and Concurrency

Tags:

Information

Transcription of CUDA C/C++ Streams and Concurrency

Related search queries

CUDA C/C++ Streams and Concurrency

Tags:

Information

Documents from same domain

Related documents

Related search queries