CUDA C++ Programming Guide - NVIDIA Developer

| February 2022 CUDA C++ Programming GuideDesign GuideCUDA C++ Programming | iiChanges from Version Added Graph Memory Nodes. Formalized Asynchronous SIMT Programming C++ Programming | iiiTable of ContentsChapter 1. The Benefits of Using CUDA : A General-Purpose Parallel Computing Platform and Programming A Scalable Programming Document 5 Chapter 2. Programming Thread Memory Heterogeneous Asynchronous SIMT Programming Asynchronous Compute 3. Programming Compilation with Compilation Offline Just-in-Time Binary PTX Application C++ 64-Bit CUDA Device Device Memory L2 Access L2 cache Set-Aside for Persisting L2 Policy for Persisting L2 Access L2 Persistence Reset L2 Access to Manage Utilization of L2 set-aside Query L2 cache Control L2 Cache Set-Aside Size for Persisting Memory C++ Programming | Shared Page-Locked Host Portable Write-Combining Mapped Asynchronous Concurrent Concurrent Execution between Host and Concurrent Kernel Overlap of Data Transfer and Kernel Concurrent Data CUDA Synchronous Multi-Device Device Device Stream and Event Peer-to-Peer

Memory Peer-to-Peer Memory Unified Virtual Address Interprocess Error Call Texture and Surface Texture Surface CUDA Read/Write Graphics OpenGL Direct3D SLI External Resource Vulkan OpenGL Direct3D 12 Direct3D 11 C++ Programming | NVIDIA Software Communication Interface Interoperability (NVSCI).. CUDA User Versioning and Compute Mode Tesla Compute Cluster Mode for 4. Hardware SIMT Hardware 5. Performance Overall Performance Optimization Maximize Application Device Multiprocessor Occupancy Maximize Memory Data Transfer between Host and Device Memory Maximize Instruction Arithmetic Control Flow Synchronization Minimize Memory A.

CUDA-Enabled B. C++ Language Function Execution Space Undefined __noinline__ and Variable Memory Space C++ Programming | Built-in Vector char, short, int, long, longlong, float, Built-in Memory Fence Synchronization Mathematical Texture Texture Object tex1 Dfetch().. tex1D().. tex1 DLod().. tex1 DGrad().. tex2D().. tex2 DLod().. tex2 DGrad().. tex3D().. tex3 DLod().. tex3 DGrad().. tex1 DLayered().. tex1 DLayeredLod().. tex1 DLayeredGrad().. tex2 DLayered().. tex2 DLayeredLod().. tex2 DLayeredGrad().. texCubemap().. texCubemapLod().. texCubemapLayered().. texCubemapLayeredLod().. tex2 Dgather().. Texture Reference tex1 Dfetch().

148 CUDA C++ Programming | tex1D().. tex1 DLod().. tex1 DGrad().. tex2D().. tex2 DLod().. tex2 DGrad().. tex3D().. tex3 DLod().. tex3 DGrad().. tex1 DLayered().. tex1 DLayeredLod().. tex1 DLayeredGrad().. tex2 DLayered().. tex2 DLayeredLod().. tex2 DLayeredGrad().. texCubemap().. texCubemapLod().. texCubemapLayered().. texCubemapLayeredLod().. tex2 Dgather().. Surface Surface Object surf1 Dread().. surf2 Dread().. surf2 Dwrite().. surf3 Dread().. surf3 Dwrite().. surf1 DLayeredread().. surf1 DLayeredwrite().. surf2 DLayeredread().. surf2 DLayeredwrite().. surfCubemapread().. surfCubemapwrite().. surfCubemapLayeredread().. surfCubemapLayeredwrite().. Surface Reference surf1 Dread().

158 CUDA C++ Programming | surf2 Dread().. surf2 Dwrite().. surf3 Dread().. surf3 Dwrite().. surf1 DLayeredread().. surf1 DLayeredwrite().. surf2 DLayeredread().. surf2 DLayeredwrite().. surfCubemapread().. surfCubemapwrite().. surfCubemapLayeredread().. surfCubemapLayeredwrite().. Read-Only Data Cache Load Load Functions Using Cache Store Functions Using Cache Time Atomic Arithmetic atomicAdd().. atomicSub().. atomicExch().. atomicMin().. atomicMax().. atomicInc().. atomicDec().. atomicCAS().. Bitwise atomicAnd().. atomicOr().. atomicXor().. Address Space Predicate __isGlobal().. __isShared().. __isConstant().. __isLocal().. Address Space Conversion __cvta_generic_to_global().

168 CUDA C++ Programming | __cvta_generic_to_shared().. __cvta_generic_to_constant().. __cvta_generic_to_local().. __cvta_global_to_generic().. __cvta_shared_to_generic().. __cvta_constant_to_generic().. __cvta_local_to_generic().. Alloca Compiler Optimization Hint __builtin_assume_aligned().. __builtin_assume().. __assume().. __builtin_expect().. __builtin_unreachable().. Warp Vote Warp Match Warp Reduce Warp Shuffle Broadcast of a single value across a Inclusive plus-scan across sub-partitions of 8 Reduction across a Nanosleep Warp matrix C++ Programming | Alternate Floating Double Sub-byte Element Types & Matrix Asynchronous Simple Synchronization Temporal Splitting and Five Stages of Bootstrap Initialization, Expected Arrival Count, and A Barrier's Phase: Arrival, Countdown, Completion, and Spatial Partitioning (also known as Warp Specialization).

Early Exit (Dropping out of Participation).. Memory Barrier Primitives Data Memory Barrier Primitives Asynchronous Data memcpy_async Copy and Compute Pattern - Staging Data Through Shared Without With Asynchronous Data Copies using Performance Guidance for Trivially Warp Entanglement - Warp Entanglement - Warp Entanglement - Keep Commit and Arrive-On Operations Asynchronous Data Copies using Single-Stage Asynchronous Data Copies using Multi-Stage Asynchronous Data Copies using Pipeline Pipeline Primitives memcpy_async Commit Wait 208 CUDA C++ Programming | Arrive On Barrier Profiler Counter Trap Breakpoint Formatted Format Associated Host-Side Dynamic Global Memory

Allocation and Heap Memory Interoperability with Host Memory Per Thread Per Thread Block Allocation Persisting Between Kernel Execution Launch #pragma SIMD Video Diagnostic C. Cooperative What's New in CUDA Programming Model Composition Group Implicit Thread Block Grid Multi Grid Explicit Thread Block Coalesced Group Group C++ Programming | Data Data inclusive_scan and Grid Multi-Device D. CUDA Dynamic Execution Environment and Memory Execution Parent and Child Scope of CUDA Streams and Ordering and Device Memory Coherence and Programming CUDA C++ Device-Side Kernel Device Memory API Errors and Launch API Device-side Launch from Kernel Launch Parameter Buffer Toolkit Support for Dynamic Including Device Runtime API in CUDA Compiling and Programming C++ Programming | Dynamic-parallelism-enabled Kernel Implementation Restrictions and E.

Virtual Memory Query for Allocating Physical Shareable Memory Memory Compressible Reserving a Virtual Address Virtual Aliasing Mapping Control Access F. Stream Ordered Memory Query for API Fundamentals (cudaMallocAsync and cudaFreeAsync).. Memory Pools and the Default/Impicit Explicit Physical Page Caching Resource Usage Memory Reuse Disabling Reuse Device Accessibility for Multi-GPU IPC Memory Creating and Sharing IPC Memory Set Access in the Importing Creating and Sharing Allocations from an Exported IPC Export Pool IPC Import Pool Synchronization API C++ Programming | cudaMemcpyAsync Current Context/Device cuPointerGetAttribute Pointer G.

Graph Memory Support and API Graph Node Stream Accessing and Freeing Graph Memory Outside of the Allocating Optimized Memory Address Reuse within a Physical Memory Management and Peformance First Launch / Physical Memory Peer Peer Access with Graph Node Peer Access with Stream H. Mathematical Standard Intrinsic I. C++ Language C++11 Language C++14 Language C++17 Language Host Compiler Preprocessor Device Memory Space __managed__ Memory Space Volatile 322 CUDA C++ Programming | Assignment Address Run Time Type Information (RTTI).. Exception Standard External Implicitly-declared and explicitly-defaulted Function Static Variables within Function Function Friend Operator Data Function Virtual Virtual Base Anonymous Trigraphs and Const-qualified Long Deprecation Noreturn [[likely]] / [[unlikely]]

CUDA C++ Programming Guide - NVIDIA Developer

Tags:

Information

Advertisement

Transcription of CUDA C++ Programming Guide - NVIDIA Developer

Related search queries

CUDA C++ Programming Guide - NVIDIA Developer

Tags:

Information

Advertisement

Documents from same domain

Related documents

Related search queries