NVIDIA CUDA Programming Guide

Version 4/16/2012 NVIDIA cuda NVIDIA cuda C Programming Guide ii cuda C Programming Guide Version Changes from Version Updated Chapter 4, Chapter 5, and Appendix F to include information on devices of compute capability Replaced each reference to processor core with multiprocessor in Section Replaced Table A-1 by a reference to Added new Section on the warp shuffle functions. cuda C Programming Guide Version iii Table of Contents Chapter 1. Introduction .. 1 From Graphics Processing to General-Purpose Parallel Computing .. 1 cuda : a General-Purpose Parallel Computing Architecture .. 3 A Scalable Programming Model .. 4 Document s Structure .. 6 Chapter 2.

Programming Model .. 7 Kernels .. 7 Thread Hierarchy .. 8 Memory Hierarchy .. 10 Heterogeneous Programming .. 11 Compute Capability .. 14 Chapter 3. Programming Interface .. 15 Compilation with NVCC .. 15 Compilation Workflow .. 16 Offline Compilation .. 16 Just-in-Time Compilation .. 16 Binary Compatibility .. 17 PTX Compatibility .. 17 Application Compatibility .. 17 C/C++ Compatibility .. 18 64-Bit Compatibility .. 18 cuda C Runtime .. 19 Initialization .. 19 Device Memory .. 20 Shared Memory .. 22 Page-Locked Host Memory .. 28 Portable Memory .. 29 Write-Combining Memory .. 29 iv cuda C Programming Guide Version Mapped Memory .. 29 Asynchronous Concurrent Execution.

30 Concurrent Execution between Host and Device .. 30 Overlap of Data Transfer and Kernel Execution .. 30 Concurrent Kernel Execution .. 31 Concurrent Data Transfers .. 31 Streams .. 31 Events .. 34 Synchronous Calls .. 34 Multi-Device System .. 35 Device Enumeration .. 35 Device Selection .. 35 Stream and Event Behavior .. 35 Peer-to-Peer Memory Access .. 36 Peer-to-Peer Memory Copy .. 36 Unified Virtual Address Space .. 37 Error Checking .. 37 Call Stack .. 38 Texture and Surface Memory .. 38 Texture Memory .. 38 Surface Memory .. 45 cuda Arrays .. 48 Read/Write Coherency .. 48 Graphics Interoperability .. 48 OpenGL Interoperability .. 49 Direct3D Interoperability.

51 SLI Interoperability .. 58 Versioning and 58 Compute Modes .. 59 Mode Switches .. 60 Tesla Compute Cluster Mode for Windows .. 60 Chapter 4. Hardware Implementation .. 61 SIMT Architecture .. 61 cuda C Programming Guide Version v Hardware Multithreading .. 62 Chapter 5. Performance Guidelines .. 65 Overall Performance Optimization Strategies .. 65 Maximize Utilization .. 65 Application Level .. 65 Device Level .. 66 Multiprocessor Level .. 66 Maximize Memory Throughput .. 68 Data Transfer between Host and Device .. 69 Device Memory Accesses .. 70 Global Memory .. 70 Local Memory .. 72 Shared Memory .. 72 Constant Memory .. 73 Texture and Surface Memory.

73 Maximize Instruction Throughput .. 73 Arithmetic Instructions .. 74 Control Flow Instructions .. 77 Synchronization Instruction .. 77 Appendix A. cuda -Enabled GPUs .. 79 Appendix B. C Language Extensions .. 81 Function Type Qualifiers .. 81 __device__ .. 81 __global__ .. 81 __host__ .. 81 __noinline__ and __forceinline__ .. 82 Variable Type Qualifiers .. 82 __device__ .. 83 __constant__ .. 83 __shared__ .. 83 __restrict__ .. 84 Built-in Vector Types .. 85 vi cuda C Programming Guide Version char1, uchar1, char2, uchar2, char3, uchar3, char4, uchar4, short1, ushort1, short2, ushort2, short3, ushort3, short4, ushort4, int1, uint1, int2, uint2, int3, uint3, int4, uint4, long1, ulong1, long2, ulong2, long3, ulong3, long4, ulong4, longlong1, ulonglong1, longlong2, ulonglong2, float1, float2, float3, float4, double1, double2 85 dim3.

86 Built-in Variables .. 86 gridDim .. 87 blockIdx .. 87 blockDim .. 87 threadIdx .. 87 warpSize .. 87 Memory Fence Functions .. 87 Synchronization Functions .. 89 Mathematical Functions .. 89 Texture Functions .. 90 tex1 Dfetch() .. 90 tex1D() .. 91 tex2D() .. 91 tex3D() .. 91 tex1 DLayered() .. 91 tex2 DLayered() .. 91 texCubemap() .. 92 texCubemapLayered() .. 92 tex2 Dgather() .. 92 Surface Functions .. 92 surf1 Dread() .. 92 surf1 Dwrite() .. 93 surf2 Dread() .. 93 surf2 Dwrite() .. 93 surf3 Dread() .. 93 surf3 Dwrite() .. 94 surf1 DLayeredread() .. 94 surf1 DLayeredwrite() .. 94 cuda C Programming Guide Version vii surf2 DLayeredread().

94 surf2 DLayeredwrite() .. 95 surfCubemapread() .. 95 surfCubemapwrite() .. 95 surfCubemabLayeredread() .. 95 surfCubemapLayeredwrite() .. 96 Time Function .. 96 Atomic Functions .. 96 Arithmetic Functions .. 97 atomicAdd() .. 97 atomicSub() .. 97 atomicExch() .. 98 atomicMin() .. 98 atomicMax() .. 98 atomicInc() .. 98 atomicDec() .. 98 atomicCAS() .. 99 Bitwise Functions .. 99 atomicAnd() .. 99 atomicOr() .. 99 atomicXor() .. 99 Warp Vote Functions .. 100 Warp Shuffle Functions .. 100 Synopsys .. 100 Description .. 100 Return Value .. 101 Notes .. 101 Examples .. 102 Broadcast of a single value across a warp .. 102 Inclusive plus-scan across sub-partitions of 8 threads.

102 Reduction across a warp .. 103 Profiler Counter Function .. 103 Assertion .. 103 viii cuda C Programming Guide Version Formatted Output .. 104 Format Specifiers .. 105 Limitations .. 105 Associated Host-Side API .. 106 Examples .. 106 Dynamic Global Memory Allocation .. 108 Heap Memory Allocation .. 108 Interoperability with Host Memory API .. 109 Examples .. 109 Per Thread Allocation .. 109 Per Thread Block Allocation .. 109 Allocation Persisting Between Kernel Launches .. 110 Execution Configuration .. 111 Launch Bounds .. 112 #pragma unroll .. 114 Appendix C. Mathematical Functions .. 115 Standard Functions .. 115 Single-Precision Floating-Point Functions.

115 Double-Precision Floating-Point Functions .. 118 Intrinsic Functions .. 120 Single-Precision Floating-Point Functions .. 121 Double-Precision Floating-Point Functions .. 122 Appendix D. C/C++ Language Support .. 123 Code Samples .. 123 Data Aggregation Class .. 123 Derived Class .. 124 Class Template .. 124 Function Template .. 125 Functor Class .. 125 Restrictions .. 126 Qualifiers .. 126 Device Memory Qualifiers .. 126 Volatile Qualifier .. 126 cuda C Programming Guide Version ix Pointers .. 127 Operators .. 127 Assignment Operator .. 127 Address Operator .. 127 Functions .. 127 Function Parameters .. 127 Static Variables within Function .. 128 Function Pointers.

128 Function Recursion .. 128 Classes .. 128 Data Members .. 128 Function Members .. 128 Constructors and Destructors .. 128 Virtual Functions .. 128 Virtual Base Classes .. 128 Windows-Specific .. 128 Templates .. 129 Appendix E. Texture Fetching .. 131 Nearest-Point Sampling .. 132 Linear Filtering .. 132 Table Lookup .. 134 Appendix F. Compute Capabilities .. 135 Features and Technical Specifications .. 136 Floating-Point Standard .. 139 Compute Capability .. 141 Architecture .. 141 Global Memory .. 141 Devices of Compute Capability and .. 142 Devices of Compute Capability and .. 142 Shared Memory .. 143 32-Bit Strided Access .. 143 32-Bit Broadcast Access.

NVIDIA CUDA Programming Guide

Tags:

Information

Transcription of NVIDIA CUDA Programming Guide

Related search queries

NVIDIA CUDA Programming Guide

Tags:

Information

Documents from same domain

Related documents

Related search queries