CUDA C++ Best Practices Guide - NVIDIA Developer

cuda C++ best Practices Guide Design Guide | January 2022. Table of Contents What Is This Document?.. viii Who Should Read This Guide ?..viii Assess, Parallelize, Optimize, ix x Recommendations and best Chapter 1. Assessing Your Chapter 2. Heterogeneous 2. Differences between Host and What Runs on a cuda -Enabled Device?..3. Chapter 3. Application 5. Creating the Identifying Understanding 6. Strong Scaling and Amdahl's Weak Scaling and Gustafson's Applying Strong and Weak Chapter 4. Parallelizing Your Chapter 5. Getting 9. Parallel 9.

Parallelizing 9. Coding to Expose 10. Chapter 6. Getting the Right 11. 11. Reference Unit Numerical Accuracy and Single vs. Double 12. Floating Point Math Is not cuda C++ best Practices Guide | ii IEEE 754 13. x86 80-bit 13. Chapter 7. Optimizing cuda 14. Chapter 8. Performance 15. 15. Using CPU Using cuda GPU 16. Theoretical Bandwidth 17. Effective Bandwidth 17. Throughput Reported by Visual 18. Chapter 9. Memory Data Transfer Between Host and 19. Pinned Asynchronous and Overlapping Transfers with Zero 23. Unified Virtual 24.

Device Memory Coalesced Access to Global 26. A Simple Access A Sequential but Misaligned Access 27. Effects of Misaligned 27. Strided L2 30. L2 Cache Access 30. Tuning the Access Window 31. Shared 34. Shared Memory and Memory 34. Shared Memory in Matrix Multiplication (C=AB)..35. Shared Memory in Matrix Multiplication (C=AAT).. 38. Asynchronous Copy from Global Memory to Shared Local 43. Texture Additional Texture 43. Constant 44. 44. Register 45. cuda C++ best Practices Guide | iii NUMA best 45. Chapter 10. Execution Configuration Calculating 47.

Hiding Register 48. Thread and Block Effects of Shared Concurrent Kernel 50. Multiple 51. Chapter 11. Instruction 52. Arithmetic 52. Division Modulo 52. Loop Counters Signed vs. 52. Reciprocal Square 53. Other Arithmetic 53. Exponentiation With Small Fractional Math 54. Precision-related Compiler 56. Memory 56. Chapter 12. Control 57. Branching and 57. Branch 57. Chapter 13. Deploying cuda 59. Chapter 14. Understanding the Programming 60. cuda Compute 60. Additional Hardware 61. Which Compute Capability cuda 62. Chapter 15. cuda Compatibility Developer 's 63.

cuda Toolkit Source 65. Binary 66. cuda Binary (cubin) 68. cuda Compatibility Across Minor 69. Existing cuda Applications within Minor Versions of 69. Handling New cuda Features and Driver Using 71. cuda C++ best Practices Guide | iv Dynamic Code Chapter 16. Preparing for Testing for cuda Error Building for Maximum 76. Distributing the cuda Runtime and 77. cuda Toolkit Library 78. Which Files to 79. Where to Install Redistributed cuda 80. Chapter 17. Deployment Infrastructure Queryable Modifiable 83. Cluster Management 83. Compiler JIT Cache Management 84.

Appendix A. Recommendations and best Overall Performance Optimization 85. Appendix B. nvcc Compiler cuda C++ best Practices Guide | v List of Figures Figure 1. Timeline comparison for copy and kernel execution ..22. Figure 2. Memory spaces on a cuda device ..25. Figure 3. Coalesced access .. 27. Figure 4. Misaligned sequential addresses that fall within five 32-byte segments .. 27. Figure 5. Performance of offsetCopy kernel ..28. Figure 6. Adjacent threads accessing memory with a stride of 2 .. 29. Figure 7. Performance of strideCopy kernel.

30. Figure 8. Mapping Persistent data accesses to set-aside L2 in sliding window Figure 9. The performance of the sliding-window benchmark with fixed hit-ratio of ..33. Figure 10. The performance of the sliding-window benchmark with tuned hit-ratio .. 34. Figure 11. Block-column matrix multiplied by block-row matrix ..35. Figure 12. Computing a row of a tile .. 36. Figure 13. Comparing Synchronous vs Asynchronous Copy from Global Memory to Shared 42. Figure 14. Comparing Performance of Synchronous vs Asynchronous Copy from Global Memory to Shared Figure 15.

Using the cuda Occupancy Calculator to project GPU multiprocessor Figure 16. Sample cuda configuration data reported by deviceQuery .. 61. Figure 17. Components of cuda .. 64. Figure 18. cuda Toolkit and Minimum Driver Versions .. 67. cuda C++ best Practices Guide | vi List of Tables Table 1. Salient Features of Device Memory .. 25. Table 2. Performance Improvements Optimizing C = AB Matrix Multiply .. 38. Table 3. Performance Improvements Optimizing C = AAT Matrix Multiplication .. 40. Table 4. Useful Features for tex1D(), tex2D(), and tex3D() Fetches.

44. Table 5. Formulae for exponentiation by small fractions .. 54. cuda C++ best Practices Guide | vii Preface What Is This Document? This best Practices Guide is a manual to help developers obtain the best performance from NVIDIA cuda GPUs. It presents established parallelization and optimization techniques and explains coding metaphors and idioms that can greatly simplify programming for cuda - capable GPU architectures. While the contents can be used as a reference manual, you should be aware that some topics are revisited in different contexts as various programming and configuration topics are explored.

As a result, it is recommended that first-time readers proceed through the Guide sequentially. This approach will greatly improve your understanding of effective programming Practices and enable you to better use the Guide for reference later. Who Should Read This Guide ? The discussions in this Guide all use the C++ programming language, so you should be comfortable reading C++ code. This Guide refers to and relies on several other documents that you should have at your disposal for reference, all of which are available at no cost from the cuda website https://.

CUDA C++ Best Practices Guide - NVIDIA Developer

Tags:

Information

Transcription of CUDA C++ Best Practices Guide - NVIDIA Developer

Related search queries

CUDA C++ Best Practices Guide - NVIDIA Developer

Tags:

Information

Documents from same domain

Related documents

Related search queries