Example: tourism industry

How a GPU Works

Kayvon Fatahalian 15-462 (Fall 2011) How a GPU Works Today 1. Review: the graphics pipeline 2. History: a few old GPUs 3. How a modern GPU Works (and why it is so fast!) 4. Closer look at a real GPU design NVIDIA GTX 285 2 Part 1: The graphics pipeline 3 (an abstraction) Vertex processing v0 v1 v2 v3 v4 v5 Vertices Vertices are transformed into screen space Vertex processing v0 v1 v2 v3 v4 v5 Vertices Vertices are transformed into screen space EACH VERTEX IS TRANSFORMED INDEPENDENTLY Primitive processing v0 v1 v2 v3 v4 v5 Vertices v0 v1 v2 v3 v4 v5 Primitives (triangles)

Today 1. Review: the graphics pipeline 2. History: a few old GPUs 3. How a modern GPU works (and why it is so fast!) 4. Closer look at a real GPU design – NVIDIA GTX 285

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Advertisement

Transcription of How a GPU Works

1 Kayvon Fatahalian 15-462 (Fall 2011) How a GPU Works Today 1. Review: the graphics pipeline 2. History: a few old GPUs 3. How a modern GPU Works (and why it is so fast!) 4. Closer look at a real GPU design NVIDIA GTX 285 2 Part 1: The graphics pipeline 3 (an abstraction) Vertex processing v0 v1 v2 v3 v4 v5 Vertices Vertices are transformed into screen space Vertex processing v0 v1 v2 v3 v4 v5 Vertices Vertices are transformed into screen space EACH VERTEX IS TRANSFORMED INDEPENDENTLY Primitive processing v0 v1 v2 v3 v4 v5 Vertices v0 v1 v2 v3 v4 v5 Primitives (triangles)

2 Then organized into primitives that are clipped and Rasterization Primitives are rasterized into pixel fragments Fragments Rasterization Primitives are rasterized into pixel fragments EACH PRIMITIVE IS RASTERIZED INDEPENDENTLY Fragment processing Shaded fragments Fragments are shaded to compute a color at each pixel Fragment processing EACH FRAGMENT IS PROCESSED INDEPENDENTLY Fragments are shaded to compute a color at each pixel Pixel operations Pixels Fragments are blended into the frame buffer at their pixel locations (z-buffer determines visibility) Pipeline entities v0 v1 v2 v3 v4 v5 v0 v1 v2 v3 v4 v5 Vertices Primitives Fragments Pixels Fragments (shaded) Graphics pipeline Primitive Generation Vertex Generation Vertex Processing Fragment Generation Fragment Processing Pixel Operations Fixed-function Programmable Memory Bu!

3 Ers Vertex Data Bu!ers Textures Output image (pixels) Textures Textures Primitive Processing Vertex stream Vertex stream Primitive stream Primitive stream Fragment stream Fragment stream Vertices Primitives Fragments Pixels Part 2: Graphics architectures 14 (implementations of the graphics pipeline) Independent What s so important about independent computations? 15 Silicon Graphics RealityEngine (1993) Primitive Generation Vertex Generation Vertex Processing Fragment Generation Fragment Processing Pixel Operations Primitive Processing graphics supercomputer Pre-1999 PC 3D graphics accelerator Primitive Generation Vertex Generation Vertex Processing Fragment Generation Fragment Processing Pixel Operations Primitive Processing 3dfx Voodoo NVIDIA RIVA TNT Clip/cull/rasterize Pixel operations Te x Te x CPU GPU* circa 1999 Primitive Generation Vertex Generation Vertex Processing Fragment Generation Fragment Processing Pixel Operations Primitive

4 Processing NVIDIA GeForce 256 CPU GPU Direct3D 9 programmability: 2002 Primitive Generation Vertex Generation Vertex Processing Fragment Generation Fragment Processing Pixel Operations Primitive Processing ATI Radeon 9700 Clip/cull/rasterize Pixel operations Te x Frag Te x Frag Te x Frag Te x Frag Te x Frag Te x Frag Te x Frag Te x Frag Vtx Vtx Vtx Vtx Direct3D 10 programmability: 2006 Primitive Generation Vertex Generation Vertex Processing Fragment Generation Fragment Processing Pixel Operations Primitive Processing NVIDIA GeForce 8800 ( unified shading GPU) Core Pixel op Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Te x Pixel op Pixel op Pixel op Pixel op Pixel op Clip/Cull/Rast Scheduler Part 3.

5 How a shader core Works 21 (three key ideas) GPUs are fast 22 Intel Core i7 Quad Core ~100 GFLOPS peak 730 million transistors (obtainable if you code your program to use 4 threads and SSE vector instr) AMD Radeon HD 5870 ~ TFLOPS peak billion transistors (obtainable if you write OpenGL programs like you ve done in this class) A di!use re"ectance shadersampler(mySamp;Texture2D<float3>(myTex;float3(lightDir;float4(diffuseSha der(float3(norm,(float2(uv){((float3(kd; ((kd(=( (mySamp,(uv);((kd(*=(clamp((dot(lightDir ,(norm),( ,( );((return(float4(kd,( );(((}(Shader programming model:Fragments are processed independently,but there is no explicit parallel logical sequence of control per fragment.)))))))))))))))))))))))))))))))

6 **A di!use re"ectance shadersampler(mySamp;Texture2D<float3>(myTex;float3(lightDir;float4(diffuseSha der(float3(norm,(float2(uv){((float3(kd; ((kd(=( (mySamp,(uv);((kd(*=(clamp((dot(lightDir ,(norm),( ,( );((return(float4(kd,( );(((}(Shader programming model:Fragments are processed independently,but there is no explicit parallel logical sequence of control per fragment. **A di!use re"ectance shadersampler(mySamp;Texture2D<float3>(myTex;float3(lightDir;float4(diffuseSha der(float3(norm,(float2(uv){((float3(kd; ((kd(=( (mySamp,(uv);((kd(*=(clamp((dot(lightDir ,(norm),( ,( );((return(float4(kd,( );(((}(Shader programming model:Fragments are processed independently,but there is no explicit parallel logical sequence of control per fragment.))))))))))))))))))))))))))))))) )))))))))))))))))))))))))))))))

7 **Big Guy, lookin di!use Compile shader<diffuseShader>:sample(r0,(v4,(t0,(s0mul((r3,(v0,(cb0[0 ]madd(r3,(v1,(cb0[1],(r3madd(r3,(v2,(cb0 [2],(r3clmp(r3,(r3,(l( ),(l( )mul((o0,(r0,(r3mul((o1,(r1,(r3mul((o2,( r2,(r3mov((o3,(l( )1 unshaded fragment input record1 shaded fragment output recordsampler(mySamp;Texture2D<float3>(myTex;float3(lightDir;float4(diffuseSha der(float3(norm,(float2(uv){((float3(kd; ((kd(=( (mySamp,(uv);((kd(*=(clamp((dot(lightDir ,(norm),( ,( );((return(float4(kd,( );(((}Execute shader<diffuseShader>:sample(r0,(v4,(t0,(s0mul((r3,(v0,(cb0[0 ]madd(r3,(v1,(cb0[1],(r3madd(r3,(v2,(cb0 [2],(r3clmp(r3,(r3,(l( ),(l( )mul((o0,(r0,(r3mul((o1,(r1,(r3mul((o2,( r2,(r3mov((o3,(l( )Fetch/DecodeExecutionContextALU(Execute )Execute shader<diffuseShader>:sample(r0,(v4,(t0,(s0mul((r3,(v0,(cb0[0 ]madd(r3,(v1,(cb0[1],(r3madd(r3,(v2,(cb0 [2],(r3clmp(r3,(r3,(l( ),(l( )mul((o0,(r0,(r3mul((o1,(r1,(r3mul((o2,( r2,(r3mov((o3,(l( )ALU(Execute)Fetch/DecodeExecutionContex tExecute shader<diffuseShader>.))))))))))))))))))))))))))))))))))))))) )))))))))))))))))))))))))))))))))))))))) )))))))))))))))))))))))))))))))))))))))) ))))))))))))))))

8 Sample(r0,(v4,(t0,(s0mul((r3,(v0,(cb0[0] madd(r3,(v1,(cb0[1],(r3madd(r3,(v2,(cb0[ 2],(r3clmp(r3,(r3,(l( ),(l( )mul((o0,(r0,(r3mul((o1,(r1,(r3mul((o2,( r2,(r3mov((o3,(l( )Fetch/DecodeExecutionContextALU(Execute )Execute shader<diffuseShader>:sample(r0,(v4,(t0,(s0mul((r3,(v0,(cb0[0 ]madd(r3,(v1,(cb0[1],(r3madd(r3,(v2,(cb0 [2],(r3clmp(r3,(r3,(l( ),(l( )mul((o0,(r0,(r3mul((o1,(r1,(r3mul((o2,( r2,(r3mov((o3,(l( )Fetch/DecodeExecutionContextALU(Execute )Execute shader<diffuseShader>:sample(r0,(v4,(t0,(s0mul((r3,(v0,(cb0[0 ]madd(r3,(v1,(cb0[1],(r3madd(r3,(v2,(cb0 [2],(r3clmp(r3,(r3,(l( ),(l( )mul((o0,(r0,(r3mul((o1,(r1,(r3mul((o2,( r2,(r3mov((o3,(l( )Fetch/DecodeExecutionContextALU(Execute )Execute shader<diffuseShader>.))))))))))))))))))))))))))))))))))))))) )))))))))))))))))))))))))))))))))))))))) ))))))))))))))))))))))))))

9 Sample(r0,(v4,(t0,(s0mul((r3,(v0,(cb0[0] madd(r3,(v1,(cb0[1],(r3madd(r3,(v2,(cb0[ 2],(r3clmp(r3,(r3,(l( ),(l( )mul((o0,(r0,(r3mul((o1,(r1,(r3mul((o2,( r2,(r3mov((o3,(l( )Fetch/DecodeExecutionContextALU(Execute ) CPU-style coresFetch/DecodeExecutionContextALU(Exe cute)Data cache(a big one)Out-of-order control logicFancy branch predictorMemory pre-fetcherSlimming downFetch/DecodeExecutionContextALU(Exec ute)Idea #1: Remove components thathelp a single instructionstream run fast Two cores (two fragments in parallel)Fetch/DecodeExecutionContextALU (Execute)Fetch/DecodeExecutionContextALU (Execute)<diffuseShader>:sample(r0,(v4,(t0,(s0mul((r3,(v0,(cb0[0 ]madd(r3,(v1,(cb0[1],(r3madd(r3,(v2,(cb0 [2],(r3clmp(r3,(r3,(l( ),(l( )mul((o0,(r0,(r3mul((o1,(r1,(r3mul((o2,( r2,(r3mov((o3,(l( )fragment 1<diffuseShader>.))))))))))))))))))))))))))))))))))))))) )))))))))))))))))))))))))))))))

10 Sample(r0,(v4,(t0,(s0mul((r3,(v0,(cb0[0] madd(r3,(v1,(cb0[1],(r3madd(r3,(v2,(cb0[ 2],(r3clmp(r3,(r3,(l( ),(l( )mul((o0,(r0,(r3mul((o1,(r1,(r3mul((o2,( r2,(r3mov((o3,(l( )fragment 2 Four cores (four fragments in parallel)Fetch/DecodeExecutionContextALU (Execute)Fetch/DecodeExecutionContextALU (Execute)Fetch/DecodeExecutionContextALU (Execute)Fetch/DecodeExecutionContextALU (Execute)Sixteen cores (sixteen fragments in parallel)16 cores = 16 simultaneous instruction streams Instruction stream sharingBut .. many fragments should be able to share an instruction stream!))))))))))))))))))))))))))))))))) ))


Related search queries