Example: bachelor of science

Case Study - Computational Fluid Dynamics (CFD) …

IntroductionCase StudyKey LessonsCase Study - Computational Fluid Dynamics (CFD) using Graphics Processing UnitsAaron F. ShinnMechanical Science and Engineering Dept., UIUCS ummer School 2009: Many-Core Processors for Science andEngineering Applications, ShinnCFD using GPUs1 / 30 IntroductionCase StudyKey LessonsWhat is CFD? Computational Fluid Dynamics : solve governing equations offluid motionnumerically- Conservation of Mass (Continuity Equation)- Conservation of Momentum (Newton s 2nd Law)- Conservation of Energy (1st Law of Thermodynamics) Coupled set of nonlinear Partial Differential Equations (PDEs) Solution time can bevery long makes GPUs very ShinnCFD using GPUs2 / 30 IntroductionCase StudyKey LessonsGeneral Governing EquationsConservation of Mass t+ u= 0 Conservation of Momentum DuDt= p+ Conservation of Energy CpDTDt= TDpDt+ (k T) + viscous stress tensor: = ( ui xj+ uj xi)+ ij ( u)substantial derivative.

Introduction Case Study Key Lessons Case Study - Computational Fluid Dynamics (CFD) using Graphics Processing Units Aaron F. Shinn Mechanical Science and Engineering Dept., UIUC

Tags:

  Introduction, Computational, Study, Fluid, Dynamics, Case, Case study, Computational fluid dynamics, Introduction case study

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Transcription of Case Study - Computational Fluid Dynamics (CFD) …

1 IntroductionCase StudyKey LessonsCase Study - Computational Fluid Dynamics (CFD) using Graphics Processing UnitsAaron F. ShinnMechanical Science and Engineering Dept., UIUCS ummer School 2009: Many-Core Processors for Science andEngineering Applications, ShinnCFD using GPUs1 / 30 IntroductionCase StudyKey LessonsWhat is CFD? Computational Fluid Dynamics : solve governing equations offluid motionnumerically- Conservation of Mass (Continuity Equation)- Conservation of Momentum (Newton s 2nd Law)- Conservation of Energy (1st Law of Thermodynamics) Coupled set of nonlinear Partial Differential Equations (PDEs) Solution time can bevery long makes GPUs very ShinnCFD using GPUs2 / 30 IntroductionCase StudyKey LessonsGeneral Governing EquationsConservation of Mass t+ u= 0 Conservation of Momentum DuDt= p+ Conservation of Energy CpDTDt= TDpDt+ (k T) + viscous stress tensor: = ( ui xj+ uj xi)+ ij ( u)substantial derivative.

2 D( )Dt= ( ) t+u ( ) ShinnCFD using GPUs3 / 30 IntroductionCase StudyKey LessonsOverviewImplementationResultsOver view of case Study Illustrate CFD implementation issues with real researchexample CU-FLOW: general-purpose Cartesian-based 3 DNavier-Stokes solver written in C/CUDA for GPUs First implementation of fractional-step/multigridNavier-Stokes solver for Large-Eddy Simulations (LES) ofturbulence on GPUs Many different variations of this code were created Countless hours spent on algorithm design, optimizations,and debugging! ShinnCFD using GPUs4 / 30 IntroductionCase StudyKey LessonsOverviewImplementationResultsGove rning Equations for this Study3D Incompressible Navier-Stokes equationsConservation of Mass u= 0 Conservation of Momentum u t+u u= 1 p+ ShinnCFD using GPUs5 / 30 IntroductionCase StudyKey LessonsOverviewImplementationResultsNume rical Methodology Discretized via Finite-Volume Method on a staggeredCartesian mesh.

3 Smagorinsky SGS model used for turbulence modeling. Solved equations with fractional-step Pressure-Poisson equation (PPE) solved using Geometric multigrid used for convergence acceleration ofPPE Temporal advancement: explicit 2nd-orderAdams-Bashforth Spatial derivatives: 2nd-order central ShinnCFD using GPUs6 / 30 IntroductionCase StudyKey LessonsOverviewImplementationResultsGeom etric Multigrid: V-cycleFigure: Multigrid V-cycle, where S=smooth, R=restrict residual,P=prolongate. Only three mesh levels are shown for ShinnCFD using GPUs7 / 30 IntroductionCase StudyKey LessonsOverviewImplementationResultsMult igrid: How good is it? Consider a unit square 2D domain, solve Laplace equation 2 = 0 onthat domain Multigrid converges in just a few iterations, whereas using a singlegrid takes thousands!

4 Figure:Residuals of multigrid and single grid for solution of the Laplaceequation on a 256x256 grid, tolerance = 10 ShinnCFD using GPUs8 / 30 IntroductionCase StudyKey LessonsOverviewImplementationResultsLayo ut of CU-FLOW codePreprocessing on CPU set and generate mesh copy data to GPUTime-stepping loop controlled on CPUfor(n=1; n<=nsteps; n++) {Processing solution on GPU (call kernels) advance velocity fromuntou (Adams-Bashforth) advancepntopn+1(Multigrid V-cycle) advanceu toun+1} // end time-stepping loopPostprocessing on CPU copy data from GPU write plot ShinnCFD using GPUs9 / 30 IntroductionCase StudyKey LessonsOverviewImplementationResultsMapp ing between threads and cellsFigure: Correspondence between GPU grid and Computational ShinnCFD using GPUs10 / 30 IntroductionCase StudyKey LessonsOverviewImplementationResultsMult ithreading Multigrid Optimal block size may conflict with mesh level dimensions.

5 Example: would like a 4x4x4 mesh as coarsest level, but32x1x8 is optimal block size. Cannot map one-to-one dueto dimensions of block exceeding mesh. Question: how to resolve this conflict? Possible solution: set block size based on mesh ShinnCFD using GPUs11 / 30 IntroductionCase StudyKey LessonsOverviewImplementationResultsMult ithreading MultigridHost code for calling a kernel// define *fine mesh* dimensions of the blocks#define bx_f 32#define by_f 1#define bz_f 8// define *coarse mesh* dimensions of the blocks#define bx_c 4#define by_c 4#define bz_c ( n = 1; n<=ngrid; n++) {// use block size for coarse mesh by defaultbx = bx_c; by = by_c; bz = bz_c;// for finer meshes, use better block sizeif ( nx[n]%bx_f == 0 && ny[n]%by_f == 0 ){ bx = bx_f; by = by_f; bz = bz_f; }dim3 block(bx,by,bz);dim3 grid(nx[n]/bx,ny[n]/by);kernel<<<grid, block>>>(.)}

6 , n, ..);}.. ShinnCFD using GPUs12 / 30 IntroductionCase StudyKey LessonsOverviewImplementationResultsMult ithreading MultigridDevice code for kernel__global__ void kernel(.., n, ..){// i = tx + 2, j = ty + 2 (offset thread indices to mesh indices)i = + * + 2;j = + * + 2;for (slice=0; slice<=nz[n] ; slice++){k = + slice * + 2;m = i + (j-1)*(nx[n]+2) + \(k-1)*(nx[n]+2)*(ny[n]+2) + begin[n] - 1;.. kernel computations ..}} ShinnCFD using GPUs13 / 30 IntroductionCase StudyKey LessonsOverviewImplementationResultsCUDA implementation of Red-Black Gauss-Seidel Color the grid like a checkerboard to enable parallelprocessing of pressure First update the red pressures, then update the blackpressuresFigure: 2D example of red-black coloring of a ShinnCFD using GPUs14 / 30 IntroductionCase StudyKey LessonsOverviewImplementationResultsCUDA implementation of Red-Black Gauss-SeidelUpdating pressure: host codefor( icyc = 1; icyc<=ncyc; icyc++) { // go through all V-cyclesfor( n = ngrid; n>=1; n--) { // downleg of V-cycle// use block size for coarse mesh by defaultbx = bx_c; by = by_c; bz = bz_c;// for finer meshes, use better block sizeif ( nx[n]%bx_f == 0 && ny[n]%by_f == 0 ){bx = bx_f; by = by_f.}}

7 Bz = bz_f;}dim3 block(bx,by,bz);dim3 grid(nx[n]/bx,ny[n]/by);for( iswp = 1; iswp<=nswp; iswp++) {red_kernel<<<grid, block>>>(.., n, ..);black_kernel<<<grid, block>>>(.., n, ..);}.. ShinnCFD using GPUs15 / 30 IntroductionCase StudyKey LessonsOverviewImplementationResultsCUDA implementation of Red-Black Gauss-Seidelred kernel: device code__global__ void red_kernel( .. ) {i = + * + 2;j = + * + 2;for (slice=0; slice<=nz_d[n] ; slice++) {k = + slab * + 2;if( (i+j+k)%2==0 ) { // test if red cellm = i + (j-1)*(nx[n]+2)+(k-1)*(nx[n]+2)*(ny[n]+2 )+begin[n]-1;xm = xm[m]; xp = xp[m];ym = ym[m]; yp = yp[m];zm = zm[m]; zp = zp[m];res = (aw_d[m] * pressure_d[xm] + ae_d[m] * pressure_d[xp] + \as_d[m] * pressure_d[ym] + an_d[m] * pressure_d[yp] + \al_d[m] * pressure_d[zm] + ah_d[m] * pressure_d[zp] + \resc_d[m]) / ap_d[m];pressure_d[m] = relxp*(res) + ( )*pressure_d[m];} // end if} //end slice} //end ShinnCFD using GPUs16 / 30 IntroductionCase StudyKey LessonsOverviewImplementationResultsProf iling of CU-FLOW Red-black Gauss-Seidel kernels consume over 2/3 of GPUtime!

8 Must optimize red-black Gauss-Seidel ShinnCFD using GPUs17 / 30 IntroductionCase StudyKey LessonsOverviewImplementationResultsCUDA implementation of Red-Black Gauss-Seidel Memory management in red-black kernels-Global memory: easiest, but slow-Shared memory: gives marginally better performance,perhaps due to low data reuse or handling of boundaryhalos for each sub-domain in shared memory: fetch device memory through texturesinstead of expensive global memory load. Currently workingon this. This is an alternative to avoid uncoalesed ShinnCFD using GPUs18 / 30 IntroductionCase StudyKey LessonsOverviewImplementationResultsComp utational Resources GPU verison: CUDA, CPU version: Fortran. Single-precision used for all calculations. Dell Precision 690 Workstation (Linux: Red Hat Enterprise5) CPU: GHz Intel Xeon GPU: NVIDIA Tesla C1060 ( 1 teraFLOP) ShinnCFD using GPUs19 / 30 IntroductionCase StudyKey LessonsOverviewImplementationResultsLami nar Flow in 3D Lid-Driven CubeFigure: Computational domain for 3D lid-driven cube.

9 ReL=1000 mesh: 128x128x128, constant mesh ShinnCFD using GPUs20 / 30 IntroductionCase StudyKey LessonsOverviewImplementationResultsLami nar Flow in 3D Lid-Driven ShinnCFD using GPUs21 / 30 IntroductionCase StudyKey LessonsOverviewImplementationResultsTurb ulent Flow in 3D square ductFigure: Computational domain for 3D square duct. Re =360 mesh: 256x64x64, 3% geometric stretching ShinnCFD using GPUs22 / 30 IntroductionCase StudyKey LessonsOverviewImplementationResults3D square duct (Re =360)Figure: Contours and velocity vectors of instantaneous streamwisevelocity in cross-flow plane atx= ShinnCFD using GPUs23 / 30 IntroductionCase StudyKey LessonsOverviewImplementationResults3D square duct (Re =360)(a) present GPU simulation(b) Madabhushi and VankaFigure: Velocity vectors of mean flowfield in cross-flow ShinnCFD using GPUs24 / 30 IntroductionCase StudyKey LessonsOverviewImplementationResultsSpee dup of GPU vs.

10 CPU Performance of GPU versus CPU for first 100 time-steps ofsimulation, with block sizebx=by=bz=4 Table 1: Laminar flow in lid-driven code (sec)CUDA code (sec)speedup (CPU/GPU) 2: Turbulent flow in a square code (sec)CUDA code (sec)speedup (CPU/GPU) ShinnCFD using GPUs25 / 30 IntroductionCase StudyKey LessonsOverviewImplementationResultsSpee dup of GPU vs. CPU Performance of GPU versus CPU for first 100 time-steps ofsimulation, with block sizebx=by=bz=4on coarser meshes andbx=32,by=1,bz=8on finer 1: Laminar flow in lid-driven cubemeshFortran code (sec)GPU code (sec)speedup (CPU/GPU) improved by factor of for 128x128x128 caseTable 2: Turbulent flow in square ductmeshFortran code (sec)GPU code (sec)speedup (CPU/GPU) improved by factor of for 256x64x64 ShinnCFD using GPUs26 / 30 IntroductionCase StudyKey LessonsKey Lessons Speedup of GPU scaled with the problem size; largestproblem size yielded maximum speedup.


Related search queries