Case Study - Computational Fluid Dynamics (CFD) …

IntroductionCase StudyKey LessonsCase Study - Computational Fluid Dynamics (CFD) using Graphics Processing UnitsAaron F. ShinnMechanical Science and Engineering Dept., UIUCS ummer School 2009: Many-Core Processors for Science andEngineering Applications, ShinnCFD using GPUs1 / 30 IntroductionCase StudyKey LessonsWhat is CFD? Computational Fluid Dynamics : solve governing equations offluid motionnumerically- Conservation of Mass (Continuity Equation)- Conservation of Momentum (Newton s 2nd Law)- Conservation of Energy (1st Law of Thermodynamics) Coupled set of nonlinear Partial Differential Equations (PDEs) Solution time can bevery long makes GPUs very ShinnCFD using GPUs2 / 30 IntroductionCase StudyKey LessonsGeneral Governing EquationsConservation of Mass t+ u= 0 Conservation of Momentum DuDt= p+ Conservation of Energy CpDTDt= TDpDt+ (k T) + viscous stress tensor: = ( ui xj+ uj xi)+ ij ( u)substantial derivative:D( )Dt= ( ) t+u ( ) ShinnCFD using GPUs3 / 30 IntroductionCase StudyKey LessonsOverviewImplementationResultsOver view of case Study Illustrate CFD implementation issues with real researchexample CU-FLOW.

General-purpose Cartesian-based 3 DNavier-Stokes solver written in C/CUDA for GPUs First implementation of fractional-step/multigridNavier-Stokes solver for Large-Eddy Simulations (LES) ofturbulence on GPUs Many different variations of this code were created Countless hours spent on algorithm design, optimizations,and debugging! ShinnCFD using GPUs4 / 30 IntroductionCase StudyKey LessonsOverviewImplementationResultsGove rning Equations for this Study3D Incompressible Navier-Stokes equationsConservation of Mass u= 0 Conservation of Momentum u t+u u= 1 p+ ShinnCFD using GPUs5 / 30 IntroductionCase StudyKey LessonsOverviewImplementationResultsNume rical Methodology Discretized via Finite-Volume Method on a staggeredCartesian mesh. Smagorinsky SGS model used for turbulence modeling. Solved equations with fractional-step Pressure-Poisson equation (PPE) solved using Geometric multigrid used for convergence acceleration ofPPE Temporal advancement: explicit 2nd-orderAdams-Bashforth Spatial derivatives: 2nd-order central ShinnCFD using GPUs6 / 30 IntroductionCase StudyKey LessonsOverviewImplementationResultsGeom etric Multigrid: V-cycleFigure: Multigrid V-cycle, where S=smooth, R=restrict residual,P=prolongate.

Only three mesh levels are shown for ShinnCFD using GPUs7 / 30 IntroductionCase StudyKey LessonsOverviewImplementationResultsMult igrid: How good is it? Consider a unit square 2D domain, solve Laplace equation 2 = 0 onthat domain Multigrid converges in just a few iterations, whereas using a singlegrid takes thousands!Figure:Residuals of multigrid and single grid for solution of the Laplaceequation on a 256x256 grid, tolerance = 10 ShinnCFD using GPUs8 / 30 IntroductionCase StudyKey LessonsOverviewImplementationResultsLayo ut of CU-FLOW codePreprocessing on CPU set and generate mesh copy data to GPUTime-stepping loop controlled on CPUfor(n=1; n<=nsteps; n++) {Processing solution on GPU (call kernels) advance velocity fromuntou (Adams-Bashforth) advancepntopn+1(Multigrid V-cycle) advanceu toun+1} // end time-stepping loopPostprocessing on CPU copy data from GPU write plot ShinnCFD using GPUs9 / 30 IntroductionCase StudyKey LessonsOverviewImplementationResultsMapp ing between threads and cellsFigure: Correspondence between GPU grid and Computational ShinnCFD using GPUs10 / 30 IntroductionCase StudyKey LessonsOverviewImplementationResultsMult ithreading Multigrid Optimal block size may conflict with mesh level dimensions.

Example: would like a 4x4x4 mesh as coarsest level, but32x1x8 is optimal block size. Cannot map one-to-one dueto dimensions of block exceeding mesh. Question: how to resolve this conflict? Possible solution: set block size based on mesh ShinnCFD using GPUs11 / 30 IntroductionCase StudyKey LessonsOverviewImplementationResultsMult ithreading MultigridHost code for calling a kernel// define *fine mesh* dimensions of the blocks#define bx_f 32#define by_f 1#define bz_f 8// define *coarse mesh* dimensions of the blocks#define bx_c 4#define by_c 4#define bz_c ( n = 1; n<=ngrid; n++) {// use block size for coarse mesh by defaultbx = bx_c; by = by_c; bz = bz_c;// for finer meshes, use better block sizeif ( nx[n]%bx_f == 0 && ny[n]%by_f == 0 ){ bx = bx_f; by = by_f; bz = bz_f; }dim3 block(bx,by,bz);dim3 grid(nx[n]/bx,ny[n]/by);kernel<<<grid, block>>>(.., n, ..);}.. ShinnCFD using GPUs12 / 30 IntroductionCase StudyKey LessonsOverviewImplementationResultsMult ithreading MultigridDevice code for kernel__global__ void kernel(.)

, n, ..){// i = tx + 2, j = ty + 2 (offset thread indices to mesh indices)i = + * + 2;j = + * + 2;for (slice=0; slice<=nz[n] ; slice++){k = + slice * + 2;m = i + (j-1)*(nx[n]+2) + \(k-1)*(nx[n]+2)*(ny[n]+2) + begin[n] - 1;.. kernel computations ..}} ShinnCFD using GPUs13 / 30 IntroductionCase StudyKey LessonsOverviewImplementationResultsCUDA implementation of Red-Black Gauss-Seidel Color the grid like a checkerboard to enable parallelprocessing of pressure First update the red pressures, then update the blackpressuresFigure: 2D example of red-black coloring of a ShinnCFD using GPUs14 / 30 IntroductionCase StudyKey LessonsOverviewImplementationResultsCUDA implementation of Red-Black Gauss-SeidelUpdating pressure: host codefor( icyc = 1; icyc<=ncyc; icyc++) { // go through all V-cyclesfor( n = ngrid; n>=1; n--) { // downleg of V-cycle// use block size for coarse mesh by defaultbx = bx_c; by = by_c; bz = bz_c;// for finer meshes, use better block sizeif ( nx[n]%bx_f == 0 && ny[n]%by_f == 0 ){bx = bx_f; by = by_f; bz = bz_f;}dim3 block(bx,by,bz);dim3 grid(nx[n]/bx,ny[n]/by);for( iswp = 1; iswp<=nswp; iswp++) {red_kernel<<<grid, block>>>(.}}}

, n, ..);black_kernel<<<grid, block>>>(.., n, ..);}.. ShinnCFD using GPUs15 / 30 IntroductionCase StudyKey LessonsOverviewImplementationResultsCUDA implementation of Red-Black Gauss-Seidelred kernel: device code__global__ void red_kernel( .. ) {i = + * + 2;j = + * + 2;for (slice=0; slice<=nz_d[n] ; slice++) {k = + slab * + 2;if( (i+j+k)%2==0 ) { // test if red cellm = i + (j-1)*(nx[n]+2)+(k-1)*(nx[n]+2)*(ny[n]+2 )+begin[n]-1;xm = xm[m]; xp = xp[m];ym = ym[m]; yp = yp[m];zm = zm[m]; zp = zp[m];res = (aw_d[m] * pressure_d[xm] + ae_d[m] * pressure_d[xp] + \as_d[m] * pressure_d[ym] + an_d[m] * pressure_d[yp] + \al_d[m] * pressure_d[zm] + ah_d[m] * pressure_d[zp] + \resc_d[m]) / ap_d[m];pressure_d[m] = relxp*(res) + ( )*pressure_d[m];} // end if} //end slice} //end ShinnCFD using GPUs16 / 30 IntroductionCase StudyKey LessonsOverviewImplementationResultsProf iling of CU-FLOW Red-black Gauss-Seidel kernels consume over 2/3 of GPUtime! Must optimize red-black Gauss-Seidel ShinnCFD using GPUs17 / 30 IntroductionCase StudyKey LessonsOverviewImplementationResultsCUDA implementation of Red-Black Gauss-Seidel Memory management in red-black kernels-Global memory: easiest, but slow-Shared memory: gives marginally better performance,perhaps due to low data reuse or handling of boundaryhalos for each sub-domain in shared memory: fetch device memory through texturesinstead of expensive global memory load.

Currently workingon this. This is an alternative to avoid uncoalesed ShinnCFD using GPUs18 / 30 IntroductionCase StudyKey LessonsOverviewImplementationResultsComp utational Resources GPU verison: CUDA, CPU version: Fortran. Single-precision used for all calculations. Dell Precision 690 Workstation (Linux: Red Hat Enterprise5) CPU: GHz Intel Xeon GPU: NVIDIA Tesla C1060 ( 1 teraFLOP) ShinnCFD using GPUs19 / 30 IntroductionCase StudyKey LessonsOverviewImplementationResultsLami nar Flow in 3D Lid-Driven CubeFigure: Computational domain for 3D lid-driven cube. ReL=1000 mesh: 128x128x128, constant mesh ShinnCFD using GPUs20 / 30 IntroductionCase StudyKey LessonsOverviewImplementationResultsLami nar Flow in 3D Lid-Driven ShinnCFD using GPUs21 / 30 IntroductionCase StudyKey LessonsOverviewImplementationResultsTurb ulent Flow in 3D square ductFigure: Computational domain for 3D square duct. Re =360 mesh: 256x64x64, 3% geometric stretching ShinnCFD using GPUs22 / 30 IntroductionCase StudyKey LessonsOverviewImplementationResults3D square duct (Re =360)Figure: Contours and velocity vectors of instantaneous streamwisevelocity in cross-flow plane atx= ShinnCFD using GPUs23 / 30 IntroductionCase StudyKey LessonsOverviewImplementationResults3D square duct (Re =360)(a) present GPU simulation(b) Madabhushi and VankaFigure: Velocity vectors of mean flowfield in cross-flow ShinnCFD using GPUs24 / 30 IntroductionCase StudyKey LessonsOverviewImplementationResultsSpee dup of GPU vs.

CPU Performance of GPU versus CPU for first 100 time-steps ofsimulation, with block sizebx=by=bz=4 Table 1: Laminar flow in lid-driven code (sec)CUDA code (sec)speedup (CPU/GPU) 2: Turbulent flow in a square code (sec)CUDA code (sec)speedup (CPU/GPU) ShinnCFD using GPUs25 / 30 IntroductionCase StudyKey LessonsOverviewImplementationResultsSpee dup of GPU vs. CPU Performance of GPU versus CPU for first 100 time-steps ofsimulation, with block sizebx=by=bz=4on coarser meshes andbx=32,by=1,bz=8on finer 1: Laminar flow in lid-driven cubemeshFortran code (sec)GPU code (sec)speedup (CPU/GPU) improved by factor of for 128x128x128 caseTable 2: Turbulent flow in square ductmeshFortran code (sec)GPU code (sec)speedup (CPU/GPU) improved by factor of for 256x64x64 ShinnCFD using GPUs26 / 30 IntroductionCase StudyKey LessonsKey Lessons Speedup of GPU scaled with the problem size; largestproblem size yielded maximum speedup. Single precision did not appreciably affect the results, evenfor turbulent flows.

Global memory easiest to use, but worst for memorylatency. Need global residuals to observe convergence. This requirescudaMemcpybetween CPU/GPU. Very expensive, so decidewhen you really need to see the ShinnCFD using GPUs27 / 30 IntroductionCase StudyKey LessonsKey Lessons Optimization can be a time drain. Need to decide whencode is good enough Two possibilities:- Code is complete, just needs porting to CUDA and have more time to optimize- Code isnotcomplete, need to add physics features, write inCUDA, and tune. Maybe need to spend more time onphysics algorithm and get what you can get out ofminimal time coding in ShinnCFD using GPUs28 / 30 IntroductionCase StudyKey LessonsFuture Work Model complex geometries in flow using the ImmersedBoundary Method (IBM) Multi-GPU capability - collaborating with John Stone, ShinnCFD using GPUs29 / 30 IntroductionCase StudyKey LessonsReferences[1] H. Ku, R. Hirsh, and T. Taylor. A Pseudospectral Method for Solution of theThree-Dimensional Incompressible Navier-Stokes ofComputational Physics, 70:439-462, 1987.

[2] Madabhushi and Vanka. Large eddy simulation of turbulence-drivensecondary flow in a square Fluids, 3(11):2734-2745, ShinnCFD using GPUs30 / 30

Case Study - Computational Fluid Dynamics (CFD) …

Tags:

Information

Transcription of Case Study - Computational Fluid Dynamics (CFD) …

Related search queries

Case Study - Computational Fluid Dynamics (CFD) …

Tags:

Information

Related documents

Deep Learning AMI - docs.aws.amazon.com

CUDA Math API - Universidad de Sonora

Release 396 Graphics Drivers for Windows, Version …

NVIDIA Tesla

Applications engineering, Texas Instruments …

MatConvNet: Convolutional Neural Networks for …

Convolutional Radio Modulation Recognition …

Related search queries