CUDA代写:CME213HeatDiffusionEquation


Introduction

用CUDA编写程序来计算热扩散方程,背景挺吓人的,不过工作量只是实现几个GPU的函数。

Requirement

This assignment builds on the previous assignment’s theme of examining memory
access patterns. You will implement a finite difference solver for the 2D heat
diffusion equation in different ways to examine the performance
characteristics of the different implementations.

Background on the heat diffusion PDE

To solve this PDE, we are going to discretize both the temporal and the
spatial derivatives. To do this, we define a two-dimensional grid G(i, j), 1 ≤
i ≤ n(x), 1 ≤ j ≤ n(y), where we denote by n(x) (resp. n(y) ) the number of
points on the x-axis (resp. y-axis). At each time-step, we will evaluate the
temperature and its derivatives at these gridpoints.
While we will consistently use a first order discretization scheme for the
temporal derivative, we will use 2nd, 4th or 8th order discretization of the
spatial derivative.
If we denote by T(i, j) the temperature at time t at point (i, j) of the grid,
the 2nd order spatial discretization scheme can be written.
The C(x) (xfcl in the code) and C(y) (yfcl in the code) constants are called
Courant numbers. They depend on the temporal discretization step, as well as
the spatial discretization step. To ensure the stability of the discretization
scheme, they have to be less than a maximum value given by the Courant-
Friedrichs-Lewy condition. You do not have to worry about this, because the
starter code already takes care of picking the temporal discretization step as
to maximize the Courant numbers while ensuring stability.
The starter code also contains host and device functions named stencil which
contain the coefficients that go into the update equation. Therefore, you do
not need to figure out how to implement the different order updates. You only
need to understand how this function works and pass in the arguments
correctly.

Boundary conditions

The starter code contains the functions that will update the boundary
conditions for you (see file BC.h, in particular the function
gpuUpdateBCsOnly) and the points that are in the border (which has a size of
order / 2). This way, you do not have to worry about the size of the stencil
as you approach the wall.

Various implementations

In this programming assignment, we are going to implement 2 different kernels
(and you can do a third one for extra credit):

  • Global (function gpuStencil): this kernel will use global memory and each thread will update exactly one point of the mesh. You should use a 2D grid with n(x) × n(y) threads total.
  • Block (function gpuStencilLoop): this kernel will also use global memory. Each thread will update numYPerStep points of the mesh (these points form a vertical line). You should also use a 2D grid with n(x) × n(y) / numYPerStep threads total.
  • (Extra Credit) Shared (function gpuShared): this kernel will use shared memory. A group of threads must load a piece of the mesh in shared memory and then compute the new values of the temperatures on the mesh. Each thread will load and update several elements.

Parameter file

The parameters used in the computation are read from the file params.in. You
will need to modify some parameters (see description of the starter code) in
this file to answer some of the questions. But this file will be not be
submitted through the submission script.
Here is a list of parameters that are used in the order they appear in the
file:
int nx_, ny_; // number of grid points in each dimension
double lx_, ly_; // extent of physical domain in each dimension
int iters_; // number of iterations to do
int order_; // order of discretization

Start code

The starter code is composed of the following files:

  • main.cu — This is the CUDA driver file. Do not modify this file.
  • gpuStencil.cu — This is the file containing the kernels. You will need to modify this file.
  • Makefile — make will build the binary. make clean will remove build files as well as debug output. Do not modify this file.
  • params.in — This file contains a basic set of parameters. For debugging, performance testing, and to answer the questions, you will need to modify this file. The only parameters you should modify are nx, ny (line 1) and order (line 4). This file however will not get submitted through the script.
  • simParams.h and simParams.cpp — These files contain the data structure necessary to handle the parameters of the problem. Do not modify these files.
  • Grid.h and Grid.cu — These files contain the data structure that models the grid used to solve the PDE. Do not modify this file.
  • BC.h — This file contains the class boundary_conditions that will allow us to update the boundary conditions during the simulation. Do not modify this file.
  • hw3.sh — This script is used to submit jobs to the queue

    Note The files in the starter code contain some additional information about
    the implementation in the form of comments. Additionally, the CPU
    implementation should provide a clearer picture of the method and should aid
    your GPU work.

Running the program

Type make to compile the code. Once this is done, you can use the command:
$ ./main [-g] [-b] [-s]
where -g stands for global, -b for block, and -s for shared. Note that you can
run several implementations at the same time (for instance ./main -g -b to run
the global and block implementations).
The output produced by the program will contain:

  • The time and bandwidth for the CPU implementation, as well as for the implementations that you specified when running the program.
  • A report of the errors for the GPU implementations. Namely, the program will output:
    • The L2 norm of the final solution from the CPU implementation (i.e. the reference).
    • The L∞ norm of the relative error between the reference solution and the GPU solution.
    • The L2 norm of the error normalized by the L 2 norm of the CPU implementation.
      Typical ranges for the errors (and for the parameters that you will be using)
      are:
  • [10^-7, 10^-5] for L∞ norm error.
  • [10^-8, 10^-6] for L2 norm error.
    Search for TODO in gpuStencil.cu to see where you need to implement code.

Question 1

(30 points) Implement the function gpuStencil that runs the solver using
global memory. You must also fill in the function gpuComputation. The
difference (in terms of the norms) between your implementation and the
reference to should be in the expected range.

Question 2

(35 points) Implement the function gpuStencilLoop that runs the solver using
global memory but where each thread computes several points on the grid. You
must also fill in the function gpuComputationLoop.
The difference (in terms of the norms) between your implementation and the
reference to should be in the expected range.

Question 3

(15 points) Plot the bandwidth (GB/s) as a function of the grid size (in units
of MegaPoints) for the following grid sizes: 256×256; 512×512; 1024×1024;
2048×2048; 4096×4096.
You must have 2 plots (or 3 plots if you choose to do the extra credit) as
follows:

  1. For order = 4, plot the bandwidth for the 2 (or 3) different algorithms.
  2. For the block implementation, plot the bandwidth for the 3 different orders.
  3. If you implemented the shared algorithm, plot the bandwidth for the 3 different orders.

Question 4

(20 points) Which kernel (global, block or shared) and which order gives the
best performance (your answer may depend on the grid size)? Explain the
performance results you got in Question 3.

Question 5

(20 points Extra Credit) Implement the function gpuShared that runs the solver
using shared memory. You should also fill in the function
gpuComputationShared. Note that you have to answer the questions related to.
Total number of points: 100 (+20 Extra credit)


文章作者: SafePoker
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 SafePoker !
  目录