C++代写:COMP4300SharedMemoryParallelAlgorithm


用CUDA和OpenMP代写几个并行算法。

Assignment Setup and Submission

This assignment must be submitted electronically. To obtain a copy of the
template source files, you must fork the gitlab project ps-ass2 into your own
gitlab area, add the user as a Reporter to your project, and push the final
version of your files into your fork of the ps-ass2 gitlab project before the
deadline.
The files that will be marked are a report ps-ass2Rep.pdf and the C files
parAdvect.c, parAdvect.cu.

Learning Objectives

  • Implement shared memory parallel algorithm using the OpenMP and CUDA programming models.
  • Gain an understanding of some of the performance and programability implications of SMP and GPU systems.

Setup

Your assignment project directory will contain a ps-ass2Rep.pdf, which you
must overwrite with your own report. It also contains two sub-directories,
openmp and cuda. The former contains a test program advectTest.c, a file
serAdvect.c containing serial advection functions, some header files, and a
template OpenMP advection solver parAdvect.c. The test programs can be built
using the command make.
The usage for the test program is:
OMP_NUM_THREADS=p ./testAdvect [-P P] [-x] M N [r]
The test program operates much like that for Assignment 1 except as follows.
The -P option invokes an optimization where the parallel region is over all
timesteps of the simulation, and P by Q block distribution is used to
parallelize the threads, where p=PQ. The -x option is used to invoke an
optional extra optimization.
The directory cuda is similar, except the test program is called
advectTest.cu, and the template CUDA parallel solver file is called
parAdvect.cu. The usage for the test program is:
./testAdvect [-h] [-s] [-g Gx[,Gy]] [-b Bx[,By]] [-o] [-w w] M N [r]
with default values of Gx=Gy=Bx=By=r=1. (Gx,Gy) specifies the grid dimensions
of the GPU kernels; (Bx,By) specifies the block dimensions.
The option -h runs the solver on the host; this may be useful for comparing
the ‘error’ of the GPU runs (and for comparing GPU and CPU speeds). The option
-s forces a serial implementation (run on a single GPU thread); all other
options are ignored. If neither of -h,-s,-o are given, Gx,Gy thread blocks of
size Bx,By are used in a 2D GPU parallelization of the solver. If -o is
specified, an optimized GPU implementation is used, which may use the
parameter w as well.
You should read the files and familiarize yourself with what they contain.

Part 1 (OpenMP): Tasks

Experimentation for this section should be done on the Raijin supercomputer.
Results should use a batch job run on a single supercomputer node.

Parallelization via 1D Decomposition and Simple Directives

Parallelize the prototype function omp1dAdvect() in parAdvect.c. Each
(advection update) loop nesting must have its own parallel region, and the
parallelization must be over one of the two loop indices (hint: this can be
done via simple OMP parallel for directives).
There are various ways this can be done: (i) parallelize the outer or inner
loop, (ii) interchange the loop order and (iii) schedule the iterations in a
block or cyclic fashion. Choose combinations which:

  • maximize performance.
  • maximize the number of OpenMP parallel region entry/exits.
  • maximize shared cache misses involving coherent reads.
  • maximize shared cache misses involving coherent writes.
    The boundary update loops can potentially be parallelized as well. Experiment
    with this for the purpose of improving the performance of case (1).
    Cut-and-paste the directive / loop-nesting combinations for each case in your
    report, and discuss why you chose them. Similarly discuss your strategy in
    parallelizing the boundary update loops, and whether and by how much it
    improved the performance of case (i). Choosing suitable M, N, r, record the
    performance at p = 8,16 and put the results in your report, discussing the
    differences.
    Now, leave the best performing parallelization combinations in your file.

Performance Modelling of Shared Memory Programs

Let ts denote the cost of a parallel region entry/exits, and tw,R and tw,W
denote the cost per (double) word of a cache miss for coherent read and
writes, respectively. From your measurements, derive values for these
parameters for both p = 8,16.
Construct a performance model for the best-performing variant (hint: it should
be very similar to the 1D model you used in Assignment 1, and you can use the
same value of tf). Compare your model’s prediction with the previous results
for the best-performing variant (these should match - why?) and for when MN is
twice as large, and for the both of these for p = 4,12.

Parallelization via 2D Decomposition and an Extended Parallel Region

In the function omp2dAdvect(), write an advection solver such there is only a
single parallel region (over all timesteps), and the parallelization is over a
P by Q block distribution. Hint: use the code of omp1dAdvect() minus the
directives as a starting point.
For suitable measurements with p=16 and varying P, compare performance.
Discuss whether there is any gain from the 2D distribution. Comparing this
with the 1D vs 2D case for the MPI version of the solver, and explain any
differences with the relative advantages. Comparing with your results in Q1
for the P=1 case, was there any advantage in having a single parallel region?

Optional.

Find and describe some new optimization that could potentially improve the
performance of the OpenMP advection solver. If possible, implement and
evaluate this on Raijin. Put your code in ompAdvectExtra(), which is activated
by the -x option.

Part 2 (CUDA): Tasks

Unless otherwise specified, experimental results for this section should be
made on the Jetson system, as described in Prac 5.

Baseline GPU Implementation - 1D

Using the code of cudaSerAdvect() and its kernels in serAdvect.cu as a
starting point, implement a solver whose field update kernels operate on Gx x
Gy thread blocks of size Bx x By to (without restrictions, except you may
assume Bx*By ≤ the maximum thread block size). You may choose what, if any,
parallelization strategy you apply for the boundary updates (justify this in
your report).
In parAdvect.cu, you will need to create new kernel functions, and your outer-
level solver calling these must be implemented in cuda2DAdvect(). Hint: to
help in debugging, replace the kernels from serAdvect.cu one by one with your
own, and do 1D parallelizations first.
Perform experiments to determine the effects of varying Gx,Gy,Bx,By (given
some reasonably sized problem to solve). Report and discuss the optimal
combination, including its speedups against the single GPU and host (ARM)
cores, and any results of interest (including a comparison of 1 x B vs B x 1
blocks). Also perform suitable experiments to determine the overhead of a
kernel invocation, and report your findings.

Optimized GPU Implementation

In cudaOptAdvect(), create an optimized solver and its associated kernels. It
should be significantly faster than the previous version. For ideas of what
optimizations you might consider, read the paper Optimized Three-Dimensional
Stencil Computation on Fermi and Kepler GPUs by Vizitiu et al. In your report,
describe your optimization(s) and their rationale. Perform suitable
experiments which should demonstrate the efficacy of your approach, and
discuss them in your report.

Comparison of the Programming Models

Based on your experiences in this course, comment on the programability
(relative ease of designing, implementing and debugging programs) in the three
paradigms you have used so far: OpenMP (shared memory), CUDA (streaming device
programming) and MPI (distributed memory). Compare this with the speedups
obtained.

Optional

Port your codes to the Raijin GPUs, and perform some experiments to compare
your results with those on Jetson (warning: the batch queues for the GPU nodes
are known for their long wait times!). Any code changes should not go in your
submitted version of parAdvect.cu; instead, describe these in your report.

Requirements

Your code must be written using C/OpenMP or Cuda, as appropriate. It should be
written with good programming style. It should be well commented, so as to
enable the reader (this includes the tutor/marker!) to understand how it
works.
In your report, include:

  • A disclaimer with your name and student number stating what (if any) contributions from others (not including course staff) are in your submission. Note that significant contributions may require a revision of your final mark, as this is intended to be an individual assignment.
  • Your answers to all the questions given above.
  • Details of any notable deficiencies, bugs, or issues with your work.
  • Any feedback on what you found helpful in doing this work, or what from the course could be improved in helping you to carry out this work.

文章作者: SafePoker
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 SafePoker !
  目录