CUDA代写:CS314EfficientParallelGraphMatching


使用GPU编程,实现一个Efficient Parallel Graph Matching
算法。
![Graph
Matching](https://upload.wikimedia.org/wikipedia/commons/thumb/e/e1/Maximal-
matching.svg/450px-Maximal-matching.svg.png)

Background

Graph Matching Problem

Given a graph G = (V, E), where V is the set of vertices (also called nodes).
A matching M in G is a set of pairwise non-adjacent edges such that no two
edges share a common vertex. A vertex is matched if it is an endpoint of one
of the edges in the matching. A vertex is unmatched if it does not belong to
any edge in the matching. We show examples of possible matchings for a given
graph.
A maximum matching can be defined as a matching where the total weight of the
edges in the matching is maximized. (c) is a maximum matching, where the total
weight of the edges in the matching is 7. Fig. (a) and (b) respectively have
the total weight of 3 and 2.

Parallel Graph Matching

Most well-known matching algorithms such as blossom algorithm are
embarrassingly sequential and hard to parallelize. In this project, we will
adopt the handshaking-based algorithm which is amenable to parallelization on
GPUs.
In the handshaking-based algorithm, a vertex v extends a hand to one of its
neighbours and the neighbor must be on the maximum-weight edge incident to v.
If two vertices shake hands, the edge between these two vertices will be added
to the matching. (b) where node A extends a hand to D since edge(A,D) has the
largest weight among all edges incident to node A; Nodes C and F shake hands
because they extend a hand to each other.
It is possible that multiple incident edges of a node have maximum weight. In
this project, we let the algorithm pick the neighbor vertex that has the
smallest vertex index.
For example, among the maximum-weight neighbors of vertex E, we pick vertex B
since it has the smallest index (in alphabetical order) among all E’s edges
that have maximum-weight 4.
The handshaking algorithm needs to run one or multiple passes. A one-pass
handshaking checks all nodes once and only once. At the end of every pass, we
remove all matched nodes and check if there is any remaining un-matched nodes.
If the remaining un-matched nodes are connected, another pass of handshaking
must be performed. We repeat this until no more edges can be added to the
matching. In Fig. 2, we show two passes of handshaking.
The handshaking algorithm is highly data parallel, since each vertex is
processed independently and the find-maximum-weight-neighbor step involves
only reads to shared data. It is a greedy algorithm that attempts to maximize
the total weight in the matching.

Implementing Parallel Graph Matching

Data Structure

The graph is encoded as an edge list consisting of three arrays: src, dst, and
weight, such that src[n] is the source node for the n-th edge, dst[n] is the
destination node for the n-th edge, and weight[n] is the weight of the n-th
edge. The graph is undirected, so if src[n]=x and dst[n]=y then there exists
an edge m such that src[m]=y and dst[m]=x.
An example of this representation is shown in. The example graph, with five
vertices, has a total of six un-directed edges. The edge list is arranged such
that edges that have the same source node are placed together. For edges that
have the same source node, they are sorted by the destination indices. The
relative order of the edges is important, please do not modify it.
We have provided graph I/O functions for you. The code given to you will read
and parse the graphs stored in matrix format. After the graph is parsed, three
arrays src[ ], dst[ ], weight[ ] contain graph information. Pointers to the
three arrays: src[ ], dst[ ], weight[ ] are stored in the GraphData struct.
This is what GraphData looks like:
struct GraphData {
int numNodes;
int numEdges;
int * src;
int * dst;
int * weight;
}
—|—
The program contains four GPU kernel functions, three of which you are
required to implement (while the fourth is already implemented). Their
arguments (input and output) are all described in the file src/gpuHeaders.cuh.
Your implementations should go inside the appropriate files in the
src/gpu_required/ directory.
We show an example of the data produced during the first iteration of the
handshaking algorithm, based on the same graph as in our edge list example.
Note that the notations used in Fig will be used consistently in the following
subsection. Those notations include the keep_edges, new_indices,
strongNeighbor, and matches arrays. Please refer to Fig for a specific example
of the results from each of the functions described in the subsections below.
Note that by default, the program uses provided CPU code to serially identify
the strongest neighbors. For information about the extra credit part of this
project, see Section 6.

Update Matches

With the strongNeighbor results calculated, we can now perform handshaking to
update the matching results. This kernel function looks for node-pairs which
are both each other’s maximum-weight neighbor, and matches them.
global void check_handshaking_gpu(int * strongNeighbor, int * matches, int numNodes);
—|—
The strongNeighbor array is the input, such that strongNeighbor[x]=y if the
x-th node’s strongest neighbor is the y-th node. The matches array is the
output, such that matches[x]=y if the x-th has been matched with the y-th
node. Note that some nodes may already be matched from prior iterations, and
they should not be changed. An unmatched node is indicated by UNMATCHED
defined in utils.hpp.We use -1 to represent UNMATCHED.

Edges Filtering

To filter out the edges we no longer need, we must first distinguish them from
the edges to keep. A vertex can be matched with only one vertex - therefore if
an edge has a source or destination which has already been matched then we can
discard it; otherwise we must keep it.
global void markFilterEdges_gpu(int * src, int * dst, int * matches, int * keepEdges, int numEdges);
—|—
The input arrays for this function are src (the source array from the edge
list), dst (the destination array from the edge list), matches (the current
node matches). The output is keepEdges. The src, dst, and keepEdges arrays
each have numEdges elements.
After this function is complete, keepEdges[x] should be 1 if we want to keep
the xth edge, or 0 otherwise.

Get New Edge Indices

Once we know which edges to keep versus which edges to filter, we can
determine each edge’s new index in the edge list. This is performed with an
exclusive prefix sum.
global void exclusive_prefix_sum_gpu(int * oldSum, int * newSum, int distance, int numElements);
—|—
The input array is oldSum, and the output array is newSum. Each of these
arrays has numElements elements. The distance variable tells us which stride
to use at each iteration. This function will be called multiple times. A
sketch of the parallel algorithm is in lecture 22 slides, page 40.
Since this is an exclusive prefix sum, we do not want the sum to include the
current element. In other words.
The first time exclusive_prefix_sum_gpu is launched, it will be given a
distance value of 0. In this case, instead of performing addition, it should
merely shift everything one element to the right - in other words, each
element in the output should be set to the previous element in the input when
distance is 0 in this kernel function. That is how we implement the
exclusivity, after which we can operate like an inclusive prefix scan.

Filter Edges

This function is already implemented for you. When the program reaches this
point, we know where the edges will go in the filtered edge list, but we need
to repack them for use in the next iteration of matching.
global void packGraph_gpu(int * newSrc, int * oldSrc, int * newDst, int * oldDst, int * newWeight, int * oldWeight, int * edgeMap, int numEdges)
—|—
Here oldSrc, oldDst, and oldWeight are the edge list before filtering, each
with numEdges elements. The edgeMap array, with (numEdges+1) elements,
contains the results of the exclusive prefix sum from the previous step. The
newSrc, newDst, and newWeight arrays are the output of this kernel function.
After this kernel function is complete, for every edge x that we want to keep,
it will be the case that newSrc[edgeMap[x]] = oldSrc[x] . The dest and
weight arrays are handled similarly.

GPU Memory

You are expected to handle allocation, initialization, and de-allocation of
GPU memory yourself. You must use the function cudaMalloc for to allocate GPU
memory. For initialization, you can use cudaMemcpy to copy memory between GPU
and CPU. Finally, you must use cudaFree for de-allocation.
You can find examples of how to use these functions inside the Lecture 19 and
Lecture 20 slides from class.

Thread Assignment

It is important to map tasks to threads in a reasonable manner. Furthermore,
although GPU kernels can be launched with a large number of threads, we may
not want to do so because it incurs significant kernel launch overhead. In
this project, we do not guarantee the number of threads allocated for each
kernel is the same as the number of tasks to perform. Your code within a
kernel needs to account for that.
For example, suppose we are working on an array with 1024 elements, but only
have 256 threads. Each thread should work on four elements. To ensure memory
coalescing, the first thread should handle array[0], array[256], array[512],
and array[768]. Meanwhile the second thread should handle array[1],
array[257], array[513], and array[769], and etc. This mapping distributes the
work evenly across threads, while offering opportunity for the GPU device to
combine adjacent memory requests performed by adjacent threads.
We use one-dimensional grids and one-dimensional thread-blocks, making it
straightforward for you to calculate both the thread’s ID, and the total
number of threads launched by the kernel function. Each thread can query the
total number of threads launched using the expression blockDim.x * gridDim.x .
Also note that the number of tasks to process is not necessarily a multiple of
the number of threads. In your code, you will need to add condition checks to
account for that. For the three kernels you’re required need to implement, the
number of tasks is the same as the size of the output array, which is given as
an argument.

Grading

Your programs will be graded primarily on functionality. Grading will be done
on ilab machines. You will receive 0 credit if we cannot compile and run your
code on ilab machines.
Although we have provided you with some testcases, we will use secret test
cases when we perform grading. Additionally, we will run your code with
varying numbers of threads.
During grading, your submission will be run with a (generous) time limit, to
ensure the program terminates even if it gets stuck. We may run your kernel
functions separately (alongside our own implementations of the other
functions), or altogether.

How to Get Started

The code package for you to start with is provided on Sakai. Create your own
directory on the ilab cluster, and copy the entire provided project folder to
your home directory or any other one of your directories. Make sure that the
read, write, and execute permissions for groups and others are disabled (chmod
go-rwx [directory_name]).
DO NOT change any files except onewaywrapper.cu, the existing files in the
gpu_required directory, and the file in the gpu_extra_credit directory. No
other code files will be considered when we grade your submission.
The various parts of the code you’re expected to modify are prepended and
post-pended by comments stating “YOUR CODE GOES BELOW” and “YOUR CODE GOES
ABOVE”, respectively. This includes the contents of kernel functions, as well
as allocation and deallocation of GPU memory.

Compilation and Execution

Compilation: We have provided a Makefile in the sample code package.

  • make should compile all code into an executables called match.
  • make submit should generate a tar file containing your current code, which you must manually submit to Sakai.
  • make clean should remove all files that were generated by Make, including object files, executables, and the tar file.
  • make debug compiles the executable with flags that disable optimizations and add debugging information. Note that you may need to run make clean prior to switching between the debug and default versions.
    Execution: The match program performs one-way matching on a given graph. It
    allows the following arguments:
  • input [file] Use the specified graph file as input.
  • output [file] Write matching results to specified file.
  • threads [number] Launch approximately the specified number of threads when calling a GPU kernel.
  • device [number] Run all GPU kernels on the specified GPU device.
    The -input flag is mandatory; if the other flags are not used then the program
    will use default settings, such as writing the results to a file named
    out.txt.
    Program Output: The program will output the matching results to the specified
    output file, and print the approximate time it spent. We have handled the
    functionality of generating the output file.

Test Cases

We have provided input and expected output for several randomly generated
testcases. Each of the input files has a filename “inputX.mtx” for some value
X. The corresponding solution has filename “outputX.txt”.
For example, if you fully and correctly implement the required parts of this
project, then the command ./match -input testcases/input5.mtx -output out.txt
-threads 1024 would produce a file named out.txt that’s identical to
testcases/output5.txt. (You can also use a different number of threads than
1024, and should get the same output.)
Note that you can use the diff command to compare two text files. For example,
diff out.txt testcases/output9.txt will either display a list of differences
between those two files, or display nothing if they are identical.

Debugging

You can use cuda-gdb to help detect and debug errors in your GPU kernel
functions. Make sure to use the command set cuda memcheck on when you first
launch cuda-gdb.
Before debugging, make sure to compile your code with the make debug command.
This might make your code run slower, but will allow cuda-gdb to retrieve more
high-level information about any memory errors it detects.

GPUs are a finite resource!

You can run the program nvidia-smi to view the GPUs on your current machine,
including some information about which ones are currently in use. If the
project executable tries to run on a GPU device that is already used by many
users, then it may be unable to run correctly, getting stuck or throwing
unusual errors (e.g. “out of memory”).
We’ve added an optional argument to the program that lets you select the GPU
device to use. For example, if nvidia-smi tells you that there are four GPU
devices but devices 0, 1, and 2 are all in use, then you might run the match
program with the flag “-device 3” so that it will use device #3. However, be
aware that some of the computers in ilab that you can use have only one GPU.
If this isn’t sufficient to get the program working (i.e. there are no
available GPU devices), then you may need to switch to a different ilab
machine.

What to Submit

You should invoke make submit to generate the file submit_me.tar containing
your GPU code. This only generates the file; it does not automatically submit
it. You need to submit submit_me.tar to Sakai.
If you change your code after generating this tar file, don’t forget to
regenerate (and resubmit) the tar afterward.
If you complete the extra credit, then you should additionally submit your
3-page report in PDF format.
Please DO NOT submit any other code, executable, mtx files, matching results,
etc.

Extra-credit

The extra credit part of this project concerns the parallel implementation of
finding the strongest neighbors (i.e. producing the strongNeighbor array in
Fig. 1).
If you choose to complete the extra-credit, you must design the parallel
algorithm for finding the maximal neighbor of each node. There are several
factors to consider include, but are not limited to:

  • Load balancing. Each node has a different number of neighbors, so partitioning the work according to node may not be a good idea.
  • Thread divergence. Each thread in each warp (i.e. each group of 32 consecutive threads) should perform the same operations, or else parallelism is reduced.
  • Memory coalescing. A memory operation is more efficient when consecutive threads are accessing consecutive memory locations.
    You can use as many GPU kernel functions as you feel are necessary, which
    should all be placed inside the extra.cu file. You will call these kernels at
    the appropriate place inside onewaywrapper.cu. If necessary, you are permitted
    to define additional GPU arrays inside onewaywrapper.cu. Your implementation
    will be invoked iff you use the -extra flag on the command-line when you run
    the match program.
    For full points on the extra credit, you need need to write a three-page
    report including your algorithm, the implementation results, and evaluation.

Questions

Questions regarding this project can be posted on Sakai forum. Good luck!


文章作者: SafePoker
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 SafePoker !
  目录