C++代写：CS516ParallelizingSingleSourceShortestPathonGPU

发布日期: 2018-06-06

使用CUDA库，代写Single Source Shortest Path算法，代码需要能在GPU上运行。

Requirement

In this project, you will parallelize single source shortest path (SSSP)
algorithm on GPUs. A SSSP algorithm finds the shortest path between two
vertices in a graph. It is an important and also an interesting problem that
admit many potential algorithmic choices. SSSP can find its uses in many
applications including route planning, AI, 3D modeling, and social networks
[2].
In this project, both undergraduate and graduate students need to implement
two versions of parallel SSSP in CUDA. Every graduate student needs to
implement a third version, however, it is optional to undergraduate students.
You need to submit both code and report. The code comprises 60% and the report
comprises 40% of project 2 grade. We will provide real world graphs. You need
to evaluate your implementations on each of the graph and report the results
for different stages of computation.
You will receive 0 credit if we cannot compile/run your code on ilab machines
or if your code fails correctness tests. All grading will be done on ilab
machines.
The interface of your implemented code needs to meet the requirements defined
in Section 3.2. We have provided sample code package for you to start with.
The following three sections describe the three different versions of SSSP.

Bellman-ford Algorithm

Before describing Bellman-ford algorithm, we will describe the input file
format and some necessary data structures for the shortest path algorithm.
The input file is a plain text file that contains only edges. Each line in the
input file corresponds to an edge in the graph and has at least two vertex
indices separated by space or tab. The first number specifies the start point
of the edge of interest and the second one specifies the end point (the node
that is pointed to in the edge). The third number (optional) contains the
weight of the edge, if not specified, you need to set the edge weight to 1 by
default.
An example graph and edge list is provided in Figure 1 (a) and (b)
respectively. The source node by default is always the node with the index 0
in the graph. Note that the edge list in the input file is not necessarily
always sorted by starting node index or destination node index. You might need
to sort the list of edges yourself after reading in the edge list. A potential
data structure you can use (after sorting the edge list by starting node
index) is presented in Figure 1 (b), in which the outgoing edges from a node v
is represented as a range of edges from index A[v] to A[v+1]-1 in the edge
list E. You can also sort the edges by the destination node index, and
represent contiguous segments of incoming edges each of which corresponding to
edges pointing to the same node. This is similar to the compressed sparse row
(CSR) format for sparse matrices.
The sequential bellman-ford algorithm checks all edges at every iteration and
updates a node if the current estimate of its distance from the source node
can be reduced. The number of iterations is at most the same as the number of
vertices if no negative cycle exists.
The sequential bellman-ford algorithm is easy to implement, but not as
efficient as the Dijkstra’s algorithm. However, it is amenable to
parallelization. The first two required versions of parallel implementations
are both based on the bellman-ford algorithm. An example for the bellman-ford
algorithm is presented in Figure 1 (c). In this example, it takes four
iterations before the distance value of every node in the graph stops
changing. The node values at the beginning of every iteration is provided in
Figure 1.
In the parallel bellman-ford algorithm, we exploit the parallelism of edge
processing at every iteration. For example, in Figure 1 (c), each of the seven
edges is checked at every iteration, we can distribute these 7 edges evenly to
different processors such that each processor is responsible for the same
number of edges.
The prev node list array distance prev and current node list array distance
cur in Listing 2 represent respectively the result from the last iteration and
the updated result at the end of the current iteration. Every iteration here
corresponds to one kernel invocation (Line 19 at Listing 2). This is an out-
of-core implementation, which means once these nodes are updated, their
distance is not immediately visible to other co-running threads in the same
kernel invocation. We switch the pointers to these two arrays in Line 21 of
Listing 2 before calling the kernel in the next iteration.

What to Report?

Execution configuration: you need to report results for the five following block size and block number configurations: (256, 8), (384, 5), (512, 4), (768, 2), (1024, 2). The edge list will be evenly divided to different thread warps.
Edge list organization: You need to report the running time for the three cases: when the edge order is when the edge list is sorted by source node, and when the edge list is sorted by destination node.
In-core configuration: you need to implement the in-core method and report the running time for all graphs. Note that in Listing 2, we used two arrays for the node list, one represents the state before the iteration and the other represents the state after the iteration. You will need to use one single array “distance”, and remove distance prev & distance cur such that the original two lines:
You need to make other changes correspondingly, for example, the parameters to
the kernel function need to be updated, the code to swap the two pointers need
to be removed. The code to check if any node is updated needs to updated.
You also need to report the running time of every graph when the edge order is
when the edge list is sorted by starting node, and when the edge list is
sorted by destination node.
Using shared memory: you will need to implement a segment-scan type method and report the running time. You will only need to run the segscan method on the edge list sorted by destination index, in which the incoming edges for the same node can be processed by consecutive threads (or iterations). You will only need to use the out-of-core processing mechanism. Since finding the minimum of a segment of values is similar to computing the sum of a segment of values, you are required to use the similar segment scan code in project 1 to find the minimum distance for every segment before you atomically update using atomicMin.
The CUDA atomic intrinsic functions are provided here: [
http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#atomic-
functions ](http://docs.nvidia.com/cuda/cuda-c-programming-
guide/index.html#atomic-functions)

Work Efficient Algorithm

We can improve Implementation 1 by reducing the edges that need to be
processed at each iteration.
For any directed edge, if the starting point of the edge (the did not change,
it will not affect the pointed-to end of the edge (the destination node on the
other end). Thus, we can skip this edge when performing the comparison and
updating “if (distance[u] + w [ distance[v]) …” computation.
In this implementation, you will need to filter out the edges whose starting
point did not change in the last iteration and only keep the edges that might
lead to a node distance update. We refer to the edges that need to be checked
as to-process edges. To achieve this, you can use a parallel prefix sum
operation.
Recall that the (exclusive) prefix sum is an operation that scans a sequence
numbers of x0, x1, x2, …, xn, and obtain a second sequence numbers of y0, y1,
… yn such that: y0 = 0, y1 = x0, y2 = x0 + x1, and so on. Similar to the
summation peration, the prefix sum operation can be performed within
logarithmic time steps [1]. Assuming xi stores the number of to-process edges
identified by warp i and T stores all the to-process edges, yi gives the total
number of to-process edges gathered by warps from warp0, warp1, .., warpi, and
thus gives the index of the first to-process edge found by warp i in the list
T . That is, the first to-process edge detected by warp i will be stored at
T[yi]. An example is given in Figure 2.
You may implement the prefix sum operation for gathering to-process stages in
three steps. In this project, we assume that the total number of threads will
not be greater than 2048. We also assume that one thread warp will gather to-
process edges from a contiguous range of edges from the edge list L, such that
the warp i will process the range from index (L.length/warp num) i to index
L.length/warp num (i+1) - 1 (inclusive).

Every warp counts the number of edges that need to be processed in its assigned edge range. x[warpid] specifies the number of to-process edges that are identified by warp warpid. Specifically you can use warp voting functions for fast identification of the to-process edges for threads in the same warp. The “unsigned int ballot(int predicate)” function in CUDA returns a value with the N-th bit set, where N is the lane index in the warp if the predicate is nonzero. If you combine the ballot function with popc function, you can count the number of threads in a warp which have the predicate set to non-zero. In our example, you can use “mask = ballot(current edge.src == changed); count = popc(mask);” to get count, which is the number of lanes in the warp that has seen the predicate (current edge.src == changed) set to true. More information about the ballot function can be found here: https://devblogs.nvidia.com/parallelforall/cuda-pro-tip-optimized-filtering-warp-aggregated-atomics/
Since we assume the total number of threads is 2048, the number of warps is 64. In this step, we launch a kernel that has the same number of threads as the number of warps in the last step (and the block number is 1). At this step, we perform a block level parallel prefix sum to get the offset of every warp’s first to-process edge in the final to-process edge list T .
We let every warp copy its identified to-process edge index to the array T . If a warp’s offset is y[i], and it has detected k to-process edges (not necessarily contiguous). Specifically, you can also use the ballot and the integer intrinsic functions to help you determine the index of the to-process edge in the final list T.
More details about the integer intrinsic functions can be found here:
http://docs.nvidia.com/cuda/cuda-math-api/group CUDA MATH INTRINSIC
INT.html
Note that we will focus on improving performance of the computation stage not
the performance of the filtering stage. You can implement the filtering stage
in different ways as long as the relative ordering of the to-process edges is
the same as they are in the complete list L.
After filtering out the edges that do not need to be considered, you will
evenly distribute the toprocess edges among different thread warps and perform
the same iterative process as described in Implementation 1.

What to Report?

Use the same execution configuration as specified in Implementation 1.
You need to report the running time for the computation stage and the filtering stage separately.
You need to report the results for the edge list sorted by starting point, sorted by destination, for the out-of-core and in-core processing kernel.
You need to perform data analysis, summarize what you have observed and discuss (reason about) the differences in various configurations.

A Further Optimization (Optional to Undergraduate Students)

In the third implementation, you will design and implement your optimization
strategy. The optimization should be locality optimization, priority queue
based optimization, or a combination of both. You will need to specify clearly
which optimization you have applied. Your implementation does not necessarily
need to always run faster than the first two implementations. However, you
will need to describe why you think your proposed optimization might help, and
provide some intuition, i.e., use a small example.
Similar to the previous two implementations, you only have to optimize the
time for the computation stage, not the filtering stage or other overhead for
the optimization approach. You will still have to report the filtering time or
the overhead.
Here we provide some heuristics on locality optimization and priority queue
based optimization that you can use.

Locality Optimization

You can use graph partition method to divide one kernel invocation into
multiple kernel invocations for last level cache performance optimization. The
GPU we have on ilab cluster (in Hill 120) each has 256KB last level cache. Not
the entire data set can fit into cache at one time. Therefore, we can divide
one kernel (if there is sufficient workload) into multiple kernels such that
each kernel’s (frequently-used) data set size fit into cache.
To partition the workload, we partition the edges of the graph. We will
provide a balanced graph edge partition library for you, and details will be
released later in piazza. Note that the provided library only performs
balanced partition, if you want to keep partition the tasks into variable size
components, you might need to perform hierarchical partitioning, for instance,
use bisection at one time, and only split the individual sub-partition that
you want to split.
Note this optimization may only apply to implementation 1, since only in
implementation 1, the edges to be processed in every original kernel
invocation are the same.

Priority Queue

In our first two implementations in this project, we treat every edge as
having the same priority. The Dijkstra’s algorithm and the near-far tile
method [2] let edges that potentially point to smaller distance nodes have
higher priority. The basic idea is to maintain a priority queue. At one time,
we take the first n tasks from the priority queue, process them, which further
generates new tasks that can be inserted into the priority queue (which might
require some reordering operations in the queue or deletion of tasks in the
queue).
If choosing the priority queue based approach, you will need to try at least
one heuristic for the priority function. The near-far tile method uses the
distance as the heuristic, that is, the edges in the work list that come from
the nodes which are bound by a distance have higher priority.
Another potential heuristic you can try is to set the nodes that have the
least number of un-updated predecessors nodes as having high priority.
Specifically, if a node’s all predecessors nodes have been updated in previous
iteration(s), this node will have higher priority and the incoming edges to
these nodes will be placed at the beginning of the priority queue. Next we
place the edges that point to these nodes that just have only one un-updated
(immediate) predecessor nodes in the queue - here an immediate predecessor
node points to the node of interest. And next the edges that point to the
nodes with two un-updated (immediate) predecessor nodes. Note that when
ordering the edges in the list based on the priority order, the list should be
the actual work-efficient work list, that is, we don’t consider any edge whose
starting point has not been changed.
In this exploration, you can maintain the priority queue (insertion and
deletion) in GPU or CPU. As mentioned before, our focus at this project is not
how to reduce the cost of parallel priority queue implementation. Rather we
would like to explore heuristics that will lead to a good parallel SSSP
implementation. Once you discover a good heuristics, you can try to improve
the parallelization of priority queue on GPUs, which is not required in this
project (but can be a good topic for your project 3).