使用OpenMP,编写程序并行进行矩阵计算。
OpenMP Parallel Programming Project
In this project, you will implement a parallelized version of sparse matrix
vector multiplication (spmv). You are given a sequential C implementation of
the spmv code and your job is to parallelize it using OpenMP. The following
four sections will explain the Sequential C implementation and clarify what
kind of parallelism you should be restricted to during this project.
Background
Matrix-vector Multiplication
Matrix-vector multiplication is an important and fundamental operation in many
computation problems.
The complexity of calculating res is O(MN). However, if a matrix has a large
number of zeros, we can reduce the number of multiplication operation because
any number multiplied by zero is zero. This type of matrices are called sparse
matrices. For example, regardless whether b is sparse, only five
multiplications are needed to calculate res.
Apparently the calculation is independent. In this project you will exploit
the parallelism in sparse matrix vector multiplication and parallelize it
using OpenMP.
Handling Sparse Matrix Input
In this project you work with sparse matrices stored in Coordinate Form (COO).
Specifically, we use the sparse matrices from matrix market 1 with mtx file
extension name. COO stores a list of (row, column, value) tuples.
In the source code, you are provided with matrix I/O functions and the code to
load the sparse matrix in COO format. The sparse matrix A is loaded and stored
into a one-dimension array val[ ], which only contains non-zero element of the
matrix A. The row and column index of every entry in val[ ] are represented
using two arrays: rIndex[ ] and cIndex[ ], for instance, the non-zero value
val[i] is located at row rIndex[i] and column cIndex[i] in the matrix A.
Ideally, the tuples (row, column, value) in the COO format are sorted by row
index, then column index. However, in mtx files, it is not always the case. In
the source code, we pre-process the loaded matrix entries and sort the val[ ]
values such that after pre-processing, the elements in the array val[ ] are
reordered by row index, and then by column index, both in ascending order.
Two 1-dimension arrays, rsIndex and reIndex represent the beginning and ending
positions of each row of matrix A in the array val[ ]. For instance, the
nonzero values of the i-th row in matrix A are stored in the val[ ] array in
locations.
An input vector is stored in a file in the format such that the first element
in the file is the vector size, and the rest of the file contains the values
of the vector sorted by row index in ascending order, illustrated below:
In the source code, we also provide vector I/O functions to load and store the
vector b values from a file. The 1-dimension array vec[ ] stores the vector
for sparse matrix vector multiplication. Please do not modify the matrix and
vector I/O code!
Sequential SPMV Code
With the above variable definitions, the following sequential algorithm is
used to calculate A b where A is the sparse matrix and b is the input vector.
The result is stored in res[ ].
int i, j;
for (i = 0 ; i < M; i++) {
res[i] = 0;
for (j = rsIndex[i]; j <= reIndex[i]; j++) {
int tmp = cIndex[j];
res[i] += val[j] * vec[tmp];
}
}
—|—
The sequential code has a two-level nested loop. The outermost i loop
corresponds to the multiplication of a row vector in the matrix A with the
column vector b. The innermost j loop corresponds to the accumulation of the
multiplication results in a row and column vector multiplication. The non-zero
entries in A are stored in val[ ]. The rsIndex[i] and reIndex[i] represents
the starting and ending position of row i’s non-zero elements in the array
val[ ]. The code is in “spmv seq.c”.
To compile this program, you can type:
make sequential
To run this program, you type:
./spmv seq [martix-market-filename] [input-vector-filename]
Parallelization
In this project, you are asked to only exploit loop-level parallelism in the
given sequential spmv program. You will express loop-level parallelism through
OpenMP pagmas, i.e., #pragma omp parallel
variations. Detailed OpenMP
pragma description can be found in
https://computing.llnl.gov/tutorials/openMP/
.
You will need to choose the right loop level(s) to parallelize, and use the
best loop scheduling parameters for each parallelized loop and learn how to
improve the parallelization by feedback.
You will need to submit three OpenMP versions of the sparse matrix vector
program, named spmv N thread static.c, spmv N thread dynamic.c.
The requirement for each is as follows:
- spmv_N_thread_static.c: A version that uses static schedule for parallelization. You can exploit parallelism in only a single loop level. And you need to choose the right loop level based on dependence analysis. The “reduction” pragma is not allowed in this version.
- spmv_N_thread_dynamic.c: A version that uses dynamic schedule for parallelization. Similarly, you can exploit parallelism in only a single loop level. You need to choose the right loop level based on dependence analysis. The “reduction” pragma is not allowed in this version.
- spmv_N_thread_newalg.c: A version that uses your own strategy for parallelization. You can exploit parallelism in a single loop level or two loop level as long as the parallelized program produce correct output.
Feel free to add preprocessing steps before this loop for optimization
purpose, for instance, you can add code to determine the best chunk size for
any given sparse matrix, or you can modify the loop for better load balancing
and locality strategy. You can use other parallelization strategy such as
“reduction” type parallelism. The goal is to make the parallel version of spmv
as fast as possible.
In all three cases, the number of threads is set as a runtime parameter. That
is, the parallel program will not only take an input matrix and vector as
input, but also the number of threads. For instance, the dynamic schedule
version takes three input parameters.
./spmv_parallel_dynamic [martix-filename] [vector-filename] [thread-num]
The thread number is stored in the “thread num” variable in the program. If
you need to use or apply the number of the threads in any other place of the
program, please make sure it is consistent with the “thread num” variable.
In OpenMP, in a static schedule, loop iterations are divided into pieces of
size chunk and then statically assigned to threads, that is, the workload
schedule is determined at compile time. If chunk is not specified, the
iterations are evenly (if possible) divided contiguously among the threads.
In dynamic schedule, loop iterations are divided into pieces of size chunk,
and dynamically scheduled among the threads; when a thread finishes one chunk,
it is dynamically assigned another from the work queue shared by all threads.
If you do not specify the chunk size, it is 1 by default.
The static schedule reduces runtime overhead for accessing shared work-queue,
however, it may not provide the best load balancing, since the number of
iterations in the inner loop (j loop) is unknown at compile time, and may be
different across different i iterations - the number of non-zero entries for
each rows may be different.
The dynamic schedule provides better load balancing, since it determines at
runtime, how many task (iterations) every thread needs to be assigned. A task
is always assigned to the least idle thread during the execution. However, the
dynamic schedule requires extra code generated at compile to communicate and
update the shared work-queue.
In the third file you need to submit, the parallelization strategy you
designed, you are encouraged to transform the loop and/or add additional code
for better load balancing results. You are also encouraged to try other
optimization strategies as mentioned above. The current source code includes
timing functionality. You can check the running time for each version of the
code.
Hints: to make your parallel program work efficiently, shared and
threadprivate variable have to be explicitly specified by using data-sharing
attribute clauses in OpenMP.
We will grade your projects mainly based on correct parallelization parameters
and correct program output (after parallelization). However, for better
performance testing, we encourage you to run the program on a machine that has
less interference from other users, for instance, your own computer. The ilab
machines usually have multiple users active at the same time, and the parallel
performance measured on a busy machine may be noisy. To design better
parallelization strategy in the third version of your code, you might want to
obtain accurate performance numbers. However, you still have to make sure your
code runs on an ilab machines. In the end, we will use an idle ilab machine to
test your program.
File Description
In this part we provide detailed description of the provided files, as well as
the test cases. You only need to modify the following files when you do this
project.
- spmv_N_thread_static.c is provided for you to implement the static schedule parallel strategy. Its usage is
./pstatic [martix-market-filename] [input-vector-filename] [thread-num] - spmv_N_thread_dynamic.c is provided for you to implement the dynamic schedule parallel strategy. Its usage is
./pdynamic [martix-market-filename] [input-vector-filename] [thread-num] - spmv_N_thread_newalg.c is provided for you to implement your own parallel strategy. Its usage is
./pnewalg [martix-market-filename] [input-vector-filename] [thread-num]
The computation result, the output vector, will be saved in “output.txt” for
grading purpose. We will grade based on both correctness and efficiency of
your parallelized code.
Several other provided files include - mmio.h, mmio.c those two files are Matrix Market I/O library, which can be downloaded from http://math.nist.gov/MatrixMarket . We call the functions defined in this library to load the .mtx matrix file.
- utils.h, utils.c those two files provide two utility functions.”getmul” calculates the matrix-vector multiplication after loading the data. We compare its result with the one we get by parallel computing to check the result by “checkerror” function. The program will print “Calculation Error!” if the two results are not the same.
- spmv_seq.c is the sequential implementation of matrix-vector multiplication. Its usage is
./spmv_seq [martix-market-filename] [input-vector-filename] - vector_generator.sh is used for generating the vector file. Its usage is
./vector_generator.sh [vector-size]
The generated file follows the format described in Section 1. - testcases folder stores the matrix and vector examples for testing. You are only given one test case: cop20k A.mtx as the input matrix, cop20k A vector.txt as the input vector.
However, the input vector is not provided in the link above. Use the vector
generator.sh script above to generate an input vector of any size. The vector
size has to match the number of columns in the input matrix. To find the
dimension of the input matrix, in the mtx file, the first line specifies the
number of rows, columns, and non-zero entries for the given matrix.
Do not change the Makefile otherwise your code can not be compiled!