Introduction
入门级的CUDA吧,用GPU进行PageRank的算法编写。不过算法的核心部分框架都已提供,需要实现的主要是显卡的内存申请和释放,以及一些基本的迭代运算操作。
Introduction
In this programming assignment you will use NVIDIA’s Compute Unified Device
Architecture (CUDA) language to implement a basic string shift algorithm and
the pagerank algorithm. In the process you will learn how to write general
purpose GPU programming applications and consider some optimization
techniques.
You must turn in your own copy of the assignment as described below. You may
discuss the assignment with your peers, but you may not share answers. Please
direct your questions about the assignment to the forum.
CUDA
“C for CUDA” is a programming language subset and extension of the C
programming language, and is commonly referenced as simply CUDA. Many
languages support wrappers for CUDA, but in this class we will develop in C
for CUDA and compile with nvcc.
The programmer creates a general purpose kernel to be run on a GPU, analogous
to a function or method on a CPU. The compiler allows you to run C++ code on
the CPU and the CUDA code on the device (GPU).
The first step you should take in any CUDA program is to move the data from
the host memory to device memory. The function calls cudaMalloc and cudaMemcpy
allocate and copy data, respectively. cudaMalloc will allocate a specified
number of bytes in the device main memory and return a pointer to the memory
block, similar to malloc in C. You should not try to dereference a pointer
allocated with cudaMalloc from a host function.
The second step is to use cudaMemcpy from the CUDA API to transfer a block of
memory from the host to the device. You can also use this function to copy
memory from the device to the host. It takes four parameters, a pointer to the
device memory, a pointer to the host memory, a size, and the direction to move
data (cudaMemcpyHostToDevice or cudaMemcpyDeviceToHost). We have already
provided the code to copy the string from the host memory to the device memory
space, and to copy it back after calling your shift kernel.
Kernels are launched in CUDA using the syntax kernelName. The arguments inside
of the chevrons specify the number of thread blocks and thread per block to be
launched for the kernel. The arguments to the kernel are passed by value like
in normal C/C++ functions.
There are some read-only variables that all threads running on the device
possess. The three most valuable to you for this assignment are blockIdx,
blockDim, and threadIdx. Each of these variables contains fields x, y, and z.
blockIdx contains the x, y, and z coordinates of the thread block where this
thread is located. blockDim contains the dimensions of thread block where the
thread resides. threadIdx contains the indices of this thread within the
thread block.
Problem 1 String Shift
The purpose of this problem is to give you experience writing your first
simple CUDA program. This program will help us examine how various factors can
affect the achieved memory bandwidth. The program will take an input string
and shift each character by constant amount. You will do this in three
different ways:
- by shifting one byte at a time.
- by shifting 4 bytes at a time.
- and finally by shifting 8 bytes at a time.
You should be able to take the files we give you and type make main_q1 to
build the executable. The executable takes 1 argument | the number of times to
double the input file in size. For debugging it is recommended to use a value
of 0. The executable will run, but since the CUDA code hasn’t been written yet
(that’s your job), it will report errors and quit.
Problem 2 PageRank
PageRank was the link analysis algorithm responsible (in part) for the success
of Google. It generates a score for every node in a graph by considering the
number of in links and out links of a node. We are going to compute a
simplified model of pagerank, which, in every iteration computes the pagerank
score as a vector.
In the actual algorithm this operation is performed until the change between
successive vectors is sufficiently small. In our case we will choose a fixed
number of iterations to more easily compare performance across various numbers
of nodes and edges. If you wish to learn more about the algorithm itself,
check http://en.wikipedia.org/wiki/PageRank
.
Summary
本次作业侧重点是熟悉CUDA的原理,因此整体不是很难。