Introduction
结合了神经网络的高级的CUDA,用GPU进行Neural
Network的算法编写进行数字识别。核心算法部分框架都已提供,需要实现的主要是相关数据结构在GPU中的申请和释放。
In this project, you will implement a neural network using CUDA to identify
digits from handwritten images (a specific case of image classification
problem). Neural networks are widely used in machine learning problems,
specifically in the domains of image processing, computer vision and natural
language processing. Recently, there has been a lot of excitement regarding
deep learning, which basically uses more advanced variants of the simpler
neural network (NN) we cover here. Therefore, being able to train large neural
networks efficiently is important and is the goal of this project.
The main purpose of this Part I is to introduce you to the project and give
you a picture of what needs to be done. Details of the starter code,
submission instructions and full details of the grading steps will be handed
out in Part II.
Data and Notation
We will be using the MNIST dataset, which consists of greyscale 28 × 28 pixel
images of handwritten digits from 0 to 9.
The dataset is divided into two parts — 60,000 training data and 10,000 test
data. We will use the training data to learn the parameters of our neural
network (described later), and the test data to measure the performance of the
learned network.
In the training process, one issue will be overfitting on the training data.
To avoid this, a standard practice is to perform a cross-validation — a
technique to measure how the model will generalize on an independent dataset.
Cross-validation is carried out by further dividing the training data into two
sets - a training set and a validation set. The validation set is a small
portion (usually 10%) of the training dataset. We then perform the training on
the training set (excluding the validation set) and evaluate our model on the
validation set. There are different types of cross-validation, and we will
only do a single holdout for validation because of computational issues. We
use insights from this validation to improve our model.
The goal of the test set is to perform a final evaluation of our model on the
unseen data. It is not meant to be used to in the training process.
Neural Networks
Neurons
To describe neural networks, we will begin by describing the simplest possible
neural network, one which comprises a single “neuron.”
This is a good time for us to discuss the calculation of the derivative of the
sigmoid function with respect to its input, since we are going to use it
significantly in the following sections.
The neuron performs a linear transform on the input and then applies a non-
linear transformation (sigmoid, in this case).
When a single neuron is used, we are limited to a binary classification. Take
the example of a cancer tumor. We may build a neural network that takes as
input the size of the tumor, its location, and the length of time it has been
there. Based on these three pieces of information, the network needs to
determine whether the tumor is benign or malignant. This is a true/false type
of determination. In our previous model using the sigmoid function, we can
interpret the output of the network.
Our task is therefore to learn the parameters of the network W and b so that
it achieves the best accuracy or precision on the test set.
In the project we need to extend this concept to multiple classes. Instead of
a simple true/false output, we need to decide which digits from 0 to 9 is
shown on the input image. This requires a full neural network.
Fully-connected neural network
Figure 4 shows a fully-connected neural network with 1 input layer, 1 hidden
layer, and 1 output layer. We call such a network to be a two-layered neural
network (ignoring the input layer as it is trivially present)
In our problem, we are trying to determine the digit associated with each
image. We will call this digit the “label” associated with the image (using
the neural network terminology). The total number of labels is denoted C. In
our case C = 10, since we are trying to determine digits 0 to 9.
The last layer is special. This is the output of our network. In the project,
we have C = 10 output nodes. Each node represent a possible digit. We will see
later on how the output vector y can be interpreted to determine the digit
that is guessed by the network for a given image.
Feed-forward
The nice thing about neural networks is that they are highly modular. Layer
L[i] does not need to know whether its input is the input layer itself or the
output of L[i−1]. L[i] computes its activations
Training
Recall that our objective is to learn the parameters of the neural network
such that it gets the best accuracy on a set of data points, which we call the
dev set. Let y be the one-hot vector denoting the class of the input, i.e.,
y[c] = 1 if c is the correct label, 0, otherwise. We want P (label = c|x) to
be the highest.
Without going into the mathematical details, we will use the following general
expression to determine the error of our neural network.
The above cost measures the error or our “dissatisfaction” with the output of
the network. The more certain the network is about the correct label (high P
(y = c|x)), the lower our cost will be.
Clearly, we should choose the parameters that minimize this cost. This is an
optimization problem, and is usually solved using the method of Gradient
Descent (described below).
Our neural network applies a non-linear function to the input because of the
sigmoid and softmax functions. However, if we make W very small, the network
becomes nearly linear because T is itself very small. To optimize the neural
network, we often add a penalization term for the magnitude of W in order to
control the non-linearity of the network. There is no rigorous justification
for this penalization. It is found to work well in practice and is easy to
use. With the penalization term.
Gradient Descent
Gradient Descent is an iterative algorithm for finding local minima of a
function.
In practice, we often do not compute J using all the input images. Instead, we
subdivide the input into mini-batches containing M images. We process one
mini-batch at a time. For each mini-batch, we calculate J, and update the
network coefficients. Then, process the next batch, until all images are
processed. See Listing 5 for the pseudo-code. In the code, an epoch (in the
machine learning lingo) is an iteration over the entire data set of N images.
This approach usually leads to faster convergence because we update the
network coefficients more often and are able to learn faster.
Backpropagation
Backpropagation is the process of updating the neural network coefficients.
This involves computing the gradient of multi-variable functions using the
chain rule.
Your Tasks
You task is to implement a parallel training process for our two layer neural
network utilizing both CUDA and MPI. You should focus on parallelizing each
iteration that processes a batch of images. The work should be parallelized
among a certain number of MPI nodes (e.g., 4 MPI nodes), and the computation
on each node should be accelerated by GPU kernels. You’ll need to carefully
analyze the neural network training algorithm, the overhead of the MPI and
CPU-GPU communication and decide the granularity of parallelism. For example,
you may choose to fork (distribute data among MPI nodes) and join (collect,
reduce on one MPI node) on every matrix operation, or fork at the beginning,
keep intermediate results local at each node and join at the end, or a
combination of both.
Possible optimizations may include:
- GPU accelerated matrix operation: The training process contains GEMM (General Matrix-matrix Multiplication) and element-wise matrix operations, which should be accelerated by GPU. The key to optimizing the GPU kernels involves a wise choice of blockSize and gridSize, memory coalescing, usage of shared memory, etc.
- Communication overhead: Both MPI and GPU computation introduce communication overheads. While some of the communication are required, some can be avoided.
Outline of parallelization strategies for CUDA and MPI
- GEMM CUDA implementation: A GEMM operation can be expressed as D = a * A * B + b * C. Some BLAS libraries perform in-place computation that saves the result D in the memory space of C, as cuBLAS does. This is possible if C is no longer used after the computation.
We first outline the basic implementation (“Algorithm 1”). The simplest
implementation is to have one thread for each element in the result D. Each
thread reads the required row of A, column of B, and an element of C to
compute the output element in D.
For Algorithm 2, try to think of a better implementation. For example, use a
blocking algorithm and take advantage of the shared memory. One approach is to
have each thread block (e.g., a block of 3232 threads) compute a sub-matrix
(of size 3232) in the output matrix. Blocks from matrices A and B can be
loaded into shared memory, with each thread reading one element of each sub-
matrix.
Each thread then updates its entry in the sub-matrix of D. A loop is used to
multiply all the required entries in A and B. For an even more optimized and a
possible “A+ grade” implementation, please refer to section 5.
We intentionally do not explain the details of these algorithms. It’s for you
to fill the blanks and perhaps come up with better ideas! - MPI implementation: For each batch of images, you should subdivide the input images into smaller image batches and use MPI communication methods to distribute the input data and Neural Network parameters among MPI nodes, and perform GEMM and other computations in parallel. The resulting network coefficient updates should be reduced and sent to all nodes.