Introduction
用C实现PageRank算法,同时需要对大数据量节点的情况进行优化。
Task Description
In this assignment your task is to implement the PageRank algorithm in C using
the power method described below and then optimise and parallelise your code
to ensure peak performance is achieved.
The PageRank algorithm was developed in 1996 by Larry Page and Sergey Brin
when they were graduate students at Stanford University. Google and other
search engines compare words in search phrases to words in web pages and use
ranking algorithms to determine the most relevant results.
PageRank assigns a score to a set of web pages that indicates their
importance. The underlying idea behind PageRank is to model a user who is
clicking on web pages and following links from one page to another. In this
framework, important pages are those which have incoming links from many other
pages, or have incoming links from other pages with a high PageRank score, or
both.
You are encouraged to ask questions on Ed using the assignments category. As
with any assignment, make sure that your work is your own, and you do not
share your code or solutions with other students.
Working on your assignment
You can work on this assignment on your own computer or the lab machines.
Students can also take advantage of the online code editor on Ed. Simply
navigate to the assignment page and you are able to run, edit and submit code
from within your browser. We recommend that you use Safari or Chrome.
It is important that you continually back up your assignment files onto your
own machine, flash drives, external hard drives and cloud storage providers.
You are encouraged to submit your assignment while you are in the process of
completing it. By submitting you will obtain some feedback of your progress.
The PageRank algorithm
PageRank is an iterative algorithm that is repeated until a stopping criteria
is met. The last iteration gives us the result of the search, which is a score
per web page. A high score indicates a very relevant web page whereas a low
score indicates a not so relevant web page for a search. Sorting the web pages
by their scores in descending order gives us the order for the result list of
a search query.
Example
Our example has four web pages: S = {A, B, C, D}. In this example, d = 0.85
and c = 0.005.
The referencing structure of the web pages A, B, C, and D is given in the
graph below.
Each node represents a web page.
Edges in the graph indicate that the source of the edge is linking to the
destination of the edge.
At this point the algorithm terminates and the values of P(4) are the final
PageRank scores for each of the pages. If this were a query, the resulting
ranks would be A, C, B, D where we arbitrarily rank page A before page C since
they have the same score.
Implementation details
Your program must implement the PageRank algorithm using the power method
described above.
The header file pagerank.h contains an init function that processes the list
of pages and edges from standard input and stores a linked list of pages and
corresponding inlinks in the following structs
struct page {
char* name; // name of this page
size_t index; // index of this page
size_t noutlinks; // number of outlinks from this page
node* inlinks; // linked list of pages with inlinks to this page
};
struct node {
page* page; // pointer to the page data structure
node* next; // pointer to the next page in the list
};
—|—
Your program must produce no errors when built on the lab machines and ed with
the command:
$ clang -O1 -std=gnu11 -march=native -lm -pthread pagerank.c -o pagerank
Your program output must match the exact output format shown by the prototype
and on Ed. You are encouraged to submit your assignment while you are working
on it, so you can obtain some feedback.
Program input
<dampening factor>
<number of pages>
<name of web page>
...
<number of edges>
<source page> <destination page>
...
The first line specifies the dampening factor to be used by the program. The
second line defines the number of pages to be declared, followed by a list of
page names with one name per line. Then, the number of edges in the graph is
specified, followed by a list of edges with one edge per line.
The init function reads the input and terminates outputting an error message
if the input is invalid.
The input is considered invalid if a page name exceeds the maximum length, the
dampening factor is not in the range 0 < d < 1, a page is declared twice, or
an edge is defined to a nonexistent page.
Example
0.85
A
B
C
D
D A
D B
D C
B A
B C
This example has four web pages with names: A, B, C, and D
There are five edges: D - A, D - B, D - C, B - A, and B - C
The output is the list of scores for each web page, for example:
A 0.30778524
B 0.21606622
C 0.30778524
D 0.16836329
The scores are ordered in the same order they were defined in the input.
For printing the score, use the format string “%s %.8lf\n”
For all test cases set e = 0.005, use the constant defined as EPSILON