将一个单线程的 Web Crawler 改成多线程运行。
![Web
Crawler](https://upload.wikimedia.org/wikipedia/commons/thumb/d/df/WebCrawlerArchitecture.svg/220px-
WebCrawlerArchitecture.svg.png)
Threaded Web Crawler
I have completed a simple single-threaded Web Crawler that uses this class
from a 310 assignment. Your task is to
- Replace the non-recursive breadth-first graph traversal with a multi-threaded implementation using a fixed- size Java ThreadPool (i.e., instead of enqueing URLs you’d submit a new task to the pool that creates a new WebCrawler object that submits more tasks…) Note that it may require you to rip apart the HTMLLinks class and refactor the solution entirely.
- Modify the program so that it displays the 10 most commonly occurring URLs (local or remote) encountered, in descending order, together with a count of many times that URL appeared.
Implementation Notes
Consider what data structure might be useful in displaying the top 10 URLs
without having to a) perform a brute-force linear search or 2), sort all the
URLs. (hint: it can be done in O(n + klgn) time)
Here is a simple example of how one can use a ThreadPool to limit the number
of active threads:
import java.util.concurrent.*;
public class ThreadPoolExample
{
private static final int MAX_THREADS = 10; // Try Different values!!!
public static void main(String args[])
{
ExecutorService service = Executors.newFixedThreadPool(MAX_THREADS);
for (int i=0; i<100000; i++) // Create lots of threads
{
service.submit(new Task(i));
}
}
}
final class Task implements Runnable
{
private int p_taskId;
public Task(int id)
{
p_taskId = id;
}
@Override
public void run()
{
System.out.println(“Task ID : “ + p_taskId +” performed by “
+ Thread.currentThread().getName());
}
}
—|—
Note that the run() method of a Runnable can not be parameterized. So we
typically implement a parameterized constructor that is passed references to
objects shared by various threads (i.e., a shared memory model of
communication); obviously you need to use “thread-safe” objects and data
structures (can you say “Vector” instead of “ArrayList”; “ConcurrentHashMap”,
etc?). Consider what shared “global” data or data structures need to be
accessible by each thread (and by “global” I mean passed to the thread via the
constructor). You must be careful to avoid “race conditions” (e.g., testing
and subsequently setting) that might result in inconsistency; correct programs
must guarantee correctness, not depend on “luck.”
Also note that we never explicitly call run(); this is the method that will be
invoked when a Thread is started (explicitly by calling start() or implicitly
when the ThreadPool wraps one of its limited number of Thread objects around a
Runnable object and starts the thread).
The main() above is somewhat deficient in that it never terminates. Obviously
we want our main program to wait for all threads to complete before we can go
on to display the most common URLs and then exit. One way to do that is to
“shutdown” the thread pool and wait (indefinitely) for it to “drain”:
service.shutdown();
try
{
service.awaitTermination(Long.MAX_VALUE, TimeUnit.DAYS);
} catch (InterruptedException e) { }
…
—|—
If all tasks to be executed have been submitted before we call shutdown() this
works perfectly. However, this wont work in our case as threads will attempt
to submit new tasks at the same time the main thread is calling shutdown(),
preventing new tasks from being added as shutdown “bars the door” to the
service and then waits for the existing threads to complete. (Note, however,
that this is the approach signal-handlers should take).
Another approach is to have the last task “turn out the lights”, as it were,
by shutting down the thread pool; main() would still awaitTermination() but
would rely on a thread shutting the service down after it determines there is
no more work to be done. Can this work and, if so, how? I’m leaving this as an
open-question for discussion in class; ASK!
Analysis
I want you to test non-threaded and threaded versions of your program.;
execute each a number of times and average the runtimes. You’ll need to
minimally modify the non-threaded version so that both versions report the top
10 most common links. Your threaded version might experiment with various
sizes for your thread-pool.
In addition to turning in the program code for your threaded version, answer
the following questions for the remaining 15 points:
- Which version ran faster and why?
- How did the thread-pool size affect the threaded version, if at all?
- Do you think the results would be different if the crawler crawled beyond the initial URL’s host? Why or why not?
Submit your analysis and your multi-threaded WebCrawler.java program only (do
not include HTMLLinks.java unless you’ve made changes). Do not include your
analysis in the code…