代写Java数据结构作业,作业分为两部分,上半部分的理论知识以及下半部分的编程设计。
Introduction
This homework has two parts: in Part A you will write up answers to analytical
problems related to the lectures in the past week, both to confirm and extend
your understanding of the abstract principles discussed; in Part B you will
write code to implement this understanding, and to practice your Java coding
skills. I suggest you read this whole assignment carefully and for Part B, it
is definitely worth thinking about your solution for a bit before launching
Dr. Java and beginning to type. In addition to the requirements for the Java
problems, you must observe the following requirements (for all homework
submitted in this course):
- All programs should be literate, i.e., easily understandable by a human (say, your grader) and follow the Java Style Guidelines for CS112 posted on the class web site;
- All files for this homework should be submitted using WebSubmit, following the instructions on the class web site;
- You may not use data structure libraries such as ArrayList, since we are learning to write Java from the ground up and you must learn how these libraries are built; however, you are free to use (unless specifically directed otherwise) the basic libraries String, Character, Scanner, and Math;
- You may freely use code from the class web site, the textbook, or lecture (unless specifically directed otherwise) as long as you cite the source in your comments; but you may NEVER use code from the web or other students’ programs—this will be considered plagiarism and penalized accordingly.
Part A: Analytical Problems
In these questions we will first explore the binary heap data structure, and
then consider the general problem of hashing by considering the two basic
paradigms, using a simple hash function. In the first one, involving separate
chaining, we will make this realistic by stored (key, value) pairs; the hash
function is applied to the key alone, but the entire pair is stored in the
node. For the second, involving linear probing, we will simply store the key
in the array slot.
Write your solutions in a file hw11.txt, hw11.pdf, hw11.rtf, etc. and submit
by the due date and time.
- Insert the following sequence of keys into an intially empty 2-3 tree. You should write all the trees down as you do the transformations (this is what would be expected on an exam) but for the purposes of grading this exercise you can just draw the final tree that results.
- Consider the following 2-3 tree:
Suppose we count ONLY comparisons between two keys, and NOT checks to see if
the key exists, or checks to see if a pointer is null. (i) How many
comparisons would necessary to find the key 10? (ii) How about 140? (iii) How
about 60? (iv) Which key(s) occurring at leaf nodes would require a minimum
number of comparisons? State which and how many comparisons. (v) Which key(s)
would require a maximum number of comparisons? State which and how many
comparisons. (vi) What is the average number of comparisons to find the keys
in this tree (count for all and then divide by the size of the tree). - Insert the following keys into a binary heap, showing the structure of the tree (draw them sides as in the homeworks, or some other way) and the contents of the array A at the indicated points. At the end, give the values of the indicated variables.
- Let us consider using a hash table for a symbol table storing (key, value) pairs, where the key is used as input to the hash function. Insert the following key-value pairs into a hash table implemented with the technique of separate chaining with an array H of size 7. For each insertion, use the hash function hash(x) = (3 * x % 7) on the key part and store the pair in the bucket at H[ hash(key )]. Assume nodes are defined as class Node{ int key; String value; Node next; }.
* A. Show the hash table after all insertions are performed. What is the worst-case lookup time (in the number of comparisons), and what is the average-case lookup (average the number of comparisons over all keys in the table)?
* B. Perform the following operations on the table from (a) and show the table that would result
* C. What is the worst-case lookup and what is the average-case lookup for keys in the table that resulted from part (b)?
* D. Give a sequence of keys (just the keys, not pairs) that would all hash to the same location for a table of size 7, creating a worst-case hash table (i.e., start with an empty table, and create a linked list worst-case table).
* E. If we insert M keys into a hash table with N buckets, creating a worst-case table, what is the worst case time for looking up a key? Express in terms of N and M, using Theta(…) notation.
* F. If we insert M keys into a hash table with N buckets, suppose the hash table arranges itself in the best possible way, what is the worst-case lookup? Express in terms of N and M, using Theta(…) notation. - Suppose now we have a hash table L using the strategy of linear probing, which just stores integer keys (not key-value pairs). The size of the array is N = 11 and the hash function is hash(k) = 3*k%11. Consider the following series of insertions.
* A. Show L after all insertions were performed.
* B. What is the worst-case time to search for one of the four keys 4, 2, 1, or 28?
* C. Perform the following operations on the table from (a) and show the table that would result.
* D. What is the worst-case time for search for one of the keys 4, 2, 1, 28, 23, 63, 19, or 13 in the table from part (c)?
* E. What is the average-case time in this table (total the number of comparisons used for searching for each key, and divide by the number of keys, i.e., 8)?
* F. Perform the following operations on the table from part (c) and show the table that would result.
* G. Describe the worst-case of a hash table using linear probing. Use Theta(….) notation to characterize the number of comparisons in this case when searching the hash table for a given item which is present in the table.
Part B: Programming Problems
Problem B.1 (Lab 11)
Submit the file IntStack.java from Lab 11 as problem B.1.
Problem B.2 (Lab 12)
Submit your text file Lab12.txt as part of your HW 11 submission.
In the remaining problems, you will develop a realistic program which
implements a text-retrieval system similar to Google.
You will extend a basic program called Minipedia that maintains a database of
articles on various subjects. We supply a folder of ]2200 articles culled from
Wikipedia (first paragraphs only), two test articles Test1 and Test2, and also
the code to access this collection. The basic user interaction is provided, as
well as a bare-bones implementation of the database as a dumb unordered list.
You must extend the functionality in several interesting ways, and also
provide a more intelligent data structure using a hash table.
Problem B.3 Article Table
For this problem, you will develop a hash-table based version of the
DumbList.java code.
Step One
Download the Code Distribution (ZIP) and unzip it in the folder where you will
create your solution. It has the following files (the files marked with
asterisk (*) you should NOT modify in any way):
- articles.zip* – A zipped archive of a folder called “articles” which contains 2511 Wikipedia articles in text form;
- Article.java* – A class to hold an article, including the title and the body of the article;
- DatabaseIterator.java* – A class which reads the text files in the articles folder which returns one article after another when next() is called;
- DumbList.java – An unordered array of articles, which you can add, delete, and lookup.
- Minipedia.java – The client code which provides a user interface and interacts with the other classes to provide basic functionality for adding, deleting, and looking up articles. It does NOT do search in the body of the articles, and this is the primary functionality that you will be adding.
First, you should unzip the articles folder and browse around and get a sense
for what an article looks like. Then say “wow, that’s a lot of articles” (this
step optional). Then compile and run the Minipedia.java client, and experiment
with the interface, looking up articles (look at the folder to see titles—you
will need to get the title exactly right to look it up). Finally, look through
the code and understand what the program does.
Note on running Minipedia with the articles folder:
If you get an error either compiling or running Minipedia.java in OS X, e.g.,
The problem is that OS X puts a hidden file in the articles folder when you
create it. You must remove this file. Please open up the Terminal and do the
following: (1) Type in ‘cd ‘ (just two letters and a space) and then drag the
articles folder into the Terminal app. You should see the complete name of the
article folder following the ‘cd ‘. Then (only after dragging the folder in)
hit enter. Now you are in the article folder. (2) Then type ‘rm .DS_Store’ and
hit enter. This will delete the file. (3) Close the terminal window.
After this, Minipedia.java should compile.
Step Two
Next you must rewrite the DumbList.java program as a program ArticleTable.java
which will store articles in a hash table using the technique of separate
chaining, instead of using a “dumb list.” It should have the same
functionality (as specified below). You should initialize the table by simply
inserting each article from the list returned by getArticleList(db) in line
65.
Your hash table should be implemented as described in lecture 11/29 and in my
notes on hash tables. Please use a table size which is a prime number around
2500 (Google “prime numbers” to see a list). You should use the article titles
as the keys to insert into the table.
You may use any “good” hash function for String, e.g., the two found here work
well: HTML.
Your public methods should include the following for inserting, deleting, and
retrieval an article based on its exact title
Hash Table Iterator In addition, you must write your ArticleTable class to be
Iterable, as in Lab 09.
The iterator will need to go through each bucket and each node in each bucket.
Hint: use two cursors, one an index into the array and one a pointer to a Node
in a bucket; initialize the iterator by finding the first non-empty bucket and
point to the first Node in that bucket; when you advance the cursors, you will
need to figure out how to advanced to the next Node, even if it in a different
bucket.
You, of course, will write a hash table which has Article values and String
keys (=titles). Write your class in a file ArticleTable.java and provide a
unit test that verifies that it works properly, and can add, delete, and
lookup articles, as well as enumerate all articles in the table using the
iterator. You MUST test all the features of your code by writing your own unit
test, which will be part of the grade. I strongly suggest that you modify the
Minipedia code to use this new implementation and test it that works (this is
not a requirement).
Problem B.4 Cosine Similarity Calculator
In this problem you must write a program file TermFrequencyTable.java which
will store the words from two Strings (i.e., documents) in such a way that you
can easily calculate the “cosine” similarity of the two.
Calculating the Cosine Similarity
To calculate the similarity of the two documents, you must take the union of
the two word lists A and B (this is the total vocabulary of the two documents,
minus any “blacklisted words”—very common words are often not used to
calculate similarity measures) and then calculate the new term frequency
vector of each document with respect to this total list of words. For example,
if document A is “the man with the hat ran up to the man with the dog” and
document B is “a man with a hat approached a dog and a man”, and the black
list is {“the”, “a”, “to”, “with”, “up”}, then we have the following word
lists which record how many times each word occurred in each document, and
then compile a list of the total vocabulary (minus black-listed words):
Now we must calculate the term frequency vector for each test, indicating how
many times each word in the total list occurs in each of the two documents (of
course, some entries will be 0):
Finally, we calculate the cosine similarity using the formula:
(Note that the indices here start at 1 and end at n, instead of starting at 0
and ending at n-1; this is just a mathematician’s way of writing sequences;
you should always think in terms of 0 to n-1.) More explanation of the cosine
similarity may be found here.
This means the documents are very similar, but not identical (with respect to
non-black-listed words). The cosine similarity is literally the cosine of the
angle between two vectors in N-dimensional space (you’ll learn about this in
Linear Algebra), so it is always between 0 and 1.
You should write this class using a separate-chaining table with a table size
which is a relatively small prime around size 100 (the articles are relatively
short and you need to store only the vocabulary of two articles). Your buckets
should be composed of Nodes using something similar to the following.
The basic idea here is to put all the terms in two strings (documents) into
the table, and count how the number of occurrences of each term in each
document. The total list of all words is just the total of all words in the
hash table. At the end, you may return the cosine similarity of the terms
contained in it; if you go through the entire table, the list of all (non-
duplicate) terms is the term list, and the list of the term frequencies
contained in termFreq[] contains the term frequency vectors.
In the formula above, the sequence Ai is the sequence of integers found in
p.termFreq[0] for all nodes p, and the sequence Bi is the sequence found in
p.termFreq[1]. You need to calculate the product of these integers (for the
numerator) and the squares of each (for the denominator). You must therefore
use a method for iterating through all nodes in the table; you do not need to
provide an interface to do this, but you will need to iterate through all the
entries (two for loops should do it!).
Write your class in a file TermFrequencyTable.java and provide a unit test
that verifies that it works properly. Test three examples, one with the exact
same terms in each document (e.g., “A B” and “A A B B”, the c.s. should be
1.0), one where there is no common term between the two documents (“A B” and
“C D”, the c.s. should be 0.0), and also an example which produces a c.s.
between 0 and 1, perhaps “CS112 HW11” and “CS112 HW11 HW11” which should be
0.9487.
Problem B.5: ArticleHeap
You must write a file ArticleHeap.java which implements a priority queue for
Articles, ordered by the cosine similarity (which is a double field in the
Article class, and has a method compareCS(…), take a look there). The method
getMax() should return the article in the heap with the largest
cosineSimilarity field. You should not need to make any changes to the Article
class to get this to work.
You should be able to write this easily using the code for the MaxHeap from
the web site.
The method getMax() in your class should throw a HeapUnderflowException if you
call the method when the heap is empty. You MUST create a Unit Test which
verifies that your heap works properly and throws this exception in the case
of heap underflow. And you will need to call the getMax() inside of a try-
catch block.
Put the exception definition at the end of your file, e.g.,
Hand in your code as ArticleHeap.java with an appropriate Unit Test.
Problem B.6: MiniGoogle
For this problem you must use Minipedia.java from the code distribution (which
currently is implemented using the DumbList.java data structure to search for
titles, insert, and delete articles) as a template to write a program
MiniGoogle.java,which will allow for searching the text of an article using
keywords, using Cosine Similarity (see previous).
The steps in this process are as follows:
To find articles most similar to a search phrase P:
- Use the ArticleTable iterator to enumerate all the articles in the ArticleTable;
- For each article, compare the body of the article with the search phrase P (as if they are two documents), by first preprocessing the strings (described below), then using the TermFrequencyTable class to calculate the Cosine Similarity (calling it once for each article to compare with the search phrase);
- Insert any articles with a cosine similarity greater than 0.001 into the ArticleHeap;
- Call getMax() for this heap to print out the top three matching articles; if no articles are found, print out “No matching articles found!” but if only 1 or 2 are found, simply print them out.
Note that you will construct a TermFrequencyTable for each article that you
compare with the search phrase.
Grading
We will be grading this using a grading client that builds an ArticleTable as
in the main method of your MiniGoogle, and then calls phraseSearch(…) for
various test cases.
You should develop your code as an interactive program which uses Minipedia as
a basis to create the program MiniGoogle, and make sure that it performs as
you expect.