Java代写:CS201InformationRetrieval


TF-IDF
算法,结合数据结构的Map, Set和List,代写一个信息检索系统。

Goals

  • Translate written descriptions of behavior into code.
  • Practice representing state in a class.
  • Practice interacting with the Map, Set, and List abstractions.
  • Test code using unit tests.

Downloading and importing the starter code

As in previous assignments, download and save (but do not decompress) the
provided archive file containing the starter code. Then import it into Eclipse
in the same way; you should end up with a information-retrieval- project in
the “Project Explorer”.

Search engine behavior

For the purposes of this assignment, a search engine is a stateful object that
“knows about” a set of documents and supports various queries on those
documents and their contents.
Documents are identified by a unique ID and consist of a sequence of terms.
Terms are (always) the lowercase version of words; operations on the search
engine and documents are should therefore be case-insensitive. Documents are
added one-by-one to the search engine.
The search engine function as an index. That is, given a term, the search
engine can return the set of documents (that it knows about) that contain that
term.
The search engine can also find a list of documents (again, from among the set
it knows about) relevant to a given term, ordered from most-relevant to least-
relevant. It does so using a specific version of the tf-idf statistic, which
sounds intimidating but is actually fairly straightforward to calculate - so
long as you have the data structures to support doing so.

What to do

The SearchEngine needs to keep track of the documents for two things: to do
index lookups of terms, returning a set of documents (in indexLookup), and to
compute the two components of the tf-idf statistics (in termFrequency and
inverseD). You can hold this state with whatever data structures you like, but
my suggestions follow.
I suggest you get addDocument and indexLooku working first. To support the
index, a straightforward mapping of terms to DocumentIDs will work. (To be
clear: a Map). It turns out you don’t need to create this structure; you can
use the one you’ll make to support tf-idf instead, but creating this Map might
be a good warmup. In any case, declare the structure(s) as instance variables,
create the empty structure(s) you’ll use in the constructor, fill it/them in
addDocument, and examine it/them in indexLookup. When turning the document
itself into terms, use the same approach as in Assignment 05: String.split
using “ \\W+ “, and remember toLowercase the result.
termFrequency requires that you compute the number of times a given term
appears in a given document. This suggests you should have a data structure
that keeps track of the count of terms per document: a Map. But this
frequency-counting structure is per-document; you need to keep track of each
document’s counts. So overall, I suggest a Map. The outer map goes from
DocumentIds to the inner frequency-counting structure. You’ll have to update
addDocument to populate and update these structures. Be sure to get clear in
your head the different times you’ll use get, put, containsKey, and
getOrDefault.
Once you have the structure described above, inverseDocumentFrequenc is fairly
straightforward. Be sure to read the javadoc comment above the method for the
exact equation the tests are expecting. Use Math.log to compute the logarithm
(not Math.log10 or Math.log2).
Use these two methods to compute a given document-term pair’s tfIdf.
Finally, implement relevanceLookup, which returns a list of all documents
containing a given term, sorted from largest tf-idf to smallest. You’ll
probably need to implement TfIdfComparator., but note that no tests test the
comparator directly, so if you have another method in mind to sort the list,
go ahead. If you do implement it, make sure it returns a value that will
result in the list being sorted largest-to-smallest, and mind the tie-breaker
requirement.


文章作者: SafePoker
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 SafePoker !
  目录