使用提供的NLTK库,按照描述中提供的编程步骤,代写实现一个文件检索系统。
Materials Provided
Download the file 4115 6115 Assignment Files.zip from the module homepage,
which unzips to give a folder containing the following data and code files,
for use in the assignment:
data files: documents.txt, queries.txt, cacm_gold_std.txt, stoplist.txt example_results_file.txt
code files: read_documents.py, eval_ir.py
The file documents.txt contains a collection of documents which record
publications in the CACM (Communications of the Association for Computing
Machinery). Each document is a short record of a CACM paper, including its
title, author(s), and abstract - although one or other of these (especially
abstract) may be absent for a given document. The file queries.txt contains a
set of IR queries for use against this collection. (These are ‘old-style’
queries, where users might write an entire paragraph describing their
interest.) The file cacm_gold_std.txt is a ‘gold standard’ identifying the
documents that have been judged relevant to each query. These files constitute
a standard test set that has been used for evaluating IR systems (although it
is now somewhat dated, not least by being very small by modern standards).
Code files: Inspect the files documents.txt and queries.txt. You will see they
have a common format, where each document or query comes enclosed within (XML-
style) open/close document tags, that also specify a numeric identifier for
the document or query. The code in the read_documents.py allows convenient
access to documents/queries via a simple iteration. In particular, if we
create an instance of the ReadDocuments class (supplying the file name of the
document collection or query set), then we can access the documents from that
collection in the manner of a simple iteration, as illustrated in the
following code example:
from read_documents import ReadDocuments
documents = ReadDocuments(‘documents.txt’)
for doc in documents:
print(“ID: “, doc.docid)
print(doc.lines[0])
—|—
Here, the document is returned as an instance of a Document class, that has
two attributes: an attribute docid whose (integer) value is the numeric
identifier of the document (or query), and an attribute lines whose value is a
list of strings for the lines of text in the document. The example code above
prints the identifier and first line of each document in the collection. You
are free to employ the code in read_documents.py (i.e. by importing it) in
building your own system (but this is optional).
The script eval_ir.py calculates system performance scores. It requires
systems to produce a results file in a standard format, listing the documents
deemed relevant for each query. Run the script with its ‘help’ option (-h) for
instructions on its use, and on the required format of the results file. An
example of a results file is provided as example results file.txt, so you can
try the scorer out. (This file, btw, is real output from a previous student
assignment, and its performance is pretty much at the upper limit of what is
achievable on this task.)
Task Description
Your task is to implement a document retrieval system, based on the vector
space model, and to evaluate its performance over the CACM test collection
under alternative configurations, arising from choices that might include the
following:
- stoplist: whether a stoplist is used or not (to exclude less useful terms)
- stemming: whether or not stemming is applied to terms.
- term weighting: whether term weights in vectors are binary, or are term frequencies, or use the TF.IDF approach
Your retrieval system should be implemented in Python, and it should run on
all platforms.
What to Submit
Your assignment work should be submitted electronically via MOLE, and should
include:
- Your Python code, plus a README file explaining how to run it. Credit will be given in regard to the extent of the implementation achieved, e.g. some credit for code that can index the collection; more credit for further aspects implemented; and so on, up to producing a full retrieval system that can be evaluated against the test set. Some amount of credit will also be assigned in regard to the elegance of your code, its presentation (i.e. comments, etc), and its ease of use (according to the instructions in the README).
- A short report (as a pdf file), which should NOT EXCEED 3 PAGES IN LENGTH. The report should include a brief description of your system, including the extent of the implementation achieved (this is especially important if you have not completed the entire implementation). The report should also present the performance results you have collected, under different configurations, and any conclusions that you draw from your analysis of these results. Graphs/tables may be used in presenting your results, if this aids exposition.
Subtasks and a Possible Breakdown of Work
To help you break the work down into more manageable portions, the following
notes suggest a subdivision of the work into specific subtasks, and a possible
sequencing thereof. You are not required to follow this suggested work
programme, but you should read the notes anyway (since they contain various
instructions/useful suggestions).
- Command line options/flags: It is strongly preferred that you use command line options to parameterise your code’s behaviour, for specifying input/output files, etc.
This example has options for the stoplist (-s), the collection (-c), and for
naming the index file (-i). The option -I is boolean, serving to ‘switch’ some
aspect of behaviour (e.g. between the code reading a pre-existing index file
or creating a new one). See the lecture slides on using the getopt module. - Stoplist: You might begin by defining a function (or class) to read and store the list of stop words (supplied in stoplist.txt). (Remember to strip linebreaks, or you will fail to match words correctly.) Despite its name, you should not use a list to store the stopwords, as stopword testing must be fast. Instead, Python’s set data structure is a suitable option.
- Tokenisation / preprocessing: You can choose your own approach to tokenising the input, but a simple approach should suffice, e.g. simply extracting the maximal alphabetic sequences from text using a suitable regex, and mapping them to lowercase.
- Stemming: The NLTK library provides an implemention of the Porter stemmer, whose use is illustrated in the following:
- Index data structure: Your inverted index stores the counts of terms in different documents. A suitable data structure is a two-level dictionary, i.e. one which maps terms to (embedded dictionaries that map from) document ids to counts. You will need an indexing function, to count terms in the document collection and populate this data structure.
- Storing / loading the index: In general, a retrieval system’s index is not computed dynamically each time the system is run. Rather the index, once computed, is stored to disk, to be reloaded from there as required. Ideally, your system would also work this way, suggesting the need for a function to print the inverted index out to a file (in some suitable format), and another function to read such a file to populate an empty index.
- One program or two?: One choice is whether to have two separate programs for indexing and retrieval or just one, i.e. one program to index the collection and write the index to file, and another program to read this index in, and use it to retrieve documents, or a single program that does both tasks as different parameterised behaviours. Clearly, however, there is common functionality required by both indexing and retrieval stages (e.g. use of stoplist, tokenisation, etc.), and it is preferable to avoid redundancy in your code.
- Simple boolean retrieval: A good way to test your inverted index is to use it for simple conjunctive boolean retrieval (in contrast to the ranked retrieval we aim to achieve overall). The inverted index records the set of documents that contain any one query term. If we compute the intersection of the document sets for several query terms (e.g. parallel and computing), this should identify precisely the documents containing all of the query terms.
- Computing required values: Various numeric values that derive from the document collection are required later for term weighting and for calculating query/document similarity. Computing these values when the collection is indexed implies the need to store them alongside the index file. Rather than increasing the number of record files that need to be created/reloaded, a simpler approach is to compute these values instead directly from the inverted index.
- Ranked retrieval for single queries: You should now be in a position to write a function to do ranked retrieval for single queries.
- Retrieving for the full query set: Finally, extend your code so that it can ‘batch process’ the entire query set, returning results in the format required by the scoring script.