Introduction
用Python代写一个Hashing的程序,对自然语言进行统计排序,由于是自然语言,存在大量的标点符号,以及大小写的处理。
给的测试文件篇幅较大,都是几千字的,调试还是比较花时间的。
此外还需要找到合适的Hashing大小,使得collision最小,这个就是对n进行循环,找到collision最小的那个数即可。
Requirement
Write a program that uses hashing for the following problem. Given a natural
language text, generate a table of distinct words with the number of
occurrences of each word in the text. A word is defined as a series of
alphanumeric characters. Note that punctuation and spaces can both (separately
or together) delineate a word. There are some hyphenated words in the text;
for these words there will be no space before or after the hyphen and the
hyphen should be considered part of the word. Capitalized versions of a word
are considered the same as the lowercase version, so feel free to change all
words to either upper- or lowercase. Plural possessives and contractions
should remain and be counted as a unique word (such as can’t or widows’).
You must use the provided input file, which contains the first chapter of
Tolkien’s “The Hobbit” found on herePreview the documentView in a new window.
Your resulting program will be an interactive one allowing the user to type in
a word (that may or may not appear in the table) and receive the number of
occurrences of that word in the text until the user wishes to quit.
The challenge
Find the combination of hash function and collision resolution that will
minimize the collisions. A collision is defined as an attempt to store a new
word in a location already occupied by another word. Consequently, it is
possible for a new word to collide more than once before it is finally stored.
To qualify for full credit on the assignment, you must have fewer than 625
collisions and your table size must be appropriate for the number of words
(see below) and the collision resolution used. Your instructor was able to
reach 514 collisions simply using a version of one of the hash functions
discussed in class and linear probe. The student(s) with the least collisions
and proper table size will receive 5 bonus points.
Hints
There are 663 unique words in the file (using the above definition of a unique
word). There are 1976 words in total.
Program description
Your program should hash the values from the file and then print (neatly
formatted) the number of collisions, number of unique words and total number
of words to the standard output (screen) before allowing the user to input a
word to see the number of times it appeared. The user should be able to input
as many words as desired before choosing to quit.
Suggestions & Notes
Reading in one word at a time from the file, create a function that pre-
processes a word, changing capital letters to lowercase and removing any
quotation marks or other unnecessary punctuation that may have been read from
the file. Then send the pre-processed string into your hash function.
When counting collisions, do not count additional instances of a word. So you
will want to keep a “temporary” collision count for a word until you know if
the word is already in the table.
Note that the table size must be “reasonable” and depends upon the number of
entries expected and the collision resolution used. Using a larger table than
required by these factors (simply to reduce collisions) will mean a deduction.
The final load factor must be between 0.5 and 0.7.
Get your program up and running by using one of the hash functions discussed
in class and a simple collision resolution (such as linear probe). Make sure
you are pre-processing your words correctly, etc. before attempting to
minimize the collisions.