需要代写的作业包含三部分:Regular Expressions,Text Statistics以及Word-sequence
analysis,根据要求完成相应的函数。
Regular expressions
Part 1
Many words beginning with the letters “sl” have related meanings. Consider,
“slip”, “slide”, “slosh”, “slick”, and “slather”, for example. The “sl” root
comes from Proto-Indo-European, the proposed language ancestor of Greek,
Latin, and Sanskrit.
We we limit ourselves to words beginning with “sl”. While that will miss words
like “re-slide”, it avoids words like “island”.
Write a function findall_sl(text)
that takes a text string, searches it
with re.findall(), and returns the result. You must supply the appropriate RE
argument to the re.findall() so that it searches for words beginning with
“sl”. The “s” may be capitalized or not.
Part 2
Write a function findall_triple_vowel(text)
that takes a text string,
searches it with re.findall(), and returns the result. You must supply the
appropriate RE argument to the re.findall() so that it searches for anything
(e.g., words, abbreviations, and Roman numerals) that contains three or more
consecutive, identical vowels. This search should be case-insensitive.
Part 3
We want to find anything that relates to the 1980s. Create a RE to search for
references to the decade or its years.
Among the things you would want to find are “1984”, “‘80s”, and “eighties”.
You are not expected to search for terms relating to things that happened
during the decade, such as the Berlin Wall being torn down. Only search for
explicit references to the years.
Think about what search terms are relevant, and combine them into one RE.
There is a lot of variation in solutions on this problem because of the loose
specification.
Write a function findall_80s(text)
that takes a text string, searches it
with re.findall(), and returns the result. Allow the beginning of the search
term to be capitalized. You must supply the appropriate RE argument to the
re.findall() so that it searches for such terms.
Text statistics
Combine your file reading, word filtering, and statistics functions from
previous exercises and assignments to examine some properties of text files.
For simplicity, we’ll consider any word, abbreviation, or number to be a word.
Furthermore, consider words to be case-insensitive, so that “the” and “The”
are counted as the same word. However, we will not attempt any stemming, so
“spell”, “spells”, and “spelling” are all considered distinct.
Part 4
Write a function count_distinct_words(filename)
that returns a count of
the number of distinct words that occur in the provided text file.
Part 5
Write a function median_word(filename)
that returns a word that has the
median of all the number of word occurrences. (As before in the course, use
the lower median.) Note that multiple words can have the median number of
occurrences, and you should only report one of them. If the file contains no
words, then the function should return None.
Word-sequence analysis
In class, you generalized your original word-counting program to word-
sequence- (i.e., n-gram) counting. Also in class, you wrote code to find word
frequencies and word successors. In the following exercises, you’ll combine
all these ideas, plus further generalize them, to determine the frequencies of
word-sequence successors, also known as a Markov chain.
These problems take a list of filenames, rather than just a single filename,
so that we can train a Markov chain on multiple texts. To be more specific,
let’s say one file contains “a b c” and another contains “d e f”. The
resulting Markov chain should contain the n-grams for both separate files.
However, it should not contain any n-grams like “c d” that would result from
concatenating the file contents.
These problems have two additional parameters that the in-class exercises
didn’t. One indicates whether we should treat punctuation like words. The
motivation for this is that when analyzing an author’s style, we would want to
not only look at the word usage but also the punctuation usage, as well.
Another indicates whether we should treat words as case-sensitive or not. The
in-class exercises treated, for example, “the” and “The” as distinct words,
but we might also want to treat them as equivalent.
Again, you’ve written most of the necessary code before. You now need to
combine the pieces appropriately. We strongly encourage you to decompose your
code into smaller useful functions. Use the same word-finding RE given in the
video.
On the next assignment, you’ll use Markov chains to generate text.
Part 6
Define a function wordseq_successor_counts(filename_list, seq_size, include_punc, is_case_sensitive)
. It returns a default dictionary, where
each key is a seq_size-element tuple of words. Each key’s value is a Counter
mapping distinct successor words to their counts.
For example, in comp130_EightDaysAWeek.txt, the phrase “love babe” occurs
eight times. Six times it is followed by a comma, and twice by “just”. If the
sequence size is two and we are counting punctuation, then ("love", "babe")
should be a key which maps to a Counter Counter({"," : 6, "just" : 2})
.
Hints: Start with your code for wordseq_counts_file()
and modify it,
rather than calling it. At first, work on a simplified version that ignores
the last two parameters, then see where you need to add conditionals for those
arguments.
Part 7
Define a function wordseq_successor_frequencies(filename_list, seq_size, include_punc, is_case_sensitive)
. It returns a default dictionary, where
each key is a seq_size-element tuple of words. Each key’s value is a regular
(non-default) dictionary mapping distinct successor words to their
frequencies, i.e., their percentage of occurrence.
Continuing our example, ("love", "babe")
should be a key which maps to a
dictionary {"," : 0.75, "just" : 0.25}
.
This function should call the previous function, then create a new dictionary
where the inner dictionaries (mapping strings to frequencies) are created from
the corresponding Counters (mapping strings to counts).