代写数据处理相关的作业,需要使用一个叫NTLK的库,并完成五个小问题。
Introduction
To do this homework, you will need to download the dependency_treebank data
for NLTK (see “Important note” in Lab 5 for how to download NLTK data).
You should ideally also install these packages:
- matplotlib ( http://matplotlib.org/users/installing.html )
- numpy ( http://www.numpy.org/ )
If you have trouble installing these packages, you can post to Piazza for help
or ask course staff. However, it is also possible to do this homework without
these packages – they are only needed for plotting the histogram in Q2, which
you can get a picture of from a classmate if needed. (If you take this route
you will also need to comment out some parts of the homework file where
functions from matplotlib and numpy are imported and called.)
(matplotlib and numpy may already be installed, if you are using Anaconda or
some other distributions. Typeimport matplotlib
andimport numpy
in a
Python console to find out.)
Instructions
This homework comes with a file, hw4.py , which contains places for you to
fill in, marked in lines including “STUDENT” or “XX”.
Completing the questions below involves either completing functions (Q1, Q3,
Q5), or completing code snippets and giving verbal answers (Q2, Q4) that print
out when the file is run in Python. Only the sentences in bold in the
questions below require answering by changing the Python file. (For example,
do not include code in the file to compute the histogram referred to in Q2, as
this will mess with our grading scripts.)
You should rename the file in the format hw4_lastname.py. This file is all
that should be submitted on myCourses.
We should be able to run your script on the command line (= command prompt on
windows), without errors, using
python hw4_lastname.py
We should also be able to import your script (without errors) when running
python in interactive mode, by typing
import hw4_lastname
after changing the directory to be in the same directory as your script.
Make sure to include a note specifying any collaborators in your submitted HW
(see collaboration policy in the syllabus). This should be in comments at the
top of hw4_lastname.py , or in comments with your submission on myCourses.
Preliminaries
hw4.py contains code to be referred to below, and places for you to fill in.
Run hw4.py in a Python console/interactive mode (e.g. run hw4.py in iPython)
before proceeding.
Part of speech (POS) tagging is one of the basic steps in developing many
Natural Language Processing tools, and is often also a first step in starting
to annotate and analyze a corpus in a new language. In this lab, we will
explore POS tagging and use a (very) simple POS tagger using an already
annotated corpus.
A POS tagset is the set of Part-of-Speech tags used for annotating a
particular corpus. The Penn Tagset is one such tagset which is widely used for
English. Click on the link and have a look at the tagset.
For this homework, we consider a small part of the Penn Treebank POS annotated
data. This data consists of around 3900 sentences, where each word is
annotated with its POS tag using the Penn POS tagset. To access the data, our
code (in hw4.py ) first imports the dependency_treebank from the nltk.corpus
package using the command
from nltk.corpus import dependency_treebank
We then extract the tagged sentences using the following command:
tsents = dependency_treebank.tagged_sents()
tsents contains a list of tagged sentences. A tagged sentence is a list of
pairs, where each pair consists of a word and its POS tag. A pair is just a
“tuple” with two members. Once you’ve loaded hw4.py , tsents[0] contains the
first tagged sentence. tsents[0][0] gives the first tuple in the first
sentence, which is a (word, tag) pair, and tsents[0][0][0] gives you the word
from that pair, tsents[0][0][1] its tag.
Question 1
Complete the code in the tag_distribution function. This function:
- takes as input a list of tagged sentences in NLTK format (like tsents , or tsents[:10] for the first ten sentences)
- returns a dictionary with POS tags as keys and the number of word tokens with that tag as values.
Some sample inputs to test your tag_distribution function with are given in
comments in hw4.py
This function can then be used construct a frequency distribution of POS tags
in the Penn Treebank data, by running:
freqDist = tag_distribution(tsents)
Question 2
Using the function plot_histogram , plot a histogram of the tag distribution
with tags on the x-axis and their counts on the y-axis, ordered by descending
frequency. Hint: To sort the items (i.e., key-value pairs) in a dictionary by
their values, you can use:
sorted(mydict.items(), key=lambda x: x[1])
—|—
a). Describe the distribution you see in the histogram, in 1-2 sentences.
(What does it say about how frequent different parts of speech are in
English?)
Write code to determine the 5 most frequent POS tags using your frequency
dictionary, and use it in answering this question:
b). What are the 5 most frequent POS tags in this data? Do they agree with
your intuition about what the most frequent parts of speech are in English?
(Briefly explain.)
Question 3
Construct a conditional frequency distribution (CFD) by completing the code in
the word_tag_distribution
function. A CFD is a dictionary whose values are
themselves distributions, keyed by context or condition. (You also constructed
a CFD in Homework 3: a CFD is one kind of “dictionary of dictionaries”.) In
our case we want words as conditions == keys, with values a frequency
distribution of tags for that word.
For example, for the word “book”, the value of your CFD should be a frequency
distribution of the POS tags that occur with book. Once you have completed
word_tag_distribution correctly, you can construct the conditional frequency
distribution for words (and their POS tags) in this corpus by running:
>>> condFreqDist = word_tag_distribution(tsents)
This dictionary should give:
>>> condFreqDist[‘book’]
{‘NN’: 7, ‘VB’: 1}
This means that the word “book” occurs 7 times as a noun and once as a verb in
the Penn Treebank sentences.
Question 4
Using your CFD, write code to compute the level of tag ambiguity in the
corpus. That is, on average, how many different tags does each word (type)
have? Then, write code to compute the level of tag ambiguity for just the
first 2000 sentences in the corpus, the first 1000, and the first 500. (Note
that this is roughly 50%, 25%, 10% of sentences, as there are 3914 sentences.)
(This involves filling in the lines of hw4.py where tagAmbiguity , etc. are
defined.)
a). What is the level of tag ambiguity in the corpus? For the first 50%, 25%,
10% of the corpus?
b). You should observe a pattern in the tag ambiguity numbers: the amount of
tag ambiguity should either steadily inscrease or decrease as the size of the
corpus is increased. Explain in 1-3 sentences why you might expect this
pattern. (You may also be able to think of reasons why you might expect the
opposite pattern; if so, you could list these as well.)
Question 5
The homework file contains a simple POS tagger called a unigram tagger, in the
function unigram_tagger
.This function takes three arguments:
- The first argument is a conditional frequency distribution, which can be generated using the
word_tag_distriution
you completed above. - The second argument is the most frequent POS tag in the corpus.
- The third argument is a sentence that needs to be tagged.
The goal of this function is to tag the sentence using probabilities from the
CFD and most frequent POS tag. The function uses a helper function called ut1
, that processes a single word. If the word has been seen (it’s present in the
CFD), ut1 assigns the most frequent tag for that word. For unseen words (not
present in the CFD), ut1 assigns the overall most frequent POS tag, as passed
in as the 2nd argument.
For example, running the tagger using “NN” as the most frequent POS tag (after
defining condFreqDist as above), on a test sentence:
unigram_tagger(condFreqDist, ‘NN’, ‘This is a test’)
should give this output:
[(‘This’, ‘DT’), (‘is’, ‘VBZ’), (‘a’, ‘DT’), (‘test’, ‘NN’)]
a). Why is this called a “unigram tagger”? Your answer (1-2 sentences) should
make reference to the concepts of “unigrams” and “unigram probability”
discussed in class.
Run the tagger to tag the sentences “the bank has money” and “you can bank on
it”. Look at the POS tags.
b). What errors are there in the POS tags? What caused the errors? Explain why
an HMM tagger (discussed in class) would probably not make the same error(s)
(if trained on enough data). (Your answer can be up to a paragraph.)