代写NLP作业,对音乐的作者进行分类与识别。作业提供了框架以及相关文档,按照要求一步一步往下写即可。
Core Description
For the core, you will implement a program that creates a model of a music
artist’s lyrics. This model receives lyric data as input and ultimately
generates new lyrics in the style of that artist. To do this, you will
leverage an NLP concept called an n-gram and use an NLP technique called
language modeling.
Your understanding of the linked concepts and definitions is crucial to your
success, so make sure to understand n-grams, language modeling, Python
dictionaries as taught in the warmup, and classes and inheritance in Python
before attempting to implement the core.
The core does not require you to include any external libraries beyond what
has already been included for you. Use of any other external libraries is
prohibited on this part of the project.
Core Structure
In the language-models/folder, you will find four files which contain class
definitions: nGramModel.py, unigramModel.py, bigramModel.py, and
trigramModel.py. You must complete the prepData, weightedChoice, and
getNextToken functions in nGramModel.py. You must also complete the
trainModel, trainingDataHasNGram, and getCandidateDictionary functions in each
of the other three files.
In the root CreativeAI repository, there is a file called generate.py, which
will be the driver for generating both lyrics and music. For the core, you
will implement the trainLyricsModels, selectNGramModel, generateSentence, and
runLyricsGenerator functions; these functions will be called, directly or
indirectly, by main, which is written for you.
We recommend that you implement the functions in the order they are listed in
the spec; start with prepData and work your way down to runLyricsGenerator.
Getting New Lyrics (Optional)
If your group chooses to use lyrics from an artist other than the Beatles, you
can use the web scraper we have written to get the lyrics of the new artist
and save them in the data/lyrics directory for you. A web scraper is a program
that gets information from web pages: ours, which lives in the data/scrapers
directory.
If you navigate to the data/scrapers folder and run the lyricsWikiaScraper.py
file, you will be prompted to input the name of an artist. If that artist is
found on lyrics.wikia.com, the program will make a folder in the data/lyrics
directory for that artist, and save each of the artist’s songs as a .txt file
in that folder.
Explanation of Functions to Implement
prepData
The purpose of this function is to take input data in the form of a list of
lists, and return a copy of that list with symbols added to both ends of each
inner list.
For the core, these inner lists will be sentences, which are represented as
lists of strings. The symbols added to the beginning of each sentence will be
^::^ followed by ^:::^, and the symbol added to the end of each sentence will
be $:::$
. These are arbitrary symbols, but make sure to use them exactly
and in the correct order.
For example, if the function is passed this list of lists:
[ [‘hey’, ‘jude’], [‘yellow’, ‘submarine’] ]
Then it would return a new list that looks like this:
[ [‘^::^’, ‘^:::^’, ‘hey’, ‘jude’, ‘$:::$’], [‘^::^’, ‘^:::^’, ‘yellow’, ‘submarine’, ‘$:::$’] ]
The purpose of adding two symbols at the beginning of each sentence is so that
you can look at a trigram containing only the first English word of that
sentence. This captures information about which words are most likely to begin
a sentence; without these symbols, you would not be able to use the trigam
model at the beginning of sentences because there would be no trigrams to look
at until the third word.
The purpose of adding a symbol to the end of each sentence is to be able to
generate sentence endings. If you ever see $:::$
while generating a
sentence in the generateSentence function, you know the sentence is complete.
trainModel
This function trains the NGramModel child classes on the input data by
building their dictionary of n-grams and respective counts, self.nGramCounts.
Note that the special starting and ending symbols also count as words for all
NGramModels, which is why you should use the return value of prepData before
you create the self.nGramCounts dictionary for each language model.
- For the unigram model, self.nGramCounts will be a one-dimensional dictionary of {unigram: unigramCount} pairs, where each unique unigram is somewhere in the input data, and unigramCount is the number of times the model saw that particular unigram appear in the data. The unigram model should not consider the special symbols ‘^::^’ and ‘^:::^’ as words, but it should consider the ending symbol
$:::$
as a word. The bigram and trigram modles should consider all special symbols as words. - For the bigram model, the dictionary will be two-dimensional. It will be structured as {unigramOne: {unigramTwo: bigramCount_two big parantheses_, where bigramCount is the count of how many times this model has seen unigramOne + unigramTwo appear as a bigram in the input data. For example, if the only song you were looking at was Strawberry Fields Forever, part of the BigramModel’s self.nGramCounts dictionary would look like this:
self.nGramCounts = {
‘strawberry’ : {‘fields’ : 10},
‘fields’ : {‘forever’ : 6, ‘$:::$’ : 4}
}
—|— - For the trigram model, the dictionary will be three-dimensional. It will be structured as {unigramOne: {unigramTwo: {unigramThree: trigramCount_two big parantheses_}, where trigramCount is the count of how many times this model has seen unigramOne + unigramTwo + unigramThree appear as a trigram in the input data.
trainingDataHasNGram
This function takes a sentence, which is a list of strings, and returns True
if a particular language model can be used to determine the next token to add
to that sentence, given the n-grams that the language model has stored in
self.nGramCounts. If not, it returns False.
- For the unigram model, this function returns True if the unigram model knows about any words at all: in other words, its self.nGramCounts dictionary is not empty. This is because a unigram model only looks at the frequency of single words, regardless of their context.
- For the bigram model, this function returns True if the model has seen the last word in the current sentence at the start of a bigram. Hint: which “dimension” of the bigram model’s self.nGramCounts would contain this information?
- For the trigram model, this function returns True if the model has seen the second-to-last and last words in the current sentence at the start of a trigram, in that order.
getCandidateDictionary
This function returns a dictionary of candidate next words to be added to the
current sentence. More specifically, it returns the set of words that are
legal to follow the sentence passed in, given the particular language model’s
training data. So it looks at the sentence, figures out what word the model
thinks can follow the last words in the sentence, and returns that set of
words and counts. Note: when you write this function, you may assume that that
the trainingDataHasNGram function for this specific language model instance
has returned True.
For each n-gram model, this function will look at the last n - 1 words in the
current sentence, index into self.nGramCounts using those words, and return a
dictionary of possible n-th words and their counts. For example, the unigram
model is an n-gram model for which n = 1, so the unigram model looks at the
previous 0 words in the sentence. Therefore, the unigram model sees every word
in its training data as a candidate; in other words, the unigram model version
of getCandidateDictionary should return its entire self.nGramCounts
dictionary. Based on this knowledge, what dictionaries should the bigram and
trigram models return?
Hint: the indexing method you use here will be syntactically very similar to
what you did in trainingDataHasNGram.
weightedChoice
This function takes a dictionary as input and chooses a key in that dictionary
to return, using the dictionary’s values to compute probabilities of each key
being chosen. It will be used in getNextToken as a helper function in choosing
a next word for a randomly generated sentence.
Suppose your input dictionary was this:
{‘north’: 4, ‘south’: 1, ‘east’: 3, ‘west’: 2}
Here’s how to choose a key to return from the above dictionary. First, make
two lists: one to represent all the keys in the dictionary, and one to
represent all the values in the dictionary. Here’s a table where the “token”
column is our list of keys, and the “count” column is our list of values.
For example, using the table above, say we generated 7 as our random number.
Then we see the word at index 2, “east”, is the first word with a cumulative
count greater than 7, since the word’s cumulative value is 8. Therefore, we
return the token at index 2.
getNextToken
This function should be very short. It does three things:
- Call getCandidateDictionary with the current sentence as an argument.
- Pass the return value of getCandidateDictionary to weightedChoice.
- Return whatever weightedChoice returns.
At a high level, the effect of doing this is getting a list of candidate next
words for the current sentence, choosing a next word for the sentence based on
the weights of each candidate word, and then finally returning that chosen
word.
trainLyricsModels
This function takes a name of a directory where an artist’s lyrics are stored
- for example, the_beatles - and loads those lyrics, using an instance of the
prewritten DataLoader class, into a list called dataLoader.lyrics. It then
makes a list of language models and trains those language models using the
text in dataLoader.lyrics. Remember that you wrote functions to train
instances of language models!
This function ultimately returns the list of trained language models.
selectNGramModel
This function takes a trained list of language models and a sentence, which is
a list of strings. It first checks if the trigram model in the list can be
used to pick the next word for the sentence; if so, it returns the trigram
model. If not, it attempts to back off to a bigram model and returns the
bigram model if possible. As a last resort, it returns the unigram model.
Remember that you wrote a function that checks whether or not a particular
NGramModel can be used to pick the next word for a sentence.
generateSentence
This function adds a word one at a time to a sentence until it decides that
the sentence is done. Remember that you wrote a function to select a language
model to use, and another function that picks a next word for a sentence using
a language model!
Determining whether a sentence is done can come about one of two ways: either
the sentenceTooLong function, which is written for you, returns True because
the sentence is too long, or the next token chosen for the sentence is the
special ending symbol $:::$
. If either of these two events happen, the
sentence is done.
This function returns a list of strings representing a sentence. The returned
list should not contain any of the starting or ending symbols!
runLyricsGenerator
This function takes a list of trained language models and calls the functions
you wrote above to generate a verse one, a verse two, and a chorus. Each
verse/chorus is a list containing 4 sentences, where sentences are lists of
strings. Note: when you call generateSentence in this function, you can choose
what value you would like desiredLength (the goal length of the sentence) to
be - play around with different values and see what gets the best output.
After you create the verseOne, verseTwo, and chorus lists, pass those lists
into the printSongLyrics function, which is written for you. This will print
out the song that you created in a nice song-like format.
Explanation of Given Functions
sentenceTooLong
This function takes two parameters, an integer desiredLength and an integer
currentLength. desiredLength is how the length that you would like your
sentence to be. It would be unnatural if all sentences were the same length,
so sentenceTooLong enables your sentence to end at the desired length, plus or
minus a small amount. The random.gauss function generates a random number
close to currentLength; if the random number is larger than desiredLength, the
function returns True. Otherwise the function returns False.
Increasing the value of STDEV in this function will increase the randomness,
leading to more varying sentence lengths.
printSongLyrics
This function takes three parameters which are lists of lists of strings:
verseOne, verseTwo, and chorus. It then prints out the song in this order:
verse one, chorus, verse two, chorus.
getUserInput
This function takes three parameters: teamName, which should be the name of
your group; lyricsSource, which should be the name of the artist that you’re
generating lyrics for; and musicSource, which should be the name of the source
from which you got your music data for the reach.
The function returns a user’s choice between 1 and 3, looping while the user
does not input a valid choice. Choice 1 is for generating lyrics; choice 2 is
for generating music; and choice 3 is to quit the program.
main
This function first trains instances of language models on the lyrics and
music data by calling the trainLyricsModels and trainMusicModels functions.
Then, it calls getUserInput and uses the return value of that function to
either generate new lyrics by calling runLyricsGenerator, or generate a song
by calling runMusicGenerator. Note that the trainMusicModels and
runMusicGenerator functions don’t need to be touched for the core.
At the beginning of main there are several string variables to hold your
group’s name, the name of the artist you’re using, etc. Make sure to update
these values with your team’s name and your choices of data.
How to Test Your Program
At the bottom of every file that you will edit for this project, you will find
this line:
if name == ‘main‘:
—|—
Below this line, you should add any testing code that you need. In fact, we
have already provided you with a few lines for testing in each file - add to
these as you develop your project, so you can ensure that your individual
functions work before putting it all together.
To run your tests, just run the file that contains the tests as a Python
program. For example, if you’re writing tests for unigramModel.py, you can
either run that file as a Python program in PyCharm, or navigate to the
directory where it’s stored via the command line and type:
python unigramModel.py
Note that in generate.py, the main function is called for you - this is the
driver program that will ask you to input an option for either generating
lyrics, generating music, or quitting the program. You probably will not want
main to run when you’re testing smaller functions in generate.py, so you can
comment main out for testing purposes. Just make sure to uncomment it and test
it as a whole before submitting your final product.
Tips for Speeding Up Your Program
If your program is taking a long time to load the data and train the models,
it’s likely that inefficiencies in your code are slowing down your program.
The most common cause of inefficiency is too many nested loops in your
trainModel functions. For example, if you have 10 words, and you run through
the words once for each word in the list (i.e. 10 times), that will be 100
steps total, which is not too bad. But if you have 10,000 words in the
dataset, and you look at each one 10,000 times, then that will be 100,000,000,
which is bad.
Each version of the trainModel function can be written correctly with at most
two levels of nested for loops, and a typical program should not take more
than around 30 seconds to load. Try experimenting with different loop
structures if your program is taking too long to load.
How to Run Your Program to Generate Lyrics
If you are using PyCharm, open generate.py and click “Run…” in the top
navigation bar. If you are working from the command line, navigate to the root
directory where your CreativeAI project is stored and type:
python generate.py
Even if you have not implemented any of the functions in the project, the
starter code should work out of the box. Therefore, you can play around with
it and get a feel for how the driver in main works.