练习Python的基础语法, list
和string的用法。
![List](https://upload.wikimedia.org/wikipedia/commons/thumb/3/37/Singly_linked_list.png/220px-
Singly_linked_list.png)
Requirement
Please read this handout on strings before you start. To submit your solution,
compress all files together into one .zip file. Login to handins.ccs.neu.edu
with your Khoury account, click on the appropriate assignment, and upload that
single zip file. You may submit multiple times right up until the deadline; we
will grade only the most recent submission.
Your solution will be evaluated according to our grading rubric. The written
component accounts for 15% of your overall HW5 score; the programming
component for 85%. We’ll give written feedback when appropriate as well;
please come to office hours with any questions about grading.
You are permitted two “late day” passes in the semester; each one allows you
submit a homework up to 24 hours late. You may apply both to this homework, or
split them between two homeworks. The Hand-in Server automatically applies one
late-day token for every 24 hours (or part of 24 hours) you submit past the
deadline.
Written Component
- Filename: written.txt
Please open a plaintext file (you can do this in IDLE or any plaintext editor
you like, such as TextEdit or NotePad) and type out your answers to the
questions below. You can type your answers into Python to confirm, but answer
the questions first!
Written #1
For each of the Python snippets below, what will be printed to the terminal?
Be specific about linebreaks and indentation where applicable.
s = “tug”
t = “boat”
print(2 * (s + t))
s = “tug”
t = “boat”
print(s + (2 * t))
s = “tug”
t = “boat”
print(‘bark’ in 2 * (s + t))
—|—
Written #2
Write one line of Python code that will return, from the string s, every third
letter working backwards from the end. Here are some examples – the same one
line of code should work on all of them:
s = “madamimadam”
# should return “mama”
s = “helloworld”
# should return “dolh”
s = “racecar”
# should return “rer”
—|—
Written #3
For the string s = “boston red sox”, what does each of the Python expressions
produce? If it would produce an error, specify what type of error.
s[11:13]
s[11:14]
s[11:15]
—|—
Programming Component (85% of this HW)
We’re branching out from the last homework – go ahead and use any loops you
like, and any list/string/tuple functions we’ve covered in class, or that
you’ve found yourself. Here are some useful ones (you may or may not need them
for this HW, but they’re always good to know!):
- in: Tells you whether a list contains a value
- split: Turns a string into a list
- join: Turns a list into a string
- index: Returns the position in a list of the first instance of a given value
You may not use any other data structures, like dictionaries. Stick with
list/string/tuples for now and we’ll cover dictionaries in class soon!
Programming #1
- Filenames: classify_functions.py (your functions)
- test_classify_functions.py (test at least one of your functions)
- classify99.py (your driver)
- classify_data.py (starter code)
This week we’re delving into data mining! One of the most useful techniques in
the data science world is called classification. A classification algorithm
works as follows:
- We decide ahead of time on several categories.
- We start with a training set, where we know which category each member of the set belongs to.
- We then work on a testing set. We don’t know which category a testing value belongs to, but we attempt to find the most-similar category based on the values already there.
- We check to see if we were correct. But even if we were wrong, we have new information! The testing value gets attached to the category it belongs to and hopefully we can be more and more accurate as time goes on.
For this homework, we’re going to do the first three steps. Here’s our
classification setup for this homework:
- Categories: Jake, Rosa, Holt, and Gina.
- Training set (in the starter code): This data has already been classified into the four categories.
- For each person, calculate their five most-frequently-used words (they don’t need to be ranked or differentiated, just a list of 5 things they say most frequently).
- Testing set (also part of the starter code, a list called TESTING):
- The quotes in the test set are not yet classified; your algorithm will figure out which category they belong to.
- If a quote in the test is an exact match for a quote in the training set, that’s your person!
- Otherwise, compare the test set quote to the most-frequently-used words of each category in the training set. Whichever category has the most words in common with the test, that’s your person! If there’s a tie, or the test has no words in common with anyone, pick a person at random.
- The driver
- There’s no user interaction, just print the results for the test set, one after the other.
Testing Requirements
- Your functions will be defined in the file classify_functions.py
- You must submit test code for at least one function. Obviously, they should all work! But we’ll run your test file and look for expected/actual comparisons on one of your functions.
Helpful Hints
- The starter file has a list of “stopwords”, which should be skipped in both the training and testing data when comparing frequently-used-words, but NOT when looking for an exact match.
- Every test set needs to be put into ONE category. “We couldn’t find a match” is not a good result for this one.
- All of the training set words and testing set words should be converted to lowercase (or all to uppercase… either way, as long as you’re consistent. The word Charles should be the same as the word charles)
- The only separator of words is a space. Punctuation has been removed from the training set, and you can assume the user doesn’t type any punctuation either.
- There are still some apostrophes in there, and you can simply treat them as ordinary letters. (The word it’s is different than its for example.)
Example of running the program (for the first three test sentences; there are
more in the starter code)
Like most AI-type algorithms, this one is imperfect. If you know the show,
you’ll see that it won’t correctly place every quote from the testing set.
That’s OK! Overall, you should see pretty good results.
AMAZING points: These final two points may be awarded if you’ve completed the
rest of the assignment perfectly and blown us away with… - For each test set you place into a category, tell the user the stats, too – whether it was an exact match; or if not, how many words the test set had in common with the “winning” person.
Programming #2
- Filenames: driver must be in speller.py, but any other files you need are ok too.
- Starter code: search_log.py
We all interact with spellcheckers all the time, and some of us even rely on
them maybe a little too much. You’d think that spell correction algorithms
would rely on dictionaries as a source of correct spellings, but sometimes
they don’t. For a search engine doing spelling correction, for example,
dictionaries don’t quite do the job because we search for proper names and
nicknames and titles so much.
So instead, some spell corrections rely on users. If people misspell a word,
they’ll get it wrong a bunch of different ways, but there’s only one way to
get it right – so the most common way the word is typed is often the correct
one. For example, if you dig into Google’s query logs from 2007, you might see
3 instances of “Brittany Spears”, 3 of “Britany Spears”, 4 of “Britney
Speares” and 10 of the correct “Britney Spears”. (This would be for like 5
minutes worth of data. We were obsessed with Britney in ‘07.)
For this part of the homework, you’ll use query logs related to some of the
most popular search terms on Google in 2019. We’ll work with one word at a
time, and not worry about sentences or phrases.
You’ll prompt the user for one of the top search words, and, if it’s
misspelled, find the likely correction for it. You’ll use the following: - Hamming distance: The number of letters where two words differ (a blank space vs an actual letter counts as a difference). The hamming distance between northeastern and northwestern is 2. The hamming distance between hi and hello is 4.
- Convergence of the correct word: The query logs you’ll start with reflect what we said above – a word is misspelled a bunch of different ways, but spelled correctly the plurality of the time.
Requirements
- Start by prompting the user to enter a word, one from the top 2019 searches (“Antonio”, “Brown”, “Nipsey”, “Hustle”, “Hurricane”, “Dorian”, “Disney”, “Plus”).
- You can assume that the user gives you good input – they either type one of those words, or they’re trying to, so it’s a legit misspelling and not a completely different word.
- Find the top-two contenders (the two words with the smallest hamming distance), from the search_logs file linked above. You can copy/paste the list of words, or import the file.
- Of the top two, the one that appears most frequently in the search logs is probably the winner, so spit it back out with a “did you mean…”
- If the user spelled the word correctly, don’t offer a correction.
- Note that Hamming distance is not a perfect predictor, so your algorithm won’t work all the time, and that’s expected – no spell correction works in every instance. We’ll expect it to work for most reasonable misspellings, though, including for words that don’t appear in the logs.
- Convert the user’s input to lowercase, so they can enter Antonio or antONIO and it’s all the same.
Sample Program (a few runs of it – note that plos is not in the logs but we
correct it correctly; brown is spelled right in the first place so we don’t
offer a correction):
AMAZING points: These final two points may be awarded if you’ve completed the
rest of the assignment perfectly and blown us away with… - No single function, including main, is longer than 30 lines.