探索 Twitterverse
信息,完成相关函数的开发。
![Twitterverse](https://upload.wikimedia.org/wikipedia/en/thumb/9/9f/Twitter_bird_logo_2012.svg/100px-
Twitter_bird_logo_2012.svg.png)
Goals of this Assignment
- Practice working with structured files.
- Practice building and using dictionaries.
- Write unittests for a function.
- Use top-down design to break a problem down into subtasks and implement helper functions to complete those tasks.
Note the last goal in the list – in this assignment, you will be doing
significantly more design of functions than in the previous ones. We will give
you a smaller number of functions to write, and you will write helper
functions to solve the subtasks of each of those functions.
Introduction: The Twitterverse
This handout page explains the problem you will be solving, the data you will
be working with, and the tasks you need to complete. Please read it carefully
– you will need to read all of the information here to complete the assignment
successfully.
Twitter is a social networking website where users can post very short
messages known as “tweets”. Each Twitter user can choose to “follow” other
users, which means that they see those users’ tweets. A Twitter user sees the
tweets of users they are “following”, and their tweets are seen by their
“followers” (the users who follow them).
All the “follow” connections define a network among Twitter users, and it’s
quite interesting to look for patterns in the connections. Tools like
Twiangulate ( http://twiangulate.com/search/
) let you explore questions like “what
connections do two users have in common?”. In this assignment, you’ll write
functions that let you ask questions (or “queries”) about a Twitter dataset.
While it is possible to get data directly from Twitter, we have simplified
things for you and provided the data already stored in a file.
Your Tasks
For this assignment, you are required to:
- Write the 6 required functions described later in this handout.
- Write helper functions as needed. All helper functions should have complete docstrings, but there are no other requirements for your helper functions.
- Write unittests for one of the required functions, all_followers.
This is your first experience designing a program of this size. You will have
to do far more planning than in the previous two assignments. It is likely you
will need to break most of the required functions down into subtasks and write
helper functions for those subtasks.
The Twitter Data File
A Twitter data file contains a series of one or more user profiles, one after
the other. Each user profile has the following elements, in this order:
- A line containing a non-blank, non-empty username. You may assume that usernames are unique (i.e. a single username will not occur more than once in the file), and that usernames do not contain any whitespace. Usernames are case sensitive.
- A line for the user’s actual name. If they did not provide a name, this line will be blank.
- A line for the user’s location, or a blank line if they did not provide one.
- A line for the URL of a website, or a blank line if they did not provide one.
- Zero or more lines for the user’s bio, then a line with nothing but the keyword ENDBIO on it. This marks the end of the bio, and is not considered part of it. (You may assume that no bio has the string ENDBIO within it.) If the user did not provide a bio, the ENDBIO line will come immediately after the website line, with no blank line in between.
- Zero or more lines each containing a single username of someone that this user is following, then a line with the keyword END on it. (You may assume that no one has END as their username.) A user cannot follow themselves. You may assume that every user that appears in this section of the file has a user profile in the Twitter data file.
A blank line is one that contains only whitespace characters. For this
assignment, whitespace characters include spaces, tabs, and newlines.
Notice that the keywords ( ENDBIO and END ) act as separators in this file.
All of their letters are capitalized, and the keywords contain no punctuation.
You may assume that no keyword separators will appear as any part of the
content of a user profile.
Examples
Here is a sample user profile that might occur among many in a file:
tomCruise
Tom Cruise
Los Angeles, CA http://www.tomcruise.com
Official TomCruise.com crew tweets. We love you guys!
Visit us at Facebook!
ENDBIO
katieH
NicoleKidman
END
The file data.txt in the provided starter files is a small example of a
complete Twitter data file (and was made by hand). The file rdata.txt is a
much larger example (and is made from real data extracted from Twitter). These
should help you confirm your understanding of the file format and will be
useful in testing your program.
Cycles in the data
Although a user cannot follow themselves, there can be “cycles” such as this:
user A can be following B who is following A. This is the shortest possible
cycle. Of course, cycles can be longer.
The Query File
Note that the word “query” just means “question”. In computer science, we use
it to specify a request for information.
For this assignment, a query will be provided in a file. Below we will review
the high-level parts of the query, look at an example, and then describe the
query file format.
Overview
A query has three components: a search specification, a filter specification,
and a sorting specification.
The search specification describes how to generate a list of Twitter
usernames, starting with an initial username (a list of length one) and then
finding their followers or people they are following, then people that are
those people’s followers or who they are following, and so on. When processing
the search specification, don’t try to do anything to avoid cycles. For
instance, if the search specification says to find the people who user A is
following, and from there the people they are following, you could find
yourself back at user A. Don’t try to avoid that.
After processing the search specification, we have a list of Twitter
usernames. Its length could be zero. For example, if the initial username is
‘dianelynnhorton’ and the search specification contains a single ‘followers’
keyword, then the number of results of the search will be zero if
‘dianelynnhorton’ has no followers.
The filter specification describes how to filter the list of usernames
produced by the search specification. The filtering can be based on
- whether or not they are following a particular user,
- whether or not a particular user is their follower,
- whether or not their name contains a particular string (case-insensitive), or
- whether or not their location contains a particular string (case-insensitive).
After processing the filter specification, we have a possibly reduced list of
usernames.
Once the search results have been found and filtered, the sorting
specification describes how the results should be sorted. Prior to the sorting
step, the order of usernames does not matter.
Example query
Here is an example query:
SEARCH
tomCruise
following
following
following
FILTER
following c
location-includes a
SORT
popularity
The search specification in this particular query has four steps.
- Start with a list containing the username to start the search from; i.e. [‘tomCruise’]. Let’s call that list L1.
- The search keyword ‘following’ says to replace each username p in L1 with the usernames of the users who p is following. This yields a new list, L2.
- For the next ‘following’ keyword, we start with L2 and repeat the same operation as in the previous step, yielding another list, L3.
- For the final ‘following’ keyword, we start with L3 and repeat that operation one last time, yielding list L4.
Notice that each step yields a list of zero or more usernames that is the
input to the next step. There should be no duplicates in the final results
list. Duplicates should be removed after each step.
The Twitter data file diagram_data.txt in the provided files contains the
follower/following relationships as represented by this diagram.
For those relationships, the search specification above would yield this list
of usernames: [‘i’, ‘j’, ‘h’, ‘k’, ‘tomCruise’] . Make sure that you can see
how the four lists, ending with this final one, are generated. Notice that the
final list contains the users you can get to in three “steps” of the
“following” relationship, starting from ‘tomCruise’ .
The final list generated by the search specification becomes the input to the
filter specification. For our current example, the filter specification says
that the list should be filtered in this way: a user should be kept only if
they are following user ‘c’ and has a location that includes the string ‘a’ .
Based on the query above, the final filtered list would be [‘tomCruise’] .
The sorting specification says to order the users according to their
popularity.
Now let’s look in detail at the exact format and meaning of a query.
Overall format of a query
The format of the query data file is as described below, in this order:
- A line containing the keyword SEARCH.
- The search specification.
- A line containing the keyword FILTER.
- The filter specification.
- A line containing the keyword SORT.
- The sorting specification.
Notice that the keywords above all act as separators in this file and are in
all capital letters. There are other keywords (such as following which can be
part of a search specification) that are not separators; they are not
capitalized.
Format and meaning of the search specification
The format of a search specification is as shown below, in this order:
- A line containing a single username
- Zero or more lines, in any order, each containing one of:
- the keyword followers
- the keyword following
Each step in the search specification involves creating a new list of users by
going through each user in the current list and finding their followers or who
they are following. Importantly, at each step, we want to create the new
results list by replacing each user from the current list with the users that
they are following or who are their followers.
The possible search operations for a given user are:
- followers: this will get all usernames for users who are following the given user (i.e. that have the given username in the value list paired with the key ‘following’ in their data in the Twitterverse dictionary – more on this later).
- following : this will get all usernames for users who the given user is following (i.e. the value list paired with the key ‘following’ in the data associated with the given user in the Twitterverse dictionary).
Note that, if the list of usernames is [‘A’] , and we perform a ‘followers’ or
‘following’ operation,
‘A’ will not be in the resulting list, because users cannot follow themselves.
If ‘A’ is one of multiple usernames in the current list, then ‘A’ can appear
in the results list, but only if one of the other usernames produces ‘A’ from
the search operation.
The final list of results should not contain any duplicates. Duplicates should
be removed after each operation is performed. The order of the final list of
results from the search operation does not matter.
Format and meaning of the filter specification
The format of a filter specification is:
- Zero or more lines, in any order, each containing one of the following:
- the keyword name-includes, a space, and a string to match
- the keyword location-includes, a space, and a string to match
- the keyword bio-includes, a space, and a string to match
- the keyword follower, a space, and a string which is a valid username
- the keyword following, a space, and a string which is a valid username
Each of the lines has exactly one space within it (the space separating the
keyword from the next string), and each keyword will appear at most once per
filter specification.
The filters are “additive” in that each filter step should be applied to the
list that resulted from the previous filter step. That is, you start with the
list from the search specification, filter it based on the first filter step
to get a new list L1, filter L1 in the second filter step to get L2, and so
on.
The ‘following’ and ‘follower’ filters will add users in the new list if they
are following a particular user, or if a particular user is their follower.
These filters are defined as:
- The ‘following’ filter adds only users who are following provided username.
- The ‘follower’ filter adds only users who the provided username is following.
The filters that have ‘includes’ as part of the keyword will do a simple
substring search on the string representing the relevant data for the user in
the Twitterverse dictionary. This means that if the given string occurs
anywhere in a user’s name (for a ‘location-includes’ filter) or a user’s
location (for a ‘location-includes’ filter), then that user is to be kept in
the list. The substring search should be case-insensitive. For example, if the
filter specifies users whose locations include the string “USA”, then users
with location “USA”, “Modesto, California, USA”, “kansas, usa” or “Musala,
Bulgaria” would be kept, and users with location “United States of America”
would be excluded. This is far from perfect, but don’t try to improve on it.
The order of the final list of results from the filter operation does not
matter.
Format and meaning of the sorting specification
The output from your program must include every user whose username is still
in the list after the search and filter specifications have been processed.
The sorting specification describes how the final list of results should be
ordered. It consists of one line containing one of these keywords: ‘username’
, ‘name’ , ‘popularity’ .
The users in the final results list must be ordered as indicated by the
keyword of the sorting specification. Output that is sorted by username or
name should be in alphabetical order, so that strings starting with ‘a’ are at
the beginning of the output. Although usernames are unique, names are not. If
any users have the same name, sort them by username. When sorting users by
popularity, the most popular users should come first. A user’s popularity is
defined to be the number of followers they have. (Not the number of people
they are following!) If any users are tied for popularity, sort them by
username.
We can use Python’s built-in sort and/or string comparison for sorting by
username. However, we’ll have to find another way to sort by username name or
popularity. There is more information on sorting is later in the “Sorting your
output” section.
Data Structures
To increase the readability of our code, we will use dictionaries to represent
both the Twitter data and the query in our program. In this section, we’ll
look at the format of the data and query dictionaries.
The Twitterverse Data Dictionary
The type of the Twitterverse dictionary is Dict[str, Dict[str, object]]. (We
use the type object here to indicate that the values in the inner dictionary
can have different types.) More specifically, each key in the Twitterverse
dictionary is the username of a Twitter user, and the value associated with
that key is a dictionary containing additional information about the user.
Each of these inner dictionaries will contain the keys the ‘name’, ‘location’,
‘web’, ‘bio’ and ‘following’. The value associated with key is a list of zero
or more strings, and the values associated with the rest of the keys are
strings (and may be the empty string).
For example, if the following user information is included in our Twitter data
file…
tomCruise
Tom Cruise
Los Angeles, CA
http://www.tomcruise.com
Official TomCruise.com crew tweets. We love you guys!
Visit us at Facebook!
ENDBIO
katieH
NicoleKidman
END
… then this key/value pair will be in our Twitterverse dictionary.
‘tomCruise’: {‘name’: ‘Tom Cruise’,
‘bio’: ‘Official TomCruise.com crew tweets. We love you guys!\nVisit us at Facebook!’,
‘location’: ‘Los Angeles, CA’,
‘web’: ‘http://www.tomcruise.com‘,
‘following’: [‘katieH’, ‘NicoleKidman’]}
Notice that the newlines at the end of each line are removed from the data
stored in the Twitterverse dictionary, with the exception of the bio
information, which needs to keep its inner newlines so that we could
reconstruct the bio format the user chose. In general, you should remove
leading and trailing whitespace from each line.
The usernames in the following list should be in the same order as in the data
file.
And another example: if the following user information is included in our
Twitter data file (note the blank lines for name, location, and web, and zero
lines for bio and following)…
quietTweeter
ENDBIO
END
… then this key/value pair will be in our Twitterverse dictionary.
‘quietTweeter’: {‘name’: ‘’,
‘bio’: ‘’,
‘location’: ‘’,
‘web’: ‘’,
‘following’: []}
Note that dictionaries are not ordered, so don’t worry if your dictionary
output is in a different order, as long as the items are the same.
The Query Dictionary
We will also use dictionaries to represent queries in our program. The query
dictionary contains three items; the key ‘search’, whose value is a search
specification dictionary, the key ‘filter’, whose value is a filter
specification dictionary , and the key ‘sorting’, whose value is a sorting
specification
The search specification dictionary contains the keys ‘username’ and
‘operations’, whose values are the string representing the username to start
at, and a list of strings representing the operations to perform,
respectively. Note that the list that is the value associated with the key
‘operations’ may be empty, if there is only one line between the SEARCH and
FILTER keywords in the query file. If there is only one line between SEARCH
and FILTER, then it represents the username to start at, and there are no
operations to be performed in the search step. The operations must be
performed in the order that they are in the list.
The filter specification is a dictionary with strings representing filter
keywords as keys, and the strings to match as values. Note that this
dictionary may be empty if there are no filters to apply, and that each filter
keyword will appear at most once in the filter section of a query file.
The sorting specification is a single string value representing the order to
sort the results in.
For example, if our query file looks like this…
SEARCH
tomCruise
following
following
following
FILTER
following
KatieH
location-includes USA
SORT
popularity
… then the query dictionary will have these key/value pairs.
{ ‘search’: {‘username’: ‘tomCruise’, ‘operations’: [‘following’, ‘following’, ‘following’]},
‘filter’: {‘following’: ‘KatieH’, ‘location-includes’: ‘USA’},
‘sorting’: {‘sort-by’: popularity’two big parantheses
The main program
Once you have correctly implemented the functions in
twitterverse_functions.py, execution of the main program
(twitterverse_program.py) will:
- Read a data file and produce the Twitterverse data dictionary.
- Read a query file and produce the query dictionary.
- Compute the results of the search operations.
- Apply the filters to the search results.
- Sort the final results.
You may try running this file on the provided data and query files, as well as
your own files that you have created for testing. Note that just running the
your code. twitterverse_program.py is NOT sufficient to test
Required Testing ( unittest )
Write and submit a set of unittests for the all_followers function in the file
test_all_followers.py . You should create Twitterverse dictionaries as inputs
for your test cases inside the test_all_followers.py file. Do NOT call open in
your test file, or anywhere else in your code.
Files to Download
All of the files included in the download for the assignment are listed in
this section. These files must all be placed in the same folder.
Please download the Assignment 3 files and extract the zip archive. The
following paragraphs explain the files you have been given.
- Python Code
- The starter code for this assignment is in the file twitterverse_functions.py. This is the primary file you will be editing.
- The main program file is twitterverse_program.py. It is complete and must not be changed.
- The starter code for the required unittests is in the file test_all_followers.py. You will edit and submit this file as well.
- We are providing a checker as usual: a3_checker.py. It is complete and must not be changed.
- Sample Twitter Data and Sample Queries
- We have provided a number of text files containing Twitter data, and queries on that data. All files in the provided zip file that contain data in their filename are Twitter data files, and all files in the provided zip file that contain query in their filename are query files. Put these files in the same directory as your other A3 files.
- The query files and the file query1.txt, query2.txt and query3.txt are queries on the data in data.txt, and the file query4.txt is a query on the data in rdata.txt.
We are providing a type-check module that can be used to test whether your
functions in twitterverse_functions.py a3_checker.py have the correct
parameter and return types. To use the type checker, place in the same folder
(directory) as your twitterverse_functions.py and run it.
If the type-checks pass: the output will tell you that the typechecker passed
(and what it means for the typechecker to pass!). If the typechecker passes,
then the parameters and return types match the assignment specification for
each of the functions.
If the type-checks fail: Look carefully at the message provided. One or more
of your parameter or return types does not match the assignment specification.
Fix your code and re-run the tests. Make sure the tests pass before
submitting.
We will do a much more thorough job of testing your code once you hand it in,
so be sure that you have thoroughly tested it yourself. With the functions in
this assignment, there are many more possible cases to test (and cases where
your code could go wrong). If you want to get a great mark on the correctness
of your functions, do a great job of testing your functions under all possible
conditions. Then we won’t be able to find any errors that you haven’t already
fixed!