Georgia Tech的Data and Visual Analytics的作业,还是和上学期这门课一样的难。工程量巨大,SQlite, Python,
JavaScript以及第三方的软件都有涉及到。
Part 1: Collecting and visualizing Twitter data
Q1
You will use the Twitter REST API to retrieve (1) followers, (2) followers of
followers, (3) friends and (4) friends of friends of a user on Twitter (a
Twitter friend is someone you follow and a Twitter follower is someone who
follows you).
a. The Twitter REST API allows developers to retrieve data from Twitter. It
uses the OAuth mechanism to authenticate developers who request access to
data. Here’s how you can set up your own developer account to get started:
Check the developer agreement checkbox and click on ‘Create your Twitter
application’. Once your request is approved, you can click ‘Keys and Access
Tokens’ to view your ‘API key’ and ‘API secret’. You will also need to
generate your access token by clicking the ‘Create my access token’ button.
After this step, you are ready to make authenticated API calls to fetch data.
Important notes and hints:
- Twitter limits how fast you can make API calls. For example, the limit while making GET calls for friends is 15 requests per 15 minutes.
- Refer to the rate limits chart for different API calls.
- Set appropriate timeout intervals in the code while making requests.
- An API endpoint may return different results for the same request.
b. Search for followers of the Twitter screen name “PoloChau”. Use the API to
retrieve the first 10 followers. Further, for each of them, use the API to
find their 10 followers. - Read the documentation for getting followers of a Twitter user.
- You code will write the results to followers.csv.
- Grading distribution is given in the boilerplate code.
Note: followerscreenname represents the source and username represents
the target for an edge in a directed graph. You will be adding these column
headers to the CSV file in a later question.
c. Search for friends of the Twitter screen name “PoloChau “. Use the API to
retrieve the first 10 friends. Further, for each of the 10 friends, use the
API to find their 10 friends. - Read the documentation for getting friends of a Twitter user.
- You code will write the results to friends.csv.
- Grading distribution is given in the boilerplate code.
Note: username represents the source and friendscreenname represents the
target for an edge in a directed graph. You will be adding these column
headers to the CSV file in a later question.
If a user has fewer than 10 followers or friends, the API will return as many
as it can find. Your code should be flexible to work with whatever data the
API endpoint returns.
Q2
Visualize the network of friends and followers obtained previously using
Gephi.
Note: Make sure your system fulfils all requirements for running Gephi.
a. Go through the Gephi quickstart guide.
b. Insert Source, Target as the first line in both followers.csv and
friends.csv. Each line in both files now represents a directed edge with the
format source, target. Import all the edges contained in these files using
Data Laboratory.
Note: Remember to check the “create missing nodes” option while importing
since we don’t have an explicit nodes file.
c. Visualize the graph and submit a snapshot of a visually meaningful view of
this graph.
Here are some general guidelines for a visually meaningful graph:
- Keep edge crossing to a minimum, and avoid as much node overlap as possible.
- Keep the graph compact and symmetric if possible.
- Whenever possible, show node labels. If showing all node labels create too much visual complexity, try showing those for the “important” nodes.
- Using colors, sizes, thicknesses, etc. to convey information.
- Using nodes’ spatial positions to convey information (e.g., “clusters” or groups).
Experiment with Gephi’s features, such as graph layouts, changing node size
and color, edge thickness, etc. The objective of this task is to familiarize
yourself with Gephi and hence is a fairly open ended task.
d. Using Gephi’s builtin functions, compute and report the following metrics
for your graph: - Average node degree
- Diameter of the graph
- Average path length
Briefly explain the intuitive meaning of each metric in your own words. You
will learn about these metrics in the “graphs” lectures.
Part 2: Using SQLite
The following questions help refresh your memory about SQL or get you started
with SQLite , which is a lightweight, serverless embedded database that can
easily handle up to multiple GBs of data. SQLite is great for building
prototypes and sharing data (all data stored in a single crossplatform file).
a. Import data: Create an SQLite database called rt.db .
Note : You can use SQLite’s built in feature to import data from files (
https://www.sqlite.org/cli.html#section_3 : .separator STRING and
.import FILE TABLE)
b. Build indexes: Create two indexes that will speed up subsequent join
operations:
An index called movies_primary_index in the movies table for the movie_id
attribute
An index called movies_secondary_index in ratings table for the movie_id
attribute
c. Find the total number of movies that are reviewed by at least 500 reviewers
and with average ratings >= 3.5.
Output format:
movie_count
d. Finding most reviewed movies: List all the movies with at least 2500
reviews. Sort the movies by the review count (high to low) then by their names
(alphabetical order) for those who may have the same review counts.
Output format:
movie_id, movie_name, review_count
e. Finding best films: Find the top 10 movies (highest average ratings). Sort
the movies by their average ratings (high to low) then by their names
(alphabetical order).
Output format:
movie_id, movie_name, avg_rating
f. Finding the best movies with the most reviews: Find the top 8 movies with
the highest average ratings that are rated by at least 1000 users. Sort the
results by the movies’ average rating (from high to low), then by the movies’
names (alphabetical order), and then genres (alphabetical order).
Output format:
movie_name, avg_rating, review_count, movie_genre
g. Creating views: Create a view (virtual table) called common_interests from
the data, such that: for each movie with exactly 10 reviews, show its
reviewers in pairs, for all unique reviewer combinations. User IDs should be
ranked in ascending order, and within a pair, the first user ID should be
strictly smaller than the second ID. For example, movie M has 10 reviews,
rated by reviewers 1,2,3,4,5,6,7,8,9,10. You would show “(1, 2, M)”, “(1, 3,
M)”, …, “(1, 10, M)”, “(2, 3, M)”, … , “(2, 10, M)”, etc. This example has 45
such pairs.
The view should have the format:
common_interests(user_id1, user_id2, movie_name)
Full points will only be awarded for queries that use joins.
Note: Remember that creating a view will produce no output, so you should
test your view with a few simple select statements during development.
h. Calculate the total number of such pairs created from the view made in part
g.
Output format:
common_interest_count
i. SQLite supports simple but powerful Full Text Search (FTS) for fast
textbased querying (FTS documentation).
Import the movie overview data from movieoverview.txt into a new FTS table
(in rt.db) called movie_overview with the schema:
movie_overview(id integer, name text, year integer, overview text, popularity decimal)
j. Explain your understanding of FTS performance in comparison with a SQL
‘like’ query and why FTS may perform better (hint: try SQLite’s EXPLAIN
command). Write down your explanation in fewer than 50 words in “fts.txt”.
Part 3: D3 Warmup and Tutorial
- Go through the D3 tutorial here.
- Complete steps 01-09 (Complete through “09. The power of data()”).
- This is a simple and important tutorial which lays the groundwork for Homework 2.
Note: We recommend using Mozilla Firefox or Google Chrome, since they have
relatively robust builtin developer tools.
Part 4: OpenRefine
a. Watch the videos on the OpenRefine’s homepage for an overview of its
features. Download OpenRefine (latest release : 2.6 r.c2 )
b. Import Dataset:
- Launch OpenRefine. It opens in a browser (127.0.0.1:3333).
- Download the dataset
- Choose “Create Project” > This Computer > “menu.csv”. Click “Next”.
- You will now see a preview of the dataset. Click “Create Project” in the upper right corner.
c. Clean/Refine the data:Note: OpenRefine maintains a log of all changes. You can undo changes. See
the “Undo/Redo” button on the upper left corner.