代写一个美国大选的小程序,根据Twitter数据绘制热力图。
Requirement
Sentiment analysis is the process of computationally identifying a writer’s
attitude towards a topic expressed in a piece of text. Some companies apply
sentiment analysis to opinions expressed in social media about their products.
In this assignment, we are providing you with all tweets generated in the
second week of November and you are going to use that data to generate a
geographic visualization of the sentiment expressed about particular topics.
As an example, consider the following map that shows how people feel about
Justin Bieber using the sentiments expressed in their tweets. States that are
red have the most positive view, while states that are dark blue have the most
negative view; yellow represents a more neutral view, while states in gray
have insufficient data.
To generate this image, thousands of tweets that included the word “bieber”
were collected. Each tweet contained the latitude and longitude of the tweet’s
location, which could be used to associate the tweet with a state. To
determine if the tweet was overall positive or negative, the individual words
in the tweet were analyzed.
Words were assigned a score between -1 and +1 using a pre-defined dictionary
of word sentiments. For example, a few of the words in the dictionary and
their scores include,
‘DEPLORABLE’ = -1.0
‘BAD’ = -0.625
‘GOOD’ = 0.875
‘EXCELLENT’ = 1.0
If a word of the tweet is not found in the sentiment dictionary, it is
ignored. The overall sentiment of the tweet is the average of the sentiment
scores that are found. If no sentiment scores are found for any of the words
of the tweet, this tweet is ignored. The overall sentiment of a state is
computed as the average sentiment score for all tweets that are associated
with that state (ignoring those tweets that did not have a sentiment score).
The state’s sentiment score is then mapped to a color between blue (negative)
and red (positive) using a prescribed color gradient.
Data provided
There is a file on Moodle called tweets.zip that includes nine json files of
tweets collected using the Twitter API. Some of the files have a timestamp,
while others do not have a timestamp. All of the files contain the text in the
tweet and the latitude and longitude of the tweeter.
Code provided
There are several files provided in finalProjectFiles.zip that provide the
functionality for calculating the sentiment from the tweet text and
graphically rendering the sentiment for each state. The files include,
- geo.py contains a GeoPosition class to represent a geographic location in terms of latitude and longitude. Each tweet will have a latitude and longitude that can be used to get its location relative to the states. State descriptions also have a latitude and longitude. Also included in GeoPosition is a distance method that computes that properly computes the shortest distance between two geographic locations (based on the distance traveled on the great circle that connects them).
The class also provides methods latitude and longitude, to access the
individual components in a tweet. - tweet.py contains the Tweet class. An instance of that class represents a single twitter message. The class includes the following methods:
- message() – returns a string that comprises the full body of the tweet
- position() – returns a GeoPosition instance describing the location of the tweet.
- timestamp() – returns a datetime instance describing the day and time at which the tweet was posted. (This information is only relevant for the extra credit challenge.).
- state.py defines a State class used to represent information about a state. Each state has a standard two-letter abbreviation (e.g., MO for Missouri), that is returned by the abbrev() method. The boundaries of each state are defined with a series of geographic positions. The relevant information about State for you is that the State class supports a method, centroid(), that returns a single GeoPosition for the centroid of the state. Informally, the centroid is an “average” of all positions in the state, which can be used as an approximation for the entire state for determining the closest state for a tweet.
- us_states.py module contains the actual data needed for representing the United States. You will not need to examine this file; it will be used by other parts of the project.
- country.py defines a Country class that handles the actual rendering of the states. It supports the following two methods:
- setFillColor(stateCode, color)
This method causes the state with the given two-letter state code (e.g., ‘MO’)
to be filled with the given color (specified either as a string or an RGB
triple). - setTitle(title)
This method sets the title of the window (it is ‘United States’ by default).
- setFillColor(stateCode, color)
- colors.py provides support for translating the numeric “sentiment” values into an appropriate color based on a fixed gradient suggested by Cynthia Brewer of Penn State University. In particular, the module defines a method:
get_sentiment_color(sentimentValue)
that returns an RGB triple of an appropriate color for the given numeric sentiment value. If None is sent as a parameter, it returns the color gray (which is different than the color indicated by a neutral sentiment of 0.0). - parse.py includes load_sentiments to load the sentiments dictionary.
- The data folder contains the raw data for sentiment scores and tweets.
- The samples folder contains four examples of complete images for the respective terms: bacon, bieber, cat, and dog. The bieber image is the one shown at the beginning of this page; others can be viewed for bacon, cat, and dog.
What you need to do
You need to use the data and code provided to generate a sentiment analysis on
some topic. All of your code should go in the file trends.py. The file
currently has a very basic class definition for a SentimentAnalysis class that
loads the sentiments dictionary, the states list, and the Country instance.
Your code needs to read in the data files you are using: there are nine files
provided, you can use either the files with the created date or the ones
without the created date. You only want to include tweets that have a
specified search term, hashtag, or keyword. For example, if you are analyzing
the sentiment towards the recent election, you might want to include tweets
only if they include Hillary or Trump in the text. You need to write the code
to filter the data.
Your primary tasks in this assignment are to loop through the provided data,
and for each tweet that you include, compute the average sentiment for that
tweet. You can do that by breaking the tweet into a sequence of words and
looking up each word in the sentiment dictionary. The sentiment for the tweet
is the average of all word sentiments for the tweet.
For example, if the original tweet were
justin bieber…doesn’t deserve the award..eminem deserves it.
The words of the tweet should be considered:
[‘justin’, ‘bieber’, ‘doesn’, ‘t’, ‘deserve’, ‘the’, ‘award’, ‘eminem’, ‘deserv es’, ‘it’]
—|—
Assuming the tweet has a sentiment score (that is, at least one word of the
tweet was identified in the sentiments dictionary), assign this tweet’s
sentiment score to the “closest” state. The rule that you should use is to
assign the tweet to whichever state has its centroid closest to the location
of the tweet. This is an imperfect rule (for example, because tweets from New
York City will actually be closer to the centroid of Connecticut and New
Jersey then to the centroid of New York state); but it is an easy rule to
implement, and it will do for now.
Once you have scored all tweets and assigned those scores to the appropriate
state, compute the cumulative sentiment for each state as the average of all
sentiments that were assigned. Then use that sentiment to pick an appropriate
color (using the get_sentiment_color function from our colors module), and set
the state’s color in the visualization.
You should feel free to define any additional functions within the trends.py
file that help you organize your code in a more clear and modular fashion.
Command-line arguments
Your program needs to take the search terms, such as
>> python trends.py Trump #MakeAmericaGreatAgain
if you want to include tweets that match either of the search terms provided.
If you only want one search term, you would call your program using
>> python trends.py Hillary
Some options for how you could use this data
- Determine what people are saying in different states, this could include the sentiment only, or the sentiment weighted by the volume of tweets in a state.
- Examine median sentiment values instead of the average sentiment.
- Compare the results of different keywords or hashtags in the results.
- Compare results by region instead of individual states.
Report
Write a short, 1-2 page report describing what you did and any interesting
results you generated. Your report should include the following three
sections:
Purpose: What is the purpose of the assignment
Procedure: What did you do? What code did you write? What functionality did
you implement? What analysis did you do on the data?
Results: What were the results of the project? How did sentiments in different
states compare to each other?