Python代写:CS614DataCleaningandRecordLinkage


用Python三方库对数据进行处理,练习Data Cleaning和Record Linkage的方法。

Requirement

You may work alone or in a pair on this assignment.
In this assignment you will take a pair of datasets containing restaurant
names and addresses, clean them, and link them, i.e., find records in the two
datasets that refer to the same restaurant.
This assignment will give you extensive experience in data cleaning and
munging with pandas, as well as experience in using matplotlib, and in
probabilistic record linkage.
Here is an overview of the various tasks you’ll perform. Some of these will
make more sense after you have read the full description.

  1. The restaurant datasets are lists of names and addresses provided by the restaurant review companies Zagat and Fodor’s, and are in the files zagat.txt and fodors.txt. Your first task is to create a pandas dataframe for each consisting of four columns: (i) the original full name and address string, and extracted columns consisting of (ii) the restaurant name, (iii) the city, and (iv) the address (except for the city). Call the dataframes zagat and fodors.
  2. For record linkage you will use a training dataset of known links, available in the file known_links.txt. Your next task is to read this dataset into a pandas dataframe and find the correspondence between the pairs in the known links and the rows in zagat and fodors. Call a pair from the known links a match, and the dataframe matches.
  3. You will also need a dataset of pairs that are not matches. We will call such a pair an unmatch. You will create this by choosing random pairs from zagat and fodors. Call this the unmatches dataframe. (The terms match and unmatch follow the convention in record linkage. )
  4. You will use the Jaro-Winkler distance to measure the similarity between a field in a pair of records (one each from zagat and fodors). Your fourth task will be to explore the distribution of the Jaro-Winkler scores when a pair is a match and compare it to when the pair is an unmatch. (This step is not strictly required for record linkage, but will help clarify the rationale behind the approach we adopt.) For this you will plot histograms for the distance values for six cases, and create a pdf file containing all your histograms.
  5. To link records, you will use (a simplified version of) the probabilistic record linkage approach. In this you will represent the distances between the fields of a pair of restaurants as a vector of three integers. There are 27 possible such vectors. Your next task is to find the probability of each vector given whether the pair it came from is a match or an unmatch. Here you will use the matches and unmatches dataframes.
  6. Next, you will order the vectors, and partition them into three sets: match_vectors, possible_vectors, and unmatch_vectors. If a test pair maps to a vector in the first set, we classify it as a match; if it maps to a vector in the second set, we classify it as a possible match; else we classify it as an unmatch. (Possible matches need to be verified manually.)
  7. It will then be time to find matches. You will iterate through all potential pairs, for each find the corresponding vector, and the set it belongs to. Based on the set you will mark it as a match, possible match, or unmatch. You will print the number of each type you find, and also store all matches in a csv file.
  8. Iterating through all possible pairs is time consuming in general. In the final step, you will repeat the above step but only consider potential pairs that also have the same value for either name, city, or address.

Create restaurant dataframes

The restaurant datasets are lists of names and addresses provided by the
restaurant review companies Zagat and Fodor’s, and are in the files zagat.txt
and fodors.txt. Your first task is to create a pandas dataframe for each
consisting of four columns: (i) the original full name and address string,
(ii) the restaurant name, (iii) the city, and (iv) the address (except for the
city). You will extract the latter three columns from the first, and use them
for record linkage later on.

Extracting new fields from text

pandas provides the function extract() that takes a regular expression to
create new fields from text data. Assume, e.g., that the original name and
address strings from Zagat are in a column named nameaddress in zagat. Then
the call:
zagat[‘nameaddress’].str.extract(r’^([^\d])(\d.)$’, expand=True)
—|—
returns a dataframe with two columnseach corresponding to a group in the
regular expressionthe first containing the longest prefix of the nameaddress
string that contains no digits, and the second containing the rest of the
string. The columns of the returned dataframe can be appended to the zagat
dataframe using the pandas function concat().
Use (possibly multiple calls to) extract() to create the name, city, and
address columns. We found the following useful in this data munging process:

  • Build the regular expression(s) you need by iterative refinementlook at some of the data, try out an initial regular expression, observe where it works and where it fails, and improve upon it. If the regular expression doesn’t match a given string, extract returns null values. So if you want to look at the strings where your regular expression failed, use, e.g., zagat.ix[df.iloc[:,0].isnull()] , in which df is the name of the dataframe returned by extract().
  • Even so, you may not find a single regular expression that extracts all three columns. It may be easier to extract the columns in stages, using separate calls to extract(). Further, there may be special cases that are unlike all other data and are best handled by hard-coding their values. In the approximately 850 rows in the combined data, we hard-coded the extraction for about 20 restaurants. You should not hard-code more than 50 cases.
  • Addresses often start with a street number. But not always, and sometimes the restaurant name itself has a number as well.
  • City names often begin after the name of a road, i.e., after a substring ending with “Blvd.”, “Ave.”, “PCH”, or “Broadway”. It is simple to build a complete list of such suffixes by iterative refinement.
  • We recommend using a jupyter notebook, at least for the iterative refinement process.
  • You should be comfortable with the different indexing and selection mechanisms in pandas.
  • It is best not to name the restaurant name column name, because pandas dataframes have a predefined attribute called name, and you get unexpected results if you use the dataframename.columnname style of referring to a column.

Create the matches dataframe

For record linkage, you will use a training dataset of known links, available
in the file known_links.txt. It contains 50 pairs of restaurant names and
addresses that are known to refer to the same restaurant. The first in the
pair is from Zagat and the second from Fodor’s. Your next data munging task is
to read this dataset into a pandas dataframe called matches. You will find
read_csv() unsuitable for this due to the irregular formatting in
known_links.txt. For example, sometimes a restaurant string is in a single
line, and sometimes split across two lines. We recommend writing custom Python
code to read the dataset.
You will also find that not all restaurant strings are exactly equal to the
corresponding ones in the Zagat and Fodor’s datasets, and will have to correct
those. This correspondence will be useful during record linkage.

Create the unmatches dataframe

It turns out that for record linkage you also need a dataset of unmatches. For
this choose 1000 random restaurants from zagat and pair them with 1000 random
restaurants from fodors. There are several ways to make random choices:
Python’s random library, Numpy functions for creating random arrays, or pandas
own sample function for sampling a dataframe.
Technically the random pairs may include some pairs that actually match, but
the overwhelming number will be unmatches, and we’ll ignore this minor source
of error.
Why are we using 1000 random pairs for unmatches, when we only have 50 for
matches? Because we can, and the more pairs we have, the better the accuracy
of our record linkage. Of course, having more matches would also have been
helpful as well, but that requires expensive manual work.
Caution. The random choices you make will influence the rest of your results.
This makes it hard to debug a program if the bug only shows up for some random
choices. So during the development and debugging process, it is useful to seed
the random generator with a fixed number. E.g., when using pandas the
following call:
zagat.sample(1000, replace = True, random_state = 1234)
—|—
always returns the same 1000 rows because the random state is fixed through
the seed 1234. Once your code is bug-free you should remove the random_state
option.

Explore Jaro-Winkler scores, plot histograms

To link records we take a potential pair, one each from zagat and fodors, and
decide if their strings are similar enough. Our results are better when the
strings are broken up into natural components, which you have already
accomplishedinto the components name, city, and address.
When the components are strings, a natural measure of similarity, or rather
the distance between two strings, is the edit distance, also known as the
Levenshtein distance. It is defined as the minimum number of single-character
insertions, deletions, and substitutions required to convert one string to
another. In practice, however, the Levenshtein distance is computationally
intensive, and a related measure, called the Jaro-Winkler distance is used
instead.
The exact definition of the Jaro-Winkler distance is somewhat technical (but
available here). Also, although it is called a distance, it actually measures
the similarity between two strings: a complete match (“Medici” and “Medici”)
has a score of 1, and a complete mismatch (“Medici” and “Subway”) has a score
of 0.
You will use the Python library jellyfish to compute the Jaro-Winkler distance
between two strings. Install it using the standard:
sudo pip3 install -U jellyfish
To find the Jaro-Winkler distance use, e.g.,:
import jellyfish
score = jellyfish.jaro_winkler(“medici”, “noodles etc.”)
—|—
Your next task is to explore the distribution of the Jaro-Winkler distance
between names, cities, and addresses of pairs of restaurants, both when the
pairs are matches and when they are unmatches.
Your specific task here is to create a pdf file, named histograms.pdf
containing six histograms, labelled appropriately. The first pair of
histograms is for the Jaro-Winkler distances between names for restaurants,
one each for matches and unmatches. The second and third are similar, except
for cities and addresses.
Here you will find useful the pandas function plot.hist(), and the
matplotlib.pyplot functions figure(), subplot(), xlabel(), title(), and
savefig(). See, e.g., the tutorial on working with multiple figures. Our
histograms are in histograms_sample.pdf. Your results would vary because of
different random choices, and possibly different extraction processing.

Estimate vector probabilities given match or unmatch

Observe that the probability of a match does not increase monotonically with
the similarity for a field; rather, there are clusters in which this
probability is high. For instance, in our results, for the address field, the
probability is lower in the range 0.80-0.95 than just outside this range. The
clusters are even more pronounced for the unmatch pairs.
The above implies that a single “cut-off” on the similarity for a field, or
even a group of fields will not serve our purpose. So we break-up the range of
similarity values into discrete chunks, and determine the probability of a
match for each chunk separately from that of neighboring chunks. Each chunk
(in the 3-dimensional space of name, city, and address similarities) can be
identified by a vector as described below.
We have adopted (a simplified version of) the probabilistic record linkage
approach proposed by Fellegi and Sunter. Provided in utils.py is a simple
utility function get_jw_category() that takes a Jaro-Winkler distance and
returns an integer category between 0 to 2, essentially breaking the range of
the Jaro-Winkler score into three blocks. We apply this on all three fields:
names, cities, and addresses. Thus for any pair of restaurants, we can create
a vector (name_category, city_category, address_category) that represents the
similarity of the fields for that pair. Whether, during record linkage, the
pair should be classified as a match or unmatch, or something in between, will
depend on certain probabilities associated with the corresponding vector.
We determine whether a vector should be classified as a match, possible match,
or unmatch, based on estimates of the probability that a pair results in that
vector when given that the pair is a match, as well as when given that the
pair is an unmatch. Formally, assume that (r1 , r2 ) is a potential pair, v(r1
, r2 ) is the vector formed from its field similarities, and V is the set of
all possible vectors.
Compute estimates for the former by iterating through all the pairs in
matches, determining their vectors, and counting the frequency of each of 27
possible vectors during this process. The relative frequency for a vector w is
the estimate of the probability for w given a matching pair. Similarly find an
estimate for the probability given an unmatch by iterating through vectors for
all pairs in unmatches.

Partition vectors into match, possible match, and unmatch sets

How should we partition the set of vectors V into the three different sets:
match_vectors, possible_vectors, and unmatch_vectors? It depends on our
tolerance for two kinds of errors:

  • false positive rate, the probability that we classify an actual unmatch as a match, and
  • false negative rate, the probability that we classify an actual match as an unmatch.
    Assume that u is the maximum false positive rate and is the maximum false
    negative rate we are willing to tolerate, and given these are satisfied we
    want to maximize the number of our matches.
    In order to partition the set of vectors V , we do the following
  1. Assign all vectors w that have m(w) = 0 and u(w) = 0 to possible_vectors.
  2. Order the rest of the vectors such that the vectors with u(w) = 0 are in front of everyone else, and the remaining are in decreasing (non-increasing) order of m(w)/u(w) .
  3. Beginning from the first vector in the order above find the last vector w1 such that the cumulative sum of u(w) values (for w from first to w1 ) is at most . Add all vectors from the first till w1 (inclusive) to match_vectors.
  4. Beginning from the last vector in the reverse order, find the furthest-from-last vector w2 such that the cumulative sum of m(w) values is at most . Add vectors from the last till w2 (inclusive) to unmatch_vectors.
  5. Add any remaining vectors (in the ones that were ordered) to possible_vectors.
    You should first convince yourself that the above approach makes sense. You
    will then realize that the above only works when w1 happens to be ahead of w2.
    Work out the version of the algorithm when this is not the case, and implement
    the complete algorithm.

Find matches and count possible matches and unmatches

Once you have the three sets for given maximum false positive and false
negative rates, simply iterate through every pair in zagat and fodor to
determine counts of matches, possible matches, and unmatches.

Block on name, city, or address

Iterating through all potential pairs is computationally prohibitive for large
datasets. So the number of potential pairs is often reduced by applying a
constraint, such as only considering pairs that have the same value for a
blocking key. Implement a version of the record linking algorithm in which
either name, city, or address is supplied as a blocking key.

Task checklist

Write Python code, with appropriate auxiliary functions and documentation that
implements the following function:
find_matches(mu, lambda_, outfile, block_on)
—|—
which takes mu, the maximum false positive rate, lambda_, the maximum false
negative rate, outfile, the file name in which to store the matches, and block
to indicate the blocking key to use, if any. It should create the file
histograms.pdf, and store all the matches it finds in outfile in the csv
format (Zagat name and address followed by Fodor’s name and address), and
returns a 3tuple containing the number of matches, possible matches, and
unmatches it found.
In particular, ensure you

  1. Create zagat, fodors, matches, and unmatches.
  2. Plot histograms for names, city, and address similarity for pairs from matches and unmatches.
  3. Determine relative frequencies for 27 vectors corresponding to pairs from matches and unmatches.
  4. Order vectors and create sets match_vectors, possible_vectors, and unmatch_vectors.
  5. Count the number of matches, possible matches, and unmatches for all potential pairs from zagat and fodors, possibly after applying a blocking key.
  6. Store the matches in a csv file.

Getting started

Follow these start-up instructions if you plan to work alone.
Follow these start-up instructions if you are going to work with the same
partner as you had for PA #2 or PA #3.
Follow these start-up instructions if you are going to work in a new pair.
Once you follow these instructions, your repository will contain a directory
named pa4,

Submission

Follow these submission instructions if you are working alone.
Follow these submission instructions if you are working in a pair.


文章作者: SafePoker
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 SafePoker !
  目录