Python代写：CSC108HWorkingWitharxivMetadata

发布日期: 2023-12-01

完成Python基础作业，练习Dict的使用，对 arXiv.org 的数据进行分析。
![arXiv](https://upload.wikimedia.org/wikipedia/commons/thumb/6/6a/ArXiv-
org_screenshot_20140706.png/600px-ArXiv-org_screenshot_20140706.png)

Goals of this Assignment

In this assignment, you will practise working with files, building and using
dictionaries, designing functions using the Function Design Recipe, and
writing unit tests.

Introduction: arxiv.org

arxiv.org ( https://arxiv.org/ ) is a free
distribution service and an open-access archive for nearly two million
scholarly articles in the fields of physics, mathematics, computer science,
quantitative biology, quantitative finance, statistics, electrical engineering
and systems science, and economics. arXiv is pronounced as “archive” (
https://en.wikipedia.org/wiki/ArXiv ) .
arxiv.org ( https://arxiv.org/ ) maintainers believe
in open, free, and accessible information. In addition to free and easy access
to the articles themselves, arxiv.org also provides ways to access its
metadata ( https://arxiv.org/help/bulk_data ) . This metadata includes information
such as the article’s unique identification number, author(s), title,
abstract, the date the article was added to the arxiv and when it was last
modified, licence under which the article was published, etc. This metadata is
used by a variety of research tools that investigate scientific research
trends, provide intelligent scientific search techniques, and in many other
areas.
To make this assignment more manageable for you, we have extracted a sample of
arxiv’s metadata, simplified it, and created a text file you will use as input
to your program.

The Metadata File

The metadata file contains a series of one or more article descriptions, one
after the other. Each article description has the following elements, in
order:

A line containing a unique identifier of the article. An identifier will not occur more than once in the file, and it does not contain any whitespace.
A line containing the article’s title. If we do not have title information, this line will be blank.
A line containing the date the article was created, or a blank line if this information is not provided. The date is formatted YYYYMM-DD.
A line containing the date the article was last modified, or a blank line if this information is not provided. The date is formatted YYYY-MM-DD.
Zero or more lines with the article’s author(s). Each line contains an author’s last name followed by a comma , followed by the author’s first name(s). There is always exactly one comma , on the author line. Note, that there may be white space and/or punctuation other than commas included in an author’s last name and/or first name. Immediately after the zero or more author lines, the file contains a single blank line. If we do not have any author information for an article, then the blank line will come immediately after the modification date line.
Zero or more lines of text containing the abstract of the article. Immediately after the abstract, the file contains a line with the keyword END on it (and nothing else other than perhaps whitespace). You may assume that a line with only END in it does not occur in any other context in the metadata file, i.e. it always signifies that an article description is over. Unless this is the last line in the file, the next line will contain the identifier of the next article, and so on.
You can assume that any file we test your code with has this structure. You do
not need to handle any invalid file formats.

Example Metadata File

Here is an example metadata file (also provided in the starter code later
(a3.zip) ):
Intro to CS is the Best Course Ever
2021-09-01
Ponce,Marcelo
Tafliovich,Anya Y.
We present clear evidence that Introduction to
Computer Science is the best course.
END
Calculus is the Best Course Ever
2021-09-02
Breuss,Nataliya
We discuss the reasons why Calculus I
is the best course.
END
Discrete Mathematics is the Best Course Ever
2021-09-02
2021-10-01
Pancer,Richard
Bretscher,Anna
We explain why Discrete Mathematics is the best course of all times.
END
University of Toronto is the Best University
2021-08-20
2021-10-02
Ponce,Marcelo
Bretscher,Anna
Tafliovich,Anya Y.
We show a formal proof that the University of
Toronto is the best university.
END
2021-05-04
2021-05-05
This is a very strange article with no title
and no authors.
END
This metadata file contains information on five articles with unique
identifiers ‘008’ , ‘031’ , ‘067’ , ‘827’ , and ‘042’ . Notice that the
following information is not provided in the file: modified date in article
‘008’ , created date in article ‘031’ , and title and authors in article ‘042’
. All these are valid cases and your code should deal with them. Also notice
that an abstract can occupy zero or one or more lines in the input file.

Storing the Arxiv Metadata

We will use a dictionary to maintain the arxiv metadata. Let us look in detail
at the format of this dictionary. The types below are defined in constants.py
and we have imported them into arxiv_functions.py for use in your type
contracts.

Type NameType

We will store the names of authors as tuples of two strings: the author’s last
name(s) and the author’s first name(s). For example, the author Anna Bretscher
would be listed in the metadata file as ‘Bretscher,Anna’ and will be stored as
(‘Bretscher’, ‘Anna’) . Note, that there may be punctuation and/or white space
included in an author’s last name and/or first name, and we need to keep all
this information. The only exception is: there are no commas in author’s first
nor last names. For example, Tafliovich,Anya Y. , Van Dyke,Mary-Ellen and
Sklodowska Curie,Marie Salomea are all valid input lines, and should be stored
as (‘Tafliovich’, ‘Anya Y.’) , (‘Van Dyke’, ‘Mary-Ellen’) and (‘Sklodowska
Curie’, ‘Marie Salomea’) , respectively. A line like Tafliovich,Anya,Y. is not
valid, since it contains two commas and we cannot tell which is supposed to be
the first and which is the last name. You will only have to deal with valid
input in this assignment. We will also make the simplification that all
authors will have both first and last names.

Type ArticleType

The file constants.py in the starter code defines the following constants that
you should use instead of the literal strings. Below are the current values of
the constants.
ID = ‘identifier’
TITLE = ‘title’
CREATED = ‘created’
MODIFIED = ‘modified’
AUTHORS = ‘authors’
ABSTRACT = ‘abstract’
We will store information about a single article in a dictionary that maps ID
, TITLE , CREATED , MODIFIED , AUTHORS , and ABSTRACT to the corresponding
values. The value for each piece of information is of type str , except for
the value associated with key AUTHORS , which is a List of NameType . If an
element is not provided in the metadata file, then the value associated with
that key will be empty (i.e. the empty string, or in the case of no authors,
an empty list). The list of authors will be sorted in lexicographic order.
(removed Nov 12)
For example, the article with the identifier ‘008’ in our example input file
above will be stored in the following dictionary:
{ID: ‘008’,
TITLE: ‘Intro to CS is the Best Course Ever’,
CREATED: ‘2021-09-01’,
MODIFIED: ‘’,
AUTHORS: [(‘Ponce’, ‘Marcelo’), (‘Tafliovich’, ‘Anya Y.’)],
ABSTRACT: ‘We present clear evidence that Introduction to\nComputer Science is the best course.’}
Notice that since the fourth line in the specification is blank, the value
corresponding to key MODIFIED is the empty string. Also notice that the final
newline character on each line is not included in any of the stored values,
except for the newline characters inside the abstract we keep those! Take a
careful look at the starter file example_data.txt (same as the example above)
and the corresponding dictionary EXAMPLE_ARXIV defined in the file
arxiv_functions.py for more examples.

Type ArxivType

Finally, we will store the entire arxiv metadata in a dictionary that maps
article identifiers to articles, i.e. to values of type
ArticleType . The key/value pair in this dictionary that corresponds to the
above article is:
‘008’: {
ID: ‘008’,
TITLE: ‘Intro to CS is the Best Course Ever’,
CREATED: ‘2021-09-01’,
MODIFIED: ‘’,
AUTHORS: [(‘Ponce’, ‘Marcelo’), (‘Tafliovich’, ‘Anya Y.’)],
ABSTRACT: ‘We present clear evidence that Introduction to\nComputer Science is the best course.’
}
The diagram below shows a picture of the dictionary that represents some of
the articles in the example_data.txt file using the constants provided in
constants.py .

Required Functions

In the starter code file arxiv_functions.py , follow the Function Design
Recipe to complete the following functions. In addition, you must add some
helper functions (i.e. functions that you design yourself) to aid with the
implementation of these required functions. Helper functions also require
complete docstrings. We strongly recommend you also use the suggested helper
functions in the table below; we give you these hints to make your programming
task easier.
Some indicators that you should consider writing a new helper function, or
using something you’ve already written as a helper are:

Rewriting code to solve a task you have already solved in another function
Getting a warning from the checker that your function is too long
Getting a warning from the checker that your function has too many nested blocks or too many branches
Realizing that your function can be broken down into smaller sub-problems (with a helper function for each)
For each of the functions below, other than read_arxiv_file , write at least
two examples that use the constant EXAMPLE_ARXIV . If your helper function
takes an open file as an argument, you do NOT need to write any examples in
that function’s docstring. Otherwise, for any helper functions you add, write
at least two examples in the docstring.
Your functions should not mutate their arguments, unless the description says
that is what they do.
A note on sorting: Throughout the assignment, we ask for lists to be sorted in
lexiocographic order. This is the order that Python sorts in (such as when you
call list.sort ). You do not have to write your own sorting code (unless you
want to!)
We have broken the components of the assignment down into five Tasks, grouping
related functions together. Some tasks are easier than others, and you can do
the tasks in any order. As in the previous assignments, we’ll be marking each
function mostly separately (however there will be some overlap when functions
call other functions).