代写Python基础作业,实现 XML 解析以及文本比较。

Requirement
For all the tasks you have to write comments in your code which briefly
explains what is going on (ie in the .py file itself you have to write small
comments). In addition, you must write a report in which you describe your
program in more detail and explain your choice of solutions. The scope of the
report is 4-5 pages (for groups of two students 6-8 pages and for groups of
three students 8-10 pages). You must attach as many attachments as you like.
- Take a screenshot of a run of each program (i.e. run the program and take a screenshot of the result that appears).
- Attach screen prints as attachments (even if the program only works partially).
- All your programs (i.e. the .py files themselves) must be submitted as attachments.
Task 1 xml
a) Write a program called with one argument (an input file) from the terminal
window. The program must load the input file that is in tei xml (‘iso-8859-1’)
and it must print to the terminal: the title of the file (the text in element
sourceDesc / bib / title), the author (the text in element sourceDesc / bib /
author), number of quoter elements (elements with tag q). Try the program like
this: python find_quoter.py ‘fair_tei.xml’. Call your program find_quoter.py
b) Write a new program called with two arguments in the terminal window (an
input file and an output file). The program should load the same input file as
before, but it must print the following in the output file: the title (the
text of the element title), and the words in the texts for p and q elements
that appear at least 3 times in the corpus. Try the program like this: python
find_freq.py ‘fair_tei.xml’ ‘outq.txt’. Call your program find_freq.py
Task 2 Frequency lists and text comparison by Python
Files you need for this program: vivaldi_positiv.txt and vivaldi_negativ.txt.
Imagine being a communications officer at Vivaldi and forming an overview of
Vivaldi reviews. Vivaldi_positiv and vivaldi_negativ contain a sample of user
reviews from Trustpilot and Tripadvisor (the reviews are randomly selected so
they do not necessarily give a fair picture of users’ opinions about Vivaldi).
Write a program in Python that looks at the words used in texts in different
ways. Remember that it is important to think of uppercase / lowercase letters
and punctuation when working with texts. Use vivaldi_positiv and
vivaldi_negativ as examples. Call your program lex_analysis.py
- Your program should make a frequency list of the words in vivaldi_positiv
and a frequency list of the words in vivaldi_negativ (1-word frequency lists).
The two lists must be sorted with the most frequent word at the top. You will
need Dictionaries in your solution. Use a Stopordslliste (fill words list) to
sort out the most “contentless” words (create your own Stopordslliste or find
one online). Print the two frequency lists for text_out.txt. - Make frequency lists of word pairs (bigrams) and 3 word combinations
(trigrams) for each text. The lists are sorted with the most frequent bigrams
/ trigrams at the top and the lists are printed to text_out.txt. - Your program should print to the same file (text_out.txt) the words that
ONLY occur in vivaldi_negativ (and thus not in vivaldi_negativ) and the words
that ONLY occur in vivaldi_negativ (and thus not in vivaldi_positiv). - Describe the trends you see in text_out.txt. Ie try to put into words how
the different frequency lists can contribute to an overview of the content in
large volumes of text. - Provide at least one suggestion (and preferably several) on how to improve
/ extend the program so that it could be even more widely used in an overall
analysis of the content of Vivaldi’s reviews.
Remember to print appropriate headings in your text_out.txt - so it’s easy to
find the file (for example, which frequency list belongs to which file, etc.).
Task 3 Frequency lists with Unix commands
In this task, using the command line in the terminal window, you must achieve
some of the same results that you obtained using the Python program in the
above task. Here, too, you should reflect on upper / lower case letters and
punctuation. Explain the commands in detail.
- Use unix commands to create a frequency list of the 25 most frequent words
from either vivaldi_positiv or vivaldi_negativ (1-word frequency list). Lay
out the frequency list in a new file that you call freq_vivaldi.txt. Use a
stop word list to sort out the most “contentless” words (create your own stop
word list or find one online). - Also make frequency lists with big frames and trigrams for either
vivaldi_negativ or vivaldi_positiv. Also print these lists for
freq_vivaldi.txt. - Make at least one more relevant study of the texts of your choice.
Screenshots showing your commands and freq_vivaldi.txt.
Task 4 Automatic Literature Generation
In this assignment, you must create a program to generate a bibliography
automatically from a collection of books. The books are given to you in a csv
file that you have to load in python and print as a sorted list of literature.
Write a program that loads the file literature.csv and prints a sorted
literature list with the books from the file. Call your program literature.py.
Your bibliography should look as described below under “Formatting of
bibliography”, where an explanation is also given on how the books can be
found in the csv file (“Description of csv file”).
An example of how your program might work (for example, if you created a
function called csv2lit):
>>> fil = “litteratur.csv”
>>> csv2lit (fil)
Litteraturliste
S., Leon. Linear Algebra with applications. Pearson. 8. (2010)
A., Turing. Solvable and unsolvable problems. Penguin Books. 1. (1954)
Here the bibliography contains only the first two books from the file, but
your program must print a bibliography with all the books.
Description of csv file
The books to work with are in the file litteratur.csv. A CSV file is a text
file that is formatted in a very specific way: Each line in the file
corresponds to (in our case) one book and for each line / book a number of
information is specified. The information is given by being separated by the
symbol semicolon, “;”. Thus, the first lines from litteratur.csv
First Name, Last Name, Title, Publisher, Edition, Year
Alan; Turing; Solvable and unsolvable problems; Penguin Books; 1; 1954
Steven; Leon; Linear algebra with applications; Pearson; 8; 2010
is understood as
First; Last name; Title; Publisher; edition; Year
Alan; Turing; Solvable and unsolvable problems; Penguin Books; 1; 1954
Steven; Leon; Linear algebra with applications; Pearson; 8; 2010
where the first line contains headings for the different information.
Literature formatting
The books in your bibliography must be printed in the following format:
F., Surname. Title. Publisher. Edition. (Year)
where first names are abbreviated by a period, the year is written in
parentheses, first names and surnames are separated by a comma, and where the
different values (name, title, publisher, edition and year) are separated by a
period. In order to obtain full marks for the assignment, the literature list
must be sorted by the author’s last name.