Introduction
代写一个Html标签解析器,发现错误配对的标签,考察Stack的应用。
Requirement
A markup language is a language that annotates text so that the computer can
manipulate the text. Most markup languages are human readable because the
annotations are written in a way to distinguish them from the text. The most
important feature of a markup language is that the tags it uses to indicate
annotations should be easy to distinguish from the document content.
One of the most well-known markup languages is the one commonly used to create
web pages, called HTML, or “Hypertext Markup Language”. In HTML, tags appear
in “angle brackets”. When you load a Web page in your browser, you do not see
the tags themselves: the browser interprets the tags as instructions on how to
format the text for display.
Most tags in HTML are used in pairs to indicate where an effect starts and
ends. For example:
this is a paragraph of text written in HTML
Here tag p represents the start of a paragraph, and tap p indicates where that
paragraph ends.
Other tags include tag b that are used to place the enclosed text in bold
font, and tag i indicate that the enclosed text is italic.
Note that “end” tags look just like the “start” tags, except for the addition
of a backslash ‘/‘ after the symbol.
Sets of tags are often nested inside other sets of tags. For example, an
ordered list is a list of numbered bullets.
You specify the start of an ordered list with the tag ol, and the end with
/ol. Within the ordered list, you identify items to be numbered with the tags
li (for “list item”) and /li. For example, the following specification:
- First item
- Second item
- Third item
would result in the following:
- First item
- Second item
- Third item
Notice how you start the ordered list with the ol tag, specify three line
items with matching li and /li tags, and the close the ordered list with the
/ol tag.
You may have noticed that the pattern of using matching tags strongly
resembles the pattern of matching parentheses that we discussed in class: when
you use parentheses, brackets, and braces, they have to match in reverse
order, such as “{[()]}”. A pattern such as “[(])” would be incorrect since the
right bracket does not match the left parenthesis. Similarly, an HTML pattern
such as ol li /ol /li would be incorrect since the closing tags are in the
wrong order.
The aim of this question is to write an “HTML Checker” program that takes as
input an HTML file, and produces a report indicating whether or not the tags
are correctly matched.
Just as the parenthesis checker uses a stack to store symbols waiting for a
match to be found, your program should also use a stack. You should include
the implementation of the Stack ADT discussed in class.
Input: As input for your program, the sample test files (test1.html,
test2.html, test3.html, test4.html, test5.html) can be download from the
course website. You can open the test files with a text editor i.e. Notepad++.
The test files are created with different scenarios both test1.html and
test2.html have balanced tags, whereas the rest of the test files have
unbalanced tag.
Processing the input file
- The first task your program must do is read in an HTML file and extract the tags. A simple strategy for doing this would be to write a function “getTags” that:
- reads one character at a time from the data file, throwing everything away until it gets to a “<”. (Discard the “<” as well.)
- reads one character at a time, appending it to a string, until it gets to a “>” or whitespace. (Discard the “>” as well.)
- append the tag to a list.
- returns tags found.
- Make sure you account for end-of-file conditions in getTags. If you have completed everything correctly, you now have a list of tags, both start and end tags, once the getTags function is invoked.
HTML Tag Checker
- Write a function called “checkTags” that iterates through your list of tags, looking for matches.
- If there is a mismatch of beginning and ending tags, print an error message (see output section below) and terminate.
- After processing the list of tags and there is no mismatch, print a confirmation message (see output section below).
- At the end of the list, there are remaining tags on the stack, print a confirmation message (see output section below) and the remaining tags in the stack.
- In addition, have your program build a list called “VALIDTAGS”. As you iterate through your list of tags, check to see if the tag appears in VALIDTAGS. If it doesn’t, add it to VALIDTAGS and print a confirmation message (see output section below).
Output
The output of your program should include the following:
- A printout of your list of tags (the result of getTags).
- One line for each tag as you process it, explaining the action and showing the current contents of the stack. You may have to modify your ADT to allow for the information to be displayed properly. Some examples are:
Tag b pushed: stack is now [html, body, b]
Tag /b matches top of stack: stack is now [html, body]
Tag ul pushed: stack is now [html, body, ul] - A message every time you add a tag to VALIDTAGS. For example:
New tag XXX found and added to list of valid tags
The Twist
There are some tags that do not need matching start and end tags! One example
is br. This tag is used to indicate a line break at the current location.
Another is meta, which is used to provide special information (“metadata”)
about a webpage, and one more (left for you to identify in your data files).
If you followed the instructions above correctly, your HTML checker will
notice that there are three tags that don’t have a match. Teach your program
that this is okay for these three cases by maintaining a list called
EXCEPTIONS which you hard-code into your main program. They will appear in
your list of tags just as any other tags. However, when you begin your
iteration through the list and you encounter one of these, you do not need to
push it on the stack since you won’t be waiting for a close tag. Instead, just
print an output line such as:
Tag br does not need to match: stack is still [html, body, b]
and continue.