这次需要代写一个HTML的解析器,主要考察Data Structure中Stack和Queue的应用。
Requirement
You will need the support files HTMLTag.java, HTMLTagType.java,
HTMLParser.java, and HTML Main.java from the resources button on the course
website; place these in the same folder as your program or project. If you are
using EZclipse, the files will be automatically downloaded for you. You should
not modify the provided files. The code you submit must work properly with the
unmodified versions.
Although this assignment relates to websites and HTML, you will not need to
know how to write HTML to complete the assignment.
What is HTML?
Web pages are written in a language called “Hypertext Markup Language”, or
HTML. An HTML file consists of text surrounded by markings called tags. Tags
give information to the text, such as formatting (bold, italic, etc.) or
layout (paragraph, table, list). Some tags specify comments or information
about the document (header, title, document type).
A tag consists of a named element between less-than <
and greater-than >
symbols. For example, the tag for making text bold uses the element b and is
written as <b>
.
Many tags apply to a range of text, in which case a pair of tags is used:
- An opening tag (indicating the start of the range), which is written as:
<name>
- A closing tag (indicating the end of the range), which is written as:
</name>
So to make some text bold on a page, one would surround the text with opening
and closing b tags:
like this
Tags can be nested to combine effects. For example:
bold italic
Some tags, such as the br tag (which inserts a line break) or the img (which
inserts an image), do not cover a range of text and are considered to be
“self-closing.” Self-closing tags do not need a closing tag; for a line break,
only<br>
is needed. Some web developers write self-closing tags with an
optional / before the >, such as<br />
.
Some tags can have attributes. For example, the tag<img src="cat.jpg">
specifies an image from the file cat.jpg. The element is img, and the rest of
the text such as src are attributes. In this assignment, you will not have to
worry about attributes. The parser will store them for you, and you can just
ignore them entirely.
One problem on the web is that many developers make mistakes in their HTML
code. All tags that cover a range must eventually be closed, but some
developers forget to close their tags. Also, whenever a tag is nested inside
another tag,<b><i>like this</i></b>
, the inner tag (i for italic, here)
must be closed before the outer tag is closed. So the following tags are not
valid HTML, because the</i>
should appear first:<b><i>this is invalid</b></i>
In this assignment you will write a class that stores the contents of an HTML
page and is able to fix any invalid HTML. Your HTMLManager will use stacks and
queues to figure out whether the tags match and fix any mistakes it finds.
Instructor-provided code will read HTML pages from files and break them apart
into tags for you; it’s your job to store the tags and fix the tags if there
is a mismatch.
Implementation Details
Write one class called HTMLManager.java that can handle pages with all types
of HTMLTags (including self-closing tags). Your class must have the
constructors/methods listed in the next section.
When using the Stack and Queue classes, you may only use the methods discussed
in lecture, namely: add/push, remove/pop, peek, isEmpty, size and toString.
public HTMLManager(Queue page)
The constructor takes in a Queue of HTMLTags that make up an HTML page. If the
queue passed is null, the constructor throws an IllegalArgumentException. An
empty queue (size 0) is allowed. You should store a Queue as a field to manage
your tags, however, the constructor should not directly store a reference to
the given Queue to avoid exposing internal state to the client.
public void add(HTMLTag tag)
Appends the given HTMLTag to the end of the queue of HTMLTags being managed.
If the HTMLTag passed in is null, the method should throw an
IllegalArgumentException.
public void removeAll(HTMLTag tag)
Removes all occurrences of the given HTMLTag from the queue of HTMLTags being
managed. If the HTMLTag passed in is null, the method should throw an
IllegalArgumentException.
public List getTags()
Returns a list of HTMLTags being managed. To avoid exposing internal state to
your client, make sure to make a copy rather than just returning a reference
to any internal field.
public void fixHTML()
This method will try to fix the page’s HTML if there are any missing or extra
tags. The algorithm that you will use is a simplified version of what Chrome,
Firefox, and other browsers do when they try to display a broken webpage. This
method should build up a correct version of the HTMLTags being managed, and
update the instance to store the corrected version of the HTML when it is
finished. To fix the HTML you will analyze the tags stored in the HTMLManager
using a Stack.
The basic idea of the algorithm is to process the page tag by tag. For each
tag, you will check to see if it has a matching tag later in the page in the
correct place. Since self-closing tags don’t have to match anything, whenever
you see one, you can simply add it directly to the correct version of the
tags. For opening tags, we assume that the writer of the HTML page intended to
actually include the tag; so, like with the self- closing tag, we add it to
the result. However, we need to keep track of if we have found its match; so,
it should also be added to a Stack. If we find a closing tag, we must figure
out if it is in the right place or not. In particular:
- If the opening tag at the top of the stack matches the closing tag we found, then it matches, and you should update the state according. (Hint: You probably want to edit the stack and the result.)
- If the opening tag at the top of the stack does not match the closing tag we found, then the writer of the HTML page made a mistake. To fix the mistake, you should add a new closing tag that matches the opening one at the top of the stack (so that it remains balanced). You should keep on adding closing tags to your result as long as the opening tag on top of the stack does not match. If you close all unclosed open tags on the stack, then the closing tag we found doesn’t match any open tags and should be discarded.
- If, at any point, you find a closing tag that has no matching open tag, just discard it.
Your Test Case (HTMLManagerTest.java)
In addition to HTMLManager.java, you will create tests to verify your class’s
removeAll method. Create a file HTMLManagerTest.java to thoroughly test the
removeAll method (a starter file is available on the website). Your tests
should call removeAll at least three times, removing three different elements.
You should compare the tags stored in your HTMLManager after each call with
the expected Queue of tags.
Your testing program must produce at least three lines of output indicating
whether each of your tests passed. You will be evaluated partly on how
complete your tests are (whether they try every possible input and state
configuration).
Development Strategy
The best way to write code is in stages. If you attempt to write everything at
once, it will be significantly more difficult to debug, because any bugs are
likely not isolated to a single place. For this assignment we will provide you
with a development strategy. We will also provide you with some correct output
that you can check using the Output Comparison Tool on the website.
We suggest that you develop the program in the follow four stages:
- In this stage, we want to decide what field(s) belong in the HTMLManager class. These field(s) should be initialized in the constructor.
- In this stage, we want to make sure that the constructor is working by printing out the page we constructed. The best way to do this is to write getTags() and print out the returned List.
- In this stage, we want to implement the method that edit the page we currently have. You should implement add(HTMLTag tag) and removeAll(HTMLTag tag).
- Finally, we want to write the difficult method: fixHTML(). You can test that things are working by using the Output Comparison Tool on the course website.