代写一个解析器,解析MIPS语言。
Introduction
Each semester I develop one or more projects that require you to implement
many of the design patterns and APIs discussed in class. The project for this
semester is an instruction-level simulator of a common CPU instruction set
architecture called MIPS.
MIPS is a RISC (reduced-instruction set computer) architecture adaptable to a
wide-variety of applications, ranging from embedded systems (e.g. PIC32) and
game consoles (e.g. PlayStation), to high-end servers (e.g. ARMv8). We will be
simulating the first version, MIPS Release I, without co-processors. This CPU
is ideally suited to our purposes since there are (relatively) few
instructions with only three instruction formats and two addressing modes. It
uses a 32-bit word size and has 32 general-purpose registers.
Our simulator, which we will call simmips , will read a defined subset of MIPS
assembly source code and simulate program execution, similar, but not
identical to, SPIM and MARS. The code and tests will be developed over the
first part of the semester in milestones 0-2. We will then write a graphical
debugger and visualization tool in milestones 3-4 during the last part of the
semester.
The milestones will be used for grading and thus are motivation to not
procrastinate. They correspond closely to the schedule of material that we
will be covering in class. Your progress will be tracked through a series of
private git repositories on GitHub.
The goal of this first milestone is to develop the lexer for our simulator.
This will allow you to get accustomed to the course workflow, gain experience
reading specifications, and refresh your programming skills. You will write a
lexer for MIPS assembly source files, producing a sequence of tokens.
This assignment requires very little C++ experience beyond basic data
structures. If you struggle with this assignment you are woefully ill-prepared
for this course. This means you should devote additional time to it and seek
advice from the TAs and myself, or as a last resort drop.
Recommended Background Reading
- Section 4 of “MIPS Assembly Language Programming using QtSpim”, by Ed Jorgensen, 2016.
- Lexing
- CMake Workflow on Windows
Basics of MIPS Assembly
The role of a compiler is to translate from a higher-level representation of a
computation, called the source language, to a lower one, called the target
language. Traditionally, when you are programming in a high-level language
like C or C++, the target language is assembly. Individual files, or modules,
are then passed through another compiler, called an assembler to produce
object (machine) code. Multiple module object code is combined (usually with
pre-compiled library code) by the linker to produce the final executable
machine code. Modern compilers typically subsume high-level compilation and
assembly into a single program, although they can emit intermediate languages.
Some high-level languages (e.g. Java, C#, Lua, PLT scheme) compile to a
portable intermediate language called bytecode, which is then compiled to
machine code as needed (so-called just-in-time or JIT).
At each level of the compilation stage, the code moves from syntax that is
more human understandable to less, while becoming more efficient for hardware
execution. Assembly is the last language in this process that is symbolic
enough that humans can interpret it without serious effort (one can decode and
write/understand machine code, but it is a lot of work). Assembly language is
expressed in plain text files called assembly source files. Like all languages
it has a specific syntax, what constitutes a well-formed sequence of
characters, and semantics, what the program does (it’s side effects).
Our simulator will only simulate execution of a single module of code; thus,
we will have no need for a linker. It will also not handle conditional
assembly or macros.
It will be helpful to see an example first before diving into the details of
the assembler source format in later milestones. So what follows is a simple
program with some C-like translations.
# this is a comment, any characters after #
up to newline are ignored
.data # start the data section, where variables are declared
var1: .byte 1 # char var1 = 1;
var2: .half 65526 # short var2 = 65526;
var3: .word 1 # int var3 = 1;
var4: .word 2 # int var4 = 2;
var5: .word 0 # int var5 = 0;
.text # start the code section, a sequence of instructions main: # compute var5 = var3 + var4;
lw $t0, var3
lw $t1, var4
add $t3, $t0, $t1
sw $t3, var5
—|—
Some initial things of note. Comments begin with the # character and are not
semantically important to the execution of the program. The program contains
several directives to the compiler. These begin with a . character. The most
prominent are .data and .text which control if the assembler is laying out or
initializing memory, or processing computations as instructions. Also notable
are labels, strings that end in the character : .
Lexing the Assembly Source
MIPS assembly programs, like that above, are written in plain ASCII text. In
the assembly file format there are several syntactically meaningful tokens:
- EOL (LF or ASCII code point 0x0a) indicates the end of a source line has been reached (no value)
- SEP (a comma or ASCII code point 0x2c) is the separator for lists (no value)
- OPEN_PAREN (an open parenthesis or ASCII code point 0x28) indicates the start of an indirect address (no value)
- CLOSE_PAREN (a close parenthesis or ASCII code point 0x29) indicates the end of an indirect address (no value)
- STRING_DELIM (double quote or ASCII code point 0x22) is used to wrap literal strings (no value)
- EQUAL (an equal sign or ASCII code point 0x3d) is used in a constant assignment (no value)
- STRING is any other white-space (ASCII code points 0x020, 0x0d, 0x09, 0x0b, or 0x0c) separated string (value is the string)
The job of the lexer is to convert the input text to a sequence of these
tokens. It should discard any comments during this process by ignoring any
input from the character # up to a newline. It should also record the line
number from the file the token appeared on.
The interface is a C++ free function with the signature:
/* Function to convert an input text stream
into a sequence of tokens.
*/
TokenSequence lexer(std::istream &);
—|—
This function is declared in the file lexer.hpp and should be implemented in
the file lexer.cpp . A TokenSequence is a list of type Token
typedef std::list
—|—
where a Token is a type, a line number where the token originated, and the
string value
enum TokenType {
EOL,
SEP,
EQUAL,
OPEN_PAREN,
CLOSE_PAREN,
STRING_DELIM,
STRING,
ERROR
};
class Token {
public:
// construct a token type on line with empty value
Token(TokenType type, std::size_t line);
// construct a token type on line with value
Token(TokenType type, std::size_t line, const std::string &value);
// return the token type
TokenType type() const;
// return the token’s originating source line
std::size_t line() const;
// return the token’s value
std::string value() const;
}
—|—
The Token and TokenSequence types are defined and implemented in the module
token.hpp / token.cpp .
Example
Consider the following contrived example of an assembly source file.
.data
LENGTH = 10
array: .space LENGTH
str: .asciiz “the (end)”
.text
main: lw $t0, array
lw $t1, ($t0)
—|—
Would produce the following sequence of tokens, written as (type, value),
where the line breaks are not significant.
(STRING,”.data”) (EOL,””) (STRING,”LENGTH”) (EQUAL,””) (STRING,”10”) (EOL,””) (STRING,”array:”) (STRING,”.space”) (STRING,”LENGTH”) (EOL,””) (STRING,”str:”) (STRING,”.asciiz”) (STRING_DELIMETER,””) (STRING,”the (end)”) (STRING_DELIMETER,””) (EOL,””) (STRING,”.text”) (EOL,””) (STRING,”main:”) (STRING,”lw”) (STRING,”$t0”) (SEP,””) (STRING, “array”) (EOL,””) (STRING,”lw”) (STRING,”$t1”) (SEP,””) (OPEN_PAREN,””) (STRING, “$t0”) (CLOSE_PAREN,””) (EOL,””)
Note in particular that any text other than ASCII code 0xa in between pairs of
“ are treated literally (not as potential tokens).
The lexer should detect the following errors (only):
- A string must consist of matching “ and cannot span multiple lines.
- A pair of ( and ) must match and cannot span multiple lines.
For example the following inputs would produce lexing errors.
.data
var: .ascii “this is
a string”
.text
lw $t0, ($t1
)
—|—
On the first error detected the last token in the returned sequence should
have the type ERROR and the value should be an informative error message that
starts with the string Error: . For example the token sequence resulting from
the second error case above might be
(STRING,”.text”) (EOL,””) (STRING,”var:”) (STRING,”lw”) (STRING,”$t0”) (SEP,””) (OPEN_PAREN,””) (ERROR,”Error: unmatched paren on line 2”)
Instructions
Accept the GitHub invitation above and then clone the git repository to your
local machine. The initial git repository contains several files. The types
for TokenSequence are defined for you in the header lexer.hpp . Your task in
this milestone is to implement the function lexer in the file lexer.cpp . You
can define additional units of code (classes, helper functions) in lexer.cpp ,
but do not change the provided type definitions, nor the lexer function
signature.
Unit tests for your function are provided in the file test_lexer.cpp using the
Catch testing framework (see meeting 7). Do not modify them (other than to
perhaps add temporary debugging output). The included CMakeLists.txt file sets
up building these tests for you. Just configure the build, run the build, and
then run the tests. You should use git to commit versions of your program
source (only) as you go along, demonstrating good incremental programming
technique.
Steps to build and run the tests in the reference environment (after vagrant
ssh). To configure the build
cmake /vagrant
To run the build make or
cmake –build .
To run the unit tests make test or
cmake –build . –target test
The test output is placed in ``Testng/Temporary/LastTest.log. You can also
just run the tests directly:
./unit_tests
To configure and run the build in strict mode (increased warnings, warnings
become errors)
cmake -DSTRICT=True /vagrant
make clean; make
Submission
To submit your assignment:
- Tag the git commit that you wish to be considered for grading as “final”.
- Push this change to GitHub
If you need to tag a different version of your code as final (before the due
date) simply create and push a new tag appending a monotonically increasing
number to final, e.g. final2, final3, etc.
Be sure you have committed all the changes you intend to. It is a good idea to
re-clone your repository into a separate directory and double check it is what
you intend to submit. Failure to complete these steps by the due date will
result in a failed submission.