用C++代写一个简易的C语言的 Lexical analysis
,也就是Compiler的Preprocessing过程中的一环。相比Compilation, Assembly和Linking,已经非常简单了。
In this part of the course project, you will be implementing a basic tokenizer
for a subset of C language that can convert a sequence of characters passed to
the actual compilation stage of a C compiler (or rather a compiler for our
subset of C language) into a sequence of strings each holding a C-language
token. Any remaining preprocessing directives in the input should be stripped
out. For an overview of the stages of building a C language program, see:
https://calleerlandsson.com/the-four-stages-of-compiling-a-c-program/
for
more information on the preprocessor, see:
https://gcc.gnu.org/onlinedocs/cpp/Preprocessor-Output.html
A token, for the purposes of this Course Project and in a syntactically
correct program, is defined as the smallest number of consecutive characters
that can be surrounded by whitespace without changing the functionality of the
code.
Space, tab, linefeed, carriage-return, formfeed, vertical-tab, and newline
characters are called “white-space characters” because they serve the same
purpose as the spaces between words and lines on a printed page — they make
reading easier. Tokens are delimited (bounded) by white-space characters and
by other tokens, such as operators and punctuation.
https://msdn.microsoft.com/en-us/library/e9a023cx.aspx
The punctuation and special characters in the C character set have various
uses, from organizing program text to defining the tasks that the compiler or
the compiled program carries out. They do not specify an operation to be
performed. Some punctuation symbols are also operators (see Operators). The
compiler determines their use from context. [ https://msdn.microsoft.com/en-
us/library/eb1htw0t.aspx ](https://msdn.microsoft.com/en-
us/library/eb1htw0t.aspx) .
String literals are a type of token. You will have to research what they and
character literals can look like. Hint: here is a link with a useful tool to
add and remove escape characters from a string:
http://www.freeformatter.com/java-dotnet-escape.html
–it should be valid
for C-language too.
Our subset of C language will use the following punctuators:
Id | Symbol | Description |
---|---|---|
1 | ( | opening parenthesis |
2 | ) | closing parenthesis |
3 | : | colon |
4 | ; | semicolon |
5 | * | indirection |
Here is the list of operators for our subset of C language: | ||
Category | Operator | Associativity |
— | — | — |
Prostfix | ++ – | Left to right |
Unary | ~ & | Right to left |
Divisive | / % | Left to right |
Additive | + - | Left to right |
Shift | << >> | Left to right |
Conditional Operator | ? | Right to left |
Assignment | = += -= *= /= %= >>= <<= &= ^= | Right to left |
For example, |
sum = 3 + 2;
can be split into the following set of tokens:
sum = 3 + 2 ;
When the compiler interprets tokens, it includes as many characters as
possible in a single token before moving on to the next token. Because of this
behavior, the compiler may not interpret tokens as you intended if they are
not properly separated by white space. Consider the following expression:
i+++j
In this example, the compiler first makes the longest possible operator (++)
from the three plus signs, then processes the remaining plus sign as an
addition operator (+). Thus, the expression is interpreted as (i++) + (j), not
(i) + (++j). https://msdn.microsoft.com/en-us/library/44kh05a0.aspx
A template including the required function for you to implement has been
provided. You may implement any other helper functions, but you are not
allowed to remove or change the structure of the function ‘tokenizeCodeStrip’.
Take care with floating point numbers. For example, 1.575E1 is equivalent to
15.75. So, “1.575E1” is one token.
Refer to the BNF text files in the provided template. They contain Backus-Naur
Form (BNF) grammars for our project.
Note, when testing your code we may use some tests that are correct according
to our C-subset grammar, some tests that include things not present in the
grammar (like operators or punctuators not present in the grammar), and some
tests that have other sorts of errors (like unterminated strings). When you
find something not included in our subset grammar, including unterminated
strings, issue a warning, ignore the problematic characters, push an empty
string onto the back of the vector of strings to be returned, and then
continue on tokenizing the remaining characters.