Compiler代写:CS1167BasicTokenizer


用C++代写一个简易的C语言的 Lexical analysis
,也就是Compiler的Preprocessing过程中的一环。相比Compilation, Assembly和Linking,已经非常简单了。
In this part of the course project, you will be implementing a basic tokenizer
for a subset of C language that can convert a sequence of characters passed to
the actual compilation stage of a C compiler (or rather a compiler for our
subset of C language) into a sequence of strings each holding a C-language
token. Any remaining preprocessing directives in the input should be stripped
out. For an overview of the stages of building a C language program, see:
https://calleerlandsson.com/the-four-stages-of-compiling-a-c-program/ for
more information on the preprocessor, see:
https://gcc.gnu.org/onlinedocs/cpp/Preprocessor-Output.html
A token, for the purposes of this Course Project and in a syntactically
correct program, is defined as the smallest number of consecutive characters
that can be surrounded by whitespace without changing the functionality of the
code.
Space, tab, linefeed, carriage-return, formfeed, vertical-tab, and newline
characters are called “white-space characters” because they serve the same
purpose as the spaces between words and lines on a printed page — they make
reading easier. Tokens are delimited (bounded) by white-space characters and
by other tokens, such as operators and punctuation.
https://msdn.microsoft.com/en-us/library/e9a023cx.aspx

The punctuation and special characters in the C character set have various
uses, from organizing program text to defining the tasks that the compiler or
the compiled program carries out. They do not specify an operation to be
performed. Some punctuation symbols are also operators (see Operators). The
compiler determines their use from context. [ https://msdn.microsoft.com/en-
us/library/eb1htw0t.aspx ](https://msdn.microsoft.com/en-
us/library/eb1htw0t.aspx) .
String literals are a type of token. You will have to research what they and
character literals can look like. Hint: here is a link with a useful tool to
add and remove escape characters from a string:
http://www.freeformatter.com/java-dotnet-escape.html
–it should be valid
for C-language too.
Our subset of C language will use the following punctuators:

Id Symbol Description
1 ( opening parenthesis
2 ) closing parenthesis
3 : colon
4 ; semicolon
5 * indirection
Here is the list of operators for our subset of C language:
Category Operator Associativity
Prostfix ++ – Left to right
Unary ~ & Right to left
Divisive / % Left to right
Additive + - Left to right
Shift << >> Left to right
Conditional Operator ? Right to left
Assignment = += -= *= /= %= >>= <<= &= ^= Right to left
For example,
sum = 3 + 2;

can be split into the following set of tokens:
sum = 3 + 2 ;
When the compiler interprets tokens, it includes as many characters as
possible in a single token before moving on to the next token. Because of this
behavior, the compiler may not interpret tokens as you intended if they are
not properly separated by white space. Consider the following expression:
i+++j
In this example, the compiler first makes the longest possible operator (++)
from the three plus signs, then processes the remaining plus sign as an
addition operator (+). Thus, the expression is interpreted as (i++) + (j), not
(i) + (++j). https://msdn.microsoft.com/en-us/library/44kh05a0.aspx
A template including the required function for you to implement has been
provided. You may implement any other helper functions, but you are not
allowed to remove or change the structure of the function ‘tokenizeCodeStrip’.
Take care with floating point numbers. For example, 1.575E1 is equivalent to
15.75. So, “1.575E1” is one token.
Refer to the BNF text files in the provided template. They contain Backus-Naur
Form (BNF) grammars for our project.
Note, when testing your code we may use some tests that are correct according
to our C-subset grammar, some tests that include things not present in the
grammar (like operators or punctuators not present in the grammar), and some
tests that have other sorts of errors (like unterminated strings). When you
find something not included in our subset grammar, including unterminated
strings, issue a warning, ignore the problematic characters, push an empty
string onto the back of the vector of strings to be returned, and then
continue on tokenizing the remaining characters.


文章作者: SafePoker
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 SafePoker !
  目录