用C++实现一个 CPU
的简易指令模拟器,需要使用STL库。
![CPU](https://upload.wikimedia.org/wikipedia/commons/thumb/c/c8/Board_with_SPARC64_VIIIfx_processors_on_display_in_Fujitsu_HQ.JPG/220px-
Board_with_SPARC64_VIIIfx_processors_on_display_in_Fujitsu_HQ.JPG)
Objective
Students will develop an in-order, fully functional C++ CPU simulator that
executes appropriately binary-formatted programs. The CPU supports flexible
hardware structures and program parameters, may run in debug or execution
mode, and provides performance statistics on executed programs. The simulator
will be coded and realized using various classes and structures in C++’s
Standard
Template Library (STL).
Background Information
CPU Simulator
Object-oriented programmers are often required to code simulators that model a
simulated representation of a physical system. In this project, we will
simulate a simple in-order Central Processing Unit (CPU) which will replicate
very basic processor functionality in software. We will recreate various parts
of the CPU and observe its states, instruction, and data-flow using a Debug
Mode. This will allow us to trace instructions as they flow through the CPU to
ensure proper functionality, while obtaining an appreciation for computer
architecture. Once the simulator is completely coded and bug free, it may be
run in execution mode, reflecting al terminal environment where the program
will be executed on a simulated CPU.
There are numerous CPU simulators that have been created over the past
decades. Many of the simulators are designed either for 1) educational
purposes, or 2) for research/commercial purposes.
In the context of education, CPU simulators are mainly used to study single
and multi-processor systems, where one may tune various parameters to view
performance effects of design choices, and study CPU structure optimization.
In the context of research/commercial simulators, computer architects must
first simulate a physical system and observe potential performance gains
before proceeding to create the actual hardware system and fabricating the
chip. If performance is negligible, another route or design must be
considered. Since tuning or redesigning a simulator is associated with minimal
in cost in comparison to fabricating a processor chip (that may or may not
work), simulators are widely used as a gateway for determining performance
potential before proceeding to the actual hardware design.
The CPU simulator we will be designing in this project will not be modeling a
complete physical system: a computing system comprises of many layers beyond
the scope of this project. Instead, we will be simulating a very simple CPU
for educational purposes, however the processor will still be capable of
executing various programs and obtaining respective performance statistics
from varying parameters. You may refer to this simulator and CPU design
principles in your future engineering courses as well, such as ENSC254 and
ENSC350.
Overview of a simple CPU pipeline
An instruction which enters a CPU is processed incrementally in a series of
steps. All instructions are assigned a number, or ID, as they enter the CPU.
Our CPU is an in-order processor, meaning that instructions must be processed
in the order which they enter the CPU. Therefore instructions must be
identified using a numbering system to process instructions in-order; the
younger the instruction, the greater the instruction’s ID value.
The series of steps required to execute an instruction is referred to as a
pipeline. Since each instruction is divided and processed in a series of
steps, several successive instructions may overlap in the pipeline at a given
time.
Our CPU consists of two memory units: 1) instruction memory which contains all
the programs instructions, and 2) a register file, which holds the program’s
variable data (data is signed integers)
Consider the CPU as an assembly line for processing instructions with the
following stages:
Instruction Fetch (IF)
The first step is to read an instruction of our program (fetched) from
instruction memory according to the address specified by a variable (or
“register” in hardware terminology) called the Program Counter (PC). The PC is
incremented every time an instruction is fetched from the instruction memory.
Therefore, the value held by the PC dictates the location, or address, of the
next instruction to be read from the instruction memory. We will assume the
first instruction of a program is located at PC=0, where instruction_memory[0]
contains the first instruction that will be fetched and processed in the
pipeline.
A “fetch width” parameter may be specified as well, indicating the number of
instructions that may be obtained from instruction memory simultaneously at a
given time. One unit of time is referred to as a clock cycle in CPU
terminology.
Instruction Decode (ID)
Once an instruction is fetched, it is decoded according to the processor’s
Instruction Set Architecture (ISA). The ISA stipulates the CPU’s supported
“instruction set” i.e. instruction operations and their formats recognized and
used by the processor to interpret instructions. The decoding process allows
the CPU to extract information from the fetched instruction. Specifically, the
decoder extracts an instruction’s i) input data locations to be read from the
register file (referred to as source operands), ii) operation, and iii) the
output destination/location where the result will be written to in the
register file.
Once the instruction’s information is extracted, it is placed in two separate
CPU structures: 1) Instruction Queue (IQ), and 2) Retire/Commit Buffer, or
more commonly referred to as the Reorder Buffer (ROB).
The IQ is a finite entry queue which buffers instructions until they are ready
for execution. An instruction is ready once all its source operands are marked
as “valid”.
The ROB is a finite FIFO list which manages and ensures the safe eviction or
“retirement” of instructions from the pipeline after execution. This
retirement process is completed during the Retirement / Commit stage. The ROB
has a much bigger role in complex CPUs, as its name implies, however in the
context of our project, the ROB will only be used to guarantee that the
instructions are executed and released from the pipeline in the order from
which they were fetched.
Instruction fetch and decode will be performed as one step in our CPU
simulator.
Dispatch/Read/Execute/Writeback (Rd/EXE/WB)
The term issue, interchangeable with the term dispatch, is the process of
releasing an instruction from the IQ and proceeding to execution.
In this step, instructions in the IQ wait until all previous instruction have
been issued. An instruction may be issued once all of its source operands are
ready to be read from the register file, and resources are available for
execution.
Accordingly in this second pipeline stage, when an instruction is ready in the
IQ: i) the instruction is released from the IQ (dispatch) , ii) its input
operands are read from the register file, the iii) operation is executed using
source operand data, iv) the result is written back to the register file at
the instruction’s specified destination register, and v) the destination
register is broadcasted to the IQ to inform younger instructions that the
contents in the register are ready to be read.
The CPU will implement all 5 of these substeps as one pipeline stage in the
CPU simulator. More details pertaining to each step of this multi-step process
are provided below:
Dispatch (Issue): The instructions in the IQ are monitored for operand
readiness. Since instructions are dependent on one another, a “consumer”
instruction can not execute before a “producer” instruction has finished
executing. For instance:
Instruction 1: Z = A + B; (“produces” results of variable Z)
Instruction 2: C = Z + F; (“consumes” the contents of variable Z)
As the example illustrates, Instruction 2 can not execute until the contents
of Z are computed and written to the variable Z, signifying that instruction 1
must first execute. Instruction 2 may thereafter proceed in the next cycle to
read the contents of the Z operand and executed its operation. Consequently,
these two instructions can not dispatch nor execute in the same cycle due to
this read-after-write (RAW) dependency. We refer to this issue validation
process as monitoring an instruction for “valid” operands.
Once a given instruction’s operands are valid, the instruction may be
dispatched (or “issued”) for execution. Since we are implementing an in-order
CPU, this implies that the oldest instruction in the IQ must be ready for
execution, implementing a FIFO (First-In, First-Out) scheme; if the oldest
instruction is not ready, subsequent younger instructions can not proceed to
execute even if they have valid operand data.
An “ISSUE_WIDTH” parameter may also be specified at this stage, indicating the
maximum number of ready/valid in-order instructions that the may execute on
the CPU in one clock cycle. This also implies that there are “ISSUE_WIDTH”
number of functional units that exist to execute the instructions
simultaneously. If multiple execution units exist, the processor model is
referred to as Superscalar.
Read: As previously mentioned, the working set of program variables are held
in the register file. Registers are the fastest access memory in a computing
system, yet the most expensive to implement.
Once an instruction is dispatched, an instruction’s source operands must be
read from the register file to obtain their respective data values. For
instance, in the example above, variable Z is associated with a value, let’s
say 5. This value must be read from the register file, and hence Z represents
a memory address which holds the data content of our variable. In this case, Z
would represent an address (or location) in the register file which must be
accesses to obtain the value of the variable. In the case of our CPU
simulator, we use a finite set of numbered “registers” representing our
program variables, which are named with the prefix “r”. We will work with a
set of 16 registers, and our operands will be named r0, r1… r15 accordingly.
Execute: Once the input operand contents have been read from the register
file, they are passed to an Arithmetic Logic Unit (ALU). The ALU is a
functional unit which performs a specified input operation on the values input
to the unit. Our CPU’s ALU takes 3 inputs: two input operands, and a code
representing the operation to be performed on the input values, referred to as
an opcode. Based on this information, the ALU computes the operation on the
input operand values, and produces an output result.
Writeback: The result generated by the ALU is redirected and written back to
the register file, at the address specified by the “destination” operand
extracted from the instruction at decode.
Once the contents have been written, the instruction is marked in the ROB as
valid.
Retirement/Commit
Retiring an instruction refers to the safe eviction of an instruction from the
pipeline. As mentioned previously, the role of the ROB is to guarantee that
the instructions are executed and released from the pipeline in the order from
which they were fetched. Accordingly, in the Commit stage, the ROB’s head
entry, considering a FIFO list structure, is accessed first. If the head of
the list is valid, i.e. has been successfully executed, then it may be safely
popped from the ROB. Once popped, the instruction is no longer present in the
pipeline. If it is not valid, then it must wait in the ROB until it has been
successfully executed.
A “COMMIT_WIDTH” parameter may also be specified in the CPU which indicates
how many valid in-order instructions may be popped off the list
simultaneously.
Referring back to Fig. 1, it is evident that since instructions are processed
in a series of stages, assuming a FETCH_WIDTH and ISSUE_WIDTH of 1
instruction, up to three instructions may be in the midst of being processed
by the CPU pipeline simultaneously. As FETCH_WIDTH and ISSUE_WIDTH parameter
increase, it is possible that the CPU may increase performance as more
instructions exist in the pipeline. This however depends on the number of RAW
dependencies and “branch” instructions inherit of the program.
Instruction Set Architectures (ISA)
The ISA is a functional definition of the operations, instruction encodings,
and storage that must be supported by a given processor. The role of the ISA
is to divide labour between hardware and software 1
, allowing for
universal software compatibility across various generations and styles of
processor models.
Using the ISA, the programmer may design a software program which is compiled
and converted to a set of instructions encoded for a given ISA as machine
language. The processor which supports the target ISA obtains the formatted
instructions and executes the software program. Thus, the ISA acts as a
hardware/software interface allowing programmers and computer architects to
work independently with the same computing objective.
The simulated CPU designed in this project will abide by an Instruction Set
Architecture (ISA) referred to as the Two Fifty One (TFO) ISA. Instructions
are read as machine language by the CPU, which we will interpret as assembly
language in this document. For our sake, the general format for reading
assembly is as follows: operation destination_reg, operand_source1,
operand_source2
For instance, if an assembly statement were written as follows: add r5, r6, r7
This would translate to r5 = r6+r7, where our variables are expressed in terms
of register locations.
Side Note: A word on binary for those who have not yet taken ENSC252: A binary
number is expressed in the base-2 numeral system, and represents a decimal
number as a set of 1s and 0s. Each 1 or 0 digit in a binary number is referred
to as a bit. Depending on the position of the bit, b i , a given decimal
number may be expressed using the following formula, assuming a binary number
consisting of n bits.
Where bit i may only take the value of 1 or 0 and is multiplied by a power of
2 depending on the location of the digit. The left most bit represents the
most significant bit/value, whereas the right-most bit signifies the least
significant bit. If you were given the binary number “01010” for instance,
this would equate to the decimal value.
The TFO ISA supports four instruction type formats: 1) Register type (R-type),
2) Immediate type (I-type),
3) Jump type (J-type) and 4) Prompt type (P-type). The instruction format type
may be obtained from the two left-most bits (or Most Significant Bits (MSB) 31
and 30) of the given instruction as shown below. Instructions are represented
in this section in binary.
All programs and their respective instructions will be provided to you for
this project as unsigned integers.
To extract each instruction’s information, the unsigned integer must be
converted to its binary equivalent, and processed as a string of 1s and 0s,
each digit referred to as a bit. The bits must be decoded as shown above to
obtain the field encodings, depending on the type of instruction extracted
from bits[31,30]. Specifications for each instruction format are provided
here.
If an instruction’s left most bits are “00”, the instruction is processed by
the decoder as an R-Type instruction. Rtype instructions only operate on
registers. In this case, both source operands must be read from the register
file. The address of the input operands src1 and src2 may be obtained from
instruction bits 20 through 16, and 15 through 11, respectively. The bits 29
through 26 represent the opcode operation performed on the input operand data,
which are all passed to the functional unit during execute. Once the
instruction has executed, the result is written back to the dest register
specified in bits 25 through 21. Bits 10 through 0 remain unused for this
instruction type.
Supported instructions: add, sub, mult, div, mod, mov (see Table I for
specific formats and descriptions)
A special type of instruction is also supported by R-Types, called print,
which prints the src1 value to standard output (cout). This is used to present
the user with the program’s final result.
If an instruction’s left most bits are “01”, the instruction is processed by
the decoder as an I-Type instruction. I-type instructions have one register
source operand, and one 16bit immediate source operand. The address of the
input operand src1 is obtained from bits 20 through 16, and the immediate
value from bits 15 through 0, respectively. In the case, only one value must
be read from the register file, and the second operand will be the immediate
value extracted from the instruction, passed directly to the ALU. Bits 29
through 26 represent the opcode (operation code) to be passed to the
functional unit during execute. Once the instruction is executed, the result
is written back to the dest register specified in bits 25 through 21.
Note that there are special types of I-type instructions in the TFO ISA,
referred to as branches, which redirect a program’s flow. More details and
processing specifications of these types of I-instructions will be provided in
the next section for a full elaboration on the topic.