This page goes through some technical details of how Mips2cs works. The aim is to give a sort of overview of the system, and to provide some "context" for those who may wish to go diving into the source code.
It is assumed that the reader has some familiarity with MIPS assembly language and with Haskell and C# programming.
The basic "pipeline" of Mips2cs is summarized in the following diagram:
The sequence of events during a typical run is as follows:
The following sections give more details.
The Intermediate representation is comprised of three main elements: Exprs, CondExprs and Statements. A full definition of these can be found in the source code (in module Mips2cs.Intermediate); here we give a brief description of each.
Represents an arbitrary integer expression. Examples:
(LitExpr 42)
– the number 42 (LoadRegExpr "v0")
– the contents of MIPS register v0
(LoadMemExpr MemWord (LitExpr 0x1234))
– the 32-bit word currently stored at address 0x1234 (UnExpr Negate (LoadRegExpr "s1"))
– the expression (-s1)
(BinExpr Add (LoadRegExpr "t2") (LitExpr 1))
– the expression (t2 + 1)
(VarExpr "foo")
– the current contents of the variable foo
(see below) Represents a condition (boolean expression). Examples:
(BinCond Equal (LoadRegExpr "v0") (LitExpr 23))
– true if v0
is equal to 23 (BinCond LessThan (LoadRegExpr "t0") (LoadRegExpr "t1"))
– true if t0
is less than t1
(LitCond True)
– always true Represents an operation performed by the MIPS machine. Examples include:
(StoreReg "v0" (LitExpr 2))
– writes the number 2 into register v0
. (StoreMem MemByte (LitExpr 0x1234) (LitExpr 42))
– writes the byte 42 into memory location 0x1234. (Jump (BinCond Equal (LoadRegExpr "v0") (LitExpr 0)) 0x1234 0x5678)
– if v0==0
, then goto address 0x1234, else goto 0x5678. (Jump (LitCond True) 0x1234 0x5678)
– jumps unconditionally to 0x1234. (IndirectJump (LoadRegExpr "ra"))
– jumps to the address pointed to by the ra
register. (Let "foo" (LitExpr 31))
– creates a variable called foo
and assigns the value 31 to it. (See below for more about variables.) The Intermediate language contains the concept of variables. Variables are created by the Let
statement and can be used in expressions via the VarExpr
construct (see above).
Variables must follow two rules:
Variables are used extensively during the optimization passes of Mips2cs (this will be discussed further in the section on optimization, below).
The idea is that the entire MIPS machine code program can be converted into an equivalent program in the Intermediate language. To do this, each MIPS instruction is converted into zero or more Statements. This is usually straightforward. For example:
addiu v1, a0, 1
→ StoreReg "v1" (BinExpr Add (LoadRegExpr "a0") (LitExpr 1))
subu a0, a1, s0
→ StoreReg "a0" (BinExpr Sub (LoadRegExpr "a1") (LoadRegExpr "s0"))
mult t0, t1
→ StoreReg "hi" (BinExpr MultHi (LoadRegExpr "t0") (LoadRegExpr "t1"))
and StoreReg "lo" (BinExpr Mult (LoadRegExpr "t0") (LoadRegExpr "t1"))
lw s0, 4(sp)
→ StoreReg "s0" (LoadMemExpr MemWord (BinExpr Add (LitExpr 4) (LoadRegExpr "sp")))
sb s0, 4(sp)
→ StoreMem MemByte (LoadRegExpr "s0") (BinExpr Add (LitExpr 4) (LoadRegExpr "sp"))
jr ra
→ IndirectJump (LoadRegExpr "ra")
j 0x1234
→ Jump (LitCond True) 0x1234 0x1234
Branches require special handling because of the MIPS delay slot mechanism. Mips2cs creates two Statements for a branch instruction: one for the delay slot, and one for the branch itself. In addition, any registers referenced in the branch condition are saved (using Let statements) to ensure that the correct value gets used.
For example, the following disassembly:
400084: 1060000c beqz v1,4000b8 400088: 24630001 addiu v1,v1,1
would be translated to the following three Statements:
Let "delay1" (LoadRegExpr "v1") StoreReg "v1" (BinExpr Add (LoadRegExpr "v1") (LitExpr 1)) Jump (BinCond Equal (VarExpr "delay1") (LitExpr 0)) 0x4000b8 0x40008c
Note that the Let is essential in this example; without it, the wrong value of v1 would be used in the branch condition.
After generating the Statements for each machine code instruction, we organize them into basic blocks. (A basic block is a chunk of code that has exactly one entry point and one exit point.)
In Mips2cs basic blocks are represented by
The following invariants are maintained:
The results of the basic block analysis are used extensively during the optimization and code generation passes.
Once the Intermediate code has been produced, Mips2cs runs a series of optimization passes on it, in order to make it shorter and simpler. (This results in a shorter and faster C# program at the end of the process.)
The optimizations are done by the module Mips2cs.Simplifier, which runs several optimization passes, as follows.
We run a pass to simplify Exprs at compile time. This uses simple pattern matching to calculate compile time constants (e.g. "1 + 2" is changed into "3") and apply arithmetic identities (e.g. "x + 0" is changed into "x", for any expression x).
A naive translation of machine code into C# would tend to produce sequences like the following:
v0 = s0 + s1; v0 = v0 + s2; v0 = v0 + 10;
(where each line corresponds to a single machine code instruction). However, it would be more natural to write this in a single line, as follows:
v0 = s0 + s1 + s2 + 10;
Not only does this result in a shorter C# program, it also results in a smaller final compiled executable (at least when using the Microsoft C# compiler – which appears to generate a more efficient bytecode sequence for the second form of the code, as compared to the first).
Therefore, we introduce two optimization passes which are designed to turn the first form of the code into the second. These are as follows:
If a register is written multiple times, all writes bar the last can be changed into local variables (i.e. Let statements). For example, the following:
StoreReg "v0" (BinExpr Add (LoadRegExpr "s0") (LoadRegExpr "s1")) StoreReg "v0" (BinExpr Add (LoadRegExpr "v0") (LoadRegExpr "s2")) StoreReg "v0" (BinExpr Add (LoadRegExpr "v0") (LitExpr 10))
can be changed into:
Let "dummy_var_1" (BinExpr Add (LoadRegExpr "s0") (LoadRegExpr "s1")) Let "dummy_var_2" (BinExpr Add (VarExpr "dummy_var_1") (LoadRegExpr "s2")) StoreReg "v0" (BinExpr Add (VarExpr "dummy_var_2") (LitExpr 10))
Because "Lets" are, by definition, local variables, it is legitimate to substitute the right-hand side of the Let directly into the expression where the variable is used, and then delete the Let. So, continuing our example, we may first substitute for dummy_var_1
to obtain:
Let "dummy_var_2" (BinExpr Add (BinExpr Add (LoadRegExpr "s0") (LoadRegExpr "s1")) (LoadRegExpr "s2")) StoreReg "v0" (BinExpr Add (VarExpr "dummy_var_2") (LitExpr 10))
and then substitute for dummy_var_2 to obtain:
StoreReg "v0" (BinExpr Add (BinExpr Add (BinExpr Add (LoadRegExpr "s0") (LoadRegExpr "s1")) (LoadRegExpr "s2")) (LitExpr 10))
We have now arrived at the desired form of the code, i.e. one single assignment instead of three separate assignments.
Note that there are some cases where we do not want to substitute out a Let statement. These are as follows:
If we have something like
Let "dummy_var_1" (some huge complex expression) StoreReg "v0" (VarExpr "dummy_var_1") StoreReg "v1" (VarExpr "dummy_var_1")
then we don't want to substitute for dummy_var_1
, because we will then be evaluating the huge complex expression more than once. Therefore, before doing a substitution, Mips2cs checks that it is not going to result in repeated calculation being done.
Sometimes it is not safe to substitute a variable, e.g. in
Let "foo" (LoadRegExpr "v1") StoreReg "v1" (LitExpr 1) StoreReg "v2" (VarExpr "foo")
we cannot substitute (LoadRegExpr "v1")
in place of (VarExpr "foo")
, because the value of v1 has changed in the interim. We refer to this situation as a data hazard, and in such cases, substitution is not performed.
We say that a register is "dead" if its value is no longer needed after the current basic block completes. In other words, future basic blocks will not read from that register (or at least, they will always write to it before they read it).
Mips2cs includes a liveness analysis pass which can detect dead registers in many cases. If a dead register is found, the StoreReg
statement (where the dead value gets written) is changed into a Let
. The substitution pass is then run again; this often allows the Let
to be substituted into other expressions, or even removed entirely.
A common example is the MIPS multiply instruction which writes a 64-bit result to the hi
and lo
registers. In many cases, the hi
value is never used. In these cases, the liveness analysis (and following substitution pass) is usually able to remove the assignment to hi
entirely.
There are also examples where liveness analysis does not remove a value entirely, but it does lead to further optimizations being made. For an example see the Worked example.
Once we have our optimized Intermediate code we can use it to generate the final C# code. By this point, all of the hard optimization work has been done, so we can translate the Intermediate code into C# in a relatively straightforward and mechanical way.
An example of the C# code generated by Mips2cs is given in cs_example.html.
Points to note are the following:
MipsCallbacks
interface allows the MIPS code to call back to arbitrary C# code. MipsVM
class contains the state of the MIPS machine (data members) as well as the translated MIPS code (methods). int
variables in the MipsVM
class. int[][]
array (the first index being the page number, and the second being the offset within the page). malloc
or new
) without having to copy the entire memory contents. InitMem()
. ReadWord
, WriteWord
etc.) rather than being done inline. Exec
methods (Exec64
, Exec65
etc). There is also a top-level Run
method which dispatches to the appropriate Exec
method (depending on the current position within the code). Run
method, is that the smaller Exec
methods are (in theory) more JIT-friendly, and should therefore run faster. (However, this has not been tested in practice, nor has the optimal size of the Exec
methods been determined.) Exec
methods themselves are structured as large switch statements. Each basic block of the original code becomes a single case
block in one of these methods. StoreReg
operations become assignments, e.g. StoreReg "v1" (BinExpr Add (LoadRegExpr "s0") (LoadRegExpr "s1")))
translates into v1 = s0 + s1;
. Let
statements become assignments to local variables. Jump
statements translate into goto
(if the destination is within the same Exec
method) or return
(if we need to go to a different Exec
method; in this case, the code goes back up to Run()
, which then sends us back down into the new Exec()
method). Jump
has a (non-trivial) condition attached, it is converted to an if
. IndirectJump
statements need special handling because we need to map the jump address into a basic block number. This mapping is handled by a hash table (called case_table
in the code). If you want to see how the above works in a real example then you may want to look at the Worked example page. This illustrates the complete pipeline (including translation to Intermediate code, optimization, and C# code generation) for a simple factorial function.