=========================== QBE Intermediate Language =========================== - Table of Contents ------------------- 1. <@ Basic Concepts > * <@ Input Files > * <@ BNF Notation > * <@ Sigils > 2. <@ Types > * <@ Simple Types > * <@ Subtyping > 3. <@ Constants > 4. <@ Definitions > * <@ Aggregate Types > * <@ Data > * <@ Functions > 5. <@ Control > * <@ Blocks > * <@ Jumps > 6. <@ Instructions > * <@ Arithmetic and Bits > * <@ Memory > * <@ Comparisons > * <@ Conversions > * <@ Cast > * <@ Call > * <@ Phi > 7. <@ Instructions Index > - 1. Basic Concepts ------------------- The intermediate language (IL) is a higher-level language than the machine's assembly language. It smoothes most of the irregularities of the underlying hardware and allows an infinite number of temporaries to be used. This higher abstraction level allows frontend programmers to focus on language design issues. ~ Input Files ~~~~~~~~~~~~~ The intermediate language is provided to QBE as text files. Usually, one file is generated per each compilation unit of the frontend input language. An IL file is a sequence of <@ Definitions > for data, functions, and types. Once processed by QBE, the resulting file can be assembled and linked using a standard toolchain (e.g., GNU binutils). Here is a complete "Hello World" IL file, it defines a function that prints to the screen. Since the string is not a first class object (only the pointer is) it is defined outside the function's body. # Define the string constant. data $str = { b "hello world", b 0 } function w $main() { @start # Call the puts function with $str as argument. %r =w call $puts(l $str) ret 0 } If you have read the LLVM language reference, you might recognize the above example. In comparison, QBE makes a much lighter use of types and the syntax is terser. ~ BNF Notation ~~~~~~~~~~~~~~ The language syntax is vaporously described in the sections below using BNF syntax. The different BNF constructs used are listed below. * Keywords are enclosed between quotes; * `... | ...` expresses disjunctions; * `[ ... ]` marks some syntax as optional; * `( ... ),` designates a comma-separated list of the enclosed syntax; * `...*` and `...+` are used for arbitrary and at-least-once repetition. ~ Sigils ~~~~~~~~ The intermediate language makes heavy use of sigils, all user-defined names are prefixed with a sigil. This is to avoid keyword conflicts, and also to quickly spot the scope and kind of an identifier. * `:` is for user-defined <@ Aggregate Types> * `$` is for globals (represented by a pointer) * `%` is for function-scope temporaries * `@` is for block labels In BNF syntax, we use `?IDENT` to designate an identifier starting with the sigil `?`. - 2. Types ---------- ~ Simple Types ~~~~~~~~~~~~~~ `bnf BASETY := 'w' | 'l' | 's' | 'd' # Base types EXTTY := BASETY | 'b' | 'h' # Extended types The IL makes very minimal use of types. By design, the types used are restricted to what is necessary for unambiguous compilation to machine code and C interfacing. Unlike LLVM, QBE is not using types as a means to safety; they are only here for semantic purposes. The four base types are `w` (word), `l` (long), `s` (single), and `d` (double), they stand respectively for 32-bit and 64-bit integers, and 32-bit and 64-bit floating-point numbers. There are no pointer types available; pointers are typed by an integer type sufficiently wide to represent all memory addresses (e.g., `l` on 64-bit architectures). Temporaries in the IL can only have a basic type. Extended types contain base types plus `b` (byte) and `h` (half word), respectively for 8-bit and 16-bit integers. They are used in <@ Aggregate Types> and <@ Data> definitions. For C interfacing, the IL also provides user-defined aggregate types. The syntax used to designate them is `:foo`. Details about their definition are given in the <@ Aggregate Types > section. ~ Subtyping ~~~~~~~~~~~ The IL has a minimal subtyping feature for integer types. Any value of type `l` can be used in a `w` context. In that case, only the 32 least significant bits of the word value are used. Make note that it is the inverse of the usual subtyping on integers (in C, we can safely use an `int` where a `long` is expected). A long value cannot be used in word context. The rationale is that a word can be signed or unsigned, so extending it to a long can be done in two ways, either by zero-extension, or by sign-extension. - 3. Constants -------------- `bnf CONST := ['-'] NUMBER # Decimal integer | 's_' FP # Single-precision float | 'd_' FP # Double-precision float | $IDENT # Global symbol Throughout the IL, constants are specified with a unified syntax and semantics. Constants are immediates, meaning that they can be used directly in instructions; there is no need for a "load constant" instruction. The representation of integers is two's complement. Floating-point numbers are represented using the single-precision and double-precision formats of the IEEE 754 standard. Constants specify a sequence of bits and are untyped. They are always parsed as 64-bit blobs. Depending on the context surrounding a constant, only some of its bits are used. For example, in the program below, the two variables defined have the same value since the first operand of the substraction is a word (32-bit) context. %x =w sub -1, 0 %y =w sub 4294967295, 0 Because specifying floating-point constants by their bits makes the code less readable, syntactic sugar is provided to express them. Standard scientific notation is used with a prefix of `s_` for single and `d_` for double-precision numbers. Once again, the following example defines twice the same double-precision constant. %x =d add d_0, d_-1 %y =d add d_0, -4616189618054758400 Global symbols can also be used directly as constants; they will be resolved and turned into actual numeric constants by the linker. - 4. Definitions ---------------- Definitions are the essential components of an IL file. They can define three types of objects: Aggregate types, data, and functions. Aggregate types are never exported and do not compile to any code. Data and function definitions have file scope and are mutually recursive (even across IL files). Their visibility can be controlled using the `export` keyword. ~ Aggregate Types ~~~~~~~~~~~~~~~~~ `bnf TYPEDEF := # Regular type 'type' :IDENT '=' [ 'align' NUMBER ] '{' ( EXTTY [ NUMBER ] ), '}' | # Opaque type 'type' :IDENT '=' 'align' NUMBER '{' NUMBER '}' Aggregate type definitions start with the `type` keyword. They have file scope, but types must be defined before their first use. The inner structure of a type is expressed by a comma-separated list of <@ Simple Types> enclosed in curly braces. type :fourfloats = { s, s, d, d } For ease of generation, a trailing comma is tolerated by the parser. In case many items of the same type are sequenced (like in a C array), the shorter array syntax can be used. type :abyteandmanywords = { b, w 100 } By default, the alignment of an aggregate type is the maximum alignment of its members. The alignment can be explicitly specified by the programmer. type :cryptovector = align 16 { w 4 } Opaque types are used when the inner structure of an aggregate cannot be specified; the alignment for opaque types is mandatory. They are defined by simply enclosing their size between curly braces. type :opaque = align 16 { 32 } ~ Data ~~~~~~ `bnf DATADEF := ['export'] 'data' $IDENT '=' '{' ( EXTTY DATAITEM+ | 'z' NUMBER ), '}' DATAITEM := $IDENT [ '+' NUMBER ] # Symbol and offset | '"' ... '"' # String | CONST # Constant Data definitions define objects that will be emitted in the compiled file. They can be local to the file or exported with global visibility to the whole program. They define a global identifier (starting with the sigil `$`), that will contain a pointer to the object specified by the definition. Objects are described by a sequence of fields that start with a type letter. This letter can either be an extended type, or the `z` letter. If the letter used is an extended type, the data item following specifies the bits to be stored in the field. When several data items follow a letter, they initialize multiple fields of the same size. The members of a struct will be packed. This means that padding has to be emitted by the frontend when necessary. Alignment of the whole data objects can be manually specified, and when no alignment is provided, the maximum alignment of the platform is used. When the `z` letter is used the number following indicates the size of the field, the contents of the field are zero initialized. It can be used to add padding between fields or zero-initialize big arrays. Here are various examples of data definitions. # Three 32-bit values 1, 2, and 3 # followed by a 0 byte. data $a = { w 1 2 3, b 0 } # A thousand bytes 0 initialized. data $b = { z 1000 } # An object containing two 64-bit # fields, one with all bits sets and the # other containing a pointer to the # object itself. data $c = { l -1, l $c } ~ Functions ~~~~~~~~~~~ `bnf FUNCDEF := ['export'] 'function' [BASETY | :IDENT] $IDENT PARAMS '{' BLOCK+ '}' PARAMS := '(' ( (BASETY | :IDENT) %IDENT ), ')' Function definitions contain the actual code to emit in the compiled file. They define a global symbol that contains a pointer to the function code. This pointer can be used in call instructions or stored in memory. The type given right before the function name is the return type of the function. All return values of this function must have the return type. If the return type is missing, the function cannot return any value. The parameter list is a comma separated list of temporary names prefixed by types. The types are used to correctly implement C compatibility. When an argument has an aggregate type, is is set on entry of the function to a pointer to the aggregate passed by the caller. In the example below, we have to use a load instruction to get the value of the first (and only) member of the struct. type :one = { w } function w $getone(:one %p) { @start %val =w loadw %p ret %val } Since global symbols are defined mutually recursive, there is no need for function declarations: A function can be referenced before its definition. Similarly, functions from other modules can be used without previous declarations. All the type information is provided in the call instructions. The syntax and semantics for the body of functions are described in the <@ Control > section. - 5. Control ------------ The IL represents programs as textual transcriptions of control flow graphs. The control flow is serialized as a sequence of blocks of straight-line code and connected using jump instructions. ~ Blocks ~~~~~~~~ `bnf BLOCK := @IDENT # Block label PHI* # Phi instructions INST* # Regular instructions JUMP # Jump or return All blocks have a name that is specified by a label at their beginning. Then follows a sequence of instructions that have "fall-through" flow. Finally one jump terminates the block. The jump can either transfer control to another block of the same function or return, they are described further below. The first block in a function must not be the target of any jump in the program. If this need is encountered, the frontend can always insert an empty prelude block at the beginning of the function. When one block jumps to the next block in the IL file, it is not necessary to give the jump instruction, it will be automatically added by the parser. For example the start block in the example below jumps directly to the loop block. function $loop() { @start @loop %x =w phi @start 100, @loop %x1 %x1 =w sub %x, 1 jnz %x1, @loop, @end @end ret } ~ Jumps ~~~~~~~ `bnf JUMP := 'jmp' @IDENT # Unconditional | 'jnz' VAL, @IDENT, @IDENT # Conditional | 'ret' [ VAL ] # Return A jump instruction ends every block and transfers the control to another program location. The target of a jump must never be the first block in a function. The three kinds of jumps available are described in the following list. 1. Unconditional jump. Simply jumps to another block of the same function. 2. Conditional jump. When its word argument is non-zero, it jumps to its first label argument; otherwise it jumps to the other label. The argument must be of word type, because of subtyping a long argument can be passed, but only its least significant 32 bits will be compared to 0. 3. Function return. Terminates the execution of the current function, optionally returning a value to the caller. The value returned must have the type given in the function prototype. If the function prototype does not specify a return type, no return value can be used. - 6. Instructions ----------------- Instructions are the smallest piece of code in the IL, they form the body of <@ Blocks >. The IL uses a three-address code, which means that one instruction computes an operation between two operands and assigns the result to a third one. An instruction has both a name and a return type, this return type is a base type that defines the size of the instruction's result. The type of the arguments can be unambiguously inferred using the instruction name and the return type. For example, for all arithmetic instructions, the type of the arguments is the same as the return type. The two additions below are valid if `%y` is a word or a long (because of <@ Subtyping >). %x =w add 0, %y %z =w add %x, %x Some instructions, like comparisons and memory loads have operand types that differ from their return types. For instance, two floating points can be compared to give a word result (0 if the comparison succeeds, 1 if it fails). %c =w cgts %a, %b In the example above, both operands have to have single type. This is made explicit by the instruction suffix. The types of instructions are described below using a short type string. A type string specifies all the valid return types an instruction can have, its arity, and the type of its arguments in function of its return type. Type strings begin with acceptable return types, then follows, in parentheses, the possible types for the arguments. If the n-th return type of the type string is used for an instruction, the arguments must use the n-th type listed for them in the type string. When an instruction does not have a return type, the type string only contains the types of the arguments. The following abbreviations are used. * `T` stands for `wlsd` * `I` stands for `wl` * `F` stands for `sd` * `m` stands for the type of pointers on the target, on 64-bit architectures it is the same as `l` For example, consider the type string `wl(F)`, it mentions that the instruction has only one argument and that if the return type used is long, the argument must be of type double. ~ Arithmetic and Bits ~~~~~~~~~~~~~~~~~~~~~ * `add`, `sub`, `div`, `mul` -- `T(T,T)` * `udiv`, `rem`, `urem` -- `I(I,I)` * `or`, `xor`, `and` -- `I(I,I)` * `sar`, `shr`, `shl` -- `I(I,ww)` The base arithmetic instructions in the first bullet are available for all types, integers and floating points. When `div` is used with word or long return type, the arguments are treated as signed. The unsigned integral division is available as `udiv` instruction. When the result of a division is not an integer, it is truncated towards zero. The signed and unsigned remainder operations are available as `rem` and `urem`. The sign of the remainder is the same as the one of the dividend. Its magnitude is smaller than the divisor's. These two instructions and `udiv` are only available with integer arguments and result. Bitwise OR, AND, and XOR operations are available for both integer types. Logical operations of typical programming languages can be implemented using <@ Comparisons > and <@ Jumps >. Shift instructions `sar`, `shr`, and `shl` shift right or left their first operand by the amount in the second operand. The shifting amount is taken modulo the size of the result type. Shifting right can either preserve the sign of the value (using `sar`), or fill the newly freed bits with zeroes (using `shr`). Shifting left always fills the freed bits with zeroes. Remark that an arithmetic shift right (`sar`) is only equivalent to a division by a power of two for non-negative numbers. This is because the shift right "truncates" towards minus infinity, while the division truncates towards zero. ~ Memory ~~~~~~~~ * Store instructions. * `stored` -- `(d,m)` * `stores` -- `(s,m)` * `storel` -- `(l,m)` * `storew` -- `(w,m)` * `storeh` -- `(w,m)` * `storeb` -- `(w,m)` Store instructions exist to store a value of any base type and any extended type. Since halfwords and bytes are not first class in the IL, `storeh` and `storeb` take a word as argument. Only the first 16 or 8 bits of this word will be stored in memory at the address specified in the second argument. * Load instructions. * `loadd` -- `d(m)` * `loads` -- `s(m)` * `loadl` -- `l(m)` * `loadsw`, `loaduw` -- `I(mm)` * `loadsh`, `loaduh` -- `I(mm)` * `loadsb`, `loadub` -- `I(mm)` For types smaller than long, two variants of the load instruction is available: one will sign extend the value loaded, while the other will zero extend it. Remark that all loads smaller than long can load to either a long or a word. The two instructions `loadsw` and `loaduw` have the same effect when they are used to define a word temporary. A `loadw` instruction is provided as syntactic sugar for `loadsw` to make explicit that the extension mechanism used is irrelevant. * Stack allocation. * `alloc4` -- `m(l)` * `alloc8` -- `m(l)` * `alloc16` -- `m(l)` These instructions allocate a chunk of memory on the stack. The number ending the instruction name is the alignment required for the allocated slot. QBE will make sure that the returned address is a multiple of that alignment value. Stack allocation instructions are used, for example, when compiling the C local variables, because their address can be taken. When compiling Fortran, temporaries can be used directly instead, because it is illegal to take the address of a variable. The following example makes use some of the memory instructions. Pointers are stored in long temporaries. %A0 =l alloc4 8 # stack allocate an array A of 2 words %A1 =l add %A0, 4 storew 43, %A0 # A[0] <- 43 storew 255, %A1 # A[1] <- 255 %v1 =w loadw %A0 # %v1 <- A[0] as word %v2 =w loadsb %A1 # %v2 <- A[1] as signed byte %v3 =w add %v1, %v2 # %v3 is 42 here ~ Comparisons ~~~~~~~~~~~~~ Comparison instructions return an integer value (either a word or a long), and compare values of arbitrary types. The value returned is 1 if the two operands satisfy the comparison relation, and 0 otherwise. The names of comparisons respect a standard naming scheme in three parts. 1. All comparisons start with the letter `c`. 2. Then comes a comparison type. The following types are available for integer comparisons: * `eq` for equality * `ne` for inequality * `sle` for signed lower or equal * `slt` for signed lower * `sge` for signed greater or equal * `sgt` for signed greater * `ule` for unsigned lower or equal * `ult` for unsigned lower * `uge` for unsigned greater or equal * `ugt` for unsigned greater Floating point comparisons use one of these types: * `eq` for equality * `ne` for inequality * `le` for lower or equal * `lt` for lower * `ge` for greater or equal * `gt` for greater * `o` for ordered (no operand is a NaN) * `uo` for unordered (at least one operand is a NaN) Because floating point types always have a sign bit, all the comparisons available are signed. 3. Finally, the instruction name is terminated with a basic type suffix precising the type of the operands to be compared. For example, `cod` (`I(dd,dd)`) compares two double-precision floating point numbers and returns 1 if the two floating points are not NaNs, and 0 otherwise. The `csltw` (`I(ww,ww)`) instruction compares two words representing signed numbers and returns 1 when the first argument is smaller than the second one. ~ Conversions ~~~~~~~~~~~~~ Conversion operations allow to change the representation of a value, possibly modifying it if the target type cannot hold the value of the source type. Conversions can extend the precision of a temporary (e.g., from signed 8-bit to 32-bit), or convert a floating point into an integer and vice versa. * `extsw`, `extuw` -- `l(w)` * `extsh`, `extuh` -- `I(ww)` * `extsb`, `extub` -- `I(ww)` * `exts` -- `d(s)` * `truncd` -- `s(d)` * `stosi` -- `I(ss)` * `dtosi` -- `I(dd)` * `swtof` -- `F(ww)` * `sltof` -- `F(ll)` Extending the precision of a temporary is done using the `ext` family of instructions. Because QBE types do not precise the signedness (like in LLVM), extension instructions exist to sign-extend and zero-extend a value. For example, `extsb` takes a word argument and sign-extend the 8 least-significant bits to a full word or long, depending on the return type. The instructions `exts` and `truncd` are provided to change the precision of a floating point value. When the double argument of `truncd` cannot be represented as a single-precision floating point, it is truncated towards zero. Converting between signed integers and floating points is done using `stosi` (single to signed integer), `dtosi` (double to signed integer), `swtof` (signed word to float), and `sltof` (signed long to float). These instructions only handle signed integers, conversion to and from unsigned types are not yet supported. Because of <@ Subtyping >, there is no need to have an instruction to lower the precision of an integer temporary. ~ Cast ~~~~~~ The `cast` instruction reinterprets the bits of a value of a given type into another type of the same width. * `cast` -- `wlsd(sdwl)` It can be used to make bitwise operations on the representation of floating point numbers. For example the following program will compute the opposite of the single-precision floating point number `%f` into `%rs`. %b0 =w cast %f %b1 =w xor 2147483648, %b0 # flip the msb %rs =s cast %b1 ~ Call ~~~~~~ `bnf CALL := [ %IDENT '=' ( BASETY | :IDENT ) ] 'call' VAL PARAMS PARAMS := '(' ( (BASETY | :IDENT) %IDENT ), ')' The call instruction is special in many ways. It is not a three-address instruction and requires the type of all its arguments to be given. Also, the return type can be either a base type or an aggregate type. These specificities are required to compile calls with C compatibility (i.e., to respect the ABI). When an aggregate type is used as argument type or return type, the value repectively passed or returned needs to be a pointer to a memory location holding the value. This is because aggregate types are not first-class citizens of the IL. Call instructions are currently required to define a return temporary, even for functions returning no values. The temporary can very well be ignored (not used) when necessary. ~ Phi ~~~~~ `bnf PHI := %IDENT '=' BASETY 'phi' ( @IDENT VAL ), First and foremost, phi instructions are NOT necessary when writing a frontend to QBE. One solution to avoid having to deal with SSA form is to use stack allocated variables for all source program variables and perform assignments and lookups using <@ Memory > operations. This is what LLVM users typically do. Another solution is to simply emit code that is not in SSA form! Contrary to LLVM, QBE is able to fixup programs not in SSA form without requiring the boilerplate of loading and storing in memory. For example, the following program will be correctly compiled by QBE. @start %x =w copy 100 %s =w copy 0 @loop %s =w add %s, %x %x =w sub %x, 1 jnz %x, @loop, @end @end ret %s Now, if you want to know what phi instructions are and how to use them in QBE, you can read the following. Phi instructions are specific to SSA form. In SSA form values can only be assigned once, without phi instructions, this requirement is too strong to represent many programs. For example consider the following C program. int f(int x) { int y; if (x) y = 1; else y = 2; return y; } The variable `y` is assigned twice, the solution to translate it in SSA form is to insert a phi instruction. @ifstmt jnz %x, @ift, @iff @ift jmp @retstmt @iff jmp @retstmt @retstmt %y =w phi @ift 1, @iff 2 ret %y Phi instructions return one of their arguments depending on where the control came from. In the example, `%y` is set to 1 if the `@ift` branch is taken, and it is set to 2 otherwise. An important remark about phi instructions is that QBE assumes that if a variable is defined by a phi it respects all the SSA invariants. So it is critical to not use phi instructions unless you know exactly what you are doing. - 7. Instructions Index ----------------------- * <@ Arithmetic and Bits >: * `add` * `and` * `div` * `mul` * `or` * `rem` * `sar` * `shl` * `shr` * `sub` * `udiv` * `urem` * `xor` * <@ Memory >: * `alloc16` * `alloc4` * `alloc8` * `loadd` * `loadl` * `loads` * `loadsb` * `loadsh` * `loadsw` * `loadub` * `loaduh` * `loaduw` * `loadw` * `storeb` * `stored` * `storeh` * `storel` * `stores` * `storew` * <@ Comparisons >: * `ceqd` * `ceql` * `ceqs` * `ceqw` * `cged` * `cges` * `cgtd` * `cgts` * `cled` * `cles` * `cltd` * `clts` * `cned` * `cnel` * `cnes` * `cnew` * `cod` * `cos` * `csgel` * `csgew` * `csgtl` * `csgtw` * `cslel` * `cslew` * `csltl` * `csltw` * `cugel` * `cugew` * `cugtl` * `cugtw` * `culel` * `culew` * `cultl` * `cultw` * `cuod` * `cuos` * <@ Conversions >: * `dtosi` * `exts` * `extsb` * `extsh` * `extsw` * `extub` * `extuh` * `extuw` * `sltof` * `stosi` * `swtof` * `truncd` * <@ Cast > : * `cast` * <@ Call >: * `call` * <@ Phi >: * `phi` * <@ Jumps >: * `jmp` * `jnz` * `ret`