=========================== QBE Intermediate Language =========================== - Table of Contents ------------------- 1. <@ Basic Concepts > * <@ Input Files > * <@ BNF Notation > * <@ Sigils > 2. <@ Types > * <@ Simple Types > * <@ Subtyping > 3. <@ Immediate Constants > 4. <@ Definitions > * <@ Aggregate Types > * <@ Data > * <@ Functions > 5. <@ Control > * <@ Blocks > * <@ Instructions > * <@ Jumps > 6. Integer Instructions * Arithmetic * Memory * Comparisons 7. Floating Point Instructions * Arithmetic * Memory * Comparisons 8. Special Instructions * Conversions and Extensions * Casts * Phi - 1. Basic Concepts ------------------- The intermediate language (IL) is a higher-level language than the machine's assembly language. It smoothes most of the irregularities of the underlying hardware and allows an infinite number of temporaries to be used. This higher abstraction level allows frontend programmers to focus on language design issues. ~ Input Files ~~~~~~~~~~~~~ The intermediate language is provided to QBE as text files. Usually, one file is generated per each compilation unit of the frontend input language. An IL file is a sequence of <@ Definitions > for data, functions, and types. Once processed by QBE, the resulting file can be assembled and linked using a standard toolchain (e.g. GNU binutils). Here is a complete "Hello World" IL file, it defines a function that prints to the screen. Since the string is not a first class object (only the pointer is) it is defined outside the function's body. # Define the string constant. data $str = { b "hello world", b 0 } function w $main() { @start # Call the puts function with $str as argument. %r =w call $puts(l $str) ret 0 } If you have read the LLVM language reference, you might recognize the above example. In comparison, QBE makes a much lighter use of types and the syntax is terser. ~ BNF Notation ~~~~~~~~~~~~~~ The language syntax is vaporously described in the sections below using BNF syntax. The different BNF constructs used are listed below. * Keywords are enclosed between quotes; * `... | ...` expresses disjunctions; * `[ ... ]` marks some syntax as optional; * `( ... ),` designates a comma-separated list of the enclosed syntax; * `...*` and `...+` as used for arbitrary and at-least-once repetition. ~ Sigils ~~~~~~~~ The intermediate language makes heavy use of sigils, all user-defined names are prefixed with a sigil. This is to avoid keyword conflicts, and also to quickly spot the scope and kind of an identifier. * `:` is for user-defined <@ Aggregate Types> * `$` is for globals (represented by a pointer) * `%` is for function-scope temporaries * `@` is for block labels In BNF syntax, we use `?IDENT` to designate an identifier starting with the sigil `?`. - 2. Types ---------- ~ Simple Types ~~~~~~~~~~~~~~ `bnf BASETY := 'w' | 'l' | 's' | 'd' # Base types EXTTY := BASETY | 'h' | 'b' # Extended types We makes very minimal use of types. The types used are only what is necessary for unambiguous compilation to machine code and C interfacing. The four base types are `w` (word), `l` (long), `s` (single), and `d` (double), they stand respectively for 32 bits and 64 bits integers, and 32 bits and 64 bits floating points. Temporaries in the IL can only have a basic type. Extended types contain base types and add `h` (half word) and `b` (byte), respectively for 16 bits and 8 bits integers. They are used in <@ Aggregate Types> and <@ Data> definitions. The IL also provides user-defined aggregate types, these are used for C interfacing. The syntax used to designate them is `:foo`. Details about their definition are given in the <@ Aggregate Types > section. ~ Subtyping ~~~~~~~~~~~ The IL has a minimal subtyping feature for integer types. Any value of type `l` can be used in a `w` context. When that happens only the 32 least significant bits of the word value are used. Note that a long value must not be used in word context. The rationale is that the 32 high bits of the extended long value could very well be zeroes or the result of a sign extension of the word. - 3. Immediate Constants ------------------------ `bnf CONST := ['-'] NUMBER # Decimal integer | 's_' FP # Single-precision float | 'd_' FP # Double-precision float | $IDENT # Global symbol Throughout the IL, constants are specified with a unified syntax and semantics. Constants are immediates, meaning that they can be used directly in instructions; there is no need for a "load constant" instruction. The representation of integers is two's complement. Floating point numbers are represented using the single-precision and double-precision formats of the EEE 754 standard. Consants specify a sequence of bits and are untyped. They are always parsed as 64 bits blobs. Depending on the context surrounding one constant, only some of its bits are used. For example, in the program below, the two variables defined have the same value since the fist operand of the substraction is a word (32 bits) context. %x =w sub -1, 0 %y =w sub 4294967295, 0 Because specifying floating point constants by their bits makes the code less readable, syntactic sugar is provided to express them. Standard scientific notation is used with a prefix of `s_` for single and `d_` for double-precision numbers. Once again, the following example defines twice the same double-precision constant. %x =d add d_0, d_-1 %y =d add d_0, -4616189618054758400 Global symbols can also be used directly as constants, they will be resolved and turned to actual numeric constants by the linker. - 4. Definitions ---------------- Definitions are the essential components of an IL file. They can define three types of objects: Aggregate types, data, and functions. Aggregate types are never exported and do not compile to any code. Data and function definitions have file scope and are mutually recursive (even across IL files). Their visibility can be controlled using the `export` keyword. ~ Aggregate Types ~~~~~~~~~~~~~~~~~ `bnf TYPEDEF := # Regular type 'type' :IDENT '=' [ 'align' NUMBER ] '{' ( EXTTY [ NUMBER ] ), '}' | # Opaque type 'type' :IDENT '=' 'align' NUMBER '{' NUMBER '}' Aggregate type definitions start with the `type` keyword. They have file scope but types must be defined before their first use. The inner structure of a type is expressed by a comma separated list of <@ Simple Types> enclosed in curly braces. type :fourfloats = { s, s, d, d } For ease of generation, a trailing comma is tolerated by the parser. In case many items of the same type are sequenced (like in a C array), the sorter array syntax can be used. type :abyteandmanywords = { b, w 100 } By default, the alignment of an aggregate type is the maximum alignment of its members. The alignment can be explicitely specified by the programmer Opaque types are used when the inner structure of an aggregate cannot be specified, the alignment for opaque types is mandatory. They are defined by simply enclosing their size between curly braces. type :opaque = align 16 { 32 } ~ Data ~~~~~~ `bnf DATADEF := ['export'] 'data' $IDENT '=' '{' ( EXTTY DATAITEM+ | 'z' NUMBER ), '}' DATAITEM := $IDENT [ '+' NUMBER ] # Symbol and offset | '"' ... '"' # String | CONST # Constant Data definitions define objects that will be emitted in the compiled file. They can be local to the file or exported with global visibility to the whole program. They define a global identifier (starting with the sigil `$`), that will contain a pointer to the object specified by the definition. Objects are described by a sequence of fields that start with a type letter. This letter can either be an extended type, or the `z` letter. If the letter used is an extended type, the data item following specifies the bits to be stored in the field. When several data items follow a letter, they initialize multiple fields of the same size. The members of a struct will be packed. This means that padding has to be emitted by the frontend when necessary. Alignment of the whole data objects can be manually specified, and when no alignment is provided, the maximum alignment of the platform is used. When the `z` letter is used the number following indicates the size of the field, the contents of the field are zero initialized. It can be used to add padding between fields or zero-initialize big arrays. Here are various examples of data definitions. # Three 32 bits values 1, 2, and 3 # followed by a 0 byte. data $a = { w 1 2 3, b 0 } # A thousand bytes 0 initialized. data $b = { z 1000 } # An object containing two 64 bits # fields, one with all bits sets and the # other containing a pointer to the # object itself. data $c = { l -1, l $c } ~ Functions ~~~~~~~~~~~ `bnf FUNCDEF := ['export'] 'function' [BASETY | :IDENT] $IDENT PARAMS '{' BLOCK+ '}' PARAMS := '(' ( (BASETY | :IDENT) %IDENT ), ')' Function definitions contain the actual code to emit in the compiled file. They define a global symbol that contains a pointer to the function code. This pointer can be used in call instructions or stored in memory. The type given right before the function name is the return type of the function. All return values of this function must have the return type. If the return type is missing, the function cannot return any value. The parameter list is a comma separated list of temporary names prefixed by types. The types are used to correctly implement C compatibility. When an argument has an aggregate type, is is set on entry of the function to a pointer to the aggregate passed by the caller. In the example below, we have to use a load instruction to get the value of the first (and only) member of the struct. type :one = { w } function w $getone(:one %p) { @start %val =w loadw %p ret %val } Since global symbols are defined mutually recursive, there is no need for function declarations: A function can be referenced before its definition. Similarly, functions from other modules can be used without previous declarations. All the type information is provided in the call instructions. The syntax and semantics for the body of functions are described in the <@ Control > section. - 5. Control ------------ The IL represents programs as textual transcriptions of control flow graphs. The control flow is serialized as a sequence of blocks of straight-line code and connected using jump instructions. ~ Blocks ~~~~~~~~ `bnf BLOCK := @IDENT # Block label PHI* # Phi instructions INST* # Regular instructions JUMP # Jump or return All blocks have a name that is specified by a label at their beginning. Then follows a sequence of instructions that have "fall-through" flow. Finally one jump terminates the block. The jump can either transfer control to another block of the same function or return, they are described further below. The first block in a function must not be the target of any jump in the program. If this need is encountered, the frontend can always insert an empty prelude block at the beginning of the function. When one block jumps to the next block in the IL file, it is not necessary to give the jump instruction, it will be automatically added by the parser. For example the start block in the example below jumps directly to the loop block. function $loop() { @start @loop %x =w phi @start 100, @loop %x1 %x1 =w sub %x, 1 jnz %x1, @loop, @end @end ret } ~ Instructions ~~~~~~~~~~~~~~ Regular instructions are described in details in sections below. The IL uses a three-address code, which means that one instruction computes an operation between two operands and assigns the result to a third one. An instruction has both a name and a return type, this return type is a base type that defines the size of the instruction's result. The type of the arguments can be unambiguously inferred using the instruction name and the return type. For example, for all arithmetic instructions, the type of the arguments is the same as the return type. The two additions below are valid if `%y` is a word or a long (because of <@ Subtyping >). %x =w add 0, %y %z =w add %x, %x Some instructions, like comparisons and memory loads have operand types that differ from their return types. For instance, two floating points can be compared to give a word result (0 if the comparison succeeds, 1 if it fails). %c =w cgts %a, %b In the example above, both operands have to have single type. This is made explicit by the instruction suffix. ~ Jumps ~~~~~~~ `bnf JUMP := 'jmp' @IDENT # Unconditional | 'jnz' VAL, @IDENT, @IDENT # Conditional | 'ret' [ VAL ] # Return A jump instruction ends every block and transfers the control to another program location. The target of a jump must never be the first block in a function. The three kinds of jumps available are described in the following list. 1. Unconditional jump. Simply jumps to another block of the same function. 2. Conditional jump. When its word argument is non-zero, it jumps to its first label argument; otherwise it jumps to the other label. The argument must be of word type, because of subtyping a long argument can be passed, but only its least significant 32 bits will be compared to 0. 3. Function return. Terminates the execution of the current function, optionally returning a value to the caller. The value returned must have the type given in the function prototype. If the function prototype does not specify a return type, no return value can be used.