summary refs log tree commit diff
path: root/doc/il.txt
blob: 08dc2a66bfaa10ca2c99da938356fe00557f084e (plain) (blame)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
                ===========================
                 QBE Intermediate Language
                ===========================



- Table of Contents
-------------------

  1. <@ Basic Concepts >
      * <@ Input Files >
      * <@ BNF Notation >
      * <@ Sigils >
  2. <@ Types >
      * <@ Simple Types >
      * <@ Subtyping >
  3. <@ Immediate Constants >
  4. <@ Definitions >
      * <@ Aggregate Types >
      * <@ Data >
      * <@ Functions >
  5. <@ Control >
      * <@ Blocks >
      * <@ Instructions >
      * <@ Jumps >
  6. <@ Regular Instructions >
      * <@ Arithmetic >
      * <@ Memory >
      * Comparisons
  7. Special Instructions
      * Conversions
      * Casts
      * Phi

- 1. Basic Concepts
-------------------

The intermediate language (IL) is a higher-level language
than the machine's assembly language.  It smoothes most
of the irregularities of the underlying hardware and
allows an infinite number of temporaries to be used.
This higher abstraction level allows frontend programmers
to focus on language design issues.

~ Input Files
~~~~~~~~~~~~~

The intermediate language is provided to QBE as text files.
Usually, one file is generated per each compilation unit of
the frontend input language.  An IL file is a sequence of
<@ Definitions > for data, functions, and types.  Once
processed by QBE, the resulting file can be assembled and
linked using a standard toolchain (e.g. GNU binutils).

Here is a complete "Hello World" IL file, it defines a
function that prints to the screen.  Since the string is
not a first class object (only the pointer is) it is
defined outside the function's body.

    # Define the string constant.
    data $str = { b "hello world", b 0 }

    function w $main() {
    @start
            # Call the puts function with $str as argument.
            %r =w call $puts(l $str)
            ret 0
    }

If you have read the LLVM language reference, you might
recognize the above example.  In comparison, QBE makes a
much lighter use of types and the syntax is terser.

~ BNF Notation
~~~~~~~~~~~~~~

The language syntax is vaporously described in the sections
below using BNF syntax.  The different BNF constructs used
are listed below.

  * Keywords are enclosed between quotes;
  * `... | ...` expresses disjunctions;
  * `[ ... ]` marks some syntax as optional;
  * `( ... ),` designates a comma-separated list of the
    enclosed syntax;
  * `...*` and `...+` as used for arbitrary and
    at-least-once repetition.

~ Sigils
~~~~~~~~

The intermediate language makes heavy use of sigils, all
user-defined names are prefixed with a sigil.  This is
to avoid keyword conflicts, and also to quickly spot the
scope and kind of an identifier.

 * `:` is for user-defined <@ Aggregate Types>
 * `$` is for globals (represented by a pointer)
 * `%` is for function-scope temporaries
 * `@` is for block labels

In BNF syntax, we use `?IDENT` to designate an identifier
starting with the sigil `?`.

- 2. Types
----------

~ Simple Types
~~~~~~~~~~~~~~

    `bnf
    BASETY := 'w' | 'l' | 's' | 'd'  # Base types
    EXTTY  := BASETY    | 'h' | 'b'  # Extended types

The IL makes very minimal use of types.  By design, the types
used are restricted to what is necessary for unambiguous
compilation to machine code and C interfacing.  Unlike LLVM,
QBE is not using types as a mean to safety,  they are only
here for semantics purposes.

The four base types are `w` (word), `l` (long), `s` (single),
and `d` (double), they stand respectively for 32 bits and
64 bits integers, and 32 bits and 64 bits floating points.
There are no pointer types available, pointers are typed
by an integer type sufficiently wide to represent all memory
addresses (e.g. `l` on x64).  Temporaries in the IL can only
have a basic type.

Extended types contain base types and add `h` (half word)
and `b` (byte), respectively for 16 bits and 8 bits integers.
They are used in <@ Aggregate Types> and <@ Data> definitions.

For C interfacing, the IL also provides user-defined aggregate
types.  The syntax used to designate them is `:foo`.  Details
about their definition are given in the <@ Aggregate Types >
section.

~ Subtyping
~~~~~~~~~~~

The IL has a minimal subtyping feature for integer types.
Any value of type `l` can be used in a `w` context.  When that
happens only the 32 least significant bits of the word value
are used.

Make note that it is the inverse of the usual subtyping on
integers (in C, we can safely use an `int` where a `long`
is expected).  A long value cannot be used in word context.
The rationale is that a word can be signed or unsigned, so
extending it to a long can be done in two ways, either
by zero-extension, or by sign-extension.

- 3. Immediate Constants
------------------------

    `bnf
    CONST :=
        ['-'] NUMBER  # Decimal integer
      | 's_' FP       # Single-precision float
      | 'd_' FP       # Double-precision float
      | $IDENT        # Global symbol

Throughout the IL, constants are specified with a unified
syntax and semantics.  Constants are immediates, meaning
that they can be used directly in instructions; there is
no need for a "load constant" instruction.

The representation of integers is two's complement.
Floating point numbers are represented using the
single-precision and double-precision formats of the
EEE 754 standard.

Consants specify a sequence of bits and are untyped.
They are always parsed as 64 bits blobs.  Depending on
the context surrounding one constant, only some of its
bits are used.  For example, in the program below, the
two variables defined have the same value since the fist
operand of the substraction is a word (32 bits) context.

    %x =w sub -1, 0
    %y =w sub 4294967295, 0

Because specifying floating point constants by their bits
makes the code less readable, syntactic sugar is provided
to express them.  Standard scientific notation is used with
a prefix of `s_` for single and `d_` for double-precision
numbers.  Once again, the following example defines twice
the same double-precision constant.

    %x =d add d_0, d_-1
    %y =d add d_0, -4616189618054758400

Global symbols can also be used directly as constants,
they will be resolved and turned to actual numeric
constants by the linker.

- 4. Definitions
----------------

Definitions are the essential components of an IL file.
They can define three types of objects: Aggregate types,
data, and functions.  Aggregate types are never exported
and do not compile to any code.  Data and function
definitions have file scope and are mutually recursive
(even across IL files).  Their visibility can be controlled
using the `export` keyword.

~ Aggregate Types
~~~~~~~~~~~~~~~~~

    `bnf
    TYPEDEF :=
        # Regular type
        'type' :IDENT '=' [ 'align' NUMBER ]
        '{'
            ( EXTTY [ NUMBER ] ),
        '}'
      | # Opaque type
        'type' :IDENT '=' 'align' NUMBER '{' NUMBER '}'

Aggregate type definitions start with the `type` keyword.
They have file scope but types must be defined before their
first use.  The inner structure of a type is expressed by a
comma separated list of <@ Simple Types> enclosed in curly
braces.

    type :fourfloats = { s, s, d, d }

For ease of generation, a trailing comma is tolerated by
the parser.  In case many items of the same type are
sequenced (like in a C array), the sorter array syntax
can be used.

    type :abyteandmanywords = { b, w 100 }

By default, the alignment of an aggregate type is the
maximum alignment of its members.  The alignment can be
explicitely specified by the programmer

Opaque types are used when the inner structure of an
aggregate cannot be specified, the alignment for opaque
types is mandatory.  They are defined by simply enclosing
their size between curly braces.

    type :opaque = align 16 { 32 }

~ Data
~~~~~~

    `bnf
    DATADEF :=
        ['export'] 'data' $IDENT '='
        '{'
            ( EXTTY DATAITEM+
            | 'z'   NUMBER ),
        '}'

    DATAITEM :=
        $IDENT [ '+' NUMBER ]  # Symbol and offset
      |  '"' ... '"'           # String
      |  CONST                 # Constant

Data definitions define objects that will be emitted in the
compiled file.  They can be local to the file or exported
with global visibility to the whole program.

They define a global identifier (starting with the sigil
`$`), that will contain a pointer to the object specified
by the definition.

Objects are described by a sequence of fields that start with
a type letter.  This letter can either be an extended type,
or the `z` letter.  If the letter used is an extended type,
the data item following specifies the bits to be stored in
the field.  When several data items follow a letter, they
initialize multiple fields of the same size.

The members of a struct will be packed.  This means that
padding has to be emitted by the frontend when necessary.
Alignment of the whole data objects can be manually specified,
and when no alignment is provided, the maximum alignment of
the platform is used.

When the `z` letter is used the number following indicates
the size of the field, the contents of the field are zero
initialized.  It can be used to add padding between fields
or zero-initialize big arrays.

Here are various examples of data definitions.

    # Three 32 bits values 1, 2, and 3
    # followed by a 0 byte.
    data $a = { w 1 2 3, b 0 }

    # A thousand bytes 0 initialized.
    data $b = { z 1000 }

    # An object containing two 64 bits
    # fields, one with all bits sets and the
    # other containing a pointer to the
    # object itself.
    data $c = { l -1, l $c }

~ Functions
~~~~~~~~~~~

    `bnf
    FUNCDEF :=
        ['export'] 'function' [BASETY | :IDENT] $IDENT PARAMS
        '{'
           BLOCK+
        '}'

    PARAMS := '(' ( (BASETY | :IDENT) %IDENT ), ')'

Function definitions contain the actual code to emit in
the compiled file.  They define a global symbol that
contains a pointer to the function code.  This pointer
can be used in call instructions or stored in memory.

The type given right before the function name is the
return type of the function.  All return values of this
function must have the return type.  If the return
type is missing, the function cannot return any value.

The parameter list is a comma separated list of
temporary names prefixed by types.  The types are used
to correctly implement C compatibility.  When an argument
has an aggregate type, is is set on entry of the
function to a pointer to the aggregate passed by the
caller.  In the example below, we have to use a load
instruction to get the value of the first (and only)
member of the struct.

    type :one = { w }

    function w $getone(:one %p) {
    @start
            %val =w loadw %p
            ret %val
    }

Since global symbols are defined mutually recursive,
there is no need for function declarations: A function
can be referenced before its definition.
Similarly, functions from other modules can be used
without previous declarations.  All the type information
is provided in the call instructions.

The syntax and semantics for the body of functions
are described in the <@ Control > section.

- 5. Control
------------

The IL represents programs as textual transcriptions of
control flow graphs.  The control flow is serialized as
a sequence of blocks of straight-line code and connected
using jump instructions.

~ Blocks
~~~~~~~~

    `bnf
    BLOCK :=
        @IDENT    # Block label
        PHI*      # Phi instructions
        INST*     # Regular instructions
        JUMP      # Jump or return

All blocks have a name that is specified by a label at
their beginning.  Then follows a sequence of instructions
that have "fall-through" flow.  Finally one jump terminates
the block.  The jump can either transfer control to another
block of the same function or return, they are described
further below.

The first block in a function must not be the target of
any jump in the program.  If this need is encountered,
the frontend can always insert an empty prelude block
at the beginning of the function.

When one block jumps to the next block in the IL file,
it is not necessary to give the jump instruction, it
will be automatically added by the parser.  For example
the start block in the example below jumps directly
to the loop block.

    function $loop() {
    @start
    @loop
            %x =w phi @start 100, @loop %x1
            %x1 =w sub %x, 1
            jnz %x1, @loop, @end
    @end
            ret
    }

~ Instructions
~~~~~~~~~~~~~~

Regular instructions are described in details in sections
below.  The IL uses a three-address code, which means that
one instruction computes an operation between two operands
and assigns the result to a third one.

An instruction has both a name and a return type, this
return type is a base type that defines the size of the
instruction's result.  The type of the arguments can be
unambiguously inferred using the instruction name and the
return type.  For example, for all arithmetic instructions,
the type of the arguments is the same as the return type.
The two additions below are valid if `%y` is a word or a long
(because of <@ Subtyping >).

    %x =w add 0, %y
    %z =w add %x, %x

Some instructions, like comparisons and memory loads
have operand types that differ from their return types.
For instance, two floating points can be compared to give a
word result (0 if the comparison succeeds, 1 if it fails).

    %c =w cgts %a, %b

In the example above, both operands have to have single type.
This is made explicit by the instruction suffix.

~ Jumps
~~~~~~~

    `bnf
    JUMP :=
        'jmp' @IDENT               # Unconditional
      | 'jnz' VAL, @IDENT, @IDENT  # Conditional
      | 'ret' [ VAL ]              # Return

A jump instruction ends every block and transfers the
control to another program location.  The target of
a jump must never be the first block in a function.
The three kinds of jumps available are described in
the following list.

 1. Unconditional jump.

    Simply jumps to another block of the same function.

 2. Conditional jump.

    When its word argument is non-zero, it jumps to its
    first label argument; otherwise it jumps to the other
    label.  The argument must be of word type, because of
    subtyping a long argument can be passed, but only its
    least significant 32 bits will be compared to 0.

 3. Function return.

    Terminates the execution of the current function,
    optionally returning a value to the caller.  The value
    returned must have the type given in the function
    prototype.  If the function prototype does not specify
    a return type, no return value can be used.

- 6. Regular Instructions
-------------------------

Instructions are the smallest piece of code in the IL, they
form the body of <@ Blocks >.

The types of instructions is described below using a short
type string.  A type string specifies all the valid return
types an instruction can have, its arity, and the type of
its arguments in function of its return type.

Type strings begin with acceptable return types, then
follows, in parentheses, the possible types for the arguments.
If the n-th return type of the type string is used for an
instruction, the arguments must use the n-th type listed for
them in the type string.  When an instruction does not have a
return type, the type string only contains the types of the
arguments.

The following abbreviations are used.

  * `T` stands for `wlsd`
  * `I` stands for `wl`
  * `F` stands for `sd`
  * `m` stands for the type of pointers on the target, on
    x64 it is the same as `l`

For example, consider the type string `wl(F)`, it mentions
that the instruction has only one argument and that if the
return type used is long, the argument must be of type double.


~ Arithmetic
~~~~~~~~~~~~

  * `add`, `sub`, `div`, `mul` -- `T(T,T)`

    The base arithmetic instructions are available for all
    types, integers and floating points.  When the division
    is used with word or long return type, the arguments are
    treated as signed.  The other instructions are sign
    agnositc.

  * `udiv` -- `I(I,I)`

    An unsigned division, to use on integer types only when
    the integers represented are unsigned.

  * `rem`, `urem` -- `I(I,I)`

    The remainder operator in signed and unsigned variants.
    When the result is not an integer, it is truncated towards
    zero.  For `rem`, the sign of the remainder is the same
    as the one of the dividend.

~ Memory
~~~~~~~~

  * Store instructions.

      * `stored` -- `(d,m)`
      * `stores` -- `(s,m)`
      * `storel` -- `(l,m)`
      * `storew` -- `(w,m)`
      * `storeh`, `storeb` -- `(w,m)`

    Store instructions exist to store a value of any base type
    and any extended type.  Since halfwords and bytes are not
    first class in the IL, `storeh` and `storeb` take a word
    as argument.  Only the first 16 or 8 bits of this word will
    be stored in memory at the address specified in the second
    argument.

  * Load instructions.

      * `loadd` -- `d(m)`
      * `loads` -- `s(m)`
      * `loadl` -- `l(m)`
      * `loadsw`, `loadzw` -- `I(mm)`
      * `loadsh`, `loadzh` -- `I(mm)`
      * `loadsb`, `loadzb` -- `I(mm)`