b3bbe1d by Ahmed S. Darwish at 2010-11-16 1
2
15681eb by Ahmed S. Darwish at 2010-02-18 3
			Notes on the Cute Kernel
4
649633f by Ahmed S. Darwish at 2010-07-29 5
6
TOC:
7
----
8
9
(1)  On Intel Documents
10
(2)  On Data Alignment
11
(3)  On little-endian byte ordering
12
(4)  On Real Mode
13
(5)  On accessing fixed disk drivers (hard-disks)
14
(6)  On x86 jumps
15
(7)  On referencing global symbols
16
(8)  The 8259A Programmable Interrupt Controller
17
(9)  On x86 SMP initialization sequence
18
(10) On linkers 'relocation error's
19
(11) When %RIP-relativeness goes wrong
20
(12) Testing libc()-like kernel code
21
(13) On SMP Memory ordering models
22
(14) On the C 'volatile' type qualifier
23
(15) On the vaguely documented x86 16-bit protected mode
24
(16) Using light debugging techniques
25
(17) Musings on Memory Allocation
26
(18) On Virtual Memory
27
(19) On Partitionting the 64-bit kernel address space
28
(20) Debugging -O3 triggered problems
29
(21) The Zone!
30
(22) Debugging periodic timers using statistical data
5bc1d68 by Ahmed S. Darwish at 2010-11-16 31
(23) The Naughty Multiply
649633f by Ahmed S. Darwish at 2010-07-29 32
33
550f8c7 by Ahmed S. Darwish at 2009-08-25 34
On Intel Documents:
35
-------------------
36
37
* Although the newer intel documents are much more sexy and detailed, the
38
  original i386 one describes the concepts in a much more bottom-up,
39
  undergraduate-book like, no forward-references way.
40
41
* Somehow the new intel documents beginning from the 1997 i686 document
42
  assumes some x86 familarity that wasn't obviously assumed in the original
43
  document.
44
45
* For an example, the 'alignment' topic is throughly explained in
46
  the i386 manual, what is it and why is it important to performance,
47
  and how the processor act if given a misaligned address, *before* ever
48
  using the words '16bit-aligned' and '32bit-aligned'.
49
50
  The new intel documents assume you know subtle topics as above and
51
  talks about critical issues related to their understandings freely
52
  *before* they become throughly discussed. Only at i686 chapter 5
53
  alignment is throughly described although several critical alignment
54
  related topics have already been discussed.
55
ded3ad0 by Ahmed S. Darwish at 2009-09-03 56
* Case 2: on why we should disable interrupts while switching to protected
57
  mode?
58
59
  New Intel document: Software must guarantee that no exceptions or interrupts
60
  are generated during the mode switching operation.
61
62
  Original 386 document: ``The IDTR may be loaded in either real-address or
63
  protected mode. However, the format of the interrupt table for protected mode
64
  is different than that for real-address mode. It is not possible to change to
65
  protected mode and change interrupt table formats at the same time; therefore,
66
  it is inevitable that, if IDTR selects an interrupt table, it will have the
67
  wrong format at some time. An interrupt or exception that occurs at this time
68
  will have unpredictable results. To avoid this unpredictability, interrupts
69
  should remain disabled until interrupt handlers are in place and a valid IDT
70
  has been created in protected mode.''
71
72
* Case 3: on why should we long jump after switching to protected mode?
73
74
  New intel document: ``The JMP or CALL instruction immediately after the MOV
75
  CR0 instruction changes the flow of execution and serializes the processor.''
76
77
  Original 386 document: ``Immediately after setting the PE flag, the
78
  initialization code must flush the processor's instruction prefetch queue by
79
  executing a JMP instruction. The 80386 fetches and decodes instructions and
80
  addresses before they are used; however, after a change into protected mode,
81
  the prefetched instruction information (which pertains to real-address mode)
82
  is no longer valid. A JMP forces the processor to discard the invalid
83
  information.''
84
550f8c7 by Ahmed S. Darwish at 2009-08-25 85
  This is a recurring pattern.
86
87
* What's IA-32e in new Intel documents?
88
  Intel has really messed up the namings of their 64bit processors. The
89
  most logical choice I think was IA-64, but unfortunately that was taken
90
  by the big flop in processor's history: the Itanium. When Intel waked
91
  up and created their first amd64 compatible processor, they named it EMT64.
92
  
93
  Now intel in its new official documents call the x86_64 architecture
94
  "intel 64" and call those processor's 64bit mode IA-32e.
95
96
97
On Data Alignment:
98
------------------
99
100
* Note: In x86 speak, word = 2bytes/16bits, double word = 4bytes/32bits
101
102
* Each byte within a word has its own address, and the smaller of the
103
  addresses represent the address of the whole word.
104
105
* Each byte within the double word has its own address. The smallest of the byte
106
  addresses is the address of the double word.
107
108
* ``Note that words need not be aligned at even-numbered addresses and
109
  doublewords need not be aligned at addresses evenly divisible by four.
110
  This allows maximum flexibility in data structures (e.g., records containing
111
  mixed byte, word, and doubleword items) and efficiency in memory
112
  utilization.
113
7bc027c by Ahmed S. Darwish at 2010-07-24 114
  ** When used in a configuration with a 32-bit bus, actual transfers of
115
  data between processor and memory take place in units of doublewords (4bytes)
116
  beginning at addresses evenly divisible by four **
550f8c7 by Ahmed S. Darwish at 2009-08-25 117
118
  However, the processor converts requests for misaligned words or doublewords
119
  into the appropriate sequences of requests acceptable to the memory interface.
7bc027c by Ahmed S. Darwish at 2010-07-24 120
  ** Such misaligned data transfers reduce performance by requiring extra memory
550f8c7 by Ahmed S. Darwish at 2009-08-25 121
  cycles. For maximum performance, data structures (including stacks) should
122
  be designed in such a way that, whenever possible, word operands are aligned
123
  at even addresses and doubleword operands are aligned at addresses evenly
7bc027c by Ahmed S. Darwish at 2010-07-24 124
  divisible by four. **
550f8c7 by Ahmed S. Darwish at 2009-08-25 125
126
  Due to instruction prefetching and queuing within the CPU, there is no
127
  requirement for instructions to be aligned on word or doubleword boundaries.
128
  However, a slight increase in speed results if the target addresses of control
129
  transfers are evenly divisible by four.''
130
  	    	       		    	   -- Intel 386 Reference Manual
131
132
* ``The reason for this is that the processor requires two memory accesses to
133
  make an unaligned memory access; whereas, aligned accesses require only one
134
  memory access. A word or doubleword operand that crosses a 4-byte boundary
135
  or a quadword operand that crosses an 8-byte boundary is considered unaligned
136
  and requires two separate memory bus cycles to access it; a word that starts
137
  on an odd address but does not cross a word boundary is considered aligned
138
  and can still be accessed in one bus cycle.''
139
      	  	   	       	       	   -- Intel 686 Reference Manual
140
141
* x86 4KB page aligned addresses:
142
143
  Decimal:
144
  10 ^ 0 = 1
145
  10 ^ 1 = 10
146
  10 ^ 2 = 100
147
  10 ^ 3 = 1000
148
  ...
149
150
  Binary:
151
  2  ^ 0 = 1
152
  2  ^ 1 = 10
153
  2  ^ 2 = 100
154
  ...
155
  2  ^ 10 = 10000000000 = 1024 = 1KB
156
  2  ^ 12 = 1000000000000 = 2^2 * 1KB = 4KB (x86 Page size/aligned-address)
157
  ...
f654c49 by Ahmed S. Darwish at 2009-09-07 158
159
* Common alignments offsets:
160
161
  1byte alignment = 2 ^ 0  -- any x86 RAM address
162
163
  2byte alignment = 2 ^ 1 (0b10) -- 16bit aligned address
164
  = any address divisable by 2
165
  = 0x0, 0x2, 0x4, 0x6, 0x8, ..., 0xa, 0xc, 0xe, 0x10, 0x12, ...
550f8c7 by Ahmed S. Darwish at 2009-08-25 166
  
f654c49 by Ahmed S. Darwish at 2009-09-07 167
  4byte alignment = 2 ^ 2 (0b100) -- 32bit aligned address
168
  = any address divisable by 4
169
  = 0x0, 0x4, 0x8, 0xc, 0x10, 0x14, 0x18, 0x1c, 0x20, ...
170
171
  8byte alignment = 2 ^ 3 (0b1000) - GDT cell index/size
172
  = any address divisable by 8
173
  = 0x0, 0x8, 0x10, 0x18, 0x20, ...
174
     |    |    |
175
   null  %cs  %ds
176
177
  16byte alignment = 2 ^ 4 (0b10000) - GDT table recommended alignment
178
  = any address divisable by 16
179
  = 0x0, 0x10, 0x20, 0x30, 0x40, ..., 0xf0, 0x100, 0x110, 0x120, ...
180
ded3ad0 by Ahmed S. Darwish at 2009-09-03 181
* Useful numbers:
182
183
  0x100   = 256 bytes
e4d1afe by Ahmed S. Darwish at 2010-07-29 184
  0x200   = 512 bytes  (bootsector size, number of entries in PML{4, 3, 2} tables)
ded3ad0 by Ahmed S. Darwish at 2009-09-03 185
  0x400   = 1KB
f654c49 by Ahmed S. Darwish at 2009-09-07 186
  0x1000  = 4KB        (x86 normal page size/alignment)
ded3ad0 by Ahmed S. Darwish at 2009-09-03 187
  0x8000  = 32KB
188
  0xffff  = 64KB -1    (max offset for a real-mode segment)
189
  0x10000 = 64KB
f654c49 by Ahmed S. Darwish at 2009-09-07 190
  0x80000 = 512KB      (our initial kernel size!)
550f8c7 by Ahmed S. Darwish at 2009-08-25 191
b3bbe1d by Ahmed S. Darwish at 2010-11-16 192
812cb17 by Ahmed S. Darwish at 2009-10-28 193
On little-endian byte ordering:
194
-------------------------------
195
196
* From AMD64 documents:
197
198
  "The x86 and AMD64 architectures address memory using little-endian
199
  byte-ordering. Multibyte values are stored with their least-significant byte at
200
  the lowest byte address, and they are illustrated with their least significant
201
  byte at the right side. Strings are illustrated in reverse order, because the
202
  addresses of their bytes increase from right to left."
203
204
  Value =
205
206
  BYTE-7  BYTE-6  BYTE-5  BYTE-4  BYTE-3  BYTE-2  BYTE-1  BYTE-0
207
208
  Actual Mem Layout:
209
210
  BYTE-0  BYTE-1  BYTE-2  BYTE-3  BYTE-4  BYTE-5  BYTE-6  BYTE-7
211
  ^
212
  |--- Address start here
213
214
  For illustrative purposes only - not actual mem layout:
215
216
  GREATEST                 DATA STRUCTURE
217
  ADDRESS
218
   31              23              15              7             0 <--BIT
219
  +---------------+---------------+---------------+---------------+   OFFSET
220
  |    BYTE 7     |    BYTE 6     |    BYTE 5     |    BYTE 4     |4
221
  |---------------+---------------+---------------+---------------|  SMALLEST
222
  |    BYTE 3          BYTE 2          BYTE 1          BYTE 0     |0 ADDRESS
223
  +---------------+---------------+---------------+---------------+^
224
                                                                   |
225
                                             Address start here ---|
226
227
* Classical little-endian ordering:
228
229
  .long  0x00cf9200 = 00 92 CF 00
230
                      ^
231
                      |--- Address start
232
  =
233
234
  +---------------+---------------+---------------+---------------+
235
  |      00       |      CF       |      92       |       00      |
236
  +---------------+---------------+---------------+---------------+^
237
                                                  Address start ---|
238
239
* GAS concatenation:
240
241
  .long  0x00cf9200, 0x0000ffff  != .quad 0x00cf92000000ffff
242
243
  .long  0x00cf9200, 0x0000ffff
244
  =
245
  .long  0x00cf9200
246
  .long  0x0000ffff
247
  =
248
  .quad  0x0000ffff00cf9200
249
  =
250
  (actual layout in memory)
251
  00 92 CF 00  FF FF 00 00
252
  ^
253
  |--- Address start
254
  =
255
  (for illustrative purposes only - _NOT_ actual mem layout)
256
  +---------------+---------------+---------------+---------------+
257
  |      00       |      00       |      FF       |       FF      |
258
  |---------------+---------------+---------------+---------------|
259
  |      00       |      CF       |      92       |       00      |
260
  +---------------+---------------+---------------+---------------+^
261
                                                  Address start ---|
262
  BUT:
263
264
  .long  0x0000ffff
265
  .long  0x00cf9200
266
  =
267
  .quad  0x00cf92000000ffff
268
  =
269
  (for illustrative purposes only - _NOT_ actual mem layout)
270
  +---------------+---------------+---------------+---------------+
271
  |      00       |      CF       |       92      |      00       |4
272
  |---------------+---------------+---------------+---------------|
273
  |      00       |      00       |       FF      |      FF       |
274
  +---------------+---------------+---------------+---------------+^
275
                                                  Address start ---|
276
  =
277
  (actual layout in memory)
278
  FF FF 00 00  00 92 CF 00
279
  ^
280
  |--- Address start
281
282
* So:
283
284
  test.S:                           |  print.c:
285
                                    |
286
  .global test                      |  #include <stdint.h>
287
  test:                             |  extern uint64_t test;
288
        .quad 0x00cf92000000ffff    |  int main(void) { printf("test = 0x%x\n", test); }
289
290
  output: test = 0x0000ffff
291
  as the 64-bit value is presented as: (address-start) FF FF 00 00 00 92 ... (address-end)
292
293
* The String "WXYZ" is represented in memory as:
294
295
  'W' 'X' 'Y' 'Z'
296
  ^
297
  |--Address start
298
299
  And in the usual memory diagrams:
300
301
  +---------------+---------------+---------------+---------------+
302
  |      'Z'      |      'Y'      |      'X'      |      'W'      |
303
  +---------------+---------------+---------------+---------------+^
304
                                                  Address start ---|
305
b3bbe1d by Ahmed S. Darwish at 2010-11-16 306
0db0564 by Ahmed S. Darwish at 2009-08-30 307
On Real Mode:
308
-------------
309
310
* To achieve maximum backward compatibility, x86 processors starts in the 16bit
311
  8086-compatible real-mode. The 8086 had a complete 16-bit architecture - 16-bit
312
  internal registers, 16-bit data bus, and a *20-bit* external address bus (1 MB
313
  of physical memory).
314
315
  Because the 8086 processor had 16-bit index registers and memory pointers,
316
  it can effectively address *only* up to 64 KB of memory, which wasn't acceptable
317
  even by that time. 
318
319
* So, to address the memory beyond 64 KB (the remaining 20bit 1MB address space)
320
  with 16bit registers, Intel created a `segmented' memory addressing scheme.
321
322
  We want to access 20bit address space, namely 0xfffff at max. The 8086
323
  processor's registers are only 16bit, namely a max value of 0xffff. To solve
324
  this issue, the 8086 processor used *two* registers to address physical memory.
325
  The two registers are a segment described as a base address, and an offset within
326
  that segment.
327
328
  The goal: addressing up to the max 20bit value; 0xfffff:
329
330
  f 0 0 0 0     <-- 16bit value (segment base address) hex left shifted once to
331
      	  	    address the 4 highest order bits in the goal.
332
            +   <-- A simple adding operation
333
334
    f f f f	<-- 16bit value, max offset we can address
335
336
    -------
337
338
  f f f f f	<-- The goal (max 20bit value we can address)
339
  
340
  So physical memory in the 8086 is accessed through a `segment:offset' scheme
341
342
  where physical address = (segment * 16) + offset
343
344
* Note that the above segment model is very primitive that there are really no
5097e48 by Ahmed S. Darwish at 2009-08-30 345
  real `segments' identified by the processor. A *physical* 'segment' is
346
  basically just a 16-bit aligned physical address address with an offset
347
  `limit' of 64K enforced by the processor registers limit.
0db0564 by Ahmed S. Darwish at 2009-08-30 348
349
  Example:
350
5097e48 by Ahmed S. Darwish at 2009-08-30 351
  a000:0010 = a010:0000         /* Overlapping 'segments' */
352
  f000:ffff = ffff:000f         /* Maximum address using different 'segments' */
353
  40:100    = 50:00             /* Last byte in Bios Data Area */
354
355
  It's important to note that a segment register can take any value ranging
356
  from 0 to 0xffff including the non-16bit-aligned addresses. It's on *physical*
357
  memory where a segment base address is always 16-bit aligned due to the left
358
  shift, not the segment register value itself.
359
360
  Example of completely valid segment register values and their physical
361
  addresses:
362
363
  Segment register value    |   Segment physical address
364
  -------------------------------------------------------
365
          0xffff            |           0xffff0
366
          0x1abc            |           0x1abc0
367
          0x4242            |           0x42420
368
          ...               |           ...
369
          0xyyyy            |           0xyyyy0, y = /[0-f]/
0db0564 by Ahmed S. Darwish at 2009-08-30 370
371
* Some calculations:
372
373
  biggest segment address possible = 0xffff
374
375
  biggest offset possible = 0xffff
376
377
  biggest *physical* address that can act as a segment address =
f654c49 by Ahmed S. Darwish at 2009-09-07 378
    biggest 20-bit address which is 16-byte aligned = 0xffff0
0db0564 by Ahmed S. Darwish at 2009-08-30 379
  
380
* NOTE: There's a huge difference between real-mode `segments' and segments
381
  provided in 386+ protected-mode processors. The 16bit value in a segmented
382
  protected-mode logical address is what's called a `segment selector' which
383
  is basically a pointer to a `segment descriptor' stored in either the GDT or
384
  the LDT. The segment descriptor includes info like the protected-mode segment
385
  base, limit, access permissions, present bit, etc.
386
387
  On the other hand, the real-mode `segments' are just primitive addresses
388
  that get hex left shifted in address calculations.
550f8c7 by Ahmed S. Darwish at 2009-08-25 389
c6c1eef by Ahmed S. Darwish at 2009-09-01 390
* Note that using this scheme, code segments are limited to a size of 64KB.
391
  To reach more code, we had to jump to what's called 'far pointers', which
392
  not only contains an offset relative to current code segment, but also a
393
  new address for the code segment.
394
395
  AT&T syntax for far jumps: `ljmp  $SEGMENT, $OFFSET' where 'segment' and
396
  'offset' are naturally 16bit values.
397
398
  Usage example: after loading the kernel to ram at its chosen segment which
399
  is usually far away from the mbr code segment, we 'far jump' to this segment
400
  with offset 0 to execute the loaded kernel code and go beyond the mbr's 512
401
  bytes size limit. This is the same technique used by GRUB to loads its stages.
402
403
* The above terminology is also the root of 386+ protected mode 'far pointer'
404
  which is a 'logical address' containing a segment selector (16 bit) and an
7bc027c by Ahmed S. Darwish at 2010-07-24 405
  offset (32 bit) relative to the segment pointed by that selector.
c6c1eef by Ahmed S. Darwish at 2009-09-01 406
b3bbe1d by Ahmed S. Darwish at 2010-11-16 407
4b12840 by Ahmed S. Darwish at 2009-08-30 408
On accessing fixed disk drivers (hard-disks):
409
---------------------------------------------
410
411
* There's the usual heads, tracks, sectors, and cylinders. 
412
413
  Track: when the head is positioned over a point on the fixed disk, ready
414
  to read or write, the disk platter surface spins underneath it, tracing a
415
  full circle. This circle is a track. A disk surface contains may tracks.
416
417
  Sectors: Fixed disk systems divides each track into short arcs called
418
  sectors, each sector usually holds 512 bytes. This is the smallest unit
419
  of storage, as tracks would mean wasting a lot of space for small files.
420
421
  Cylinder: all the heads move in and out together, so each head is always
422
  physically located at the same track number; we can't have one head at
423
  track 0, and another at 250. Because of this, often the track location of
424
  the heads is not referred to as a track number but rather as a cylinder
425
  number.
426
c6c1eef by Ahmed S. Darwish at 2009-09-01 427
  Disk size = bytes per sector * sectors per track * number of cylinders *
428
              number of heads
429
4b12840 by Ahmed S. Darwish at 2009-08-30 430
* BIOS fixed disk services are provided through the INT 13h service
431
  routine. It programs the hard-disk controller; it never writes directly
432
  from the processor to the hard-disk drive. The hard-disk controller
433
  communicates directly between the hard-disk, the processor, and system
434
  memory.
435
c6c1eef by Ahmed S. Darwish at 2009-09-01 436
* To support disks beyoned the 8GB limit enforced by original IBM bios
437
  design (max of 1024 cylinder, 256 heads, and 63 sectors), new INT13
438
  bios services were added to bios which offers simple sector addressing
439
  using a logical zero based sector number (0, 1, 2, ..)
440
441
  This addressing scheme is easier than the Cylinder-Head-Sector model
442
  and is supported in all bios chips post-1995, so we use it exclusively.
443
444
  Check the boot sector code for more details.
b46a295 by Ahmed S. Darwish at 2009-10-11 445
e4d1afe by Ahmed S. Darwish at 2010-07-29 446
7bc027c by Ahmed S. Darwish at 2010-07-24 447
On x86 jumps:
448
-------------
449
450
* References:
451
  - Intel i686 Manual, vol. 2, Instruction Set Reference, Chapter 3: JMP
452
  - Sun Microsystems, [AT&T] x86 Assembly language reference manual
e4d1afe by Ahmed S. Darwish at 2010-07-29 453
  - Cute: boot/bootsect.S, boot/head.S, boot/e820.S, boot/trampoline.S
7bc027c by Ahmed S. Darwish at 2010-07-24 454
455
* Judging by my early naive trial-and-error experiences and the nature of
456
  newbies questions at osdev.org forums, the most common failure point while
457
  developing bootloaders and early kernel stages is either an intra or an
458
  inter-module jump. Because of this, we'll clarify the most used x86 jumps
459
  here, detailing their AT&T syntax and opcodes. We don't use the Intel
460
  syntax in our kernel, so it won't be discussed here; it's already clearly
461
  defined in the official Intel manuals.
462
463
  First, let's stress that a 'jump' is only a mechanism to disrupt the CPU
464
  execution flow, moving EIP to a different code (hopefully not data!) region.
465
466
  Generally, x86 jumps can be classified to either near or far. Near jumps
467
  are 'intra-segment', meaning that they only move within the limits of the
468
  currently setup and cashed code segment. Meanwhile, far jumps are
469
  'inter-segment': they take a (mostly new) code segment descriptor and an
470
  offset within such segment.
471
472
  The kind of displacement in such jumps can also be either relative or
96aef36 by Ahmed S. Darwish at 2010-07-27 473
  absolute. Relative addresses (wrt the next instruction address) were
474
  introduced to save opcodes size. [*]
475
476
  Absolute addresses are very important when relative offsets to an
477
  outside-of-code-object destination address cannot be validly calculated
478
  at link-time. Check the 'When %RIP-relativeness goes wrong' section's
479
  summary for further details.
7bc027c by Ahmed S. Darwish at 2010-07-24 480
481
  Examples of jumps:
482
483
	<AT&T Syntax>			  |	<Objdump Disassembly>
484
485
  [A] Near jump, 8-bit relative displacement (-128):  [short jump]
486
487
  1:	movl $0xdeadbeef, %eax		  | af:     b8 ef be ad de
488
	jmp  1b				  | b4:     eb f9
489
490
  [B] Near jump, 8-bit relative displacement (+127):  [short jump]
491
492
  1:	movl $0xdeadbeef, %eax	      	  | af:     b8 ef be ad de
493
	jmp  2f				  | b4:     eb 00
494
  2:	movl $0xcafebabe, %ebx		  | b6:     bb be ba fe ca
495
496
  [C] Near jump, 16-bit relative displacement:
497
498
	[Not supported by GAS]
499
500
  [D] Near jump, 32-bit relative displacement:
501
502
	jmp    dst			  | 12:     e9 f8 ff 0f 00
503
  .skip 0xffff8				  | ...
504
  dst:  movl   $0xdeadbeef, %eax	  | 10000f: b8 ef be ad de
505
506
  [E] Near jump, absolute indirect displacement (register):
507
508
	mov   $0x100000, %eax             | af:     b8 00 00 10 00
509
	jmp   *%eax			  | b4:     ff e0
510
511
  [F] Near jump, absolute indirect displacement (memory reference):
512
513
	jmpl   *(0x7c00 + mem)		  | af:     ff 25 b5 7c 00 00
514
  mem:	.long  0x100000                   | b5:     00 00 10 00
515
516
  (AT&T syntax uses the '*' operator for indirect jumps)
517
518
  Note that the _only_ available way for specifying _near_ absolute
519
  displacements is to encode such displacements 'indirectly' (i.e. not
520
  in the operands themselves) using a register or a memory reference.
521
522
  [G] Far jump, absolute 32-bit far pointer in operand (16-bit %cs
523
  index + 16-bit offset):
524
525
  .code16				  |
526
	ljmp   $0x08, $(0x7c00 + rel)	  | 99:     ea 9e 7c 08 00
527
  .code32				  |
528
  rel:	movl   $0x10000, %esi             | 9e:     be 00 00 01 00
529
530
  [H] Far jump, absolute 48-bit far pointer in operand (16-bit %cs
531
  index + 32-bit offset):
532
533
  .code32				  |
534
	# KERNEL_CS = 0x08		  |
535
	# startup_64 = 	0x100112	  |
536
	ljmp   $KERNEL_CS, $startup_64	  | 10006f: ea 12 01 10 00 08 00
537
538
  [I] Far jumps, 32-bit or 48-bit absolute indirect displacement
539
  (memory reference):
540
541
	[Rarely used, let's only know that they exist]
542
96aef36 by Ahmed S. Darwish at 2010-07-27 543
  [*] footnote: On x86-32, only jumps and calls had the relative addressing
544
  facility. To save opcode size, especially while handling 64-bit wide virtual
545
  addresses, AMD introduced a global and default %RIP-relative addressing
546
  scheme for most ops, including most important, dereferencing global data.
547
548
e4d1afe by Ahmed S. Darwish at 2010-07-29 549
On referencing global symbols:
550
------------------------------
551
552
* References:
553
  - GNU LD Manual, 'Using LD', Section 3.5.4 -'Source Code Reference'
554
  - Cute: include/idt.h, kernel/idt.S
555
556
* To gracefully fail instead of triple faulting the machine, one of the first
557
  things implemented while writing the C part of the kernel was setting up CPU
558
  exception handlers. To setup these handlers, we needed access to the IDT table
559
  and its descriptor; both were already defined in assembly files. What are the
560
  rules for accessing these data structures natively in C?
561
562
  When a global symbol is declared in C, two things happen. The compiler reserves
563
  enough space in the executable .data or .bss section holding the _value_ of the
564
  symbol. Second, it creates an entry in the program's symbol table which holds
565
  the symbol _address_ (in that section). For a small example, check below
566
  program and its disassembly:
567
568
  --> globals.c (a1):		|  --> [disassembly, $gcc -O3]:
569
  int x;			|  0000000000600800 <y>: 11 11 00 00 ... (.data)
570
  int *y = 0x1111;		|  0000000000600818 <x>: ...  (.bss, 8 bytes)
571
				|
572
  int main(void) {		|  0000000000400460 <main>:
573
    x = 0xdeadbeef;		|  400460: movq 0x200399(%rip),%rax  #600800 <y>
574
    *y = 0xcafebabe;		|  400467: movl $0xdeadbeef,0x2003a7(%rip) #600818 <x>
575
  }				|  400471: movl $0xcafebabe,(%rax)
576
				|  400477: retq
577
578
  As you can see, accessing the global 'x' required a single memory dereference
579
  of the address 0x600818 stored in the program's symbol table. Meanwhile,
580
  dereferencing the pointer 'y' required TWO memory dereferences: one for
581
  dereferencing the variable 'y' address 0x600800, getting its value/contents,
582
  and _another_ dereference for such value stored (0x1111); after all, a pointer
583
  is just a special 8-byte variable.
584
585
* Furthermore, check below program which has several symbols defined in an
586
  assembly file with different treatments of such symbols (and their data) in C:
587
588
  --> data.S (a2):		|  --> app.c (a2):
589
  .section .data		|  uint32_t x;
590
  .globl x			|  extern uint32_t *y;
591
  x:				|  extern uint32_t z[];
592
	.long 0xdeadbeef	|
593
				|  int main(void) {
594
  .globl y			|    printf("&x = 0x%lx\n", &x);
595
  y:				|    printf("&y = 0x%lx\n", &y);
596
	.long 0xcafebabe	|    printf("&z = 0x%lx\n", &z);
597
				|
598
  .globl z			|    printf("x = 0x%x\n", x);
599
  z:				|    printf("y = 0x%x\n", y);
600
	.long 0xeeeeeeed	|    ...    /* further code: check the output */
601
				|  }
602
603
  $ gcc -O3 data.S app.c -o output.elf
604
  $ objdump -D output.elf -j .data
605
  0000000000600af0 <x>:
606
  600af0:		ef be ad de be ba
607
  0000000000600af4 <y>:
608
  600af4:		be ba fe ca ed
609
  0000000000600af8 <z>:
610
  600af8:		ed ee ee ee
611
612
  $ ./output.elf
613
  'int x'   is a global located at program area (&x = ): 0x600af0
614
  'int *y'  is a global located at program area (&y = ): 0x600af4
615
  'int z[]' is a global located at program area (&z = ): 0x600af8
616
617
  x = 0xdeadbeef		|  *x = <not-applicable>
618
  y = 0xcafebabe		|  *y = <segmentation fault> (0xcafebabe unmapped)
619
  z = 0x600af8			|  *z = 0xeeeeeeed
620
621
  x[0] = <not-applicable>
622
  y[0] = <segmentation fault> since 0xcafebabe is not mapped
623
  z[0] = 0xeeeeeeed
624
625
  As seen from above introspection, y is just a global variable holding
626
  0xcafebabe. y _address_ is '0x600af4', buts its _value_ is 0xcafebabe. Thus,
627
  '*y' will produce two memory dereferences, and the second of the two will
628
  result in a CPU exception (segmentation fault) since 0xcafebabe is not
629
  mapped.
630
631
* So, if we have a structure like the below IDT table descriptor:
632
633
  /* Located at ffffffff8010e240
634
   * Binary value (little endian): 00 10 80 16 11 00 00 80 ff ff */
635
  .globl idtdesc
636
  idtdesc:
637
	.word  idt_end - idt		# limit
638
	.quad  VIRTUAL(KTEXT_PHYS(idt))	# base
639
640
  we can _not_ reference it in C as
641
642
  struct idt_descriptor {
643
	uint16_t limit;
644
	uint64_t base;
645
  } __packed;
646
647
  extern struct idt_descriptor *idtdesc;	/* WRONG!! */
648
  idtdesc->limit = 0;				/* Page-fault exception! */
649
650
  since in above case, the variable pointer _address_ (&idtdesc = ) will be
651
  0x10e240, and the pointer _value_ (idtdesc = ) will equal 0x8000001116801000.
652
  Thus, dereferencing such pointer as in 'idtdesc->limit' above will in fact
653
  dereference 0x10e240 first, then afterwards dereference the unmapped address
654
  0x8000001116801000! Final result: a CPU page fault exception.
655
656
  The only wrong thing above was specifying the wrong type for the 'idtdesc'
657
  label. Having the type as the struct itself instead of a pointer will let the
658
  struct address be 0x10e240, and its value will equal 0xffff8000001116801000
659
  which _is_ the desired result:
660
661
  extern const struct idt_descriptor idtdesc;
662
  idtdesc.limit = 0;				/* Safe!! */
663
664
* Now, if we want to reference the IDT table itself:
665
666
  /* IDT_GATES: number of entries in the table */
667
  .align 16
668
  .globl idt
669
  idt:
670
	.skip  IDT_GATES * 16
671
  idt_end:
672
673
  Assuming a pre-defined C type of 'struct idt_gate', representing a single IDT
674
  table entry structure, all what we need is an array of such types. Namely:
675
676
  extern struct idt_gate idt[IDT_GATES];
677
678
  Note that after all, an array 'A' is a special kind of pointer where A = &A.
679
  (label value and address are the same, equalling link-time label address
680
  stored in the symbol table)
681
682
* Finally, let's reference below exception handler stubs, where they catch the
683
  exception vector number through their positions in the stubs table:
684
685
  /* Each stub code size = IDT_STUB_SIZE bytes */
686
  .globl idt_exception_stubs
687
  idt_exception_stubs:
688
        i = 0
689
	.rept  EXCEPTION_GATES_CNT
690
	movq   $i, %rsi
691
	jmp    default_exception_handler
692
	i = i + 1
693
	.endr
694
695
  Well, let's first remember here that we're pointing to code, so we're only
696
  interested in  each stub address,  not by any mean its contents. So, after
697
  all, since this is a table, an array fits our goals nicely. What will be the
698
  base object type? We only care about addresses, so let the types be 'char*':
699
  char pointers makes pointer arithmetic very simple. An initial try can be
700
701
  extern const char idt_exception_stubs[EXCEPTION_GATES_CNT];
702
703
  Nice, but by this, each table cell is of size 1 byte (char). We want each
704
  table cell to have the size IDT_STUB_SIZE chars (bytes). Namely:
705
706
  idt_exception_stubs[EXCEPTION_GATES_CNT][IDT_STUB_SIZE]
707
708
  which makes the compiler automatically calculate any stub address for us.
709
  If we want the address of the third stub, we'll simply write:
710
711
  void *addr = &idt_exception_stubs[3];
712
713
* Summary: the simple declaration 'extern int x' directly reference the 4 bytes
714
  marked as 'x' in the executable, where 'x' address is specified at link-time
715
  in the program's symbol table. The operator [] in the declaration
716
  'extern type SYMBOL[]' leaves it open only to the number of base type objects
717
  which are present, not *where* they are. SYMBOL[n] just adds more static-time
718
  safety; SYMBOL[n][s] let the compiler automatically calculates an entry (of
719
  size 's')index for us.
720
721
  Meanwhile, the '*' operator in a declaration 'extern TYPE *P' only binds
722
  where the _8-byte variable_ P reside in the executable (its address), NOT
723
  where the base type object is. The base-object itself can be reached only by
724
  a double dereference: dereference of P address first, then a dereference of
725
  P's content itself (which is, in turn, an 8-byte address).
726
727
b46a295 by Ahmed S. Darwish at 2009-10-11 728
The 8259A Programmable Interrupt Controller:
729
--------------------------------------------
730
b15ccb7 by Ahmed S. Darwish at 2009-11-01 731
* References:
732
  - The Intel 8259A PIC datasheet.
733
  - The superb information in "The undocumented PC", chapter 17 - "Interrupt
734
  control and NMI".
735
736
  In subsequent text, and as said by the sheet, IRR = "Interrupt Request
737
  Register", ISR="In-Service Register", IMR = "Interrupt Mask Register".
b46a295 by Ahmed S. Darwish at 2009-10-11 738
739
* As described by Brendan Trotter (bcos) on #osdev, and the sheet:
740
741
  An interrupt request is executed by raising an IR input (low to high) and
742
  holding it high until it is acknowledged (Edge Triggered Mode). Where
743
  "acknowledged" means doing so by the PIC chip, not by the CPU/OS . Also, in
744
  this case "acknowledges" just means that it recognises it and sets the
745
  corresponding flag in the IRR
746
747
  Basically, when an interrupt is received the PIC sets the corresponding flag in
748
  the IRR. When the PIC sends the IRQ to the CPU it clears the flag in the IRR
749
  and sets the flag in the ISR. And when the CPU sends EOI the PIC clears the
750
  flag in the ISR. The processor sends an EOI through the the INTA pin.
751
752
  If the PIC sends a high priority IRQ to the CPU and is waiting for EOI when a
753
  low priority IRQ occurs, then the PIC must remember that the low priority IRQ
754
  occured (so that it can tell the CPU about it when the CPU is finished with the
755
  high priority IRQ), by setting its IRR bit.
756
757
  The 8259 cannot be used with level interrupts. This is because when
758
  an interrupt line goes true, it is latched internally in the IRR.
759
760
* Spuruious interrupts:
761
762
  As said by the sheet: "If no interrupt request is present at step 4 of either
763
  sequence (i.e., the request was too short in duration), the 8259A will issue an
764
  interrupt level 7. Both the vectoring bytes and the CAS lines will look like an
765
  interrupt number 7 was requested."
766
767
  And as said in undoc pc: "Although a noise glitch should never occur, a poor
768
  bus design or badly glitch will trigger an invalid interrupt. If the glitch
769
  occurs on IRQ 0 to IRQ7, interrupt F (which is mapped to IRQ 7) will be
770
  called. The service handler for interrupt Fh and 77H (assuming the PIC wasn't
771
  re-mapped to more sane vector-numbers) should read the in-service register to
772
  confirm that a real interrupt occurred. An invalid interrupt has In-Service bit
773
  7 cleared, and a valid interrupt has valid bit 7 set.
774
775
* So why should the PIC bother sending the CPU a spurious 7 IRQ? it could've just
776
  ignored the noise glitch source. The problem is that it might have told the CPU
777
  it's about to get an interrupt, and if it did (step 1, 2, and 3), it can not
778
  now decode the interrupt info since the line has been deasserted. Either the
779
  assert period was too short that the IRQ number was not latched internally, or
780
  the PIC just recognized it was a noise glitch due to the very short assertion
781
  period. So it gives the programmed IRQ 7 vector number to the CPU. This avoids
782
  unccessary communication with the real devices attached to IRQ7.
783
784
* Take care that, as said by linux kernel comments, the two spurious IRQs 7 and
785
  15 usually resulting from the 8259A-1|2 PICs do occur even if all the IRQs are
786
  masked (IMR = 0xff).
787
788
* An example of an interrupt (refs: undoc pc + intel sheet):
789
790
  1- The adapter card requests service, by driving the hardware IRQ line 5 from
791
     low to high
792
793
  2- The IRQ 5 line is connected to the PIC. The edge-triggered request is
794
     recorded in the controllers Interrupt-Request Register (IRR). In this case,
795
     the IRR bit 5 is set.
796
797
  3- The PIC first checks if the Interrupt Mask Register (IMR) to see if IRQ5 is
798
     allowed (not masked, IMR bit 5 = 0). Assuming that bit clear, the PIC moves
799
     to the next step.
800
801
  4- Using its priority resolver uint, the conntroller checks if any higher
802
     priority interrupts are active or in progress. If so, no further action is
803
     done till all higher priority irqs are serviced.
804
805
  5- Once all higher irqs are serviced, the controller flags the CPU over the
806
     dedicated INT hardware line leading to the CPU.
807
808
  6- If the CPU has hardware interrupts enabled, the CPU returns an interrupt
809
     acknowledgment to the PIC using the INTA pin.
810
811
  7- Directly after receiving a pulse from the INTA pin, the relevant In-Service
812
     register bit (bit 5) is set, and the corresponding IRR bit is reset.
813
814
  8- The processor sends another INTA pulse. During this pulse, the PIC sends
815
     a programmed 8-bit pointer back to the CPU, according the raised IRQ. In
816
     the default bios programmed PIC case, the pic will return the interrupt
817
     value 0x0D for IRQ5. 
818
819
  9- In an instruction boundary, the cpu interrupts the current process and
820
     jumps to the code address specified in the IDT entry 0xD. For x86-64 cpus,
821
     entry 0xD = IDT base (specified through lidt) + 0xD * 16.
822
823
 10- One of the key aspects of the interrupt handler specified in the IDT entry
824
     must make is to remove the interrupt request made by the adapter. The line
825
     maybe shared, and several ISRs got executed, where they determine first if
826
     they are servicing the right device using the "interrupt pending" register
827
     found in PCI devices, etc.
828
829
 11- At the end of the interrupt handler, the handler sends the value 0x20 to
830
     the PIC command port to indicate an End-of-Interrupt (EOI). The EOI command
831
     clears the IRQ5 bit in the In-Service Regisger.(*) The case is different in
832
     case of Automatic EOI (AEOI) mode, and is described later.
833
834
 12- By clearing the IRQ5 bit in the ISR, the chip is now ready to accept another
835
     interrupt from IRQ5 or any lower priority interrupt. This completes the
836
     interrupt cycle.
837
b3bbe1d by Ahmed S. Darwish at 2010-11-16 838
b15ccb7 by Ahmed S. Darwish at 2009-11-01 839
* On x86 SMP initialization sequence:
840
-------------------------------------
841
842
* References:
843
  - Intel MP specification v1.4, Appendix B - Operating System Programming
844
  Guideleins.
845
  - Intel Software Dev. Manual vol3A, Chapter 8, 8.4 - Multiple-Processor (MP)
846
  initialization.
847
  - Intel Software Dev. Manual vol3B, Appendix C - MP Initialization for P6
848
  Family Processors.
849
850
* Generally, there're two base x86 MP initialization sequences. One for the
851
  processors with the external APIC, and a second for processors with internal
852
  APIC (P6+ x86 processors).
853
854
  We won't bother ourself with the processors with external APICs (80486 and
855
  early Pentiums) since we're only targetting long mode.
856
857
  All long mode capable processors contains an integrated apic and comply with
858
  the new Intel MP bootup sequence, where no FIPI or BIPI are needed.
859
860
  On this new sequence, the INIT IPI on AP cores let them wait for the SIPI to
861
  get their starting execution vector instead of _unconditionally_ executing the
862
  BIOS init code. This avoids the ugly dependency on AT+ BIOSes hacks like the
863
  warm-reset vector to let the AP cores start executing from a custom address
864
  instead of BIOS init code. Search the MP spec for `Warm-reset vector'.
865
866
* After system-wide bootup, or reset, the MP initialization protocol is going to
867
  be executed. In this protocol, the hardware setup the APIC IDs for each logical
868
  core according to system topology and assign one of the cores as the bootstrap
869
  core and the others as APs (Application processors/cores).
870
871
  The core nominated as the BSC will fetch and begin executing BIOS bootstrap
872
  code at the reset vector physical address 0xfffffff0.
873
874
  The remaining cores (setup as APs) will enter a "wait-for-SIPI" state.
875
876
* Afterwars, the BIOS init code builds its ACPI and MP tables and count the
877
  number of AP cores, and update its tables entries for each new AP core
878
  found. For the algorithm, check Intel's volume 3A 8.4.4 MP Initialization
879
  Example.
880
881
* At the end, the BIOS puts each AP core to halted state with interrupts
882
  disabled. By the nature of the halted state with interrupts disabled, only
883
  INIT, #NMI, #SMI are able to start the cores again.
884
885
  IMPORTANT: The MP protocol will be executed only after a power-up or RESET. If
886
  the MP protocol has been completed and a BSP has been chosen, subsequent INITs
887
  (either to a specific processor or system wide) do not cause the MP protocol to
888
  be repeated. Instead, each processor examines its BSP flag (in the APIC_BASE
889
  MSR) to determine whether it should execute the BIOS boot-strap code (if it is
890
  the BSP) or enter a wait-for-SIPI state (if it is an AP).
891
892
  So any INIT to any of the AP cores by the OS boot code will let it enter a
893
  'wait-for-SIPI' state to be able to get its start execution vector.
894
895
* Note that the cores get started with all its states uninitialized. We have
896
  to update each core from real to long mode, and update its MMU and other
897
  system tables to match the bootstrap.
898
899
  Kernels usually call this part the SMP `trampoline', since it let the cores
900
  jump from its raw uninitialized state to a state similar to the bootstrap's
901
  one.
902
b3bbe1d by Ahmed S. Darwish at 2010-11-16 903
34568a8 by Ahmed S. Darwish at 2009-11-08 904
* On linkers 'relocation error's:
905
---------------------------------
906
907
* References:
908
  - Relocation types (R_X86_64_*): System V ABI AMD64 supplement,
909
    section 4.4 - Relocation
910
  - The Executable and Linkable Format (ELF) spec, Book I - Relocation
911
912
* One of the errors that faced me at a number of sensitive places like the
913
  kernel head, the SMP trampoline, and the linker script was LD linker's
914
  'relocation error'. Due to their relative opaqueness, they puzzled me to
915
  a great extent at first. An example of some of those errors from linking
916
  a preliminary trampoline code to the rest of the kernel image was:
917
918
  trampoline.o: In function `trampoline':
919
  (.text+0x1): relocation truncated to fit: R_X86_64_16 against `.text'
920
  trampoline.o: In function `_start':
921
  (.text+0xc): relocation truncated to fit: R_X86_64_16 against `.data'
922
  trampoline.o: In function `_start':
923
  (.text+0x11): relocation truncated to fit: R_X86_64_16 against `.data'
924
  trampoline.o: In function `_start':
925
  (.text+0x14): relocation truncated to fit: R_X86_64_32 against `.bss'
926
  trampoline.o: In function `_start':
927
  (.text+0x1d): relocation truncated to fit: R_X86_64_32 against `.bss'
928
  trampoline.o: In function `_start':
929
  (.text+0x21): relocation truncated to fit: R_X86_64_32 against `.bss'
930
  trampoline.o: In function `_start':
931
  (.text+0x27): relocation truncated to fit: R_X86_64_32 against `.bss'
932
  trampoline.o: In function `_start':
933
  (.text+0x2b): relocation truncated to fit: R_X86_64_32 against `.bss'
934
935
* We need to understand relocation to parse above errors. Relocation Information
936
  sections contain information about unresolved references. Since compilers and
937
  assemblers do not know at what absolute memory address a symbol will be
938
  allocated, and since they are unaware of definitions of symbols in other files,
939
  __every__ reference to such a symbol will create a relocation entry.
940
941
  The relocation entry will point to the code address where the reference is
942
  being made, and to the symbol table entry that contains the symbol that is
943
  referenced. The linker will use this information to fill in the correct address
944
  after it has allocated addresses to all symbols.
945
946
* Dumping the ELF object trampoline.o relocation section ($readelf -a) leads to:
947
948
  Relocation section '.rela.text' at offset 0x438 contains 8 entries:
949
    Offset          Info           Type           Sym. Value    Sym. Name + Addend
950
  000000000001  00010000000c R_X86_64_16       0000000000000000 .text + 5
951
  00000000000c  00020000000c R_X86_64_16       0000000000000000 .data + 0
952
  000000000011  00020000000c R_X86_64_16       0000000000000000 .data + 6
953
  000000000014  00030000000a R_X86_64_32       0000000000000000 .bss + 0
954
  00000000001d  00030000000a R_X86_64_32       0000000000000000 .bss + 0
955
  000000000021  00030000000a R_X86_64_32       0000000000000000 .bss + 1007
956
  000000000027  00030000000a R_X86_64_32       0000000000000000 .bss + 0
957
  00000000002b  00030000000a R_X86_64_32       0000000000000000 .bss + 2007
958
959
  The relocation table fields are described in the ELF specification. In summary:
960
961
  - Offset: location to apply relocation action relative to section start
962
  - Info: symbol table index wrt which the relocation must be made
963
  - Type: from the AMD64 ABI. _16: max bit width = 16 (rmode). _32: 32 bits max
964
  - Addend: constant addend used to compute the value to be stored in the
965
    relocatable field. Ex: `lgdt $(gdt + 0x10)' will print `.data + 10' where
966
    the addend = 0x10
967
968
* Note that all relocatable fields are filled with zeroes in the trampoline
969
  ELF object trampoline.o:
970
971
  # .code16 (real-mode)
972
  0000000000000000 <trampoline>:
973
   0:   ea 00 00 00 60     ljmp   $0x6000,$0x0 # ljmp $0x6000, $_start
974
975
  0000000000000005 <_start>:
976
   5:   e8 2a 00           call   32 <print>   # RIP-relative (7 + 0x002a = 32)
977
   8:   fa                 cli
978
   9:   0f 01 1e 00 00     lidtw  0x0          # lidt $idtdesc
979
   e:   0f 01 16 00 00     lgdtw  0x0          # lgdt $gdtdesc
980
  13:   b8 00 00 00 00     mov    $0x0,%eax    # movl %cr3, %eax
981
  18:   0f 22 d8           mov    %eax,%cr3
982
  1b:   c7 05 00 00 00     movw   $0x0,(%di)   # movl $(init_pgt+0x1007), init_pgt
983
  20:   00 00 00 00 00
984
  25:   c7 05 00 00 00     movw   $0x0,(%di)   # movl $(init_pgt+0x2007), init_pgt
985
  2a:   00 00 00 00 00
986
  2f:   f4                 hlt
987
  30:   eb fd              jmp    2f <_start+0x2a>
988
989
  0000000000000032 <print>:
990
991
  Also note that the offsets in the error messages (.text+0x1), (.text+0x11), ..
992
  are the same offsets reported in the relocation table entries, and are the
993
  same ones filled with zero in the above disassembly.
994
995
* With above information in hand, we can simply deduce the errors cause. We link
996
  the trampoline assembly code against our long mode kernel, but the kernel linker
997
  script sets up the virtual addresses base at 0xffffffff80000000.
998
999
  So the linker, obeying the kernel script, tries to assign huge (> -2GB) virtual
1000
  addresses to our symbols. The linker then discovers that since the trampoline
1001
  to-be-relocated symbols are setup with the R_X86_64_16 model, it's limited to a
1002
  width of 16 bits, and limited to 32 bits in case of R_X86_64_32. Ofcourse those
1003
  bits aren't sufficient, so it has to truncate the addresses producing a wrong
1004
  final executable.
1005
1006
  In a nutshell, there was a reported error _for each_ relocatable symbol used,
1007
  reflecting the number of relocation table entries with a type/bit-width that is
1008
  insufficient to the linker script setup 64-bit virtual addresses.
0aefbcf by Ahmed S. Darwish at 2009-11-21 1009
b3bbe1d by Ahmed S. Darwish at 2010-11-16 1010
0aefbcf by Ahmed S. Darwish at 2009-11-21 1011
* When %RIP-relativeness goes wrong:
1012
------------------------------------
1013
1014
* References:
1015
  - Intel Instruction Set Reference, A-M, vol. 2A. 3.2-CALL--Call procedure.
1016
1017
* One of the interesting errors I faced in jumping away from head or trampoline
1018
  code to C code was %rip-relativeness related errors. Here are two interesting
1019
  cases:
1020
1021
  In head.S, after switching to long mode, and setting up the higher-half
1022
  virtual mappings, we jump to C code:
1023
1024
  .code64
1025
  startup_64:
1026
	movq   $stack_end, %rsp
1027
	...
1028
	# .. Setup identity mappings
1029
	# .. Setup high-half virtual addresses mappings
1030
	# .. Apply the page tables
1031
	# .. let %RIP be its virtual address equivalent
1032
	# .. let %RSP, and GDTR be their virtual addresses equivalent
1033
	# .. Let NULL pointers segfault and avoid physical-address dependent
1034
	#    code: remove identity mappings
1035
	...
1036
	call   kernel_start		# C method: void kernel_start(void);
1037
1038
  But unfortunately we get the machine constantly triple-faulting. As indicated
1039
  by Bochs debugging output:
1040
1041
  [CPU1 ] BxError: Encountered an unknown instruction b1=0x0e (signalling #UD)
1042
  [CPU1 ] modrm was 0x00, nnn was 0, rm was 6
1043
  [CPU1 ] exception(0x06): error_code=0000
1044
1045
  Disassembling the object file, we see that everything is at place. And `callq'
1046
  points to 0xffffffff80109250; the `kernel_start' method:
1047
1048
  # head.o
1049
  00000000001000b9 <startup_64>:
1050
    1000b9:       48 c7 c4 00 50 10 00    mov    $0x105000,%rsp
1051
    ...
1052
    10012f:       e8 1c 91 00 80          callq  ffffffff80109250 <kernel_start>
1053
15681eb by Ahmed S. Darwish at 2010-02-18 1054
  Or does it?
0aefbcf by Ahmed S. Darwish at 2009-11-21 1055
1056
  A closer inspection at the binary disassembly reveals that the call is a
1057
  near, relative, displacement relative to next instruction, 32-bit displacement
1058
  sign-extended to 64-bits in long mode call.
1059
1060
  So the instruction is at 0x10012f, with a length of 5 bytes, relative
1061
  displacemented is calculated from 0x10012f + 5 = 0x100134.
1062
1063
  Code wants to call `kernel_start', so the relative displacemented =
1064
  0xffffffff80109250 - 0x100134 = 0xffffffff8000911c, which is the value encoded
1065
  in the `callq' instruction after sign-extension in a little-endian form. So
1066
  what's wrong?
1067
1068
  Notice that by the time the CPU is executing the `callq', it's actually using a
1069
  __different__ %RIP than 0x10012f. The %RIP is in its (%RIP + 0xffffffff80000000)
1070
  virtual form as the identity mappings are to be removed.
1071
1072
  This clearly messes up the %rip-relative-displacement calculations, and sends
1073
  the CPUs to unpredictable areas, leading to unknown opcode exceptions and
1074
  other horrors.
1075
1076
* Another similar case was faced in the trampoline code, when it jumps away to
1077
  the rest of AP cores C init code. The `callq' instruction was just sending
1078
  the CPU to irregular places, with horrifying results.
1079
1080
  Why the callq misbehaved in trampoline? Because the trampoline code is linked
1081
  to the kernel, but actually  copied to a below-1MB 4KB-page, and then executed
1082
  from there. This messed up external %rip-relativeness.
1083
1084
  Calls to internal trampoline methods worked because those methods retained their
1085
  relative distances between each other when they were copied. On the other hand,
1086
  and by definition, external distances was not kept when calling outside code
1087
  (like the C application_start() code):
1088
1089
  # trampoline.o
1090
  ffffffff8010ae06 <startup_64>:
1091
    ffffffff8010ae06:       48 8d a4 24 00 00 00    lea    -0x80000000(%rsp),%rsp
1092
    ffffffff8010ae0d:       80
1093
    ...
1094
    ffffffff8010ae2e:       e8 0d e3 ff ff          callq  ffffffff80109140 <application_start>
1095
    ffffffff8010ae33:       90                      nop
1096
1097
* Solution: The two examples are special cases of PIC, position-independent code.
1098
  The trampoline is PIC cause it's executed from different position than where
1099
  it's linked. The head.S code also applies as parts of it get executed using a
1100
  different %rip than what it's linked as.
1101
1102
  In the two above cases, we can simply solve them by specifying the absolute
1103
  address of destination code instead of using relative x86 calls. As said in
1104
  Intel instruction reference, x86-64 has a near call, with absolute indirect
1105
  addressing facility. It can be used in GNU AS as follows:
1106
1107
	movq   $application_start, %rax
1108
	jmpq   *%rax
1109
1110
  And also for head.S:
1111
1112
	movq   $kernel_start, %rax
1113
	callq  *%rax
1114
1115
  # head.o
1116
  00000000001000b9 <startup_64>:
1117
    1000b9:       48 c7 c4 00 50 10 00    mov    $0x105000,%rsp
1118
    ...
1119
    10012f:       48 c7 c0 50 92 10 80    mov    $0xffffffff80109250,%rax
1120
    100136:       ff d0                   callq  *%rax
1121
1122
  which solves our problem nicely.
a26984e by Ahmed S. Darwish at 2009-11-27 1123
96aef36 by Ahmed S. Darwish at 2010-07-27 1124
* Summary: For a code object which is run from a _different_ address than what
1125
  it was originally linked to, a 'jmp' or 'call' op to a label (or an address)
1126
  outside of that unlucky object needs absolute addresses instead of the
1127
  _default_ %rip-relative ones. In this case, the 'next opcode address', being
1128
  the base of %RIP-relativeness calculations, is not the same at both link-time
1129
  and run-time; at run-time, the linker-calculated relative offset of the
1130
  destination becomes completely invalid due to the _now-different_ %RIP base,
1131
  jumping intentionally to the unknown.
1132
1133
  Meanwhile, relative jumps and calls to labels within the same object are still
1134
  valid: the relative distance is still maintained within the object itself, no
1135
  matter where it's running from. This kind of internal relativeness is usually
1136
  calculated by the assembler (instead of the linker) and is safe here.
1137
31351ce by Ahmed S. Darwish at 2010-02-18 1138
1139
Testing libc()-like kernel code:
1140
--------------------------------
1141
1142
* An interesting way to test libc()-like kernel code (e.g. std string methods)
1143
  is to replace system's libc implementation with our own. Afterwards, we run a
1144
  huge program (Firefox, OpenOffice or gcc) and observe the results.
1145
1146
  Those tests usually cover huge number of cases, and give a good indicator of
1147
  tested kern code stability. I've usually found that even the smallest mistake
1148
  in such tests produce either a non-behaving app or a direct segfault.
1149
1150
* Unix linkers LD_PRELOAD feature can help us immensely. Assuming our string
1151
  methods resides at 'string_lib.c', we do something similar to:
1152
1153
  # Position-independent object, suitable for dynamic linking
904ce3a by Ahmed S. Darwish at 2010-11-18 1154
  $ gcc --std=gnu99 -O3 -fPIC -c string_lib.c -o string_lib.o
31351ce by Ahmed S. Darwish at 2010-02-18 1155
1156
  # Produce shared object (instead of final executable) that can
1157
  # be linked with other objects to form an executable.
1158
  $ gcc -shared -o string_lib.so string_lib.o
1159
1160
  $ LD_PRELOAD='./string_lib.so' firefox-bin
1161
b3bbe1d by Ahmed S. Darwish at 2010-11-16 1162
  Note: always run these tests with the binaries themselves instead of
1163
  their bash-script wrappers. Such shell wrappers don't play very well
1164
  with LD_PRELOAD at times.
1165
1166
* It is interesting to note that a 30-second run of Firefox produced
1167
  655,227 memset() and memcpy() calls! A 1-minute run of Google Chrome
1168
  produced 1,889,087 calls! (counted by `wc -l').
1169
31351ce by Ahmed S. Darwish at 2010-02-18 1170
a26984e by Ahmed S. Darwish at 2009-11-27 1171
On SMP Memory ordering models:
1172
------------------------------
1173
1174
* References:
1175
  - Kourosh Gharachorloo, "Memory Consistency Models for Shared-Memory
1176
    Multiprocessors", Digital Equipment (DEC) with Stanford and DARPA support
1177
  - Paul McKenny, Memory ordering in modern processors, 2007.09.19 Draft
1178
  - Paul McKenny, Memory Barriers: a Hardware View for Software Hackers, IBM
1179
  - Linux Kernel Documentation/memory-barriers.txt, kernel.org
1180
  - DEC/HP OpenVMS Ask the Wizard, http://www.webcitation.org/5laUMQfyi
1181
  - Intel Manuals vol3A, chapter8, MP Management, 8.2 - Memory Ordering
1182
  - AMD64 Architecture Programmer's Manual, chapter 7 - Memory System
1183
1184
* To understand memory ordering models correctly, we need to understand which
1185
  parts of our code are loads, which are stores, and which are in fact both.
1186
  Below example should cover basic cases of shared global variables usage,
1187
  compare the C code with its assembly output:
1188
1189
  int *global_symbol = (int *)0xdeadbeef;
1190
1191
  int main(void) {
1192
	/* 1) memory load + dependent memory store to stack */
1193
	int t = global_symbol;
1194
1195
	/* 2) memory store */
1196
	global_symbol = (int *)0xbbbb;
1197
1198
	/* A memory load + dependent memory store */
1199
	*global_symbol = 0xcccc;
1200
  }
1201
1202
  Disassembly of section .data:
1203
  ...
1204
  0000000000600840 <global_symbol>:
1205
  600840:       ef
1206
  600841:       be ad de 00 00
1207
1208
  Disassembly of section .text:
1209
  ...
1210
  0000000000400448 <main>:
1211
  400448:       55                      push   %rbp
1212
  400449:       48 89 e5                mov    %rsp,%rbp
1213
  # 1) Mem. load + store to stack
1214
  40044c:       48 8b 05 dd 03 20 00    mov    0x2003dd(%rip),%rax    # 600830 <global_symbol>
1215
  400453:       89 45 fc                mov    %eax,-0x4(%rbp)
1216
  # 2) Mem. store
1217
  400456:       48 c7 05 cf 03 20 00    movq   $0xbbbb,0x2003cf(%rip) # 600830 <global_symbol>
1218
  40045d:       bb bb 00 00
1219
  # 3) Mem. load + store to pointed value
1220
  400461:       48 8b 05 c8 03 20 00    mov    0x2003c8(%rip),%rax    # 600830 <global_symbol>
1221
  400468:       c7 00 cc cc 00 00       movl   $0xcccc,(%rax)
1222
  40046e:       c9                      leaveq
1223
  40046f:       c3                      retq
1224
1225
* Terminology of memory orders:
1226
  - Program order: the memory operations order specified in object code.
1227
  - Perceived order: the order a CPU perceives of its and other CPUs memory ops.
1228
10e19e7 by Ahmed S. Darwish at 2009-12-04 1229
* Memory ordering is a tradeoff between programming ease and performance. On
1230
  uniprocessors, many ops are assumed to execute in program order. Processors can
1231
  provide the illusion of sequensiality by only preserving sequential order among
1232
  memory operations to the same location.
1233
1234
  Sequential consistency defined by Lamport provides a simple model to programmers,
1235
  but it adversely affects efficiency and performance: the illusion trick provided
1236
  in uniprocessors can't be used on MP systems to provide sequential consistency.
1237
1238
  Weak ordering distinguishes between ordinary memory operations and memory ops
1239
  used for synchronization. By ensuring consistency only at the synchronization
1240
  points, weak ordering allows ordinary operations in between pairs of
1241
  synchronizations to be reordered with respect to one another.
1242
1243
  Such specifications can be considered system-centric since they in effect expose
1244
  the low-level system optimizations to the programmer.
1245
a26984e by Ahmed S. Darwish at 2009-11-27 1246
* General rules for all architectures:
1247
  - A given CPU will always _perceive_ its own memory operations as occuring
1248
    in program order, without a need for memory barriers.
1249
  - Naturally-aligned loads and stores are always atomic
1250
  - Code that uses standard synchronization primitives (spinlocks, RCU, ..)
1251
    should not need explicit mem barriers. Only tricky code that bypasses those
1252
    primitives need those barriers. An example of a linked list lockless search
1253
    is given below.
1254
1255
* Example:
1256
1257
  Writer CPU:
1258
  -----------
1259
  # Linked list head is `smack_known'
1260
  struct smack_known *smack_known;
1261
  mutex_lock(&smack_known_lock);
1262
  if (skp == NULL) {
1263
	skp = kzalloc(sizeof(struct smack_known), GFP_KERNEL);
1264
	if (skp != NULL) {
10e19e7 by Ahmed S. Darwish at 2009-12-04 1265
  1-		skp->smk_next = smack_known;
1266
  2-		skp->smk_secid = smack_next_secid++;
1267
  3-		skp->smk_cipso = NULL;
a26984e by Ahmed S. Darwish at 2009-11-27 1268
		/* 1- BUG: Write barrier needed! */
10e19e7 by Ahmed S. Darwish at 2009-12-04 1269
  4-		smack_known = skp;
a26984e by Ahmed S. Darwish at 2009-11-27 1270
	}
1271
  }
1272
  mutex_unlock(&smack_known_lock);
1273
1274
  Reader CPU:
1275
  -----------
1276
  # Just locklessly access the list
1277
  struct smack_label *skp;
1278
  skp = smack_known;
1279
  while (skp) {
1280
	/* 2- BUG: Alpha CPUs!! */
10e19e7 by Ahmed S. Darwish at 2009-12-04 1281
  5-	if (skp->smk_secid == 10) {
a26984e by Ahmed S. Darwish at 2009-11-27 1282
		...
1283
	}
1284
	...
1285
	skp = skp->next;
1286
  }
1287
1288
  Bug 1:
1289
  ------
1290
  In the writer CPU, the latest 3 stores can be re-ordered in any combination.
1291
  In arhitectures that reorder stores like Alpha, Itanum, POWER, and SPARC, this
1292
  may lead other CPUs to perceive an incomplete node added to the list: a garbage
1293
  smk_secid value. This may even lead to graver results if you notice that the
1294
  first line (skp->smk_next = smack_known) is actually a load followed by a
1295
  store. So the sequence is:
1296
1297
  Load from smack known
1298
  store to ->smk_next address
1299
  ...
1300
  store to smack_known
1301
1302
  The cpu can reorder the last stores, leading to a linked list head with a
1303
  bogus next pointer; effectively a user-space segmentation fault or worse,
1304
  a CPU exception if in kernel mode.
1305
1306
  Bug 2:
1307
  ------
1308
  As said by OpenVMS documenatation (Ask the Wizard):
1309
1310
  "Your producer must issue a 'memory barrier' instruction after writing the data
1311
  to shared memory and before inserting it on the queue; likewise, your consumer
1312
  _must_ issue a memory barrier instruction after removing an item from the queue
1313
  and before reading from its memory.  Otherwise, you risk seeing stale data,
1314
  since, while the Alpha processor does provide coherent memory, it does not
1315
  provide implicit ordering of reads and writes.  (That is, the write of the
1316
  producer's data might reach memory after the write of the queue, such that the
1317
  consumer might read the new item from the queue but get the previous values
1318
  from the item's memory."
1319
1320
  But how does the consumer sees the pointer before its data even if write
1321
  barriers were used?
1322
1323
  As said by McKenny, the producer's write barrier will guarantee that the cache
1324
  invalidates performed by the first 3 lines  will reach the interconnect before
1325
  that of the last line, but it "makes absolutely no guarantee about the order in
1326
  which the new values will reach the reading CPU’s core."
1327
1328
  So assume smack_known, and smack_known->smk_secid are on different cache banks.
1329
  Because of the write barrier, the producer's data will reach the interconnect
1330
  before updating the queue. The bank holding smack_known is idle, so its line
1331
  gets invalidated and updated, while the bank holding smk_secid was busy, so the
1332
  cache invalidates for the new element are being delayed. This leads to seeing
1333
  new value for the pointer, but old cached values for the new element at line 5.
1334
1335
  Linux kernel documentation explains by stating "some versions of the Alpha CPU
1336
  have a split data cache, permitting them to have two semantically-related cache
1337
  lines updated at separate times.  This is where the data dependency barrier
1338
  really becomes necessary as this synchronises both caches with the memory
1339
  coherence system, thus making it seem like pointer changes vs new data occur in
1340
  the right order.
1341
1342
* On the x86 side, only below basic memory access is guaranteed to be atomic:
1343
  - Readnig or writing a byte
1344
  - Reading or writing a word aligned on 16-bit boundary
1345
  - Reading or writing a doubleword aligned on 32-bit boundary
1346
  - Reading or writing a quadword aligned on 64-bit boundary
1347
1348
  On P6 family processors and up:
1349
  - Unaligned 16-, 32-, and 64-bit accesses to cached memory that fit within
1350
    a cache line. (but how do we know whether a memory access fits a cache line?)
10e19e7 by Ahmed S. Darwish at 2009-12-04 1351
b3bbe1d by Ahmed S. Darwish at 2010-11-16 1352
10e19e7 by Ahmed S. Darwish at 2009-12-04 1353
On the C 'volatile' type qualifier:
1354
-----------------------------------
1355
1356
* References:
1357
  - Eric Eid & John Regehr, "Volatiles Are Miscompiled, and What to Do about It",
1358
    Proceedings of the Eighth ACM and IEEE International Conference.
1359
  - Attached Linux kernel's "volatile-considered-harmful.txt" document, added
1360
    with various core developers thoughts copied from LKML.
1361
  - ISO C99 Committee Draft - September 7, 2007
fea07b7 by Ahmed S. Darwish at 2010-01-21 1362
  - My OSDEV forums post: http://forum.osdev.org/viewtopic.php?t=21210&start=17
10e19e7 by Ahmed S. Darwish at 2009-12-04 1363
1364
* As said by the C standard, "An object that has volatile-qualified type may be
1365
  modified in ways unknown to the implementation or have other unknown side
1366
  effects. Therefore any expression referring to such an object shall be evaluated
1367
  strictly according to the rules of the abstract machine"
1368
1369
  "Actions on objects so declared shall not be optimized out by an implementation,
1370
  or reordered except as permitted by the rules for evaluating expressions".
1371
1372
  It's in a practical manner, "for every read or write to a volatile variable
1373
  that would be performed by a straightforward interpreter for C, exactly one
1374
  load from or store to the memory locations allocated to the variable must be
1375
  performed".
1376
1377
* Volatile is used when we don't want the memory access to be re-ordered, cached,
1378
  or optimized away, as in accessing memory mapped I/O registers. Another case is
1379
  where global variables get changed behind our back by other execution paths as
1380
  in interrupt handlers or other threads:
1381
1382
  int var_set_by_irq_handler = 0;
1383
  ..
1384
  while (var_set_by_irq_handler == 0);
1385
1386
  The compiler has every right to optimize the loop to an infinite loop. To avoid
1387
  this, and force the compiler to access the global variable memory at each
1388
  iteration, we use volatile, since volatile objects "shall not be optimized out
1389
  by an implementation".
1390
1391
* Take care that the volatile is a type qualifier as `constant'. So in below case:
1392
1393
  static volatile uint64_t *msg = NULL;		/* (1) */
1394
  while (msg == NULL)
1395
	asm volatile ("pause");
1396
1397
  The above loop will still be optimized to an infinite loop. The declaration
1398
  states that the memory area _pointed_ by `msg' is volatile, not that the pointer
1399
  itself is volatile. So the above declaration should've been:
1400
1401
  static uint64_t * volatile msg = NULL;	/* (2) */
1402
1403
  specifying a volatile pointer. Declaration at (1) is perfectly fine though if
1404
  we're accessing the variable as in:
1405
1406
  while (*msg == 0x1234)
1407
	asm volatile ("pause");
1408
1409
  cause now we're referencing the memory declared as volatile. The compiler must
1410
  fetch the memory pointed by 'msg' at each iteration.
1411
1412
* As seen from above example and the ACM paper, volatiles are really subtle in C
1413
  and many industrial compilers get them wrong. I prefer the Linux way of handling
1414
  the issue, namely __minimizing__ 'volatile' as far as possible, and depending on
1415
  explicit compiler and cpu _barriers_, and SMP primitives as needed.
1416
b4776d7 by Ahmed S. Darwish at 2010-04-27 1417
* So instead of:
10e19e7 by Ahmed S. Darwish at 2009-12-04 1418
1419
  static uint64_t * volatile msg = NULL;
1420
  while (msg == NULL)
1421
	asm volatile ("pause");
1422
1423
  We use an explicit compiler barrier:
1424
1425
  static uint64_t *msg = NULL;
1426
  while (msg == NULL)
1427
	asm volatile ("pause": : :"memory");
1428
b4776d7 by Ahmed S. Darwish at 2010-04-27 1429
  The compiler barrier will assure "GCC not to keep memory values cached in
1430
  registers across the assembler instruction and not to optimize stores or loads
1431
  to that memory". In other words, it will force gcc to reload any values from
1432
  memory over the busy loop.
1433
1434
* Instead of:
10e19e7 by Ahmed S. Darwish at 2009-12-04 1435
1436
  volatile uint32_t *ioregsel = (uint32_t *)ioapic_base(apic);
1437
  volatile uint32_t *iowin = (uint32_t *)(ioapic_base(apic) + 0x10);
1438
  *ioregsel = reg;
1439
  return *iowin;
1440
b4776d7 by Ahmed S. Darwish at 2010-04-27 1441
  where we don't want to the compiler to re-order the last two 'independent'
1442
  read and write cause this re-ordering will imply different semantics to the
1443
  programmed hardware.
1444
10e19e7 by Ahmed S. Darwish at 2009-12-04 1445
  We use MMIO register accessors, which internally includes compiler barriers
1446
  and any needed volatile qualifiers (See comments at top of include/mmio.h
1447
  for further rational):
1448
1449
  uint32_t *ioregsel = (uint32_t *)ioapic_base(apic);
1450
  uint32_t *iowin = (uint32_t *)(ioapic_base(apic) + 0x10);
1451
  writel(reg, ioregsel);
1452
  return readl(iowin);
1453
1454
* Finally, check attached LKML messages for extra cases and reasonings by Linus
1455
  and Andi Kleen.
3f896f4 by Ahmed S. Darwish at 2009-12-12 1456
b3bbe1d by Ahmed S. Darwish at 2010-11-16 1457
3f896f4 by Ahmed S. Darwish at 2009-12-12 1458
On the vaguely documented x86 16-bit protected mode:
1459
----------------------------------------------------
1460
1461
* References:
1462
  - Rick A. Hohensee, "The well-factored 386", LKML,
1463
    http://lkml.org/lkml/2003/7/28/55, http://www.webcitation.org/5luRetOqF
1464
  - Intel manuals vol3A, chapter 3, 3.4.3 - segment registers
1465
  - Intel Manuals vol3A, chapter9, Processor Management and Initialization,
1466
    9.9.2 - Switching Back to Real-Address Mode
1467
  - Robert R. Collins, "The Segment Descriptor Cache", Dr. Dobb's Journal
1468
    Undocumented Corner, http://www.rcollins.org/ddj/Aug98/Aug98.html,
1469
    http://www.webcitation.org/5lxH9NFGa
1470
1471
* The x86 architecture determins the default operand and address size using
1472
  the D bit in the code segment. If D is set, it indicates 32-bit addresses
1473
  and operands. Otherwise, we're running in 16-bit mode, using 16-bit
1474
  addresses and operands. There's a similar C bit for 64-bit long mode.
1475
1476
* One of the states that really puzzled me a lot was the state occurring in
1477
  the transition from bootup real-mode to 32-bit protected mode. Namely,
1478
  bewteen enabling protected mode (setting PE = 1) and getting executed in
1479
  32-bit by using a 32-bit (D = 1) code segment through a far jump. Setting
1480
  PE and %cs together is not atomic, and there's clearly something in-between.
1481
1482
  An example from our bootsector:
1483
1484
	/* Set the protected mode PE flag */
1485
	movl   %cr0, %eax
1486
	orl    $0x1, %eax
1487
	movl   %eax, %cr0
1488
1489
	jmp    flush_prefetch:
1490
1491
	/* What're the rules covering this state? we're in
1492
	 * pmode but the cpu is still fetching and executing
1493
	 * ops in 16-bits! */
1494
	flush_prefetch:
1495
	movw   $(gdt_ds - gdt), %ax
1496
	movw   %ax, %ds
1497
	movw   %ax, %es
1498
	movw   %ax, %ss
1499
	DATA32 ljmp $0x08, $(reloc - _start + 0x7c00)
1500
1501
* To exhaustively think about this, we need to cover all the permuatations
1502
  of PE (pmode enabled) and the %cs D bit:
1503
1504
  (1)   PE=0    Dbit=0          Canonical (un)Real Mode
1505
  (2)   PE=1    Dbit=0          16-bit protected mode
1506
  (3)   PE=1    Dbit=1          Canonical 32-bit Protected Mode
1507
  (4)   PE=0    Dbit=1          ..
1508
                         _______________
1509
                        |    WAKE-UP    |
1510
                        |canonical rmode|
1511
                        |  PE=0 use16   |
1512
                        |_______________|
1513
                            A        |set PE to 1
1514
                            |        |
1515
                 set PE to 0|        |
1516
                         ___|________V__
1517
                        |     16-bit    |
1518
                        |     pmode     |
1519
                        |   PE=1 use16  |
1520
                        |_______________|
1521
                            A       |far xfer
1522
                            |       | to use32 CS
1523
                far xfer to |       |
1524
                    use16 CS|       |
1525
                         ___|_______V___
1526
                        |     32-bit    |
1527
                        |     pmode     |
1528
                        |   PE=1 use32  |
1529
                        |_______________|
1530
                            A        |set PE to 0
1531
                            |        |
1532
                 set PE to 1|        |
1533
                         ___|________V__
1534
                        |  Forreal Mode |
1535
                        |      Mode     |
1536
                        |   PE=0 use32  |
1537
                        |_______________|
1538
1539
* We can not jump from (1) to (3) directly through a far jump using a D=1 code
1540
  segment, since we'll be in canonical rmode, and %cs will always be interpreted
1541
  as in the segmented 8086 model of (%cs << 4 + offset).
1542
1543
  But how can the cpu move to 32-bit execution, if 32-bit execution is controlled
1544
  by the D bit, but we can't change the D bit due to the 8086 semanticss?
1545
1546
  The answer is to provide a switch. If this switch is on, the cpu will the use
1547
  the 386 semantics of segments (and be able to modify the segment caches). If
1548
  it's clear, the cpu will still use the same segmentation circuites but it won't
1549
  be able to modify most of the segment caches fields like limit, and the D bit,
1550
  (the base field can still be changed within 2-bytes size), due to the
1551
  constraints of the 8086 semantics.
1552
1553
  This 'switch' is the PE (pmode Enabled) bit, which acts __independently__ of
1554
  whether the machine is running on 16 or 32-bit mode (D bit is set or clear)
1555
1556
  So under this 16-bit pmode (PE set, D clear), if we do:
1557
1558
	movw   $0x10, %ax
1559
	movw   %ax, %ds
1560
1561
  we're in fact using %ds as a segment selector, and are also updating the
1562
  segment caches with the values found in the third GDT cell.
1563
1564
  But if we're in canonical 16-bit rmode (PE & D clear), if we does the same:
1565
1566
	movw   $0x10, %ax
1567
	movw   %ax, %ds
1568
1569
  we're in fact using (%ds << 4) as the base field in the segment caches.
1570
1571
* We cannot also jump directly from (3) 32-bit pmode, to cacnonical/bootup rmode
1572
  at (1). One should change the code and data segmenst limit to 64K and the D bit
1573
  to 0 __before__ clearing the PE switch: we can not change any of those segments
1574
  cached fields (except the base) once the PE get cleared.
1575
1576
  Intel's "switching back to Real-Address mode" section also states the same
1577
  order, but without making the reasoning of this order very clear.
1578
1579
  Intel also elaborates that "the segment registers must be loaded with non-null
1580
  segment selectors or the segment registers will be unusable in real-address
1581
  mode." Simply because the segment will be marked in the segment caches as
1582
  invalid (P = 0), and as we learned from above, we won't be able to change any
1583
  of those cached bits after clearing PE. Due to this cached P=0 bit, even the
1584
  usual act in canonical rmode of changing the base within the 2-bytes limit
1585
  won't be possible, rendering the segment as unusable, as this is their only
1586
  effective action in 8086 semantics.
1587
1588
* Finally, the difference between 8086 and 386 segmentation is superficial. If
1589
  the segment was marked as P=1 in the segment caches (valid), 8086 semantics
1590
  just change the base field (limited to 2-bytes size) of the equivalent 386
1591
  segment. They all function using the same registers and simple circuity. Intel
1592
  elaborates: "Note that if the segment registers are not reloaded, execution
1593
  continues using the descriptor attributes loaded during protected mode."
34429b5 by Ahmed S. Darwish at 2010-01-13 1594
1595
1596
Using light debugging techniques:
1597
---------------------------------
1598
1599
* References:
1600
  - Ulrich Drepper, "How to Write Shared Libraries", 2.1 Data Definitions,
1601
    http://www.webcitation.org/5mkmxB9Fx
1602
  - Ian Lance Taylor, "Linkers part 5", http://www.airs.com/blog/archives/42,
1603
    http://www.webcitation.org/5mkmiZ8FN
1604
  - Binutils bug report, "ELF linker loads member of archive for common symbol",
1605
    http://sources.redhat.com/bugzilla/show_bug.cgi?id=1811
1606
    http://www.webcitation.org/5mknnyUe2
1607
  - Minor bug-report at http://gcc.gnu.org/bugzilla/show_bug.cgi?id=42726
1608
1609
* Today while writing page allocator test code, I faced a very strange bug: once
1610
  any of the kernel's text, data, or bss sections exceeds something close to 20
1611
  KBs, the MPtables code panic()s at different and unpredictable locations. The
1612
  failure is something like:
1613
1614
     PANIC: MPTables - 1217939456 CPUs found, only 32 supported
1615
1616
* To give some context, there's a system global called 'nr_cpus' representing
1617
  the number of CPUs in the system. So far, we know the number of CPUs from the
1618
  MP tables Conf entries, where each CPU entry represents a CPU in the system.
1619
  Thus, finding 8 CPU entries means that the machine has 8 usable cores.
1620
1621
  This global var 'nr_cpus' is only increased at one and _only_ one place in
1622
  the system, at the MP tables parsing code:
1623
1624
  /* We call this for each MP CPU entry in the system */
1625
  static void parse_cpu(void *addr)
1626
  {
1627
	struct mpc_cpu *cpu = addr;
1628
1629
	if (!cpu->enabled)
1630
		return;
1631
1632
	if (nr_cpus >= CPUS_MAX) {
1633
		panic("MP - %d CPUs found, only %d supported",
1634
		      nr_cpus, CPUS_MAX);
1635
	}
1636
	...
1637
1638
	++nr_cpus;
1639
  }
1640
1641
* Inspecting the BIOS MP tables more closely, the MP pointer table is at 0xfbb60
1642
  and its checksum and signature are valid. The MP configuration table header was
1643
  also valid and checksummed at 0xfba00.
1644
1645
  The problem, it seems, is that the conf table entries got somehow corrupted
1646
  (thus the wrong number of cpus value), or there's a bug in our parser code.
1647
1648
* Unfortunately, and as usual for such problems, inserting printk()s at any
1649
  location skews the results randomly, making any printk() useless.
1650
1651
  So we try less invasive debugging techniques like writing to a memory area
1652
  using a _single_ inlined asm instruction, and inspecting this memory area
1653
  using an emulator afterwars. This worked and made the problem repeatable!
1654
1655
* First, and to test the parser validity, we choose a memory location that we
1656
  know before hand it's going to be zero; I chose physical 0x500. In parse_cpu
1657
  we add a very small asm instruction as follows:
1658
1659
        /* 8-byte instruction */
1660
        asm volatile("incq (0xffffffff80000500);");
1661
1662
	if (nr_cpus >= CPUS_MAX) {
1663
		panic("MP - %d CPUs found, only %d supported",
1664
		      nr_cpus, CPUS_MAX);
1665
	}
1666
1667
  And thankfully this instruction was light enough, so the system paniced
1668
  at the same point. Our panic() just calls printk() and hlt the system. So
1669
  after the system halts, we move to the qemu console handler and check the
1670
  value at address 0x500. Good for us, we found the value there to be 1, so
1671
  it seems our parser is behaving correctly.
1672
1673
* Now the problem is getting more mysterious. If the parser is behaving
1674
  correctly, and the value at address 0x500 is completely valid, how come
1675
  'nr_cpus' have those very strange values? Remember that we put the 'incl
1676
  0x500' instruction at the same place of '++nr_cpus', so the two addresses
1677
  should have the exact same values or something is _completely_ messed up.
1678
1679
  The 'nr_cpus' variable still yielded the value of "1217939456".
1680
1681
* So what did really corrupt the BSS-stored variable nr_cpus at this _very_
1682
  early stage?
1683
1684
  The very first thing we do in the C part of the kernel is to clear the entire
1685
  BSS section to zero. So assuming the kernel was copied/setup correctly (which
1686
  we're not very sure of yet), the problem probably occurs _between_ clearing
1687
  the BSS section and parsing the MP tables.
1688
1689
  OK, so we'll try to check the value of the 'nr_cpus' variable at different
1690
  places in the kernel between clearing BSS and MP parsing using this 14-byte
1691
  code segment:
1692
1693
        asm volatile ("movl %0, %%eax;"
1694
                      "movl %%eax, (0xffffffff80000500); hlt;"
1695
                      : : "m"(nr_cpus) : "memory");
1696
1697
  And then use the qemu console to read the address 0x500 'xp 0x500'.
1698
1699
* Very important discovery: Adding the above asm fragment _directly_ after
1700
  clearning the BSS section yielded the same corrupt value of "1217939456"
1701
  for the 'nr_cpus' variable! Something is seriously wrong cause 'nr_cpus'
1702
  is clearly in the BSS section:
1703
1704
  $ readelf -a kernel.elf
1705
1706
  Section Headers:
1707
    [Nr] Name              Type             Address           Offset
1708
         Size              EntSize          Flags  Link  Info  Align
1709
    ...
1710
    [ 8] .bss              NOBITS           ffffffff8010f5e0  0010f5e0
1711
         00000000000028c8  0000000000000000  WA       0     0     32
1712
1713
  Symbol table '.symtab' contains 183 entries:
1714
     Num:    Value          Size Type    Bind   Vis      Ndx Name
1715
     ...
1716
     151: ffffffff80111d80     4 OBJECT  GLOBAL DEFAULT    8 nr_cpus
1717
1718
* I think we're getting closer. Notice from above ELF analysis, the BSS
1719
  section starts at 0xffffffff8010f600, and it has a length of 0x28c8:
1720
1721
  This means that from reading the ELF:
1722
1723
     BSS start = 0xffffffff8010f5e0
1724
     BSS end = 0xffffffff8010f5e0 + 0x28c8 = 0xffffffff800111ea8
1725
     'nr_cpus' is at 0xffffffff80111d80
1726
1727
  But very strangely, here are the results of printing __bss_start and
1728
  __bss_end symbols exported by our kernel linker script we wrote since the
1729
  dawn of time:
1730
1731
     BSS start = 0xffffffff8010f5e0
1732
  !! BSS end = 0xffffffff801115e0
1733
     'nr_cpus' is outside of this region!
1734
1735
  As you see, somehow the kernel sees a _trimmed_ BSS section from the
1736
  linker script, and to add insult to injury, 'nr_cpus' is not included
1737
  in that trimmed region.
1738
1739
* Things began to make sense: the kernel obviously has a misguided view about
1740
  its bss section. This is disastrous cause at this level, we're the ones
1741
  responsible of clearing this BSS area before use. Ofcourse, we directly move
1742
  to our linker script BSS section to see what's wrong:
1743
1744
  SECTIONS
1745
  {
1746
          ...
1747
           .bss : {
1748
                   __bss_start = .;
1749
                   *(EXCLUDE_FILE (*head.o *e820.o) .bss)
1750
                   __bss_end = .;
1751
           }
1752
           __kernel_end = .;
1753
          ...
1754
  }
1755
1756
  Things should be fine and everything is at its place, but the corruption still
1757
  exists. So I tried a small modification of:
1758
1759
           .bss : {
1760
                   __bss_start = .;
1761
                   *(EXCLUDE_FILE (*head.o *e820.o) .bss)
1762
           }
1763
           __bss_end = .;
1764
           __kernel_end = .;
1765
1766
  And things worked magically! Ofcourse leaving the issue as this will be immature
1767
  as we didn't understand yet why this trivial change produced this huge effect.
1768
1769
* Trying to debug the issue further, we restore the linker script to its original
1770
  state and let the linker be more verbose by using the -M flag. Inspecting the
1771
  output, a very interesting thing came up when the linker displayed the final
1772
  memory map:
1773
1774
  ('...' denotes trimmed output lines)
1775
1776
  Allocating common symbols				// mm, ..
1777
  Common symbol       size              file
1778
1779
  test_symbol         0x8               mmap.o
1780
  ioapic_descs        0x60              ioapic.o
1781
  nr_ioapics          0x4               ioapic.o
1782
  nr_cpus             0x4               smpboot.o
1783
  mp_irqs             0x700             mptables.o
1784
  nr_mpcirqs          0x4               mptables.o
1785
  cpu_descs           0x100             smpboot.o
1786
1787
  Linker script and memory map
1788
1789
                 0x0000000000100000                . = 0x100000
1790
1791
  .text.head     0x0000000000100000      0x310
1792
  ...
1793
  .bss           0xffffffff8010f5e0     0x28c8 load address 0x000000000010692c
1794
                 0xffffffff8010f5e0                __bss_start = .
1795
  *(EXCLUDE_FILE(*e820.o *head.o) .bss)
1796
  .bss           0xffffffff8010f5e0        0x0 common.o
1797
  .bss           0xffffffff8010f5e0        0x0 main.o
1798
  .bss           0xffffffff801105e0        0x0 i8259.o
1799
  .bss           0xffffffff801105e0        0x0 apic.o
1800
  ...
1801
  .bss           0xffffffff80110610        0x0 lib/string.o
1802
  .bss           0xffffffff80110620      0xfc0 lib/printf.o
1803
                 0xffffffff801115e0                __bss_end = .
1804
  COMMON         0xffffffff801115e0       0x64 ioapic.o
1805
                 0xffffffff801115e0                ioapic_descs // Outside of BSS!
1806
                 0xffffffff80111640                nr_ioapics   // Outside of BSS!
1807
  *fill*         0xffffffff80111644       0x1c 00
1808
  COMMON         0xffffffff80111660      0x704 mptables.o
1809
                 0xffffffff80111660                mp_irqs      // Outside of BSS!
1810
                 0xffffffff80111d60                nr_mpcirqs   // Outside of BSS!
1811
  *fill*         0xffffffff80111d64       0x1c 00
1812
  COMMON         0xffffffff80111d80      0x120 smpboot.o
1813
                 0xffffffff80111d80                nr_cpus      // Outside of BSS!
1814
                 0xffffffff80111da0                cpu_descs    // Outside of BSS!
1815
  COMMON         0xffffffff80111ea0        0x8 mmap.o
1816
                 0xffffffff80111ea0                test_symbol  // Outside of BSS!
1817
                 0xffffffff80111ea8                __kernel_end = .
1818
1819
  So at last we got it, the linker script didn't put variables like nr_cpus,
1820
  nr_ioapics etc in the BSS section, thus they were not cleared in our clear_bss()
1821
  function. As you can see, those symbols are allocated past the __bss_end symbol.
1822
1823
* OK, so we inspect the LD manual for this 'COMMON' section, to find:
1824
1825
  "A special notation is needed for common symbols, because in many object file
1826
  formats common symbols do not have a particular input section. The linker treats
1827
  common symbols as though they are in an input section named ‘COMMON’ ..."
1828
1829
* Unfortunately the documentation was a bit vague, and there was no explanation
1830
  of what "common symbols" were really used for. So we dig deeper to find this
1831
  great explanation from Ulrich's paper on shared libraries:
1832
1833
  "Common variables are widely used in fortran, but they got used in C and C++
1834
  as well to work around mistakes of programmers. Since in the early days people
1835
  used to drop the 'extern' keyword from variable definitions, in the same way
1836
  it is possible to drop it from function declaration, the compiler often had
1837
  multiple definitions of the same variable in different files.
1838
1839
  To help the poor and clueless programmer, the C/C++ compiler normally generates
1840
  common variables for uninitialized definitions such as `int foo;'
1841
1842
  For common variables, there can be more than one definition, and they all get
1843
  unified in one location in the output file .. their values does not need to be
1844
  stored in the ELF file."
1845
1846
* And also from Ulrich, a description of -fno-common which is more accurate than
1847
  gcc's manual:
1848
1849
  "If the programmer uses the compiler command-line option -fno-common, the
1850
  generated code will contain uninitialized variables instead of common variables
1851
  if a vriable definition has no initializer. ..
1852
1853
  The result at run-time is the same as for common variable, no value stored in
1854
  the file. But the representation in the object file is different [BSS section],
1855
  and it allows the linker to find multiple definitions and flag them as errors."
1856
1857
* And finally from Ian Lance Taylor's series on linkers:
1858
1859
  "A section index of SHN_COMMON (0xfff2) indicates a common symbol. Common
1860
  symbols were invented to handle Fortran common blocks, and they are also often
1861
  used for uninitialized global variables in C.
1862
1863
  A common symbol has unusual semantics. Common symbols have a value of zero, but
1864
  set the size field to the desired size. If one object file has a common symbol
1865
  and another has a definition, the common symbol is treated as an undefined
1866
  reference.
1867
1868
  If there is no definition for a common symbol, the program linker acts as though
1869
  it saw a definition initialized to zero of the appropriate size. Two object
1870
  files may have common symbols of different sizes, in which case the program
1871
  linker will use the largest size.
1872
1873
  Implementing common symbol semantics across shared libraries is a touchy
1874
  subject, somewhat helped by the recent introduction of a type for common symbols
1875
  as well as a special section index (see the discussion of symbol types below)."
1876
1877
  (Also check nm(1) on "C"/COMMON symbols, and --warn-common on ld(1))
1878
1879
* And by now everything is super-clear, and the fix is only adding the COMMON
1880
  symbols within the output object BSS boundaries:
1881
1882
	.bss : {
1883
		__bss_start = .;
1884
		*(EXCLUDE_FILE (*head.o *e820.o) .bss)
1885
		*(EXCLUDE_FILE (*head.o *e820.o) COMMON)
1886
		__bss_end = .;
1887
	}
1888
1889
  What a bug this was.
fea07b7 by Ahmed S. Darwish at 2010-01-21 1890
1891
9d5e00e by Ahmed S. Darwish at 2010-02-02 1892
Musings on Memory Allocation:
fea07b7 by Ahmed S. Darwish at 2010-01-21 1893
-----------------------------
1894
1895
* References:
1896
  - Uresh Vahalia, "Unix Internals: the new frontiers", Chapter 12 - Kernel
1897
    Memory Allocation
1898
  - Kirk McKusick, J. Karles, "Design of a General Purpose Memory Allocator
1899
    for the 4.3BSD UNIX Kernel", Berkely, Usenix 1988 conference
1900
  - Daniel Bovet, Marco Cesati, "Understanding the Linux Kernel", Chapter 8 -
1901
    Memory Management, Section 8.1 - Page Frame Management
1902
  - Kirk McKusick, George Neville, "The Design and Implementation of the
1903
    FreeBSD Operating System", Section 5.3 - Kernel Memory Management
1904
  - Maurice Bach, "The Design of the UNIX Operating System", Chapter 9 -
1905
    Memory Management Policies
1906
1907
* In all Unices, memory is page-based, and the page is the smallest unit of
1908
  allocation, protection, address translation, and mapping. X86-64 does not
1909
  even provide segmentation.
1910
1911
* The very first step for writing a MM subsystem is writing the page-level
1912
  allocator. Due to virtual memory, the page-allocator doesn't need to find
1913
  contiguous free pages to satisfy its request: non-contigous frames mapped as
1914
  contiguous addresses shall be enough.
1915
1916
  The kernel often needs chunks of memory much less of size than a page, there's
1917
  where the 'kernel memory allocator' job comes into play for allocating this
1918
  kind of memory within the pages. It depends internally on the page allocator.
1919
1920
* Early Unix like SVR2 avoided dynamic memory altogether, and used fixed-size
1921
  tables for vnodes and proc structures. On the other hand, when memory was
1922
  required for holding temporary pathnames, they borrowed buffers from the block
1923
  buffer cache (Check chapter 3 of Maurice's book).
1924
1925
  Another case was the 4.2BSD, which in McKusick wordings had "at least ten
1926
  different memory allocators. Some of them handle large blocks, some handle
1927
  small chained data structures, and others include information to describe
1928
  I/O operations. .. Multiple allocators reduce the effeciency with which memory
1929
  is used. The kernel ends up with many different free lists of memory instead
1930
  of a single freelist from which all allocation can be drawn"
1931
1932
* Currently, our page allocator is interfaced through two simple functions:
1933
  get_free_page() which returns a page frame descriptor, and free_page(page)
1934
  which frees given page frame. The naming is taken from Linux, which are the
1935
  historic names Torvalds used in his 0.01 relase and are still in-use today.
1936
1937
* Now that we've finished writing our page allocator, we focus on kmalloc()-like
1938
  kernel memory allocators. A memory allocator utilization factor =
1939
1940
	          total memory 'requested' by the kmalloc() caller
1941
	-------------------------------------------------------------------
1942
	  total memory 'required' by allocator to satisfy those requests
1943
1944
  As said by McKusick, Good memory utilization in the kernel is more important
1945
  than in user processes. Pages in the process address space that are in 'required'
1946
  but not in the 'requested' pool need not tie up physical memory. But the kernel
1947
  is not typically paged out, so any not-well-utilized kernel-allocated memory
1948
  pages are held forever, and can not be used for other purposes.
1949
1950
  It's typical of user-space memory allocators not to release unused memory in
1951
  the required pool to avoid the syscall overhead. Kernel allocators are required
1952
  to release such memory as quick possible though, due to above point of holding
1953
  kernel memory pages forever, and because the kernal can  quickly manipulate its
1954
  own page maps.
1955
1956
* Not to degrade performace, memory returned by the allocator must be aligned
1957
  according to the machine requirements. I guess for our x86-64 case, the safest
1958
  haven is 8-byte alignemnt.
1959
1960
  Speaking of speed, McKusick suggests that the allocator need only to be fast for
1961
  small pieces of memory. Inspecting a time-shared machine at Berkely, he noticed
1962
  that "large allocations occur infrequently, and are typically for long-lived
1963
  objects such as buffers to hold the superblock for a mounted file system".
1964
1965
* For a nice survey of kernel memory allocators, check referenced 'Unix Internals'
1966
  chapter. The 4.2BSD allocator, is covered under 'Simple Power-of-Two Free Lists',
1967
  and the 4.3BSD allocator, explained in the McKusick paper, is clearly described
1968
  under 'The McKusick-Karles Allocator' section; Vahalia is a great writer.
1969
1970
* Before jumping to sketching our own allocator, we finally survey some of the
1971
  allocators used in modern unix kernels:
1972
1973
  In Linux, the page allocator provides contiguous pages using a binary buddy
1974
  system instead of depending on hardware circuity to emulate this continuity.
1975
  This, in Daniel Bovet words, offers the big advantage of leaving the kernel
1976
  paging tables unchanged, which leads to less average memory access times cause
1977
  such modifications lets the CPU flush its Translation-Lookaside Buffers.
1978
1979
  NOTE! The smallest unit returned by this buddy page allocatoris one PAGE. The
1980
  groups of 2, 4, 8, 16, .., 1024 are a physically contigous set of page frames.
1981
  This is not a kernel memory allocator (KMA) like the McKusick-Karles one; this
1982
  is a page allocator that uses the buddy algorithm for serving _contigous pages_
1983
  instead.
1984
1985
  The page allocator main structure used is mem_map, which holds all the page
1986
  descriptors of the specific memory zone this page allocator belongs, like
1987
  ZONE_DMA, HIGHMEM, etc. Notice that mem_map is also the historic naming Linus
1988
  used for his page allocator in the 0.01 release. That page allocator was done
1989
  using a bitmap, which as we can see, fits the 'mem_map' naming better.
1990
1991
  Above this page allocator in Linux exists the regular sl[au]b object-caching
1992
  allocator, which requests pages from the page allocator upon need.
1993
1994
  In FreeBSD, the kernel uses the 4.3BSD allocator described in the McKusick
1995
  paper for small allocations. For large fixed-size allocations, it uses the
1996
  Zone/slab allocator instead.
1997
1998
* In our kernel, we assign high weights to simplicity without sacrificying much
1999
  speed and memory resources.
2000
2001
  Our page allocator's main structure is the Page Frame Descriptor Table
2002
  (pfdtable; Unix SVR2 terminology), which is an array/table that contains all
2003
  the system's available (non e820-reserved; above kernel memory area) physical
2004
  pages descriptors.
2005
2006
  Free pages descriptors are connected in place by a linked list pfdfree_head
2007
  as the list head. Allocation and Reclamation is a simple O(1) matter of adding
2008
  to or removing from the pfdfree_head list.
2009
2010
  The kernel memory allocator is the 4.3BSD McKusick-Karles allocator. I chose
2011
  it cause of its beautiful simplicity, speed, and space effeciency for small
2012
  allocations. We shall develop a slab-like caching allocator for big objects
2013
  once the need arise; we're obviously influenced by FreeBSD in this area.
15681eb by Ahmed S. Darwish at 2010-02-18 2014
2015
2016
On Virtual Memory:
2017
------------------
2018
2019
* References:
2020
  - Uresh Vahalia, "Unix Internals", Chapter 13 - Virtual Memory
2021
  - Peter J. Denning, 'Virtual Memory', Princeton, ACM Computing Surveys
2022
  - Maurice Bach, "The Design of the UNIX Operating System", Chapter 9 -
2023
    Memory Management Policies
2024
  - Jonathan Corbet, "Four-level page tables", LWN.net,
2025
    http://lwn.net/Articles/106177/, http://www.webcitation.org/5ncgFSpzm
2026
2027
* First, please check our detailed notes on the great Dennings's virtual memory
2028
  paper; the paper covers a huge number of now very-well-established concepts.
2029
  Below paragraphs assumes familiarity with the concepts covered in that paper.
2030
2031
* To cause considerable reduction in the amount of virtual-to-physical mapping
2032
  information that needs to be stored, methods like segmentation and paging
2033
  were invented. In the words of Denning, "each method groups information into
2034
  blocks, a block being a set of contiguous addresses in address space. The
2035
  entries in the mapping table will refer now to blocks, which are _far_ less
2036
  numerous than addresses in a process address space".
2037
2038
  In other words, instead of storing mapping information for each address, we
2039
  store the mapping info fore each equally-sized _page_ or for each _segment_.
2040
2041
* In paging, even while storing mapping information for each page instead of
2042
  for each address, the page table can consume a huge area of memory. For
2043
  example, in a fictional 32-bit machine with a page size of 1KB, and an
2044
  address space of 2GB, we need: 2^31 / 2^10 = 2^21 entries. For an entry size
2045
  of 32-bit, 4-bytes: page table size = 2^21 * 2^2 = 2^23 = 2^3 * 2^20 = 8MB.
06f422f by Ahmed S. Darwish at 2010-07-27 2046
  So we'll need a _contiguous_ 8MB table for _each_ system process address space
2047
  which is definitely not tolerable.
15681eb by Ahmed S. Darwish at 2010-02-18 2048
2049
* Analyzing the problem, we see that the root cause is having mapping information
2050
  for non-used pages; a process rarely use its entire address space anyway. It's
2051
  also common to have a virtual address-space which is much larger than machine's
2052
  available physical memory. To solve this problem we either:
2053
2054
  - have segmented page tables. Each segment of the process has its own page table
2055
  which is just large enough to hold the valid address range for that segment.
2056
2057
  - second is having the page table itself paged, which means an additional higher
2058
  level page table is used to map the lower-level page table. With such a multi-
2059
  teired hierarchy, we need to allocate only those pages of the lower-level table
2060
  that map used/valid addresses of the process. This is the common approach.
2061
2062
2063
On Partitionting the 64-bit kernel address space:
2064
-------------------------------------------------
2065
2066
* References:
2067
  - Andi Kleen, "Porting Linux to x86-64", SuSe Labs
2068
  - Michael Matz, Hubicka, Andrea Jaeger, Mark Mitchell, "System V Application
2069
    Binary Interface, AMD64 processor supplement", SuSe Labs, Codesourcey
2070
  - Linux Kernel linux-2.6/Documentation/x86/x86_64/mm.txt, kernel.org
2071
  - Our kernel comments at head.S and include/paging.h
2072
  - MSDN Library, Win32 and COM development, "64-bit Virtual Address Space",
2073
    http://msdn.microsoft.com/en-us/library/aa384271(VS.85).aspx
2074
    http://www.webcitation.org/5nckv6YBf
2075
  - Intel Manuals volume 2A, A-M Instructions Reference
2076
2077
* Partitioning the address space of a long mode kernel is an interesting challenge.
2078
  There are so many tradeoffs to be taken care of. First, let's remember that we
2079
  have full 64-bit space. This has obvious potentials, but also some drawbacks that
2080
  need be taken care of.
2081
2082
* The first drawback of using a full 64-bit address space is code size. Imagine when
2083
  every far jump, global data access, and function call uses a full 64-bit address.
06f422f by Ahmed S. Darwish at 2010-07-27 2084
  The difference is outlied in below pure x86-32 app:
15681eb by Ahmed S. Darwish at 2010-02-18 2085
2086
	/* File1: sum.c */
2087
	int global;
2088
	int sum() {
2089
		return 0;
2090
	}
2091
2092
	/* File2: app.c */
2093
	extern int global;
06f422f by Ahmed S. Darwish at 2010-07-27 2094
	int x, y;
15681eb by Ahmed S. Darwish at 2010-02-18 2095
	int main() {
2096
		global = 0xdead;
2097
		x = 0xceeb;
2098
		return sum();
2099
	}
2100
2101
	# -m32: 32-bit compilation and linkage
2102
	$ gcc -O3 -m32 sum.c app.c -o app
2103
2104
  which generates the very small code below:
2105
2106
	$ objdump -D app
2107
	0804956c <x>:	/* 4-byte address */
2108
	 804956c:       00 00                   add    %al,(%eax)
2109
         ...
2110
	08049570 <y>:	/* 4-byte address */
2111
	 8049570:       00 00                   add    %al,(%eax)
2112
2113
	08048370 <main>:
06f422f by Ahmed S. Darwish at 2010-07-27 2114
	/* 10-bytes in total
2115
	 * 4-byte absolute address dereference */
2116
	8048376:       c7 05 68 95 04 08 ad     movl   $0xdead,0x8049568
2117
	804837d:       de 00 00
2118
2119
	/* 10-bytes in total
2120
	 * 4-byte absolute address dereference */
2121
	8048380:       c7 05 6c 95 04 08 eb     movl   $0xceeb,0x804956c
2122
	8048387:       ce 00 00
2123
2124
	/* Only 5 bytes in total
2125
	 * 4-byte relative offset */
2126
	8048394:       e8 c7 ff ff ff           call   8048360 <sum>
2127
	...
15681eb by Ahmed S. Darwish at 2010-02-18 2128
2129
  Now imagine if above app used full 64-bit virtual addresses, each 4-byte address
2130
  dereference above would have consumed full 8-bytes; even some ops would consume
2131
  140% extra size. Inspect below results:
2132
2133
	$ gcc -m64 -mcmodel=large -c sum.c -o sum.o
2134
	$ gcc -m64 -mcmodel=large -c app.c -o app.o
2135
	$ ld  -Ttext 0xffff800000000000 app.o sum.o -o app
06f422f by Ahmed S. Darwish at 2010-07-27 2136
        $ objdump -D app | less
15681eb by Ahmed S. Darwish at 2010-02-18 2137
2138
	ffff8000002000cc <x>:   /* .. 8-byte address */
2139
	ffff8000002000cc:       00 00                   add    %al,(%rax)
2140
         ...
2141
	ffff8000002000d0 <y>:   /* .. 8-byte address */
2142
         ...
2143
2144
	ffff800000000024 <main>:
2145
06f422f by Ahmed S. Darwish at 2010-07-27 2146
	/* 8-bytes address, 15-bytes ops size in total. Compare
2147
	 * with the 10-byte 32-bit equivalent ops (50% increase) */
2148
	ffff800000000028:       48 b8 c8 00 20 00 00    mov    $0xffff8000002000c8,%rax
2149
	ffff80000000002f:       80 ff ff
2150
	ffff800000000032:       c7 00 ad de 00 00       movl   $0xdead,(%rax)
2151
2152
	/* 8-bytes address, 15-bytes ops size in total. Compare
2153
	 * with the 10-byte 32-bit equivalent ops (50% increase) */
2154
	ffff800000000038:       48 b8 cc 00 20 00 00    mov    $0xffff8000002000cc,%rax
2155
	ffff80000000003f:       80 ff ff
2156
	ffff800000000042:       c7 00 eb ce 00 00       movl   $0xceeb,(%rax)
2157
	...
2158
2159
	/* 8-byte address, *12-bytes* ops size in total. Compare
2160
	 * with the 5-byte 32-bit equivalent ops (140% increase!) */
2161
	ffff80000000005d:       48 ba 00 00 00 00 00    mov    $0xffff800000000000,%rdx
2162
	ffff800000000064:       80 ff ff
2163
	ffff800000000067:       ff d2                   callq  *%rdx
15681eb by Ahmed S. Darwish at 2010-02-18 2164
2165
* As you can see from above, naively using the 64-bit address space can lead to
2166
  diastrous effects on code size, and thus, the cache footprint. To solve this, some
2167
  help from both the kernel and the compiler is needed.
2168
2169
  First, let's focus on user-space. To let user-space avoid this large code-size hit,
2170
  it's common to assign to it the 0 -> 4GB+ virtual space, and let the compiler
2171
  generate 32-bit addresses for code and global data. This is for example, how the Linux
2172
  ecosystem does it. The kernel assigns below virtual addresses to user-space (for each
2173
  process):
2174
2175
	0000000000000000 - 00007fffffffffff (=47 bits) user space, different per mm
2176
	hole caused by [48:63] sign extension
2177
2178
  And let gcc emits 32-bit addressing for code and global data using the -mcmodel=small
2179
  parameter, which is the default memory model. %rip-relative addressing is also used
06f422f by Ahmed S. Darwish at 2010-07-27 2180
  by default at this model for dereferencing object globals; it already consumes same
2181
  amount of opcode size relative to x86-32's in-operand 32-bit absoulte dereference.
15681eb by Ahmed S. Darwish at 2010-02-18 2182
2183
* Now how are we going to produce similar effect in kernel-space code when the
2184
  0 -> 4G+ space has been already reserved by user-space?
2185
2186
  Thanks to the elegance of the 2's complement model, the common case is to use
2187
  negative 32-bit addresses for kernel linking. This leads to negative 32-bit
2188
  discplacement for kernel ops. We then let the CPU implcitly (automatically)
2189
  sign-extend those discplacements, leading us to use the top 64-bit address space,
2190
  but with the size-effeciency of 32-bit addresses!
2191
2192
  An example is overdue. Imagine linking our kernel with the 0xffffffff80000000
2193
  as base. %rax points to a physical address in our text or data area. To let it
2194
  be virtual, we add the base to it. Mainly we want to do:
2195
2196
	%rax = %rax + 0xffffffff80000000
2197
2198
  Now, thanks to the sign-extension trick, the above instruction can be size-
2199
  -effeciently encoded as if the displacement was 32-bit as follows:
2200
2201
	ffffffff8010b347: 48 05 00 00 00 80       add    $0xffffffff80000000,%rax
2202
2203
  As you can see, only 4-bytes are used for displacement (0x80000000). The REX.W
2204
  prefix (0x48) with the x86 ADD opcode (05) will let the x86-64 cpu automatically
2205
  sign extend the 32-bit immediate 0x80000000 to $0xffffffff80000000 and achieve
2206
  our goals.
2207
2208
* Here's a more complete explanation from our kernel comments:
2209
2210
  Using -2GB as a base empowers the compiler to directly encode symbolic
2211
  references as 32-bit values, and let the CPU implicitly sign-extended those to
2212
  64-bits due to the REX.W prefix. Thus, we use the top 64-bit address space, but
2213
  with 32-bit size effeciency! Check below example:
2214
2215
  Passing global strings to memcpy:
2216
2217
   48 c7 c6 80 24 8b 80    mov    $0xffffffff808b2480,%rsi
2218
   48 c7 c7 60 d8 10 80    mov    $0xffffffff8010d860,%rdi
2219
   e8 f4 39 00 00          callq  ffffffff8010bb00 <memcpy>
2220
2221
  To avoid the size-overhead of 64-bit offsets in function calls, we limit the
2222
  kernel text to a 32-bit virtual area. Thus, an x86-64 callq will only consume a
2223
  4-byte offset: relative space between next op and the destination label.
2224
06f422f by Ahmed S. Darwish at 2010-07-27 2225
  To dereference kernel globals (instead of just passing their addresses), we'll
2226
  use %rip-relative addressing within + or - 2GB limit, avoiding the size-overhead
2227
  of referencing a full 8-byte address.
15681eb by Ahmed S. Darwish at 2010-02-18 2228
2229
  All of this was in fact designed by the SysV AMD64 ABI developers, and is called
2230
  the 'kernel code model'. We use this code  model, and let userspace use the
2231
  'small' or the 'medium' one. It's interesting to note that NT userspace defaults
2232
  to the medium model, while most UNIX implementations default to the 'small' one.
2233
2234
* To get a state of the size difference, check the difference of function call
2235
  size overhead between the 'large' and the 'kernel' code models:
2236
2237
  $ objdump -j .text -D kernel.elf | grep call		# -mcmodel=kernel
2238
2239
  ffffffff801090c8:       e8 e3 45 00 00          callq  ffffffff8010d6b0 <vsnprintf>
2240
  ffffffff801090df:       e8 ec 48 00 00          callq  ffffffff8010d9d0 <printk>
2241
  ffffffff80109105:       e8 86 41 00 00          callq  ffffffff8010d290 <memset>
2242
  ffffffff8010b97b:       e8 70 f9 ff ff          callq  ffffffff8010b2f0 <spin_lock>
2243
  ffffffff8010ba61:       e8 6a ff ff ff          callq  ffffffff8010b9d0 <get_free_page>
2244
  ffffffff8010bdd8:       e8 83 fc ff ff          callq  ffffffff8010ba60 <get_zeroed_page>
2245
  ffffffff8010c749:       e8 a2 f8 ff ff          callq  ffffffff8010bff0 <__kmalloc>
2246
  ...
2247
2248
  $ objdump -j .text -D kernel.elf | grep call		# -mcmodel=large
2249
2250
  /* Notice the ~%140 increase in size per call */
2251
  ffffffff801090cb:       48 b8 d0 e6 10 80 ff    mov    $0xffffffff8010e6d0,%rax
2252
  ffffffff801090d2:       ff ff ff
2253
  ffffffff801090d5:       ff d0                   callq  *%rax
2254
  ffffffff801091a3:       48 bb 00 ea 10 80 ff    mov    $0xffffffff8010ea00,%rbx
2255
  ffffffff801091aa:       ff ff ff
2256
  ffffffff801091b9:       ff d3                   callq  *%rbx
2257
  ...
2258
2259
  So, while admittedly using -mcmodel=large makes things a bit easier for the 64-bit
06f422f by Ahmed S. Darwish at 2010-07-27 2260
  kernel developer (as it gives them the freedom of basing the kernel on any desired
2261
  virtual address), it induces too much size overhead that it's worth the effort
2262
  avoiding it.
2263
2264
  For example, since Cute uses -mcmodel=kernel, we have two mappings in our system,
2265
  one for kernel text and data (-2GB), and _another_ mapping for the entire physical
2266
  memory discovered at system boot. Extra care was needed while developing this in
2267
  an already very complicated subsystem: memory management.
2268
2269
  This is also the same reason  why Linux's amd64 (unlike ia32) __PAGE_OFFSET is
2270
  different from the kernel linkage base at __START_KERNEL_MAP = -2GB.
15681eb by Ahmed S. Darwish at 2010-02-18 2271
2272
* So, keeping above points in mind, we'll use below strategy for dividing our kernel
2273
  space addresses:
2274
2275
  0x0000000000000000 -> 0x00007fffffffffff	userspace mappings
2276
  0x0000800000000000 -> 0x0000ffffffffffff	not canoncial; cannot be used
2277
  0xffff800000000000 -> 0xffffc00000000000	maps for all physical memory (64TB max)
2278
  0xffffffff80000000 -> 0xffffffffc0000000	kernel text and data map
9a0481a by Ahmed S. Darwish at 2010-02-24 2279
2280
2281
* Debugging -O3 triggered problems:
2282
-----------------------------------
2283
2284
* While adding a good number of GCC warning request flags to detect bugs as early
2285
  as possible, I enabled the -O3 switch only to discover that it triggered a good
2286
  number of hidden kernel bugs.
2287
2288
  First bug led to several exceptions. Upon investigating the CPU's reported
2289
  faulting instruction, I found its an SSE x86 op. SSE ops require special %cr0
2290
  and %cr4 setup that we're not ready for.
2291
2292
  The solution was simply instructing GCC to avoid emitting SSE ops using its
2293
  -mno-see{2,3,4} and -mno-3dnow control flags.
2294
2295
* Second bug was way uglier. It was only triggered at -O3 and with Kmalloc test
2296
  code (and only kmalloc test code, not any of the rest of the tests) enabled.
2297
2298
  The faults were a bit random, from continuous triple faults to rubbish on the
2299
  screen by printk(). This also means that printk is not available for debugging.
2300
2301
  Switching to -O2 mute all those faults ..
2302
2303
* An important debugging tool when printf is not available is inspecting the final
2304
  kernel ELF. Having the ELF at hand helps us at:
2305
2306
  - directly knowing the exact addresses of all global symbols
2307
2308
  - translate any %rip address to the exact place in the executable. This aids in
2309
    knowing things like whether a core had deadlock on a spinlock (%rip inifinitely
2310
    tied at the spin loop), the exact x86 op of a faulting %rip address, etc
2311
2312
* Applying above methods, we inspect the global vga_buffer array. This array get
2313
  its contents directly copied to the vga card, so we can have a nice view about
2314
  the source of screen corruption:
2315
2316
  $ readelf --syms kernel.elf | grep vga_buffer
2317
2318
  114: ffffffff8010eb60  4000 OBJECT  LOCAL  DEFAULT    6 vga_buffer
2319
2320
  Halting the CPU and inspecting physical address 0x10eb60 contents, we find that:
2321
  At -O2, well-formatted vga text: 0f 75 0f 43 0f 64 0f 74 - 'Cute'
2322
  At -O3, it's rubbish: 9f 19 9e fa 9f 9f ae 80 ...
2323
2324
  Further inspecting the printk() code, I found there's no apparent problem in
2325
  the code itself (or its generated assembly). The printk() code was mainly passed
2326
  rubbish strings, instead of the expected valid ones.
2327
2328
* Inspecting the issue further, I've put a global string:
2329
2330
  static char string[] = "ABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890";
2331
2332
  $ readelf --syms kernel.elf | grep string
2333
2334
  114: 0000000000000000     0 FILE    LOCAL  DEFAULT  ABS string.c
2335
  118: ffffffff8010e680    37 OBJECT  LOCAL  DEFAULT    7 string
2336
2337
  Inspecting those mem locations at -O2, yielded the expected byte values:
2338
  41 42 43 44 45 46 47 48 ..
2339
2340
  At -O3, however, below values were found:
2341
  10 80 48 89 f2 48 03 53 08 48 01 eb
2342
2343
* Checking the unexpected bytes sequence above, I smelled x86 opcodes, which
2344
  they truly were! They were from the disassembly:
2345
2346
  ffffffff80108140 <e820_init>:
2347
   ...
2348
   ffffffff801081b7:       48 c7 c7 18 e3 10 80    mov    $0xffffffff8010e318,%rdi
2349
   ffffffff801081be:       48 89 f2                mov    %rsi,%rdx
2350
   ffffffff801081c1:       48 03 53 08             add    0x8(%rbx),%rdx
2351
   ffffffff801081c5:       48 01 eb                add    %rbp,%rbx
2352
   ...
2353
2354
  So we got a lead, but it's just an indication of something utterly and
2355
  completely messed-up.
2356
2357
* Trying to inspect the issue further, I halted the CPU and dumped first 10MB
2358
  of RAM before the crash. There were some really surprising results.
2359
2360
  Searching for some of the strings we send to printk() led to nothing, nada,
2361
  nil. The strings we send, they must be there at the kernel binary in ASCII
2362
  form, aren't simply there. Instead most of them are replaced by x86 ops.
2363
2364
* Thinking about the situation, a change to kernel binary at such scale can
2365
  mostly happen by a serious bug at bootstrap code.
2366
2367
  Checking the first 3 lines of our bootstrap disassembly, where it moves
2368
  kernel code from bootstrap BSS to its expected place, made everything
2369
  clear:
2370
2371
  kern/kernel.elf:     file format elf64-x86-64
2372
2373
  Disassembly of section .text.head:
2374
2375
  0000000000100000 <startup_32>:
2376
  100000:       be a0 03 10 00          mov    $0x1003a0,%esi
2377
  100005:       bf 04 80 10 00          mov    $0x108004,%edi <-- [1]
2378
  10000a:       b9 00 00 02 00          mov    $0x20000,%ecx
2379
2380
  [1] As you can see, if the kernel image we're copying away from bootstrap
2381
  BSS to its right place exceeded a size of 0x8000 bytes, the part of the
2382
  image > 0x8000 will be overriden, which leads to the mess we were in.
2383
2384
  As in C's memmove(), we now avoid such corruption by copying kernel code
2385
  to a safe temporary buffer first, then copy it again to the final
2386
  overlapping place.
2387
2388
  (Sometimes reading disassembly makes you catch bugs quicker than reading
2389
  source code. Although in this case the source was really in assembly, when
2390
  we saw the _numbers_ instead of just the symbols exported by the linker
2391
  script, we caught the bug rather instantaneously)
2392
2393
* Important lessons:
2394
2395
  - Take extreme care of overlaps while coding x86 'rep movsb' and friends.
2396
    It's good to dynamically assure that:
2397
4f55712 by Ahmed S. Darwish at 2010-11-16 2398
    + src < dst, AND dst >= (src + len)		.. i.e., bad overlap
9a0481a by Ahmed S. Darwish at 2010-02-24 2399
    + OR src >= dst				.. i.e., safe overlap
2400
2401
    Any other condition means direct corruption.
2402
2403
    The 'safe' overlap is only so if we are using 'movsb' instead of movsl
4f55712 by Ahmed S. Darwish at 2010-11-16 2404
    or movsq. Remember that this 'safe' overlap corrupts the source, but
2405
    keeps the destination data valid.
2406
2407
  - To be safe against such kind of overlaps in C parts of the kernel,
2408
    we've added below memcpy() asserts:
2409
2410
    if (src >  dst && (src - dst) < 8) OR
2411
       (src <= dst && (dst - src) < len):
2412
         panic()						[1]
2413
2414
    which can be reduced by logic (check Appendix) to:
2415
2416
    if (dst + 8 > src) AND (src + len > dst):
2417
        panic();						[2]
2418
2419
    The first check is used cause of 'movsq' 8-byte block copy. We've
2420
    intentionally transformed substraction in the '>' and '<' expressions
2421
    to addition cause of the _unsigned_ pointer values.
2422
2423
    ----------------------------------------------------------------------
2424
    |					Appendix
2425
    |
2426
    | It's easy to prove equivalence of the 'if' conditions marked
2427
    | with [1] and [2] above; let:
2428
    |
2429
    | A eq 'src > dst'		==>   ~A eq 'src <= dst'
2430
    | B eq '(src - dst) < 8'
2431
    | C eq '(dst - src) < len)'
2432
    | D eq '(A and B) or (~A and C)'
2433
    | src >= 0, dst >= 0, len >= 0
2434
    |
2435
    | We want to prove that:
2436
    | D   is fully equivalent to   (B and C)
2437
    |
2438
    | From the premises, we can see that:
2439
    | if (B and C) is true, then
2440
    | '(A and B) or (~A and C)' is true, then
2441
    | D is true						   (I)
2442
    |
2443
    | From the premises, we can also see that:
2444
    | if A is true, then C is true			   (II)
2445
    | if ~A is true, then B is true			   (III)
2446
    |
2447
    | From premises and (II):
2448
    | if (A and B) is true, then (A and B and C) is true   (IV)
2449
    |
2450
    | From premises and (III):
2451
    | if (~A and C) is true, then (~A and C and B) is true (V)
2452
    |
2453
    | From premises, (IV), and (V):
2454
    | if D is true, then (B and C) is true		   (VI)
2455
    |
2456
    | From (I) and (VI):
2457
    | D   is fully equivalent to   (B and C)		   Q.E.D.
2458
    ----------------------------------------------------------------------
5cc7a1d by Ahmed S. Darwish at 2010-03-20 2459
2460
* Global lessons:
2461
2462
  - Disassembly is a great tool for debugging kernel code, since a lot of things
2463
    taken for granted at user-space is not guaranteed here anymore. Abstraction
2464
    cut both ways indeed.
2465
2466
  - It's _very important_ to deeply test each code path, no matter how trivial
2467
    it seems (basic stuff, but sometimes we forget the basics).
2468
2469
  - Memory dumping of critical kernel regions is really valuable; remember to use
2470
    it as needed.
2471
2472
  - It's important to test the code with different environment combinations:
2473
    different optimization levels: GCC's -O {0, 1, 2, 3, s}
2474
    different memory sizes: Qemu's -m {2, 3, .., 500, .., 2000, ..}
2475
    different number of cores: -smp {1, 2, 3, ..}
2476
    different CPU speeds: Bochs's ips= parameter
2477
    different BIOSes: different versions of Bochs Bios, different PC forms, ..
2478
b3bbe1d by Ahmed S. Darwish at 2010-11-16 2479
5cc7a1d by Ahmed S. Darwish at 2010-03-20 2480
* The Zone:
2481
-----------
2482
2483
  (Or the bug that took me around 6 days to solve)
2484
2485
* While writing the context switch code, I observed a very ugly bug. Somehow the
2486
  montonic PIT interrupts that got triggered every 1 millisecond corrupts kernel
2487
  state.
2488
2489
  Symptoms: random exceptions and random assert() failures allover the place.
2490
2491
* So I first thought there must be something wrong in the timer handler itself.
2492
  So what if the PIT handler did exactly nothing but acknowledging the local APIC?
2493
  (It must acknowledge the APIC so future IRQs get handled)
2494
2495
  The handler code was successfully reduced to:
2496
2497
	push   %rax
2498
	movq   $(VIRTUAL(APIC_PHBASE) + APIC_EOI), %rax
2499
	movl   $0,(%rax)
2500
	pop    %rax
2501
	iretq
2502
2503
  But still the same buggy behavior prevailed.
2504
2505
* After reducing the handler code to a number of bytes, with no positive results,
2506
  I thought let's check the IDT validity. Inspecting the IDT cells date yielded
2507
  nothing interesting.
2508
2509
  Also checking the %rflags, bit by bit, led to nothing interesting.
2510
2511
  So I thought maybe we should setup the IDT cells as 'trap' instead of 'interrupt'
2512
  gates, but this is ofcourse wrong. Trap gates leaves the IF flag set while
2513
  servicing the interrupt, which means handlers can get interrupted indefinitely,
2514
  and our stack can get overflowed in a matter of milliseconds.
2515
2516
* OK, so I thought instead of focusing on the IRQ handlers and the IDT, lets focus
2517
  on the _code_ that get interrupted itself. The failed code was usually the
2518
  general memory allocator kmalloc() test cases.
2519
2520
  After inspecting the interrupted code, I found nothing interesting.
2521
2522
* To let the problem domain be smaller, and after minimizing the handler code, I
2523
  thought lets also minimize the interrupted code and see. Having a bunch of
2524
  printk()s in a tight loop, with the timer handler on, triggered the problem!
2525
2526
  Here the second day finishes. Either something is really messed up, or I'm a
2527
  shitty developer that can't even write a stable printk().
2528
2529
  Furthermore, running any testcases after enabling the interrupts make them
2530
  fail miserably. From all the debugging I've done in the last 6 months, I
2531
  began to feel the instincts of such 'deep' kernel bugs: this is definitely
2532
  one of them.
2533
2534
* So, I began to feel hopeless. I'll just inspect the x86 ops that triggered
2535
  the exceptions, hoping to find any pattern in the code that trigger the
2536
  bug. Definitely a pattern was clearly there; some examples of the failing
2537
  x86 ops:
2538
2539
	mov    %rax, 0x0(%rbp)
2540
	movzbl (%rax), %eax
2541
	movl   $0x2, 0x8(%rax)
2542
	movl   $0x1, 0x4(%rax)
2543
	...
2544
2545
  And the most triggered, probably the _only_ triggered, exception was the page
2546
  fault #PF exception, vector 14. The faulting (unmapped) address was things
2547
  like %cr2=0x0, %cr2=0x8, %cr2=0x200020, etc.
2548
2549
  From above pattern, I thought: who's really messing up with our registers,
2550
  getting them loaded with weird values like 0, 0x8, etc?
2551
2552
* So, we're in the verge of the third day end with no direct results, and nothing
2553
  known about the source of the bug. Having nothing specific to think about, I
2554
  inspected the critical kernel paths: the IDT setup, IOAPIC entries setup, the
2555
  GDT, the bootsector, early kernel boot setup, printk() code, etc.
2556
2557
  Funnily, cause I had a 'global' view about the kernel while doing this deep
2558
  inspection, 9 issues scattered around was discovered that needs to be fixed.
2559
  Unfortunately, fixing those 9 issues didn't really solve our main problem, but
2560
  it was nice nonetheless to discover them.
2561
2562
* So we're now in the fifth day, and nothing is clear. Hopelessly, I swap our
2563
  printk code with FreeBSD's one. Things work fine, but once we enable interrupts
2564
  and the PIT in monotonic mode, similar erroneous behavior appears. Swapping
2565
  the FreeBSD code with the Linux code, and again, leading to the same failures.
2566
2567
  So I was right, something 'deep' is messed up, not just the nature of the
2568
  interrupted code.
2569
2570
  Another hopeless try was swapping the data segment descriptor index @ the GDT
2571
  from 0x10 to 0x18. Something interesting was discovered; the page faults
2572
  addresses occurs from values originating from %cs, %ss, and %rflags; it was
2573
  completely repeatable. Sounds familiar? Yes, they should! Those are the values
2574
  pushed by the CPU before invoking IRQ handlers!
2575
2576
  So somehow, registers that get dereferenced get the wrong values from the
2577
  state pushed and popped by the CPU before and after invoking an IRQ handler.
2578
  Now _this_ is an interesting discovery.
2579
2580
  So, how the %ss, %cs, and %rflags values got 'leaked' to our general-purpose
2581
  registers? The stack _must_ have been corrupted somehow ..
2582
2583
* Trying different directions, we minimize the timer handler even further only to
2584
  the 'iretq' x86 op, but how do we acknowledge the local APIC this way?
2585
2586
  In fact, we dont. I just masked all the IOAPIC entries, and let the PIT trigger
2587
  through the PIC. The PIC was then programmed in 'Automatic EOI' mode, where no
2588
  acknowledgment was needed. This reduced the handler only to:
2589
2590
	iretq
2591
2592
  Even with this super-minimal handler, the bug gets triggered :(
2593
2594
  It was also interesting that, again, the #PF handler reported bad memory values
2595
  resembling %rflags, %ss, etc.
2596
2597
* So after minimizing the timer handler to the absolute architecturally minimum,
2598
  we try to minimize the faulting code even further. After some hours of
2599
  disassembly, I find that at -O1 the VGA code alone (without the printk() code
2600
  that called it) triggers the problem.
2601
2602
  After even further inspection and disassembly (and wasted hours), I find that
2603
  not only we can limit the code to VGA writing, but only to the string function
2604
  memcpy().
2605
2606
  So, now, all I have is the timer interrupt and an infinite loop containing
2607
  memcpy, and the code still fails!
2608
2609
  And, yes, the memcpy() code was valid (ehem, Dijkestra's 'Humble Programmer')
2610
2611
* So I go crazy stripping the entire kernel. All what was left was code that
2612
  re-programs the PIC, the PIT setup code, and the string function memcpy()
2613
  and   the    code     still    failed.
2614
2615
  That was, well, weird. We really have nothing now except a very small number of
2616
  lines of code and I don't know what to do.
2617
2618
  Once we enable interrupts, the memcpy() function fail with similar patterns to
2619
  the one described above (page faults, with faulting memory values similar to
2620
  %ss and %rflags)
2621
2622
  Well, you can imagine the state I'm in after all those days.
2623
2624
* Another step for more explicitness was disabling the PIT, and triggering its
2625
  handler manually using software interrupts. I've inject software interrupts
2626
  around strategic C statements like '*src = *dst', etc.
2627
2628
* So at the middle of the middle of the sixth day, I disassemble the memcpy()
2629
  function, draw the stack on paper, and trace. Now something became directly
2630
  apparent. Notice those instructions (gcc -O0):
2631
2632
  /* Classical stack setup */
2633
  ffffffff80109c88:       55                      push   %rbp
2634
  ffffffff80109c89:       48 89 e5                mov    %rsp,%rbp
2635
  ffffffff80109c8c:       48 89 7d e8             mov    %rdi,-0x18(%rbp)
2636
  ffffffff80109c90:       48 89 75 e0             mov    %rsi,-0x20(%rbp)
2637
2638
  /* Bochs magic breakpoint */
2639
  ffffffff80109caa:       66 87 db                xchg   %bx,%bx
2640
2641
  /* Our manual software interrupt */
2642
  ffffffff80109cad:       cd f0                   int    $0xf0
2643
2644
  /* Failing code, specially last line */
2645
  ffffffff80109caf:       48 8b 45 f8             mov    -0x8(%rbp),%rax
2646
  ffffffff80109cb3:       0f b6 10                movzbl (%rax),%edx
2647
  ffffffff80109cb6:       48 8b 45 f0             mov    -0x10(%rbp),%rax
2648
  ffffffff80109cba:       88 10                   mov    %dl,(%rax)
2649
2650
  Notice something? Look again. See those `mov -0x8(%rbp), %rax' and
2651
  `mov -0x10(%rbp), %rax) ops?
2652
2653
  Now check the top of the function, especially the stack pointer setup. Can't
2654
  you notice something?
2655
2656
  YES, GCC used areas of the stack _B E L O W_ the stack pointer to keep parts
2657
  of the leaf function local state. Now when the interrupt was triggered at
2658
  `int $0xf0', the CPU rightfully pushed CPU state in that order:
2659
2660
	%ss
2661
	%rsp
2662
	%rflags
2663
	%cs
2664
	%rip
2665
2666
  Which meant that parts of that function local state get corrupted through
2667
  the implicit CPU stack usage!
2668
2669
  This is also the reason we found faulting addresses of values resembling %ss
2670
  and %rflags. Our processor modified our pointers with those!
2671
2672
* Wow, THAT WAS INTERESTING and worthwhile to invest the week in. Inspecting the
2673
  AMD64 ABI to see why GCC uses this behavior, which is _certainly_ interrupt and
2674
  signals unsafe, I found the answer: it's 'The Red Zone'.
2675
2676
  Now I've really had enough for today, see the discussion of this zone at the
2677
  AMD64 ABI document.
2678
2679
* Solution:
2680
2681
  Either:
2682
  [1] Ignore the 128-byte region in the IRQ handler  OR
2683
  [2] Let the compiler avoid using this Red Zone
2684
2685
  We can't really use [1], since the CPU pushes its state behind our backs. We
2686
  can't let it wind 128-bytes of the stack. Also the kernel stack is a very
2687
  valuable resource, and we shouldn't waste 128-Bytes of it to save an x86 op
2688
  or two.
2689
2690
  Thus, we pick the second option. That's also what the mainstream folks like
2691
  {Open, Free}BSD and Linux do (which I discovered too lately).
2692
2693
* Lessons:
2694
2695
  - When faced with triple faults, inspect the following:
2696
    - the IDT table: if it has been overwritten, then the machine will definitely
2697
      triple fault.
2698
    - stack sanity: If the stack was not sane, then pushes and pops will lead to
2699
      faults, leading to triple faults.
2700
    - MM structures sanity. If the pagetables have been erroneously overwritten,
2701
      the stack will get erroneously mapped.
2702
    - Kernel stack overflow: corrupt stack = triple faults
2703
2704
  - If you noticed that an interrupt randomly triggers a problem in the kernel,
2705
    it can be good to disable interrupts (avoid distractions), then find the most
2706
    affected code path and _manually_ inject interrupts there using software IRQs.
2707
2708
  - The best possible way to let IRQ handlers be as minimal as possible is to use
2709
    the PIC in Automatic EOI (AEOI) mode. They will get the handler minimized to
2710
    the 2 byte `iretq' x86-64 op.
2711
2712
  - Let code binary single-stepping be our last debugging resort to make sure we
2713
    understand the algorithms and code we've just written well enough.
c65c8b1 by Ahmed S. Darwish at 2010-04-25 2714
b3bbe1d by Ahmed S. Darwish at 2010-11-16 2715
c65c8b1 by Ahmed S. Darwish at 2010-04-25 2716
Debugging periodic timers using statistical data:
2717
-------------------------------------------------
2718
2719
* A problem appeared while running the scheduler with the APIC timer instead of
2720
  the PIT. Using observation, it was clear that the running intervals given
2721
  to processes by the APIC timer are twice as long as the ones given by the PIT.
2722
  Both timers were supposedly programmed to produce 40 millisecond intervals,
2723
  so where's the fault?
2724
2725
  Note: we are talking about monotonic PIT and APIC interrupts since those are
2726
  the only ones used by the scheduler; we don't use the one-shot variants here.
2727
2728
* One of the best ways I found to debug this problem, instead of depending on
2729
  shaky heuristics, is to store the occurrence of the timer ticks interrupts in
2730
  a statistical manner.
2731
2732
  An example of this statistical method would be finding the number of timer
2733
  ticks occurrence after a sequence of independently-programmed delay intervals.
2734
2735
* How to do so without sacrificing simplicity?
2736
2737
  Setup the scheduler periodic timer through the PIT or the APIC:
2738
  `pit_monotonic(40)' OR `apic_monotonic(40)'.
2739
2740
  Setup a counter representing the number of scheduler timer ticks so far:
2741
2742
  .globl ticks_counter
2743
  ticks_counter:
2744
	.quad 0
2745
2746
  .globl timer_handler
2747
  timer_handler:
2748
	...
2749
	incq   ticks_counter
2750
	...
2751
2752
  Then, do _independent_ time delays and store the number of timer ticks so far
2753
  after each delay. This way, we can have a statistical comparison between the
2754
  two timers, the one used for the delay and the one used for the ticks. An
2755
  example of this would be:
2756
2757
	/* ticks[i]: number of scheduler timer ticks after the `i'th delay
2758
	 * interval, having in mind that we've setup both the ticks and delay
2759
	 * timers with the same time priod.
2760
	 *
2761
	 * If `1 2 3 4 5 6 ..' is printed, then the two timers are in harmony.
2762
	 * Having the sequence '3 5 7 9 11 ..' means that the ticks timer runs
2763
	 * twice as fast as the delay one. */
2764
	for (int i = 0; i < 100; i++)
2765
		pit_mdelay(52), ticks[i] = ticks_counter;
2766
	for (int j = 0; j < 100; j++)
2767
		printk("%d ", ticks[j]);
2768
2769
  Comparing the statistical results, we found that the PIT in monotonic mode
2770
  ticks twice as fast as the PIT programmed in delay mode. It also ticked twice
2771
  as fast as the APIC timer in delay mode. This heavily suggests something wrong
2772
  in our PIT monotonic-mode setup!
2773
2774
  Meanwhile, setting up the scheduler ticks using the APIC timer ran in perfect
2775
  harmony with the PIT-programmed delays. This heavily re-enforces above paragr-
2776
  aph reasoned speculation: something is wrong in our PIT periodic mode code.
2777
2778
* Now that the problem was isolated, the error was a misunderstanding of the
2779
  PIT's mode-3 behavior; mode-2 fitted our goals better.
2780
2781
  Interestingly, I found this method really helpful for testing periodic timers
2782
  code in general. We already test timer delays code by calibrating its produced
2783
  delays against wall clock time; this method complement the tests by also
2784
  testing the monotonic/periodic interrupts code.
b4776d7 by Ahmed S. Darwish at 2010-04-27 2785
2786
  Note that for testing, specifying a large delay period catches more periodic
2787
  timers accuracy errors. Values from 50 ms and up proved to be good.
2788
2789
* A similar test-case for periodic timers is comparing their statistical
2790
  ticks occurrences relative to each other. This will guarantee that all system
2791
  periodic timers really measure the same concept of time!
5bc1d68 by Ahmed S. Darwish at 2010-11-16 2792
2793
2794
The Naughty Multiply:
2795
---------------------
2796
2797
* References:
2798
  - Intel Instruction Set Reference, A-M, vol. 2A, MUL.
2799
  - GCC manual, 'Using the GNU Compiler Collection'
2800
2801
* So I was writing a casual number of test-cases, where this sequence of code:
2802
2803
  void run_tests(void) {
2804
	memset(_arr, 0xFF, 0x100);
2805
	__memcpy_forward(_arr, _arr + 20, 10);
2806
	__memcpy_forward(_arr + 20, _arr, 10);
2807
  }
2808
2809
  produced a NULL-dereference exception. As printed by our exception handler:
2810
2811
	Exception: vector=14, errcode=0x2, %cs=0x8, %rip=0xffffffff8010b6f7,
2812
	%ss=0x0, %rsp=0xffffffff80104ff0, %rflags=0x200202, %cr2=0x0,
2813
	%cr3=0x3ffef000, %cr4=0x20
2814
2815
  It was weird: the code of these two functions was a very simple and
2816
  straightforward inline GCC assembly.
2817
2818
  The problem was clearly related somehow to registers-usage pressure (the
2819
  compiler register allocator decisions) in the mother function run_tests().
2820
  Removing one memcpy() above, or adding a printk() in a random place in that
2821
  function would make the NULL exception disappear.
2822
2823
* Disassembling the area around the faulting opcode, we found:
2824
2825
  ffffffff8010b6a0 <string_run_tests>:
2826
  ffffffff8010b6a0:       48 c7 c2 20 fb 10 80    mov    $0xffffffff8010fb20,%rdx
2827
  ffffffff8010b6a7:       31 f6                   xor    %esi,%esi
2828
  ffffffff8010b6a9:       b8 ff 00 00 00          mov    $0xff,%eax
2829
  ffffffff8010b6ae:       48 89 d7                mov    %rdx,%rdi
2830
  ffffffff8010b6b1:       b9 00 01 00 00          mov    $0x100,%ecx
2831
  ffffffff8010b6b6:       48 89 ce                mov    %rcx,%rsi
2832
  ffffffff8010b6b9:       48 83 e1 07             and    $0x7,%rcx
2833
  ffffffff8010b6bd:       f3 aa                   rep stos %al,%es:(%rdi)   (+)
2834
  ffffffff8010b6bf:       48 89 f1                mov    %rsi,%rcx
2835
  ffffffff8010b6c2:       48 c1 e9 03             shr    $0x3,%rcx
2836
  ffffffff8010b6c6:       48 89 c6                mov    %rax,%rsi
2837
  ffffffff8010b6c9:       48 b8 01 01 01 01 01    mov    $0x101010101010101,%rax
2838
  ffffffff8010b6d0:       01 01 01
2839
  ffffffff8010b6d3:       48 f7 e6                mul    %rsi
2840
  ffffffff8010b6d6:       f3 48 ab                rep stos %rax,%es:(%rdi)  (++)
2841
  ffffffff8010b6d9:       41 b9 0a 00 00 00       mov    $0xa,%r9d
2842
  ffffffff8010b6df:       31 c0                   xor    %eax,%eax
2843
  ffffffff8010b6e1:       48 c7 c6 34 fb 10 80    mov    $0xffffffff8010fb34,%rsi
2844
  ffffffff8010b6e8:       48 89 d7                mov    %rdx,%rdi
2845
  ffffffff8010b6eb:       44 89 c9                mov    %r9d,%ecx
2846
  ffffffff8010b6ee:       41 89 c0                mov    %eax,%r8d
2847
  ffffffff8010b6f1:       41 89 c8                mov    %ecx,%r8d
2848
  ffffffff8010b6f4:       83 e1 07                and    $0x7,%ecx
2849
  ffffffff8010b6f7:       f3 a4                   rep movsb %ds:(%rsi),%es:(%rdi)
2850
2851
  From our exception handler, we can see that the faulting instruction
2852
  is the above 'rep movsb'. We can also deduce that (+) and (++) ops are
2853
  produced for memset(), and the rest are the memcpy() opcodes.
2854
2855
* Tracing the code by hand, I reached the faulty instruction without
2856
  catching the NULL dereference. A null dereference at the 'movsb'
2857
  instruction can only mean that either %rsi = 0, or %rdi = 0, or both
2858
  had magically turned to zeroes.
2859
2860
  Wondering about the state, I overwrote this movsb instruction in the
2861
  final kernel binary with a Bochs magic breakpoint code (66 87 db)
2862
  using a simple hex editor.
2863
2864
  After moving to the debugging console, thanks to the magic breakpoint,
2865
  here was the Bochs registers dump:
2866
2867
  (0) Magic breakpoint
2868
  CPU0:
2869
  rax: 0x00000000:00000000 rcx: 0x00000000:00000002
2870
  rdx: 0x00000000:00000000 rbx: 0xffffffff:801129a8
2871
  rsp: 0xffffffff:80104ff0 rbp: 0x00000000:00000000
2872
  rsi: 0xffffffff:8010fb34 rdi: 0x00000000:00000000
2873
  r8 : 0x00000000:0000000a r9 : 0x00000000:0000000a
2874
  r10: 0x0f200f20:0f200f20 r11: 0x66656463:62613938
2875
  r12: 0x00000000:00000000 r13: 0x00000000:00000000
2876
  r14: 0x00000000:00000000 r15: 0x00000000:00000000
2877
  rip: 0xffffffff:8010b6fa
2878
  eflags 0x00200202: ID vip vif ac vm rf nt IOPL=0 of df IF tf sf zf af pf cf
2879
2880
  One thing was clearly noted: for some reason, %rdx (and %rdi by consequence)
2881
  have magically turned to zeroes, leading to the NULL dereference!
2882
2883
* Finding that no explicit opcode zeroed %rdx above, then the zeroing
2884
  must have been a side-effect of some instruction above the faulty
2885
  dereference. Looking at the Intel Manual for cues, I found that
2886
  opcode: it was the 'mul %rsi' one.
2887
2888
  So why didn't GCC know about this? Because _I_ did not know about
2889
  it while coding, thus I did not explicitly tell GCC that this nice
2890
  register will get implicitly clobbered.
2891
2892
  GCC, naively thought that the register value was unmodified, and
2893
  rightfully re-used that register value at address 0x8010b6e8. GCC
2894
  is really competent wrt to optimizations. The faulty code was:
2895
2896
	asm volatile (
2897
	    [...]
2898
	    "mov  %[ch], %[tmp];"
2899
	    "mov  $0x0101010101010101, %[ch];"
2900
	    "mul  %[tmp];"
2901
	    "rep  stosq;"
2902
	    : [tmp] "=&r"(tmp), [dst] "=&D"(d0), [ch] "=&a"(uch),
2903
	      [len] "=&c"(ulen)
2904
	    : "[dst]"(dst), "[ch]"(uch), "[len]"(ulen), "[tmp]"(tmp)
2905
	    : "memory");
2906
2907
  As you can see, %rdx was not mentiond as a clobbered register, and
2908
  not even as an output register. GCC had all the rights to believe
2909
  that this register value was not modified in that snippet.
2910
2911
  In another way, the clobber list tells gcc which registers (or
2912
  memory) are changed by the snippet, but not listed as an output.
2913
2914
* The simple solution was adding the 'rdx' register to the assembly
2915
  snippet clobber list. Now the compiler output looks like:
2916
2917
  ffffffff8010b6a0 <string_run_tests>:
2918
  ffffffff8010b6a0:       49 c7 c0 20 fb 10 80    mov    $0xffffffff8010fb20,%r8
2919
  [...]
2920
  ffffffff8010b6e8:       48 89 d7                mov    %r8,%rdi
2921
  [...]
2922
  ffffffff8010b6f7:       f3 a4                   rep movsb %ds:(%rsi),%es:(%rdi)
2923
2924
  where the compiler uses the new register %r8, entirely avoiding
2925
  the %rdx register in the process.
2926
2927
* From investigation, it seems GCC need to also be notified if the
2928
  assembly snippet alters the condition code register %rflags. We now
2929
  handle this by adding `cc' to the clobber list while using ops like
2930
  'andq', 'popfq', 'decq', and 'shrq'.
2931
2932
  Cases where we forgot to clobber 'rcx' after using opcodes like
2933
  'rep XX' were also caught. For this, we only need to set 'rcx' as a
2934
  double input+output register; GCC will infer the rest.
2935
2936
  Another faulty case was having 'rsi' be only set as an input register
2937
  amid the existence of 'movsq' in the assembly snippet.
2938
2939
* Lesson:
2940
2941
  While I did study the privileged x86 ring0 opcodes carefully (mainly
2942
  Intel's manual vol. 3A, and AMD vol. 2), the basic x86 opcodes were
2943
  learned 'on the go'. The problem we faced today is a small price for
2944
  that; I should re-read Intel's vol. 1 docs way more carefully.
2945
2946
  To make-up for this, and also using the Intel manuals, I investigated
2947
  the side effects of __every__ opcode used in the kernel inlined
2948
  assembly snippets. That's how I caught the %rflags, %rcx, and %rsi
2949
  cases mentioned in the above point!