| 1 |
|
| 2 |
|
| 3 |
Notes on the Cute Kernel |
| 4 |
|
| 5 |
|
| 6 |
TOC: |
| 7 |
---- |
| 8 |
|
| 9 |
(1) On Intel Documents |
| 10 |
(2) On Data Alignment |
| 11 |
(3) On little-endian byte ordering |
| 12 |
(4) On Real Mode |
| 13 |
(5) On accessing fixed disk drivers (hard-disks) |
| 14 |
(6) On x86 jumps |
| 15 |
(7) On referencing global symbols |
| 16 |
(8) The 8259A Programmable Interrupt Controller |
| 17 |
(9) On x86 SMP initialization sequence |
| 18 |
(10) On linkers 'relocation error's |
| 19 |
(11) When %RIP-relativeness goes wrong |
| 20 |
(12) Testing libc()-like kernel code |
| 21 |
(13) On SMP Memory ordering models |
| 22 |
(14) On the C 'volatile' type qualifier |
| 23 |
(15) On the vaguely documented x86 16-bit protected mode |
| 24 |
(16) Using light debugging techniques |
| 25 |
(17) Musings on Memory Allocation |
| 26 |
(18) On Virtual Memory |
| 27 |
(19) On Partitionting the 64-bit kernel address space |
| 28 |
(20) Debugging -O3 triggered problems |
| 29 |
(21) The Zone! |
| 30 |
(22) Debugging periodic timers using statistical data |
| 31 |
(23) The Naughty Multiply |
| 32 |
|
| 33 |
|
| 34 |
On Intel Documents: |
| 35 |
------------------- |
| 36 |
|
| 37 |
* Although the newer intel documents are much more sexy and detailed, the |
| 38 |
original i386 one describes the concepts in a much more bottom-up, |
| 39 |
undergraduate-book like, no forward-references way. |
| 40 |
|
| 41 |
* Somehow the new intel documents beginning from the 1997 i686 document |
| 42 |
assumes some x86 familarity that wasn't obviously assumed in the original |
| 43 |
document. |
| 44 |
|
| 45 |
* For an example, the 'alignment' topic is throughly explained in |
| 46 |
the i386 manual, what is it and why is it important to performance, |
| 47 |
and how the processor act if given a misaligned address, *before* ever |
| 48 |
using the words '16bit-aligned' and '32bit-aligned'. |
| 49 |
|
| 50 |
The new intel documents assume you know subtle topics as above and |
| 51 |
talks about critical issues related to their understandings freely |
| 52 |
*before* they become throughly discussed. Only at i686 chapter 5 |
| 53 |
alignment is throughly described although several critical alignment |
| 54 |
related topics have already been discussed. |
| 55 |
|
| 56 |
* Case 2: on why we should disable interrupts while switching to protected |
| 57 |
mode? |
| 58 |
|
| 59 |
New Intel document: Software must guarantee that no exceptions or interrupts |
| 60 |
are generated during the mode switching operation. |
| 61 |
|
| 62 |
Original 386 document: ``The IDTR may be loaded in either real-address or |
| 63 |
protected mode. However, the format of the interrupt table for protected mode |
| 64 |
is different than that for real-address mode. It is not possible to change to |
| 65 |
protected mode and change interrupt table formats at the same time; therefore, |
| 66 |
it is inevitable that, if IDTR selects an interrupt table, it will have the |
| 67 |
wrong format at some time. An interrupt or exception that occurs at this time |
| 68 |
will have unpredictable results. To avoid this unpredictability, interrupts |
| 69 |
should remain disabled until interrupt handlers are in place and a valid IDT |
| 70 |
has been created in protected mode.'' |
| 71 |
|
| 72 |
* Case 3: on why should we long jump after switching to protected mode? |
| 73 |
|
| 74 |
New intel document: ``The JMP or CALL instruction immediately after the MOV |
| 75 |
CR0 instruction changes the flow of execution and serializes the processor.'' |
| 76 |
|
| 77 |
Original 386 document: ``Immediately after setting the PE flag, the |
| 78 |
initialization code must flush the processor's instruction prefetch queue by |
| 79 |
executing a JMP instruction. The 80386 fetches and decodes instructions and |
| 80 |
addresses before they are used; however, after a change into protected mode, |
| 81 |
the prefetched instruction information (which pertains to real-address mode) |
| 82 |
is no longer valid. A JMP forces the processor to discard the invalid |
| 83 |
information.'' |
| 84 |
|
| 85 |
This is a recurring pattern. |
| 86 |
|
| 87 |
* What's IA-32e in new Intel documents? |
| 88 |
Intel has really messed up the namings of their 64bit processors. The |
| 89 |
most logical choice I think was IA-64, but unfortunately that was taken |
| 90 |
by the big flop in processor's history: the Itanium. When Intel waked |
| 91 |
up and created their first amd64 compatible processor, they named it EMT64. |
| 92 |
|
| 93 |
Now intel in its new official documents call the x86_64 architecture |
| 94 |
"intel 64" and call those processor's 64bit mode IA-32e. |
| 95 |
|
| 96 |
|
| 97 |
On Data Alignment: |
| 98 |
------------------ |
| 99 |
|
| 100 |
* Note: In x86 speak, word = 2bytes/16bits, double word = 4bytes/32bits |
| 101 |
|
| 102 |
* Each byte within a word has its own address, and the smaller of the |
| 103 |
addresses represent the address of the whole word. |
| 104 |
|
| 105 |
* Each byte within the double word has its own address. The smallest of the byte |
| 106 |
addresses is the address of the double word. |
| 107 |
|
| 108 |
* ``Note that words need not be aligned at even-numbered addresses and |
| 109 |
doublewords need not be aligned at addresses evenly divisible by four. |
| 110 |
This allows maximum flexibility in data structures (e.g., records containing |
| 111 |
mixed byte, word, and doubleword items) and efficiency in memory |
| 112 |
utilization. |
| 113 |
|
| 114 |
** When used in a configuration with a 32-bit bus, actual transfers of |
| 115 |
data between processor and memory take place in units of doublewords (4bytes) |
| 116 |
beginning at addresses evenly divisible by four ** |
| 117 |
|
| 118 |
However, the processor converts requests for misaligned words or doublewords |
| 119 |
into the appropriate sequences of requests acceptable to the memory interface. |
| 120 |
** Such misaligned data transfers reduce performance by requiring extra memory |
| 121 |
cycles. For maximum performance, data structures (including stacks) should |
| 122 |
be designed in such a way that, whenever possible, word operands are aligned |
| 123 |
at even addresses and doubleword operands are aligned at addresses evenly |
| 124 |
divisible by four. ** |
| 125 |
|
| 126 |
Due to instruction prefetching and queuing within the CPU, there is no |
| 127 |
requirement for instructions to be aligned on word or doubleword boundaries. |
| 128 |
However, a slight increase in speed results if the target addresses of control |
| 129 |
transfers are evenly divisible by four.'' |
| 130 |
-- Intel 386 Reference Manual |
| 131 |
|
| 132 |
* ``The reason for this is that the processor requires two memory accesses to |
| 133 |
make an unaligned memory access; whereas, aligned accesses require only one |
| 134 |
memory access. A word or doubleword operand that crosses a 4-byte boundary |
| 135 |
or a quadword operand that crosses an 8-byte boundary is considered unaligned |
| 136 |
and requires two separate memory bus cycles to access it; a word that starts |
| 137 |
on an odd address but does not cross a word boundary is considered aligned |
| 138 |
and can still be accessed in one bus cycle.'' |
| 139 |
-- Intel 686 Reference Manual |
| 140 |
|
| 141 |
* x86 4KB page aligned addresses: |
| 142 |
|
| 143 |
Decimal: |
| 144 |
10 ^ 0 = 1 |
| 145 |
10 ^ 1 = 10 |
| 146 |
10 ^ 2 = 100 |
| 147 |
10 ^ 3 = 1000 |
| 148 |
... |
| 149 |
|
| 150 |
Binary: |
| 151 |
2 ^ 0 = 1 |
| 152 |
2 ^ 1 = 10 |
| 153 |
2 ^ 2 = 100 |
| 154 |
... |
| 155 |
2 ^ 10 = 10000000000 = 1024 = 1KB |
| 156 |
2 ^ 12 = 1000000000000 = 2^2 * 1KB = 4KB (x86 Page size/aligned-address) |
| 157 |
... |
| 158 |
|
| 159 |
* Common alignments offsets: |
| 160 |
|
| 161 |
1byte alignment = 2 ^ 0 -- any x86 RAM address |
| 162 |
|
| 163 |
2byte alignment = 2 ^ 1 (0b10) -- 16bit aligned address |
| 164 |
= any address divisable by 2 |
| 165 |
= 0x0, 0x2, 0x4, 0x6, 0x8, ..., 0xa, 0xc, 0xe, 0x10, 0x12, ... |
| 166 |
|
| 167 |
4byte alignment = 2 ^ 2 (0b100) -- 32bit aligned address |
| 168 |
= any address divisable by 4 |
| 169 |
= 0x0, 0x4, 0x8, 0xc, 0x10, 0x14, 0x18, 0x1c, 0x20, ... |
| 170 |
|
| 171 |
8byte alignment = 2 ^ 3 (0b1000) - GDT cell index/size |
| 172 |
= any address divisable by 8 |
| 173 |
= 0x0, 0x8, 0x10, 0x18, 0x20, ... |
| 174 |
| | | |
| 175 |
null %cs %ds |
| 176 |
|
| 177 |
16byte alignment = 2 ^ 4 (0b10000) - GDT table recommended alignment |
| 178 |
= any address divisable by 16 |
| 179 |
= 0x0, 0x10, 0x20, 0x30, 0x40, ..., 0xf0, 0x100, 0x110, 0x120, ... |
| 180 |
|
| 181 |
* Useful numbers: |
| 182 |
|
| 183 |
0x100 = 256 bytes |
| 184 |
0x200 = 512 bytes (bootsector size, number of entries in PML{4, 3, 2} tables) |
| 185 |
0x400 = 1KB |
| 186 |
0x1000 = 4KB (x86 normal page size/alignment) |
| 187 |
0x8000 = 32KB |
| 188 |
0xffff = 64KB -1 (max offset for a real-mode segment) |
| 189 |
0x10000 = 64KB |
| 190 |
0x80000 = 512KB (our initial kernel size!) |
| 191 |
|
| 192 |
|
| 193 |
On little-endian byte ordering: |
| 194 |
------------------------------- |
| 195 |
|
| 196 |
* From AMD64 documents: |
| 197 |
|
| 198 |
"The x86 and AMD64 architectures address memory using little-endian |
| 199 |
byte-ordering. Multibyte values are stored with their least-significant byte at |
| 200 |
the lowest byte address, and they are illustrated with their least significant |
| 201 |
byte at the right side. Strings are illustrated in reverse order, because the |
| 202 |
addresses of their bytes increase from right to left." |
| 203 |
|
| 204 |
Value = |
| 205 |
|
| 206 |
BYTE-7 BYTE-6 BYTE-5 BYTE-4 BYTE-3 BYTE-2 BYTE-1 BYTE-0 |
| 207 |
|
| 208 |
Actual Mem Layout: |
| 209 |
|
| 210 |
BYTE-0 BYTE-1 BYTE-2 BYTE-3 BYTE-4 BYTE-5 BYTE-6 BYTE-7 |
| 211 |
^ |
| 212 |
|--- Address start here |
| 213 |
|
| 214 |
For illustrative purposes only - not actual mem layout: |
| 215 |
|
| 216 |
GREATEST DATA STRUCTURE |
| 217 |
ADDRESS |
| 218 |
31 23 15 7 0 <--BIT |
| 219 |
+---------------+---------------+---------------+---------------+ OFFSET |
| 220 |
| BYTE 7 | BYTE 6 | BYTE 5 | BYTE 4 |4 |
| 221 |
|---------------+---------------+---------------+---------------| SMALLEST |
| 222 |
| BYTE 3 BYTE 2 BYTE 1 BYTE 0 |0 ADDRESS |
| 223 |
+---------------+---------------+---------------+---------------+^ |
| 224 |
| |
| 225 |
Address start here ---| |
| 226 |
|
| 227 |
* Classical little-endian ordering: |
| 228 |
|
| 229 |
.long 0x00cf9200 = 00 92 CF 00 |
| 230 |
^ |
| 231 |
|--- Address start |
| 232 |
= |
| 233 |
|
| 234 |
+---------------+---------------+---------------+---------------+ |
| 235 |
| 00 | CF | 92 | 00 | |
| 236 |
+---------------+---------------+---------------+---------------+^ |
| 237 |
Address start ---| |
| 238 |
|
| 239 |
* GAS concatenation: |
| 240 |
|
| 241 |
.long 0x00cf9200, 0x0000ffff != .quad 0x00cf92000000ffff |
| 242 |
|
| 243 |
.long 0x00cf9200, 0x0000ffff |
| 244 |
= |
| 245 |
.long 0x00cf9200 |
| 246 |
.long 0x0000ffff |
| 247 |
= |
| 248 |
.quad 0x0000ffff00cf9200 |
| 249 |
= |
| 250 |
(actual layout in memory) |
| 251 |
00 92 CF 00 FF FF 00 00 |
| 252 |
^ |
| 253 |
|--- Address start |
| 254 |
= |
| 255 |
(for illustrative purposes only - _NOT_ actual mem layout) |
| 256 |
+---------------+---------------+---------------+---------------+ |
| 257 |
| 00 | 00 | FF | FF | |
| 258 |
|---------------+---------------+---------------+---------------| |
| 259 |
| 00 | CF | 92 | 00 | |
| 260 |
+---------------+---------------+---------------+---------------+^ |
| 261 |
Address start ---| |
| 262 |
BUT: |
| 263 |
|
| 264 |
.long 0x0000ffff |
| 265 |
.long 0x00cf9200 |
| 266 |
= |
| 267 |
.quad 0x00cf92000000ffff |
| 268 |
= |
| 269 |
(for illustrative purposes only - _NOT_ actual mem layout) |
| 270 |
+---------------+---------------+---------------+---------------+ |
| 271 |
| 00 | CF | 92 | 00 |4 |
| 272 |
|---------------+---------------+---------------+---------------| |
| 273 |
| 00 | 00 | FF | FF | |
| 274 |
+---------------+---------------+---------------+---------------+^ |
| 275 |
Address start ---| |
| 276 |
= |
| 277 |
(actual layout in memory) |
| 278 |
FF FF 00 00 00 92 CF 00 |
| 279 |
^ |
| 280 |
|--- Address start |
| 281 |
|
| 282 |
* So: |
| 283 |
|
| 284 |
test.S: | print.c: |
| 285 |
| |
| 286 |
.global test | #include <stdint.h> |
| 287 |
test: | extern uint64_t test; |
| 288 |
.quad 0x00cf92000000ffff | int main(void) { printf("test = 0x%x\n", test); } |
| 289 |
|
| 290 |
output: test = 0x0000ffff |
| 291 |
as the 64-bit value is presented as: (address-start) FF FF 00 00 00 92 ... (address-end) |
| 292 |
|
| 293 |
* The String "WXYZ" is represented in memory as: |
| 294 |
|
| 295 |
'W' 'X' 'Y' 'Z' |
| 296 |
^ |
| 297 |
|--Address start |
| 298 |
|
| 299 |
And in the usual memory diagrams: |
| 300 |
|
| 301 |
+---------------+---------------+---------------+---------------+ |
| 302 |
| 'Z' | 'Y' | 'X' | 'W' | |
| 303 |
+---------------+---------------+---------------+---------------+^ |
| 304 |
Address start ---| |
| 305 |
|
| 306 |
|
| 307 |
On Real Mode: |
| 308 |
------------- |
| 309 |
|
| 310 |
* To achieve maximum backward compatibility, x86 processors starts in the 16bit |
| 311 |
8086-compatible real-mode. The 8086 had a complete 16-bit architecture - 16-bit |
| 312 |
internal registers, 16-bit data bus, and a *20-bit* external address bus (1 MB |
| 313 |
of physical memory). |
| 314 |
|
| 315 |
Because the 8086 processor had 16-bit index registers and memory pointers, |
| 316 |
it can effectively address *only* up to 64 KB of memory, which wasn't acceptable |
| 317 |
even by that time. |
| 318 |
|
| 319 |
* So, to address the memory beyond 64 KB (the remaining 20bit 1MB address space) |
| 320 |
with 16bit registers, Intel created a `segmented' memory addressing scheme. |
| 321 |
|
| 322 |
We want to access 20bit address space, namely 0xfffff at max. The 8086 |
| 323 |
processor's registers are only 16bit, namely a max value of 0xffff. To solve |
| 324 |
this issue, the 8086 processor used *two* registers to address physical memory. |
| 325 |
The two registers are a segment described as a base address, and an offset within |
| 326 |
that segment. |
| 327 |
|
| 328 |
The goal: addressing up to the max 20bit value; 0xfffff: |
| 329 |
|
| 330 |
f 0 0 0 0 <-- 16bit value (segment base address) hex left shifted once to |
| 331 |
address the 4 highest order bits in the goal. |
| 332 |
+ <-- A simple adding operation |
| 333 |
|
| 334 |
f f f f <-- 16bit value, max offset we can address |
| 335 |
|
| 336 |
------- |
| 337 |
|
| 338 |
f f f f f <-- The goal (max 20bit value we can address) |
| 339 |
|
| 340 |
So physical memory in the 8086 is accessed through a `segment:offset' scheme |
| 341 |
|
| 342 |
where physical address = (segment * 16) + offset |
| 343 |
|
| 344 |
* Note that the above segment model is very primitive that there are really no |
| 345 |
real `segments' identified by the processor. A *physical* 'segment' is |
| 346 |
basically just a 16-bit aligned physical address address with an offset |
| 347 |
`limit' of 64K enforced by the processor registers limit. |
| 348 |
|
| 349 |
Example: |
| 350 |
|
| 351 |
a000:0010 = a010:0000 /* Overlapping 'segments' */ |
| 352 |
f000:ffff = ffff:000f /* Maximum address using different 'segments' */ |
| 353 |
40:100 = 50:00 /* Last byte in Bios Data Area */ |
| 354 |
|
| 355 |
It's important to note that a segment register can take any value ranging |
| 356 |
from 0 to 0xffff including the non-16bit-aligned addresses. It's on *physical* |
| 357 |
memory where a segment base address is always 16-bit aligned due to the left |
| 358 |
shift, not the segment register value itself. |
| 359 |
|
| 360 |
Example of completely valid segment register values and their physical |
| 361 |
addresses: |
| 362 |
|
| 363 |
Segment register value | Segment physical address |
| 364 |
------------------------------------------------------- |
| 365 |
0xffff | 0xffff0 |
| 366 |
0x1abc | 0x1abc0 |
| 367 |
0x4242 | 0x42420 |
| 368 |
... | ... |
| 369 |
0xyyyy | 0xyyyy0, y = /[0-f]/ |
| 370 |
|
| 371 |
* Some calculations: |
| 372 |
|
| 373 |
biggest segment address possible = 0xffff |
| 374 |
|
| 375 |
biggest offset possible = 0xffff |
| 376 |
|
| 377 |
biggest *physical* address that can act as a segment address = |
| 378 |
biggest 20-bit address which is 16-byte aligned = 0xffff0 |
| 379 |
|
| 380 |
* NOTE: There's a huge difference between real-mode `segments' and segments |
| 381 |
provided in 386+ protected-mode processors. The 16bit value in a segmented |
| 382 |
protected-mode logical address is what's called a `segment selector' which |
| 383 |
is basically a pointer to a `segment descriptor' stored in either the GDT or |
| 384 |
the LDT. The segment descriptor includes info like the protected-mode segment |
| 385 |
base, limit, access permissions, present bit, etc. |
| 386 |
|
| 387 |
On the other hand, the real-mode `segments' are just primitive addresses |
| 388 |
that get hex left shifted in address calculations. |
| 389 |
|
| 390 |
* Note that using this scheme, code segments are limited to a size of 64KB. |
| 391 |
To reach more code, we had to jump to what's called 'far pointers', which |
| 392 |
not only contains an offset relative to current code segment, but also a |
| 393 |
new address for the code segment. |
| 394 |
|
| 395 |
AT&T syntax for far jumps: `ljmp $SEGMENT, $OFFSET' where 'segment' and |
| 396 |
'offset' are naturally 16bit values. |
| 397 |
|
| 398 |
Usage example: after loading the kernel to ram at its chosen segment which |
| 399 |
is usually far away from the mbr code segment, we 'far jump' to this segment |
| 400 |
with offset 0 to execute the loaded kernel code and go beyond the mbr's 512 |
| 401 |
bytes size limit. This is the same technique used by GRUB to loads its stages. |
| 402 |
|
| 403 |
* The above terminology is also the root of 386+ protected mode 'far pointer' |
| 404 |
which is a 'logical address' containing a segment selector (16 bit) and an |
| 405 |
offset (32 bit) relative to the segment pointed by that selector. |
| 406 |
|
| 407 |
|
| 408 |
On accessing fixed disk drivers (hard-disks): |
| 409 |
--------------------------------------------- |
| 410 |
|
| 411 |
* There's the usual heads, tracks, sectors, and cylinders. |
| 412 |
|
| 413 |
Track: when the head is positioned over a point on the fixed disk, ready |
| 414 |
to read or write, the disk platter surface spins underneath it, tracing a |
| 415 |
full circle. This circle is a track. A disk surface contains may tracks. |
| 416 |
|
| 417 |
Sectors: Fixed disk systems divides each track into short arcs called |
| 418 |
sectors, each sector usually holds 512 bytes. This is the smallest unit |
| 419 |
of storage, as tracks would mean wasting a lot of space for small files. |
| 420 |
|
| 421 |
Cylinder: all the heads move in and out together, so each head is always |
| 422 |
physically located at the same track number; we can't have one head at |
| 423 |
track 0, and another at 250. Because of this, often the track location of |
| 424 |
the heads is not referred to as a track number but rather as a cylinder |
| 425 |
number. |
| 426 |
|
| 427 |
Disk size = bytes per sector * sectors per track * number of cylinders * |
| 428 |
number of heads |
| 429 |
|
| 430 |
* BIOS fixed disk services are provided through the INT 13h service |
| 431 |
routine. It programs the hard-disk controller; it never writes directly |
| 432 |
from the processor to the hard-disk drive. The hard-disk controller |
| 433 |
communicates directly between the hard-disk, the processor, and system |
| 434 |
memory. |
| 435 |
|
| 436 |
* To support disks beyoned the 8GB limit enforced by original IBM bios |
| 437 |
design (max of 1024 cylinder, 256 heads, and 63 sectors), new INT13 |
| 438 |
bios services were added to bios which offers simple sector addressing |
| 439 |
using a logical zero based sector number (0, 1, 2, ..) |
| 440 |
|
| 441 |
This addressing scheme is easier than the Cylinder-Head-Sector model |
| 442 |
and is supported in all bios chips post-1995, so we use it exclusively. |
| 443 |
|
| 444 |
Check the boot sector code for more details. |
| 445 |
|
| 446 |
|
| 447 |
On x86 jumps: |
| 448 |
------------- |
| 449 |
|
| 450 |
* References: |
| 451 |
- Intel i686 Manual, vol. 2, Instruction Set Reference, Chapter 3: JMP |
| 452 |
- Sun Microsystems, [AT&T] x86 Assembly language reference manual |
| 453 |
- Cute: boot/bootsect.S, boot/head.S, boot/e820.S, boot/trampoline.S |
| 454 |
|
| 455 |
* Judging by my early naive trial-and-error experiences and the nature of |
| 456 |
newbies questions at osdev.org forums, the most common failure point while |
| 457 |
developing bootloaders and early kernel stages is either an intra or an |
| 458 |
inter-module jump. Because of this, we'll clarify the most used x86 jumps |
| 459 |
here, detailing their AT&T syntax and opcodes. We don't use the Intel |
| 460 |
syntax in our kernel, so it won't be discussed here; it's already clearly |
| 461 |
defined in the official Intel manuals. |
| 462 |
|
| 463 |
First, let's stress that a 'jump' is only a mechanism to disrupt the CPU |
| 464 |
execution flow, moving EIP to a different code (hopefully not data!) region. |
| 465 |
|
| 466 |
Generally, x86 jumps can be classified to either near or far. Near jumps |
| 467 |
are 'intra-segment', meaning that they only move within the limits of the |
| 468 |
currently setup and cashed code segment. Meanwhile, far jumps are |
| 469 |
'inter-segment': they take a (mostly new) code segment descriptor and an |
| 470 |
offset within such segment. |
| 471 |
|
| 472 |
The kind of displacement in such jumps can also be either relative or |
| 473 |
absolute. Relative addresses (wrt the next instruction address) were |
| 474 |
introduced to save opcodes size. [*] |
| 475 |
|
| 476 |
Absolute addresses are very important when relative offsets to an |
| 477 |
outside-of-code-object destination address cannot be validly calculated |
| 478 |
at link-time. Check the 'When %RIP-relativeness goes wrong' section's |
| 479 |
summary for further details. |
| 480 |
|
| 481 |
Examples of jumps: |
| 482 |
|
| 483 |
<AT&T Syntax> | <Objdump Disassembly> |
| 484 |
|
| 485 |
[A] Near jump, 8-bit relative displacement (-128): [short jump] |
| 486 |
|
| 487 |
1: movl $0xdeadbeef, %eax | af: b8 ef be ad de |
| 488 |
jmp 1b | b4: eb f9 |
| 489 |
|
| 490 |
[B] Near jump, 8-bit relative displacement (+127): [short jump] |
| 491 |
|
| 492 |
1: movl $0xdeadbeef, %eax | af: b8 ef be ad de |
| 493 |
jmp 2f | b4: eb 00 |
| 494 |
2: movl $0xcafebabe, %ebx | b6: bb be ba fe ca |
| 495 |
|
| 496 |
[C] Near jump, 16-bit relative displacement: |
| 497 |
|
| 498 |
[Not supported by GAS] |
| 499 |
|
| 500 |
[D] Near jump, 32-bit relative displacement: |
| 501 |
|
| 502 |
jmp dst | 12: e9 f8 ff 0f 00 |
| 503 |
.skip 0xffff8 | ... |
| 504 |
dst: movl $0xdeadbeef, %eax | 10000f: b8 ef be ad de |
| 505 |
|
| 506 |
[E] Near jump, absolute indirect displacement (register): |
| 507 |
|
| 508 |
mov $0x100000, %eax | af: b8 00 00 10 00 |
| 509 |
jmp *%eax | b4: ff e0 |
| 510 |
|
| 511 |
[F] Near jump, absolute indirect displacement (memory reference): |
| 512 |
|
| 513 |
jmpl *(0x7c00 + mem) | af: ff 25 b5 7c 00 00 |
| 514 |
mem: .long 0x100000 | b5: 00 00 10 00 |
| 515 |
|
| 516 |
(AT&T syntax uses the '*' operator for indirect jumps) |
| 517 |
|
| 518 |
Note that the _only_ available way for specifying _near_ absolute |
| 519 |
displacements is to encode such displacements 'indirectly' (i.e. not |
| 520 |
in the operands themselves) using a register or a memory reference. |
| 521 |
|
| 522 |
[G] Far jump, absolute 32-bit far pointer in operand (16-bit %cs |
| 523 |
index + 16-bit offset): |
| 524 |
|
| 525 |
.code16 | |
| 526 |
ljmp $0x08, $(0x7c00 + rel) | 99: ea 9e 7c 08 00 |
| 527 |
.code32 | |
| 528 |
rel: movl $0x10000, %esi | 9e: be 00 00 01 00 |
| 529 |
|
| 530 |
[H] Far jump, absolute 48-bit far pointer in operand (16-bit %cs |
| 531 |
index + 32-bit offset): |
| 532 |
|
| 533 |
.code32 | |
| 534 |
# KERNEL_CS = 0x08 | |
| 535 |
# startup_64 = 0x100112 | |
| 536 |
ljmp $KERNEL_CS, $startup_64 | 10006f: ea 12 01 10 00 08 00 |
| 537 |
|
| 538 |
[I] Far jumps, 32-bit or 48-bit absolute indirect displacement |
| 539 |
(memory reference): |
| 540 |
|
| 541 |
[Rarely used, let's only know that they exist] |
| 542 |
|
| 543 |
[*] footnote: On x86-32, only jumps and calls had the relative addressing |
| 544 |
facility. To save opcode size, especially while handling 64-bit wide virtual |
| 545 |
addresses, AMD introduced a global and default %RIP-relative addressing |
| 546 |
scheme for most ops, including most important, dereferencing global data. |
| 547 |
|
| 548 |
|
| 549 |
On referencing global symbols: |
| 550 |
------------------------------ |
| 551 |
|
| 552 |
* References: |
| 553 |
- GNU LD Manual, 'Using LD', Section 3.5.4 -'Source Code Reference' |
| 554 |
- Cute: include/idt.h, kernel/idt.S |
| 555 |
|
| 556 |
* To gracefully fail instead of triple faulting the machine, one of the first |
| 557 |
things implemented while writing the C part of the kernel was setting up CPU |
| 558 |
exception handlers. To setup these handlers, we needed access to the IDT table |
| 559 |
and its descriptor; both were already defined in assembly files. What are the |
| 560 |
rules for accessing these data structures natively in C? |
| 561 |
|
| 562 |
When a global symbol is declared in C, two things happen. The compiler reserves |
| 563 |
enough space in the executable .data or .bss section holding the _value_ of the |
| 564 |
symbol. Second, it creates an entry in the program's symbol table which holds |
| 565 |
the symbol _address_ (in that section). For a small example, check below |
| 566 |
program and its disassembly: |
| 567 |
|
| 568 |
--> globals.c (a1): | --> [disassembly, $gcc -O3]: |
| 569 |
int x; | 0000000000600800 <y>: 11 11 00 00 ... (.data) |
| 570 |
int *y = 0x1111; | 0000000000600818 <x>: ... (.bss, 8 bytes) |
| 571 |
| |
| 572 |
int main(void) { | 0000000000400460 <main>: |
| 573 |
x = 0xdeadbeef; | 400460: movq 0x200399(%rip),%rax #600800 <y> |
| 574 |
*y = 0xcafebabe; | 400467: movl $0xdeadbeef,0x2003a7(%rip) #600818 <x> |
| 575 |
} | 400471: movl $0xcafebabe,(%rax) |
| 576 |
| 400477: retq |
| 577 |
|
| 578 |
As you can see, accessing the global 'x' required a single memory dereference |
| 579 |
of the address 0x600818 stored in the program's symbol table. Meanwhile, |
| 580 |
dereferencing the pointer 'y' required TWO memory dereferences: one for |
| 581 |
dereferencing the variable 'y' address 0x600800, getting its value/contents, |
| 582 |
and _another_ dereference for such value stored (0x1111); after all, a pointer |
| 583 |
is just a special 8-byte variable. |
| 584 |
|
| 585 |
* Furthermore, check below program which has several symbols defined in an |
| 586 |
assembly file with different treatments of such symbols (and their data) in C: |
| 587 |
|
| 588 |
--> data.S (a2): | --> app.c (a2): |
| 589 |
.section .data | uint32_t x; |
| 590 |
.globl x | extern uint32_t *y; |
| 591 |
x: | extern uint32_t z[]; |
| 592 |
.long 0xdeadbeef | |
| 593 |
| int main(void) { |
| 594 |
.globl y | printf("&x = 0x%lx\n", &x); |
| 595 |
y: | printf("&y = 0x%lx\n", &y); |
| 596 |
.long 0xcafebabe | printf("&z = 0x%lx\n", &z); |
| 597 |
| |
| 598 |
.globl z | printf("x = 0x%x\n", x); |
| 599 |
z: | printf("y = 0x%x\n", y); |
| 600 |
.long 0xeeeeeeed | ... /* further code: check the output */ |
| 601 |
| } |
| 602 |
|
| 603 |
$ gcc -O3 data.S app.c -o output.elf |
| 604 |
$ objdump -D output.elf -j .data |
| 605 |
0000000000600af0 <x>: |
| 606 |
600af0: ef be ad de be ba |
| 607 |
0000000000600af4 <y>: |
| 608 |
600af4: be ba fe ca ed |
| 609 |
0000000000600af8 <z>: |
| 610 |
600af8: ed ee ee ee |
| 611 |
|
| 612 |
$ ./output.elf |
| 613 |
'int x' is a global located at program area (&x = ): 0x600af0 |
| 614 |
'int *y' is a global located at program area (&y = ): 0x600af4 |
| 615 |
'int z[]' is a global located at program area (&z = ): 0x600af8 |
| 616 |
|
| 617 |
x = 0xdeadbeef | *x = <not-applicable> |
| 618 |
y = 0xcafebabe | *y = <segmentation fault> (0xcafebabe unmapped) |
| 619 |
z = 0x600af8 | *z = 0xeeeeeeed |
| 620 |
|
| 621 |
x[0] = <not-applicable> |
| 622 |
y[0] = <segmentation fault> since 0xcafebabe is not mapped |
| 623 |
z[0] = 0xeeeeeeed |
| 624 |
|
| 625 |
As seen from above introspection, y is just a global variable holding |
| 626 |
0xcafebabe. y _address_ is '0x600af4', buts its _value_ is 0xcafebabe. Thus, |
| 627 |
'*y' will produce two memory dereferences, and the second of the two will |
| 628 |
result in a CPU exception (segmentation fault) since 0xcafebabe is not |
| 629 |
mapped. |
| 630 |
|
| 631 |
* So, if we have a structure like the below IDT table descriptor: |
| 632 |
|
| 633 |
/* Located at ffffffff8010e240 |
| 634 |
* Binary value (little endian): 00 10 80 16 11 00 00 80 ff ff */ |
| 635 |
.globl idtdesc |
| 636 |
idtdesc: |
| 637 |
.word idt_end - idt # limit |
| 638 |
.quad VIRTUAL(KTEXT_PHYS(idt)) # base |
| 639 |
|
| 640 |
we can _not_ reference it in C as |
| 641 |
|
| 642 |
struct idt_descriptor { |
| 643 |
uint16_t limit; |
| 644 |
uint64_t base; |
| 645 |
} __packed; |
| 646 |
|
| 647 |
extern struct idt_descriptor *idtdesc; /* WRONG!! */ |
| 648 |
idtdesc->limit = 0; /* Page-fault exception! */ |
| 649 |
|
| 650 |
since in above case, the variable pointer _address_ (&idtdesc = ) will be |
| 651 |
0x10e240, and the pointer _value_ (idtdesc = ) will equal 0x8000001116801000. |
| 652 |
Thus, dereferencing such pointer as in 'idtdesc->limit' above will in fact |
| 653 |
dereference 0x10e240 first, then afterwards dereference the unmapped address |
| 654 |
0x8000001116801000! Final result: a CPU page fault exception. |
| 655 |
|
| 656 |
The only wrong thing above was specifying the wrong type for the 'idtdesc' |
| 657 |
label. Having the type as the struct itself instead of a pointer will let the |
| 658 |
struct address be 0x10e240, and its value will equal 0xffff8000001116801000 |
| 659 |
which _is_ the desired result: |
| 660 |
|
| 661 |
extern const struct idt_descriptor idtdesc; |
| 662 |
idtdesc.limit = 0; /* Safe!! */ |
| 663 |
|
| 664 |
* Now, if we want to reference the IDT table itself: |
| 665 |
|
| 666 |
/* IDT_GATES: number of entries in the table */ |
| 667 |
.align 16 |
| 668 |
.globl idt |
| 669 |
idt: |
| 670 |
.skip IDT_GATES * 16 |
| 671 |
idt_end: |
| 672 |
|
| 673 |
Assuming a pre-defined C type of 'struct idt_gate', representing a single IDT |
| 674 |
table entry structure, all what we need is an array of such types. Namely: |
| 675 |
|
| 676 |
extern struct idt_gate idt[IDT_GATES]; |
| 677 |
|
| 678 |
Note that after all, an array 'A' is a special kind of pointer where A = &A. |
| 679 |
(label value and address are the same, equalling link-time label address |
| 680 |
stored in the symbol table) |
| 681 |
|
| 682 |
* Finally, let's reference below exception handler stubs, where they catch the |
| 683 |
exception vector number through their positions in the stubs table: |
| 684 |
|
| 685 |
/* Each stub code size = IDT_STUB_SIZE bytes */ |
| 686 |
.globl idt_exception_stubs |
| 687 |
idt_exception_stubs: |
| 688 |
i = 0 |
| 689 |
.rept EXCEPTION_GATES_CNT |
| 690 |
movq $i, %rsi |
| 691 |
jmp default_exception_handler |
| 692 |
i = i + 1 |
| 693 |
.endr |
| 694 |
|
| 695 |
Well, let's first remember here that we're pointing to code, so we're only |
| 696 |
interested in each stub address, not by any mean its contents. So, after |
| 697 |
all, since this is a table, an array fits our goals nicely. What will be the |
| 698 |
base object type? We only care about addresses, so let the types be 'char*': |
| 699 |
char pointers makes pointer arithmetic very simple. An initial try can be |
| 700 |
|
| 701 |
extern const char idt_exception_stubs[EXCEPTION_GATES_CNT]; |
| 702 |
|
| 703 |
Nice, but by this, each table cell is of size 1 byte (char). We want each |
| 704 |
table cell to have the size IDT_STUB_SIZE chars (bytes). Namely: |
| 705 |
|
| 706 |
idt_exception_stubs[EXCEPTION_GATES_CNT][IDT_STUB_SIZE] |
| 707 |
|
| 708 |
which makes the compiler automatically calculate any stub address for us. |
| 709 |
If we want the address of the third stub, we'll simply write: |
| 710 |
|
| 711 |
void *addr = &idt_exception_stubs[3]; |
| 712 |
|
| 713 |
* Summary: the simple declaration 'extern int x' directly reference the 4 bytes |
| 714 |
marked as 'x' in the executable, where 'x' address is specified at link-time |
| 715 |
in the program's symbol table. The operator [] in the declaration |
| 716 |
'extern type SYMBOL[]' leaves it open only to the number of base type objects |
| 717 |
which are present, not *where* they are. SYMBOL[n] just adds more static-time |
| 718 |
safety; SYMBOL[n][s] let the compiler automatically calculates an entry (of |
| 719 |
size 's')index for us. |
| 720 |
|
| 721 |
Meanwhile, the '*' operator in a declaration 'extern TYPE *P' only binds |
| 722 |
where the _8-byte variable_ P reside in the executable (its address), NOT |
| 723 |
where the base type object is. The base-object itself can be reached only by |
| 724 |
a double dereference: dereference of P address first, then a dereference of |
| 725 |
P's content itself (which is, in turn, an 8-byte address). |
| 726 |
|
| 727 |
|
| 728 |
The 8259A Programmable Interrupt Controller: |
| 729 |
-------------------------------------------- |
| 730 |
|
| 731 |
* References: |
| 732 |
- The Intel 8259A PIC datasheet. |
| 733 |
- The superb information in "The undocumented PC", chapter 17 - "Interrupt |
| 734 |
control and NMI". |
| 735 |
|
| 736 |
In subsequent text, and as said by the sheet, IRR = "Interrupt Request |
| 737 |
Register", ISR="In-Service Register", IMR = "Interrupt Mask Register". |
| 738 |
|
| 739 |
* As described by Brendan Trotter (bcos) on #osdev, and the sheet: |
| 740 |
|
| 741 |
An interrupt request is executed by raising an IR input (low to high) and |
| 742 |
holding it high until it is acknowledged (Edge Triggered Mode). Where |
| 743 |
"acknowledged" means doing so by the PIC chip, not by the CPU/OS . Also, in |
| 744 |
this case "acknowledges" just means that it recognises it and sets the |
| 745 |
corresponding flag in the IRR |
| 746 |
|
| 747 |
Basically, when an interrupt is received the PIC sets the corresponding flag in |
| 748 |
the IRR. When the PIC sends the IRQ to the CPU it clears the flag in the IRR |
| 749 |
and sets the flag in the ISR. And when the CPU sends EOI the PIC clears the |
| 750 |
flag in the ISR. The processor sends an EOI through the the INTA pin. |
| 751 |
|
| 752 |
If the PIC sends a high priority IRQ to the CPU and is waiting for EOI when a |
| 753 |
low priority IRQ occurs, then the PIC must remember that the low priority IRQ |
| 754 |
occured (so that it can tell the CPU about it when the CPU is finished with the |
| 755 |
high priority IRQ), by setting its IRR bit. |
| 756 |
|
| 757 |
The 8259 cannot be used with level interrupts. This is because when |
| 758 |
an interrupt line goes true, it is latched internally in the IRR. |
| 759 |
|
| 760 |
* Spuruious interrupts: |
| 761 |
|
| 762 |
As said by the sheet: "If no interrupt request is present at step 4 of either |
| 763 |
sequence (i.e., the request was too short in duration), the 8259A will issue an |
| 764 |
interrupt level 7. Both the vectoring bytes and the CAS lines will look like an |
| 765 |
interrupt number 7 was requested." |
| 766 |
|
| 767 |
And as said in undoc pc: "Although a noise glitch should never occur, a poor |
| 768 |
bus design or badly glitch will trigger an invalid interrupt. If the glitch |
| 769 |
occurs on IRQ 0 to IRQ7, interrupt F (which is mapped to IRQ 7) will be |
| 770 |
called. The service handler for interrupt Fh and 77H (assuming the PIC wasn't |
| 771 |
re-mapped to more sane vector-numbers) should read the in-service register to |
| 772 |
confirm that a real interrupt occurred. An invalid interrupt has In-Service bit |
| 773 |
7 cleared, and a valid interrupt has valid bit 7 set. |
| 774 |
|
| 775 |
* So why should the PIC bother sending the CPU a spurious 7 IRQ? it could've just |
| 776 |
ignored the noise glitch source. The problem is that it might have told the CPU |
| 777 |
it's about to get an interrupt, and if it did (step 1, 2, and 3), it can not |
| 778 |
now decode the interrupt info since the line has been deasserted. Either the |
| 779 |
assert period was too short that the IRQ number was not latched internally, or |
| 780 |
the PIC just recognized it was a noise glitch due to the very short assertion |
| 781 |
period. So it gives the programmed IRQ 7 vector number to the CPU. This avoids |
| 782 |
unccessary communication with the real devices attached to IRQ7. |
| 783 |
|
| 784 |
* Take care that, as said by linux kernel comments, the two spurious IRQs 7 and |
| 785 |
15 usually resulting from the 8259A-1|2 PICs do occur even if all the IRQs are |
| 786 |
masked (IMR = 0xff). |
| 787 |
|
| 788 |
* An example of an interrupt (refs: undoc pc + intel sheet): |
| 789 |
|
| 790 |
1- The adapter card requests service, by driving the hardware IRQ line 5 from |
| 791 |
low to high |
| 792 |
|
| 793 |
2- The IRQ 5 line is connected to the PIC. The edge-triggered request is |
| 794 |
recorded in the controllers Interrupt-Request Register (IRR). In this case, |
| 795 |
the IRR bit 5 is set. |
| 796 |
|
| 797 |
3- The PIC first checks if the Interrupt Mask Register (IMR) to see if IRQ5 is |
| 798 |
allowed (not masked, IMR bit 5 = 0). Assuming that bit clear, the PIC moves |
| 799 |
to the next step. |
| 800 |
|
| 801 |
4- Using its priority resolver uint, the conntroller checks if any higher |
| 802 |
priority interrupts are active or in progress. If so, no further action is |
| 803 |
done till all higher priority irqs are serviced. |
| 804 |
|
| 805 |
5- Once all higher irqs are serviced, the controller flags the CPU over the |
| 806 |
dedicated INT hardware line leading to the CPU. |
| 807 |
|
| 808 |
6- If the CPU has hardware interrupts enabled, the CPU returns an interrupt |
| 809 |
acknowledgment to the PIC using the INTA pin. |
| 810 |
|
| 811 |
7- Directly after receiving a pulse from the INTA pin, the relevant In-Service |
| 812 |
register bit (bit 5) is set, and the corresponding IRR bit is reset. |
| 813 |
|
| 814 |
8- The processor sends another INTA pulse. During this pulse, the PIC sends |
| 815 |
a programmed 8-bit pointer back to the CPU, according the raised IRQ. In |
| 816 |
the default bios programmed PIC case, the pic will return the interrupt |
| 817 |
value 0x0D for IRQ5. |
| 818 |
|
| 819 |
9- In an instruction boundary, the cpu interrupts the current process and |
| 820 |
jumps to the code address specified in the IDT entry 0xD. For x86-64 cpus, |
| 821 |
entry 0xD = IDT base (specified through lidt) + 0xD * 16. |
| 822 |
|
| 823 |
10- One of the key aspects of the interrupt handler specified in the IDT entry |
| 824 |
must make is to remove the interrupt request made by the adapter. The line |
| 825 |
maybe shared, and several ISRs got executed, where they determine first if |
| 826 |
they are servicing the right device using the "interrupt pending" register |
| 827 |
found in PCI devices, etc. |
| 828 |
|
| 829 |
11- At the end of the interrupt handler, the handler sends the value 0x20 to |
| 830 |
the PIC command port to indicate an End-of-Interrupt (EOI). The EOI command |
| 831 |
clears the IRQ5 bit in the In-Service Regisger.(*) The case is different in |
| 832 |
case of Automatic EOI (AEOI) mode, and is described later. |
| 833 |
|
| 834 |
12- By clearing the IRQ5 bit in the ISR, the chip is now ready to accept another |
| 835 |
interrupt from IRQ5 or any lower priority interrupt. This completes the |
| 836 |
interrupt cycle. |
| 837 |
|
| 838 |
|
| 839 |
* On x86 SMP initialization sequence: |
| 840 |
------------------------------------- |
| 841 |
|
| 842 |
* References: |
| 843 |
- Intel MP specification v1.4, Appendix B - Operating System Programming |
| 844 |
Guideleins. |
| 845 |
- Intel Software Dev. Manual vol3A, Chapter 8, 8.4 - Multiple-Processor (MP) |
| 846 |
initialization. |
| 847 |
- Intel Software Dev. Manual vol3B, Appendix C - MP Initialization for P6 |
| 848 |
Family Processors. |
| 849 |
|
| 850 |
* Generally, there're two base x86 MP initialization sequences. One for the |
| 851 |
processors with the external APIC, and a second for processors with internal |
| 852 |
APIC (P6+ x86 processors). |
| 853 |
|
| 854 |
We won't bother ourself with the processors with external APICs (80486 and |
| 855 |
early Pentiums) since we're only targetting long mode. |
| 856 |
|
| 857 |
All long mode capable processors contains an integrated apic and comply with |
| 858 |
the new Intel MP bootup sequence, where no FIPI or BIPI are needed. |
| 859 |
|
| 860 |
On this new sequence, the INIT IPI on AP cores let them wait for the SIPI to |
| 861 |
get their starting execution vector instead of _unconditionally_ executing the |
| 862 |
BIOS init code. This avoids the ugly dependency on AT+ BIOSes hacks like the |
| 863 |
warm-reset vector to let the AP cores start executing from a custom address |
| 864 |
instead of BIOS init code. Search the MP spec for `Warm-reset vector'. |
| 865 |
|
| 866 |
* After system-wide bootup, or reset, the MP initialization protocol is going to |
| 867 |
be executed. In this protocol, the hardware setup the APIC IDs for each logical |
| 868 |
core according to system topology and assign one of the cores as the bootstrap |
| 869 |
core and the others as APs (Application processors/cores). |
| 870 |
|
| 871 |
The core nominated as the BSC will fetch and begin executing BIOS bootstrap |
| 872 |
code at the reset vector physical address 0xfffffff0. |
| 873 |
|
| 874 |
The remaining cores (setup as APs) will enter a "wait-for-SIPI" state. |
| 875 |
|
| 876 |
* Afterwars, the BIOS init code builds its ACPI and MP tables and count the |
| 877 |
number of AP cores, and update its tables entries for each new AP core |
| 878 |
found. For the algorithm, check Intel's volume 3A 8.4.4 MP Initialization |
| 879 |
Example. |
| 880 |
|
| 881 |
* At the end, the BIOS puts each AP core to halted state with interrupts |
| 882 |
disabled. By the nature of the halted state with interrupts disabled, only |
| 883 |
INIT, #NMI, #SMI are able to start the cores again. |
| 884 |
|
| 885 |
IMPORTANT: The MP protocol will be executed only after a power-up or RESET. If |
| 886 |
the MP protocol has been completed and a BSP has been chosen, subsequent INITs |
| 887 |
(either to a specific processor or system wide) do not cause the MP protocol to |
| 888 |
be repeated. Instead, each processor examines its BSP flag (in the APIC_BASE |
| 889 |
MSR) to determine whether it should execute the BIOS boot-strap code (if it is |
| 890 |
the BSP) or enter a wait-for-SIPI state (if it is an AP). |
| 891 |
|
| 892 |
So any INIT to any of the AP cores by the OS boot code will let it enter a |
| 893 |
'wait-for-SIPI' state to be able to get its start execution vector. |
| 894 |
|
| 895 |
* Note that the cores get started with all its states uninitialized. We have |
| 896 |
to update each core from real to long mode, and update its MMU and other |
| 897 |
system tables to match the bootstrap. |
| 898 |
|
| 899 |
Kernels usually call this part the SMP `trampoline', since it let the cores |
| 900 |
jump from its raw uninitialized state to a state similar to the bootstrap's |
| 901 |
one. |
| 902 |
|
| 903 |
|
| 904 |
* On linkers 'relocation error's: |
| 905 |
--------------------------------- |
| 906 |
|
| 907 |
* References: |
| 908 |
- Relocation types (R_X86_64_*): System V ABI AMD64 supplement, |
| 909 |
section 4.4 - Relocation |
| 910 |
- The Executable and Linkable Format (ELF) spec, Book I - Relocation |
| 911 |
|
| 912 |
* One of the errors that faced me at a number of sensitive places like the |
| 913 |
kernel head, the SMP trampoline, and the linker script was LD linker's |
| 914 |
'relocation error'. Due to their relative opaqueness, they puzzled me to |
| 915 |
a great extent at first. An example of some of those errors from linking |
| 916 |
a preliminary trampoline code to the rest of the kernel image was: |
| 917 |
|
| 918 |
trampoline.o: In function `trampoline': |
| 919 |
(.text+0x1): relocation truncated to fit: R_X86_64_16 against `.text' |
| 920 |
trampoline.o: In function `_start': |
| 921 |
(.text+0xc): relocation truncated to fit: R_X86_64_16 against `.data' |
| 922 |
trampoline.o: In function `_start': |
| 923 |
(.text+0x11): relocation truncated to fit: R_X86_64_16 against `.data' |
| 924 |
trampoline.o: In function `_start': |
| 925 |
(.text+0x14): relocation truncated to fit: R_X86_64_32 against `.bss' |
| 926 |
trampoline.o: In function `_start': |
| 927 |
(.text+0x1d): relocation truncated to fit: R_X86_64_32 against `.bss' |
| 928 |
trampoline.o: In function `_start': |
| 929 |
(.text+0x21): relocation truncated to fit: R_X86_64_32 against `.bss' |
| 930 |
trampoline.o: In function `_start': |
| 931 |
(.text+0x27): relocation truncated to fit: R_X86_64_32 against `.bss' |
| 932 |
trampoline.o: In function `_start': |
| 933 |
(.text+0x2b): relocation truncated to fit: R_X86_64_32 against `.bss' |
| 934 |
|
| 935 |
* We need to understand relocation to parse above errors. Relocation Information |
| 936 |
sections contain information about unresolved references. Since compilers and |
| 937 |
assemblers do not know at what absolute memory address a symbol will be |
| 938 |
allocated, and since they are unaware of definitions of symbols in other files, |
| 939 |
__every__ reference to such a symbol will create a relocation entry. |
| 940 |
|
| 941 |
The relocation entry will point to the code address where the reference is |
| 942 |
being made, and to the symbol table entry that contains the symbol that is |
| 943 |
referenced. The linker will use this information to fill in the correct address |
| 944 |
after it has allocated addresses to all symbols. |
| 945 |
|
| 946 |
* Dumping the ELF object trampoline.o relocation section ($readelf -a) leads to: |
| 947 |
|
| 948 |
Relocation section '.rela.text' at offset 0x438 contains 8 entries: |
| 949 |
Offset Info Type Sym. Value Sym. Name + Addend |
| 950 |
000000000001 00010000000c R_X86_64_16 0000000000000000 .text + 5 |
| 951 |
00000000000c 00020000000c R_X86_64_16 0000000000000000 .data + 0 |
| 952 |
000000000011 00020000000c R_X86_64_16 0000000000000000 .data + 6 |
| 953 |
000000000014 00030000000a R_X86_64_32 0000000000000000 .bss + 0 |
| 954 |
00000000001d 00030000000a R_X86_64_32 0000000000000000 .bss + 0 |
| 955 |
000000000021 00030000000a R_X86_64_32 0000000000000000 .bss + 1007 |
| 956 |
000000000027 00030000000a R_X86_64_32 0000000000000000 .bss + 0 |
| 957 |
00000000002b 00030000000a R_X86_64_32 0000000000000000 .bss + 2007 |
| 958 |
|
| 959 |
The relocation table fields are described in the ELF specification. In summary: |
| 960 |
|
| 961 |
- Offset: location to apply relocation action relative to section start |
| 962 |
- Info: symbol table index wrt which the relocation must be made |
| 963 |
- Type: from the AMD64 ABI. _16: max bit width = 16 (rmode). _32: 32 bits max |
| 964 |
- Addend: constant addend used to compute the value to be stored in the |
| 965 |
relocatable field. Ex: `lgdt $(gdt + 0x10)' will print `.data + 10' where |
| 966 |
the addend = 0x10 |
| 967 |
|
| 968 |
* Note that all relocatable fields are filled with zeroes in the trampoline |
| 969 |
ELF object trampoline.o: |
| 970 |
|
| 971 |
# .code16 (real-mode) |
| 972 |
0000000000000000 <trampoline>: |
| 973 |
0: ea 00 00 00 60 ljmp $0x6000,$0x0 # ljmp $0x6000, $_start |
| 974 |
|
| 975 |
0000000000000005 <_start>: |
| 976 |
5: e8 2a 00 call 32 <print> # RIP-relative (7 + 0x002a = 32) |
| 977 |
8: fa cli |
| 978 |
9: 0f 01 1e 00 00 lidtw 0x0 # lidt $idtdesc |
| 979 |
e: 0f 01 16 00 00 lgdtw 0x0 # lgdt $gdtdesc |
| 980 |
13: b8 00 00 00 00 mov $0x0,%eax # movl %cr3, %eax |
| 981 |
18: 0f 22 d8 mov %eax,%cr3 |
| 982 |
1b: c7 05 00 00 00 movw $0x0,(%di) # movl $(init_pgt+0x1007), init_pgt |
| 983 |
20: 00 00 00 00 00 |
| 984 |
25: c7 05 00 00 00 movw $0x0,(%di) # movl $(init_pgt+0x2007), init_pgt |
| 985 |
2a: 00 00 00 00 00 |
| 986 |
2f: f4 hlt |
| 987 |
30: eb fd jmp 2f <_start+0x2a> |
| 988 |
|
| 989 |
0000000000000032 <print>: |
| 990 |
|
| 991 |
Also note that the offsets in the error messages (.text+0x1), (.text+0x11), .. |
| 992 |
are the same offsets reported in the relocation table entries, and are the |
| 993 |
same ones filled with zero in the above disassembly. |
| 994 |
|
| 995 |
* With above information in hand, we can simply deduce the errors cause. We link |
| 996 |
the trampoline assembly code against our long mode kernel, but the kernel linker |
| 997 |
script sets up the virtual addresses base at 0xffffffff80000000. |
| 998 |
|
| 999 |
So the linker, obeying the kernel script, tries to assign huge (> -2GB) virtual |
| 1000 |
addresses to our symbols. The linker then discovers that since the trampoline |
| 1001 |
to-be-relocated symbols are setup with the R_X86_64_16 model, it's limited to a |
| 1002 |
width of 16 bits, and limited to 32 bits in case of R_X86_64_32. Ofcourse those |
| 1003 |
bits aren't sufficient, so it has to truncate the addresses producing a wrong |
| 1004 |
final executable. |
| 1005 |
|
| 1006 |
In a nutshell, there was a reported error _for each_ relocatable symbol used, |
| 1007 |
reflecting the number of relocation table entries with a type/bit-width that is |
| 1008 |
insufficient to the linker script setup 64-bit virtual addresses. |
| 1009 |
|
| 1010 |
|
| 1011 |
* When %RIP-relativeness goes wrong: |
| 1012 |
------------------------------------ |
| 1013 |
|
| 1014 |
* References: |
| 1015 |
- Intel Instruction Set Reference, A-M, vol. 2A. 3.2-CALL--Call procedure. |
| 1016 |
|
| 1017 |
* One of the interesting errors I faced in jumping away from head or trampoline |
| 1018 |
code to C code was %rip-relativeness related errors. Here are two interesting |
| 1019 |
cases: |
| 1020 |
|
| 1021 |
In head.S, after switching to long mode, and setting up the higher-half |
| 1022 |
virtual mappings, we jump to C code: |
| 1023 |
|
| 1024 |
.code64 |
| 1025 |
startup_64: |
| 1026 |
movq $stack_end, %rsp |
| 1027 |
... |
| 1028 |
# .. Setup identity mappings |
| 1029 |
# .. Setup high-half virtual addresses mappings |
| 1030 |
# .. Apply the page tables |
| 1031 |
# .. let %RIP be its virtual address equivalent |
| 1032 |
# .. let %RSP, and GDTR be their virtual addresses equivalent |
| 1033 |
# .. Let NULL pointers segfault and avoid physical-address dependent |
| 1034 |
# code: remove identity mappings |
| 1035 |
... |
| 1036 |
call kernel_start # C method: void kernel_start(void); |
| 1037 |
|
| 1038 |
But unfortunately we get the machine constantly triple-faulting. As indicated |
| 1039 |
by Bochs debugging output: |
| 1040 |
|
| 1041 |
[CPU1 ] BxError: Encountered an unknown instruction b1=0x0e (signalling #UD) |
| 1042 |
[CPU1 ] modrm was 0x00, nnn was 0, rm was 6 |
| 1043 |
[CPU1 ] exception(0x06): error_code=0000 |
| 1044 |
|
| 1045 |
Disassembling the object file, we see that everything is at place. And `callq' |
| 1046 |
points to 0xffffffff80109250; the `kernel_start' method: |
| 1047 |
|
| 1048 |
# head.o |
| 1049 |
00000000001000b9 <startup_64>: |
| 1050 |
1000b9: 48 c7 c4 00 50 10 00 mov $0x105000,%rsp |
| 1051 |
... |
| 1052 |
10012f: e8 1c 91 00 80 callq ffffffff80109250 <kernel_start> |
| 1053 |
|
| 1054 |
Or does it? |
| 1055 |
|
| 1056 |
A closer inspection at the binary disassembly reveals that the call is a |
| 1057 |
near, relative, displacement relative to next instruction, 32-bit displacement |
| 1058 |
sign-extended to 64-bits in long mode call. |
| 1059 |
|
| 1060 |
So the instruction is at 0x10012f, with a length of 5 bytes, relative |
| 1061 |
displacemented is calculated from 0x10012f + 5 = 0x100134. |
| 1062 |
|
| 1063 |
Code wants to call `kernel_start', so the relative displacemented = |
| 1064 |
0xffffffff80109250 - 0x100134 = 0xffffffff8000911c, which is the value encoded |
| 1065 |
in the `callq' instruction after sign-extension in a little-endian form. So |
| 1066 |
what's wrong? |
| 1067 |
|
| 1068 |
Notice that by the time the CPU is executing the `callq', it's actually using a |
| 1069 |
__different__ %RIP than 0x10012f. The %RIP is in its (%RIP + 0xffffffff80000000) |
| 1070 |
virtual form as the identity mappings are to be removed. |
| 1071 |
|
| 1072 |
This clearly messes up the %rip-relative-displacement calculations, and sends |
| 1073 |
the CPUs to unpredictable areas, leading to unknown opcode exceptions and |
| 1074 |
other horrors. |
| 1075 |
|
| 1076 |
* Another similar case was faced in the trampoline code, when it jumps away to |
| 1077 |
the rest of AP cores C init code. The `callq' instruction was just sending |
| 1078 |
the CPU to irregular places, with horrifying results. |
| 1079 |
|
| 1080 |
Why the callq misbehaved in trampoline? Because the trampoline code is linked |
| 1081 |
to the kernel, but actually copied to a below-1MB 4KB-page, and then executed |
| 1082 |
from there. This messed up external %rip-relativeness. |
| 1083 |
|
| 1084 |
Calls to internal trampoline methods worked because those methods retained their |
| 1085 |
relative distances between each other when they were copied. On the other hand, |
| 1086 |
and by definition, external distances was not kept when calling outside code |
| 1087 |
(like the C application_start() code): |
| 1088 |
|
| 1089 |
# trampoline.o |
| 1090 |
ffffffff8010ae06 <startup_64>: |
| 1091 |
ffffffff8010ae06: 48 8d a4 24 00 00 00 lea -0x80000000(%rsp),%rsp |
| 1092 |
ffffffff8010ae0d: 80 |
| 1093 |
... |
| 1094 |
ffffffff8010ae2e: e8 0d e3 ff ff callq ffffffff80109140 <application_start> |
| 1095 |
ffffffff8010ae33: 90 nop |
| 1096 |
|
| 1097 |
* Solution: The two examples are special cases of PIC, position-independent code. |
| 1098 |
The trampoline is PIC cause it's executed from different position than where |
| 1099 |
it's linked. The head.S code also applies as parts of it get executed using a |
| 1100 |
different %rip than what it's linked as. |
| 1101 |
|
| 1102 |
In the two above cases, we can simply solve them by specifying the absolute |
| 1103 |
address of destination code instead of using relative x86 calls. As said in |
| 1104 |
Intel instruction reference, x86-64 has a near call, with absolute indirect |
| 1105 |
addressing facility. It can be used in GNU AS as follows: |
| 1106 |
|
| 1107 |
movq $application_start, %rax |
| 1108 |
jmpq *%rax |
| 1109 |
|
| 1110 |
And also for head.S: |
| 1111 |
|
| 1112 |
movq $kernel_start, %rax |
| 1113 |
callq *%rax |
| 1114 |
|
| 1115 |
# head.o |
| 1116 |
00000000001000b9 <startup_64>: |
| 1117 |
1000b9: 48 c7 c4 00 50 10 00 mov $0x105000,%rsp |
| 1118 |
... |
| 1119 |
10012f: 48 c7 c0 50 92 10 80 mov $0xffffffff80109250,%rax |
| 1120 |
100136: ff d0 callq *%rax |
| 1121 |
|
| 1122 |
which solves our problem nicely. |
| 1123 |
|
| 1124 |
* Summary: For a code object which is run from a _different_ address than what |
| 1125 |
it was originally linked to, a 'jmp' or 'call' op to a label (or an address) |
| 1126 |
outside of that unlucky object needs absolute addresses instead of the |
| 1127 |
_default_ %rip-relative ones. In this case, the 'next opcode address', being |
| 1128 |
the base of %RIP-relativeness calculations, is not the same at both link-time |
| 1129 |
and run-time; at run-time, the linker-calculated relative offset of the |
| 1130 |
destination becomes completely invalid due to the _now-different_ %RIP base, |
| 1131 |
jumping intentionally to the unknown. |
| 1132 |
|
| 1133 |
Meanwhile, relative jumps and calls to labels within the same object are still |
| 1134 |
valid: the relative distance is still maintained within the object itself, no |
| 1135 |
matter where it's running from. This kind of internal relativeness is usually |
| 1136 |
calculated by the assembler (instead of the linker) and is safe here. |
| 1137 |
|
| 1138 |
|
| 1139 |
Testing libc()-like kernel code: |
| 1140 |
-------------------------------- |
| 1141 |
|
| 1142 |
* An interesting way to test libc()-like kernel code (e.g. std string methods) |
| 1143 |
is to replace system's libc implementation with our own. Afterwards, we run a |
| 1144 |
huge program (Firefox, OpenOffice or gcc) and observe the results. |
| 1145 |
|
| 1146 |
Those tests usually cover huge number of cases, and give a good indicator of |
| 1147 |
tested kern code stability. I've usually found that even the smallest mistake |
| 1148 |
in such tests produce either a non-behaving app or a direct segfault. |
| 1149 |
|
| 1150 |
* Unix linkers LD_PRELOAD feature can help us immensely. Assuming our string |
| 1151 |
methods resides at 'string_lib.c', we do something similar to: |
| 1152 |
|
| 1153 |
# Position-independent object, suitable for dynamic linking |
| 1154 |
$ gcc --std=gnu99 -O3 -fPIC -c string_lib.c -o string_lib.o |
| 1155 |
|
| 1156 |
# Produce shared object (instead of final executable) that can |
| 1157 |
# be linked with other objects to form an executable. |
| 1158 |
$ gcc -shared -o string_lib.so string_lib.o |
| 1159 |
|
| 1160 |
$ LD_PRELOAD='./string_lib.so' firefox-bin |
| 1161 |
|
| 1162 |
Note: always run these tests with the binaries themselves instead of |
| 1163 |
their bash-script wrappers. Such shell wrappers don't play very well |
| 1164 |
with LD_PRELOAD at times. |
| 1165 |
|
| 1166 |
* It is interesting to note that a 30-second run of Firefox produced |
| 1167 |
655,227 memset() and memcpy() calls! A 1-minute run of Google Chrome |
| 1168 |
produced 1,889,087 calls! (counted by `wc -l'). |
| 1169 |
|
| 1170 |
|
| 1171 |
On SMP Memory ordering models: |
| 1172 |
------------------------------ |
| 1173 |
|
| 1174 |
* References: |
| 1175 |
- Kourosh Gharachorloo, "Memory Consistency Models for Shared-Memory |
| 1176 |
Multiprocessors", Digital Equipment (DEC) with Stanford and DARPA support |
| 1177 |
- Paul McKenny, Memory ordering in modern processors, 2007.09.19 Draft |
| 1178 |
- Paul McKenny, Memory Barriers: a Hardware View for Software Hackers, IBM |
| 1179 |
- Linux Kernel Documentation/memory-barriers.txt, kernel.org |
| 1180 |
- DEC/HP OpenVMS Ask the Wizard, http://www.webcitation.org/5laUMQfyi |
| 1181 |
- Intel Manuals vol3A, chapter8, MP Management, 8.2 - Memory Ordering |
| 1182 |
- AMD64 Architecture Programmer's Manual, chapter 7 - Memory System |
| 1183 |
|
| 1184 |
* To understand memory ordering models correctly, we need to understand which |
| 1185 |
parts of our code are loads, which are stores, and which are in fact both. |
| 1186 |
Below example should cover basic cases of shared global variables usage, |
| 1187 |
compare the C code with its assembly output: |
| 1188 |
|
| 1189 |
int *global_symbol = (int *)0xdeadbeef; |
| 1190 |
|
| 1191 |
int main(void) { |
| 1192 |
/* 1) memory load + dependent memory store to stack */ |
| 1193 |
int t = global_symbol; |
| 1194 |
|
| 1195 |
/* 2) memory store */ |
| 1196 |
global_symbol = (int *)0xbbbb; |
| 1197 |
|
| 1198 |
/* A memory load + dependent memory store */ |
| 1199 |
*global_symbol = 0xcccc; |
| 1200 |
} |
| 1201 |
|
| 1202 |
Disassembly of section .data: |
| 1203 |
... |
| 1204 |
0000000000600840 <global_symbol>: |
| 1205 |
600840: ef |
| 1206 |
600841: be ad de 00 00 |
| 1207 |
|
| 1208 |
Disassembly of section .text: |
| 1209 |
... |
| 1210 |
0000000000400448 <main>: |
| 1211 |
400448: 55 push %rbp |
| 1212 |
400449: 48 89 e5 mov %rsp,%rbp |
| 1213 |
# 1) Mem. load + store to stack |
| 1214 |
40044c: 48 8b 05 dd 03 20 00 mov 0x2003dd(%rip),%rax # 600830 <global_symbol> |
| 1215 |
400453: 89 45 fc mov %eax,-0x4(%rbp) |
| 1216 |
# 2) Mem. store |
| 1217 |
400456: 48 c7 05 cf 03 20 00 movq $0xbbbb,0x2003cf(%rip) # 600830 <global_symbol> |
| 1218 |
40045d: bb bb 00 00 |
| 1219 |
# 3) Mem. load + store to pointed value |
| 1220 |
400461: 48 8b 05 c8 03 20 00 mov 0x2003c8(%rip),%rax # 600830 <global_symbol> |
| 1221 |
400468: c7 00 cc cc 00 00 movl $0xcccc,(%rax) |
| 1222 |
40046e: c9 leaveq |
| 1223 |
40046f: c3 retq |
| 1224 |
|
| 1225 |
* Terminology of memory orders: |
| 1226 |
- Program order: the memory operations order specified in object code. |
| 1227 |
- Perceived order: the order a CPU perceives of its and other CPUs memory ops. |
| 1228 |
|
| 1229 |
* Memory ordering is a tradeoff between programming ease and performance. On |
| 1230 |
uniprocessors, many ops are assumed to execute in program order. Processors can |
| 1231 |
provide the illusion of sequensiality by only preserving sequential order among |
| 1232 |
memory operations to the same location. |
| 1233 |
|
| 1234 |
Sequential consistency defined by Lamport provides a simple model to programmers, |
| 1235 |
but it adversely affects efficiency and performance: the illusion trick provided |
| 1236 |
in uniprocessors can't be used on MP systems to provide sequential consistency. |
| 1237 |
|
| 1238 |
Weak ordering distinguishes between ordinary memory operations and memory ops |
| 1239 |
used for synchronization. By ensuring consistency only at the synchronization |
| 1240 |
points, weak ordering allows ordinary operations in between pairs of |
| 1241 |
synchronizations to be reordered with respect to one another. |
| 1242 |
|
| 1243 |
Such specifications can be considered system-centric since they in effect expose |
| 1244 |
the low-level system optimizations to the programmer. |
| 1245 |
|
| 1246 |
* General rules for all architectures: |
| 1247 |
- A given CPU will always _perceive_ its own memory operations as occuring |
| 1248 |
in program order, without a need for memory barriers. |
| 1249 |
- Naturally-aligned loads and stores are always atomic |
| 1250 |
- Code that uses standard synchronization primitives (spinlocks, RCU, ..) |
| 1251 |
should not need explicit mem barriers. Only tricky code that bypasses those |
| 1252 |
primitives need those barriers. An example of a linked list lockless search |
| 1253 |
is given below. |
| 1254 |
|
| 1255 |
* Example: |
| 1256 |
|
| 1257 |
Writer CPU: |
| 1258 |
----------- |
| 1259 |
# Linked list head is `smack_known' |
| 1260 |
struct smack_known *smack_known; |
| 1261 |
mutex_lock(&smack_known_lock); |
| 1262 |
if (skp == NULL) { |
| 1263 |
skp = kzalloc(sizeof(struct smack_known), GFP_KERNEL); |
| 1264 |
if (skp != NULL) { |
| 1265 |
1- skp->smk_next = smack_known; |
| 1266 |
2- skp->smk_secid = smack_next_secid++; |
| 1267 |
3- skp->smk_cipso = NULL; |
| 1268 |
/* 1- BUG: Write barrier needed! */ |
| 1269 |
4- smack_known = skp; |
| 1270 |
} |
| 1271 |
} |
| 1272 |
mutex_unlock(&smack_known_lock); |
| 1273 |
|
| 1274 |
Reader CPU: |
| 1275 |
----------- |
| 1276 |
# Just locklessly access the list |
| 1277 |
struct smack_label *skp; |
| 1278 |
skp = smack_known; |
| 1279 |
while (skp) { |
| 1280 |
/* 2- BUG: Alpha CPUs!! */ |
| 1281 |
5- if (skp->smk_secid == 10) { |
| 1282 |
... |
| 1283 |
} |
| 1284 |
... |
| 1285 |
skp = skp->next; |
| 1286 |
} |
| 1287 |
|
| 1288 |
Bug 1: |
| 1289 |
------ |
| 1290 |
In the writer CPU, the latest 3 stores can be re-ordered in any combination. |
| 1291 |
In arhitectures that reorder stores like Alpha, Itanum, POWER, and SPARC, this |
| 1292 |
may lead other CPUs to perceive an incomplete node added to the list: a garbage |
| 1293 |
smk_secid value. This may even lead to graver results if you notice that the |
| 1294 |
first line (skp->smk_next = smack_known) is actually a load followed by a |
| 1295 |
store. So the sequence is: |
| 1296 |
|
| 1297 |
Load from smack known |
| 1298 |
store to ->smk_next address |
| 1299 |
... |
| 1300 |
store to smack_known |
| 1301 |
|
| 1302 |
The cpu can reorder the last stores, leading to a linked list head with a |
| 1303 |
bogus next pointer; effectively a user-space segmentation fault or worse, |
| 1304 |
a CPU exception if in kernel mode. |
| 1305 |
|
| 1306 |
Bug 2: |
| 1307 |
------ |
| 1308 |
As said by OpenVMS documenatation (Ask the Wizard): |
| 1309 |
|
| 1310 |
"Your producer must issue a 'memory barrier' instruction after writing the data |
| 1311 |
to shared memory and before inserting it on the queue; likewise, your consumer |
| 1312 |
_must_ issue a memory barrier instruction after removing an item from the queue |
| 1313 |
and before reading from its memory. Otherwise, you risk seeing stale data, |
| 1314 |
since, while the Alpha processor does provide coherent memory, it does not |
| 1315 |
provide implicit ordering of reads and writes. (That is, the write of the |
| 1316 |
producer's data might reach memory after the write of the queue, such that the |
| 1317 |
consumer might read the new item from the queue but get the previous values |
| 1318 |
from the item's memory." |
| 1319 |
|
| 1320 |
But how does the consumer sees the pointer before its data even if write |
| 1321 |
barriers were used? |
| 1322 |
|
| 1323 |
As said by McKenny, the producer's write barrier will guarantee that the cache |
| 1324 |
invalidates performed by the first 3 lines will reach the interconnect before |
| 1325 |
that of the last line, but it "makes absolutely no guarantee about the order in |
| 1326 |
which the new values will reach the reading CPU’s core." |
| 1327 |
|
| 1328 |
So assume smack_known, and smack_known->smk_secid are on different cache banks. |
| 1329 |
Because of the write barrier, the producer's data will reach the interconnect |
| 1330 |
before updating the queue. The bank holding smack_known is idle, so its line |
| 1331 |
gets invalidated and updated, while the bank holding smk_secid was busy, so the |
| 1332 |
cache invalidates for the new element are being delayed. This leads to seeing |
| 1333 |
new value for the pointer, but old cached values for the new element at line 5. |
| 1334 |
|
| 1335 |
Linux kernel documentation explains by stating "some versions of the Alpha CPU |
| 1336 |
have a split data cache, permitting them to have two semantically-related cache |
| 1337 |
lines updated at separate times. This is where the data dependency barrier |
| 1338 |
really becomes necessary as this synchronises both caches with the memory |
| 1339 |
coherence system, thus making it seem like pointer changes vs new data occur in |
| 1340 |
the right order. |
| 1341 |
|
| 1342 |
* On the x86 side, only below basic memory access is guaranteed to be atomic: |
| 1343 |
- Readnig or writing a byte |
| 1344 |
- Reading or writing a word aligned on 16-bit boundary |
| 1345 |
- Reading or writing a doubleword aligned on 32-bit boundary |
| 1346 |
- Reading or writing a quadword aligned on 64-bit boundary |
| 1347 |
|
| 1348 |
On P6 family processors and up: |
| 1349 |
- Unaligned 16-, 32-, and 64-bit accesses to cached memory that fit within |
| 1350 |
a cache line. (but how do we know whether a memory access fits a cache line?) |
| 1351 |
|
| 1352 |
|
| 1353 |
On the C 'volatile' type qualifier: |
| 1354 |
----------------------------------- |
| 1355 |
|
| 1356 |
* References: |
| 1357 |
- Eric Eid & John Regehr, "Volatiles Are Miscompiled, and What to Do about It", |
| 1358 |
Proceedings of the Eighth ACM and IEEE International Conference. |
| 1359 |
- Attached Linux kernel's "volatile-considered-harmful.txt" document, added |
| 1360 |
with various core developers thoughts copied from LKML. |
| 1361 |
- ISO C99 Committee Draft - September 7, 2007 |
| 1362 |
- My OSDEV forums post: http://forum.osdev.org/viewtopic.php?t=21210&start=17 |
| 1363 |
|
| 1364 |
* As said by the C standard, "An object that has volatile-qualified type may be |
| 1365 |
modified in ways unknown to the implementation or have other unknown side |
| 1366 |
effects. Therefore any expression referring to such an object shall be evaluated |
| 1367 |
strictly according to the rules of the abstract machine" |
| 1368 |
|
| 1369 |
"Actions on objects so declared shall not be optimized out by an implementation, |
| 1370 |
or reordered except as permitted by the rules for evaluating expressions". |
| 1371 |
|
| 1372 |
It's in a practical manner, "for every read or write to a volatile variable |
| 1373 |
that would be performed by a straightforward interpreter for C, exactly one |
| 1374 |
load from or store to the memory locations allocated to the variable must be |
| 1375 |
performed". |
| 1376 |
|
| 1377 |
* Volatile is used when we don't want the memory access to be re-ordered, cached, |
| 1378 |
or optimized away, as in accessing memory mapped I/O registers. Another case is |
| 1379 |
where global variables get changed behind our back by other execution paths as |
| 1380 |
in interrupt handlers or other threads: |
| 1381 |
|
| 1382 |
int var_set_by_irq_handler = 0; |
| 1383 |
.. |
| 1384 |
while (var_set_by_irq_handler == 0); |
| 1385 |
|
| 1386 |
The compiler has every right to optimize the loop to an infinite loop. To avoid |
| 1387 |
this, and force the compiler to access the global variable memory at each |
| 1388 |
iteration, we use volatile, since volatile objects "shall not be optimized out |
| 1389 |
by an implementation". |
| 1390 |
|
| 1391 |
* Take care that the volatile is a type qualifier as `constant'. So in below case: |
| 1392 |
|
| 1393 |
static volatile uint64_t *msg = NULL; /* (1) */ |
| 1394 |
while (msg == NULL) |
| 1395 |
asm volatile ("pause"); |
| 1396 |
|
| 1397 |
The above loop will still be optimized to an infinite loop. The declaration |
| 1398 |
states that the memory area _pointed_ by `msg' is volatile, not that the pointer |
| 1399 |
itself is volatile. So the above declaration should've been: |
| 1400 |
|
| 1401 |
static uint64_t * volatile msg = NULL; /* (2) */ |
| 1402 |
|
| 1403 |
specifying a volatile pointer. Declaration at (1) is perfectly fine though if |
| 1404 |
we're accessing the variable as in: |
| 1405 |
|
| 1406 |
while (*msg == 0x1234) |
| 1407 |
asm volatile ("pause"); |
| 1408 |
|
| 1409 |
cause now we're referencing the memory declared as volatile. The compiler must |
| 1410 |
fetch the memory pointed by 'msg' at each iteration. |
| 1411 |
|
| 1412 |
* As seen from above example and the ACM paper, volatiles are really subtle in C |
| 1413 |
and many industrial compilers get them wrong. I prefer the Linux way of handling |
| 1414 |
the issue, namely __minimizing__ 'volatile' as far as possible, and depending on |
| 1415 |
explicit compiler and cpu _barriers_, and SMP primitives as needed. |
| 1416 |
|
| 1417 |
* So instead of: |
| 1418 |
|
| 1419 |
static uint64_t * volatile msg = NULL; |
| 1420 |
while (msg == NULL) |
| 1421 |
asm volatile ("pause"); |
| 1422 |
|
| 1423 |
We use an explicit compiler barrier: |
| 1424 |
|
| 1425 |
static uint64_t *msg = NULL; |
| 1426 |
while (msg == NULL) |
| 1427 |
asm volatile ("pause": : :"memory"); |
| 1428 |
|
| 1429 |
The compiler barrier will assure "GCC not to keep memory values cached in |
| 1430 |
registers across the assembler instruction and not to optimize stores or loads |
| 1431 |
to that memory". In other words, it will force gcc to reload any values from |
| 1432 |
memory over the busy loop. |
| 1433 |
|
| 1434 |
* Instead of: |
| 1435 |
|
| 1436 |
volatile uint32_t *ioregsel = (uint32_t *)ioapic_base(apic); |
| 1437 |
volatile uint32_t *iowin = (uint32_t *)(ioapic_base(apic) + 0x10); |
| 1438 |
*ioregsel = reg; |
| 1439 |
return *iowin; |
| 1440 |
|
| 1441 |
where we don't want to the compiler to re-order the last two 'independent' |
| 1442 |
read and write cause this re-ordering will imply different semantics to the |
| 1443 |
programmed hardware. |
| 1444 |
|
| 1445 |
We use MMIO register accessors, which internally includes compiler barriers |
| 1446 |
and any needed volatile qualifiers (See comments at top of include/mmio.h |
| 1447 |
for further rational): |
| 1448 |
|
| 1449 |
uint32_t *ioregsel = (uint32_t *)ioapic_base(apic); |
| 1450 |
uint32_t *iowin = (uint32_t *)(ioapic_base(apic) + 0x10); |
| 1451 |
writel(reg, ioregsel); |
| 1452 |
return readl(iowin); |
| 1453 |
|
| 1454 |
* Finally, check attached LKML messages for extra cases and reasonings by Linus |
| 1455 |
and Andi Kleen. |
| 1456 |
|
| 1457 |
|
| 1458 |
On the vaguely documented x86 16-bit protected mode: |
| 1459 |
---------------------------------------------------- |
| 1460 |
|
| 1461 |
* References: |
| 1462 |
- Rick A. Hohensee, "The well-factored 386", LKML, |
| 1463 |
http://lkml.org/lkml/2003/7/28/55, http://www.webcitation.org/5luRetOqF |
| 1464 |
- Intel manuals vol3A, chapter 3, 3.4.3 - segment registers |
| 1465 |
- Intel Manuals vol3A, chapter9, Processor Management and Initialization, |
| 1466 |
9.9.2 - Switching Back to Real-Address Mode |
| 1467 |
- Robert R. Collins, "The Segment Descriptor Cache", Dr. Dobb's Journal |
| 1468 |
Undocumented Corner, http://www.rcollins.org/ddj/Aug98/Aug98.html, |
| 1469 |
http://www.webcitation.org/5lxH9NFGa |
| 1470 |
|
| 1471 |
* The x86 architecture determins the default operand and address size using |
| 1472 |
the D bit in the code segment. If D is set, it indicates 32-bit addresses |
| 1473 |
and operands. Otherwise, we're running in 16-bit mode, using 16-bit |
| 1474 |
addresses and operands. There's a similar C bit for 64-bit long mode. |
| 1475 |
|
| 1476 |
* One of the states that really puzzled me a lot was the state occurring in |
| 1477 |
the transition from bootup real-mode to 32-bit protected mode. Namely, |
| 1478 |
bewteen enabling protected mode (setting PE = 1) and getting executed in |
| 1479 |
32-bit by using a 32-bit (D = 1) code segment through a far jump. Setting |
| 1480 |
PE and %cs together is not atomic, and there's clearly something in-between. |
| 1481 |
|
| 1482 |
An example from our bootsector: |
| 1483 |
|
| 1484 |
/* Set the protected mode PE flag */ |
| 1485 |
movl %cr0, %eax |
| 1486 |
orl $0x1, %eax |
| 1487 |
movl %eax, %cr0 |
| 1488 |
|
| 1489 |
jmp flush_prefetch: |
| 1490 |
|
| 1491 |
/* What're the rules covering this state? we're in |
| 1492 |
* pmode but the cpu is still fetching and executing |
| 1493 |
* ops in 16-bits! */ |
| 1494 |
flush_prefetch: |
| 1495 |
movw $(gdt_ds - gdt), %ax |
| 1496 |
movw %ax, %ds |
| 1497 |
movw %ax, %es |
| 1498 |
movw %ax, %ss |
| 1499 |
DATA32 ljmp $0x08, $(reloc - _start + 0x7c00) |
| 1500 |
|
| 1501 |
* To exhaustively think about this, we need to cover all the permuatations |
| 1502 |
of PE (pmode enabled) and the %cs D bit: |
| 1503 |
|
| 1504 |
(1) PE=0 Dbit=0 Canonical (un)Real Mode |
| 1505 |
(2) PE=1 Dbit=0 16-bit protected mode |
| 1506 |
(3) PE=1 Dbit=1 Canonical 32-bit Protected Mode |
| 1507 |
(4) PE=0 Dbit=1 .. |
| 1508 |
_______________ |
| 1509 |
| WAKE-UP | |
| 1510 |
|canonical rmode| |
| 1511 |
| PE=0 use16 | |
| 1512 |
|_______________| |
| 1513 |
A |set PE to 1 |
| 1514 |
| | |
| 1515 |
set PE to 0| | |
| 1516 |
___|________V__ |
| 1517 |
| 16-bit | |
| 1518 |
| pmode | |
| 1519 |
| PE=1 use16 | |
| 1520 |
|_______________| |
| 1521 |
A |far xfer |
| 1522 |
| | to use32 CS |
| 1523 |
far xfer to | | |
| 1524 |
use16 CS| | |
| 1525 |
___|_______V___ |
| 1526 |
| 32-bit | |
| 1527 |
| pmode | |
| 1528 |
| PE=1 use32 | |
| 1529 |
|_______________| |
| 1530 |
A |set PE to 0 |
| 1531 |
| | |
| 1532 |
set PE to 1| | |
| 1533 |
___|________V__ |
| 1534 |
| Forreal Mode | |
| 1535 |
| Mode | |
| 1536 |
| PE=0 use32 | |
| 1537 |
|_______________| |
| 1538 |
|
| 1539 |
* We can not jump from (1) to (3) directly through a far jump using a D=1 code |
| 1540 |
segment, since we'll be in canonical rmode, and %cs will always be interpreted |
| 1541 |
as in the segmented 8086 model of (%cs << 4 + offset). |
| 1542 |
|
| 1543 |
But how can the cpu move to 32-bit execution, if 32-bit execution is controlled |
| 1544 |
by the D bit, but we can't change the D bit due to the 8086 semanticss? |
| 1545 |
|
| 1546 |
The answer is to provide a switch. If this switch is on, the cpu will the use |
| 1547 |
the 386 semantics of segments (and be able to modify the segment caches). If |
| 1548 |
it's clear, the cpu will still use the same segmentation circuites but it won't |
| 1549 |
be able to modify most of the segment caches fields like limit, and the D bit, |
| 1550 |
(the base field can still be changed within 2-bytes size), due to the |
| 1551 |
constraints of the 8086 semantics. |
| 1552 |
|
| 1553 |
This 'switch' is the PE (pmode Enabled) bit, which acts __independently__ of |
| 1554 |
whether the machine is running on 16 or 32-bit mode (D bit is set or clear) |
| 1555 |
|
| 1556 |
So under this 16-bit pmode (PE set, D clear), if we do: |
| 1557 |
|
| 1558 |
movw $0x10, %ax |
| 1559 |
movw %ax, %ds |
| 1560 |
|
| 1561 |
we're in fact using %ds as a segment selector, and are also updating the |
| 1562 |
segment caches with the values found in the third GDT cell. |
| 1563 |
|
| 1564 |
But if we're in canonical 16-bit rmode (PE & D clear), if we does the same: |
| 1565 |
|
| 1566 |
movw $0x10, %ax |
| 1567 |
movw %ax, %ds |
| 1568 |
|
| 1569 |
we're in fact using (%ds << 4) as the base field in the segment caches. |
| 1570 |
|
| 1571 |
* We cannot also jump directly from (3) 32-bit pmode, to cacnonical/bootup rmode |
| 1572 |
at (1). One should change the code and data segmenst limit to 64K and the D bit |
| 1573 |
to 0 __before__ clearing the PE switch: we can not change any of those segments |
| 1574 |
cached fields (except the base) once the PE get cleared. |
| 1575 |
|
| 1576 |
Intel's "switching back to Real-Address mode" section also states the same |
| 1577 |
order, but without making the reasoning of this order very clear. |
| 1578 |
|
| 1579 |
Intel also elaborates that "the segment registers must be loaded with non-null |
| 1580 |
segment selectors or the segment registers will be unusable in real-address |
| 1581 |
mode." Simply because the segment will be marked in the segment caches as |
| 1582 |
invalid (P = 0), and as we learned from above, we won't be able to change any |
| 1583 |
of those cached bits after clearing PE. Due to this cached P=0 bit, even the |
| 1584 |
usual act in canonical rmode of changing the base within the 2-bytes limit |
| 1585 |
won't be possible, rendering the segment as unusable, as this is their only |
| 1586 |
effective action in 8086 semantics. |
| 1587 |
|
| 1588 |
* Finally, the difference between 8086 and 386 segmentation is superficial. If |
| 1589 |
the segment was marked as P=1 in the segment caches (valid), 8086 semantics |
| 1590 |
just change the base field (limited to 2-bytes size) of the equivalent 386 |
| 1591 |
segment. They all function using the same registers and simple circuity. Intel |
| 1592 |
elaborates: "Note that if the segment registers are not reloaded, execution |
| 1593 |
continues using the descriptor attributes loaded during protected mode." |
| 1594 |
|
| 1595 |
|
| 1596 |
Using light debugging techniques: |
| 1597 |
--------------------------------- |
| 1598 |
|
| 1599 |
* References: |
| 1600 |
- Ulrich Drepper, "How to Write Shared Libraries", 2.1 Data Definitions, |
| 1601 |
http://www.webcitation.org/5mkmxB9Fx |
| 1602 |
- Ian Lance Taylor, "Linkers part 5", http://www.airs.com/blog/archives/42, |
| 1603 |
http://www.webcitation.org/5mkmiZ8FN |
| 1604 |
- Binutils bug report, "ELF linker loads member of archive for common symbol", |
| 1605 |
http://sources.redhat.com/bugzilla/show_bug.cgi?id=1811 |
| 1606 |
http://www.webcitation.org/5mknnyUe2 |
| 1607 |
- Minor bug-report at http://gcc.gnu.org/bugzilla/show_bug.cgi?id=42726 |
| 1608 |
|
| 1609 |
* Today while writing page allocator test code, I faced a very strange bug: once |
| 1610 |
any of the kernel's text, data, or bss sections exceeds something close to 20 |
| 1611 |
KBs, the MPtables code panic()s at different and unpredictable locations. The |
| 1612 |
failure is something like: |
| 1613 |
|
| 1614 |
PANIC: MPTables - 1217939456 CPUs found, only 32 supported |
| 1615 |
|
| 1616 |
* To give some context, there's a system global called 'nr_cpus' representing |
| 1617 |
the number of CPUs in the system. So far, we know the number of CPUs from the |
| 1618 |
MP tables Conf entries, where each CPU entry represents a CPU in the system. |
| 1619 |
Thus, finding 8 CPU entries means that the machine has 8 usable cores. |
| 1620 |
|
| 1621 |
This global var 'nr_cpus' is only increased at one and _only_ one place in |
| 1622 |
the system, at the MP tables parsing code: |
| 1623 |
|
| 1624 |
/* We call this for each MP CPU entry in the system */ |
| 1625 |
static void parse_cpu(void *addr) |
| 1626 |
{ |
| 1627 |
struct mpc_cpu *cpu = addr; |
| 1628 |
|
| 1629 |
if (!cpu->enabled) |
| 1630 |
return; |
| 1631 |
|
| 1632 |
if (nr_cpus >= CPUS_MAX) { |
| 1633 |
panic("MP - %d CPUs found, only %d supported", |
| 1634 |
nr_cpus, CPUS_MAX); |
| 1635 |
} |
| 1636 |
... |
| 1637 |
|
| 1638 |
++nr_cpus; |
| 1639 |
} |
| 1640 |
|
| 1641 |
* Inspecting the BIOS MP tables more closely, the MP pointer table is at 0xfbb60 |
| 1642 |
and its checksum and signature are valid. The MP configuration table header was |
| 1643 |
also valid and checksummed at 0xfba00. |
| 1644 |
|
| 1645 |
The problem, it seems, is that the conf table entries got somehow corrupted |
| 1646 |
(thus the wrong number of cpus value), or there's a bug in our parser code. |
| 1647 |
|
| 1648 |
* Unfortunately, and as usual for such problems, inserting printk()s at any |
| 1649 |
location skews the results randomly, making any printk() useless. |
| 1650 |
|
| 1651 |
So we try less invasive debugging techniques like writing to a memory area |
| 1652 |
using a _single_ inlined asm instruction, and inspecting this memory area |
| 1653 |
using an emulator afterwars. This worked and made the problem repeatable! |
| 1654 |
|
| 1655 |
* First, and to test the parser validity, we choose a memory location that we |
| 1656 |
know before hand it's going to be zero; I chose physical 0x500. In parse_cpu |
| 1657 |
we add a very small asm instruction as follows: |
| 1658 |
|
| 1659 |
/* 8-byte instruction */ |
| 1660 |
asm volatile("incq (0xffffffff80000500);"); |
| 1661 |
|
| 1662 |
if (nr_cpus >= CPUS_MAX) { |
| 1663 |
panic("MP - %d CPUs found, only %d supported", |
| 1664 |
nr_cpus, CPUS_MAX); |
| 1665 |
} |
| 1666 |
|
| 1667 |
And thankfully this instruction was light enough, so the system paniced |
| 1668 |
at the same point. Our panic() just calls printk() and hlt the system. So |
| 1669 |
after the system halts, we move to the qemu console handler and check the |
| 1670 |
value at address 0x500. Good for us, we found the value there to be 1, so |
| 1671 |
it seems our parser is behaving correctly. |
| 1672 |
|
| 1673 |
* Now the problem is getting more mysterious. If the parser is behaving |
| 1674 |
correctly, and the value at address 0x500 is completely valid, how come |
| 1675 |
'nr_cpus' have those very strange values? Remember that we put the 'incl |
| 1676 |
0x500' instruction at the same place of '++nr_cpus', so the two addresses |
| 1677 |
should have the exact same values or something is _completely_ messed up. |
| 1678 |
|
| 1679 |
The 'nr_cpus' variable still yielded the value of "1217939456". |
| 1680 |
|
| 1681 |
* So what did really corrupt the BSS-stored variable nr_cpus at this _very_ |
| 1682 |
early stage? |
| 1683 |
|
| 1684 |
The very first thing we do in the C part of the kernel is to clear the entire |
| 1685 |
BSS section to zero. So assuming the kernel was copied/setup correctly (which |
| 1686 |
we're not very sure of yet), the problem probably occurs _between_ clearing |
| 1687 |
the BSS section and parsing the MP tables. |
| 1688 |
|
| 1689 |
OK, so we'll try to check the value of the 'nr_cpus' variable at different |
| 1690 |
places in the kernel between clearing BSS and MP parsing using this 14-byte |
| 1691 |
code segment: |
| 1692 |
|
| 1693 |
asm volatile ("movl %0, %%eax;" |
| 1694 |
"movl %%eax, (0xffffffff80000500); hlt;" |
| 1695 |
: : "m"(nr_cpus) : "memory"); |
| 1696 |
|
| 1697 |
And then use the qemu console to read the address 0x500 'xp 0x500'. |
| 1698 |
|
| 1699 |
* Very important discovery: Adding the above asm fragment _directly_ after |
| 1700 |
clearning the BSS section yielded the same corrupt value of "1217939456" |
| 1701 |
for the 'nr_cpus' variable! Something is seriously wrong cause 'nr_cpus' |
| 1702 |
is clearly in the BSS section: |
| 1703 |
|
| 1704 |
$ readelf -a kernel.elf |
| 1705 |
|
| 1706 |
Section Headers: |
| 1707 |
[Nr] Name Type Address Offset |
| 1708 |
Size EntSize Flags Link Info Align |
| 1709 |
... |
| 1710 |
[ 8] .bss NOBITS ffffffff8010f5e0 0010f5e0 |
| 1711 |
00000000000028c8 0000000000000000 WA 0 0 32 |
| 1712 |
|
| 1713 |
Symbol table '.symtab' contains 183 entries: |
| 1714 |
Num: Value Size Type Bind Vis Ndx Name |
| 1715 |
... |
| 1716 |
151: ffffffff80111d80 4 OBJECT GLOBAL DEFAULT 8 nr_cpus |
| 1717 |
|
| 1718 |
* I think we're getting closer. Notice from above ELF analysis, the BSS |
| 1719 |
section starts at 0xffffffff8010f600, and it has a length of 0x28c8: |
| 1720 |
|
| 1721 |
This means that from reading the ELF: |
| 1722 |
|
| 1723 |
BSS start = 0xffffffff8010f5e0 |
| 1724 |
BSS end = 0xffffffff8010f5e0 + 0x28c8 = 0xffffffff800111ea8 |
| 1725 |
'nr_cpus' is at 0xffffffff80111d80 |
| 1726 |
|
| 1727 |
But very strangely, here are the results of printing __bss_start and |
| 1728 |
__bss_end symbols exported by our kernel linker script we wrote since the |
| 1729 |
dawn of time: |
| 1730 |
|
| 1731 |
BSS start = 0xffffffff8010f5e0 |
| 1732 |
!! BSS end = 0xffffffff801115e0 |
| 1733 |
'nr_cpus' is outside of this region! |
| 1734 |
|
| 1735 |
As you see, somehow the kernel sees a _trimmed_ BSS section from the |
| 1736 |
linker script, and to add insult to injury, 'nr_cpus' is not included |
| 1737 |
in that trimmed region. |
| 1738 |
|
| 1739 |
* Things began to make sense: the kernel obviously has a misguided view about |
| 1740 |
its bss section. This is disastrous cause at this level, we're the ones |
| 1741 |
responsible of clearing this BSS area before use. Ofcourse, we directly move |
| 1742 |
to our linker script BSS section to see what's wrong: |
| 1743 |
|
| 1744 |
SECTIONS |
| 1745 |
{ |
| 1746 |
... |
| 1747 |
.bss : { |
| 1748 |
__bss_start = .; |
| 1749 |
*(EXCLUDE_FILE (*head.o *e820.o) .bss) |
| 1750 |
__bss_end = .; |
| 1751 |
} |
| 1752 |
__kernel_end = .; |
| 1753 |
... |
| 1754 |
} |
| 1755 |
|
| 1756 |
Things should be fine and everything is at its place, but the corruption still |
| 1757 |
exists. So I tried a small modification of: |
| 1758 |
|
| 1759 |
.bss : { |
| 1760 |
__bss_start = .; |
| 1761 |
*(EXCLUDE_FILE (*head.o *e820.o) .bss) |
| 1762 |
} |
| 1763 |
__bss_end = .; |
| 1764 |
__kernel_end = .; |
| 1765 |
|
| 1766 |
And things worked magically! Ofcourse leaving the issue as this will be immature |
| 1767 |
as we didn't understand yet why this trivial change produced this huge effect. |
| 1768 |
|
| 1769 |
* Trying to debug the issue further, we restore the linker script to its original |
| 1770 |
state and let the linker be more verbose by using the -M flag. Inspecting the |
| 1771 |
output, a very interesting thing came up when the linker displayed the final |
| 1772 |
memory map: |
| 1773 |
|
| 1774 |
('...' denotes trimmed output lines) |
| 1775 |
|
| 1776 |
Allocating common symbols // mm, .. |
| 1777 |
Common symbol size file |
| 1778 |
|
| 1779 |
test_symbol 0x8 mmap.o |
| 1780 |
ioapic_descs 0x60 ioapic.o |
| 1781 |
nr_ioapics 0x4 ioapic.o |
| 1782 |
nr_cpus 0x4 smpboot.o |
| 1783 |
mp_irqs 0x700 mptables.o |
| 1784 |
nr_mpcirqs 0x4 mptables.o |
| 1785 |
cpu_descs 0x100 smpboot.o |
| 1786 |
|
| 1787 |
Linker script and memory map |
| 1788 |
|
| 1789 |
0x0000000000100000 . = 0x100000 |
| 1790 |
|
| 1791 |
.text.head 0x0000000000100000 0x310 |
| 1792 |
... |
| 1793 |
.bss 0xffffffff8010f5e0 0x28c8 load address 0x000000000010692c |
| 1794 |
0xffffffff8010f5e0 __bss_start = . |
| 1795 |
*(EXCLUDE_FILE(*e820.o *head.o) .bss) |
| 1796 |
.bss 0xffffffff8010f5e0 0x0 common.o |
| 1797 |
.bss 0xffffffff8010f5e0 0x0 main.o |
| 1798 |
.bss 0xffffffff801105e0 0x0 i8259.o |
| 1799 |
.bss 0xffffffff801105e0 0x0 apic.o |
| 1800 |
... |
| 1801 |
.bss 0xffffffff80110610 0x0 lib/string.o |
| 1802 |
.bss 0xffffffff80110620 0xfc0 lib/printf.o |
| 1803 |
0xffffffff801115e0 __bss_end = . |
| 1804 |
COMMON 0xffffffff801115e0 0x64 ioapic.o |
| 1805 |
0xffffffff801115e0 ioapic_descs // Outside of BSS! |
| 1806 |
0xffffffff80111640 nr_ioapics // Outside of BSS! |
| 1807 |
*fill* 0xffffffff80111644 0x1c 00 |
| 1808 |
COMMON 0xffffffff80111660 0x704 mptables.o |
| 1809 |
0xffffffff80111660 mp_irqs // Outside of BSS! |
| 1810 |
0xffffffff80111d60 nr_mpcirqs // Outside of BSS! |
| 1811 |
*fill* 0xffffffff80111d64 0x1c 00 |
| 1812 |
COMMON 0xffffffff80111d80 0x120 smpboot.o |
| 1813 |
0xffffffff80111d80 nr_cpus // Outside of BSS! |
| 1814 |
0xffffffff80111da0 cpu_descs // Outside of BSS! |
| 1815 |
COMMON 0xffffffff80111ea0 0x8 mmap.o |
| 1816 |
0xffffffff80111ea0 test_symbol // Outside of BSS! |
| 1817 |
0xffffffff80111ea8 __kernel_end = . |
| 1818 |
|
| 1819 |
So at last we got it, the linker script didn't put variables like nr_cpus, |
| 1820 |
nr_ioapics etc in the BSS section, thus they were not cleared in our clear_bss() |
| 1821 |
function. As you can see, those symbols are allocated past the __bss_end symbol. |
| 1822 |
|
| 1823 |
* OK, so we inspect the LD manual for this 'COMMON' section, to find: |
| 1824 |
|
| 1825 |
"A special notation is needed for common symbols, because in many object file |
| 1826 |
formats common symbols do not have a particular input section. The linker treats |
| 1827 |
common symbols as though they are in an input section named ‘COMMON’ ..." |
| 1828 |
|
| 1829 |
* Unfortunately the documentation was a bit vague, and there was no explanation |
| 1830 |
of what "common symbols" were really used for. So we dig deeper to find this |
| 1831 |
great explanation from Ulrich's paper on shared libraries: |
| 1832 |
|
| 1833 |
"Common variables are widely used in fortran, but they got used in C and C++ |
| 1834 |
as well to work around mistakes of programmers. Since in the early days people |
| 1835 |
used to drop the 'extern' keyword from variable definitions, in the same way |
| 1836 |
it is possible to drop it from function declaration, the compiler often had |
| 1837 |
multiple definitions of the same variable in different files. |
| 1838 |
|
| 1839 |
To help the poor and clueless programmer, the C/C++ compiler normally generates |
| 1840 |
common variables for uninitialized definitions such as `int foo;' |
| 1841 |
|
| 1842 |
For common variables, there can be more than one definition, and they all get |
| 1843 |
unified in one location in the output file .. their values does not need to be |
| 1844 |
stored in the ELF file." |
| 1845 |
|
| 1846 |
* And also from Ulrich, a description of -fno-common which is more accurate than |
| 1847 |
gcc's manual: |
| 1848 |
|
| 1849 |
"If the programmer uses the compiler command-line option -fno-common, the |
| 1850 |
generated code will contain uninitialized variables instead of common variables |
| 1851 |
if a vriable definition has no initializer. .. |
| 1852 |
|
| 1853 |
The result at run-time is the same as for common variable, no value stored in |
| 1854 |
the file. But the representation in the object file is different [BSS section], |
| 1855 |
and it allows the linker to find multiple definitions and flag them as errors." |
| 1856 |
|
| 1857 |
* And finally from Ian Lance Taylor's series on linkers: |
| 1858 |
|
| 1859 |
"A section index of SHN_COMMON (0xfff2) indicates a common symbol. Common |
| 1860 |
symbols were invented to handle Fortran common blocks, and they are also often |
| 1861 |
used for uninitialized global variables in C. |
| 1862 |
|
| 1863 |
A common symbol has unusual semantics. Common symbols have a value of zero, but |
| 1864 |
set the size field to the desired size. If one object file has a common symbol |
| 1865 |
and another has a definition, the common symbol is treated as an undefined |
| 1866 |
reference. |
| 1867 |
|
| 1868 |
If there is no definition for a common symbol, the program linker acts as though |
| 1869 |
it saw a definition initialized to zero of the appropriate size. Two object |
| 1870 |
files may have common symbols of different sizes, in which case the program |
| 1871 |
linker will use the largest size. |
| 1872 |
|
| 1873 |
Implementing common symbol semantics across shared libraries is a touchy |
| 1874 |
subject, somewhat helped by the recent introduction of a type for common symbols |
| 1875 |
as well as a special section index (see the discussion of symbol types below)." |
| 1876 |
|
| 1877 |
(Also check nm(1) on "C"/COMMON symbols, and --warn-common on ld(1)) |
| 1878 |
|
| 1879 |
* And by now everything is super-clear, and the fix is only adding the COMMON |
| 1880 |
symbols within the output object BSS boundaries: |
| 1881 |
|
| 1882 |
.bss : { |
| 1883 |
__bss_start = .; |
| 1884 |
*(EXCLUDE_FILE (*head.o *e820.o) .bss) |
| 1885 |
*(EXCLUDE_FILE (*head.o *e820.o) COMMON) |
| 1886 |
__bss_end = .; |
| 1887 |
} |
| 1888 |
|
| 1889 |
What a bug this was. |
| 1890 |
|
| 1891 |
|
| 1892 |
Musings on Memory Allocation: |
| 1893 |
----------------------------- |
| 1894 |
|
| 1895 |
* References: |
| 1896 |
- Uresh Vahalia, "Unix Internals: the new frontiers", Chapter 12 - Kernel |
| 1897 |
Memory Allocation |
| 1898 |
- Kirk McKusick, J. Karles, "Design of a General Purpose Memory Allocator |
| 1899 |
for the 4.3BSD UNIX Kernel", Berkely, Usenix 1988 conference |
| 1900 |
- Daniel Bovet, Marco Cesati, "Understanding the Linux Kernel", Chapter 8 - |
| 1901 |
Memory Management, Section 8.1 - Page Frame Management |
| 1902 |
- Kirk McKusick, George Neville, "The Design and Implementation of the |
| 1903 |
FreeBSD Operating System", Section 5.3 - Kernel Memory Management |
| 1904 |
- Maurice Bach, "The Design of the UNIX Operating System", Chapter 9 - |
| 1905 |
Memory Management Policies |
| 1906 |
|
| 1907 |
* In all Unices, memory is page-based, and the page is the smallest unit of |
| 1908 |
allocation, protection, address translation, and mapping. X86-64 does not |
| 1909 |
even provide segmentation. |
| 1910 |
|
| 1911 |
* The very first step for writing a MM subsystem is writing the page-level |
| 1912 |
allocator. Due to virtual memory, the page-allocator doesn't need to find |
| 1913 |
contiguous free pages to satisfy its request: non-contigous frames mapped as |
| 1914 |
contiguous addresses shall be enough. |
| 1915 |
|
| 1916 |
The kernel often needs chunks of memory much less of size than a page, there's |
| 1917 |
where the 'kernel memory allocator' job comes into play for allocating this |
| 1918 |
kind of memory within the pages. It depends internally on the page allocator. |
| 1919 |
|
| 1920 |
* Early Unix like SVR2 avoided dynamic memory altogether, and used fixed-size |
| 1921 |
tables for vnodes and proc structures. On the other hand, when memory was |
| 1922 |
required for holding temporary pathnames, they borrowed buffers from the block |
| 1923 |
buffer cache (Check chapter 3 of Maurice's book). |
| 1924 |
|
| 1925 |
Another case was the 4.2BSD, which in McKusick wordings had "at least ten |
| 1926 |
different memory allocators. Some of them handle large blocks, some handle |
| 1927 |
small chained data structures, and others include information to describe |
| 1928 |
I/O operations. .. Multiple allocators reduce the effeciency with which memory |
| 1929 |
is used. The kernel ends up with many different free lists of memory instead |
| 1930 |
of a single freelist from which all allocation can be drawn" |
| 1931 |
|
| 1932 |
* Currently, our page allocator is interfaced through two simple functions: |
| 1933 |
get_free_page() which returns a page frame descriptor, and free_page(page) |
| 1934 |
which frees given page frame. The naming is taken from Linux, which are the |
| 1935 |
historic names Torvalds used in his 0.01 relase and are still in-use today. |
| 1936 |
|
| 1937 |
* Now that we've finished writing our page allocator, we focus on kmalloc()-like |
| 1938 |
kernel memory allocators. A memory allocator utilization factor = |
| 1939 |
|
| 1940 |
total memory 'requested' by the kmalloc() caller |
| 1941 |
------------------------------------------------------------------- |
| 1942 |
total memory 'required' by allocator to satisfy those requests |
| 1943 |
|
| 1944 |
As said by McKusick, Good memory utilization in the kernel is more important |
| 1945 |
than in user processes. Pages in the process address space that are in 'required' |
| 1946 |
but not in the 'requested' pool need not tie up physical memory. But the kernel |
| 1947 |
is not typically paged out, so any not-well-utilized kernel-allocated memory |
| 1948 |
pages are held forever, and can not be used for other purposes. |
| 1949 |
|
| 1950 |
It's typical of user-space memory allocators not to release unused memory in |
| 1951 |
the required pool to avoid the syscall overhead. Kernel allocators are required |
| 1952 |
to release such memory as quick possible though, due to above point of holding |
| 1953 |
kernel memory pages forever, and because the kernal can quickly manipulate its |
| 1954 |
own page maps. |
| 1955 |
|
| 1956 |
* Not to degrade performace, memory returned by the allocator must be aligned |
| 1957 |
according to the machine requirements. I guess for our x86-64 case, the safest |
| 1958 |
haven is 8-byte alignemnt. |
| 1959 |
|
| 1960 |
Speaking of speed, McKusick suggests that the allocator need only to be fast for |
| 1961 |
small pieces of memory. Inspecting a time-shared machine at Berkely, he noticed |
| 1962 |
that "large allocations occur infrequently, and are typically for long-lived |
| 1963 |
objects such as buffers to hold the superblock for a mounted file system". |
| 1964 |
|
| 1965 |
* For a nice survey of kernel memory allocators, check referenced 'Unix Internals' |
| 1966 |
chapter. The 4.2BSD allocator, is covered under 'Simple Power-of-Two Free Lists', |
| 1967 |
and the 4.3BSD allocator, explained in the McKusick paper, is clearly described |
| 1968 |
under 'The McKusick-Karles Allocator' section; Vahalia is a great writer. |
| 1969 |
|
| 1970 |
* Before jumping to sketching our own allocator, we finally survey some of the |
| 1971 |
allocators used in modern unix kernels: |
| 1972 |
|
| 1973 |
In Linux, the page allocator provides contiguous pages using a binary buddy |
| 1974 |
system instead of depending on hardware circuity to emulate this continuity. |
| 1975 |
This, in Daniel Bovet words, offers the big advantage of leaving the kernel |
| 1976 |
paging tables unchanged, which leads to less average memory access times cause |
| 1977 |
such modifications lets the CPU flush its Translation-Lookaside Buffers. |
| 1978 |
|
| 1979 |
NOTE! The smallest unit returned by this buddy page allocatoris one PAGE. The |
| 1980 |
groups of 2, 4, 8, 16, .., 1024 are a physically contigous set of page frames. |
| 1981 |
This is not a kernel memory allocator (KMA) like the McKusick-Karles one; this |
| 1982 |
is a page allocator that uses the buddy algorithm for serving _contigous pages_ |
| 1983 |
instead. |
| 1984 |
|
| 1985 |
The page allocator main structure used is mem_map, which holds all the page |
| 1986 |
descriptors of the specific memory zone this page allocator belongs, like |
| 1987 |
ZONE_DMA, HIGHMEM, etc. Notice that mem_map is also the historic naming Linus |
| 1988 |
used for his page allocator in the 0.01 release. That page allocator was done |
| 1989 |
using a bitmap, which as we can see, fits the 'mem_map' naming better. |
| 1990 |
|
| 1991 |
Above this page allocator in Linux exists the regular sl[au]b object-caching |
| 1992 |
allocator, which requests pages from the page allocator upon need. |
| 1993 |
|
| 1994 |
In FreeBSD, the kernel uses the 4.3BSD allocator described in the McKusick |
| 1995 |
paper for small allocations. For large fixed-size allocations, it uses the |
| 1996 |
Zone/slab allocator instead. |
| 1997 |
|
| 1998 |
* In our kernel, we assign high weights to simplicity without sacrificying much |
| 1999 |
speed and memory resources. |
| 2000 |
|
| 2001 |
Our page allocator's main structure is the Page Frame Descriptor Table |
| 2002 |
(pfdtable; Unix SVR2 terminology), which is an array/table that contains all |
| 2003 |
the system's available (non e820-reserved; above kernel memory area) physical |
| 2004 |
pages descriptors. |
| 2005 |
|
| 2006 |
Free pages descriptors are connected in place by a linked list pfdfree_head |
| 2007 |
as the list head. Allocation and Reclamation is a simple O(1) matter of adding |
| 2008 |
to or removing from the pfdfree_head list. |
| 2009 |
|
| 2010 |
The kernel memory allocator is the 4.3BSD McKusick-Karles allocator. I chose |
| 2011 |
it cause of its beautiful simplicity, speed, and space effeciency for small |
| 2012 |
allocations. We shall develop a slab-like caching allocator for big objects |
| 2013 |
once the need arise; we're obviously influenced by FreeBSD in this area. |
| 2014 |
|
| 2015 |
|
| 2016 |
On Virtual Memory: |
| 2017 |
------------------ |
| 2018 |
|
| 2019 |
* References: |
| 2020 |
- Uresh Vahalia, "Unix Internals", Chapter 13 - Virtual Memory |
| 2021 |
- Peter J. Denning, 'Virtual Memory', Princeton, ACM Computing Surveys |
| 2022 |
- Maurice Bach, "The Design of the UNIX Operating System", Chapter 9 - |
| 2023 |
Memory Management Policies |
| 2024 |
- Jonathan Corbet, "Four-level page tables", LWN.net, |
| 2025 |
http://lwn.net/Articles/106177/, http://www.webcitation.org/5ncgFSpzm |
| 2026 |
|
| 2027 |
* First, please check our detailed notes on the great Dennings's virtual memory |
| 2028 |
paper; the paper covers a huge number of now very-well-established concepts. |
| 2029 |
Below paragraphs assumes familiarity with the concepts covered in that paper. |
| 2030 |
|
| 2031 |
* To cause considerable reduction in the amount of virtual-to-physical mapping |
| 2032 |
information that needs to be stored, methods like segmentation and paging |
| 2033 |
were invented. In the words of Denning, "each method groups information into |
| 2034 |
blocks, a block being a set of contiguous addresses in address space. The |
| 2035 |
entries in the mapping table will refer now to blocks, which are _far_ less |
| 2036 |
numerous than addresses in a process address space". |
| 2037 |
|
| 2038 |
In other words, instead of storing mapping information for each address, we |
| 2039 |
store the mapping info fore each equally-sized _page_ or for each _segment_. |
| 2040 |
|
| 2041 |
* In paging, even while storing mapping information for each page instead of |
| 2042 |
for each address, the page table can consume a huge area of memory. For |
| 2043 |
example, in a fictional 32-bit machine with a page size of 1KB, and an |
| 2044 |
address space of 2GB, we need: 2^31 / 2^10 = 2^21 entries. For an entry size |
| 2045 |
of 32-bit, 4-bytes: page table size = 2^21 * 2^2 = 2^23 = 2^3 * 2^20 = 8MB. |
| 2046 |
So we'll need a _contiguous_ 8MB table for _each_ system process address space |
| 2047 |
which is definitely not tolerable. |
| 2048 |
|
| 2049 |
* Analyzing the problem, we see that the root cause is having mapping information |
| 2050 |
for non-used pages; a process rarely use its entire address space anyway. It's |
| 2051 |
also common to have a virtual address-space which is much larger than machine's |
| 2052 |
available physical memory. To solve this problem we either: |
| 2053 |
|
| 2054 |
- have segmented page tables. Each segment of the process has its own page table |
| 2055 |
which is just large enough to hold the valid address range for that segment. |
| 2056 |
|
| 2057 |
- second is having the page table itself paged, which means an additional higher |
| 2058 |
level page table is used to map the lower-level page table. With such a multi- |
| 2059 |
teired hierarchy, we need to allocate only those pages of the lower-level table |
| 2060 |
that map used/valid addresses of the process. This is the common approach. |
| 2061 |
|
| 2062 |
|
| 2063 |
On Partitionting the 64-bit kernel address space: |
| 2064 |
------------------------------------------------- |
| 2065 |
|
| 2066 |
* References: |
| 2067 |
- Andi Kleen, "Porting Linux to x86-64", SuSe Labs |
| 2068 |
- Michael Matz, Hubicka, Andrea Jaeger, Mark Mitchell, "System V Application |
| 2069 |
Binary Interface, AMD64 processor supplement", SuSe Labs, Codesourcey |
| 2070 |
- Linux Kernel linux-2.6/Documentation/x86/x86_64/mm.txt, kernel.org |
| 2071 |
- Our kernel comments at head.S and include/paging.h |
| 2072 |
- MSDN Library, Win32 and COM development, "64-bit Virtual Address Space", |
| 2073 |
http://msdn.microsoft.com/en-us/library/aa384271(VS.85).aspx |
| 2074 |
http://www.webcitation.org/5nckv6YBf |
| 2075 |
- Intel Manuals volume 2A, A-M Instructions Reference |
| 2076 |
|
| 2077 |
* Partitioning the address space of a long mode kernel is an interesting challenge. |
| 2078 |
There are so many tradeoffs to be taken care of. First, let's remember that we |
| 2079 |
have full 64-bit space. This has obvious potentials, but also some drawbacks that |
| 2080 |
need be taken care of. |
| 2081 |
|
| 2082 |
* The first drawback of using a full 64-bit address space is code size. Imagine when |
| 2083 |
every far jump, global data access, and function call uses a full 64-bit address. |
| 2084 |
The difference is outlied in below pure x86-32 app: |
| 2085 |
|
| 2086 |
/* File1: sum.c */ |
| 2087 |
int global; |
| 2088 |
int sum() { |
| 2089 |
return 0; |
| 2090 |
} |
| 2091 |
|
| 2092 |
/* File2: app.c */ |
| 2093 |
extern int global; |
| 2094 |
int x, y; |
| 2095 |
int main() { |
| 2096 |
global = 0xdead; |
| 2097 |
x = 0xceeb; |
| 2098 |
return sum(); |
| 2099 |
} |
| 2100 |
|
| 2101 |
# -m32: 32-bit compilation and linkage |
| 2102 |
$ gcc -O3 -m32 sum.c app.c -o app |
| 2103 |
|
| 2104 |
which generates the very small code below: |
| 2105 |
|
| 2106 |
$ objdump -D app |
| 2107 |
0804956c <x>: /* 4-byte address */ |
| 2108 |
804956c: 00 00 add %al,(%eax) |
| 2109 |
... |
| 2110 |
08049570 <y>: /* 4-byte address */ |
| 2111 |
8049570: 00 00 add %al,(%eax) |
| 2112 |
|
| 2113 |
08048370 <main>: |
| 2114 |
/* 10-bytes in total |
| 2115 |
* 4-byte absolute address dereference */ |
| 2116 |
8048376: c7 05 68 95 04 08 ad movl $0xdead,0x8049568 |
| 2117 |
804837d: de 00 00 |
| 2118 |
|
| 2119 |
/* 10-bytes in total |
| 2120 |
* 4-byte absolute address dereference */ |
| 2121 |
8048380: c7 05 6c 95 04 08 eb movl $0xceeb,0x804956c |
| 2122 |
8048387: ce 00 00 |
| 2123 |
|
| 2124 |
/* Only 5 bytes in total |
| 2125 |
* 4-byte relative offset */ |
| 2126 |
8048394: e8 c7 ff ff ff call 8048360 <sum> |
| 2127 |
... |
| 2128 |
|
| 2129 |
Now imagine if above app used full 64-bit virtual addresses, each 4-byte address |
| 2130 |
dereference above would have consumed full 8-bytes; even some ops would consume |
| 2131 |
140% extra size. Inspect below results: |
| 2132 |
|
| 2133 |
$ gcc -m64 -mcmodel=large -c sum.c -o sum.o |
| 2134 |
$ gcc -m64 -mcmodel=large -c app.c -o app.o |
| 2135 |
$ ld -Ttext 0xffff800000000000 app.o sum.o -o app |
| 2136 |
$ objdump -D app | less |
| 2137 |
|
| 2138 |
ffff8000002000cc <x>: /* .. 8-byte address */ |
| 2139 |
ffff8000002000cc: 00 00 add %al,(%rax) |
| 2140 |
... |
| 2141 |
ffff8000002000d0 <y>: /* .. 8-byte address */ |
| 2142 |
... |
| 2143 |
|
| 2144 |
ffff800000000024 <main>: |
| 2145 |
|
| 2146 |
/* 8-bytes address, 15-bytes ops size in total. Compare |
| 2147 |
* with the 10-byte 32-bit equivalent ops (50% increase) */ |
| 2148 |
ffff800000000028: 48 b8 c8 00 20 00 00 mov $0xffff8000002000c8,%rax |
| 2149 |
ffff80000000002f: 80 ff ff |
| 2150 |
ffff800000000032: c7 00 ad de 00 00 movl $0xdead,(%rax) |
| 2151 |
|
| 2152 |
/* 8-bytes address, 15-bytes ops size in total. Compare |
| 2153 |
* with the 10-byte 32-bit equivalent ops (50% increase) */ |
| 2154 |
ffff800000000038: 48 b8 cc 00 20 00 00 mov $0xffff8000002000cc,%rax |
| 2155 |
ffff80000000003f: 80 ff ff |
| 2156 |
ffff800000000042: c7 00 eb ce 00 00 movl $0xceeb,(%rax) |
| 2157 |
... |
| 2158 |
|
| 2159 |
/* 8-byte address, *12-bytes* ops size in total. Compare |
| 2160 |
* with the 5-byte 32-bit equivalent ops (140% increase!) */ |
| 2161 |
ffff80000000005d: 48 ba 00 00 00 00 00 mov $0xffff800000000000,%rdx |
| 2162 |
ffff800000000064: 80 ff ff |
| 2163 |
ffff800000000067: ff d2 callq *%rdx |
| 2164 |
|
| 2165 |
* As you can see from above, naively using the 64-bit address space can lead to |
| 2166 |
diastrous effects on code size, and thus, the cache footprint. To solve this, some |
| 2167 |
help from both the kernel and the compiler is needed. |
| 2168 |
|
| 2169 |
First, let's focus on user-space. To let user-space avoid this large code-size hit, |
| 2170 |
it's common to assign to it the 0 -> 4GB+ virtual space, and let the compiler |
| 2171 |
generate 32-bit addresses for code and global data. This is for example, how the Linux |
| 2172 |
ecosystem does it. The kernel assigns below virtual addresses to user-space (for each |
| 2173 |
process): |
| 2174 |
|
| 2175 |
0000000000000000 - 00007fffffffffff (=47 bits) user space, different per mm |
| 2176 |
hole caused by [48:63] sign extension |
| 2177 |
|
| 2178 |
And let gcc emits 32-bit addressing for code and global data using the -mcmodel=small |
| 2179 |
parameter, which is the default memory model. %rip-relative addressing is also used |
| 2180 |
by default at this model for dereferencing object globals; it already consumes same |
| 2181 |
amount of opcode size relative to x86-32's in-operand 32-bit absoulte dereference. |
| 2182 |
|
| 2183 |
* Now how are we going to produce similar effect in kernel-space code when the |
| 2184 |
0 -> 4G+ space has been already reserved by user-space? |
| 2185 |
|
| 2186 |
Thanks to the elegance of the 2's complement model, the common case is to use |
| 2187 |
negative 32-bit addresses for kernel linking. This leads to negative 32-bit |
| 2188 |
discplacement for kernel ops. We then let the CPU implcitly (automatically) |
| 2189 |
sign-extend those discplacements, leading us to use the top 64-bit address space, |
| 2190 |
but with the size-effeciency of 32-bit addresses! |
| 2191 |
|
| 2192 |
An example is overdue. Imagine linking our kernel with the 0xffffffff80000000 |
| 2193 |
as base. %rax points to a physical address in our text or data area. To let it |
| 2194 |
be virtual, we add the base to it. Mainly we want to do: |
| 2195 |
|
| 2196 |
%rax = %rax + 0xffffffff80000000 |
| 2197 |
|
| 2198 |
Now, thanks to the sign-extension trick, the above instruction can be size- |
| 2199 |
-effeciently encoded as if the displacement was 32-bit as follows: |
| 2200 |
|
| 2201 |
ffffffff8010b347: 48 05 00 00 00 80 add $0xffffffff80000000,%rax |
| 2202 |
|
| 2203 |
As you can see, only 4-bytes are used for displacement (0x80000000). The REX.W |
| 2204 |
prefix (0x48) with the x86 ADD opcode (05) will let the x86-64 cpu automatically |
| 2205 |
sign extend the 32-bit immediate 0x80000000 to $0xffffffff80000000 and achieve |
| 2206 |
our goals. |
| 2207 |
|
| 2208 |
* Here's a more complete explanation from our kernel comments: |
| 2209 |
|
| 2210 |
Using -2GB as a base empowers the compiler to directly encode symbolic |
| 2211 |
references as 32-bit values, and let the CPU implicitly sign-extended those to |
| 2212 |
64-bits due to the REX.W prefix. Thus, we use the top 64-bit address space, but |
| 2213 |
with 32-bit size effeciency! Check below example: |
| 2214 |
|
| 2215 |
Passing global strings to memcpy: |
| 2216 |
|
| 2217 |
48 c7 c6 80 24 8b 80 mov $0xffffffff808b2480,%rsi |
| 2218 |
48 c7 c7 60 d8 10 80 mov $0xffffffff8010d860,%rdi |
| 2219 |
e8 f4 39 00 00 callq ffffffff8010bb00 <memcpy> |
| 2220 |
|
| 2221 |
To avoid the size-overhead of 64-bit offsets in function calls, we limit the |
| 2222 |
kernel text to a 32-bit virtual area. Thus, an x86-64 callq will only consume a |
| 2223 |
4-byte offset: relative space between next op and the destination label. |
| 2224 |
|
| 2225 |
To dereference kernel globals (instead of just passing their addresses), we'll |
| 2226 |
use %rip-relative addressing within + or - 2GB limit, avoiding the size-overhead |
| 2227 |
of referencing a full 8-byte address. |
| 2228 |
|
| 2229 |
All of this was in fact designed by the SysV AMD64 ABI developers, and is called |
| 2230 |
the 'kernel code model'. We use this code model, and let userspace use the |
| 2231 |
'small' or the 'medium' one. It's interesting to note that NT userspace defaults |
| 2232 |
to the medium model, while most UNIX implementations default to the 'small' one. |
| 2233 |
|
| 2234 |
* To get a state of the size difference, check the difference of function call |
| 2235 |
size overhead between the 'large' and the 'kernel' code models: |
| 2236 |
|
| 2237 |
$ objdump -j .text -D kernel.elf | grep call # -mcmodel=kernel |
| 2238 |
|
| 2239 |
ffffffff801090c8: e8 e3 45 00 00 callq ffffffff8010d6b0 <vsnprintf> |
| 2240 |
ffffffff801090df: e8 ec 48 00 00 callq ffffffff8010d9d0 <printk> |
| 2241 |
ffffffff80109105: e8 86 41 00 00 callq ffffffff8010d290 <memset> |
| 2242 |
ffffffff8010b97b: e8 70 f9 ff ff callq ffffffff8010b2f0 <spin_lock> |
| 2243 |
ffffffff8010ba61: e8 6a ff ff ff callq ffffffff8010b9d0 <get_free_page> |
| 2244 |
ffffffff8010bdd8: e8 83 fc ff ff callq ffffffff8010ba60 <get_zeroed_page> |
| 2245 |
ffffffff8010c749: e8 a2 f8 ff ff callq ffffffff8010bff0 <__kmalloc> |
| 2246 |
... |
| 2247 |
|
| 2248 |
$ objdump -j .text -D kernel.elf | grep call # -mcmodel=large |
| 2249 |
|
| 2250 |
/* Notice the ~%140 increase in size per call */ |
| 2251 |
ffffffff801090cb: 48 b8 d0 e6 10 80 ff mov $0xffffffff8010e6d0,%rax |
| 2252 |
ffffffff801090d2: ff ff ff |
| 2253 |
ffffffff801090d5: ff d0 callq *%rax |
| 2254 |
ffffffff801091a3: 48 bb 00 ea 10 80 ff mov $0xffffffff8010ea00,%rbx |
| 2255 |
ffffffff801091aa: ff ff ff |
| 2256 |
ffffffff801091b9: ff d3 callq *%rbx |
| 2257 |
... |
| 2258 |
|
| 2259 |
So, while admittedly using -mcmodel=large makes things a bit easier for the 64-bit |
| 2260 |
kernel developer (as it gives them the freedom of basing the kernel on any desired |
| 2261 |
virtual address), it induces too much size overhead that it's worth the effort |
| 2262 |
avoiding it. |
| 2263 |
|
| 2264 |
For example, since Cute uses -mcmodel=kernel, we have two mappings in our system, |
| 2265 |
one for kernel text and data (-2GB), and _another_ mapping for the entire physical |
| 2266 |
memory discovered at system boot. Extra care was needed while developing this in |
| 2267 |
an already very complicated subsystem: memory management. |
| 2268 |
|
| 2269 |
This is also the same reason why Linux's amd64 (unlike ia32) __PAGE_OFFSET is |
| 2270 |
different from the kernel linkage base at __START_KERNEL_MAP = -2GB. |
| 2271 |
|
| 2272 |
* So, keeping above points in mind, we'll use below strategy for dividing our kernel |
| 2273 |
space addresses: |
| 2274 |
|
| 2275 |
0x0000000000000000 -> 0x00007fffffffffff userspace mappings |
| 2276 |
0x0000800000000000 -> 0x0000ffffffffffff not canoncial; cannot be used |
| 2277 |
0xffff800000000000 -> 0xffffc00000000000 maps for all physical memory (64TB max) |
| 2278 |
0xffffffff80000000 -> 0xffffffffc0000000 kernel text and data map |
| 2279 |
|
| 2280 |
|
| 2281 |
* Debugging -O3 triggered problems: |
| 2282 |
----------------------------------- |
| 2283 |
|
| 2284 |
* While adding a good number of GCC warning request flags to detect bugs as early |
| 2285 |
as possible, I enabled the -O3 switch only to discover that it triggered a good |
| 2286 |
number of hidden kernel bugs. |
| 2287 |
|
| 2288 |
First bug led to several exceptions. Upon investigating the CPU's reported |
| 2289 |
faulting instruction, I found its an SSE x86 op. SSE ops require special %cr0 |
| 2290 |
and %cr4 setup that we're not ready for. |
| 2291 |
|
| 2292 |
The solution was simply instructing GCC to avoid emitting SSE ops using its |
| 2293 |
-mno-see{2,3,4} and -mno-3dnow control flags. |
| 2294 |
|
| 2295 |
* Second bug was way uglier. It was only triggered at -O3 and with Kmalloc test |
| 2296 |
code (and only kmalloc test code, not any of the rest of the tests) enabled. |
| 2297 |
|
| 2298 |
The faults were a bit random, from continuous triple faults to rubbish on the |
| 2299 |
screen by printk(). This also means that printk is not available for debugging. |
| 2300 |
|
| 2301 |
Switching to -O2 mute all those faults .. |
| 2302 |
|
| 2303 |
* An important debugging tool when printf is not available is inspecting the final |
| 2304 |
kernel ELF. Having the ELF at hand helps us at: |
| 2305 |
|
| 2306 |
- directly knowing the exact addresses of all global symbols |
| 2307 |
|
| 2308 |
- translate any %rip address to the exact place in the executable. This aids in |
| 2309 |
knowing things like whether a core had deadlock on a spinlock (%rip inifinitely |
| 2310 |
tied at the spin loop), the exact x86 op of a faulting %rip address, etc |
| 2311 |
|
| 2312 |
* Applying above methods, we inspect the global vga_buffer array. This array get |
| 2313 |
its contents directly copied to the vga card, so we can have a nice view about |
| 2314 |
the source of screen corruption: |
| 2315 |
|
| 2316 |
$ readelf --syms kernel.elf | grep vga_buffer |
| 2317 |
|
| 2318 |
114: ffffffff8010eb60 4000 OBJECT LOCAL DEFAULT 6 vga_buffer |
| 2319 |
|
| 2320 |
Halting the CPU and inspecting physical address 0x10eb60 contents, we find that: |
| 2321 |
At -O2, well-formatted vga text: 0f 75 0f 43 0f 64 0f 74 - 'Cute' |
| 2322 |
At -O3, it's rubbish: 9f 19 9e fa 9f 9f ae 80 ... |
| 2323 |
|
| 2324 |
Further inspecting the printk() code, I found there's no apparent problem in |
| 2325 |
the code itself (or its generated assembly). The printk() code was mainly passed |
| 2326 |
rubbish strings, instead of the expected valid ones. |
| 2327 |
|
| 2328 |
* Inspecting the issue further, I've put a global string: |
| 2329 |
|
| 2330 |
static char string[] = "ABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890"; |
| 2331 |
|
| 2332 |
$ readelf --syms kernel.elf | grep string |
| 2333 |
|
| 2334 |
114: 0000000000000000 0 FILE LOCAL DEFAULT ABS string.c |
| 2335 |
118: ffffffff8010e680 37 OBJECT LOCAL DEFAULT 7 string |
| 2336 |
|
| 2337 |
Inspecting those mem locations at -O2, yielded the expected byte values: |
| 2338 |
41 42 43 44 45 46 47 48 .. |
| 2339 |
|
| 2340 |
At -O3, however, below values were found: |
| 2341 |
10 80 48 89 f2 48 03 53 08 48 01 eb |
| 2342 |
|
| 2343 |
* Checking the unexpected bytes sequence above, I smelled x86 opcodes, which |
| 2344 |
they truly were! They were from the disassembly: |
| 2345 |
|
| 2346 |
ffffffff80108140 <e820_init>: |
| 2347 |
... |
| 2348 |
ffffffff801081b7: 48 c7 c7 18 e3 10 80 mov $0xffffffff8010e318,%rdi |
| 2349 |
ffffffff801081be: 48 89 f2 mov %rsi,%rdx |
| 2350 |
ffffffff801081c1: 48 03 53 08 add 0x8(%rbx),%rdx |
| 2351 |
ffffffff801081c5: 48 01 eb add %rbp,%rbx |
| 2352 |
... |
| 2353 |
|
| 2354 |
So we got a lead, but it's just an indication of something utterly and |
| 2355 |
completely messed-up. |
| 2356 |
|
| 2357 |
* Trying to inspect the issue further, I halted the CPU and dumped first 10MB |
| 2358 |
of RAM before the crash. There were some really surprising results. |
| 2359 |
|
| 2360 |
Searching for some of the strings we send to printk() led to nothing, nada, |
| 2361 |
nil. The strings we send, they must be there at the kernel binary in ASCII |
| 2362 |
form, aren't simply there. Instead most of them are replaced by x86 ops. |
| 2363 |
|
| 2364 |
* Thinking about the situation, a change to kernel binary at such scale can |
| 2365 |
mostly happen by a serious bug at bootstrap code. |
| 2366 |
|
| 2367 |
Checking the first 3 lines of our bootstrap disassembly, where it moves |
| 2368 |
kernel code from bootstrap BSS to its expected place, made everything |
| 2369 |
clear: |
| 2370 |
|
| 2371 |
kern/kernel.elf: file format elf64-x86-64 |
| 2372 |
|
| 2373 |
Disassembly of section .text.head: |
| 2374 |
|
| 2375 |
0000000000100000 <startup_32>: |
| 2376 |
100000: be a0 03 10 00 mov $0x1003a0,%esi |
| 2377 |
100005: bf 04 80 10 00 mov $0x108004,%edi <-- [1] |
| 2378 |
10000a: b9 00 00 02 00 mov $0x20000,%ecx |
| 2379 |
|
| 2380 |
[1] As you can see, if the kernel image we're copying away from bootstrap |
| 2381 |
BSS to its right place exceeded a size of 0x8000 bytes, the part of the |
| 2382 |
image > 0x8000 will be overriden, which leads to the mess we were in. |
| 2383 |
|
| 2384 |
As in C's memmove(), we now avoid such corruption by copying kernel code |
| 2385 |
to a safe temporary buffer first, then copy it again to the final |
| 2386 |
overlapping place. |
| 2387 |
|
| 2388 |
(Sometimes reading disassembly makes you catch bugs quicker than reading |
| 2389 |
source code. Although in this case the source was really in assembly, when |
| 2390 |
we saw the _numbers_ instead of just the symbols exported by the linker |
| 2391 |
script, we caught the bug rather instantaneously) |
| 2392 |
|
| 2393 |
* Important lessons: |
| 2394 |
|
| 2395 |
- Take extreme care of overlaps while coding x86 'rep movsb' and friends. |
| 2396 |
It's good to dynamically assure that: |
| 2397 |
|
| 2398 |
+ src < dst, AND dst >= (src + len) .. i.e., bad overlap |
| 2399 |
+ OR src >= dst .. i.e., safe overlap |
| 2400 |
|
| 2401 |
Any other condition means direct corruption. |
| 2402 |
|
| 2403 |
The 'safe' overlap is only so if we are using 'movsb' instead of movsl |
| 2404 |
or movsq. Remember that this 'safe' overlap corrupts the source, but |
| 2405 |
keeps the destination data valid. |
| 2406 |
|
| 2407 |
- To be safe against such kind of overlaps in C parts of the kernel, |
| 2408 |
we've added below memcpy() asserts: |
| 2409 |
|
| 2410 |
if (src > dst && (src - dst) < 8) OR |
| 2411 |
(src <= dst && (dst - src) < len): |
| 2412 |
panic() [1] |
| 2413 |
|
| 2414 |
which can be reduced by logic (check Appendix) to: |
| 2415 |
|
| 2416 |
if (dst + 8 > src) AND (src + len > dst): |
| 2417 |
panic(); [2] |
| 2418 |
|
| 2419 |
The first check is used cause of 'movsq' 8-byte block copy. We've |
| 2420 |
intentionally transformed substraction in the '>' and '<' expressions |
| 2421 |
to addition cause of the _unsigned_ pointer values. |
| 2422 |
|
| 2423 |
---------------------------------------------------------------------- |
| 2424 |
| Appendix |
| 2425 |
| |
| 2426 |
| It's easy to prove equivalence of the 'if' conditions marked |
| 2427 |
| with [1] and [2] above; let: |
| 2428 |
| |
| 2429 |
| A eq 'src > dst' ==> ~A eq 'src <= dst' |
| 2430 |
| B eq '(src - dst) < 8' |
| 2431 |
| C eq '(dst - src) < len)' |
| 2432 |
| D eq '(A and B) or (~A and C)' |
| 2433 |
| src >= 0, dst >= 0, len >= 0 |
| 2434 |
| |
| 2435 |
| We want to prove that: |
| 2436 |
| D is fully equivalent to (B and C) |
| 2437 |
| |
| 2438 |
| From the premises, we can see that: |
| 2439 |
| if (B and C) is true, then |
| 2440 |
| '(A and B) or (~A and C)' is true, then |
| 2441 |
| D is true (I) |
| 2442 |
| |
| 2443 |
| From the premises, we can also see that: |
| 2444 |
| if A is true, then C is true (II) |
| 2445 |
| if ~A is true, then B is true (III) |
| 2446 |
| |
| 2447 |
| From premises and (II): |
| 2448 |
| if (A and B) is true, then (A and B and C) is true (IV) |
| 2449 |
| |
| 2450 |
| From premises and (III): |
| 2451 |
| if (~A and C) is true, then (~A and C and B) is true (V) |
| 2452 |
| |
| 2453 |
| From premises, (IV), and (V): |
| 2454 |
| if D is true, then (B and C) is true (VI) |
| 2455 |
| |
| 2456 |
| From (I) and (VI): |
| 2457 |
| D is fully equivalent to (B and C) Q.E.D. |
| 2458 |
---------------------------------------------------------------------- |
| 2459 |
|
| 2460 |
* Global lessons: |
| 2461 |
|
| 2462 |
- Disassembly is a great tool for debugging kernel code, since a lot of things |
| 2463 |
taken for granted at user-space is not guaranteed here anymore. Abstraction |
| 2464 |
cut both ways indeed. |
| 2465 |
|
| 2466 |
- It's _very important_ to deeply test each code path, no matter how trivial |
| 2467 |
it seems (basic stuff, but sometimes we forget the basics). |
| 2468 |
|
| 2469 |
- Memory dumping of critical kernel regions is really valuable; remember to use |
| 2470 |
it as needed. |
| 2471 |
|
| 2472 |
- It's important to test the code with different environment combinations: |
| 2473 |
different optimization levels: GCC's -O {0, 1, 2, 3, s} |
| 2474 |
different memory sizes: Qemu's -m {2, 3, .., 500, .., 2000, ..} |
| 2475 |
different number of cores: -smp {1, 2, 3, ..} |
| 2476 |
different CPU speeds: Bochs's ips= parameter |
| 2477 |
different BIOSes: different versions of Bochs Bios, different PC forms, .. |
| 2478 |
|
| 2479 |
|
| 2480 |
* The Zone: |
| 2481 |
----------- |
| 2482 |
|
| 2483 |
(Or the bug that took me around 6 days to solve) |
| 2484 |
|
| 2485 |
* While writing the context switch code, I observed a very ugly bug. Somehow the |
| 2486 |
montonic PIT interrupts that got triggered every 1 millisecond corrupts kernel |
| 2487 |
state. |
| 2488 |
|
| 2489 |
Symptoms: random exceptions and random assert() failures allover the place. |
| 2490 |
|
| 2491 |
* So I first thought there must be something wrong in the timer handler itself. |
| 2492 |
So what if the PIT handler did exactly nothing but acknowledging the local APIC? |
| 2493 |
(It must acknowledge the APIC so future IRQs get handled) |
| 2494 |
|
| 2495 |
The handler code was successfully reduced to: |
| 2496 |
|
| 2497 |
push %rax |
| 2498 |
movq $(VIRTUAL(APIC_PHBASE) + APIC_EOI), %rax |
| 2499 |
movl $0,(%rax) |
| 2500 |
pop %rax |
| 2501 |
iretq |
| 2502 |
|
| 2503 |
But still the same buggy behavior prevailed. |
| 2504 |
|
| 2505 |
* After reducing the handler code to a number of bytes, with no positive results, |
| 2506 |
I thought let's check the IDT validity. Inspecting the IDT cells date yielded |
| 2507 |
nothing interesting. |
| 2508 |
|
| 2509 |
Also checking the %rflags, bit by bit, led to nothing interesting. |
| 2510 |
|
| 2511 |
So I thought maybe we should setup the IDT cells as 'trap' instead of 'interrupt' |
| 2512 |
gates, but this is ofcourse wrong. Trap gates leaves the IF flag set while |
| 2513 |
servicing the interrupt, which means handlers can get interrupted indefinitely, |
| 2514 |
and our stack can get overflowed in a matter of milliseconds. |
| 2515 |
|
| 2516 |
* OK, so I thought instead of focusing on the IRQ handlers and the IDT, lets focus |
| 2517 |
on the _code_ that get interrupted itself. The failed code was usually the |
| 2518 |
general memory allocator kmalloc() test cases. |
| 2519 |
|
| 2520 |
After inspecting the interrupted code, I found nothing interesting. |
| 2521 |
|
| 2522 |
* To let the problem domain be smaller, and after minimizing the handler code, I |
| 2523 |
thought lets also minimize the interrupted code and see. Having a bunch of |
| 2524 |
printk()s in a tight loop, with the timer handler on, triggered the problem! |
| 2525 |
|
| 2526 |
Here the second day finishes. Either something is really messed up, or I'm a |
| 2527 |
shitty developer that can't even write a stable printk(). |
| 2528 |
|
| 2529 |
Furthermore, running any testcases after enabling the interrupts make them |
| 2530 |
fail miserably. From all the debugging I've done in the last 6 months, I |
| 2531 |
began to feel the instincts of such 'deep' kernel bugs: this is definitely |
| 2532 |
one of them. |
| 2533 |
|
| 2534 |
* So, I began to feel hopeless. I'll just inspect the x86 ops that triggered |
| 2535 |
the exceptions, hoping to find any pattern in the code that trigger the |
| 2536 |
bug. Definitely a pattern was clearly there; some examples of the failing |
| 2537 |
x86 ops: |
| 2538 |
|
| 2539 |
mov %rax, 0x0(%rbp) |
| 2540 |
movzbl (%rax), %eax |
| 2541 |
movl $0x2, 0x8(%rax) |
| 2542 |
movl $0x1, 0x4(%rax) |
| 2543 |
... |
| 2544 |
|
| 2545 |
And the most triggered, probably the _only_ triggered, exception was the page |
| 2546 |
fault #PF exception, vector 14. The faulting (unmapped) address was things |
| 2547 |
like %cr2=0x0, %cr2=0x8, %cr2=0x200020, etc. |
| 2548 |
|
| 2549 |
From above pattern, I thought: who's really messing up with our registers, |
| 2550 |
getting them loaded with weird values like 0, 0x8, etc? |
| 2551 |
|
| 2552 |
* So, we're in the verge of the third day end with no direct results, and nothing |
| 2553 |
known about the source of the bug. Having nothing specific to think about, I |
| 2554 |
inspected the critical kernel paths: the IDT setup, IOAPIC entries setup, the |
| 2555 |
GDT, the bootsector, early kernel boot setup, printk() code, etc. |
| 2556 |
|
| 2557 |
Funnily, cause I had a 'global' view about the kernel while doing this deep |
| 2558 |
inspection, 9 issues scattered around was discovered that needs to be fixed. |
| 2559 |
Unfortunately, fixing those 9 issues didn't really solve our main problem, but |
| 2560 |
it was nice nonetheless to discover them. |
| 2561 |
|
| 2562 |
* So we're now in the fifth day, and nothing is clear. Hopelessly, I swap our |
| 2563 |
printk code with FreeBSD's one. Things work fine, but once we enable interrupts |
| 2564 |
and the PIT in monotonic mode, similar erroneous behavior appears. Swapping |
| 2565 |
the FreeBSD code with the Linux code, and again, leading to the same failures. |
| 2566 |
|
| 2567 |
So I was right, something 'deep' is messed up, not just the nature of the |
| 2568 |
interrupted code. |
| 2569 |
|
| 2570 |
Another hopeless try was swapping the data segment descriptor index @ the GDT |
| 2571 |
from 0x10 to 0x18. Something interesting was discovered; the page faults |
| 2572 |
addresses occurs from values originating from %cs, %ss, and %rflags; it was |
| 2573 |
completely repeatable. Sounds familiar? Yes, they should! Those are the values |
| 2574 |
pushed by the CPU before invoking IRQ handlers! |
| 2575 |
|
| 2576 |
So somehow, registers that get dereferenced get the wrong values from the |
| 2577 |
state pushed and popped by the CPU before and after invoking an IRQ handler. |
| 2578 |
Now _this_ is an interesting discovery. |
| 2579 |
|
| 2580 |
So, how the %ss, %cs, and %rflags values got 'leaked' to our general-purpose |
| 2581 |
registers? The stack _must_ have been corrupted somehow .. |
| 2582 |
|
| 2583 |
* Trying different directions, we minimize the timer handler even further only to |
| 2584 |
the 'iretq' x86 op, but how do we acknowledge the local APIC this way? |
| 2585 |
|
| 2586 |
In fact, we dont. I just masked all the IOAPIC entries, and let the PIT trigger |
| 2587 |
through the PIC. The PIC was then programmed in 'Automatic EOI' mode, where no |
| 2588 |
acknowledgment was needed. This reduced the handler only to: |
| 2589 |
|
| 2590 |
iretq |
| 2591 |
|
| 2592 |
Even with this super-minimal handler, the bug gets triggered :( |
| 2593 |
|
| 2594 |
It was also interesting that, again, the #PF handler reported bad memory values |
| 2595 |
resembling %rflags, %ss, etc. |
| 2596 |
|
| 2597 |
* So after minimizing the timer handler to the absolute architecturally minimum, |
| 2598 |
we try to minimize the faulting code even further. After some hours of |
| 2599 |
disassembly, I find that at -O1 the VGA code alone (without the printk() code |
| 2600 |
that called it) triggers the problem. |
| 2601 |
|
| 2602 |
After even further inspection and disassembly (and wasted hours), I find that |
| 2603 |
not only we can limit the code to VGA writing, but only to the string function |
| 2604 |
memcpy(). |
| 2605 |
|
| 2606 |
So, now, all I have is the timer interrupt and an infinite loop containing |
| 2607 |
memcpy, and the code still fails! |
| 2608 |
|
| 2609 |
And, yes, the memcpy() code was valid (ehem, Dijkestra's 'Humble Programmer') |
| 2610 |
|
| 2611 |
* So I go crazy stripping the entire kernel. All what was left was code that |
| 2612 |
re-programs the PIC, the PIT setup code, and the string function memcpy() |
| 2613 |
and the code still failed. |
| 2614 |
|
| 2615 |
That was, well, weird. We really have nothing now except a very small number of |
| 2616 |
lines of code and I don't know what to do. |
| 2617 |
|
| 2618 |
Once we enable interrupts, the memcpy() function fail with similar patterns to |
| 2619 |
the one described above (page faults, with faulting memory values similar to |
| 2620 |
%ss and %rflags) |
| 2621 |
|
| 2622 |
Well, you can imagine the state I'm in after all those days. |
| 2623 |
|
| 2624 |
* Another step for more explicitness was disabling the PIT, and triggering its |
| 2625 |
handler manually using software interrupts. I've inject software interrupts |
| 2626 |
around strategic C statements like '*src = *dst', etc. |
| 2627 |
|
| 2628 |
* So at the middle of the middle of the sixth day, I disassemble the memcpy() |
| 2629 |
function, draw the stack on paper, and trace. Now something became directly |
| 2630 |
apparent. Notice those instructions (gcc -O0): |
| 2631 |
|
| 2632 |
/* Classical stack setup */ |
| 2633 |
ffffffff80109c88: 55 push %rbp |
| 2634 |
ffffffff80109c89: 48 89 e5 mov %rsp,%rbp |
| 2635 |
ffffffff80109c8c: 48 89 7d e8 mov %rdi,-0x18(%rbp) |
| 2636 |
ffffffff80109c90: 48 89 75 e0 mov %rsi,-0x20(%rbp) |
| 2637 |
|
| 2638 |
/* Bochs magic breakpoint */ |
| 2639 |
ffffffff80109caa: 66 87 db xchg %bx,%bx |
| 2640 |
|
| 2641 |
/* Our manual software interrupt */ |
| 2642 |
ffffffff80109cad: cd f0 int $0xf0 |
| 2643 |
|
| 2644 |
/* Failing code, specially last line */ |
| 2645 |
ffffffff80109caf: 48 8b 45 f8 mov -0x8(%rbp),%rax |
| 2646 |
ffffffff80109cb3: 0f b6 10 movzbl (%rax),%edx |
| 2647 |
ffffffff80109cb6: 48 8b 45 f0 mov -0x10(%rbp),%rax |
| 2648 |
ffffffff80109cba: 88 10 mov %dl,(%rax) |
| 2649 |
|
| 2650 |
Notice something? Look again. See those `mov -0x8(%rbp), %rax' and |
| 2651 |
`mov -0x10(%rbp), %rax) ops? |
| 2652 |
|
| 2653 |
Now check the top of the function, especially the stack pointer setup. Can't |
| 2654 |
you notice something? |
| 2655 |
|
| 2656 |
YES, GCC used areas of the stack _B E L O W_ the stack pointer to keep parts |
| 2657 |
of the leaf function local state. Now when the interrupt was triggered at |
| 2658 |
`int $0xf0', the CPU rightfully pushed CPU state in that order: |
| 2659 |
|
| 2660 |
%ss |
| 2661 |
%rsp |
| 2662 |
%rflags |
| 2663 |
%cs |
| 2664 |
%rip |
| 2665 |
|
| 2666 |
Which meant that parts of that function local state get corrupted through |
| 2667 |
the implicit CPU stack usage! |
| 2668 |
|
| 2669 |
This is also the reason we found faulting addresses of values resembling %ss |
| 2670 |
and %rflags. Our processor modified our pointers with those! |
| 2671 |
|
| 2672 |
* Wow, THAT WAS INTERESTING and worthwhile to invest the week in. Inspecting the |
| 2673 |
AMD64 ABI to see why GCC uses this behavior, which is _certainly_ interrupt and |
| 2674 |
signals unsafe, I found the answer: it's 'The Red Zone'. |
| 2675 |
|
| 2676 |
Now I've really had enough for today, see the discussion of this zone at the |
| 2677 |
AMD64 ABI document. |
| 2678 |
|
| 2679 |
* Solution: |
| 2680 |
|
| 2681 |
Either: |
| 2682 |
[1] Ignore the 128-byte region in the IRQ handler OR |
| 2683 |
[2] Let the compiler avoid using this Red Zone |
| 2684 |
|
| 2685 |
We can't really use [1], since the CPU pushes its state behind our backs. We |
| 2686 |
can't let it wind 128-bytes of the stack. Also the kernel stack is a very |
| 2687 |
valuable resource, and we shouldn't waste 128-Bytes of it to save an x86 op |
| 2688 |
or two. |
| 2689 |
|
| 2690 |
Thus, we pick the second option. That's also what the mainstream folks like |
| 2691 |
{Open, Free}BSD and Linux do (which I discovered too lately). |
| 2692 |
|
| 2693 |
* Lessons: |
| 2694 |
|
| 2695 |
- When faced with triple faults, inspect the following: |
| 2696 |
- the IDT table: if it has been overwritten, then the machine will definitely |
| 2697 |
triple fault. |
| 2698 |
- stack sanity: If the stack was not sane, then pushes and pops will lead to |
| 2699 |
faults, leading to triple faults. |
| 2700 |
- MM structures sanity. If the pagetables have been erroneously overwritten, |
| 2701 |
the stack will get erroneously mapped. |
| 2702 |
- Kernel stack overflow: corrupt stack = triple faults |
| 2703 |
|
| 2704 |
- If you noticed that an interrupt randomly triggers a problem in the kernel, |
| 2705 |
it can be good to disable interrupts (avoid distractions), then find the most |
| 2706 |
affected code path and _manually_ inject interrupts there using software IRQs. |
| 2707 |
|
| 2708 |
- The best possible way to let IRQ handlers be as minimal as possible is to use |
| 2709 |
the PIC in Automatic EOI (AEOI) mode. They will get the handler minimized to |
| 2710 |
the 2 byte `iretq' x86-64 op. |
| 2711 |
|
| 2712 |
- Let code binary single-stepping be our last debugging resort to make sure we |
| 2713 |
understand the algorithms and code we've just written well enough. |
| 2714 |
|
| 2715 |
|
| 2716 |
Debugging periodic timers using statistical data: |
| 2717 |
------------------------------------------------- |
| 2718 |
|
| 2719 |
* A problem appeared while running the scheduler with the APIC timer instead of |
| 2720 |
the PIT. Using observation, it was clear that the running intervals given |
| 2721 |
to processes by the APIC timer are twice as long as the ones given by the PIT. |
| 2722 |
Both timers were supposedly programmed to produce 40 millisecond intervals, |
| 2723 |
so where's the fault? |
| 2724 |
|
| 2725 |
Note: we are talking about monotonic PIT and APIC interrupts since those are |
| 2726 |
the only ones used by the scheduler; we don't use the one-shot variants here. |
| 2727 |
|
| 2728 |
* One of the best ways I found to debug this problem, instead of depending on |
| 2729 |
shaky heuristics, is to store the occurrence of the timer ticks interrupts in |
| 2730 |
a statistical manner. |
| 2731 |
|
| 2732 |
An example of this statistical method would be finding the number of timer |
| 2733 |
ticks occurrence after a sequence of independently-programmed delay intervals. |
| 2734 |
|
| 2735 |
* How to do so without sacrificing simplicity? |
| 2736 |
|
| 2737 |
Setup the scheduler periodic timer through the PIT or the APIC: |
| 2738 |
`pit_monotonic(40)' OR `apic_monotonic(40)'. |
| 2739 |
|
| 2740 |
Setup a counter representing the number of scheduler timer ticks so far: |
| 2741 |
|
| 2742 |
.globl ticks_counter |
| 2743 |
ticks_counter: |
| 2744 |
.quad 0 |
| 2745 |
|
| 2746 |
.globl timer_handler |
| 2747 |
timer_handler: |
| 2748 |
... |
| 2749 |
incq ticks_counter |
| 2750 |
... |
| 2751 |
|
| 2752 |
Then, do _independent_ time delays and store the number of timer ticks so far |
| 2753 |
after each delay. This way, we can have a statistical comparison between the |
| 2754 |
two timers, the one used for the delay and the one used for the ticks. An |
| 2755 |
example of this would be: |
| 2756 |
|
| 2757 |
/* ticks[i]: number of scheduler timer ticks after the `i'th delay |
| 2758 |
* interval, having in mind that we've setup both the ticks and delay |
| 2759 |
* timers with the same time priod. |
| 2760 |
* |
| 2761 |
* If `1 2 3 4 5 6 ..' is printed, then the two timers are in harmony. |
| 2762 |
* Having the sequence '3 5 7 9 11 ..' means that the ticks timer runs |
| 2763 |
* twice as fast as the delay one. */ |
| 2764 |
for (int i = 0; i < 100; i++) |
| 2765 |
pit_mdelay(52), ticks[i] = ticks_counter; |
| 2766 |
for (int j = 0; j < 100; j++) |
| 2767 |
printk("%d ", ticks[j]); |
| 2768 |
|
| 2769 |
Comparing the statistical results, we found that the PIT in monotonic mode |
| 2770 |
ticks twice as fast as the PIT programmed in delay mode. It also ticked twice |
| 2771 |
as fast as the APIC timer in delay mode. This heavily suggests something wrong |
| 2772 |
in our PIT monotonic-mode setup! |
| 2773 |
|
| 2774 |
Meanwhile, setting up the scheduler ticks using the APIC timer ran in perfect |
| 2775 |
harmony with the PIT-programmed delays. This heavily re-enforces above paragr- |
| 2776 |
aph reasoned speculation: something is wrong in our PIT periodic mode code. |
| 2777 |
|
| 2778 |
* Now that the problem was isolated, the error was a misunderstanding of the |
| 2779 |
PIT's mode-3 behavior; mode-2 fitted our goals better. |
| 2780 |
|
| 2781 |
Interestingly, I found this method really helpful for testing periodic timers |
| 2782 |
code in general. We already test timer delays code by calibrating its produced |
| 2783 |
delays against wall clock time; this method complement the tests by also |
| 2784 |
testing the monotonic/periodic interrupts code. |
| 2785 |
|
| 2786 |
Note that for testing, specifying a large delay period catches more periodic |
| 2787 |
timers accuracy errors. Values from 50 ms and up proved to be good. |
| 2788 |
|
| 2789 |
* A similar test-case for periodic timers is comparing their statistical |
| 2790 |
ticks occurrences relative to each other. This will guarantee that all system |
| 2791 |
periodic timers really measure the same concept of time! |
| 2792 |
|
| 2793 |
|
| 2794 |
The Naughty Multiply: |
| 2795 |
--------------------- |
| 2796 |
|
| 2797 |
* References: |
| 2798 |
- Intel Instruction Set Reference, A-M, vol. 2A, MUL. |
| 2799 |
- GCC manual, 'Using the GNU Compiler Collection' |
| 2800 |
|
| 2801 |
* So I was writing a casual number of test-cases, where this sequence of code: |
| 2802 |
|
| 2803 |
void run_tests(void) { |
| 2804 |
memset(_arr, 0xFF, 0x100); |
| 2805 |
__memcpy_forward(_arr, _arr + 20, 10); |
| 2806 |
__memcpy_forward(_arr + 20, _arr, 10); |
| 2807 |
} |
| 2808 |
|
| 2809 |
produced a NULL-dereference exception. As printed by our exception handler: |
| 2810 |
|
| 2811 |
Exception: vector=14, errcode=0x2, %cs=0x8, %rip=0xffffffff8010b6f7, |
| 2812 |
%ss=0x0, %rsp=0xffffffff80104ff0, %rflags=0x200202, %cr2=0x0, |
| 2813 |
%cr3=0x3ffef000, %cr4=0x20 |
| 2814 |
|
| 2815 |
It was weird: the code of these two functions was a very simple and |
| 2816 |
straightforward inline GCC assembly. |
| 2817 |
|
| 2818 |
The problem was clearly related somehow to registers-usage pressure (the |
| 2819 |
compiler register allocator decisions) in the mother function run_tests(). |
| 2820 |
Removing one memcpy() above, or adding a printk() in a random place in that |
| 2821 |
function would make the NULL exception disappear. |
| 2822 |
|
| 2823 |
* Disassembling the area around the faulting opcode, we found: |
| 2824 |
|
| 2825 |
ffffffff8010b6a0 <string_run_tests>: |
| 2826 |
ffffffff8010b6a0: 48 c7 c2 20 fb 10 80 mov $0xffffffff8010fb20,%rdx |
| 2827 |
ffffffff8010b6a7: 31 f6 xor %esi,%esi |
| 2828 |
ffffffff8010b6a9: b8 ff 00 00 00 mov $0xff,%eax |
| 2829 |
ffffffff8010b6ae: 48 89 d7 mov %rdx,%rdi |
| 2830 |
ffffffff8010b6b1: b9 00 01 00 00 mov $0x100,%ecx |
| 2831 |
ffffffff8010b6b6: 48 89 ce mov %rcx,%rsi |
| 2832 |
ffffffff8010b6b9: 48 83 e1 07 and $0x7,%rcx |
| 2833 |
ffffffff8010b6bd: f3 aa rep stos %al,%es:(%rdi) (+) |
| 2834 |
ffffffff8010b6bf: 48 89 f1 mov %rsi,%rcx |
| 2835 |
ffffffff8010b6c2: 48 c1 e9 03 shr $0x3,%rcx |
| 2836 |
ffffffff8010b6c6: 48 89 c6 mov %rax,%rsi |
| 2837 |
ffffffff8010b6c9: 48 b8 01 01 01 01 01 mov $0x101010101010101,%rax |
| 2838 |
ffffffff8010b6d0: 01 01 01 |
| 2839 |
ffffffff8010b6d3: 48 f7 e6 mul %rsi |
| 2840 |
ffffffff8010b6d6: f3 48 ab rep stos %rax,%es:(%rdi) (++) |
| 2841 |
ffffffff8010b6d9: 41 b9 0a 00 00 00 mov $0xa,%r9d |
| 2842 |
ffffffff8010b6df: 31 c0 xor %eax,%eax |
| 2843 |
ffffffff8010b6e1: 48 c7 c6 34 fb 10 80 mov $0xffffffff8010fb34,%rsi |
| 2844 |
ffffffff8010b6e8: 48 89 d7 mov %rdx,%rdi |
| 2845 |
ffffffff8010b6eb: 44 89 c9 mov %r9d,%ecx |
| 2846 |
ffffffff8010b6ee: 41 89 c0 mov %eax,%r8d |
| 2847 |
ffffffff8010b6f1: 41 89 c8 mov %ecx,%r8d |
| 2848 |
ffffffff8010b6f4: 83 e1 07 and $0x7,%ecx |
| 2849 |
ffffffff8010b6f7: f3 a4 rep movsb %ds:(%rsi),%es:(%rdi) |
| 2850 |
|
| 2851 |
From our exception handler, we can see that the faulting instruction |
| 2852 |
is the above 'rep movsb'. We can also deduce that (+) and (++) ops are |
| 2853 |
produced for memset(), and the rest are the memcpy() opcodes. |
| 2854 |
|
| 2855 |
* Tracing the code by hand, I reached the faulty instruction without |
| 2856 |
catching the NULL dereference. A null dereference at the 'movsb' |
| 2857 |
instruction can only mean that either %rsi = 0, or %rdi = 0, or both |
| 2858 |
had magically turned to zeroes. |
| 2859 |
|
| 2860 |
Wondering about the state, I overwrote this movsb instruction in the |
| 2861 |
final kernel binary with a Bochs magic breakpoint code (66 87 db) |
| 2862 |
using a simple hex editor. |
| 2863 |
|
| 2864 |
After moving to the debugging console, thanks to the magic breakpoint, |
| 2865 |
here was the Bochs registers dump: |
| 2866 |
|
| 2867 |
(0) Magic breakpoint |
| 2868 |
CPU0: |
| 2869 |
rax: 0x00000000:00000000 rcx: 0x00000000:00000002 |
| 2870 |
rdx: 0x00000000:00000000 rbx: 0xffffffff:801129a8 |
| 2871 |
rsp: 0xffffffff:80104ff0 rbp: 0x00000000:00000000 |
| 2872 |
rsi: 0xffffffff:8010fb34 rdi: 0x00000000:00000000 |
| 2873 |
r8 : 0x00000000:0000000a r9 : 0x00000000:0000000a |
| 2874 |
r10: 0x0f200f20:0f200f20 r11: 0x66656463:62613938 |
| 2875 |
r12: 0x00000000:00000000 r13: 0x00000000:00000000 |
| 2876 |
r14: 0x00000000:00000000 r15: 0x00000000:00000000 |
| 2877 |
rip: 0xffffffff:8010b6fa |
| 2878 |
eflags 0x00200202: ID vip vif ac vm rf nt IOPL=0 of df IF tf sf zf af pf cf |
| 2879 |
|
| 2880 |
One thing was clearly noted: for some reason, %rdx (and %rdi by consequence) |
| 2881 |
have magically turned to zeroes, leading to the NULL dereference! |
| 2882 |
|
| 2883 |
* Finding that no explicit opcode zeroed %rdx above, then the zeroing |
| 2884 |
must have been a side-effect of some instruction above the faulty |
| 2885 |
dereference. Looking at the Intel Manual for cues, I found that |
| 2886 |
opcode: it was the 'mul %rsi' one. |
| 2887 |
|
| 2888 |
So why didn't GCC know about this? Because _I_ did not know about |
| 2889 |
it while coding, thus I did not explicitly tell GCC that this nice |
| 2890 |
register will get implicitly clobbered. |
| 2891 |
|
| 2892 |
GCC, naively thought that the register value was unmodified, and |
| 2893 |
rightfully re-used that register value at address 0x8010b6e8. GCC |
| 2894 |
is really competent wrt to optimizations. The faulty code was: |
| 2895 |
|
| 2896 |
asm volatile ( |
| 2897 |
[...] |
| 2898 |
"mov %[ch], %[tmp];" |
| 2899 |
"mov $0x0101010101010101, %[ch];" |
| 2900 |
"mul %[tmp];" |
| 2901 |
"rep stosq;" |
| 2902 |
: [tmp] "=&r"(tmp), [dst] "=&D"(d0), [ch] "=&a"(uch), |
| 2903 |
[len] "=&c"(ulen) |
| 2904 |
: "[dst]"(dst), "[ch]"(uch), "[len]"(ulen), "[tmp]"(tmp) |
| 2905 |
: "memory"); |
| 2906 |
|
| 2907 |
As you can see, %rdx was not mentiond as a clobbered register, and |
| 2908 |
not even as an output register. GCC had all the rights to believe |
| 2909 |
that this register value was not modified in that snippet. |
| 2910 |
|
| 2911 |
In another way, the clobber list tells gcc which registers (or |
| 2912 |
memory) are changed by the snippet, but not listed as an output. |
| 2913 |
|
| 2914 |
* The simple solution was adding the 'rdx' register to the assembly |
| 2915 |
snippet clobber list. Now the compiler output looks like: |
| 2916 |
|
| 2917 |
ffffffff8010b6a0 <string_run_tests>: |
| 2918 |
ffffffff8010b6a0: 49 c7 c0 20 fb 10 80 mov $0xffffffff8010fb20,%r8 |
| 2919 |
[...] |
| 2920 |
ffffffff8010b6e8: 48 89 d7 mov %r8,%rdi |
| 2921 |
[...] |
| 2922 |
ffffffff8010b6f7: f3 a4 rep movsb %ds:(%rsi),%es:(%rdi) |
| 2923 |
|
| 2924 |
where the compiler uses the new register %r8, entirely avoiding |
| 2925 |
the %rdx register in the process. |
| 2926 |
|
| 2927 |
* From investigation, it seems GCC need to also be notified if the |
| 2928 |
assembly snippet alters the condition code register %rflags. We now |
| 2929 |
handle this by adding `cc' to the clobber list while using ops like |
| 2930 |
'andq', 'popfq', 'decq', and 'shrq'. |
| 2931 |
|
| 2932 |
Cases where we forgot to clobber 'rcx' after using opcodes like |
| 2933 |
'rep XX' were also caught. For this, we only need to set 'rcx' as a |
| 2934 |
double input+output register; GCC will infer the rest. |
| 2935 |
|
| 2936 |
Another faulty case was having 'rsi' be only set as an input register |
| 2937 |
amid the existence of 'movsq' in the assembly snippet. |
| 2938 |
|
| 2939 |
* Lesson: |
| 2940 |
|
| 2941 |
While I did study the privileged x86 ring0 opcodes carefully (mainly |
| 2942 |
Intel's manual vol. 3A, and AMD vol. 2), the basic x86 opcodes were |
| 2943 |
learned 'on the go'. The problem we faced today is a small price for |
| 2944 |
that; I should re-read Intel's vol. 1 docs way more carefully. |
| 2945 |
|
| 2946 |
To make-up for this, and also using the Intel manuals, I investigated |
| 2947 |
the side effects of __every__ opcode used in the kernel inlined |
| 2948 |
assembly snippets. That's how I caught the %rflags, %rcx, and %rsi |
| 2949 |
cases mentioned in the above point! |