Notes on the Cute Kernel TOC: ---- (1) On Intel Documents (2) On Data Alignment (3) On little-endian byte ordering (4) On Real Mode (5) On accessing fixed disk drivers (hard-disks) (6) On x86 jumps (7) On referencing global symbols (8) The 8259A Programmable Interrupt Controller (9) On x86 SMP initialization sequence (10) On linkers 'relocation error's (11) When %RIP-relativeness goes wrong (12) Testing libc()-like kernel code (13) On SMP Memory ordering models (14) On the C 'volatile' type qualifier (15) On the vaguely documented x86 16-bit protected mode (16) Using light debugging techniques (17) Musings on Memory Allocation (18) On Virtual Memory (19) On Partitionting the 64-bit kernel address space (20) Debugging -O3 triggered problems (21) The Zone! (22) Debugging periodic timers using statistical data (23) The Naughty Multiply On Intel Documents: ------------------- * Although the newer intel documents are much more sexy and detailed, the original i386 one describes the concepts in a much more bottom-up, undergraduate-book like, no forward-references way. * Somehow the new intel documents beginning from the 1997 i686 document assumes some x86 familarity that wasn't obviously assumed in the original document. * For an example, the 'alignment' topic is throughly explained in the i386 manual, what is it and why is it important to performance, and how the processor act if given a misaligned address, *before* ever using the words '16bit-aligned' and '32bit-aligned'. The new intel documents assume you know subtle topics as above and talks about critical issues related to their understandings freely *before* they become throughly discussed. Only at i686 chapter 5 alignment is throughly described although several critical alignment related topics have already been discussed. * Case 2: on why we should disable interrupts while switching to protected mode? New Intel document: Software must guarantee that no exceptions or interrupts are generated during the mode switching operation. Original 386 document: ``The IDTR may be loaded in either real-address or protected mode. However, the format of the interrupt table for protected mode is different than that for real-address mode. It is not possible to change to protected mode and change interrupt table formats at the same time; therefore, it is inevitable that, if IDTR selects an interrupt table, it will have the wrong format at some time. An interrupt or exception that occurs at this time will have unpredictable results. To avoid this unpredictability, interrupts should remain disabled until interrupt handlers are in place and a valid IDT has been created in protected mode.'' * Case 3: on why should we long jump after switching to protected mode? New intel document: ``The JMP or CALL instruction immediately after the MOV CR0 instruction changes the flow of execution and serializes the processor.'' Original 386 document: ``Immediately after setting the PE flag, the initialization code must flush the processor's instruction prefetch queue by executing a JMP instruction. The 80386 fetches and decodes instructions and addresses before they are used; however, after a change into protected mode, the prefetched instruction information (which pertains to real-address mode) is no longer valid. A JMP forces the processor to discard the invalid information.'' This is a recurring pattern. * What's IA-32e in new Intel documents? Intel has really messed up the namings of their 64bit processors. The most logical choice I think was IA-64, but unfortunately that was taken by the big flop in processor's history: the Itanium. When Intel waked up and created their first amd64 compatible processor, they named it EMT64. Now intel in its new official documents call the x86_64 architecture "intel 64" and call those processor's 64bit mode IA-32e. On Data Alignment: ------------------ * Note: In x86 speak, word = 2bytes/16bits, double word = 4bytes/32bits * Each byte within a word has its own address, and the smaller of the addresses represent the address of the whole word. * Each byte within the double word has its own address. The smallest of the byte addresses is the address of the double word. * ``Note that words need not be aligned at even-numbered addresses and doublewords need not be aligned at addresses evenly divisible by four. This allows maximum flexibility in data structures (e.g., records containing mixed byte, word, and doubleword items) and efficiency in memory utilization. ** When used in a configuration with a 32-bit bus, actual transfers of data between processor and memory take place in units of doublewords (4bytes) beginning at addresses evenly divisible by four ** However, the processor converts requests for misaligned words or doublewords into the appropriate sequences of requests acceptable to the memory interface. ** Such misaligned data transfers reduce performance by requiring extra memory cycles. For maximum performance, data structures (including stacks) should be designed in such a way that, whenever possible, word operands are aligned at even addresses and doubleword operands are aligned at addresses evenly divisible by four. ** Due to instruction prefetching and queuing within the CPU, there is no requirement for instructions to be aligned on word or doubleword boundaries. However, a slight increase in speed results if the target addresses of control transfers are evenly divisible by four.'' -- Intel 386 Reference Manual * ``The reason for this is that the processor requires two memory accesses to make an unaligned memory access; whereas, aligned accesses require only one memory access. A word or doubleword operand that crosses a 4-byte boundary or a quadword operand that crosses an 8-byte boundary is considered unaligned and requires two separate memory bus cycles to access it; a word that starts on an odd address but does not cross a word boundary is considered aligned and can still be accessed in one bus cycle.'' -- Intel 686 Reference Manual * x86 4KB page aligned addresses: Decimal: 10 ^ 0 = 1 10 ^ 1 = 10 10 ^ 2 = 100 10 ^ 3 = 1000 ... Binary: 2 ^ 0 = 1 2 ^ 1 = 10 2 ^ 2 = 100 ... 2 ^ 10 = 10000000000 = 1024 = 1KB 2 ^ 12 = 1000000000000 = 2^2 * 1KB = 4KB (x86 Page size/aligned-address) ... * Common alignments offsets: 1byte alignment = 2 ^ 0 -- any x86 RAM address 2byte alignment = 2 ^ 1 (0b10) -- 16bit aligned address = any address divisable by 2 = 0x0, 0x2, 0x4, 0x6, 0x8, ..., 0xa, 0xc, 0xe, 0x10, 0x12, ... 4byte alignment = 2 ^ 2 (0b100) -- 32bit aligned address = any address divisable by 4 = 0x0, 0x4, 0x8, 0xc, 0x10, 0x14, 0x18, 0x1c, 0x20, ... 8byte alignment = 2 ^ 3 (0b1000) - GDT cell index/size = any address divisable by 8 = 0x0, 0x8, 0x10, 0x18, 0x20, ... | | | null %cs %ds 16byte alignment = 2 ^ 4 (0b10000) - GDT table recommended alignment = any address divisable by 16 = 0x0, 0x10, 0x20, 0x30, 0x40, ..., 0xf0, 0x100, 0x110, 0x120, ... * Useful numbers: 0x100 = 256 bytes 0x200 = 512 bytes (bootsector size, number of entries in PML{4, 3, 2} tables) 0x400 = 1KB 0x1000 = 4KB (x86 normal page size/alignment) 0x8000 = 32KB 0xffff = 64KB -1 (max offset for a real-mode segment) 0x10000 = 64KB 0x80000 = 512KB (our initial kernel size!) On little-endian byte ordering: ------------------------------- * From AMD64 documents: "The x86 and AMD64 architectures address memory using little-endian byte-ordering. Multibyte values are stored with their least-significant byte at the lowest byte address, and they are illustrated with their least significant byte at the right side. Strings are illustrated in reverse order, because the addresses of their bytes increase from right to left." Value = BYTE-7 BYTE-6 BYTE-5 BYTE-4 BYTE-3 BYTE-2 BYTE-1 BYTE-0 Actual Mem Layout: BYTE-0 BYTE-1 BYTE-2 BYTE-3 BYTE-4 BYTE-5 BYTE-6 BYTE-7 ^ |--- Address start here For illustrative purposes only - not actual mem layout: GREATEST DATA STRUCTURE ADDRESS 31 23 15 7 0 <--BIT +---------------+---------------+---------------+---------------+ OFFSET | BYTE 7 | BYTE 6 | BYTE 5 | BYTE 4 |4 |---------------+---------------+---------------+---------------| SMALLEST | BYTE 3 BYTE 2 BYTE 1 BYTE 0 |0 ADDRESS +---------------+---------------+---------------+---------------+^ | Address start here ---| * Classical little-endian ordering: .long 0x00cf9200 = 00 92 CF 00 ^ |--- Address start = +---------------+---------------+---------------+---------------+ | 00 | CF | 92 | 00 | +---------------+---------------+---------------+---------------+^ Address start ---| * GAS concatenation: .long 0x00cf9200, 0x0000ffff != .quad 0x00cf92000000ffff .long 0x00cf9200, 0x0000ffff = .long 0x00cf9200 .long 0x0000ffff = .quad 0x0000ffff00cf9200 = (actual layout in memory) 00 92 CF 00 FF FF 00 00 ^ |--- Address start = (for illustrative purposes only - _NOT_ actual mem layout) +---------------+---------------+---------------+---------------+ | 00 | 00 | FF | FF | |---------------+---------------+---------------+---------------| | 00 | CF | 92 | 00 | +---------------+---------------+---------------+---------------+^ Address start ---| BUT: .long 0x0000ffff .long 0x00cf9200 = .quad 0x00cf92000000ffff = (for illustrative purposes only - _NOT_ actual mem layout) +---------------+---------------+---------------+---------------+ | 00 | CF | 92 | 00 |4 |---------------+---------------+---------------+---------------| | 00 | 00 | FF | FF | +---------------+---------------+---------------+---------------+^ Address start ---| = (actual layout in memory) FF FF 00 00 00 92 CF 00 ^ |--- Address start * So: test.S: | print.c: | .global test | #include test: | extern uint64_t test; .quad 0x00cf92000000ffff | int main(void) { printf("test = 0x%x\n", test); } output: test = 0x0000ffff as the 64-bit value is presented as: (address-start) FF FF 00 00 00 92 ... (address-end) * The String "WXYZ" is represented in memory as: 'W' 'X' 'Y' 'Z' ^ |--Address start And in the usual memory diagrams: +---------------+---------------+---------------+---------------+ | 'Z' | 'Y' | 'X' | 'W' | +---------------+---------------+---------------+---------------+^ Address start ---| On Real Mode: ------------- * To achieve maximum backward compatibility, x86 processors starts in the 16bit 8086-compatible real-mode. The 8086 had a complete 16-bit architecture - 16-bit internal registers, 16-bit data bus, and a *20-bit* external address bus (1 MB of physical memory). Because the 8086 processor had 16-bit index registers and memory pointers, it can effectively address *only* up to 64 KB of memory, which wasn't acceptable even by that time. * So, to address the memory beyond 64 KB (the remaining 20bit 1MB address space) with 16bit registers, Intel created a `segmented' memory addressing scheme. We want to access 20bit address space, namely 0xfffff at max. The 8086 processor's registers are only 16bit, namely a max value of 0xffff. To solve this issue, the 8086 processor used *two* registers to address physical memory. The two registers are a segment described as a base address, and an offset within that segment. The goal: addressing up to the max 20bit value; 0xfffff: f 0 0 0 0 <-- 16bit value (segment base address) hex left shifted once to address the 4 highest order bits in the goal. + <-- A simple adding operation f f f f <-- 16bit value, max offset we can address ------- f f f f f <-- The goal (max 20bit value we can address) So physical memory in the 8086 is accessed through a `segment:offset' scheme where physical address = (segment * 16) + offset * Note that the above segment model is very primitive that there are really no real `segments' identified by the processor. A *physical* 'segment' is basically just a 16-bit aligned physical address address with an offset `limit' of 64K enforced by the processor registers limit. Example: a000:0010 = a010:0000 /* Overlapping 'segments' */ f000:ffff = ffff:000f /* Maximum address using different 'segments' */ 40:100 = 50:00 /* Last byte in Bios Data Area */ It's important to note that a segment register can take any value ranging from 0 to 0xffff including the non-16bit-aligned addresses. It's on *physical* memory where a segment base address is always 16-bit aligned due to the left shift, not the segment register value itself. Example of completely valid segment register values and their physical addresses: Segment register value | Segment physical address ------------------------------------------------------- 0xffff | 0xffff0 0x1abc | 0x1abc0 0x4242 | 0x42420 ... | ... 0xyyyy | 0xyyyy0, y = /[0-f]/ * Some calculations: biggest segment address possible = 0xffff biggest offset possible = 0xffff biggest *physical* address that can act as a segment address = biggest 20-bit address which is 16-byte aligned = 0xffff0 * NOTE: There's a huge difference between real-mode `segments' and segments provided in 386+ protected-mode processors. The 16bit value in a segmented protected-mode logical address is what's called a `segment selector' which is basically a pointer to a `segment descriptor' stored in either the GDT or the LDT. The segment descriptor includes info like the protected-mode segment base, limit, access permissions, present bit, etc. On the other hand, the real-mode `segments' are just primitive addresses that get hex left shifted in address calculations. * Note that using this scheme, code segments are limited to a size of 64KB. To reach more code, we had to jump to what's called 'far pointers', which not only contains an offset relative to current code segment, but also a new address for the code segment. AT&T syntax for far jumps: `ljmp $SEGMENT, $OFFSET' where 'segment' and 'offset' are naturally 16bit values. Usage example: after loading the kernel to ram at its chosen segment which is usually far away from the mbr code segment, we 'far jump' to this segment with offset 0 to execute the loaded kernel code and go beyond the mbr's 512 bytes size limit. This is the same technique used by GRUB to loads its stages. * The above terminology is also the root of 386+ protected mode 'far pointer' which is a 'logical address' containing a segment selector (16 bit) and an offset (32 bit) relative to the segment pointed by that selector. On accessing fixed disk drivers (hard-disks): --------------------------------------------- * There's the usual heads, tracks, sectors, and cylinders. Track: when the head is positioned over a point on the fixed disk, ready to read or write, the disk platter surface spins underneath it, tracing a full circle. This circle is a track. A disk surface contains may tracks. Sectors: Fixed disk systems divides each track into short arcs called sectors, each sector usually holds 512 bytes. This is the smallest unit of storage, as tracks would mean wasting a lot of space for small files. Cylinder: all the heads move in and out together, so each head is always physically located at the same track number; we can't have one head at track 0, and another at 250. Because of this, often the track location of the heads is not referred to as a track number but rather as a cylinder number. Disk size = bytes per sector * sectors per track * number of cylinders * number of heads * BIOS fixed disk services are provided through the INT 13h service routine. It programs the hard-disk controller; it never writes directly from the processor to the hard-disk drive. The hard-disk controller communicates directly between the hard-disk, the processor, and system memory. * To support disks beyoned the 8GB limit enforced by original IBM bios design (max of 1024 cylinder, 256 heads, and 63 sectors), new INT13 bios services were added to bios which offers simple sector addressing using a logical zero based sector number (0, 1, 2, ..) This addressing scheme is easier than the Cylinder-Head-Sector model and is supported in all bios chips post-1995, so we use it exclusively. Check the boot sector code for more details. On x86 jumps: ------------- * References: - Intel i686 Manual, vol. 2, Instruction Set Reference, Chapter 3: JMP - Sun Microsystems, [AT&T] x86 Assembly language reference manual - Cute: boot/bootsect.S, boot/head.S, boot/e820.S, boot/trampoline.S * Judging by my early naive trial-and-error experiences and the nature of newbies questions at osdev.org forums, the most common failure point while developing bootloaders and early kernel stages is either an intra or an inter-module jump. Because of this, we'll clarify the most used x86 jumps here, detailing their AT&T syntax and opcodes. We don't use the Intel syntax in our kernel, so it won't be discussed here; it's already clearly defined in the official Intel manuals. First, let's stress that a 'jump' is only a mechanism to disrupt the CPU execution flow, moving EIP to a different code (hopefully not data!) region. Generally, x86 jumps can be classified to either near or far. Near jumps are 'intra-segment', meaning that they only move within the limits of the currently setup and cashed code segment. Meanwhile, far jumps are 'inter-segment': they take a (mostly new) code segment descriptor and an offset within such segment. The kind of displacement in such jumps can also be either relative or absolute. Relative addresses (wrt the next instruction address) were introduced to save opcodes size. [*] Absolute addresses are very important when relative offsets to an outside-of-code-object destination address cannot be validly calculated at link-time. Check the 'When %RIP-relativeness goes wrong' section's summary for further details. Examples of jumps: | [A] Near jump, 8-bit relative displacement (-128): [short jump] 1: movl $0xdeadbeef, %eax | af: b8 ef be ad de jmp 1b | b4: eb f9 [B] Near jump, 8-bit relative displacement (+127): [short jump] 1: movl $0xdeadbeef, %eax | af: b8 ef be ad de jmp 2f | b4: eb 00 2: movl $0xcafebabe, %ebx | b6: bb be ba fe ca [C] Near jump, 16-bit relative displacement: [Not supported by GAS] [D] Near jump, 32-bit relative displacement: jmp dst | 12: e9 f8 ff 0f 00 .skip 0xffff8 | ... dst: movl $0xdeadbeef, %eax | 10000f: b8 ef be ad de [E] Near jump, absolute indirect displacement (register): mov $0x100000, %eax | af: b8 00 00 10 00 jmp *%eax | b4: ff e0 [F] Near jump, absolute indirect displacement (memory reference): jmpl *(0x7c00 + mem) | af: ff 25 b5 7c 00 00 mem: .long 0x100000 | b5: 00 00 10 00 (AT&T syntax uses the '*' operator for indirect jumps) Note that the _only_ available way for specifying _near_ absolute displacements is to encode such displacements 'indirectly' (i.e. not in the operands themselves) using a register or a memory reference. [G] Far jump, absolute 32-bit far pointer in operand (16-bit %cs index + 16-bit offset): .code16 | ljmp $0x08, $(0x7c00 + rel) | 99: ea 9e 7c 08 00 .code32 | rel: movl $0x10000, %esi | 9e: be 00 00 01 00 [H] Far jump, absolute 48-bit far pointer in operand (16-bit %cs index + 32-bit offset): .code32 | # KERNEL_CS = 0x08 | # startup_64 = 0x100112 | ljmp $KERNEL_CS, $startup_64 | 10006f: ea 12 01 10 00 08 00 [I] Far jumps, 32-bit or 48-bit absolute indirect displacement (memory reference): [Rarely used, let's only know that they exist] [*] footnote: On x86-32, only jumps and calls had the relative addressing facility. To save opcode size, especially while handling 64-bit wide virtual addresses, AMD introduced a global and default %RIP-relative addressing scheme for most ops, including most important, dereferencing global data. On referencing global symbols: ------------------------------ * References: - GNU LD Manual, 'Using LD', Section 3.5.4 -'Source Code Reference' - Cute: include/idt.h, kernel/idt.S * To gracefully fail instead of triple faulting the machine, one of the first things implemented while writing the C part of the kernel was setting up CPU exception handlers. To setup these handlers, we needed access to the IDT table and its descriptor; both were already defined in assembly files. What are the rules for accessing these data structures natively in C? When a global symbol is declared in C, two things happen. The compiler reserves enough space in the executable .data or .bss section holding the _value_ of the symbol. Second, it creates an entry in the program's symbol table which holds the symbol _address_ (in that section). For a small example, check below program and its disassembly: --> globals.c (a1): | --> [disassembly, $gcc -O3]: int x; | 0000000000600800 : 11 11 00 00 ... (.data) int *y = 0x1111; | 0000000000600818 : ... (.bss, 8 bytes) | int main(void) { | 0000000000400460
: x = 0xdeadbeef; | 400460: movq 0x200399(%rip),%rax #600800 *y = 0xcafebabe; | 400467: movl $0xdeadbeef,0x2003a7(%rip) #600818 } | 400471: movl $0xcafebabe,(%rax) | 400477: retq As you can see, accessing the global 'x' required a single memory dereference of the address 0x600818 stored in the program's symbol table. Meanwhile, dereferencing the pointer 'y' required TWO memory dereferences: one for dereferencing the variable 'y' address 0x600800, getting its value/contents, and _another_ dereference for such value stored (0x1111); after all, a pointer is just a special 8-byte variable. * Furthermore, check below program which has several symbols defined in an assembly file with different treatments of such symbols (and their data) in C: --> data.S (a2): | --> app.c (a2): .section .data | uint32_t x; .globl x | extern uint32_t *y; x: | extern uint32_t z[]; .long 0xdeadbeef | | int main(void) { .globl y | printf("&x = 0x%lx\n", &x); y: | printf("&y = 0x%lx\n", &y); .long 0xcafebabe | printf("&z = 0x%lx\n", &z); | .globl z | printf("x = 0x%x\n", x); z: | printf("y = 0x%x\n", y); .long 0xeeeeeeed | ... /* further code: check the output */ | } $ gcc -O3 data.S app.c -o output.elf $ objdump -D output.elf -j .data 0000000000600af0 : 600af0: ef be ad de be ba 0000000000600af4 : 600af4: be ba fe ca ed 0000000000600af8 : 600af8: ed ee ee ee $ ./output.elf 'int x' is a global located at program area (&x = ): 0x600af0 'int *y' is a global located at program area (&y = ): 0x600af4 'int z[]' is a global located at program area (&z = ): 0x600af8 x = 0xdeadbeef | *x = y = 0xcafebabe | *y = (0xcafebabe unmapped) z = 0x600af8 | *z = 0xeeeeeeed x[0] = y[0] = since 0xcafebabe is not mapped z[0] = 0xeeeeeeed As seen from above introspection, y is just a global variable holding 0xcafebabe. y _address_ is '0x600af4', buts its _value_ is 0xcafebabe. Thus, '*y' will produce two memory dereferences, and the second of the two will result in a CPU exception (segmentation fault) since 0xcafebabe is not mapped. * So, if we have a structure like the below IDT table descriptor: /* Located at ffffffff8010e240 * Binary value (little endian): 00 10 80 16 11 00 00 80 ff ff */ .globl idtdesc idtdesc: .word idt_end - idt # limit .quad VIRTUAL(KTEXT_PHYS(idt)) # base we can _not_ reference it in C as struct idt_descriptor { uint16_t limit; uint64_t base; } __packed; extern struct idt_descriptor *idtdesc; /* WRONG!! */ idtdesc->limit = 0; /* Page-fault exception! */ since in above case, the variable pointer _address_ (&idtdesc = ) will be 0x10e240, and the pointer _value_ (idtdesc = ) will equal 0x8000001116801000. Thus, dereferencing such pointer as in 'idtdesc->limit' above will in fact dereference 0x10e240 first, then afterwards dereference the unmapped address 0x8000001116801000! Final result: a CPU page fault exception. The only wrong thing above was specifying the wrong type for the 'idtdesc' label. Having the type as the struct itself instead of a pointer will let the struct address be 0x10e240, and its value will equal 0xffff8000001116801000 which _is_ the desired result: extern const struct idt_descriptor idtdesc; idtdesc.limit = 0; /* Safe!! */ * Now, if we want to reference the IDT table itself: /* IDT_GATES: number of entries in the table */ .align 16 .globl idt idt: .skip IDT_GATES * 16 idt_end: Assuming a pre-defined C type of 'struct idt_gate', representing a single IDT table entry structure, all what we need is an array of such types. Namely: extern struct idt_gate idt[IDT_GATES]; Note that after all, an array 'A' is a special kind of pointer where A = &A. (label value and address are the same, equalling link-time label address stored in the symbol table) * Finally, let's reference below exception handler stubs, where they catch the exception vector number through their positions in the stubs table: /* Each stub code size = IDT_STUB_SIZE bytes */ .globl idt_exception_stubs idt_exception_stubs: i = 0 .rept EXCEPTION_GATES_CNT movq $i, %rsi jmp default_exception_handler i = i + 1 .endr Well, let's first remember here that we're pointing to code, so we're only interested in each stub address, not by any mean its contents. So, after all, since this is a table, an array fits our goals nicely. What will be the base object type? We only care about addresses, so let the types be 'char*': char pointers makes pointer arithmetic very simple. An initial try can be extern const char idt_exception_stubs[EXCEPTION_GATES_CNT]; Nice, but by this, each table cell is of size 1 byte (char). We want each table cell to have the size IDT_STUB_SIZE chars (bytes). Namely: idt_exception_stubs[EXCEPTION_GATES_CNT][IDT_STUB_SIZE] which makes the compiler automatically calculate any stub address for us. If we want the address of the third stub, we'll simply write: void *addr = &idt_exception_stubs[3]; * Summary: the simple declaration 'extern int x' directly reference the 4 bytes marked as 'x' in the executable, where 'x' address is specified at link-time in the program's symbol table. The operator [] in the declaration 'extern type SYMBOL[]' leaves it open only to the number of base type objects which are present, not *where* they are. SYMBOL[n] just adds more static-time safety; SYMBOL[n][s] let the compiler automatically calculates an entry (of size 's')index for us. Meanwhile, the '*' operator in a declaration 'extern TYPE *P' only binds where the _8-byte variable_ P reside in the executable (its address), NOT where the base type object is. The base-object itself can be reached only by a double dereference: dereference of P address first, then a dereference of P's content itself (which is, in turn, an 8-byte address). The 8259A Programmable Interrupt Controller: -------------------------------------------- * References: - The Intel 8259A PIC datasheet. - The superb information in "The undocumented PC", chapter 17 - "Interrupt control and NMI". In subsequent text, and as said by the sheet, IRR = "Interrupt Request Register", ISR="In-Service Register", IMR = "Interrupt Mask Register". * As described by Brendan Trotter (bcos) on #osdev, and the sheet: An interrupt request is executed by raising an IR input (low to high) and holding it high until it is acknowledged (Edge Triggered Mode). Where "acknowledged" means doing so by the PIC chip, not by the CPU/OS . Also, in this case "acknowledges" just means that it recognises it and sets the corresponding flag in the IRR Basically, when an interrupt is received the PIC sets the corresponding flag in the IRR. When the PIC sends the IRQ to the CPU it clears the flag in the IRR and sets the flag in the ISR. And when the CPU sends EOI the PIC clears the flag in the ISR. The processor sends an EOI through the the INTA pin. If the PIC sends a high priority IRQ to the CPU and is waiting for EOI when a low priority IRQ occurs, then the PIC must remember that the low priority IRQ occured (so that it can tell the CPU about it when the CPU is finished with the high priority IRQ), by setting its IRR bit. The 8259 cannot be used with level interrupts. This is because when an interrupt line goes true, it is latched internally in the IRR. * Spuruious interrupts: As said by the sheet: "If no interrupt request is present at step 4 of either sequence (i.e., the request was too short in duration), the 8259A will issue an interrupt level 7. Both the vectoring bytes and the CAS lines will look like an interrupt number 7 was requested." And as said in undoc pc: "Although a noise glitch should never occur, a poor bus design or badly glitch will trigger an invalid interrupt. If the glitch occurs on IRQ 0 to IRQ7, interrupt F (which is mapped to IRQ 7) will be called. The service handler for interrupt Fh and 77H (assuming the PIC wasn't re-mapped to more sane vector-numbers) should read the in-service register to confirm that a real interrupt occurred. An invalid interrupt has In-Service bit 7 cleared, and a valid interrupt has valid bit 7 set. * So why should the PIC bother sending the CPU a spurious 7 IRQ? it could've just ignored the noise glitch source. The problem is that it might have told the CPU it's about to get an interrupt, and if it did (step 1, 2, and 3), it can not now decode the interrupt info since the line has been deasserted. Either the assert period was too short that the IRQ number was not latched internally, or the PIC just recognized it was a noise glitch due to the very short assertion period. So it gives the programmed IRQ 7 vector number to the CPU. This avoids unccessary communication with the real devices attached to IRQ7. * Take care that, as said by linux kernel comments, the two spurious IRQs 7 and 15 usually resulting from the 8259A-1|2 PICs do occur even if all the IRQs are masked (IMR = 0xff). * An example of an interrupt (refs: undoc pc + intel sheet): 1- The adapter card requests service, by driving the hardware IRQ line 5 from low to high 2- The IRQ 5 line is connected to the PIC. The edge-triggered request is recorded in the controllers Interrupt-Request Register (IRR). In this case, the IRR bit 5 is set. 3- The PIC first checks if the Interrupt Mask Register (IMR) to see if IRQ5 is allowed (not masked, IMR bit 5 = 0). Assuming that bit clear, the PIC moves to the next step. 4- Using its priority resolver uint, the conntroller checks if any higher priority interrupts are active or in progress. If so, no further action is done till all higher priority irqs are serviced. 5- Once all higher irqs are serviced, the controller flags the CPU over the dedicated INT hardware line leading to the CPU. 6- If the CPU has hardware interrupts enabled, the CPU returns an interrupt acknowledgment to the PIC using the INTA pin. 7- Directly after receiving a pulse from the INTA pin, the relevant In-Service register bit (bit 5) is set, and the corresponding IRR bit is reset. 8- The processor sends another INTA pulse. During this pulse, the PIC sends a programmed 8-bit pointer back to the CPU, according the raised IRQ. In the default bios programmed PIC case, the pic will return the interrupt value 0x0D for IRQ5. 9- In an instruction boundary, the cpu interrupts the current process and jumps to the code address specified in the IDT entry 0xD. For x86-64 cpus, entry 0xD = IDT base (specified through lidt) + 0xD * 16. 10- One of the key aspects of the interrupt handler specified in the IDT entry must make is to remove the interrupt request made by the adapter. The line maybe shared, and several ISRs got executed, where they determine first if they are servicing the right device using the "interrupt pending" register found in PCI devices, etc. 11- At the end of the interrupt handler, the handler sends the value 0x20 to the PIC command port to indicate an End-of-Interrupt (EOI). The EOI command clears the IRQ5 bit in the In-Service Regisger.(*) The case is different in case of Automatic EOI (AEOI) mode, and is described later. 12- By clearing the IRQ5 bit in the ISR, the chip is now ready to accept another interrupt from IRQ5 or any lower priority interrupt. This completes the interrupt cycle. * On x86 SMP initialization sequence: ------------------------------------- * References: - Intel MP specification v1.4, Appendix B - Operating System Programming Guideleins. - Intel Software Dev. Manual vol3A, Chapter 8, 8.4 - Multiple-Processor (MP) initialization. - Intel Software Dev. Manual vol3B, Appendix C - MP Initialization for P6 Family Processors. * Generally, there're two base x86 MP initialization sequences. One for the processors with the external APIC, and a second for processors with internal APIC (P6+ x86 processors). We won't bother ourself with the processors with external APICs (80486 and early Pentiums) since we're only targetting long mode. All long mode capable processors contains an integrated apic and comply with the new Intel MP bootup sequence, where no FIPI or BIPI are needed. On this new sequence, the INIT IPI on AP cores let them wait for the SIPI to get their starting execution vector instead of _unconditionally_ executing the BIOS init code. This avoids the ugly dependency on AT+ BIOSes hacks like the warm-reset vector to let the AP cores start executing from a custom address instead of BIOS init code. Search the MP spec for `Warm-reset vector'. * After system-wide bootup, or reset, the MP initialization protocol is going to be executed. In this protocol, the hardware setup the APIC IDs for each logical core according to system topology and assign one of the cores as the bootstrap core and the others as APs (Application processors/cores). The core nominated as the BSC will fetch and begin executing BIOS bootstrap code at the reset vector physical address 0xfffffff0. The remaining cores (setup as APs) will enter a "wait-for-SIPI" state. * Afterwars, the BIOS init code builds its ACPI and MP tables and count the number of AP cores, and update its tables entries for each new AP core found. For the algorithm, check Intel's volume 3A 8.4.4 MP Initialization Example. * At the end, the BIOS puts each AP core to halted state with interrupts disabled. By the nature of the halted state with interrupts disabled, only INIT, #NMI, #SMI are able to start the cores again. IMPORTANT: The MP protocol will be executed only after a power-up or RESET. If the MP protocol has been completed and a BSP has been chosen, subsequent INITs (either to a specific processor or system wide) do not cause the MP protocol to be repeated. Instead, each processor examines its BSP flag (in the APIC_BASE MSR) to determine whether it should execute the BIOS boot-strap code (if it is the BSP) or enter a wait-for-SIPI state (if it is an AP). So any INIT to any of the AP cores by the OS boot code will let it enter a 'wait-for-SIPI' state to be able to get its start execution vector. * Note that the cores get started with all its states uninitialized. We have to update each core from real to long mode, and update its MMU and other system tables to match the bootstrap. Kernels usually call this part the SMP `trampoline', since it let the cores jump from its raw uninitialized state to a state similar to the bootstrap's one. * On linkers 'relocation error's: --------------------------------- * References: - Relocation types (R_X86_64_*): System V ABI AMD64 supplement, section 4.4 - Relocation - The Executable and Linkable Format (ELF) spec, Book I - Relocation * One of the errors that faced me at a number of sensitive places like the kernel head, the SMP trampoline, and the linker script was LD linker's 'relocation error'. Due to their relative opaqueness, they puzzled me to a great extent at first. An example of some of those errors from linking a preliminary trampoline code to the rest of the kernel image was: trampoline.o: In function `trampoline': (.text+0x1): relocation truncated to fit: R_X86_64_16 against `.text' trampoline.o: In function `_start': (.text+0xc): relocation truncated to fit: R_X86_64_16 against `.data' trampoline.o: In function `_start': (.text+0x11): relocation truncated to fit: R_X86_64_16 against `.data' trampoline.o: In function `_start': (.text+0x14): relocation truncated to fit: R_X86_64_32 against `.bss' trampoline.o: In function `_start': (.text+0x1d): relocation truncated to fit: R_X86_64_32 against `.bss' trampoline.o: In function `_start': (.text+0x21): relocation truncated to fit: R_X86_64_32 against `.bss' trampoline.o: In function `_start': (.text+0x27): relocation truncated to fit: R_X86_64_32 against `.bss' trampoline.o: In function `_start': (.text+0x2b): relocation truncated to fit: R_X86_64_32 against `.bss' * We need to understand relocation to parse above errors. Relocation Information sections contain information about unresolved references. Since compilers and assemblers do not know at what absolute memory address a symbol will be allocated, and since they are unaware of definitions of symbols in other files, __every__ reference to such a symbol will create a relocation entry. The relocation entry will point to the code address where the reference is being made, and to the symbol table entry that contains the symbol that is referenced. The linker will use this information to fill in the correct address after it has allocated addresses to all symbols. * Dumping the ELF object trampoline.o relocation section ($readelf -a) leads to: Relocation section '.rela.text' at offset 0x438 contains 8 entries: Offset Info Type Sym. Value Sym. Name + Addend 000000000001 00010000000c R_X86_64_16 0000000000000000 .text + 5 00000000000c 00020000000c R_X86_64_16 0000000000000000 .data + 0 000000000011 00020000000c R_X86_64_16 0000000000000000 .data + 6 000000000014 00030000000a R_X86_64_32 0000000000000000 .bss + 0 00000000001d 00030000000a R_X86_64_32 0000000000000000 .bss + 0 000000000021 00030000000a R_X86_64_32 0000000000000000 .bss + 1007 000000000027 00030000000a R_X86_64_32 0000000000000000 .bss + 0 00000000002b 00030000000a R_X86_64_32 0000000000000000 .bss + 2007 The relocation table fields are described in the ELF specification. In summary: - Offset: location to apply relocation action relative to section start - Info: symbol table index wrt which the relocation must be made - Type: from the AMD64 ABI. _16: max bit width = 16 (rmode). _32: 32 bits max - Addend: constant addend used to compute the value to be stored in the relocatable field. Ex: `lgdt $(gdt + 0x10)' will print `.data + 10' where the addend = 0x10 * Note that all relocatable fields are filled with zeroes in the trampoline ELF object trampoline.o: # .code16 (real-mode) 0000000000000000 : 0: ea 00 00 00 60 ljmp $0x6000,$0x0 # ljmp $0x6000, $_start 0000000000000005 <_start>: 5: e8 2a 00 call 32 # RIP-relative (7 + 0x002a = 32) 8: fa cli 9: 0f 01 1e 00 00 lidtw 0x0 # lidt $idtdesc e: 0f 01 16 00 00 lgdtw 0x0 # lgdt $gdtdesc 13: b8 00 00 00 00 mov $0x0,%eax # movl %cr3, %eax 18: 0f 22 d8 mov %eax,%cr3 1b: c7 05 00 00 00 movw $0x0,(%di) # movl $(init_pgt+0x1007), init_pgt 20: 00 00 00 00 00 25: c7 05 00 00 00 movw $0x0,(%di) # movl $(init_pgt+0x2007), init_pgt 2a: 00 00 00 00 00 2f: f4 hlt 30: eb fd jmp 2f <_start+0x2a> 0000000000000032 : Also note that the offsets in the error messages (.text+0x1), (.text+0x11), .. are the same offsets reported in the relocation table entries, and are the same ones filled with zero in the above disassembly. * With above information in hand, we can simply deduce the errors cause. We link the trampoline assembly code against our long mode kernel, but the kernel linker script sets up the virtual addresses base at 0xffffffff80000000. So the linker, obeying the kernel script, tries to assign huge (> -2GB) virtual addresses to our symbols. The linker then discovers that since the trampoline to-be-relocated symbols are setup with the R_X86_64_16 model, it's limited to a width of 16 bits, and limited to 32 bits in case of R_X86_64_32. Ofcourse those bits aren't sufficient, so it has to truncate the addresses producing a wrong final executable. In a nutshell, there was a reported error _for each_ relocatable symbol used, reflecting the number of relocation table entries with a type/bit-width that is insufficient to the linker script setup 64-bit virtual addresses. * When %RIP-relativeness goes wrong: ------------------------------------ * References: - Intel Instruction Set Reference, A-M, vol. 2A. 3.2-CALL--Call procedure. * One of the interesting errors I faced in jumping away from head or trampoline code to C code was %rip-relativeness related errors. Here are two interesting cases: In head.S, after switching to long mode, and setting up the higher-half virtual mappings, we jump to C code: .code64 startup_64: movq $stack_end, %rsp ... # .. Setup identity mappings # .. Setup high-half virtual addresses mappings # .. Apply the page tables # .. let %RIP be its virtual address equivalent # .. let %RSP, and GDTR be their virtual addresses equivalent # .. Let NULL pointers segfault and avoid physical-address dependent # code: remove identity mappings ... call kernel_start # C method: void kernel_start(void); But unfortunately we get the machine constantly triple-faulting. As indicated by Bochs debugging output: [CPU1 ] BxError: Encountered an unknown instruction b1=0x0e (signalling #UD) [CPU1 ] modrm was 0x00, nnn was 0, rm was 6 [CPU1 ] exception(0x06): error_code=0000 Disassembling the object file, we see that everything is at place. And `callq' points to 0xffffffff80109250; the `kernel_start' method: # head.o 00000000001000b9 : 1000b9: 48 c7 c4 00 50 10 00 mov $0x105000,%rsp ... 10012f: e8 1c 91 00 80 callq ffffffff80109250 Or does it? A closer inspection at the binary disassembly reveals that the call is a near, relative, displacement relative to next instruction, 32-bit displacement sign-extended to 64-bits in long mode call. So the instruction is at 0x10012f, with a length of 5 bytes, relative displacemented is calculated from 0x10012f + 5 = 0x100134. Code wants to call `kernel_start', so the relative displacemented = 0xffffffff80109250 - 0x100134 = 0xffffffff8000911c, which is the value encoded in the `callq' instruction after sign-extension in a little-endian form. So what's wrong? Notice that by the time the CPU is executing the `callq', it's actually using a __different__ %RIP than 0x10012f. The %RIP is in its (%RIP + 0xffffffff80000000) virtual form as the identity mappings are to be removed. This clearly messes up the %rip-relative-displacement calculations, and sends the CPUs to unpredictable areas, leading to unknown opcode exceptions and other horrors. * Another similar case was faced in the trampoline code, when it jumps away to the rest of AP cores C init code. The `callq' instruction was just sending the CPU to irregular places, with horrifying results. Why the callq misbehaved in trampoline? Because the trampoline code is linked to the kernel, but actually copied to a below-1MB 4KB-page, and then executed from there. This messed up external %rip-relativeness. Calls to internal trampoline methods worked because those methods retained their relative distances between each other when they were copied. On the other hand, and by definition, external distances was not kept when calling outside code (like the C application_start() code): # trampoline.o ffffffff8010ae06 : ffffffff8010ae06: 48 8d a4 24 00 00 00 lea -0x80000000(%rsp),%rsp ffffffff8010ae0d: 80 ... ffffffff8010ae2e: e8 0d e3 ff ff callq ffffffff80109140 ffffffff8010ae33: 90 nop * Solution: The two examples are special cases of PIC, position-independent code. The trampoline is PIC cause it's executed from different position than where it's linked. The head.S code also applies as parts of it get executed using a different %rip than what it's linked as. In the two above cases, we can simply solve them by specifying the absolute address of destination code instead of using relative x86 calls. As said in Intel instruction reference, x86-64 has a near call, with absolute indirect addressing facility. It can be used in GNU AS as follows: movq $application_start, %rax jmpq *%rax And also for head.S: movq $kernel_start, %rax callq *%rax # head.o 00000000001000b9 : 1000b9: 48 c7 c4 00 50 10 00 mov $0x105000,%rsp ... 10012f: 48 c7 c0 50 92 10 80 mov $0xffffffff80109250,%rax 100136: ff d0 callq *%rax which solves our problem nicely. * Summary: For a code object which is run from a _different_ address than what it was originally linked to, a 'jmp' or 'call' op to a label (or an address) outside of that unlucky object needs absolute addresses instead of the _default_ %rip-relative ones. In this case, the 'next opcode address', being the base of %RIP-relativeness calculations, is not the same at both link-time and run-time; at run-time, the linker-calculated relative offset of the destination becomes completely invalid due to the _now-different_ %RIP base, jumping intentionally to the unknown. Meanwhile, relative jumps and calls to labels within the same object are still valid: the relative distance is still maintained within the object itself, no matter where it's running from. This kind of internal relativeness is usually calculated by the assembler (instead of the linker) and is safe here. Testing libc()-like kernel code: -------------------------------- * An interesting way to test libc()-like kernel code (e.g. std string methods) is to replace system's libc implementation with our own. Afterwards, we run a huge program (Firefox, OpenOffice or gcc) and observe the results. Those tests usually cover huge number of cases, and give a good indicator of tested kern code stability. I've usually found that even the smallest mistake in such tests produce either a non-behaving app or a direct segfault. * Unix linkers LD_PRELOAD feature can help us immensely. Assuming our string methods resides at 'string_lib.c', we do something similar to: # Position-independent object, suitable for dynamic linking $ gcc --std=gnu99 -O3 -fPIC -c string_lib.c -o string_lib.o # Produce shared object (instead of final executable) that can # be linked with other objects to form an executable. $ gcc -shared -o string_lib.so string_lib.o $ LD_PRELOAD='./string_lib.so' firefox-bin Note: always run these tests with the binaries themselves instead of their bash-script wrappers. Such shell wrappers don't play very well with LD_PRELOAD at times. * It is interesting to note that a 30-second run of Firefox produced 655,227 memset() and memcpy() calls! A 1-minute run of Google Chrome produced 1,889,087 calls! (counted by `wc -l'). On SMP Memory ordering models: ------------------------------ * References: - Kourosh Gharachorloo, "Memory Consistency Models for Shared-Memory Multiprocessors", Digital Equipment (DEC) with Stanford and DARPA support - Paul McKenny, Memory ordering in modern processors, 2007.09.19 Draft - Paul McKenny, Memory Barriers: a Hardware View for Software Hackers, IBM - Linux Kernel Documentation/memory-barriers.txt, kernel.org - DEC/HP OpenVMS Ask the Wizard, http://www.webcitation.org/5laUMQfyi - Intel Manuals vol3A, chapter8, MP Management, 8.2 - Memory Ordering - AMD64 Architecture Programmer's Manual, chapter 7 - Memory System * To understand memory ordering models correctly, we need to understand which parts of our code are loads, which are stores, and which are in fact both. Below example should cover basic cases of shared global variables usage, compare the C code with its assembly output: int *global_symbol = (int *)0xdeadbeef; int main(void) { /* 1) memory load + dependent memory store to stack */ int t = global_symbol; /* 2) memory store */ global_symbol = (int *)0xbbbb; /* A memory load + dependent memory store */ *global_symbol = 0xcccc; } Disassembly of section .data: ... 0000000000600840 : 600840: ef 600841: be ad de 00 00 Disassembly of section .text: ... 0000000000400448
: 400448: 55 push %rbp 400449: 48 89 e5 mov %rsp,%rbp # 1) Mem. load + store to stack 40044c: 48 8b 05 dd 03 20 00 mov 0x2003dd(%rip),%rax # 600830 400453: 89 45 fc mov %eax,-0x4(%rbp) # 2) Mem. store 400456: 48 c7 05 cf 03 20 00 movq $0xbbbb,0x2003cf(%rip) # 600830 40045d: bb bb 00 00 # 3) Mem. load + store to pointed value 400461: 48 8b 05 c8 03 20 00 mov 0x2003c8(%rip),%rax # 600830 400468: c7 00 cc cc 00 00 movl $0xcccc,(%rax) 40046e: c9 leaveq 40046f: c3 retq * Terminology of memory orders: - Program order: the memory operations order specified in object code. - Perceived order: the order a CPU perceives of its and other CPUs memory ops. * Memory ordering is a tradeoff between programming ease and performance. On uniprocessors, many ops are assumed to execute in program order. Processors can provide the illusion of sequensiality by only preserving sequential order among memory operations to the same location. Sequential consistency defined by Lamport provides a simple model to programmers, but it adversely affects efficiency and performance: the illusion trick provided in uniprocessors can't be used on MP systems to provide sequential consistency. Weak ordering distinguishes between ordinary memory operations and memory ops used for synchronization. By ensuring consistency only at the synchronization points, weak ordering allows ordinary operations in between pairs of synchronizations to be reordered with respect to one another. Such specifications can be considered system-centric since they in effect expose the low-level system optimizations to the programmer. * General rules for all architectures: - A given CPU will always _perceive_ its own memory operations as occuring in program order, without a need for memory barriers. - Naturally-aligned loads and stores are always atomic - Code that uses standard synchronization primitives (spinlocks, RCU, ..) should not need explicit mem barriers. Only tricky code that bypasses those primitives need those barriers. An example of a linked list lockless search is given below. * Example: Writer CPU: ----------- # Linked list head is `smack_known' struct smack_known *smack_known; mutex_lock(&smack_known_lock); if (skp == NULL) { skp = kzalloc(sizeof(struct smack_known), GFP_KERNEL); if (skp != NULL) { 1- skp->smk_next = smack_known; 2- skp->smk_secid = smack_next_secid++; 3- skp->smk_cipso = NULL; /* 1- BUG: Write barrier needed! */ 4- smack_known = skp; } } mutex_unlock(&smack_known_lock); Reader CPU: ----------- # Just locklessly access the list struct smack_label *skp; skp = smack_known; while (skp) { /* 2- BUG: Alpha CPUs!! */ 5- if (skp->smk_secid == 10) { ... } ... skp = skp->next; } Bug 1: ------ In the writer CPU, the latest 3 stores can be re-ordered in any combination. In arhitectures that reorder stores like Alpha, Itanum, POWER, and SPARC, this may lead other CPUs to perceive an incomplete node added to the list: a garbage smk_secid value. This may even lead to graver results if you notice that the first line (skp->smk_next = smack_known) is actually a load followed by a store. So the sequence is: Load from smack known store to ->smk_next address ... store to smack_known The cpu can reorder the last stores, leading to a linked list head with a bogus next pointer; effectively a user-space segmentation fault or worse, a CPU exception if in kernel mode. Bug 2: ------ As said by OpenVMS documenatation (Ask the Wizard): "Your producer must issue a 'memory barrier' instruction after writing the data to shared memory and before inserting it on the queue; likewise, your consumer _must_ issue a memory barrier instruction after removing an item from the queue and before reading from its memory. Otherwise, you risk seeing stale data, since, while the Alpha processor does provide coherent memory, it does not provide implicit ordering of reads and writes. (That is, the write of the producer's data might reach memory after the write of the queue, such that the consumer might read the new item from the queue but get the previous values from the item's memory." But how does the consumer sees the pointer before its data even if write barriers were used? As said by McKenny, the producer's write barrier will guarantee that the cache invalidates performed by the first 3 lines will reach the interconnect before that of the last line, but it "makes absolutely no guarantee about the order in which the new values will reach the reading CPU’s core." So assume smack_known, and smack_known->smk_secid are on different cache banks. Because of the write barrier, the producer's data will reach the interconnect before updating the queue. The bank holding smack_known is idle, so its line gets invalidated and updated, while the bank holding smk_secid was busy, so the cache invalidates for the new element are being delayed. This leads to seeing new value for the pointer, but old cached values for the new element at line 5. Linux kernel documentation explains by stating "some versions of the Alpha CPU have a split data cache, permitting them to have two semantically-related cache lines updated at separate times. This is where the data dependency barrier really becomes necessary as this synchronises both caches with the memory coherence system, thus making it seem like pointer changes vs new data occur in the right order. * On the x86 side, only below basic memory access is guaranteed to be atomic: - Readnig or writing a byte - Reading or writing a word aligned on 16-bit boundary - Reading or writing a doubleword aligned on 32-bit boundary - Reading or writing a quadword aligned on 64-bit boundary On P6 family processors and up: - Unaligned 16-, 32-, and 64-bit accesses to cached memory that fit within a cache line. (but how do we know whether a memory access fits a cache line?) On the C 'volatile' type qualifier: ----------------------------------- * References: - Eric Eid & John Regehr, "Volatiles Are Miscompiled, and What to Do about It", Proceedings of the Eighth ACM and IEEE International Conference. - Attached Linux kernel's "volatile-considered-harmful.txt" document, added with various core developers thoughts copied from LKML. - ISO C99 Committee Draft - September 7, 2007 - My OSDEV forums post: http://forum.osdev.org/viewtopic.php?t=21210&start=17 * As said by the C standard, "An object that has volatile-qualified type may be modified in ways unknown to the implementation or have other unknown side effects. Therefore any expression referring to such an object shall be evaluated strictly according to the rules of the abstract machine" "Actions on objects so declared shall not be optimized out by an implementation, or reordered except as permitted by the rules for evaluating expressions". It's in a practical manner, "for every read or write to a volatile variable that would be performed by a straightforward interpreter for C, exactly one load from or store to the memory locations allocated to the variable must be performed". * Volatile is used when we don't want the memory access to be re-ordered, cached, or optimized away, as in accessing memory mapped I/O registers. Another case is where global variables get changed behind our back by other execution paths as in interrupt handlers or other threads: int var_set_by_irq_handler = 0; .. while (var_set_by_irq_handler == 0); The compiler has every right to optimize the loop to an infinite loop. To avoid this, and force the compiler to access the global variable memory at each iteration, we use volatile, since volatile objects "shall not be optimized out by an implementation". * Take care that the volatile is a type qualifier as `constant'. So in below case: static volatile uint64_t *msg = NULL; /* (1) */ while (msg == NULL) asm volatile ("pause"); The above loop will still be optimized to an infinite loop. The declaration states that the memory area _pointed_ by `msg' is volatile, not that the pointer itself is volatile. So the above declaration should've been: static uint64_t * volatile msg = NULL; /* (2) */ specifying a volatile pointer. Declaration at (1) is perfectly fine though if we're accessing the variable as in: while (*msg == 0x1234) asm volatile ("pause"); cause now we're referencing the memory declared as volatile. The compiler must fetch the memory pointed by 'msg' at each iteration. * As seen from above example and the ACM paper, volatiles are really subtle in C and many industrial compilers get them wrong. I prefer the Linux way of handling the issue, namely __minimizing__ 'volatile' as far as possible, and depending on explicit compiler and cpu _barriers_, and SMP primitives as needed. * So instead of: static uint64_t * volatile msg = NULL; while (msg == NULL) asm volatile ("pause"); We use an explicit compiler barrier: static uint64_t *msg = NULL; while (msg == NULL) asm volatile ("pause": : :"memory"); The compiler barrier will assure "GCC not to keep memory values cached in registers across the assembler instruction and not to optimize stores or loads to that memory". In other words, it will force gcc to reload any values from memory over the busy loop. * Instead of: volatile uint32_t *ioregsel = (uint32_t *)ioapic_base(apic); volatile uint32_t *iowin = (uint32_t *)(ioapic_base(apic) + 0x10); *ioregsel = reg; return *iowin; where we don't want to the compiler to re-order the last two 'independent' read and write cause this re-ordering will imply different semantics to the programmed hardware. We use MMIO register accessors, which internally includes compiler barriers and any needed volatile qualifiers (See comments at top of include/mmio.h for further rational): uint32_t *ioregsel = (uint32_t *)ioapic_base(apic); uint32_t *iowin = (uint32_t *)(ioapic_base(apic) + 0x10); writel(reg, ioregsel); return readl(iowin); * Finally, check attached LKML messages for extra cases and reasonings by Linus and Andi Kleen. On the vaguely documented x86 16-bit protected mode: ---------------------------------------------------- * References: - Rick A. Hohensee, "The well-factored 386", LKML, http://lkml.org/lkml/2003/7/28/55, http://www.webcitation.org/5luRetOqF - Intel manuals vol3A, chapter 3, 3.4.3 - segment registers - Intel Manuals vol3A, chapter9, Processor Management and Initialization, 9.9.2 - Switching Back to Real-Address Mode - Robert R. Collins, "The Segment Descriptor Cache", Dr. Dobb's Journal Undocumented Corner, http://www.rcollins.org/ddj/Aug98/Aug98.html, http://www.webcitation.org/5lxH9NFGa * The x86 architecture determins the default operand and address size using the D bit in the code segment. If D is set, it indicates 32-bit addresses and operands. Otherwise, we're running in 16-bit mode, using 16-bit addresses and operands. There's a similar C bit for 64-bit long mode. * One of the states that really puzzled me a lot was the state occurring in the transition from bootup real-mode to 32-bit protected mode. Namely, bewteen enabling protected mode (setting PE = 1) and getting executed in 32-bit by using a 32-bit (D = 1) code segment through a far jump. Setting PE and %cs together is not atomic, and there's clearly something in-between. An example from our bootsector: /* Set the protected mode PE flag */ movl %cr0, %eax orl $0x1, %eax movl %eax, %cr0 jmp flush_prefetch: /* What're the rules covering this state? we're in * pmode but the cpu is still fetching and executing * ops in 16-bits! */ flush_prefetch: movw $(gdt_ds - gdt), %ax movw %ax, %ds movw %ax, %es movw %ax, %ss DATA32 ljmp $0x08, $(reloc - _start + 0x7c00) * To exhaustively think about this, we need to cover all the permuatations of PE (pmode enabled) and the %cs D bit: (1) PE=0 Dbit=0 Canonical (un)Real Mode (2) PE=1 Dbit=0 16-bit protected mode (3) PE=1 Dbit=1 Canonical 32-bit Protected Mode (4) PE=0 Dbit=1 .. _______________ | WAKE-UP | |canonical rmode| | PE=0 use16 | |_______________| A |set PE to 1 | | set PE to 0| | ___|________V__ | 16-bit | | pmode | | PE=1 use16 | |_______________| A |far xfer | | to use32 CS far xfer to | | use16 CS| | ___|_______V___ | 32-bit | | pmode | | PE=1 use32 | |_______________| A |set PE to 0 | | set PE to 1| | ___|________V__ | Forreal Mode | | Mode | | PE=0 use32 | |_______________| * We can not jump from (1) to (3) directly through a far jump using a D=1 code segment, since we'll be in canonical rmode, and %cs will always be interpreted as in the segmented 8086 model of (%cs << 4 + offset). But how can the cpu move to 32-bit execution, if 32-bit execution is controlled by the D bit, but we can't change the D bit due to the 8086 semanticss? The answer is to provide a switch. If this switch is on, the cpu will the use the 386 semantics of segments (and be able to modify the segment caches). If it's clear, the cpu will still use the same segmentation circuites but it won't be able to modify most of the segment caches fields like limit, and the D bit, (the base field can still be changed within 2-bytes size), due to the constraints of the 8086 semantics. This 'switch' is the PE (pmode Enabled) bit, which acts __independently__ of whether the machine is running on 16 or 32-bit mode (D bit is set or clear) So under this 16-bit pmode (PE set, D clear), if we do: movw $0x10, %ax movw %ax, %ds we're in fact using %ds as a segment selector, and are also updating the segment caches with the values found in the third GDT cell. But if we're in canonical 16-bit rmode (PE & D clear), if we does the same: movw $0x10, %ax movw %ax, %ds we're in fact using (%ds << 4) as the base field in the segment caches. * We cannot also jump directly from (3) 32-bit pmode, to cacnonical/bootup rmode at (1). One should change the code and data segmenst limit to 64K and the D bit to 0 __before__ clearing the PE switch: we can not change any of those segments cached fields (except the base) once the PE get cleared. Intel's "switching back to Real-Address mode" section also states the same order, but without making the reasoning of this order very clear. Intel also elaborates that "the segment registers must be loaded with non-null segment selectors or the segment registers will be unusable in real-address mode." Simply because the segment will be marked in the segment caches as invalid (P = 0), and as we learned from above, we won't be able to change any of those cached bits after clearing PE. Due to this cached P=0 bit, even the usual act in canonical rmode of changing the base within the 2-bytes limit won't be possible, rendering the segment as unusable, as this is their only effective action in 8086 semantics. * Finally, the difference between 8086 and 386 segmentation is superficial. If the segment was marked as P=1 in the segment caches (valid), 8086 semantics just change the base field (limited to 2-bytes size) of the equivalent 386 segment. They all function using the same registers and simple circuity. Intel elaborates: "Note that if the segment registers are not reloaded, execution continues using the descriptor attributes loaded during protected mode." Using light debugging techniques: --------------------------------- * References: - Ulrich Drepper, "How to Write Shared Libraries", 2.1 Data Definitions, http://www.webcitation.org/5mkmxB9Fx - Ian Lance Taylor, "Linkers part 5", http://www.airs.com/blog/archives/42, http://www.webcitation.org/5mkmiZ8FN - Binutils bug report, "ELF linker loads member of archive for common symbol", http://sources.redhat.com/bugzilla/show_bug.cgi?id=1811 http://www.webcitation.org/5mknnyUe2 - Minor bug-report at http://gcc.gnu.org/bugzilla/show_bug.cgi?id=42726 * Today while writing page allocator test code, I faced a very strange bug: once any of the kernel's text, data, or bss sections exceeds something close to 20 KBs, the MPtables code panic()s at different and unpredictable locations. The failure is something like: PANIC: MPTables - 1217939456 CPUs found, only 32 supported * To give some context, there's a system global called 'nr_cpus' representing the number of CPUs in the system. So far, we know the number of CPUs from the MP tables Conf entries, where each CPU entry represents a CPU in the system. Thus, finding 8 CPU entries means that the machine has 8 usable cores. This global var 'nr_cpus' is only increased at one and _only_ one place in the system, at the MP tables parsing code: /* We call this for each MP CPU entry in the system */ static void parse_cpu(void *addr) { struct mpc_cpu *cpu = addr; if (!cpu->enabled) return; if (nr_cpus >= CPUS_MAX) { panic("MP - %d CPUs found, only %d supported", nr_cpus, CPUS_MAX); } ... ++nr_cpus; } * Inspecting the BIOS MP tables more closely, the MP pointer table is at 0xfbb60 and its checksum and signature are valid. The MP configuration table header was also valid and checksummed at 0xfba00. The problem, it seems, is that the conf table entries got somehow corrupted (thus the wrong number of cpus value), or there's a bug in our parser code. * Unfortunately, and as usual for such problems, inserting printk()s at any location skews the results randomly, making any printk() useless. So we try less invasive debugging techniques like writing to a memory area using a _single_ inlined asm instruction, and inspecting this memory area using an emulator afterwars. This worked and made the problem repeatable! * First, and to test the parser validity, we choose a memory location that we know before hand it's going to be zero; I chose physical 0x500. In parse_cpu we add a very small asm instruction as follows: /* 8-byte instruction */ asm volatile("incq (0xffffffff80000500);"); if (nr_cpus >= CPUS_MAX) { panic("MP - %d CPUs found, only %d supported", nr_cpus, CPUS_MAX); } And thankfully this instruction was light enough, so the system paniced at the same point. Our panic() just calls printk() and hlt the system. So after the system halts, we move to the qemu console handler and check the value at address 0x500. Good for us, we found the value there to be 1, so it seems our parser is behaving correctly. * Now the problem is getting more mysterious. If the parser is behaving correctly, and the value at address 0x500 is completely valid, how come 'nr_cpus' have those very strange values? Remember that we put the 'incl 0x500' instruction at the same place of '++nr_cpus', so the two addresses should have the exact same values or something is _completely_ messed up. The 'nr_cpus' variable still yielded the value of "1217939456". * So what did really corrupt the BSS-stored variable nr_cpus at this _very_ early stage? The very first thing we do in the C part of the kernel is to clear the entire BSS section to zero. So assuming the kernel was copied/setup correctly (which we're not very sure of yet), the problem probably occurs _between_ clearing the BSS section and parsing the MP tables. OK, so we'll try to check the value of the 'nr_cpus' variable at different places in the kernel between clearing BSS and MP parsing using this 14-byte code segment: asm volatile ("movl %0, %%eax;" "movl %%eax, (0xffffffff80000500); hlt;" : : "m"(nr_cpus) : "memory"); And then use the qemu console to read the address 0x500 'xp 0x500'. * Very important discovery: Adding the above asm fragment _directly_ after clearning the BSS section yielded the same corrupt value of "1217939456" for the 'nr_cpus' variable! Something is seriously wrong cause 'nr_cpus' is clearly in the BSS section: $ readelf -a kernel.elf Section Headers: [Nr] Name Type Address Offset Size EntSize Flags Link Info Align ... [ 8] .bss NOBITS ffffffff8010f5e0 0010f5e0 00000000000028c8 0000000000000000 WA 0 0 32 Symbol table '.symtab' contains 183 entries: Num: Value Size Type Bind Vis Ndx Name ... 151: ffffffff80111d80 4 OBJECT GLOBAL DEFAULT 8 nr_cpus * I think we're getting closer. Notice from above ELF analysis, the BSS section starts at 0xffffffff8010f600, and it has a length of 0x28c8: This means that from reading the ELF: BSS start = 0xffffffff8010f5e0 BSS end = 0xffffffff8010f5e0 + 0x28c8 = 0xffffffff800111ea8 'nr_cpus' is at 0xffffffff80111d80 But very strangely, here are the results of printing __bss_start and __bss_end symbols exported by our kernel linker script we wrote since the dawn of time: BSS start = 0xffffffff8010f5e0 !! BSS end = 0xffffffff801115e0 'nr_cpus' is outside of this region! As you see, somehow the kernel sees a _trimmed_ BSS section from the linker script, and to add insult to injury, 'nr_cpus' is not included in that trimmed region. * Things began to make sense: the kernel obviously has a misguided view about its bss section. This is disastrous cause at this level, we're the ones responsible of clearing this BSS area before use. Ofcourse, we directly move to our linker script BSS section to see what's wrong: SECTIONS { ... .bss : { __bss_start = .; *(EXCLUDE_FILE (*head.o *e820.o) .bss) __bss_end = .; } __kernel_end = .; ... } Things should be fine and everything is at its place, but the corruption still exists. So I tried a small modification of: .bss : { __bss_start = .; *(EXCLUDE_FILE (*head.o *e820.o) .bss) } __bss_end = .; __kernel_end = .; And things worked magically! Ofcourse leaving the issue as this will be immature as we didn't understand yet why this trivial change produced this huge effect. * Trying to debug the issue further, we restore the linker script to its original state and let the linker be more verbose by using the -M flag. Inspecting the output, a very interesting thing came up when the linker displayed the final memory map: ('...' denotes trimmed output lines) Allocating common symbols // mm, .. Common symbol size file test_symbol 0x8 mmap.o ioapic_descs 0x60 ioapic.o nr_ioapics 0x4 ioapic.o nr_cpus 0x4 smpboot.o mp_irqs 0x700 mptables.o nr_mpcirqs 0x4 mptables.o cpu_descs 0x100 smpboot.o Linker script and memory map 0x0000000000100000 . = 0x100000 .text.head 0x0000000000100000 0x310 ... .bss 0xffffffff8010f5e0 0x28c8 load address 0x000000000010692c 0xffffffff8010f5e0 __bss_start = . *(EXCLUDE_FILE(*e820.o *head.o) .bss) .bss 0xffffffff8010f5e0 0x0 common.o .bss 0xffffffff8010f5e0 0x0 main.o .bss 0xffffffff801105e0 0x0 i8259.o .bss 0xffffffff801105e0 0x0 apic.o ... .bss 0xffffffff80110610 0x0 lib/string.o .bss 0xffffffff80110620 0xfc0 lib/printf.o 0xffffffff801115e0 __bss_end = . COMMON 0xffffffff801115e0 0x64 ioapic.o 0xffffffff801115e0 ioapic_descs // Outside of BSS! 0xffffffff80111640 nr_ioapics // Outside of BSS! *fill* 0xffffffff80111644 0x1c 00 COMMON 0xffffffff80111660 0x704 mptables.o 0xffffffff80111660 mp_irqs // Outside of BSS! 0xffffffff80111d60 nr_mpcirqs // Outside of BSS! *fill* 0xffffffff80111d64 0x1c 00 COMMON 0xffffffff80111d80 0x120 smpboot.o 0xffffffff80111d80 nr_cpus // Outside of BSS! 0xffffffff80111da0 cpu_descs // Outside of BSS! COMMON 0xffffffff80111ea0 0x8 mmap.o 0xffffffff80111ea0 test_symbol // Outside of BSS! 0xffffffff80111ea8 __kernel_end = . So at last we got it, the linker script didn't put variables like nr_cpus, nr_ioapics etc in the BSS section, thus they were not cleared in our clear_bss() function. As you can see, those symbols are allocated past the __bss_end symbol. * OK, so we inspect the LD manual for this 'COMMON' section, to find: "A special notation is needed for common symbols, because in many object file formats common symbols do not have a particular input section. The linker treats common symbols as though they are in an input section named ‘COMMON’ ..." * Unfortunately the documentation was a bit vague, and there was no explanation of what "common symbols" were really used for. So we dig deeper to find this great explanation from Ulrich's paper on shared libraries: "Common variables are widely used in fortran, but they got used in C and C++ as well to work around mistakes of programmers. Since in the early days people used to drop the 'extern' keyword from variable definitions, in the same way it is possible to drop it from function declaration, the compiler often had multiple definitions of the same variable in different files. To help the poor and clueless programmer, the C/C++ compiler normally generates common variables for uninitialized definitions such as `int foo;' For common variables, there can be more than one definition, and they all get unified in one location in the output file .. their values does not need to be stored in the ELF file." * And also from Ulrich, a description of -fno-common which is more accurate than gcc's manual: "If the programmer uses the compiler command-line option -fno-common, the generated code will contain uninitialized variables instead of common variables if a vriable definition has no initializer. .. The result at run-time is the same as for common variable, no value stored in the file. But the representation in the object file is different [BSS section], and it allows the linker to find multiple definitions and flag them as errors." * And finally from Ian Lance Taylor's series on linkers: "A section index of SHN_COMMON (0xfff2) indicates a common symbol. Common symbols were invented to handle Fortran common blocks, and they are also often used for uninitialized global variables in C. A common symbol has unusual semantics. Common symbols have a value of zero, but set the size field to the desired size. If one object file has a common symbol and another has a definition, the common symbol is treated as an undefined reference. If there is no definition for a common symbol, the program linker acts as though it saw a definition initialized to zero of the appropriate size. Two object files may have common symbols of different sizes, in which case the program linker will use the largest size. Implementing common symbol semantics across shared libraries is a touchy subject, somewhat helped by the recent introduction of a type for common symbols as well as a special section index (see the discussion of symbol types below)." (Also check nm(1) on "C"/COMMON symbols, and --warn-common on ld(1)) * And by now everything is super-clear, and the fix is only adding the COMMON symbols within the output object BSS boundaries: .bss : { __bss_start = .; *(EXCLUDE_FILE (*head.o *e820.o) .bss) *(EXCLUDE_FILE (*head.o *e820.o) COMMON) __bss_end = .; } What a bug this was. Musings on Memory Allocation: ----------------------------- * References: - Uresh Vahalia, "Unix Internals: the new frontiers", Chapter 12 - Kernel Memory Allocation - Kirk McKusick, J. Karles, "Design of a General Purpose Memory Allocator for the 4.3BSD UNIX Kernel", Berkely, Usenix 1988 conference - Daniel Bovet, Marco Cesati, "Understanding the Linux Kernel", Chapter 8 - Memory Management, Section 8.1 - Page Frame Management - Kirk McKusick, George Neville, "The Design and Implementation of the FreeBSD Operating System", Section 5.3 - Kernel Memory Management - Maurice Bach, "The Design of the UNIX Operating System", Chapter 9 - Memory Management Policies * In all Unices, memory is page-based, and the page is the smallest unit of allocation, protection, address translation, and mapping. X86-64 does not even provide segmentation. * The very first step for writing a MM subsystem is writing the page-level allocator. Due to virtual memory, the page-allocator doesn't need to find contiguous free pages to satisfy its request: non-contigous frames mapped as contiguous addresses shall be enough. The kernel often needs chunks of memory much less of size than a page, there's where the 'kernel memory allocator' job comes into play for allocating this kind of memory within the pages. It depends internally on the page allocator. * Early Unix like SVR2 avoided dynamic memory altogether, and used fixed-size tables for vnodes and proc structures. On the other hand, when memory was required for holding temporary pathnames, they borrowed buffers from the block buffer cache (Check chapter 3 of Maurice's book). Another case was the 4.2BSD, which in McKusick wordings had "at least ten different memory allocators. Some of them handle large blocks, some handle small chained data structures, and others include information to describe I/O operations. .. Multiple allocators reduce the effeciency with which memory is used. The kernel ends up with many different free lists of memory instead of a single freelist from which all allocation can be drawn" * Currently, our page allocator is interfaced through two simple functions: get_free_page() which returns a page frame descriptor, and free_page(page) which frees given page frame. The naming is taken from Linux, which are the historic names Torvalds used in his 0.01 relase and are still in-use today. * Now that we've finished writing our page allocator, we focus on kmalloc()-like kernel memory allocators. A memory allocator utilization factor = total memory 'requested' by the kmalloc() caller ------------------------------------------------------------------- total memory 'required' by allocator to satisfy those requests As said by McKusick, Good memory utilization in the kernel is more important than in user processes. Pages in the process address space that are in 'required' but not in the 'requested' pool need not tie up physical memory. But the kernel is not typically paged out, so any not-well-utilized kernel-allocated memory pages are held forever, and can not be used for other purposes. It's typical of user-space memory allocators not to release unused memory in the required pool to avoid the syscall overhead. Kernel allocators are required to release such memory as quick possible though, due to above point of holding kernel memory pages forever, and because the kernal can quickly manipulate its own page maps. * Not to degrade performace, memory returned by the allocator must be aligned according to the machine requirements. I guess for our x86-64 case, the safest haven is 8-byte alignemnt. Speaking of speed, McKusick suggests that the allocator need only to be fast for small pieces of memory. Inspecting a time-shared machine at Berkely, he noticed that "large allocations occur infrequently, and are typically for long-lived objects such as buffers to hold the superblock for a mounted file system". * For a nice survey of kernel memory allocators, check referenced 'Unix Internals' chapter. The 4.2BSD allocator, is covered under 'Simple Power-of-Two Free Lists', and the 4.3BSD allocator, explained in the McKusick paper, is clearly described under 'The McKusick-Karles Allocator' section; Vahalia is a great writer. * Before jumping to sketching our own allocator, we finally survey some of the allocators used in modern unix kernels: In Linux, the page allocator provides contiguous pages using a binary buddy system instead of depending on hardware circuity to emulate this continuity. This, in Daniel Bovet words, offers the big advantage of leaving the kernel paging tables unchanged, which leads to less average memory access times cause such modifications lets the CPU flush its Translation-Lookaside Buffers. NOTE! The smallest unit returned by this buddy page allocatoris one PAGE. The groups of 2, 4, 8, 16, .., 1024 are a physically contigous set of page frames. This is not a kernel memory allocator (KMA) like the McKusick-Karles one; this is a page allocator that uses the buddy algorithm for serving _contigous pages_ instead. The page allocator main structure used is mem_map, which holds all the page descriptors of the specific memory zone this page allocator belongs, like ZONE_DMA, HIGHMEM, etc. Notice that mem_map is also the historic naming Linus used for his page allocator in the 0.01 release. That page allocator was done using a bitmap, which as we can see, fits the 'mem_map' naming better. Above this page allocator in Linux exists the regular sl[au]b object-caching allocator, which requests pages from the page allocator upon need. In FreeBSD, the kernel uses the 4.3BSD allocator described in the McKusick paper for small allocations. For large fixed-size allocations, it uses the Zone/slab allocator instead. * In our kernel, we assign high weights to simplicity without sacrificying much speed and memory resources. Our page allocator's main structure is the Page Frame Descriptor Table (pfdtable; Unix SVR2 terminology), which is an array/table that contains all the system's available (non e820-reserved; above kernel memory area) physical pages descriptors. Free pages descriptors are connected in place by a linked list pfdfree_head as the list head. Allocation and Reclamation is a simple O(1) matter of adding to or removing from the pfdfree_head list. The kernel memory allocator is the 4.3BSD McKusick-Karles allocator. I chose it cause of its beautiful simplicity, speed, and space effeciency for small allocations. We shall develop a slab-like caching allocator for big objects once the need arise; we're obviously influenced by FreeBSD in this area. On Virtual Memory: ------------------ * References: - Uresh Vahalia, "Unix Internals", Chapter 13 - Virtual Memory - Peter J. Denning, 'Virtual Memory', Princeton, ACM Computing Surveys - Maurice Bach, "The Design of the UNIX Operating System", Chapter 9 - Memory Management Policies - Jonathan Corbet, "Four-level page tables", LWN.net, http://lwn.net/Articles/106177/, http://www.webcitation.org/5ncgFSpzm * First, please check our detailed notes on the great Dennings's virtual memory paper; the paper covers a huge number of now very-well-established concepts. Below paragraphs assumes familiarity with the concepts covered in that paper. * To cause considerable reduction in the amount of virtual-to-physical mapping information that needs to be stored, methods like segmentation and paging were invented. In the words of Denning, "each method groups information into blocks, a block being a set of contiguous addresses in address space. The entries in the mapping table will refer now to blocks, which are _far_ less numerous than addresses in a process address space". In other words, instead of storing mapping information for each address, we store the mapping info fore each equally-sized _page_ or for each _segment_. * In paging, even while storing mapping information for each page instead of for each address, the page table can consume a huge area of memory. For example, in a fictional 32-bit machine with a page size of 1KB, and an address space of 2GB, we need: 2^31 / 2^10 = 2^21 entries. For an entry size of 32-bit, 4-bytes: page table size = 2^21 * 2^2 = 2^23 = 2^3 * 2^20 = 8MB. So we'll need a _contiguous_ 8MB table for _each_ system process address space which is definitely not tolerable. * Analyzing the problem, we see that the root cause is having mapping information for non-used pages; a process rarely use its entire address space anyway. It's also common to have a virtual address-space which is much larger than machine's available physical memory. To solve this problem we either: - have segmented page tables. Each segment of the process has its own page table which is just large enough to hold the valid address range for that segment. - second is having the page table itself paged, which means an additional higher level page table is used to map the lower-level page table. With such a multi- teired hierarchy, we need to allocate only those pages of the lower-level table that map used/valid addresses of the process. This is the common approach. On Partitionting the 64-bit kernel address space: ------------------------------------------------- * References: - Andi Kleen, "Porting Linux to x86-64", SuSe Labs - Michael Matz, Hubicka, Andrea Jaeger, Mark Mitchell, "System V Application Binary Interface, AMD64 processor supplement", SuSe Labs, Codesourcey - Linux Kernel linux-2.6/Documentation/x86/x86_64/mm.txt, kernel.org - Our kernel comments at head.S and include/paging.h - MSDN Library, Win32 and COM development, "64-bit Virtual Address Space", http://msdn.microsoft.com/en-us/library/aa384271(VS.85).aspx http://www.webcitation.org/5nckv6YBf - Intel Manuals volume 2A, A-M Instructions Reference * Partitioning the address space of a long mode kernel is an interesting challenge. There are so many tradeoffs to be taken care of. First, let's remember that we have full 64-bit space. This has obvious potentials, but also some drawbacks that need be taken care of. * The first drawback of using a full 64-bit address space is code size. Imagine when every far jump, global data access, and function call uses a full 64-bit address. The difference is outlied in below pure x86-32 app: /* File1: sum.c */ int global; int sum() { return 0; } /* File2: app.c */ extern int global; int x, y; int main() { global = 0xdead; x = 0xceeb; return sum(); } # -m32: 32-bit compilation and linkage $ gcc -O3 -m32 sum.c app.c -o app which generates the very small code below: $ objdump -D app 0804956c : /* 4-byte address */ 804956c: 00 00 add %al,(%eax) ... 08049570 : /* 4-byte address */ 8049570: 00 00 add %al,(%eax) 08048370
: /* 10-bytes in total * 4-byte absolute address dereference */ 8048376: c7 05 68 95 04 08 ad movl $0xdead,0x8049568 804837d: de 00 00 /* 10-bytes in total * 4-byte absolute address dereference */ 8048380: c7 05 6c 95 04 08 eb movl $0xceeb,0x804956c 8048387: ce 00 00 /* Only 5 bytes in total * 4-byte relative offset */ 8048394: e8 c7 ff ff ff call 8048360 ... Now imagine if above app used full 64-bit virtual addresses, each 4-byte address dereference above would have consumed full 8-bytes; even some ops would consume 140% extra size. Inspect below results: $ gcc -m64 -mcmodel=large -c sum.c -o sum.o $ gcc -m64 -mcmodel=large -c app.c -o app.o $ ld -Ttext 0xffff800000000000 app.o sum.o -o app $ objdump -D app | less ffff8000002000cc : /* .. 8-byte address */ ffff8000002000cc: 00 00 add %al,(%rax) ... ffff8000002000d0 : /* .. 8-byte address */ ... ffff800000000024
: /* 8-bytes address, 15-bytes ops size in total. Compare * with the 10-byte 32-bit equivalent ops (50% increase) */ ffff800000000028: 48 b8 c8 00 20 00 00 mov $0xffff8000002000c8,%rax ffff80000000002f: 80 ff ff ffff800000000032: c7 00 ad de 00 00 movl $0xdead,(%rax) /* 8-bytes address, 15-bytes ops size in total. Compare * with the 10-byte 32-bit equivalent ops (50% increase) */ ffff800000000038: 48 b8 cc 00 20 00 00 mov $0xffff8000002000cc,%rax ffff80000000003f: 80 ff ff ffff800000000042: c7 00 eb ce 00 00 movl $0xceeb,(%rax) ... /* 8-byte address, *12-bytes* ops size in total. Compare * with the 5-byte 32-bit equivalent ops (140% increase!) */ ffff80000000005d: 48 ba 00 00 00 00 00 mov $0xffff800000000000,%rdx ffff800000000064: 80 ff ff ffff800000000067: ff d2 callq *%rdx * As you can see from above, naively using the 64-bit address space can lead to diastrous effects on code size, and thus, the cache footprint. To solve this, some help from both the kernel and the compiler is needed. First, let's focus on user-space. To let user-space avoid this large code-size hit, it's common to assign to it the 0 -> 4GB+ virtual space, and let the compiler generate 32-bit addresses for code and global data. This is for example, how the Linux ecosystem does it. The kernel assigns below virtual addresses to user-space (for each process): 0000000000000000 - 00007fffffffffff (=47 bits) user space, different per mm hole caused by [48:63] sign extension And let gcc emits 32-bit addressing for code and global data using the -mcmodel=small parameter, which is the default memory model. %rip-relative addressing is also used by default at this model for dereferencing object globals; it already consumes same amount of opcode size relative to x86-32's in-operand 32-bit absoulte dereference. * Now how are we going to produce similar effect in kernel-space code when the 0 -> 4G+ space has been already reserved by user-space? Thanks to the elegance of the 2's complement model, the common case is to use negative 32-bit addresses for kernel linking. This leads to negative 32-bit discplacement for kernel ops. We then let the CPU implcitly (automatically) sign-extend those discplacements, leading us to use the top 64-bit address space, but with the size-effeciency of 32-bit addresses! An example is overdue. Imagine linking our kernel with the 0xffffffff80000000 as base. %rax points to a physical address in our text or data area. To let it be virtual, we add the base to it. Mainly we want to do: %rax = %rax + 0xffffffff80000000 Now, thanks to the sign-extension trick, the above instruction can be size- -effeciently encoded as if the displacement was 32-bit as follows: ffffffff8010b347: 48 05 00 00 00 80 add $0xffffffff80000000,%rax As you can see, only 4-bytes are used for displacement (0x80000000). The REX.W prefix (0x48) with the x86 ADD opcode (05) will let the x86-64 cpu automatically sign extend the 32-bit immediate 0x80000000 to $0xffffffff80000000 and achieve our goals. * Here's a more complete explanation from our kernel comments: Using -2GB as a base empowers the compiler to directly encode symbolic references as 32-bit values, and let the CPU implicitly sign-extended those to 64-bits due to the REX.W prefix. Thus, we use the top 64-bit address space, but with 32-bit size effeciency! Check below example: Passing global strings to memcpy: 48 c7 c6 80 24 8b 80 mov $0xffffffff808b2480,%rsi 48 c7 c7 60 d8 10 80 mov $0xffffffff8010d860,%rdi e8 f4 39 00 00 callq ffffffff8010bb00 To avoid the size-overhead of 64-bit offsets in function calls, we limit the kernel text to a 32-bit virtual area. Thus, an x86-64 callq will only consume a 4-byte offset: relative space between next op and the destination label. To dereference kernel globals (instead of just passing their addresses), we'll use %rip-relative addressing within + or - 2GB limit, avoiding the size-overhead of referencing a full 8-byte address. All of this was in fact designed by the SysV AMD64 ABI developers, and is called the 'kernel code model'. We use this code model, and let userspace use the 'small' or the 'medium' one. It's interesting to note that NT userspace defaults to the medium model, while most UNIX implementations default to the 'small' one. * To get a state of the size difference, check the difference of function call size overhead between the 'large' and the 'kernel' code models: $ objdump -j .text -D kernel.elf | grep call # -mcmodel=kernel ffffffff801090c8: e8 e3 45 00 00 callq ffffffff8010d6b0 ffffffff801090df: e8 ec 48 00 00 callq ffffffff8010d9d0 ffffffff80109105: e8 86 41 00 00 callq ffffffff8010d290 ffffffff8010b97b: e8 70 f9 ff ff callq ffffffff8010b2f0 ffffffff8010ba61: e8 6a ff ff ff callq ffffffff8010b9d0 ffffffff8010bdd8: e8 83 fc ff ff callq ffffffff8010ba60 ffffffff8010c749: e8 a2 f8 ff ff callq ffffffff8010bff0 <__kmalloc> ... $ objdump -j .text -D kernel.elf | grep call # -mcmodel=large /* Notice the ~%140 increase in size per call */ ffffffff801090cb: 48 b8 d0 e6 10 80 ff mov $0xffffffff8010e6d0,%rax ffffffff801090d2: ff ff ff ffffffff801090d5: ff d0 callq *%rax ffffffff801091a3: 48 bb 00 ea 10 80 ff mov $0xffffffff8010ea00,%rbx ffffffff801091aa: ff ff ff ffffffff801091b9: ff d3 callq *%rbx ... So, while admittedly using -mcmodel=large makes things a bit easier for the 64-bit kernel developer (as it gives them the freedom of basing the kernel on any desired virtual address), it induces too much size overhead that it's worth the effort avoiding it. For example, since Cute uses -mcmodel=kernel, we have two mappings in our system, one for kernel text and data (-2GB), and _another_ mapping for the entire physical memory discovered at system boot. Extra care was needed while developing this in an already very complicated subsystem: memory management. This is also the same reason why Linux's amd64 (unlike ia32) __PAGE_OFFSET is different from the kernel linkage base at __START_KERNEL_MAP = -2GB. * So, keeping above points in mind, we'll use below strategy for dividing our kernel space addresses: 0x0000000000000000 -> 0x00007fffffffffff userspace mappings 0x0000800000000000 -> 0x0000ffffffffffff not canoncial; cannot be used 0xffff800000000000 -> 0xffffc00000000000 maps for all physical memory (64TB max) 0xffffffff80000000 -> 0xffffffffc0000000 kernel text and data map * Debugging -O3 triggered problems: ----------------------------------- * While adding a good number of GCC warning request flags to detect bugs as early as possible, I enabled the -O3 switch only to discover that it triggered a good number of hidden kernel bugs. First bug led to several exceptions. Upon investigating the CPU's reported faulting instruction, I found its an SSE x86 op. SSE ops require special %cr0 and %cr4 setup that we're not ready for. The solution was simply instructing GCC to avoid emitting SSE ops using its -mno-see{2,3,4} and -mno-3dnow control flags. * Second bug was way uglier. It was only triggered at -O3 and with Kmalloc test code (and only kmalloc test code, not any of the rest of the tests) enabled. The faults were a bit random, from continuous triple faults to rubbish on the screen by printk(). This also means that printk is not available for debugging. Switching to -O2 mute all those faults .. * An important debugging tool when printf is not available is inspecting the final kernel ELF. Having the ELF at hand helps us at: - directly knowing the exact addresses of all global symbols - translate any %rip address to the exact place in the executable. This aids in knowing things like whether a core had deadlock on a spinlock (%rip inifinitely tied at the spin loop), the exact x86 op of a faulting %rip address, etc * Applying above methods, we inspect the global vga_buffer array. This array get its contents directly copied to the vga card, so we can have a nice view about the source of screen corruption: $ readelf --syms kernel.elf | grep vga_buffer 114: ffffffff8010eb60 4000 OBJECT LOCAL DEFAULT 6 vga_buffer Halting the CPU and inspecting physical address 0x10eb60 contents, we find that: At -O2, well-formatted vga text: 0f 75 0f 43 0f 64 0f 74 - 'Cute' At -O3, it's rubbish: 9f 19 9e fa 9f 9f ae 80 ... Further inspecting the printk() code, I found there's no apparent problem in the code itself (or its generated assembly). The printk() code was mainly passed rubbish strings, instead of the expected valid ones. * Inspecting the issue further, I've put a global string: static char string[] = "ABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890"; $ readelf --syms kernel.elf | grep string 114: 0000000000000000 0 FILE LOCAL DEFAULT ABS string.c 118: ffffffff8010e680 37 OBJECT LOCAL DEFAULT 7 string Inspecting those mem locations at -O2, yielded the expected byte values: 41 42 43 44 45 46 47 48 .. At -O3, however, below values were found: 10 80 48 89 f2 48 03 53 08 48 01 eb * Checking the unexpected bytes sequence above, I smelled x86 opcodes, which they truly were! They were from the disassembly: ffffffff80108140 : ... ffffffff801081b7: 48 c7 c7 18 e3 10 80 mov $0xffffffff8010e318,%rdi ffffffff801081be: 48 89 f2 mov %rsi,%rdx ffffffff801081c1: 48 03 53 08 add 0x8(%rbx),%rdx ffffffff801081c5: 48 01 eb add %rbp,%rbx ... So we got a lead, but it's just an indication of something utterly and completely messed-up. * Trying to inspect the issue further, I halted the CPU and dumped first 10MB of RAM before the crash. There were some really surprising results. Searching for some of the strings we send to printk() led to nothing, nada, nil. The strings we send, they must be there at the kernel binary in ASCII form, aren't simply there. Instead most of them are replaced by x86 ops. * Thinking about the situation, a change to kernel binary at such scale can mostly happen by a serious bug at bootstrap code. Checking the first 3 lines of our bootstrap disassembly, where it moves kernel code from bootstrap BSS to its expected place, made everything clear: kern/kernel.elf: file format elf64-x86-64 Disassembly of section .text.head: 0000000000100000 : 100000: be a0 03 10 00 mov $0x1003a0,%esi 100005: bf 04 80 10 00 mov $0x108004,%edi <-- [1] 10000a: b9 00 00 02 00 mov $0x20000,%ecx [1] As you can see, if the kernel image we're copying away from bootstrap BSS to its right place exceeded a size of 0x8000 bytes, the part of the image > 0x8000 will be overriden, which leads to the mess we were in. As in C's memmove(), we now avoid such corruption by copying kernel code to a safe temporary buffer first, then copy it again to the final overlapping place. (Sometimes reading disassembly makes you catch bugs quicker than reading source code. Although in this case the source was really in assembly, when we saw the _numbers_ instead of just the symbols exported by the linker script, we caught the bug rather instantaneously) * Important lessons: - Take extreme care of overlaps while coding x86 'rep movsb' and friends. It's good to dynamically assure that: + src < dst, AND dst >= (src + len) .. i.e., bad overlap + OR src >= dst .. i.e., safe overlap Any other condition means direct corruption. The 'safe' overlap is only so if we are using 'movsb' instead of movsl or movsq. Remember that this 'safe' overlap corrupts the source, but keeps the destination data valid. - To be safe against such kind of overlaps in C parts of the kernel, we've added below memcpy() asserts: if (src > dst && (src - dst) < 8) OR (src <= dst && (dst - src) < len): panic() [1] which can be reduced by logic (check Appendix) to: if (dst + 8 > src) AND (src + len > dst): panic(); [2] The first check is used cause of 'movsq' 8-byte block copy. We've intentionally transformed substraction in the '>' and '<' expressions to addition cause of the _unsigned_ pointer values. ---------------------------------------------------------------------- | Appendix | | It's easy to prove equivalence of the 'if' conditions marked | with [1] and [2] above; let: | | A eq 'src > dst' ==> ~A eq 'src <= dst' | B eq '(src - dst) < 8' | C eq '(dst - src) < len)' | D eq '(A and B) or (~A and C)' | src >= 0, dst >= 0, len >= 0 | | We want to prove that: | D is fully equivalent to (B and C) | | From the premises, we can see that: | if (B and C) is true, then | '(A and B) or (~A and C)' is true, then | D is true (I) | | From the premises, we can also see that: | if A is true, then C is true (II) | if ~A is true, then B is true (III) | | From premises and (II): | if (A and B) is true, then (A and B and C) is true (IV) | | From premises and (III): | if (~A and C) is true, then (~A and C and B) is true (V) | | From premises, (IV), and (V): | if D is true, then (B and C) is true (VI) | | From (I) and (VI): | D is fully equivalent to (B and C) Q.E.D. ---------------------------------------------------------------------- * Global lessons: - Disassembly is a great tool for debugging kernel code, since a lot of things taken for granted at user-space is not guaranteed here anymore. Abstraction cut both ways indeed. - It's _very important_ to deeply test each code path, no matter how trivial it seems (basic stuff, but sometimes we forget the basics). - Memory dumping of critical kernel regions is really valuable; remember to use it as needed. - It's important to test the code with different environment combinations: different optimization levels: GCC's -O {0, 1, 2, 3, s} different memory sizes: Qemu's -m {2, 3, .., 500, .., 2000, ..} different number of cores: -smp {1, 2, 3, ..} different CPU speeds: Bochs's ips= parameter different BIOSes: different versions of Bochs Bios, different PC forms, .. * The Zone: ----------- (Or the bug that took me around 6 days to solve) * While writing the context switch code, I observed a very ugly bug. Somehow the montonic PIT interrupts that got triggered every 1 millisecond corrupts kernel state. Symptoms: random exceptions and random assert() failures allover the place. * So I first thought there must be something wrong in the timer handler itself. So what if the PIT handler did exactly nothing but acknowledging the local APIC? (It must acknowledge the APIC so future IRQs get handled) The handler code was successfully reduced to: push %rax movq $(VIRTUAL(APIC_PHBASE) + APIC_EOI), %rax movl $0,(%rax) pop %rax iretq But still the same buggy behavior prevailed. * After reducing the handler code to a number of bytes, with no positive results, I thought let's check the IDT validity. Inspecting the IDT cells date yielded nothing interesting. Also checking the %rflags, bit by bit, led to nothing interesting. So I thought maybe we should setup the IDT cells as 'trap' instead of 'interrupt' gates, but this is ofcourse wrong. Trap gates leaves the IF flag set while servicing the interrupt, which means handlers can get interrupted indefinitely, and our stack can get overflowed in a matter of milliseconds. * OK, so I thought instead of focusing on the IRQ handlers and the IDT, lets focus on the _code_ that get interrupted itself. The failed code was usually the general memory allocator kmalloc() test cases. After inspecting the interrupted code, I found nothing interesting. * To let the problem domain be smaller, and after minimizing the handler code, I thought lets also minimize the interrupted code and see. Having a bunch of printk()s in a tight loop, with the timer handler on, triggered the problem! Here the second day finishes. Either something is really messed up, or I'm a shitty developer that can't even write a stable printk(). Furthermore, running any testcases after enabling the interrupts make them fail miserably. From all the debugging I've done in the last 6 months, I began to feel the instincts of such 'deep' kernel bugs: this is definitely one of them. * So, I began to feel hopeless. I'll just inspect the x86 ops that triggered the exceptions, hoping to find any pattern in the code that trigger the bug. Definitely a pattern was clearly there; some examples of the failing x86 ops: mov %rax, 0x0(%rbp) movzbl (%rax), %eax movl $0x2, 0x8(%rax) movl $0x1, 0x4(%rax) ... And the most triggered, probably the _only_ triggered, exception was the page fault #PF exception, vector 14. The faulting (unmapped) address was things like %cr2=0x0, %cr2=0x8, %cr2=0x200020, etc. From above pattern, I thought: who's really messing up with our registers, getting them loaded with weird values like 0, 0x8, etc? * So, we're in the verge of the third day end with no direct results, and nothing known about the source of the bug. Having nothing specific to think about, I inspected the critical kernel paths: the IDT setup, IOAPIC entries setup, the GDT, the bootsector, early kernel boot setup, printk() code, etc. Funnily, cause I had a 'global' view about the kernel while doing this deep inspection, 9 issues scattered around was discovered that needs to be fixed. Unfortunately, fixing those 9 issues didn't really solve our main problem, but it was nice nonetheless to discover them. * So we're now in the fifth day, and nothing is clear. Hopelessly, I swap our printk code with FreeBSD's one. Things work fine, but once we enable interrupts and the PIT in monotonic mode, similar erroneous behavior appears. Swapping the FreeBSD code with the Linux code, and again, leading to the same failures. So I was right, something 'deep' is messed up, not just the nature of the interrupted code. Another hopeless try was swapping the data segment descriptor index @ the GDT from 0x10 to 0x18. Something interesting was discovered; the page faults addresses occurs from values originating from %cs, %ss, and %rflags; it was completely repeatable. Sounds familiar? Yes, they should! Those are the values pushed by the CPU before invoking IRQ handlers! So somehow, registers that get dereferenced get the wrong values from the state pushed and popped by the CPU before and after invoking an IRQ handler. Now _this_ is an interesting discovery. So, how the %ss, %cs, and %rflags values got 'leaked' to our general-purpose registers? The stack _must_ have been corrupted somehow .. * Trying different directions, we minimize the timer handler even further only to the 'iretq' x86 op, but how do we acknowledge the local APIC this way? In fact, we dont. I just masked all the IOAPIC entries, and let the PIT trigger through the PIC. The PIC was then programmed in 'Automatic EOI' mode, where no acknowledgment was needed. This reduced the handler only to: iretq Even with this super-minimal handler, the bug gets triggered :( It was also interesting that, again, the #PF handler reported bad memory values resembling %rflags, %ss, etc. * So after minimizing the timer handler to the absolute architecturally minimum, we try to minimize the faulting code even further. After some hours of disassembly, I find that at -O1 the VGA code alone (without the printk() code that called it) triggers the problem. After even further inspection and disassembly (and wasted hours), I find that not only we can limit the code to VGA writing, but only to the string function memcpy(). So, now, all I have is the timer interrupt and an infinite loop containing memcpy, and the code still fails! And, yes, the memcpy() code was valid (ehem, Dijkestra's 'Humble Programmer') * So I go crazy stripping the entire kernel. All what was left was code that re-programs the PIC, the PIT setup code, and the string function memcpy() and the code still failed. That was, well, weird. We really have nothing now except a very small number of lines of code and I don't know what to do. Once we enable interrupts, the memcpy() function fail with similar patterns to the one described above (page faults, with faulting memory values similar to %ss and %rflags) Well, you can imagine the state I'm in after all those days. * Another step for more explicitness was disabling the PIT, and triggering its handler manually using software interrupts. I've inject software interrupts around strategic C statements like '*src = *dst', etc. * So at the middle of the middle of the sixth day, I disassemble the memcpy() function, draw the stack on paper, and trace. Now something became directly apparent. Notice those instructions (gcc -O0): /* Classical stack setup */ ffffffff80109c88: 55 push %rbp ffffffff80109c89: 48 89 e5 mov %rsp,%rbp ffffffff80109c8c: 48 89 7d e8 mov %rdi,-0x18(%rbp) ffffffff80109c90: 48 89 75 e0 mov %rsi,-0x20(%rbp) /* Bochs magic breakpoint */ ffffffff80109caa: 66 87 db xchg %bx,%bx /* Our manual software interrupt */ ffffffff80109cad: cd f0 int $0xf0 /* Failing code, specially last line */ ffffffff80109caf: 48 8b 45 f8 mov -0x8(%rbp),%rax ffffffff80109cb3: 0f b6 10 movzbl (%rax),%edx ffffffff80109cb6: 48 8b 45 f0 mov -0x10(%rbp),%rax ffffffff80109cba: 88 10 mov %dl,(%rax) Notice something? Look again. See those `mov -0x8(%rbp), %rax' and `mov -0x10(%rbp), %rax) ops? Now check the top of the function, especially the stack pointer setup. Can't you notice something? YES, GCC used areas of the stack _B E L O W_ the stack pointer to keep parts of the leaf function local state. Now when the interrupt was triggered at `int $0xf0', the CPU rightfully pushed CPU state in that order: %ss %rsp %rflags %cs %rip Which meant that parts of that function local state get corrupted through the implicit CPU stack usage! This is also the reason we found faulting addresses of values resembling %ss and %rflags. Our processor modified our pointers with those! * Wow, THAT WAS INTERESTING and worthwhile to invest the week in. Inspecting the AMD64 ABI to see why GCC uses this behavior, which is _certainly_ interrupt and signals unsafe, I found the answer: it's 'The Red Zone'. Now I've really had enough for today, see the discussion of this zone at the AMD64 ABI document. * Solution: Either: [1] Ignore the 128-byte region in the IRQ handler OR [2] Let the compiler avoid using this Red Zone We can't really use [1], since the CPU pushes its state behind our backs. We can't let it wind 128-bytes of the stack. Also the kernel stack is a very valuable resource, and we shouldn't waste 128-Bytes of it to save an x86 op or two. Thus, we pick the second option. That's also what the mainstream folks like {Open, Free}BSD and Linux do (which I discovered too lately). * Lessons: - When faced with triple faults, inspect the following: - the IDT table: if it has been overwritten, then the machine will definitely triple fault. - stack sanity: If the stack was not sane, then pushes and pops will lead to faults, leading to triple faults. - MM structures sanity. If the pagetables have been erroneously overwritten, the stack will get erroneously mapped. - Kernel stack overflow: corrupt stack = triple faults - If you noticed that an interrupt randomly triggers a problem in the kernel, it can be good to disable interrupts (avoid distractions), then find the most affected code path and _manually_ inject interrupts there using software IRQs. - The best possible way to let IRQ handlers be as minimal as possible is to use the PIC in Automatic EOI (AEOI) mode. They will get the handler minimized to the 2 byte `iretq' x86-64 op. - Let code binary single-stepping be our last debugging resort to make sure we understand the algorithms and code we've just written well enough. Debugging periodic timers using statistical data: ------------------------------------------------- * A problem appeared while running the scheduler with the APIC timer instead of the PIT. Using observation, it was clear that the running intervals given to processes by the APIC timer are twice as long as the ones given by the PIT. Both timers were supposedly programmed to produce 40 millisecond intervals, so where's the fault? Note: we are talking about monotonic PIT and APIC interrupts since those are the only ones used by the scheduler; we don't use the one-shot variants here. * One of the best ways I found to debug this problem, instead of depending on shaky heuristics, is to store the occurrence of the timer ticks interrupts in a statistical manner. An example of this statistical method would be finding the number of timer ticks occurrence after a sequence of independently-programmed delay intervals. * How to do so without sacrificing simplicity? Setup the scheduler periodic timer through the PIT or the APIC: `pit_monotonic(40)' OR `apic_monotonic(40)'. Setup a counter representing the number of scheduler timer ticks so far: .globl ticks_counter ticks_counter: .quad 0 .globl timer_handler timer_handler: ... incq ticks_counter ... Then, do _independent_ time delays and store the number of timer ticks so far after each delay. This way, we can have a statistical comparison between the two timers, the one used for the delay and the one used for the ticks. An example of this would be: /* ticks[i]: number of scheduler timer ticks after the `i'th delay * interval, having in mind that we've setup both the ticks and delay * timers with the same time priod. * * If `1 2 3 4 5 6 ..' is printed, then the two timers are in harmony. * Having the sequence '3 5 7 9 11 ..' means that the ticks timer runs * twice as fast as the delay one. */ for (int i = 0; i < 100; i++) pit_mdelay(52), ticks[i] = ticks_counter; for (int j = 0; j < 100; j++) printk("%d ", ticks[j]); Comparing the statistical results, we found that the PIT in monotonic mode ticks twice as fast as the PIT programmed in delay mode. It also ticked twice as fast as the APIC timer in delay mode. This heavily suggests something wrong in our PIT monotonic-mode setup! Meanwhile, setting up the scheduler ticks using the APIC timer ran in perfect harmony with the PIT-programmed delays. This heavily re-enforces above paragr- aph reasoned speculation: something is wrong in our PIT periodic mode code. * Now that the problem was isolated, the error was a misunderstanding of the PIT's mode-3 behavior; mode-2 fitted our goals better. Interestingly, I found this method really helpful for testing periodic timers code in general. We already test timer delays code by calibrating its produced delays against wall clock time; this method complement the tests by also testing the monotonic/periodic interrupts code. Note that for testing, specifying a large delay period catches more periodic timers accuracy errors. Values from 50 ms and up proved to be good. * A similar test-case for periodic timers is comparing their statistical ticks occurrences relative to each other. This will guarantee that all system periodic timers really measure the same concept of time! The Naughty Multiply: --------------------- * References: - Intel Instruction Set Reference, A-M, vol. 2A, MUL. - GCC manual, 'Using the GNU Compiler Collection' * So I was writing a casual number of test-cases, where this sequence of code: void run_tests(void) { memset(_arr, 0xFF, 0x100); __memcpy_forward(_arr, _arr + 20, 10); __memcpy_forward(_arr + 20, _arr, 10); } produced a NULL-dereference exception. As printed by our exception handler: Exception: vector=14, errcode=0x2, %cs=0x8, %rip=0xffffffff8010b6f7, %ss=0x0, %rsp=0xffffffff80104ff0, %rflags=0x200202, %cr2=0x0, %cr3=0x3ffef000, %cr4=0x20 It was weird: the code of these two functions was a very simple and straightforward inline GCC assembly. The problem was clearly related somehow to registers-usage pressure (the compiler register allocator decisions) in the mother function run_tests(). Removing one memcpy() above, or adding a printk() in a random place in that function would make the NULL exception disappear. * Disassembling the area around the faulting opcode, we found: ffffffff8010b6a0 : ffffffff8010b6a0: 48 c7 c2 20 fb 10 80 mov $0xffffffff8010fb20,%rdx ffffffff8010b6a7: 31 f6 xor %esi,%esi ffffffff8010b6a9: b8 ff 00 00 00 mov $0xff,%eax ffffffff8010b6ae: 48 89 d7 mov %rdx,%rdi ffffffff8010b6b1: b9 00 01 00 00 mov $0x100,%ecx ffffffff8010b6b6: 48 89 ce mov %rcx,%rsi ffffffff8010b6b9: 48 83 e1 07 and $0x7,%rcx ffffffff8010b6bd: f3 aa rep stos %al,%es:(%rdi) (+) ffffffff8010b6bf: 48 89 f1 mov %rsi,%rcx ffffffff8010b6c2: 48 c1 e9 03 shr $0x3,%rcx ffffffff8010b6c6: 48 89 c6 mov %rax,%rsi ffffffff8010b6c9: 48 b8 01 01 01 01 01 mov $0x101010101010101,%rax ffffffff8010b6d0: 01 01 01 ffffffff8010b6d3: 48 f7 e6 mul %rsi ffffffff8010b6d6: f3 48 ab rep stos %rax,%es:(%rdi) (++) ffffffff8010b6d9: 41 b9 0a 00 00 00 mov $0xa,%r9d ffffffff8010b6df: 31 c0 xor %eax,%eax ffffffff8010b6e1: 48 c7 c6 34 fb 10 80 mov $0xffffffff8010fb34,%rsi ffffffff8010b6e8: 48 89 d7 mov %rdx,%rdi ffffffff8010b6eb: 44 89 c9 mov %r9d,%ecx ffffffff8010b6ee: 41 89 c0 mov %eax,%r8d ffffffff8010b6f1: 41 89 c8 mov %ecx,%r8d ffffffff8010b6f4: 83 e1 07 and $0x7,%ecx ffffffff8010b6f7: f3 a4 rep movsb %ds:(%rsi),%es:(%rdi) From our exception handler, we can see that the faulting instruction is the above 'rep movsb'. We can also deduce that (+) and (++) ops are produced for memset(), and the rest are the memcpy() opcodes. * Tracing the code by hand, I reached the faulty instruction without catching the NULL dereference. A null dereference at the 'movsb' instruction can only mean that either %rsi = 0, or %rdi = 0, or both had magically turned to zeroes. Wondering about the state, I overwrote this movsb instruction in the final kernel binary with a Bochs magic breakpoint code (66 87 db) using a simple hex editor. After moving to the debugging console, thanks to the magic breakpoint, here was the Bochs registers dump: (0) Magic breakpoint CPU0: rax: 0x00000000:00000000 rcx: 0x00000000:00000002 rdx: 0x00000000:00000000 rbx: 0xffffffff:801129a8 rsp: 0xffffffff:80104ff0 rbp: 0x00000000:00000000 rsi: 0xffffffff:8010fb34 rdi: 0x00000000:00000000 r8 : 0x00000000:0000000a r9 : 0x00000000:0000000a r10: 0x0f200f20:0f200f20 r11: 0x66656463:62613938 r12: 0x00000000:00000000 r13: 0x00000000:00000000 r14: 0x00000000:00000000 r15: 0x00000000:00000000 rip: 0xffffffff:8010b6fa eflags 0x00200202: ID vip vif ac vm rf nt IOPL=0 of df IF tf sf zf af pf cf One thing was clearly noted: for some reason, %rdx (and %rdi by consequence) have magically turned to zeroes, leading to the NULL dereference! * Finding that no explicit opcode zeroed %rdx above, then the zeroing must have been a side-effect of some instruction above the faulty dereference. Looking at the Intel Manual for cues, I found that opcode: it was the 'mul %rsi' one. So why didn't GCC know about this? Because _I_ did not know about it while coding, thus I did not explicitly tell GCC that this nice register will get implicitly clobbered. GCC, naively thought that the register value was unmodified, and rightfully re-used that register value at address 0x8010b6e8. GCC is really competent wrt to optimizations. The faulty code was: asm volatile ( [...] "mov %[ch], %[tmp];" "mov $0x0101010101010101, %[ch];" "mul %[tmp];" "rep stosq;" : [tmp] "=&r"(tmp), [dst] "=&D"(d0), [ch] "=&a"(uch), [len] "=&c"(ulen) : "[dst]"(dst), "[ch]"(uch), "[len]"(ulen), "[tmp]"(tmp) : "memory"); As you can see, %rdx was not mentiond as a clobbered register, and not even as an output register. GCC had all the rights to believe that this register value was not modified in that snippet. In another way, the clobber list tells gcc which registers (or memory) are changed by the snippet, but not listed as an output. * The simple solution was adding the 'rdx' register to the assembly snippet clobber list. Now the compiler output looks like: ffffffff8010b6a0 : ffffffff8010b6a0: 49 c7 c0 20 fb 10 80 mov $0xffffffff8010fb20,%r8 [...] ffffffff8010b6e8: 48 89 d7 mov %r8,%rdi [...] ffffffff8010b6f7: f3 a4 rep movsb %ds:(%rsi),%es:(%rdi) where the compiler uses the new register %r8, entirely avoiding the %rdx register in the process. * From investigation, it seems GCC need to also be notified if the assembly snippet alters the condition code register %rflags. We now handle this by adding `cc' to the clobber list while using ops like 'andq', 'popfq', 'decq', and 'shrq'. Cases where we forgot to clobber 'rcx' after using opcodes like 'rep XX' were also caught. For this, we only need to set 'rcx' as a double input+output register; GCC will infer the rest. Another faulty case was having 'rsi' be only set as an input register amid the existence of 'movsq' in the assembly snippet. * Lesson: While I did study the privileged x86 ring0 opcodes carefully (mainly Intel's manual vol. 3A, and AMD vol. 2), the basic x86 opcodes were learned 'on the go'. The problem we faced today is a small price for that; I should re-read Intel's vol. 1 docs way more carefully. To make-up for this, and also using the Intel manuals, I investigated the side effects of __every__ opcode used in the kernel inlined assembly snippets. That's how I caught the %rflags, %rcx, and %rsi cases mentioned in the above point!