| b3bbe1d by Ahmed S. Darwish at 2010-11-16 |
1 |
|
|
2 |
|
| 15681eb by Ahmed S. Darwish at 2010-02-18 |
3 |
Notes on the Cute Kernel |
|
4 |
|
| 649633f by Ahmed S. Darwish at 2010-07-29 |
5 |
|
|
6 |
TOC: |
|
7 |
---- |
|
8 |
|
|
9 |
(1) On Intel Documents |
|
10 |
(2) On Data Alignment |
|
11 |
(3) On little-endian byte ordering |
|
12 |
(4) On Real Mode |
|
13 |
(5) On accessing fixed disk drivers (hard-disks) |
|
14 |
(6) On x86 jumps |
|
15 |
(7) On referencing global symbols |
|
16 |
(8) The 8259A Programmable Interrupt Controller |
|
17 |
(9) On x86 SMP initialization sequence |
|
18 |
(10) On linkers 'relocation error's |
|
19 |
(11) When %RIP-relativeness goes wrong |
|
20 |
(12) Testing libc()-like kernel code |
|
21 |
(13) On SMP Memory ordering models |
|
22 |
(14) On the C 'volatile' type qualifier |
|
23 |
(15) On the vaguely documented x86 16-bit protected mode |
|
24 |
(16) Using light debugging techniques |
|
25 |
(17) Musings on Memory Allocation |
|
26 |
(18) On Virtual Memory |
|
27 |
(19) On Partitionting the 64-bit kernel address space |
|
28 |
(20) Debugging -O3 triggered problems |
|
29 |
(21) The Zone! |
|
30 |
(22) Debugging periodic timers using statistical data |
| 5bc1d68 by Ahmed S. Darwish at 2010-11-16 |
31 |
(23) The Naughty Multiply |
| 649633f by Ahmed S. Darwish at 2010-07-29 |
32 |
|
|
33 |
|
| 550f8c7 by Ahmed S. Darwish at 2009-08-25 |
34 |
On Intel Documents: |
|
35 |
------------------- |
|
36 |
|
|
37 |
* Although the newer intel documents are much more sexy and detailed, the |
|
38 |
original i386 one describes the concepts in a much more bottom-up, |
|
39 |
undergraduate-book like, no forward-references way. |
|
40 |
|
|
41 |
* Somehow the new intel documents beginning from the 1997 i686 document |
|
42 |
assumes some x86 familarity that wasn't obviously assumed in the original |
|
43 |
document. |
|
44 |
|
|
45 |
* For an example, the 'alignment' topic is throughly explained in |
|
46 |
the i386 manual, what is it and why is it important to performance, |
|
47 |
and how the processor act if given a misaligned address, *before* ever |
|
48 |
using the words '16bit-aligned' and '32bit-aligned'. |
|
49 |
|
|
50 |
The new intel documents assume you know subtle topics as above and |
|
51 |
talks about critical issues related to their understandings freely |
|
52 |
*before* they become throughly discussed. Only at i686 chapter 5 |
|
53 |
alignment is throughly described although several critical alignment |
|
54 |
related topics have already been discussed. |
|
55 |
|
| ded3ad0 by Ahmed S. Darwish at 2009-09-03 |
56 |
* Case 2: on why we should disable interrupts while switching to protected |
|
57 |
mode? |
|
58 |
|
|
59 |
New Intel document: Software must guarantee that no exceptions or interrupts |
|
60 |
are generated during the mode switching operation. |
|
61 |
|
|
62 |
Original 386 document: ``The IDTR may be loaded in either real-address or |
|
63 |
protected mode. However, the format of the interrupt table for protected mode |
|
64 |
is different than that for real-address mode. It is not possible to change to |
|
65 |
protected mode and change interrupt table formats at the same time; therefore, |
|
66 |
it is inevitable that, if IDTR selects an interrupt table, it will have the |
|
67 |
wrong format at some time. An interrupt or exception that occurs at this time |
|
68 |
will have unpredictable results. To avoid this unpredictability, interrupts |
|
69 |
should remain disabled until interrupt handlers are in place and a valid IDT |
|
70 |
has been created in protected mode.'' |
|
71 |
|
|
72 |
* Case 3: on why should we long jump after switching to protected mode? |
|
73 |
|
|
74 |
New intel document: ``The JMP or CALL instruction immediately after the MOV |
|
75 |
CR0 instruction changes the flow of execution and serializes the processor.'' |
|
76 |
|
|
77 |
Original 386 document: ``Immediately after setting the PE flag, the |
|
78 |
initialization code must flush the processor's instruction prefetch queue by |
|
79 |
executing a JMP instruction. The 80386 fetches and decodes instructions and |
|
80 |
addresses before they are used; however, after a change into protected mode, |
|
81 |
the prefetched instruction information (which pertains to real-address mode) |
|
82 |
is no longer valid. A JMP forces the processor to discard the invalid |
|
83 |
information.'' |
|
84 |
|
| 550f8c7 by Ahmed S. Darwish at 2009-08-25 |
85 |
This is a recurring pattern. |
|
86 |
|
|
87 |
* What's IA-32e in new Intel documents? |
|
88 |
Intel has really messed up the namings of their 64bit processors. The |
|
89 |
most logical choice I think was IA-64, but unfortunately that was taken |
|
90 |
by the big flop in processor's history: the Itanium. When Intel waked |
|
91 |
up and created their first amd64 compatible processor, they named it EMT64. |
|
92 |
|
|
93 |
Now intel in its new official documents call the x86_64 architecture |
|
94 |
"intel 64" and call those processor's 64bit mode IA-32e. |
|
95 |
|
|
96 |
|
|
97 |
On Data Alignment: |
|
98 |
------------------ |
|
99 |
|
|
100 |
* Note: In x86 speak, word = 2bytes/16bits, double word = 4bytes/32bits |
|
101 |
|
|
102 |
* Each byte within a word has its own address, and the smaller of the |
|
103 |
addresses represent the address of the whole word. |
|
104 |
|
|
105 |
* Each byte within the double word has its own address. The smallest of the byte |
|
106 |
addresses is the address of the double word. |
|
107 |
|
|
108 |
* ``Note that words need not be aligned at even-numbered addresses and |
|
109 |
doublewords need not be aligned at addresses evenly divisible by four. |
|
110 |
This allows maximum flexibility in data structures (e.g., records containing |
|
111 |
mixed byte, word, and doubleword items) and efficiency in memory |
|
112 |
utilization. |
|
113 |
|
| 7bc027c by Ahmed S. Darwish at 2010-07-24 |
114 |
** When used in a configuration with a 32-bit bus, actual transfers of |
|
115 |
data between processor and memory take place in units of doublewords (4bytes) |
|
116 |
beginning at addresses evenly divisible by four ** |
| 550f8c7 by Ahmed S. Darwish at 2009-08-25 |
117 |
|
|
118 |
However, the processor converts requests for misaligned words or doublewords |
|
119 |
into the appropriate sequences of requests acceptable to the memory interface. |
| 7bc027c by Ahmed S. Darwish at 2010-07-24 |
120 |
** Such misaligned data transfers reduce performance by requiring extra memory |
| 550f8c7 by Ahmed S. Darwish at 2009-08-25 |
121 |
cycles. For maximum performance, data structures (including stacks) should |
|
122 |
be designed in such a way that, whenever possible, word operands are aligned |
|
123 |
at even addresses and doubleword operands are aligned at addresses evenly |
| 7bc027c by Ahmed S. Darwish at 2010-07-24 |
124 |
divisible by four. ** |
| 550f8c7 by Ahmed S. Darwish at 2009-08-25 |
125 |
|
|
126 |
Due to instruction prefetching and queuing within the CPU, there is no |
|
127 |
requirement for instructions to be aligned on word or doubleword boundaries. |
|
128 |
However, a slight increase in speed results if the target addresses of control |
|
129 |
transfers are evenly divisible by four.'' |
|
130 |
-- Intel 386 Reference Manual |
|
131 |
|
|
132 |
* ``The reason for this is that the processor requires two memory accesses to |
|
133 |
make an unaligned memory access; whereas, aligned accesses require only one |
|
134 |
memory access. A word or doubleword operand that crosses a 4-byte boundary |
|
135 |
or a quadword operand that crosses an 8-byte boundary is considered unaligned |
|
136 |
and requires two separate memory bus cycles to access it; a word that starts |
|
137 |
on an odd address but does not cross a word boundary is considered aligned |
|
138 |
and can still be accessed in one bus cycle.'' |
|
139 |
-- Intel 686 Reference Manual |
|
140 |
|
|
141 |
* x86 4KB page aligned addresses: |
|
142 |
|
|
143 |
Decimal: |
|
144 |
10 ^ 0 = 1 |
|
145 |
10 ^ 1 = 10 |
|
146 |
10 ^ 2 = 100 |
|
147 |
10 ^ 3 = 1000 |
|
148 |
... |
|
149 |
|
|
150 |
Binary: |
|
151 |
2 ^ 0 = 1 |
|
152 |
2 ^ 1 = 10 |
|
153 |
2 ^ 2 = 100 |
|
154 |
... |
|
155 |
2 ^ 10 = 10000000000 = 1024 = 1KB |
|
156 |
2 ^ 12 = 1000000000000 = 2^2 * 1KB = 4KB (x86 Page size/aligned-address) |
|
157 |
... |
| f654c49 by Ahmed S. Darwish at 2009-09-07 |
158 |
|
|
159 |
* Common alignments offsets: |
|
160 |
|
|
161 |
1byte alignment = 2 ^ 0 -- any x86 RAM address |
|
162 |
|
|
163 |
2byte alignment = 2 ^ 1 (0b10) -- 16bit aligned address |
|
164 |
= any address divisable by 2 |
|
165 |
= 0x0, 0x2, 0x4, 0x6, 0x8, ..., 0xa, 0xc, 0xe, 0x10, 0x12, ... |
| 550f8c7 by Ahmed S. Darwish at 2009-08-25 |
166 |
|
| f654c49 by Ahmed S. Darwish at 2009-09-07 |
167 |
4byte alignment = 2 ^ 2 (0b100) -- 32bit aligned address |
|
168 |
= any address divisable by 4 |
|
169 |
= 0x0, 0x4, 0x8, 0xc, 0x10, 0x14, 0x18, 0x1c, 0x20, ... |
|
170 |
|
|
171 |
8byte alignment = 2 ^ 3 (0b1000) - GDT cell index/size |
|
172 |
= any address divisable by 8 |
|
173 |
= 0x0, 0x8, 0x10, 0x18, 0x20, ... |
|
174 |
| | | |
|
175 |
null %cs %ds |
|
176 |
|
|
177 |
16byte alignment = 2 ^ 4 (0b10000) - GDT table recommended alignment |
|
178 |
= any address divisable by 16 |
|
179 |
= 0x0, 0x10, 0x20, 0x30, 0x40, ..., 0xf0, 0x100, 0x110, 0x120, ... |
|
180 |
|
| ded3ad0 by Ahmed S. Darwish at 2009-09-03 |
181 |
* Useful numbers: |
|
182 |
|
|
183 |
0x100 = 256 bytes |
| e4d1afe by Ahmed S. Darwish at 2010-07-29 |
184 |
0x200 = 512 bytes (bootsector size, number of entries in PML{4, 3, 2} tables) |
| ded3ad0 by Ahmed S. Darwish at 2009-09-03 |
185 |
0x400 = 1KB |
| f654c49 by Ahmed S. Darwish at 2009-09-07 |
186 |
0x1000 = 4KB (x86 normal page size/alignment) |
| ded3ad0 by Ahmed S. Darwish at 2009-09-03 |
187 |
0x8000 = 32KB |
|
188 |
0xffff = 64KB -1 (max offset for a real-mode segment) |
|
189 |
0x10000 = 64KB |
| f654c49 by Ahmed S. Darwish at 2009-09-07 |
190 |
0x80000 = 512KB (our initial kernel size!) |
| 550f8c7 by Ahmed S. Darwish at 2009-08-25 |
191 |
|
| b3bbe1d by Ahmed S. Darwish at 2010-11-16 |
192 |
|
| 812cb17 by Ahmed S. Darwish at 2009-10-28 |
193 |
On little-endian byte ordering: |
|
194 |
------------------------------- |
|
195 |
|
|
196 |
* From AMD64 documents: |
|
197 |
|
|
198 |
"The x86 and AMD64 architectures address memory using little-endian |
|
199 |
byte-ordering. Multibyte values are stored with their least-significant byte at |
|
200 |
the lowest byte address, and they are illustrated with their least significant |
|
201 |
byte at the right side. Strings are illustrated in reverse order, because the |
|
202 |
addresses of their bytes increase from right to left." |
|
203 |
|
|
204 |
Value = |
|
205 |
|
|
206 |
BYTE-7 BYTE-6 BYTE-5 BYTE-4 BYTE-3 BYTE-2 BYTE-1 BYTE-0 |
|
207 |
|
|
208 |
Actual Mem Layout: |
|
209 |
|
|
210 |
BYTE-0 BYTE-1 BYTE-2 BYTE-3 BYTE-4 BYTE-5 BYTE-6 BYTE-7 |
|
211 |
^ |
|
212 |
|--- Address start here |
|
213 |
|
|
214 |
For illustrative purposes only - not actual mem layout: |
|
215 |
|
|
216 |
GREATEST DATA STRUCTURE |
|
217 |
ADDRESS |
|
218 |
31 23 15 7 0 <--BIT |
|
219 |
+---------------+---------------+---------------+---------------+ OFFSET |
|
220 |
| BYTE 7 | BYTE 6 | BYTE 5 | BYTE 4 |4 |
|
221 |
|---------------+---------------+---------------+---------------| SMALLEST |
|
222 |
| BYTE 3 BYTE 2 BYTE 1 BYTE 0 |0 ADDRESS |
|
223 |
+---------------+---------------+---------------+---------------+^ |
|
224 |
| |
|
225 |
Address start here ---| |
|
226 |
|
|
227 |
* Classical little-endian ordering: |
|
228 |
|
|
229 |
.long 0x00cf9200 = 00 92 CF 00 |
|
230 |
^ |
|
231 |
|--- Address start |
|
232 |
= |
|
233 |
|
|
234 |
+---------------+---------------+---------------+---------------+ |
|
235 |
| 00 | CF | 92 | 00 | |
|
236 |
+---------------+---------------+---------------+---------------+^ |
|
237 |
Address start ---| |
|
238 |
|
|
239 |
* GAS concatenation: |
|
240 |
|
|
241 |
.long 0x00cf9200, 0x0000ffff != .quad 0x00cf92000000ffff |
|
242 |
|
|
243 |
.long 0x00cf9200, 0x0000ffff |
|
244 |
= |
|
245 |
.long 0x00cf9200 |
|
246 |
.long 0x0000ffff |
|
247 |
= |
|
248 |
.quad 0x0000ffff00cf9200 |
|
249 |
= |
|
250 |
(actual layout in memory) |
|
251 |
00 92 CF 00 FF FF 00 00 |
|
252 |
^ |
|
253 |
|--- Address start |
|
254 |
= |
|
255 |
(for illustrative purposes only - _NOT_ actual mem layout) |
|
256 |
+---------------+---------------+---------------+---------------+ |
|
257 |
| 00 | 00 | FF | FF | |
|
258 |
|---------------+---------------+---------------+---------------| |
|
259 |
| 00 | CF | 92 | 00 | |
|
260 |
+---------------+---------------+---------------+---------------+^ |
|
261 |
Address start ---| |
|
262 |
BUT: |
|
263 |
|
|
264 |
.long 0x0000ffff |
|
265 |
.long 0x00cf9200 |
|
266 |
= |
|
267 |
.quad 0x00cf92000000ffff |
|
268 |
= |
|
269 |
(for illustrative purposes only - _NOT_ actual mem layout) |
|
270 |
+---------------+---------------+---------------+---------------+ |
|
271 |
| 00 | CF | 92 | 00 |4 |
|
272 |
|---------------+---------------+---------------+---------------| |
|
273 |
| 00 | 00 | FF | FF | |
|
274 |
+---------------+---------------+---------------+---------------+^ |
|
275 |
Address start ---| |
|
276 |
= |
|
277 |
(actual layout in memory) |
|
278 |
FF FF 00 00 00 92 CF 00 |
|
279 |
^ |
|
280 |
|--- Address start |
|
281 |
|
|
282 |
* So: |
|
283 |
|
|
284 |
test.S: | print.c: |
|
285 |
| |
|
286 |
.global test | #include <stdint.h> |
|
287 |
test: | extern uint64_t test; |
|
288 |
.quad 0x00cf92000000ffff | int main(void) { printf("test = 0x%x\n", test); } |
|
289 |
|
|
290 |
output: test = 0x0000ffff |
|
291 |
as the 64-bit value is presented as: (address-start) FF FF 00 00 00 92 ... (address-end) |
|
292 |
|
|
293 |
* The String "WXYZ" is represented in memory as: |
|
294 |
|
|
295 |
'W' 'X' 'Y' 'Z' |
|
296 |
^ |
|
297 |
|--Address start |
|
298 |
|
|
299 |
And in the usual memory diagrams: |
|
300 |
|
|
301 |
+---------------+---------------+---------------+---------------+ |
|
302 |
| 'Z' | 'Y' | 'X' | 'W' | |
|
303 |
+---------------+---------------+---------------+---------------+^ |
|
304 |
Address start ---| |
|
305 |
|
| b3bbe1d by Ahmed S. Darwish at 2010-11-16 |
306 |
|
| 0db0564 by Ahmed S. Darwish at 2009-08-30 |
307 |
On Real Mode: |
|
308 |
------------- |
|
309 |
|
|
310 |
* To achieve maximum backward compatibility, x86 processors starts in the 16bit |
|
311 |
8086-compatible real-mode. The 8086 had a complete 16-bit architecture - 16-bit |
|
312 |
internal registers, 16-bit data bus, and a *20-bit* external address bus (1 MB |
|
313 |
of physical memory). |
|
314 |
|
|
315 |
Because the 8086 processor had 16-bit index registers and memory pointers, |
|
316 |
it can effectively address *only* up to 64 KB of memory, which wasn't acceptable |
|
317 |
even by that time. |
|
318 |
|
|
319 |
* So, to address the memory beyond 64 KB (the remaining 20bit 1MB address space) |
|
320 |
with 16bit registers, Intel created a `segmented' memory addressing scheme. |
|
321 |
|
|
322 |
We want to access 20bit address space, namely 0xfffff at max. The 8086 |
|
323 |
processor's registers are only 16bit, namely a max value of 0xffff. To solve |
|
324 |
this issue, the 8086 processor used *two* registers to address physical memory. |
|
325 |
The two registers are a segment described as a base address, and an offset within |
|
326 |
that segment. |
|
327 |
|
|
328 |
The goal: addressing up to the max 20bit value; 0xfffff: |
|
329 |
|
|
330 |
f 0 0 0 0 <-- 16bit value (segment base address) hex left shifted once to |
|
331 |
address the 4 highest order bits in the goal. |
|
332 |
+ <-- A simple adding operation |
|
333 |
|
|
334 |
f f f f <-- 16bit value, max offset we can address |
|
335 |
|
|
336 |
------- |
|
337 |
|
|
338 |
f f f f f <-- The goal (max 20bit value we can address) |
|
339 |
|
|
340 |
So physical memory in the 8086 is accessed through a `segment:offset' scheme |
|
341 |
|
|
342 |
where physical address = (segment * 16) + offset |
|
343 |
|
|
344 |
* Note that the above segment model is very primitive that there are really no |
| 5097e48 by Ahmed S. Darwish at 2009-08-30 |
345 |
real `segments' identified by the processor. A *physical* 'segment' is |
|
346 |
basically just a 16-bit aligned physical address address with an offset |
|
347 |
`limit' of 64K enforced by the processor registers limit. |
| 0db0564 by Ahmed S. Darwish at 2009-08-30 |
348 |
|
|
349 |
Example: |
|
350 |
|
| 5097e48 by Ahmed S. Darwish at 2009-08-30 |
351 |
a000:0010 = a010:0000 /* Overlapping 'segments' */ |
|
352 |
f000:ffff = ffff:000f /* Maximum address using different 'segments' */ |
|
353 |
40:100 = 50:00 /* Last byte in Bios Data Area */ |
|
354 |
|
|
355 |
It's important to note that a segment register can take any value ranging |
|
356 |
from 0 to 0xffff including the non-16bit-aligned addresses. It's on *physical* |
|
357 |
memory where a segment base address is always 16-bit aligned due to the left |
|
358 |
shift, not the segment register value itself. |
|
359 |
|
|
360 |
Example of completely valid segment register values and their physical |
|
361 |
addresses: |
|
362 |
|
|
363 |
Segment register value | Segment physical address |
|
364 |
------------------------------------------------------- |
|
365 |
0xffff | 0xffff0 |
|
366 |
0x1abc | 0x1abc0 |
|
367 |
0x4242 | 0x42420 |
|
368 |
... | ... |
|
369 |
0xyyyy | 0xyyyy0, y = /[0-f]/ |
| 0db0564 by Ahmed S. Darwish at 2009-08-30 |
370 |
|
|
371 |
* Some calculations: |
|
372 |
|
|
373 |
biggest segment address possible = 0xffff |
|
374 |
|
|
375 |
biggest offset possible = 0xffff |
|
376 |
|
|
377 |
biggest *physical* address that can act as a segment address = |
| f654c49 by Ahmed S. Darwish at 2009-09-07 |
378 |
biggest 20-bit address which is 16-byte aligned = 0xffff0 |
| 0db0564 by Ahmed S. Darwish at 2009-08-30 |
379 |
|
|
380 |
* NOTE: There's a huge difference between real-mode `segments' and segments |
|
381 |
provided in 386+ protected-mode processors. The 16bit value in a segmented |
|
382 |
protected-mode logical address is what's called a `segment selector' which |
|
383 |
is basically a pointer to a `segment descriptor' stored in either the GDT or |
|
384 |
the LDT. The segment descriptor includes info like the protected-mode segment |
|
385 |
base, limit, access permissions, present bit, etc. |
|
386 |
|
|
387 |
On the other hand, the real-mode `segments' are just primitive addresses |
|
388 |
that get hex left shifted in address calculations. |
| 550f8c7 by Ahmed S. Darwish at 2009-08-25 |
389 |
|
| c6c1eef by Ahmed S. Darwish at 2009-09-01 |
390 |
* Note that using this scheme, code segments are limited to a size of 64KB. |
|
391 |
To reach more code, we had to jump to what's called 'far pointers', which |
|
392 |
not only contains an offset relative to current code segment, but also a |
|
393 |
new address for the code segment. |
|
394 |
|
|
395 |
AT&T syntax for far jumps: `ljmp $SEGMENT, $OFFSET' where 'segment' and |
|
396 |
'offset' are naturally 16bit values. |
|
397 |
|
|
398 |
Usage example: after loading the kernel to ram at its chosen segment which |
|
399 |
is usually far away from the mbr code segment, we 'far jump' to this segment |
|
400 |
with offset 0 to execute the loaded kernel code and go beyond the mbr's 512 |
|
401 |
bytes size limit. This is the same technique used by GRUB to loads its stages. |
|
402 |
|
|
403 |
* The above terminology is also the root of 386+ protected mode 'far pointer' |
|
404 |
which is a 'logical address' containing a segment selector (16 bit) and an |
| 7bc027c by Ahmed S. Darwish at 2010-07-24 |
405 |
offset (32 bit) relative to the segment pointed by that selector. |
| c6c1eef by Ahmed S. Darwish at 2009-09-01 |
406 |
|
| b3bbe1d by Ahmed S. Darwish at 2010-11-16 |
407 |
|
| 4b12840 by Ahmed S. Darwish at 2009-08-30 |
408 |
On accessing fixed disk drivers (hard-disks): |
|
409 |
--------------------------------------------- |
|
410 |
|
|
411 |
* There's the usual heads, tracks, sectors, and cylinders. |
|
412 |
|
|
413 |
Track: when the head is positioned over a point on the fixed disk, ready |
|
414 |
to read or write, the disk platter surface spins underneath it, tracing a |
|
415 |
full circle. This circle is a track. A disk surface contains may tracks. |
|
416 |
|
|
417 |
Sectors: Fixed disk systems divides each track into short arcs called |
|
418 |
sectors, each sector usually holds 512 bytes. This is the smallest unit |
|
419 |
of storage, as tracks would mean wasting a lot of space for small files. |
|
420 |
|
|
421 |
Cylinder: all the heads move in and out together, so each head is always |
|
422 |
physically located at the same track number; we can't have one head at |
|
423 |
track 0, and another at 250. Because of this, often the track location of |
|
424 |
the heads is not referred to as a track number but rather as a cylinder |
|
425 |
number. |
|
426 |
|
| c6c1eef by Ahmed S. Darwish at 2009-09-01 |
427 |
Disk size = bytes per sector * sectors per track * number of cylinders * |
|
428 |
number of heads |
|
429 |
|
| 4b12840 by Ahmed S. Darwish at 2009-08-30 |
430 |
* BIOS fixed disk services are provided through the INT 13h service |
|
431 |
routine. It programs the hard-disk controller; it never writes directly |
|
432 |
from the processor to the hard-disk drive. The hard-disk controller |
|
433 |
communicates directly between the hard-disk, the processor, and system |
|
434 |
memory. |
|
435 |
|
| c6c1eef by Ahmed S. Darwish at 2009-09-01 |
436 |
* To support disks beyoned the 8GB limit enforced by original IBM bios |
|
437 |
design (max of 1024 cylinder, 256 heads, and 63 sectors), new INT13 |
|
438 |
bios services were added to bios which offers simple sector addressing |
|
439 |
using a logical zero based sector number (0, 1, 2, ..) |
|
440 |
|
|
441 |
This addressing scheme is easier than the Cylinder-Head-Sector model |
|
442 |
and is supported in all bios chips post-1995, so we use it exclusively. |
|
443 |
|
|
444 |
Check the boot sector code for more details. |
| b46a295 by Ahmed S. Darwish at 2009-10-11 |
445 |
|
| e4d1afe by Ahmed S. Darwish at 2010-07-29 |
446 |
|
| 7bc027c by Ahmed S. Darwish at 2010-07-24 |
447 |
On x86 jumps: |
|
448 |
------------- |
|
449 |
|
|
450 |
* References: |
|
451 |
- Intel i686 Manual, vol. 2, Instruction Set Reference, Chapter 3: JMP |
|
452 |
- Sun Microsystems, [AT&T] x86 Assembly language reference manual |
| e4d1afe by Ahmed S. Darwish at 2010-07-29 |
453 |
- Cute: boot/bootsect.S, boot/head.S, boot/e820.S, boot/trampoline.S |
| 7bc027c by Ahmed S. Darwish at 2010-07-24 |
454 |
|
|
455 |
* Judging by my early naive trial-and-error experiences and the nature of |
|
456 |
newbies questions at osdev.org forums, the most common failure point while |
|
457 |
developing bootloaders and early kernel stages is either an intra or an |
|
458 |
inter-module jump. Because of this, we'll clarify the most used x86 jumps |
|
459 |
here, detailing their AT&T syntax and opcodes. We don't use the Intel |
|
460 |
syntax in our kernel, so it won't be discussed here; it's already clearly |
|
461 |
defined in the official Intel manuals. |
|
462 |
|
|
463 |
First, let's stress that a 'jump' is only a mechanism to disrupt the CPU |
|
464 |
execution flow, moving EIP to a different code (hopefully not data!) region. |
|
465 |
|
|
466 |
Generally, x86 jumps can be classified to either near or far. Near jumps |
|
467 |
are 'intra-segment', meaning that they only move within the limits of the |
|
468 |
currently setup and cashed code segment. Meanwhile, far jumps are |
|
469 |
'inter-segment': they take a (mostly new) code segment descriptor and an |
|
470 |
offset within such segment. |
|
471 |
|
|
472 |
The kind of displacement in such jumps can also be either relative or |
| 96aef36 by Ahmed S. Darwish at 2010-07-27 |
473 |
absolute. Relative addresses (wrt the next instruction address) were |
|
474 |
introduced to save opcodes size. [*] |
|
475 |
|
|
476 |
Absolute addresses are very important when relative offsets to an |
|
477 |
outside-of-code-object destination address cannot be validly calculated |
|
478 |
at link-time. Check the 'When %RIP-relativeness goes wrong' section's |
|
479 |
summary for further details. |
| 7bc027c by Ahmed S. Darwish at 2010-07-24 |
480 |
|
|
481 |
Examples of jumps: |
|
482 |
|
|
483 |
<AT&T Syntax> | <Objdump Disassembly> |
|
484 |
|
|
485 |
[A] Near jump, 8-bit relative displacement (-128): [short jump] |
|
486 |
|
|
487 |
1: movl $0xdeadbeef, %eax | af: b8 ef be ad de |
|
488 |
jmp 1b | b4: eb f9 |
|
489 |
|
|
490 |
[B] Near jump, 8-bit relative displacement (+127): [short jump] |
|
491 |
|
|
492 |
1: movl $0xdeadbeef, %eax | af: b8 ef be ad de |
|
493 |
jmp 2f | b4: eb 00 |
|
494 |
2: movl $0xcafebabe, %ebx | b6: bb be ba fe ca |
|
495 |
|
|
496 |
[C] Near jump, 16-bit relative displacement: |
|
497 |
|
|
498 |
[Not supported by GAS] |
|
499 |
|
|
500 |
[D] Near jump, 32-bit relative displacement: |
|
501 |
|
|
502 |
jmp dst | 12: e9 f8 ff 0f 00 |
|
503 |
.skip 0xffff8 | ... |
|
504 |
dst: movl $0xdeadbeef, %eax | 10000f: b8 ef be ad de |
|
505 |
|
|
506 |
[E] Near jump, absolute indirect displacement (register): |
|
507 |
|
|
508 |
mov $0x100000, %eax | af: b8 00 00 10 00 |
|
509 |
jmp *%eax | b4: ff e0 |
|
510 |
|
|
511 |
[F] Near jump, absolute indirect displacement (memory reference): |
|
512 |
|
|
513 |
jmpl *(0x7c00 + mem) | af: ff 25 b5 7c 00 00 |
|
514 |
mem: .long 0x100000 | b5: 00 00 10 00 |
|
515 |
|
|
516 |
(AT&T syntax uses the '*' operator for indirect jumps) |
|
517 |
|
|
518 |
Note that the _only_ available way for specifying _near_ absolute |
|
519 |
displacements is to encode such displacements 'indirectly' (i.e. not |
|
520 |
in the operands themselves) using a register or a memory reference. |
|
521 |
|
|
522 |
[G] Far jump, absolute 32-bit far pointer in operand (16-bit %cs |
|
523 |
index + 16-bit offset): |
|
524 |
|
|
525 |
.code16 | |
|
526 |
ljmp $0x08, $(0x7c00 + rel) | 99: ea 9e 7c 08 00 |
|
527 |
.code32 | |
|
528 |
rel: movl $0x10000, %esi | 9e: be 00 00 01 00 |
|
529 |
|
|
530 |
[H] Far jump, absolute 48-bit far pointer in operand (16-bit %cs |
|
531 |
index + 32-bit offset): |
|
532 |
|
|
533 |
.code32 | |
|
534 |
# KERNEL_CS = 0x08 | |
|
535 |
# startup_64 = 0x100112 | |
|
536 |
ljmp $KERNEL_CS, $startup_64 | 10006f: ea 12 01 10 00 08 00 |
|
537 |
|
|
538 |
[I] Far jumps, 32-bit or 48-bit absolute indirect displacement |
|
539 |
(memory reference): |
|
540 |
|
|
541 |
[Rarely used, let's only know that they exist] |
|
542 |
|
| 96aef36 by Ahmed S. Darwish at 2010-07-27 |
543 |
[*] footnote: On x86-32, only jumps and calls had the relative addressing |
|
544 |
facility. To save opcode size, especially while handling 64-bit wide virtual |
|
545 |
addresses, AMD introduced a global and default %RIP-relative addressing |
|
546 |
scheme for most ops, including most important, dereferencing global data. |
|
547 |
|
|
548 |
|
| e4d1afe by Ahmed S. Darwish at 2010-07-29 |
549 |
On referencing global symbols: |
|
550 |
------------------------------ |
|
551 |
|
|
552 |
* References: |
|
553 |
- GNU LD Manual, 'Using LD', Section 3.5.4 -'Source Code Reference' |
|
554 |
- Cute: include/idt.h, kernel/idt.S |
|
555 |
|
|
556 |
* To gracefully fail instead of triple faulting the machine, one of the first |
|
557 |
things implemented while writing the C part of the kernel was setting up CPU |
|
558 |
exception handlers. To setup these handlers, we needed access to the IDT table |
|
559 |
and its descriptor; both were already defined in assembly files. What are the |
|
560 |
rules for accessing these data structures natively in C? |
|
561 |
|
|
562 |
When a global symbol is declared in C, two things happen. The compiler reserves |
|
563 |
enough space in the executable .data or .bss section holding the _value_ of the |
|
564 |
symbol. Second, it creates an entry in the program's symbol table which holds |
|
565 |
the symbol _address_ (in that section). For a small example, check below |
|
566 |
program and its disassembly: |
|
567 |
|
|
568 |
--> globals.c (a1): | --> [disassembly, $gcc -O3]: |
|
569 |
int x; | 0000000000600800 <y>: 11 11 00 00 ... (.data) |
|
570 |
int *y = 0x1111; | 0000000000600818 <x>: ... (.bss, 8 bytes) |
|
571 |
| |
|
572 |
int main(void) { | 0000000000400460 <main>: |
|
573 |
x = 0xdeadbeef; | 400460: movq 0x200399(%rip),%rax #600800 <y> |
|
574 |
*y = 0xcafebabe; | 400467: movl $0xdeadbeef,0x2003a7(%rip) #600818 <x> |
|
575 |
} | 400471: movl $0xcafebabe,(%rax) |
|
576 |
| 400477: retq |
|
577 |
|
|
578 |
As you can see, accessing the global 'x' required a single memory dereference |
|
579 |
of the address 0x600818 stored in the program's symbol table. Meanwhile, |
|
580 |
dereferencing the pointer 'y' required TWO memory dereferences: one for |
|
581 |
dereferencing the variable 'y' address 0x600800, getting its value/contents, |
|
582 |
and _another_ dereference for such value stored (0x1111); after all, a pointer |
|
583 |
is just a special 8-byte variable. |
|
584 |
|
|
585 |
* Furthermore, check below program which has several symbols defined in an |
|
586 |
assembly file with different treatments of such symbols (and their data) in C: |
|
587 |
|
|
588 |
--> data.S (a2): | --> app.c (a2): |
|
589 |
.section .data | uint32_t x; |
|
590 |
.globl x | extern uint32_t *y; |
|
591 |
x: | extern uint32_t z[]; |
|
592 |
.long 0xdeadbeef | |
|
593 |
| int main(void) { |
|
594 |
.globl y | printf("&x = 0x%lx\n", &x); |
|
595 |
y: | printf("&y = 0x%lx\n", &y); |
|
596 |
.long 0xcafebabe | printf("&z = 0x%lx\n", &z); |
|
597 |
| |
|
598 |
.globl z | printf("x = 0x%x\n", x); |
|
599 |
z: | printf("y = 0x%x\n", y); |
|
600 |
.long 0xeeeeeeed | ... /* further code: check the output */ |
|
601 |
| } |
|
602 |
|
|
603 |
$ gcc -O3 data.S app.c -o output.elf |
|
604 |
$ objdump -D output.elf -j .data |
|
605 |
0000000000600af0 <x>: |
|
606 |
600af0: ef be ad de be ba |
|
607 |
0000000000600af4 <y>: |
|
608 |
600af4: be ba fe ca ed |
|
609 |
0000000000600af8 <z>: |
|
610 |
600af8: ed ee ee ee |
|
611 |
|
|
612 |
$ ./output.elf |
|
613 |
'int x' is a global located at program area (&x = ): 0x600af0 |
|
614 |
'int *y' is a global located at program area (&y = ): 0x600af4 |
|
615 |
'int z[]' is a global located at program area (&z = ): 0x600af8 |
|
616 |
|
|
617 |
x = 0xdeadbeef | *x = <not-applicable> |
|
618 |
y = 0xcafebabe | *y = <segmentation fault> (0xcafebabe unmapped) |
|
619 |
z = 0x600af8 | *z = 0xeeeeeeed |
|
620 |
|
|
621 |
x[0] = <not-applicable> |
|
622 |
y[0] = <segmentation fault> since 0xcafebabe is not mapped |
|
623 |
z[0] = 0xeeeeeeed |
|
624 |
|
|
625 |
As seen from above introspection, y is just a global variable holding |
|
626 |
0xcafebabe. y _address_ is '0x600af4', buts its _value_ is 0xcafebabe. Thus, |
|
627 |
'*y' will produce two memory dereferences, and the second of the two will |
|
628 |
result in a CPU exception (segmentation fault) since 0xcafebabe is not |
|
629 |
mapped. |
|
630 |
|
|
631 |
* So, if we have a structure like the below IDT table descriptor: |
|
632 |
|
|
633 |
/* Located at ffffffff8010e240 |
|
634 |
* Binary value (little endian): 00 10 80 16 11 00 00 80 ff ff */ |
|
635 |
.globl idtdesc |
|
636 |
idtdesc: |
|
637 |
.word idt_end - idt # limit |
|
638 |
.quad VIRTUAL(KTEXT_PHYS(idt)) # base |
|
639 |
|
|
640 |
we can _not_ reference it in C as |
|
641 |
|
|
642 |
struct idt_descriptor { |
|
643 |
uint16_t limit; |
|
644 |
uint64_t base; |
|
645 |
} __packed; |
|
646 |
|
|
647 |
extern struct idt_descriptor *idtdesc; /* WRONG!! */ |
|
648 |
idtdesc->limit = 0; /* Page-fault exception! */ |
|
649 |
|
|
650 |
since in above case, the variable pointer _address_ (&idtdesc = ) will be |
|
651 |
0x10e240, and the pointer _value_ (idtdesc = ) will equal 0x8000001116801000. |
|
652 |
Thus, dereferencing such pointer as in 'idtdesc->limit' above will in fact |
|
653 |
dereference 0x10e240 first, then afterwards dereference the unmapped address |
|
654 |
0x8000001116801000! Final result: a CPU page fault exception. |
|
655 |
|
|
656 |
The only wrong thing above was specifying the wrong type for the 'idtdesc' |
|
657 |
label. Having the type as the struct itself instead of a pointer will let the |
|
658 |
struct address be 0x10e240, and its value will equal 0xffff8000001116801000 |
|
659 |
which _is_ the desired result: |
|
660 |
|
|
661 |
extern const struct idt_descriptor idtdesc; |
|
662 |
idtdesc.limit = 0; /* Safe!! */ |
|
663 |
|
|
664 |
* Now, if we want to reference the IDT table itself: |
|
665 |
|
|
666 |
/* IDT_GATES: number of entries in the table */ |
|
667 |
.align 16 |
|
668 |
.globl idt |
|
669 |
idt: |
|
670 |
.skip IDT_GATES * 16 |
|
671 |
idt_end: |
|
672 |
|
|
673 |
Assuming a pre-defined C type of 'struct idt_gate', representing a single IDT |
|
674 |
table entry structure, all what we need is an array of such types. Namely: |
|
675 |
|
|
676 |
extern struct idt_gate idt[IDT_GATES]; |
|
677 |
|
|
678 |
Note that after all, an array 'A' is a special kind of pointer where A = &A. |
|
679 |
(label value and address are the same, equalling link-time label address |
|
680 |
stored in the symbol table) |
|
681 |
|
|
682 |
* Finally, let's reference below exception handler stubs, where they catch the |
|
683 |
exception vector number through their positions in the stubs table: |
|
684 |
|
|
685 |
/* Each stub code size = IDT_STUB_SIZE bytes */ |
|
686 |
.globl idt_exception_stubs |
|
687 |
idt_exception_stubs: |
|
688 |
i = 0 |
|
689 |
.rept EXCEPTION_GATES_CNT |
|
690 |
movq $i, %rsi |
|
691 |
jmp default_exception_handler |
|
692 |
i = i + 1 |
|
693 |
.endr |
|
694 |
|
|
695 |
Well, let's first remember here that we're pointing to code, so we're only |
|
696 |
interested in each stub address, not by any mean its contents. So, after |
|
697 |
all, since this is a table, an array fits our goals nicely. What will be the |
|
698 |
base object type? We only care about addresses, so let the types be 'char*': |
|
699 |
char pointers makes pointer arithmetic very simple. An initial try can be |
|
700 |
|
|
701 |
extern const char idt_exception_stubs[EXCEPTION_GATES_CNT]; |
|
702 |
|
|
703 |
Nice, but by this, each table cell is of size 1 byte (char). We want each |
|
704 |
table cell to have the size IDT_STUB_SIZE chars (bytes). Namely: |
|
705 |
|
|
706 |
idt_exception_stubs[EXCEPTION_GATES_CNT][IDT_STUB_SIZE] |
|
707 |
|
|
708 |
which makes the compiler automatically calculate any stub address for us. |
|
709 |
If we want the address of the third stub, we'll simply write: |
|
710 |
|
|
711 |
void *addr = &idt_exception_stubs[3]; |
|
712 |
|
|
713 |
* Summary: the simple declaration 'extern int x' directly reference the 4 bytes |
|
714 |
marked as 'x' in the executable, where 'x' address is specified at link-time |
|
715 |
in the program's symbol table. The operator [] in the declaration |
|
716 |
'extern type SYMBOL[]' leaves it open only to the number of base type objects |
|
717 |
which are present, not *where* they are. SYMBOL[n] just adds more static-time |
|
718 |
safety; SYMBOL[n][s] let the compiler automatically calculates an entry (of |
|
719 |
size 's')index for us. |
|
720 |
|
|
721 |
Meanwhile, the '*' operator in a declaration 'extern TYPE *P' only binds |
|
722 |
where the _8-byte variable_ P reside in the executable (its address), NOT |
|
723 |
where the base type object is. The base-object itself can be reached only by |
|
724 |
a double dereference: dereference of P address first, then a dereference of |
|
725 |
P's content itself (which is, in turn, an 8-byte address). |
|
726 |
|
|
727 |
|
| b46a295 by Ahmed S. Darwish at 2009-10-11 |
728 |
The 8259A Programmable Interrupt Controller: |
|
729 |
-------------------------------------------- |
|
730 |
|
| b15ccb7 by Ahmed S. Darwish at 2009-11-01 |
731 |
* References: |
|
732 |
- The Intel 8259A PIC datasheet. |
|
733 |
- The superb information in "The undocumented PC", chapter 17 - "Interrupt |
|
734 |
control and NMI". |
|
735 |
|
|
736 |
In subsequent text, and as said by the sheet, IRR = "Interrupt Request |
|
737 |
Register", ISR="In-Service Register", IMR = "Interrupt Mask Register". |
| b46a295 by Ahmed S. Darwish at 2009-10-11 |
738 |
|
|
739 |
* As described by Brendan Trotter (bcos) on #osdev, and the sheet: |
|
740 |
|
|
741 |
An interrupt request is executed by raising an IR input (low to high) and |
|
742 |
holding it high until it is acknowledged (Edge Triggered Mode). Where |
|
743 |
"acknowledged" means doing so by the PIC chip, not by the CPU/OS . Also, in |
|
744 |
this case "acknowledges" just means that it recognises it and sets the |
|
745 |
corresponding flag in the IRR |
|
746 |
|
|
747 |
Basically, when an interrupt is received the PIC sets the corresponding flag in |
|
748 |
the IRR. When the PIC sends the IRQ to the CPU it clears the flag in the IRR |
|
749 |
and sets the flag in the ISR. And when the CPU sends EOI the PIC clears the |
|
750 |
flag in the ISR. The processor sends an EOI through the the INTA pin. |
|
751 |
|
|
752 |
If the PIC sends a high priority IRQ to the CPU and is waiting for EOI when a |
|
753 |
low priority IRQ occurs, then the PIC must remember that the low priority IRQ |
|
754 |
occured (so that it can tell the CPU about it when the CPU is finished with the |
|
755 |
high priority IRQ), by setting its IRR bit. |
|
756 |
|
|
757 |
The 8259 cannot be used with level interrupts. This is because when |
|
758 |
an interrupt line goes true, it is latched internally in the IRR. |
|
759 |
|
|
760 |
* Spuruious interrupts: |
|
761 |
|
|
762 |
As said by the sheet: "If no interrupt request is present at step 4 of either |
|
763 |
sequence (i.e., the request was too short in duration), the 8259A will issue an |
|
764 |
interrupt level 7. Both the vectoring bytes and the CAS lines will look like an |
|
765 |
interrupt number 7 was requested." |
|
766 |
|
|
767 |
And as said in undoc pc: "Although a noise glitch should never occur, a poor |
|
768 |
bus design or badly glitch will trigger an invalid interrupt. If the glitch |
|
769 |
occurs on IRQ 0 to IRQ7, interrupt F (which is mapped to IRQ 7) will be |
|
770 |
called. The service handler for interrupt Fh and 77H (assuming the PIC wasn't |
|
771 |
re-mapped to more sane vector-numbers) should read the in-service register to |
|
772 |
confirm that a real interrupt occurred. An invalid interrupt has In-Service bit |
|
773 |
7 cleared, and a valid interrupt has valid bit 7 set. |
|
774 |
|
|
775 |
* So why should the PIC bother sending the CPU a spurious 7 IRQ? it could've just |
|
776 |
ignored the noise glitch source. The problem is that it might have told the CPU |
|
777 |
it's about to get an interrupt, and if it did (step 1, 2, and 3), it can not |
|
778 |
now decode the interrupt info since the line has been deasserted. Either the |
|
779 |
assert period was too short that the IRQ number was not latched internally, or |
|
780 |
the PIC just recognized it was a noise glitch due to the very short assertion |
|
781 |
period. So it gives the programmed IRQ 7 vector number to the CPU. This avoids |
|
782 |
unccessary communication with the real devices attached to IRQ7. |
|
783 |
|
|
784 |
* Take care that, as said by linux kernel comments, the two spurious IRQs 7 and |
|
785 |
15 usually resulting from the 8259A-1|2 PICs do occur even if all the IRQs are |
|
786 |
masked (IMR = 0xff). |
|
787 |
|
|
788 |
* An example of an interrupt (refs: undoc pc + intel sheet): |
|
789 |
|
|
790 |
1- The adapter card requests service, by driving the hardware IRQ line 5 from |
|
791 |
low to high |
|
792 |
|
|
793 |
2- The IRQ 5 line is connected to the PIC. The edge-triggered request is |
|
794 |
recorded in the controllers Interrupt-Request Register (IRR). In this case, |
|
795 |
the IRR bit 5 is set. |
|
796 |
|
|
797 |
3- The PIC first checks if the Interrupt Mask Register (IMR) to see if IRQ5 is |
|
798 |
allowed (not masked, IMR bit 5 = 0). Assuming that bit clear, the PIC moves |
|
799 |
to the next step. |
|
800 |
|
|
801 |
4- Using its priority resolver uint, the conntroller checks if any higher |
|
802 |
priority interrupts are active or in progress. If so, no further action is |
|
803 |
done till all higher priority irqs are serviced. |
|
804 |
|
|
805 |
5- Once all higher irqs are serviced, the controller flags the CPU over the |
|
806 |
dedicated INT hardware line leading to the CPU. |
|
807 |
|
|
808 |
6- If the CPU has hardware interrupts enabled, the CPU returns an interrupt |
|
809 |
acknowledgment to the PIC using the INTA pin. |
|
810 |
|
|
811 |
7- Directly after receiving a pulse from the INTA pin, the relevant In-Service |
|
812 |
register bit (bit 5) is set, and the corresponding IRR bit is reset. |
|
813 |
|
|
814 |
8- The processor sends another INTA pulse. During this pulse, the PIC sends |
|
815 |
a programmed 8-bit pointer back to the CPU, according the raised IRQ. In |
|
816 |
the default bios programmed PIC case, the pic will return the interrupt |
|
817 |
value 0x0D for IRQ5. |
|
818 |
|
|
819 |
9- In an instruction boundary, the cpu interrupts the current process and |
|
820 |
jumps to the code address specified in the IDT entry 0xD. For x86-64 cpus, |
|
821 |
entry 0xD = IDT base (specified through lidt) + 0xD * 16. |
|
822 |
|
|
823 |
10- One of the key aspects of the interrupt handler specified in the IDT entry |
|
824 |
must make is to remove the interrupt request made by the adapter. The line |
|
825 |
maybe shared, and several ISRs got executed, where they determine first if |
|
826 |
they are servicing the right device using the "interrupt pending" register |
|
827 |
found in PCI devices, etc. |
|
828 |
|
|
829 |
11- At the end of the interrupt handler, the handler sends the value 0x20 to |
|
830 |
the PIC command port to indicate an End-of-Interrupt (EOI). The EOI command |
|
831 |
clears the IRQ5 bit in the In-Service Regisger.(*) The case is different in |
|
832 |
case of Automatic EOI (AEOI) mode, and is described later. |
|
833 |
|
|
834 |
12- By clearing the IRQ5 bit in the ISR, the chip is now ready to accept another |
|
835 |
interrupt from IRQ5 or any lower priority interrupt. This completes the |
|
836 |
interrupt cycle. |
|
837 |
|
| b3bbe1d by Ahmed S. Darwish at 2010-11-16 |
838 |
|
| b15ccb7 by Ahmed S. Darwish at 2009-11-01 |
839 |
* On x86 SMP initialization sequence: |
|
840 |
------------------------------------- |
|
841 |
|
|
842 |
* References: |
|
843 |
- Intel MP specification v1.4, Appendix B - Operating System Programming |
|
844 |
Guideleins. |
|
845 |
- Intel Software Dev. Manual vol3A, Chapter 8, 8.4 - Multiple-Processor (MP) |
|
846 |
initialization. |
|
847 |
- Intel Software Dev. Manual vol3B, Appendix C - MP Initialization for P6 |
|
848 |
Family Processors. |
|
849 |
|
|
850 |
* Generally, there're two base x86 MP initialization sequences. One for the |
|
851 |
processors with the external APIC, and a second for processors with internal |
|
852 |
APIC (P6+ x86 processors). |
|
853 |
|
|
854 |
We won't bother ourself with the processors with external APICs (80486 and |
|
855 |
early Pentiums) since we're only targetting long mode. |
|
856 |
|
|
857 |
All long mode capable processors contains an integrated apic and comply with |
|
858 |
the new Intel MP bootup sequence, where no FIPI or BIPI are needed. |
|
859 |
|
|
860 |
On this new sequence, the INIT IPI on AP cores let them wait for the SIPI to |
|
861 |
get their starting execution vector instead of _unconditionally_ executing the |
|
862 |
BIOS init code. This avoids the ugly dependency on AT+ BIOSes hacks like the |
|
863 |
warm-reset vector to let the AP cores start executing from a custom address |
|
864 |
instead of BIOS init code. Search the MP spec for `Warm-reset vector'. |
|
865 |
|
|
866 |
* After system-wide bootup, or reset, the MP initialization protocol is going to |
|
867 |
be executed. In this protocol, the hardware setup the APIC IDs for each logical |
|
868 |
core according to system topology and assign one of the cores as the bootstrap |
|
869 |
core and the others as APs (Application processors/cores). |
|
870 |
|
|
871 |
The core nominated as the BSC will fetch and begin executing BIOS bootstrap |
|
872 |
code at the reset vector physical address 0xfffffff0. |
|
873 |
|
|
874 |
The remaining cores (setup as APs) will enter a "wait-for-SIPI" state. |
|
875 |
|
|
876 |
* Afterwars, the BIOS init code builds its ACPI and MP tables and count the |
|
877 |
number of AP cores, and update its tables entries for each new AP core |
|
878 |
found. For the algorithm, check Intel's volume 3A 8.4.4 MP Initialization |
|
879 |
Example. |
|
880 |
|
|
881 |
* At the end, the BIOS puts each AP core to halted state with interrupts |
|
882 |
disabled. By the nature of the halted state with interrupts disabled, only |
|
883 |
INIT, #NMI, #SMI are able to start the cores again. |
|
884 |
|
|
885 |
IMPORTANT: The MP protocol will be executed only after a power-up or RESET. If |
|
886 |
the MP protocol has been completed and a BSP has been chosen, subsequent INITs |
|
887 |
(either to a specific processor or system wide) do not cause the MP protocol to |
|
888 |
be repeated. Instead, each processor examines its BSP flag (in the APIC_BASE |
|
889 |
MSR) to determine whether it should execute the BIOS boot-strap code (if it is |
|
890 |
the BSP) or enter a wait-for-SIPI state (if it is an AP). |
|
891 |
|
|
892 |
So any INIT to any of the AP cores by the OS boot code will let it enter a |
|
893 |
'wait-for-SIPI' state to be able to get its start execution vector. |
|
894 |
|
|
895 |
* Note that the cores get started with all its states uninitialized. We have |
|
896 |
to update each core from real to long mode, and update its MMU and other |
|
897 |
system tables to match the bootstrap. |
|
898 |
|
|
899 |
Kernels usually call this part the SMP `trampoline', since it let the cores |
|
900 |
jump from its raw uninitialized state to a state similar to the bootstrap's |
|
901 |
one. |
|
902 |
|
| b3bbe1d by Ahmed S. Darwish at 2010-11-16 |
903 |
|
| 34568a8 by Ahmed S. Darwish at 2009-11-08 |
904 |
* On linkers 'relocation error's: |
|
905 |
--------------------------------- |
|
906 |
|
|
907 |
* References: |
|
908 |
- Relocation types (R_X86_64_*): System V ABI AMD64 supplement, |
|
909 |
section 4.4 - Relocation |
|
910 |
- The Executable and Linkable Format (ELF) spec, Book I - Relocation |
|
911 |
|
|
912 |
* One of the errors that faced me at a number of sensitive places like the |
|
913 |
kernel head, the SMP trampoline, and the linker script was LD linker's |
|
914 |
'relocation error'. Due to their relative opaqueness, they puzzled me to |
|
915 |
a great extent at first. An example of some of those errors from linking |
|
916 |
a preliminary trampoline code to the rest of the kernel image was: |
|
917 |
|
|
918 |
trampoline.o: In function `trampoline': |
|
919 |
(.text+0x1): relocation truncated to fit: R_X86_64_16 against `.text' |
|
920 |
trampoline.o: In function `_start': |
|
921 |
(.text+0xc): relocation truncated to fit: R_X86_64_16 against `.data' |
|
922 |
trampoline.o: In function `_start': |
|
923 |
(.text+0x11): relocation truncated to fit: R_X86_64_16 against `.data' |
|
924 |
trampoline.o: In function `_start': |
|
925 |
(.text+0x14): relocation truncated to fit: R_X86_64_32 against `.bss' |
|
926 |
trampoline.o: In function `_start': |
|
927 |
(.text+0x1d): relocation truncated to fit: R_X86_64_32 against `.bss' |
|
928 |
trampoline.o: In function `_start': |
|
929 |
(.text+0x21): relocation truncated to fit: R_X86_64_32 against `.bss' |
|
930 |
trampoline.o: In function `_start': |
|
931 |
(.text+0x27): relocation truncated to fit: R_X86_64_32 against `.bss' |
|
932 |
trampoline.o: In function `_start': |
|
933 |
(.text+0x2b): relocation truncated to fit: R_X86_64_32 against `.bss' |
|
934 |
|
|
935 |
* We need to understand relocation to parse above errors. Relocation Information |
|
936 |
sections contain information about unresolved references. Since compilers and |
|
937 |
assemblers do not know at what absolute memory address a symbol will be |
|
938 |
allocated, and since they are unaware of definitions of symbols in other files, |
|
939 |
__every__ reference to such a symbol will create a relocation entry. |
|
940 |
|
|
941 |
The relocation entry will point to the code address where the reference is |
|
942 |
being made, and to the symbol table entry that contains the symbol that is |
|
943 |
referenced. The linker will use this information to fill in the correct address |
|
944 |
after it has allocated addresses to all symbols. |
|
945 |
|
|
946 |
* Dumping the ELF object trampoline.o relocation section ($readelf -a) leads to: |
|
947 |
|
|
948 |
Relocation section '.rela.text' at offset 0x438 contains 8 entries: |
|
949 |
Offset Info Type Sym. Value Sym. Name + Addend |
|
950 |
000000000001 00010000000c R_X86_64_16 0000000000000000 .text + 5 |
|
951 |
00000000000c 00020000000c R_X86_64_16 0000000000000000 .data + 0 |
|
952 |
000000000011 00020000000c R_X86_64_16 0000000000000000 .data + 6 |
|
953 |
000000000014 00030000000a R_X86_64_32 0000000000000000 .bss + 0 |
|
954 |
00000000001d 00030000000a R_X86_64_32 0000000000000000 .bss + 0 |
|
955 |
000000000021 00030000000a R_X86_64_32 0000000000000000 .bss + 1007 |
|
956 |
000000000027 00030000000a R_X86_64_32 0000000000000000 .bss + 0 |
|
957 |
00000000002b 00030000000a R_X86_64_32 0000000000000000 .bss + 2007 |
|
958 |
|
|
959 |
The relocation table fields are described in the ELF specification. In summary: |
|
960 |
|
|
961 |
- Offset: location to apply relocation action relative to section start |
|
962 |
- Info: symbol table index wrt which the relocation must be made |
|
963 |
- Type: from the AMD64 ABI. _16: max bit width = 16 (rmode). _32: 32 bits max |
|
964 |
- Addend: constant addend used to compute the value to be stored in the |
|
965 |
relocatable field. Ex: `lgdt $(gdt + 0x10)' will print `.data + 10' where |
|
966 |
the addend = 0x10 |
|
967 |
|
|
968 |
* Note that all relocatable fields are filled with zeroes in the trampoline |
|
969 |
ELF object trampoline.o: |
|
970 |
|
|
971 |
# .code16 (real-mode) |
|
972 |
0000000000000000 <trampoline>: |
|
973 |
0: ea 00 00 00 60 ljmp $0x6000,$0x0 # ljmp $0x6000, $_start |
|
974 |
|
|
975 |
0000000000000005 <_start>: |
|
976 |
5: e8 2a 00 call 32 <print> # RIP-relative (7 + 0x002a = 32) |
|
977 |
8: fa cli |
|
978 |
9: 0f 01 1e 00 00 lidtw 0x0 # lidt $idtdesc |
|
979 |
e: 0f 01 16 00 00 lgdtw 0x0 # lgdt $gdtdesc |
|
980 |
13: b8 00 00 00 00 mov $0x0,%eax # movl %cr3, %eax |
|
981 |
18: 0f 22 d8 mov %eax,%cr3 |
|
982 |
1b: c7 05 00 00 00 movw $0x0,(%di) # movl $(init_pgt+0x1007), init_pgt |
|
983 |
20: 00 00 00 00 00 |
|
984 |
25: c7 05 00 00 00 movw $0x0,(%di) # movl $(init_pgt+0x2007), init_pgt |
|
985 |
2a: 00 00 00 00 00 |
|
986 |
2f: f4 hlt |
|
987 |
30: eb fd jmp 2f <_start+0x2a> |
|
988 |
|
|
989 |
0000000000000032 <print>: |
|
990 |
|
|
991 |
Also note that the offsets in the error messages (.text+0x1), (.text+0x11), .. |
|
992 |
are the same offsets reported in the relocation table entries, and are the |
|
993 |
same ones filled with zero in the above disassembly. |
|
994 |
|
|
995 |
* With above information in hand, we can simply deduce the errors cause. We link |
|
996 |
the trampoline assembly code against our long mode kernel, but the kernel linker |
|
997 |
script sets up the virtual addresses base at 0xffffffff80000000. |
|
998 |
|
|
999 |
So the linker, obeying the kernel script, tries to assign huge (> -2GB) virtual |
|
1000 |
addresses to our symbols. The linker then discovers that since the trampoline |
|
1001 |
to-be-relocated symbols are setup with the R_X86_64_16 model, it's limited to a |
|
1002 |
width of 16 bits, and limited to 32 bits in case of R_X86_64_32. Ofcourse those |
|
1003 |
bits aren't sufficient, so it has to truncate the addresses producing a wrong |
|
1004 |
final executable. |
|
1005 |
|
|
1006 |
In a nutshell, there was a reported error _for each_ relocatable symbol used, |
|
1007 |
reflecting the number of relocation table entries with a type/bit-width that is |
|
1008 |
insufficient to the linker script setup 64-bit virtual addresses. |
| 0aefbcf by Ahmed S. Darwish at 2009-11-21 |
1009 |
|
| b3bbe1d by Ahmed S. Darwish at 2010-11-16 |
1010 |
|
| 0aefbcf by Ahmed S. Darwish at 2009-11-21 |
1011 |
* When %RIP-relativeness goes wrong: |
|
1012 |
------------------------------------ |
|
1013 |
|
|
1014 |
* References: |
|
1015 |
- Intel Instruction Set Reference, A-M, vol. 2A. 3.2-CALL--Call procedure. |
|
1016 |
|
|
1017 |
* One of the interesting errors I faced in jumping away from head or trampoline |
|
1018 |
code to C code was %rip-relativeness related errors. Here are two interesting |
|
1019 |
cases: |
|
1020 |
|
|
1021 |
In head.S, after switching to long mode, and setting up the higher-half |
|
1022 |
virtual mappings, we jump to C code: |
|
1023 |
|
|
1024 |
.code64 |
|
1025 |
startup_64: |
|
1026 |
movq $stack_end, %rsp |
|
1027 |
... |
|
1028 |
# .. Setup identity mappings |
|
1029 |
# .. Setup high-half virtual addresses mappings |
|
1030 |
# .. Apply the page tables |
|
1031 |
# .. let %RIP be its virtual address equivalent |
|
1032 |
# .. let %RSP, and GDTR be their virtual addresses equivalent |
|
1033 |
# .. Let NULL pointers segfault and avoid physical-address dependent |
|
1034 |
# code: remove identity mappings |
|
1035 |
... |
|
1036 |
call kernel_start # C method: void kernel_start(void); |
|
1037 |
|
|
1038 |
But unfortunately we get the machine constantly triple-faulting. As indicated |
|
1039 |
by Bochs debugging output: |
|
1040 |
|
|
1041 |
[CPU1 ] BxError: Encountered an unknown instruction b1=0x0e (signalling #UD) |
|
1042 |
[CPU1 ] modrm was 0x00, nnn was 0, rm was 6 |
|
1043 |
[CPU1 ] exception(0x06): error_code=0000 |
|
1044 |
|
|
1045 |
Disassembling the object file, we see that everything is at place. And `callq' |
|
1046 |
points to 0xffffffff80109250; the `kernel_start' method: |
|
1047 |
|
|
1048 |
# head.o |
|
1049 |
00000000001000b9 <startup_64>: |
|
1050 |
1000b9: 48 c7 c4 00 50 10 00 mov $0x105000,%rsp |
|
1051 |
... |
|
1052 |
10012f: e8 1c 91 00 80 callq ffffffff80109250 <kernel_start> |
|
1053 |
|
| 15681eb by Ahmed S. Darwish at 2010-02-18 |
1054 |
Or does it? |
| 0aefbcf by Ahmed S. Darwish at 2009-11-21 |
1055 |
|
|
1056 |
A closer inspection at the binary disassembly reveals that the call is a |
|
1057 |
near, relative, displacement relative to next instruction, 32-bit displacement |
|
1058 |
sign-extended to 64-bits in long mode call. |
|
1059 |
|
|
1060 |
So the instruction is at 0x10012f, with a length of 5 bytes, relative |
|
1061 |
displacemented is calculated from 0x10012f + 5 = 0x100134. |
|
1062 |
|
|
1063 |
Code wants to call `kernel_start', so the relative displacemented = |
|
1064 |
0xffffffff80109250 - 0x100134 = 0xffffffff8000911c, which is the value encoded |
|
1065 |
in the `callq' instruction after sign-extension in a little-endian form. So |
|
1066 |
what's wrong? |
|
1067 |
|
|
1068 |
Notice that by the time the CPU is executing the `callq', it's actually using a |
|
1069 |
__different__ %RIP than 0x10012f. The %RIP is in its (%RIP + 0xffffffff80000000) |
|
1070 |
virtual form as the identity mappings are to be removed. |
|
1071 |
|
|
1072 |
This clearly messes up the %rip-relative-displacement calculations, and sends |
|
1073 |
the CPUs to unpredictable areas, leading to unknown opcode exceptions and |
|
1074 |
other horrors. |
|
1075 |
|
|
1076 |
* Another similar case was faced in the trampoline code, when it jumps away to |
|
1077 |
the rest of AP cores C init code. The `callq' instruction was just sending |
|
1078 |
the CPU to irregular places, with horrifying results. |
|
1079 |
|
|
1080 |
Why the callq misbehaved in trampoline? Because the trampoline code is linked |
|
1081 |
to the kernel, but actually copied to a below-1MB 4KB-page, and then executed |
|
1082 |
from there. This messed up external %rip-relativeness. |
|
1083 |
|
|
1084 |
Calls to internal trampoline methods worked because those methods retained their |
|
1085 |
relative distances between each other when they were copied. On the other hand, |
|
1086 |
and by definition, external distances was not kept when calling outside code |
|
1087 |
(like the C application_start() code): |
|
1088 |
|
|
1089 |
# trampoline.o |
|
1090 |
ffffffff8010ae06 <startup_64>: |
|
1091 |
ffffffff8010ae06: 48 8d a4 24 00 00 00 lea -0x80000000(%rsp),%rsp |
|
1092 |
ffffffff8010ae0d: 80 |
|
1093 |
... |
|
1094 |
ffffffff8010ae2e: e8 0d e3 ff ff callq ffffffff80109140 <application_start> |
|
1095 |
ffffffff8010ae33: 90 nop |
|
1096 |
|
|
1097 |
* Solution: The two examples are special cases of PIC, position-independent code. |
|
1098 |
The trampoline is PIC cause it's executed from different position than where |
|
1099 |
it's linked. The head.S code also applies as parts of it get executed using a |
|
1100 |
different %rip than what it's linked as. |
|
1101 |
|
|
1102 |
In the two above cases, we can simply solve them by specifying the absolute |
|
1103 |
address of destination code instead of using relative x86 calls. As said in |
|
1104 |
Intel instruction reference, x86-64 has a near call, with absolute indirect |
|
1105 |
addressing facility. It can be used in GNU AS as follows: |
|
1106 |
|
|
1107 |
movq $application_start, %rax |
|
1108 |
jmpq *%rax |
|
1109 |
|
|
1110 |
And also for head.S: |
|
1111 |
|
|
1112 |
movq $kernel_start, %rax |
|
1113 |
callq *%rax |
|
1114 |
|
|
1115 |
# head.o |
|
1116 |
00000000001000b9 <startup_64>: |
|
1117 |
1000b9: 48 c7 c4 00 50 10 00 mov $0x105000,%rsp |
|
1118 |
... |
|
1119 |
10012f: 48 c7 c0 50 92 10 80 mov $0xffffffff80109250,%rax |
|
1120 |
100136: ff d0 callq *%rax |
|
1121 |
|
|
1122 |
which solves our problem nicely. |
| a26984e by Ahmed S. Darwish at 2009-11-27 |
1123 |
|
| 96aef36 by Ahmed S. Darwish at 2010-07-27 |
1124 |
* Summary: For a code object which is run from a _different_ address than what |
|
1125 |
it was originally linked to, a 'jmp' or 'call' op to a label (or an address) |
|
1126 |
outside of that unlucky object needs absolute addresses instead of the |
|
1127 |
_default_ %rip-relative ones. In this case, the 'next opcode address', being |
|
1128 |
the base of %RIP-relativeness calculations, is not the same at both link-time |
|
1129 |
and run-time; at run-time, the linker-calculated relative offset of the |
|
1130 |
destination becomes completely invalid due to the _now-different_ %RIP base, |
|
1131 |
jumping intentionally to the unknown. |
|
1132 |
|
|
1133 |
Meanwhile, relative jumps and calls to labels within the same object are still |
|
1134 |
valid: the relative distance is still maintained within the object itself, no |
|
1135 |
matter where it's running from. This kind of internal relativeness is usually |
|
1136 |
calculated by the assembler (instead of the linker) and is safe here. |
|
1137 |
|
| 31351ce by Ahmed S. Darwish at 2010-02-18 |
1138 |
|
|
1139 |
Testing libc()-like kernel code: |
|
1140 |
-------------------------------- |
|
1141 |
|
|
1142 |
* An interesting way to test libc()-like kernel code (e.g. std string methods) |
|
1143 |
is to replace system's libc implementation with our own. Afterwards, we run a |
|
1144 |
huge program (Firefox, OpenOffice or gcc) and observe the results. |
|
1145 |
|
|
1146 |
Those tests usually cover huge number of cases, and give a good indicator of |
|
1147 |
tested kern code stability. I've usually found that even the smallest mistake |
|
1148 |
in such tests produce either a non-behaving app or a direct segfault. |
|
1149 |
|
|
1150 |
* Unix linkers LD_PRELOAD feature can help us immensely. Assuming our string |
|
1151 |
methods resides at 'string_lib.c', we do something similar to: |
|
1152 |
|
|
1153 |
# Position-independent object, suitable for dynamic linking |
| 904ce3a by Ahmed S. Darwish at 2010-11-18 |
1154 |
$ gcc --std=gnu99 -O3 -fPIC -c string_lib.c -o string_lib.o |
| 31351ce by Ahmed S. Darwish at 2010-02-18 |
1155 |
|
|
1156 |
# Produce shared object (instead of final executable) that can |
|
1157 |
# be linked with other objects to form an executable. |
|
1158 |
$ gcc -shared -o string_lib.so string_lib.o |
|
1159 |
|
|
1160 |
$ LD_PRELOAD='./string_lib.so' firefox-bin |
|
1161 |
|
| b3bbe1d by Ahmed S. Darwish at 2010-11-16 |
1162 |
Note: always run these tests with the binaries themselves instead of |
|
1163 |
their bash-script wrappers. Such shell wrappers don't play very well |
|
1164 |
with LD_PRELOAD at times. |
|
1165 |
|
|
1166 |
* It is interesting to note that a 30-second run of Firefox produced |
|
1167 |
655,227 memset() and memcpy() calls! A 1-minute run of Google Chrome |
|
1168 |
produced 1,889,087 calls! (counted by `wc -l'). |
|
1169 |
|
| 31351ce by Ahmed S. Darwish at 2010-02-18 |
1170 |
|
| a26984e by Ahmed S. Darwish at 2009-11-27 |
1171 |
On SMP Memory ordering models: |
|
1172 |
------------------------------ |
|
1173 |
|
|
1174 |
* References: |
|
1175 |
- Kourosh Gharachorloo, "Memory Consistency Models for Shared-Memory |
|
1176 |
Multiprocessors", Digital Equipment (DEC) with Stanford and DARPA support |
|
1177 |
- Paul McKenny, Memory ordering in modern processors, 2007.09.19 Draft |
|
1178 |
- Paul McKenny, Memory Barriers: a Hardware View for Software Hackers, IBM |
|
1179 |
- Linux Kernel Documentation/memory-barriers.txt, kernel.org |
|
1180 |
- DEC/HP OpenVMS Ask the Wizard, http://www.webcitation.org/5laUMQfyi |
|
1181 |
- Intel Manuals vol3A, chapter8, MP Management, 8.2 - Memory Ordering |
|
1182 |
- AMD64 Architecture Programmer's Manual, chapter 7 - Memory System |
|
1183 |
|
|
1184 |
* To understand memory ordering models correctly, we need to understand which |
|
1185 |
parts of our code are loads, which are stores, and which are in fact both. |
|
1186 |
Below example should cover basic cases of shared global variables usage, |
|
1187 |
compare the C code with its assembly output: |
|
1188 |
|
|
1189 |
int *global_symbol = (int *)0xdeadbeef; |
|
1190 |
|
|
1191 |
int main(void) { |
|
1192 |
/* 1) memory load + dependent memory store to stack */ |
|
1193 |
int t = global_symbol; |
|
1194 |
|
|
1195 |
/* 2) memory store */ |
|
1196 |
global_symbol = (int *)0xbbbb; |
|
1197 |
|
|
1198 |
/* A memory load + dependent memory store */ |
|
1199 |
*global_symbol = 0xcccc; |
|
1200 |
} |
|
1201 |
|
|
1202 |
Disassembly of section .data: |
|
1203 |
... |
|
1204 |
0000000000600840 <global_symbol>: |
|
1205 |
600840: ef |
|
1206 |
600841: be ad de 00 00 |
|
1207 |
|
|
1208 |
Disassembly of section .text: |
|
1209 |
... |
|
1210 |
0000000000400448 <main>: |
|
1211 |
400448: 55 push %rbp |
|
1212 |
400449: 48 89 e5 mov %rsp,%rbp |
|
1213 |
# 1) Mem. load + store to stack |
|
1214 |
40044c: 48 8b 05 dd 03 20 00 mov 0x2003dd(%rip),%rax # 600830 <global_symbol> |
|
1215 |
400453: 89 45 fc mov %eax,-0x4(%rbp) |
|
1216 |
# 2) Mem. store |
|
1217 |
400456: 48 c7 05 cf 03 20 00 movq $0xbbbb,0x2003cf(%rip) # 600830 <global_symbol> |
|
1218 |
40045d: bb bb 00 00 |
|
1219 |
# 3) Mem. load + store to pointed value |
|
1220 |
400461: 48 8b 05 c8 03 20 00 mov 0x2003c8(%rip),%rax # 600830 <global_symbol> |
|
1221 |
400468: c7 00 cc cc 00 00 movl $0xcccc,(%rax) |
|
1222 |
40046e: c9 leaveq |
|
1223 |
40046f: c3 retq |
|
1224 |
|
|
1225 |
* Terminology of memory orders: |
|
1226 |
- Program order: the memory operations order specified in object code. |
|
1227 |
- Perceived order: the order a CPU perceives of its and other CPUs memory ops. |
|
1228 |
|
| 10e19e7 by Ahmed S. Darwish at 2009-12-04 |
1229 |
* Memory ordering is a tradeoff between programming ease and performance. On |
|
1230 |
uniprocessors, many ops are assumed to execute in program order. Processors can |
|
1231 |
provide the illusion of sequensiality by only preserving sequential order among |
|
1232 |
memory operations to the same location. |
|
1233 |
|
|
1234 |
Sequential consistency defined by Lamport provides a simple model to programmers, |
|
1235 |
but it adversely affects efficiency and performance: the illusion trick provided |
|
1236 |
in uniprocessors can't be used on MP systems to provide sequential consistency. |
|
1237 |
|
|
1238 |
Weak ordering distinguishes between ordinary memory operations and memory ops |
|
1239 |
used for synchronization. By ensuring consistency only at the synchronization |
|
1240 |
points, weak ordering allows ordinary operations in between pairs of |
|
1241 |
synchronizations to be reordered with respect to one another. |
|
1242 |
|
|
1243 |
Such specifications can be considered system-centric since they in effect expose |
|
1244 |
the low-level system optimizations to the programmer. |
|
1245 |
|
| a26984e by Ahmed S. Darwish at 2009-11-27 |
1246 |
* General rules for all architectures: |
|
1247 |
- A given CPU will always _perceive_ its own memory operations as occuring |
|
1248 |
in program order, without a need for memory barriers. |
|
1249 |
- Naturally-aligned loads and stores are always atomic |
|
1250 |
- Code that uses standard synchronization primitives (spinlocks, RCU, ..) |
|
1251 |
should not need explicit mem barriers. Only tricky code that bypasses those |
|
1252 |
primitives need those barriers. An example of a linked list lockless search |
|
1253 |
is given below. |
|
1254 |
|
|
1255 |
* Example: |
|
1256 |
|
|
1257 |
Writer CPU: |
|
1258 |
----------- |
|
1259 |
# Linked list head is `smack_known' |
|
1260 |
struct smack_known *smack_known; |
|
1261 |
mutex_lock(&smack_known_lock); |
|
1262 |
if (skp == NULL) { |
|
1263 |
skp = kzalloc(sizeof(struct smack_known), GFP_KERNEL); |
|
1264 |
if (skp != NULL) { |
| 10e19e7 by Ahmed S. Darwish at 2009-12-04 |
1265 |
1- skp->smk_next = smack_known; |
|
1266 |
2- skp->smk_secid = smack_next_secid++; |
|
1267 |
3- skp->smk_cipso = NULL; |
| a26984e by Ahmed S. Darwish at 2009-11-27 |
1268 |
/* 1- BUG: Write barrier needed! */ |
| 10e19e7 by Ahmed S. Darwish at 2009-12-04 |
1269 |
4- smack_known = skp; |
| a26984e by Ahmed S. Darwish at 2009-11-27 |
1270 |
} |
|
1271 |
} |
|
1272 |
mutex_unlock(&smack_known_lock); |
|
1273 |
|
|
1274 |
Reader CPU: |
|
1275 |
----------- |
|
1276 |
# Just locklessly access the list |
|
1277 |
struct smack_label *skp; |
|
1278 |
skp = smack_known; |
|
1279 |
while (skp) { |
|
1280 |
/* 2- BUG: Alpha CPUs!! */ |
| 10e19e7 by Ahmed S. Darwish at 2009-12-04 |
1281 |
5- if (skp->smk_secid == 10) { |
| a26984e by Ahmed S. Darwish at 2009-11-27 |
1282 |
... |
|
1283 |
} |
|
1284 |
... |
|
1285 |
skp = skp->next; |
|
1286 |
} |
|
1287 |
|
|
1288 |
Bug 1: |
|
1289 |
------ |
|
1290 |
In the writer CPU, the latest 3 stores can be re-ordered in any combination. |
|
1291 |
In arhitectures that reorder stores like Alpha, Itanum, POWER, and SPARC, this |
|
1292 |
may lead other CPUs to perceive an incomplete node added to the list: a garbage |
|
1293 |
smk_secid value. This may even lead to graver results if you notice that the |
|
1294 |
first line (skp->smk_next = smack_known) is actually a load followed by a |
|
1295 |
store. So the sequence is: |
|
1296 |
|
|
1297 |
Load from smack known |
|
1298 |
store to ->smk_next address |
|
1299 |
... |
|
1300 |
store to smack_known |
|
1301 |
|
|
1302 |
The cpu can reorder the last stores, leading to a linked list head with a |
|
1303 |
bogus next pointer; effectively a user-space segmentation fault or worse, |
|
1304 |
a CPU exception if in kernel mode. |
|
1305 |
|
|
1306 |
Bug 2: |
|
1307 |
------ |
|
1308 |
As said by OpenVMS documenatation (Ask the Wizard): |
|
1309 |
|
|
1310 |
"Your producer must issue a 'memory barrier' instruction after writing the data |
|
1311 |
to shared memory and before inserting it on the queue; likewise, your consumer |
|
1312 |
_must_ issue a memory barrier instruction after removing an item from the queue |
|
1313 |
and before reading from its memory. Otherwise, you risk seeing stale data, |
|
1314 |
since, while the Alpha processor does provide coherent memory, it does not |
|
1315 |
provide implicit ordering of reads and writes. (That is, the write of the |
|
1316 |
producer's data might reach memory after the write of the queue, such that the |
|
1317 |
consumer might read the new item from the queue but get the previous values |
|
1318 |
from the item's memory." |
|
1319 |
|
|
1320 |
But how does the consumer sees the pointer before its data even if write |
|
1321 |
barriers were used? |
|
1322 |
|
|
1323 |
As said by McKenny, the producer's write barrier will guarantee that the cache |
|
1324 |
invalidates performed by the first 3 lines will reach the interconnect before |
|
1325 |
that of the last line, but it "makes absolutely no guarantee about the order in |
|
1326 |
which the new values will reach the reading CPU’s core." |
|
1327 |
|
|
1328 |
So assume smack_known, and smack_known->smk_secid are on different cache banks. |
|
1329 |
Because of the write barrier, the producer's data will reach the interconnect |
|
1330 |
before updating the queue. The bank holding smack_known is idle, so its line |
|
1331 |
gets invalidated and updated, while the bank holding smk_secid was busy, so the |
|
1332 |
cache invalidates for the new element are being delayed. This leads to seeing |
|
1333 |
new value for the pointer, but old cached values for the new element at line 5. |
|
1334 |
|
|
1335 |
Linux kernel documentation explains by stating "some versions of the Alpha CPU |
|
1336 |
have a split data cache, permitting them to have two semantically-related cache |
|
1337 |
lines updated at separate times. This is where the data dependency barrier |
|
1338 |
really becomes necessary as this synchronises both caches with the memory |
|
1339 |
coherence system, thus making it seem like pointer changes vs new data occur in |
|
1340 |
the right order. |
|
1341 |
|
|
1342 |
* On the x86 side, only below basic memory access is guaranteed to be atomic: |
|
1343 |
- Readnig or writing a byte |
|
1344 |
- Reading or writing a word aligned on 16-bit boundary |
|
1345 |
- Reading or writing a doubleword aligned on 32-bit boundary |
|
1346 |
- Reading or writing a quadword aligned on 64-bit boundary |
|
1347 |
|
|
1348 |
On P6 family processors and up: |
|
1349 |
- Unaligned 16-, 32-, and 64-bit accesses to cached memory that fit within |
|
1350 |
a cache line. (but how do we know whether a memory access fits a cache line?) |
| 10e19e7 by Ahmed S. Darwish at 2009-12-04 |
1351 |
|
| b3bbe1d by Ahmed S. Darwish at 2010-11-16 |
1352 |
|
| 10e19e7 by Ahmed S. Darwish at 2009-12-04 |
1353 |
On the C 'volatile' type qualifier: |
|
1354 |
----------------------------------- |
|
1355 |
|
|
1356 |
* References: |
|
1357 |
- Eric Eid & John Regehr, "Volatiles Are Miscompiled, and What to Do about It", |
|
1358 |
Proceedings of the Eighth ACM and IEEE International Conference. |
|
1359 |
- Attached Linux kernel's "volatile-considered-harmful.txt" document, added |
|
1360 |
with various core developers thoughts copied from LKML. |
|
1361 |
- ISO C99 Committee Draft - September 7, 2007 |
| fea07b7 by Ahmed S. Darwish at 2010-01-21 |
1362 |
- My OSDEV forums post: http://forum.osdev.org/viewtopic.php?t=21210&start=17 |
| 10e19e7 by Ahmed S. Darwish at 2009-12-04 |
1363 |
|
|
1364 |
* As said by the C standard, "An object that has volatile-qualified type may be |
|
1365 |
modified in ways unknown to the implementation or have other unknown side |
|
1366 |
effects. Therefore any expression referring to such an object shall be evaluated |
|
1367 |
strictly according to the rules of the abstract machine" |
|
1368 |
|
|
1369 |
"Actions on objects so declared shall not be optimized out by an implementation, |
|
1370 |
or reordered except as permitted by the rules for evaluating expressions". |
|
1371 |
|
|
1372 |
It's in a practical manner, "for every read or write to a volatile variable |
|
1373 |
that would be performed by a straightforward interpreter for C, exactly one |
|
1374 |
load from or store to the memory locations allocated to the variable must be |
|
1375 |
performed". |
|
1376 |
|
|
1377 |
* Volatile is used when we don't want the memory access to be re-ordered, cached, |
|
1378 |
or optimized away, as in accessing memory mapped I/O registers. Another case is |
|
1379 |
where global variables get changed behind our back by other execution paths as |
|
1380 |
in interrupt handlers or other threads: |
|
1381 |
|
|
1382 |
int var_set_by_irq_handler = 0; |
|
1383 |
.. |
|
1384 |
while (var_set_by_irq_handler == 0); |
|
1385 |
|
|
1386 |
The compiler has every right to optimize the loop to an infinite loop. To avoid |
|
1387 |
this, and force the compiler to access the global variable memory at each |
|
1388 |
iteration, we use volatile, since volatile objects "shall not be optimized out |
|
1389 |
by an implementation". |
|
1390 |
|
|
1391 |
* Take care that the volatile is a type qualifier as `constant'. So in below case: |
|
1392 |
|
|
1393 |
static volatile uint64_t *msg = NULL; /* (1) */ |
|
1394 |
while (msg == NULL) |
|
1395 |
asm volatile ("pause"); |
|
1396 |
|
|
1397 |
The above loop will still be optimized to an infinite loop. The declaration |
|
1398 |
states that the memory area _pointed_ by `msg' is volatile, not that the pointer |
|
1399 |
itself is volatile. So the above declaration should've been: |
|
1400 |
|
|
1401 |
static uint64_t * volatile msg = NULL; /* (2) */ |
|
1402 |
|
|
1403 |
specifying a volatile pointer. Declaration at (1) is perfectly fine though if |
|
1404 |
we're accessing the variable as in: |
|
1405 |
|
|
1406 |
while (*msg == 0x1234) |
|
1407 |
asm volatile ("pause"); |
|
1408 |
|
|
1409 |
cause now we're referencing the memory declared as volatile. The compiler must |
|
1410 |
fetch the memory pointed by 'msg' at each iteration. |
|
1411 |
|
|
1412 |
* As seen from above example and the ACM paper, volatiles are really subtle in C |
|
1413 |
and many industrial compilers get them wrong. I prefer the Linux way of handling |
|
1414 |
the issue, namely __minimizing__ 'volatile' as far as possible, and depending on |
|
1415 |
explicit compiler and cpu _barriers_, and SMP primitives as needed. |
|
1416 |
|
| b4776d7 by Ahmed S. Darwish at 2010-04-27 |
1417 |
* So instead of: |
| 10e19e7 by Ahmed S. Darwish at 2009-12-04 |
1418 |
|
|
1419 |
static uint64_t * volatile msg = NULL; |
|
1420 |
while (msg == NULL) |
|
1421 |
asm volatile ("pause"); |
|
1422 |
|
|
1423 |
We use an explicit compiler barrier: |
|
1424 |
|
|
1425 |
static uint64_t *msg = NULL; |
|
1426 |
while (msg == NULL) |
|
1427 |
asm volatile ("pause": : :"memory"); |
|
1428 |
|
| b4776d7 by Ahmed S. Darwish at 2010-04-27 |
1429 |
The compiler barrier will assure "GCC not to keep memory values cached in |
|
1430 |
registers across the assembler instruction and not to optimize stores or loads |
|
1431 |
to that memory". In other words, it will force gcc to reload any values from |
|
1432 |
memory over the busy loop. |
|
1433 |
|
|
1434 |
* Instead of: |
| 10e19e7 by Ahmed S. Darwish at 2009-12-04 |
1435 |
|
|
1436 |
volatile uint32_t *ioregsel = (uint32_t *)ioapic_base(apic); |
|
1437 |
volatile uint32_t *iowin = (uint32_t *)(ioapic_base(apic) + 0x10); |
|
1438 |
*ioregsel = reg; |
|
1439 |
return *iowin; |
|
1440 |
|
| b4776d7 by Ahmed S. Darwish at 2010-04-27 |
1441 |
where we don't want to the compiler to re-order the last two 'independent' |
|
1442 |
read and write cause this re-ordering will imply different semantics to the |
|
1443 |
programmed hardware. |
|
1444 |
|
| 10e19e7 by Ahmed S. Darwish at 2009-12-04 |
1445 |
We use MMIO register accessors, which internally includes compiler barriers |
|
1446 |
and any needed volatile qualifiers (See comments at top of include/mmio.h |
|
1447 |
for further rational): |
|
1448 |
|
|
1449 |
uint32_t *ioregsel = (uint32_t *)ioapic_base(apic); |
|
1450 |
uint32_t *iowin = (uint32_t *)(ioapic_base(apic) + 0x10); |
|
1451 |
writel(reg, ioregsel); |
|
1452 |
return readl(iowin); |
|
1453 |
|
|
1454 |
* Finally, check attached LKML messages for extra cases and reasonings by Linus |
|
1455 |
and Andi Kleen. |
| 3f896f4 by Ahmed S. Darwish at 2009-12-12 |
1456 |
|
| b3bbe1d by Ahmed S. Darwish at 2010-11-16 |
1457 |
|
| 3f896f4 by Ahmed S. Darwish at 2009-12-12 |
1458 |
On the vaguely documented x86 16-bit protected mode: |
|
1459 |
---------------------------------------------------- |
|
1460 |
|
|
1461 |
* References: |
|
1462 |
- Rick A. Hohensee, "The well-factored 386", LKML, |
|
1463 |
http://lkml.org/lkml/2003/7/28/55, http://www.webcitation.org/5luRetOqF |
|
1464 |
- Intel manuals vol3A, chapter 3, 3.4.3 - segment registers |
|
1465 |
- Intel Manuals vol3A, chapter9, Processor Management and Initialization, |
|
1466 |
9.9.2 - Switching Back to Real-Address Mode |
|
1467 |
- Robert R. Collins, "The Segment Descriptor Cache", Dr. Dobb's Journal |
|
1468 |
Undocumented Corner, http://www.rcollins.org/ddj/Aug98/Aug98.html, |
|
1469 |
http://www.webcitation.org/5lxH9NFGa |
|
1470 |
|
|
1471 |
* The x86 architecture determins the default operand and address size using |
|
1472 |
the D bit in the code segment. If D is set, it indicates 32-bit addresses |
|
1473 |
and operands. Otherwise, we're running in 16-bit mode, using 16-bit |
|
1474 |
addresses and operands. There's a similar C bit for 64-bit long mode. |
|
1475 |
|
|
1476 |
* One of the states that really puzzled me a lot was the state occurring in |
|
1477 |
the transition from bootup real-mode to 32-bit protected mode. Namely, |
|
1478 |
bewteen enabling protected mode (setting PE = 1) and getting executed in |
|
1479 |
32-bit by using a 32-bit (D = 1) code segment through a far jump. Setting |
|
1480 |
PE and %cs together is not atomic, and there's clearly something in-between. |
|
1481 |
|
|
1482 |
An example from our bootsector: |
|
1483 |
|
|
1484 |
/* Set the protected mode PE flag */ |
|
1485 |
movl %cr0, %eax |
|
1486 |
orl $0x1, %eax |
|
1487 |
movl %eax, %cr0 |
|
1488 |
|
|
1489 |
jmp flush_prefetch: |
|
1490 |
|
|
1491 |
/* What're the rules covering this state? we're in |
|
1492 |
* pmode but the cpu is still fetching and executing |
|
1493 |
* ops in 16-bits! */ |
|
1494 |
flush_prefetch: |
|
1495 |
movw $(gdt_ds - gdt), %ax |
|
1496 |
movw %ax, %ds |
|
1497 |
movw %ax, %es |
|
1498 |
movw %ax, %ss |
|
1499 |
DATA32 ljmp $0x08, $(reloc - _start + 0x7c00) |
|
1500 |
|
|
1501 |
* To exhaustively think about this, we need to cover all the permuatations |
|
1502 |
of PE (pmode enabled) and the %cs D bit: |
|
1503 |
|
|
1504 |
(1) PE=0 Dbit=0 Canonical (un)Real Mode |
|
1505 |
(2) PE=1 Dbit=0 16-bit protected mode |
|
1506 |
(3) PE=1 Dbit=1 Canonical 32-bit Protected Mode |
|
1507 |
(4) PE=0 Dbit=1 .. |
|
1508 |
_______________ |
|
1509 |
| WAKE-UP | |
|
1510 |
|canonical rmode| |
|
1511 |
| PE=0 use16 | |
|
1512 |
|_______________| |
|
1513 |
A |set PE to 1 |
|
1514 |
| | |
|
1515 |
set PE to 0| | |
|
1516 |
___|________V__ |
|
1517 |
| 16-bit | |
|
1518 |
| pmode | |
|
1519 |
| PE=1 use16 | |
|
1520 |
|_______________| |
|
1521 |
A |far xfer |
|
1522 |
| | to use32 CS |
|
1523 |
far xfer to | | |
|
1524 |
use16 CS| | |
|
1525 |
___|_______V___ |
|
1526 |
| 32-bit | |
|
1527 |
| pmode | |
|
1528 |
| PE=1 use32 | |
|
1529 |
|_______________| |
|
1530 |
A |set PE to 0 |
|
1531 |
| | |
|
1532 |
set PE to 1| | |
|
1533 |
___|________V__ |
|
1534 |
| Forreal Mode | |
|
1535 |
| Mode | |
|
1536 |
| PE=0 use32 | |
|
1537 |
|_______________| |
|
1538 |
|
|
1539 |
* We can not jump from (1) to (3) directly through a far jump using a D=1 code |
|
1540 |
segment, since we'll be in canonical rmode, and %cs will always be interpreted |
|
1541 |
as in the segmented 8086 model of (%cs << 4 + offset). |
|
1542 |
|
|
1543 |
But how can the cpu move to 32-bit execution, if 32-bit execution is controlled |
|
1544 |
by the D bit, but we can't change the D bit due to the 8086 semanticss? |
|
1545 |
|
|
1546 |
The answer is to provide a switch. If this switch is on, the cpu will the use |
|
1547 |
the 386 semantics of segments (and be able to modify the segment caches). If |
|
1548 |
it's clear, the cpu will still use the same segmentation circuites but it won't |
|
1549 |
be able to modify most of the segment caches fields like limit, and the D bit, |
|
1550 |
(the base field can still be changed within 2-bytes size), due to the |
|
1551 |
constraints of the 8086 semantics. |
|
1552 |
|
|
1553 |
This 'switch' is the PE (pmode Enabled) bit, which acts __independently__ of |
|
1554 |
whether the machine is running on 16 or 32-bit mode (D bit is set or clear) |
|
1555 |
|
|
1556 |
So under this 16-bit pmode (PE set, D clear), if we do: |
|
1557 |
|
|
1558 |
movw $0x10, %ax |
|
1559 |
movw %ax, %ds |
|
1560 |
|
|
1561 |
we're in fact using %ds as a segment selector, and are also updating the |
|
1562 |
segment caches with the values found in the third GDT cell. |
|
1563 |
|
|
1564 |
But if we're in canonical 16-bit rmode (PE & D clear), if we does the same: |
|
1565 |
|
|
1566 |
movw $0x10, %ax |
|
1567 |
movw %ax, %ds |
|
1568 |
|
|
1569 |
we're in fact using (%ds << 4) as the base field in the segment caches. |
|
1570 |
|
|
1571 |
* We cannot also jump directly from (3) 32-bit pmode, to cacnonical/bootup rmode |
|
1572 |
at (1). One should change the code and data segmenst limit to 64K and the D bit |
|
1573 |
to 0 __before__ clearing the PE switch: we can not change any of those segments |
|
1574 |
cached fields (except the base) once the PE get cleared. |
|
1575 |
|
|
1576 |
Intel's "switching back to Real-Address mode" section also states the same |
|
1577 |
order, but without making the reasoning of this order very clear. |
|
1578 |
|
|
1579 |
Intel also elaborates that "the segment registers must be loaded with non-null |
|
1580 |
segment selectors or the segment registers will be unusable in real-address |
|
1581 |
mode." Simply because the segment will be marked in the segment caches as |
|
1582 |
invalid (P = 0), and as we learned from above, we won't be able to change any |
|
1583 |
of those cached bits after clearing PE. Due to this cached P=0 bit, even the |
|
1584 |
usual act in canonical rmode of changing the base within the 2-bytes limit |
|
1585 |
won't be possible, rendering the segment as unusable, as this is their only |
|
1586 |
effective action in 8086 semantics. |
|
1587 |
|
|
1588 |
* Finally, the difference between 8086 and 386 segmentation is superficial. If |
|
1589 |
the segment was marked as P=1 in the segment caches (valid), 8086 semantics |
|
1590 |
just change the base field (limited to 2-bytes size) of the equivalent 386 |
|
1591 |
segment. They all function using the same registers and simple circuity. Intel |
|
1592 |
elaborates: "Note that if the segment registers are not reloaded, execution |
|
1593 |
continues using the descriptor attributes loaded during protected mode." |
| 34429b5 by Ahmed S. Darwish at 2010-01-13 |
1594 |
|
|
1595 |
|
|
1596 |
Using light debugging techniques: |
|
1597 |
--------------------------------- |
|
1598 |
|
|
1599 |
* References: |
|
1600 |
- Ulrich Drepper, "How to Write Shared Libraries", 2.1 Data Definitions, |
|
1601 |
http://www.webcitation.org/5mkmxB9Fx |
|
1602 |
- Ian Lance Taylor, "Linkers part 5", http://www.airs.com/blog/archives/42, |
|
1603 |
http://www.webcitation.org/5mkmiZ8FN |
|
1604 |
- Binutils bug report, "ELF linker loads member of archive for common symbol", |
|
1605 |
http://sources.redhat.com/bugzilla/show_bug.cgi?id=1811 |
|
1606 |
http://www.webcitation.org/5mknnyUe2 |
|
1607 |
- Minor bug-report at http://gcc.gnu.org/bugzilla/show_bug.cgi?id=42726 |
|
1608 |
|
|
1609 |
* Today while writing page allocator test code, I faced a very strange bug: once |
|
1610 |
any of the kernel's text, data, or bss sections exceeds something close to 20 |
|
1611 |
KBs, the MPtables code panic()s at different and unpredictable locations. The |
|
1612 |
failure is something like: |
|
1613 |
|
|
1614 |
PANIC: MPTables - 1217939456 CPUs found, only 32 supported |
|
1615 |
|
|
1616 |
* To give some context, there's a system global called 'nr_cpus' representing |
|
1617 |
the number of CPUs in the system. So far, we know the number of CPUs from the |
|
1618 |
MP tables Conf entries, where each CPU entry represents a CPU in the system. |
|
1619 |
Thus, finding 8 CPU entries means that the machine has 8 usable cores. |
|
1620 |
|
|
1621 |
This global var 'nr_cpus' is only increased at one and _only_ one place in |
|
1622 |
the system, at the MP tables parsing code: |
|
1623 |
|
|
1624 |
/* We call this for each MP CPU entry in the system */ |
|
1625 |
static void parse_cpu(void *addr) |
|
1626 |
{ |
|
1627 |
struct mpc_cpu *cpu = addr; |
|
1628 |
|
|
1629 |
if (!cpu->enabled) |
|
1630 |
return; |
|
1631 |
|
|
1632 |
if (nr_cpus >= CPUS_MAX) { |
|
1633 |
panic("MP - %d CPUs found, only %d supported", |
|
1634 |
nr_cpus, CPUS_MAX); |
|
1635 |
} |
|
1636 |
... |
|
1637 |
|
|
1638 |
++nr_cpus; |
|
1639 |
} |
|
1640 |
|
|
1641 |
* Inspecting the BIOS MP tables more closely, the MP pointer table is at 0xfbb60 |
|
1642 |
and its checksum and signature are valid. The MP configuration table header was |
|
1643 |
also valid and checksummed at 0xfba00. |
|
1644 |
|
|
1645 |
The problem, it seems, is that the conf table entries got somehow corrupted |
|
1646 |
(thus the wrong number of cpus value), or there's a bug in our parser code. |
|
1647 |
|
|
1648 |
* Unfortunately, and as usual for such problems, inserting printk()s at any |
|
1649 |
location skews the results randomly, making any printk() useless. |
|
1650 |
|
|
1651 |
So we try less invasive debugging techniques like writing to a memory area |
|
1652 |
using a _single_ inlined asm instruction, and inspecting this memory area |
|
1653 |
using an emulator afterwars. This worked and made the problem repeatable! |
|
1654 |
|
|
1655 |
* First, and to test the parser validity, we choose a memory location that we |
|
1656 |
know before hand it's going to be zero; I chose physical 0x500. In parse_cpu |
|
1657 |
we add a very small asm instruction as follows: |
|
1658 |
|
|
1659 |
/* 8-byte instruction */ |
|
1660 |
asm volatile("incq (0xffffffff80000500);"); |
|
1661 |
|
|
1662 |
if (nr_cpus >= CPUS_MAX) { |
|
1663 |
panic("MP - %d CPUs found, only %d supported", |
|
1664 |
nr_cpus, CPUS_MAX); |
|
1665 |
} |
|
1666 |
|
|
1667 |
And thankfully this instruction was light enough, so the system paniced |
|
1668 |
at the same point. Our panic() just calls printk() and hlt the system. So |
|
1669 |
after the system halts, we move to the qemu console handler and check the |
|
1670 |
value at address 0x500. Good for us, we found the value there to be 1, so |
|
1671 |
it seems our parser is behaving correctly. |
|
1672 |
|
|
1673 |
* Now the problem is getting more mysterious. If the parser is behaving |
|
1674 |
correctly, and the value at address 0x500 is completely valid, how come |
|
1675 |
'nr_cpus' have those very strange values? Remember that we put the 'incl |
|
1676 |
0x500' instruction at the same place of '++nr_cpus', so the two addresses |
|
1677 |
should have the exact same values or something is _completely_ messed up. |
|
1678 |
|
|
1679 |
The 'nr_cpus' variable still yielded the value of "1217939456". |
|
1680 |
|
|
1681 |
* So what did really corrupt the BSS-stored variable nr_cpus at this _very_ |
|
1682 |
early stage? |
|
1683 |
|
|
1684 |
The very first thing we do in the C part of the kernel is to clear the entire |
|
1685 |
BSS section to zero. So assuming the kernel was copied/setup correctly (which |
|
1686 |
we're not very sure of yet), the problem probably occurs _between_ clearing |
|
1687 |
the BSS section and parsing the MP tables. |
|
1688 |
|
|
1689 |
OK, so we'll try to check the value of the 'nr_cpus' variable at different |
|
1690 |
places in the kernel between clearing BSS and MP parsing using this 14-byte |
|
1691 |
code segment: |
|
1692 |
|
|
1693 |
asm volatile ("movl %0, %%eax;" |
|
1694 |
"movl %%eax, (0xffffffff80000500); hlt;" |
|
1695 |
: : "m"(nr_cpus) : "memory"); |
|
1696 |
|
|
1697 |
And then use the qemu console to read the address 0x500 'xp 0x500'. |
|
1698 |
|
|
1699 |
* Very important discovery: Adding the above asm fragment _directly_ after |
|
1700 |
clearning the BSS section yielded the same corrupt value of "1217939456" |
|
1701 |
for the 'nr_cpus' variable! Something is seriously wrong cause 'nr_cpus' |
|
1702 |
is clearly in the BSS section: |
|
1703 |
|
|
1704 |
$ readelf -a kernel.elf |
|
1705 |
|
|
1706 |
Section Headers: |
|
1707 |
[Nr] Name Type Address Offset |
|
1708 |
Size EntSize Flags Link Info Align |
|
1709 |
... |
|
1710 |
[ 8] .bss NOBITS ffffffff8010f5e0 0010f5e0 |
|
1711 |
00000000000028c8 0000000000000000 WA 0 0 32 |
|
1712 |
|
|
1713 |
Symbol table '.symtab' contains 183 entries: |
|
1714 |
Num: Value Size Type Bind Vis Ndx Name |
|
1715 |
... |
|
1716 |
151: ffffffff80111d80 4 OBJECT GLOBAL DEFAULT 8 nr_cpus |
|
1717 |
|
|
1718 |
* I think we're getting closer. Notice from above ELF analysis, the BSS |
|
1719 |
section starts at 0xffffffff8010f600, and it has a length of 0x28c8: |
|
1720 |
|
|
1721 |
This means that from reading the ELF: |
|
1722 |
|
|
1723 |
BSS start = 0xffffffff8010f5e0 |
|
1724 |
BSS end = 0xffffffff8010f5e0 + 0x28c8 = 0xffffffff800111ea8 |
|
1725 |
'nr_cpus' is at 0xffffffff80111d80 |
|
1726 |
|
|
1727 |
But very strangely, here are the results of printing __bss_start and |
|
1728 |
__bss_end symbols exported by our kernel linker script we wrote since the |
|
1729 |
dawn of time: |
|
1730 |
|
|
1731 |
BSS start = 0xffffffff8010f5e0 |
|
1732 |
!! BSS end = 0xffffffff801115e0 |
|
1733 |
'nr_cpus' is outside of this region! |
|
1734 |
|
|
1735 |
As you see, somehow the kernel sees a _trimmed_ BSS section from the |
|
1736 |
linker script, and to add insult to injury, 'nr_cpus' is not included |
|
1737 |
in that trimmed region. |
|
1738 |
|
|
1739 |
* Things began to make sense: the kernel obviously has a misguided view about |
|
1740 |
its bss section. This is disastrous cause at this level, we're the ones |
|
1741 |
responsible of clearing this BSS area before use. Ofcourse, we directly move |
|
1742 |
to our linker script BSS section to see what's wrong: |
|
1743 |
|
|
1744 |
SECTIONS |
|
1745 |
{ |
|
1746 |
... |
|
1747 |
.bss : { |
|
1748 |
__bss_start = .; |
|
1749 |
*(EXCLUDE_FILE (*head.o *e820.o) .bss) |
|
1750 |
__bss_end = .; |
|
1751 |
} |
|
1752 |
__kernel_end = .; |
|
1753 |
... |
|
1754 |
} |
|
1755 |
|
|
1756 |
Things should be fine and everything is at its place, but the corruption still |
|
1757 |
exists. So I tried a small modification of: |
|
1758 |
|
|
1759 |
.bss : { |
|
1760 |
__bss_start = .; |
|
1761 |
*(EXCLUDE_FILE (*head.o *e820.o) .bss) |
|
1762 |
} |
|
1763 |
__bss_end = .; |
|
1764 |
__kernel_end = .; |
|
1765 |
|
|
1766 |
And things worked magically! Ofcourse leaving the issue as this will be immature |
|
1767 |
as we didn't understand yet why this trivial change produced this huge effect. |
|
1768 |
|
|
1769 |
* Trying to debug the issue further, we restore the linker script to its original |
|
1770 |
state and let the linker be more verbose by using the -M flag. Inspecting the |
|
1771 |
output, a very interesting thing came up when the linker displayed the final |
|
1772 |
memory map: |
|
1773 |
|
|
1774 |
('...' denotes trimmed output lines) |
|
1775 |
|
|
1776 |
Allocating common symbols // mm, .. |
|
1777 |
Common symbol size file |
|
1778 |
|
|
1779 |
test_symbol 0x8 mmap.o |
|
1780 |
ioapic_descs 0x60 ioapic.o |
|
1781 |
nr_ioapics 0x4 ioapic.o |
|
1782 |
nr_cpus 0x4 smpboot.o |
|
1783 |
mp_irqs 0x700 mptables.o |
|
1784 |
nr_mpcirqs 0x4 mptables.o |
|
1785 |
cpu_descs 0x100 smpboot.o |
|
1786 |
|
|
1787 |
Linker script and memory map |
|
1788 |
|
|
1789 |
0x0000000000100000 . = 0x100000 |
|
1790 |
|
|
1791 |
.text.head 0x0000000000100000 0x310 |
|
1792 |
... |
|
1793 |
.bss 0xffffffff8010f5e0 0x28c8 load address 0x000000000010692c |
|
1794 |
0xffffffff8010f5e0 __bss_start = . |
|
1795 |
*(EXCLUDE_FILE(*e820.o *head.o) .bss) |
|
1796 |
.bss 0xffffffff8010f5e0 0x0 common.o |
|
1797 |
.bss 0xffffffff8010f5e0 0x0 main.o |
|
1798 |
.bss 0xffffffff801105e0 0x0 i8259.o |
|
1799 |
.bss 0xffffffff801105e0 0x0 apic.o |
|
1800 |
... |
|
1801 |
.bss 0xffffffff80110610 0x0 lib/string.o |
|
1802 |
.bss 0xffffffff80110620 0xfc0 lib/printf.o |
|
1803 |
0xffffffff801115e0 __bss_end = . |
|
1804 |
COMMON 0xffffffff801115e0 0x64 ioapic.o |
|
1805 |
0xffffffff801115e0 ioapic_descs // Outside of BSS! |
|
1806 |
0xffffffff80111640 nr_ioapics // Outside of BSS! |
|
1807 |
*fill* 0xffffffff80111644 0x1c 00 |
|
1808 |
COMMON 0xffffffff80111660 0x704 mptables.o |
|
1809 |
0xffffffff80111660 mp_irqs // Outside of BSS! |
|
1810 |
0xffffffff80111d60 nr_mpcirqs // Outside of BSS! |
|
1811 |
*fill* 0xffffffff80111d64 0x1c 00 |
|
1812 |
COMMON 0xffffffff80111d80 0x120 smpboot.o |
|
1813 |
0xffffffff80111d80 nr_cpus // Outside of BSS! |
|
1814 |
0xffffffff80111da0 cpu_descs // Outside of BSS! |
|
1815 |
COMMON 0xffffffff80111ea0 0x8 mmap.o |
|
1816 |
0xffffffff80111ea0 test_symbol // Outside of BSS! |
|
1817 |
0xffffffff80111ea8 __kernel_end = . |
|
1818 |
|
|
1819 |
So at last we got it, the linker script didn't put variables like nr_cpus, |
|
1820 |
nr_ioapics etc in the BSS section, thus they were not cleared in our clear_bss() |
|
1821 |
function. As you can see, those symbols are allocated past the __bss_end symbol. |
|
1822 |
|
|
1823 |
* OK, so we inspect the LD manual for this 'COMMON' section, to find: |
|
1824 |
|
|
1825 |
"A special notation is needed for common symbols, because in many object file |
|
1826 |
formats common symbols do not have a particular input section. The linker treats |
|
1827 |
common symbols as though they are in an input section named ‘COMMON’ ..." |
|
1828 |
|
|
1829 |
* Unfortunately the documentation was a bit vague, and there was no explanation |
|
1830 |
of what "common symbols" were really used for. So we dig deeper to find this |
|
1831 |
great explanation from Ulrich's paper on shared libraries: |
|
1832 |
|
|
1833 |
"Common variables are widely used in fortran, but they got used in C and C++ |
|
1834 |
as well to work around mistakes of programmers. Since in the early days people |
|
1835 |
used to drop the 'extern' keyword from variable definitions, in the same way |
|
1836 |
it is possible to drop it from function declaration, the compiler often had |
|
1837 |
multiple definitions of the same variable in different files. |
|
1838 |
|
|
1839 |
To help the poor and clueless programmer, the C/C++ compiler normally generates |
|
1840 |
common variables for uninitialized definitions such as `int foo;' |
|
1841 |
|
|
1842 |
For common variables, there can be more than one definition, and they all get |
|
1843 |
unified in one location in the output file .. their values does not need to be |
|
1844 |
stored in the ELF file." |
|
1845 |
|
|
1846 |
* And also from Ulrich, a description of -fno-common which is more accurate than |
|
1847 |
gcc's manual: |
|
1848 |
|
|
1849 |
"If the programmer uses the compiler command-line option -fno-common, the |
|
1850 |
generated code will contain uninitialized variables instead of common variables |
|
1851 |
if a vriable definition has no initializer. .. |
|
1852 |
|
|
1853 |
The result at run-time is the same as for common variable, no value stored in |
|
1854 |
the file. But the representation in the object file is different [BSS section], |
|
1855 |
and it allows the linker to find multiple definitions and flag them as errors." |
|
1856 |
|
|
1857 |
* And finally from Ian Lance Taylor's series on linkers: |
|
1858 |
|
|
1859 |
"A section index of SHN_COMMON (0xfff2) indicates a common symbol. Common |
|
1860 |
symbols were invented to handle Fortran common blocks, and they are also often |
|
1861 |
used for uninitialized global variables in C. |
|
1862 |
|
|
1863 |
A common symbol has unusual semantics. Common symbols have a value of zero, but |
|
1864 |
set the size field to the desired size. If one object file has a common symbol |
|
1865 |
and another has a definition, the common symbol is treated as an undefined |
|
1866 |
reference. |
|
1867 |
|
|
1868 |
If there is no definition for a common symbol, the program linker acts as though |
|
1869 |
it saw a definition initialized to zero of the appropriate size. Two object |
|
1870 |
files may have common symbols of different sizes, in which case the program |
|
1871 |
linker will use the largest size. |
|
1872 |
|
|
1873 |
Implementing common symbol semantics across shared libraries is a touchy |
|
1874 |
subject, somewhat helped by the recent introduction of a type for common symbols |
|
1875 |
as well as a special section index (see the discussion of symbol types below)." |
|
1876 |
|
|
1877 |
(Also check nm(1) on "C"/COMMON symbols, and --warn-common on ld(1)) |
|
1878 |
|
|
1879 |
* And by now everything is super-clear, and the fix is only adding the COMMON |
|
1880 |
symbols within the output object BSS boundaries: |
|
1881 |
|
|
1882 |
.bss : { |
|
1883 |
__bss_start = .; |
|
1884 |
*(EXCLUDE_FILE (*head.o *e820.o) .bss) |
|
1885 |
*(EXCLUDE_FILE (*head.o *e820.o) COMMON) |
|
1886 |
__bss_end = .; |
|
1887 |
} |
|
1888 |
|
|
1889 |
What a bug this was. |
| fea07b7 by Ahmed S. Darwish at 2010-01-21 |
1890 |
|
|
1891 |
|
| 9d5e00e by Ahmed S. Darwish at 2010-02-02 |
1892 |
Musings on Memory Allocation: |
| fea07b7 by Ahmed S. Darwish at 2010-01-21 |
1893 |
----------------------------- |
|
1894 |
|
|
1895 |
* References: |
|
1896 |
- Uresh Vahalia, "Unix Internals: the new frontiers", Chapter 12 - Kernel |
|
1897 |
Memory Allocation |
|
1898 |
- Kirk McKusick, J. Karles, "Design of a General Purpose Memory Allocator |
|
1899 |
for the 4.3BSD UNIX Kernel", Berkely, Usenix 1988 conference |
|
1900 |
- Daniel Bovet, Marco Cesati, "Understanding the Linux Kernel", Chapter 8 - |
|
1901 |
Memory Management, Section 8.1 - Page Frame Management |
|
1902 |
- Kirk McKusick, George Neville, "The Design and Implementation of the |
|
1903 |
FreeBSD Operating System", Section 5.3 - Kernel Memory Management |
|
1904 |
- Maurice Bach, "The Design of the UNIX Operating System", Chapter 9 - |
|
1905 |
Memory Management Policies |
|
1906 |
|
|
1907 |
* In all Unices, memory is page-based, and the page is the smallest unit of |
|
1908 |
allocation, protection, address translation, and mapping. X86-64 does not |
|
1909 |
even provide segmentation. |
|
1910 |
|
|
1911 |
* The very first step for writing a MM subsystem is writing the page-level |
|
1912 |
allocator. Due to virtual memory, the page-allocator doesn't need to find |
|
1913 |
contiguous free pages to satisfy its request: non-contigous frames mapped as |
|
1914 |
contiguous addresses shall be enough. |
|
1915 |
|
|
1916 |
The kernel often needs chunks of memory much less of size than a page, there's |
|
1917 |
where the 'kernel memory allocator' job comes into play for allocating this |
|
1918 |
kind of memory within the pages. It depends internally on the page allocator. |
|
1919 |
|
|
1920 |
* Early Unix like SVR2 avoided dynamic memory altogether, and used fixed-size |
|
1921 |
tables for vnodes and proc structures. On the other hand, when memory was |
|
1922 |
required for holding temporary pathnames, they borrowed buffers from the block |
|
1923 |
buffer cache (Check chapter 3 of Maurice's book). |
|
1924 |
|
|
1925 |
Another case was the 4.2BSD, which in McKusick wordings had "at least ten |
|
1926 |
different memory allocators. Some of them handle large blocks, some handle |
|
1927 |
small chained data structures, and others include information to describe |
|
1928 |
I/O operations. .. Multiple allocators reduce the effeciency with which memory |
|
1929 |
is used. The kernel ends up with many different free lists of memory instead |
|
1930 |
of a single freelist from which all allocation can be drawn" |
|
1931 |
|
|
1932 |
* Currently, our page allocator is interfaced through two simple functions: |
|
1933 |
get_free_page() which returns a page frame descriptor, and free_page(page) |
|
1934 |
which frees given page frame. The naming is taken from Linux, which are the |
|
1935 |
historic names Torvalds used in his 0.01 relase and are still in-use today. |
|
1936 |
|
|
1937 |
* Now that we've finished writing our page allocator, we focus on kmalloc()-like |
|
1938 |
kernel memory allocators. A memory allocator utilization factor = |
|
1939 |
|
|
1940 |
total memory 'requested' by the kmalloc() caller |
|
1941 |
------------------------------------------------------------------- |
|
1942 |
total memory 'required' by allocator to satisfy those requests |
|
1943 |
|
|
1944 |
As said by McKusick, Good memory utilization in the kernel is more important |
|
1945 |
than in user processes. Pages in the process address space that are in 'required' |
|
1946 |
but not in the 'requested' pool need not tie up physical memory. But the kernel |
|
1947 |
is not typically paged out, so any not-well-utilized kernel-allocated memory |
|
1948 |
pages are held forever, and can not be used for other purposes. |
|
1949 |
|
|
1950 |
It's typical of user-space memory allocators not to release unused memory in |
|
1951 |
the required pool to avoid the syscall overhead. Kernel allocators are required |
|
1952 |
to release such memory as quick possible though, due to above point of holding |
|
1953 |
kernel memory pages forever, and because the kernal can quickly manipulate its |
|
1954 |
own page maps. |
|
1955 |
|
|
1956 |
* Not to degrade performace, memory returned by the allocator must be aligned |
|
1957 |
according to the machine requirements. I guess for our x86-64 case, the safest |
|
1958 |
haven is 8-byte alignemnt. |
|
1959 |
|
|
1960 |
Speaking of speed, McKusick suggests that the allocator need only to be fast for |
|
1961 |
small pieces of memory. Inspecting a time-shared machine at Berkely, he noticed |
|
1962 |
that "large allocations occur infrequently, and are typically for long-lived |
|
1963 |
objects such as buffers to hold the superblock for a mounted file system". |
|
1964 |
|
|
1965 |
* For a nice survey of kernel memory allocators, check referenced 'Unix Internals' |
|
1966 |
chapter. The 4.2BSD allocator, is covered under 'Simple Power-of-Two Free Lists', |
|
1967 |
and the 4.3BSD allocator, explained in the McKusick paper, is clearly described |
|
1968 |
under 'The McKusick-Karles Allocator' section; Vahalia is a great writer. |
|
1969 |
|
|
1970 |
* Before jumping to sketching our own allocator, we finally survey some of the |
|
1971 |
allocators used in modern unix kernels: |
|
1972 |
|
|
1973 |
In Linux, the page allocator provides contiguous pages using a binary buddy |
|
1974 |
system instead of depending on hardware circuity to emulate this continuity. |
|
1975 |
This, in Daniel Bovet words, offers the big advantage of leaving the kernel |
|
1976 |
paging tables unchanged, which leads to less average memory access times cause |
|
1977 |
such modifications lets the CPU flush its Translation-Lookaside Buffers. |
|
1978 |
|
|
1979 |
NOTE! The smallest unit returned by this buddy page allocatoris one PAGE. The |
|
1980 |
groups of 2, 4, 8, 16, .., 1024 are a physically contigous set of page frames. |
|
1981 |
This is not a kernel memory allocator (KMA) like the McKusick-Karles one; this |
|
1982 |
is a page allocator that uses the buddy algorithm for serving _contigous pages_ |
|
1983 |
instead. |
|
1984 |
|
|
1985 |
The page allocator main structure used is mem_map, which holds all the page |
|
1986 |
descriptors of the specific memory zone this page allocator belongs, like |
|
1987 |
ZONE_DMA, HIGHMEM, etc. Notice that mem_map is also the historic naming Linus |
|
1988 |
used for his page allocator in the 0.01 release. That page allocator was done |
|
1989 |
using a bitmap, which as we can see, fits the 'mem_map' naming better. |
|
1990 |
|
|
1991 |
Above this page allocator in Linux exists the regular sl[au]b object-caching |
|
1992 |
allocator, which requests pages from the page allocator upon need. |
|
1993 |
|
|
1994 |
In FreeBSD, the kernel uses the 4.3BSD allocator described in the McKusick |
|
1995 |
paper for small allocations. For large fixed-size allocations, it uses the |
|
1996 |
Zone/slab allocator instead. |
|
1997 |
|
|
1998 |
* In our kernel, we assign high weights to simplicity without sacrificying much |
|
1999 |
speed and memory resources. |
|
2000 |
|
|
2001 |
Our page allocator's main structure is the Page Frame Descriptor Table |
|
2002 |
(pfdtable; Unix SVR2 terminology), which is an array/table that contains all |
|
2003 |
the system's available (non e820-reserved; above kernel memory area) physical |
|
2004 |
pages descriptors. |
|
2005 |
|
|
2006 |
Free pages descriptors are connected in place by a linked list pfdfree_head |
|
2007 |
as the list head. Allocation and Reclamation is a simple O(1) matter of adding |
|
2008 |
to or removing from the pfdfree_head list. |
|
2009 |
|
|
2010 |
The kernel memory allocator is the 4.3BSD McKusick-Karles allocator. I chose |
|
2011 |
it cause of its beautiful simplicity, speed, and space effeciency for small |
|
2012 |
allocations. We shall develop a slab-like caching allocator for big objects |
|
2013 |
once the need arise; we're obviously influenced by FreeBSD in this area. |
| 15681eb by Ahmed S. Darwish at 2010-02-18 |
2014 |
|
|
2015 |
|
|
2016 |
On Virtual Memory: |
|
2017 |
------------------ |
|
2018 |
|
|
2019 |
* References: |
|
2020 |
- Uresh Vahalia, "Unix Internals", Chapter 13 - Virtual Memory |
|
2021 |
- Peter J. Denning, 'Virtual Memory', Princeton, ACM Computing Surveys |
|
2022 |
- Maurice Bach, "The Design of the UNIX Operating System", Chapter 9 - |
|
2023 |
Memory Management Policies |
|
2024 |
- Jonathan Corbet, "Four-level page tables", LWN.net, |
|
2025 |
http://lwn.net/Articles/106177/, http://www.webcitation.org/5ncgFSpzm |
|
2026 |
|
|
2027 |
* First, please check our detailed notes on the great Dennings's virtual memory |
|
2028 |
paper; the paper covers a huge number of now very-well-established concepts. |
|
2029 |
Below paragraphs assumes familiarity with the concepts covered in that paper. |
|
2030 |
|
|
2031 |
* To cause considerable reduction in the amount of virtual-to-physical mapping |
|
2032 |
information that needs to be stored, methods like segmentation and paging |
|
2033 |
were invented. In the words of Denning, "each method groups information into |
|
2034 |
blocks, a block being a set of contiguous addresses in address space. The |
|
2035 |
entries in the mapping table will refer now to blocks, which are _far_ less |
|
2036 |
numerous than addresses in a process address space". |
|
2037 |
|
|
2038 |
In other words, instead of storing mapping information for each address, we |
|
2039 |
store the mapping info fore each equally-sized _page_ or for each _segment_. |
|
2040 |
|
|
2041 |
* In paging, even while storing mapping information for each page instead of |
|
2042 |
for each address, the page table can consume a huge area of memory. For |
|
2043 |
example, in a fictional 32-bit machine with a page size of 1KB, and an |
|
2044 |
address space of 2GB, we need: 2^31 / 2^10 = 2^21 entries. For an entry size |
|
2045 |
of 32-bit, 4-bytes: page table size = 2^21 * 2^2 = 2^23 = 2^3 * 2^20 = 8MB. |
| 06f422f by Ahmed S. Darwish at 2010-07-27 |
2046 |
So we'll need a _contiguous_ 8MB table for _each_ system process address space |
|
2047 |
which is definitely not tolerable. |
| 15681eb by Ahmed S. Darwish at 2010-02-18 |
2048 |
|
|
2049 |
* Analyzing the problem, we see that the root cause is having mapping information |
|
2050 |
for non-used pages; a process rarely use its entire address space anyway. It's |
|
2051 |
also common to have a virtual address-space which is much larger than machine's |
|
2052 |
available physical memory. To solve this problem we either: |
|
2053 |
|
|
2054 |
- have segmented page tables. Each segment of the process has its own page table |
|
2055 |
which is just large enough to hold the valid address range for that segment. |
|
2056 |
|
|
2057 |
- second is having the page table itself paged, which means an additional higher |
|
2058 |
level page table is used to map the lower-level page table. With such a multi- |
|
2059 |
teired hierarchy, we need to allocate only those pages of the lower-level table |
|
2060 |
that map used/valid addresses of the process. This is the common approach. |
|
2061 |
|
|
2062 |
|
|
2063 |
On Partitionting the 64-bit kernel address space: |
|
2064 |
------------------------------------------------- |
|
2065 |
|
|
2066 |
* References: |
|
2067 |
- Andi Kleen, "Porting Linux to x86-64", SuSe Labs |
|
2068 |
- Michael Matz, Hubicka, Andrea Jaeger, Mark Mitchell, "System V Application |
|
2069 |
Binary Interface, AMD64 processor supplement", SuSe Labs, Codesourcey |
|
2070 |
- Linux Kernel linux-2.6/Documentation/x86/x86_64/mm.txt, kernel.org |
|
2071 |
- Our kernel comments at head.S and include/paging.h |
|
2072 |
- MSDN Library, Win32 and COM development, "64-bit Virtual Address Space", |
|
2073 |
http://msdn.microsoft.com/en-us/library/aa384271(VS.85).aspx |
|
2074 |
http://www.webcitation.org/5nckv6YBf |
|
2075 |
- Intel Manuals volume 2A, A-M Instructions Reference |
|
2076 |
|
|
2077 |
* Partitioning the address space of a long mode kernel is an interesting challenge. |
|
2078 |
There are so many tradeoffs to be taken care of. First, let's remember that we |
|
2079 |
have full 64-bit space. This has obvious potentials, but also some drawbacks that |
|
2080 |
need be taken care of. |
|
2081 |
|
|
2082 |
* The first drawback of using a full 64-bit address space is code size. Imagine when |
|
2083 |
every far jump, global data access, and function call uses a full 64-bit address. |
| 06f422f by Ahmed S. Darwish at 2010-07-27 |
2084 |
The difference is outlied in below pure x86-32 app: |
| 15681eb by Ahmed S. Darwish at 2010-02-18 |
2085 |
|
|
2086 |
/* File1: sum.c */ |
|
2087 |
int global; |
|
2088 |
int sum() { |
|
2089 |
return 0; |
|
2090 |
} |
|
2091 |
|
|
2092 |
/* File2: app.c */ |
|
2093 |
extern int global; |
| 06f422f by Ahmed S. Darwish at 2010-07-27 |
2094 |
int x, y; |
| 15681eb by Ahmed S. Darwish at 2010-02-18 |
2095 |
int main() { |
|
2096 |
global = 0xdead; |
|
2097 |
x = 0xceeb; |
|
2098 |
return sum(); |
|
2099 |
} |
|
2100 |
|
|
2101 |
# -m32: 32-bit compilation and linkage |
|
2102 |
$ gcc -O3 -m32 sum.c app.c -o app |
|
2103 |
|
|
2104 |
which generates the very small code below: |
|
2105 |
|
|
2106 |
$ objdump -D app |
|
2107 |
0804956c <x>: /* 4-byte address */ |
|
2108 |
804956c: 00 00 add %al,(%eax) |
|
2109 |
... |
|
2110 |
08049570 <y>: /* 4-byte address */ |
|
2111 |
8049570: 00 00 add %al,(%eax) |
|
2112 |
|
|
2113 |
08048370 <main>: |
| 06f422f by Ahmed S. Darwish at 2010-07-27 |
2114 |
/* 10-bytes in total |
|
2115 |
* 4-byte absolute address dereference */ |
|
2116 |
8048376: c7 05 68 95 04 08 ad movl $0xdead,0x8049568 |
|
2117 |
804837d: de 00 00 |
|
2118 |
|
|
2119 |
/* 10-bytes in total |
|
2120 |
* 4-byte absolute address dereference */ |
|
2121 |
8048380: c7 05 6c 95 04 08 eb movl $0xceeb,0x804956c |
|
2122 |
8048387: ce 00 00 |
|
2123 |
|
|
2124 |
/* Only 5 bytes in total |
|
2125 |
* 4-byte relative offset */ |
|
2126 |
8048394: e8 c7 ff ff ff call 8048360 <sum> |
|
2127 |
... |
| 15681eb by Ahmed S. Darwish at 2010-02-18 |
2128 |
|
|
2129 |
Now imagine if above app used full 64-bit virtual addresses, each 4-byte address |
|
2130 |
dereference above would have consumed full 8-bytes; even some ops would consume |
|
2131 |
140% extra size. Inspect below results: |
|
2132 |
|
|
2133 |
$ gcc -m64 -mcmodel=large -c sum.c -o sum.o |
|
2134 |
$ gcc -m64 -mcmodel=large -c app.c -o app.o |
|
2135 |
$ ld -Ttext 0xffff800000000000 app.o sum.o -o app |
| 06f422f by Ahmed S. Darwish at 2010-07-27 |
2136 |
$ objdump -D app | less |
| 15681eb by Ahmed S. Darwish at 2010-02-18 |
2137 |
|
|
2138 |
ffff8000002000cc <x>: /* .. 8-byte address */ |
|
2139 |
ffff8000002000cc: 00 00 add %al,(%rax) |
|
2140 |
... |
|
2141 |
ffff8000002000d0 <y>: /* .. 8-byte address */ |
|
2142 |
... |
|
2143 |
|
|
2144 |
ffff800000000024 <main>: |
|
2145 |
|
| 06f422f by Ahmed S. Darwish at 2010-07-27 |
2146 |
/* 8-bytes address, 15-bytes ops size in total. Compare |
|
2147 |
* with the 10-byte 32-bit equivalent ops (50% increase) */ |
|
2148 |
ffff800000000028: 48 b8 c8 00 20 00 00 mov $0xffff8000002000c8,%rax |
|
2149 |
ffff80000000002f: 80 ff ff |
|
2150 |
ffff800000000032: c7 00 ad de 00 00 movl $0xdead,(%rax) |
|
2151 |
|
|
2152 |
/* 8-bytes address, 15-bytes ops size in total. Compare |
|
2153 |
* with the 10-byte 32-bit equivalent ops (50% increase) */ |
|
2154 |
ffff800000000038: 48 b8 cc 00 20 00 00 mov $0xffff8000002000cc,%rax |
|
2155 |
ffff80000000003f: 80 ff ff |
|
2156 |
ffff800000000042: c7 00 eb ce 00 00 movl $0xceeb,(%rax) |
|
2157 |
... |
|
2158 |
|
|
2159 |
/* 8-byte address, *12-bytes* ops size in total. Compare |
|
2160 |
* with the 5-byte 32-bit equivalent ops (140% increase!) */ |
|
2161 |
ffff80000000005d: 48 ba 00 00 00 00 00 mov $0xffff800000000000,%rdx |
|
2162 |
ffff800000000064: 80 ff ff |
|
2163 |
ffff800000000067: ff d2 callq *%rdx |
| 15681eb by Ahmed S. Darwish at 2010-02-18 |
2164 |
|
|
2165 |
* As you can see from above, naively using the 64-bit address space can lead to |
|
2166 |
diastrous effects on code size, and thus, the cache footprint. To solve this, some |
|
2167 |
help from both the kernel and the compiler is needed. |
|
2168 |
|
|
2169 |
First, let's focus on user-space. To let user-space avoid this large code-size hit, |
|
2170 |
it's common to assign to it the 0 -> 4GB+ virtual space, and let the compiler |
|
2171 |
generate 32-bit addresses for code and global data. This is for example, how the Linux |
|
2172 |
ecosystem does it. The kernel assigns below virtual addresses to user-space (for each |
|
2173 |
process): |
|
2174 |
|
|
2175 |
0000000000000000 - 00007fffffffffff (=47 bits) user space, different per mm |
|
2176 |
hole caused by [48:63] sign extension |
|
2177 |
|
|
2178 |
And let gcc emits 32-bit addressing for code and global data using the -mcmodel=small |
|
2179 |
parameter, which is the default memory model. %rip-relative addressing is also used |
| 06f422f by Ahmed S. Darwish at 2010-07-27 |
2180 |
by default at this model for dereferencing object globals; it already consumes same |
|
2181 |
amount of opcode size relative to x86-32's in-operand 32-bit absoulte dereference. |
| 15681eb by Ahmed S. Darwish at 2010-02-18 |
2182 |
|
|
2183 |
* Now how are we going to produce similar effect in kernel-space code when the |
|
2184 |
0 -> 4G+ space has been already reserved by user-space? |
|
2185 |
|
|
2186 |
Thanks to the elegance of the 2's complement model, the common case is to use |
|
2187 |
negative 32-bit addresses for kernel linking. This leads to negative 32-bit |
|
2188 |
discplacement for kernel ops. We then let the CPU implcitly (automatically) |
|
2189 |
sign-extend those discplacements, leading us to use the top 64-bit address space, |
|
2190 |
but with the size-effeciency of 32-bit addresses! |
|
2191 |
|
|
2192 |
An example is overdue. Imagine linking our kernel with the 0xffffffff80000000 |
|
2193 |
as base. %rax points to a physical address in our text or data area. To let it |
|
2194 |
be virtual, we add the base to it. Mainly we want to do: |
|
2195 |
|
|
2196 |
%rax = %rax + 0xffffffff80000000 |
|
2197 |
|
|
2198 |
Now, thanks to the sign-extension trick, the above instruction can be size- |
|
2199 |
-effeciently encoded as if the displacement was 32-bit as follows: |
|
2200 |
|
|
2201 |
ffffffff8010b347: 48 05 00 00 00 80 add $0xffffffff80000000,%rax |
|
2202 |
|
|
2203 |
As you can see, only 4-bytes are used for displacement (0x80000000). The REX.W |
|
2204 |
prefix (0x48) with the x86 ADD opcode (05) will let the x86-64 cpu automatically |
|
2205 |
sign extend the 32-bit immediate 0x80000000 to $0xffffffff80000000 and achieve |
|
2206 |
our goals. |
|
2207 |
|
|
2208 |
* Here's a more complete explanation from our kernel comments: |
|
2209 |
|
|
2210 |
Using -2GB as a base empowers the compiler to directly encode symbolic |
|
2211 |
references as 32-bit values, and let the CPU implicitly sign-extended those to |
|
2212 |
64-bits due to the REX.W prefix. Thus, we use the top 64-bit address space, but |
|
2213 |
with 32-bit size effeciency! Check below example: |
|
2214 |
|
|
2215 |
Passing global strings to memcpy: |
|
2216 |
|
|
2217 |
48 c7 c6 80 24 8b 80 mov $0xffffffff808b2480,%rsi |
|
2218 |
48 c7 c7 60 d8 10 80 mov $0xffffffff8010d860,%rdi |
|
2219 |
e8 f4 39 00 00 callq ffffffff8010bb00 <memcpy> |
|
2220 |
|
|
2221 |
To avoid the size-overhead of 64-bit offsets in function calls, we limit the |
|
2222 |
kernel text to a 32-bit virtual area. Thus, an x86-64 callq will only consume a |
|
2223 |
4-byte offset: relative space between next op and the destination label. |
|
2224 |
|
| 06f422f by Ahmed S. Darwish at 2010-07-27 |
2225 |
To dereference kernel globals (instead of just passing their addresses), we'll |
|
2226 |
use %rip-relative addressing within + or - 2GB limit, avoiding the size-overhead |
|
2227 |
of referencing a full 8-byte address. |
| 15681eb by Ahmed S. Darwish at 2010-02-18 |
2228 |
|
|
2229 |
All of this was in fact designed by the SysV AMD64 ABI developers, and is called |
|
2230 |
the 'kernel code model'. We use this code model, and let userspace use the |
|
2231 |
'small' or the 'medium' one. It's interesting to note that NT userspace defaults |
|
2232 |
to the medium model, while most UNIX implementations default to the 'small' one. |
|
2233 |
|
|
2234 |
* To get a state of the size difference, check the difference of function call |
|
2235 |
size overhead between the 'large' and the 'kernel' code models: |
|
2236 |
|
|
2237 |
$ objdump -j .text -D kernel.elf | grep call # -mcmodel=kernel |
|
2238 |
|
|
2239 |
ffffffff801090c8: e8 e3 45 00 00 callq ffffffff8010d6b0 <vsnprintf> |
|
2240 |
ffffffff801090df: e8 ec 48 00 00 callq ffffffff8010d9d0 <printk> |
|
2241 |
ffffffff80109105: e8 86 41 00 00 callq ffffffff8010d290 <memset> |
|
2242 |
ffffffff8010b97b: e8 70 f9 ff ff callq ffffffff8010b2f0 <spin_lock> |
|
2243 |
ffffffff8010ba61: e8 6a ff ff ff callq ffffffff8010b9d0 <get_free_page> |
|
2244 |
ffffffff8010bdd8: e8 83 fc ff ff callq ffffffff8010ba60 <get_zeroed_page> |
|
2245 |
ffffffff8010c749: e8 a2 f8 ff ff callq ffffffff8010bff0 <__kmalloc> |
|
2246 |
... |
|
2247 |
|
|
2248 |
$ objdump -j .text -D kernel.elf | grep call # -mcmodel=large |
|
2249 |
|
|
2250 |
/* Notice the ~%140 increase in size per call */ |
|
2251 |
ffffffff801090cb: 48 b8 d0 e6 10 80 ff mov $0xffffffff8010e6d0,%rax |
|
2252 |
ffffffff801090d2: ff ff ff |
|
2253 |
ffffffff801090d5: ff d0 callq *%rax |
|
2254 |
ffffffff801091a3: 48 bb 00 ea 10 80 ff mov $0xffffffff8010ea00,%rbx |
|
2255 |
ffffffff801091aa: ff ff ff |
|
2256 |
ffffffff801091b9: ff d3 callq *%rbx |
|
2257 |
... |
|
2258 |
|
|
2259 |
So, while admittedly using -mcmodel=large makes things a bit easier for the 64-bit |
| 06f422f by Ahmed S. Darwish at 2010-07-27 |
2260 |
kernel developer (as it gives them the freedom of basing the kernel on any desired |
|
2261 |
virtual address), it induces too much size overhead that it's worth the effort |
|
2262 |
avoiding it. |
|
2263 |
|
|
2264 |
For example, since Cute uses -mcmodel=kernel, we have two mappings in our system, |
|
2265 |
one for kernel text and data (-2GB), and _another_ mapping for the entire physical |
|
2266 |
memory discovered at system boot. Extra care was needed while developing this in |
|
2267 |
an already very complicated subsystem: memory management. |
|
2268 |
|
|
2269 |
This is also the same reason why Linux's amd64 (unlike ia32) __PAGE_OFFSET is |
|
2270 |
different from the kernel linkage base at __START_KERNEL_MAP = -2GB. |
| 15681eb by Ahmed S. Darwish at 2010-02-18 |
2271 |
|
|
2272 |
* So, keeping above points in mind, we'll use below strategy for dividing our kernel |
|
2273 |
space addresses: |
|
2274 |
|
|
2275 |
0x0000000000000000 -> 0x00007fffffffffff userspace mappings |
|
2276 |
0x0000800000000000 -> 0x0000ffffffffffff not canoncial; cannot be used |
|
2277 |
0xffff800000000000 -> 0xffffc00000000000 maps for all physical memory (64TB max) |
|
2278 |
0xffffffff80000000 -> 0xffffffffc0000000 kernel text and data map |
| 9a0481a by Ahmed S. Darwish at 2010-02-24 |
2279 |
|
|
2280 |
|
|
2281 |
* Debugging -O3 triggered problems: |
|
2282 |
----------------------------------- |
|
2283 |
|
|
2284 |
* While adding a good number of GCC warning request flags to detect bugs as early |
|
2285 |
as possible, I enabled the -O3 switch only to discover that it triggered a good |
|
2286 |
number of hidden kernel bugs. |
|
2287 |
|
|
2288 |
First bug led to several exceptions. Upon investigating the CPU's reported |
|
2289 |
faulting instruction, I found its an SSE x86 op. SSE ops require special %cr0 |
|
2290 |
and %cr4 setup that we're not ready for. |
|
2291 |
|
|
2292 |
The solution was simply instructing GCC to avoid emitting SSE ops using its |
|
2293 |
-mno-see{2,3,4} and -mno-3dnow control flags. |
|
2294 |
|
|
2295 |
* Second bug was way uglier. It was only triggered at -O3 and with Kmalloc test |
|
2296 |
code (and only kmalloc test code, not any of the rest of the tests) enabled. |
|
2297 |
|
|
2298 |
The faults were a bit random, from continuous triple faults to rubbish on the |
|
2299 |
screen by printk(). This also means that printk is not available for debugging. |
|
2300 |
|
|
2301 |
Switching to -O2 mute all those faults .. |
|
2302 |
|
|
2303 |
* An important debugging tool when printf is not available is inspecting the final |
|
2304 |
kernel ELF. Having the ELF at hand helps us at: |
|
2305 |
|
|
2306 |
- directly knowing the exact addresses of all global symbols |
|
2307 |
|
|
2308 |
- translate any %rip address to the exact place in the executable. This aids in |
|
2309 |
knowing things like whether a core had deadlock on a spinlock (%rip inifinitely |
|
2310 |
tied at the spin loop), the exact x86 op of a faulting %rip address, etc |
|
2311 |
|
|
2312 |
* Applying above methods, we inspect the global vga_buffer array. This array get |
|
2313 |
its contents directly copied to the vga card, so we can have a nice view about |
|
2314 |
the source of screen corruption: |
|
2315 |
|
|
2316 |
$ readelf --syms kernel.elf | grep vga_buffer |
|
2317 |
|
|
2318 |
114: ffffffff8010eb60 4000 OBJECT LOCAL DEFAULT 6 vga_buffer |
|
2319 |
|
|
2320 |
Halting the CPU and inspecting physical address 0x10eb60 contents, we find that: |
|
2321 |
At -O2, well-formatted vga text: 0f 75 0f 43 0f 64 0f 74 - 'Cute' |
|
2322 |
At -O3, it's rubbish: 9f 19 9e fa 9f 9f ae 80 ... |
|
2323 |
|
|
2324 |
Further inspecting the printk() code, I found there's no apparent problem in |
|
2325 |
the code itself (or its generated assembly). The printk() code was mainly passed |
|
2326 |
rubbish strings, instead of the expected valid ones. |
|
2327 |
|
|
2328 |
* Inspecting the issue further, I've put a global string: |
|
2329 |
|
|
2330 |
static char string[] = "ABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890"; |
|
2331 |
|
|
2332 |
$ readelf --syms kernel.elf | grep string |
|
2333 |
|
|
2334 |
114: 0000000000000000 0 FILE LOCAL DEFAULT ABS string.c |
|
2335 |
118: ffffffff8010e680 37 OBJECT LOCAL DEFAULT 7 string |
|
2336 |
|
|
2337 |
Inspecting those mem locations at -O2, yielded the expected byte values: |
|
2338 |
41 42 43 44 45 46 47 48 .. |
|
2339 |
|
|
2340 |
At -O3, however, below values were found: |
|
2341 |
10 80 48 89 f2 48 03 53 08 48 01 eb |
|
2342 |
|
|
2343 |
* Checking the unexpected bytes sequence above, I smelled x86 opcodes, which |
|
2344 |
they truly were! They were from the disassembly: |
|
2345 |
|
|
2346 |
ffffffff80108140 <e820_init>: |
|
2347 |
... |
|
2348 |
ffffffff801081b7: 48 c7 c7 18 e3 10 80 mov $0xffffffff8010e318,%rdi |
|
2349 |
ffffffff801081be: 48 89 f2 mov %rsi,%rdx |
|
2350 |
ffffffff801081c1: 48 03 53 08 add 0x8(%rbx),%rdx |
|
2351 |
ffffffff801081c5: 48 01 eb add %rbp,%rbx |
|
2352 |
... |
|
2353 |
|
|
2354 |
So we got a lead, but it's just an indication of something utterly and |
|
2355 |
completely messed-up. |
|
2356 |
|
|
2357 |
* Trying to inspect the issue further, I halted the CPU and dumped first 10MB |
|
2358 |
of RAM before the crash. There were some really surprising results. |
|
2359 |
|
|
2360 |
Searching for some of the strings we send to printk() led to nothing, nada, |
|
2361 |
nil. The strings we send, they must be there at the kernel binary in ASCII |
|
2362 |
form, aren't simply there. Instead most of them are replaced by x86 ops. |
|
2363 |
|
|
2364 |
* Thinking about the situation, a change to kernel binary at such scale can |
|
2365 |
mostly happen by a serious bug at bootstrap code. |
|
2366 |
|
|
2367 |
Checking the first 3 lines of our bootstrap disassembly, where it moves |
|
2368 |
kernel code from bootstrap BSS to its expected place, made everything |
|
2369 |
clear: |
|
2370 |
|
|
2371 |
kern/kernel.elf: file format elf64-x86-64 |
|
2372 |
|
|
2373 |
Disassembly of section .text.head: |
|
2374 |
|
|
2375 |
0000000000100000 <startup_32>: |
|
2376 |
100000: be a0 03 10 00 mov $0x1003a0,%esi |
|
2377 |
100005: bf 04 80 10 00 mov $0x108004,%edi <-- [1] |
|
2378 |
10000a: b9 00 00 02 00 mov $0x20000,%ecx |
|
2379 |
|
|
2380 |
[1] As you can see, if the kernel image we're copying away from bootstrap |
|
2381 |
BSS to its right place exceeded a size of 0x8000 bytes, the part of the |
|
2382 |
image > 0x8000 will be overriden, which leads to the mess we were in. |
|
2383 |
|
|
2384 |
As in C's memmove(), we now avoid such corruption by copying kernel code |
|
2385 |
to a safe temporary buffer first, then copy it again to the final |
|
2386 |
overlapping place. |
|
2387 |
|
|
2388 |
(Sometimes reading disassembly makes you catch bugs quicker than reading |
|
2389 |
source code. Although in this case the source was really in assembly, when |
|
2390 |
we saw the _numbers_ instead of just the symbols exported by the linker |
|
2391 |
script, we caught the bug rather instantaneously) |
|
2392 |
|
|
2393 |
* Important lessons: |
|
2394 |
|
|
2395 |
- Take extreme care of overlaps while coding x86 'rep movsb' and friends. |
|
2396 |
It's good to dynamically assure that: |
|
2397 |
|
| 4f55712 by Ahmed S. Darwish at 2010-11-16 |
2398 |
+ src < dst, AND dst >= (src + len) .. i.e., bad overlap |
| 9a0481a by Ahmed S. Darwish at 2010-02-24 |
2399 |
+ OR src >= dst .. i.e., safe overlap |
|
2400 |
|
|
2401 |
Any other condition means direct corruption. |
|
2402 |
|
|
2403 |
The 'safe' overlap is only so if we are using 'movsb' instead of movsl |
| 4f55712 by Ahmed S. Darwish at 2010-11-16 |
2404 |
or movsq. Remember that this 'safe' overlap corrupts the source, but |
|
2405 |
keeps the destination data valid. |
|
2406 |
|
|
2407 |
- To be safe against such kind of overlaps in C parts of the kernel, |
|
2408 |
we've added below memcpy() asserts: |
|
2409 |
|
|
2410 |
if (src > dst && (src - dst) < 8) OR |
|
2411 |
(src <= dst && (dst - src) < len): |
|
2412 |
panic() [1] |
|
2413 |
|
|
2414 |
which can be reduced by logic (check Appendix) to: |
|
2415 |
|
|
2416 |
if (dst + 8 > src) AND (src + len > dst): |
|
2417 |
panic(); [2] |
|
2418 |
|
|
2419 |
The first check is used cause of 'movsq' 8-byte block copy. We've |
|
2420 |
intentionally transformed substraction in the '>' and '<' expressions |
|
2421 |
to addition cause of the _unsigned_ pointer values. |
|
2422 |
|
|
2423 |
---------------------------------------------------------------------- |
|
2424 |
| Appendix |
|
2425 |
| |
|
2426 |
| It's easy to prove equivalence of the 'if' conditions marked |
|
2427 |
| with [1] and [2] above; let: |
|
2428 |
| |
|
2429 |
| A eq 'src > dst' ==> ~A eq 'src <= dst' |
|
2430 |
| B eq '(src - dst) < 8' |
|
2431 |
| C eq '(dst - src) < len)' |
|
2432 |
| D eq '(A and B) or (~A and C)' |
|
2433 |
| src >= 0, dst >= 0, len >= 0 |
|
2434 |
| |
|
2435 |
| We want to prove that: |
|
2436 |
| D is fully equivalent to (B and C) |
|
2437 |
| |
|
2438 |
| From the premises, we can see that: |
|
2439 |
| if (B and C) is true, then |
|
2440 |
| '(A and B) or (~A and C)' is true, then |
|
2441 |
| D is true (I) |
|
2442 |
| |
|
2443 |
| From the premises, we can also see that: |
|
2444 |
| if A is true, then C is true (II) |
|
2445 |
| if ~A is true, then B is true (III) |
|
2446 |
| |
|
2447 |
| From premises and (II): |
|
2448 |
| if (A and B) is true, then (A and B and C) is true (IV) |
|
2449 |
| |
|
2450 |
| From premises and (III): |
|
2451 |
| if (~A and C) is true, then (~A and C and B) is true (V) |
|
2452 |
| |
|
2453 |
| From premises, (IV), and (V): |
|
2454 |
| if D is true, then (B and C) is true (VI) |
|
2455 |
| |
|
2456 |
| From (I) and (VI): |
|
2457 |
| D is fully equivalent to (B and C) Q.E.D. |
|
2458 |
---------------------------------------------------------------------- |
| 5cc7a1d by Ahmed S. Darwish at 2010-03-20 |
2459 |
|
|
2460 |
* Global lessons: |
|
2461 |
|
|
2462 |
- Disassembly is a great tool for debugging kernel code, since a lot of things |
|
2463 |
taken for granted at user-space is not guaranteed here anymore. Abstraction |
|
2464 |
cut both ways indeed. |
|
2465 |
|
|
2466 |
- It's _very important_ to deeply test each code path, no matter how trivial |
|
2467 |
it seems (basic stuff, but sometimes we forget the basics). |
|
2468 |
|
|
2469 |
- Memory dumping of critical kernel regions is really valuable; remember to use |
|
2470 |
it as needed. |
|
2471 |
|
|
2472 |
- It's important to test the code with different environment combinations: |
|
2473 |
different optimization levels: GCC's -O {0, 1, 2, 3, s} |
|
2474 |
different memory sizes: Qemu's -m {2, 3, .., 500, .., 2000, ..} |
|
2475 |
different number of cores: -smp {1, 2, 3, ..} |
|
2476 |
different CPU speeds: Bochs's ips= parameter |
|
2477 |
different BIOSes: different versions of Bochs Bios, different PC forms, .. |
|
2478 |
|
| b3bbe1d by Ahmed S. Darwish at 2010-11-16 |
2479 |
|
| 5cc7a1d by Ahmed S. Darwish at 2010-03-20 |
2480 |
* The Zone: |
|
2481 |
----------- |
|
2482 |
|
|
2483 |
(Or the bug that took me around 6 days to solve) |
|
2484 |
|
|
2485 |
* While writing the context switch code, I observed a very ugly bug. Somehow the |
|
2486 |
montonic PIT interrupts that got triggered every 1 millisecond corrupts kernel |
|
2487 |
state. |
|
2488 |
|
|
2489 |
Symptoms: random exceptions and random assert() failures allover the place. |
|
2490 |
|
|
2491 |
* So I first thought there must be something wrong in the timer handler itself. |
|
2492 |
So what if the PIT handler did exactly nothing but acknowledging the local APIC? |
|
2493 |
(It must acknowledge the APIC so future IRQs get handled) |
|
2494 |
|
|
2495 |
The handler code was successfully reduced to: |
|
2496 |
|
|
2497 |
push %rax |
|
2498 |
movq $(VIRTUAL(APIC_PHBASE) + APIC_EOI), %rax |
|
2499 |
movl $0,(%rax) |
|
2500 |
pop %rax |
|
2501 |
iretq |
|
2502 |
|
|
2503 |
But still the same buggy behavior prevailed. |
|
2504 |
|
|
2505 |
* After reducing the handler code to a number of bytes, with no positive results, |
|
2506 |
I thought let's check the IDT validity. Inspecting the IDT cells date yielded |
|
2507 |
nothing interesting. |
|
2508 |
|
|
2509 |
Also checking the %rflags, bit by bit, led to nothing interesting. |
|
2510 |
|
|
2511 |
So I thought maybe we should setup the IDT cells as 'trap' instead of 'interrupt' |
|
2512 |
gates, but this is ofcourse wrong. Trap gates leaves the IF flag set while |
|
2513 |
servicing the interrupt, which means handlers can get interrupted indefinitely, |
|
2514 |
and our stack can get overflowed in a matter of milliseconds. |
|
2515 |
|
|
2516 |
* OK, so I thought instead of focusing on the IRQ handlers and the IDT, lets focus |
|
2517 |
on the _code_ that get interrupted itself. The failed code was usually the |
|
2518 |
general memory allocator kmalloc() test cases. |
|
2519 |
|
|
2520 |
After inspecting the interrupted code, I found nothing interesting. |
|
2521 |
|
|
2522 |
* To let the problem domain be smaller, and after minimizing the handler code, I |
|
2523 |
thought lets also minimize the interrupted code and see. Having a bunch of |
|
2524 |
printk()s in a tight loop, with the timer handler on, triggered the problem! |
|
2525 |
|
|
2526 |
Here the second day finishes. Either something is really messed up, or I'm a |
|
2527 |
shitty developer that can't even write a stable printk(). |
|
2528 |
|
|
2529 |
Furthermore, running any testcases after enabling the interrupts make them |
|
2530 |
fail miserably. From all the debugging I've done in the last 6 months, I |
|
2531 |
began to feel the instincts of such 'deep' kernel bugs: this is definitely |
|
2532 |
one of them. |
|
2533 |
|
|
2534 |
* So, I began to feel hopeless. I'll just inspect the x86 ops that triggered |
|
2535 |
the exceptions, hoping to find any pattern in the code that trigger the |
|
2536 |
bug. Definitely a pattern was clearly there; some examples of the failing |
|
2537 |
x86 ops: |
|
2538 |
|
|
2539 |
mov %rax, 0x0(%rbp) |
|
2540 |
movzbl (%rax), %eax |
|
2541 |
movl $0x2, 0x8(%rax) |
|
2542 |
movl $0x1, 0x4(%rax) |
|
2543 |
... |
|
2544 |
|
|
2545 |
And the most triggered, probably the _only_ triggered, exception was the page |
|
2546 |
fault #PF exception, vector 14. The faulting (unmapped) address was things |
|
2547 |
like %cr2=0x0, %cr2=0x8, %cr2=0x200020, etc. |
|
2548 |
|
|
2549 |
From above pattern, I thought: who's really messing up with our registers, |
|
2550 |
getting them loaded with weird values like 0, 0x8, etc? |
|
2551 |
|
|
2552 |
* So, we're in the verge of the third day end with no direct results, and nothing |
|
2553 |
known about the source of the bug. Having nothing specific to think about, I |
|
2554 |
inspected the critical kernel paths: the IDT setup, IOAPIC entries setup, the |
|
2555 |
GDT, the bootsector, early kernel boot setup, printk() code, etc. |
|
2556 |
|
|
2557 |
Funnily, cause I had a 'global' view about the kernel while doing this deep |
|
2558 |
inspection, 9 issues scattered around was discovered that needs to be fixed. |
|
2559 |
Unfortunately, fixing those 9 issues didn't really solve our main problem, but |
|
2560 |
it was nice nonetheless to discover them. |
|
2561 |
|
|
2562 |
* So we're now in the fifth day, and nothing is clear. Hopelessly, I swap our |
|
2563 |
printk code with FreeBSD's one. Things work fine, but once we enable interrupts |
|
2564 |
and the PIT in monotonic mode, similar erroneous behavior appears. Swapping |
|
2565 |
the FreeBSD code with the Linux code, and again, leading to the same failures. |
|
2566 |
|
|
2567 |
So I was right, something 'deep' is messed up, not just the nature of the |
|
2568 |
interrupted code. |
|
2569 |
|
|
2570 |
Another hopeless try was swapping the data segment descriptor index @ the GDT |
|
2571 |
from 0x10 to 0x18. Something interesting was discovered; the page faults |
|
2572 |
addresses occurs from values originating from %cs, %ss, and %rflags; it was |
|
2573 |
completely repeatable. Sounds familiar? Yes, they should! Those are the values |
|
2574 |
pushed by the CPU before invoking IRQ handlers! |
|
2575 |
|
|
2576 |
So somehow, registers that get dereferenced get the wrong values from the |
|
2577 |
state pushed and popped by the CPU before and after invoking an IRQ handler. |
|
2578 |
Now _this_ is an interesting discovery. |
|
2579 |
|
|
2580 |
So, how the %ss, %cs, and %rflags values got 'leaked' to our general-purpose |
|
2581 |
registers? The stack _must_ have been corrupted somehow .. |
|
2582 |
|
|
2583 |
* Trying different directions, we minimize the timer handler even further only to |
|
2584 |
the 'iretq' x86 op, but how do we acknowledge the local APIC this way? |
|
2585 |
|
|
2586 |
In fact, we dont. I just masked all the IOAPIC entries, and let the PIT trigger |
|
2587 |
through the PIC. The PIC was then programmed in 'Automatic EOI' mode, where no |
|
2588 |
acknowledgment was needed. This reduced the handler only to: |
|
2589 |
|
|
2590 |
iretq |
|
2591 |
|
|
2592 |
Even with this super-minimal handler, the bug gets triggered :( |
|
2593 |
|
|
2594 |
It was also interesting that, again, the #PF handler reported bad memory values |
|
2595 |
resembling %rflags, %ss, etc. |
|
2596 |
|
|
2597 |
* So after minimizing the timer handler to the absolute architecturally minimum, |
|
2598 |
we try to minimize the faulting code even further. After some hours of |
|
2599 |
disassembly, I find that at -O1 the VGA code alone (without the printk() code |
|
2600 |
that called it) triggers the problem. |
|
2601 |
|
|
2602 |
After even further inspection and disassembly (and wasted hours), I find that |
|
2603 |
not only we can limit the code to VGA writing, but only to the string function |
|
2604 |
memcpy(). |
|
2605 |
|
|
2606 |
So, now, all I have is the timer interrupt and an infinite loop containing |
|
2607 |
memcpy, and the code still fails! |
|
2608 |
|
|
2609 |
And, yes, the memcpy() code was valid (ehem, Dijkestra's 'Humble Programmer') |
|
2610 |
|
|
2611 |
* So I go crazy stripping the entire kernel. All what was left was code that |
|
2612 |
re-programs the PIC, the PIT setup code, and the string function memcpy() |
|
2613 |
and the code still failed. |
|
2614 |
|
|
2615 |
That was, well, weird. We really have nothing now except a very small number of |
|
2616 |
lines of code and I don't know what to do. |
|
2617 |
|
|
2618 |
Once we enable interrupts, the memcpy() function fail with similar patterns to |
|
2619 |
the one described above (page faults, with faulting memory values similar to |
|
2620 |
%ss and %rflags) |
|
2621 |
|
|
2622 |
Well, you can imagine the state I'm in after all those days. |
|
2623 |
|
|
2624 |
* Another step for more explicitness was disabling the PIT, and triggering its |
|
2625 |
handler manually using software interrupts. I've inject software interrupts |
|
2626 |
around strategic C statements like '*src = *dst', etc. |
|
2627 |
|
|
2628 |
* So at the middle of the middle of the sixth day, I disassemble the memcpy() |
|
2629 |
function, draw the stack on paper, and trace. Now something became directly |
|
2630 |
apparent. Notice those instructions (gcc -O0): |
|
2631 |
|
|
2632 |
/* Classical stack setup */ |
|
2633 |
ffffffff80109c88: 55 push %rbp |
|
2634 |
ffffffff80109c89: 48 89 e5 mov %rsp,%rbp |
|
2635 |
ffffffff80109c8c: 48 89 7d e8 mov %rdi,-0x18(%rbp) |
|
2636 |
ffffffff80109c90: 48 89 75 e0 mov %rsi,-0x20(%rbp) |
|
2637 |
|
|
2638 |
/* Bochs magic breakpoint */ |
|
2639 |
ffffffff80109caa: 66 87 db xchg %bx,%bx |
|
2640 |
|
|
2641 |
/* Our manual software interrupt */ |
|
2642 |
ffffffff80109cad: cd f0 int $0xf0 |
|
2643 |
|
|
2644 |
/* Failing code, specially last line */ |
|
2645 |
ffffffff80109caf: 48 8b 45 f8 mov -0x8(%rbp),%rax |
|
2646 |
ffffffff80109cb3: 0f b6 10 movzbl (%rax),%edx |
|
2647 |
ffffffff80109cb6: 48 8b 45 f0 mov -0x10(%rbp),%rax |
|
2648 |
ffffffff80109cba: 88 10 mov %dl,(%rax) |
|
2649 |
|
|
2650 |
Notice something? Look again. See those `mov -0x8(%rbp), %rax' and |
|
2651 |
`mov -0x10(%rbp), %rax) ops? |
|
2652 |
|
|
2653 |
Now check the top of the function, especially the stack pointer setup. Can't |
|
2654 |
you notice something? |
|
2655 |
|
|
2656 |
YES, GCC used areas of the stack _B E L O W_ the stack pointer to keep parts |
|
2657 |
of the leaf function local state. Now when the interrupt was triggered at |
|
2658 |
`int $0xf0', the CPU rightfully pushed CPU state in that order: |
|
2659 |
|
|
2660 |
%ss |
|
2661 |
%rsp |
|
2662 |
%rflags |
|
2663 |
%cs |
|
2664 |
%rip |
|
2665 |
|
|
2666 |
Which meant that parts of that function local state get corrupted through |
|
2667 |
the implicit CPU stack usage! |
|
2668 |
|
|
2669 |
This is also the reason we found faulting addresses of values resembling %ss |
|
2670 |
and %rflags. Our processor modified our pointers with those! |
|
2671 |
|
|
2672 |
* Wow, THAT WAS INTERESTING and worthwhile to invest the week in. Inspecting the |
|
2673 |
AMD64 ABI to see why GCC uses this behavior, which is _certainly_ interrupt and |
|
2674 |
signals unsafe, I found the answer: it's 'The Red Zone'. |
|
2675 |
|
|
2676 |
Now I've really had enough for today, see the discussion of this zone at the |
|
2677 |
AMD64 ABI document. |
|
2678 |
|
|
2679 |
* Solution: |
|
2680 |
|
|
2681 |
Either: |
|
2682 |
[1] Ignore the 128-byte region in the IRQ handler OR |
|
2683 |
[2] Let the compiler avoid using this Red Zone |
|
2684 |
|
|
2685 |
We can't really use [1], since the CPU pushes its state behind our backs. We |
|
2686 |
can't let it wind 128-bytes of the stack. Also the kernel stack is a very |
|
2687 |
valuable resource, and we shouldn't waste 128-Bytes of it to save an x86 op |
|
2688 |
or two. |
|
2689 |
|
|
2690 |
Thus, we pick the second option. That's also what the mainstream folks like |
|
2691 |
{Open, Free}BSD and Linux do (which I discovered too lately). |
|
2692 |
|
|
2693 |
* Lessons: |
|
2694 |
|
|
2695 |
- When faced with triple faults, inspect the following: |
|
2696 |
- the IDT table: if it has been overwritten, then the machine will definitely |
|
2697 |
triple fault. |
|
2698 |
- stack sanity: If the stack was not sane, then pushes and pops will lead to |
|
2699 |
faults, leading to triple faults. |
|
2700 |
- MM structures sanity. If the pagetables have been erroneously overwritten, |
|
2701 |
the stack will get erroneously mapped. |
|
2702 |
- Kernel stack overflow: corrupt stack = triple faults |
|
2703 |
|
|
2704 |
- If you noticed that an interrupt randomly triggers a problem in the kernel, |
|
2705 |
it can be good to disable interrupts (avoid distractions), then find the most |
|
2706 |
affected code path and _manually_ inject interrupts there using software IRQs. |
|
2707 |
|
|
2708 |
- The best possible way to let IRQ handlers be as minimal as possible is to use |
|
2709 |
the PIC in Automatic EOI (AEOI) mode. They will get the handler minimized to |
|
2710 |
the 2 byte `iretq' x86-64 op. |
|
2711 |
|
|
2712 |
- Let code binary single-stepping be our last debugging resort to make sure we |
|
2713 |
understand the algorithms and code we've just written well enough. |
| c65c8b1 by Ahmed S. Darwish at 2010-04-25 |
2714 |
|
| b3bbe1d by Ahmed S. Darwish at 2010-11-16 |
2715 |
|
| c65c8b1 by Ahmed S. Darwish at 2010-04-25 |
2716 |
Debugging periodic timers using statistical data: |
|
2717 |
------------------------------------------------- |
|
2718 |
|
|
2719 |
* A problem appeared while running the scheduler with the APIC timer instead of |
|
2720 |
the PIT. Using observation, it was clear that the running intervals given |
|
2721 |
to processes by the APIC timer are twice as long as the ones given by the PIT. |
|
2722 |
Both timers were supposedly programmed to produce 40 millisecond intervals, |
|
2723 |
so where's the fault? |
|
2724 |
|
|
2725 |
Note: we are talking about monotonic PIT and APIC interrupts since those are |
|
2726 |
the only ones used by the scheduler; we don't use the one-shot variants here. |
|
2727 |
|
|
2728 |
* One of the best ways I found to debug this problem, instead of depending on |
|
2729 |
shaky heuristics, is to store the occurrence of the timer ticks interrupts in |
|
2730 |
a statistical manner. |
|
2731 |
|
|
2732 |
An example of this statistical method would be finding the number of timer |
|
2733 |
ticks occurrence after a sequence of independently-programmed delay intervals. |
|
2734 |
|
|
2735 |
* How to do so without sacrificing simplicity? |
|
2736 |
|
|
2737 |
Setup the scheduler periodic timer through the PIT or the APIC: |
|
2738 |
`pit_monotonic(40)' OR `apic_monotonic(40)'. |
|
2739 |
|
|
2740 |
Setup a counter representing the number of scheduler timer ticks so far: |
|
2741 |
|
|
2742 |
.globl ticks_counter |
|
2743 |
ticks_counter: |
|
2744 |
.quad 0 |
|
2745 |
|
|
2746 |
.globl timer_handler |
|
2747 |
timer_handler: |
|
2748 |
... |
|
2749 |
incq ticks_counter |
|
2750 |
... |
|
2751 |
|
|
2752 |
Then, do _independent_ time delays and store the number of timer ticks so far |
|
2753 |
after each delay. This way, we can have a statistical comparison between the |
|
2754 |
two timers, the one used for the delay and the one used for the ticks. An |
|
2755 |
example of this would be: |
|
2756 |
|
|
2757 |
/* ticks[i]: number of scheduler timer ticks after the `i'th delay |
|
2758 |
* interval, having in mind that we've setup both the ticks and delay |
|
2759 |
* timers with the same time priod. |
|
2760 |
* |
|
2761 |
* If `1 2 3 4 5 6 ..' is printed, then the two timers are in harmony. |
|
2762 |
* Having the sequence '3 5 7 9 11 ..' means that the ticks timer runs |
|
2763 |
* twice as fast as the delay one. */ |
|
2764 |
for (int i = 0; i < 100; i++) |
|
2765 |
pit_mdelay(52), ticks[i] = ticks_counter; |
|
2766 |
for (int j = 0; j < 100; j++) |
|
2767 |
printk("%d ", ticks[j]); |
|
2768 |
|
|
2769 |
Comparing the statistical results, we found that the PIT in monotonic mode |
|
2770 |
ticks twice as fast as the PIT programmed in delay mode. It also ticked twice |
|
2771 |
as fast as the APIC timer in delay mode. This heavily suggests something wrong |
|
2772 |
in our PIT monotonic-mode setup! |
|
2773 |
|
|
2774 |
Meanwhile, setting up the scheduler ticks using the APIC timer ran in perfect |
|
2775 |
harmony with the PIT-programmed delays. This heavily re-enforces above paragr- |
|
2776 |
aph reasoned speculation: something is wrong in our PIT periodic mode code. |
|
2777 |
|
|
2778 |
* Now that the problem was isolated, the error was a misunderstanding of the |
|
2779 |
PIT's mode-3 behavior; mode-2 fitted our goals better. |
|
2780 |
|
|
2781 |
Interestingly, I found this method really helpful for testing periodic timers |
|
2782 |
code in general. We already test timer delays code by calibrating its produced |
|
2783 |
delays against wall clock time; this method complement the tests by also |
|
2784 |
testing the monotonic/periodic interrupts code. |
| b4776d7 by Ahmed S. Darwish at 2010-04-27 |
2785 |
|
|
2786 |
Note that for testing, specifying a large delay period catches more periodic |
|
2787 |
timers accuracy errors. Values from 50 ms and up proved to be good. |
|
2788 |
|
|
2789 |
* A similar test-case for periodic timers is comparing their statistical |
|
2790 |
ticks occurrences relative to each other. This will guarantee that all system |
|
2791 |
periodic timers really measure the same concept of time! |
| 5bc1d68 by Ahmed S. Darwish at 2010-11-16 |
2792 |
|
|
2793 |
|
|
2794 |
The Naughty Multiply: |
|
2795 |
--------------------- |
|
2796 |
|
|
2797 |
* References: |
|
2798 |
- Intel Instruction Set Reference, A-M, vol. 2A, MUL. |
|
2799 |
- GCC manual, 'Using the GNU Compiler Collection' |
|
2800 |
|
|
2801 |
* So I was writing a casual number of test-cases, where this sequence of code: |
|
2802 |
|
|
2803 |
void run_tests(void) { |
|
2804 |
memset(_arr, 0xFF, 0x100); |
|
2805 |
__memcpy_forward(_arr, _arr + 20, 10); |
|
2806 |
__memcpy_forward(_arr + 20, _arr, 10); |
|
2807 |
} |
|
2808 |
|
|
2809 |
produced a NULL-dereference exception. As printed by our exception handler: |
|
2810 |
|
|
2811 |
Exception: vector=14, errcode=0x2, %cs=0x8, %rip=0xffffffff8010b6f7, |
|
2812 |
%ss=0x0, %rsp=0xffffffff80104ff0, %rflags=0x200202, %cr2=0x0, |
|
2813 |
%cr3=0x3ffef000, %cr4=0x20 |
|
2814 |
|
|
2815 |
It was weird: the code of these two functions was a very simple and |
|
2816 |
straightforward inline GCC assembly. |
|
2817 |
|
|
2818 |
The problem was clearly related somehow to registers-usage pressure (the |
|
2819 |
compiler register allocator decisions) in the mother function run_tests(). |
|
2820 |
Removing one memcpy() above, or adding a printk() in a random place in that |
|
2821 |
function would make the NULL exception disappear. |
|
2822 |
|
|
2823 |
* Disassembling the area around the faulting opcode, we found: |
|
2824 |
|
|
2825 |
ffffffff8010b6a0 <string_run_tests>: |
|
2826 |
ffffffff8010b6a0: 48 c7 c2 20 fb 10 80 mov $0xffffffff8010fb20,%rdx |
|
2827 |
ffffffff8010b6a7: 31 f6 xor %esi,%esi |
|
2828 |
ffffffff8010b6a9: b8 ff 00 00 00 mov $0xff,%eax |
|
2829 |
ffffffff8010b6ae: 48 89 d7 mov %rdx,%rdi |
|
2830 |
ffffffff8010b6b1: b9 00 01 00 00 mov $0x100,%ecx |
|
2831 |
ffffffff8010b6b6: 48 89 ce mov %rcx,%rsi |
|
2832 |
ffffffff8010b6b9: 48 83 e1 07 and $0x7,%rcx |
|
2833 |
ffffffff8010b6bd: f3 aa rep stos %al,%es:(%rdi) (+) |
|
2834 |
ffffffff8010b6bf: 48 89 f1 mov %rsi,%rcx |
|
2835 |
ffffffff8010b6c2: 48 c1 e9 03 shr $0x3,%rcx |
|
2836 |
ffffffff8010b6c6: 48 89 c6 mov %rax,%rsi |
|
2837 |
ffffffff8010b6c9: 48 b8 01 01 01 01 01 mov $0x101010101010101,%rax |
|
2838 |
ffffffff8010b6d0: 01 01 01 |
|
2839 |
ffffffff8010b6d3: 48 f7 e6 mul %rsi |
|
2840 |
ffffffff8010b6d6: f3 48 ab rep stos %rax,%es:(%rdi) (++) |
|
2841 |
ffffffff8010b6d9: 41 b9 0a 00 00 00 mov $0xa,%r9d |
|
2842 |
ffffffff8010b6df: 31 c0 xor %eax,%eax |
|
2843 |
ffffffff8010b6e1: 48 c7 c6 34 fb 10 80 mov $0xffffffff8010fb34,%rsi |
|
2844 |
ffffffff8010b6e8: 48 89 d7 mov %rdx,%rdi |
|
2845 |
ffffffff8010b6eb: 44 89 c9 mov %r9d,%ecx |
|
2846 |
ffffffff8010b6ee: 41 89 c0 mov %eax,%r8d |
|
2847 |
ffffffff8010b6f1: 41 89 c8 mov %ecx,%r8d |
|
2848 |
ffffffff8010b6f4: 83 e1 07 and $0x7,%ecx |
|
2849 |
ffffffff8010b6f7: f3 a4 rep movsb %ds:(%rsi),%es:(%rdi) |
|
2850 |
|
|
2851 |
From our exception handler, we can see that the faulting instruction |
|
2852 |
is the above 'rep movsb'. We can also deduce that (+) and (++) ops are |
|
2853 |
produced for memset(), and the rest are the memcpy() opcodes. |
|
2854 |
|
|
2855 |
* Tracing the code by hand, I reached the faulty instruction without |
|
2856 |
catching the NULL dereference. A null dereference at the 'movsb' |
|
2857 |
instruction can only mean that either %rsi = 0, or %rdi = 0, or both |
|
2858 |
had magically turned to zeroes. |
|
2859 |
|
|
2860 |
Wondering about the state, I overwrote this movsb instruction in the |
|
2861 |
final kernel binary with a Bochs magic breakpoint code (66 87 db) |
|
2862 |
using a simple hex editor. |
|
2863 |
|
|
2864 |
After moving to the debugging console, thanks to the magic breakpoint, |
|
2865 |
here was the Bochs registers dump: |
|
2866 |
|
|
2867 |
(0) Magic breakpoint |
|
2868 |
CPU0: |
|
2869 |
rax: 0x00000000:00000000 rcx: 0x00000000:00000002 |
|
2870 |
rdx: 0x00000000:00000000 rbx: 0xffffffff:801129a8 |
|
2871 |
rsp: 0xffffffff:80104ff0 rbp: 0x00000000:00000000 |
|
2872 |
rsi: 0xffffffff:8010fb34 rdi: 0x00000000:00000000 |
|
2873 |
r8 : 0x00000000:0000000a r9 : 0x00000000:0000000a |
|
2874 |
r10: 0x0f200f20:0f200f20 r11: 0x66656463:62613938 |
|
2875 |
r12: 0x00000000:00000000 r13: 0x00000000:00000000 |
|
2876 |
r14: 0x00000000:00000000 r15: 0x00000000:00000000 |
|
2877 |
rip: 0xffffffff:8010b6fa |
|
2878 |
eflags 0x00200202: ID vip vif ac vm rf nt IOPL=0 of df IF tf sf zf af pf cf |
|
2879 |
|
|
2880 |
One thing was clearly noted: for some reason, %rdx (and %rdi by consequence) |
|
2881 |
have magically turned to zeroes, leading to the NULL dereference! |
|
2882 |
|
|
2883 |
* Finding that no explicit opcode zeroed %rdx above, then the zeroing |
|
2884 |
must have been a side-effect of some instruction above the faulty |
|
2885 |
dereference. Looking at the Intel Manual for cues, I found that |
|
2886 |
opcode: it was the 'mul %rsi' one. |
|
2887 |
|
|
2888 |
So why didn't GCC know about this? Because _I_ did not know about |
|
2889 |
it while coding, thus I did not explicitly tell GCC that this nice |
|
2890 |
register will get implicitly clobbered. |
|
2891 |
|
|
2892 |
GCC, naively thought that the register value was unmodified, and |
|
2893 |
rightfully re-used that register value at address 0x8010b6e8. GCC |
|
2894 |
is really competent wrt to optimizations. The faulty code was: |
|
2895 |
|
|
2896 |
asm volatile ( |
|
2897 |
[...] |
|
2898 |
"mov %[ch], %[tmp];" |
|
2899 |
"mov $0x0101010101010101, %[ch];" |
|
2900 |
"mul %[tmp];" |
|
2901 |
"rep stosq;" |
|
2902 |
: [tmp] "=&r"(tmp), [dst] "=&D"(d0), [ch] "=&a"(uch), |
|
2903 |
[len] "=&c"(ulen) |
|
2904 |
: "[dst]"(dst), "[ch]"(uch), "[len]"(ulen), "[tmp]"(tmp) |
|
2905 |
: "memory"); |
|
2906 |
|
|
2907 |
As you can see, %rdx was not mentiond as a clobbered register, and |
|
2908 |
not even as an output register. GCC had all the rights to believe |
|
2909 |
that this register value was not modified in that snippet. |
|
2910 |
|
|
2911 |
In another way, the clobber list tells gcc which registers (or |
|
2912 |
memory) are changed by the snippet, but not listed as an output. |
|
2913 |
|
|
2914 |
* The simple solution was adding the 'rdx' register to the assembly |
|
2915 |
snippet clobber list. Now the compiler output looks like: |
|
2916 |
|
|
2917 |
ffffffff8010b6a0 <string_run_tests>: |
|
2918 |
ffffffff8010b6a0: 49 c7 c0 20 fb 10 80 mov $0xffffffff8010fb20,%r8 |
|
2919 |
[...] |
|
2920 |
ffffffff8010b6e8: 48 89 d7 mov %r8,%rdi |
|
2921 |
[...] |
|
2922 |
ffffffff8010b6f7: f3 a4 rep movsb %ds:(%rsi),%es:(%rdi) |
|
2923 |
|
|
2924 |
where the compiler uses the new register %r8, entirely avoiding |
|
2925 |
the %rdx register in the process. |
|
2926 |
|
|
2927 |
* From investigation, it seems GCC need to also be notified if the |
|
2928 |
assembly snippet alters the condition code register %rflags. We now |
|
2929 |
handle this by adding `cc' to the clobber list while using ops like |
|
2930 |
'andq', 'popfq', 'decq', and 'shrq'. |
|
2931 |
|
|
2932 |
Cases where we forgot to clobber 'rcx' after using opcodes like |
|
2933 |
'rep XX' were also caught. For this, we only need to set 'rcx' as a |
|
2934 |
double input+output register; GCC will infer the rest. |
|
2935 |
|
|
2936 |
Another faulty case was having 'rsi' be only set as an input register |
|
2937 |
amid the existence of 'movsq' in the assembly snippet. |
|
2938 |
|
|
2939 |
* Lesson: |
|
2940 |
|
|
2941 |
While I did study the privileged x86 ring0 opcodes carefully (mainly |
|
2942 |
Intel's manual vol. 3A, and AMD vol. 2), the basic x86 opcodes were |
|
2943 |
learned 'on the go'. The problem we faced today is a small price for |
|
2944 |
that; I should re-read Intel's vol. 1 docs way more carefully. |
|
2945 |
|
|
2946 |
To make-up for this, and also using the Intel manuals, I investigated |
|
2947 |
the side effects of __every__ opcode used in the kernel inlined |
|
2948 |
assembly snippets. That's how I caught the %rflags, %rcx, and %rsi |
|
2949 |
cases mentioned in the above point! |