Commit 34429b5b6638e1a787b65bc8909928175fb30c49.patch
- Diff rendering mode:
- inline
- side by side
CuteNotes.txt
(296 / 0)
|   | |||
| 1213 | 1213 | segment. They all function using the same registers and simple circuity. Intel | |
| 1214 | 1214 | elaborates: "Note that if the segment registers are not reloaded, execution | |
| 1215 | 1215 | continues using the descriptor attributes loaded during protected mode." | |
| 1216 | |||
| 1217 | |||
| 1218 | Using light debugging techniques: | ||
| 1219 | --------------------------------- | ||
| 1220 | |||
| 1221 | * References: | ||
| 1222 | - Ulrich Drepper, "How to Write Shared Libraries", 2.1 Data Definitions, | ||
| 1223 | http://www.webcitation.org/5mkmxB9Fx | ||
| 1224 | - Ian Lance Taylor, "Linkers part 5", http://www.airs.com/blog/archives/42, | ||
| 1225 | http://www.webcitation.org/5mkmiZ8FN | ||
| 1226 | - Binutils bug report, "ELF linker loads member of archive for common symbol", | ||
| 1227 | http://sources.redhat.com/bugzilla/show_bug.cgi?id=1811 | ||
| 1228 | http://www.webcitation.org/5mknnyUe2 | ||
| 1229 | - Minor bug-report at http://gcc.gnu.org/bugzilla/show_bug.cgi?id=42726 | ||
| 1230 | |||
| 1231 | * Today while writing page allocator test code, I faced a very strange bug: once | ||
| 1232 | any of the kernel's text, data, or bss sections exceeds something close to 20 | ||
| 1233 | KBs, the MPtables code panic()s at different and unpredictable locations. The | ||
| 1234 | failure is something like: | ||
| 1235 | |||
| 1236 | PANIC: MPTables - 1217939456 CPUs found, only 32 supported | ||
| 1237 | |||
| 1238 | * To give some context, there's a system global called 'nr_cpus' representing | ||
| 1239 | the number of CPUs in the system. So far, we know the number of CPUs from the | ||
| 1240 | MP tables Conf entries, where each CPU entry represents a CPU in the system. | ||
| 1241 | Thus, finding 8 CPU entries means that the machine has 8 usable cores. | ||
| 1242 | |||
| 1243 | This global var 'nr_cpus' is only increased at one and _only_ one place in | ||
| 1244 | the system, at the MP tables parsing code: | ||
| 1245 | |||
| 1246 | /* We call this for each MP CPU entry in the system */ | ||
| 1247 | static void parse_cpu(void *addr) | ||
| 1248 | { | ||
| 1249 | struct mpc_cpu *cpu = addr; | ||
| 1250 | |||
| 1251 | if (!cpu->enabled) | ||
| 1252 | return; | ||
| 1253 | |||
| 1254 | if (nr_cpus >= CPUS_MAX) { | ||
| 1255 | panic("MP - %d CPUs found, only %d supported", | ||
| 1256 | nr_cpus, CPUS_MAX); | ||
| 1257 | } | ||
| 1258 | ... | ||
| 1259 | |||
| 1260 | ++nr_cpus; | ||
| 1261 | } | ||
| 1262 | |||
| 1263 | * Inspecting the BIOS MP tables more closely, the MP pointer table is at 0xfbb60 | ||
| 1264 | and its checksum and signature are valid. The MP configuration table header was | ||
| 1265 | also valid and checksummed at 0xfba00. | ||
| 1266 | |||
| 1267 | The problem, it seems, is that the conf table entries got somehow corrupted | ||
| 1268 | (thus the wrong number of cpus value), or there's a bug in our parser code. | ||
| 1269 | |||
| 1270 | * Unfortunately, and as usual for such problems, inserting printk()s at any | ||
| 1271 | location skews the results randomly, making any printk() useless. | ||
| 1272 | |||
| 1273 | So we try less invasive debugging techniques like writing to a memory area | ||
| 1274 | using a _single_ inlined asm instruction, and inspecting this memory area | ||
| 1275 | using an emulator afterwars. This worked and made the problem repeatable! | ||
| 1276 | |||
| 1277 | * First, and to test the parser validity, we choose a memory location that we | ||
| 1278 | know before hand it's going to be zero; I chose physical 0x500. In parse_cpu | ||
| 1279 | we add a very small asm instruction as follows: | ||
| 1280 | |||
| 1281 | /* 8-byte instruction */ | ||
| 1282 | asm volatile("incq (0xffffffff80000500);"); | ||
| 1283 | |||
| 1284 | if (nr_cpus >= CPUS_MAX) { | ||
| 1285 | panic("MP - %d CPUs found, only %d supported", | ||
| 1286 | nr_cpus, CPUS_MAX); | ||
| 1287 | } | ||
| 1288 | |||
| 1289 | And thankfully this instruction was light enough, so the system paniced | ||
| 1290 | at the same point. Our panic() just calls printk() and hlt the system. So | ||
| 1291 | after the system halts, we move to the qemu console handler and check the | ||
| 1292 | value at address 0x500. Good for us, we found the value there to be 1, so | ||
| 1293 | it seems our parser is behaving correctly. | ||
| 1294 | |||
| 1295 | * Now the problem is getting more mysterious. If the parser is behaving | ||
| 1296 | correctly, and the value at address 0x500 is completely valid, how come | ||
| 1297 | 'nr_cpus' have those very strange values? Remember that we put the 'incl | ||
| 1298 | 0x500' instruction at the same place of '++nr_cpus', so the two addresses | ||
| 1299 | should have the exact same values or something is _completely_ messed up. | ||
| 1300 | |||
| 1301 | The 'nr_cpus' variable still yielded the value of "1217939456". | ||
| 1302 | |||
| 1303 | * So what did really corrupt the BSS-stored variable nr_cpus at this _very_ | ||
| 1304 | early stage? | ||
| 1305 | |||
| 1306 | The very first thing we do in the C part of the kernel is to clear the entire | ||
| 1307 | BSS section to zero. So assuming the kernel was copied/setup correctly (which | ||
| 1308 | we're not very sure of yet), the problem probably occurs _between_ clearing | ||
| 1309 | the BSS section and parsing the MP tables. | ||
| 1310 | |||
| 1311 | OK, so we'll try to check the value of the 'nr_cpus' variable at different | ||
| 1312 | places in the kernel between clearing BSS and MP parsing using this 14-byte | ||
| 1313 | code segment: | ||
| 1314 | |||
| 1315 | asm volatile ("movl %0, %%eax;" | ||
| 1316 | "movl %%eax, (0xffffffff80000500); hlt;" | ||
| 1317 | : : "m"(nr_cpus) : "memory"); | ||
| 1318 | |||
| 1319 | And then use the qemu console to read the address 0x500 'xp 0x500'. | ||
| 1320 | |||
| 1321 | * Very important discovery: Adding the above asm fragment _directly_ after | ||
| 1322 | clearning the BSS section yielded the same corrupt value of "1217939456" | ||
| 1323 | for the 'nr_cpus' variable! Something is seriously wrong cause 'nr_cpus' | ||
| 1324 | is clearly in the BSS section: | ||
| 1325 | |||
| 1326 | $ readelf -a kernel.elf | ||
| 1327 | |||
| 1328 | Section Headers: | ||
| 1329 | [Nr] Name Type Address Offset | ||
| 1330 | Size EntSize Flags Link Info Align | ||
| 1331 | ... | ||
| 1332 | [ 8] .bss NOBITS ffffffff8010f5e0 0010f5e0 | ||
| 1333 | 00000000000028c8 0000000000000000 WA 0 0 32 | ||
| 1334 | |||
| 1335 | Symbol table '.symtab' contains 183 entries: | ||
| 1336 | Num: Value Size Type Bind Vis Ndx Name | ||
| 1337 | ... | ||
| 1338 | 151: ffffffff80111d80 4 OBJECT GLOBAL DEFAULT 8 nr_cpus | ||
| 1339 | |||
| 1340 | * I think we're getting closer. Notice from above ELF analysis, the BSS | ||
| 1341 | section starts at 0xffffffff8010f600, and it has a length of 0x28c8: | ||
| 1342 | |||
| 1343 | This means that from reading the ELF: | ||
| 1344 | |||
| 1345 | BSS start = 0xffffffff8010f5e0 | ||
| 1346 | BSS end = 0xffffffff8010f5e0 + 0x28c8 = 0xffffffff800111ea8 | ||
| 1347 | 'nr_cpus' is at 0xffffffff80111d80 | ||
| 1348 | |||
| 1349 | But very strangely, here are the results of printing __bss_start and | ||
| 1350 | __bss_end symbols exported by our kernel linker script we wrote since the | ||
| 1351 | dawn of time: | ||
| 1352 | |||
| 1353 | BSS start = 0xffffffff8010f5e0 | ||
| 1354 | !! BSS end = 0xffffffff801115e0 | ||
| 1355 | 'nr_cpus' is outside of this region! | ||
| 1356 | |||
| 1357 | As you see, somehow the kernel sees a _trimmed_ BSS section from the | ||
| 1358 | linker script, and to add insult to injury, 'nr_cpus' is not included | ||
| 1359 | in that trimmed region. | ||
| 1360 | |||
| 1361 | * Things began to make sense: the kernel obviously has a misguided view about | ||
| 1362 | its bss section. This is disastrous cause at this level, we're the ones | ||
| 1363 | responsible of clearing this BSS area before use. Ofcourse, we directly move | ||
| 1364 | to our linker script BSS section to see what's wrong: | ||
| 1365 | |||
| 1366 | SECTIONS | ||
| 1367 | { | ||
| 1368 | ... | ||
| 1369 | .bss : { | ||
| 1370 | __bss_start = .; | ||
| 1371 | *(EXCLUDE_FILE (*head.o *e820.o) .bss) | ||
| 1372 | __bss_end = .; | ||
| 1373 | } | ||
| 1374 | __kernel_end = .; | ||
| 1375 | ... | ||
| 1376 | } | ||
| 1377 | |||
| 1378 | Things should be fine and everything is at its place, but the corruption still | ||
| 1379 | exists. So I tried a small modification of: | ||
| 1380 | |||
| 1381 | .bss : { | ||
| 1382 | __bss_start = .; | ||
| 1383 | *(EXCLUDE_FILE (*head.o *e820.o) .bss) | ||
| 1384 | } | ||
| 1385 | __bss_end = .; | ||
| 1386 | __kernel_end = .; | ||
| 1387 | |||
| 1388 | And things worked magically! Ofcourse leaving the issue as this will be immature | ||
| 1389 | as we didn't understand yet why this trivial change produced this huge effect. | ||
| 1390 | |||
| 1391 | * Trying to debug the issue further, we restore the linker script to its original | ||
| 1392 | state and let the linker be more verbose by using the -M flag. Inspecting the | ||
| 1393 | output, a very interesting thing came up when the linker displayed the final | ||
| 1394 | memory map: | ||
| 1395 | |||
| 1396 | ('...' denotes trimmed output lines) | ||
| 1397 | |||
| 1398 | Allocating common symbols // mm, .. | ||
| 1399 | Common symbol size file | ||
| 1400 | |||
| 1401 | test_symbol 0x8 mmap.o | ||
| 1402 | ioapic_descs 0x60 ioapic.o | ||
| 1403 | nr_ioapics 0x4 ioapic.o | ||
| 1404 | nr_cpus 0x4 smpboot.o | ||
| 1405 | mp_irqs 0x700 mptables.o | ||
| 1406 | nr_mpcirqs 0x4 mptables.o | ||
| 1407 | cpu_descs 0x100 smpboot.o | ||
| 1408 | |||
| 1409 | Linker script and memory map | ||
| 1410 | |||
| 1411 | 0x0000000000100000 . = 0x100000 | ||
| 1412 | |||
| 1413 | .text.head 0x0000000000100000 0x310 | ||
| 1414 | ... | ||
| 1415 | .bss 0xffffffff8010f5e0 0x28c8 load address 0x000000000010692c | ||
| 1416 | 0xffffffff8010f5e0 __bss_start = . | ||
| 1417 | *(EXCLUDE_FILE(*e820.o *head.o) .bss) | ||
| 1418 | .bss 0xffffffff8010f5e0 0x0 common.o | ||
| 1419 | .bss 0xffffffff8010f5e0 0x0 main.o | ||
| 1420 | .bss 0xffffffff801105e0 0x0 i8259.o | ||
| 1421 | .bss 0xffffffff801105e0 0x0 apic.o | ||
| 1422 | ... | ||
| 1423 | .bss 0xffffffff80110610 0x0 lib/string.o | ||
| 1424 | .bss 0xffffffff80110620 0xfc0 lib/printf.o | ||
| 1425 | 0xffffffff801115e0 __bss_end = . | ||
| 1426 | COMMON 0xffffffff801115e0 0x64 ioapic.o | ||
| 1427 | 0xffffffff801115e0 ioapic_descs // Outside of BSS! | ||
| 1428 | 0xffffffff80111640 nr_ioapics // Outside of BSS! | ||
| 1429 | *fill* 0xffffffff80111644 0x1c 00 | ||
| 1430 | COMMON 0xffffffff80111660 0x704 mptables.o | ||
| 1431 | 0xffffffff80111660 mp_irqs // Outside of BSS! | ||
| 1432 | 0xffffffff80111d60 nr_mpcirqs // Outside of BSS! | ||
| 1433 | *fill* 0xffffffff80111d64 0x1c 00 | ||
| 1434 | COMMON 0xffffffff80111d80 0x120 smpboot.o | ||
| 1435 | 0xffffffff80111d80 nr_cpus // Outside of BSS! | ||
| 1436 | 0xffffffff80111da0 cpu_descs // Outside of BSS! | ||
| 1437 | COMMON 0xffffffff80111ea0 0x8 mmap.o | ||
| 1438 | 0xffffffff80111ea0 test_symbol // Outside of BSS! | ||
| 1439 | 0xffffffff80111ea8 __kernel_end = . | ||
| 1440 | |||
| 1441 | So at last we got it, the linker script didn't put variables like nr_cpus, | ||
| 1442 | nr_ioapics etc in the BSS section, thus they were not cleared in our clear_bss() | ||
| 1443 | function. As you can see, those symbols are allocated past the __bss_end symbol. | ||
| 1444 | |||
| 1445 | * OK, so we inspect the LD manual for this 'COMMON' section, to find: | ||
| 1446 | |||
| 1447 | "A special notation is needed for common symbols, because in many object file | ||
| 1448 | formats common symbols do not have a particular input section. The linker treats | ||
| 1449 | common symbols as though they are in an input section named ‘COMMON’ ..." | ||
| 1450 | |||
| 1451 | * Unfortunately the documentation was a bit vague, and there was no explanation | ||
| 1452 | of what "common symbols" were really used for. So we dig deeper to find this | ||
| 1453 | great explanation from Ulrich's paper on shared libraries: | ||
| 1454 | |||
| 1455 | "Common variables are widely used in fortran, but they got used in C and C++ | ||
| 1456 | as well to work around mistakes of programmers. Since in the early days people | ||
| 1457 | used to drop the 'extern' keyword from variable definitions, in the same way | ||
| 1458 | it is possible to drop it from function declaration, the compiler often had | ||
| 1459 | multiple definitions of the same variable in different files. | ||
| 1460 | |||
| 1461 | To help the poor and clueless programmer, the C/C++ compiler normally generates | ||
| 1462 | common variables for uninitialized definitions such as `int foo;' | ||
| 1463 | |||
| 1464 | For common variables, there can be more than one definition, and they all get | ||
| 1465 | unified in one location in the output file .. their values does not need to be | ||
| 1466 | stored in the ELF file." | ||
| 1467 | |||
| 1468 | * And also from Ulrich, a description of -fno-common which is more accurate than | ||
| 1469 | gcc's manual: | ||
| 1470 | |||
| 1471 | "If the programmer uses the compiler command-line option -fno-common, the | ||
| 1472 | generated code will contain uninitialized variables instead of common variables | ||
| 1473 | if a vriable definition has no initializer. .. | ||
| 1474 | |||
| 1475 | The result at run-time is the same as for common variable, no value stored in | ||
| 1476 | the file. But the representation in the object file is different [BSS section], | ||
| 1477 | and it allows the linker to find multiple definitions and flag them as errors." | ||
| 1478 | |||
| 1479 | * And finally from Ian Lance Taylor's series on linkers: | ||
| 1480 | |||
| 1481 | "A section index of SHN_COMMON (0xfff2) indicates a common symbol. Common | ||
| 1482 | symbols were invented to handle Fortran common blocks, and they are also often | ||
| 1483 | used for uninitialized global variables in C. | ||
| 1484 | |||
| 1485 | A common symbol has unusual semantics. Common symbols have a value of zero, but | ||
| 1486 | set the size field to the desired size. If one object file has a common symbol | ||
| 1487 | and another has a definition, the common symbol is treated as an undefined | ||
| 1488 | reference. | ||
| 1489 | |||
| 1490 | If there is no definition for a common symbol, the program linker acts as though | ||
| 1491 | it saw a definition initialized to zero of the appropriate size. Two object | ||
| 1492 | files may have common symbols of different sizes, in which case the program | ||
| 1493 | linker will use the largest size. | ||
| 1494 | |||
| 1495 | Implementing common symbol semantics across shared libraries is a touchy | ||
| 1496 | subject, somewhat helped by the recent introduction of a type for common symbols | ||
| 1497 | as well as a special section index (see the discussion of symbol types below)." | ||
| 1498 | |||
| 1499 | (Also check nm(1) on "C"/COMMON symbols, and --warn-common on ld(1)) | ||
| 1500 | |||
| 1501 | * And by now everything is super-clear, and the fix is only adding the COMMON | ||
| 1502 | symbols within the output object BSS boundaries: | ||
| 1503 | |||
| 1504 | .bss : { | ||
| 1505 | __bss_start = .; | ||
| 1506 | *(EXCLUDE_FILE (*head.o *e820.o) .bss) | ||
| 1507 | *(EXCLUDE_FILE (*head.o *e820.o) COMMON) | ||
| 1508 | __bss_end = .; | ||
| 1509 | } | ||
| 1510 | |||
| 1511 | What a bug this was. |

