Commit 5cc7a1ded5e6b741657fea903edd1b2250eb12d4.patch
- Diff rendering mode:
- inline
- side by side
CuteNotes.txt
(254 / 0)
|   | |||
| 2049 | 2049 | ||
| 2050 | 2050 | - We may add asserts to memcpy() and friends assuring no such diastrous | |
| 2051 | 2051 | overlaps occur in the C side of the kernel. | |
| 2052 | |||
| 2053 | * Global lessons: | ||
| 2054 | |||
| 2055 | - Disassembly is a great tool for debugging kernel code, since a lot of things | ||
| 2056 | taken for granted at user-space is not guaranteed here anymore. Abstraction | ||
| 2057 | cut both ways indeed. | ||
| 2058 | |||
| 2059 | - It's _very important_ to deeply test each code path, no matter how trivial | ||
| 2060 | it seems (basic stuff, but sometimes we forget the basics). | ||
| 2061 | |||
| 2062 | - Memory dumping of critical kernel regions is really valuable; remember to use | ||
| 2063 | it as needed. | ||
| 2064 | |||
| 2065 | - It's important to test the code with different environment combinations: | ||
| 2066 | different optimization levels: GCC's -O {0, 1, 2, 3, s} | ||
| 2067 | different memory sizes: Qemu's -m {2, 3, .., 500, .., 2000, ..} | ||
| 2068 | different number of cores: -smp {1, 2, 3, ..} | ||
| 2069 | different CPU speeds: Bochs's ips= parameter | ||
| 2070 | different BIOSes: different versions of Bochs Bios, different PC forms, .. | ||
| 2071 | |||
| 2072 | * The Zone: | ||
| 2073 | ----------- | ||
| 2074 | |||
| 2075 | (Or the bug that took me around 6 days to solve) | ||
| 2076 | |||
| 2077 | * While writing the context switch code, I observed a very ugly bug. Somehow the | ||
| 2078 | montonic PIT interrupts that got triggered every 1 millisecond corrupts kernel | ||
| 2079 | state. | ||
| 2080 | |||
| 2081 | Symptoms: random exceptions and random assert() failures allover the place. | ||
| 2082 | |||
| 2083 | * So I first thought there must be something wrong in the timer handler itself. | ||
| 2084 | So what if the PIT handler did exactly nothing but acknowledging the local APIC? | ||
| 2085 | (It must acknowledge the APIC so future IRQs get handled) | ||
| 2086 | |||
| 2087 | The handler code was successfully reduced to: | ||
| 2088 | |||
| 2089 | push %rax | ||
| 2090 | movq $(VIRTUAL(APIC_PHBASE) + APIC_EOI), %rax | ||
| 2091 | movl $0,(%rax) | ||
| 2092 | pop %rax | ||
| 2093 | iretq | ||
| 2094 | |||
| 2095 | But still the same buggy behavior prevailed. | ||
| 2096 | |||
| 2097 | * After reducing the handler code to a number of bytes, with no positive results, | ||
| 2098 | I thought let's check the IDT validity. Inspecting the IDT cells date yielded | ||
| 2099 | nothing interesting. | ||
| 2100 | |||
| 2101 | Also checking the %rflags, bit by bit, led to nothing interesting. | ||
| 2102 | |||
| 2103 | So I thought maybe we should setup the IDT cells as 'trap' instead of 'interrupt' | ||
| 2104 | gates, but this is ofcourse wrong. Trap gates leaves the IF flag set while | ||
| 2105 | servicing the interrupt, which means handlers can get interrupted indefinitely, | ||
| 2106 | and our stack can get overflowed in a matter of milliseconds. | ||
| 2107 | |||
| 2108 | * OK, so I thought instead of focusing on the IRQ handlers and the IDT, lets focus | ||
| 2109 | on the _code_ that get interrupted itself. The failed code was usually the | ||
| 2110 | general memory allocator kmalloc() test cases. | ||
| 2111 | |||
| 2112 | After inspecting the interrupted code, I found nothing interesting. | ||
| 2113 | |||
| 2114 | * To let the problem domain be smaller, and after minimizing the handler code, I | ||
| 2115 | thought lets also minimize the interrupted code and see. Having a bunch of | ||
| 2116 | printk()s in a tight loop, with the timer handler on, triggered the problem! | ||
| 2117 | |||
| 2118 | Here the second day finishes. Either something is really messed up, or I'm a | ||
| 2119 | shitty developer that can't even write a stable printk(). | ||
| 2120 | |||
| 2121 | Furthermore, running any testcases after enabling the interrupts make them | ||
| 2122 | fail miserably. From all the debugging I've done in the last 6 months, I | ||
| 2123 | began to feel the instincts of such 'deep' kernel bugs: this is definitely | ||
| 2124 | one of them. | ||
| 2125 | |||
| 2126 | * So, I began to feel hopeless. I'll just inspect the x86 ops that triggered | ||
| 2127 | the exceptions, hoping to find any pattern in the code that trigger the | ||
| 2128 | bug. Definitely a pattern was clearly there; some examples of the failing | ||
| 2129 | x86 ops: | ||
| 2130 | |||
| 2131 | mov %rax, 0x0(%rbp) | ||
| 2132 | movzbl (%rax), %eax | ||
| 2133 | movl $0x2, 0x8(%rax) | ||
| 2134 | movl $0x1, 0x4(%rax) | ||
| 2135 | ... | ||
| 2136 | |||
| 2137 | And the most triggered, probably the _only_ triggered, exception was the page | ||
| 2138 | fault #PF exception, vector 14. The faulting (unmapped) address was things | ||
| 2139 | like %cr2=0x0, %cr2=0x8, %cr2=0x200020, etc. | ||
| 2140 | |||
| 2141 | From above pattern, I thought: who's really messing up with our registers, | ||
| 2142 | getting them loaded with weird values like 0, 0x8, etc? | ||
| 2143 | |||
| 2144 | * So, we're in the verge of the third day end with no direct results, and nothing | ||
| 2145 | known about the source of the bug. Having nothing specific to think about, I | ||
| 2146 | inspected the critical kernel paths: the IDT setup, IOAPIC entries setup, the | ||
| 2147 | GDT, the bootsector, early kernel boot setup, printk() code, etc. | ||
| 2148 | |||
| 2149 | Funnily, cause I had a 'global' view about the kernel while doing this deep | ||
| 2150 | inspection, 9 issues scattered around was discovered that needs to be fixed. | ||
| 2151 | Unfortunately, fixing those 9 issues didn't really solve our main problem, but | ||
| 2152 | it was nice nonetheless to discover them. | ||
| 2153 | |||
| 2154 | * So we're now in the fifth day, and nothing is clear. Hopelessly, I swap our | ||
| 2155 | printk code with FreeBSD's one. Things work fine, but once we enable interrupts | ||
| 2156 | and the PIT in monotonic mode, similar erroneous behavior appears. Swapping | ||
| 2157 | the FreeBSD code with the Linux code, and again, leading to the same failures. | ||
| 2158 | |||
| 2159 | So I was right, something 'deep' is messed up, not just the nature of the | ||
| 2160 | interrupted code. | ||
| 2161 | |||
| 2162 | Another hopeless try was swapping the data segment descriptor index @ the GDT | ||
| 2163 | from 0x10 to 0x18. Something interesting was discovered; the page faults | ||
| 2164 | addresses occurs from values originating from %cs, %ss, and %rflags; it was | ||
| 2165 | completely repeatable. Sounds familiar? Yes, they should! Those are the values | ||
| 2166 | pushed by the CPU before invoking IRQ handlers! | ||
| 2167 | |||
| 2168 | So somehow, registers that get dereferenced get the wrong values from the | ||
| 2169 | state pushed and popped by the CPU before and after invoking an IRQ handler. | ||
| 2170 | Now _this_ is an interesting discovery. | ||
| 2171 | |||
| 2172 | So, how the %ss, %cs, and %rflags values got 'leaked' to our general-purpose | ||
| 2173 | registers? The stack _must_ have been corrupted somehow .. | ||
| 2174 | |||
| 2175 | * Trying different directions, we minimize the timer handler even further only to | ||
| 2176 | the 'iretq' x86 op, but how do we acknowledge the local APIC this way? | ||
| 2177 | |||
| 2178 | In fact, we dont. I just masked all the IOAPIC entries, and let the PIT trigger | ||
| 2179 | through the PIC. The PIC was then programmed in 'Automatic EOI' mode, where no | ||
| 2180 | acknowledgment was needed. This reduced the handler only to: | ||
| 2181 | |||
| 2182 | iretq | ||
| 2183 | |||
| 2184 | Even with this super-minimal handler, the bug gets triggered :( | ||
| 2185 | |||
| 2186 | It was also interesting that, again, the #PF handler reported bad memory values | ||
| 2187 | resembling %rflags, %ss, etc. | ||
| 2188 | |||
| 2189 | * So after minimizing the timer handler to the absolute architecturally minimum, | ||
| 2190 | we try to minimize the faulting code even further. After some hours of | ||
| 2191 | disassembly, I find that at -O1 the VGA code alone (without the printk() code | ||
| 2192 | that called it) triggers the problem. | ||
| 2193 | |||
| 2194 | After even further inspection and disassembly (and wasted hours), I find that | ||
| 2195 | not only we can limit the code to VGA writing, but only to the string function | ||
| 2196 | memcpy(). | ||
| 2197 | |||
| 2198 | So, now, all I have is the timer interrupt and an infinite loop containing | ||
| 2199 | memcpy, and the code still fails! | ||
| 2200 | |||
| 2201 | And, yes, the memcpy() code was valid (ehem, Dijkestra's 'Humble Programmer') | ||
| 2202 | |||
| 2203 | * So I go crazy stripping the entire kernel. All what was left was code that | ||
| 2204 | re-programs the PIC, the PIT setup code, and the string function memcpy() | ||
| 2205 | and the code still failed. | ||
| 2206 | |||
| 2207 | That was, well, weird. We really have nothing now except a very small number of | ||
| 2208 | lines of code and I don't know what to do. | ||
| 2209 | |||
| 2210 | Once we enable interrupts, the memcpy() function fail with similar patterns to | ||
| 2211 | the one described above (page faults, with faulting memory values similar to | ||
| 2212 | %ss and %rflags) | ||
| 2213 | |||
| 2214 | Well, you can imagine the state I'm in after all those days. | ||
| 2215 | |||
| 2216 | * Another step for more explicitness was disabling the PIT, and triggering its | ||
| 2217 | handler manually using software interrupts. I've inject software interrupts | ||
| 2218 | around strategic C statements like '*src = *dst', etc. | ||
| 2219 | |||
| 2220 | * So at the middle of the middle of the sixth day, I disassemble the memcpy() | ||
| 2221 | function, draw the stack on paper, and trace. Now something became directly | ||
| 2222 | apparent. Notice those instructions (gcc -O0): | ||
| 2223 | |||
| 2224 | /* Classical stack setup */ | ||
| 2225 | ffffffff80109c88: 55 push %rbp | ||
| 2226 | ffffffff80109c89: 48 89 e5 mov %rsp,%rbp | ||
| 2227 | ffffffff80109c8c: 48 89 7d e8 mov %rdi,-0x18(%rbp) | ||
| 2228 | ffffffff80109c90: 48 89 75 e0 mov %rsi,-0x20(%rbp) | ||
| 2229 | |||
| 2230 | /* Bochs magic breakpoint */ | ||
| 2231 | ffffffff80109caa: 66 87 db xchg %bx,%bx | ||
| 2232 | |||
| 2233 | /* Our manual software interrupt */ | ||
| 2234 | ffffffff80109cad: cd f0 int $0xf0 | ||
| 2235 | |||
| 2236 | /* Failing code, specially last line */ | ||
| 2237 | ffffffff80109caf: 48 8b 45 f8 mov -0x8(%rbp),%rax | ||
| 2238 | ffffffff80109cb3: 0f b6 10 movzbl (%rax),%edx | ||
| 2239 | ffffffff80109cb6: 48 8b 45 f0 mov -0x10(%rbp),%rax | ||
| 2240 | ffffffff80109cba: 88 10 mov %dl,(%rax) | ||
| 2241 | |||
| 2242 | Notice something? Look again. See those `mov -0x8(%rbp), %rax' and | ||
| 2243 | `mov -0x10(%rbp), %rax) ops? | ||
| 2244 | |||
| 2245 | Now check the top of the function, especially the stack pointer setup. Can't | ||
| 2246 | you notice something? | ||
| 2247 | |||
| 2248 | YES, GCC used areas of the stack _B E L O W_ the stack pointer to keep parts | ||
| 2249 | of the leaf function local state. Now when the interrupt was triggered at | ||
| 2250 | `int $0xf0', the CPU rightfully pushed CPU state in that order: | ||
| 2251 | |||
| 2252 | %ss | ||
| 2253 | %rsp | ||
| 2254 | %rflags | ||
| 2255 | %cs | ||
| 2256 | %rip | ||
| 2257 | |||
| 2258 | Which meant that parts of that function local state get corrupted through | ||
| 2259 | the implicit CPU stack usage! | ||
| 2260 | |||
| 2261 | This is also the reason we found faulting addresses of values resembling %ss | ||
| 2262 | and %rflags. Our processor modified our pointers with those! | ||
| 2263 | |||
| 2264 | * Wow, THAT WAS INTERESTING and worthwhile to invest the week in. Inspecting the | ||
| 2265 | AMD64 ABI to see why GCC uses this behavior, which is _certainly_ interrupt and | ||
| 2266 | signals unsafe, I found the answer: it's 'The Red Zone'. | ||
| 2267 | |||
| 2268 | Now I've really had enough for today, see the discussion of this zone at the | ||
| 2269 | AMD64 ABI document. | ||
| 2270 | |||
| 2271 | * Solution: | ||
| 2272 | |||
| 2273 | Either: | ||
| 2274 | [1] Ignore the 128-byte region in the IRQ handler OR | ||
| 2275 | [2] Let the compiler avoid using this Red Zone | ||
| 2276 | |||
| 2277 | We can't really use [1], since the CPU pushes its state behind our backs. We | ||
| 2278 | can't let it wind 128-bytes of the stack. Also the kernel stack is a very | ||
| 2279 | valuable resource, and we shouldn't waste 128-Bytes of it to save an x86 op | ||
| 2280 | or two. | ||
| 2281 | |||
| 2282 | Thus, we pick the second option. That's also what the mainstream folks like | ||
| 2283 | {Open, Free}BSD and Linux do (which I discovered too lately). | ||
| 2284 | |||
| 2285 | * Lessons: | ||
| 2286 | |||
| 2287 | - When faced with triple faults, inspect the following: | ||
| 2288 | - the IDT table: if it has been overwritten, then the machine will definitely | ||
| 2289 | triple fault. | ||
| 2290 | - stack sanity: If the stack was not sane, then pushes and pops will lead to | ||
| 2291 | faults, leading to triple faults. | ||
| 2292 | - MM structures sanity. If the pagetables have been erroneously overwritten, | ||
| 2293 | the stack will get erroneously mapped. | ||
| 2294 | - Kernel stack overflow: corrupt stack = triple faults | ||
| 2295 | |||
| 2296 | - If you noticed that an interrupt randomly triggers a problem in the kernel, | ||
| 2297 | it can be good to disable interrupts (avoid distractions), then find the most | ||
| 2298 | affected code path and _manually_ inject interrupts there using software IRQs. | ||
| 2299 | |||
| 2300 | - The best possible way to let IRQ handlers be as minimal as possible is to use | ||
| 2301 | the PIC in Automatic EOI (AEOI) mode. They will get the handler minimized to | ||
| 2302 | the 2 byte `iretq' x86-64 op. | ||
| 2303 | |||
| 2304 | - Let code binary single-stepping be our last debugging resort to make sure we | ||
| 2305 | understand the algorithms and code we've just written well enough. |

