Commit 5cc7a1ded5e6b741657fea903edd1b2250eb12d4.patch

  • avatar
  • darwish
  • Sat Mar 20 13:18:05 CET 2010
On the Zone!
CuteNotes.txt
(254 / 0)
  
20492049
20502050 - We may add asserts to memcpy() and friends assuring no such diastrous
20512051 overlaps occur in the C side of the kernel.
2052
2053* Global lessons:
2054
2055 - Disassembly is a great tool for debugging kernel code, since a lot of things
2056 taken for granted at user-space is not guaranteed here anymore. Abstraction
2057 cut both ways indeed.
2058
2059 - It's _very important_ to deeply test each code path, no matter how trivial
2060 it seems (basic stuff, but sometimes we forget the basics).
2061
2062 - Memory dumping of critical kernel regions is really valuable; remember to use
2063 it as needed.
2064
2065 - It's important to test the code with different environment combinations:
2066 different optimization levels: GCC's -O {0, 1, 2, 3, s}
2067 different memory sizes: Qemu's -m {2, 3, .., 500, .., 2000, ..}
2068 different number of cores: -smp {1, 2, 3, ..}
2069 different CPU speeds: Bochs's ips= parameter
2070 different BIOSes: different versions of Bochs Bios, different PC forms, ..
2071
2072* The Zone:
2073-----------
2074
2075 (Or the bug that took me around 6 days to solve)
2076
2077* While writing the context switch code, I observed a very ugly bug. Somehow the
2078 montonic PIT interrupts that got triggered every 1 millisecond corrupts kernel
2079 state.
2080
2081 Symptoms: random exceptions and random assert() failures allover the place.
2082
2083* So I first thought there must be something wrong in the timer handler itself.
2084 So what if the PIT handler did exactly nothing but acknowledging the local APIC?
2085 (It must acknowledge the APIC so future IRQs get handled)
2086
2087 The handler code was successfully reduced to:
2088
2089 push %rax
2090 movq $(VIRTUAL(APIC_PHBASE) + APIC_EOI), %rax
2091 movl $0,(%rax)
2092 pop %rax
2093 iretq
2094
2095 But still the same buggy behavior prevailed.
2096
2097* After reducing the handler code to a number of bytes, with no positive results,
2098 I thought let's check the IDT validity. Inspecting the IDT cells date yielded
2099 nothing interesting.
2100
2101 Also checking the %rflags, bit by bit, led to nothing interesting.
2102
2103 So I thought maybe we should setup the IDT cells as 'trap' instead of 'interrupt'
2104 gates, but this is ofcourse wrong. Trap gates leaves the IF flag set while
2105 servicing the interrupt, which means handlers can get interrupted indefinitely,
2106 and our stack can get overflowed in a matter of milliseconds.
2107
2108* OK, so I thought instead of focusing on the IRQ handlers and the IDT, lets focus
2109 on the _code_ that get interrupted itself. The failed code was usually the
2110 general memory allocator kmalloc() test cases.
2111
2112 After inspecting the interrupted code, I found nothing interesting.
2113
2114* To let the problem domain be smaller, and after minimizing the handler code, I
2115 thought lets also minimize the interrupted code and see. Having a bunch of
2116 printk()s in a tight loop, with the timer handler on, triggered the problem!
2117
2118 Here the second day finishes. Either something is really messed up, or I'm a
2119 shitty developer that can't even write a stable printk().
2120
2121 Furthermore, running any testcases after enabling the interrupts make them
2122 fail miserably. From all the debugging I've done in the last 6 months, I
2123 began to feel the instincts of such 'deep' kernel bugs: this is definitely
2124 one of them.
2125
2126* So, I began to feel hopeless. I'll just inspect the x86 ops that triggered
2127 the exceptions, hoping to find any pattern in the code that trigger the
2128 bug. Definitely a pattern was clearly there; some examples of the failing
2129 x86 ops:
2130
2131 mov %rax, 0x0(%rbp)
2132 movzbl (%rax), %eax
2133 movl $0x2, 0x8(%rax)
2134 movl $0x1, 0x4(%rax)
2135 ...
2136
2137 And the most triggered, probably the _only_ triggered, exception was the page
2138 fault #PF exception, vector 14. The faulting (unmapped) address was things
2139 like %cr2=0x0, %cr2=0x8, %cr2=0x200020, etc.
2140
2141 From above pattern, I thought: who's really messing up with our registers,
2142 getting them loaded with weird values like 0, 0x8, etc?
2143
2144* So, we're in the verge of the third day end with no direct results, and nothing
2145 known about the source of the bug. Having nothing specific to think about, I
2146 inspected the critical kernel paths: the IDT setup, IOAPIC entries setup, the
2147 GDT, the bootsector, early kernel boot setup, printk() code, etc.
2148
2149 Funnily, cause I had a 'global' view about the kernel while doing this deep
2150 inspection, 9 issues scattered around was discovered that needs to be fixed.
2151 Unfortunately, fixing those 9 issues didn't really solve our main problem, but
2152 it was nice nonetheless to discover them.
2153
2154* So we're now in the fifth day, and nothing is clear. Hopelessly, I swap our
2155 printk code with FreeBSD's one. Things work fine, but once we enable interrupts
2156 and the PIT in monotonic mode, similar erroneous behavior appears. Swapping
2157 the FreeBSD code with the Linux code, and again, leading to the same failures.
2158
2159 So I was right, something 'deep' is messed up, not just the nature of the
2160 interrupted code.
2161
2162 Another hopeless try was swapping the data segment descriptor index @ the GDT
2163 from 0x10 to 0x18. Something interesting was discovered; the page faults
2164 addresses occurs from values originating from %cs, %ss, and %rflags; it was
2165 completely repeatable. Sounds familiar? Yes, they should! Those are the values
2166 pushed by the CPU before invoking IRQ handlers!
2167
2168 So somehow, registers that get dereferenced get the wrong values from the
2169 state pushed and popped by the CPU before and after invoking an IRQ handler.
2170 Now _this_ is an interesting discovery.
2171
2172 So, how the %ss, %cs, and %rflags values got 'leaked' to our general-purpose
2173 registers? The stack _must_ have been corrupted somehow ..
2174
2175* Trying different directions, we minimize the timer handler even further only to
2176 the 'iretq' x86 op, but how do we acknowledge the local APIC this way?
2177
2178 In fact, we dont. I just masked all the IOAPIC entries, and let the PIT trigger
2179 through the PIC. The PIC was then programmed in 'Automatic EOI' mode, where no
2180 acknowledgment was needed. This reduced the handler only to:
2181
2182 iretq
2183
2184 Even with this super-minimal handler, the bug gets triggered :(
2185
2186 It was also interesting that, again, the #PF handler reported bad memory values
2187 resembling %rflags, %ss, etc.
2188
2189* So after minimizing the timer handler to the absolute architecturally minimum,
2190 we try to minimize the faulting code even further. After some hours of
2191 disassembly, I find that at -O1 the VGA code alone (without the printk() code
2192 that called it) triggers the problem.
2193
2194 After even further inspection and disassembly (and wasted hours), I find that
2195 not only we can limit the code to VGA writing, but only to the string function
2196 memcpy().
2197
2198 So, now, all I have is the timer interrupt and an infinite loop containing
2199 memcpy, and the code still fails!
2200
2201 And, yes, the memcpy() code was valid (ehem, Dijkestra's 'Humble Programmer')
2202
2203* So I go crazy stripping the entire kernel. All what was left was code that
2204 re-programs the PIC, the PIT setup code, and the string function memcpy()
2205 and the code still failed.
2206
2207 That was, well, weird. We really have nothing now except a very small number of
2208 lines of code and I don't know what to do.
2209
2210 Once we enable interrupts, the memcpy() function fail with similar patterns to
2211 the one described above (page faults, with faulting memory values similar to
2212 %ss and %rflags)
2213
2214 Well, you can imagine the state I'm in after all those days.
2215
2216* Another step for more explicitness was disabling the PIT, and triggering its
2217 handler manually using software interrupts. I've inject software interrupts
2218 around strategic C statements like '*src = *dst', etc.
2219
2220* So at the middle of the middle of the sixth day, I disassemble the memcpy()
2221 function, draw the stack on paper, and trace. Now something became directly
2222 apparent. Notice those instructions (gcc -O0):
2223
2224 /* Classical stack setup */
2225 ffffffff80109c88: 55 push %rbp
2226 ffffffff80109c89: 48 89 e5 mov %rsp,%rbp
2227 ffffffff80109c8c: 48 89 7d e8 mov %rdi,-0x18(%rbp)
2228 ffffffff80109c90: 48 89 75 e0 mov %rsi,-0x20(%rbp)
2229
2230 /* Bochs magic breakpoint */
2231 ffffffff80109caa: 66 87 db xchg %bx,%bx
2232
2233 /* Our manual software interrupt */
2234 ffffffff80109cad: cd f0 int $0xf0
2235
2236 /* Failing code, specially last line */
2237 ffffffff80109caf: 48 8b 45 f8 mov -0x8(%rbp),%rax
2238 ffffffff80109cb3: 0f b6 10 movzbl (%rax),%edx
2239 ffffffff80109cb6: 48 8b 45 f0 mov -0x10(%rbp),%rax
2240 ffffffff80109cba: 88 10 mov %dl,(%rax)
2241
2242 Notice something? Look again. See those `mov -0x8(%rbp), %rax' and
2243 `mov -0x10(%rbp), %rax) ops?
2244
2245 Now check the top of the function, especially the stack pointer setup. Can't
2246 you notice something?
2247
2248 YES, GCC used areas of the stack _B E L O W_ the stack pointer to keep parts
2249 of the leaf function local state. Now when the interrupt was triggered at
2250 `int $0xf0', the CPU rightfully pushed CPU state in that order:
2251
2252 %ss
2253 %rsp
2254 %rflags
2255 %cs
2256 %rip
2257
2258 Which meant that parts of that function local state get corrupted through
2259 the implicit CPU stack usage!
2260
2261 This is also the reason we found faulting addresses of values resembling %ss
2262 and %rflags. Our processor modified our pointers with those!
2263
2264* Wow, THAT WAS INTERESTING and worthwhile to invest the week in. Inspecting the
2265 AMD64 ABI to see why GCC uses this behavior, which is _certainly_ interrupt and
2266 signals unsafe, I found the answer: it's 'The Red Zone'.
2267
2268 Now I've really had enough for today, see the discussion of this zone at the
2269 AMD64 ABI document.
2270
2271* Solution:
2272
2273 Either:
2274 [1] Ignore the 128-byte region in the IRQ handler OR
2275 [2] Let the compiler avoid using this Red Zone
2276
2277 We can't really use [1], since the CPU pushes its state behind our backs. We
2278 can't let it wind 128-bytes of the stack. Also the kernel stack is a very
2279 valuable resource, and we shouldn't waste 128-Bytes of it to save an x86 op
2280 or two.
2281
2282 Thus, we pick the second option. That's also what the mainstream folks like
2283 {Open, Free}BSD and Linux do (which I discovered too lately).
2284
2285* Lessons:
2286
2287 - When faced with triple faults, inspect the following:
2288 - the IDT table: if it has been overwritten, then the machine will definitely
2289 triple fault.
2290 - stack sanity: If the stack was not sane, then pushes and pops will lead to
2291 faults, leading to triple faults.
2292 - MM structures sanity. If the pagetables have been erroneously overwritten,
2293 the stack will get erroneously mapped.
2294 - Kernel stack overflow: corrupt stack = triple faults
2295
2296 - If you noticed that an interrupt randomly triggers a problem in the kernel,
2297 it can be good to disable interrupts (avoid distractions), then find the most
2298 affected code path and _manually_ inject interrupts there using software IRQs.
2299
2300 - The best possible way to let IRQ handlers be as minimal as possible is to use
2301 the PIC in Automatic EOI (AEOI) mode. They will get the handler minimized to
2302 the 2 byte `iretq' x86-64 op.
2303
2304 - Let code binary single-stepping be our last debugging resort to make sure we
2305 understand the algorithms and code we've just written well enough.