Commit 5bc1d687e4ffcdba6b02ade8faa8a0fb6f67dbeb.patch

  • avatar
  • darwish
  • Tue Nov 16 05:30:07 CET 2010
Notes: New section: The naughty multiply.
CuteNotes.txt
(159 / 0)
  
2626(20) Debugging -O3 triggered problems
2727(21) The Zone!
2828(22) Debugging periodic timers using statistical data
29(23) The Naughty Multiply
2930
3031
3132On Intel Documents:
27192719* A similar test-case for periodic timers is comparing their statistical
27202720 ticks occurrences relative to each other. This will guarantee that all system
27212721 periodic timers really measure the same concept of time!
2722
2723
2724The Naughty Multiply:
2725---------------------
2726
2727* References:
2728 - Intel Instruction Set Reference, A-M, vol. 2A, MUL.
2729 - GCC manual, 'Using the GNU Compiler Collection'
2730
2731* So I was writing a casual number of test-cases, where this sequence of code:
2732
2733 void run_tests(void) {
2734 memset(_arr, 0xFF, 0x100);
2735 __memcpy_forward(_arr, _arr + 20, 10);
2736 __memcpy_forward(_arr + 20, _arr, 10);
2737 }
2738
2739 produced a NULL-dereference exception. As printed by our exception handler:
2740
2741 Exception: vector=14, errcode=0x2, %cs=0x8, %rip=0xffffffff8010b6f7,
2742 %ss=0x0, %rsp=0xffffffff80104ff0, %rflags=0x200202, %cr2=0x0,
2743 %cr3=0x3ffef000, %cr4=0x20
2744
2745 It was weird: the code of these two functions was a very simple and
2746 straightforward inline GCC assembly.
2747
2748 The problem was clearly related somehow to registers-usage pressure (the
2749 compiler register allocator decisions) in the mother function run_tests().
2750 Removing one memcpy() above, or adding a printk() in a random place in that
2751 function would make the NULL exception disappear.
2752
2753* Disassembling the area around the faulting opcode, we found:
2754
2755 ffffffff8010b6a0 <string_run_tests>:
2756 ffffffff8010b6a0: 48 c7 c2 20 fb 10 80 mov $0xffffffff8010fb20,%rdx
2757 ffffffff8010b6a7: 31 f6 xor %esi,%esi
2758 ffffffff8010b6a9: b8 ff 00 00 00 mov $0xff,%eax
2759 ffffffff8010b6ae: 48 89 d7 mov %rdx,%rdi
2760 ffffffff8010b6b1: b9 00 01 00 00 mov $0x100,%ecx
2761 ffffffff8010b6b6: 48 89 ce mov %rcx,%rsi
2762 ffffffff8010b6b9: 48 83 e1 07 and $0x7,%rcx
2763 ffffffff8010b6bd: f3 aa rep stos %al,%es:(%rdi) (+)
2764 ffffffff8010b6bf: 48 89 f1 mov %rsi,%rcx
2765 ffffffff8010b6c2: 48 c1 e9 03 shr $0x3,%rcx
2766 ffffffff8010b6c6: 48 89 c6 mov %rax,%rsi
2767 ffffffff8010b6c9: 48 b8 01 01 01 01 01 mov $0x101010101010101,%rax
2768 ffffffff8010b6d0: 01 01 01
2769 ffffffff8010b6d3: 48 f7 e6 mul %rsi
2770 ffffffff8010b6d6: f3 48 ab rep stos %rax,%es:(%rdi) (++)
2771 ffffffff8010b6d9: 41 b9 0a 00 00 00 mov $0xa,%r9d
2772 ffffffff8010b6df: 31 c0 xor %eax,%eax
2773 ffffffff8010b6e1: 48 c7 c6 34 fb 10 80 mov $0xffffffff8010fb34,%rsi
2774 ffffffff8010b6e8: 48 89 d7 mov %rdx,%rdi
2775 ffffffff8010b6eb: 44 89 c9 mov %r9d,%ecx
2776 ffffffff8010b6ee: 41 89 c0 mov %eax,%r8d
2777 ffffffff8010b6f1: 41 89 c8 mov %ecx,%r8d
2778 ffffffff8010b6f4: 83 e1 07 and $0x7,%ecx
2779 ffffffff8010b6f7: f3 a4 rep movsb %ds:(%rsi),%es:(%rdi)
2780
2781 From our exception handler, we can see that the faulting instruction
2782 is the above 'rep movsb'. We can also deduce that (+) and (++) ops are
2783 produced for memset(), and the rest are the memcpy() opcodes.
2784
2785* Tracing the code by hand, I reached the faulty instruction without
2786 catching the NULL dereference. A null dereference at the 'movsb'
2787 instruction can only mean that either %rsi = 0, or %rdi = 0, or both
2788 had magically turned to zeroes.
2789
2790 Wondering about the state, I overwrote this movsb instruction in the
2791 final kernel binary with a Bochs magic breakpoint code (66 87 db)
2792 using a simple hex editor.
2793
2794 After moving to the debugging console, thanks to the magic breakpoint,
2795 here was the Bochs registers dump:
2796
2797 (0) Magic breakpoint
2798 CPU0:
2799 rax: 0x00000000:00000000 rcx: 0x00000000:00000002
2800 rdx: 0x00000000:00000000 rbx: 0xffffffff:801129a8
2801 rsp: 0xffffffff:80104ff0 rbp: 0x00000000:00000000
2802 rsi: 0xffffffff:8010fb34 rdi: 0x00000000:00000000
2803 r8 : 0x00000000:0000000a r9 : 0x00000000:0000000a
2804 r10: 0x0f200f20:0f200f20 r11: 0x66656463:62613938
2805 r12: 0x00000000:00000000 r13: 0x00000000:00000000
2806 r14: 0x00000000:00000000 r15: 0x00000000:00000000
2807 rip: 0xffffffff:8010b6fa
2808 eflags 0x00200202: ID vip vif ac vm rf nt IOPL=0 of df IF tf sf zf af pf cf
2809
2810 One thing was clearly noted: for some reason, %rdx (and %rdi by consequence)
2811 have magically turned to zeroes, leading to the NULL dereference!
2812
2813* Finding that no explicit opcode zeroed %rdx above, then the zeroing
2814 must have been a side-effect of some instruction above the faulty
2815 dereference. Looking at the Intel Manual for cues, I found that
2816 opcode: it was the 'mul %rsi' one.
2817
2818 So why didn't GCC know about this? Because _I_ did not know about
2819 it while coding, thus I did not explicitly tell GCC that this nice
2820 register will get implicitly clobbered.
2821
2822 GCC, naively thought that the register value was unmodified, and
2823 rightfully re-used that register value at address 0x8010b6e8. GCC
2824 is really competent wrt to optimizations. The faulty code was:
2825
2826 asm volatile (
2827 [...]
2828 "mov %[ch], %[tmp];"
2829 "mov $0x0101010101010101, %[ch];"
2830 "mul %[tmp];"
2831 "rep stosq;"
2832 : [tmp] "=&r"(tmp), [dst] "=&D"(d0), [ch] "=&a"(uch),
2833 [len] "=&c"(ulen)
2834 : "[dst]"(dst), "[ch]"(uch), "[len]"(ulen), "[tmp]"(tmp)
2835 : "memory");
2836
2837 As you can see, %rdx was not mentiond as a clobbered register, and
2838 not even as an output register. GCC had all the rights to believe
2839 that this register value was not modified in that snippet.
2840
2841 In another way, the clobber list tells gcc which registers (or
2842 memory) are changed by the snippet, but not listed as an output.
2843
2844* The simple solution was adding the 'rdx' register to the assembly
2845 snippet clobber list. Now the compiler output looks like:
2846
2847 ffffffff8010b6a0 <string_run_tests>:
2848 ffffffff8010b6a0: 49 c7 c0 20 fb 10 80 mov $0xffffffff8010fb20,%r8
2849 [...]
2850 ffffffff8010b6e8: 48 89 d7 mov %r8,%rdi
2851 [...]
2852 ffffffff8010b6f7: f3 a4 rep movsb %ds:(%rsi),%es:(%rdi)
2853
2854 where the compiler uses the new register %r8, entirely avoiding
2855 the %rdx register in the process.
2856
2857* From investigation, it seems GCC need to also be notified if the
2858 assembly snippet alters the condition code register %rflags. We now
2859 handle this by adding `cc' to the clobber list while using ops like
2860 'andq', 'popfq', 'decq', and 'shrq'.
2861
2862 Cases where we forgot to clobber 'rcx' after using opcodes like
2863 'rep XX' were also caught. For this, we only need to set 'rcx' as a
2864 double input+output register; GCC will infer the rest.
2865
2866 Another faulty case was having 'rsi' be only set as an input register
2867 amid the existence of 'movsq' in the assembly snippet.
2868
2869* Lesson:
2870
2871 While I did study the privileged x86 ring0 opcodes carefully (mainly
2872 Intel's manual vol. 3A, and AMD vol. 2), the basic x86 opcodes were
2873 learned 'on the go'. The problem we faced today is a small price for
2874 that; I should re-read Intel's vol. 1 docs way more carefully.
2875
2876 To make-up for this, and also using the Intel manuals, I investigated
2877 the side effects of __every__ opcode used in the kernel inlined
2878 assembly snippets. That's how I caught the %rflags, %rcx, and %rsi
2879 cases mentioned in the above point!