Commit 5bc1d687e4ffcdba6b02ade8faa8a0fb6f67dbeb.patch
- Diff rendering mode:
- inline
- side by side
CuteNotes.txt
(159 / 0)
|   | |||
| 26 | 26 | (20) Debugging -O3 triggered problems | |
| 27 | 27 | (21) The Zone! | |
| 28 | 28 | (22) Debugging periodic timers using statistical data | |
| 29 | (23) The Naughty Multiply | ||
| 29 | 30 | ||
| 30 | 31 | ||
| 31 | 32 | On Intel Documents: | |
| … | … | ||
| 2719 | 2719 | * A similar test-case for periodic timers is comparing their statistical | |
| 2720 | 2720 | ticks occurrences relative to each other. This will guarantee that all system | |
| 2721 | 2721 | periodic timers really measure the same concept of time! | |
| 2722 | |||
| 2723 | |||
| 2724 | The Naughty Multiply: | ||
| 2725 | --------------------- | ||
| 2726 | |||
| 2727 | * References: | ||
| 2728 | - Intel Instruction Set Reference, A-M, vol. 2A, MUL. | ||
| 2729 | - GCC manual, 'Using the GNU Compiler Collection' | ||
| 2730 | |||
| 2731 | * So I was writing a casual number of test-cases, where this sequence of code: | ||
| 2732 | |||
| 2733 | void run_tests(void) { | ||
| 2734 | memset(_arr, 0xFF, 0x100); | ||
| 2735 | __memcpy_forward(_arr, _arr + 20, 10); | ||
| 2736 | __memcpy_forward(_arr + 20, _arr, 10); | ||
| 2737 | } | ||
| 2738 | |||
| 2739 | produced a NULL-dereference exception. As printed by our exception handler: | ||
| 2740 | |||
| 2741 | Exception: vector=14, errcode=0x2, %cs=0x8, %rip=0xffffffff8010b6f7, | ||
| 2742 | %ss=0x0, %rsp=0xffffffff80104ff0, %rflags=0x200202, %cr2=0x0, | ||
| 2743 | %cr3=0x3ffef000, %cr4=0x20 | ||
| 2744 | |||
| 2745 | It was weird: the code of these two functions was a very simple and | ||
| 2746 | straightforward inline GCC assembly. | ||
| 2747 | |||
| 2748 | The problem was clearly related somehow to registers-usage pressure (the | ||
| 2749 | compiler register allocator decisions) in the mother function run_tests(). | ||
| 2750 | Removing one memcpy() above, or adding a printk() in a random place in that | ||
| 2751 | function would make the NULL exception disappear. | ||
| 2752 | |||
| 2753 | * Disassembling the area around the faulting opcode, we found: | ||
| 2754 | |||
| 2755 | ffffffff8010b6a0 <string_run_tests>: | ||
| 2756 | ffffffff8010b6a0: 48 c7 c2 20 fb 10 80 mov $0xffffffff8010fb20,%rdx | ||
| 2757 | ffffffff8010b6a7: 31 f6 xor %esi,%esi | ||
| 2758 | ffffffff8010b6a9: b8 ff 00 00 00 mov $0xff,%eax | ||
| 2759 | ffffffff8010b6ae: 48 89 d7 mov %rdx,%rdi | ||
| 2760 | ffffffff8010b6b1: b9 00 01 00 00 mov $0x100,%ecx | ||
| 2761 | ffffffff8010b6b6: 48 89 ce mov %rcx,%rsi | ||
| 2762 | ffffffff8010b6b9: 48 83 e1 07 and $0x7,%rcx | ||
| 2763 | ffffffff8010b6bd: f3 aa rep stos %al,%es:(%rdi) (+) | ||
| 2764 | ffffffff8010b6bf: 48 89 f1 mov %rsi,%rcx | ||
| 2765 | ffffffff8010b6c2: 48 c1 e9 03 shr $0x3,%rcx | ||
| 2766 | ffffffff8010b6c6: 48 89 c6 mov %rax,%rsi | ||
| 2767 | ffffffff8010b6c9: 48 b8 01 01 01 01 01 mov $0x101010101010101,%rax | ||
| 2768 | ffffffff8010b6d0: 01 01 01 | ||
| 2769 | ffffffff8010b6d3: 48 f7 e6 mul %rsi | ||
| 2770 | ffffffff8010b6d6: f3 48 ab rep stos %rax,%es:(%rdi) (++) | ||
| 2771 | ffffffff8010b6d9: 41 b9 0a 00 00 00 mov $0xa,%r9d | ||
| 2772 | ffffffff8010b6df: 31 c0 xor %eax,%eax | ||
| 2773 | ffffffff8010b6e1: 48 c7 c6 34 fb 10 80 mov $0xffffffff8010fb34,%rsi | ||
| 2774 | ffffffff8010b6e8: 48 89 d7 mov %rdx,%rdi | ||
| 2775 | ffffffff8010b6eb: 44 89 c9 mov %r9d,%ecx | ||
| 2776 | ffffffff8010b6ee: 41 89 c0 mov %eax,%r8d | ||
| 2777 | ffffffff8010b6f1: 41 89 c8 mov %ecx,%r8d | ||
| 2778 | ffffffff8010b6f4: 83 e1 07 and $0x7,%ecx | ||
| 2779 | ffffffff8010b6f7: f3 a4 rep movsb %ds:(%rsi),%es:(%rdi) | ||
| 2780 | |||
| 2781 | From our exception handler, we can see that the faulting instruction | ||
| 2782 | is the above 'rep movsb'. We can also deduce that (+) and (++) ops are | ||
| 2783 | produced for memset(), and the rest are the memcpy() opcodes. | ||
| 2784 | |||
| 2785 | * Tracing the code by hand, I reached the faulty instruction without | ||
| 2786 | catching the NULL dereference. A null dereference at the 'movsb' | ||
| 2787 | instruction can only mean that either %rsi = 0, or %rdi = 0, or both | ||
| 2788 | had magically turned to zeroes. | ||
| 2789 | |||
| 2790 | Wondering about the state, I overwrote this movsb instruction in the | ||
| 2791 | final kernel binary with a Bochs magic breakpoint code (66 87 db) | ||
| 2792 | using a simple hex editor. | ||
| 2793 | |||
| 2794 | After moving to the debugging console, thanks to the magic breakpoint, | ||
| 2795 | here was the Bochs registers dump: | ||
| 2796 | |||
| 2797 | (0) Magic breakpoint | ||
| 2798 | CPU0: | ||
| 2799 | rax: 0x00000000:00000000 rcx: 0x00000000:00000002 | ||
| 2800 | rdx: 0x00000000:00000000 rbx: 0xffffffff:801129a8 | ||
| 2801 | rsp: 0xffffffff:80104ff0 rbp: 0x00000000:00000000 | ||
| 2802 | rsi: 0xffffffff:8010fb34 rdi: 0x00000000:00000000 | ||
| 2803 | r8 : 0x00000000:0000000a r9 : 0x00000000:0000000a | ||
| 2804 | r10: 0x0f200f20:0f200f20 r11: 0x66656463:62613938 | ||
| 2805 | r12: 0x00000000:00000000 r13: 0x00000000:00000000 | ||
| 2806 | r14: 0x00000000:00000000 r15: 0x00000000:00000000 | ||
| 2807 | rip: 0xffffffff:8010b6fa | ||
| 2808 | eflags 0x00200202: ID vip vif ac vm rf nt IOPL=0 of df IF tf sf zf af pf cf | ||
| 2809 | |||
| 2810 | One thing was clearly noted: for some reason, %rdx (and %rdi by consequence) | ||
| 2811 | have magically turned to zeroes, leading to the NULL dereference! | ||
| 2812 | |||
| 2813 | * Finding that no explicit opcode zeroed %rdx above, then the zeroing | ||
| 2814 | must have been a side-effect of some instruction above the faulty | ||
| 2815 | dereference. Looking at the Intel Manual for cues, I found that | ||
| 2816 | opcode: it was the 'mul %rsi' one. | ||
| 2817 | |||
| 2818 | So why didn't GCC know about this? Because _I_ did not know about | ||
| 2819 | it while coding, thus I did not explicitly tell GCC that this nice | ||
| 2820 | register will get implicitly clobbered. | ||
| 2821 | |||
| 2822 | GCC, naively thought that the register value was unmodified, and | ||
| 2823 | rightfully re-used that register value at address 0x8010b6e8. GCC | ||
| 2824 | is really competent wrt to optimizations. The faulty code was: | ||
| 2825 | |||
| 2826 | asm volatile ( | ||
| 2827 | [...] | ||
| 2828 | "mov %[ch], %[tmp];" | ||
| 2829 | "mov $0x0101010101010101, %[ch];" | ||
| 2830 | "mul %[tmp];" | ||
| 2831 | "rep stosq;" | ||
| 2832 | : [tmp] "=&r"(tmp), [dst] "=&D"(d0), [ch] "=&a"(uch), | ||
| 2833 | [len] "=&c"(ulen) | ||
| 2834 | : "[dst]"(dst), "[ch]"(uch), "[len]"(ulen), "[tmp]"(tmp) | ||
| 2835 | : "memory"); | ||
| 2836 | |||
| 2837 | As you can see, %rdx was not mentiond as a clobbered register, and | ||
| 2838 | not even as an output register. GCC had all the rights to believe | ||
| 2839 | that this register value was not modified in that snippet. | ||
| 2840 | |||
| 2841 | In another way, the clobber list tells gcc which registers (or | ||
| 2842 | memory) are changed by the snippet, but not listed as an output. | ||
| 2843 | |||
| 2844 | * The simple solution was adding the 'rdx' register to the assembly | ||
| 2845 | snippet clobber list. Now the compiler output looks like: | ||
| 2846 | |||
| 2847 | ffffffff8010b6a0 <string_run_tests>: | ||
| 2848 | ffffffff8010b6a0: 49 c7 c0 20 fb 10 80 mov $0xffffffff8010fb20,%r8 | ||
| 2849 | [...] | ||
| 2850 | ffffffff8010b6e8: 48 89 d7 mov %r8,%rdi | ||
| 2851 | [...] | ||
| 2852 | ffffffff8010b6f7: f3 a4 rep movsb %ds:(%rsi),%es:(%rdi) | ||
| 2853 | |||
| 2854 | where the compiler uses the new register %r8, entirely avoiding | ||
| 2855 | the %rdx register in the process. | ||
| 2856 | |||
| 2857 | * From investigation, it seems GCC need to also be notified if the | ||
| 2858 | assembly snippet alters the condition code register %rflags. We now | ||
| 2859 | handle this by adding `cc' to the clobber list while using ops like | ||
| 2860 | 'andq', 'popfq', 'decq', and 'shrq'. | ||
| 2861 | |||
| 2862 | Cases where we forgot to clobber 'rcx' after using opcodes like | ||
| 2863 | 'rep XX' were also caught. For this, we only need to set 'rcx' as a | ||
| 2864 | double input+output register; GCC will infer the rest. | ||
| 2865 | |||
| 2866 | Another faulty case was having 'rsi' be only set as an input register | ||
| 2867 | amid the existence of 'movsq' in the assembly snippet. | ||
| 2868 | |||
| 2869 | * Lesson: | ||
| 2870 | |||
| 2871 | While I did study the privileged x86 ring0 opcodes carefully (mainly | ||
| 2872 | Intel's manual vol. 3A, and AMD vol. 2), the basic x86 opcodes were | ||
| 2873 | learned 'on the go'. The problem we faced today is a small price for | ||
| 2874 | that; I should re-read Intel's vol. 1 docs way more carefully. | ||
| 2875 | |||
| 2876 | To make-up for this, and also using the Intel manuals, I investigated | ||
| 2877 | the side effects of __every__ opcode used in the kernel inlined | ||
| 2878 | assembly snippets. That's how I caught the %rflags, %rcx, and %rsi | ||
| 2879 | cases mentioned in the above point! |

