| 1 |
Goals, Design and Implementation of the |
| 2 |
new ultra-scalable O(1) scheduler |
| 3 |
|
| 4 |
|
| 5 |
This is an edited version of an email Ingo Molnar sent to |
| 6 |
lkml on 4 Jan 2002. It describes the goals, design, and |
| 7 |
implementation of Ingo's new ultra-scalable O(1) scheduler. |
| 8 |
Edits by Robert Love, 18 April 2002, to add corrections and |
| 9 |
account for changes to the scheduler since its debut. |
| 10 |
|
| 11 |
|
| 12 |
Goal |
| 13 |
==== |
| 14 |
|
| 15 |
The main goal of the new scheduler is to keep all the good things we know |
| 16 |
and love about the current Linux scheduler: |
| 17 |
|
| 18 |
- good interactive performance even during high load: if the user |
| 19 |
types or clicks then the system must react instantly and must execute |
| 20 |
the user tasks smoothly, even during considerable background load. |
| 21 |
|
| 22 |
- good scheduling/wakeup performance with 1-2 runnable processes. |
| 23 |
|
| 24 |
- fairness: no process should stay without any timeslice for any |
| 25 |
unreasonable amount of time. No process should get an unjustly high |
| 26 |
amount of CPU time. |
| 27 |
|
| 28 |
- priorities: less important tasks can be started with lower priority, |
| 29 |
more important tasks with higher priority. |
| 30 |
|
| 31 |
- SMP efficiency: no CPU should stay idle if there is work to do. |
| 32 |
|
| 33 |
- SMP affinity: processes which run on one CPU should stay affine to |
| 34 |
that CPU. Processes should not bounce between CPUs too frequently. |
| 35 |
|
| 36 |
- plus additional scheduler features: RT scheduling, CPU binding. |
| 37 |
|
| 38 |
and the goal is also to add a few new things: |
| 39 |
|
| 40 |
- fully O(1) scheduling. Are you tired of the recalculation loop |
| 41 |
blowing the L1 cache away every now and then? Do you think the goodness |
| 42 |
loop is taking a bit too long to finish if there are lots of runnable |
| 43 |
processes? This new scheduler takes no prisoners: wakeup(), schedule(), |
| 44 |
the timer interrupt are all O(1) algorithms. There is no recalculation |
| 45 |
loop. There is no goodness loop either. |
| 46 |
|
| 47 |
- 'perfect' SMP scalability. With the new scheduler there is no 'big' |
| 48 |
runqueue_lock anymore - it's all per-CPU runqueues and locks - two |
| 49 |
tasks on two separate CPUs can wake up, schedule and context-switch |
| 50 |
completely in parallel, without any interlocking. All |
| 51 |
scheduling-relevant data is structured for maximum scalability. |
| 52 |
|
| 53 |
- better SMP affinity. The old scheduler has a particular weakness that |
| 54 |
causes the random bouncing of tasks between CPUs if/when higher |
| 55 |
priority/interactive tasks, this was observed and reported by many |
| 56 |
people. The reason is that the timeslice recalculation loop first needs |
| 57 |
every currently running task to consume its timeslice. But when this |
| 58 |
happens on eg. an 8-way system, then this property starves an |
| 59 |
increasing number of CPUs from executing any process. Once the last |
| 60 |
task that has a timeslice left has finished using up that timeslice, |
| 61 |
the recalculation loop is triggered and other CPUs can start executing |
| 62 |
tasks again - after having idled around for a number of timer ticks. |
| 63 |
The more CPUs, the worse this effect. |
| 64 |
|
| 65 |
Furthermore, this same effect causes the bouncing effect as well: |
| 66 |
whenever there is such a 'timeslice squeeze' of the global runqueue, |
| 67 |
idle processors start executing tasks which are not affine to that CPU. |
| 68 |
(because the affine tasks have finished off their timeslices already.) |
| 69 |
|
| 70 |
The new scheduler solves this problem by distributing timeslices on a |
| 71 |
per-CPU basis, without having any global synchronization or |
| 72 |
recalculation. |
| 73 |
|
| 74 |
- batch scheduling. A significant proportion of computing-intensive tasks |
| 75 |
benefit from batch-scheduling, where timeslices are long and processes |
| 76 |
are roundrobin scheduled. |
| 77 |
|
| 78 |
[ I don't know if the following is still true; |
| 79 |
to enter BATCH-mode please use schedtool. SCHED_BATCH processes, are, |
| 80 |
however, interruptible by SCHED_NORMAL processes anytime to garantee |
| 81 |
interactiveness. -Freek ] |
| 82 |
|
| 83 |
The new scheduler does such batch-scheduling of the lowest priority tasks - |
| 84 |
so nice +19 jobs will get 'batch-scheduled' automatically. With this |
| 85 |
scheduler, nice +19 jobs are in essence SCHED_IDLE, from an |
| 86 |
interactiveness point of view. |
| 87 |
|
| 88 |
- handle extreme loads more smoothly, without breakdown and scheduling |
| 89 |
storms. |
| 90 |
|
| 91 |
- O(1) RT scheduling. For those RT folks who are paranoid about the |
| 92 |
O(nr_running) property of the goodness loop and the recalculation loop. |
| 93 |
|
| 94 |
- run fork()ed children before the parent. Andrea has pointed out the |
| 95 |
advantages of this a few months ago, but patches for this feature |
| 96 |
do not work with the old scheduler as well as they should, |
| 97 |
because idle processes often steal the new child before the fork()ing |
| 98 |
CPU gets to execute it. |
| 99 |
|
| 100 |
|
| 101 |
Design |
| 102 |
====== |
| 103 |
|
| 104 |
the core of the new scheduler are the following mechanizms: |
| 105 |
|
| 106 |
- *two*, priority-ordered 'priority arrays' per CPU. There is an 'active' |
| 107 |
array and an 'expired' array. The active array contains all tasks that |
| 108 |
are affine to this CPU and have timeslices left. The expired array |
| 109 |
contains all tasks which have used up their timeslices - but this array |
| 110 |
is kept sorted as well. The active and expired array is not accessed |
| 111 |
directly, it's accessed through two pointers in the per-CPU runqueue |
| 112 |
structure. If all active tasks are used up then we 'switch' the two |
| 113 |
pointers and from now on the ready-to-go (former-) expired array is the |
| 114 |
active array - and the empty active array serves as the new collector |
| 115 |
for expired tasks. |
| 116 |
|
| 117 |
- there is a 64-bit bitmap cache for array indices. Finding the highest |
| 118 |
priority task is thus a matter of two x86 BSFL bit-search instructions. |
| 119 |
|
| 120 |
the split-array solution enables us to have an arbitrary number of active |
| 121 |
and expired tasks, and the recalculation of timeslices can be done |
| 122 |
immediately when the timeslice expires. Because the arrays are always |
| 123 |
access through the pointers in the runqueue, switching the two arrays can |
| 124 |
be done very quickly. |
| 125 |
|
| 126 |
this is a hybride priority-list approach coupled with roundrobin |
| 127 |
scheduling and the array-switch method of distributing timeslices. |
| 128 |
|
| 129 |
- there is a per-task 'load estimator'. |
| 130 |
|
| 131 |
one of the toughest things to get right is good interactive feel during |
| 132 |
heavy system load. While playing with various scheduler variants i found |
| 133 |
that the best interactive feel is achieved not by 'boosting' interactive |
| 134 |
tasks, but by 'punishing' tasks that want to use more CPU time than there |
| 135 |
is available. This method is also much easier to do in an O(1) fashion. |
| 136 |
|
| 137 |
to establish the actual 'load' the task contributes to the system, a |
| 138 |
complex-looking but pretty accurate method is used: there is a 4-entry |
| 139 |
'history' ringbuffer of the task's activities during the last 4 seconds. |
| 140 |
This ringbuffer is operated without much overhead. The entries tell the |
| 141 |
scheduler a pretty accurate load-history of the task: has it used up more |
| 142 |
CPU time or less during the past N seconds. [the size '4' and the interval |
| 143 |
of 4x 1 seconds was found by lots of experimentation - this part is |
| 144 |
flexible and can be changed in both directions.] |
| 145 |
|
| 146 |
the penalty a task gets for generating more load than the CPU can handle |
| 147 |
is a priority decrease - there is a maximum amount to this penalty |
| 148 |
relative to their static priority, so even fully CPU-bound tasks will |
| 149 |
observe each other's priorities, and will share the CPU accordingly. |
| 150 |
|
| 151 |
the SMP load-balancer can be extended/switched with additional parallel |
| 152 |
computing and cache hierarchy concepts: NUMA scheduling, multi-core CPUs |
| 153 |
can be supported easily by changing the load-balancer. Right now it's |
| 154 |
tuned for my SMP systems. |
| 155 |
|
| 156 |
i skipped the prev->mm == next->mm advantage - no workload i know of shows |
| 157 |
any sensitivity to this. It can be added back by sacrificing O(1) |
| 158 |
schedule() [the current and one-lower priority list can be searched for a |
| 159 |
that->mm == current->mm condition], but costs a fair number of cycles |
| 160 |
during a number of important workloads, so i wanted to avoid this as much |
| 161 |
as possible. |
| 162 |
|
| 163 |
- the SMP idle-task startup code was still racy and the new scheduler |
| 164 |
triggered this. So i streamlined the idle-setup code a bit. We do not call |
| 165 |
into schedule() before all processors have started up fully and all idle |
| 166 |
threads are in place. |
| 167 |
|
| 168 |
- the patch also cleans up a number of aspects of sched.c - moves code |
| 169 |
into other areas of the kernel where it's appropriate, and simplifies |
| 170 |
certain code paths and data constructs. As a result, the new scheduler's |
| 171 |
code is smaller than the old one. |
| 172 |
|
| 173 |
Ingo |