1
		   Goals, Design and Implementation of the
2
		      new ultra-scalable O(1) scheduler
3
4
5
  This is an edited version of an email Ingo Molnar sent to
6
  lkml on 4 Jan 2002.  It describes the goals, design, and
7
  implementation of Ingo's new ultra-scalable O(1) scheduler.
8
  Edits by Robert Love, 18 April 2002, to add corrections and
9
  account for changes to the scheduler since its debut.
10
11
12
Goal
13
====
14
15
The main goal of the new scheduler is to keep all the good things we know
16
and love about the current Linux scheduler:
17
18
 - good interactive performance even during high load: if the user
19
   types or clicks then the system must react instantly and must execute
20
   the user tasks smoothly, even during considerable background load.
21
22
 - good scheduling/wakeup performance with 1-2 runnable processes.
23
24
 - fairness: no process should stay without any timeslice for any
25
   unreasonable amount of time. No process should get an unjustly high
26
   amount of CPU time.
27
28
 - priorities: less important tasks can be started with lower priority,
29
   more important tasks with higher priority.
30
31
 - SMP efficiency: no CPU should stay idle if there is work to do.
32
33
 - SMP affinity: processes which run on one CPU should stay affine to
34
   that CPU. Processes should not bounce between CPUs too frequently.
35
36
 - plus additional scheduler features: RT scheduling, CPU binding.
37
38
and the goal is also to add a few new things:
39
40
 - fully O(1) scheduling. Are you tired of the recalculation loop
41
   blowing the L1 cache away every now and then? Do you think the goodness
42
   loop is taking a bit too long to finish if there are lots of runnable
43
   processes? This new scheduler takes no prisoners: wakeup(), schedule(),
44
   the timer interrupt are all O(1) algorithms. There is no recalculation
45
   loop. There is no goodness loop either.
46
47
 - 'perfect' SMP scalability. With the new scheduler there is no 'big'
48
   runqueue_lock anymore - it's all per-CPU runqueues and locks - two
49
   tasks on two separate CPUs can wake up, schedule and context-switch
50
   completely in parallel, without any interlocking. All
51
   scheduling-relevant data is structured for maximum scalability.
52
53
 - better SMP affinity. The old scheduler has a particular weakness that
54
   causes the random bouncing of tasks between CPUs if/when higher
55
   priority/interactive tasks, this was observed and reported by many
56
   people. The reason is that the timeslice recalculation loop first needs
57
   every currently running task to consume its timeslice. But when this
58
   happens on eg. an 8-way system, then this property starves an
59
   increasing number of CPUs from executing any process. Once the last
60
   task that has a timeslice left has finished using up that timeslice,
61
   the recalculation loop is triggered and other CPUs can start executing
62
   tasks again - after having idled around for a number of timer ticks.
63
   The more CPUs, the worse this effect.
64
65
   Furthermore, this same effect causes the bouncing effect as well:
66
   whenever there is such a 'timeslice squeeze' of the global runqueue,
67
   idle processors start executing tasks which are not affine to that CPU.
68
   (because the affine tasks have finished off their timeslices already.)
69
70
   The new scheduler solves this problem by distributing timeslices on a
71
   per-CPU basis, without having any global synchronization or
72
   recalculation.
73
74
 - batch scheduling. A significant proportion of computing-intensive tasks
75
   benefit from batch-scheduling, where timeslices are long and processes
76
   are roundrobin scheduled.
77
  
78
   [ I don't know if the following is still true;
79
   to enter BATCH-mode please use schedtool. SCHED_BATCH processes, are,
80
   however, interruptible by SCHED_NORMAL processes anytime to garantee
81
   interactiveness. -Freek ]
82
83
   The new scheduler does such batch-scheduling of the lowest priority tasks -
84
   so nice +19 jobs will get 'batch-scheduled' automatically. With this
85
   scheduler, nice +19 jobs are in essence SCHED_IDLE, from an
86
   interactiveness point of view.
87
88
 - handle extreme loads more smoothly, without breakdown and scheduling
89
   storms.
90
91
 - O(1) RT scheduling. For those RT folks who are paranoid about the
92
   O(nr_running) property of the goodness loop and the recalculation loop.
93
94
 - run fork()ed children before the parent. Andrea has pointed out the
95
   advantages of this a few months ago, but patches for this feature
96
   do not work with the old scheduler as well as they should,
97
   because idle processes often steal the new child before the fork()ing
98
   CPU gets to execute it.
99
100
101
Design
102
======
103
104
the core of the new scheduler are the following mechanizms:
105
106
 - *two*, priority-ordered 'priority arrays' per CPU. There is an 'active'
107
   array and an 'expired' array. The active array contains all tasks that
108
   are affine to this CPU and have timeslices left. The expired array
109
   contains all tasks which have used up their timeslices - but this array
110
   is kept sorted as well. The active and expired array is not accessed
111
   directly, it's accessed through two pointers in the per-CPU runqueue
112
   structure. If all active tasks are used up then we 'switch' the two
113
   pointers and from now on the ready-to-go (former-) expired array is the
114
   active array - and the empty active array serves as the new collector
115
   for expired tasks.
116
117
 - there is a 64-bit bitmap cache for array indices. Finding the highest
118
   priority task is thus a matter of two x86 BSFL bit-search instructions.
119
120
the split-array solution enables us to have an arbitrary number of active
121
and expired tasks, and the recalculation of timeslices can be done
122
immediately when the timeslice expires. Because the arrays are always
123
access through the pointers in the runqueue, switching the two arrays can
124
be done very quickly.
125
126
this is a hybride priority-list approach coupled with roundrobin
127
scheduling and the array-switch method of distributing timeslices.
128
129
 - there is a per-task 'load estimator'.
130
131
one of the toughest things to get right is good interactive feel during
132
heavy system load. While playing with various scheduler variants i found
133
that the best interactive feel is achieved not by 'boosting' interactive
134
tasks, but by 'punishing' tasks that want to use more CPU time than there
135
is available. This method is also much easier to do in an O(1) fashion.
136
137
to establish the actual 'load' the task contributes to the system, a
138
complex-looking but pretty accurate method is used: there is a 4-entry
139
'history' ringbuffer of the task's activities during the last 4 seconds.
140
This ringbuffer is operated without much overhead. The entries tell the
141
scheduler a pretty accurate load-history of the task: has it used up more
142
CPU time or less during the past N seconds. [the size '4' and the interval
143
of 4x 1 seconds was found by lots of experimentation - this part is
144
flexible and can be changed in both directions.]
145
146
the penalty a task gets for generating more load than the CPU can handle
147
is a priority decrease - there is a maximum amount to this penalty
148
relative to their static priority, so even fully CPU-bound tasks will
149
observe each other's priorities, and will share the CPU accordingly.
150
151
the SMP load-balancer can be extended/switched with additional parallel
152
computing and cache hierarchy concepts: NUMA scheduling, multi-core CPUs
153
can be supported easily by changing the load-balancer. Right now it's
154
tuned for my SMP systems.
155
156
i skipped the prev->mm == next->mm advantage - no workload i know of shows
157
any sensitivity to this. It can be added back by sacrificing O(1)
158
schedule() [the current and one-lower priority list can be searched for a
159
that->mm == current->mm condition], but costs a fair number of cycles
160
during a number of important workloads, so i wanted to avoid this as much
161
as possible.
162
163
- the SMP idle-task startup code was still racy and the new scheduler
164
triggered this. So i streamlined the idle-setup code a bit. We do not call
165
into schedule() before all processors have started up fully and all idle
166
threads are in place.
167
168
- the patch also cleans up a number of aspects of sched.c - moves code
169
into other areas of the kernel where it's appropriate, and simplifies
170
certain code paths and data constructs. As a result, the new scheduler's
171
code is smaller than the old one.
172
173
	Ingo