1
2
				Schedulers
3
				----------
4
5
6
Study notes for a number of Scheduling-related papers, which served as
7
a preparation for building an SMP scheduling base for my “Cute” kernel.
8
9
The __primary sources__ (papers from the 60s, 80s, 90s, and 2000s) of
10
the following topics are discussed:
11
12
	- Multi-level Feedback Queues: that algorithm is the cornerstone
13
	  of old & new schedulers like SVR2/3/4, the BSDs, Linux O(1), NT,
14
	  Staircase, and Staircase Deadline algorithms.
15
16
	- SVR4 and Solaris event-driven scheduling: including the
17
	  disadvantages I faced while implementing it.
18
19
	- Spinlocks: why was it badly-needed and invented by VMS engineers,
20
	  and on missing an important condition of critical section theory
21
	  in its original design.
22
23
	- Per-CPU Data Areas: how VAX/VMS originally handled it, how the
24
	  VMS-based NT kernel did it afterwards, and how the AMD x86-64
25
	  architecture specifically kept the %fs and %gs segments cause
26
	  of that original NT design.
27
28
	- Finding the current thread state using stack masking: why was
29
	  that design engineered by VMS on the VAX architecture, and how
30
	  it was transfered to DEC OSF/1 and unnecessarily later to Linux.
31
32
	- DEC OSF/1 and VAX/VMS symmetric multiprocessing: these kernels
33
	  were the origin of now-classical ideas like spinlocks, per-CPU
34
	  runqueues, interactions between spinlocks and kernel preemption,
35
	  and thread scheduling soft and hard affinity!
36
37
38
2010-2011, Ahmed S. Darwish <darwish.07@gmail.com>
39
40
41
NOTE: To get things going at first, I start with a very basic overview of
42
general-purpose thread scheduling using course notes by R. Arpaci-Dusseau.
43
If you're familiar with the basics, move directly to the legendary F.J.
44
Corbato's “An Experimental Time-Sharing System” paper.
45
46
NOTE-2: You can find a cached copy of *all* the discussed papers at
47
      http://gitorious.org/cute-os/references/trees/master/papers/sched
48
49
NOTE-3: Turn UTF-8 on if you're reading this inside a web browser.
50
51
52
* TOC:
53
  ----
54
55
  - Arpaci00, Operating Systems Notes, 'Scheduling: Introduction'
56
  - Corbato62, 'An Experimental Time-Sharing System', MIT
57
  - Corbato63, 'The Compatible Time-Sharing System - A Programmer's Guide'
58
  - Dingledine09, 'Performance Improvements on Tor'
59
  - Arpaci00, OS Notes, 'Scheduling: The Multi-Level Feedback Queue'
60
  - Arpaci98, 'Multilevel Feedback Queue Scheduling in Solaris'
61
  - Oracle02, Man Pages Section 4: File Formats, ts_dptbl(4)
62
  - Black90, 'Concurrency and Parallelism in the Mach Operating System'
63
  - Denham94, 'DEC OSF/1 Version 3.0 Symmetric Multiprocessing Implementation'
64
  - Gamache88, 'VMS Symmetric Multiprocessing'
65
  - Sliberchatz08, Bibliography from the 'Operating System Concepts' book
66
67
68
* Remzi Arpaci-Dusseau, Operating System Notes, “Scheduling:
69
  Introduction”, University of Wisconsin-Madison, 2000:
70
  ----------------------------------------------------------
71
72
  The included portion of Remzi's notes (ch. 6, 7, and 8) build a light-
73
  weight (undergrad-level) introduction to general-purpose OS scheduling.
74
  These documents also include a number of good references.
75
76
  Here are the important points in these notes + analysis of mine:
77
78
  - Some of the metrics in measuring schedulers include 'Turnaround time',
79
    which is the time taken between registering a job for execution and
80
    its completion/return/finish. Mainly:
81
82
    T_{turnaround} = T_{finish} - T_{arrival}
83
84
    Another metric, caring more about latency, is the 'response time'
85
    which is the time the job waits in the ready queue till some CPU time
86
    is provided:
87
88
    T_{response} = T_{firstrun} - T_{arrival}
89
90
  - The cost of context switching does not arise from the OS actions of
91
    saving and restoring a few registers. For example, we do context
92
    switching in our kernel using only 37 x86 opcodes: they are mainly
93
    register moves and stack pops and pushes.
94
95
    The real cost is that "when programs run, they build up a great deal
96
    of state in CPU caches, TLBs, branch predictors, and other on-chip
97
    hardware. Switching to another job causes this state to be flushed
98
    and new state relevant to the currently-running job to be brought in,
99
    which may exact a noticeable performance cost."
100
101
  - Round-robin, the simplest practical scheduler possible, involves the
102
    act of __cycling__ on the runqueue's list of processes, running each
103
    job for a fixed time slice (unit). Assuming processes A, B, C, and D,
104
    the scheduler run the jobs as follows
105
106
    | A | B | C | D | A | B | C | D | A | B | C | D | A | B | C | ...
107
108
    each for an equal amount of time. Advantages include implementation
109
    simplicity, good average response times (see disadvantages though),
110
    and being starvation-free by design.
111
112
    Disadvantages include very bad behaviour on system saturation, cause
113
    the system will swap-in and out programs for each (supposedly small)
114
    time slice [*], and response-time rising linearly with the number of
115
    active jobs in the system, even for highly-interactive processes.
116
117
    [*] this is a point handled by the Corbato paper below
118
119
  - In practice, most of processes will issue a number of blocking I/O
120
    requests during their allocated time quanta, which leads to moving
121
    them out of the runqueue altogether, and returning back to such queue
122
    after I/O completion.
123
124
    HOW are these jobs treated after returning back can be as simple as
125
    considering them new processes (classical round-robin), or to perform
126
    timing calculations and modify their priority (assuming 'priorities'
127
    support in the scheduler, or a similar concept).
128
129
  - Finally, there's a nice paragraph on the separation of mechanism and
130
    policy in OS Design. For example, stuff like programming the system
131
    ticks handler and context-switching code are scheduling mechanisms
132
    (mechanically how the jobs get interleaved). Scheduling policies
133
    (or disciplines) is what you're currently reading is about.
134
135
    In real-time POSIX for example, there are several programmable
136
    scheduling 'policies' that can be setup through sched_setscheduler(2).
137
    Such policies include SCHED_FIFO, SCHED_RR, and SCHED_OTHER.
138
139
140
* F.J. Corbato et al., “An Experimental Time-Sharing System”, AIEE-IRE
141
  Proceedings of the May 1-3, 1962, Spring Joint Computer Conference, MIT:
142
  ------------------------------------------------------------------------
143
144
  This is one of the earliest papers on time-sharing schedulers. The core
145
  of the paper (beside its overall historical importance) is the section
146
  'A Multi-Level Scheduling Algorithm', which is in modern terminology now
147
  called the 'Multi-level Feedback Queue' (MLFQ). Important points:
148
149
  - The main problem the algorithm tries to solve is system saturation:
150
    a) total size of active user programs exceed high-speed memory
151
    b) too  many active user programs to maintain an adequate response at
152
    each user console.
153
    c) high performance cost of context switching
154
155
    According to the authors, a good solution is to provide a procedure
156
    which gives graceful degradation of the response time and effective
157
    real-time computation speed of the *large and long-running users*.
158
159
  - The paper argues that a simple round-robin scheduler will not work in
160
    case of saturation cause it will cause an 'abrupt collapse of service'
161
    due to the sudden burden of time required to swap programs in-and-out
162
    of secondary memory such as a disk.
163
164
    The multi-level scheduling algorithm was thus presented to improve the
165
    saturation performance of a time-sharing system.
166
167
  - We now delve into the scheduler algorithm details: from the very first,
168
    the scheduler partitions its runqueue to several priority queues
169
    l_0, l_1, ..., l_L. ('L' is to be calculated later)
170
171
  - In general, the bigger the program memory footprint (size), the less
172
    queue priority it's assigned to. But, the less the queue priority, the
173
    bigger the runtime for each of its tasks in number of quantum.
174
175
    Mathematically: l = [ log_2( [w_p/w_q] + 1 ) ]
176
177
    where each task at queue 'l' runs for 2^l quanta, and that if
178
    l_a < l_b, then l_a has a higher priority than l_b. We can deduce
179
    several things from above equation:
180
181
    a) if      0 < w_p <    w_q, [w_p/w_q] =  0, runtime =  1 quantum |-- 1
182
    b) if    w_q < w_p <  2*w_q, [w_p/w_q] =  1, runtime =  2 quanta  -|
183
    c) if  2*w_q < w_p <  3*w_q, [w_p/w_q] =  2, runtime =  2 quanta  -|-- 2
184
    d) if  3*w_q < w_p <  4*w_q, [w_p/w_q] =  3, runtime =  4 quanta  ..|
185
    e) if  4*w_q < w_p <  5*w_q, [w_p/w_q] =  4, runtime =  4 quanta  ..|
186
    f) if  5*w_q < w_p <  6*w_q: [w_p/w_q] =  5, runtime =  4 quanta  ..|-- 4
187
    g) if  6*w_q < w_p <  7*w_q: [w_p/w_q] =  6, runtime =  4 quanta  ..|
188
    h) if  7*w_q < w_p <  8*w_q: [w_p/w_q] =  7, runtime =  8 quanta  +++
189
    ...								      +++|-- 8
190
    p) if 15*w_q < w_p < 16*w_q: [w_p/w_q] = 15, runtime = 16 quanta  xxxx
191
    ...								      xxxx|-- 16
192
    _) if 31*w_q < w_p < 32*w_q: [w_p/w_q] = 31, runtime = 32 quanta  @@@@@
193
    ...								      @@@@@|-- 32
194
    _) if 63*w_q < w_p < 64*w_q: [w_p/w_q] = 63, runtime = 64 quanta
195
196
    Note: at point 'a)', queue 'l' is equal to 0: the highest possible
197
    queue in priority.
198
199
    By above design, the bigger the program, the bigger its allocated
200
    number of quanta, making the time-overhead of swapping negligible.
201
    Thus, optimizing the turnaround time of CPU-bound processes.
202
203
  - The above point in scheduler design _tries_ to solve the first part of
204
    the problem the algorithm tackles: the stated point 'a) total size of
205
    active user programs exceed high-speed memory' above.
206
207
    From now on, the paper tackles second part of the problem, 'b) too many
208
    active user programs to maintain an adequate response'.
209
210
  - The scheduler __always__ begin with the tasks at the head of the
211
    highest priority queue (l_0). After that queue gets exhausted, it moves
212
    to the direct following queue l_1, and so on  <== (I)
213
214
    Tasks in a single queue get run in a round-robin fashion.
215
216
  - If while executing a task at queue 'l' (during the 2**l quantum), a
217
    task got assigned to a higher priority queue 'l*', the current
218
    executing task get put at the __head__ of the 'l'th queue, and the
219
    process at (I) is restarted from queue 'l*', and so on.
220
221
    In other words, a task cannot get executed while another task is
222
    waiting at a higher priority queue.
223
224
  - If a program in queue 'l' has not 'completed' (did not issue any I/O)
225
    within its 2**l-quantum runtime, it's placed at the __tail__ of the
226
    less-priority queue 'l+1'
227
228
  - A powerful point in the algorithm above (beside broad bounds on system
229
    performance) is that the classification procedure for tasks is entirely
230
    automatic: it depends on performance and program size rather than
231
    static declarations or (the possibly malicious) users hopes.
232
233
  - As once stated above, the main theme around this scheduler is improving
234
    the __saturation performance__ of the system.
235
236
    This performance improvement on saturation exists cause as a user taxes
237
    the system, the degradation of service occurs progressively starting
238
    with users with large or long-running programs (least-priority queues)
239
    first.
240
241
  - Starvation: there's a probable case of a big number of active high
242
    priority tasks starving less-priority CPU-bound tasks out of execution.
243
244
    The paper __models__ a worst cast of having N tasks, each running at
245
    queue 'l', and each relinquishing the CPU __directly__ before their
246
    2**l quantum runtime, and thus each also returning to the the same-
247
    priority queue, with the condition of returning to the runqueue only
248
    't_u'-seconds later (time for finishing I/O and, or, user response).
249
250
    If the system will always be busy at queue 'l' above, starving all less
251
    priority queues 'l + x' for x > 0, then the queue 'l' must never be
252
    empty; we must have:
253
254
	N * 2^l * q >= t_u
255
	2^l         >= (t_u / N*q)
256
	l           >= log_2 (t_u / N*q)
257
258
    where 'N' is the number of tasks at queue 'l'.
259
260
    From above, we've found an upper-bound on the priority queue that can
261
    be serviced: queues > 'l' above can possibly starve if number of tasks
262
    reach N. Thus, we have a guarantee that all
263
264
	l_a        <= [ log_2 (t_u / N * q) ]
265
266
    will be serviced and will __not__ get starved (given the estimated
267
    response time t_u), where '[]' denotes the integral part of enclosed
268
    equation. The author calls this queue the 'highest serviced level'
269
    queue.
270
271
  - The equation above has set a bound on the starvation problem. So if we
272
    set the max possible 'highest serviced queue' to the max possible queue
273
    on the system, we have a __theoretical guarantee__ of no starvation in
274
    the system.
275
276
    (Providing this guarantee is very hard for modern-day general purpose
277
    operating systems: we cannot easily put a limit on the number of tasks
278
    'N', and we cannot even have a valid estimation of 't_u', the I/O
279
    or user response finish time.
280
281
    These values can be calculated easily if the target hardware and work
282
    load is previously known; this is the case in the paper [IBM 7090],
283
    and this can also be the case for small and predictable devices like a
284
    small network router.)
285
286
  - By now, we've analyzed most of the paper sans the response time
287
    analysis.
288
289
    While these response time equations are intellectually simulating, they
290
    offer very broad bounds that are not entirely practical for a general
291
    purpose scheduler, similar to the paper's way of handling starvation in
292
    the above point.
293
294
295
* F.J. Corbato et al., “The Compatible Time-Sharing System: A Programmer's
296
  Guide”, The MIT Computation Center, MIT Press, 1963:
297
  ------------------------------------------------------------------------
298
299
  This manual gives a very nice overview of the historical time-sharing
300
  hands-on experience. It's a worthwhile read while studying the CTSS
301
  scheduler paper above.
302
303
  A useful demonstration of a programmer's session on CTSS is also
304
  presented in the Appendix. Interesting text editor!! it makes me
305
  remember the original UNIX editor 'ed'.
306
307
308
* Roger Dingledine, “Performance Improvements on Tor”, 2009:
309
  ----------------------------------------------------------
310
311
  Now why am I including a paper on Tor while we're dealing with
312
  schedulers? Because the Tor network tries to solve a problem very
313
  similar to the one Corbato was designing his scheduler model against:
314
315
  ``Section 1 described mechanisms to let low-volume streams have a
316
  chance at competing with high-volume streams. Without those mechanisms,
317
  normal web browsing users will always get squeezed out by people
318
  pulling down larger content and tolerating high latency.
319
320
  But the next problem is that some users simply add more load than the
321
  network can handle. Just making sure that all the load gets handled
322
  __fairly__ isn't enough if there’s too much load in the first place.(+)
323
324
  When we originally designed Tor, we aimed for high throughput. We
325
  figured that providing high throughput would mean we get good latency
326
  properties for free. However, now that it’s clear we have several user
327
  profiles trying to use the Tor network at once, we need to consider
328
  changing some of those design choices. Some of these changes would
329
  aim for better latency and worse throughput. (++) ''
330
331
  The statement marked with (+) matches word-by-word the problem
332
  statement of the Multi-Level Feedback Queue paper: to gracefuly handle
333
  'system saturation' while maintaining an acceptable level of system
334
  responsiveness. Thus, avoiding the 'abrupt collapse of service' caused
335
  by OS scheduler's round-robin algorithm or Tor's full fairness in high
336
  overlad.
337
338
  A main theme of the MLFQ solution, and Tor's one proposed at (++), is
339
  not to let throughput-driven jobs (compilers in schedulers case, an
340
  700-MB ISO download in the Tor case) deprive latency-senitive jobs
341
  out of quickly-needed resources. Examples of such latency-sensive
342
  apps include an editor or web-browser GUI scrolling in OS schedulers,
343
  and servicing HTML websites in Tor.
344
345
346
* Remzi Arpaci-Dusseau, Operating Systems Notes, The MLFQ, 2000:
347
  --------------------------------------------------------------
348
349
  Here, the MLFQ algorithm we studied from __primary sources__ above is
350
  re-discussed in modern terms. Important points:
351
352
  - The hardest point in designing schedulers is to minimize response time
353
    for interactive jobs, while also minimizing overall system turnround
354
    time.
355
356
  - As we know by now, MLFQ varies the priority of a job based on its
357
    observational behaviour: learning from jobs as they run, and using
358
    history to predict their future behaviour. Historical state is kept in
359
    the scheduler priority queues.
360
361
    This is an important systems design pattern of learning from history:
362
    observing current trends and learning from the past to predict the
363
    future. Such pattern is common in CPUs branch prediction, working set
364
    rules, and in networking protocols congestion control mechanisms.
365
366
    Remzi advises us though that 'such approaches work when jobs have
367
    __phases of behavior__ and are thus predictable; of course, one must
368
    be careful with such techniques, as they can easily be wrong and drive
369
    a system to make worse decisions than they would have with no knowledge
370
    at all.'
371
372
  - There are two important problems in MLFQs:
373
374
    a) Starvation: One of the possible ways to handle it is the well-known
375
    'aging principle': increasing the priority as the process waits. Check
376
    below parenthesized notes though; it's not that simple.
377
378
    (Note1: The original paper theoretically 'solved' this issue by giving
379
    bounds on starvation. These bounds can only work if we have a true
380
    estimate of I/O response times _and_ of the jobs number in the system.
381
    This is practically impossible for general-purpose scheduling.)
382
383
    (Note2: There are several ways to __practically__ apply aging methods.
384
    Some of these ways are discussed in David Black's paper 'Scheduling
385
    support for concurrency and parallelism in the Mach Operating System.')
386
387
    b) Gaming the system: A possible solution is to improve the algorithm
388
    accounting of CPU time, making it less viable to gaming. Check the
389
    notes for further details.
390
391
392
* Remzi Arpaci-Dusseau, “Multilevel Feedback Queue Scheduling in Solaris”,
393
  1998  &&  Solaris, Man Pages Section 4: File Formats, ts_dptbl(4), 2002:
394
  ------------------------------------------------------------------------
395
396
  From now on, we check how various real-world kernels implement MLFQ
397
  scheduling. Beginning with Solaris, here are the important points:
398
399
  - For regular jobs, the kernel has 60 priority queues. A job begins in
400
    the middle, at priority 29. As in classical MLFQ, CPU-bound jobs get
401
    assigned smaller priorities over time (with bigger slices), and I/O
402
    tasks move up, getting scheduled whenever they have jobs to do.
403
404
    Now here is Solaris 'dispatcher parameter table' defaults:
405
406
    Priority	q (ms)	tqexp	slpret	maxwait lwait
407
    0		200	0	50	0	50
408
    ...		vvv	v	vv	v	vv
409
    9		vvv	v	vv	v	vv
410
    10		160	v	51	v	51
411
    11		vvv     1	vv	v	vv
412
    12		vvv	2	vv	v	vv
413
    13		vvv	3	vv	v	vv
414
    14		vvv	4	vv	v	vv
415
    15		vvv	5	vv	v	vv
416
    16		vvv	6	vv	v	vv
417
    17		vvv	7	vv	v	vv
418
    18		vvv	8	vv	v	vv
419
    19		vvv	9	vv	v	vv
420
    20		120	10	52	v	52
421
    21		vvv	11	vv	v	vv
422
    22		vvv	12	vv	v	vv
423
    23		vvv	13	vv	v	vv
424
    24		vvv	14	vv	v	vv
425
    25		vvv	15	vv	v	vv
426
    26		vvv	16	vv	v	vv
427
    27		vvv	17	vv	v	vv
428
    28		vvv	18	vv	v	vv
429
    29		vvv	19	vv	v	vv
430
    30		80	20	53	v	53
431
    31		vv	21	vv	v	vv
432
    33		vv	22	vv	v	vv
433
    33		vv	23	vv	v	vv
434
    34		vv	24	vv	v	vv
435
    35		vv	25	54	v	54
436
    36		vv	26	vv	v	vv
437
    37		vv	27	vv	v	vv
438
    38		vv	28	vv	v	vv
439
    39		vv	29	vv	v	vv
440
    40		40	30	55	v	55
441
    41		vv	31	vv	v	vv
442
    42		vv	32	vv	v	vv
443
    43		vv	33	vv	v	vv
444
    44		vv	34	vv	v	vv
445
    45		vv	35	56	v	56
446
    46		vv	36	57	v	57
447
    47		vv	37	58	v	58
448
    48		vv	38	vv	v	vv
449
    49		vv	39	vv	v	59
450
    50		vv	40	vv	v	vv
451
    51		vv	41	vv	v	vv
452
    52		vv	42	vv	v	vv
453
    53		vv	43	vv	v	vv
454
    54		vv	44	vv	v	vv
455
    55		vv	45	vv	v	vv
456
    56		vv	46	vv	v	vv
457
    57		vv	47	vv	v	vv
458
    58		vv	48	vv	v	vv
459
    59		20	49	59	32000	vv
460
461
    tqexp: time quantum expired priority, the new priority the thread is set
462
    to when it uses its entire time quantum. As you can see, any process
463
    using its entire slice get its priority decreased by __10 levels__.
464
465
    slpret: sleep return priority value, the new priority the thread is set
466
    to after waking up (from a sleep event like network or disk I/O.)
467
    The thread priority is increased to 50 or above after an I/O operation!
468
469
    maxwait: a starvation avoidance mechanism to compensate threads waiting
470
    for a relatively long time in the dispatcher queue. 'maxwait' is such
471
    waiting time threshold. If a thread waited more than maxwait in the
472
    queues, its priority is boosted up to the respective 'lwait' priority.
473
474
    Each thread has a dispatcher_wait ('dispwait') counter, which is reset
475
    when the thread is inserted into the dispatch queue (following a time-
476
    quantum expiration, or cause of wakeup from I/O or locking). This
477
    field is incremented once per second for all processes waiting in the
478
    dispatcher queues. If such counter exceeded 'maxwait', then the thread
479
    priority is incremented to 'lwait' -- a priority > 50 by default.
480
481
    Note that the default 'maxwait' for Solaris is 0: just 1 second and
482
    all the system processes priority will be boosted up above 50. This
483
    avoids starving CPU-bound too much.
484
485
    [Sol-*]:
486
    (The values above have some very interesting details: Note that after
487
    1-second, processes in the lowest (0-9) range will move to priority 50,
488
    processes in the range (10-19) will move to 51, (20-29) to priority 52,
489
    and so on. By this, high-priority processes response times will NOT be
490
    affected because they will still run on queue 59 instantly.
491
492
    Now, how can this prevent starvation? Can't processes keep going in
493
    and out very quickly, starving queues < 59?. Indeed that was the
494
    mathematically modeled worst case in the original CTSS paper.
495
496
    Now comes the second important detail: even if the worst case above
497
    occured (highly unlikely in practice, except if done maliciously),
498
    you'll see that the 'lwait' field of the 50-59 priorites is 59. Thus,
499
    in the second __starvation round__, all will move to 59 and execute
500
    in a round-robin manner. This solves the starvation problem elegantly.
501
502
    Summary: starvation can never exceed 2 seconds + waiting time in prio
503
    queue 59. All while maintaining system responsiveness in the process.)
504
505
  - NOTE: I've coded such scheduler in my kernel, and its disadvantages
506
    began to appear directly after the first day of benchmarking: too much
507
    heuristics and voo-doo constants!
508
509
    Scheduling 100% CPU-hog threads showed the ugly side of that algorithm:
510
    the combination of a big number of dynamic priority levels with huge
511
    and different slice lengthes per each priority kept starving the low
512
    priority processes till the newly-queued CPU hogs moved up to their
513
    equivalent low-priority level.
514
515
    While the new CPU hogs were moving up, the older low-priority tasks
516
    faced complete starvation for 1+ seconds. The algorithm thus boosted
517
    these threads to high-priority levels, but this __starved__ the newer
518
    threads that are now residing in relatively lower-priority queues. This
519
    effectively made all system processes enter an infinite loop of
520
    1+-second starvation and then boosting -- an upleasent situation.
521
522
    Decreasing the number of priority levels and making the slices more fair
523
    helped the situation, but kept introducing __new__ corner-cases. After
524
    a number of days, I gave-up and went coding a more theoretically-sound
525
    scheduling algorithm.
526
527
    (The benchmark was simply plotting different kernel threads properties
528
    over time. Such properties included threads total runtime, average
529
    runtime per dispatch, total wait in runqueues, average wait in runqueues,
530
    number of priority boosts [cause of starvation], number of preemptions
531
    cause of a higher-priority thread becoming runnable, and number of
532
    preemptions cause of a thread time-slice end:
533
534
    (a) code version 1, 6 threads:
535
        http://localf.googlepages.com/pub/mlfq/fairness.gif
536
    (b) code version 1, 100 threads:
537
        http://localf.googlepages.com/pub/mlfq/fairness100.gif
538
    (c) code version 2 (after sanitizing the voo-doos), 6 threads:
539
        http://localf.googlepages.com/pub/mlfq/fairness-improved.gif
540
    (d) code version 2, 100 threads (things became good!!):
541
        http://localf.googlepages.com/pub/mlfq/fairness-improved.gif
542
    (e) unfortunately, the _new_ heuristics above had corner cases:
543
        http://localf.googlepages.com/pub/mlfq/fairness-improved-pathological.gif
544
545
    It seems I'm not the only one facing this though; I've found some refs
546
    in academia on how SVR4-based schedulers [like Solaris] were very hard
547
    to tweak and adapt to different workloads.)
548
549
550
* David L. Black, “Scheduling Support for Concurrency and Parallelism in
551
  the Mach Operating System”, IEEE Computer Journal, Carnegie Mellon, 1990:
552
  -------------------------------------------------------------------------
553
554
  This is a very interesting article, but we will concentrate on its MLFQ
555
  deficiencies discussion for now (Note: always remember Black's point of
556
  creating an idle thread per CPU, not to corrupt kernel stack on
557
  multiprocessor systems). Important points:
558
559
  - Multics used a decreasing priority MLFQ (as was exactly described in
560
    its original paper) and discovered the major deficiency: on a heavily
561
    loaded system with many short-lived jobs, the priority of a CPU-bound
562
    job can decrease until little or no CPU-time is assigned to it. The
563
    author notes that such problem (starvation) also existed on a number
564
    of UNIX kernels.
565
566
  - To handle this problem, it's necessary to elevate priorities in some
567
    manner. VAX/VMS elevates a low application priority when it issues an
568
    event like I/O completion.
569
570
    The author states that this negatively affect interactive applications
571
    which consume large amount of processor time, since they may not issue
572
    priority-raising events to an effective degree.
573
574
    (Nonetheless, this approach is effective, response-time wise, for
575
    handling software with 'phases' of I/O and CPU-bound activity if the
576
    CPU-bound activity is long but not dominant. This is a point the
577
    original CTSS scheduler did not address: a job's transition from a
578
    CPU-bound to an I/O-bound phase.)
579
580
  - A (non-exclusive) second way to handle this is processor usage aging.
581
    To compensate lower-priority jobs, we forget (decay) their past history
582
    in some way, usually in an exponential decay manner.
583
584
    (Exponential decay: N(t) = N(0) * (decay_factor)**t. Assuming a decay
585
     factor of 0.5, and N(0)=10, then N(1)=5, and N(2) = (10/2)/2 = 0.25)
586
587
  - Solaris, as we've seen, uses BOTH of the prevention methods above:
588
    processes priorities get elevated after completion of an I/O event,
589
    AND the kernel also forgets (age) __most__ of the current processes
590
    usage history after one second (coalescing each 10 priorities to one,
591
    while maintaining the relative difference between them).
592
593
  - 4.3BSD: it solely uses the aging method to handle starvation, with a
594
    CPU-usage decay factor that's based on system load. The bigger the
595
    system load, the less usage decay, avoiding jobs from getting crammed
596
    in high priority queues on high system load.
597
598
    Priority of the __currently running__ thread get re-calculated after
599
    each 4 ticks of its runtime, decreasing linearly with job's CPU usage.
600
    If a higher priority was discovered after this decrease, the thread
601
    is preempted.
602
603
    Routines responsible for waking up a thread or adding a new thread to
604
    the runqueue check such thread priority against the currently running
605
    thread. If it's higher, a reschedule operation is requested.
606
607
    (XXX: Unix Internals discusses the problem of classical SVR2/3 decay,
608
    its equations, and the new BSD decay method. Add these notes here as
609
    a supplement to the original paper discussion.)
610
611
612
* Jeffrey M. Denham et al., “DEC OSF/1 Version 3.0 Symmetric Multiprocessing
613
  Implementation”, Digital Technical Journal, volume 6 number 3, Summer 1994:
614
  ---------------------------------------------------------------------------
615
616
  This paper is gold, and is comparable in quality to the CTSS one. It
617
  discusses a huge number of now-classical SMP features:
618
619
    + Accessing the 'current' thread state by masking the stack pointer,
620
      regardless of the CPU we're currently running on.
621
    + Per-CPU runqueues for thread scheduling!
622
    + SMP scheduling soft affinity
623
    + Interactions between spinlocks and kernel preemption
624
    + Striking a balance between fine-grained locking and lockless algorithms
625
626
  A good number of above ideas were of a VAX/VMS origin. Important notes:
627
628
  - The task of an SMP kernel is to improve system performance as CPUs are
629
    added without compromising quality. DEC OSF/1 goal was to provide a
630
    performant Unix SMP system over the Alpha architecture.
631
632
    That hardware architecture provided shared memory, symmetric access to
633
    system I/O busses, atomic read-and-set instructions, and interprocessor
634
    interrupts (IPIs).
635
636
    NOTE: This is very close to the Intel MultiProcessor (MP) architecture.
637
638
  - As we know by now, classical Unix kernels insured its consistency by:
639
     a] disabling interrupts in critical regions accessed by IRQ handlers
640
        (and thus handling the concurrency of asynchronous IRQ handlers)
641
     b] allowing only one process to be in kernel context at a time
642
        (and thus handling the virtual concurrency provided by time-sharing)
643
644
    These protections aren't enough in case of multicore systems, since
645
    they provide real concurrent execution of kernel code paths.
646
647
    An evolutionary solution to that problem was the use of 'funneling':
648
    allowing certain kernel subsystems to be executed only in one CPU at
649
    _any_ given time. These subsystems was then iteratively and slowly
650
    moved to local fine-grained locking.
651
652
    NOTE: Linux used that 'funneling' method in its early days using the
653
    'Big Kernel Lock (BKL)'; FreeBSD did the same using a 'Global Mutex'.
654
    Both kernels moved to fine-grained locking iteratively. By virtue of
655
    being started in Winter of 2009, my kernel incorporates fine-grained
656
    locking from the start.
657
658
  - It's interesting to note that the first version of DEC OSF/1 provided
659
    full kernel-mode preemption in its real-time version of the kernel.
660
    Solaris was also made fully preemptive with fine-grained locking very
661
    early on. Linux and FreeBSD caught on around 10 years later.
662
663
  - The paper then discusses how kernel preemption should be disabled by the
664
    entrance of spinlocks. It also discusses the now-very-common idea of
665
    counting the depth of spin locks acquired using a reference count, and
666
    preempting the kernel if and only if that reference count = 0.
667
668
    NOTE 1: This is equivalent to Robert Love's Linux kernel preemption
669
    patch. A google search reveals that this patch got merged in 2001.
670
671
    NOTE 2: FreeBSD on the other hand completely disables local interrupts
672
    upon spin lock entrance, automatically disabling preemption in the
673
    process (at a slight performance cost). I do the same in my kernel.
674
675
    NOTE 3: (cut-and-paste from my kernel comments) Disabling preemption
676
    in spinlock-protected regions is a necessity because:
677
     a) If a thread holding a spin lock got preempted, it will let a
678
        possibly huge number of other threads consume their entire time
679
        slice (which is in the order of milliseconds) just spinning!
680
     b) If that preempted thread (which is now in the scheduler runqueue)
681
        got killed, we can deadlock the entire system!
682
     c) Classical race conditions may arise if the newly scheduled thread
683
        accessed the same critical-region the older thread was accessing
684
        while being preempted.
685
        That's why spin locks are also needed in preemptive Uniprocessor
686
        kernels: ignoring their spinning behavior, they act as atomic
687
        critical region markers.
688
689
    There's also a warning about accessing per-CPU variables while kernel
690
    preemption is enabled. Modern kernels disable preemption in that state.
691
692
    NOTE 4: (also from my kernel comments) Disabling preemption while
693
    accessing a per-CPU variable is necessary:
694
     a) We might get preempted and then rescheduled on a different CPU.
695
     b) If the _preempting_ thread used the same per-CPU variable, that's
696
        a classical race condition.
697
698
    The paper notes that due to the complexities of kernel locking, a
699
    "coherent locking architecture with automated debugging facilities
700
    was needed to ship a reliable product on time."
701
702
  - One problem faced by SMP is finding the descriptor of the current
703
    process context. In a uniprocessor machine, that can be simply done
704
    by a global memory variable that is changed before context switching.
705
706
    On SMP, there's a "current" process descriptor for each CPU. This
707
    descriptor can be located by finding the process number we're running
708
    on, and then accessing an array of per-cpu data based on that number.
709
710
    In Unix kernels, the current thread state is dereferenced a huge number
711
    of times. Needing to know the current CPU number before each of these
712
    dereferences (a costly hardware operation) was unacceptable.
713
714
    NOTE 1: In contemporary x86 CPUs, this is costly since it needs reading
715
    the APIC registers for each dereference. APIC registers accessors induce
716
    higher latency than the usual general-purpose registers.
717
718
    Even assuming that this operation wasn't costly from the hardware side,
719
    we'll have to disable, then enable, kernel preemption each time we want
720
    to access that thread state; also unacceptable.
721
722
    DEC OSF/1 solved this by storing a pointer to the currently running
723
    process descriptor at the very bottom of each thread stack. By simply
724
    masking the stack pointer, they can instantly access that thread state,
725
    __regardless__ of the CPU that thread is currently running on.
726
727
    NOTE 2: The paper states that putting the 'current' thread state pointer
728
    in the per-CPU area, and then setting it before each context switch,
729
    would frequently trash other CPUs caches.  This is true in the regular
730
    case, but aligning such per-CPU array elements on a cache-line boundary
731
    (64 or 128 bytes in x86 CPUs) would solve that problem.  It's possible
732
    that the Alpha architecture had different semantics though.
733
734
    NOTE 3: Linux x86-32 used that stack-masking design for a long time. I
735
    guess it was transfered to its developers by folklore.
736
737
      NOTE: What about the risks, econet anyone?
738
739
    NOTE 4: On the x86 kind of the equation, it's common to _permanently_
740
    assign a segment register with the address of the relevant per-CPU area
741
    (aligned on a cache-boundary) for each CPU. This avoids the inefficiency
742
    of accessing the APIC for every per-CPU data dereference.
743
744
      NOTE: Not all CPU architectures have the luxury of a such a register,
745
      which is left _intact_ by the basic system ABI definition. The absence
746
      of such register in the VAX architecture was the main reason of the
747
      stack-masking idea in VMS. Explore the 'VMS Symmetric Multiprocessing'
748
      paper, DTJ vol.1#7, for further historical context.
749
750
      NOTE: Reading the historical (preliminary) AMD64 architecture document
751
      (January 2001, Rev C), it seems that AMD kept the %fs and %gs registers
752
      because Windows NT used %gs as a per-CPU area pointer. Interestingly,
753
      NT is, afterall, a VMS-based kernel!
754
755
  - Per-CPU runqueues: Paper begins by noting that "scheduling threads out of
756
    a global runqueue is highly effective at keeping the N highest-priority
757
    threads running". It then states the disadvantages of that method: lock
758
    contention over the global runqueue and losing CPU cache state due to
759
    uncontrolled bouncing of threads between CPUs.
760
761
    NOTE: Having a global runqueue and taking CPU affinity in consideration
762
    is not mutually exclusive. Brainfuck Scheduler (BFS) and the Linux-2.4
763
    epoch scheduler is an example of this.
764
765
    The DEC designers thus added "soft affinity" to timeshare threads: these
766
    threads are softly bound to the processor they last run on. Nevertheless,
767
    if too much time passed since last run, they can get scheduled elsewhere:
768
    the CPU state was probably lost anyway. It's more beneficial to directly
769
    run that thread on another less-loaded CPU than wait for the current one.
770
771
  - Load balancing: As a summary of the presented algorithm, let
772
773
    TS(i): The estimated number of ticks available to time-sharing threads
774
           by CPU #i in a scheduling period
775
    TTS  : Total number of system ticks available to time-sharing threads
776
    N    : Total number of runnable time-sharing threads in the system
777
    D(i) : Ideal number of runnable time-sharing threads assigned to CPU #i
778
    C(i) : Current number of time-sharing threads assigned to CPU #i
779
780
    we have
781
782
    TTS  = TS(1) + TS(2) + ... + TS(k), k = number of system CPUs
783
    D(i) = (TS(i) / TTS) * N
784
785
    Every second, the kernel calculates D(i) for each CPU. If a CPU had a
786
    C(i) < D(i), and another one had C(i) > D(i), the former is marked as
787
    "out of balance". When such unbalanced CPU reschedules, it "steals"
788
    tasks from the overloaded CPU. Tasks get stolen only if they did not
789
    run on their softly-bound CPU for a configurable amount of time, thus
790
    preserving soft affinity usefulness as much as possible.
791
792
  - The paper finishes by stating the software engineering tactics needed
793
    to safely develop such an ambitious code base. Remember that this was
794
    the 90s.  After studying that paper, it's clear that the two most
795
    important tactics were:
796
797
       - Code review, and
798
       - Iterative development
799
800
    So, well, things haven't changed a lot from the software engineering
801
    side!
802
803
804
* Rodney N. Gamache && Kathleen D. Morse, “VMS Symmetric Multiprocessing”,
805
  Digital Technical Journal, volume 1 number 7, pp. 57-73, August 1988:
806
  ------------------------------------------------------------------------
807
808
  NOTE: Below discussion assumes reading our notes on DEC OSF/1 v3.0 first.
809
810
  "Systems software engineers must now design effective ways to utilize
811
  systems with six, eight, or even more CPUs", begins this late 80s paper!
812
813
  The SMP additions to VAX/VMS were motiviated by the, back then new, VAX
814
  6200 architecture. From other papers in the _same_ journal issue, that
815
  hardware design had:
816
817
   + Symmetric shared memory
818
   + Cache coherency in hardware (always a bless!)
819
   + First-level cache for instructions, and a second-level write-through
820
     one for both data and instructions
821
   + Broadcast and single-destination inter-processor interrupts
822
823
  This paper INTRODUCED a good number of now-classical SMP features:
824
825
   + Spinlocks! (also _coined_ that term)
826
   + Accessing the Per-CPU area (and thus the current thread state) by
827
     masking the CPU stack pointer register.
828
   + SMP scheduling hard affinity! (also _coined_ that term)
829
830
  Having above context in mind, here were the important points:
831
832
  - VMS engineers first considered using a global lock[*] to protect the
833
    kernel in an SMP machine. They (rightfully) ditched the idea later and
834
    moved directly to the ambitious goal of fine-grained locking.
835
836
    NOTE: Compare this to the DEC OSF/1 method, where they first used a
837
    global lock, and then fine-grained the system iteratively.
838
839
    [*] Big Kernel Lock in Linux jargon, Global Mutex in the BSDs
840
841
  - Inventing spinlocks: Regular VMS mutexes must be acquired in a process
842
    context, but the kernel needed SMP protection in different other
843
    contexts. That's why spinlocks were created; they can get acquired by
844
    any CPU in any context (as long as certain sanity rules are held).
845
846
    NOTE 1: A classical case is protecting the video-RAM and printf() buffers.
847
    The kernel usually want to access such resource in all possible contexts,
848
    including boot context, exception and even IRQ context, not just inside
849
    process contexts.
850
851
    Usually, a mutex is acquired or "owned" by a certain process; spinlocks
852
    on the other hand are acquired by the CPU as a whole. Spinlocks was
853
    implemented in VMS using an interlocked bit-test-and-set op.
854
855
    NOTE 2: It's still implemented similarly these days, but it might be
856
    faster to use an interlocked swap. A missing issue in that paper design
857
    was making threads lock spinning time be finite (bound); that's an
858
    important (third) condition in the theory of critical sections.
859
860
      NOTE: That got handled in Linux using N.Piggin's ticket spinlocks.
861
862
  - VMS added two important debugging aids to spin locks:
863
864
    a) each spinlock was assigned a rank (order). All spinlocks must be
865
       acquired in that order, eliminating any possible deadlocks.
866
    b) each spinlock structure saved the last eight program counters (PC/
867
       x86 %RIP) that acquired or released such lock.
868
869
    NOTE: Can this ranking method dynamically work in my kernel? All my
870
    locks are strictly local so far, but that might change once VFS and
871
    file-systems are added (e.g. inode locks).
872
873
  - Per-CPU Area: I _love_ how the authors state their thinking process in
874
    that section. As in any SMP-capable kernel, VMS needed a per-CPU data
875
    area to put things as the current thread state pointer and other per-
876
    CPU bookkeeping.
877
878
    The most effecient way to do that is to have a general-purpose register
879
    _permanently_ set to the virtual address of the relevant per-CPU area.
880
    Such a register did not exist in VAX, so instead of creating it in the
881
    VAX hardware, they searched for a way to _overload_ a current register
882
    with that purpose.
883
884
    Since each CPU must have a unique stack to safely handle interrupts,
885
    the current thread stack pointer can be safely overloaded with that
886
    task. They put a pointer to the per-CPU area at the bottom of such
887
    stacks, which can be reached by simply masking the stack pointer.
888
889
    NOTE: See MUCH more details on this in our DEC OSF/1 discussion above.
890
891
  - CPU Affinity: They maintained a mask inside each thread descriptor,
892
    where each CPU is represented by a single bit. Setting only one bit in
893
    that mask is equivilant to hard binding that thread to a single CPU.
894
    Such affinity rules are naturally enforced by the scheduler.
895
896
    NOTE: Solaris and Linux roughly use the same method for hard affinity.
897
898
899
* Bibliography from “Operating System Concepts”, Sliberchatz et al., 2007:
900
  ------------------------------------------------------------------------
901
902
  Feedback queues were originally implemented on the CTSS system described
903
  in Corbato et al.[1]. This feedback queue scheduling system was analyzed
904
  by Scharge[2]. The preemptive priority scheduling algorithm was suggested
905
  by Kleinrock[3].
906
907
  [1] Fernando J. Corbato et al., 'An Experimental Time-Sharing System', 
908
      Proceedings of the AFIPS Fall Joint Computer Conference (1962)
909
910
  [2] L. E. Scharge, 'The Queue M/G/I with Feedback to Lower Priority
911
      Queues', Management Science, Volume 13 (1967)
912
913
  [3] L. Kleinrock, 'Queuing Systems, Volume II: Computer Applications,
914
      Wiley Interscience (1975)
915
916
  Anderson et al.[4], Lewis and Berg[5], and Philbin et al[6] discuss
917
  thread scheduling.
918
919
  [4] T. E. Anderson et al., 'The Performance Implications of Thread
920
      Management Alternatives for Shared-Memory Multiprocessors', IEEE
921
      Transactions on Computers, Volume 38, Number 12 (1989)
922
923
  [5] B. Lewis and D. Berg, 'Multithreaded Programming with Pthreads,
924
      Sun Microsystems Press (1998)
925
926
  [6] J. Philbin et al., 'Thread Scheduling for Cache Locality',
927
      Architectural Support for Programming Language and Operating Systems,
928
      (1996)
929
930
  Scheduling techniques that take into account information regarding
931
  process execution times from previous runs are described in Fisher[7],
932
  Hall et al.[8], and Lowney et al.[9]
933
934
  [7] J. A. Fisher, 'Trace Scheduling: A Technique for Global Microcode
935
      Dispatching', IEEE Transactions on Computers, Volume 30, Number 7
936
      (1981)
937
938
  [8] L. Hall et al., 'Scheduling to Minimize Average Completion Time:
939
      Off-line and On-line Algorithms', SODA: ACM-SIAM Symposium on
940
      Discrete Algorithms (1996)
941
942
  [9] P. G. Lowney et al., 'The Multiflow Trace Scheduling Compiler',
943
      Journal of Supercomputing, Volume 7, Number 1-2 (1993)
944
945
  Fair-share schedulers are covered by Henry[10], Woodside[11], and Kay
946
  Kay and Lauder[12]
947
948
  [10] G. Henry, 'The Fair Share Scheduler', AT&T Bell Laboratories
949
       Technical Journal (1984)
950
951
  [11] C. Woodside, 'Controllability of Computer Performance Tradeoffs
952
       Obtained Using Controlled-Share Queue Schedulers', IEEE
953
       Transactions on Software Engineering, Volume SE-12, Number 10
954
       (1986)
955
956
  [12] J. Kay and P. Lauder, 'Fair Share Scheduler', Communications of the
957
       ACM, Volume 31, Number 1 (1988)
958
959
  Siddha et al[13] discusses __MODERN__ challenges on multicore systems.
960
961
  [13] S. Siddha et al, 'Process Scheduling Challenges in the Era of
962
       Multi-Core Processors', Intel Technology Journal, Volume 11 (2007)
963
964
  Multiprocessor scheduling was discussed by Tucker and Gupta[14], Zahorjan
965
  and McCann[15], Fietelson and Rudolph[16], Leutenegger and Vernnon[17],
966
  Blumofe and Leisersonn[18], Polychronopoulos and Kuck[19], and Lucco[20].
967
968
  [14] A. Tucker and A. Gupta, 'Process Control and Scheduling Issues for
969
       Multiprogrammed Shared-Memory Multiprocessors', Proceedings of the
970
       ACM Symposium on Operating Systems Principles (1989)
971
972
  [15] J. Zahorjan and C. McCann, 'Processor Scheduling in Shared-Memory
973
       Multiprocessors', Proceedings of the Conference on Measurement and
974
       Modeling of Computer Systems (1990)
975
976
  [16] D. Fietelson and L. Rudolph, 'Mapping and Scheduling in a Shared
977
       Parallel Environment Using Distributed Hierarchial Control', 
978
       Proceedings of the International Conference on Parallel Processing,
979
       (1990)
980
981
  [17] S. Leutenegger and M. Vernnon, 'The Performance of Multiprogrammed
982
       Multiprocessor Scheduling Policies', Proceedings of the Conference
983
       on Measurement and Modeling of Computer Systems (1990)
984
985
  [18] R. Blumofe and C. Leisersonn, 'Scheduling Multi-threaded
986
       Computations by Word Stealing', Proceedings of the Annual Symposium
987
       on Foundations of Computer Science (1994)
988
989
  [19] C. D. Polychronopoulos and D. J. Kuck, 'Guided Self-Scheduling:
990
       'A Practical Scheduling Scheme for Parallel Supercomputers', IEEE
991
       Transactions on Computers, Volume 36, Number 12, (1987)
992
993
  [20] S. Lucco, 'A Dynamic Scheduling Method for Irregular Parallel
994
       Programs', Proceedings of the conference on Programming Languages
995
       Design and Implementation, (1992)
996
997
  The authors then list the standard books covering well-known general
998
  purpose operating systems kernels like SVR2, the BSDs, Linux, NT, and
999
  Solaris by Bach, Kirk McKusick, Bovet and Cesati, Solomon and
1000
  Russionovich, and Mauro and McDougal respectively.