| 1 |
|
| 2 |
Schedulers
|
| 3 |
----------
|
| 4 |
|
| 5 |
|
| 6 |
Study notes for a number of Scheduling-related papers, which served as
|
| 7 |
a preparation for building an SMP scheduling base for my “Cute” kernel.
|
| 8 |
|
| 9 |
The __primary sources__ (papers from the 60s, 80s, 90s, and 2000s) of
|
| 10 |
the following topics are discussed:
|
| 11 |
|
| 12 |
- Multi-level Feedback Queues: that algorithm is the cornerstone
|
| 13 |
of old & new schedulers like SVR2/3/4, the BSDs, Linux O(1), NT,
|
| 14 |
Staircase, and Staircase Deadline algorithms.
|
| 15 |
|
| 16 |
- SVR4 and Solaris event-driven scheduling: including the
|
| 17 |
disadvantages I faced while implementing it.
|
| 18 |
|
| 19 |
- Spinlocks: why was it badly-needed and invented by VMS engineers,
|
| 20 |
and on missing an important condition of critical section theory
|
| 21 |
in its original design.
|
| 22 |
|
| 23 |
- Per-CPU Data Areas: how VAX/VMS originally handled it, how the
|
| 24 |
VMS-based NT kernel did it afterwards, and how the AMD x86-64
|
| 25 |
architecture specifically kept the %fs and %gs segments cause
|
| 26 |
of that original NT design.
|
| 27 |
|
| 28 |
- Finding the current thread state using stack masking: why was
|
| 29 |
that design engineered by VMS on the VAX architecture, and how
|
| 30 |
it was transfered to DEC OSF/1 and unnecessarily later to Linux.
|
| 31 |
|
| 32 |
- DEC OSF/1 and VAX/VMS symmetric multiprocessing: these kernels
|
| 33 |
were the origin of now-classical ideas like spinlocks, per-CPU
|
| 34 |
runqueues, interactions between spinlocks and kernel preemption,
|
| 35 |
and thread scheduling soft and hard affinity!
|
| 36 |
|
| 37 |
|
| 38 |
2010-2011, Ahmed S. Darwish <darwish.07@gmail.com>
|
| 39 |
|
| 40 |
|
| 41 |
NOTE: To get things going at first, I start with a very basic overview of
|
| 42 |
general-purpose thread scheduling using course notes by R. Arpaci-Dusseau.
|
| 43 |
If you're familiar with the basics, move directly to the legendary F.J.
|
| 44 |
Corbato's “An Experimental Time-Sharing System” paper.
|
| 45 |
|
| 46 |
NOTE-2: You can find a cached copy of *all* the discussed papers at
|
| 47 |
http://gitorious.org/cute-os/references/trees/master/papers/sched
|
| 48 |
|
| 49 |
NOTE-3: Turn UTF-8 on if you're reading this inside a web browser.
|
| 50 |
|
| 51 |
|
| 52 |
* TOC:
|
| 53 |
----
|
| 54 |
|
| 55 |
- Arpaci00, Operating Systems Notes, 'Scheduling: Introduction'
|
| 56 |
- Corbato62, 'An Experimental Time-Sharing System', MIT
|
| 57 |
- Corbato63, 'The Compatible Time-Sharing System - A Programmer's Guide'
|
| 58 |
- Dingledine09, 'Performance Improvements on Tor'
|
| 59 |
- Arpaci00, OS Notes, 'Scheduling: The Multi-Level Feedback Queue'
|
| 60 |
- Arpaci98, 'Multilevel Feedback Queue Scheduling in Solaris'
|
| 61 |
- Oracle02, Man Pages Section 4: File Formats, ts_dptbl(4)
|
| 62 |
- Black90, 'Concurrency and Parallelism in the Mach Operating System'
|
| 63 |
- Denham94, 'DEC OSF/1 Version 3.0 Symmetric Multiprocessing Implementation'
|
| 64 |
- Gamache88, 'VMS Symmetric Multiprocessing'
|
| 65 |
- Sliberchatz08, Bibliography from the 'Operating System Concepts' book
|
| 66 |
|
| 67 |
|
| 68 |
* Remzi Arpaci-Dusseau, Operating System Notes, “Scheduling:
|
| 69 |
Introduction”, University of Wisconsin-Madison, 2000:
|
| 70 |
----------------------------------------------------------
|
| 71 |
|
| 72 |
The included portion of Remzi's notes (ch. 6, 7, and 8) build a light-
|
| 73 |
weight (undergrad-level) introduction to general-purpose OS scheduling.
|
| 74 |
These documents also include a number of good references.
|
| 75 |
|
| 76 |
Here are the important points in these notes + analysis of mine:
|
| 77 |
|
| 78 |
- Some of the metrics in measuring schedulers include 'Turnaround time',
|
| 79 |
which is the time taken between registering a job for execution and
|
| 80 |
its completion/return/finish. Mainly:
|
| 81 |
|
| 82 |
T_{turnaround} = T_{finish} - T_{arrival}
|
| 83 |
|
| 84 |
Another metric, caring more about latency, is the 'response time'
|
| 85 |
which is the time the job waits in the ready queue till some CPU time
|
| 86 |
is provided:
|
| 87 |
|
| 88 |
T_{response} = T_{firstrun} - T_{arrival}
|
| 89 |
|
| 90 |
- The cost of context switching does not arise from the OS actions of
|
| 91 |
saving and restoring a few registers. For example, we do context
|
| 92 |
switching in our kernel using only 37 x86 opcodes: they are mainly
|
| 93 |
register moves and stack pops and pushes.
|
| 94 |
|
| 95 |
The real cost is that "when programs run, they build up a great deal
|
| 96 |
of state in CPU caches, TLBs, branch predictors, and other on-chip
|
| 97 |
hardware. Switching to another job causes this state to be flushed
|
| 98 |
and new state relevant to the currently-running job to be brought in,
|
| 99 |
which may exact a noticeable performance cost."
|
| 100 |
|
| 101 |
- Round-robin, the simplest practical scheduler possible, involves the
|
| 102 |
act of __cycling__ on the runqueue's list of processes, running each
|
| 103 |
job for a fixed time slice (unit). Assuming processes A, B, C, and D,
|
| 104 |
the scheduler run the jobs as follows
|
| 105 |
|
| 106 |
| A | B | C | D | A | B | C | D | A | B | C | D | A | B | C | ...
|
| 107 |
|
| 108 |
each for an equal amount of time. Advantages include implementation
|
| 109 |
simplicity, good average response times (see disadvantages though),
|
| 110 |
and being starvation-free by design.
|
| 111 |
|
| 112 |
Disadvantages include very bad behaviour on system saturation, cause
|
| 113 |
the system will swap-in and out programs for each (supposedly small)
|
| 114 |
time slice [*], and response-time rising linearly with the number of
|
| 115 |
active jobs in the system, even for highly-interactive processes.
|
| 116 |
|
| 117 |
[*] this is a point handled by the Corbato paper below
|
| 118 |
|
| 119 |
- In practice, most of processes will issue a number of blocking I/O
|
| 120 |
requests during their allocated time quanta, which leads to moving
|
| 121 |
them out of the runqueue altogether, and returning back to such queue
|
| 122 |
after I/O completion.
|
| 123 |
|
| 124 |
HOW are these jobs treated after returning back can be as simple as
|
| 125 |
considering them new processes (classical round-robin), or to perform
|
| 126 |
timing calculations and modify their priority (assuming 'priorities'
|
| 127 |
support in the scheduler, or a similar concept).
|
| 128 |
|
| 129 |
- Finally, there's a nice paragraph on the separation of mechanism and
|
| 130 |
policy in OS Design. For example, stuff like programming the system
|
| 131 |
ticks handler and context-switching code are scheduling mechanisms
|
| 132 |
(mechanically how the jobs get interleaved). Scheduling policies
|
| 133 |
(or disciplines) is what you're currently reading is about.
|
| 134 |
|
| 135 |
In real-time POSIX for example, there are several programmable
|
| 136 |
scheduling 'policies' that can be setup through sched_setscheduler(2).
|
| 137 |
Such policies include SCHED_FIFO, SCHED_RR, and SCHED_OTHER.
|
| 138 |
|
| 139 |
|
| 140 |
* F.J. Corbato et al., “An Experimental Time-Sharing System”, AIEE-IRE
|
| 141 |
Proceedings of the May 1-3, 1962, Spring Joint Computer Conference, MIT:
|
| 142 |
------------------------------------------------------------------------
|
| 143 |
|
| 144 |
This is one of the earliest papers on time-sharing schedulers. The core
|
| 145 |
of the paper (beside its overall historical importance) is the section
|
| 146 |
'A Multi-Level Scheduling Algorithm', which is in modern terminology now
|
| 147 |
called the 'Multi-level Feedback Queue' (MLFQ). Important points:
|
| 148 |
|
| 149 |
- The main problem the algorithm tries to solve is system saturation:
|
| 150 |
a) total size of active user programs exceed high-speed memory
|
| 151 |
b) too many active user programs to maintain an adequate response at
|
| 152 |
each user console.
|
| 153 |
c) high performance cost of context switching
|
| 154 |
|
| 155 |
According to the authors, a good solution is to provide a procedure
|
| 156 |
which gives graceful degradation of the response time and effective
|
| 157 |
real-time computation speed of the *large and long-running users*.
|
| 158 |
|
| 159 |
- The paper argues that a simple round-robin scheduler will not work in
|
| 160 |
case of saturation cause it will cause an 'abrupt collapse of service'
|
| 161 |
due to the sudden burden of time required to swap programs in-and-out
|
| 162 |
of secondary memory such as a disk.
|
| 163 |
|
| 164 |
The multi-level scheduling algorithm was thus presented to improve the
|
| 165 |
saturation performance of a time-sharing system.
|
| 166 |
|
| 167 |
- We now delve into the scheduler algorithm details: from the very first,
|
| 168 |
the scheduler partitions its runqueue to several priority queues
|
| 169 |
l_0, l_1, ..., l_L. ('L' is to be calculated later)
|
| 170 |
|
| 171 |
- In general, the bigger the program memory footprint (size), the less
|
| 172 |
queue priority it's assigned to. But, the less the queue priority, the
|
| 173 |
bigger the runtime for each of its tasks in number of quantum.
|
| 174 |
|
| 175 |
Mathematically: l = [ log_2( [w_p/w_q] + 1 ) ]
|
| 176 |
|
| 177 |
where each task at queue 'l' runs for 2^l quanta, and that if
|
| 178 |
l_a < l_b, then l_a has a higher priority than l_b. We can deduce
|
| 179 |
several things from above equation:
|
| 180 |
|
| 181 |
a) if 0 < w_p < w_q, [w_p/w_q] = 0, runtime = 1 quantum |-- 1
|
| 182 |
b) if w_q < w_p < 2*w_q, [w_p/w_q] = 1, runtime = 2 quanta -|
|
| 183 |
c) if 2*w_q < w_p < 3*w_q, [w_p/w_q] = 2, runtime = 2 quanta -|-- 2
|
| 184 |
d) if 3*w_q < w_p < 4*w_q, [w_p/w_q] = 3, runtime = 4 quanta ..|
|
| 185 |
e) if 4*w_q < w_p < 5*w_q, [w_p/w_q] = 4, runtime = 4 quanta ..|
|
| 186 |
f) if 5*w_q < w_p < 6*w_q: [w_p/w_q] = 5, runtime = 4 quanta ..|-- 4
|
| 187 |
g) if 6*w_q < w_p < 7*w_q: [w_p/w_q] = 6, runtime = 4 quanta ..|
|
| 188 |
h) if 7*w_q < w_p < 8*w_q: [w_p/w_q] = 7, runtime = 8 quanta +++
|
| 189 |
... +++|-- 8
|
| 190 |
p) if 15*w_q < w_p < 16*w_q: [w_p/w_q] = 15, runtime = 16 quanta xxxx
|
| 191 |
... xxxx|-- 16
|
| 192 |
_) if 31*w_q < w_p < 32*w_q: [w_p/w_q] = 31, runtime = 32 quanta @@@@@
|
| 193 |
... @@@@@|-- 32
|
| 194 |
_) if 63*w_q < w_p < 64*w_q: [w_p/w_q] = 63, runtime = 64 quanta
|
| 195 |
|
| 196 |
Note: at point 'a)', queue 'l' is equal to 0: the highest possible
|
| 197 |
queue in priority.
|
| 198 |
|
| 199 |
By above design, the bigger the program, the bigger its allocated
|
| 200 |
number of quanta, making the time-overhead of swapping negligible.
|
| 201 |
Thus, optimizing the turnaround time of CPU-bound processes.
|
| 202 |
|
| 203 |
- The above point in scheduler design _tries_ to solve the first part of
|
| 204 |
the problem the algorithm tackles: the stated point 'a) total size of
|
| 205 |
active user programs exceed high-speed memory' above.
|
| 206 |
|
| 207 |
From now on, the paper tackles second part of the problem, 'b) too many
|
| 208 |
active user programs to maintain an adequate response'.
|
| 209 |
|
| 210 |
- The scheduler __always__ begin with the tasks at the head of the
|
| 211 |
highest priority queue (l_0). After that queue gets exhausted, it moves
|
| 212 |
to the direct following queue l_1, and so on <== (I)
|
| 213 |
|
| 214 |
Tasks in a single queue get run in a round-robin fashion.
|
| 215 |
|
| 216 |
- If while executing a task at queue 'l' (during the 2**l quantum), a
|
| 217 |
task got assigned to a higher priority queue 'l*', the current
|
| 218 |
executing task get put at the __head__ of the 'l'th queue, and the
|
| 219 |
process at (I) is restarted from queue 'l*', and so on.
|
| 220 |
|
| 221 |
In other words, a task cannot get executed while another task is
|
| 222 |
waiting at a higher priority queue.
|
| 223 |
|
| 224 |
- If a program in queue 'l' has not 'completed' (did not issue any I/O)
|
| 225 |
within its 2**l-quantum runtime, it's placed at the __tail__ of the
|
| 226 |
less-priority queue 'l+1'
|
| 227 |
|
| 228 |
- A powerful point in the algorithm above (beside broad bounds on system
|
| 229 |
performance) is that the classification procedure for tasks is entirely
|
| 230 |
automatic: it depends on performance and program size rather than
|
| 231 |
static declarations or (the possibly malicious) users hopes.
|
| 232 |
|
| 233 |
- As once stated above, the main theme around this scheduler is improving
|
| 234 |
the __saturation performance__ of the system.
|
| 235 |
|
| 236 |
This performance improvement on saturation exists cause as a user taxes
|
| 237 |
the system, the degradation of service occurs progressively starting
|
| 238 |
with users with large or long-running programs (least-priority queues)
|
| 239 |
first.
|
| 240 |
|
| 241 |
- Starvation: there's a probable case of a big number of active high
|
| 242 |
priority tasks starving less-priority CPU-bound tasks out of execution.
|
| 243 |
|
| 244 |
The paper __models__ a worst cast of having N tasks, each running at
|
| 245 |
queue 'l', and each relinquishing the CPU __directly__ before their
|
| 246 |
2**l quantum runtime, and thus each also returning to the the same-
|
| 247 |
priority queue, with the condition of returning to the runqueue only
|
| 248 |
't_u'-seconds later (time for finishing I/O and, or, user response).
|
| 249 |
|
| 250 |
If the system will always be busy at queue 'l' above, starving all less
|
| 251 |
priority queues 'l + x' for x > 0, then the queue 'l' must never be
|
| 252 |
empty; we must have:
|
| 253 |
|
| 254 |
N * 2^l * q >= t_u
|
| 255 |
2^l >= (t_u / N*q)
|
| 256 |
l >= log_2 (t_u / N*q)
|
| 257 |
|
| 258 |
where 'N' is the number of tasks at queue 'l'.
|
| 259 |
|
| 260 |
From above, we've found an upper-bound on the priority queue that can
|
| 261 |
be serviced: queues > 'l' above can possibly starve if number of tasks
|
| 262 |
reach N. Thus, we have a guarantee that all
|
| 263 |
|
| 264 |
l_a <= [ log_2 (t_u / N * q) ]
|
| 265 |
|
| 266 |
will be serviced and will __not__ get starved (given the estimated
|
| 267 |
response time t_u), where '[]' denotes the integral part of enclosed
|
| 268 |
equation. The author calls this queue the 'highest serviced level'
|
| 269 |
queue.
|
| 270 |
|
| 271 |
- The equation above has set a bound on the starvation problem. So if we
|
| 272 |
set the max possible 'highest serviced queue' to the max possible queue
|
| 273 |
on the system, we have a __theoretical guarantee__ of no starvation in
|
| 274 |
the system.
|
| 275 |
|
| 276 |
(Providing this guarantee is very hard for modern-day general purpose
|
| 277 |
operating systems: we cannot easily put a limit on the number of tasks
|
| 278 |
'N', and we cannot even have a valid estimation of 't_u', the I/O
|
| 279 |
or user response finish time.
|
| 280 |
|
| 281 |
These values can be calculated easily if the target hardware and work
|
| 282 |
load is previously known; this is the case in the paper [IBM 7090],
|
| 283 |
and this can also be the case for small and predictable devices like a
|
| 284 |
small network router.)
|
| 285 |
|
| 286 |
- By now, we've analyzed most of the paper sans the response time
|
| 287 |
analysis.
|
| 288 |
|
| 289 |
While these response time equations are intellectually simulating, they
|
| 290 |
offer very broad bounds that are not entirely practical for a general
|
| 291 |
purpose scheduler, similar to the paper's way of handling starvation in
|
| 292 |
the above point.
|
| 293 |
|
| 294 |
|
| 295 |
* F.J. Corbato et al., “The Compatible Time-Sharing System: A Programmer's
|
| 296 |
Guide”, The MIT Computation Center, MIT Press, 1963:
|
| 297 |
------------------------------------------------------------------------
|
| 298 |
|
| 299 |
This manual gives a very nice overview of the historical time-sharing
|
| 300 |
hands-on experience. It's a worthwhile read while studying the CTSS
|
| 301 |
scheduler paper above.
|
| 302 |
|
| 303 |
A useful demonstration of a programmer's session on CTSS is also
|
| 304 |
presented in the Appendix. Interesting text editor!! it makes me
|
| 305 |
remember the original UNIX editor 'ed'.
|
| 306 |
|
| 307 |
|
| 308 |
* Roger Dingledine, “Performance Improvements on Tor”, 2009:
|
| 309 |
----------------------------------------------------------
|
| 310 |
|
| 311 |
Now why am I including a paper on Tor while we're dealing with
|
| 312 |
schedulers? Because the Tor network tries to solve a problem very
|
| 313 |
similar to the one Corbato was designing his scheduler model against:
|
| 314 |
|
| 315 |
``Section 1 described mechanisms to let low-volume streams have a
|
| 316 |
chance at competing with high-volume streams. Without those mechanisms,
|
| 317 |
normal web browsing users will always get squeezed out by people
|
| 318 |
pulling down larger content and tolerating high latency.
|
| 319 |
|
| 320 |
But the next problem is that some users simply add more load than the
|
| 321 |
network can handle. Just making sure that all the load gets handled
|
| 322 |
__fairly__ isn't enough if there’s too much load in the first place.(+)
|
| 323 |
|
| 324 |
When we originally designed Tor, we aimed for high throughput. We
|
| 325 |
figured that providing high throughput would mean we get good latency
|
| 326 |
properties for free. However, now that it’s clear we have several user
|
| 327 |
profiles trying to use the Tor network at once, we need to consider
|
| 328 |
changing some of those design choices. Some of these changes would
|
| 329 |
aim for better latency and worse throughput. (++) ''
|
| 330 |
|
| 331 |
The statement marked with (+) matches word-by-word the problem
|
| 332 |
statement of the Multi-Level Feedback Queue paper: to gracefuly handle
|
| 333 |
'system saturation' while maintaining an acceptable level of system
|
| 334 |
responsiveness. Thus, avoiding the 'abrupt collapse of service' caused
|
| 335 |
by OS scheduler's round-robin algorithm or Tor's full fairness in high
|
| 336 |
overlad.
|
| 337 |
|
| 338 |
A main theme of the MLFQ solution, and Tor's one proposed at (++), is
|
| 339 |
not to let throughput-driven jobs (compilers in schedulers case, an
|
| 340 |
700-MB ISO download in the Tor case) deprive latency-senitive jobs
|
| 341 |
out of quickly-needed resources. Examples of such latency-sensive
|
| 342 |
apps include an editor or web-browser GUI scrolling in OS schedulers,
|
| 343 |
and servicing HTML websites in Tor.
|
| 344 |
|
| 345 |
|
| 346 |
* Remzi Arpaci-Dusseau, Operating Systems Notes, The MLFQ, 2000:
|
| 347 |
--------------------------------------------------------------
|
| 348 |
|
| 349 |
Here, the MLFQ algorithm we studied from __primary sources__ above is
|
| 350 |
re-discussed in modern terms. Important points:
|
| 351 |
|
| 352 |
- The hardest point in designing schedulers is to minimize response time
|
| 353 |
for interactive jobs, while also minimizing overall system turnround
|
| 354 |
time.
|
| 355 |
|
| 356 |
- As we know by now, MLFQ varies the priority of a job based on its
|
| 357 |
observational behaviour: learning from jobs as they run, and using
|
| 358 |
history to predict their future behaviour. Historical state is kept in
|
| 359 |
the scheduler priority queues.
|
| 360 |
|
| 361 |
This is an important systems design pattern of learning from history:
|
| 362 |
observing current trends and learning from the past to predict the
|
| 363 |
future. Such pattern is common in CPUs branch prediction, working set
|
| 364 |
rules, and in networking protocols congestion control mechanisms.
|
| 365 |
|
| 366 |
Remzi advises us though that 'such approaches work when jobs have
|
| 367 |
__phases of behavior__ and are thus predictable; of course, one must
|
| 368 |
be careful with such techniques, as they can easily be wrong and drive
|
| 369 |
a system to make worse decisions than they would have with no knowledge
|
| 370 |
at all.'
|
| 371 |
|
| 372 |
- There are two important problems in MLFQs:
|
| 373 |
|
| 374 |
a) Starvation: One of the possible ways to handle it is the well-known
|
| 375 |
'aging principle': increasing the priority as the process waits. Check
|
| 376 |
below parenthesized notes though; it's not that simple.
|
| 377 |
|
| 378 |
(Note1: The original paper theoretically 'solved' this issue by giving
|
| 379 |
bounds on starvation. These bounds can only work if we have a true
|
| 380 |
estimate of I/O response times _and_ of the jobs number in the system.
|
| 381 |
This is practically impossible for general-purpose scheduling.)
|
| 382 |
|
| 383 |
(Note2: There are several ways to __practically__ apply aging methods.
|
| 384 |
Some of these ways are discussed in David Black's paper 'Scheduling
|
| 385 |
support for concurrency and parallelism in the Mach Operating System.')
|
| 386 |
|
| 387 |
b) Gaming the system: A possible solution is to improve the algorithm
|
| 388 |
accounting of CPU time, making it less viable to gaming. Check the
|
| 389 |
notes for further details.
|
| 390 |
|
| 391 |
|
| 392 |
* Remzi Arpaci-Dusseau, “Multilevel Feedback Queue Scheduling in Solaris”,
|
| 393 |
1998 && Solaris, Man Pages Section 4: File Formats, ts_dptbl(4), 2002:
|
| 394 |
------------------------------------------------------------------------
|
| 395 |
|
| 396 |
From now on, we check how various real-world kernels implement MLFQ
|
| 397 |
scheduling. Beginning with Solaris, here are the important points:
|
| 398 |
|
| 399 |
- For regular jobs, the kernel has 60 priority queues. A job begins in
|
| 400 |
the middle, at priority 29. As in classical MLFQ, CPU-bound jobs get
|
| 401 |
assigned smaller priorities over time (with bigger slices), and I/O
|
| 402 |
tasks move up, getting scheduled whenever they have jobs to do.
|
| 403 |
|
| 404 |
Now here is Solaris 'dispatcher parameter table' defaults:
|
| 405 |
|
| 406 |
Priority q (ms) tqexp slpret maxwait lwait
|
| 407 |
0 200 0 50 0 50
|
| 408 |
... vvv v vv v vv
|
| 409 |
9 vvv v vv v vv
|
| 410 |
10 160 v 51 v 51
|
| 411 |
11 vvv 1 vv v vv
|
| 412 |
12 vvv 2 vv v vv
|
| 413 |
13 vvv 3 vv v vv
|
| 414 |
14 vvv 4 vv v vv
|
| 415 |
15 vvv 5 vv v vv
|
| 416 |
16 vvv 6 vv v vv
|
| 417 |
17 vvv 7 vv v vv
|
| 418 |
18 vvv 8 vv v vv
|
| 419 |
19 vvv 9 vv v vv
|
| 420 |
20 120 10 52 v 52
|
| 421 |
21 vvv 11 vv v vv
|
| 422 |
22 vvv 12 vv v vv
|
| 423 |
23 vvv 13 vv v vv
|
| 424 |
24 vvv 14 vv v vv
|
| 425 |
25 vvv 15 vv v vv
|
| 426 |
26 vvv 16 vv v vv
|
| 427 |
27 vvv 17 vv v vv
|
| 428 |
28 vvv 18 vv v vv
|
| 429 |
29 vvv 19 vv v vv
|
| 430 |
30 80 20 53 v 53
|
| 431 |
31 vv 21 vv v vv
|
| 432 |
33 vv 22 vv v vv
|
| 433 |
33 vv 23 vv v vv
|
| 434 |
34 vv 24 vv v vv
|
| 435 |
35 vv 25 54 v 54
|
| 436 |
36 vv 26 vv v vv
|
| 437 |
37 vv 27 vv v vv
|
| 438 |
38 vv 28 vv v vv
|
| 439 |
39 vv 29 vv v vv
|
| 440 |
40 40 30 55 v 55
|
| 441 |
41 vv 31 vv v vv
|
| 442 |
42 vv 32 vv v vv
|
| 443 |
43 vv 33 vv v vv
|
| 444 |
44 vv 34 vv v vv
|
| 445 |
45 vv 35 56 v 56
|
| 446 |
46 vv 36 57 v 57
|
| 447 |
47 vv 37 58 v 58
|
| 448 |
48 vv 38 vv v vv
|
| 449 |
49 vv 39 vv v 59
|
| 450 |
50 vv 40 vv v vv
|
| 451 |
51 vv 41 vv v vv
|
| 452 |
52 vv 42 vv v vv
|
| 453 |
53 vv 43 vv v vv
|
| 454 |
54 vv 44 vv v vv
|
| 455 |
55 vv 45 vv v vv
|
| 456 |
56 vv 46 vv v vv
|
| 457 |
57 vv 47 vv v vv
|
| 458 |
58 vv 48 vv v vv
|
| 459 |
59 20 49 59 32000 vv
|
| 460 |
|
| 461 |
tqexp: time quantum expired priority, the new priority the thread is set
|
| 462 |
to when it uses its entire time quantum. As you can see, any process
|
| 463 |
using its entire slice get its priority decreased by __10 levels__.
|
| 464 |
|
| 465 |
slpret: sleep return priority value, the new priority the thread is set
|
| 466 |
to after waking up (from a sleep event like network or disk I/O.)
|
| 467 |
The thread priority is increased to 50 or above after an I/O operation!
|
| 468 |
|
| 469 |
maxwait: a starvation avoidance mechanism to compensate threads waiting
|
| 470 |
for a relatively long time in the dispatcher queue. 'maxwait' is such
|
| 471 |
waiting time threshold. If a thread waited more than maxwait in the
|
| 472 |
queues, its priority is boosted up to the respective 'lwait' priority.
|
| 473 |
|
| 474 |
Each thread has a dispatcher_wait ('dispwait') counter, which is reset
|
| 475 |
when the thread is inserted into the dispatch queue (following a time-
|
| 476 |
quantum expiration, or cause of wakeup from I/O or locking). This
|
| 477 |
field is incremented once per second for all processes waiting in the
|
| 478 |
dispatcher queues. If such counter exceeded 'maxwait', then the thread
|
| 479 |
priority is incremented to 'lwait' -- a priority > 50 by default.
|
| 480 |
|
| 481 |
Note that the default 'maxwait' for Solaris is 0: just 1 second and
|
| 482 |
all the system processes priority will be boosted up above 50. This
|
| 483 |
avoids starving CPU-bound too much.
|
| 484 |
|
| 485 |
[Sol-*]:
|
| 486 |
(The values above have some very interesting details: Note that after
|
| 487 |
1-second, processes in the lowest (0-9) range will move to priority 50,
|
| 488 |
processes in the range (10-19) will move to 51, (20-29) to priority 52,
|
| 489 |
and so on. By this, high-priority processes response times will NOT be
|
| 490 |
affected because they will still run on queue 59 instantly.
|
| 491 |
|
| 492 |
Now, how can this prevent starvation? Can't processes keep going in
|
| 493 |
and out very quickly, starving queues < 59?. Indeed that was the
|
| 494 |
mathematically modeled worst case in the original CTSS paper.
|
| 495 |
|
| 496 |
Now comes the second important detail: even if the worst case above
|
| 497 |
occured (highly unlikely in practice, except if done maliciously),
|
| 498 |
you'll see that the 'lwait' field of the 50-59 priorites is 59. Thus,
|
| 499 |
in the second __starvation round__, all will move to 59 and execute
|
| 500 |
in a round-robin manner. This solves the starvation problem elegantly.
|
| 501 |
|
| 502 |
Summary: starvation can never exceed 2 seconds + waiting time in prio
|
| 503 |
queue 59. All while maintaining system responsiveness in the process.)
|
| 504 |
|
| 505 |
- NOTE: I've coded such scheduler in my kernel, and its disadvantages
|
| 506 |
began to appear directly after the first day of benchmarking: too much
|
| 507 |
heuristics and voo-doo constants!
|
| 508 |
|
| 509 |
Scheduling 100% CPU-hog threads showed the ugly side of that algorithm:
|
| 510 |
the combination of a big number of dynamic priority levels with huge
|
| 511 |
and different slice lengthes per each priority kept starving the low
|
| 512 |
priority processes till the newly-queued CPU hogs moved up to their
|
| 513 |
equivalent low-priority level.
|
| 514 |
|
| 515 |
While the new CPU hogs were moving up, the older low-priority tasks
|
| 516 |
faced complete starvation for 1+ seconds. The algorithm thus boosted
|
| 517 |
these threads to high-priority levels, but this __starved__ the newer
|
| 518 |
threads that are now residing in relatively lower-priority queues. This
|
| 519 |
effectively made all system processes enter an infinite loop of
|
| 520 |
1+-second starvation and then boosting -- an upleasent situation.
|
| 521 |
|
| 522 |
Decreasing the number of priority levels and making the slices more fair
|
| 523 |
helped the situation, but kept introducing __new__ corner-cases. After
|
| 524 |
a number of days, I gave-up and went coding a more theoretically-sound
|
| 525 |
scheduling algorithm.
|
| 526 |
|
| 527 |
(The benchmark was simply plotting different kernel threads properties
|
| 528 |
over time. Such properties included threads total runtime, average
|
| 529 |
runtime per dispatch, total wait in runqueues, average wait in runqueues,
|
| 530 |
number of priority boosts [cause of starvation], number of preemptions
|
| 531 |
cause of a higher-priority thread becoming runnable, and number of
|
| 532 |
preemptions cause of a thread time-slice end:
|
| 533 |
|
| 534 |
(a) code version 1, 6 threads:
|
| 535 |
http://localf.googlepages.com/pub/mlfq/fairness.gif
|
| 536 |
(b) code version 1, 100 threads:
|
| 537 |
http://localf.googlepages.com/pub/mlfq/fairness100.gif
|
| 538 |
(c) code version 2 (after sanitizing the voo-doos), 6 threads:
|
| 539 |
http://localf.googlepages.com/pub/mlfq/fairness-improved.gif
|
| 540 |
(d) code version 2, 100 threads (things became good!!):
|
| 541 |
http://localf.googlepages.com/pub/mlfq/fairness-improved.gif
|
| 542 |
(e) unfortunately, the _new_ heuristics above had corner cases:
|
| 543 |
http://localf.googlepages.com/pub/mlfq/fairness-improved-pathological.gif
|
| 544 |
|
| 545 |
It seems I'm not the only one facing this though; I've found some refs
|
| 546 |
in academia on how SVR4-based schedulers [like Solaris] were very hard
|
| 547 |
to tweak and adapt to different workloads.)
|
| 548 |
|
| 549 |
|
| 550 |
* David L. Black, “Scheduling Support for Concurrency and Parallelism in
|
| 551 |
the Mach Operating System”, IEEE Computer Journal, Carnegie Mellon, 1990:
|
| 552 |
-------------------------------------------------------------------------
|
| 553 |
|
| 554 |
This is a very interesting article, but we will concentrate on its MLFQ
|
| 555 |
deficiencies discussion for now (Note: always remember Black's point of
|
| 556 |
creating an idle thread per CPU, not to corrupt kernel stack on
|
| 557 |
multiprocessor systems). Important points:
|
| 558 |
|
| 559 |
- Multics used a decreasing priority MLFQ (as was exactly described in
|
| 560 |
its original paper) and discovered the major deficiency: on a heavily
|
| 561 |
loaded system with many short-lived jobs, the priority of a CPU-bound
|
| 562 |
job can decrease until little or no CPU-time is assigned to it. The
|
| 563 |
author notes that such problem (starvation) also existed on a number
|
| 564 |
of UNIX kernels.
|
| 565 |
|
| 566 |
- To handle this problem, it's necessary to elevate priorities in some
|
| 567 |
manner. VAX/VMS elevates a low application priority when it issues an
|
| 568 |
event like I/O completion.
|
| 569 |
|
| 570 |
The author states that this negatively affect interactive applications
|
| 571 |
which consume large amount of processor time, since they may not issue
|
| 572 |
priority-raising events to an effective degree.
|
| 573 |
|
| 574 |
(Nonetheless, this approach is effective, response-time wise, for
|
| 575 |
handling software with 'phases' of I/O and CPU-bound activity if the
|
| 576 |
CPU-bound activity is long but not dominant. This is a point the
|
| 577 |
original CTSS scheduler did not address: a job's transition from a
|
| 578 |
CPU-bound to an I/O-bound phase.)
|
| 579 |
|
| 580 |
- A (non-exclusive) second way to handle this is processor usage aging.
|
| 581 |
To compensate lower-priority jobs, we forget (decay) their past history
|
| 582 |
in some way, usually in an exponential decay manner.
|
| 583 |
|
| 584 |
(Exponential decay: N(t) = N(0) * (decay_factor)**t. Assuming a decay
|
| 585 |
factor of 0.5, and N(0)=10, then N(1)=5, and N(2) = (10/2)/2 = 0.25)
|
| 586 |
|
| 587 |
- Solaris, as we've seen, uses BOTH of the prevention methods above:
|
| 588 |
processes priorities get elevated after completion of an I/O event,
|
| 589 |
AND the kernel also forgets (age) __most__ of the current processes
|
| 590 |
usage history after one second (coalescing each 10 priorities to one,
|
| 591 |
while maintaining the relative difference between them).
|
| 592 |
|
| 593 |
- 4.3BSD: it solely uses the aging method to handle starvation, with a
|
| 594 |
CPU-usage decay factor that's based on system load. The bigger the
|
| 595 |
system load, the less usage decay, avoiding jobs from getting crammed
|
| 596 |
in high priority queues on high system load.
|
| 597 |
|
| 598 |
Priority of the __currently running__ thread get re-calculated after
|
| 599 |
each 4 ticks of its runtime, decreasing linearly with job's CPU usage.
|
| 600 |
If a higher priority was discovered after this decrease, the thread
|
| 601 |
is preempted.
|
| 602 |
|
| 603 |
Routines responsible for waking up a thread or adding a new thread to
|
| 604 |
the runqueue check such thread priority against the currently running
|
| 605 |
thread. If it's higher, a reschedule operation is requested.
|
| 606 |
|
| 607 |
(XXX: Unix Internals discusses the problem of classical SVR2/3 decay,
|
| 608 |
its equations, and the new BSD decay method. Add these notes here as
|
| 609 |
a supplement to the original paper discussion.)
|
| 610 |
|
| 611 |
|
| 612 |
* Jeffrey M. Denham et al., “DEC OSF/1 Version 3.0 Symmetric Multiprocessing
|
| 613 |
Implementation”, Digital Technical Journal, volume 6 number 3, Summer 1994:
|
| 614 |
---------------------------------------------------------------------------
|
| 615 |
|
| 616 |
This paper is gold, and is comparable in quality to the CTSS one. It
|
| 617 |
discusses a huge number of now-classical SMP features:
|
| 618 |
|
| 619 |
+ Accessing the 'current' thread state by masking the stack pointer,
|
| 620 |
regardless of the CPU we're currently running on.
|
| 621 |
+ Per-CPU runqueues for thread scheduling!
|
| 622 |
+ SMP scheduling soft affinity
|
| 623 |
+ Interactions between spinlocks and kernel preemption
|
| 624 |
+ Striking a balance between fine-grained locking and lockless algorithms
|
| 625 |
|
| 626 |
A good number of above ideas were of a VAX/VMS origin. Important notes:
|
| 627 |
|
| 628 |
- The task of an SMP kernel is to improve system performance as CPUs are
|
| 629 |
added without compromising quality. DEC OSF/1 goal was to provide a
|
| 630 |
performant Unix SMP system over the Alpha architecture.
|
| 631 |
|
| 632 |
That hardware architecture provided shared memory, symmetric access to
|
| 633 |
system I/O busses, atomic read-and-set instructions, and interprocessor
|
| 634 |
interrupts (IPIs).
|
| 635 |
|
| 636 |
NOTE: This is very close to the Intel MultiProcessor (MP) architecture.
|
| 637 |
|
| 638 |
- As we know by now, classical Unix kernels insured its consistency by:
|
| 639 |
a] disabling interrupts in critical regions accessed by IRQ handlers
|
| 640 |
(and thus handling the concurrency of asynchronous IRQ handlers)
|
| 641 |
b] allowing only one process to be in kernel context at a time
|
| 642 |
(and thus handling the virtual concurrency provided by time-sharing)
|
| 643 |
|
| 644 |
These protections aren't enough in case of multicore systems, since
|
| 645 |
they provide real concurrent execution of kernel code paths.
|
| 646 |
|
| 647 |
An evolutionary solution to that problem was the use of 'funneling':
|
| 648 |
allowing certain kernel subsystems to be executed only in one CPU at
|
| 649 |
_any_ given time. These subsystems was then iteratively and slowly
|
| 650 |
moved to local fine-grained locking.
|
| 651 |
|
| 652 |
NOTE: Linux used that 'funneling' method in its early days using the
|
| 653 |
'Big Kernel Lock (BKL)'; FreeBSD did the same using a 'Global Mutex'.
|
| 654 |
Both kernels moved to fine-grained locking iteratively. By virtue of
|
| 655 |
being started in Winter of 2009, my kernel incorporates fine-grained
|
| 656 |
locking from the start.
|
| 657 |
|
| 658 |
- It's interesting to note that the first version of DEC OSF/1 provided
|
| 659 |
full kernel-mode preemption in its real-time version of the kernel.
|
| 660 |
Solaris was also made fully preemptive with fine-grained locking very
|
| 661 |
early on. Linux and FreeBSD caught on around 10 years later.
|
| 662 |
|
| 663 |
- The paper then discusses how kernel preemption should be disabled by the
|
| 664 |
entrance of spinlocks. It also discusses the now-very-common idea of
|
| 665 |
counting the depth of spin locks acquired using a reference count, and
|
| 666 |
preempting the kernel if and only if that reference count = 0.
|
| 667 |
|
| 668 |
NOTE 1: This is equivalent to Robert Love's Linux kernel preemption
|
| 669 |
patch. A google search reveals that this patch got merged in 2001.
|
| 670 |
|
| 671 |
NOTE 2: FreeBSD on the other hand completely disables local interrupts
|
| 672 |
upon spin lock entrance, automatically disabling preemption in the
|
| 673 |
process (at a slight performance cost). I do the same in my kernel.
|
| 674 |
|
| 675 |
NOTE 3: (cut-and-paste from my kernel comments) Disabling preemption
|
| 676 |
in spinlock-protected regions is a necessity because:
|
| 677 |
a) If a thread holding a spin lock got preempted, it will let a
|
| 678 |
possibly huge number of other threads consume their entire time
|
| 679 |
slice (which is in the order of milliseconds) just spinning!
|
| 680 |
b) If that preempted thread (which is now in the scheduler runqueue)
|
| 681 |
got killed, we can deadlock the entire system!
|
| 682 |
c) Classical race conditions may arise if the newly scheduled thread
|
| 683 |
accessed the same critical-region the older thread was accessing
|
| 684 |
while being preempted.
|
| 685 |
That's why spin locks are also needed in preemptive Uniprocessor
|
| 686 |
kernels: ignoring their spinning behavior, they act as atomic
|
| 687 |
critical region markers.
|
| 688 |
|
| 689 |
There's also a warning about accessing per-CPU variables while kernel
|
| 690 |
preemption is enabled. Modern kernels disable preemption in that state.
|
| 691 |
|
| 692 |
NOTE 4: (also from my kernel comments) Disabling preemption while
|
| 693 |
accessing a per-CPU variable is necessary:
|
| 694 |
a) We might get preempted and then rescheduled on a different CPU.
|
| 695 |
b) If the _preempting_ thread used the same per-CPU variable, that's
|
| 696 |
a classical race condition.
|
| 697 |
|
| 698 |
The paper notes that due to the complexities of kernel locking, a
|
| 699 |
"coherent locking architecture with automated debugging facilities
|
| 700 |
was needed to ship a reliable product on time."
|
| 701 |
|
| 702 |
- One problem faced by SMP is finding the descriptor of the current
|
| 703 |
process context. In a uniprocessor machine, that can be simply done
|
| 704 |
by a global memory variable that is changed before context switching.
|
| 705 |
|
| 706 |
On SMP, there's a "current" process descriptor for each CPU. This
|
| 707 |
descriptor can be located by finding the process number we're running
|
| 708 |
on, and then accessing an array of per-cpu data based on that number.
|
| 709 |
|
| 710 |
In Unix kernels, the current thread state is dereferenced a huge number
|
| 711 |
of times. Needing to know the current CPU number before each of these
|
| 712 |
dereferences (a costly hardware operation) was unacceptable.
|
| 713 |
|
| 714 |
NOTE 1: In contemporary x86 CPUs, this is costly since it needs reading
|
| 715 |
the APIC registers for each dereference. APIC registers accessors induce
|
| 716 |
higher latency than the usual general-purpose registers.
|
| 717 |
|
| 718 |
Even assuming that this operation wasn't costly from the hardware side,
|
| 719 |
we'll have to disable, then enable, kernel preemption each time we want
|
| 720 |
to access that thread state; also unacceptable.
|
| 721 |
|
| 722 |
DEC OSF/1 solved this by storing a pointer to the currently running
|
| 723 |
process descriptor at the very bottom of each thread stack. By simply
|
| 724 |
masking the stack pointer, they can instantly access that thread state,
|
| 725 |
__regardless__ of the CPU that thread is currently running on.
|
| 726 |
|
| 727 |
NOTE 2: The paper states that putting the 'current' thread state pointer
|
| 728 |
in the per-CPU area, and then setting it before each context switch,
|
| 729 |
would frequently trash other CPUs caches. This is true in the regular
|
| 730 |
case, but aligning such per-CPU array elements on a cache-line boundary
|
| 731 |
(64 or 128 bytes in x86 CPUs) would solve that problem. It's possible
|
| 732 |
that the Alpha architecture had different semantics though.
|
| 733 |
|
| 734 |
NOTE 3: Linux x86-32 used that stack-masking design for a long time. I
|
| 735 |
guess it was transfered to its developers by folklore.
|
| 736 |
|
| 737 |
NOTE: What about the risks, econet anyone?
|
| 738 |
|
| 739 |
NOTE 4: On the x86 kind of the equation, it's common to _permanently_
|
| 740 |
assign a segment register with the address of the relevant per-CPU area
|
| 741 |
(aligned on a cache-boundary) for each CPU. This avoids the inefficiency
|
| 742 |
of accessing the APIC for every per-CPU data dereference.
|
| 743 |
|
| 744 |
NOTE: Not all CPU architectures have the luxury of a such a register,
|
| 745 |
which is left _intact_ by the basic system ABI definition. The absence
|
| 746 |
of such register in the VAX architecture was the main reason of the
|
| 747 |
stack-masking idea in VMS. Explore the 'VMS Symmetric Multiprocessing'
|
| 748 |
paper, DTJ vol.1#7, for further historical context.
|
| 749 |
|
| 750 |
NOTE: Reading the historical (preliminary) AMD64 architecture document
|
| 751 |
(January 2001, Rev C), it seems that AMD kept the %fs and %gs registers
|
| 752 |
because Windows NT used %gs as a per-CPU area pointer. Interestingly,
|
| 753 |
NT is, afterall, a VMS-based kernel!
|
| 754 |
|
| 755 |
- Per-CPU runqueues: Paper begins by noting that "scheduling threads out of
|
| 756 |
a global runqueue is highly effective at keeping the N highest-priority
|
| 757 |
threads running". It then states the disadvantages of that method: lock
|
| 758 |
contention over the global runqueue and losing CPU cache state due to
|
| 759 |
uncontrolled bouncing of threads between CPUs.
|
| 760 |
|
| 761 |
NOTE: Having a global runqueue and taking CPU affinity in consideration
|
| 762 |
is not mutually exclusive. Brainfuck Scheduler (BFS) and the Linux-2.4
|
| 763 |
epoch scheduler is an example of this.
|
| 764 |
|
| 765 |
The DEC designers thus added "soft affinity" to timeshare threads: these
|
| 766 |
threads are softly bound to the processor they last run on. Nevertheless,
|
| 767 |
if too much time passed since last run, they can get scheduled elsewhere:
|
| 768 |
the CPU state was probably lost anyway. It's more beneficial to directly
|
| 769 |
run that thread on another less-loaded CPU than wait for the current one.
|
| 770 |
|
| 771 |
- Load balancing: As a summary of the presented algorithm, let
|
| 772 |
|
| 773 |
TS(i): The estimated number of ticks available to time-sharing threads
|
| 774 |
by CPU #i in a scheduling period
|
| 775 |
TTS : Total number of system ticks available to time-sharing threads
|
| 776 |
N : Total number of runnable time-sharing threads in the system
|
| 777 |
D(i) : Ideal number of runnable time-sharing threads assigned to CPU #i
|
| 778 |
C(i) : Current number of time-sharing threads assigned to CPU #i
|
| 779 |
|
| 780 |
we have
|
| 781 |
|
| 782 |
TTS = TS(1) + TS(2) + ... + TS(k), k = number of system CPUs
|
| 783 |
D(i) = (TS(i) / TTS) * N
|
| 784 |
|
| 785 |
Every second, the kernel calculates D(i) for each CPU. If a CPU had a
|
| 786 |
C(i) < D(i), and another one had C(i) > D(i), the former is marked as
|
| 787 |
"out of balance". When such unbalanced CPU reschedules, it "steals"
|
| 788 |
tasks from the overloaded CPU. Tasks get stolen only if they did not
|
| 789 |
run on their softly-bound CPU for a configurable amount of time, thus
|
| 790 |
preserving soft affinity usefulness as much as possible.
|
| 791 |
|
| 792 |
- The paper finishes by stating the software engineering tactics needed
|
| 793 |
to safely develop such an ambitious code base. Remember that this was
|
| 794 |
the 90s. After studying that paper, it's clear that the two most
|
| 795 |
important tactics were:
|
| 796 |
|
| 797 |
- Code review, and
|
| 798 |
- Iterative development
|
| 799 |
|
| 800 |
So, well, things haven't changed a lot from the software engineering
|
| 801 |
side!
|
| 802 |
|
| 803 |
|
| 804 |
* Rodney N. Gamache && Kathleen D. Morse, “VMS Symmetric Multiprocessing”,
|
| 805 |
Digital Technical Journal, volume 1 number 7, pp. 57-73, August 1988:
|
| 806 |
------------------------------------------------------------------------
|
| 807 |
|
| 808 |
NOTE: Below discussion assumes reading our notes on DEC OSF/1 v3.0 first.
|
| 809 |
|
| 810 |
"Systems software engineers must now design effective ways to utilize
|
| 811 |
systems with six, eight, or even more CPUs", begins this late 80s paper!
|
| 812 |
|
| 813 |
The SMP additions to VAX/VMS were motiviated by the, back then new, VAX
|
| 814 |
6200 architecture. From other papers in the _same_ journal issue, that
|
| 815 |
hardware design had:
|
| 816 |
|
| 817 |
+ Symmetric shared memory
|
| 818 |
+ Cache coherency in hardware (always a bless!)
|
| 819 |
+ First-level cache for instructions, and a second-level write-through
|
| 820 |
one for both data and instructions
|
| 821 |
+ Broadcast and single-destination inter-processor interrupts
|
| 822 |
|
| 823 |
This paper INTRODUCED a good number of now-classical SMP features:
|
| 824 |
|
| 825 |
+ Spinlocks! (also _coined_ that term)
|
| 826 |
+ Accessing the Per-CPU area (and thus the current thread state) by
|
| 827 |
masking the CPU stack pointer register.
|
| 828 |
+ SMP scheduling hard affinity! (also _coined_ that term)
|
| 829 |
|
| 830 |
Having above context in mind, here were the important points:
|
| 831 |
|
| 832 |
- VMS engineers first considered using a global lock[*] to protect the
|
| 833 |
kernel in an SMP machine. They (rightfully) ditched the idea later and
|
| 834 |
moved directly to the ambitious goal of fine-grained locking.
|
| 835 |
|
| 836 |
NOTE: Compare this to the DEC OSF/1 method, where they first used a
|
| 837 |
global lock, and then fine-grained the system iteratively.
|
| 838 |
|
| 839 |
[*] Big Kernel Lock in Linux jargon, Global Mutex in the BSDs
|
| 840 |
|
| 841 |
- Inventing spinlocks: Regular VMS mutexes must be acquired in a process
|
| 842 |
context, but the kernel needed SMP protection in different other
|
| 843 |
contexts. That's why spinlocks were created; they can get acquired by
|
| 844 |
any CPU in any context (as long as certain sanity rules are held).
|
| 845 |
|
| 846 |
NOTE 1: A classical case is protecting the video-RAM and printf() buffers.
|
| 847 |
The kernel usually want to access such resource in all possible contexts,
|
| 848 |
including boot context, exception and even IRQ context, not just inside
|
| 849 |
process contexts.
|
| 850 |
|
| 851 |
Usually, a mutex is acquired or "owned" by a certain process; spinlocks
|
| 852 |
on the other hand are acquired by the CPU as a whole. Spinlocks was
|
| 853 |
implemented in VMS using an interlocked bit-test-and-set op.
|
| 854 |
|
| 855 |
NOTE 2: It's still implemented similarly these days, but it might be
|
| 856 |
faster to use an interlocked swap. A missing issue in that paper design
|
| 857 |
was making threads lock spinning time be finite (bound); that's an
|
| 858 |
important (third) condition in the theory of critical sections.
|
| 859 |
|
| 860 |
NOTE: That got handled in Linux using N.Piggin's ticket spinlocks.
|
| 861 |
|
| 862 |
- VMS added two important debugging aids to spin locks:
|
| 863 |
|
| 864 |
a) each spinlock was assigned a rank (order). All spinlocks must be
|
| 865 |
acquired in that order, eliminating any possible deadlocks.
|
| 866 |
b) each spinlock structure saved the last eight program counters (PC/
|
| 867 |
x86 %RIP) that acquired or released such lock.
|
| 868 |
|
| 869 |
NOTE: Can this ranking method dynamically work in my kernel? All my
|
| 870 |
locks are strictly local so far, but that might change once VFS and
|
| 871 |
file-systems are added (e.g. inode locks).
|
| 872 |
|
| 873 |
- Per-CPU Area: I _love_ how the authors state their thinking process in
|
| 874 |
that section. As in any SMP-capable kernel, VMS needed a per-CPU data
|
| 875 |
area to put things as the current thread state pointer and other per-
|
| 876 |
CPU bookkeeping.
|
| 877 |
|
| 878 |
The most effecient way to do that is to have a general-purpose register
|
| 879 |
_permanently_ set to the virtual address of the relevant per-CPU area.
|
| 880 |
Such a register did not exist in VAX, so instead of creating it in the
|
| 881 |
VAX hardware, they searched for a way to _overload_ a current register
|
| 882 |
with that purpose.
|
| 883 |
|
| 884 |
Since each CPU must have a unique stack to safely handle interrupts,
|
| 885 |
the current thread stack pointer can be safely overloaded with that
|
| 886 |
task. They put a pointer to the per-CPU area at the bottom of such
|
| 887 |
stacks, which can be reached by simply masking the stack pointer.
|
| 888 |
|
| 889 |
NOTE: See MUCH more details on this in our DEC OSF/1 discussion above.
|
| 890 |
|
| 891 |
- CPU Affinity: They maintained a mask inside each thread descriptor,
|
| 892 |
where each CPU is represented by a single bit. Setting only one bit in
|
| 893 |
that mask is equivilant to hard binding that thread to a single CPU.
|
| 894 |
Such affinity rules are naturally enforced by the scheduler.
|
| 895 |
|
| 896 |
NOTE: Solaris and Linux roughly use the same method for hard affinity.
|
| 897 |
|
| 898 |
|
| 899 |
* Bibliography from “Operating System Concepts”, Sliberchatz et al., 2007:
|
| 900 |
------------------------------------------------------------------------
|
| 901 |
|
| 902 |
Feedback queues were originally implemented on the CTSS system described
|
| 903 |
in Corbato et al.[1]. This feedback queue scheduling system was analyzed
|
| 904 |
by Scharge[2]. The preemptive priority scheduling algorithm was suggested
|
| 905 |
by Kleinrock[3].
|
| 906 |
|
| 907 |
[1] Fernando J. Corbato et al., 'An Experimental Time-Sharing System',
|
| 908 |
Proceedings of the AFIPS Fall Joint Computer Conference (1962)
|
| 909 |
|
| 910 |
[2] L. E. Scharge, 'The Queue M/G/I with Feedback to Lower Priority
|
| 911 |
Queues', Management Science, Volume 13 (1967)
|
| 912 |
|
| 913 |
[3] L. Kleinrock, 'Queuing Systems, Volume II: Computer Applications,
|
| 914 |
Wiley Interscience (1975)
|
| 915 |
|
| 916 |
Anderson et al.[4], Lewis and Berg[5], and Philbin et al[6] discuss
|
| 917 |
thread scheduling.
|
| 918 |
|
| 919 |
[4] T. E. Anderson et al., 'The Performance Implications of Thread
|
| 920 |
Management Alternatives for Shared-Memory Multiprocessors', IEEE
|
| 921 |
Transactions on Computers, Volume 38, Number 12 (1989)
|
| 922 |
|
| 923 |
[5] B. Lewis and D. Berg, 'Multithreaded Programming with Pthreads,
|
| 924 |
Sun Microsystems Press (1998)
|
| 925 |
|
| 926 |
[6] J. Philbin et al., 'Thread Scheduling for Cache Locality',
|
| 927 |
Architectural Support for Programming Language and Operating Systems,
|
| 928 |
(1996)
|
| 929 |
|
| 930 |
Scheduling techniques that take into account information regarding
|
| 931 |
process execution times from previous runs are described in Fisher[7],
|
| 932 |
Hall et al.[8], and Lowney et al.[9]
|
| 933 |
|
| 934 |
[7] J. A. Fisher, 'Trace Scheduling: A Technique for Global Microcode
|
| 935 |
Dispatching', IEEE Transactions on Computers, Volume 30, Number 7
|
| 936 |
(1981)
|
| 937 |
|
| 938 |
[8] L. Hall et al., 'Scheduling to Minimize Average Completion Time:
|
| 939 |
Off-line and On-line Algorithms', SODA: ACM-SIAM Symposium on
|
| 940 |
Discrete Algorithms (1996)
|
| 941 |
|
| 942 |
[9] P. G. Lowney et al., 'The Multiflow Trace Scheduling Compiler',
|
| 943 |
Journal of Supercomputing, Volume 7, Number 1-2 (1993)
|
| 944 |
|
| 945 |
Fair-share schedulers are covered by Henry[10], Woodside[11], and Kay
|
| 946 |
Kay and Lauder[12]
|
| 947 |
|
| 948 |
[10] G. Henry, 'The Fair Share Scheduler', AT&T Bell Laboratories
|
| 949 |
Technical Journal (1984)
|
| 950 |
|
| 951 |
[11] C. Woodside, 'Controllability of Computer Performance Tradeoffs
|
| 952 |
Obtained Using Controlled-Share Queue Schedulers', IEEE
|
| 953 |
Transactions on Software Engineering, Volume SE-12, Number 10
|
| 954 |
(1986)
|
| 955 |
|
| 956 |
[12] J. Kay and P. Lauder, 'Fair Share Scheduler', Communications of the
|
| 957 |
ACM, Volume 31, Number 1 (1988)
|
| 958 |
|
| 959 |
Siddha et al[13] discusses __MODERN__ challenges on multicore systems.
|
| 960 |
|
| 961 |
[13] S. Siddha et al, 'Process Scheduling Challenges in the Era of
|
| 962 |
Multi-Core Processors', Intel Technology Journal, Volume 11 (2007)
|
| 963 |
|
| 964 |
Multiprocessor scheduling was discussed by Tucker and Gupta[14], Zahorjan
|
| 965 |
and McCann[15], Fietelson and Rudolph[16], Leutenegger and Vernnon[17],
|
| 966 |
Blumofe and Leisersonn[18], Polychronopoulos and Kuck[19], and Lucco[20].
|
| 967 |
|
| 968 |
[14] A. Tucker and A. Gupta, 'Process Control and Scheduling Issues for
|
| 969 |
Multiprogrammed Shared-Memory Multiprocessors', Proceedings of the
|
| 970 |
ACM Symposium on Operating Systems Principles (1989)
|
| 971 |
|
| 972 |
[15] J. Zahorjan and C. McCann, 'Processor Scheduling in Shared-Memory
|
| 973 |
Multiprocessors', Proceedings of the Conference on Measurement and
|
| 974 |
Modeling of Computer Systems (1990)
|
| 975 |
|
| 976 |
[16] D. Fietelson and L. Rudolph, 'Mapping and Scheduling in a Shared
|
| 977 |
Parallel Environment Using Distributed Hierarchial Control',
|
| 978 |
Proceedings of the International Conference on Parallel Processing,
|
| 979 |
(1990)
|
| 980 |
|
| 981 |
[17] S. Leutenegger and M. Vernnon, 'The Performance of Multiprogrammed
|
| 982 |
Multiprocessor Scheduling Policies', Proceedings of the Conference
|
| 983 |
on Measurement and Modeling of Computer Systems (1990)
|
| 984 |
|
| 985 |
[18] R. Blumofe and C. Leisersonn, 'Scheduling Multi-threaded
|
| 986 |
Computations by Word Stealing', Proceedings of the Annual Symposium
|
| 987 |
on Foundations of Computer Science (1994)
|
| 988 |
|
| 989 |
[19] C. D. Polychronopoulos and D. J. Kuck, 'Guided Self-Scheduling:
|
| 990 |
'A Practical Scheduling Scheme for Parallel Supercomputers', IEEE
|
| 991 |
Transactions on Computers, Volume 36, Number 12, (1987)
|
| 992 |
|
| 993 |
[20] S. Lucco, 'A Dynamic Scheduling Method for Irregular Parallel
|
| 994 |
Programs', Proceedings of the conference on Programming Languages
|
| 995 |
Design and Implementation, (1992)
|
| 996 |
|
| 997 |
The authors then list the standard books covering well-known general
|
| 998 |
purpose operating systems kernels like SVR2, the BSDs, Linux, NT, and
|
| 999 |
Solaris by Bach, Kirk McKusick, Bovet and Cesati, Solomon and
|
| 1000 |
Russionovich, and Mauro and McDougal respectively.
|