Schedulers ---------- Study notes for a number of Scheduling-related papers, which served as a preparation for building an SMP scheduling base for my “Cute” kernel. The __primary sources__ (papers from the 60s, 80s, 90s, and 2000s) of the following topics are discussed: - Multi-level Feedback Queues: that algorithm is the cornerstone of old & new schedulers like SVR2/3/4, the BSDs, Linux O(1), NT, Staircase, and Staircase Deadline algorithms. - SVR4 and Solaris event-driven scheduling: including the disadvantages I faced while implementing it. - Spinlocks: why was it badly-needed and invented by VMS engineers, and on missing an important condition of critical section theory in its original design. - Per-CPU Data Areas: how VAX/VMS originally handled it, how the VMS-based NT kernel did it afterwards, and how the AMD x86-64 architecture specifically kept the %fs and %gs segments cause of that original NT design. - Finding the current thread state using stack masking: why was that design engineered by VMS on the VAX architecture, and how it was transfered to DEC OSF/1 and unnecessarily later to Linux. - DEC OSF/1 and VAX/VMS symmetric multiprocessing: these kernels were the origin of now-classical ideas like spinlocks, per-CPU runqueues, interactions between spinlocks and kernel preemption, and thread scheduling soft and hard affinity! 2010-2011, Ahmed S. Darwish NOTE: To get things going at first, I start with a very basic overview of general-purpose thread scheduling using course notes by R. Arpaci-Dusseau. If you're familiar with the basics, move directly to the legendary F.J. Corbato's “An Experimental Time-Sharing System” paper. NOTE-2: You can find a cached copy of *all* the discussed papers at http://gitorious.org/cute-os/references/trees/master/papers/sched NOTE-3: Turn UTF-8 on if you're reading this inside a web browser. * TOC: ---- - Arpaci00, Operating Systems Notes, 'Scheduling: Introduction' - Corbato62, 'An Experimental Time-Sharing System', MIT - Corbato63, 'The Compatible Time-Sharing System - A Programmer's Guide' - Dingledine09, 'Performance Improvements on Tor' - Arpaci00, OS Notes, 'Scheduling: The Multi-Level Feedback Queue' - Arpaci98, 'Multilevel Feedback Queue Scheduling in Solaris' - Oracle02, Man Pages Section 4: File Formats, ts_dptbl(4) - Black90, 'Concurrency and Parallelism in the Mach Operating System' - Denham94, 'DEC OSF/1 Version 3.0 Symmetric Multiprocessing Implementation' - Gamache88, 'VMS Symmetric Multiprocessing' - Sliberchatz08, Bibliography from the 'Operating System Concepts' book * Remzi Arpaci-Dusseau, Operating System Notes, “Scheduling: Introduction”, University of Wisconsin-Madison, 2000: ---------------------------------------------------------- The included portion of Remzi's notes (ch. 6, 7, and 8) build a light- weight (undergrad-level) introduction to general-purpose OS scheduling. These documents also include a number of good references. Here are the important points in these notes + analysis of mine: - Some of the metrics in measuring schedulers include 'Turnaround time', which is the time taken between registering a job for execution and its completion/return/finish. Mainly: T_{turnaround} = T_{finish} - T_{arrival} Another metric, caring more about latency, is the 'response time' which is the time the job waits in the ready queue till some CPU time is provided: T_{response} = T_{firstrun} - T_{arrival} - The cost of context switching does not arise from the OS actions of saving and restoring a few registers. For example, we do context switching in our kernel using only 37 x86 opcodes: they are mainly register moves and stack pops and pushes. The real cost is that "when programs run, they build up a great deal of state in CPU caches, TLBs, branch predictors, and other on-chip hardware. Switching to another job causes this state to be flushed and new state relevant to the currently-running job to be brought in, which may exact a noticeable performance cost." - Round-robin, the simplest practical scheduler possible, involves the act of __cycling__ on the runqueue's list of processes, running each job for a fixed time slice (unit). Assuming processes A, B, C, and D, the scheduler run the jobs as follows | A | B | C | D | A | B | C | D | A | B | C | D | A | B | C | ... each for an equal amount of time. Advantages include implementation simplicity, good average response times (see disadvantages though), and being starvation-free by design. Disadvantages include very bad behaviour on system saturation, cause the system will swap-in and out programs for each (supposedly small) time slice [*], and response-time rising linearly with the number of active jobs in the system, even for highly-interactive processes. [*] this is a point handled by the Corbato paper below - In practice, most of processes will issue a number of blocking I/O requests during their allocated time quanta, which leads to moving them out of the runqueue altogether, and returning back to such queue after I/O completion. HOW are these jobs treated after returning back can be as simple as considering them new processes (classical round-robin), or to perform timing calculations and modify their priority (assuming 'priorities' support in the scheduler, or a similar concept). - Finally, there's a nice paragraph on the separation of mechanism and policy in OS Design. For example, stuff like programming the system ticks handler and context-switching code are scheduling mechanisms (mechanically how the jobs get interleaved). Scheduling policies (or disciplines) is what you're currently reading is about. In real-time POSIX for example, there are several programmable scheduling 'policies' that can be setup through sched_setscheduler(2). Such policies include SCHED_FIFO, SCHED_RR, and SCHED_OTHER. * F.J. Corbato et al., “An Experimental Time-Sharing System”, AIEE-IRE Proceedings of the May 1-3, 1962, Spring Joint Computer Conference, MIT: ------------------------------------------------------------------------ This is one of the earliest papers on time-sharing schedulers. The core of the paper (beside its overall historical importance) is the section 'A Multi-Level Scheduling Algorithm', which is in modern terminology now called the 'Multi-level Feedback Queue' (MLFQ). Important points: - The main problem the algorithm tries to solve is system saturation: a) total size of active user programs exceed high-speed memory b) too many active user programs to maintain an adequate response at each user console. c) high performance cost of context switching According to the authors, a good solution is to provide a procedure which gives graceful degradation of the response time and effective real-time computation speed of the *large and long-running users*. - The paper argues that a simple round-robin scheduler will not work in case of saturation cause it will cause an 'abrupt collapse of service' due to the sudden burden of time required to swap programs in-and-out of secondary memory such as a disk. The multi-level scheduling algorithm was thus presented to improve the saturation performance of a time-sharing system. - We now delve into the scheduler algorithm details: from the very first, the scheduler partitions its runqueue to several priority queues l_0, l_1, ..., l_L. ('L' is to be calculated later) - In general, the bigger the program memory footprint (size), the less queue priority it's assigned to. But, the less the queue priority, the bigger the runtime for each of its tasks in number of quantum. Mathematically: l = [ log_2( [w_p/w_q] + 1 ) ] where each task at queue 'l' runs for 2^l quanta, and that if l_a < l_b, then l_a has a higher priority than l_b. We can deduce several things from above equation: a) if 0 < w_p < w_q, [w_p/w_q] = 0, runtime = 1 quantum |-- 1 b) if w_q < w_p < 2*w_q, [w_p/w_q] = 1, runtime = 2 quanta -| c) if 2*w_q < w_p < 3*w_q, [w_p/w_q] = 2, runtime = 2 quanta -|-- 2 d) if 3*w_q < w_p < 4*w_q, [w_p/w_q] = 3, runtime = 4 quanta ..| e) if 4*w_q < w_p < 5*w_q, [w_p/w_q] = 4, runtime = 4 quanta ..| f) if 5*w_q < w_p < 6*w_q: [w_p/w_q] = 5, runtime = 4 quanta ..|-- 4 g) if 6*w_q < w_p < 7*w_q: [w_p/w_q] = 6, runtime = 4 quanta ..| h) if 7*w_q < w_p < 8*w_q: [w_p/w_q] = 7, runtime = 8 quanta +++ ... +++|-- 8 p) if 15*w_q < w_p < 16*w_q: [w_p/w_q] = 15, runtime = 16 quanta xxxx ... xxxx|-- 16 _) if 31*w_q < w_p < 32*w_q: [w_p/w_q] = 31, runtime = 32 quanta @@@@@ ... @@@@@|-- 32 _) if 63*w_q < w_p < 64*w_q: [w_p/w_q] = 63, runtime = 64 quanta Note: at point 'a)', queue 'l' is equal to 0: the highest possible queue in priority. By above design, the bigger the program, the bigger its allocated number of quanta, making the time-overhead of swapping negligible. Thus, optimizing the turnaround time of CPU-bound processes. - The above point in scheduler design _tries_ to solve the first part of the problem the algorithm tackles: the stated point 'a) total size of active user programs exceed high-speed memory' above. From now on, the paper tackles second part of the problem, 'b) too many active user programs to maintain an adequate response'. - The scheduler __always__ begin with the tasks at the head of the highest priority queue (l_0). After that queue gets exhausted, it moves to the direct following queue l_1, and so on <== (I) Tasks in a single queue get run in a round-robin fashion. - If while executing a task at queue 'l' (during the 2**l quantum), a task got assigned to a higher priority queue 'l*', the current executing task get put at the __head__ of the 'l'th queue, and the process at (I) is restarted from queue 'l*', and so on. In other words, a task cannot get executed while another task is waiting at a higher priority queue. - If a program in queue 'l' has not 'completed' (did not issue any I/O) within its 2**l-quantum runtime, it's placed at the __tail__ of the less-priority queue 'l+1' - A powerful point in the algorithm above (beside broad bounds on system performance) is that the classification procedure for tasks is entirely automatic: it depends on performance and program size rather than static declarations or (the possibly malicious) users hopes. - As once stated above, the main theme around this scheduler is improving the __saturation performance__ of the system. This performance improvement on saturation exists cause as a user taxes the system, the degradation of service occurs progressively starting with users with large or long-running programs (least-priority queues) first. - Starvation: there's a probable case of a big number of active high priority tasks starving less-priority CPU-bound tasks out of execution. The paper __models__ a worst cast of having N tasks, each running at queue 'l', and each relinquishing the CPU __directly__ before their 2**l quantum runtime, and thus each also returning to the the same- priority queue, with the condition of returning to the runqueue only 't_u'-seconds later (time for finishing I/O and, or, user response). If the system will always be busy at queue 'l' above, starving all less priority queues 'l + x' for x > 0, then the queue 'l' must never be empty; we must have: N * 2^l * q >= t_u 2^l >= (t_u / N*q) l >= log_2 (t_u / N*q) where 'N' is the number of tasks at queue 'l'. From above, we've found an upper-bound on the priority queue that can be serviced: queues > 'l' above can possibly starve if number of tasks reach N. Thus, we have a guarantee that all l_a <= [ log_2 (t_u / N * q) ] will be serviced and will __not__ get starved (given the estimated response time t_u), where '[]' denotes the integral part of enclosed equation. The author calls this queue the 'highest serviced level' queue. - The equation above has set a bound on the starvation problem. So if we set the max possible 'highest serviced queue' to the max possible queue on the system, we have a __theoretical guarantee__ of no starvation in the system. (Providing this guarantee is very hard for modern-day general purpose operating systems: we cannot easily put a limit on the number of tasks 'N', and we cannot even have a valid estimation of 't_u', the I/O or user response finish time. These values can be calculated easily if the target hardware and work load is previously known; this is the case in the paper [IBM 7090], and this can also be the case for small and predictable devices like a small network router.) - By now, we've analyzed most of the paper sans the response time analysis. While these response time equations are intellectually simulating, they offer very broad bounds that are not entirely practical for a general purpose scheduler, similar to the paper's way of handling starvation in the above point. * F.J. Corbato et al., “The Compatible Time-Sharing System: A Programmer's Guide”, The MIT Computation Center, MIT Press, 1963: ------------------------------------------------------------------------ This manual gives a very nice overview of the historical time-sharing hands-on experience. It's a worthwhile read while studying the CTSS scheduler paper above. A useful demonstration of a programmer's session on CTSS is also presented in the Appendix. Interesting text editor!! it makes me remember the original UNIX editor 'ed'. * Roger Dingledine, “Performance Improvements on Tor”, 2009: ---------------------------------------------------------- Now why am I including a paper on Tor while we're dealing with schedulers? Because the Tor network tries to solve a problem very similar to the one Corbato was designing his scheduler model against: ``Section 1 described mechanisms to let low-volume streams have a chance at competing with high-volume streams. Without those mechanisms, normal web browsing users will always get squeezed out by people pulling down larger content and tolerating high latency. But the next problem is that some users simply add more load than the network can handle. Just making sure that all the load gets handled __fairly__ isn't enough if there’s too much load in the first place.(+) When we originally designed Tor, we aimed for high throughput. We figured that providing high throughput would mean we get good latency properties for free. However, now that it’s clear we have several user profiles trying to use the Tor network at once, we need to consider changing some of those design choices. Some of these changes would aim for better latency and worse throughput. (++) '' The statement marked with (+) matches word-by-word the problem statement of the Multi-Level Feedback Queue paper: to gracefuly handle 'system saturation' while maintaining an acceptable level of system responsiveness. Thus, avoiding the 'abrupt collapse of service' caused by OS scheduler's round-robin algorithm or Tor's full fairness in high overlad. A main theme of the MLFQ solution, and Tor's one proposed at (++), is not to let throughput-driven jobs (compilers in schedulers case, an 700-MB ISO download in the Tor case) deprive latency-senitive jobs out of quickly-needed resources. Examples of such latency-sensive apps include an editor or web-browser GUI scrolling in OS schedulers, and servicing HTML websites in Tor. * Remzi Arpaci-Dusseau, Operating Systems Notes, The MLFQ, 2000: -------------------------------------------------------------- Here, the MLFQ algorithm we studied from __primary sources__ above is re-discussed in modern terms. Important points: - The hardest point in designing schedulers is to minimize response time for interactive jobs, while also minimizing overall system turnround time. - As we know by now, MLFQ varies the priority of a job based on its observational behaviour: learning from jobs as they run, and using history to predict their future behaviour. Historical state is kept in the scheduler priority queues. This is an important systems design pattern of learning from history: observing current trends and learning from the past to predict the future. Such pattern is common in CPUs branch prediction, working set rules, and in networking protocols congestion control mechanisms. Remzi advises us though that 'such approaches work when jobs have __phases of behavior__ and are thus predictable; of course, one must be careful with such techniques, as they can easily be wrong and drive a system to make worse decisions than they would have with no knowledge at all.' - There are two important problems in MLFQs: a) Starvation: One of the possible ways to handle it is the well-known 'aging principle': increasing the priority as the process waits. Check below parenthesized notes though; it's not that simple. (Note1: The original paper theoretically 'solved' this issue by giving bounds on starvation. These bounds can only work if we have a true estimate of I/O response times _and_ of the jobs number in the system. This is practically impossible for general-purpose scheduling.) (Note2: There are several ways to __practically__ apply aging methods. Some of these ways are discussed in David Black's paper 'Scheduling support for concurrency and parallelism in the Mach Operating System.') b) Gaming the system: A possible solution is to improve the algorithm accounting of CPU time, making it less viable to gaming. Check the notes for further details. * Remzi Arpaci-Dusseau, “Multilevel Feedback Queue Scheduling in Solaris”, 1998 && Solaris, Man Pages Section 4: File Formats, ts_dptbl(4), 2002: ------------------------------------------------------------------------ From now on, we check how various real-world kernels implement MLFQ scheduling. Beginning with Solaris, here are the important points: - For regular jobs, the kernel has 60 priority queues. A job begins in the middle, at priority 29. As in classical MLFQ, CPU-bound jobs get assigned smaller priorities over time (with bigger slices), and I/O tasks move up, getting scheduled whenever they have jobs to do. Now here is Solaris 'dispatcher parameter table' defaults: Priority q (ms) tqexp slpret maxwait lwait 0 200 0 50 0 50 ... vvv v vv v vv 9 vvv v vv v vv 10 160 v 51 v 51 11 vvv 1 vv v vv 12 vvv 2 vv v vv 13 vvv 3 vv v vv 14 vvv 4 vv v vv 15 vvv 5 vv v vv 16 vvv 6 vv v vv 17 vvv 7 vv v vv 18 vvv 8 vv v vv 19 vvv 9 vv v vv 20 120 10 52 v 52 21 vvv 11 vv v vv 22 vvv 12 vv v vv 23 vvv 13 vv v vv 24 vvv 14 vv v vv 25 vvv 15 vv v vv 26 vvv 16 vv v vv 27 vvv 17 vv v vv 28 vvv 18 vv v vv 29 vvv 19 vv v vv 30 80 20 53 v 53 31 vv 21 vv v vv 33 vv 22 vv v vv 33 vv 23 vv v vv 34 vv 24 vv v vv 35 vv 25 54 v 54 36 vv 26 vv v vv 37 vv 27 vv v vv 38 vv 28 vv v vv 39 vv 29 vv v vv 40 40 30 55 v 55 41 vv 31 vv v vv 42 vv 32 vv v vv 43 vv 33 vv v vv 44 vv 34 vv v vv 45 vv 35 56 v 56 46 vv 36 57 v 57 47 vv 37 58 v 58 48 vv 38 vv v vv 49 vv 39 vv v 59 50 vv 40 vv v vv 51 vv 41 vv v vv 52 vv 42 vv v vv 53 vv 43 vv v vv 54 vv 44 vv v vv 55 vv 45 vv v vv 56 vv 46 vv v vv 57 vv 47 vv v vv 58 vv 48 vv v vv 59 20 49 59 32000 vv tqexp: time quantum expired priority, the new priority the thread is set to when it uses its entire time quantum. As you can see, any process using its entire slice get its priority decreased by __10 levels__. slpret: sleep return priority value, the new priority the thread is set to after waking up (from a sleep event like network or disk I/O.) The thread priority is increased to 50 or above after an I/O operation! maxwait: a starvation avoidance mechanism to compensate threads waiting for a relatively long time in the dispatcher queue. 'maxwait' is such waiting time threshold. If a thread waited more than maxwait in the queues, its priority is boosted up to the respective 'lwait' priority. Each thread has a dispatcher_wait ('dispwait') counter, which is reset when the thread is inserted into the dispatch queue (following a time- quantum expiration, or cause of wakeup from I/O or locking). This field is incremented once per second for all processes waiting in the dispatcher queues. If such counter exceeded 'maxwait', then the thread priority is incremented to 'lwait' -- a priority > 50 by default. Note that the default 'maxwait' for Solaris is 0: just 1 second and all the system processes priority will be boosted up above 50. This avoids starving CPU-bound too much. [Sol-*]: (The values above have some very interesting details: Note that after 1-second, processes in the lowest (0-9) range will move to priority 50, processes in the range (10-19) will move to 51, (20-29) to priority 52, and so on. By this, high-priority processes response times will NOT be affected because they will still run on queue 59 instantly. Now, how can this prevent starvation? Can't processes keep going in and out very quickly, starving queues < 59?. Indeed that was the mathematically modeled worst case in the original CTSS paper. Now comes the second important detail: even if the worst case above occured (highly unlikely in practice, except if done maliciously), you'll see that the 'lwait' field of the 50-59 priorites is 59. Thus, in the second __starvation round__, all will move to 59 and execute in a round-robin manner. This solves the starvation problem elegantly. Summary: starvation can never exceed 2 seconds + waiting time in prio queue 59. All while maintaining system responsiveness in the process.) - NOTE: I've coded such scheduler in my kernel, and its disadvantages began to appear directly after the first day of benchmarking: too much heuristics and voo-doo constants! Scheduling 100% CPU-hog threads showed the ugly side of that algorithm: the combination of a big number of dynamic priority levels with huge and different slice lengthes per each priority kept starving the low priority processes till the newly-queued CPU hogs moved up to their equivalent low-priority level. While the new CPU hogs were moving up, the older low-priority tasks faced complete starvation for 1+ seconds. The algorithm thus boosted these threads to high-priority levels, but this __starved__ the newer threads that are now residing in relatively lower-priority queues. This effectively made all system processes enter an infinite loop of 1+-second starvation and then boosting -- an upleasent situation. Decreasing the number of priority levels and making the slices more fair helped the situation, but kept introducing __new__ corner-cases. After a number of days, I gave-up and went coding a more theoretically-sound scheduling algorithm. (The benchmark was simply plotting different kernel threads properties over time. Such properties included threads total runtime, average runtime per dispatch, total wait in runqueues, average wait in runqueues, number of priority boosts [cause of starvation], number of preemptions cause of a higher-priority thread becoming runnable, and number of preemptions cause of a thread time-slice end: (a) code version 1, 6 threads: http://localf.googlepages.com/pub/mlfq/fairness.gif (b) code version 1, 100 threads: http://localf.googlepages.com/pub/mlfq/fairness100.gif (c) code version 2 (after sanitizing the voo-doos), 6 threads: http://localf.googlepages.com/pub/mlfq/fairness-improved.gif (d) code version 2, 100 threads (things became good!!): http://localf.googlepages.com/pub/mlfq/fairness-improved.gif (e) unfortunately, the _new_ heuristics above had corner cases: http://localf.googlepages.com/pub/mlfq/fairness-improved-pathological.gif It seems I'm not the only one facing this though; I've found some refs in academia on how SVR4-based schedulers [like Solaris] were very hard to tweak and adapt to different workloads.) * David L. Black, “Scheduling Support for Concurrency and Parallelism in the Mach Operating System”, IEEE Computer Journal, Carnegie Mellon, 1990: ------------------------------------------------------------------------- This is a very interesting article, but we will concentrate on its MLFQ deficiencies discussion for now (Note: always remember Black's point of creating an idle thread per CPU, not to corrupt kernel stack on multiprocessor systems). Important points: - Multics used a decreasing priority MLFQ (as was exactly described in its original paper) and discovered the major deficiency: on a heavily loaded system with many short-lived jobs, the priority of a CPU-bound job can decrease until little or no CPU-time is assigned to it. The author notes that such problem (starvation) also existed on a number of UNIX kernels. - To handle this problem, it's necessary to elevate priorities in some manner. VAX/VMS elevates a low application priority when it issues an event like I/O completion. The author states that this negatively affect interactive applications which consume large amount of processor time, since they may not issue priority-raising events to an effective degree. (Nonetheless, this approach is effective, response-time wise, for handling software with 'phases' of I/O and CPU-bound activity if the CPU-bound activity is long but not dominant. This is a point the original CTSS scheduler did not address: a job's transition from a CPU-bound to an I/O-bound phase.) - A (non-exclusive) second way to handle this is processor usage aging. To compensate lower-priority jobs, we forget (decay) their past history in some way, usually in an exponential decay manner. (Exponential decay: N(t) = N(0) * (decay_factor)**t. Assuming a decay factor of 0.5, and N(0)=10, then N(1)=5, and N(2) = (10/2)/2 = 0.25) - Solaris, as we've seen, uses BOTH of the prevention methods above: processes priorities get elevated after completion of an I/O event, AND the kernel also forgets (age) __most__ of the current processes usage history after one second (coalescing each 10 priorities to one, while maintaining the relative difference between them). - 4.3BSD: it solely uses the aging method to handle starvation, with a CPU-usage decay factor that's based on system load. The bigger the system load, the less usage decay, avoiding jobs from getting crammed in high priority queues on high system load. Priority of the __currently running__ thread get re-calculated after each 4 ticks of its runtime, decreasing linearly with job's CPU usage. If a higher priority was discovered after this decrease, the thread is preempted. Routines responsible for waking up a thread or adding a new thread to the runqueue check such thread priority against the currently running thread. If it's higher, a reschedule operation is requested. (XXX: Unix Internals discusses the problem of classical SVR2/3 decay, its equations, and the new BSD decay method. Add these notes here as a supplement to the original paper discussion.) * Jeffrey M. Denham et al., “DEC OSF/1 Version 3.0 Symmetric Multiprocessing Implementation”, Digital Technical Journal, volume 6 number 3, Summer 1994: --------------------------------------------------------------------------- This paper is gold, and is comparable in quality to the CTSS one. It discusses a huge number of now-classical SMP features: + Accessing the 'current' thread state by masking the stack pointer, regardless of the CPU we're currently running on. + Per-CPU runqueues for thread scheduling! + SMP scheduling soft affinity + Interactions between spinlocks and kernel preemption + Striking a balance between fine-grained locking and lockless algorithms A good number of above ideas were of a VAX/VMS origin. Important notes: - The task of an SMP kernel is to improve system performance as CPUs are added without compromising quality. DEC OSF/1 goal was to provide a performant Unix SMP system over the Alpha architecture. That hardware architecture provided shared memory, symmetric access to system I/O busses, atomic read-and-set instructions, and interprocessor interrupts (IPIs). NOTE: This is very close to the Intel MultiProcessor (MP) architecture. - As we know by now, classical Unix kernels insured its consistency by: a] disabling interrupts in critical regions accessed by IRQ handlers (and thus handling the concurrency of asynchronous IRQ handlers) b] allowing only one process to be in kernel context at a time (and thus handling the virtual concurrency provided by time-sharing) These protections aren't enough in case of multicore systems, since they provide real concurrent execution of kernel code paths. An evolutionary solution to that problem was the use of 'funneling': allowing certain kernel subsystems to be executed only in one CPU at _any_ given time. These subsystems was then iteratively and slowly moved to local fine-grained locking. NOTE: Linux used that 'funneling' method in its early days using the 'Big Kernel Lock (BKL)'; FreeBSD did the same using a 'Global Mutex'. Both kernels moved to fine-grained locking iteratively. By virtue of being started in Winter of 2009, my kernel incorporates fine-grained locking from the start. - It's interesting to note that the first version of DEC OSF/1 provided full kernel-mode preemption in its real-time version of the kernel. Solaris was also made fully preemptive with fine-grained locking very early on. Linux and FreeBSD caught on around 10 years later. - The paper then discusses how kernel preemption should be disabled by the entrance of spinlocks. It also discusses the now-very-common idea of counting the depth of spin locks acquired using a reference count, and preempting the kernel if and only if that reference count = 0. NOTE 1: This is equivalent to Robert Love's Linux kernel preemption patch. A google search reveals that this patch got merged in 2001. NOTE 2: FreeBSD on the other hand completely disables local interrupts upon spin lock entrance, automatically disabling preemption in the process (at a slight performance cost). I do the same in my kernel. NOTE 3: (cut-and-paste from my kernel comments) Disabling preemption in spinlock-protected regions is a necessity because: a) If a thread holding a spin lock got preempted, it will let a possibly huge number of other threads consume their entire time slice (which is in the order of milliseconds) just spinning! b) If that preempted thread (which is now in the scheduler runqueue) got killed, we can deadlock the entire system! c) Classical race conditions may arise if the newly scheduled thread accessed the same critical-region the older thread was accessing while being preempted. That's why spin locks are also needed in preemptive Uniprocessor kernels: ignoring their spinning behavior, they act as atomic critical region markers. There's also a warning about accessing per-CPU variables while kernel preemption is enabled. Modern kernels disable preemption in that state. NOTE 4: (also from my kernel comments) Disabling preemption while accessing a per-CPU variable is necessary: a) We might get preempted and then rescheduled on a different CPU. b) If the _preempting_ thread used the same per-CPU variable, that's a classical race condition. The paper notes that due to the complexities of kernel locking, a "coherent locking architecture with automated debugging facilities was needed to ship a reliable product on time." - One problem faced by SMP is finding the descriptor of the current process context. In a uniprocessor machine, that can be simply done by a global memory variable that is changed before context switching. On SMP, there's a "current" process descriptor for each CPU. This descriptor can be located by finding the process number we're running on, and then accessing an array of per-cpu data based on that number. In Unix kernels, the current thread state is dereferenced a huge number of times. Needing to know the current CPU number before each of these dereferences (a costly hardware operation) was unacceptable. NOTE 1: In contemporary x86 CPUs, this is costly since it needs reading the APIC registers for each dereference. APIC registers accessors induce higher latency than the usual general-purpose registers. Even assuming that this operation wasn't costly from the hardware side, we'll have to disable, then enable, kernel preemption each time we want to access that thread state; also unacceptable. DEC OSF/1 solved this by storing a pointer to the currently running process descriptor at the very bottom of each thread stack. By simply masking the stack pointer, they can instantly access that thread state, __regardless__ of the CPU that thread is currently running on. NOTE 2: The paper states that putting the 'current' thread state pointer in the per-CPU area, and then setting it before each context switch, would frequently trash other CPUs caches. This is true in the regular case, but aligning such per-CPU array elements on a cache-line boundary (64 or 128 bytes in x86 CPUs) would solve that problem. It's possible that the Alpha architecture had different semantics though. NOTE 3: Linux x86-32 used that stack-masking design for a long time. I guess it was transfered to its developers by folklore. NOTE: What about the risks, econet anyone? NOTE 4: On the x86 kind of the equation, it's common to _permanently_ assign a segment register with the address of the relevant per-CPU area (aligned on a cache-boundary) for each CPU. This avoids the inefficiency of accessing the APIC for every per-CPU data dereference. NOTE: Not all CPU architectures have the luxury of a such a register, which is left _intact_ by the basic system ABI definition. The absence of such register in the VAX architecture was the main reason of the stack-masking idea in VMS. Explore the 'VMS Symmetric Multiprocessing' paper, DTJ vol.1#7, for further historical context. NOTE: Reading the historical (preliminary) AMD64 architecture document (January 2001, Rev C), it seems that AMD kept the %fs and %gs registers because Windows NT used %gs as a per-CPU area pointer. Interestingly, NT is, afterall, a VMS-based kernel! - Per-CPU runqueues: Paper begins by noting that "scheduling threads out of a global runqueue is highly effective at keeping the N highest-priority threads running". It then states the disadvantages of that method: lock contention over the global runqueue and losing CPU cache state due to uncontrolled bouncing of threads between CPUs. NOTE: Having a global runqueue and taking CPU affinity in consideration is not mutually exclusive. Brainfuck Scheduler (BFS) and the Linux-2.4 epoch scheduler is an example of this. The DEC designers thus added "soft affinity" to timeshare threads: these threads are softly bound to the processor they last run on. Nevertheless, if too much time passed since last run, they can get scheduled elsewhere: the CPU state was probably lost anyway. It's more beneficial to directly run that thread on another less-loaded CPU than wait for the current one. - Load balancing: As a summary of the presented algorithm, let TS(i): The estimated number of ticks available to time-sharing threads by CPU #i in a scheduling period TTS : Total number of system ticks available to time-sharing threads N : Total number of runnable time-sharing threads in the system D(i) : Ideal number of runnable time-sharing threads assigned to CPU #i C(i) : Current number of time-sharing threads assigned to CPU #i we have TTS = TS(1) + TS(2) + ... + TS(k), k = number of system CPUs D(i) = (TS(i) / TTS) * N Every second, the kernel calculates D(i) for each CPU. If a CPU had a C(i) < D(i), and another one had C(i) > D(i), the former is marked as "out of balance". When such unbalanced CPU reschedules, it "steals" tasks from the overloaded CPU. Tasks get stolen only if they did not run on their softly-bound CPU for a configurable amount of time, thus preserving soft affinity usefulness as much as possible. - The paper finishes by stating the software engineering tactics needed to safely develop such an ambitious code base. Remember that this was the 90s. After studying that paper, it's clear that the two most important tactics were: - Code review, and - Iterative development So, well, things haven't changed a lot from the software engineering side! * Rodney N. Gamache && Kathleen D. Morse, “VMS Symmetric Multiprocessing”, Digital Technical Journal, volume 1 number 7, pp. 57-73, August 1988: ------------------------------------------------------------------------ NOTE: Below discussion assumes reading our notes on DEC OSF/1 v3.0 first. "Systems software engineers must now design effective ways to utilize systems with six, eight, or even more CPUs", begins this late 80s paper! The SMP additions to VAX/VMS were motiviated by the, back then new, VAX 6200 architecture. From other papers in the _same_ journal issue, that hardware design had: + Symmetric shared memory + Cache coherency in hardware (always a bless!) + First-level cache for instructions, and a second-level write-through one for both data and instructions + Broadcast and single-destination inter-processor interrupts This paper INTRODUCED a good number of now-classical SMP features: + Spinlocks! (also _coined_ that term) + Accessing the Per-CPU area (and thus the current thread state) by masking the CPU stack pointer register. + SMP scheduling hard affinity! (also _coined_ that term) Having above context in mind, here were the important points: - VMS engineers first considered using a global lock[*] to protect the kernel in an SMP machine. They (rightfully) ditched the idea later and moved directly to the ambitious goal of fine-grained locking. NOTE: Compare this to the DEC OSF/1 method, where they first used a global lock, and then fine-grained the system iteratively. [*] Big Kernel Lock in Linux jargon, Global Mutex in the BSDs - Inventing spinlocks: Regular VMS mutexes must be acquired in a process context, but the kernel needed SMP protection in different other contexts. That's why spinlocks were created; they can get acquired by any CPU in any context (as long as certain sanity rules are held). NOTE 1: A classical case is protecting the video-RAM and printf() buffers. The kernel usually want to access such resource in all possible contexts, including boot context, exception and even IRQ context, not just inside process contexts. Usually, a mutex is acquired or "owned" by a certain process; spinlocks on the other hand are acquired by the CPU as a whole. Spinlocks was implemented in VMS using an interlocked bit-test-and-set op. NOTE 2: It's still implemented similarly these days, but it might be faster to use an interlocked swap. A missing issue in that paper design was making threads lock spinning time be finite (bound); that's an important (third) condition in the theory of critical sections. NOTE: That got handled in Linux using N.Piggin's ticket spinlocks. - VMS added two important debugging aids to spin locks: a) each spinlock was assigned a rank (order). All spinlocks must be acquired in that order, eliminating any possible deadlocks. b) each spinlock structure saved the last eight program counters (PC/ x86 %RIP) that acquired or released such lock. NOTE: Can this ranking method dynamically work in my kernel? All my locks are strictly local so far, but that might change once VFS and file-systems are added (e.g. inode locks). - Per-CPU Area: I _love_ how the authors state their thinking process in that section. As in any SMP-capable kernel, VMS needed a per-CPU data area to put things as the current thread state pointer and other per- CPU bookkeeping. The most effecient way to do that is to have a general-purpose register _permanently_ set to the virtual address of the relevant per-CPU area. Such a register did not exist in VAX, so instead of creating it in the VAX hardware, they searched for a way to _overload_ a current register with that purpose. Since each CPU must have a unique stack to safely handle interrupts, the current thread stack pointer can be safely overloaded with that task. They put a pointer to the per-CPU area at the bottom of such stacks, which can be reached by simply masking the stack pointer. NOTE: See MUCH more details on this in our DEC OSF/1 discussion above. - CPU Affinity: They maintained a mask inside each thread descriptor, where each CPU is represented by a single bit. Setting only one bit in that mask is equivilant to hard binding that thread to a single CPU. Such affinity rules are naturally enforced by the scheduler. NOTE: Solaris and Linux roughly use the same method for hard affinity. * Bibliography from “Operating System Concepts”, Sliberchatz et al., 2007: ------------------------------------------------------------------------ Feedback queues were originally implemented on the CTSS system described in Corbato et al.[1]. This feedback queue scheduling system was analyzed by Scharge[2]. The preemptive priority scheduling algorithm was suggested by Kleinrock[3]. [1] Fernando J. Corbato et al., 'An Experimental Time-Sharing System', Proceedings of the AFIPS Fall Joint Computer Conference (1962) [2] L. E. Scharge, 'The Queue M/G/I with Feedback to Lower Priority Queues', Management Science, Volume 13 (1967) [3] L. Kleinrock, 'Queuing Systems, Volume II: Computer Applications, Wiley Interscience (1975) Anderson et al.[4], Lewis and Berg[5], and Philbin et al[6] discuss thread scheduling. [4] T. E. Anderson et al., 'The Performance Implications of Thread Management Alternatives for Shared-Memory Multiprocessors', IEEE Transactions on Computers, Volume 38, Number 12 (1989) [5] B. Lewis and D. Berg, 'Multithreaded Programming with Pthreads, Sun Microsystems Press (1998) [6] J. Philbin et al., 'Thread Scheduling for Cache Locality', Architectural Support for Programming Language and Operating Systems, (1996) Scheduling techniques that take into account information regarding process execution times from previous runs are described in Fisher[7], Hall et al.[8], and Lowney et al.[9] [7] J. A. Fisher, 'Trace Scheduling: A Technique for Global Microcode Dispatching', IEEE Transactions on Computers, Volume 30, Number 7 (1981) [8] L. Hall et al., 'Scheduling to Minimize Average Completion Time: Off-line and On-line Algorithms', SODA: ACM-SIAM Symposium on Discrete Algorithms (1996) [9] P. G. Lowney et al., 'The Multiflow Trace Scheduling Compiler', Journal of Supercomputing, Volume 7, Number 1-2 (1993) Fair-share schedulers are covered by Henry[10], Woodside[11], and Kay Kay and Lauder[12] [10] G. Henry, 'The Fair Share Scheduler', AT&T Bell Laboratories Technical Journal (1984) [11] C. Woodside, 'Controllability of Computer Performance Tradeoffs Obtained Using Controlled-Share Queue Schedulers', IEEE Transactions on Software Engineering, Volume SE-12, Number 10 (1986) [12] J. Kay and P. Lauder, 'Fair Share Scheduler', Communications of the ACM, Volume 31, Number 1 (1988) Siddha et al[13] discusses __MODERN__ challenges on multicore systems. [13] S. Siddha et al, 'Process Scheduling Challenges in the Era of Multi-Core Processors', Intel Technology Journal, Volume 11 (2007) Multiprocessor scheduling was discussed by Tucker and Gupta[14], Zahorjan and McCann[15], Fietelson and Rudolph[16], Leutenegger and Vernnon[17], Blumofe and Leisersonn[18], Polychronopoulos and Kuck[19], and Lucco[20]. [14] A. Tucker and A. Gupta, 'Process Control and Scheduling Issues for Multiprogrammed Shared-Memory Multiprocessors', Proceedings of the ACM Symposium on Operating Systems Principles (1989) [15] J. Zahorjan and C. McCann, 'Processor Scheduling in Shared-Memory Multiprocessors', Proceedings of the Conference on Measurement and Modeling of Computer Systems (1990) [16] D. Fietelson and L. Rudolph, 'Mapping and Scheduling in a Shared Parallel Environment Using Distributed Hierarchial Control', Proceedings of the International Conference on Parallel Processing, (1990) [17] S. Leutenegger and M. Vernnon, 'The Performance of Multiprogrammed Multiprocessor Scheduling Policies', Proceedings of the Conference on Measurement and Modeling of Computer Systems (1990) [18] R. Blumofe and C. Leisersonn, 'Scheduling Multi-threaded Computations by Word Stealing', Proceedings of the Annual Symposium on Foundations of Computer Science (1994) [19] C. D. Polychronopoulos and D. J. Kuck, 'Guided Self-Scheduling: 'A Practical Scheduling Scheme for Parallel Supercomputers', IEEE Transactions on Computers, Volume 36, Number 12, (1987) [20] S. Lucco, 'A Dynamic Scheduling Method for Irregular Parallel Programs', Proceedings of the conference on Programming Languages Design and Implementation, (1992) The authors then list the standard books covering well-known general purpose operating systems kernels like SVR2, the BSDs, Linux, NT, and Solaris by Bach, Kirk McKusick, Bovet and Cesati, Solomon and Russionovich, and Mauro and McDougal respectively.