1
schedtool
2
Copyright (C) 2002-2006 Freek
3
Release under GPL, version 2 (see LICENSE)
4
Use at your own risk.
5
Inspired by setbatch (C) 2002 Ingo Molnar
6
Suggestions are welcome.
7
8
CONTENT:
9
--------
10
11
-About
12
-Usage / description of schedtool
13
-A complex example
14
-Static Priority
15
-Policies reviewed
16
-Thanks
17
-Appendix A: A course into Multi-Level-Feedback-Queue-scheduling
18
19
20
21
ABOUT:
22
------
23
24
schedtool was born, because there was no tool to change or query
25
all CPU-scheduling policies under Linux, in one handy command.
26
Support for CPU-affinity has also been added and most recently
27
(re-)nicing of processes.
28
Thus, schedtool is the definitive interface to Linux's scheduler.
29
30
It can be used to avoid skipping for A/V-applications, to lock
31
processes onto certain CPUs on SMP/NUMA systems, which may be
32
beneficial for networking or benchmarks, or to adjust nice-levels
33
of lesser important jobs to maintain a high amount of interactive
34
responsiveness under high load.
35
36
All output, even errors, go to STDOUT to ease piping.
37
38
If you don't know about scheduling policies, you probably don't want to
39
use this program - or learn and read "man sched_setscheduler".
40
41
Certain modes (as of this writing: SCHED_IDLEPRIO and SCHED_ISO) need a
42
patched kernel. See INSTALL for details.
43
44
45
46
USAGE:
47
------
48
49
There are 3 operation modes: query, set and execute new process.
50
51
QUERY PROCESS(ES):
52
53
#> schedtool <LIST_OF_PIDs>
54
55
This will print all information it can obtain for the processes with
56
<PIDs>.
57
58
59
SET PROCESS(ES):
60
61
I) scheduling policy (detailed discussion in "POLICY OVERVIEW")
62
63
#> schedtool -<MODE> <LIST_OF_PIDs>
64
65
where <MODE> is one of:
66
67
	N or 0:		for SCHED_NORMAL
68
	F or 1:         for SCHED_FIFO
69
	R or 2:		for SCHED_RR
70
	B or 3:		for SCHED_BATCH
71
	I or 4:		for SCHED_ISO
72
73
example:
74
#> schedtool -B <PIDs>
75
76
77
II) static priority
78
79
.. is mandatory for SCHED_FIFO and SCHED_RR
80
STATIC_PRIO is a number from 1-99; higher values mean higher priority in
81
that scheduling class (relative to other processes in the same class).
82
83
example:
84
#> schedtool -R -p 20 <PIDs>
85
86
III) CPU-affinity
87
88
example:
89
#> schedtool -a 0x3 <PIDs>
90
91
IV) nice-level
92
93
example
94
#> schedtool -n 10 <PIDs>
95
96
97
Of course you can combine policy with affinity and nice in one call.
98
99
100
EXECUTE A NEW PROCESS:
101
102
example:
103
#> schedtool [SCHED_PARAMETERS_LIKE_ABOVE] -e command -arg1 -arg2 file
104
105
This will execute "command -arg1 -arg2 file" like typing exactly this
106
on the prompt would.
107
108
109
110
CPU-affinity:
111
112
To give PIDs/a command a certain CPU-affinity, use the -a switch.
113
The value is used as a simple bitmask, the bit set to 1 denoting the
114
PID may run on that CPU, the bit unset (0) denoting it MUST NOT.
115
116
The following picture uses only 16 bits for example purpose.
117
The resulting value is the bitwise OR of the single values for each CPU.
118
CPU0 (the first CPU in your system) is denoted by the least significant bit
119
(here, the one on the right side).
120
121
  CPU 0-----------,
122
 CPU 1-----------,|
123
  ...            ||
124
mask             VV		means				value  == dec
125
--------------------------------------------------------------------------
126
0000 0000 0000 0001	->	run only on CPU0	->	0x1    ==  1
127
0000 0000 0000 1001	->	run on CPU0 AND CPU3	->	0x9    ==  9
128
0000 0000 0000 1111	->	run on CPU0-CPU4	->	0xF    == 15
129
1111 1111 1111 1111	->	run on CPU0-CPU15	->	0xFFFF ==2^16-1
130
131
To set back to the default (PID may run on all CPUs), use the mask
132
0xFFFFFFFF (the kernel will automatically reduce it to the max # of cpus)
133
134
As a short mnemonic rule, each 'F' denotes a set of 4 CPUs
135
(0xF: all 4 CPUs, 0xFF: all 8 CPUs, and so on ...)
136
137
138
Since version 1.1.1 a new list mode is supported, allowing you to
139
specify the target-CPUs without doing bitjuggling. To separate the
140
different CPUs, use a ',':
141
142
Run on CPU0 and CPU1:
143
#> schedtool -a 0,1 <PIDs>
144
145
146
147
A COMPLEX EXAMPLE:
148
------------------
149
#> schedtool -R -p 50 -a 0x2 -e mplayer file.avi
150
151
Execute mplayer file.avi with
152
	-SCHED_RR,
153
	-static priority 50,
154
	-affinity 0x2 (run only on CPU1).
155
156
157
158
ABOUT STATIC PRIORITY:
159
----------------------
160
Static priority is something completely different than the nice-level; the
161
nice-level is added to the dynamic priority, and the higher it gets, the more
162
the process is "punished"([2]), whereas the static priority is used to find
163
the next process to run in the current scheduling class and the higher it is
164
the more preferred >in general< the process is over others, e.g. when
165
it's becoming ready after a blocking action. It will/may also preempt
166
another, lower-prioritized process.
167
168
STATIC_PRIO can't be assigned to SCHED_NORMAL or SCHED_BATCH. The
169
code won't prevent this (a warning is printed - think UNIX), you maybe get an
170
error later at the setting-call.
171
172
v1.2.4+ support a probe mode like sched-utils; it will display each policy's
173
min and max priority, when given the -r parameter.
174
175
#> schedtool -r
176
N: SCHED_NORMAL  : prio_min 0, prio_max 0
177
F: SCHED_FIFO    : prio_min 1, prio_max 99
178
R: SCHED_RR      : prio_min 1, prio_max 99
179
B: SCHED_BATCH   : prio_min 0, prio_max 0
180
I: SCHED_ISO     : policy not implemented
181
D: SCHED_IDLEPRIO: policy not implemented
182
183
184
185
POLICY OVERVIEW + WHERE TO USE:
186
-------------------------------
187
SCHED_NORMAL
188
is the standard scheduling policy and good for the average
189
job with reasonable interaction.
190
191
192
SCHED_FF and SCHED_RR
193
are for real-time constraints.
194
Don't use them for normal stuff, because they've got extremely short
195
time-slices increasing the context-switching overhead and they won't
196
let other processes run until they get blocked by a system-call like
197
read() or actively free themselves from the CPU via the system-call
198
sched_yield(2).
199
200
201
SCHED_BATCH
202
is encuraged for long-running and non-interactive
203
processes; the timeslice is considerably longer (1.5s I think) -
204
these processes, though, are interrupted almost anytime by other ones to
205
guarantee interactiveness.
206
Processes won't get any interactive boosts.
207
208
Users are encouraged to set their computing jobs to SCHED_BATCH. Or, as
209
admin of a compute-server, you could set their shells to SCHED_BATCH
210
via the login-script.
211
SCHED_BATCH has been included in 2.6.16+ kernels.
212
213
214
SCHED_ISO [patch needed, see INSTALL]
215
is a new mode, currently only in Con's patches, to mimick the
216
real-time class for non-root users. To quote Con:
217
218
"Any task trying to start as real time that doesn't have authority to do so
219
will be set to SCHED_ISO. This is a non-expiring scheduler policy designed to
220
guarantee a timeslice within a reasonable latency while preventing starvation.
221
Good for gaming, video at the limits of hardware, video capture etc.
222
It is best set using the schedtool by a normal user trying to start something
223
as SCHED_RR." [ http://kerneltrap.org/node/view/2159 ]
224
225
SCHED_ISO is now somewhat deprecated; SCHED_RR is now possible for normal users,
226
albeit to a limited amount only. See newer kernels.
227
228
229
SCHED_IDLEPRIO [patch needed, see INSTALL]
230
SCHED_IDLEPRIO was formerly called SCHED_BATCH in the -ck patchset; the
231
-ck SCHED_BATCH has nothing to do with the mainline SCHED_BATCH!
232
It is a policy where the process does not get any interactive boost
233
(through sleeping etc) and also only the idle CPU time.
234
235
For more information you can read the file SCHED_DESIGN as a good overview, but
236
be warned, that *some* things may be outdated by the new O(1)-patches.
237
Then proceed to the man-page for sched_setscheduler(2) - it gives a very good
238
overview and is _highly_ recommended.
239
240
241
242
FINAL WORDS / CONTACT:
243
----------------------
244
If you feel you are able to make this software better or you can report
245
some numbers with the different scheduling policies, please contact me.
246
Feedback is appreciated.
247
Please use freshmeat.net's "contact author"-feature to do so.
248
249
250
251
THANKS:
252
-------
253
Thanks fly out to (in no particular order)
254
255
o Ingo Molnar
256
o Con Kolivas for suggesting the -e switch, submitting patch for SCHED_ISO
257
o Samuli Kärkkäinen, the quality-verification-engineer
258
o my girlfriend and supporting friends
259
260
261
262
- -- - -- -
263
264
[2]:
265
A bit simplified - it's not all that easy :-) Go on to Appendix A for
266
an example on how scheduling is performed in Solaris.
267
268
[3]: (see also [4])
269
Nice level and dynamic priority are somewhat "strange": sometimes, higher
270
values mean higher priority (to be put on CPU when process ready); sometimes,
271
lower values mean higher priority.
272
273
At the moment, I confirm my system being the following:
274
- The nice-level is SUBSTRACTED from the dynamic priority, thus giving a
275
Process nice -10 means INCREASING it's priority (valuewise) by 10 points.
276
- In the end, higher values mean higher priority.
277
Use "ps -eO pri,nice" and look for yourself.
278
279
280
281
APPENDIX A: INTRODUCTION TO MULTI-LEVEL-QUEUE-FEEDBACK-SCHEDULING
282
-----------------------------------------------------------------
283
This appendix uses information originating from the "System
284
Programming I"-course at my university; the examples are using Solaris
285
2.X, but I think, Linux is doing it in a >similar< (albeit not that
286
overcomplicated) way.
287
288
Solaris has 60 wait queues for the class TS (TimeSharing); there are
289
other classes as system and RealTime as well.
290
A TS-queue looks like this (which is basically a set of rules):
291
292
Level    ts_quantum    ts_tqexp     ts_maxwait    ts_lwait   ts_slpret
293
0        200           0            0             50         50
294
.        .             .            .             .          .
295
.        .             .            .             .          .
296
.        .             .            .             .          .
297
44       40            34           0             55         55
298
45       40            35           0             56         56
299
.        .             .            .             .          .
300
.        .             .            .             .          .
301
.        .             .            .             .          .
302
59       20            49           32000         59         59
303
304
305
You can display all these numbers on a Solaris-box using
306
# dispadmin -c TS -g
307
308
Level:
309
	just a queue ID.
310
311
ts_quantum:
312
	the maximum timeslice - the maximum time, the process
313
	is allowed to run continuously until it's interrupted and another
314
	process is physically put on the CPU.
315
316
ts_tqexp:
317
	if the process uses it's timeslice entirely, it's put into that
318
	queue [cf. Level].
319
320
ts_maxwait:
321
	maximum time to wait for the process in that queue without being
322
	run, in seconds.
323
324
ts_lwait:
325
	if one process stays too long in the current queue, it's put
326
	into that queue.
327
ts_slpret:
328
	queue to put the process in after it was blocked, e.g. in a
329
	syscall.
330
331
332
Let's start an imaginary process and look what's happening:
333
Start ->
334
-> queue 59, ts_quantum  20ms -> queue 49, ts_quantum  40ms
335
-> queue 39, ts_quantum  80ms -> queue 29, ts_quantum 120ms
336
-> queue 19, ts_quantum 160ms -> queue  9, ts_quantum 200ms
337
338
You see how this only number-crunching process is put into queues that
339
allow him to use the CPU for more and more time.
340
341
342
Now let's do the process a blocking action:
343
344
queue  0, ts_quantum 200ms, after e.g. 100ms blocking call! -->
345
queue 50, ts_quantum  40ms, after e.g.  20ms blocking call! -->
346
queue 58, ts_quantum  40ms
347
348
Now you see how this process is "punished", or from another point of
349
view, the scheduler thinks, this is an interactive process computing a
350
bit and then outputting data, so it's put into a queue that has
351
averagely the same computing time until a this output occurs.
352
So the schedulers knows pretty much about the current state of the
353
machine and can plan accordingly.
354
355
The dynamic priority is something like an age - the higher[4] it is the more
356
likely you get a seat :) (the CPU).
357
There are 4 rules:
358
-if a process is not run, it ages - the dynamic priority rises.
359
-if a process is running, the dynamic priority is lowered.
360
-the process with the (at the moment) highest priority is put onto the CPU.
361
-processes with lower priority are/can be interrupted by processes with
362
higher priority.
363
364
This guarantees that no process is running for too long and others are
365
waiting for too long.
366
367
	-End Of Documentation
368
369
- -- -
370
371
[4]:
372
higher (priority) in means of more important to run in the near future;
373
higher does not automatically mean a higher value in it's PCB[5]
374
375
[5]:
376
Process Control Block, some structure where important accounting and
377
other useful information are stored, usually only used by the kernel.