Performance Tuning Ultrix 4.1
Corey Satten
corey at milton.u.washington.edu
Wed May 1 02:03:31 AEST 1991
(This is cross-posted to unix-wizards because it may also apply to 4.3BSD)
Performance Tuning a DEC Ultrix 4.1 Workstation
Round 2
Corey Satten, corey at cac.washington.edu
Networks and Distributed Computing
University of Washington
Seattle, Washington
April 1991
This is a follow up to work first posted in September 1990.
History:
Our department is using a rather maximally configured DECstation as
a time-sharing host. It is a DEC 5000 running Ultrix 4.1 and has six
disks, mostly 660 meg or 1 gigabyte. It serves /usr/local/bin via NFS
to about a dozen workstations; talks to several printers; is the
departmental electronic mail machine; hosts some campus wide mailing
lists; is our anonymous FTP server; is one of two campus default domain
nameservers; and also time-sharing host for about 20 X-terminals plus
a dozen or more other users connected via telnet. We are supporting
about 250 megs of swap space on roughly 43 megabytes of the 56 megabyte
physical memory. A 'ps aux' listing usually has 400+ processes in it.
In my September posting, I described how tweaking some global
variables in the kernel allowed us to improve performance by paging more
and swapping less and maintaining a larger chunk of free memory. Several
people on campus and in netland followed our lead and reported similar
improvements. (Global kernel variables can be conveniently tweaked on
a running system with the kmem program included with this posting.)
Our system, thus tweaked, spoiled us by the times it was fast and
frustrated us by the times it wasn't. Occasionally we had reports of
very large (20+ second) character echo delays experienced by one user
while others in the same environment running the same programs saw no
delays. Furthermore, large programs tended to lose too many pages to
run satisfactorily. These continuing problems, plus my gut feeling that
56 megabytes should really be enough memory for what we're doing
motivated me to continue investigating.
Current Work:
Close examination of our system revealed that we had some processes
which had not run in a very long time (perhaps days) but which still
had a significant RSS. Scrutiny of the source combined with some
careful experiments lead me to discover that data pages never page out
under normal circumstances even though all the complicated code to do
so is there. Judging from the code, this was a conscious decision made
in BSD Unix. Modern Unix systems running X windows tend to have more
idle processes than ever before and those processes tend to have larger
data spaces than their non-windowing ancestors. Thus, not paging data
pages causes memory to be tied up with junk which only swapping can
remove. On our system, I estimated perhaps 20 megabytes fell into
this category.
So why don't these data pages eventually swap out? When free memory
becomes less than "desfree", the kernel looks for processes which have
been sleeping longer than "maxslp" (usually 20 seconds) to swap out.
To my horror, I discovered that it always starts looking at the beginning
of the process table and stops when it has satisfied the need for free
memory. A quick modification to "ps" to print the processes in
process-table order and also print the number of times each had swapped
confirmed this. No process in the last half of our process table had
ever swapped and those at the front had swapped a lot. The extreme
prejudice which is directed against processes living at the front of
the table could well explain the horrible performance some users
occasionally reported that the rest of us didn't see.
In order to completely rectify both the swapping and paging problems
described above, kernel changes are required, however I believe fixing
the data paging problem has the biggest effect and later I will describe
a partial workaround to the data paging problem which may work for those
of you who can't change your kernels. (You might also try asking your
DEC rep to supply these changes in binary form.)
To fix the swapping problem, I modified the FORALLPROC macro used
by vm_sched.c to begin swapping where it last left off so, in the long
run, process table position has no predictable effect on swapping and
long time sleepers will eventually swap out. To achieve this, I also
needed to make other small changes to vm_sched.c.
To fix the data paging problem I simply changed the two places which
prevented data paging in vm_page.c. In both vm_sched.c and vm_page.c,
I inserted global variables which I can set and test at runtime to
experiment. Since data pages can be more expensive to page out than
text pages, BSD and Ultrix have 2 limits on data pageouts. 1) only
maxpgio/4 data pages per second will pageout. 2) data pages are only
paged out if the process they belong to has an RSS > (saferss - sleeptime)
where sleeptime is the number of seconds since the process has run.
Our system seems to be doing fine with these defaults.
If you can't change your kernel but you still want to try to get
data pages to page out you may be able to use the following trick to
force all processes to exceed their soft memory limit. Rename /etc/init
to /etc/init.orig and replace /etc/init with the following trivial
program:
#include <sys/time.h>
#include <sys/resource.h>
main(argc, argv, envp)
char *argv[], *envp[];
int argc;
{
struct rlimit rlp;
getrlimit(RLIMIT_RSS,&rlp);
rlp.rlim_cur = 1; /* zero may be better if it works */
setrlimit(RLIMIT_RSS,&rlp);
execve("/etc/init.orig", argv, envp);
}
This will effectively nice them all (which shouldn't matter since it
happens to them all uniformly) and allow their data pages to be paged
out as long as the resident set size is greater than the value of
"saferss" (6 pages) minus the idle time of the process. Unfortunately,
DEC changed p_rssize from signed long to unsigned long so as soon as a
process has been idle longer than saferss seconds the data pages stop
paging out again. The workaround for those who can't re-compile is to
set saferss to 127 which will guarantee the subtraction never goes
negative and idle processes will eventually lose data pages (although
not as quickly as with a fixed kernel and a smaller value of saferss).
For those of you with source to Ultrix 4.1, context diffs of my
changes are appended below. I have manually deleted my monitoring
changes from the diffs to make the functional changes clearer and the
diffs about 90% smaller. In addition, there are several things to note:
first, that I applied the FORALLPROC change only to vm_sched.c by naming
the changed file procNDC.h and changing the #include in vm_sched.c
accordingly. Second, the p++ which is uniformly changed to ++p is a
vestige of earlier changes which have been removed. The only lingering
effect is to enlarge the context of this context-diff to include the
entire macro (which is why I left it). Third, even though the new
FORALLPROC code is complicated, it maintains the desirable timing
characteristics of the original since the new stuff only executes twice
for each invocation of the FORALLPROC macro.
One final note for kernel hacking purists, in vm_sched.c where
processes are swapped when RSS reaches zero, what I really want is to
have them swap when RSS is zero AND the number of swapouts of the process
is less than 2 AND the number of pageins is greater than zero but I
didn't trust myself to implement that change since the per process count
of swapouts and pageins is in the u.u_ru structure and I was unsure how
to access it properly. As a result, our nfsd and biod processes
aimlessly swap in and out, fortunately at negligible cost.
We are now running with the following kernel parameters poked into
/dev/kmem by my kmem program (appended somewhere below).
lotsfree 128 -> 768 /* begin scanning for pages with 3meg free */
desfree 64 -> 512 /* begin swapping sleeping procs at 2meg */
coreyf1 xx -> 40 /* ... if sleeping longer than 40 seconds */
minfree 28 -> 128 /* consider swapping running procs at .5meg */
The elevated scan rate we needed when data pages didn't page seems to be
unnecessary now. Slow scanning on our system encounters plenty of stale
pages to keep the free list replenished.
Results:
1) Our system now almost never swaps jobs because of memory shortfall.
2) The active real memory (displayed by vmstat -v) is about double
what it was before -- about 25 megabytes -- with spikes as high
as 37 megabytes. We never saw such high numbers before.
3) RSS of idle processes eventually reaches zero and these processes
are then gently swapped out.
4) Paging activity (scan rate, etc.) is much lower than before -- often
the system goes for minutes with a scan rate of zero.
5) Interactive response is good on FrameMaker and other medium size
programs which were formerly frustratingly slow.
6) Large programs (such as cc -O on perl version 3's eval.c where the
optimizer grows to 20 megabytes) can run effectively -- cpu idle
time drops to zero for the entire 2-minute optimizer run and
interactive response in other applications remains good.
7) The load average on the machine is somewhat lower and the spikes
are considerably lower.
8) We are considering increasing our file-system buffer-cache from
the default 15% since most of our idle time now seems to be
file-system related rather than virtual memory related.
--------
Corey Satten, corey at cac.washington.edu
Networks and Distributed Computing
University of Washington
(206)543-5611
The kmem program and context diffs follow:
*** /usr/src/Ultrix-4.1-RISC/sys/h/proc.h Fri Jul 6 07:18:00 1990
--- /usr/src/Ultrix-4.1-RISC/sys/h/procNDC.h Fri Apr 5 21:13:18 1991
***************
*** 398,434 ****
* out of the For loop, and not one of the inner While loops
*/
! #define NEXTPROC { pp++; goto _a ; }
#define FORALLPROC(X) { \
! register unsigned long *_bp; \
! register struct proc *pp = proc; \
register unsigned long _mask; \
\
/* \
* for the whole index into the table \
*/ \
! for ( _bp = proc_bitmap; \
! _bp < &proc_bitmap[max_proc_index] ; _bp++ ) { \
/* \
* If any bits in this longword are used, \
* find the associated structures \
*/ \
! if (_mask = *_bp) { \
_a: if (_mask) { \
_b: if ((_mask&1) == 0) { \
_mask = _mask >> 1; \
! pp++; \
goto _b; \
} \
_mask = _mask >> 1; \
{ X } \
! pp++; \
goto _a; \
} else { \
if (_mask = ((pp-proc)%32)) \
! pp += 32 - _mask; \
} \
} else pp += 32; \
} \
}
--- 398,452 ----
* out of the For loop, and not one of the inner While loops
*/
! #define NEXTPROC { ++pp; goto _a ; }
+ #define INC_bp(X,Y) (X < &proc_bitmap[max_proc_index-1] ? \
+ ++X : (Y=proc, X=proc_bitmap))
+
#define FORALLPROC(X) { \
! static unsigned long *_bp = proc_bitmap; \
! register struct proc *pp = proc + 32*(_bp-proc_bitmap); \
! static struct proc *_opp = 0; \
register unsigned long _mask; \
+ register unsigned long *_bpe = _bp; \
+ register int _more; \
+ unsigned long _maskmask; \
\
/* \
* for the whole index into the table \
*/ \
! for (_more = 2; ; INC_bp(_bp,pp)) { \
! if (_bp == _bpe) \
! if (--_more) { \
! int i = _opp-pp+1; \
! _maskmask = ~0; \
! if (i<32 && i>=0) \
! _maskmask <<= i; \
! else _maskmask = 0; \
! _mask = *_bp & _maskmask; \
! } else _mask = *_bp & ~_maskmask; \
! else _mask = *_bp; \
/* \
* If any bits in this longword are used, \
* find the associated structures \
*/ \
! if (_mask) { \
_a: if (_mask) { \
_b: if ((_mask&1) == 0) { \
_mask = _mask >> 1; \
! ++pp; \
goto _b; \
} \
_mask = _mask >> 1; \
+ _opp = pp; \
{ X } \
! ++pp; \
goto _a; \
} else { \
if (_mask = ((pp-proc)%32)) \
! pp += 32-_mask; \
} \
} else pp += 32; \
+ if (!_more) break; \
} \
}
*** /usr/src/Ultrix-4.1-RISC/sys/vm/vm_page.c_orig Tue Jul 17 12:30:21 1990
--- /usr/src/Ultrix-4.1-RISC/sys/vm/vm_page.c Tue Apr 23 12:58:18 1991
***************
*** 274,279 ****
--- 274,283 ----
int nohash = 0; /* turn on/off hashing */
int nobufcache = 1; /* turn on/off buf cache for data */
+ /* symbols added for performance prodding at request of corey at cac */
+ int coreyp1 = 0; /* data does page out (stock is coreyp1 = 1) */
+ int coreyp2 = 10; /* data pageouts per second (was maxpgio/4) */
+
extern int swapfrag;
/*
* Handle a page fault.
***************
*** 1318,1324 ****
(void) splx(s);
return(0);
}
! if ((rp->p_vm & (SSEQL|SUANOM)) == 0 &&
rp->p_rssize <= rp->p_maxrss) {
smp_unlock(seg_lock);
smp_unlock(&lk_cmap);
--- 1322,1328 ----
(void) splx(s);
return(0);
}
! if ((rp->p_vm & (SSEQL|SUANOM)) == 0 && coreyp1 &&
rp->p_rssize <= rp->p_maxrss) {
smp_unlock(seg_lock);
smp_unlock(&lk_cmap);
***************
*** 1332,1338 ****
* Guarantee a minimal investment in data
* space for jobs in balance set.
*/
! if (rp->p_rssize < saferss - rp->p_slptime) {
smp_unlock(&lk_p_vm);
smp_unlock(&lk_cmap);
(void) splx(s);
--- 1336,1342 ----
* Guarantee a minimal investment in data
* space for jobs in balance set.
*/
! if ((long)rp->p_rssize < saferss - rp->p_slptime) {
smp_unlock(&lk_p_vm);
smp_unlock(&lk_cmap);
(void) splx(s);
***************
*** 1371,1377 ****
* Limit pushes to avoid saturating
* pageout device.
*/
! (pushes > maxpgio / 4)) {
if (seg_lock != &lk_p_vm)
smp_unlock(&lk_p_vm);
smp_unlock(seg_lock);
--- 1375,1381 ----
* Limit pushes to avoid saturating
* pageout device.
*/
! (pushes > coreyp2 /* was maxpgio / 4 */)) {
if (seg_lock != &lk_p_vm)
smp_unlock(&lk_p_vm);
smp_unlock(seg_lock);
*** /usr/src/Ultrix-4.1-RISC/sys/vm/vm_sched.c_orig Fri Jul 6 06:41:49 1990
--- /usr/src/Ultrix-4.1-RISC/sys/vm/vm_sched.c Thu Apr 25 10:12:23 1991
***************
*** 103,109 ****
#include "../h/seg.h"
#include "../h/dir.h"
#include "../h/user.h"
! #include "../h/proc.h"
#include "../h/text.h"
#include "../h/vm.h"
#include "../h/cmap.h"
--- 103,109 ----
#include "../h/seg.h"
#include "../h/dir.h"
#include "../h/user.h"
! #include "../h/procNDC.h"
#include "../h/text.h"
#include "../h/vm.h"
#include "../h/cmap.h"
***************
*** 143,148 ****
--- 143,154 ----
int minfree = 0;
int desfree = 0;
int lotsfree= 0;
+
+ /* symbols added for performance prodding at request of corey at cac */
+ int swload = TO_FIX(2);
+ int coreyf0 = 1; /* goto loop after swapout */
+ int coreyf1 = 20; /* softswap enabled after this many seconds */
+ int coreyf4 = 1; /* don't swap when RSS=0 */
#endif mips
#ifdef vax
***************
*** 291,297 ****
(avenrun[0] >= 2 && imax(avefree, avefree30) < desfree &&
#endif vax
#ifdef mips
! (avenrun[0] >= TO_FIX(2) && imax(avefree, avefree30) < desfree &&
#endif mips
(rate.v_pgin + rate.v_pgout > maxpgio || avefree < minfree))) {
desperate = 1;
--- 297,307 ----
(avenrun[0] >= 2 && imax(avefree, avefree30) < desfree &&
#endif vax
#ifdef mips
! /*
! * symbol "swload" added for performance prodding at
! * request of corey at cac 26 Feb 91
! */
! (avenrun[0] >= swload && imax(avefree, avefree30) < desfree &&
#endif mips
(rate.v_pgin + rate.v_pgout > maxpgio || avefree < minfree))) {
desperate = 1;
***************
*** 340,353 ****
case SSLEEP:
case SSTOP:
! if ((freemem < desfree || pp->p_rssize == 0) &&
! pp->p_slptime > maxslp &&
(!pp->p_textp || (pp->p_textp->x_flag&(XLOCK|XNOSW))==0) &&
swappable(pp)) {
/*
* Kick out deadwood.
*/
pp->p_sched &= ~SLOAD;
smp_unlock(&lk_rq);
--- 350,366 ----
case SSLEEP:
case SSTOP:
! if (coreyf1 &&
! (freemem < desfree || (pp->p_rssize == 0 && coreyf4)) &&
! pp->p_slptime > coreyf1 /* was maxslp */ &&
(!pp->p_textp || (pp->p_textp->x_flag&(XLOCK|XNOSW))==0) &&
swappable(pp)) {
+ int breakout;
/*
* Kick out deadwood.
*/
+ breakout = pp->p_rssize ? 1 : 1-coreyf4;
pp->p_sched &= ~SLOAD;
smp_unlock(&lk_rq);
***************
*** 357,363 ****
goto loop;
#endif vax
#ifdef mips
! NEXTPROC;
#endif mips
}
smp_unlock(&lk_rq);
--- 370,381 ----
goto loop;
#endif vax
#ifdef mips
! if (coreyf0 && breakout) {
! goto loop;
! }
! else {
! NEXTPROC;
! }
#endif mips
}
smp_unlock(&lk_rq);
***************
*** 540,545 ****
--- 558,564 ----
if (sleeper < pp->p_slptime) {
p = pp;
sleeper = pp->p_slptime;
+ if (sleeper == 127) return(p); /* Corey */
}
} else if (!sleeper && (pp->p_stat==SRUN||pp->p_stat==SSLEEP)) {
rppri = pp->p_rssize;
: ----- cut here ----- cut here ----- cut here ----- cut here -----
: This is a "shell archive". Save everything after the cut mark
: in a file called thisstuff, then feed it to sh by typing sh thisstuff.
: SHAR archive format. Archive created Thu Apr 25 10:26:03 PDT 1991
echo x - kmem.c
echo '-rw-r--r-- 1 corey 3925 Apr 25 10:24 kmem.c (as sent)'
sed 's/^-//' >kmem.c <<'+FUNKY+STUFF+'
-/*
- * a tool to use in place of adb (on systems without adb) which lets you
- * peek and poke at the values of kernel variables in /dev/kmem
- *
- * usage: kmem [-s#] var1 var2 ... varN
- * or
- * usage: kmem -w var1=val1 var2=val2 ... varN=valN
- *
- * If -s# is given, loop every # seconds and repeat. This is handy for
- * watching variables like freemem or debugging flags. The following simple
- * awk script can postprocess the output filtering out values which don't
- * change:
- * { if (NF > 2) date = $0
- * else if (seen[$1] != $2) {
- * seen[$1] = $2
- * if (date != "") {
- * print ""; print date
- * date = ""
- * }
- * print
- * }
- * }
- *
- * Corey Satten, corey at cac.washington.edu, 9/6/90 - Ultrix 4.0 version
- */
-#include<stdio.h>
-#include<nlist.h>
-#include<sys/file.h>
-#include <time.h>
-
-struct nlist *nl; /* how we find locations of names */
-int *nv; /* the new values for each name */
-int w_flag = 0; /* write new values? */
-char *file = "/vmunix"; /* default file to read symbols from */
-int kmem;
-
-main(argc, argv)
- int argc;
- char *argv[];
-{
- int f; /* walks argv upto index of first non-flag */
- int i; /* walks through remaining arguments */
- int value = 0;
- int rc = 0;
- int sleeptime = 0; /* if set nonzero with -s#, repeat every # secs */
-
- /*
- * flag parsing
- */
- for (f=1; f<argc && *(argv[f]) == '-'; ++f) {
- switch(argv[f][1]) {
- default:
- fprintf(stderr, "%s: unknown flag -%c\n", argv[0], argv[f][1]);
- exit(1);
- case 'w':
- w_flag = 1;
- break;
- case 's':
- sscanf(argv[f][2] ? argv[f]+2 : argv[++f], "%d", &sleeptime);
- break;
- case 'f':
- file = argv[++f];
- break;
- }
- }
-
- /*
- * handle the remaining arguments as either symname or symname=value
- * depending on whether -w (w_flag) was specified.
- */
-
- nl = (struct nlist *) malloc( sizeof(*nl) * (argc-f+1) );
- nv = (int *) malloc( sizeof(int) * (argc-f+1) );
- if (!nv || !nl) {perror("malloc"); exit(1);};
-
- for (i=0; i<argc-f; ++i) {
- char *name = (char *)malloc(strlen(argv[i+f])+1);
-
- if (!name) {perror("malloc"); exit(1);};
- rc = sscanf(argv[i+f], "%[^=]=%d", name, &value);
- if (rc - w_flag != 1) {
- fprintf(stderr, "%s: bad argument: %s\n", argv[0], argv[i+f]);
- exit(1);
- }
- nl[i].n_name = name;
- nv[i] = value;
- }
- nl[i].n_name = "";
-
- /*
- * now figure out where to read/write in /dev/kmem and do it
- */
-
- nlist(file, nl);
-
- kmem = open("/dev/kmem", w_flag ? O_RDWR : O_RDONLY);
- if (kmem < 0) {
- perror("/dev/kmem open");
- exit(1);
- }
-
- sleeploop:
-
- if (sleeptime) {
- putchar('\n'); date();
- }
-
- for (i=0; i<argc-f; ++i) {
- long seekto = (long)nl[i].n_value;
-
- if (nl[i].n_type == 0) {
- fprintf(stderr, "%s: symbol `%s' not found in namelist of %s\n",
- argv[0], nl[i].n_name, file);
- /*
- * We promise to do all writes in command line order, so if one
- * is going to fail, we'd best bail out rather than continue.
- */
- if (w_flag) exit(2);
- else continue;
- }
- if ( lseek(kmem, seekto, 0) != seekto ) {
- perror("/dev/kmem lseek"); exit(2);
- }
- if ( read(kmem, &value, sizeof(int)) != sizeof(int) ) {
- perror("/dev/kmem read"); exit(2);
- }
-
- printf("%s(0x%x)\t%d", nl[i].n_name, nl[i].n_value, value);
-
- if (w_flag) {
- if ( lseek(kmem, seekto, 0) != seekto ) {
- perror("/dev/kmem lseek"); exit(2);
- }
- value = nv[i];
- printf(" -> %d", value);
- if ( write(kmem, &value, sizeof(int)) != sizeof(int) ) {
- perror("/dev/kmem write"); exit(2);
- }
- }
- putchar('\n');
- }
-
- if (sleeptime) {
- fflush(stdout);
- sleep(sleeptime);
- goto sleeploop;
- }
-}
-/*
- * print ascii version of current date and time
- */
-date() {
- char *at;
- static char db[30];
- int i;
-
- time(&i);
- at = asctime(localtime(&i));
- strcpy(db, at+4);
- db[20] = 0;
- puts(db);
- }
+FUNKY+STUFF+
chmod u=rw,g=r,o=r kmem.c
ls -l kmem.c
exit 0
More information about the Comp.unix.wizards
mailing list