Performance Tuning Ultrix 4.1

Corey Satten corey at milton.u.washington.edu
Wed May 1 02:03:31 AEST 1991


(This is cross-posted to unix-wizards because it may also apply to 4.3BSD)


	    Performance Tuning a DEC Ultrix 4.1 Workstation

			       Round 2

		Corey Satten, corey at cac.washington.edu
		  Networks and Distributed Computing
		       University of Washington
			  Seattle, Washington
			      April 1991



This is a follow up to work first posted in September 1990.

History:

      Our department is using a rather maximally configured DECstation as
  a time-sharing host.  It is a DEC 5000 running Ultrix 4.1 and has six
  disks, mostly 660 meg or 1 gigabyte.  It serves /usr/local/bin via NFS
  to about a dozen workstations; talks to several printers; is the
  departmental electronic mail machine; hosts some campus wide mailing
  lists; is our anonymous FTP server; is one of two campus default domain
  nameservers; and also time-sharing host for about 20 X-terminals plus
  a dozen or more other users connected via telnet.  We are supporting
  about 250 megs of swap space on roughly 43 megabytes of the 56 megabyte
  physical memory.  A 'ps aux' listing usually has 400+ processes in it.

      In my September posting, I described how tweaking some global
  variables in the kernel allowed us to improve performance by paging more
  and swapping less and maintaining a larger chunk of free memory.  Several
  people on campus and in netland followed our lead and reported similar
  improvements.  (Global kernel variables can be conveniently tweaked on
  a running system with the kmem program included with this posting.)

      Our system, thus tweaked, spoiled us by the times it was fast and
  frustrated us by the times it wasn't.  Occasionally we had reports of
  very large (20+ second) character echo delays experienced by one user
  while others in the same environment running the same programs saw no
  delays.  Furthermore, large programs tended to lose too many pages to
  run satisfactorily.  These continuing problems, plus my gut feeling that
  56 megabytes should really be enough memory for what we're doing
  motivated me to continue investigating.

Current Work:

      Close examination of our system revealed that we had some processes
  which had not run in a very long time (perhaps days) but which still
  had a significant RSS.  Scrutiny of the source combined with some
  careful experiments lead me to discover that data pages never page out
  under normal circumstances even though all the complicated code to do
  so is there.  Judging from the code, this was a conscious decision made
  in BSD Unix.  Modern Unix systems running X windows tend to have more
  idle processes than ever before and those processes tend to have larger
  data spaces than their non-windowing ancestors.  Thus, not paging data
  pages causes memory to be tied up with junk which only swapping can
  remove.  On our system, I estimated perhaps 20 megabytes fell into
  this category.

      So why don't these data pages eventually swap out?  When free memory
  becomes less than "desfree", the kernel looks for processes which have
  been sleeping longer than "maxslp" (usually 20 seconds) to swap out.
  To my horror, I discovered that it always starts looking at the beginning
  of the process table and stops when it has satisfied the need for free
  memory.  A quick modification to "ps" to print the processes in
  process-table order and also print the number of times each had swapped
  confirmed this.  No process in the last half of our process table had
  ever swapped and those at the front had swapped a lot.  The extreme
  prejudice which is directed against processes living at the front of
  the table could well explain the horrible performance some users
  occasionally reported that the rest of us didn't see.

      In order to completely rectify both the swapping and paging problems
  described above, kernel changes are required, however I believe fixing
  the data paging problem has the biggest effect and later I will describe
  a partial workaround to the data paging problem which may work for those
  of you who can't change your kernels.  (You might also try asking your
  DEC rep to supply these changes in binary form.)

      To fix the swapping problem, I modified the FORALLPROC macro used
  by vm_sched.c to begin swapping where it last left off so, in the long
  run, process table position has no predictable effect on swapping and
  long time sleepers will eventually swap out.  To achieve this, I also
  needed to make other small changes to vm_sched.c.

      To fix the data paging problem I simply changed the two places which
  prevented data paging in vm_page.c.  In both vm_sched.c and vm_page.c,
  I inserted global variables which I can set and test at runtime to
  experiment.  Since data pages can be more expensive to page out than
  text pages, BSD and Ultrix have 2 limits on data pageouts.  1) only
  maxpgio/4 data pages per second will pageout.  2) data pages are only
  paged out if the process they belong to has an RSS > (saferss - sleeptime)
  where sleeptime is the number of seconds since the process has run.
  Our system seems to be doing fine with these defaults.

      If you can't change your kernel but you still want to try to get
  data pages to page out you may be able to use the following trick to
  force all processes to exceed their soft memory limit.  Rename /etc/init
  to /etc/init.orig and replace /etc/init with the following trivial
  program:

	#include <sys/time.h>
	#include <sys/resource.h>

	main(argc, argv, envp)
	    char *argv[], *envp[];
	    int argc;
	{
	    struct rlimit rlp;

	    getrlimit(RLIMIT_RSS,&rlp);
	    rlp.rlim_cur = 1;		/* zero may be better if it works */
	    setrlimit(RLIMIT_RSS,&rlp);
	    execve("/etc/init.orig", argv, envp);
	}
  
  This will effectively nice them all (which shouldn't matter since it
  happens to them all uniformly) and allow their data pages to be paged
  out as long as the resident set size is greater than the value of
  "saferss" (6 pages) minus the idle time of the process.  Unfortunately,
  DEC changed p_rssize from signed long to unsigned long so as soon as a
  process has been idle longer than saferss seconds the data pages stop
  paging out again.  The workaround for those who can't re-compile is to
  set saferss to 127 which will guarantee the subtraction never goes
  negative and idle processes will eventually lose data pages (although
  not as quickly as with a fixed kernel and a smaller value of saferss).

      For those of you with source to Ultrix 4.1, context diffs of my
  changes are appended below.  I have manually deleted my monitoring
  changes from the diffs to make the functional changes clearer and the
  diffs about 90% smaller.  In addition, there are several things to note:
  first, that I applied the FORALLPROC change only to vm_sched.c by naming
  the changed file procNDC.h and changing the #include in vm_sched.c
  accordingly.  Second, the p++ which is uniformly changed to ++p is a
  vestige of earlier changes which have been removed.  The only lingering
  effect is to enlarge the context of this context-diff to include the
  entire macro (which is why I left it).  Third, even though the new
  FORALLPROC code is complicated, it maintains the desirable timing
  characteristics of the original since the new stuff only executes twice
  for each invocation of the FORALLPROC macro.

      One final note for kernel hacking purists, in vm_sched.c where
  processes are swapped when RSS reaches zero, what I really want is to
  have them swap when RSS is zero AND the number of swapouts of the process
  is less than 2 AND the number of pageins is greater than zero but I
  didn't trust myself to implement that change since the per process count
  of swapouts and pageins is in the u.u_ru structure and I was unsure how
  to access it properly.  As a result, our nfsd and biod processes
  aimlessly swap in and out, fortunately at negligible cost.

      We are now running with the following kernel parameters poked into
  /dev/kmem by my kmem program (appended somewhere below).

	lotsfree 128 -> 768	/* begin scanning for pages with 3meg free */
	desfree   64 -> 512	/* begin swapping sleeping procs at 2meg */
	coreyf1   xx -> 40	/* ... if sleeping longer than 40 seconds */
	minfree   28 -> 128	/* consider swapping running procs at .5meg */

  The elevated scan rate we needed when data pages didn't page seems to be
  unnecessary now.  Slow scanning on our system encounters plenty of stale
  pages to keep the free list replenished.

Results:

  1) Our system now almost never swaps jobs because of memory shortfall.
  2) The active real memory (displayed by vmstat -v) is about double
     what it was before -- about 25 megabytes -- with spikes as high
     as 37 megabytes.  We never saw such high numbers before.
  3) RSS of idle processes eventually reaches zero and these processes
     are then gently swapped out.
  4) Paging activity (scan rate, etc.) is much lower than before -- often
     the system goes for minutes with a scan rate of zero.
  5) Interactive response is good on FrameMaker and other medium size
     programs which were formerly frustratingly slow.
  6) Large programs (such as cc -O on perl version 3's eval.c where the
     optimizer grows to 20 megabytes) can run effectively -- cpu idle
     time drops to zero for the entire 2-minute optimizer run and
     interactive response in other applications remains good.
  7) The load average on the machine is somewhat lower and the spikes
     are considerably lower.
  8) We are considering increasing our file-system buffer-cache from
     the default 15% since most of our idle time now seems to be
     file-system related rather than virtual memory related.

--------
Corey Satten, corey at cac.washington.edu
Networks and Distributed Computing
University of Washington
(206)543-5611

The kmem program and context diffs follow:

*** /usr/src/Ultrix-4.1-RISC/sys/h/proc.h	Fri Jul  6 07:18:00 1990
--- /usr/src/Ultrix-4.1-RISC/sys/h/procNDC.h	Fri Apr  5 21:13:18 1991
***************
*** 398,434 ****
   * out of the For loop, and not one of the inner While loops
   */
  
! #define NEXTPROC	{ pp++; goto _a ; }
  
  #define FORALLPROC(X) {						\
! 	register unsigned long *_bp;				\
! 	register struct proc *pp = proc;			\
  	register unsigned long _mask;				\
  								\
  	/*							\
  	 * for the whole index into the table			\
  	 */							\
! 	for ( _bp = proc_bitmap;				\
! 		_bp < &proc_bitmap[max_proc_index] ; _bp++ ) {	\
  		/*						\
  		 * If any bits in this longword are used,	\
  		 * find the associated structures		\
  		 */						\
! 		if (_mask = *_bp) {				\
  			_a: if (_mask) {			\
  				_b: if ((_mask&1) == 0) {	\
  					_mask = _mask >> 1;	\
! 					pp++;			\
  					goto _b;		\
  				}				\
  				_mask = _mask >> 1;		\
  				{ X }				\
! 				pp++;				\
  				goto _a;			\
  			} else {				\
  				if (_mask = ((pp-proc)%32))	\
! 					pp += 32 - _mask;	\
  			}					\
  		} else pp += 32;				\
  	}							\
  }
--- 398,452 ----
   * out of the For loop, and not one of the inner While loops
   */
  
! #define NEXTPROC	{ ++pp; goto _a ; }
  
+ #define INC_bp(X,Y) (X < &proc_bitmap[max_proc_index-1] ? \
+ 			++X : (Y=proc, X=proc_bitmap))
+ 
  #define FORALLPROC(X) {						\
! 	static unsigned long *_bp = proc_bitmap;		\
! 	register struct proc *pp = proc + 32*(_bp-proc_bitmap);	\
! 	static struct proc *_opp = 0;				\
  	register unsigned long _mask;				\
+ 	register unsigned long *_bpe = _bp;			\
+ 	register int _more;					\
+ 	unsigned long _maskmask;				\
  								\
  	/*							\
  	 * for the whole index into the table			\
  	 */							\
! 	for (_more = 2; ; INC_bp(_bp,pp)) {			\
! 		if (_bp == _bpe)				\
! 			if (--_more) {				\
! 				int i = _opp-pp+1;		\
! 				_maskmask = ~0;			\
! 				if (i<32 && i>=0)		\
! 				    _maskmask <<= i;		\
! 				else _maskmask = 0;		\
! 				_mask = *_bp & _maskmask;	\
! 			} else _mask = *_bp & ~_maskmask;	\
! 		else _mask = *_bp;				\
  		/*						\
  		 * If any bits in this longword are used,	\
  		 * find the associated structures		\
  		 */						\
! 		if (_mask) {					\
  			_a: if (_mask) {			\
  				_b: if ((_mask&1) == 0) {	\
  					_mask = _mask >> 1;	\
! 					++pp;			\
  					goto _b;		\
  				}				\
  				_mask = _mask >> 1;		\
+ 				_opp = pp;			\
  				{ X }				\
! 				++pp;				\
  				goto _a;			\
  			} else {				\
  				if (_mask = ((pp-proc)%32))	\
! 					pp += 32-_mask;		\
  			}					\
  		} else pp += 32;				\
+ 	if (!_more) break;					\
  	}							\
  }


*** /usr/src/Ultrix-4.1-RISC/sys/vm/vm_page.c_orig	Tue Jul 17 12:30:21 1990
--- /usr/src/Ultrix-4.1-RISC/sys/vm/vm_page.c	Tue Apr 23 12:58:18 1991
***************
*** 274,279 ****
--- 274,283 ----
  int	nohash = 0;		/* turn on/off hashing */
  int	nobufcache = 1;		/* turn on/off buf cache for data */
  
+ /* symbols added for performance prodding at request of corey at cac */
+ int     coreyp1 = 0;            /* data does page out (stock is coreyp1 = 1) */
+ int     coreyp2 = 10;           /* data pageouts per second (was maxpgio/4) */
+ 
  extern int swapfrag;
  /*
   * Handle a page fault.
***************
*** 1318,1324 ****
  			(void) splx(s);
  			return(0);
  		}
! 		if ((rp->p_vm & (SSEQL|SUANOM)) == 0 &&
  		    rp->p_rssize <= rp->p_maxrss) {
  			smp_unlock(seg_lock);
  			smp_unlock(&lk_cmap);
--- 1322,1328 ----
  			(void) splx(s);
  			return(0);
  		}
! 		if ((rp->p_vm & (SSEQL|SUANOM)) == 0 && coreyp1 &&
  		    rp->p_rssize <= rp->p_maxrss) {
  			smp_unlock(seg_lock);
  			smp_unlock(&lk_cmap);
***************
*** 1332,1338 ****
  		 * Guarantee a minimal investment in data
  		 * space for jobs in balance set.
  		 */
! 		if (rp->p_rssize < saferss - rp->p_slptime) {
  			smp_unlock(&lk_p_vm);
  			smp_unlock(&lk_cmap);
  			(void) splx(s);
--- 1336,1342 ----
  		 * Guarantee a minimal investment in data
  		 * space for jobs in balance set.
  		 */
! 		if ((long)rp->p_rssize < saferss - rp->p_slptime) {
  			smp_unlock(&lk_p_vm);
  			smp_unlock(&lk_cmap);
  			(void) splx(s);
***************
*** 1371,1377 ****
  		 * Limit pushes to avoid saturating
  		 * pageout device.
  		 */
! 		    (pushes > maxpgio / 4)) {
  			if (seg_lock != &lk_p_vm)
  				smp_unlock(&lk_p_vm);
  			smp_unlock(seg_lock);
--- 1375,1381 ----
  		 * Limit pushes to avoid saturating
  		 * pageout device.
  		 */
! 		    (pushes > coreyp2 /* was maxpgio / 4 */)) {
  			if (seg_lock != &lk_p_vm)
  				smp_unlock(&lk_p_vm);
  			smp_unlock(seg_lock);



*** /usr/src/Ultrix-4.1-RISC/sys/vm/vm_sched.c_orig	Fri Jul  6 06:41:49 1990
--- /usr/src/Ultrix-4.1-RISC/sys/vm/vm_sched.c		Thu Apr 25 10:12:23 1991
***************
*** 103,109 ****
  #include "../h/seg.h"
  #include "../h/dir.h"
  #include "../h/user.h"
! #include "../h/proc.h"
  #include "../h/text.h"
  #include "../h/vm.h"
  #include "../h/cmap.h"
--- 103,109 ----
  #include "../h/seg.h"
  #include "../h/dir.h"
  #include "../h/user.h"
! #include "../h/procNDC.h"
  #include "../h/text.h"
  #include "../h/vm.h"
  #include "../h/cmap.h"
***************
*** 143,148 ****
--- 143,154 ----
  int     minfree = 0;
  int     desfree = 0;
  int	lotsfree= 0;
+ 
+ /* symbols added for performance prodding at request of corey at cac */
+ int     swload = TO_FIX(2);
+ int	coreyf0 = 1;		/* goto loop after swapout */
+ int	coreyf1 = 20;		/* softswap enabled after this many seconds */
+ int	coreyf4 = 1;		/* don't swap when RSS=0 */
  #endif mips
  
  #ifdef vax
***************
*** 291,297 ****
  	    (avenrun[0] >= 2 && imax(avefree, avefree30) < desfree &&
  #endif vax
  #ifdef mips
!             (avenrun[0] >= TO_FIX(2) && imax(avefree, avefree30) < desfree &&
  #endif mips
  	    (rate.v_pgin + rate.v_pgout > maxpgio || avefree < minfree))) {
  		desperate = 1;
--- 297,307 ----
  	    (avenrun[0] >= 2 && imax(avefree, avefree30) < desfree &&
  #endif vax
  #ifdef mips
! 	    /* 
! 	     * symbol "swload" added for performance prodding at
! 	     * request of corey at cac 26 Feb 91
! 	     */
!             (avenrun[0] >= swload && imax(avefree, avefree30) < desfree &&
  #endif mips
  	    (rate.v_pgin + rate.v_pgout > maxpgio || avefree < minfree))) {
  		desperate = 1;
***************
*** 340,353 ****
  
  	case SSLEEP:
  	case SSTOP:
! 		if ((freemem < desfree || pp->p_rssize == 0) &&
! 		    pp->p_slptime > maxslp &&
  		   (!pp->p_textp || (pp->p_textp->x_flag&(XLOCK|XNOSW))==0) &&
  		    swappable(pp)) {
  			/*
  			 * Kick out deadwood.
  			 */
  
  			pp->p_sched &= ~SLOAD;
  
  			smp_unlock(&lk_rq);
--- 350,366 ----
  
  	case SSLEEP:
  	case SSTOP:
! 		if (coreyf1 &&
! 		   (freemem < desfree || (pp->p_rssize == 0 && coreyf4)) &&
! 		    pp->p_slptime > coreyf1 /* was maxslp */ &&
  		   (!pp->p_textp || (pp->p_textp->x_flag&(XLOCK|XNOSW))==0) &&
  		    swappable(pp)) {
+ 			int breakout;
  			/*
  			 * Kick out deadwood.
  			 */
  
+ 			breakout = pp->p_rssize ? 1 : 1-coreyf4;
  			pp->p_sched &= ~SLOAD;
  
  			smp_unlock(&lk_rq);
***************
*** 357,363 ****
  			goto loop;
  #endif vax
  #ifdef mips
! 			NEXTPROC;
  #endif mips
  		} 
  	        smp_unlock(&lk_rq);
--- 370,381 ----
  			goto loop;
  #endif vax
  #ifdef mips
! 			if (coreyf0 && breakout) {
! 			    goto loop;
! 			    }
! 			else {
! 			    NEXTPROC;
! 			    }
  #endif mips
  		} 
  	        smp_unlock(&lk_rq);
***************
*** 540,545 ****
--- 558,564 ----
  			if (sleeper < pp->p_slptime) {
  				p = pp;
  				sleeper = pp->p_slptime;
+ 				if (sleeper == 127) return(p);	/* Corey */
  			}
  		} else if (!sleeper && (pp->p_stat==SRUN||pp->p_stat==SSLEEP)) {
  			rppri = pp->p_rssize;



: ----- cut here ----- cut here ----- cut here ----- cut here -----
: This is a "shell archive".  Save everything after the cut mark
: in a file called thisstuff, then feed it to sh by typing sh thisstuff.
: SHAR archive format.  Archive created Thu Apr 25 10:26:03 PDT 1991
echo x - kmem.c
echo '-rw-r--r--  1 corey       3925 Apr 25 10:24 kmem.c    (as sent)'
sed 's/^-//' >kmem.c <<'+FUNKY+STUFF+'
-/*
- * a tool to use in place of adb (on systems without adb) which lets you
- * peek and poke at the values of kernel variables in /dev/kmem
- *
- * usage:	kmem [-s#] var1 var2 ... varN
- *  or
- * usage:	kmem -w var1=val1 var2=val2 ... varN=valN
- *
- * If -s# is given, loop every # seconds and repeat.  This is handy for
- * watching variables like freemem or debugging flags.  The following simple
- * awk script can postprocess the output filtering out values which don't
- * change:
- *		{ 	if (NF > 2) date = $0
- *			else if (seen[$1] != $2) {
- *			    seen[$1] = $2
- *			    if (date != "") {
- *				print ""; print date
- *				date = ""
- *				}
- *			    print
- *			    }
- * 		 }
- *
- * Corey Satten, corey at cac.washington.edu, 9/6/90 - Ultrix 4.0 version
- */
-#include<stdio.h>
-#include<nlist.h>
-#include<sys/file.h>
-#include <time.h>
-
-struct nlist *nl;		/* how we find locations of names */
-int *nv;			/* the new values for each name */
-int w_flag = 0;			/* write new values? */
-char *file = "/vmunix";		/* default file to read symbols from */
-int kmem;
-
-main(argc, argv)
-    int argc;
-    char *argv[];
-{
-    int f;			/* walks argv upto index of first non-flag */
-    int i;			/* walks through remaining arguments */
-    int value = 0;
-    int rc = 0;
-    int sleeptime = 0;		/* if set nonzero with -s#, repeat every # secs */
-
-    /*
-     * flag parsing
-     */
-    for (f=1; f<argc && *(argv[f]) == '-'; ++f) {
-	switch(argv[f][1]) {
-	default:
-	    fprintf(stderr, "%s: unknown flag -%c\n", argv[0], argv[f][1]);
-	    exit(1);
-	case 'w':
-	    w_flag = 1;
-	    break;
-	case 's':
-	    sscanf(argv[f][2] ? argv[f]+2 : argv[++f], "%d", &sleeptime);
-	    break;
-	case 'f':
-	    file = argv[++f];
-	    break;
-	}
-    }
-
-    /*
-     * handle the remaining arguments as either symname or symname=value
-     * depending on whether -w (w_flag) was specified.
-     */
-
-    nl = (struct nlist *) malloc( sizeof(*nl) * (argc-f+1) );
-    nv = (int *) malloc( sizeof(int) * (argc-f+1) );
-    if (!nv || !nl) {perror("malloc"); exit(1);};
-
-    for (i=0; i<argc-f; ++i) {
-	char *name = (char *)malloc(strlen(argv[i+f])+1);
-
-	if (!name) {perror("malloc"); exit(1);};
-	rc = sscanf(argv[i+f], "%[^=]=%d", name, &value);
-	if (rc - w_flag != 1) {
-	    fprintf(stderr, "%s: bad argument: %s\n", argv[0], argv[i+f]);
-	    exit(1);
-	    }
-	nl[i].n_name = name;
-	nv[i] = value;
-	}
-    nl[i].n_name = "";
-
-    /*
-     * now figure out where to read/write in /dev/kmem and do it
-     */
-    
-    nlist(file, nl);
-
-    kmem = open("/dev/kmem", w_flag ? O_RDWR : O_RDONLY);
-    if (kmem < 0) {
-	perror("/dev/kmem open");
-	exit(1);
-	}
-
-    sleeploop:
-
-	if (sleeptime) {
-	    putchar('\n'); date();
-	    }
-
-	for (i=0; i<argc-f; ++i) {
-	    long seekto = (long)nl[i].n_value;
-
-	    if (nl[i].n_type == 0) {
-		fprintf(stderr, "%s: symbol `%s' not found in namelist of %s\n",
-		    argv[0], nl[i].n_name, file);
-	    /*
-	     *  We promise to do all writes in command line order, so if one
-	     *  is going to fail, we'd best bail out rather than continue.
-	     */
-		if (w_flag) exit(2);
-		else	continue;
-		}
-	    if ( lseek(kmem, seekto, 0) != seekto ) {
-		perror("/dev/kmem lseek"); exit(2);
-		}
-	    if ( read(kmem, &value, sizeof(int)) != sizeof(int) ) {
-		perror("/dev/kmem read"); exit(2);
-		}
-
-	    printf("%s(0x%x)\t%d", nl[i].n_name, nl[i].n_value, value);
-
-	    if (w_flag) {
-		if ( lseek(kmem, seekto, 0) != seekto ) {
-		    perror("/dev/kmem lseek"); exit(2);
-		    }
-		value = nv[i];
-		printf(" -> %d", value);
-		if ( write(kmem, &value, sizeof(int)) != sizeof(int) ) {
-		    perror("/dev/kmem write"); exit(2);
-		    }
-		}
-	    putchar('\n');
-	    }
-
-    if (sleeptime) {
-	fflush(stdout);
-	sleep(sleeptime);
-	goto sleeploop;
-	}
-}
-/*
- * print ascii version of current date and time
- */
-date() {
-    char *at;
-    static char db[30];
-    int i;
-
-    time(&i);
-    at = asctime(localtime(&i));
-    strcpy(db, at+4);
-    db[20] = 0;
-    puts(db);
-    }
+FUNKY+STUFF+
chmod u=rw,g=r,o=r kmem.c
ls -l kmem.c
exit 0



More information about the Comp.unix.wizards mailing list