Summary - rsize, wsize (l
Monty Mullig
monty at delphi.bsd.uchicago.edu
Wed Jun 14 08:40:39 AEST 1989
The following is a summary of responses that I received in response to my
posting about rsize and wsize for NFS mounted partions. The first entry
is a summary of my original posting. Thanks to those who responded.
--monty
Summary of results:
wc and a cp run on a 9.5MB file, all activity on this file on the
test partition.
trial 1: read/write sizes using default (8k)
fstab entry for /u1 partition:
delphi:/u1 /u1 nfs rw 0 0
average cp: 1:33.6 (93.6s)
average wc: 1:45.6 (105.6s)
trial 2: read/write sizes of 2048, timeo=100
fstab entry for /u1 partition:
delphi:/u1 /u1 nfs rw,rsize=2048,wsize=2048,timeo=100 0 0
average cp: 4:50.3 (290.3s) +210.2% over defaults ave
average wc: 2:09.3 (129.3s) + 22.4% over defaults ave
trial 3: read/write sizes of 1024, timeo=100
fstab entry for /u1 partition:
delphi:/u1 /u1 nfs rw,rsize=1024,wsize=1024,timeo=100 0 0
average cp time: 1:48.6 (108.6s) 16.0% over defaults
average wc time: 1:45.0 (105.0s) -0.6% over defaults
>----------------------------------------------<
Date: Thu, 25 May 89 22:28:46 EDT
From: dan at flash.bellcore.com (Daniel Strick)
The default rsize/wsize is 8k. The recommendation that these
parameters be reduced to 2k or 1k was originally suggested
to preserve the functionality of old ethernet interfaces
with only 2k of buffer space (beyond which packets must
be dropped). It turns out that in addition to the limitation
in the old interfaces, there are kernel buffer resources
that can be exceeded (usually happens when a fast machine
blasts away at a slower one) and therefore the rsize/wsize
reduction recommendation is periodically repeated even
though the old ethernet interface is history.
If the destination of the nfs data is not overrun, the
default 8k rsize/wsize should be marginally most
efficient. This is reflected by your 8k and 1k tests.
I don't know what happened during your 2k tests (you win
a cigar). Perhaps the maximum ethernet packet size of
roughly 1500 bytes is relevant.
>-----------------------------------------<
Date: Thu, 25 May 89 22:28:40 EDT
From: hedrick at geneva.rutgers.edu
There's no reason to descrease rsize and wsize between Sun 3's and 4's
on the same Ethernet. Rsize and wsize are a hack, for use only with
Ethernet controllers that don't have enough buffering to receive 6
back to back packets. The 3Com Ethernet cards used on most Sun 2's
have this problem. So you want to reduce wsize on a 3 that has
mounted a 2 or rsize on a 2 that has mounted a 3. Some gateways or
bridges have trouble with large numbers of back to back packets also.
This may be load-dependent. The newest cisco hardware works fine with
default settings, as they are now using controller cards with lots of
on-board buffering. Older cisco gateways (particularly those using
3Com controller cards, but sometimes the Interlan cards have trouble
also) need reduced rsize and wsize. I assume the same may be true of
other vendors. Finally, if you have a link that tends to lose packets
(e.g. a noisy serial line), reducing the sizes could help too. If you
lose one packet you have to resend the whole bunch, so reducing the
size of the bunch could help. But you'd need very high error rates
before you'd see this. If you don't have one of these special
situations where you need a smaller size, then the defaults do better,
since they decrease the RPC processing overhead needed to handle a
given amount of data.
Your test had a server that was faster than the client. If you had
the reverse, e.g. a Sun 4 client and a Sun 3 server, when the client
writes data to the server, the server may get overrun. Generally we
suggest reducing the number of biod's rather than using rsize and
wsize, but if you needed to throttle just one particular mount, wsize
might be the way to do it. We've never seen trouble due to the server
being faster than the client.
>-------------------------------------<
Date: Fri, 26 May 89 11:05:19 EDT
From: jas at proteon.com (John A. Shriver)
The default rsize and wsize are 8192.
The problem is that they send one giant UDP packet of wsize, and let
IP fragmentation make it small enough to go across the Ethernet. For
8192, that six packets. These packets are sent as a *very* fast
burst. If any of the fragments get lost, all are useless because of
the IP unique ID. This message explains:
Date: Sat, 28 Dec 85 19:00:04 est
From: Larry Allen <apollo!lwa at uw-beaver.arpa>
Subject: ip fragmentation follies
I've been playing with IP fragmentation/reassembly and have discovered a
major crock in the Berkeley way of doing things. This may have been
noticed by someone before, but I hadn't really thought about it.
What caused me to notice this was claims by some people (namely Sun)
that using very large IP packets and using IP-level fragmentation makes
protocols like NFS run faster. This makes some sense (less
context-switching, etc), so we decided to try it. We quickly noticed a
problem, though: if a fragmented packet has to be retransmitted (eg
because one of the fragments was dropped somewhere) the fragments of the
retransmitted packet are not and can not be merged with those of the
original packet! Why? Because the Berkeley code has no notion of
IP-level retransmission, and hence assigns a new IP-level packet
identifier to each and every IP packet it transmits! And since the
IP-level identifier is the only way the receiver can tell whether two
fragments belong to the same packet, this means that the fragments of a
retransmitted packet can never be combined with those of the original.
What all this means in practice is this: for a fragmented IP packet to
get through to its receiver, all the fragments resulting from a single
transmission of that packet must get through. If a single fragment is
lost, all the other fragments resulting from that transmission of the
packet are useless and will never be recombined with fragments from past
or future transmissions of the same packet.
This all explains (or at least provides a partial explanation) for why
people running 4.2 TCP connections across the Arpanet using 1024-byte
packets were losing so badly. If the probability of fragment lossage is
even moderately high, it will often take three or more tries to get a
fragmented packet across the net. Meanwhile, of course, the useless
fragments from previous transmissions are sitting on reassembly queues
in the receiver (for 15 seconds, I think?), tying up buffering resources
and increasing the chances that fragments will be dropped in the future!
In the current Berkeley code, it's possible to imagine workarounds for
this problem for TCP: because TCP is in the kernel, it could have a side
hook into the IP layer to tell it "this packet is a retransmission,
don't give it a new IP identifier". For protocols like UDP, however, the
acknowledgment and retransmission functions are done outside of the
kernel, and the only kernel interface that's available is Berkeley's
socket calls (sendto, recvfrom, etc). Needless to say, the socket
interface gives you 1) no way to find out what IP identifier a packet
was sent with; 2) No way to specify the IP identifier to use on an
outgoing packet.
I don't really have any idea what to do about this problem. And, it's
not entirely Berkeley's fault; the BBN TCP/IP for 4.1bsd did the same
thing... In any case, until there's a fix I don't think using IP
fragmentation/reassembly when talking to 4.2bsd systems is a very good
idea.
-Larry
Well, the important thing is that this only matters when packets are
being lost. The least likely time for that to happen is on an idle
network at midnight. The net has to be busy. Also, the problem is
for receiving data on the slow host (3/50 in your case). Try reading
two large files, from two different file servers, at the same time,
with your 3/50. That will start causing it to lose packets.
For files the 3/50 mounts, you may only need to set the rsize lower,
the wsize may be fine.
The case is not that smaller rsize/wsize improves performance. The
case is that if you are losing enough packets to blow performance to
hell, lowering rsize/wsize will save your ass.
Much of this should be greatly improved with SunOS 4.1 comes out, with
adaptive retransmission in NFS.
>----------------------------------------------<
Date: Fri, 26 May 1989 11:25-EDT
From: David.Maynard at K.GP.CS.CMU.EDU
About 2 years ago I did some fairly extensive benchmarks on the rsize,
wsize, and timeo options. In addition to having machines of different
speeds (Sun-2/120 vs. Sun-3/160), I had to deal with LANbridges and IP
routers on a heavily loaded network. About 6 months ago I did some
more limited tests using a Sun-3/50 instead of the Sun-2 on a similarly
convoluted network. These tests were done under 3.X so things could be
very different under 4.0. In addition, the client machines had local
disks so I was not affected by page/swap traffic that might change your
results.
First to answer your question, the default maximum transfer size is
8192 unless the server is a Sun-2 with the 3Com ethernet board. This
corresponds to the page size on most of the newer Suns so you only need
one transfer to get a whole page.
In most cases, the default rsize and wsize settings should work well.
Problems generally arise if your combination of hardware and loading
prevent one of the machines from handling a fairly steady stream of
large packets. Two possible sources of such problems are, 1) speed
differences between the client and the server, and 2) limitations in
the network itself.
If the server machine is much faster than the client, then what the
server considers a steady stream of packets may be an unmanageable
flood to the client. With Sun-2's this could be a real problem. I've
also heard of people having similar problems between Sun-3's and
Sun-4's. In this case, the load on the client plays a major role in
how bad the problem is.
The second source of problems is limitations in the network itself. In
Sun-2's with the 3com controller, the network interface doesn't deal
well with packets longer than 4K. If your network has IP routers or
bridges, these network links can greatly limit your ability to transfer
streams of large packets. Some IP routers are especially notorious for
dropping things under heavy loads.
The key to minimizing these problems for NFS is limiting the overhead
of having large numbers of small packets while reducing the number of
retransmissions due to dropped or late packets.
To get a feel for how your network behaves, try using 'spray' with
various packet sizes. It isn't as accurate as NFS tests, but is easier
to do while others are working. Be sure to spray both from client to
server and from server to client. By comparing the percentage of
packets dropped in the two directions you can get an idea of how CPU
speed differences might affect NFS (although only roughly since spray
represents the extreme case of streaming packets). Then, look at the
bandwidth numbers for the different sizes. Bandwidth should increase
as packet size increases (reduced overhead). This is why you want
rsize and wsize to be as large as possible. However, the number of
dropped packets also tends to increase with size. Unlike spray, NFS
has to retransmit dropped packets, so dropped packets can greatly
reduce NFS performance. If your network has routers, you will also
probably notice a drop-off point where performance degrades rapidly for
larger packets.
Once you have an idea of how the network behaves, start doing NFS tests
with 'cp,' 'wc,' or your favorite command. Adjust rsize and wsize from
8192 down to 1024. Also, adjust the timeo option from 7 (the default)
up to 20 or so. For each test look at the elapsed time for the
commands AND the statistics reported by 'nfsstat' on the client.
(Remember to zero the nfsstat statistics between tests.) The 'Client
rpc' data reported by nfsstat will tell you how many (if any) of the
calls timed out (i.e., were dropped or were too late). You want to
keep the number of retransmissions low to get the best performance.
One way of reducing retransmissions is to increase the timeo option.
However, increasing the timeout introduces a delay before dropped
packets are retried. With timeo=100, it will be 10 seconds before a
dropped packet is retried! This delay can really really hurt NFS
performance. Even on a bad network I have found that limiting the
timeo to 10 or less gives me the best overall performance. On the
other hand, that extra 3/10's of a second greatly reduces the number of
timeouts for our particular network.
To comment on your specific results, I would suspect that either you
don't have a problem and you should just use the defaults, or that your
tests were skewed by the large timeo values. One quick way to tell is
to look at the nfsstat results on a client that has been running for
awhile under normal load. If the percentage of client rpc calls that
has timed out is greater than 1/2% of the total then you should
probably do some more rsize and wsize tests. Because of our heavily
loaded network and routers, I get the best performance when around 1%
of the packets time out.
I hope you don't mind the long explanation. I guess it might be more
appropriate for Sun-Spots where it might help someone who isn't
familiar with the background and hasn't already done a lot of tests.
Anyway, I hope it helps.
More information about the Comp.sys.sun
mailing list