Loss of RFNMs on ARPAnet hosts (the REAL FIX)

Wed Oct 30 04:22:41 AEST 1985

Index:	/sys/vaxif/if_acc.c 4.2BSD

APOLOGIA:
	The previous fix posted was **** ALL WRONG. ****

	My colleague who tracked down this bug did not (by his own admission)
	explain the nature of the bug sufficently, hence the wrong
	'fix'.  This person, who will remain nameless, has suffered
	and will continue to suffer the pains of the damned
	because *I* ended up looking stupid on USENET. (I know, USENET
	is full of stupid-looking people, but I was saving that for
	net.singles).
	Thanks to Art Berggreen <Art at ACC.ARPA> for his analysis of the
	problem (included below) and to my nameless colleagues for spending
	hours pouring over logic diagrams to figure out just how this bloody
	box works.

NOTE:	This is **not applicable** unless the modifications from Chris Kent
	(cak at purdue.ARPA, posted 21 March 1984) have been made to
	/sys/netinet/tcp_output.c.  These modifications advertise a
	maximum TCP segment size that is tuned per network interface.

Description:
	Connections to certain hosts on the ARPAnet will start failing with
	"out of buffer space" messages.  Doing a 'netstat -h' shows
	that the host (or the gateway to it) has a RFNM count of 8.

	The RFNM count never drops below 8 and so the network path is
	unusable until the system is rebooted.

	The problem lies in the LH/DH-11 IMP interface.
	Sometimes, most likely always, it will not set the <END OF MESSAGE>
	flag in the control & status register if the input buffer is filled  
	at the same time that <LAST BIT SIGNAL> from the 
	IMP comes up.

	This causes the LH/DH driver to append the next 
	incoming message from the IMP to the the previous message.
	This process (appending of messages) will continue until
	a message SHORTER then the input buffer size is sent --
	a RFNM response does nicely.

	This results in the LOSS of the succeeding messages (e.g. RFNMs)
	since the 1822 protocol handling code expects to get only
	<ONE> message from the LH/DH at a time.

	This problem happens when the IMP MTU is advertised as the TCP
	maximum segment size (as is done by the TCP changes from cak at purdue).
	This allows an incoming message to be 1006 + 12 bytes long, which
	equals the size of the 1018 byte input buffer in
	the IMP (I believe) and so exercises the bug in the LH/DH.

	The described problem would appear to happen ONLY if a message
	from the IMP is one word longer than the buffer being read into.
	When the buffer fills, leaving the data that contains the Last
	Bit in the LH/DH data buffer, the Receive DMA terminates and
	the EOM flag is NOT ON (because the user has not yet DMA'd 
	the End-of-Message into memory).  What should happen when the
	Receive DMA is restarted, is that the remaining word is read into memory
	and the DMA should terminate with the EOM flag ON.  If when the DMA is
	restarted, the internal EOM status is lost, the following message would
	be concatenated with the end of previous message.

	A better solution than reducing IMPMTU (which doesn't really
	fix the problem) would be to use I/O buffers that are slightly
	larger than IMPMTU (and of course setting the Receive Byte Counter
	to be larger than any expected message). 

Fix:
	/sys/vaxif/if_acc.c:


163c164
< 	     (int)btoc(IMPMTU)) == 0) {
---
> 	     (int)btoc(IMPMTU+2)) == 0) {
190c191
< 	addr->iwc = -(IMPMTU >> 1);
---
> 	addr->iwc = -((IMPMTU + 2) >> 1);
328,330c329,331
< 	len = IMPMTU + (addr->iwc << 1);
< 	if (len < 0 || len > IMPMTU) {
< 		printf("acc%d: bad length=%d\n", len);
---
> 	len = IMPMTU+2 + (addr->iwc << 1);
> 	if (len < 0 || len > IMPMTU+2) {
> 		printf("acc%d: bad length=%d\n", unit, len);
362c363
< 	addr->iwc = -(IMPMTU >> 1);
---
> 	addr->iwc = -((IMPMTU + 2)>> 1);

This fix really does the job properly.
-- 
Shouter-To-Dead-Parrots @ Univ. of Texas Computation Center; Austin, Texas  

"All life is a blur of Republicans and meat." -Zippy the Pinhead

	clyde at ngp.UTEXAS.EDU, clyde at sally.UTEXAS.EDU
	...!ihnp4!ut-ngp!clyde, ...!allegra!ut-ngp!clyde