Unix V/386 driver help (LONG)

Fri Aug 10 04:39:20 AEST 1990

This message has been cross-posted to the following groups:
		386users
		info-ibmpc
		Sun-386i
		Unix-Wizards

			   IMPORTANT CAVEAT
			 *******************

This message concerns a difficult problem concerned with a Unix V/386
device driver for a RISC coprocessor based on AMD's 29000 chip. If you
do not have low level 80386, AT Bus and/or Unix V/386 experience, you
should probably skip the message.

			      BACKGROUND
			      **********

We have written a Unix V/386 device driver for this coprocessor.  It
provides between 16 and 25 MIPS, and 4 M-Whetstones of computing
horsepower.

AMD specify a "High Level Interface" (HIF) for people developing for
boards based upon the 29000. The HIF specifies a set of available
system traps that emulate several Unix system and library calls. Some
of them can be satisfied by a kernel that runs on the 29000, but
others require the services of the host 386 operating system.

In practice this gives the programmer access to regular Unix-like I/O
to and from the host filesystem, as well as a number of system calls
such as time(), and library calls such as getenv(). The net effect is
that development for the board can be carried out under Unix (or DOS)
and then once the code runs, simply recompiled for the 29000.

The device driver, however, has to support these HIF requests,
particularly those concerned with file I/O. It is obviously not
entirely normal for a driver to do file I/O .....  For commercial
reasons, I can't discuss here exactly what solution was reached to
solve this problem, but its not relevant, since the problems discussed
below exist even in a miniature version that does not support file I/O.

The board is targeted as a RIP for PostScript work, and also a high
speed general coprocessor. We are using it for both purposes, both to
drive an 800x400 dpi laser engine, and to carry out some pretty
intensive and proprietary image processing (halftoning, grey scaling
and compression).

The driver is operational, at least to the extent that I can run
Whetstone tests, our image processing code, and a clone of PostScript
(with some provisos), and a variety of other programs.

You should be aware that the company that designed this board did not
design it with the AT bus spec in mind, in as much as they do not
pulse the interrupt line. Instead, the line stays high until the
interrupt handler reads from a register on the board. We have hassled
them A LOT about this, but under DOS, it seems to cause no problems.
Using a kernel debugger, I have verified that our Unix V/386 system
(Interactive's 386/ix) is indeed initializing the PIC in edge-triggered
mode, and there is other confirmatory evidence of this too.

			      THE BEAST
			      *********

When a program running on the 29000 makes a call that requires the
services of the host (such as time()), the board generates an
interrupt. The interrupt handler reads both a register and a memory
location on the board to determine the nature of the request.  It
services it, writes some status values to memory on the coprocessor and
everyone is happy. The interrupts are actually generated by a kernel
that runs on the 29000 and which in itself has to generate a few extra
interrupts at startup and termination (see below).

Consider the following program:

main (argc, argv)

int argc;
char *argv[];

{
	int intcnt;

	intcnt = atoi (argv[1]);

	while (intcnt--)
	   time ();

	exit (0);
}

The program, with an argument of 1, generates 8 interrupts. These
satisfy the following function calls either by the on-board kernel,
or by the above program. 

REQUEST		FUNCTION	SOURCE		 REASON

20 		WRITE 		29000 kernel	(startup message)
20  		WRITE 		29000 kernel	(startup message part II)
66  		COPYARGS 	29000 kernel    (get arguments for program)
49		TIME		program		(get time)
18		CLOSE		program 	(close stdin)
18		CLOSE		program		(close stderr)
18		CLOSE		program		(close stdout)
1		EXIT		program		(exit)

You will hopefully see that this program generates (n+7) interrupts,
where n is argv[1]. It calls time() n times, each of which generates
an interrupt, and also has the overhead of the startup and shutdown
interrupts regardless of the value of argv[1].

All is fine. HOWEVER, if the argument is increased, then wierd things
begin to happen. When I say wierd, I mean that the machine reboots.
Just like that. Reboots ......

What is the value of argv[1] when this happens ? Good question. It
varies from between 200 to 1000. It does not seem to be deterministic.
However, if I jump into the kernel debugger at some point whilst this
program is running (which is pretty difficult with a 25 MIP board
:-)), then at certain times, I see a LARGE (30-60) number of traps
built up when I issue "stack" command. The traps are of the same type
as the interrupt vector used by the board - i.e if the board uses IRQ
10, there will lots of "trap A"'s.

What seems doubly wierd is that if I look at the stack I notice:

	i) only the trap on the top of the stack calls
	   cmnint(), which from my disassembler adventures
	   with the kernel debugger is what actually calls
	   the interrupt handler defined in ivect (declared
	   in config.c, built and compiled at kernel build time).

	ii) the rest call some other function.

	iii) what this function is seems to depend upon the
	     IRQ used by the board. If I use IRQ 10, then
	     the extra Traps on the stack will calls
	     timein(), whose function appears to be
	     checking the timeout() stack (the array
	     "callout") for functions that should be called
	     after a given period of time.

	     However, if the board uses IRQ 15, then the
	     extra's call clock_int(), although a call to
	     timein is also on the stack, apparently
	     from the call to clock_int() generated by
	     the prior trap.

SOME QUESTIONS:
***************

1) What causes a given trap, of the a given type, to call a 
	particular kernel function instead of another ? Is
	this vectored by some low level 386 hardware stuff,
	or is there a layer of the Unix kernel that routes
	this ?

2) Does the prescence of 30-60 traps built up on the stack
        indicate a real problem ? Can the debugger be trusted
	when it reports this information ?

3) Could an overflow of the stack reboot the machine ?
	If I single step through the kernel when the stack
	builds up like this, I can reboot the machine on
	a single machine instruction .... I don't
	know what that instruction is, however ...

4) Could a board that generated "level-triggered" interrupts
        cause multiple traps in this way ? When I say
	level-triggered, I mean that the board does NOT
	generate a pulse, but instead drives its interrupt line
	high until something reads from one of its registers ?

	We (and the board's makers) tried testing this under DOS,
	and could not generate multiple interrupts, but thats
	not necessarily a fair test.

5) What the hell is going on ?

I appreciate that there is a lot here to digest and a lot of areas for
problems to arise. I have had some Unix driver experience before, but
simply do not have the knowledge or the access to the kernel source to
know what is happening at this level. There seem to be extremely few
people around who have the kind of experience and knowledge to deal
with this type of question - DOS folk don't know anything about
the way Unix handles interrupts, whilst Unix folks tend not to have 
a very deep knowledge about interactions between hardware and the kernel.

Basically, we are stuck with this, and any assistance or help you can
offer will be much appreciated. 

I do not subscribe to any of the lists to which this has been submitted,
so please reach me by mail at: pauld at scenic.wa.com

Thanks for your time and expertise.

-- paul

Paul Barton-Davis			<pauld at scenic.wa.com>
ScenicSoft, Inc.			
(206) 776-7760
			"Industry without art is brutality"