alloca wars
Gordon Burditt
gordon at sneaky.TANDY.COM
Sun Aug 7 08:41:47 AEST 1988
> > 2.) Alloca is less available than malloc/free.
> On a M68k with a decent OS, alloca is not more than a few lines of assembly
> code, correct? (Judging by the size of the GNU assembly alloca)...
> If one needs alloca and it is not available, why not write a quick alloca?
> If the machine architecture or OS is braindamaged, then one would have
> to program around it. GNU does that also.
If the compiler knows about alloca [Read: recognizes the name as special
and refrains from doing certain things because of it, OR recognizes the name
as special and implements it as a built-in], , it could be made to work
nicely. If the compiler doesn't know about alloca you can have a real
headache trying to implement it as it was intended (memory on the stack
frame), even with a "nice" CPU.
Further, even with the ultimate alloca implementation, GNU GCC with
alloca built into the compiler, it still gets it wrong.
Take, for example, the following 68000 linkage conventions:
d0-d1, a0-a1 are clobbered across calls. d2-d7, a2-a5 are preserved across
calls. a6 is the frame pointer. a7 is the stack pointer. Args are
on the stack, first arg at the lowest address. CALLER POPS ARGS FROM THE
STACK. Function return value is in d0 or d0/d1 (for doubles).
These are not negotiable, unless I scrap all my libraries and the compiler
that came with the system, and actually seem fairly reasonable. These are
used on Tandy's 68k Xenix compiler, and some other systems, like the sun
(2 and/or 3) version of the GNU compiler, which I have adapted to also
work on a Tandy 6000.
Ok, so alloca should be simple, right?
- Leave a2-a6 and d2-d7 alone, so you don't have to save them.
- Pop the return address into a register (a0 probably). [1 instruction]
- Pick up the requested amount of memory and round it up to
a multiple of 2. [2-3 instructions]
- Subtract the rounded amount of memory from the stack pointer.
[1 instruction]
- Put the stack pointer + the size of the argument ( = 4) in d0
(because the caller is going to adjust the stack to remove
that argument). [1-2 instructions]
- Return to the saved return address. [1 instruction]
[Nitpickers: I really don't care if I'm one or two instructions off in the
above estimates]
Simple, huh? So how come bison infinite loops?
Typical function entry/exit code for Tandy's compiler:
.globl _foo
_foo:
link a6,#-<stack frame size>
moveml #<<bunch of registers>>,-(a7)
...
moveml (a7)+,#<<bunch of registers>> | AARRRGGGGHHHH!!!!
unlk a6
rts
The "bunch of registers" is determined from what is actually used. There
are variations of this code if 0 or 1 registers need to be saved.
Notice the instruction marked "AARRRGGGGHHHH". If alloca moves the stack
pointer, then it has to allocate some extra memory and leave a copy of the
saved registers in them, so this instruction can pop them off. (Why didn't
the compiler writer use "moveml <offset>(a6),#<bunch of registers>" instead?
I don't care, really, but it might be because the code it actually generates
is 2 bytes shorter (and 1 clock cycle faster on a 68020).) So, my first shot
at alloca trashes all of the registers in the caller of the caller of alloca.
Ok, how much memory? 40 bytes (d2-d7 and a2-a5), regardless of whether the
function actually saves them or not. Why not find out what's actually saved?
To do this, you'd need to find the entry point of the function and disassemble
a few instructions. The frame pointer points to the return point in the
caller of the caller of alloca. However, in 68000 code, given the address
of the end of an instruction and the contents of memory, you cannot always
uniquely find the address of the beginning of it. (Ok, maybe 99% of the time
you can, but having any need for a "panic: alloca can't find it's caller -
program aborted" message is unacceptable). Besides, if the caller of alloca
was called through a function pointer (I DID NOT say alloca was called through
a function pointer! although that wouldn't matter much anyway), the address
might be and usually is in a0, and has been destroyed before alloca is called.
Now, on to another issue: function calls and arguments. Assume that the
compiler has a maximum of N temporary locations in registers to store
computed arguments that haven't been pushed on the stack yet. (Suppose for
this example that N = 2). Assume that the compiler pushes its arguments onto
the stack in an endian order (I did NOT say which end!). The arguments and
return types of foo and bar are (char *), and bar returns its first argument.
How does the compiler evaluate this (foo is presumed to have 2N+3 arguments):
foo(bar(alloca(20)), bar(alloca(20)), bar(alloca(20)), alloca(40),
bar(alloca(20)), bar(alloca(20)), bar(alloca(20)) );
Can anyone find ANY compiler on a system with caller-pops-args linkage
conventions that gets this right? Or for that matter, any compiler on
ANY system where alloca really allocates memory on the stack?
To correctly evaluate this, the compiler would have to evaluate
"bar(alloca(20))" 3 times, store the results somewhere other than pushing
on the stack, and since it has only 2 registers, one of them has to be
elsewhere (stack frame relative temporary seems to be the only place left),
evaluate alloca(40), at which point it had better not have pushed anything on
the stack, and then evaluate "bar(alloca(20))" 3 more times. THEN it fetches
the 7 values from assorted storage and pushes them on the stack.
Now, consider a compiler that doesn't know alloca is special. Can you
imagine the laughter if it always generated the above sequence for
every function, alloca or not? Storing stuff on the stack frame, then
copying it with a push-contents-of-memory-on-stack instruction? Inefficient
code in the extreme ... .
What it's really going to do is evaluate "bar(alloca(20))", push the result on
the stack, repeat 2 more times, evaluate "alloca(40)", push the result on
the stack, and repeat evaluating and pushing "bar(alloca(20))" 3 times.
foo() will then be called with all but the last-pushed argument in the wrong
place. This is, in fact, how gcc 1.24 evaluates it, WITH AND WITHOUT
using the builtin alloca. With the builtin and without the optimizer, the
alloca() becomes 2 instructions (fix stack and move return value to d0).
The round-up-to-multiple-of-2 gets done at compile time since the arguments
to the alloca calls were constant.
Consider the GNU compiler feature "defer-pop". If a compiler defers
taking arguments off the stack after a function call, and doesn't recognize
alloca as a clue to turn off this behavior, and there is no flag to
control it, then I have to throw up my hands and abandon any attempt to
write alloca. Supposedly GCC does recognize alloca as special for this
purpose, and in any case, it has flags to control it.
Back to my Tandy 68k machine. To accomodate the args-on-the-stack problem,
I have to make alloca copy more of the stack frame to handle the worst-case
of args pushed on the stack when it's called. There is no worst-case, but 5
args ( = 20 more bytes) might be good enough.
I got lucky. Tandy's compiler references auto variables relative to the
stack frame pointer. If it referenced them relative to the stack pointer,
I'd be out of luck.
So now it works like this:
- Leave a2-a6 and d2-d7 alone, so you don't have to save them.
- Pop the return address into a register (a0 probably). [1 instruction]
- Pick up the requested amount of memory and round it up to
a multiple of 2, and add 60. [2-3 instructions]
- Subtract the rounded amount of memory from the stack pointer.
[1 instruction]
- Copy 60 bytes from the old stack pointer + 4 to the new stack
pointer + 4 for length 60. [15 instructions, straightline.
Could be less but slower with a loop]
- Put the stack pointer + the size of the argument plus the
copied stack ( = 64) in d0 (because the caller is going to adjust
the stack to remove that argument). [1-2 instructions]
- Return to the saved return address. [1 instruction]
So I have a version of alloca that allocates its memory on the stack frame
and wastes 60 bytes per call. If a compiler uses defer-pop, it's
going to screw up. If there are more than 5 arguments on the stack when
it's called, it's going to screw up. But bison doesn't infinite loop
any more.
Why don't I post it? Several reasons:
- It's too compiler- and system- specific to be of much use to many people.
- The waste of memory per call is embarrasing.
- It was debugged using bison, and according to discussions going on in
gnu.gcc, it is probably subject to the GNU Public License, and I'd have to
post the source to everything I used it in. I don't have that much disk
space, and neither does the net.
- It still doesn't work right, in ways I indicated.
Gordon L. Burditt
killer!ninja!sneaky!gordon
More information about the Comp.lang.c
mailing list