Wanted - Auto function prototype generator
Michael Condict
mnc at m10ux.UUCP
Mon Feb 6 15:13:07 AEST 1989
In <943 at ubu.warwick.UUCP>, geoff at warwick.UUCP (Geoff Rimmer) writes:
> Has anyone written, or are they in the middle of writing a program
> that vaguely fits the following description?
>
> The program should scan thru a set of C source files, checking each
> function definition. If the function is extern, it should place a
> function prototype in the relevant header file . . .
>
> Geoff Rimmer, Computer Science, Warwick University, England.
> geoff at uk.ac.warwick.emerald
Funny you should ask. I just got done writing a sed script that prints
function prototypes, given C source files as input. It simply outputs
the prototypes on standard output, but you can easily augment it with a shell
script to append them to header files. (Also, it puts the arg-types in
comments, so as not to offend pre-ANSI compilers, but it is trivial to
have it not put out the comment delimiters.)
I wrote this sed script (actually, three that are pipelined together) as
part of an informal investigation of the programming power and efficiency of
sed, one of the old workhorses of UNIX that is too often bypassed in favor of
the trendier awk.
The results were somewhat surprising to me, and will perhaps be more so to
some of you awk addicts out there. For your edification and amusement I've
enclosed the complete text of the sed scripts, plus encapsulating shell
script. It operates as a standard UNIX filter, so the C files can be given
as arguments to the shell script or can be piped into it.
Before that, however, here are some statistics on the sed scripts and their
performance (ctags is a BSD utility that scans C source files at a similar
level of analysis, looking for function definitions):
# Non-comment lines: 87
# Non-comment words: 165 (as reported by wc)
# Non-comment chars: 1214
Speed w.r.t. ctags (4.3 BSD): 4 times slower
Speed w.r.t. lint or cc: much faster
I offer a direct challenge to those of you who think awk is better than sed
for almost everything. Show me a version of this program in awk that is
either smaller or faster or both. I am genuinely interested in knowing
whether this can be done.
The rest of this article consists of some discussion of the interesting
parts of the implementation, followed by the files. Hit n now if you don't
care about it. The files consist of a shell script (fdecs) followed by three
sed scripts (fdecs[1-3].sed) and a test input file (test.c) with the
corresponding output (test.out).
The first sed script removes preprocessor lines and comments and changes all
single- and double-quoted strings to the token 0. This removes all syntax
from the code that might confuse the part that does the real work.
All comments, backslashed newlines and preprocessor definitions are correctly
handled, as far as I know, regardless of formatting. The deleting of comments
is particularly interesting, depending as it does on delicate reasoning about
the possibility of a "*" token being immediately followed by a "/"
character in correct C programs.
The second sed script deletes the contents of all top-level {} pairs,
thus nulling out function bodies, struct field definitions and so on.
The third sed script finds function definitions and edits them into a
prototype. It puts both the names and the types of the arguments in the
prototype declaration. I can't be sure that the prototype syntax is strictly
ANSI conforming, since I don't have an ANSI compiler, but it looks right
(and what do you want for free?).
The interesting part of this sed script is that it allows the argument type
declarations to be in a different order from the argument list inside the
parens, and any argument in the list that does not appear in a type
declaration correctly defaults to int. The implementation depends on a little
known feature of the \1,\2,... notation of sed, namely that \1 is defined as
soon as the first \) is encountered, and can therefore usefully appear later
in the same pattern as the \). This is a significant extension beyond the
power of standard regular expressions, since it allows, e.g., the matching of
a...aba...a, where the number of a's before and after the b must be equal.
There are very few restrictions made on the form and content of the C source
files, mostly quite reasonable. See the shell script for details.
If you use preprocessor macros to significantly alter the syntax of function
definitions, then the files have to be filtered through "cc -E" before the sed
scripts can understand them. This possibility is provided for in the shell
script.
One more thing -- BSD sed is assumed, and the scripts will not work without
a minor, but infuriating, change for System V: all the comments except the
first line have to be removed. Feel free, if you have to suffer under this
version of sed. There are also problems with the fact that a backslash is
interpreted literally as itself inside of [] on Sys V, a bug that prevents
\n from being specified as one of the characters in a set, among other things.
Use in good health.
Michael Condict {att|allegra}!m10ux!mnc
AT&T Bell Labs (201)582-5911 MH 3B-416
Murray Hill, NJ
-----------------------------------------------------------------------------
Cut into the specified files at lines beginning with "-----" and remove
"X" from beginning of other lines:
---------------------------- fdecs -------------------------------------------
X
X# This script, given C source files, finds all the function definitions
X# of the form "type_decl f(a,b,...) type_decl a; type_decl b; ... { body }",
X# and outputs a file of extern func declarations suitable for a .h file.
X# The only restriction is that the type_decl's must be no more complicated
X# than what can be formed with names, * and []. That is, the type_decl's may
X# not contain "(*x)[]" or "struct ... {", although, e.g., "struct A" is
X# allowed.
X#
X# The types and names of the args are placed in a comment inside the parens.
X# The output is one line of the form
X#
X# type_decl f(/* type_decl a, type_decl b, . . . */);
X#
X# for each function f defined in the source files. Note that the output
X# is a legal ANSI-C function prototype if the comment delimiters are
X# removed, but with the comments it is acceptable to old C compilers.
X
XSEDDIR=.
X
XIflag=""
Xwhile [ "$#" -gt 0 ]; do
X case "$1" in
X -I*) Iflag="$Iflag $1"; shift
X ;;
X -*) echo "Usage: $0 [ -Iincl_dir ] ... file ..."
X echo " (use -I only if C preprocessor is wanted)"
X ;;
X *) break ;;
X esac
Xdone
X
Xif [ "$#" -eq 0 ] ; then
X sed -f $SEDDIR/fdecs1.sed |
X sed -f $SEDDIR/fdecs2.sed | sed -f $SEDDIR/fdecs3.sed
Xelif [ "X$Iflag" != X ] ; then
X # User wants to run the C files through the preprocessor, with
X # the specified -I flag. We delete the contents of the
X # .h files from the preprocessor's output. We also need to
X # delete any occurrences of '^$filename:', which some versions
X # of cc -E put out at the beginning of every file:
X cc -E $Iflag "$@" | \
X sed -e '#n No automatic printing
X
X # Delete contents of included .h files:
X : chkhdr
X /^# [1-9][0-9]* ".*\.h"/{
X : delhdr
X n
X /^#/!b delhdr
X b chkhdr
X }
X # Delete "filename:" inserted by "cc -E":
X /^[^ ]*\.[ch]:[ ]*$/d
X ' \
X -f $SEDDIR/fdecs1.sed \
X -e 'p' |
X sed -f $SEDDIR/fdecs2.sed | sed -f $SEDDIR/fdecs3.sed
Xelse
X sed -f $SEDDIR/fdecs1.sed "$@" |
X sed -f $SEDDIR/fdecs2.sed | sed -f $SEDDIR/fdecs3.sed
Xfi
X
---------------------------- fdecs1.sed ---------------------------------------
X# This sed script, given C source files, deletes preprocessor lines and comments
X# and changes single and double quotes to the constant 0, allowing easy analysis
X# of the remaining tokens by another sed script.
X
X# Concatenate lines ending with backslash:
X: morebs
X/\\$/ {
X N
X s/\\\n//
X b morebs
X}
X
X# Get rid of blank lines:
X/^[ ]*$/d
X
X# Delete comments:
X: delcom
X/\/\*/{
X # Change first comment delim to @ (after eliminating existing @'s):
X s/@/ /g
X s:/\*:@:
X
X # Read until we have the end comment:
X : morecm
X /\*\//!{
X N
X b morecm
X }
X
X # Get rid of any $'s:
X s/\$/ /g
X
X # First occurrence of */ is guaranteed to be the corresponding end
X # comment, because it is otherwise not legal C, so:
X s:\*/:$:
X s/@[^$]*\$/ /
X
X b delcom
X}
X
X# Delete preprocessor constructs:
X/^#/d
X
X# Get rid of single and double-quoted strings, whose contents could confuse us:
Xs/\\"/ /g
Xs/"[^"]*"/0/g
Xs/\\'/ /g
Xs/'[^']*'/0/g
X
X# Get rid of blank lines:
X/^[ ]*$/d
X
---------------------------- fdecs2.sed ---------------------------------------
X# Given C-source files that have been simplified by deleting all "#" lines,
X# comments and changing quoted strings to "0", this script
X# deletes the contents of all {} pairs that are function bodies (i.e.
X# are preceded by ")" or ";"):
X
X# WRONG -- for now we delete the contents of all {}s:
X
X: delcbrc
X/{/{
X s/{[^{}]*/{/g
X # Read until we have at least one } in the buffer:
X : getcbrc
X /}/!{
X N
X s/{[^{}]*/{/g
X b getcbrc
X }
X s/{[^{}]*}/#/g
X b delcbrc
X}
Xs/#/{}/g
X
---------------------------- fdecs3.sed ---------------------------------------
X# This sed script expects C source code that has been simplified by being
X# passed through fdecs1.sed and fdecs2.sed. The former deletes preprocessor
X# lines and comments and changes single and double quoted strings to 0. The
X# latter deletes the contents of all top-level {...} constructs.
X#
X# It outputs one extern func declaration for each func definition in the
X# C source, suitable for use in a header file. See the fdecs shell script for
X# more details.
X
X#DBG: i\
X#DBG: -----------------------------------------------------------------------
X#DBG:
X
X: doline
X
X#DBG: i\
X#DBG: ------- At doline:
X#DBG: p
X
X# Read until we have enough syntax to process:
X: getsbr
X/[;{]/!{ # This will ensure that we have either a ";" or a "{"
X N
X b getsbr
X}
X
X# Format whitespace consistently:
Xs/\n/ /g
Xs/[ ][ ]*/ /g
X
X#DBG: i\
X#DBG: ------- After formatting whitespace:
X#DBG: p
X
X# If the first semicolon in the buffer is not preceded by what looks like
X# a certain part of a function header (the end of the arg list and begin-
X# ning of the arg-type decls, which must be of the form ") name"), then it
X# is the end of a non-func declaration and can be deleted:
X/;/{
X /^[^;]*) *[a-zA-Z_{]/!{
X s/^[^;]*;//
X /^ *$/d
X b doline
X }
X}
X
X#DBG: i\
X#DBG: ------- After deleting any non-func declaration not containing {}:
X#DBG: p
X
X# If we don't yet have ") {" or "; {", we need to read more:
X/[;)] *{}/!{
X N
X b doline
X}
X
X# Format the buffer more consistently:
Xs/^ *//
Xs/( */(/g
Xs/ *)/)/g
Xs/ *, */, /g
Xs/ *; */; /g
Xs/ *\*/ \*/g
Xs/\* */\*/g
X
X#DBG: i\
X#DBG: ------- After consistent formatting:
X#DBG: p
X
X#NOTE: The following assumes that there is only one func decl in the buffer,
X# which will be true unless some joker puts more than one func definition
X# on a single line, which is so odd as to be not worth considering
X
X# If there are no args:
X/() *{/b no_arg
X # Insert a marker in front of each arg, where the type will go:
X # NOTE: This also goes in front of args in the type decls following ")"
X s/, /, %/g
X s/(/(, %/
X
X # For each arg that has a type definition after the ")", copy its
X # type in front of the arg in the arg list, a la ANSI C prototypes:
X t more_t
X : more_t
X # Following gets type decls of the form:
X # type_decl arg..."
X s/, %\([A-Za-z_][A-Za-z0-9_]*\)\(.*[);] *\)\([A-Za-z_][A-Za-z0-9_{} ]*\)\( *\)\(\**\1[\[\]]*\)\([,;]\)/, \3 \5\2\3\4@\6/
X # Following gets type decls of the form:
X # type_decl other_arg, ..., %this_arg..."
X s/, %\([A-Za-z_][A-Za-z0-9_]*\)\(.*[);] *\)\([A-Za-z_][A-Za-z0-9_{} ]*\)\( [^;]*, %\)\(\**\1[\[\]]*\)\([,;]\)/, \3 \5\2\3\4@\6/
X#DBG: i\
X#DBG: ------- After putting one arg type in front of arg:
X#DBG: p
X t more_t
X
X # Remove the " ," in front of the first arg:
X s/(, /(/
X
X # Any remaining % markers indicate untyped args, which default to int:
X s/%/int /g
X
X # Comment the args and delete everything after the ")":
X s?(\(.*\)).*?(/* \1 */);?
X
X # Get rid of any register declarations among the args, replacing it
X # with int if there was no type name given with register:
X / register/{
X s/\([A-Za-z_0-9]\) register\([^A-Za-z0-9_]\)/\1\2/g
X s/ register\([^A-Za-z0-9_][^A-Za-z0-9_]*[A-Za-z_][A-Za-z0-9_]*\),/ int\1,/g
X s/ register\([^A-Za-z0-9_]\)/ \1/g
X }
X
X # May have introduced multiple blanks, so:
X s/ */ /g
X
X b done_a
X: no_arg
X # Get rid of the empty body (i.e., ") ... {}":
X s/).*/);/
X: done_a
X
---------------------------- test.c ---------------------------------------
X#define APAP\
X 37
X# /*hi*/ define GOO(x) y
X
Xchar *abc = "hi \"Joe\"";
X/* this is
X * a comment
X */
Xstruct A_S {
X int wopper /**** a *** b *** c *//*again*/ ;
X}; int
Xf
X(x, /* a * in a comment */
X yoohoo) /**/ /* a /* b */ char *yoohoo;
X{
X int a, b, c = '\'';
X char * quote="h#w \
X#bo{ut @hat?";
X a = b /*oops*/*c; /****************/
X} enum goober {a,b};
X struct A_S *george(x) struct {int x;
X float y;} x; { return 0; }
X
Xtypedef int bar;
Xstruct A_S * * george2(moo, x, glop, foo) struct {
X int q[13]; float y;} x[];
X bar moo , *foo[];
X struct A_S *glop;
X/*a*/{
X return 0;
X}
X
X/* Try various combinations of register arg decls:*/
Xflop(a_1, b) register a_1; { return 0; }
Xstruct BB {int f,g;} floop(a_1, b_1) register char *a_1; float register*b_1;
X{ struct BB j; return j;}
X
X/* Test arg names that are substrings of one another: */
Xchar sub1(abc, abcdef) int* abcdef; float abc; { return 0; }
---------------------------- test.out ---------------------------------------
Xint f(/* int x, char *yoohoo */);
Xstruct A_S *george(/* struct {} x */);
Xstruct A_S **george2(/* bar moo, struct {} x[], struct A_S *glop, bar *foo[] */);
Xflop(/* int a_1, int b */);
Xstruct BB {} floop(/* char *a_1, float *b_1 */);
Xchar sub1(/* float abc, int *abcdef */);
--
Michael Condict {att|allegra}!m10ux!mnc
AT&T Bell Labs (201)582-5911 MH 3B-416
Murray Hill, NJ
More information about the Comp.lang.c
mailing list