YACCable C and C++ grammars; Flex scanner; yacc debugging tool
Jim Roskind x5570
jar at io.UUCP
Sat Jun 16 03:09:46 AEST 1990
FILENAME: FREEGRAM4.TXT
AUTHOR: Jim Roskind
Independent Consultant
516 Latania Palm Drive
Indialantic FL 32903
(407)729-4348
jar at ileaf.com
or ...uunet!leafusa!jar
6/6/90
Dear C++ and C Grammar User,
I have written a YACC debugging tool, and a set of grammars for C and
C++ in order to use them within my own personal project development.
I have made my work in this area available to other developers at no
charge with the hope that they would use my work. I believe the
entire C++ community can benefit from such standardization. If any
of the copyright notices on the grammars (which are VERY liberal)
prevent using my work, please notify me of the problem.
Note that the grammars can each be processed by YACC, but they are
very clean, and make NO USE of the precedence setting (i.e.: %prec)
or associativity setting (i.e.:%assoc) constructs of YACC. This
feature should make them easily portable to other parser generator
input format. This "cleanliness" fact also provides brutal exposure
of all the complex constructs in C++, and the complexity of the
grammar as a whole (the C++ grammar is 2 to 3 times as large as the C
grammar) reflects this exposure.
The files included in this set are:
FREEGRM4.TXT This introductory file
GRAMMAR4.TXT Parsing ambiguities in C++, and in my grammar
CPP4.Y My YACC compatible C++ grammar
C4.Y My YACC compatible, ANSI C conformant grammar
CPP4.L Flex input file defining a C++ lexical analyser
SKELGRPH.C A hacked file from the Berkeley YACC distribution.
Aside from the addition of several files, this release of my grammar
corrects a few problems located in my prior release. This release
will hopefully conclude compliance with C++ 2.0, and allow me to
release a C++ 2.1 compliant grammar (re: nested types).
Since my first public release of my grammar, I have received a number
of requests. One of the most common requests was for a lexical
analyser to go with the grammar. This release of the grammar
provides such a a "bare bones" lexical analyser. The analyser does
not support preprocessing, or even comment removal. In addition,
since I have not included a symbol table, or semantic actions in the
grammar to maintain proper context (i.e., current scope), typedef
names and struct/class/union/enum tags are not *really* defined. To
allow users to experiment with my grammar without a symbol table, my
lexer assumes that if the first letter of the name is upper case,
then then name is a type name. This hack is far from sufficient for
parsing full blown programs, but it is more than sufficient for
experimenting with the grammar to determine the acceptability of a
token sequence, and to understand how my grammar parsed the sequence.
Since I did not believe that a lexical analyser alone would be
sufficient to assist many people with playing with my grammar, I have
also provided the basis for a tool to explain what a grammar is
doing. Specifically, I have modified a file that is included in the
Berkeley YACC distribution so that parsers generated by such a YACC
would automatically display a syntax tree in graphical-ASCII format
during a parse. The instructions for using and building this yacc
tool are presented in the next section. Note that there are no
significant special hooks in my grammar or parser to excite this yacc
tool, and the tool can be used equally well on any grammar that you
are working with.
I have posted all 6 files to comp.lang.c++, to make this information
as available as possible to users and developers. I will also post
this introductory note to comp.compilers, and comp.lang.c. I am
arranging for archival support via several ftp sites, and updates
will be posted to those sites. I will also try to get the source to
Berkeley YACC posted to these ftp sites, although it is certainly
available at more central sites.
HOW TO USE THE DISTRIBUTION FILES TO PLAY WITH THE C++ GRAMMAR
Note that the following instructions assume that you have the
Berkeley YACC source on hand. You can actually use any YACC to
process the grammar, but running the resulting demo (which has no
semantic actions) will tend to be quite boring. If you can't get
hold of the Berkeley YACC, you should at least try to enable the
"debugging options" in your parser to so that you can see in some way
what reductions are taking place. (Hint: search for the letters
"debug" in the C file that your yacc produces...).
1) Get the entire source for Berkeley YACC 1.4 2/25/90
2) Verify that you can make the YACC
3) Rename SKELETON.C to SKELOLD.C, and implant my SKELGRPH.C
in that directory as SKELETON.C
4) Make the yacc using this new SKELETON.C
5) Using the above yacc, process my CPP4.Y file
yacc -dvl cpp4.y
The result should be a file y.tab.c, and y.tab.h
6) Using Flex (replacement for lex) process my CPP4.L file
flex cpp4.l
the result should be yy.lex.c
7) Compile the two files
cc -o cpp4 y.tab.c yy.lex.c
the result should be an executable called cpp4
8) Set the environment variable YYDEBUG to 6
setenv YYDEBUG 6
If you don't do this, the graphical output will not appear!
9) Run the program cpp4
cpp4
10) Try the input:
int a;
11) You should see a nice parse tree. Enjoy. Note that
the lexer DOES NOT INCLUDE A SYMBOL TABLE, and does
NOT KEEP TRACK OF CURRENT SCOPES. The hack (see the
CPP4.L file for details) is to assume that any identifier
that begins with a capital letter is a typedef name
Send complaints about code that doesn't parse "correctly".
INTRODUCTION TO 3/3/90 Version of my release
In light of the recent debate on comp.lang.c++ about the
"impossibility" of generating a YACC grammar for C++ (that doesn't
also accept every possible string of tokens), providing these
grammars at this point is especially nice. Rather than addressing
each of the ad hoc arguments that such an endeavor is impossible (the
least convincing of which said effectively: "... heard it was
impossible, ... something about LALR ...", and the most correct of
which said: "...it is not possible to create a simple YACC grammar
for C++.."), I am offering my grammars as my rather complete
commentary on the topic. (I'll probably get some flaming due to this
comment, but I had to read all the stuff that was posted, that told
me what I had done was impossible. ;-) ) My general feeling is that
it is a shame for opinions to be confused with facts, and for
impossibility to be confused with difficulty or complexity.
Developing the C grammar (that is intended to be compatible with the
ANSI C standard) was relatively straight forward (compared to the C++
grammar). The one difficulty in this process was the desire to avoid
use of %prec and %assoc constructs in YACC, which would tend to
obscure ambiguities. Since I didn't know what ambiguities were lying
in wait in C++, obscuring ambiguities was unacceptable. It took
several weeks to remove the conflicts that typically appear, and the
tedious process exposed several ambiguities that are not currently
disambiguated by the ANSI standard. The quality of the C grammar is
(IMHO) dramatically higher than what has been made available within
the public domain. Specifically, a C grammar's support of
redefinition of typedef names within inner scopes (the most difficult
area of the grammar) is typically excluded from public domain
grammar, and even excluded from most grammars that are supplied
commercially with parser generators! I expect that this grammar will
be very useful in the development of C related tools.
The development of the C++ grammar (initially compatible with version
1.2, but enhanced to support version 2.0 specifications as they were
made available) was anything but straight forward. The requirement
that I set to NOT USE %prec and %assoc proved both a blessing and a
curse. The blessing was that I could see what the problems were in
the language, the curse was that there were A LOT of conflicts (I can
recall times during the development effort when the number of
conflicts was well in excess of 200). Initially I was unaware that
many other attempts had been made and failed, and I went ahead
"blindly" trying to resolve the conflicts in my grammar. After
raising the issues that I noticed with Bjarne Stroustrup, I became
aware that there were some very significant syntactical ambiguities
within the current C++ language. Fortunately, by the time I first
spoke to Dr. Stroustrup, I had already derived some results that
other attempts had not uncovered. Encouraged by my results, I
continued on despite hearing ever louder claims that my goal was
"impossible".
Towards the end of the development of the C++ grammar, which took
roughly 3 months of my time, I began to see the folly in part of my
quest. I came to the conclusion that further attempts to modify my
grammar, so as to defer resolutions of ambiguities, would lead to an
unreadable language. Specifically, my feeling was that I was entering
into a battle of wits with the compiler, and the compiler was
starting to win. It was beginning to be the case that the parser
COULD figure out what I said, but I couldn't. Indeed, even examples
in a version of the C++ 2.0 reference manual demonstrated this
problem (my parser could parse some examples that neither I nor the
authors parsed correctly!). At this point I decided to stop my quest
to FURTHER defer resolutions of ambiguities, and let the grammar
commit in one direction (always in favor of declarations), at the
late point that is provided by my grammar. If this direction proved
"incorrect in light of the context that followed", then I generated a
syntax error. I believe this strategy provides ample room for
expressiveness. In support of this expressiveness, I have (based on
my discussions with language experts) deferred disambiguation far
longer than other attempts at producing an LR(1) grammar. I would
strongly argue that any code that my grammar identifies as having a
"syntax error" (based on "premature" disambiguation), but cfront
allows, should ABSOLUTELY be rewritten in a less ambiguous (and hence
more portable) form. As a final comment, by virtue of the consistent
methodology used to build my grammar, there are a number of clearly
unambiguous constructs that my grammar CAN parse, and cfront 2.0
cannot (not to mention recent versions of GNU's G++ and Zortech's C++
compiler).
One major motivation for using the C++ grammar that I have provided
is that it is capable of supporting old style function definitions
(e.g.: main(argc, argv) int argc; char*argv[]; {...} ). To my
knowledge, no other C++ parser is currently capable of this feat. I
believe this capability was removed from the C++ specification in
order to reduce ambiguities in a specific implementation (cfront). As
my grammar demonstrates, this action was not necessary. Supporting
old style function definition GREATLY eases the transition to the use
of C++ from traditional C. I expect that as some parsers begin to
support this option, that other parsing engines will be forced in
this direction by a competitive marketplace. Using my grammar, and
the standards it implies, appears to be a very straightforward
approach to this support.
A second motivation for using my grammar is that it can be processed
by YACC. The advantage in this fact lies with YACC's capability to
identify ambiguities. For software manufacturers that are heavily
concerned with correctness, this is an INCREDIBLE advantage. My
experience with hand written parsers (which usually represent a
translation by a fallible human from a grammar to parsing code) is
that they evolve and become more correct with time. Ambiguous cases
are often misparsed, without the author ever realizing there was a
conflict! In contrast, either a YACC grammar supports a given
construct, or it doesn't. If a YACC grammar supports a construct,
the semantic interpretation is usually rather straight forward. The
likelihood of internal errors in the parser is therefore
SIGNIFICANTLY reduced. The fact the the grammars I supplied are free
of %assoc and %prec operators, implies the grammar are fairly
portable, and the conflicts are open to peer code review (and not
obscured).
If you find any errors in my grammars, I would be DELIGHTED to hear
mention of them!!!! These should fall into one of the following
categories:
1) The grammar left out the following features of C++...
or
2) The grammar mis-parses the following sequences...
or
3) The discussion of the following ambiguity is
incorrect...
4) The grammar could be simplified as follows...
Please send correspondence of this form to jar at ileaf.com. My response
to 1's will be to add the feature (if possible!); feel sad that I
made a mistake; and feel glad that YOU found it. I will have a
similar response to 2's. Responses of type 3 are GREAT, but I
haven't found many folks that really get into YACC ambiguities, so I
have low expectations... feel free to surprise me!!! :-) :-). Items
of type 3 are interesting, but since simplicity is in the eye of the
beholder, such suggestions are subject to debate. I would be
interested in seeing suggestions in this area with the constraint
that they do not increase the number of conflicts in the grammar!
Please use YACC to check your suggestions before submitting them. (It
is often AMAZING how the slightest change in the grammar can change
the number of conflicts, and it took a great deal of work to reach
the current state). Distribution site(s) will be set up to distribute
updates and or corrections. Postings about the presence of
corrections will be made on the net.
Since the two grammars (C and C++) were generated in parallel, you
should be able to compare non-terminals directly. This will
hopefully make it easier to identify the complexities involved with
the C++ grammar, and "ignore" those that result from standard ANSI C.
In both cases I have left the standard if-if-else ambiguity intact.
In the case of ANSI C grammar, this is the only shift-reduce conflict
in the grammar. Although there are a number of conflicts in the C++
grammar, there are actually very few classes of problems. In order to
disambiguate the C++ grammar enough that YACC can figure out what to
do, I was commonly forced to "inline expand" non-terminals found in
the C grammar. This expansion allowed YACC to defer disambiguation
until it was possible for an LR(1) parser to understand the context.
The unfortunate consequence of this inline expansion is a large
growth in the number of rules, and the presence of an effective
"multiplier" in most cases where conflicts do occur. As a result, any
conflicts that arise are multiplied by a factor corresponding to the
number of rules I had to list separately. I have grouped the C++
grammar conflicts in the "Status" section of the GRAMMAR.TXT paper,
but you are welcome to explore my grammars using YACC directly (be
warned that you will need a robust version of YACC to handle the C++
sized grammar). PLEASE do not be put off by the number of conflicts
in the C++ grammar. There are VERY FEW CONFLICTS, but my elaborated
grammar confuses the count.
The GRAMMAR4.TXT paper is FAR from a publishable quality paper, but
it discusses many of the issues involved in ambiguities in my
grammar, and more generally in the C++ language. I hope GRAMMAR4.TXT
it is a vast improvement over "nothing at all", but apologize in
advance for lack of polished structure, and the presence of many
typos (which must surely be present). I hope you find this
almost-paper interesting. My attempts at documenting conflicts have
certainly clarified the problems in my mind. Based on my experience
with the conflicts I have identified, most current compilers and
translator fall prey to the situations that I have uncovered. I hope
that other compilers, that do not make use of the grammar I have made
available, will at least seek to standardize the resolution of the
problems identified.
As a small commercial message... I am a freelance consultant, and I
travel far and wide to perform contracts. If you like the work that
I am presenting in this group of documents, and would like to see a
resume or at least talk, please feel free to contact me.
I hope that the grammars that I have provided, will lead to many
successful C++ processing projects.
Jim Roskind
Independent Consultant
516 Latania Palm Drive
Indialantic FL 32903
(407)729-4348
jar at ileaf.com
or ...!uunet!leafusa!jar
More information about the Comp.lang.c
mailing list