% Copyright 2012-2014, Alexander Shibakov
% Copyright 2002-2014 Free Software Foundation, Inc.
% This file is part of SPLinT
%
% SPLinT is free software: you can redistribute it and/or modify
% it under the terms of the GNU General Public License as published by
% the Free Software Foundation, either version 3 of the License, or
% (at your option) any later version.
%
% SPLinT is distributed in the hope that it will be useful,
% but WITHOUT ANY WARRANTY; without even the implied warranty of
% MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
% GNU General Public License for more details.
%
% You should have received a copy of the GNU General Public License
% along with SPLinT. If not, see .
\input limbo.sty
\input frontmatter.sty
\def\optimization{5}
\input yy.sty
% multi-column output
\input dcols.sty
\let\currentparsernamespace\parsernamespace
\let\parsernamespace\mainnamespace
\let\currenttokeneq\tokeneq
\def\tokeneq#1#2{\prettytoken{#1}}
\input bo.tok % re-use token equivalence table to set the typesetting of tokens
\let\tokeneq\currenttokeneq
\input btokenset.sty
% index entries
\let\parsernamespace\indexpseudonamespace
\prettywordpair{emptyrhs}{$\circ$ {\rm(empty rhs)}}%
\prettywordpair{inline_action}{$\diamond$ {\rm(inline action)}}%
\prettywordpair{TOKEN}{{\tt TOKEN} {\rm(example)}}%
\prettywordpair{token}{{\tt "token"} {\rm(example)}}%
\let\parsernamespace\currentparsernamespace
\let\hostparsernamespace\mainnamespace % the namespace where tokens are looked up
% for typesetting purposes
\immediate\openout\exampletable=\jobname.exl
\def\nontitle#1{{\ttl #1}}
\def\cite[#1]{%
\def\next{#1}\setbox0=\hbox{l}%
[\ifx\next\empty$\,$\hbox{\vrule width\wd0 height\ht0 depth\dp0}$\,$\else \locallink{#1bibref}#1\endlink\fi]%
}
\let\oldN\N
\let\N\textN
\let\M\textM
\defreserved{Y}{\.{Y}}
\showlastactiontrue
@**Introduction.
\setupfootnotes
\splint\footnote{I was tempted to call the package {\tt ParLALRgram}
which stands for Parsing {\sc LALR} Grammars or {\tt PinT} for
`Parsing in \TeX' but both sounded too generic.} (Simple Parsing and
Lexing in \TeX, or, following the great GNU
tradition of creating recursive names, \splint\ Parses Languages
in \TeX) is a system (or
rather a m\'elange of systems) designed to
facilitate developing parsing macros in \TeX\ and (to a lesser
degree) documenting parsers written in other languages. As
an application, a parser for \bison\ input file syntax has been
developed, along with a macro collection that makes it possible to
design and pretty print \bison\ grammars using \CWEB.
Developing software in \CWEB\ involves two programs. The first of these is
\CTANGLE\ that outputs the actual code, intended to be in
\Cee. In reality, \CTANGLE\ cares very little about the language it
produces. Exceptions are \Cee\ comments and |@[#line@]| directives that might
confuse lesser software, although \bison\ is all too happy to swallow them
(there are also some \Cee\ specific constructs that \CTANGLE\ tries to
recognize). \CTANGLE's main function is to rearrange the text of the
program as written by the programmer (in a way that, hopefully,
emphasizes the internal logic of the code) into an appropriate
sequence (e.g.~all variable declaration must textually precede their
use). All that is required to adopt \CTANGLE\ to produce \bison\
output is some very rudimentary post- and pre-processing.
Our main concern is thus \CWEAVE\ that not only pretty prints the
program but also creates an index, cross-references all the
sections, etc. Getting \CWEAVE\ to pretty print a language other than
\Cee\ requires some additional attention. A true digital warrior would
probably try to decipher \CWEAVE's output `in the raw' but, alas, my
WebFu is not that strong. The loophole comes in the form of a rarely
(for a good reason) used \CWEB\ command: the verbatim (\.{@@=...@@>})
output. The material to be output by this construct undergoes minimal
processing and is put inside \.{\\vb\{}$\ldots$\.{\}}. All that is
needed now is a way to process this virtually straight text inside \TeX.
@*1 Using the \bison\ parser.
The process of using \splint\ for writing parsing macros in \TeX\ is
treated in considerable detail later in this document. We begin,
instead, by outlining how one such parser can be used to pretty print a
\bison\ grammar. Following the convention mentioned above and putting
all non-\Cee\ code inside \CWEAVE's verbatim blocks, consider the
following (meaningless) code fragment. The fragment contains a mixture
of \Cee\ and \bison\ code, the former appears outside of the verbatim blocks.
\begindemo
^@@= non_terminal: @@>
^@@= term.1 term.2 {@@> a = b; @@=}@@>
^@@= **H term.3 other_term {@@> $$ = $1; @@=}@@>
^@@= **H still more terms {@@> f($1); @@=}@@>
^@@= ; @@>
\enddemo
The fragment above will appear as (the output of \CTANGLE\ can be
examined in \.{sill.y})
@=
@G
non_terminal:
term.1 term.2 {@> a = b; @=}
| term.3 other_term {@> $$ = $1; @=}
| still more terms {@> f($1); @=}
;
@g
@ $\ldots$ if the syntax is correct.
In case it is a bit off, the parser will give up and
you will see a different result. The code in the fragment below is easily
recognizable, and some parts of it (all of \Cee\ code, in fact) are
still pretty printed in \CWEAVE. Only the verbatim portion is left
unprocessed.
@=
@G
whoops
term.1 term.2 {@>@+ a = b; @+@=}
| term.3 other_term {@>@+ $$ = $1; @+@=}
| still more terms {@>@+ f($1); @+@=}
;
@g
@ The \TeX\ header that makes such output possible is quite plain. In this example
(i.e.\ this very file) it consists of
\begindemo
^\input limbo.sty
^\input frontmatter.sty
^\input yy.sty
\nooutput
\enddemo
The first two lines are presented here merely for completeness: there is
no parsing-relevant code in them. The line that
follows loads the macros that implement the parsing and scanning
machinery. This is enough to set up all the basic
mechanisms used by the parsing and lexing macros. The rest of the header
provides a few definitions to fine tune the typesetting of
grammar productions. It starts with
\begindemo
^\let\currentparsernamespace\parsernamespace
^ \let\parsernamespace\mainnamespace
^ \let\currenttokeneq\tokeneq
^ \def\tokeneq#1#2{\prettytoken{#1}}
^ \input bo.tok % re-use token equivalence table to set the typesetting of tokens
^ \let\tokeneq\currenttokeneq
^ \input btokenset.sty
\nooutput
\enddemo
We will have a chance to discuss all the \.{\\}$\ldots$\.{namespace}
macros later, at this point it will suffice to say that the lines
above are responsible for controlling the typesetting of term names. The
file \.{bo.tok} consists of a number of lines like the ones below:
\begindemo
^\tokeneq {STRING}{{34}{115}{116}{114}{105}{110}{103}{34}}
^\tokeneq {PERCENT_TOKEN}{{34}{37}{116}{111}{107}{101}{110}{34}}
\nooutput
\enddemo
The cryptic looking sequences of integers above are strings of {\sc ASCII}
codes of the letters that form the name \bison\ uses when it needs to
refer to the corresponding token (thus, the second one is
\toksa{}\numberstochars{34}{37}{116}{111}{107}{101}{110}{34}\end
\.{\the\toksa} which might help explain why such an elaborate scheme
has been chosen). The macro \.{\\tokeneq} is defined in
\.{yymisc.sty}, which in turn is input by \.{yy.sty} but what about
the token names themselves? In this case they were extracted
automatically from the \CWEB\ source file by the parser during the
\CWEAVE\ processing stage. All of these definitions can be
overwritten to get the desired output (say, one might want to typeset
\.{ID} in a roman font, as `identifier'; all that needs to be done is
a macro that says \.{\\prettywordpair\{ID\}\{\{\\rm
identifier\}\}}). The file \.{btokenset.sty} input above contains a
number of such definitions.
@ To round off this short overview, I must mention a caveat associated
with using the macros in this collection: while one of the greatest
advantages of using \CWEB\ is its ability to rearrange the code in a
very flexible way, the parser will either give up or produce
unintended output if this feature is abused while describing the
grammar. For example, in the code below
@=
@G
next_term:
stuff @> @ @={@> a = f( x ); @=}
@g
@@;
@ the line titled |@| is intended to be a rule defined
later. Notice that while it seems that the parser was able to recognize
the first code fragment as a valid \bison\ input, it misplaced the
|@|, having erroneously assumed it to be a part of
the action code for this grammar (later on we will go into the details of
why it is necessary to collect all the non-verbatim output of \CWEAVE,
even the one that contains no interesting \Cee\ code; hint: it has
something to do with money (\.{\$}), also known as math and the way
\CWEAVE\ processes the `gaps' between verbatim sections). The production
line that follows did not fare as well: the parser gave up. There
is simply no point in including such a small language fragment as a
valid input for the grammar the parser uses to process the verbatim
output.
@=
@G
more stuff in this line {@> @[b = g(y);@]@=}
@g
@ Finally, if you forget that only the verbatim part of the output is
looked at by the parser you might get something unrecognizable, such
as
@=
but not all of it
@ To correct this, one can provide a more complete grammar fragment to
allow the parser to complete its task successfully. In some cases,
this imposes too strict a constraint on the programmer. Instead, the
parser that pretty prints \bison\ grammars allows one to add {\it
hidden context\/} to the code fragments above. The context is added
inside \.{\\vb} sections using \CWEB's \.{@@t}$\ldots$\.{@@>} facility. The \CTANGLE\
output is not affected by this while the code above can now be typeset as:
@=
@G
next_term:
stuff @> @t}\vb{\formatlocal{\let\peekstash\stashtoterm}}{@> @ @t}\vb{FAKE}{@> @={@> a = f( x ); @=}
@g
@@;
@ $\ldots$ even a single line can now be displayed properly.
@=
@G
@g
@g
@t}\vb{\formatlocal{\skipheader} FAKE:}{@>
@G
more stuff in this line {@> b = g( y ); @=}
@g
@ With enough hidden context, even a small rule fragment can be
typeset as intended. The `action star' was inserted to reveal some of
the context.
@=
@G
@g
@g
@t}\vb{\formatlocal{\skipheader} FAKE:}{@>
@G
but not all of it
@g
@t}\vb{\{\stashed{$\star$}\}}{@>
@ What makes all of this even more confusing is that \CTANGLE\ will
have no trouble outputting this as a(n almost, due to the
intentionally bad \.{whoops} production above) valid \bison\ file
(as can be checked by looking into \.{sill.y}). The author
happens to think that one should not fragment the software into pieces
that are too small: \bison\ is not \Cee\ so it makes sense to write
\bison\ code differently. However, if the logic behind your code
organization demands such fine fragmentation, hidden context provides
you with a tool to show it off. A look inside the source of this
document shows that adding hidden context can be a bit ugly so it is
not recommended for routine use. The short example above is output in
the file below.
@(sill.y@>=
@@;
@*1 On debugging. This concludes a short introduction to the \bison\
grammar pretty printing using this macro collection. It would be
incomplete, however, without any reference to debugging\footnote{Here
we are talking about debugging the output produced by \CWEAVE\ when
the included \bison\ parser is used, {\it not\/} debugging parsers
written with the help of this software: the latter topic is covered in more
detail later on}. There is a
fair amount of debugging information that the macros can output,
unfortunately, very little of it is tailored to the {\it use\/} of the
macros in the \bison\ parser. Most of it is designed to help {\it
build\/} a new parser. If you find that the parser gives up too often
or even crashes (the latter is most certainly a bug in the parser
itself), the first approach is to make sure that your code {\it
compiles\/} i.e.\ forget about the printed output and try to see if
the `real' \bison\ accepts the code (just the syntax, no need to
worry about conflicts and such).
If this does not shed any light on why the macros seem to fail, turn
on the debugging output by saying \.{\\trace$\ldots$true} for various
trace macros. This can produce {\it a lot\/} of output, even for
small fragments, so turn it on only for a section at a time. If you
need still {\it more\/} details of the inner workings of the parser
and the lexer, various other debugging conditionals are available. For
example, \.{\\yyflexdebugtrue} turns on the debugging output for the
scanner. There are a number of such conditionals that are discussed in
the commentary for the appropriate \TeX\ macros.
Remember, what you are seeing at this point is the parsing process of
the \bison\ input file, not the one for {\it your\/} grammar (which
might not even be complete at this point). However, if this fails, you
are on your own: drop me a line if you figure out how to fix any bugs
you find.
@*1 Terminology. We now list a few definitions of the concepts used
repeatedly in this documentation. Most of this terminology is
rather standard. Formal precision is not the goal here, and intuitive
explanations are substituted whenever possible.
{%
\def\aterm#1{\item{\sqebullet}{\ttl #1}: \ignorespaces}%
\setbox0=\hbox{\sqebullet\enspace}
\parindent=0pt
\advance\parindent by \wd0
\smallskip
\aterm{bison parser} while, strictly speaking, not a formally defined
term, this combination will always stand for one of the parsers generated
by this package designed to parse a subset of the `official' grammar for
\bison\ input files. All of these parsers are described later in
this documentation. The term {\it main parser\/} will be
used as a substitute in example documentation for the same purpose.
\aterm{driver} a generic but poorly defined concept. In this
documentation it is used predominantly to mean both the \Cee\ code and
the resulting executable that outputs the \TeX\ macros that contain the
parser tables, token values, etc., for the parsers built by the user. It
is understood that the \Cee\ code of the `driver' is unchanged and the
information about the parser itself is obtained by {\it including\/} the \Cee\
file produced by \bison\ in the `driver' (see the examples supplied
with the package).
\aterm{lexer} a synonym for {\it scanner}, a subroutine that performs the {\it
lexical analysis\/} phase of the parsing process, i.e.\ groups various
characters from the input stream into parser {\it tokens}.
\aterm{namespace} this is an overused bit of terminology meaning a
set of names grouped together according to some relatively
well defined principle. In a language without a well developed type
system (such as \TeX) it is usually accompanied by a specially designed
naming scheme. {\it Parser namespaces\/} are commonly used in this
documentation to mean a collection of all the data structures describing a
parser and its state, including tables, stacks, etc., named by using the
`root' name (say \.{\\yytable}) and adding the name of the parser (for
example, \.{[main]}). To support this naming scheme, a number of
macros work in unison to create and rename the `data macros' accordingly.
\aterm{symbolic switch} a macro (or an associative array of macros)
that let the \TeX\ parser generated by the package associate {\it
symbolic term names\/} with the terms. Unlike the `real' parser, the
parser created with this suite requires some extra setup as explained
in the included examples (one can also consult the source for this
documentation which creates but does not use a symbolic switch).
\aterm{symbolic term name} a (relatively new) way to refer to stack
values in \bison. In addition to using the `positional' names such as
\.{\$}$n$ to refer to term values, one can utilize the new syntax:
\.{\$}\.{[}{\it name\/}\.{]}. The `{\it name}' can be assigned by the
user or can be the name of the nonterminal or token used in the
productions.
\aterm{term} in a narrow sense, an `element' of a grammar. Instead of
a long winded definition, an example, such as \prodstyle{ID} should
suffice. Terms are further classified into {\it terminals\/} (tokens)
and {\it nonterminals\/} (which can be intuitively thought of as
composite terms).
\aterm{token} in short, an element of a set. Usually encoded as an
integer by most parsers, an indivisible {\it term\/}
produced for the parser by the scanner. \TeX's scanner uses a more
sophisticated token classification, for example, $($character code,
character category$)$ pairs, etc.
}
@** Languages, scanners, parsers, and \TeX. % Or $\ldots$
$$\vbox{\halign to\hsize{\kern-1.5pt\it#\hfil\tabskip0pt plus1fil\cr
Tokens and tables keep macros in check.\cr
Make 'em with \bison, use \.{WEAVE} as a tool.\cr
Add \TeX\ and \CTANGLE, and \Cee\ to the pool.\cr
Reduce 'em with actions, look forward, not back.\cr
Macros, productions, recursion and stack!\cr
\noalign{\vskip2pt}
\omit\hfil\eightpoint Computer generated (most likely)\cr}}
$$
\def\recount#1{${}^{(#1)}$}%
In order to understand the parsing routines in this collection,
it would help to gain some familiarity with the internals of the
parsers produced by \bison\ for its intended target: \Cee. A person
looking inside a parser delivered by \bison\ would
quickly discover that the parsing procedure itself (|yyparse|)
occupies a rather small portion of the file. If (s)he were to further
reduce the size of the file by removing all the preprocessor
directives intended to anticipate every conceivable combination of the
operating system, compiler, and \Cee\ dialect, and various reporting
and error logging functions it would become very clear that the most
valuable product of \bison's labor is a collection of integer {\it
tables\/} that control the actions of the parser routine. Moreover,
the routine itself is an extremely concise and well-structured loop
composed of |goto|'s and a number of numerical conditionals. If one
were to think of a way of accessing arrays and processing conditionals
in the language of one's choice, once the tables produced by \bison\
have been converted into a form suitable for the consumption by the
appropriate language engine, the parser implementation becomes
straightforward. Or nearly so.
The {\it scanning\/} (or {\it lexing\/}) step of this process---a way
to convert a stream of symbols into a stream of integers, also
deserves some attention here. There are a number of excellent tools
written to automate this step in much the same fashion as \bison\
automates the generation of parsers. One such tool, \flex, though
(in the opinion of this author) slightly lacking in the simplicity and
elegance as compared to \bison, was used to implement the lexer for
this software suite. Lexing in \TeX\ will be discussed in considerable
detail later in this manual.
The language of interest in our case is, of course, \TeX, so our
future discussion will revolve around the five elements mentioned
above: \recount{1}data structures (mainly arrays and stacks),
\recount{2}converting
\bison's output into a form suitable for \TeX's consumption,
\recount{3}processing raw streams of \TeX's tokens and converting them into
streams of parser tokens, \recount{4}the implementation of \bison's
|yyparse| in \TeX, and, finally, \recount{5}producing \TeX\ output via {\it
syntax-directed translation} (which requires an appropriate
abstraction to represent \bison's actions inside \TeX). We shall
begin by discussing the parsing process itself.
@*1 Arrays, stacks and the parser.
Let us briefly examine the programming environment offered by \TeX.
Designed for typesetting, \TeX's remarkable language
provides a layer of macro processing atop of a set of commands that
produce the output fulfilling its primary mission: delivering page
layouts. In The \TeX book, macro {\it expansion\/} is likened to
mastication, whereas \TeX's main product, the typographic output is the
result of its `digestion' process. Not everything that goes through
\TeX's digestive tract ends up leaving a trace on the final page: a
file full of \.{\\relax}'s will produce no output, even though
\.{\\relax} is not a macro, and thus would have to be processed by
\TeX\ at the lowest level.
It is time to describe the details of defining suitable data structures
in \TeX. At first glance, \TeX\ provides rather standard means of
organizing and using general memory. At the core of its generic
programming environment is an array of \.{\\count}$\,n$ {\it
registers\/}, which may be viewed as general purpose integer variables
that are randomly accessible by their indices. The integer arithmetic
machinery offered by \TeX\ is spartan but is very adequate for the sort of
operations a parser would perform: mostly additions and
comparisons.
Is the \.{\\count} array a good way to store tables in \TeX? Probably
not. The first factor is the {\it size\/} of this array: only 256
\.{\\count} registers exist in a standard \TeX\ (the actual number of
such registers on a typical machine running \TeX\ is significantly
higher but this author is a great believer in standards, and to his
knowledge, none of the standardization efforts in the \TeX\ world has
resulted in anything even close to the definitive masterpiece that is
The \TeX book). The issue of size can be mitigated to some extent by
using a number of other similar arrays used by \TeX\ (\.{\\catcode},
\.{\\uccode}, \.{\\dimen}, \.{\\sfcode} and others can be used for
this purpose as long as one takes care to restore the `sane' values
before control is handed off to \TeX's typesetting mechanisms). If a
table has to span several such arrays, however, the complexity of
accessing code would have to increase significantly, and the issue of
size would still haunt the programmer.
The second factor is the use of several registers by \TeX\ for special
purposes (in addition, some of these registers can only store a
limited range of values). Thus, the first 10 \.{\\count} registers are
used by plain \TeX\ for (well, {\it intended\/} for, anyway) the
purposes of page accounting: their values would have to be carefully
saved and restored before and after each parsing call,
respectively. Other registers (\.{\\catcode} in particular) have even
more disrupting effects on \TeX's internal mechanisms. While all of
this can be managed (after all, using \TeX\ as an arithmetic engine
such as a parser suspends the need for any typographic or other
specialized functions controlled by these arrays), the added
complexity of using several memory banks simultaneously and the speed penalty
caused by the need to store and restore register values make this
approach much less attractive.
What other means of storing arrays are provided by \TeX? Essentially,
only three options remain: \.{\\token} registers, macros holding whole
arrays, and associative arrays accessed through
\.{\\csname}$\,\ldots\,$\.{\\endcsname}. In the first two cases if care
is taken to store such arrays in an
appropriate form one can use \TeX's \.{\\ifcase} primitive to access
individual elements. The trade-off is the speed of such
access: it is {\it linear\/} in the size of the array for most
operations, and worse than that for others, such as removing the last
item of an array. Using clever ways
of organizing such arrays, one can improve the linear access time to
$O(\log n)$ by simply modifying the access macros but at the moment, a
straightforward \.{\\ifcase} is used after expanding a list macro or
the contents of a \.{\\token}$\,n$ register in an {\it un\/}optimized
parser. An {\it optimized\/} parser uses associative arrays.
The array discussion above is just as applicable to {\it stacks\/}
(indeed, an array is the most common form of stack
implementation). Since stacks pop up and disappear frequently (what
else are stacks to do?), list macros are usually used to store
them. The optimized parser uses a separate \.{\\count} register to
keep track of the top of the stack in the appropriate associative
array.
Let us now switch our attention
to the code that implements the parser and scanner {\it functions\/}.
If one has spent some time writing \TeX\ macros of any sophistication
(or any macros, for that matter) (s)he must be familiar with the general
feeling of frustration and the desire to `just call a function here and move
on'. Macros produce {\it tokens\/}, however, and tokens must either
expand to nothing or stay and be contributed to your input, or worse,
be out of place and produce an error. One way to sustain a stream
of execution with macros is {\it tail recursion\/} (i.e.~always expanding the
{\it last token left standing}).
As we have already discussed, \bison's
|yyparse()| is a well laid out loop organized as a sequence of
|goto|'s (no reason to become religious about structured programming
here). This fact, and the following well known trick, make \Cee\ to \TeX\
translation almost straightforward.
% The macro mess below looks painful but this is the only place such layout is used
% The approach can be easily generalized and put in limbo.sty but it seems
% a bit redundant at this point.
\newcount\piccount
\newdimen\lasthsize
\setbox5=\vtop{
\demomargin=0pt
\let\demoastyle\empty
\begindemo
^label A: ...
\nooutput
^ if**L**Krm(condition)**N
^ goto C;
\nooutput
^label B: ...
\nooutput
^ goto A;
\nooutput
^label C: ...
\nooutput
\enddemo
}
\dp5=\z@@
\setbox3=\vtop{
\demomargin=0pt
\let\demoastyle\empty
\begindemo
^\if**L**Krm(condition)**N
^ \let\next=\labelC
^\else
^ \let\next=\labelAtail
\enddemo
}
\dp3=\z@@
\newdimen\lastdepth
\def\startfitpar{%
\bgroup
\lasthsize=\hsize
\advance\lasthsize-1.5in
\vsize=\baselineskip
\topskip=\z@@
\setbox0\box2 % empty it
% this sounds good at first but there is no good way to pull the insertions out after the
% box manipulations that follow;
% insertions will thus be contributed to whatever page was being worked on when the
% picture insertions {\it started}; hence, if these happen to start at the very top of the page,
% any insertion that follows will be contributed to the previous page; we correct this for footnotes
% below
% \holdinginserts=1
\output{%
\global\setbox2=\vbox{
\ifvoid2
\else
\prevdepth=\dp2
\unvbox2
\fi
\lastdepth=\dp255
\unvbox255
% this would be tempting, however, the \eject that follows should disappear
% in addition, one really should not be playing with page breaking in the middle of
% such tricky insertions
% \penalty\outputpenalty
% \kern-\lastdepth % to make sure \baselineskip is accounted for
}%
}\eject
\output{%
\setbox0=\vbox{%
\unvbox255%
}% \lastbox would almost work ... if not for insertions
\global\advance\piccount1
\global\setbox2=\vbox{%
\prevdepth=\dp2 \unvbox2
\hbox to\hsize{%
\ifnum\piccount<15
\hbox to1.5in{%
\ifnum\piccount=1
\ \box5
\fi
\hfill}%
\fi
\box0 \hfill
\ifnum\piccount=1
\box3 \ %
\fi
\ifvoid\footins % reinsert footnotes
\else
\insert\footins{\unvbox\footins}%
\fi
}%
}%
}%
\parshape=15
0pt 2.7in
0pt 2.7in
0pt 2.7in
0pt 2.7in
0pt 2.7in
0pt 2.7in
0pt 2.7in
0pt \lasthsize
0pt \lasthsize
0pt \lasthsize
0pt \lasthsize
0pt \lasthsize
0pt \lasthsize
0pt \lasthsize
0pt \hsize
}
\def\endfitpar{%
\par
\eject
\egroup
% see the comment above
% \holdinginserts=0
\prevdepth=\dp2
\unvbox2
}
\startfitpar
\noindent Given the code on the left (where |goto|'s
are the only means of branching but can appear inside conditionals),
one way to translate it into \TeX\ is to define a set of macros (call
them \.{\\labelA}, \.{\\labelAtail} and so forth for clarity) that end in
\.{\\next} (a common name for this purpose). Now, \.{\\labelA} will
implement the code that comes between \.{label A:} and \.{goto C;},
whereas \.{\\labelAtail} is responsible for the code after \.{goto C;}
and before \.{label B:}
(provided no other |goto|'s intervene which can always be
arranged). The conditional which precedes \.{goto C;} can now be written in
\TeX\ as presented on the right, where (condition) is an appropriate
translation of the corresponding condition
in the code being translated (usually, one of `$=$' or `$\not=$'). Further
details can be extracted from the \TeX\ code that implements these
functions where the corresponding \Cee\ code is presented alongside
the macros that mimic its functionality%
\footnote{Running the risk of overloading the reader with details, the author
would like to note that the actual implementation follows a {\it slightly\/} different
route in order to avoid any \.{\\let} assignments or changing the
meaning of \.{\\next}}.
This concludes an overview of the general approach,
It is time to consider the way characters get consumed
on the lower levels of the macro hierarchy and the interaction between the different
layers of the package.
\endfitpar
@*1 \TeX\ into tokens.
Thus far we have covered the ideas
behind items \recount{1} and \recount{4} on our list. It is time to
discuss the lowest level of processing done by these macros:
converting \TeX's tokens into the tokens consumed by the parser,
i.e.\ part\recount{3} of the plan. Perhaps, it would be most appropriate
to begin by defining the term {\it token}.
As commonly defined, a token is simply an element of a set. Depending on
how much structure the said set possesses, a token can be represented by
an integer or a more complicated data structure. In the discussion
below, we will be dealing with two kinds of tokens: the tokens
consumed by the parsers and the \TeX\ tokens seen by the input
routines. The latter play the role of {\it characters\/} that combine
to become the former. \bison's internal representation for its tokens
is non-negative integers so this is what a scanner must
produce.
\TeX's tokens are a good deal more sophisticated: they can be
either pairs $(c_{\rm ch}, c_{\rm cat})$, where $c_{\rm ch}$ is the
character code and $c_{\rm cat}$ is \TeX's category code ($1$ and $2$ for
group characters, $5$ for end of line, etc.), or {\it control
sequences\/}, such as \.{\\relax}. Some of these tokens (control
sequences and {\it active}, i.e.~category~13 characters) can have
complicated internal structure (expansion). The situation is further
complicated by \TeX's \.{\\let} facility, which can create
`character-like' control sequences, and the lack of conditionals
to distinguish them from the `real' characters. Finally, not all pairs
can appear as part of the input (say, there is no $(n, 0)$ token for
any $n$, in the terminology above).
The scanner expects to see {\it characters} in its input, which are
represented by their {\sc ASCII} codes, i.e.~integers between $0$ and
$255$ (actually, a more general notion of the Unicode character is
supported but we will not discuss it further). Before character codes
appear as the input to the scanner, however, and make its integer
table-driven mechanism `tick', a lot of work must be done to collect
and process the stream of \TeX\ tokens produced after \CWEAVE\ is done
with your input. This work becomes further complicated when the
typesetting routines that interpret the parser's output must sneak
outside of the parsed stream of text (which is structured by the
parser) and insert the original \TeX\ code produced by \CWEAVE\ into
the page.
\splint\ comes with a customizeable input routine of
moderate complexity (\.{\\yyinput}) that classifies all \TeX\ tokens
into seven categories: `normal' spaces (i.e.~category~10 tokens,
skipped by \TeX's parameter scanning mechanism),
`explicit' spaces (includes the control sequences \.{\\let} to \.{\ },
as well as \.{\\\ }), groups ({\it avoid} using \.{\\bgroup} and \.{\\egroup} in
your input but `real', \.{\{}$\ldots$\.{\}} groups are fine), active
characters, normal characters (of all character categories that can
appear in \TeX\ input, including \.{\$}, \.{\^}, \.{\#}, \.{a}--\.{Z},
etc.), single letter control sequences, and multi-letter control
sequences. Each of these categories can be processed separately to
`fine-tune' the input routine to the problem at hand. The input
routine is not very fast, instead, flexibility was the main
goal. Therefore, if speed is desirable, a customized input routine
is a great place to start. As an example, a minimalistic
\.{\\yyinputtrivial} macro is included.
When \.{\\yyinput} `returns' by calling \.{\\yyreturn} (which is a
macro you design), your lexing routines have access to three
registers: \.{\\yycp@@}, that holds the character value of the
character just consumed by \.{\\yyinput}, \.{\\yybyte}, that most of
the time holds the token just removed from the input,
and \.{\\yybytepure}, that (again, with very few
exceptions) holds a `normalized' version of the read character (i.e.~a
character of the same character code as \.{\\yycp@@}, and category~11
(to be even more precise (and to use nested parentheses), `normalized'
characters have the same category code as the current category code of
\.{@@})).
Most of the time it is the character code one needs (say, in the case
of \.{\\\{}, \.{\\\}}, \.{\\\&} and so on) but under some circumstances the
distinction is important (outside of \.{\\vb\{}$\ldots$\.{\}}, the sequence
\.{\\1} has nothing to do with the digit `\.{1}'). This mechanism
makes it easy to examine the consumed token. It also forms
the foundation of the `hidden context' passing mechanism described later.
The remainder of this section discusses the internals of \.{\\yyinput}
and some of the design trade-offs one has to make while working on
processing general \TeX\ token streams. It is typeset in `small print'
and can be skipped if desired.
\smallskip
\begingroup
\abovedisplayskip=5pt%
\abovedisplayshortskip=2pt%
\belowdisplayskip=5pt%
\belowdisplayshortskip=2pt%
\fnotesstart=1
\fnotesspan=2
\noofcolumns=2
\icgap=1em%
\eightpoint
\linecount=73
\setmcparams
\def\.#1{{\chardef\\=`\\\chardef\&=`\&\tt #1}}%
\dsskip=0pt%
\begindoublecols
To examine every token in its path (including spaces that are easy to
skip), the input routine uses one of the two well-known {\sc \TeX}nologies:
\.{\\futurelet\\next\\examinenext} or equally effective
\hbox{\.{\\afterassignment\\next\\let={\tt\char"20}}}.
Recursively inserting one of these sequences, \.{\\yyinput} can go
through any list of tokens, as long as it knows where to stop
(i.e.~return an end of file character). The
signal to stop is provided by the \.{\\yyeof}
primitive which should not appear in any `ordinary' text
presented for parsing, other than for the purpose of providing such a
stop signal. Even the dependence on \.{\\yyeof} can be eliminated if
one is willing to invest the time in writing macros that juggle \TeX's
\.{\\token} registers and only limit oneself to input from such
registers (which is, aside from an obvious efficiency hit, a strain on
\TeX's memory, as you have to store multiple (3 in the general case)
copies of your input to be able to back up when the lexer makes a
wrong choice). There does not seem to be a way of doing it unless the
text has been stored in a \.{\\token} register first (or storing the
whole input as a {\it parameter\/} for the appropriate macro: this
scheme is remarkably powerful and leads to {\it expandable\/} versions
of very complicated macros, although the amount of effort required to
write such macros grows at a frightening rate). All of these are
non-issues for the text inside \.{\\vb\{}$\ldots$\.{\}} and the care that
\.{\\yyinput} takes in processing characters inside such lists is an
overkill. In a more `hostile' environment (such as the one encountered
by the now obsolete \.{\\Tex} macros), this extra attention to detail pays
off in the form of a more robust input mechanism.
One subtlety deserves a special mention here, as it can be important
to the designer of `higher-level' scanning macros. Two types of tokens
are extremely difficult to deal with whenever \TeX's own lexing
mechanisms are used: (implicit) spaces and even more so, braces. We
will only discuss braces here, however, almost everything that follows
applies equally well to spaces (category 10 tokens to be precise), with
a few simplifications (or complications, in a couple of places). To
understand the difficulty, let's consider one of the approaches above:
$$
\.{\\futurelet\\next\\examinenext}.
$$
The macro \.{\\examinenext}
usually looks at \.{\\next} and inserts another macro (usually also called
\.{\\next}) at the very end of its expansion list. This macro usually
takes one parameter, to consume the next token. This mechanism works
flawlessly, until the lexer encounters a \.{\{}br\.{,}sp\.{\}}ace. The \.{\\next}
sequence, seen by \.{\\examinenext} contains a lot of information
about the brace ahead: it knows its category code (left brace, so $1$), its
character code (in case there was, say a \.{\\catcode`\\[=1{\tt\char`\ }}
earlier) but not whether it is a `real' brace (i.e.\ a character
\.{\{}$_1$) or an implicit one (a \.{\\bgroup}). There is no way to find
that out until the control sequence `launched' by \.{\\examinenext}
sees the token as a parameter.
If the next token is a `real' brace, however,
\.{\\examinenext}'s successor will never see the token itself: the
braces are stripped by \TeX's scanning mechanism. Even if it finds a
\.{\\bgroup} as the parameter, there is no guarantee that the actual
input was not \.{\{\\bgroup\}}. One way to handle this is by using
\.{\\string} ahead of any consumption of the next token. If prior to
expanding \.{\\string} care has been taken to set the \.{\\escapechar}
appropriately (remember, we know the character code in advance), as
soon as one sees a character with \.{\\escapechar}'s character code,
(s)he knows that an implicit brace has just been seen. One added
complication to all this is that a very determined programmer can
insert an {\it active\/} character (using, say, the \.{\\uccode}
mechanism) that has the {\it same\/} character code as the {\it
brace\/} token that it has been \.{\\let} to! Setting this possibility
aside, the \.{\\string} mechanism (or, its cousin, \.{\\meaning}) is
not perfect: both produce a sequence of category 12 and 10 tokens. If
it is indeed a brace character that we just saw, we can consume the next
token and move on but what if this was a control sequence? After all,
just as easily as \.{\\string} makes a sequence into characters,
\.{\\csname}$\,\ldots\,$\.{\\endcsname} pair will make any sequence of
characters into a control sequence. Huh~$\ldots$
What we need is a backup mechanism: if one has a copy of the
token sequence ahead, one can use \.{\\string} to see if it is a real
brace first, and if it is, consume it and move on (the active character
case can be handled as the implicit case below, with one extra backup
to count how many tokens have been consumed). At this point one has to {\it
reinsert\/} the brace in case, at some point, a future `back up'
requires that the rest of the tokens are removed from the output (to
avoid `\.{Too many \}'s}' complaints from \TeX). This can be done by using
the \.{\\iftrue\{\\else\}\\fi} trick but of course, some bookkeeping is
needed to keep track of how far inside the brace groups we
are.
If it is an implicit brace, more work is needed: read all the
characters that \.{\\string} produced (an maybe more), then remember
the number of characters consumed. Remove the rest of the input using
the method described above and restart the scanning from the same point
knowing that the next token can be scanned as a parameter.
Another strategy is to design a general enough macro that counts
tokens in a token register and simply recount the tokens after every
brace was consumed.
Either way, it takes a lot of work. If anyone would
like to pursue the counting strategy, simple counting macros
are provided in \.{/examples/count/count.sty}.
The macros in this example
supply a very general counting mechanism that does not depend on
\.{\\yyeof} (or {\it any\/} other token) being `special' and can count the
tokens in any token register, as long as none of those tokens is an
\.{\\outer} control sequence. In other words, if the macro is used
immediately after the assignment to the token register, it should
always produce a correct count.
Needless to say, if such a general mechanism is desired, one has to
look elsewhere. The added complications of treating spaces (\TeX\
tends to ignore them most of the time) make this a torturous exercise
in \TeX's macro wizardry. The included \.{\\yyinput} has two ways of
dealing with braces: strip them or view the whole group as a
token. Pick one or write a different \.{\\yyinput}. Spaces, implicit
or explicit are reported as a specially selected character code and
consumed with a likeness of
$$
\hbox{\.{\\afterassignment\\moveon\\let\\next={\tt\char`\ }}}.
$$
Now that a steady stream of character codes is arriving at \.{\\yylex}
after \.{\\yyreturn} the job of converting it into numerical tokens
is performed by the {\it scanner} (or {\it lexer\/}, or {\it tokenizer\/},
or even {\it tokener}), discussed in the next section.
\enddoublecols
\endgroup
@*1 Lexing in \TeX. In a typical system that uses a parser to process
text, the parsing pass is usually split into several stages: the raw
input, the lexical analysis (or simply {\it lexing}), and the parsing
proper. The {\it lexing\/} (also called {\it scanning}, we use these
terms interchangeably) clumps various sequences of characters into
{\it tokens\/} to facilitate the parsing stage. The reasons for this
particular hierarchy are largely pragmatic and are partially historic
(there is no reason that {\it parsing\/} cannot be done in multiple
phases, as well, although it usually isn't).
If one remembers a few basic facts from the formal language theory, it
becomes obvious that a lexer, that parses {\it regular\/} languages,
can (theoretically) be replaced by an {\sc LALR} parser, that parses {\it
context-free\/} ones (or some subset thereof, which is
still a super set of all regular languages). A common justification given for
creating specialized lexers is efficiency and speed. The
reality is somewhat more subtle. While we do care about the efficiency of
parsing in \TeX, having a specialized scanner is important for
a number of different reasons.
The real advantage of having a dedicated scanner is the ease with which it
can match incomplete inputs and back up. A parser can, of course,
{\it recognize\/} any valid input that is also acceptable to a lexer, as well
as {\it reject\/} any input that does not form a valid token. Between
those two extremes, however, lies a whole realm of options that a
traditional parser will have great difficulty exploring. Thus, to
mention just one example, it
is relatively easy to set up a DFA\footnote{Which stands for
Deterministic Finite Automaton, a common (and mathematically unique)
way of implementing a scanner for regular languages. Incidentally {\sc
LALR} mentioned above is short for Look Ahead Left to Right.}
so that the {\it longest\/}
matching input is accepted. The only straightforward way to do this
with a traditional parser is to parse longer and longer inputs again
and again. While this process can be optimized to a certain degree,
the fact that a parser has a {\it stack\/} to maintain limits its
ability to back up.
As an aside, the mechanism by which \CWEB\ assembles its `scraps'
into chunks of recognized code is essentially iterative lexing,
very similar to what a human does to make sense of complicated
texts. Instead of trying to match the longest running piece of text,
\CWEB\ simply looks for patterns to combine inputs into larger
chunks, which can later be further combined. Note that this is not
quite the same as the approach taken by, say {\sc GLR} parsers, where
the parser must match the {\it whole\/} input or declare a
failure. Where a \CWEB-type parser may settle for the first available
match (or the longest available) a {\sc GLR} parser must try {\it
all\/} possible matches or use an algorithm to reject the majority of
the ones that are bound to fail in the end.
This `\CWEB\ way' is also different from a traditional `strict' {\sc
LR} parser/scanner approach and certainly deserves serious
consideration when the text to be parsed possesses some rigid
structure but the parser is only allowed to process it one small
fragment at a time.
Returning to the present macro suite, the lexer produced by \flex\
uses integer tables similar to those employed by \bison\ so the
usual {\sc\TeX}niques used in implementing \.{\\yyparse} are fully
applicable to \.{\\yylex}.
An additional advantage provided by having a \flex\ scanner implemented
as part of the suite is the availability of the original \bison\ scanner written
in \Cee\ for the use by the macro package.
This said, the code generated by \flex\ contains a few idiosyncrasies
not present in the \bison\ output. These `quirks' mostly involve
handling of end of input and error conditions. A quick glance at the
\.{\\yylex} implementation will reveal a rather extensive collection of
macros designed to deal with end of input actions.
Another difficulty one has to face in translating \flex\ output into
\TeX\ is a somewhat unstructured namespace delivered in the final
output (this is partially due to the \POSIX\ standard that \flex\
strives to follow). One consequence of this `messy' approach is that the
writer of a \flex\ scanner targeted to \TeX\ has to declare \flex\
`states' (more properly called {\it subautomata}) twice: first for the
benefit of \flex\ itself, and then again, in the {\it \Cee\ preamble\/}
portion of the code to output the states to be used by the action code
in the lexer. \.{Define\_State($\ldots$)} macro is provided for this
purpose. This macro can be used explicitly by the programmer or be
inserted by a specially designed parser.
Using \CWEB\ helps to keep these declarations together.
The `hand-off' from the scanner to the parser is implemented
through a pair of registers: \.{\\yylval}, a token register
containing the value of the returned token and \.{\\yychar}, a
\.{\\count} register that contains the numerical value of the
token to be returned.
Upon matching a token, the scanner passes one crucial piece of
information to the user: the character sequence representing the token
just matched (\.{\\yytext}). This is not the whole story
though. There are three more token sequences that are made available
to the parser writer whenever a token is matched.
The first of these is simply a `normalized' version of
\.{\\yytext} (called \.{\\yytextpure}). In most cases it
is a sequence of \TeX\ tokens with the same character codes as the one
in \.{\\yytext} but with their category codes set to 11. In
cases when the tokens in \.{\\yytext} are {\it not}
$(c_{\rm ch}, c_{\rm cat})$ pairs, a few simple
conventions are followed, some of which will be explained below. This
sequence is provided merely for convenience and its typical use is to
generate a key for an associate array.
The other two sequences are special `stream pointers' that provide
access to the extended scanner mechanism in order to implement passing
of `formatting hints' to the parser without introducing any changes to
the original grammar. As the mechanism itself and the motivation
behind it are somewhat subtle, let me spend a few moments discussing
the range of formatting options desirable in a generic pretty-printer.
Unlike strict parsers employed by most compilers, a parser designed
for pretty printing cannot afford being too picky about the structure
of its input (\cite[Go] calls such parsers `loose'). To provide
a simple illustration, an isolated identifier, such as `\.{lg\_integer}'
can be a type name, a variable name, or a structure tag (in a language like
\Cee\ for example). If one expects the pretty printer to typeset this
identifier in a correct style, some context must be supplied, as
well. There are several strategies a pretty printer can employ to get
a hold of the necessary context. Perhaps the simplest way to handle
this, and to reduce the complexity of the pretty printing algorithm is
to insist on the user providing enough context for the parser to do
its job. For short examples like the one above, this is an acceptable
strategy. Unfortunately, it is easy to come up with longer snippets of
grammatically deficient text that a pretty printer should be expected
to handle. Some pretty printers, such as the one employed by \CWEB\
and its ilk (the original \.{WEB}, \.{FWEB}), use a very flexible
bottom-up technique that tries to make sense of as large a portion of
the text as it can before outputting the result (see also \cite[Wo],
which implements a similar algorithm in \LaTeX).
The expectation is that this algorithm will handle the majority (about
90\%? it would be interesting to carry out a study in the spirit of
the ones discussed in \cite[Jo] to find out) of the
cases with the remaining few left for the author to correct. The
question is, how can such a correction be applied?
\CWEB\ itself provides two rather different mechanisms for handling
these exceptions. The first uses direct typesetting commands (for
example, \.{@@/} and \.{@@\#} for canceling and
introducing a line break, resp.) to change the typographic output.
The second (preferred) way is to supply {\it hidden context\/} to the
pretty-printer. Two commands, \.{@@;} and
\.{@@[}$\ldots$\.{@@]} are used for this purpose. The
former introduces a `virtual semicolon' that acts in every way like a
real one except it is not typeset (it is not output in the source file
generated by \CTANGLE, either but this has nothing to do with pretty
printing, so I will not mention \CTANGLE\ anymore). For
instance, from the parser's point of view, if the preceding text was
parsed as a `scrap' of type {\it exp}, the addition of \.{@@;}
will make it into a `scrap' of type {\it stmt\/} in \CWEB's
parlance. The second construct (\.{@@[}$\ldots$\.{@@]}),
is used to create an {\it exp\/} scrap out of whatever happens to be
inside the brackets.
This is a powerful tool at the author's disposal. Stylistically,
this is the right way to handle exceptions as it forces the writer to
emphasize the {\it logical\/} structure of the formal
text. If the pretty printing style is changed
extensively later, the texts with such hidden contexts should be able to
survive intact in the final document (as an example, using a break
after every statement in \Cee\ may no longer be considered
appropriate, so any forced break introduced to support this convention
would now have to be removed, whereas \.{@@;}'s would simply
quietly disappear into the background).
The same hidden context idea has another important advantage: with
careful grammar fragmenting (facilitated by \CWEB's or any other
literate programming tool's `hypertext' structure) and a more diverse
hidden context (or even arbitrary hidden text) mechanism, it is
possible to use a strict parser to parse incomplete language
fragments. For example, the productions that are needed to parse
\Cee's expressions form a complete subset of the grammar. If the
grammar's `start' symbol is changed to {\it expression\/} (instead of
the {\it translation-unit\/} as it is in the full \Cee\ grammar), a
variety of incomplete \Cee\ fragments can now be parsed and
pretty-printed. Whenever such granularity is still too `coarse',
carefully supplied hidden context will give the pretty printer enough
information to adequately process each fragment. A number of such {\it
sub}-parsers can be tried on each fragment (this may sound
computationally expensive, however, in practice, a carefully chosen
hierarchy of parsers will finish the job rather quickly) until a
correct parser produced the desired output (this approach is similar
to, although not quite the same one employed by the {\it General LR
parsers}).
This somewhat lengthy discussion brings us to the question directly
related to the tools described in this article: how does one provide
typographical hints or hidden context to the parser?
One obvious solution is to build such hints directly into the
grammar. The parser designer can, for instance, add new tokens
(say, \.{BREAK\_LINE}) to the grammar and extend the
production set to incorporate the new additions. The risk of
introducing new conflicts into the grammar is low (although not
entirely non-existent, due to the lookahead limitations of LR(1)
grammars) and the changes required are easy, although very tedious, to
incorporate.
In addition to being labor intensive, this solution has two other
significant shortcomings: it alters the original grammar and hides its
logical structure; it also `bakes in' the pretty-printing conventions
into the language structure (making `hidden' context much less
`stealthy'). It does avoid the `synchronicity problem' mentioned
below.
A marginally better technique is to introduce a new regular expression
recognizable by the scanner which will then do all the necessary
bookkeeping upon matching the sequence. All the difficulties with
altering the grammar mentioned above apply in this case, as well, only
at the `lexical analysis level'. At a minimum, the set of tokens
matched by the scanner would have to be changed.
A much better approach involves inserting the hints at the input stage and
passing this information to the scanner and parser as part of the token `values'. The
hints themselves can masquerade as characters ignored by the scanner
(white space, for example) and preprocessed by a specially designed
input routine. The scanner then simply passes on the values to the
parser. This makes hints, in effect, invisible.
The difficulty lies in synchronizing the token production with the
parser. This subtle complication is very familiar to anyone who has
designed \TeX's output routines: the parser and the lexer are not
synchronous, in the sense that the scanner might be reading several
(in the case of the general LR$(n)$ parsers) tokens ahead of the
parser before deciding on how to proceed (the same way \TeX\ can
consume a whole paragraph's worth of text before exercising its page
builder).
If we simple-mindedly let the scanner return every hint it has encountered
so far, we may end up feeding the parser the hints meant for the token
that appears {\it after\/} the fragment the parser is currently working
on. In other words, when the scanner `backs up' it must correctly back
up the hints as well.
This is exactly what the scanner produced by the tools in this package
does: along with the main stream of tokens meant for the parser, it
produces two hidden streams (called the \.{\\format} stream and
the \.{\\stash} stream) and provides the parser with two
strings (currently only strings of digits are used although arbitrary
sequences of \TeX\ tokens can be used as pointers) with the promise
that {\it all the `hints' between the beginning of the corresponding
stream and the point labeled by the current stream pointer appeared
among the characters up to and, possibly, including the ones matched
as the current token}. The macros to extract the relevant parts of the
streams (\.{\\yyreadfifo} and its cousins) are provided for the
convenience of the parser designer. The interested reader can consult
the input routine macros for the details of the internal
representation of the streams.
In the interest of full disclosure, let me point out that this simple
technique introduces a significant strain on \TeX's
computational resources: the lowest level macros, the ones that handle
character input and are thus executed (sometimes multiple times), for
{\it every\/} character in the input stream are rather complicated and
therefore, slow. Whenever the use of such streams is not desired a simpler
input routine can be written to speed up the process (see
\.{\\yyinputtrivial} for a working example of such macro).
Finally, while probably not directly related to the present
discussion, this approach has one more interesting feature: after the
parser is finished, the parser output and the streams exist
`statically', fully available for any last minute preprocessing or for
debugging purposes, if necessary. Under most circumstances, the parser
output is `executed' and the macros in the output are the ones reading
the various streams using the pointers supplied at the parsing stage
(at least, this is the case for all the parsers supplied with the
package).
@*1 Inside semantic actions: switch statements and `functions' in \TeX.
Now you have a lexer for your input, and a grammar ready to be put into
action (we will talk about actions a bit later). It is time to discuss
how the tables produced by \bison\ get converted into \TeX\ {\it macros\/}
that drive the parser in {\it \TeX}.
The tables that drive the \bison\ input parsers
are collected in various \.{\{b,d,f,g,n\}yytab.tex} and \.{small\_tab.tex}. Each
one of these files contains the tables that implement a specific parser
used during different stages of processing.
Their exact function is well explained
in the source file produced by \bison\ ({\it how} this is done is
explained elsewhere, see \cite[Ah] for a good reference). It would
suffice to mention here that there are three types of tables in this
file: \recount{1}numerical tables such as \.{\\yytable} and
\.{\\yycheck} (both are either \TeX's token registers in an
unoptimized parser or associate arrays in an optimized version of such
as discussed below),
\recount{2}a string array \.{\\yytname}, and \recount{3}an action
switch. The action switch is what gets called when the parser does a
{\it reduction}. It is easy to notice that the numerical tables come
`premade' whereas the string array consisting of token names
is difficult to recognize. This is intentional: this form of initialization
is designed to allow the widest range of
characters to appear inside names. The macros that do this reside in
\.{yymisc.sty}. The generated table files also contain
constant and token declarations used by the parser.
The description of the process used to output \bison\ tables in an
appropriate form continues in the section about
\locallink{bsfile}outputting \TeX\ tables\endlink, we pick it up here
with the description of the syntax-directed translation and the
actions. The line
$$
\.{\\switchon\\next\\in\\currentswitch}
$$
is responsible for calling an appropriate action in the current
switch, as is easy to infer. A {\it switch\/} is also a macro that
consists of strings of \TeX\ tokens intermixed with \TeX\ macros
inside braces. Each group of macros
gets executed whenever the character or the group of characters in
\.{\\next} matches a substring preceding the braced group. If there
are two different substrings
that match, only the earliest group of macros gets expanded.
Before a state is
used, a special control sequence,
\.{\\setspecialcharsfrom\\switchname} can be used to put the \TeX\
tokens in a form suitable for the consumption by \.{\\switchon}'s. The
most important step it performs is it {\it turns every token in the
list into a character with the same character code and category
12\/}. Thus \.{\\\{} becomes \.{\{}$_{12}$. There are other ways of
inserting tokens into a state: enclosing a token or a string of tokens in
\.{\\raw...\\raw} adds it to the state macro unchanged. If you have
a sequence of category 12 characters you want to add to the state, put
it after \.{\\classexpand} (such sequences are usually prepared by the
\.{\\setspecialchars} macro that uses the token tables generated by
\bison\ from your grammar).
You can give a case a readable label (say, \.{brackets}) and enclose
this label in \.{\\raw}$\ldots$\.{\\raw}. A word of caution: an `a'
inside of \.{\\raw}$\ldots$\.{\\raw} (which is most likely an
\.{a}$_{11}$ unless you played with category codes before loading the
\.{\\switchon} macros) and the one outside it are two different
characters, as one is no longer a letter (category 11) in the eyes of
\TeX\ whereas the other one still is. For this reason one should not
use characters other than letters in h\.{\{}is\.{,}er\.{\}} state
names: the way a state picks an action does not distinguish between,
say, a `\.{(}' in `\.{(letter)}' and a stand alone `\.{(}' and may
pick an action that you did not intend. This applies even if `\.{(}'
is not among the characters explicitly inserted in the state macro: if
an action for a given character is not found in the state macro, the
\.{\\switchon} macro will insert a current \.{\\default} action
instead, which most often you would want to be \.{\\yylex} or
\.{\\yyinput} (i.e.\ skip this token). If `\.{(}' or `\.{)}' matches
the braced group that follows `\.{(letter)}' chaos may ensue (most
likely \TeX\ will keep reading past the \.{\\end} or \.{\\yyeof} that
should have terminated the input). Make the names of character
categories as unique as possible: the \.{\\switchon} is simply a
string matching mechanism, with the added distinction between
characters of different categories.
Finally, the construct \.{\\statecomment}{\it
anything\/}\.{\\statecoment} allows you to insert comments in the
state sequence (note that the state {\it name\/} is put at the
beginning of the state macro (by \.{\\setspecialcharsfrom})
in the form of a special control sequence
that expands to nothing: this elaborate scheme is needed because
another control sequence can be \.{\\let} to the state macro which
makes the debugging information difficult to decipher). The debugging
mode for the lexer implemented with these macros is activated by
\.{\\tracedfatrue}.
The functionality of the \.{\\switchon} macros (for `historical'
reasons, one can also use \.{\\action} as a synonym) has been
implemented in a number of other macro packages (see \cite[Fi] that
discusses the well-known and widely used \.{\\CASE} and \.{\\FIND}
macros). The macros in this collection have the additional property
that the only assignments that persist after the \.{\\switchon}
completes are the ones performed by the user code inside the selected
case.
This last property of the switch macros is implemented using another
mechanism that is part of this macro suite: the `subroutine-like'
macros, \.{\\begingroup}$\ldots$\.{\\tokreturn}. For examples, an
interested reader can take a look at the macros included with the
package. A typical use is
\.{\\begingroup}$\ldots$\.{\\tokreturn\{\}\{\\toks0 \}\{\}} which will
preserve all the changes to \.{\\toks0} and have no other side effects
(if, for example, in typical \TeX\ vernacular, \.{\\next} is used
to implement tail recursion inside the group, after the
\.{\\tokreturn}, \.{\\next} will still have the same value it
had before the group was entered). This functionality comes at the
expense of some computational efficiency.
This covers most of the routine computations inside semantic actions,
all that is left is a way to `tap' into the stack automaton
built by \bison\ using an interface similar to the special
\.{\$$n$} variables utilized by the `genuine' \bison\ parsers
(i.e.\ written in \Cee\ or any other target language supported by
\bison).
This role is played by the several varieties of \.{\\yy$\,p$} command
sequences (for the sake of completeness, $p$ stands for one of \.{($n$)},
\.{[{\rm name}]}, \.{]{\rm name}[} or $n$, here $n$ is a
string of digits, and a `name' is any name acceptable as a symbolic
name for a term in \bison). Instead
of going into the minutia of various flavors of \.{\\yy}-macros, let me
just mention that one can get by with only two `idioms' and still
be able to write parsers of arbitrary sophistication:
\.{\\yy($n$)} can be treated as a token register containing the
value of the $n$-th term of the rule's right hand side, $n>0$. The left
hand side of a production is accessed through \.{\\yyval}. A
convenient shortcut is \.{\\yy0\{{\rm \TeX\space material}\}} which
will expand the `\TeX\ material inside the braces. Thus, a simple way
to concatenate the values of the first two production terms is
\.{\\yy0\{\\the\\yy(1)\\the\\yy(2)\}}. The included \bison\
parser can also be used to provide support for `symbolic names',
analogous to \bison's \.{{\$}[{\rm name}]} but this requires a
bit more effort on the user's part to initialize such support. It
could make the parser more readable and maintainable, however.
Naturally, a parser writer may need a number of other data
abstractions to complete the task. Since these are highly dependent on
the nature of the processing the parser is supposed to provide, we
refer the interested reader to the parsers included in the package as
a source of examples of such specialized data structures.
One last remark about the parser operation is worth making here:
the parser automaton itself does not make any \.{\\global}
assignments. This (along with some careful semantic action writing)
can be used to `localize' the effects of the parser operation and,
most importantly, to create `reentrant' parsers that can, e.g.\ call
{\it themselves\/} recursively.
@*1 `Optimization'.
By default, the generated parser and scanner keep all of their tables
in separate token registers. Each stack is kept in a single macro (this
description is further complicated by the support for parser {\it
namespaces\/} that exists even for unoptimized parsers but this
subtlety will not be mentioned again---see the macros in the package
for further details). Thus, every time a table
is accessed, it has to be expanded making the table access latency
linear in {\it the size of the table}. The same holds for stacks and
the action `switches', of
course. While keeping the parser tables (which are immutable) in token
registers does not have any better rationale than saving the control
sequence memory (the most abundant memory in \TeX), this way of
storing {\it stacks} does have an advantage when multiple parsers get
to play simultaneously. All one has to do to switch from one parser to
another is to save the state by renaming the stack control sequences
accordingly.
When the parser and scanner are `optimized', all these control
sequenced are `spread over' appropriate associative arrays. One caveat
to be aware of: the action switches for both the parser and the scanner
have to be output differently (a command line option is used to
control this) for optimized and unoptimized parsers. While it is
certainly possible to optimize only some of the parsers (if your
document uses multiple) or even only some {\it parts\/} of a given
parser (or scanner), the details of how to do this are rather
technical and are left for the reader to discover by reading the
examples supplied with the package. At least at the beginning it is
easier to simply set the highest optimization level and use it
consistently throughout the document.
@*1 {\it \TeX\/} with a different {\sl slant} or do you C an escape?.
%\def\texnspace{other}
Some \TeX\ productions below probably look like alien script.
The authors of \cite[Er] cite a number of reasons pretty printing of
\TeX\ in general is a nearly impossible task. The macros included with
the package follow a very straightforward strategy and do not try to
be very comprehensive. Instead, the burden of presenting \TeX\ code in
a readable form is placed on the programmer. Appropriate hints can be
supplied by means of indenting the code, using assignments ($=$) where
appropriate, etc. If you would rather look at straight \TeX\
instead, the line \.{\\def\\texnspace\{other\}} at the beginning of
this section can be uncommented and
|TeX_( "/noexpand/inmath{/yy0{/yy1{}}}" );| becomes
\def\texnspace{other}%
|TeX_( "/noexpand/inmath{/yy0{/yy1{}}}" );|.
\def\texnspace{texline}%
There is, however, more to this story. A look at the actual file will
reveal that the line above was typed as
$$
\.{TeX\_( "/noexpand/inmath\{/yy0\{/yy1\{\}\}\}" );}
$$
The `escape character' is leaning the other way!
The lore of \TeX\ is uncompromising: `\.{\\}' is {\it the\/} escape
character. What is the reason to avoid it in this case?
The mystery is not very deep: `\.{/}' was chosen as an escape character
by the parser macros (a quick glance at \.{?yytab.tex} will reveal as
much). There is, of course, nothing sacred (other than tradition,
which this author is trying his hardest to follow) about what character code
the escape character has. The reason for this choice is straightforward: `\.{\\}' is
a special character in \Cee, as well (also an `escape' in fact). The line
\.{TeX\_( "..." );} is a {\it macro-call\/} but $\ldots$ in \Cee. This
function simply prints out (almost `as-is') the line in
parenthesis. An attempt at \.{TeX\_( "\\noexpand" );} would result in
\numberlinestrue
\begindemo
^
^oexpand
\enddemo
\numberlinesfalse
Other escape combinations\footnote{Here is a full list of {\it
defined\/} escaped characters in \Cee: \.{\\a}, \.{\\b}, \.{\\f}, \.{\\n},
\.{\\r}, \.{\\t}, \.{\\v}, \.{\\}{$[$\it octal digit$]$}, \.{\\'},
\.{\\"}, \.{\\?}, \.{\\\\}, \.{\\x}, \.{\\u}, \.{\\U}. Note that the
last three combinations must be followed by a specific string of
characters to appear in the input without generating errors.} are
even worse: most are simply undefined. If anyone feels trapped without
an escape, however, the same line can be typed as
$$
\.{TeX\_( "\\\\noexpand\\\\inmath\{\\\\yy0\{\\\\yy1\{\}\}\}" );}
$$
Twice the escape!
If one were to look closer at the code, another oddity stands
out: there are no \.{\$}'s anywhere in sight.
The big money, \.{\$} is a beloved character in
\bison. It is used in action code to reference the values of the
appropriate terms in a production. If mathematics pays your bills, use
\.{\\inmath} instead.
@*1 The \bison\ parser(s). Let's take a short break for a broad overview of the input file.
The basic structure is that of an ordinary \bison\ file that produces
plain \Cee\ output. The \Cee\ actions, however, are programmed to output \TeX.
@s TeX_ TeX
@(bg.yy@>=
@G Switch to generic mode.
%{@> @ @=%}
@g
@@;
@G
%union {@> @ @=}
%{@> @ @=%}
@g
@@;
@G
%%
@g
@@;
@@;
@@;
@G
%%
@g
@ Bootstrap mode is next. The reason for a separate bootstrap parser is to
collect the minimal amount of information to `spool up' the `production'
parsers. To understand the mechanics and the reasons behind it, consider what happens
following a declaration such as \.{\%token TOKEN "token"}
(or, as it would be typeset by the macros in this package
`\prodstyle{\%token} \.{TOKEN} \.{token}'; see the index entries for
more details)%
\idxinline{TOKEN}\idxinline{token}.
The two names for the same token are treated very differently. \.{TOKEN} becomes
an |enum| constant in the \Cee\ parser generated by \bison. Even when
that parser becomes part of the `driver' program that outputs the \TeX\
version of the parser tables, there is no easy way to output the {\it
names\/} of the appropriate |enum| constants. The other name
(\.{"token"}) becomes an entry in the |yytname| array. These names
can be output by either the `driver' or \TeX\ itself after the
\.{\\yytname} table has been input. The scanner, on the other hand,
will use the first version (\.{TOKEN}). Therefore, it is important to
establish an equivalence between the two versions of the name. In the
`real' parser, the token values are output in a special header
file. Hence, one has to either parse the header file to establish the
equivalences or find some other means to find out the numerical values
of the tokens.
One approach is to parse the file containing the {\it declarations\/}
and extract the equivalences between the names from it. This is the
function of the bootstrap parser. Since the lexer is reused, some
token values need to be known in advance (and the rest either ignored
or replaced by some `made up' values). These tokens are `hard coded'
into the parser file generated by \bison\ and output using a special
function. The switch `|@[#define@]@; BISON_BOOTSTRAP_MODE|' tells the `driver'
program to output the hard coded token values.
@q Bizarre looking way of typing #define is due to the awkward way@>
@q \CWEB\ treats switching in and out of $-mode in inline \Cee@>
Note that the equivalence of the two versions of token names would
have to be established every time a `string version' of a token is
declared in the \bison\ file and the `macro name version' of the token
is used by the corresponding scanner. To establish this equivalence,
however, the bootstrapping parser below is not always necessary (see
the \.{xxpression} example, specifically, the file \.{xxpression.w} in
the \.{examples} directory for an example of using a different parser
for this purpose). The reason it is necessary here is that a parser
for an appropriate subset of the \bison\ syntax is not yet available
(indeed, {\it any\/} functional parser for a \bison\ syntax subset
would have to use the same scanner (unless you want to write a custom
scanner for it), which would need to know how to output tokens, for
which it would need a parser for a subset of \bison\ syntax $\ldots$
it is a `chicken and egg'). Hence the name `bootstrap'. Once a
functional parser for a large enough subset of the \bison\ input
grammar is operational, {\it it\/} can be used to pair up the token
names.
The second function of the bootstrap parser is to collect information
about the scanner's states. The mechanism is slightly different for
states. While the token equivalences are collected purely in
`\TeX\ mode', the bootstrap parser collects all the state names into a
special \Cee\ header file. The reason is simple: unlike the token
values, the numerical values of the scanner states are not passed to
the `driver' program in any data structure and are instead defined as
ordinary macros. The header file is the information the `driver' file
needs to output the state values.
An additional subtlety in the case of state value output is that the
main lexer for the \bison\ grammar utilizes states extensively and thus
cannot be easily used with the bootstrap parser before the state
values are known. The solution is to substitute a very simple scanner barely
capable of lexing state declarations. Such a scanner is implemented
in \.{ssffo.w} (the somewhat cryptic name stands for `{\bf s}imple {\bf s}canner
{\bf f}or {\bf f}lex {\bf o}ptions').
\saveparseoutputtrue
@(bb.yy@>=
@G Switch to generic mode.
%{
@g
@@;
#define BISON_BOOTSTRAP_MODE
@G
%}
@g
@@;
@G
%union {@> @ @=}
%{@> @ @=%}
@g
@@;
@G
%%
@g
@@;
@@;
@<\flex\ options parser productions@>@;
@@;
@@;
@G
%%
@g
@ The prologue parser is responsible for parsing various grammar
declarations as well as parser options.
\saveparseoutputfalse
%\traceparserstatestrue
%\tracestackstrue
%\tracerulestrue
%\traceactionstrue
\saveparseoutputtrue
@(bd.yy@>=
@G Switch to generic mode.
%{@> @ @=%}
@g
@@;
@G
%union {@> @ @=}
%{@> @ @=%}
@g
@@;
@G
%%
@g
@@;
@@;
@@;
@G
%%
@g
@ Full \bison\ input parser is used when a complete \bison\ file is
expected. It is also capable of parsing a `skeleton' of such a file,
similar to the one that follows this paragraph.
\traceparserstatesfalse
\tracestacksfalse
\tracerulesfalse
\traceactionsfalse
\checktablefalse
\saveparseoutputfalse
@(bf.yy@>=
@G Switch to generic mode.
%{@> @ @=%}
@g
@@;
@G
%union {@> @ @=}
%{@> @ @=%}
@g
@@;
@G
%%
@g
@@;
@@;
@@;
@@;
@G
%%
@g
@ The first two options are essential for the parser operation. The
start symbol can be set implicitly by listing the appropriate
production first.
@q %define lr.type canonical-lr @>
@q Make not on this and lexing too much lookahead and the \stashed trick@>
@q Explain other options @>
@=
@G
%token-table
%debug
%start input
@g
@*2 Grammar rules. Most of the original comments present in
the grammar file used by \bison\ itself have been preserved and appear in
{\it italics\/} at the beginning of each appropriate section.
To facilitate the {\it bootstrapping\/} of the parser (see above), some
declarations have been separated into their own sections. Also, a
number of new rules have been introduced to create a hierarchy of
`subparsers' that parse subsets of the grammar. We begin by listing
most of the tokens used by the grammar. Only the string versions are
kept in the |yytname| array, which, in part is the reason for a
special bootstrapping parser as explained earlier.
@=
@G
%token GRAM_EOF 0 "end of file"
%token STRING "string"
%token PERCENT_TOKEN "%token"
%token PERCENT_NTERM "%nterm"
%token PERCENT_TYPE "%type"
%token PERCENT_DESTRUCTOR "%destructor"
%token PERCENT_PRINTER "%printer"
%token PERCENT_LEFT "%left"
%token PERCENT_RIGHT "%right"
%token PERCENT_NONASSOC "%nonassoc"
%token PERCENT_PRECEDENCE "%precedence"
%token PERCENT_PREC "%prec"
%token PERCENT_DPREC "%dprec"
%token PERCENT_MERGE "%merge"
@g
@@;
@ We continue with the list of tokens below, following the layout of
the original parser.
@=
@G
%token
PERCENT_CODE "%code"
PERCENT_DEFAULT_PREC "%default-prec"
PERCENT_DEFINE "%define"
PERCENT_DEFINES "%defines"
PERCENT_ERROR_VERBOSE "%error-verbose"
PERCENT_EXPECT "%expect"
PERCENT_EXPECT_RR "%expect-rr"
PERCENT_FLAG "%"
PERCENT_FILE_PREFIX "%file-prefix"
PERCENT_GLR_PARSER "%glr-parser"
PERCENT_INITIAL_ACTION "%initial-action"
PERCENT_LANGUAGE "%language"
PERCENT_NAME_PREFIX "%name-prefix"
PERCENT_NO_DEFAULT_PREC "%no-default-prec"
PERCENT_NO_LINES "%no-lines"
PERCENT_NONDETERMINISTIC_PARSER
"%nondeterministic-parser"
PERCENT_OUTPUT "%output"
PERCENT_REQUIRE "%require"
PERCENT_SKELETON "%skeleton"
PERCENT_START "%start"
PERCENT_TOKEN_TABLE "%token-table"
PERCENT_VERBOSE "%verbose"
PERCENT_YACC "%yacc"
;
%token BRACED_CODE "{...}"
%token BRACED_PREDICATE "%?{...}"
%token BRACKETED_ID "[identifier]"
%token CHAR "char"
%token EPILOGUE "epilogue"
%token EQUAL "="
%token ID "identifier"
%token ID_COLON "identifier:"
%token PERCENT_PERCENT "%%"
%token PIPE "|"
%token PROLOGUE "%{...%}"
%token SEMICOLON ";"
%token TAG ""
%token TAG_ANY "<*>"
%token TAG_NONE "<>"
%token INT "integer"
%token PERCENT_PARAM "%param";
@g
@ Extra tokens for typesetting \flex\ state
declarations and options are declared in addition to the ones that a
standard \bison\ parser recognizes.
@=
@G
%token FLEX_OPTION FLEX_STATE_X FLEX_STATE_S
@g
@ We are ready to describe the top levels of the parse tree. The first
`sub parser' we consider is a `full' parser, that is the parser that
expects a full grammar file, complete with the prologue, declarations,
etc. This parser can be used to extract information from the grammar
that is otherwise absent from the executable code generated by
\bison. This includes, for example, the `name' part of
\.{\$}\.{[}{\rm name}\.{]}.
This parser is therefore used to generate the `symbolic
switch' to provide support for symbolic term names similar to
`genuine' \bison's \.{\$}\.{[}$\ldots$\.{]} syntax.
@=
@t}\vb{\inline}{@>
@G
input:
prologue_declarations
"%%" grammar epilogue.opt {@> @ @=}
;
@g
@ The action of the parser in this case is simply to separate the
accumulated `parse tree' from the auxiliary information carried by the
parser on the stack.
@=
@[TeX_( "/getsecond{/yy(3)}/to/toksa" );@]@; /* extract grammar contents */
@[TeX_( "/yy0{/the/toksa}/table=/yy(0)" );@]@;
@ Another subgrammar deals with the syntax of isolated \bison\ rules. This is
the most commonly used `subparser' since a rules cluster is the most
natural `unit' to include in a \CWEB\ file.
@=
@t}\vb{\inline}{@>
@G
input:
grammar epilogue.opt {@> TeX_( "/getsecond{/yy(1)}/to/table" ); @=}
;
@g
@ The bootstrap parser has a very narrow set of goals: it is concerned
with \prodstyle{\%token} and \prodstyle{\%nterm} declarations only in
order to supply the token information to the lexer (since, as noted
above, such information is not kept in the |yytname| array). It also
extends the syntax of a \prodstyle{grammar\_declaration} by allowing a
declaration with or without semicolon at the end (the latter is only
allowed in the prologue). This works since the token declarations have
been carefully separated from the rest of the grammar in different
\CWEB\ sections. The range of tokens understood by the bootstrap
parser is limited, hence most of the other rules are ignored.
@=
@t}\vb{\inline}{@>
@G
input:
grammar_declarations {@> TeX_( "/table=/yy(1)" ); @=}
;
@g
@t}\vb{\resetf}{@>
@G
grammar_declarations:
symbol_declaration semi.opt {@> @ @=}
| flex_declaration semi.opt {@> @ @=}
| grammar_declarations
symbol_declaration semi.opt {@> TeX_( "/yy0{/the/yy(1)/the/yy(2)}" ); @=}
| grammar_declarations
flex_declaration semi.opt {@> TeX_( "/yy0{/the/yy(1)/the/yy(2)}" ); @=}
;
@g
@t}\vb{\inline\flatten}{@>
@G
semi.opt: {} | ";" {};
@g
@ The following is perhaps the most common action performed by the
parser. It is done automatically by the parser code but this feature
is undocumented so we supply an explicit action in each case.
@=
@[TeX_( "/yy0{/the/yy(1)}" );@]@;
@ Next, a subgrammar for processing prologue declarations. Finer
differentiation is possible but the `subparsers' described here work
pretty well and impose a mild style on the grammar writer.
@=
@t}\vb{\inline}{@>
@G
input:
prologue_declarations epilogue.opt {@> TeX_( "/getsecond{/yy(1)}/to/table" ); @=}
| prologue_declarations
"%%" "%%" EPILOGUE {@> TeX_( "/getsecond{/yy(1)}/to/table" ); @=}
| prologue_declarations
"%%" "%%" {@> TeX_( "/getsecond{/yy(1)}/to/table" ); @=}
;
@g
@ {\it Declarations: before the first \prodstyle{\%\%}}. We are now
ready to deal with the specifics of the declarations themselves. The
\.{\\grammar} macro is a `structure', whose first `field' is the
grammar itself, whereas the second carries the type of the last
declaration added to the grammar.
@=
@G
prologue_declarations:
{@> TeX_( "/yy0{/nx/grammar{}{/nx/empty}}" ); @=}
| prologue_declarations
prologue_declaration {@> @ @=}
;
@g
@ @=
@@;
@ Here is a list of most kinds of declarations that can appear in the
prologue. The scanner returns the `stream pointers' for all the
keywords so the declaration `structures' pass on those pointers to the
grammar list. The original syntax has been left intact even though for
the purposes of this parser some of the inline rules are unnecessary.
@=
@G
prologue_declaration:
grammar_declaration {@> @ @=}
| "%{...%}" {@> TeX_( "/yy0{/nx/prologuecode/the/yy(1)}" ); @=}
| "%" {@> TeX_( "/yy0{/nx/optionflag/the/yy(1)}" ); @=}
| "%define" variable value {@> TeX_( "/yy0{/nx/vardef{/the/yy(2)}{/the/yy(3)}/the/yy(1)}" ); @=}
| "%defines" {@> TeX_( "/yy0{/nx/optionflag{defines}{}/the/yy(1)}" ); @=}
| "%defines" STRING {@> @[TeX_( "/toksa{defines}" );@]@+@ @=}
| "%error-verbose" {@> TeX_( "/yy0{/nx/optionflag{error verbose}{}/the/yy(1)}" ); @=}
| "%expect" INT {@> @[TeX_( "/toksa{expect}" );@]@+@ @=}
| "%expect-rr" INT {@> @[TeX_( "/toksa{expect-rr}" );@]@+@ @=}
| "%file-prefix" STRING {@> @[TeX_( "/toksa{file prefix}" );@]@+@ @=}
| "%glr-parser" {@> TeX_( "/yy0{/nx/optionflag{glr parser}{}/the/yy(1)}" ); @=}
| "%initial-action" "{...}" {@> TeX_( "/yy0{/nx/initaction/the/yy(2)}" ); @=}
| "%language" STRING {@> @[TeX_( "/toksa{language}" );@]@+@ @=}
| "%name-prefix" STRING {@> @[TeX_( "/toksa{name prefix}" );@]@+@ @=}
| "%no-lines" {@> TeX_( "/yy0{/nx/optionflag{no lines}{}/the/yy(1)}" ); @=}
| "%nondeterministic-parser" {@> TeX_( "/yy0{/nx/optionflag{nondet. parser}{}/the/yy(1)}" ); @=}
| "%output" STRING {@> @[TeX_( "/toksa{output}" );@]@+@ @=}
@g
@t}\vb{\flatten}{@>
@G
| "%param" {}
params {@> TeX_( "/yy0{/nx/paramdef{/the/yy(3)}/the/yy(1)}" ); @=}
@g
@t}\vb{\fold}{@>
@G
| "%require" STRING {@> @[TeX_( "/toksa{require}" );@]@+@ @=}
| "%skeleton" STRING {@> @[TeX_( "/toksa{skeleton}" );@]@+@ @=}
| "%token-table" {@> TeX_( "/yy0{/nx/optionflag{token table}{}/the/yy(1)}" ); @=}
| "%verbose" {@> TeX_( "/yy0{/nx/optionflag{verbose}{}/the/yy(1)}" ); @=}
| "%yacc" {@> TeX_( "/yy0{/nx/optionflag{yacc}{}/the/yy(1)}" ); @=}
| ";" {@> TeX_( "/yy0{/nx/empty}" ); @=}
;
params:
params "{...}" {@> TeX_( "/yy0{/the/yy(1)/nx/braceit/the/yy(2)}" ); @=}
| "{...}" {@> TeX_( "/yy0{/nx/braceit/the/yy(1)}" ); @=}
;
@g
@ This is a typical parser action: encapsulate the `type' of the
construct just parsed and attach some auxiliary info, in this case the
stream pointers.
@=
@[TeX_( "/yy0{/nx/oneparametricoption{/the/toksa}{/the/yy(2)}/the/yy(1)}" );@]@;
@ Some extra declarations to typeset \flex\ options and
declarations. These are not part of the \bison\ syntax but their
structure is similar enough that they can be included in the grammar.
@=
@G
prologue_declaration:
flex_declaration {@> @ @=}
;
@g
@<\flex\ options parser productions@>@;
@ The syntax of \flex\ options was extracted from \flex\ documentation
so it is not guaranteed to be correct.
@<\flex\ options parser productions@>=
@G
flex_declaration:
FLEX_OPTION flex_option_list {@> @ @=}
| flex_state symbols.1 {@> @ @=}
;
flex_state:
FLEX_STATE_X {@> TeX_( "/yy0{/nx/flexxstatedecls/the/yy(1)}" ); @=}
| FLEX_STATE_S {@> TeX_( "/yy0{/nx/flexsstatedecls/the/yy(1)}" ); @=}
;
flex_option_list:
flex_option {@> @ @=}
| flex_option_list flex_option {@> @ @=}
;
flex_option:
ID {@> TeX_( "/yy0{/nx/flexoptionpair{/the/yy(1)}{}}" ); @=}
| ID "=" symbol {@> TeX_( "/yy0{/nx/flexoptionpair{/the/yy(1)}{/the/yy(3)}}" ); @=}
;
@g
@ @=
@[TeX_( "/yy0{/nx/flexoptiondecls{/the/yy(2)}/the/yy(1)}" );@]@;
@ @=
@[TeX_( "/getfirst{/yy(1)}/to/toksa" );@]@;
@[TeX_( "/getsecond{/yy(1)}/to/toksb" );@]@;
@[TeX_( "/getthird{/yy(1)}/to/toksc" );@]@;
@[TeX_( "/yy0{/the/toksa{/the/yy(2)}{/the/toksb}{/the/toksc}}" );@]@;
@ @=
@[TeX_( "/getsecond{/yy(2)}/to/toksa" );@]@; /* the identifier */
@[TeX_( "/getfourth{/toksa}/to/toksb" );@]@; /* the format pointer */
@[TeX_( "/getfifth{/toksa}/to/toksc" );@]@; /* the stash pointer */
@[TeX_( "/yy0{/the/yy(1)/nx/hspace{/the/toksb}{/the/toksc}/the/yy(2)}" );@]@;
@ {\it Grammar declarations}. These declarations can appear in both
prologue and the rules sections. Their treatment is very similar to
prologue-only options.
@=
@G
grammar_declaration:
precedence_declaration {@> @ @=}
| symbol_declaration {@> @ @=}
| "%start" symbol {@> @[TeX_( "/toksa{start}" );@]@+@ @=}
| code_props_type "{...}" generic_symlist {@> @ @=}
| "%default-prec" {@> TeX_( "/yy0{/nx/optionflag{default prec.}{}/the/yy(1)}" ); @=}
| "%no-default-prec" {@> TeX_( "/yy0{/nx/optionflag{no default prec.}{}/the/yy(1)}" ); @=}
| "%code" "{...}" {@> TeX_( "/yy0{/nx/codeassoc{code}{}/the/yy(2)/the/yy(1)}" ); @=}
| "%code" ID "{...}" {@> TeX_( "/yy0{/nx/codeassoc{code}{/the/yy(2)}/the/yy(3)/the/yy(1)}" ); @=}
;
code_props_type:
"%destructor" {@> TeX_( "/yy0{{destructor}/the/yy(1)}" ); @=}
| "%printer" {@> TeX_( "/yy0{{printer}/the/yy(1)}" ); @=}
;
@g
@ @=
@[TeX_( "/getfirst{/yy(1)}/to/toksa" );@]@; /* name of the property */
@[TeX_( "/getfirst{/yy(2)}/to/toksb" );@]@; /* contents of the braced code */
@[TeX_( "/getsecond{/yy(2)}/to/toksc" );@]@; /* braced code format pointer */
@[TeX_( "/getthird{/yy(2)}/to/toksd" );@]@; /* braced code stash pointer */
@[TeX_( "/getsecond{/yy(1)}/to/tokse" );@]@; /* code format pointer */
@[TeX_( "/getthird{/yy(1)}/to/toksf" );@]@; /* code stash pointer */
@[TeX_( "/yy0{/nx/codepropstype{/the/toksa}{/the/toksb}{/the/yy(3)}{/the/toksc}{/the/toksd}{/the/tokse}{/the/toksf}}" );@]@;
@ @=
@G
%token PERCENT_UNION "%union";
@g
@ @=
@t}\vb{\inline\flatten}{@>
@G
union_name:
{@> TeX_( "/yy0{}" ); @=}
| ID {@> @ @=}
;
grammar_declaration:
"%union" union_name "{...}" {@> @ @=}
;
symbol_declaration:
"%type" TAG symbols.1 {@> @ @=}
;
@g
@t}\vb{\resetf\flatten}{@>
@G
precedence_declaration:
precedence_declarator tag.opt symbols.prec {@> @ @=}
;
precedence_declarator:
"%left" {@> TeX_( "/yy0{/nx/preckind{left}/the/yy(1)}" ); @=}
| "%right" {@> TeX_( "/yy0{/nx/preckind{right}/the/yy(1)}" ); @=}
| "%nonassoc" {@> TeX_( "/yy0{/nx/preckind{nonassoc}/the/yy(1)}" ); @=}
| "%precedence" {@> TeX_( "/yy0{/nx/preckind{precedence}/the/yy(1)}" ); @=}
;
@g
@t}\vb{\inline}{@>
@G
tag.opt:
{@> TeX_( "/yy0{}" ); @=}
| TAG {@> @ @=}
;
@g
@ @=
@[TeX_( "/yy0{/nx/codeassoc{union}{/the/yy(2)}/the/yy(3)/the/yy(1)}" );@]@;
@ @=
@[TeX_( "/yy0{/nx/typedecls{/the/yy(2)}{/the/yy(3)}/the/yy(1)}" );@]@;
@ @=
@[TeX_( "/getthird{/yy(1)}/to/toksa" );@]@; /* format pointer */
@[TeX_( "/getfourth{/yy(1)}/to/toksb" );@]@; /* stash pointer */
@[TeX_( "/getsecond{/yy(1)}/to/toksc" );@]@; /* kind of precedence */
@[TeX_( "/yy0{/nx/precdecls{/the/toksc}{/the/yy(2)}{/the/yy(3)}{/the/toksa}{/the/toksb}}" );@]@;
@ The bootstrap grammar forms the smallest subset of the full grammar.
@=
@@;
@ These are the two most important rules for the bootstrap parser.
@=
@t}\vb{\flatten}{@>
@G
symbol_declaration:
"%nterm" {} symbol_defs.1 {@> TeX_( "/yy0{/nx/ntermdecls{/the/yy(3)}/the/yy(1)}" ); @=}
@g
@t}\vb{\fold\flatten}{@>
@G
| "%token" {} symbol_defs.1 {@> TeX_( "/yy0{/nx/tokendecls{/the/yy(3)}/the/yy(1)}" ); @=}
;
@g
@ {\it Just like \prodstyle{symbols.1} but accept \prodstyle{INT} for
the sake of \POSIX}. Perhaps the only point worth mentioning here is
the inserted separator (\.{\\hspace}). Like any other separator, it takes
two parameters, stream pointers. In this case, however, both pointers are null
since there seems to be no other meaningful assignment. If any
formatting or stash information is needed, it can be extracted by the
symbols themselves.
@=
@G
symbols.prec:
symbol.prec {@> @ @=}
| symbols.prec symbol.prec {@> TeX_( "/yy0{/the/yy(1)/nx/hspace{0}{0}/the/yy(2)}" ); @=}
;
symbol.prec:
symbol {@> TeX_( "/yy0{/nx/symbolprec{/the/yy(1)}{}}" ); @=}
| symbol INT {@> TeX_( "/yy0{/nx/symbolprec{/the/yy(1)}{/the/yy(2)}}" ); @=}
;
@g
@ {\it One or more symbols to be \prodstyle{\%type}'d}.
@=
@@;
@ @=
@G
symbols.1:
symbol {@> @ @=}
| symbols.1 symbol {@> TeX_( "/yy0{/the/yy(1)/nx/hspace{0}{0}/the/yy(2)}" ); @=}
;
@g
@ @=
@G
generic_symlist:
generic_symlist_item {@> @ @=}
| generic_symlist generic_symlist_item {@> TeX_( "/yy0{/the/yy(1)/nx/hspace{0}{0}/the/yy(2)}" ); @=}
;
@g
@t}\vb{\flatten\inline}{@>
@G
generic_symlist_item:
symbol {@> @ @=}
| tag {@> @ @=}
;
tag:
TAG {@> @ @=}
| "<*>" {@> @ @=}
| "<>" {@> @ @=}
;
@g
@ {\it One token definition}.
@=
@G
symbol_def:
TAG {@> @ @=}
@g
@t}\vb{\flatten}{@>
@G
| id {@> TeX_( "/yy0{/nx/onesymbol{/the/yy(1)}{}{}}" ); @=}
| id INT {@> TeX_( "/yy0{/nx/onesymbol{/the/yy(1)}{/the/yy(2)}{}}" ); @=}
| id string_as_id {@> TeX_( "/yy0{/nx/onesymbol{/the/yy(1)}{}{/the/yy(2)}}" ); @=}
| id INT string_as_id {@> TeX_( "/yy0{/nx/onesymbol{/the/yy(1)}{/the/yy(2)}{/the/yy(3)}}" ); @=}
;
@g
@ {\it One or more symbol definitions}.
@=
@G
symbol_defs.1:
symbol_def {@> @ @=}
| symbol_defs.1 symbol_def {@> @ @=}
;
@g
@ @=
@[TeX_( "/getsecond{/yy(2)}/to/toksa" );@]@; /* the identifier */
@[TeX_( "/getfourth{/toksa}/to/toksb" );@]@; /* the format pointer */
@[TeX_( "/getfifth{/toksa}/to/toksc" );@]@; /* the stash pointer */
@[TeX_( "/yy0{/the/yy(1)/nx/hspace{/the/toksb}{/the/toksc}/the/yy(2)}" );@]@;
@ {\it The grammar section: between the two
\prodstyle{\%\%}'s}. Finally, the following few short sections define
the syntax of \bison's rules.
@=
@G
grammar:
rules_or_grammar_declaration {@> @ @=}
| grammar rules_or_grammar_declaration {@> @ @=}
;
@g
@ {\it As a \bison\ extension, one can use the grammar declarations in the
body of the grammar}. What follows is the syntax of the right hand
side of a grammar rule.
@=
@G
rules_or_grammar_declaration:
rules {@> @ @=}
| grammar_declaration ";" {@> @ @=}
| error ";" {@> TeX_( "/errmessage{parsing error!}" ); @=}
;
@g
@t}\vb{\flatten\inline}{@>
@G
rules:
id_colon named_ref.opt {@> TeX_( "/relax" ); @=}
rhses.1 {@> @ @=}
;
@g
@t}\vb{\resetf}{@>
@G
rhses.1[o]:
rhs {@> @ @=}
| rhses.1[a] "|"[b] {@> @ @=}[c]
rhs[d] {@> @ @=}
| rhses.1 ";" {@> @ @=}
;
@g
@ The next few actions describe what happens when a left hand side is
attached to a rule.
@=
@[TeX_( "/getfirst{/yy(1)}/to/toksa" );@]@;
@[TeX_( "/yy0{/nx/grammar{/the/yy(1)}{/the/toksa}}" );@]@;
@ @=
@[TeX_( "/getthird{/yy(1)}/to/toksa" );@]@; /* type of the last rule */
@[TeX_( "/getsecond{/yy(1)}/to/toksc" );@]@; /* accumulated rules */
@[TeX_( "/getfirst{/yy(2)}/to/toksb" );@]@; /* type of the new rule */
@[TeX_( "/let/default/positionswitchdefault" );@]@;
@[TeX_( "/switchon{/the/toksb}/in/positionswitch" );@]@; /* determine the position of the first token in the group */
@[TeX_( "/edef/next{/the/toksa}" );@]@;
@[TeX_( "/edef/default{/the/toksb}" );@]@; /* reuse \.{\\default} */
@[TeX_( "/ifx/next/default" );@]@;
@[TeX_( " /let/default/separatorswitchdefaulteq" );@]@;
@[TeX_( " /switchon{/the/toksa}/in/separatorswitcheq" );@]@;
@[TeX_( "/else" );@]@;
@[TeX_( " /concat/toksa/toksb" );@]@;
@[TeX_( " /let/default/separatorswitchdefaultneq" );@]@;
@[TeX_( " /switchon{/the/toksa}/in/separatorswitchneq" );@]@;
@[TeX_( "/fi" );@]@;
@[TeX_( "/yy0{/nx/grammar{/the/toksc/the/postoks/the/toksd/the/yy(2)}{/the/toksb}}" );@]@;
@ @=
@[TeX_( "/getsecond{/yy(1)}/to/toksa" );@]@; /* \.{\\prodheader} */
@[TeX_( "/getsecond{/toksa}/to/toksb" );@]@; /* \.{\\idit} */
@[TeX_( "/getfourth{/toksb}/to/toksc" );@]@; /* format stream pointer */
@[TeX_( "/getfifth{/toksb}/to/toksd" );@]@; /* stash stream pointer */
@[TeX_( "/getthird{/yy(1)}/to/toksb" );@]@; /* \.{\\rules} */
@[TeX_( "/yy0{/nx/oneproduction{/the/toksa/the/toksb}{/the/toksc}{/the/toksd}}" );@]@;
@ @=
@[TeX_( "/getfourth{/yy(1)}/to/toksa" );@]@; /* format stream pointer */
@[TeX_( "/getfifth{/yy(1)}/to/toksb" );@]@; /* stash stream pointer */
@[TeX_( "/yy0{/nx/pcluster{/nx/prodheader{/the/yy(1)}{/the/yy(2)}" );@]@;
@[TeX_( " {/the/toksa}{/the/toksb}}{/the/yy(4)}}" );@]@;
@ It is important to format the right hand side properly, since we
would like to indicate that an action is inlined by an
indentation. The `format' of the \.{\\rhs} `structure' includes the
stash pointers and a `boolean' to indicate whether the right hand side ends
with an action. Since the action can be implicit, this decision has to
be postponed until, say, a semicolon is seen.
No formatting or stash pointers are added for such implicit action.
@=
@[TeX_( "/rhsbool{/yy(1)}/to/toksa /the/toksa" );@]@;
@[TeX_( "/getthird{/yy(1)}/to/toksb" );@]@; /* the format pointer */
@[TeX_( "/getfourth{/yy(1)}/to/toksc" );@]@; /* the stash pointer */
@[TeX_( "/ifrhsfull" );@]@;
@[TeX_( " /yy0{/nx/rules{/the/yy(1)}{/the/toksb}{/the/toksc}}" );@]@;
@[TeX_( "/else" );@]@; /* it does not end with an action, fake one */
@[TeX_( " /rhscont{/yy(1)}/to/toksa" );@]@; /* rules */
@[TeX_( " /edef/next{/the/toksa}" );@]@;
@[TeX_( " /ifx/next/empty" );@]@;
@[TeX_( " /toksa{/emptyterm}" );@]@;
@[TeX_( " /fi" );@]@;
@[TeX_( " /yy0{/nx/rules{/nx/rhs{/the/toksa/nx/rarhssep{0}{0}" );@]@;
@[TeX_( " /nx/actbraces{}{}{0}{0}/nx/bdend}{}{/nx/rhsfulltrue}}{/the/toksb}{/the/toksc}}" );@]@;
@[TeX_( "/fi" );@]@;
@ @=
@[TeX_( "/rhscont{/yy(1)}/to{/yy(0)}" );@]@;
@[TeX_( "/yy0{/the/yy(0)/nx/midf/the/yy(2)}" );@]@;
@ No pointers are provided for an {\it implicit\/} action.
@=
@[TeX_( "/rhsbool{/yy(4)}/to/toksa /the/toksa" );@]@;
@[TeX_( "/ifrhsfull" );@]@;
@[TeX_( " /yy0{/nx/rules{/the/yy(3)/nx/rrhssep/the/yy(2)/the/yy(4)}/the/yy(2)}" );@]@;
@[TeX_( "/else" );@]@;
@[TeX_( " /rhscont{/yy(4)}/to/toksa" );@]@;
@[TeX_( " /edef/next{/the/toksa}" );@]@;
@[TeX_( " /ifx/next/empty" );@]@;
@[TeX_( " /toksa{/emptyterm}" );@]@;
@[TeX_( " /fi" );@]@;
@[TeX_( " /yy0{/nx/rules{/the/yy(3)/nx/rrhssep/the/yy(2)" );@]@;
@[TeX_( " /nx/rhs{/the/toksa/nx/rarhssep{0}{0}" );@]@; /* streams have already been grabbed */
@[TeX_( " /nx/actbraces{}{}{0}{0}/nx/bdend}{}{/nx/rhsfulltrue}}/the/yy(2)}" );@]@;
@[TeX_( "/fi" );@]@;
@ @=
@@;
@ @=
@G
%token PERCENT_EMPTY "%empty";
@g
@ The centerpiece of the grammar is the syntax of the right hand side
of a production. Various `precedence hints' must be attached to an
appropriate portion of the rule, just before an action (which can
be inline, implicit or both in this case).
@=
@G
rhs:
{@> @ @=}
| rhs symbol named_ref.opt {@> @ @=}
| rhs "{...}" named_ref.opt {@> @ @=}
| rhs "%?{...}" {@> @ @=}
| rhs "%empty" {@> @ @=}
| rhs "%prec" symbol {@> @ @=}
| rhs "%dprec" INT {@> @ @=}
| rhs "%merge" TAG {@> @ @=}
;
named_ref.opt:
{@> @ @=}
| BRACKETED_ID {@> @ @=}
;
@g
@ @=
@[TeX_( "/yy0{/nx/rhs{}{}{/nx/rhsfullfalse}}" );@]@;
@ @=
@[TeX_( "/rhscont{/yy(1)}/to/toksa" );@]@;
@[TeX_( "/rhscnct{/yy(1)}/to/toksb" );@]@;
@[TeX_( "/edef/next{/the/toksb}" );@]@;
@[TeX_( "/ifx/next/empty" );@]@;
@[TeX_( "/else" );@]@;
@[TeX_( " /getfourth{/yy(2)}/to/toksc" );@]@;
@[TeX_( " /getfifth{/yy(2)}/to/toksd" );@]@;
@[TeX_( " /appendr/toksb{{/the/toksc}{/the/toksd}}" );@]@;
@[TeX_( "/fi" );@]@;
@[TeX_( "/yy0{/nx/rhs{/the/toksa/the/toksb" );@]@;
@[TeX_( " /nx/termname{/the/yy(2)}{/the/yy(3)}}{/nx/hspace}{/nx/rhsfullfalse}}" );@]@;
@ @=
@[TeX_( "/rhscont{/yy(1)}/to/toksa" );@]@;
@[TeX_( "/rhsbool{/yy(1)}/to/toksb /the/toksb" );@]@;
@[TeX_( "/ifrhsfull" );@]@; /* the first half ends with an action */
@[TeX_( " /appendr/toksa{/nx/arhssep{0}{0}/nx/emptyterm}" );@]@; /* no pointers to streams */
@[TeX_( "/fi" );@]@;
@[TeX_( "/edef/next{/the/toksa}" );@]@;
@[TeX_( "/ifx/next/empty" );@]@;
@[TeX_( " /toksa{/emptyterm}" );@]@;
@[TeX_( "/fi" );@]@;
@[TeX_( "/getfirst{/yy(2)}/to/toksb" );@]@; /* the contents of the braced code */
@[TeX_( "/getsecond{/yy(2)}/to/toksc" );@]@; /* the format stream pointer */
@[TeX_( "/getthird{/yy(2)}/to/toksd" );@]@; /* the stash stream pointer */
@[TeX_( "/yy0{/nx/rhs{/the/toksa/nx/rarhssep{/the/toksc}{/the/toksd}" );@]@;
@[TeX_( " /nx/actbraces{/the/toksb}{/the/yy(3)}{/the/toksc}{/the/toksd}/nx/bdend}" );@]@;
@[TeX_( " {/nx/arhssep}{/nx/rhsfulltrue}}" );@]@;
@ @=
@[TeX_( "/rhscont{/yy(1)}/to/toksa" );@]@;
@[TeX_( "/rhsbool{/yy(1)}/to/toksb /the/toksb" );@]@;
@[TeX_( "/ifrhsfull" );@]@; /* the first half ends with an action */
@[TeX_( " /appendr/toksa{/nx/arhssep{0}{0}/nx/emptyterm}" );@]@; /* no pointers to streams */
@[TeX_( "/fi" );@]@;
@[TeX_( "/edef/next{/the/toksa}" );@]@;
@[TeX_( "/ifx/next/empty" );@]@;
@[TeX_( " /toksa{/emptyterm}" );@]@;
@[TeX_( "/fi" );@]@;
@[TeX_( "/getfirst{/yy(2)}/to/toksb" );@]@; /* the contents of the braced code */
@[TeX_( "/getsecond{/yy(2)}/to/toksc" );@]@; /* the format stream pointer */
@[TeX_( "/getthird{/yy(2)}/to/toksd" );@]@; /* the stash stream pointer */
@[TeX_( "/yy0{/nx/rhs{/the/toksa/nx/rarhssep{/the/toksc}{/the/toksd}" );@]@;
@[TeX_( " /nx/bpredicate{/the/toksb}{}{/the/toksc}{/the/toksd}/nx/bdend}" );@]@;
@[TeX_( " {/nx/arhssep}{/nx/rhsfulltrue}}" );@]@;
@ @=
@[TeX_( "/rhscont{/yy(1)}/to/toksa" );@]@;
@[TeX_( "/rhscnct{/yy(1)}/to/toksb" );@]@;
@[TeX_( "/edef/next{/the/toksb}" );@]@;
@[TeX_( "/ifx/next/empty" );@]@;
@[TeX_( "/else" );@]@;
@[TeX_( " /getfourth{/yy(2)}/to/toksc" );@]@;
@[TeX_( " /getfifth{/yy(2)}/to/toksd" );@]@;
@[TeX_( " /appendr/toksb{{/the/toksc}{/the/toksd}}" );@]@;
@[TeX_( "/fi" );@]@;
@[TeX_( "/yy0{/nx/rhs{/the/toksa/the/toksb" );@]@;
@[TeX_( " /nx/emptyterm}{/nx/hspace}{/nx/rhsfullfalse}}" );@]@;
@ @=
@[TeX_( "/rhscont{/yy(1)}/to/toksa" );@]@;
@[TeX_( "/rhscnct{/yy(1)}/to/toksb" );@]@;
@[TeX_( "/rhsbool{/yy(1)}/to/toksc /the/toksc" );@]@;
@[TeX_( "/ifrhsfull" );@]@;
@[TeX_( " /yy0{/nx/sprecop{/the/yy(3)}/the/yy(2)}" );@]@; /* reuse \.{\\yyval} */
@[TeX_( " /supplybdirective/toksa/yyval" );@]@; /* the directive is `absorbed' by the action */
@[TeX_( " /yy0{/nx/rhs{/the/toksa}{/the/toksb}{/nx/rhsfulltrue}}" );@]@;
@[TeX_( "/else" );@]@;
@[TeX_( " /yy0{/nx/rhs{/the/toksa" );@]@;
@[TeX_( " /nx/sprecop{/the/yy(3)}/the/yy(2)}{/the/toksb}{/nx/rhsfullfalse}}" );@]@;
@[TeX_( "/fi" );@]@;
@ @=
@[TeX_( "/rhscont{/yy(1)}/to/toksa" );@]@;
@[TeX_( "/rhscnct{/yy(1)}/to/toksb" );@]@;
@[TeX_( "/rhsbool{/yy(1)}/to/toksc /the/toksc" );@]@;
@[TeX_( "/ifrhsfull" );@]@;
@[TeX_( " /yy0{/nx/dprecop{/the/yy(3)}/the/yy(2)}" );@]@; /* reuse \.{\\yyval} */
@[TeX_( " /supplybdirective/toksa/yyval" );@]@; /* the directive is `absorbed' by the action */
@[TeX_( " /yy0{/nx/rhs{/the/toksa}{/the/toksb}{/nx/rhsfulltrue}}" );@]@;
@[TeX_( "/else" );@]@;
@[TeX_( " /yy0{/nx/rhs{/the/toksa" );@]@;
@[TeX_( " /nx/dprecop{/the/yy(3)}/the/yy(2)}{/the/toksb}{/nx/rhsfullfalse}}" );@]@;
@[TeX_( "/fi" );@]@;
@ @=
@[TeX_( "/rhscont{/yy(1)}/to/toksa" );@]@;
@[TeX_( "/rhscnct{/yy(1)}/to/toksb" );@]@;
@[TeX_( "/rhsbool{/yy(1)}/to/toksc /the/toksc" );@]@;
@[TeX_( "/ifrhsfull" );@]@;
@[TeX_( " /yy0{/nx/mergeop{/the/yy(3)}/the/yy(2)}" );@]@; /* reuse \.{\\yyval} */
@[TeX_( " /supplybdirective/toksa/yyval" );@]@; /* the directive is `absorbed' by the action */
@[TeX_( " /yy0{/nx/rhs{/the/toksa}{/the/toksb}{/nx/rhsfulltrue}}" );@]@;
@[TeX_( "/else" );@]@;
@[TeX_( " /yy0{/nx/rhs{/the/toksa" );@]@;
@[TeX_( " /nx/mergeop{/the/yy(3)}/the/yy(2)}{/the/toksb}{/nx/rhsfullfalse}}" );@]@;
@[TeX_( "/fi" );@]@;
@ @=
@[TeX_( "/yy0{}" );@]@;
@ @=
@@;
@ Identifiers.
{\it Identifiers are returned as |uniqstr| values by the scanner.
Depending on their use, we may need to make them genuine symbols}. We,
on the other hand simply copy the values returned by the scanner.
@=
@G
id:
ID {@> @ @=}
| CHAR {@> @ @=}
;
@g
@ @=
@@;
@ @=
@G
symbol:
id {@> @ @=}
| string_as_id {@> @ @=}
;
@g
@ @=
@t}\vb{\inline}{@>
@G
id_colon:
ID_COLON {@> @ @=}
;
@g
@ A string used as an \prodstyle{ID}.
@=
@t}\vb{\inline}{@>
@G
string_as_id:
STRING {@> @ @=}
;
@g
@ The remainder of the action code is trivial but we reserved the
placeholders for the appropriate actions in case the parser gains some
sophistication in processing low level types (or starts expecting
different types from the scanner).
@=
@@;
@ @=
@@;
@ @=
@@;
@ @=
@@;
@ @=
@@;
@ @=
@@;
@ {\it Variable and value.
The \prodstyle{STRING} form of variable is deprecated and is not \.{M4}-friendly.
For example, \.{M4} fails for \.{\%define "[" "value"}.}
@=
@t}\vb{\flatten\inline}{@>
@G
variable:
ID {@> @ @=}
| STRING {@> @ @=}
;
value:
{@> TeX_( "/yy0{}" ); @=}
| ID {@> @ @=}
| STRING {@> @ @=}
| "{...}" {@> TeX_( "/yy0{/nx/bracedvalue/the/yy(1)}" ); @=}
;
@g
@ @=
@t}\vb{\flatten\inline}{@>
@G
epilogue.opt:
{@> TeX_( "/yy0{}" ); @=}
| "%%" EPILOGUE {}
;
@g
@ \Cee\ preamble for the grammar parser. In this case, there are no `real' actions that our
grammar performs, only \TeX\ output, so this section is empty.
@=
@ \Cee\ postamble for the grammar parser. It is tricky to insert function definitions that use \bison's internal types,
as they have to be inserted in a place that is aware of the internal definitions but before said
definitions are used.
@=
#define YYPRINT(file, type, value) yyprint (file, type, value)
static void yyprint (FILE *file, int type, YYSTYPE value){}
@ @=
@@;
@@;
@ @=
void bootstrap_tokens( char *bootstrap_token_format ) {
#define _register_token_d(name) fprintf( tables_out, bootstrap_token_format, #name, name, #name );
@@;
#undef _register_token_d
}
@ \namedspot{bootstraptokens}Here is the minimal list of tokens needed
to make the lexer operational just enough to extract the rest of the
token information from the grammar.
@=
_register_token_d(INT)@;
_register_token_d(ID)@;
_register_token_d(CHAR)@;
_register_token_d(STRING)@;
_register_token_d(TAG)@;
_register_token_d(SEMICOLON)@;
_register_token_d(PERCENT_TOKEN)@;
_register_token_d(PERCENT_NTERM)@;
_register_token_d(FLEX_STATE_X)@;
_register_token_d(FLEX_STATE_S)@;
@ Union of types.
@=