Appendix A. HTML Grammar
For the most part, the exact syntax of an HTML or XHTML document is not rigidly
enforced by a browser. This gives authors wide latitude in creating
documents and gives rise to documents that work on most browsers, but
are actually incompatible with the HTML and XHTML standards. Stick to
the standards unless your documents are fly-by-night affairs.
The
standards explicitly define the
ordering and nesting of tags and document elements. This syntax is
embedded within the appropriate Document Type Definition and is not
readily understood by those not versed in SGML (for HTML 4.01, see
Appendix D, "The HTML 4.01 DTD") or XML (for XHTML 1.0, see Appendix E, "The XHTML 1.0 DTD"). Accordingly, we provide an alternate
definition of the allowable HTML and XHTML syntax, using a fairly
common tool called a "grammar."
Grammar, whether it defines English sentences or HTML documents, is
just a set of rules that indicates the order of language elements.
These language elements can be divided into two sets:
terminal (the actual words of the language) and
nonterminal (all other grammatical rules). In
HTML and XHTML, the words correspond to the embedded markup tags and
text in a document.
To use the grammar to create a valid document, follow the order of
the rules to see where the tags and text may be placed to create a
valid document.
A.1. Grammatical Conventions
We use a number of typographic and
punctuation conventions to make our grammar easy to understand.
A.1.1. Typographic and Naming Conventions
For our grammar, we denote the terminals
with a bold, monospaced Courier typeface. The nonterminals appear in
italicized text.
We also use a simple naming convention for the majority of our
nonterminals: if one defines the syntax of a specific tag, its name
will be the tag name followed by _tag. If a
nonterminal defines the various language elements that may be nested
within a certain tag, its name will be the tag name followed by
_content.
For example, if you are wondering exactly which elements are allowed
within an <a> tag, you can look for the
a_content rule within the grammar. Similarly, to
determine the correct syntax of a definition list created with the
<dl> tag, look for the
dl_tag rule.
A.1.2. Punctuation Conventions
Each rule in the grammar starts with the
rule's name, followed by the replacement symbol (::=) and the
rule's value. We've intentionally kept the grammar
simple, but we do use three punctuation elements to denote
alternation, repetition, and optional elements in the grammar.
A.1.2.1. Alternation
Alternation indicates a rule may actually have several different
values, and you must choose exactly one of them. Vertical bars (|)
separate the alternatives for the rule.
For example, the heading rule is equivalent to
any one of six HTML heading tags, and so appears in the table as:
|
heading
|
::=
|
h1_tag
|
|
|
|
|
h2_tag
|
|
|
|
|
h3_tag
|
|
|
|
|
h4_tag
|
|
|
|
|
h5_tag
|
|
|
|
|
h6_tag
|
The heading rule tells us that wherever the
heading nonterminal appears in a rule, you can
replace it with exactly one of the actual heading tags.
A.1.2.2. Repetition
Repetition indicates that an element within a rule may be repeated
some number of times. Repeated elements are enclosed in curly braces
({...}). The closing brace has a subscripted number other than one if
the element must be repeated a minimum number of times.
For example, the <ul> tag may only contain
<li> tags, or it may actually be empty. The
rule, therefore, is:
|
ul_tag
|
::=
|
<ul>
|
|
|
|
{li_tag }0
|
|
|
|
</ul>
|
The rule says that the syntax of the <ul>
tag requires the <ul> tag, zero or more
<li> tags, followed by a closing
</ul> tag.
We spread this rule across several lines and indented some of the
elements to make it more readable only; it does not imply that your
documents must actually be formatted this way.
A.1.2.3. Optional elements
Some elements may appear in a document, but are not required.
Optional elements are enclosed in square brackets ([ and ]).
The <table> tag, for example, has an
optional caption:
|
table_tag
|
::=
|
<table>
|
|
|
|
[ caption_tag ]
|
|
|
|
{tr_tag }0
|
|
|
|
</table>
|
In addition, the rule says that a table begins with the
<table> tag, followed by an optional
caption, zero or more table-row tags, and ends with the
</table> tag.
A.1.3. More Details
Our grammar stops at the tag level; it does not delve further to show
the syntax of each tag, including tag attributes. For these details,
refer to the Quick Reference card included with this book.
A.1.4. Predefined Nonterminals
The HTML and XHTML standards define a few specific kinds of content
that correspond to various types of text. We use these content types
throughout the grammar. They are:
- literal_text
-
Text is interpreted exactly as specified; no character entities or
style tags are recognized.
- plain_text
-
Regular characters in the document character encoding, along with
character entities denoted by the ampersand character.
- style_text
-
Like plain_text, with physical- and
content-based style tags allowed.
 |  |  | | 17.6. Tricks with Windows and Frames |  | A.2. The Grammar |
Copyright © 2002 O'Reilly & Associates. All rights reserved.
|