Previous Table of Contents Next


Overusing Optional SGML Features Is Dangerous

Another question is what to do with the many optional features in the SGML standard. Many of these have to do with minimizing tags—saving you from typing and making raw SGML easier to read (in case you ever have to!). But SGML minimization provides synonyms for things SGML can already express, so you can always do without it. Unlike synonyms in human languages, which always have subtle differences, a minimized SGML tag expresses exactly the same element structure as if it were unminimized, so using it doesn’t add any subtle special meanings to your documents. It may make typing or reading the tag easier, but who wants to literally type tags anyway? Better interfaces are available. For instance, you can pick element types from menus and display them as icons, in the margins, or on request.

There are some good reasons to avoid using much minimization in your SGML data (perhaps this should be called minimizing minimization?). This is especially true on the Web, for a couple reasons.

First, systems vary in which optional features they support. Although most general SGML tools support the SHORTTAG and OMITTAG features, right now Web clients have much more limited, HTML-specific parsers. Even if they are extended to support DTDs other than HTML, they may not learn to handle marked sections, omitted start tags, and other capabilities (or they may not get the harder features exactly correct). By keeping your SGML as simple as possible, you can choose from a wider variety of tools to work with it.

If you avoid minimization, you can even use completely SGML-ignorant tools effectively. A global change from <LI> to <ITEM> is a fine way to change all instances of one kind of element into another (but don’t forget </LI> to </ITEM> !). That is, it’s fine unless you happened to just use <> or omit some start or end tags entirely. The same snags come up with the Find command in a generic editor if you minimize: Searching for <P> doesn’t do much good if you left the tag out, to be implied via SGML’s minimization features. If you plan to convert your SGML to HTML for Web delivery, this may be important to think about.

Another reason to avoid minimization is that you may want to be able to ship small pieces of an SGML document around. There’s no guarantee that a piece of SGML can be interpreted right if it’s taken out of context (the same thing is true in most languages, even English). An SGML document that doesn’t use much minimization has a much better chance of being interpreted than one that does minimize. Think about what an SGML parser would have to do if it got an SGML portion like this:

    <p>This is a sample/short paragraph</p>

You can probably interpret it right; it sure looks like a paragraph element with a few words in it. And it is, so long as a few things are assumed (besides that delimiters like < haven’t been redefined in the SGML declaration):

  The piece didn’t come out of someplace buried deep down, like inside a quoted attribute value (the SGML technical term is that the piece should start in CON mode).
  There are no marked sections left open, except maybe ones with the INCLUDE keyword.
  The piece didn’t come out of the middle of a CDATA element, a CDATA entity, and so on.
  The piece didn’t come out of the middle of a comment, processing instruction, or something else.
  There are no NET-enabling start tags hanging around. Those are a minimization capability in SGML, where if you code a start tag as <X/ rather than <X>, the next / in text content counts as if it were the end tag </X> —so if one of those is pending, the slash in “sample/short” changes meaning, and serves as an end tag, not as text content).


Caution:  
Here are SGML examples to show the context problems described. In each case, the <p> isn’t really a start tag. In the last example, the final </P> would probably be reported as an SGML syntax error (because the earlier slash ended the paragraph already). You should avoid cases like these in your SGML if you anticipate having servers ship out pieces of it on demand.
    <revision original="<p>This is a sample/short paragraph</p>">

    <![ IGNORE [ <p>This is a sample/short paragraph</p> ]]>

    <!ENTITY notags CDATA "<p>This is a sample/short paragraph</p>">
    ...
    <EXAMPLE>¬ags;</EXAMPLE>

    <!-- deleted 4/2/95: <p>This is a sample/short paragraph</P> -->

    <SEC/ ...<p>This is a sample/short paragraph</P>...

There are not very many possible problems, and you can completely avoid most of them by deciding only to ship pieces that amount to whole elements, and to skip using a few SGML constructs that can have long-distance effects. The ones that pose the most problems for shipping pieces of an SGML file around in isolation are these:

  Omitted or empty start tags
  #CURRENT attributes
  NET-enabling start tags
  Marked sections (you can also solve this problem if your server or conversion process can be set up to resolve marked sections before shipping data; it can literally delete the IGNORE marked sections, and so on)
  Declared content (RCDATA, CDATA, or EMPTY)

All these structures can have long-lasting effects that change how an SGML parser must interpret the incoming characters. If you avoid them, you can just ship any element out of the SGML stream and it is possible to parse it and get the start tags and end tags right (you do still have to include the DTD and declaration subset, or a way to get them, such as via a URL).

Remember that none of these structures are errors. They are all legal, valid SGML capabilities. If you’re working in a generic SGML environment, they should all work just fine (unless the software has a bug, or the author creating the SGML misuses something). The precautions mentioned here are merely guidelines to help make the SGML easier to transport in Web-like environments where you simply can’t afford to send entire documents in one fell swoop. Since these particular SGML capabilities are not commonly used anyway, you probably won’t have to worry about them.


Previous Table of Contents Next