Special Edition Using SGML:Should You Upgrade To SGML?

What Kind of Maintenance Is Needed?

There are two areas where HTML files run into maintenance problems that SGML can help with:

• Links tend to break over time.

• HTML itself changes through improvements such as new tags.

While the URLs and other identifiers that HTML uses for links are very powerful, the most common kind right now, the URL itself, is also fragile. A URL names a specific machine on the Internet, and a specific directory and file name on that machine (technically, this doesn’t have to be true, but in practice it almost always is). This method has an obvious maintenance problem: what if the file moves? A URL-based link can break in all these ways:

• The owner moves or renames the file, or any of its containing directories (say, to install a bigger disk with a different name).

• The owner creates a new version of the file in the same place and moves the old one elsewhere (there’s an interesting question about which version old links should take you to, but you needn’t get into that here).

• The owner’s machine gets a new domain name on the Internet (for example if someone else trademarks the name the owner had).

• The owner moves to a new company or school and takes all of his data with him.

The Internet Engineering Task Force (IETF) is working hard on Uniform Resource Names or URNs, which let links specify names instead of specific locations. This is like specifying a paper book by author and title, as opposed to “the fifth book on the third shelf in the living room at 153 Main Street.” URNs will make links a lot safer against simple changes like the ones just mentioned.

SGML provides a similar solution for part of the problem already, through names called Formal Public Identifiers or FPIs for entire documents or other data objects. SGML IDs for particular places within documents can be used both in general SGML and in HTML. By using FPIs or URNs to identify documents, you can ignore where documents live. When a document is really needed (such as when the reader clicks a link to it), the name is sent off to a “name server” that looks it up and tells where the nearest copy is. This works a lot like library catalogs and like the Internet routing system used for e-mail and other communications.

Note:
You can make HTML links a little safer against change by using the new BASE feature. Very often, a document will have many links that go to nearly the same place as the document itself, such as to several different files living in the same directory on the same network server, or in neighboring directories. When this happens, the beginning of the URLs on those links are all the same, such as:

http://www.abc.com/u/xyz/docs/aug95/review.htm
http://www.abc.com/u/xyz/docs/aug95/recipe.htm

Instead of putting the full URL on every link, you can “factor out” the common part and put it on the BASE element in the header. The links all get much shorter, but the bigger plus is that you can update them all in one step if the server or a directory moves:

<BASE ID=b1 HREF="http://www.abc.com/u/xyz/docs/aug95/"> ... <A BASE=b1 HREF="review.htm"> ... <A BASE=b1 HREF="recipe.htm">

HTML is constantly being improved. While this is a good thing, it also poses compatibility problems. In HTML 1.0, <P> was not so much the start of an SGML element as a substitute for the Return key. It was an EMPTY element, so the content of the paragraph was never actually part of the P element, and there was normally no <P> tag before the first paragraph in any section. This has been fixed in HTML 2.0, but funny things can happen if you view an old document in a new browser or vice-versa (for example, you might not get a new line for the first paragraph after a heading).

A newer issue is tables: HTML 2.1 adds a way to mark up tables and get good formatting for them; they can even adjust automatically when the reader changes the window-width. But what about tables in earlier documents? Authors often deleted their tables entirely, but when they couldn’t, they had to type tables up e-mail style, using HTML’s preformatted-text tag (<PRE>) and putting in lots of spaces:

    <PRE>
    ....China....1400.million
    ....India.....800.million
    ....USA.......250.million
    ....France.....50.million
    ....Canada.....25.million
    </PRE>

These will still work in a new browser (because the <PRE> tag is still around), but they don’t get the advantages or capabilities that the new tables support. They won’t re-wrap to different window widths, you can’t wrap text within a single cell, and so on. So you can end up with awful effects like this:

        China    1400
    million
        India     800
    million
        USA       250
    million
        France     50
    million
        Canada     25
    million

To get the new capabilities, you have to go in and actually change the documents. This is one reason it’s considered bad form in SGML to use spaces for formatting. SGML helps you avoid this painful updating because you can represent your documents in whatever form makes sense for the documents themselves. That form is much less likely to change than the way you have to express it in one fixed DTD or system.

With SGML, if you need to accommodate software that doesn’t handle your markup structures, you can use a “down-translation”—that is, a process that throws away anything in your markup that a certain HTML version can’t handle. For tables, you can mark them up in any table DTD you want (CALS is the most popular) and use a program as needed to translate them to a simpler form, even the HTML 1.0 formatted kind. Then when table support is common in browsers, you just throw the down-translation program away and deliver the same data without conversion.

This works where “up-translation” won’t because computers are so much better at throwing information away than creating it. Tables are a lot like the earlier example with italics: if your DTD distinguishes book titles and a few (or a thousand!) other kinds of italics, it’s easy to write a program to turn all of them into just <I> for HTML-only browsers. The reverse is much harder.

Table of Contents