Special Edition Using SGML:Should You Upgrade To SGML?

Table of Contents

The only thing to do about tag abuse syndrome is to make sure you have the right types of tags available. Few people would use H6 for small caps if there were a more appropriate emphasis element available. That is exactly why SGML is important for the Web; a lot of documents contain elements that don’t fit into the HTML set. Here are some kinds of tags that aren’t available in HTML:

• Poetry and drama: STANZA, VERSE, SPEECH, ROLES

• Computer manual-speak: COMMAND, RESPONSE, MENUNAME

• Bibliographies, card catalogs, and the like: AUTHOR, TITLE, PUBLISHER, EDITION, SUBJECT-CODE, DATE

• Back-of-book indexes: ENTRY, SUBENTRY, PAGEREF

• Dictionaries: ENTRY (of many levels), PRONUNCIATION, ETYMOLOGY, DEFINITION, SAMPLE-QUOTATION

This problem will continue to exist even though later versions of HTML add many useful new tags—no one can predict all the kinds of documents that people will invent. SGML provides the solution, because when you need a new kind of element, you can create it. You can avoid problems by trying not to force every kind of document into a single mold (just as you don’t try to make a single vehicle do the work of a bicycle, car, and Mack truck).

From time to time as you tag a document, you might feel as if the right tag just isn’t available. How often this happens is a good way to tell how well the DTD you’re using fits the document you’re working with. If the fit is too poor, the time may come to extend the DTD, or switch to an entirely different one—though this shouldn’t happen very often. It’s better to use the right DTD for each job than to force-fit; to be able to do this, users must have software that handles SGML generically rather than forcing data into any one mold.

Tip:
Moving data from one DTD to another can sometimes be easy. It helps to have at least a little skill with some programming tool like Perl, as well as SGML. Even so, the job is not always easy. If the two DTDs use similar structures and mostly differ in tag names, it may be as easy as running some global changes to rename tags. If you aren’t using much SGML minimization, non-SGML tools like Perl or even a word processor’s “Search and Replace” command may be enough, because all the tags are right there: you can search for a string like <P> and change it to <PARA> (but remember to allow for tags with attributes!). On the other hand, if you’re using a log of omitted tags or changing to a very different DTD where you have to add or subtract containers, re-order things, and so on, it can be a lot more work.
There are also special tools available to help transform SGML documents in this way. Among them are OmniMark from Software Exoterica, the SGML Hammer from SoftQuad, and Balise from AIS.

What Data Is Already in SGML?

A lot of data is already available in SGML, and a lot of that has already gone onto the Web. Because SGML was adopted first by large organizations (after all, they had the biggest document problems to solve), those organizations have been able to make a lot of data available.

From Commercial Publishers

Many publishers are moving to SGML for all their documents. Some want to preserve their investment so they can reproduce books even after the latest wiz-bang word processor is history. Some want to simplify the data-conversion they do when authors send in their drafts. Some want to support new forms of multimedia delivery, information retrieval, and so on.

One of the earliest success stories for SGML in publishing is the many-volume Oxford English Dictionary (OED). For many decades, the entire OED used rooms full of 3×5 cards. But in the early 1980s, the publishers decided to go electronic. They worked with Waterloo University and developed sophisticated conversion programs to get the whole dictionary into SGML. One of the hardest tasks was teasing apart 25 or so different uses for italics in the scanned text: book titles, foreign words, emphasis, word origins, and so on. This is just a severe case of tag-abuse syndrome (one they couldn’t avoid, since they had to work from scanned text, and scanners can’t tell you much about distinctions other than font choice). Success in this conversion made it much easier to keep the dictionary up to date; it’s also resulted in a great electronic edition that can be searched in very sophisticated ways. Because of the up-front tagging work, if you ask for all the words with Latin origins, you don’t also get all the places where “Latin” happens to show up as an emphatic word or in a book title.

Another major SGML publishing project is the Chadwyck-Healey English Poetry Database. This project is collecting all English poetry from the earliest stages of English up to 1900 and publishing it on a series of CD-ROMs with sophisticated search software. Some of it everyone has read, some of it only an English professor could love—but it’s all going to be there, in SGML.

Journal publishers have recently started using SGML to speed up the review and publishing cycle (see fig. 19.1). Platform and format independence make it easier to ship files to the many people involved. The fact that all kinds of software—from authoring to online and paper delivery systems—can now deal with SGML also makes it a good common format for them.

Fig. 19.1 SGML is being used for a variety of sophisticated documents, including technical and scientific journals. Screen shot courtesy of Lightbinders, San Francisco (http://lbin.com).

Table of Contents