Special Edition Using SGML:Integrating SGML and HTML Environments

The Online Computer Library Center (OCLC)

http://www.oclc.org:5046

The Online Computer Library Center (OCLC) has put a lot of searchable data on the Web, and also provides a place to get Panorama, which is a product that can serve as a helper application to enable you to read SGML documents across the Web. You can find SoftQuad’s Panorama at www.sq.com/htmlsgml/htmlsgml.htm, as well as on the CD-ROM included with this book.

OCLC also provides a useful tool called Fred, which does two main things. First, you can send it an SGML document without a DTD, and it will send you back a DTD that matches. This won’t be an ideal DTD, but it will at least get your document to parse, which opens the door to using SGML software to go farther. Second, Fred can do certain kinds of translation on SGML data, so it can help you move data from one DTD to another (even to HTML, or to non-SGML formats like TeX). You can reach Fred at www.oclc.org/fred/.

The big advantages these sites make use of are: 1) SGML’s capability to adjust to whatever structure is appropriate for given document; 2) its capability to handle very large applications when needed; and 3) its capability to serve as an information-rich source from which you can generate many other formats at will (such as HTML).

Compromises To Be Made

Given these strengths and successes, you might wonder about the associated costs: how much extra time and effort are required to get the benefits. The biggest factor is that using SGML is an investment; you have to put in more up front to get even more return down the road. Here are some ways that shows up:

Conversion Can Be Expensive

Getting your data into a more effective SGML DTD will probably cost more than just getting it into HTML (this might mean time or money or both, depending on what you’re doing and how you’re doing it). Any off-the-shelf scanner with good OCR software can get you a text with the font changes and paragraph breaks marked. From there, it’s pretty easy to get to HTML tags, such as <P>, <H1>, <I>, and <B>.

Tables, figures, equations, multi-column text, and footnotes are harder because scanners can’t do as much to identify them for you. But if you have simple documents, a little scanning, a few global changes, and some proofreading can get them to be at least readable on the Web.

With more general SGML DTDs, you could just pretend most of the element types weren’t there, and mark everything up using only a few types of tags. This can be useful as a starting point, especially if you’re working from scanned documents or other unstructured data. But when you have a more powerful DTD, it’s a shame not to make more use of it. In many DTDs, you have the possibility of making more useful distinctions, and using them lets you provide more features and flexibility to your readers.

For example, you could create a software manual using either HTML or DocBook. DocBook gives you many more choices of element types, such as for all the kinds of special “names” you run into in a software manual. For example, there are tags you can use to mark whether “Open” is being used as a menu item name, a button name, a command to type, and much more. In HTML, there aren’t enough tags to go around, so you can’t separate those types—you have to lump things into more general categories, like italic or bold.

Why does this matter? Not so much because of formatting, though that’s a factor. The more important reasons have to do with flexibility. For example, if there’s no distinction in the tagging, you can’t provide readers with a reliable way to search for the word “Open” where it’s mentioned as a menu item without getting all the other “Opens” that are around. If they’re in a hurry to look up how to use the “Open” menu item, they may not be happy about this. In the same way, if there’s no distinction, you can’t provide readers with formatting options, such as a way for them to ask to see all descriptions of error messages displayed in red today. That too can be a very handy option.

People may not expect this kind of sophisticated searching or flexible formatting from an HTML file, but they probably will expect it if you use DocBook, since DocBook provides these capabilities and most DocBook users make use of them.

On the other hand, to use that available power you have to do more detailed tagging, and that takes some work. If there are only a dozen distinct tags actually used, it doesn’t matter much whether they’re selected out of HTML or DocBook.

It’s much less work if authors put in such distinctions when they write than it is to add them later. To continue the last example, documentation authors ought to know at the moment they write the word “Open” whether they mean a menu item name or something else. They probably have to do some action in their word processor to mark it (such as underline it or put it in quotes). Because of this, it isn’t much extra work to pick “Mark as Menu Name” rather than “Mark as Underlined”—and if they have to do more than just one thing (like make it italic and put it in quotes), they should be ahead of the game by using a descriptive SGML tag rather than a bunch of specific formatting commands.

On the other hand, it can be a big pain to go through and add tags later. To do it, someone has to go through and read the whole document and decide which format changes were done for which reasons, or which entirely unmarked parts of the document should be identified in some new way. This may not cost a lot, but it certainly costs something.

The important questions are, “How much can you afford to mark up?” and “Which things are the most important?” Factors in deciding which things are important include: how the document will be formatted, how users can search it, how servers can process it for other uses (like converting it to HTML versions, to various output forms, and so on), and how users can refer to it.

Table of Contents