Previous | Table of Contents | Next |
The Online Computer Library Center (OCLC) has put a lot of searchable data on the Web, and also provides a place to get Panorama, which is a product that can serve as a helper application to enable you to read SGML documents across the Web. You can find SoftQuads Panorama at www.sq.com/htmlsgml/htmlsgml.htm, as well as on the CD-ROM included with this book.
OCLC also provides a useful tool called Fred, which does two main things. First, you can send it an SGML document without a DTD, and it will send you back a DTD that matches. This wont be an ideal DTD, but it will at least get your document to parse, which opens the door to using SGML software to go farther. Second, Fred can do certain kinds of translation on SGML data, so it can help you move data from one DTD to another (even to HTML, or to non-SGML formats like TeX). You can reach Fred at www.oclc.org/fred/.
The big advantages these sites make use of are: 1) SGMLs capability to adjust to whatever structure is appropriate for given document; 2) its capability to handle very large applications when needed; and 3) its capability to serve as an information-rich source from which you can generate many other formats at will (such as HTML).
Given these strengths and successes, you might wonder about the associated costs: how much extra time and effort are required to get the benefits. The biggest factor is that using SGML is an investment; you have to put in more up front to get even more return down the road. Here are some ways that shows up:
Getting your data into a more effective SGML DTD will probably cost more than just getting it into HTML (this might mean time or money or both, depending on what youre doing and how youre doing it). Any off-the-shelf scanner with good OCR software can get you a text with the font changes and paragraph breaks marked. From there, its pretty easy to get to HTML tags, such as <P>, <H1>, <I>, and <B>.
Tables, figures, equations, multi-column text, and footnotes are harder because scanners cant do as much to identify them for you. But if you have simple documents, a little scanning, a few global changes, and some proofreading can get them to be at least readable on the Web.
With more general SGML DTDs, you could just pretend most of the element types werent there, and mark everything up using only a few types of tags. This can be useful as a starting point, especially if youre working from scanned documents or other unstructured data. But when you have a more powerful DTD, its a shame not to make more use of it. In many DTDs, you have the possibility of making more useful distinctions, and using them lets you provide more features and flexibility to your readers.
For example, you could create a software manual using either HTML or DocBook. DocBook gives you many more choices of element types, such as for all the kinds of special names you run into in a software manual. For example, there are tags you can use to mark whether Open is being used as a menu item name, a button name, a command to type, and much more. In HTML, there arent enough tags to go around, so you cant separate those typesyou have to lump things into more general categories, like italic or bold.
Why does this matter? Not so much because of formatting, though thats a factor. The more important reasons have to do with flexibility. For example, if theres no distinction in the tagging, you cant provide readers with a reliable way to search for the word Open where its mentioned as a menu item without getting all the other Opens that are around. If theyre in a hurry to look up how to use the Open menu item, they may not be happy about this. In the same way, if theres no distinction, you cant provide readers with formatting options, such as a way for them to ask to see all descriptions of error messages displayed in red today. That too can be a very handy option.
People may not expect this kind of sophisticated searching or flexible formatting from an HTML file, but they probably will expect it if you use DocBook, since DocBook provides these capabilities and most DocBook users make use of them.
On the other hand, to use that available power you have to do more detailed tagging, and that takes some work. If there are only a dozen distinct tags actually used, it doesnt matter much whether theyre selected out of HTML or DocBook.
Its much less work if authors put in such distinctions when they write than it is to add them later. To continue the last example, documentation authors ought to know at the moment they write the word Open whether they mean a menu item name or something else. They probably have to do some action in their word processor to mark it (such as underline it or put it in quotes). Because of this, it isnt much extra work to pick Mark as Menu Name rather than Mark as Underlinedand if they have to do more than just one thing (like make it italic and put it in quotes), they should be ahead of the game by using a descriptive SGML tag rather than a bunch of specific formatting commands.
On the other hand, it can be a big pain to go through and add tags later. To do it, someone has to go through and read the whole document and decide which format changes were done for which reasons, or which entirely unmarked parts of the document should be identified in some new way. This may not cost a lot, but it certainly costs something.
The important questions are, How much can you afford to mark up? and Which things are the most important? Factors in deciding which things are important include: how the document will be formatted, how users can search it, how servers can process it for other uses (like converting it to HTML versions, to various output forms, and so on), and how users can refer to it.
Previous | Table of Contents | Next |