Special Edition Using SGML:Should You Upgrade To SGML?

Why Is This Data in SGML, Not HTML?

Because of all these users, there is a lot of SGML data out there. Why did all these companies choose SGML instead of HTML? Mostly because it’s a generic solution; it lets them use tags appropriate to the kinds of documents each one cares about. This means describing the document parts themselves rather than how they should appear on today’s output device. This generic approach is why SGML data outlasts the programs that process it, and that can mean huge long-term savings. HTML can do this for a limited number of cases, but not in general. There are other reasons for using SGML:

• Scalability. SGML has features, such as entity management, that make it easier to work with large documents. A printed airplane manual often outweighs the plane itself, and the documentation system better not choke.

• Validation. SGML’s ability to check whether documents really conform to the publisher’s rules is important in industry, especially in the current world of liability lawsuits. However, validating a document doesn’t ensure it makes sense, any more than spelling correctly ensures it makes sense.

• Information retrieval. Big documents are hard to work with, and SGML tagging puts in the “hooks” you need to make search and retrieval software work much better. True containers for big organizing units are especially helpful here, like CHAPTER and SECTION instead of just H1 and H2.

• Version management. High-tech manuals and ancient literature share a common problem because they come in many versions; it can make a big difference which one you get. Although not a true version-management system by itself, SGML has features that form a good foundation for one (such as marked sections, attributes, modularity, boilerplating, and so on).

• Customizable presentations. This relates to version management, too. Because SGML doesn’t predefine formatting and layout, delivery tools can customize the display for each user as needed—show extra hints for novices, hide secret information, and so on. This is what Ted Nelson (he invented the term hypertext ) calls stretchtext : the document should smoothly expand and contract to match the user’s interests.

• Access for print disabled. Again because SGML gets away from formatting details, it is easy to convert SGML documents for delivery in Braille, via text-to-speech converters, and so on. Several books have been converted this way in record time.

All these advantages apply to paper production, online delivery, and information retrieval. But once you lay out pages for print, most of these advantages disappear; once all the lines and page breaks are set, the page representation takes over and getting back to the structure is very difficult.

Five Questions To Ask about My Data

Given all the advantages of generic SGML for big projects, yet all the simplicity of HTML for simple ones, how do you decide which way to go? There are five questions you can ask that will help you choose.

What Functionality Do I Need?

If your documents fit the HTML model and consist mostly of the kinds of elements HTML provides, HTML is probably a good choice. This is especially true if the documents are also small (tens of pages, not thousands). But if you have big documents or documents with special structures or elements, SGML will take you a lot farther.

If you need to do information retrieval, SGML is also better. You can search HTML, but you can’t easily pin down just where hits are. This is because the HTML tags don’t divide data up as finely as you can with full SGML, and HTML doesn’t typically tag large units such as sections (the tags have only been added in the latest revision, and they’re still optional).

Finally, if you need to deliver in more forms than just the Web, you should consider SGML. Tools are available to turn SGML not only into Web pages, but into paper pages, most kinds of word processor files, CD-ROM publications, Braille, and many other forms. This can all be done with HTML in theory, but it’s harder in practice.

Do I Need Flexible Data Interchange?

SGML eases data interchange in several ways. Because it helps you avoid using tags for things they don’t quite fit, your data is easier to move to other systems, especially if the tags can take advantage of finer distinctions. For example, if you tag book titles, emphasized words, and foreign words as <I> in HTML, you have a problem when you move to something that can distinguish book titles and emphasis, such as a program to extract and index bibliographies. If you make the finer distinctions, you have a choice later whether to treat the items the same or differently.

Computers are pretty bad at sorting things into meaningful categories when they look the same. You almost need artificial intelligence to decide which italic text is a book title and which is something else. The good news is that computers are really good at the opposite task; if you’ve already marked up book titles and emphasized words as different things (say, <TI> and <EMPH>), it’s no problem at all for a computer to show them both as italic.

Because of this, down the road interchange is much easier if you break things up early and make as many distinctions as practical. On the other hand, each distinction may be a little extra work, so you need to balance long-term flexibility versus how much time and effort you can put in up front. To figure out this balance, be sure to consider just how long you think your data will last (you’re safest to at least double your first guess) and how important your data is.

Importance and lifespan don’t always go together. Stock quotes are pretty important when they’re current, but after a year, only a few specialists ever look at them. At the other extreme, some literature that started out on stone tablets thousands of years is still important. Where does your data fit?

Table of Contents