Special Edition Using SGML:Practicalities of Working with SGML on the Web

SGML Requires More Thinking Up Front

SGML has many advantages for managing documents on the Web, providing more features and greater flexibility for your readers. However, this does not come for free. Authoring SGML in general takes more work than authoring HTML, and fewer tools are available. Two key areas require more work. First, you have to choose a DTD and (depending on that DTD choice) perhaps do more detailed tagging than would be needed in HTML; and second, you probably have to deal with more hierarchical structures instead of a pretty flat “list of paragraphs” model.

DTD Choice, Design, and Modification

If you’re using HTML, you’ve already chosen a DTD, and the only details are deciding which proprietary extensions to permit (if any), and which revision of the standard DTD to choose. With generic SGML, you have many more options, and you need to see if there is a DTD already available for the kind of data you’re dealing with. If not, and if no other one seems similar enough to use (as-is or after minor tweaking), you may end up creating a whole new DTD, and maintaining it as time goes on. For all the details on creating and maintaining DTDs, see Part III, “Content Modeling: Developing the DTD.”

A related decision is just how to use the tags available. You can distinguish fine levels of detail, or you can throw many things into each of a few big buckets, or most likely of all, you can do some of each. The finer the distinctions you make, the more accurately you can search and the more refined you can make your formatting. On the other hand, each distinction you make implies a certain commitment; if you’re tagging book titles as separate from italic for emphasis, the user may be surprised if you mis-tag one as the other.

Whenever you make tagging distinctions that go beyond marking up fonts, indents, and other obvious physical features, you’re making human decisions. A scanner and good OCR software can flag every place where the text turns italic, bold, and back again—but it can’t tell you why. For that, you need human intervention. Most of the time, the decisions are pretty easy for people. It’s seldom hard to distinguish a title from a foreign phrase. But to do it really well on complicated texts with fine-grained markup is a lot more difficult: telling an etymology from a field-of-study marker in a dictionary entry is a lot more difficult; marking up the syllable structures in poetry and telling a theme from an allusion in literature are more difficult still, though literary scholars may do exactly that for the texts they study.

Decide up front how much tagging you need for your purposes; think about how much effort you can apply, and then find a sensible balance. It may be helpful to write down the decisions somewhere. After tagging a lot of text, the goals may get fuzzy. The goals are also useful information for end users of the text: They help specify what people can expect to do with the text. Because of this, it is often useful to include a summary of such decisions somewhere in the text, for example in a special element put in just for that reason.

Need To Think More Hierarchically

Hierarchies buy a lot of power over flatter, one- or two-level structures. At the same time, they can make for extra data preparation work, because most word processors have very little hierarchy. For example, most word processors have no notion of a “list,” only list items—that’s why they can’t automatically keep list items numbered as you insert and delete them while editing. The same problem comes up with large containers, such as chapters and sections, and with blocks embedded in the middle of paragraphs. Moving data from such word processors to SGML requires some extra work, such as setting up the converter to group adjacent list items into a list, and to notice that the unindented “paragraph” following something is really a continuation of the paragraph element that was before it, and so on.

SGML Assures More Consistency and Flexibility

Maybe it’s a payoff for all that extra up-front thinking. Once you have your data in SGML using an appropriate DTD, you can depend on it for a wider range of processing. Since SGML lets you validate certain structures, you can ensure that certain things are possible. For example, knowing that every section has a title means you have less to worry about when setting up a table of contents. Knowing that chapters can’t show up inside of footnotes lets you optimize queries, simplify formatting specifications, and so on.

At the same time, SGML is more flexible than just HTML. If you need a new kind of element, you can create it. Since it’s very hard to think of everything in advance, this kind of extensibility pays off in the long run.

SGML Helps Make Allowances for HTML

As important as consistency and flexibility, however, is the added power you get for Web delivery. Having your data in SGML lets you express a lot of information that helps you deliver better HTML.

If the data “knows” the difference between conceptual classes of elements (because the tagging distinguishes them), you can choose at any moment whether to collapse them into a single HTML (or other) element, or to keep them distinct. This is most effective under the integrated conversion model discussed earlier, but it also helps in batch models.

Because you can keep more distinctions than HTML provides, you can have more information. If you have more information, you can do more with your data. You’ve already seen how you can customize HTML data for different clients, different versions of each client, different users, and different views.

Another useful technique is to filter views down to show only the information that’s relevant to the current user and his current needs. For example, a user can do a query and the server can pick out just the few pieces of the SGML document(s) that are relevant. Those then get tagged in HTML and sent back as a valid document. It may be that the pieces wouldn’t be valid if they were just pulled out and stuck end-to-end (for example, the result might be missing required elements, such as front matter, titles, and so on). But the conversion tool can map them into a structure that conforms to HTML or some other specific DTD.

Note:
In publishing (especially legal publishing), this is called boilerplating—you grab pieces of documents from a collection of useful bits and assemble the document you need from the pieces. SGML gives you the information about your information that makes this practical.

Table of Contents