Special Edition Using SGML:Practicalities of Working with SGML on the Web

Table of Contents

Many common element types map one-for-one, such as PARA to P. Some other elements in DocBook don’t have an HTML equivalent, so this conversion table says “untag,” meaning that the convertor shouldn’t give them any start or end tags at all, but just send their content through untagged.

The last two cases are more interesting. In HTML, headings of different levels have different names: H1, H2, and so on. These elements just contain titles, not whole sections. But DocBook has containers, such as CHAPTER and SECTION, that really are the entire units. CHAPTER and SECTION elements both have titles that appear in sub-elements. The titles are tagged TITLE regardless of whether they’re titles of CHAPTERs or of SECTIONs. Programs can tell from context which are chapter titles and which are section titles. This kind of difference in tag usage arises all the time when you convert data from one SGML DTD to another.

Because of this difference, it wouldn’t be accurate to just convert all DocBook TITLE elements into the same HTML element (such as H1 ). If you did, chapter titles and section titles would end up looking exactly the same. Instead, you would convert titles within CHAPTER elements to H2, but titles within SECTION elements to H3 (of course, any title within a SECTION is also inside a larger CHAPTER, but for converting, you care only about what the title is directly within). The last lines of the example above are there to cover these cases. They state that certain TITLE elements qualify to turn into H2, and certain others into H3.

Putting a container name in front of a tag name (with some character like a slash to separate them) is one way to state such a restriction. Many conversion programs provide a method like this, and you can get far better results using it.

Remember, though, that a string such as chapter/title or list/item/p isn’t really a tag name—you won’t see <chapter/title> in a valid SGML file. Most people call these qualified tag names or qualified GIs. These are used in various programs that process SGML, not in SGML itself.

Note:
To see how well you remember all the SGML minimization rules, try answering this question: “If <chapter/title> did show up in an SGML file, would it necessarily be a syntax error, or would it mean something valid to the SGML parser (and if it’s valid, what would it mean)?”

Two products are already available that translate SGML documents to HTML on the fly and serve it to the Web. One is from Open Text. It uses script programs to do the conversion and works with standard Web servers to ship out data from an indexed SGML form. The other is DynaWeb from Electronic Book Technologies, which does much the same thing except that the translation is done within the program based on the same stylesheet tools used for formatting SGML for CD-ROM delivery, print, and other uses. Both of these tools are highly effective for delivering SGML on the Web, can handle very large and complex documents, and include fast search and retrieval features.

Generic Conversion Tools

The second major category of “SGML to HTML” Web solutions is pretty similar to the first, except that conversion is done ahead of time in a batch process. That is, authors or publishers prepare their data in SGML as usual, but rather than install the SGML on the server (to be converted on the fly), they convert the SGML to HTML and put the HTML on the server. Figure 20.4 illustrates the process. The SGML to HTML converter does not have to run on the same system as the HTML server.

Fig. 20.4 You can also convert SGML documents to HTML ahead of time using any generic conversion tool, and then install the HTML on a Web server.

Even though the processes are fairly similar, there are a lot of consequences depending on which you choose. The plusses for the batch-conversion approach are:

• You save a lot of processor time in the long run, because the data is only processed once (or at least, only once per revision or modification).

• You can use any conversion tool you want; it doesn’t have to have the capability to be hooked up to a Web server. UNIX gurus can use sed, awk, and similar tools; programmers can use C or Icon or any other programming language; and lots of people can use Perl.

On the other hand, there are some minuses:

• You have to maintain two copies of all the data, one in HTML and one in SGML.

• If you want to take advantage of a newer version of HTML, you have to reconvert all your data, not just change the conversion filters.

• You have to ship the same HTML to all clients. You miss a big advantage of the integrated conversion approach; you can’t customize what you send.

Table of Contents