Special Edition Using SGML:Should You Upgrade To SGML?

Chapter 19
Should You Upgrade To SGML?

The World Wide Web has brought more attention to SGML than anything else. Most WWW documents (other than bit-mapped graphics) are SGML documents that use the HTML DTD. If you’re using HTML, you’re using SGML, although there’s much more to SGML. On the other hand, most Web browsers don’t support any other DTDs besides HTML. This means that all the other SGML data in the world can’t be browsed easily on the Web. (But take heart! You’ll learn about several solutions in this chapter.)

This chapter begins by telling you how SGML relates to HTML and what’s happening with SGML on the Web already. Then you learn about the practical issues: how to decide whether to go with HTML or SGML for your Web data, and how you can take advantage of each one’s strengths and avoid their weaknesses:

In this chapter, you will learn how to:

• Describe the relationship between HTML and SGML

• Understand what data is already in SGML

• Understand why this data is in SGML, not HTML

• Decide whether your data belongs in SGML

• Overcome the challenges of upgrading

• Benefit the most from upgrading

How HTML and SGML Relate

People often say that HTML is a subset of SGML. This is nearly right, but it’s a bit more complicated. Technically, HTML is an application of SGML. This means that it’s really a DTD, a set of tags and rules for where the tags can go. SGML is a language for composing DTDs that fit various kinds of documents. There are many applications, and therefore many DTDs (HTML, the DTD for the World Wide Web, is probably the best known one).

You already know that a DTD is always designed for some particular type of document: business letters, aircraft manuals, poetry, and so on. An important question when deciding whether to put some data in HTML or another SGML DTD is “What kind of documents is the HTML DTD meant for?”

Here is a sample of the kinds of tags that exist in HTML (the new version 2.1 of HTML is being finalized even as I write, and further improvements are still coming, so this list will improve a bit very soon). First, HTML has a lot of tags for marking up common kinds of structures (all of which are not listed here):

• Headings: H1, H2...

• Divisions (the actual big containers like chapters and sections, that contain headings and other data): DIV

• Basic document blocks (paragraphs, block quotations, footnotes, various kinds of lists): P, BQ, FN, OL, UL, DL

• Tables and equations (only in newer browsers): these involve many different element types

• Text emphasis: EMPH, STRONG

• Hypermedia links: A, IMG

• Interactive forms: INPUT, TEXTAREA

HTML also includes several element types that express formatting rather than structure. These pose some portability problems, but they can be useful in cases where you simply must have a certain layout:

• Font changing, such as for getting bold and italic type: B, I

• Various extensions that work only with certain browsers: BLINK, FONT, etc.

• Forced line breaks (most used in code samples, “pre-formatted text,” and similar examples): BR, PRE

• Drawing rules, boxes, and so on: HR

From the selection of element types, you can easily see the kinds of documents HTML is best for: fairly simple documents with sections, paragraphs, lists, and the like. In fact, most of the HTML element types are pretty generic; nearly every DTD has paragraphs and lists in it. One place where HTML excels, however, is in linking. Although it only has a couple of element types for links, they can use URLs to point to any data anywhere in the world. For more details on HTML, you may want to read Que’s Special Edition Using HTML.

So, why use other SGML DTDs? The main reason is that not all documents consist of only these basic kinds of elements. Whenever you run across some other kind of element, you have to “cheat” to express it in HTML. A very common example is the level-6 heading element in HTML (H6). Because the first browsers formatted H6 headings in small caps and there was no text emphasis tag that would give the same effect, people got in the habit of using H6 to mean “small caps.” Of course, some people also use H6 as a heading, and many people use it both ways.

This works fine—until something changes. Suppose that a browser comes along that lets users adjust the text styles for different tags, for example. Someone changes H6 to look like something besides small caps, and everyone who was counting on small caps is surprised. Sometimes this won’t matter, but it might; what if the user wants all the headings big and all the text emphasis small? Or what if the user is blind? When his browser runs across an H6 element, it wouldn’t do any good for his browser to put it in large type, so instead maybe its computer-generated voice says “section” and reads the heading loudly; in the same way, maybe such a browser is not supposed to do anything special for small caps.

The most important problem, though, is that you might want to use the tags for something completely different than formatting later. What if a browser is really friendly and makes automatic outlines by grabbing all the headings? Or what if you want to do a search, but only for text in headings (you might want to do that because if a word occurs in a heading, it’s probably more important than if it just occurs in the main text)?

Using a tag because it gets the right formatting effect is always a problem, usually a delayed one; it works fine when you do it, but the “gotcha” comes later. People working with the distant ancestors of SGML made up a name for this: “tag abuse syndrome.”

Table of Contents

Chapter 19Should You Upgrade To SGML?

How HTML and SGML Relate

Chapter 19
Should You Upgrade To SGML?