Previous | Table of Contents | Next |
There are two areas where HTML files run into maintenance problems that SGML can help with:
While the URLs and other identifiers that HTML uses for links are very powerful, the most common kind right now, the URL itself, is also fragile. A URL names a specific machine on the Internet, and a specific directory and file name on that machine (technically, this doesnt have to be true, but in practice it almost always is). This method has an obvious maintenance problem: what if the file moves? A URL-based link can break in all these ways:
The Internet Engineering Task Force (IETF) is working hard on Uniform Resource Names or URNs, which let links specify names instead of specific locations. This is like specifying a paper book by author and title, as opposed to the fifth book on the third shelf in the living room at 153 Main Street. URNs will make links a lot safer against simple changes like the ones just mentioned.
SGML provides a similar solution for part of the problem already, through names called Formal Public Identifiers or FPIs for entire documents or other data objects. SGML IDs for particular places within documents can be used both in general SGML and in HTML. By using FPIs or URNs to identify documents, you can ignore where documents live. When a document is really needed (such as when the reader clicks a link to it), the name is sent off to a name server that looks it up and tells where the nearest copy is. This works a lot like library catalogs and like the Internet routing system used for e-mail and other communications.
Note:
You can make HTML links a little safer against change by using the new BASE feature. Very often, a document will have many links that go to nearly the same place as the document itself, such as to several different files living in the same directory on the same network server, or in neighboring directories. When this happens, the beginning of the URLs on those links are all the same, such as:Instead of putting the full URL on every link, you can factor out the common part and put it on the BASE element in the header. The links all get much shorter, but the bigger plus is that you can update them all in one step if the server or a directory moves:
<BASE ID=b1 HREF="http://www.abc.com/u/xyz/docs/aug95/"> ... <A BASE=b1 HREF="review.htm"> ... <A BASE=b1 HREF="recipe.htm">
HTML is constantly being improved. While this is a good thing, it also poses compatibility problems. In HTML 1.0, <P> was not so much the start of an SGML element as a substitute for the Return key. It was an EMPTY element, so the content of the paragraph was never actually part of the P element, and there was normally no <P> tag before the first paragraph in any section. This has been fixed in HTML 2.0, but funny things can happen if you view an old document in a new browser or vice-versa (for example, you might not get a new line for the first paragraph after a heading).
A newer issue is tables: HTML 2.1 adds a way to mark up tables and get good formatting for them; they can even adjust automatically when the reader changes the window-width. But what about tables in earlier documents? Authors often deleted their tables entirely, but when they couldnt, they had to type tables up e-mail style, using HTMLs preformatted-text tag (<PRE>) and putting in lots of spaces:
<PRE> ....China....1400.million ....India.....800.million ....USA.......250.million ....France.....50.million ....Canada.....25.million </PRE>
These will still work in a new browser (because the <PRE> tag is still around), but they dont get the advantages or capabilities that the new tables support. They wont re-wrap to different window widths, you cant wrap text within a single cell, and so on. So you can end up with awful effects like this:
China 1400 million India 800 million USA 250 million France 50 million Canada 25 million
To get the new capabilities, you have to go in and actually change the documents. This is one reason its considered bad form in SGML to use spaces for formatting. SGML helps you avoid this painful updating because you can represent your documents in whatever form makes sense for the documents themselves. That form is much less likely to change than the way you have to express it in one fixed DTD or system.
With SGML, if you need to accommodate software that doesnt handle your markup structures, you can use a down-translationthat is, a process that throws away anything in your markup that a certain HTML version cant handle. For tables, you can mark them up in any table DTD you want (CALS is the most popular) and use a program as needed to translate them to a simpler form, even the HTML 1.0 formatted kind. Then when table support is common in browsers, you just throw the down-translation program away and deliver the same data without conversion.
This works where up-translation wont because computers are so much better at throwing information away than creating it. Tables are a lot like the earlier example with italics: if your DTD distinguishes book titles and a few (or a thousand!) other kinds of italics, its easy to write a program to turn all of them into just <I> for HTML-only browsers. The reverse is much harder.
Previous | Table of Contents | Next |