Special Edition Using SGML:Practicalities of Working with SGML on the Web

Table of Contents

An engine with really poor SGML support may decide to ignore half the marked section, and re-start indexing after the ">" in the middle, or may count the comment’s content as part of the document content. An engine with pretty poor SGML support may get those cases right, but still happily find tissue paper and maybe even IGNORE, even though they’re not part of the document’s content. A good SGML engine will exclude all these non-content things from content searches. A remarkably good engine will give you the choice, and even let you specifically ask for searches in marked sections or comments (though that isn’t a real common thing to want).

Other tools may parse correctly and be able to tell tags apart from content, but still not index tags, or index tags but not attributes, or index only a few specially chosen elements such as AUTHOR and TITLE. A few let you index as many element types as you want, but impose a space or speed penalty for each one or for indexing larger container elements.

If you want to do SGML searching, you need a retrieval engine that knows about the SGML syntax and structure. The more it knows, the better, at least about structure—it’s less likely you’ll need to do searches that worry about how the markup was typed. For example, it’s very likely someone will want to retrieve just those cross-references that are inside footnotes; but unlikely that they’ll want to retrieve all lists where there are two spaces between the tag name and the TYPE attribute (<LIST TYPE=NUMBERED>), as opposed to one space (<LIST TYPE=NUMBERED>). Be sure you know exactly what level of SGML support you want from your retrieval engine, and whether you can get it.

The Big Document Problem

Conversion-based products address one more big issue not yet mentioned. Many SGML files are big. Because of network speed limitations, it isn’t easy to put documents much over a few hundred kilobytes (say, about 50-100 page) on the Web. Because the Web method is normally to fetch a whole file at a time, putting a whole manual or book in one file imposes a burden on the network itself and on the end user who has to wait for it. It has these problems:

• Downloading takes a long time before you can see the whole document, which is annoying for the end user.

• While the long download is happening, part of your processing power and network capacity is used up. If you share your network connection with others, all those extra bytes slow them down, too.

• Most HTML browsers crash or get confused on big files because they try to load them entirely into memory at once (try creating a really big HTML file and feeding it to your favorite browser to see what happens).

• Big files are hard to navigate with just a scroll bar.

Because of such problems, you don’t see a lot of whole books on the Web. Many books are available for downloading from the Internet, but few show up in HTML for easy interactive reading. It just would take too long. A 400-page book is about 800K of text (not counting any graphics)—here’s how long that takes with different speed connections:


Speed	Time

9600 baud modem	14 minutes
28800 baud modem	5 minutes
ISDN connection	2 minutes
256K Internet line	30 seconds
T1 Internet line	5 seconds

These figures don’t count processor time to parse all of that file, load it into memory, and format it. Netscape 1.1 takes 2 1/2 minutes to load an 800K HTML file from the local hard disk on my Mac Duo 210 and needs 5M of RAM. If I then resize the window, it needs 9M and another 2 1/2 minutes.

The figures also don’t count sharing network connections among users. If each of your readers has a private top-notch Internet connection, this may work fine—but how many do? Few of us can afford a private T1 line; it takes a lot of users to justify the expense. But, on a shared line, no one gets the full bandwidth for very long.

When you open a book you should be able to start seeing it within a second, and if you immediately drag the scroll bar to the bottom, you shouldn’t have to twiddle your thumbs or go for coffee. That’s fine for downloading, but not good enough for browsing.

To put a book on the Web with reasonable performance, you pretty much have to break it up into small pieces. That brings us back to the servers that convert SGML to HTML; they can be set up to notice if you ask for something huge, and they can send a subset instead of sending everything. So if you ask for “hamlet.sgm,” instead of the entire play, you can get the first scene and be there in a fraction of the time. DynaWeb is an example of a server that does just that.

You also can set up a server to always tack on an HTML link button called Next, so when someone reads to the end of Act I Scene 1, a single click takes them further.

Better yet, a server can extract any subset of the SGML document, not just a single block. For example, it can spot the difference between a request for “Hamlet” and a request for “Hamlet Act I Scene 1”. For the first, it can send a table of contents or other navigation aid—maybe a list of the acts and scenes with their first lines or titles. Since the server determines what HTML really gets sent, it can tag each line as an HTML link that takes you to the right act or scene—instant Web navigation. For the second case, where the user makes a specific request for a manageable chunk of data, the server can just convert to HTML and send it.

This kind of filtering can happen either on the fly or in batch. Doing it on the fly greatly eases document management for whoever owns the data. They can keep things in natural units that are dictated by the material, rather than in artificial units dictated by the speed of the Internet. And if they publish or revise a book, they don’t have to re-break it up every time it changes.

Table of Contents