Previous | Table of Contents | Next |
An engine with really poor SGML support may decide to ignore half the marked section, and re-start indexing after the ">" in the middle, or may count the comments content as part of the document content. An engine with pretty poor SGML support may get those cases right, but still happily find tissue paper and maybe even IGNORE, even though theyre not part of the documents content. A good SGML engine will exclude all these non-content things from content searches. A remarkably good engine will give you the choice, and even let you specifically ask for searches in marked sections or comments (though that isnt a real common thing to want).
Other tools may parse correctly and be able to tell tags apart from content, but still not index tags, or index tags but not attributes, or index only a few specially chosen elements such as AUTHOR and TITLE. A few let you index as many element types as you want, but impose a space or speed penalty for each one or for indexing larger container elements.
If you want to do SGML searching, you need a retrieval engine that knows about the SGML syntax and structure. The more it knows, the better, at least about structureits less likely youll need to do searches that worry about how the markup was typed. For example, its very likely someone will want to retrieve just those cross-references that are inside footnotes; but unlikely that theyll want to retrieve all lists where there are two spaces between the tag name and the TYPE attribute (<LIST TYPE=NUMBERED>), as opposed to one space (<LIST TYPE=NUMBERED>). Be sure you know exactly what level of SGML support you want from your retrieval engine, and whether you can get it.
Conversion-based products address one more big issue not yet mentioned. Many SGML files are big. Because of network speed limitations, it isnt easy to put documents much over a few hundred kilobytes (say, about 50-100 page) on the Web. Because the Web method is normally to fetch a whole file at a time, putting a whole manual or book in one file imposes a burden on the network itself and on the end user who has to wait for it. It has these problems:
Because of such problems, you dont see a lot of whole books on the Web. Many books are available for downloading from the Internet, but few show up in HTML for easy interactive reading. It just would take too long. A 400-page book is about 800K of text (not counting any graphics)heres how long that takes with different speed connections:
Speed | Time |
---|---|
9600 baud modem | 14 minutes |
28800 baud modem | 5 minutes |
ISDN connection | 2 minutes |
256K Internet line | 30 seconds |
T1 Internet line | 5 seconds |
These figures dont count processor time to parse all of that file, load it into memory, and format it. Netscape 1.1 takes 2 1/2 minutes to load an 800K HTML file from the local hard disk on my Mac Duo 210 and needs 5M of RAM. If I then resize the window, it needs 9M and another 2 1/2 minutes.
The figures also dont count sharing network connections among users. If each of your readers has a private top-notch Internet connection, this may work finebut how many do? Few of us can afford a private T1 line; it takes a lot of users to justify the expense. But, on a shared line, no one gets the full bandwidth for very long.
When you open a book you should be able to start seeing it within a second, and if you immediately drag the scroll bar to the bottom, you shouldnt have to twiddle your thumbs or go for coffee. Thats fine for downloading, but not good enough for browsing.
To put a book on the Web with reasonable performance, you pretty much have to break it up into small pieces. That brings us back to the servers that convert SGML to HTML; they can be set up to notice if you ask for something huge, and they can send a subset instead of sending everything. So if you ask for hamlet.sgm, instead of the entire play, you can get the first scene and be there in a fraction of the time. DynaWeb is an example of a server that does just that.
You also can set up a server to always tack on an HTML link button called Next, so when someone reads to the end of Act I Scene 1, a single click takes them further.
Better yet, a server can extract any subset of the SGML document, not just a single block. For example, it can spot the difference between a request for Hamlet and a request for Hamlet Act I Scene 1. For the first, it can send a table of contents or other navigation aidmaybe a list of the acts and scenes with their first lines or titles. Since the server determines what HTML really gets sent, it can tag each line as an HTML link that takes you to the right act or sceneinstant Web navigation. For the second case, where the user makes a specific request for a manageable chunk of data, the server can just convert to HTML and send it.
This kind of filtering can happen either on the fly or in batch. Doing it on the fly greatly eases document management for whoever owns the data. They can keep things in natural units that are dictated by the material, rather than in artificial units dictated by the speed of the Internet. And if they publish or revise a book, they dont have to re-break it up every time it changes.
Previous | Table of Contents | Next |