Special Edition Using SGML:Practicalities of Working with SGML on the Web

Table of Contents

This last point is a big one. Current Web clients have a tendency to add proprietary extensions to HTML. Using the ones for client X provides nice additions for users of client X, but means no other users see the right thing. With integrated conversion, you can check what client is calling in, look up a different conversion filter for each one, and send data that’s optimized for just that client. This allows publishers to present their material in the best possible way for each reader without forcing them all to buy and use the same Web client—that, after all, is one of the advantages HTML is supposed to provide in the first place.

The same argument applies as new versions of the HTML standard come out. Some clients will support HTML 2.1 earlier than others. With integrated conversion, you not only can customize for client extensions, but you can customize for client versions and what level of HTML they support.

Note:
Customization is a nice feature for everyone, but it’s especially important for visually impaired users. They have good reasons to prefer special Web clients that are compatible with text-to-speech systems and provide usable presentations of documents. They may prefer to get a textual description of each graphic rather than the image itself (they shouldn’t have to waste the bandwidth to get both if they aren’t going to use the image anyway). Integrated conversion permits you to tailor the data for many different users.

Theoretically, you can get the same advantage by running many different batch conversions and saving a separate variation of the HTML files for each client: mybook.lynx, mybook.mosaic, mybook.netscape, and so on. Practically, this costs an enormous amount of space and raises huge data management problems (the chance of remembering to update every variant document every time something changes is pretty small). Each variant of each HTML file also needs a separate URL, so you have to make each conversion filter find all the links and change them during conversion. For example, a link from mybook.lynx should point to yourbook.lynx, not yourbook.mosaic.

Even worse, any documents elsewhere that point to yours can only point to specific URLs (and, of course, you can’t fix other people’s documents). If someone has a URL that points to the “regular” HTML version of a document and mails it to a visually impaired friend, it does little good. The recipient may not know that other more appropriate versions of the document exist, or how to get to them. You could make all the filters put “see also” links at the top of every document, but that gets cluttered and painful compared to having customization just work automatically.

Retrieval Engines

Retrieval engines work in combination with integrated or batch conversion solutions. With a batch solution, you would index the HTML version(s); with an integrated solution, you index the SGML. Both work fine for simple text searches. The difference comes if you want to search the SGML structure.

For example, suppose your SGML documents distinguish emphasis from foreign phrases from book titles, even though all will be tagged just <I> (italic) in HTML. If your retrieval engine is working from the SGML, clients can request searches that distinguish these types even though what they get back doesn’t distinguish them. Because searching on the Web is mainly done at the server end, this can be a big advantage, and with HTML forms you can make very nice interfaces for doing structured searches.

Although there isn’t space to talk in detail about the features and limitations of various search engines, it’s worth noting that “SGML support” can cover a wide range. At the weakest extreme, some tools simply index SGML files without knowing SGML. In that case, searching for “section” would return three hits for this piece of the file:

    <section><title>This section discusses dogs.</section>

That’s not likely to be what you want—two of the hits are tags, not content. Also, if you try to use such an engine to limit hits to words occurring in certain contexts, it gets messy. To find “dog” in <section>, you have to use a regular expression or “wildcard” searching something like:

    <section>.*dog.*</section>

This search may be slower than a plain string search, but a more important problem is that getting the query just right takes a lot of work. Here are some cases the query above will miss:

• What if there are attributes on the section start-tag?

• What if the start or end tags are short or omitted?

• What if the wildcard match includes more tags (for example, the expression just shown would match “<section>Hi</section> <p>dog</p> <section>bye</section> ”).

Some other tools deal with SGML by turning indexing off when they see < and back on when they see > —this, of course, rules out any structure searching, as well as gets things wrong in many cases like marked sections, minimization, and so on. One good test for any search engine that claims to support SGML is to see what it does for an ignored marked section (for details on marked sections, see Chapter 16):

    <p>The most important thing is
    <![ IGNORE [ tissue paper <!-- the text up to the next ">" isn’t really
    content --> ]]>
    friends.</p>

Table of Contents