XML: A Primer:XML: Building Structures

CHAPTER 3
XML: Building Structures

Knowing that there’s a complete set of formatting tools available we can use to make documents appear as they were intended, we can forget about what the documents look like for a little while. Using XML requires a different focus, demanding that designers examine the way that their documents are built rather than the way they are formatted. If you remember diagramming sentences in English classes (it’s okay if you hated it), you’ve been through the drill before, although on a different level than we’ll be using here. Instead of looking at structures on the sentence level, we’ll be examining structures at the document level, identifying titles, sections, subsections, paragraphs, lists, figures, item numbers, and item descriptions rather than nouns, verbs, and prepositions. XML offers developers the opportunity to create documents with built-in frameworks that make it much easier to create consistent results time after time.

Browsers and Parsers

The explosive growth of the Web was made possible by several different factors: the simplicity of HTML, the relative ease of setting up an HTTP (Web) server, and the rapid proliferation of browsers, of which Netscape and Microsoft Internet Explorer are just the most prominent. Even the earliest browsers, created at CERN in 1991, were meant to give users quick access to documents, letting them move from document to document without any complex transactions getting in the way. This was one of HTML’s largest breaks with SGML; it was just as significant as using markup tags for formatting as well as structure. HTML browsers didn’t worry about checking document syntax; instead, they parsed the document (the computer equivalent of reading the document) and presented their results. The results weren’t, and aren’t, always pretty. Finding a missing end tag in a large, heavily formatted document is a difficult task at best, requiring designers to compare their codes with the results generated by a particular browser. HTML browsers have always been very forgiving, but the results they present are not.

Outside the main stream of browser development were a few browsers that did validate HTML to some extent. For instance, Arena, a W3C testbed browser, had an option to indicate broken tags. The direction taken by the market, however, was clearly in favor of putting something on the screen, however odd, rather than pestering readers with error messages.

SGML was always more focused on parsing documents than presenting them. A parsing program takes a large file, usually text, and breaks it down into its component parts. In SGML, this meant that a program would examine a file, compare it to a DTD, break the document into its component parts, and validate the document against the definition. “Broken” SGML is easy to find, although the reasons for its being broken are frequently more difficult to determine. Because SGML parsers usually validate files, and because SGML is not directly concerned with formatting, programs for managing SGML paid a lot more attention to making sure that an SGML document was properly coded than to presenting it attractively. Bad markup could prevent a document management system’s sophisticated tools for storing, indexing, and reusing information from accepting the document at all.

Browsers combine a parsing engine with a presentation engine. HTML browsers don’t validate HTML even against their internal definitions. Instead, they parse the file and do what they can with the tags they can understand. If they can’t understand a tag, they ignore it. If they’re missing a closing tag, they take their best guess at where it would most likely have been. Attributes may have an effect or may be ignored, depending on whether the browser understands them. This uncontrolled model has made it much easier for amateurs to publish Web pages, increasing the number of authors dramatically. The downside is a lot of poorly written HTML. This same situation has allowed the browser developers to get away with private additions to the language, since one company’s additions wouldn’t “break” another company’s browser. The page might not look as good, because the tags are ignored. Although pages might look best on a particular browser, the worst that could happen to a user opening them in the “wrong” browser was missing or oddly presented information. Designers who wanted to make everything look perfect in every browser were bound to be disappointed, but at least a lowest common denominator of development was available.

XML doesn’t go nearly as far as SGML in requiring conformance to standards, but it may still come as a shock to HTML developers. XML standards refer to processors (parsers), not to browsers, because much XML development will be intended for machine-readable data applications rather than graphically exciting web pages. Netscape and Microsoft have both announced plans for integrating XML into their browsers, but browsers are only one part of the XML toolset. XML still allows for a good deal of HTML-style free-form development, but it enforces the rules much more strictly, as we’ll see. Developers will be happiest in the long run if they use the strongest set of tools available for structure building available, but not all applications may need that level of effort. It’s still possible to create documents easily in a format that both validating parsers and browsers can understand.

Browsers are only a tiny part of the XML vision. They remain extremely useful, presenting XML workers with a friendly and accessible tool for looking at their code, but the browser is only a window to a much larger project.

Table of Contents

CHAPTER 3XML: Building Structures

Browsers and Parsers

CHAPTER 3
XML: Building Structures