Previous Table of Contents Next


CHAPTER 5
Mortar and Bricks: Document Type Definitions

Now that we’ve explored some of the theoretical aspects of XML document creation, it’s time to open the toolbox. Although the tools for creating validated XML may look a little strange, the logic behind them is not that much more complicated than the logic of markup languages like HTML. Creating a Document Type Definition is not an easy process, but a well-written DTD is worth the effort in later savings. Even if you don’t plan to write your own DTDs, knowing how to read them may prove to be useful during a late night of work on documents created with a poorly documented DTD.

Many of the features described here work in several different places. A few work only in external DTDs, some work only in DTDs (internal or external), and some work in documents only. SGML developers need to be careful because many tools that worked in a variety of places have been restricted to provide only a subset of their previous functionality.

Parsing: An Introduction

Basically, parsing is just the interpretation of text. Computers can’t really read, but they can interpret text files. Markup languages simply aid this interpretation, specifying explicitly to computers (or occasionally to humans) the nature of chunks of text. In HTML this is fairly straightforward: putting text between a start tag and an end tag means that the text is to be formatted in a particular way. <I>This is italic.</I> should produce: This is italic. HTML browsers understand that all characters placed between the sequence <I> and the sequence </I> should be displayed in italic.

SGML and XML take a more sophisticated approach to interpreting text. HTML browsers interpret the text according to a hard-wired set of rules, created by the browser developer based on their interpretation of HTML and the various standards surrounding it. HTML browsers do the best job they can of parsing text, in the sense already described. XML and SGML parsers, on the other hand, check the document’s markup to make sure it fits a set of rules. XML parsers check at least for well-formedness, the minimal set of rules described in Chapter 3. Both SGML and XML require documents to conform to a complex set of specifications outlined in document type declarations. Documents that conform to a DTD are said to be valid; parsers that can interpret DTDs and check documents against their strictures are called validating parsers.

The XML specification refers to two components of a larger system for reading and interpreting XML. The first is the XML processor, which is the parser described previously. The job of the XML processor is to load the XML and any related files, check to make sure that it follows all the necessary rules, and build a document tree structure that can be passed on to the application. The application is the part of the system that acts upon that tree structure, processing the data it contains. The application could be a browser that displays the information in the tree structure on the screen or a printing application that formats the information to a printer. It also could be a reader application that turns the computerized text into audio for blind users. The application doesn’t need to produce output that humans can use. Instead, it could treat the XML as control information for machine tools or a set of orders that need immediate shipping. The XML application can implement just about any data-dependent process.

This separation of markup from formatting makes interpreting XML more complex than interpreting HTML. The <I> tag means the same thing on any HTML browser—start italics here—but it’s not that simple with XML. <I> could mean start italics, or it could indicate an ice cream flavor, or it could signal comments about IBM. In fact, it could signal just about anything. In HTML, the browser combines the parser and the application and follows a somewhat strict set of rules for how it interprets particular tags. XML is quite flexible about the final interpretation of the marked-up data, although it is far more strict about the markup itself.


Previous Table of Contents Next