XML: A Primer:Mortar and Bricks: Document Type Definitions

Data Structures

Before we can start building document structures, we must look at the data underneath. As HTML developers discovered when they tried to convert old files to HTML, the use of characters like <, >, and & for markup creates significant problems. A site I once converted from a Macintosh HyperCard stack, for example, worked very well except for the page on AT&T, which had missing characters throughout. XML applies a few SGML solutions to HTML’s problems, adding some apparent complexity but solving the problems much more thoroughly.

Data Types

Although XML has simplified SGML’s data types considerably, we need to consider a number of issues that never cropped up in HTML. XML allows for two types of data in documents: #PCDATA, or parsed character data, which is ordinary marked-up character data, and CDATA, which is character data without any markup. CDATA is useful for situations where a document contains no markup and many <, >, and & symbols. By default, XML assumes that all information is PCDATA, except for the contents of attributes, which are normally assumed to be CDATA. PCDATA and CDATA will receive considerably more use when we get to actually define some elements and attributes.

CDATA can also play a role directly in a document, marking a section as pure character data. To declare a section as CDATA, mark its beginning with <![CDATA[ and its end with ]]>. (This will fail if the data includes any ]]> sequences, which should be a highly unusual occurrence.) For example, the CDATA section in

  <?xml version="1.0" Encoding="UTF-8"?>
  <DOCUMENT><![CDATA[@#X! <<<<    >>>> & <<<<<
  >>>>>]]></DOCUMENT>

will be interpreted as the characters “@#X! <<<< >>>> & <<<<< >>>>>” and will not generate a parsing error. When queried for the text in this document, MSXML returns the CDATA information:

    C:\msxml>jview msxml -t http://127.0.0.1/cdata.xml
    @#X! <<<<  >>>> &   <<<<< >>>>>

If this text wasn’t “escaped” with the CDATA declaration, parsers would stop at the first < sign because it appears to open a tag with no proper closing. Even though using CDATA prohibits developers from using markup in a section of text, the tradeoff may be worth it if the section doesn’t require markup anyway. If it needs markup, replace the offending characters with their entity equivalents, as discussed later in the chapter.

The use of CDATA to escape text is actually a specific example of a more general SGML technique—marked sections. Marked sections follow a <![keyword[data…]]> syntax. Even though developers may use this syntax for CDATA, the SGML RCDATA, TEMP, IGNORE, and INCLUDE keywords are not available in XML documents. IGNORE and INCLUDE marked sections, are, however, available within DTDs and will be covered later in the chapter.

One missing feature, which many developers have complained about, is that XML doesn’t have any way to require that elements contain more specific types of data, like text or numbers or currency. Several proposals for this “stronger typing” are under development. One proposal would apply the standard types of SQL, the relational database standard, to XML elements. In the meantime, any such enforcement of data types will be the responsibility of the processing application, not the XML parser.

Entities

XML offers two kinds of entities—general entities and parameter entities. HTML developers will be familiar with using predefined general entities for encoding unusual characters and characters used for markup (the infamous <, >, and &). Although defined in DTDs, general entities are used to add information to documents, substituting their value for the entity reference, which takes the form &name;. Parameter entities are defined and used only in external DTDs. They can save developers typing, as do general entities, but they also can give developers tremendous power to include other DTDs and other information in their DTD. Parameter entities allow developers to reuse and subset older DTDs, avoiding the perpetual reinvention of the wheel and making the expansion of previously existing DTDs easier.

General entities are used through HTML to provide representations of characters that are either outside the basic ASCII character set or interfere with markup. XML has fewer problems with this for several reasons. The character encodings already described will allow developers to include characters from other languages and writing systems more directly, easing the need for those kinds of character-representation entities. The option to “escape” characters with CDATA sections also makes it easier to include the characters that interfere with markup, especially for large chunks. Still, CDATA is somewhat clunky for regular use. XML includes a few built-in entities, although not nearly as many as HTML. XML’s five built-in entities are listed in Table 5.2.

**Table 5.2** Entities built into the XML standard
	Entity	Character Represented
	&	ampersand (&)
	<	less-than sign (<)
	>	greater-than sign (>)
	'	apostrophe (‘)
	&quote	quote (“)

Even though these entities are certainly useful and help developers keep their content out of the way of markup, they offer only the tiniest taste of the powerful things XML can do with general entities. Developers can define entities just as they can define elements. General entities are simple and make many complex and annoying tasks very simple, especially when it comes to filling in boilerplate text. The syntax for defining a general entity is fairly simple:

  <!ENTITY Name EntityDefinition>

The name of the entity must be composed of letters, digits, periods, dashes, underscores, or colon, and begin with a letter or an underscore. The entity definition may contain any valid markup and must be enclosed in quotes. The syntax for using an entity in the markup is also simple:

  &Name;

The ampersand must appear at the start, and the semicolon must be at the end. No additional whitespace is permitted around or inside the entity.

Table of Contents