Previous Table of Contents Next


<?xml?>: A Very Special Processing Instruction

Valid XML documents should always begin with the XML declaration. The XML declaration contains version information, encoding information, and information about which if any DTDs the document will use. Even though the contents of the XML declaration give only a very broad idea of the kind of XML document follows, they provide some critical basic information to the parsers that interpret the document.

The XML declaration uses SGML processing instruction syntax. Technically, processing instructions, which begin with <? and end with ?> tags, don’t directly affect the SGML. (XML processing instructions must end with ?>, instead of the SGML standard of >.) Instead, they provide instructions to outside programs, like formatters, that are less concerned with the syntactical structure of a document than they are with making it look as its designers intended. Most processing instructions intended for outside formatters have their own syntax; they needn’t follow typical SGML syntax. Although it would be unusual,

  <?Jimmy - use the burnt umber crayon for this. ?>

could be an acceptable processing instruction if the processing application was a child named Jimmy. A more typical processing instruction might be

  <?FormatWhiz azure-embossed-type?>

FormatWhiz would probably (although not necessarily) be the name of the processing application, whereas the remainder of the instructions specify unusual formatting, possibly for a business card or wedding invitation.

Processing instructions have been condemned as a diabolical means of creating unnecessarily complicated SGML that doesn’t transfer well between different parsers, but the XML working group appears to have settled on it as the most appropriate syntax for telling the parser how to handle the document that follows.

Even though using processing instructions in XML is permissible, always avoid processing instructions that begin with <?XML. They are expressly reserved for future use by the XML specifications.

The XML declaration includes several parts: the opening <?xml, version information, the standalone declaration, the encoding declaration, and the closing ?>. None of this information is technically required. XML declarations can have missing parts, and documents can still be well-formed without having an XML declaration. The version information and the declarations have default values that parsers can use. Unlike the HTML element, the XML declaration has no closing tag. </?xml> should never appear in a document. The XML declaration is an opening statement—an instruction—and nothing more.

The version information in this first version of XML is quite simple: version=“1.0”. Version 1.0 will be the default, providing a base for all future implementations of XML. Whatever happens to the standard, leaving out the version completely or specifying version 1.0 should mean that the documents and document type declarations written for version 1.0 will be interpreted as originally intended even when XML reaches version 7.3 or even 20.0. Unlike HTML, which arrived as version 0.9 when it became publicly available, XML has been at version 1.0 since the working drafts first appeared.

The standalone declaration announces whether a document contains references to external document type declarations. The value may be either "yes" or "no". If no standalone declaration appears, the default is "no." Valid documents are required to provide an honest answer for this declaration. Documents may make references to external entities, and still claim "yes", but may not refer to external DTDs. The Proposed Recommendation suggests that any XML document can be converted into a standalone document for processing if necessary, and some simple applications may well choose to reject all documents that are not standalone documents. In general, however, document developers who are building sets of valid documents will most likely anwswer "no" or leave out this declaration entirely.

The standalone declaration replaces the RMD declaration that appeared in earlier working drafts.

Encoding addresses complex issues related to internationalization. XML allows developers to specify which of several different character encoding schemes should be applied to a document. The default scheme is UTF-8, which includes direct representations of most of the characters used in English using values of 0–127 for the ASCII set of characters, and provides multibyte encodings for Unicode characters with higher values. UCS-2, which XML parsers are required to support, applies the Unicode/ISO/IEC 10646 standards, which extend the character space to 16 bits, allowing values from 0 to 65,535, a very significant expansion that allows the inclusion of most modern languages. (There are still significant problems with Chinese characters and a few other characters sets that remain under negotiation.) XML parsers can (although they aren’t required to) support several other encodings, including ISO 8859-1 through ISO 8859-9, which represent most European languages, and EUC-JP, Shift_JIS, and ISO-2022-JP, which represent Japanese. The encoding scheme’s name must always be enclosed in single or double quotes and described using the Latin character set.

Encoding has the same case-sensitivity problem that version has. Watch the standard and check your parser or browser documentation for the latest updates.


Previous Table of Contents Next