XML: A Primer:XML: Building Structures

Building Blocks

XML uses structures that are very similar to HTML, which isn’t surprising given their shared roots in SGML. Underneath those structures is a language for defining structures that gives XML the power HTML lacks. We’ll start by examining the structures on the surface of HTML and then dig down until we’ve uncovered enough to begin building some simple XML documents.

Elements and Tags

Even though HTML designers have used elements and tags fairly interchangeably, the difference between them is significant. A tag is a piece of markup: , , or <LI> for example. An element is a fully formed application of those tags. A paragraph element might look like this:

  <P>This is a <EM>sample</EM> paragraph element. It
  includes several other elements, including an
  emphasis (EM) element that includes the word
  'sample' and a <B>bold</B> element that includes
  the word bold.</P>

This text includes six tags (three opening, three closing) but only three elements. The paragraph element includes two other elements. The nesting rules which have frustrated designers since HTML first appeared, rely on this distinction between tags and elements. The following code is illegal HTML, even though some browsers will render it properly anyway:

  <B>This is bold. <I>This is bold italic.</B> This
  is italic.</I>

In my word processor, I get away with this every day. I don’t need to convert text back to normal before I’m allowed to add additional formatting; formats are understood to be additive and can layer on top of each other. If the and tags were simply for formatting, this code would work the same way. It doesn’t work that way in HTML (or XML or SGML), however. That kind of code attempts to create elements whose beginning and end tags overlap, as shown in Figure 3.1.

Figure 3.1 Overlapping elements are prohibited.

Overlapping elements produce all kinds of ambiguity, especially when content (and formatting) get more complicated. In HTML tthe proper way to produce the result shown in 3.2 is:

  <B>This is bold.</B> <I><B>This is bold
  italic.</B></I> <I>This is italic.</I>

  <B>This is bold.</B> <I><B>This is bold
  italic.</B> This is italic.</I>

  <B>This is bold. <I>This is bold italic.</I></B>
  <I>This is italic.</I>

Figure 3.2 Bold, bold italic, and italic.

This way takes more tags, but the beginning and end tags do not overlap. The first sample variation use the most markup but is probably the safest way to create this text, especially if you anticipate cutting and pasting or otherwise moving it around. The other two variations use fewer markup tags and take advantage of HTML’s ability to nest tags and allow elements to absorb the formatting of the element surrounding them.The only change is whether the bold italic element is an italic element nested in the bold element preceding it or a bold element nested in the italic element that follows it. Figure 3.3 shows the three variations on nesting taking place here.

Figure 3.3 Acceptable element creations.

I always recommend creating the most containerized solution (like the first variation here), so that you can pick up elements and move them around without worrying about losing half your structures.

XML is more particular than HTML in a few other significant respects. HTML is very forgiving about leaving off closing tags and quotes around attributes and can even sometimes forgive stray <, >, and & symbols in the text. The browsers tend to parse HTML as well as they can but do not worry too much about where an element is supposed to end. Many HTML elements, including the commonly used IMG, HR, and BR elements, don’t normally have closing tags. XML is not this forgiving. For starters, unlike HTML (and even SGML) it’s case-sensitive. Tags must match in capitalization as well as meaning. To be well formed, the minimal acceptable level of XML compliance, a document must have closing tags of some kind, for all opening tags, as well as quotes around all attributes. Leaving these tags and quotes off should generate an error when the page is loaded. “Empty” elements, which don’t bracket any text, must either be given closing tags (as in ) or have a slash at the end of their tag ( ), which acts as an equivalent. Most HTML browsers are much happier about the useless end tag than the mysterious slash at the end of the normal tag, so that’s probably a better way to handle this situation until XML browsers become the standard. Markup symbols in the text may not generate errors, but they will certainly cause problems and keep the document from being technically well-formed. The entities for encoding <, >, and & are the same in XML as they were in HTML (<, >, and &), easing that transition.

At this point, it is unclear how tightly browsers and even parsers will enforce these rules. Given their HTML heritage, browsers may remain more forgiving than the standard. Despite that likely forgiveness, I strongly recommend complying with this rule. You’ll have better-structured documents that are easier to debug and manipulate as a result.

Table of Contents