Special Edition Using SGML:Following Good SGML Practice

Dealing with Mixed-Content Models

A mixed-content model contains character data mixed with elements or entities. Mixed-content models can parse cleanly, but you should attempt them only under safe circumstances. The best advice is just to avoid them altogether, because they can lead to unpredictable parsing errors that are difficult to track down. Here is an example of mixed content:

    <!ELEMENT para - - (#PCDATA, textstuf)   >

Mixed-content models are a bad practice because the parser makes a bigger deal out of different characters of the text string than it should when it encounters them. Consider carriage returns, for example. When the parser runs into carriage returns in a <PARA> element, it executes them instantly, whether you want them there or not.

Note:
The tricky part about these bad models is that the DTD parses. You might parse a hundred documents that have a DTD with a faulty content model. Sooner or later, if you don’t eliminate these practices from your DTDs, you’ll run across a document instance that will give you a problem. If you run into some oddball parsing errors that you don’t recognize, consider whether a bad content model might be the cause.
Mixed-content models cause parsing errors because the parser interprets carriage returns, tabs, and space characters as data. The DTD might pass the parser test, but when you come to some specific document instances, you will have problems. The parsing results can vary among parsers, but mixed content nearly always causes you headaches.

Figure 14.14 and figure 14.15 show a DTD declaration and document instance of mixed content.

Fig. 14.14 This is an example of a mixed-content model.

Fig. 14.15 Here is the problem text from the mixed-content model shown in figure 14.14.

This markup shown in figure 14.15 above does not parse because the parser interprets the spaces after the </ULIST> elements as a carriage return and gives you an error. However, the markup above makes logical sense and should pass parsing. That’s the problem with using mixed-content models.

If you change your document instance to the markup in figure 14.16, it passes parsing.

Fig. 14.16 This is one way to fix the parsing problem shown in figure 14.15.

I recommend that you do not fiddle with where you place your tags. That is not the real problem. Even though the second markup instance above would pass parsing, just fix the content model instead. The best policy here is to just avoid mixed-content models altogether. If you want to use a mixed-content model, make an entity declaration and make it #PCDATA. You can also add a proxy for the data that you want to the content model. You’ll be much better off in the long run.

Note:
If you decide that you must use mixed content, remember these two rules: Put everything inside one group, and make the entire group repeatable. For example:
<!ELEMENT article - - (para)+ > <!ELEMENT para - - (#PCDATA,ulist)* > <!ELEMENT ulist - - (#PCDATA) >

parses cleanly because the * occurrence indicator moved from (#PCDATA,ulist*) to (#PCDATA,ulist)*, which is acceptable. If you must use mixed content, use it this way.

Dealing with Ambiguous Content Models

Ambiguous content models cause more parsing errors. This happens when the parser cannot decide whether an element that it encounters belongs to the content model of one structure or to the content model of another structure. This happens because the structures were not defined clearly in the first place.

Your parser must be clear about which content model the element it encounters belongs to. If it comes across an <a> element and both <X> and <Y> contain an <a> element, the parser must not have any confusion about which <a> element is in question.

Consider the following DTD:

    <!ELEMENT article - -(title, (note,caution)?,(note|figure|list|p)+)>

If you have dealt with these ambiguous models before, you can recognize this from 100 yards away. When the parser gets to the <note> element, does it belong to the second or third part of the model? You can’t tell from the way this content model currently exists. The second part of the group is optional. The parser has no way of knowing whether to require a caution when it sees the first note.

Figure 14.17 is a document instance that illustrates the confusion caused by an ambiguous content model.

Fig. 14.17 This is an example of a document marked up according to an ambiguous content model.

This is an ambiguous content model. The parser does not know whether to look for a caution following the note.

There are other possible ambiguous content models. Imagine the situation in which the content of one element resembles the content of the second sequential element, but the first sequential element has a minimized end tag notation, and the second element has a minimized start tag. What do you do when the first element is followed by the second element? For example:

    <!ELEMENT A - O ((H1|H2),(P|L|G)+)*>
    <!ELEMENT B O - (H1,(P|L|G)*)>

Figure 14.18 shows the document instance.

Fig. 14.18 This is another example of a document marked up according to an ambiguous content model.

If you were the parser, you would not know whether the third paragraph is part of the first element or the second element. You would return an error.

Table of Contents