Previous Table of Contents Next


Data Gathering

As you gather sample data for each document type to be incorporated into your SGML system, a few issues should be considered. The types of data to be gathered should span the range of what you will encounter in your typical document production environment. Table 23.3 lists a range of samples to gather for use in your document analysis process.

Table 23.3 Data Gathering: Document Samples To Look For

Type Characteristics

Typical Average sample of the document type. Representative of document class.
Simple Simple version of the document type. Complete, yet uncomplicated sample of document class.
“Sample from Hell” Horrendously complex permutation of the document type. Uses many, if not most, of variations and complexities of document class.
Update Portions Representative “chunks” of the document (or source data) at the level at which the document is updated or maintained.

The typical document should be used as your baseline. It represents the normal version of the document class that will be encountered in everyday production.

The simple document example can serve as a good sanity check on your document design (through your DTD), ensuring that your document model is not too complex to work well with simple versions of the class. When you are happy with how your document model works with typical and simple versions of the document, you’re ready for the stress test: the sample from Hell.

The sample from Hell is the document sample that many people dread. It is often full of exceptions to the rule. Often it doesn’t fit the model for various and sundry reasons: “We just put Not Applicable in the section for control systems for the X-95 because it’s the only model that doesn’t have one,” or “All of our Part Breakdown documents have 12 standard sections except for the Loxomatic model, which has 238.” In short, this type of sample is the one with the black motorcycle jacket; it breaks all the rules!

The samples with the exceptions will allow you to make decisions on what to do with your document model: Should you make it generic enough so that the sample from Hell fits in or do you require changes in the document that breaks the rules? (No easy answer here; your decision will depend on your particular situation.)


Tip:  
If you find yourself with a lot of samples that break the rules, you might be in a situation where two (or more) distinct classes of documents are being forced into one document model. If this is the case, you might consider breaking them down into separate document classes.

For documents that undergo frequent updates, a look at typical chunks that are updated will prove to be very useful. Because these portions often will be the most dynamic parts of a document class, analysis of these might result in them receiving special treatment in your document model (to make them more accessible, amenable to database manipulation, and so on).

Document Analysis

Because you examined document analysis in some detail previously in Part II, you don’t need to examine all of the details here. Instead, take a look at some of the fundamentals of good document analysis:

  Use a top-down approach
  Work from the outside in
  Develop your document model
  Validate your design
  Iterate, iterate, iterate

When performing your document analysis, start from the beginning (or top) of your document and work your way down. Before you begin, you should review your goals.

As you examine your sample documents, work your way through the document sections. Working from the outside in, concentrate on the borders or boundaries between the high level sections before you look at their contents.

For example, in working through a structure for a book, your initial pass of high level objects could result in a document model that looks like figure 23.1. Note that only the high level objects have been identified at this point (TitlePg, TableContents, Foreword, and so on).


Fig. 23.1  Document model with high level objects only.

As you step through the process, you will work your way into the high level objects, defining the objects contained within them. You will continue iterating this approach until you have defined all of the objects (or elements) within your document model (see fig. 23.2).


Fig. 23.2  Document model with low level objects included.

When you are satisfied with your document model, it’s time to test it against your sample documents. In this process, you check to see that it works well with your samples (including the sample from Hell). It’s quite possible that you will find some problems in a few places that will require another iterative pass through the process.

Repeating (or iterating) the steps in the process is a standard part of developing SGML document models. In fact, it’s a common facet of many of the processes in SGML projects.

When performing document analysis, it’s common to develop the document model for one type of document class and then perform the whole process all over again with the next document class. All of this iteration might seem unusual at first, but eventually it will feel routine and natural.


Previous Table of Contents Next