Previous Table of Contents Next


Converting AnyCorp’s Documents. Going back to AnyCorp for a moment, imagine a slight change to the situation. You have been asked to put the information into SGML, but there is a catch: You also have to convert the legacy documents that have been produced in the past 20 years into the new format as well.

“No big deal,” you think. You’ll just dig up the disks containing the data, read them into your SGML authoring package, and add the necessary tags. In going about locating the data, however, you discover that the format of these documents has changed several times during the last 20 years:

  For the first 5 years or so, the documents were created by a typist (no electronic data is available, only hardcopy documents).
  For the next 7 years, they were prepared on proprietary word processing hardware. (This hardware was replaced years ago and your computer doesn’t read the disks for these files.) You have hardcopy versions of these documents as well.
  Documents produced since then generally can be read by some version of word processing program that you have.

In this situation, you realize that some of the documents must be scanned with a document scanner to bring them into your computer. As you look at some of the older documents, you realize that their quality is sometimes poor (see fig. 5.2).


Fig. 5.2  Legacy data conversions of hardcopy documents are sometimes dependent on poor-quality originals.

Your earlier approach in working out a plan for creating new documents will work in converting existing legacy data, with one addition: You need to work out a plan to convert the legacy data.


• See “What You’re Going To Do,” earlier in this chapter, p. 89

This includes a number of steps:

  Convert legacy data into a common electronic format
  Convert legacy data into a common structural format
  Convert legacy data into a tagged SGML document instance

To get your documents into a common electronic format, you must first decide what that format will be. (For many situations, plain ASCII text works best.) From there, you can choose to scan the hardcopy originals into raster image files, and then convert them to text via an optical character recognition program. For some document originals, actually retyping them may be easier.

You might be able to convert documents that you have electronic copies of via a format conversion program.


Tip:  
When performing data conversions, be sure to visually confirm the quality of the conversion! Optical character recognition and data format conversions will not provide 100 percent accuracy.

Once you have your documents converted into a common electronic format, you will need to convert them into a common structural format. Often, legacy data that spans a number of years also spans a number of structural formats. (People change the look and structure of a document over the years.)

In the case of your product advisory, take a moment to examine its current structure (see fig. 5.3).


Fig. 5.3  The current version of your document has a specific collection of data objects in a specific order.

Notice that specific data objects (or elements) occur in a specific order in this document. For example, there is an advisory number, followed by an advisory type, a date, a revision date, and a subject.

Now take a look at a sample of this document as it looked some 18 years earlier (see fig. 5.4).


Fig. 5.4  Earlier versions of documents often have different document structures and content.

Notice that this earlier version of the document looks different. The identifying data at the top (category, date, product, serial #, and title) is not the same as the current version. In the process of converting your documents into a common structural format, you’ll have to decide how to map these earlier data objects into your current element structure.

In addition to developing a mapping approach and converting your documents into a common structural format, you’ll also have to convert them into SGML, ensuring that the final converted document is properly tagged as a valid SGML document instance.


Note:  
The actual order of the conversion steps can depend on the nature of your legacy data. In some cases, the common structural format conversion and the conversion into SGML may be performed in the same processing step.

Selecting appropriate tools to perform your conversions can be easy or extremely complex, depending on many interrelated factors. The quantity of legacy data, your ongoing document volume and complexity, the data formats, and many other issues can factor into your choices.


Tip:  
In many cases, you might want to have legacy data conversions performed by a data conversion company familiar with SGML conversions.

As you have seen, the conversion of legacy data into SGML can get complicated. Before performing any conversions of a large number of documents, it can be very helpful to map out the steps in the process in some detail.

Parsing DTDs and Documents

When filtering documents from other formats into SGML, parsing takes on a much greater importance. Parsing your DTDs is the same as in the ground-up approach; you must parse them after each modification you make.

Converted documents are different beasts entirely. Because no conversions of complex documents are perfect, errors in the conversion process creep in. As a result, it’s wise to parse all converted documents (document instances) with a validating parser.

Generally, early conversions can have many parsing errors. As your heart sinks when you see all of these errors, don’t feel too bad. As you tweak both your input documents and conversion programs, the number, variety, and frequency of errors can be greatly reduced.

On occasion, you may hear people say that they don’t parse all of their converted documents, it’s not necessary, and so on. Don’t believe it! The scope of possible combinations of tags within even a simple DTD is tremendous. For very similar documents, a regular parsing step may show few and minor errors for a long time. However, just when you think that the parsing step is unnecessary, some major parsing errors surface. Think of parsing as cheap insurance for ensuring solid, reliable SGML documents.


Tip:  
Parsing can be a strange, mysterious, and complex experience. If you encounter parsing errors that you can’t track back to identifiable SGML errors, consider cross-checking with another parser. No parser is perfect, and the parsing function is an extremely difficult task to perform!


Previous Table of Contents Next