Previous Table of Contents Next


Converting from an RTF File

The best way to convert from RTF is to use an SGML converter that can read RTF files. There are several—including three Word for Windows add-ons, EBT’s Dynatag, and WordPerfect’s SGML module—but they all run under Windows. You can also try an RTF-to-HTML converter, such as the public domain rtf2html, SoftQuad’s HoTMetaL Pro, or ClarisWorks. These programs start the process of tagging your document. You then must convert the HTML tags to their SGML equivalents and insert any other tags that do not have HTML equivalents.

If that is not possible, try to convert the RTF yourself, using the regular expression feature of an ASCII editor or a scripting language. In either case, it is important to have RTF that is as clean and consistent as possible. The document that is to be converted from RTF should be authored with consistent styles, which should be designed to handle interparagraph spacing, indentation, and other formatting features, so that extra returns and tabs are not inserted. You can also try inline text styles—such as bold, italic, and underline—to indicate phrase-level elements. For example, you might use italic to indicate only foreign words and underline to indicate technical terms.

If you do not have access to other software or are working with many authors, a word processing program provides a good way to get text into electronic form and on its way to being tagged in SGML.


Caution:  
All RTF is not created equal. Certain word processors that can output RTF do not preserve style information. Without styles, this system does not work.

Suppose, for example, that you are using Word. The steps are:

1.  In the Word file, edit all the styles under the Styles menu item so that they have the same characteristics as the Normal style. In other words, the text remains assigned to a particular style, but when you look at it, it appears uniformly as Normal on the screen.
2.  Save the document as RTF.
3.  Open the RTF document with another editor, such as Qued/M, Alpha, or BBEdit. Look at the top of the file—where the RTF definition is—to see the style definitions. In the forest of curly braces, you will find a section labeled stylesheet. It contains style definitions, each of which is enclosed in curly braces. A style definition looks like this:
       {\s1 \f22 \sbasedon222\snext1 FT;}

\s1 is the number of the style, and FT is its name. The material in between is not important. You need to retain the paired names and numbers because that is what you will look for in the document.
4.  Delete all the RTF information at the top of the file to the point where the text begins.
5.  Look for RTF information inside the document that looks as though it is repeated everywhere. Use a global change command to delete it. It might look, for example, like \pard\plain and \sa240\s1240. You want to be left only with style numbers in the form \sx, where x is a number, and inline styles, such as bold and italic in the form {\b …}. Note that paragraphs end with \par, which you also need to preserve.
6.  Convert the styles to your element names. Use global changes and regular expressions. Check carefully that you are not losing information as you do this. You might be able to write a program that does this.
7.  When you have converted all you can automatically, do the rest by hand. It’s difficult to automate document conversion completely with this system because RTF does not map easily to SGML. It contains less information and does not apply it correctly in a uniform way across paragraph boundaries.

Converting from SGML to Another DTD or Data Format

This kind of conversion is generally much easier than converting from another format into SGML. Both data formats are defined, and the DTD documents maintain the structure of the source document. With a valid SGML document, you rarely have to guess what a document creator might have intended when he put a list item inside a chapter heading. The best way to take advantage of the regularity of structure in an SGML document is to use an SGML parser to process it. Perform the conversion based on the output.

You can often convert SGML by using an editor rather than a specialized tool. This technique is not generally suitable for complex DTDs with deeply-nested structures. In such DTDs, it is frequent for the output created by a tag to depend on the tags surrounding it. This kind of dependency is impossible to handle in an editor unless the relevant tags are adjacent in the file.

You can use a straight Perl or TCL script for simple DTDs when an editor cannot do the job. To handle simple context dependencies, keep track of a global state as the script processes the document. This approach is possible as long as the context is strictly limited. Attempting to track the interaction of many global states generally leads to a programming nightmare that will be hard to maintain and will likely have obscure failures.

From Here…

This chapter examined a variety of tools and strategies that you can use for editing, viewing, and converting SGML documents on a Macintosh. While commercial offerings exist for all tasks but conversion, they can be expensive for small projects. However, there is a lot of good public domain software to fill in the gaps. If you can take advantage of a mixed environment, it will be possible to use the Mac for editing and printing and do conversion on another platform, such as Windows or UNIX. If however, you intend to use only the Macintosh, with a little ingenuity, you can still put together a powerful suite of tools and complete your project successfully.

For more information, refer to the following:

  Chapter 26, “Tools for the PC: Authoring, Viewing, and Utilities,” examines the various SGML tools available for the PC.
  Chapter 28, “Other Tools and Environments,” surveys other SGML tools available on a number of computer platforms for performing data conversion, validation, transformation, and document viewing.
  Part VIII, “Becoming an Electronic Publisher,” examines the issues involved in moving into electronic publishing.


Previous Table of Contents Next