XML: A Primer:Let Data Be Data

Back to the Origins: Structure and SGML

When Tim Berners-Lee created HTML in 1991, he based it on a more powerful but vastly more complex markup language called SGML, the Standard Generalized Markup Language. SGML had been around in various forms for 20 years, but its complexity had hobbled its adoption by organizations outside of publishing, government, and large-scale information processing. SGML markup, management, and processing was a specialized skill, mastered by a small group of government, corporate, and academic users.

Developers who complain that the HTML standards are developing too slowly should look back at the tortured pace of SGML’s development. First conceived in the late 1960s, the Generalized Markup Language (GML) was created at IBM in 1969 by researchers coincidentally named Goldfarb, Mosher, and Lorris. Charles Goldfarb went on to chair the American National Standards Institute (ANSI) committee on Computer Languages for the Processing of Text in 1978, after GML had become an important standard in publishing. In 1980, the committee released its first working draft, and by 1983 the sixth working draft was adopted by users including the Internal Revenue Service and Department of Defense, which mandated that its largest contractors use SGML as well. In 1984, the committee expanded into a group of collaborating committees developing standards for the International Organization for Standardization (ISO) as well as ANSI. In 1986, eight years into the standards process, SGML became ISO 8879:1986. Work on SGML continues, of course. A group of committees evaluates changes regularly, including projects on scripted style sheets, multimedia, link extensions, and a variety of document management issues.

Unlike HTML, SGML doesn’t specify how text should be presented. SGML is not a formatting language, nor even a particular markup language. SGML is a specification that allows people to create their own markup languages. It specifies content identifiers that make it easy to format text consistently and which allow document management systems to locate information quickly. SGML is well-suited for projects that involve large quantities of similarly structured data, such as catalogs, manuals, listings, transcripts, and statistical abstracts. SGML is a favorite of the federal government, as well as IBM and other large companies. It makes it easy for a centralized team to develop specifications for data structures, to create a Document Type Definition (DTD) that can then be applied to documents throughout the organization.

More importantly in many cases, documents created with SGML are easy to port to different formats. Because SGML uses content-based markup rather than format-based markup, it’s easy to change the formatting rules depending on whether a document is being output to a dot-matrix line printer, a laser printer, a four-color press, a CD-ROM, a Web site, or even audio speakers. Design teams determine formatting that can work with or follow up the work of the original DTD designers to present information in different styles that are appropriate to particular output media. Computerized storage systems can also treat the documents as small databases, querying them with searches based on content tags and index information. Companies that use the same information repeatedly —for proposals, for instance—can benefit greatly from having prefabricated text ready to be dropped into new documents. It doesn’t make the writing any better, but it does make it easier to find.

HTML: Decaf SGML?

Most of what SGML contributed to HTML was syntax: a markup language that used the now familiar <TAG ATTRIBUTE=VALUE> Content here </TAG> style. Some of SGML’s intent to separate content from formatting survived as well, as evidenced by the wildly divergent interpretations of common tags by different browsers. By describing elements with terms like <EM> for emphasis and <ADDRESS> for address information, Berners-Lee created a simple formatting language that was flexible enough to handle many different kinds of information. The <H1> through <H6> tags spoke of levels of headings, providing a somewhat natural structure to documents. The <HEAD> and <BODY> tags separated meta-information (the <TITLE>, at first) from the visible text of a document. Most important, the <A HREF=”destination.html”> anchor tags provided a simple yet powerful structure for hypertext links.

HTML did a wonderful job of simplifying SGML and putting markup into the hands of amateurs, a necessary move for broadening markup’s appeal. Ironically, Tim Berners-Lee never really intended for users to have to enter codes by hand. The initial experiments at CERN used a simple markup processor that managed the codes invisibly. While HTML remained a small collection of tags with only a few attributes, this friendly model made it easy for authors to get started using the Web to exchange papers and share information.

As we’ve seen, however, HTML was ill-suited to a world that had been spoiled by the control WYSIWYG tools had already given designers. Although it was clear that hyperlinks and the Web’s incredible ease of use were good things, there were rumblings from the start about what these academics had done to create a useless formatting language. At the beginning of the browser wars, it was clear that HTML’s extremely simple formatting tools were not going to be accepted in the long run by designers and developers who wanted to create documents on the Web that were as detailed (allowing for screen resolution) as documents on paper. The inherent flexibility of simple standards lacked appeal. Tags whose sole duty was formatting sprawled across the HTML landscape, with <B> and <I> and eventually <FONT> receiving much more use than <EM> or <ADDRESS>. Designers hand-crafted HTML, mixing and matching tags to achieve the precise appearance they wanted without regard to document structure. Because the HTML tags were used only to specify formatting, with no alternative formatting structures, HTML was doomed to life as a formatting language instead of a structured framework for documents. All the problems categorized earlier began to grow, springing from fundamental flaws in the originally brilliant nature of HTML.

Table of Contents