XML: A Primer:Let Data Be Data

The HTML Explosion

When the World Wide Web first received widespread attention in 1994, a small army of amateur and professional designers set out to create the most exciting pages they could. Many left quickly, disappointed by the dearth of HTML formatting tools. The strange differences between browsers made it difficult to predict what any page would look like, and corporate users demanded a level of control over their electronic documents that was similar to the control they had over paper documents. For a while, these complaints nearly throttled HTML development. Like many Internet technologies before it, HTML was spread by enthusiasts who were building pages for fun, not for hire. It was simple enough to learn in a day or two, and it offered a whole new reading and authoring experience. The momentum generated by these early enthusiasts and the press coverage they received gave HTML the potential to become the next big thing.

For the Web to become an economically viable marketplace, however, it had to change. Designers and customers wanted to be able to create pages that looked exactly the way they wanted them, and they wanted to have a level of control comparable to that provided by the average desktop publishing system. The Web still isn’t at that level for the average user, but the tools have become a lot more convenient. Web design has become a specialty of designers and communications specialists the world over, making it possible for companies and individuals to post complex if not always visually pleasing sites.

Tables were a huge step forward. They made it possible to create documents that bore some resemblance to the traditional grid systems used in many print designs. Continuing improvements in image map technology made it easy for frustrated designers to create their own point-and-click interfaces when HTML just couldn’t produce what they needed. Frames and pop-up windows let developers focus on elements instead of having to rebuild entire screens of information every time they wanted to change something. The <FONT> tag made it possible to specify text presentation much more precisely than structure-based formatting had allowed. The escalating competition between Microsoft and Netscape added all kinds of tools to the palette as the companies fought for market share and mind share. Netscape created <BLINK>, and Microsoft countered with <MARQUEE>. Both companies created extensions to the elements and the attributes of HTML, frustrating the W3C and confusing developers. Worse still, neither company implemented tags in exactly the same way. Spacing varied, colors could change, and carefully aligned elements would scatter across the page.

At the same time, the number of pages on the Web was exploding. Sites routinely grew to include 10,000 pages or more, organized loosely in hierarchical schemes concocted by developers who knew little about hypertext and less about organization. Many sites were organized according to chaotic directory structures built by developers who were used to the structures of FTP archives and gopher sites, the predecessors of the Web. Large sites presented difficulties to the managers who had to keep up with them and the users who attempted to read them. Navigating hypertext was a strange art form all its own, a blend of organizational skill, memory, good design, and sheer luck. Search engines arrived to help users find their way, but it quickly became clear that librarians could never keep up with the explosive growth of this new medium.

Automated tools appeared as crawlers, and robots began searching the enormous swamp of Web documents. Some merely indexed titles, whereas more sophisticated ones began to index the entire contents of a page. AltaVista, a search engine created by Digital to demonstrate and promote its Alpha processor, brought a brute force approach to the Web, applying multiple processors that shared gigabytes of memory and enormous bandwidth to indexing the Web. Although AltaVista and the other search engines can and do provide a service, they work on the broadest of criteria: the complete contents of a document. We’ve managed to confer a lot of intelligence on search engines, even letting them identify languages and handle word forms, but we’re still a very long way from teaching them to read, categorize, and organize documents without us having to specify which part is what.

This combination of volume and increasing complexity of formatting has led developers to wonder if there might be a better way. HTML has come a long way, quickly, but the limitations of a markup language designed for formatting are beginning to chafe. As the browser wars continue, designers are beginning to demand an alternative to letting the browser determine the presentation of individual tags. The limitations of search engines become more apparent every time the Web doubles in size. Finally, as the Web grows more omnipresent, the limitations of HTML for presenting information that doesn’t easily fit the standard text and graphics model are becoming more pressing. Developers need to be able to create their own tags, and they need to be able to do so in a way that works with other people’s browsers.

Table of Contents