Previous Table of Contents Next


Tools for Programming XML

At this point, the leading contender for XML development appears to be Java. Java has a significant advantage over other languages and development tools for a very simple reason: like XML, it was built for Unicode from the ground up. The requirement that parsers be able to handle the full 2-byte Unicode canonical encoding causes problems for C++ (although there are detours), Perl (where a key part of the language will need an upgrade), and all other development tools that expect characters to occupy a single byte. As a result, much of the work currently under way in XML development is being done in Java.

Java’s structures are also a good match for XML, with hierarchies that are easily compatible with XML. Java also provides easy interfaces between classes and objects, making it very simple to add a generic parser to a data processing application. Even though Java’s facilities for handling text are not the most advanced, they are more than a match for the level of processing required by XML parsing. Java applets, as we’ll see in the next chapter, also fit well with several of the Web-based possibilities for XML.

C++ is also a very viable environment for XML development. Like Java, its object structures can embrace nested element structures quite easily. Even though adding classes to a C++ project is somewhat more complex than it is with Java, there is no lack of powerful C++ tools. C++ is already in use for a wide variety of data processing projects, including markup processing, and libraries are available. Unfortunately, most of the C++ world still expects to see single-byte characters, making it fairly difficult to work with Unicode. Documents encoded in UTF-8 should work well with C++, and tools for C++ Unicode development are starting to appear.

Perl has been the text hacker’s choice for years, helping developers blast through seemingly impossible barriers with a few lines of code. Perl’s rich support for regular expressions has helped thousands of programmers create CGI scripts, writing HTML and interpreting the data sent back by forms. At the same time, Perl has helped developers implement changes across entire sites, addressing challenges like changing all the legal notices on a site overnight with elegance and ease. Perl use is hardly limited to HTML: SGML developers have used Perl to find problems in their documents and fix them as automatically as possible. Unfortunately, Perl has (at present) no support for Unicode. Although workarounds are possible, the regular expressions engine that drives much of Perl will need a thorough rebuilding before it will be capable of handling Unicode XML.

A Unicode module for Perl is available at http://www.perl.com/CPAN-local/modules/modules/by-module/Unicode/. It doesn’t provide full Unicode support, but it may be a good place to start.

Also, even though Perl is an excellent choice for utility programs, Perl might not be the best choice for creating reusable validating parsers. It’s certainly possible, and someone out there may be able to do it in ten lines of code; it just probably isn’t the best solution for complex projects. For a good exploration of the issues surrounding Perl and XML, read Michael Leventhal’s excellent article entitled “XML: Can the Desperate Perl Hacker Do It?” in XML: Principles, Tools, and Techniques, the Winter 1997 issue of the World Wide Web Journal. Similar problems haunt most of the popular UNIX scripting languages.

Does Unicode matter anyway? The answer is probably not yet. Unicode has been slow to take off because of limited application support. However, both XML and Java require support for processing Unicode characters (not necessarily displaying them, which is more a matter of the operating system and the available font sets) at their foundations. Unicode has begun picking up steam, however, with native support available in both Microsoft’s Windows NT and Sun Solaris 2.6 operating systems. Java and XML are two key components for the future of document processing, so expect to see more action in the Unicode field.


Previous Table of Contents Next