Tue Mar 14 19:56:00 2006
This is a quickstart guide for processing XML in Perl using SAX, the Simple API for XML. It is targetted at people who already know XSLT or DOM programming and want to do first steps with SAX by using some of the already available modules. This guide assumes knowledge of XML and, of course, Perl.
The SAX specification, currently at version 2, is quite different from other methods of XML processing. While DOM and XSLT are methods that operate on whole documents at a time, SAX is a streaming API. XML tags, attributes, text, etc. trigger events which are sent along a pipe of filters and handlers. Do not confuse this with "XML Stream", which is an entirely different thing.
With SAX, we can easily chain several processors and the XML events travel through each of them. Because processing takes place as the source data is read, memory consumption is usually much lower than with document-based approaches, and it does not grow with document size. (Actually, it may grow - some modules need to store data which they can't process right away, but overall this effect is usually very small compared to the document size.) SAX pipelines are often faster as well, since modules don't need to search for elements they want to process - they simply sit and wait until something interesting comes along, passing through the rest.
There are several classes of SAX-Modules:
You may already have guessed it, to enjoy good SAX we need either a parser or a generator, and a handler. Most of the time we want to do something on our XML, so we need filters as well. There are exceptions, though: by splitting up the pipeline, we can feed multiple handlers, or we can combine data from more than one source, or we might just want to feed XML into a handler without any filter in between.
This is really a lot like playing in a chemistry lab, plugging the various devices together, combining or separating flows, feeding in some goo at the top, collecting distilled alcohol at the bottom. Or something like that ;-)
For SAX processing, we need at least the XML::SAX package installed, and XML::SAX::Writer as well. It is highly recommended to install XML::LibXML, as the parser contained therein is the best choice if we want fast and feature-complete parsing. And of course, we need all the filter modules we will encounter during this tutorial. All these modules are easily installed through the CPAN shell, only for XML::LibXML you need to get libxml2 first. Binary packages of libxml2 are part of almost all free operating system distributions (major Linux distros, the various *BSD systems), and it's source code is available at http://www.xmlsoft.org.
For our first example, we will try to write XML from XML input. Not very much of a deal, but a minimal example. Put "Hello World" into your source document and you have the canonical "Hello World"-example :-)
We create two files, an XML file to be processed, and our perl script.
Note the unusual object creation via a factory class. XML::SAX::ParserFactory keeps track of all installed SAX parsers and chooses one among them. There are various ways to influence this decision, but the default rule is to use the last parser that was installed. So usually, we don't need to worry, it will just work out.
What happens if we run this? The XML is written to the terminal window, since we didn't tell XML::SAX::Writer where to put it. Output looks like this:
<?xml version='1.0'?><demo> Hello, World! </demo>[shell prompt appears here]
Note how the shell prompt appears right after the closing tag, without a newline. This and the missing newline before the root tag are the only hints that tell us that some processing took place. But this is what we wanted, we just wanted to keep the XML as it is - those two missing newlines didn't change the meaning of the XML at all, since they were outside the root tag.
Now lets move on to something useful. We will add a filter to sort a list of records.
Imagine this list of books you are considering to buy:
Now, that is a nice list, but it would be much more useful if it were sorted by price, since a book that is more expensive usually contains more information. Having acquired new XML processing skills, let's use a filter to sort the books. Luckily, there is XML::Filter::Sort, which does exactly what we want.
This looks very much like our first example. This time, we tell XML::SAX::Writer to write the output into a file, and we added a filter between parser and handler. Note how each module receives a Handler argument which points to the next step in the pipeline. XML::Filter::Sort takes two more arguments telling it what and how to sort. For details on this, see the man page for it.
Try it! The result should look like this:
Again, we see XML::SAX::Writer's habit of removing the newline before the root tag, and the final newline is missing as well. But all the internal white space stays the same. And, this is what we wanted, the book records are now sorted by price. The filter did it's duty.
Setting up pipelines like this can be very tedious if you plan to do more complex things. There is a cool module to ease the task of setting up the pipeline: XML::SAX::Machines. Using this module, our script looks like this:
This does exactly the same as our last example. Go ahead, test it.
This is much more readable, and easier as well. Loading and creating the parser and the writer is done automatically. If a filter doesn't take extra arguments, loading and creating it can also be done by XML::SAX::Machines for you, so you save a lot of typing. And think about the advantage if you had 5 or 10 filters instead of just one.
XML::SAX::Machines has lots of other cool features to make complex processing dead easy. Maybe there will be a tutorial about it, until then read it's man page, it is well documented.
As the last point, let's take a look at what's available to be used with our new skills.
Browsing CPAN, we see a lot of filter modules. Unfortunately, many of them are not yet SAX2 compatible, so take care and check if the filter you intend to use is actually usable. If it refers to PerlSAX (and not to SAX2), it won't work out-of-the-box. There is a SAX1<->SAX2 translator, but it isn't well tested, so do not expect it to work flawlessly. There are handlers and generators as well. One highlight you should check out is XML::Handler::AxPoint, which creates high quality presentations in PDF format. But as with filters, check the manual pages of generators and handlers to see if they use SAX2 or the older SAX1/PerlSAX.
You will probably want to write your own filters soon. There is an excellent tutorial about this at "/FIXME/TODO/find link" in http:. Moreover, be sure to check out XML::Filter::Dispatcher (currently in beta stadium), which works a bit like XSLT/XPath, but as a SAX filter. Using this, you can write own filters even easier than with SAX alone. As a side note, if you have already worked with XML Namespaces, you will probably like to hear that SAX2 is fully namespaces aware, so they can be accessed and processed correctly.
And if you are in need of documentation, try the man page of XML::SAX::Machines and the filter modules you want to use. If you want a deeper insight into how it all works, look at XML::SAX and it's accompanying modules/docs. Also, be sure to search at http://www.xml.com for articles about SAX2.
written by Jörg Walter, mailto:j.walter@creITve.de
Edit This Page / Show Page History /