Tue Mar 14 19:56:00 2006

Perl SAX Quickstart

This is a quickstart guide for processing XML in Perl using SAX, the Simple API for XML. It is targetted at people who already know XSLT or DOM programming and want to do first steps with SAX by using some of the already available modules. This guide assumes knowledge of XML and, of course, Perl.

Simple API for XML

The SAX specification, currently at version 2, is quite different from other methods of XML processing. While DOM and XSLT are methods that operate on whole documents at a time, SAX is a streaming API. XML tags, attributes, text, etc. trigger events which are sent along a pipe of filters and handlers. Do not confuse this with "XML Stream", which is an entirely different thing.

With SAX, we can easily chain several processors and the XML events travel through each of them. Because processing takes place as the source data is read, memory consumption is usually much lower than with document-based approaches, and it does not grow with document size. (Actually, it may grow - some modules need to store data which they can't process right away, but overall this effect is usually very small compared to the document size.) SAX pipelines are often faster as well, since modules don't need to search for elements they want to process - they simply sit and wait until something interesting comes along, passing through the rest.

The building blocks

There are several classes of SAX-Modules:

  • XML Parsers

    The parsers are responsible for getting XML out of text. Input is obtained via plain strings or from file handles. There are several parser modules, based on the various XML libraries available. Among others, we can use libxml, expat or a parser written in pure perl.

  • Generators (or Drivers)

    These are similar to parsers, but they don't parse straight XML but some other data format, generating XML on-the-fly. Generators give you the power to process non-XML data with the same tools as XML. For example, there are generators to create SAX events from DBI queries or from Excel files.

  • Filters

    Filters do the main work. They are responsible for transforming the XML in the way you want. Filters can be chained, and there are special filters to split an event strem into two separate pipelines, and to merge them again. This way, processing can have arbitrary complex structures. Examples for filters are an XInclude processor, or a sorting filter.

  • Handlers

    Handlers are the end points of XML processing. This is usually just XML::SAX::Writer, which writes the XML into a file or a string. But there are other handlers, for example the fascinating XML::Handler::AxPoint handler which creates professional quality PDF presentations rivalling PowerPoint slides.

How things fit together

You may already have guessed it, to enjoy good SAX we need either a parser or a generator, and a handler. Most of the time we want to do something on our XML, so we need filters as well. There are exceptions, though: by splitting up the pipeline, we can feed multiple handlers, or we can combine data from more than one source, or we might just want to feed XML into a handler without any filter in between.

This is really a lot like playing in a chemistry lab, plugging the various devices together, combining or separating flows, feeding in some goo at the top, collecting distilled alcohol at the bottom. Or something like that ;-)

Prerequisites

For SAX processing, we need at least the XML::SAX package installed, and XML::SAX::Writer as well. It is highly recommended to install XML::LibXML, as the parser contained therein is the best choice if we want fast and feature-complete parsing. And of course, we need all the filter modules we will encounter during this tutorial. All these modules are easily installed through the CPAN shell, only for XML::LibXML you need to get libxml2 first. Binary packages of libxml2 are part of almost all free operating system distributions (major Linux distros, the various *BSD systems), and it's source code is available at http://www.xmlsoft.org->.

Our first experiment

For our first example, we will try to write XML from XML input. Not very much of a deal, but a minimal example. Put "Hello World" into your source document and you have the canonical "Hello World"-example :-)

We create two files, an XML file to be processed, and our perl script.

  • demo.xml
    <?xml version="1.0"?>
    <demo>
      Hello, World!
    </demo>
  • demo.pl
    # load modules
    use strict; # always use strict
    use XML::SAX::ParserFactory;
    use XML::SAX::Writer;
    
    # create the pipeline, back-to-front
    my $handler = new XML::SAX::Writer();
    my $parser = XML::SAX::ParserFactory->parser(Handler => $handler);
    
    # now process the file
    $parser->parse_uri("demo.xml");

Note the unusual object creation via a factory class. XML::SAX::ParserFactory keeps track of all installed SAX parsers and chooses one among them. There are various ways to influence this decision, but the default rule is to use the last parser that was installed. So usually, we don't need to worry, it will just work out.

What happens if we run this? The XML is written to the terminal window, since we didn't tell XML::SAX::Writer where to put it. Output looks like this:

<?xml version='1.0'?><demo>
  Hello, World!
</demo>[shell prompt appears here]

Note how the shell prompt appears right after the closing tag, without a newline. This and the missing newline before the root tag are the only hints that tell us that some processing took place. But this is what we wanted, we just wanted to keep the XML as it is - those two missing newlines didn't change the meaning of the XML at all, since they were outside the root tag.

Doing something useful

Now lets move on to something useful. We will add a filter to sort a list of records.

Imagine this list of books you are considering to buy:

  • books.xml
    <?xml version="1.0"?>
    <books>
      <book>
        <isbn>0672322404</isbn>
        <name>mod_perl Developer's Cookbook</name>
        <url>http://www.modperlcookbook.org/</url>
        <price>48</price>
      </book>
      <book>
        <isbn>not yet published</isbn>
        <name>The AxKit Book</name>
        <url>unknown</url>
        <price>0.0</price>
      </book>
      <book>
        <name>Writing Apache Modules with Perl and C</name>
        <isbn>1-56592-567-X</isbn>
        <price>47</price>
        <url>http://www.modperl.com</url>
      </book>
    </books>

Now, that is a nice list, but it would be much more useful if it were sorted by price, since a book that is more expensive usually contains more information. Having acquired new XML processing skills, let's use a filter to sort the books. Luckily, there is XML::Filter::Sort, which does exactly what we want.

  • booksort.pl
    # load modules
    use strict; # always use strict
    use XML::SAX::ParserFactory;
    use XML::SAX::Writer;
    use XML::Filter::Sort;
    
    # create the pipeline, back-to-front
    my $handler = new XML::SAX::Writer(Output => "sorted_books.xml");
    my $sorter = new XML::Filter::Sort(
        Record => 'book',
        Keys => 'price',
        Handler => $handler,
    );
    my $parser = XML::SAX::ParserFactory->parser(Handler => $sorter);
    
    # now process the file
    $parser->parse_uri("books.xml");

This looks very much like our first example. This time, we tell XML::SAX::Writer to write the output into a file, and we added a filter between parser and handler. Note how each module receives a Handler argument which points to the next step in the pipeline. XML::Filter::Sort takes two more arguments telling it what and how to sort. For details on this, see the man page for it.

Try it! The result should look like this:

  • sorted_books.xml
    <?xml version='1.0'?><books>
      <book>
        <isbn>not yet published</isbn>
        <name>The AxKit Book</name>
        <url>unknown</url>
        <price>0.0</price>
      </book>
      <book>
        <name>Writing Apache Modules with Perl and C</name>
        <isbn>1-56592-567-X</isbn>
        <price>47</price>
        <url>http://www.modperl.com</url>
      </book>
      <book>
        <isbn>0672322404</isbn>
        <name>mod_perl Developer&apos;s Cookbook</name>
        <url>http://www.modperlcookbook.org/</url>
        <price>48</price>
      </book>
    </books>

Again, we see XML::SAX::Writer's habit of removing the newline before the root tag, and the final newline is missing as well. But all the internal white space stays the same. And, this is what we wanted, the book records are now sorted by price. The filter did it's duty.

How to make it better

Setting up pipelines like this can be very tedious if you plan to do more complex things. There is a cool module to ease the task of setting up the pipeline: XML::SAX::Machines. Using this module, our script looks like this:

  • booksort_machines.pl
    # load modules
    use strict; # always use strict
    use XML::SAX::Machines qw(:all);
    use XML::Filter::Sort;
    
    # create the pipeline, the easy way
    my $sorter = new XML::Filter::Sort(
        Record => 'book',
        Keys => 'price',
    );
    my $pipeline = Pipeline($sorter, ">sorted_books.xml");
    
    # now process the file
    $pipeline->parse_uri("books.xml");

This does exactly the same as our last example. Go ahead, test it.

This is much more readable, and easier as well. Loading and creating the parser and the writer is done automatically. If a filter doesn't take extra arguments, loading and creating it can also be done by XML::SAX::Machines for you, so you save a lot of typing. And think about the advantage if you had 5 or 10 filters instead of just one.

XML::SAX::Machines has lots of other cool features to make complex processing dead easy. Maybe there will be a tutorial about it, until then read it's man page, it is well documented.

What can I do?

As the last point, let's take a look at what's available to be used with our new skills.

Browsing CPAN, we see a lot of filter modules. Unfortunately, many of them are not yet SAX2 compatible, so take care and check if the filter you intend to use is actually usable. If it refers to PerlSAX (and not to SAX2), it won't work out-of-the-box. There is a SAX1<->SAX2 translator, but it isn't well tested, so do not expect it to work flawlessly. There are handlers and generators as well. One highlight you should check out is XML::Handler::AxPoint, which creates high quality presentations in PDF format. But as with filters, check the manual pages of generators and handlers to see if they use SAX2 or the older SAX1/PerlSAX.

You will probably want to write your own filters soon. There is an excellent tutorial about this at "/FIXME/TODO/find link" in http:. Moreover, be sure to check out XML::Filter::Dispatcher (currently in beta stadium), which works a bit like XSLT/XPath, but as a SAX filter. Using this, you can write own filters even easier than with SAX alone. As a side note, if you have already worked with XML Namespaces, you will probably like to hear that SAX2 is fully namespaces aware, so they can be accessed and processed correctly.

And if you are in need of documentation, try the man page of XML::SAX::Machines and the filter modules you want to use. If you want a deeper insight into how it all works, look at XML::SAX and it's accompanying modules/docs. Also, be sure to search at http://www.xml.com-> for articles about SAX2.


written by Jörg Walter, mailto:j.walter@creITve.de->


Edit This Page / Show Page History /