NAME

XML::Twig - A perl module for processing huge XML documents in tree mode.

SYNOPSIS

Note that this documentation is intended as a reference to the module.

Complete docs, including a tutorial, examples, an easier to use HTML version, a quick reference card and a FAQ are available at http://www.xmltwig.com/xmltwig

Small documents (loaded in memory as a tree):

  my $twig=XML::Twig->new();    # create the twig
  $twig->parsefile( 'doc.xml'); # build it
  my_process( $twig);           # use twig methods to process it 
  $twig->print;                 # output the twig

Huge documents (processed in combined stream/tree mode):

  # at most one div will be loaded in memory
  my $twig=XML::Twig->new(   
    twig_handlers => 
      { title   => sub { $_->set_tag( 'h2') }, # change title tags to h2
        para    => sub { $_->set_tag( 'p')  }, # change para to p
        hidden  => sub { $_->delete;       },  # remove hidden elements
        list    => \&my_list_process,          # process list elements
        div     => sub { $_[0]->flush;     },  # output and free memory
      },
    pretty_print => 'indented',                # output will be nicely formatted
    empty_tags   => 'html',                    # outputs <empty_tag />
                         );
    $twig->flush;                              # flush the end of the document

See XML::Twig 101 for other ways to use the module, as a filter for example.

DESCRIPTION

This module provides a way to process XML documents. It is build on top of XML::Parser.

The module offers a tree interface to the document, while allowing you to output the parts of it that have been completely processed.

It allows minimal resource (CPU and memory) usage by building the tree only for the parts of the documents that need actual processing, through the use of the twig_roots and twig_print_outside_roots options. The finish and finish_print methods also help to increase performances.

XML::Twig tries to make simple things easy so it tries its best to takes care of a lot of the (usually) annoying (but sometimes necessary) features that come with XML and XML::Parser.

XML::Twig 101

XML::Twig can be used either on "small" XML documents (that fit in memory) or on huge ones, by processing parts of the document and outputting or discarding them once they are processed.

Loading an XML document and processing it

  my $t= XML::Twig->new();
  $t->parse( '<d><title>title</title><para>p 1</para><para>p 2</para></d>');
  my $root= $t->root;
  $root->set_tag( 'html');              # change doc to html
  $title= $root->first_child( 'title'); # get the title
  $title->set_tag( 'h1');               # turn it into h1
  my @para= $root->children( 'para');   # get the para children
  foreach my $para (@para)
    { $para->set_tag( 'p'); }           # turn them into p
  $t->print;                            # output the document

Other useful methods include:

att: $elt->{'att'}->{'foo'} return the foo attribute for an element,

set_att : $elt->set_att( foo => "bar") sets the foo attribute to the bar value,

next_sibling: $elt->{next_sibling} return the next sibling in the document (in the example $title->{next_sibling} is the first para, you can also (and actually should) use $elt->next_sibling( 'para') to get it

The document can also be transformed through the use of the cut, copy, paste and move methods: $title->cut; $title->paste( after => $p); for example

And much, much more, see Elt.

Processing an XML document chunk by chunk

One of the strengths of XML::Twig is that it let you work with files that do not fit in memory (BTW storing an XML document in memory as a tree is quite memory-expensive, the expansion factor being often around 10).

To do this you can define handlers, that will be called once a specific element has been completely parsed. In these handlers you can access the element and process it as you see fit, using the navigation and the cut-n-paste methods, plus lots of convenient ones like prefix . Once the element is completely processed you can then flush it, which will output it and free the memory. You can also purge it if you don't need to output it (if you are just extracting some data from the document for example). The handler will be called again once the next relevant element has been parsed.

  my $t= XML::Twig->new( twig_handlers => 
                          { section => \&section,
                            para   => sub { $_->set_tag( 'p');
                          },
                       );
  $t->parsefile( 'doc.xml');
  $t->flush; # don't forget to flush one last time in the end or anything
             # after the last </section> tag will not be output 
    
  # the handler is called once a section is completely parsed, ie when 
  # the end tag for section is found, it receives the twig itself and
  # the element (including all its sub-elements) as arguments
  sub section 
    { my( $t, $section)= @_;      # arguments for all twig_handlers
      $section->set_tag( 'div');  # change the tag name.4, my favourite method...
      # let's use the attribute nb as a prefix to the title
      my $title= $section->first_child( 'title'); # find the title
      my $nb= $title->{'att'}->{'nb'}; # get the attribute
      $title->prefix( "$nb - ");  # easy isn't it?
      $section->flush;            # outputs the section and frees memory
    }

        

There is of course more to it: you can trigger handlers on more elaborate conditions than just the name of the element, section/title for example.

  my $t= XML::Twig->new( twig_handlers => 
                           { 'section/title' => sub { $_->print } }
                       )
                  ->parsefile( 'doc.xml');

Here sub { $_->print } simply prints the current element ($_ is aliased to the element in the handler).

You can also trigger a handler on a test on an attribute:

  my $t= XML::Twig->new( twig_handlers => 
                      { 'section[@level="1"]' => sub { $_->print } }
                       );
                  ->parsefile( 'doc.xml');

You can also use start_tag_handlers to process an element as soon as the start tag is found. Besides prefix you can also use suffix ,

Processing just parts of an XML document

The twig_roots mode builds only the required sub-trees from the document Anything outside of the twig roots will just be ignored:

  my $t= XML::Twig->new( 
       # the twig will include just the root and selected titles 
           twig_roots   => { 'section/title' => \&print_n_purge,
                             'annex/title'   => \&print_n_purge
           }
                      );
  $t->parsefile( 'doc.xml');
  
  sub print_n_purge 
    { my( $t, $elt)= @_;
      print $elt->text;    # print the text (including sub-element texts)
      $t->purge;           # frees the memory
    }

You can use that mode when you want to process parts of a documents but are not interested in the rest and you don't want to pay the price, either in time or memory, to build the tree for the it.

Building an XML filter

You can combine the twig_roots and the twig_print_outside_roots options to build filters, which let you modify selected elements and will output the rest of the document as is.

This would convert prices in $ to prices in Euro in a document:

  my $t= XML::Twig->new( 
           twig_roots   => { 'price' => \&convert, },   # process prices 
           twig_print_outside_roots => 1,               # print the rest
                      );
  $t->parsefile( 'doc.xml');
 
  sub convert 
    { my( $t, $price)= @_;
      my $currency=  $price->{'att'}->{'currency'};          # get the currency
      if( $currency eq 'USD')
        { $usd_price= $price->text;                     # get the price
          # %rate is just a conversion table 
          my $euro_price= $usd_price * $rate{usd2euro};
          $price->set_text( $euro_price);               # set the new price
          $price->set_att( currency => 'EUR');          # don't forget this!
        }
      $price->print;                                    # output the price
    }

XML::Twig and various versions of Perl, XML::Parser and expat:

Before being uploaded to CPAN, XML::Twig 3.22 has been tested under the following environments:

XML::Twig is a lot more sensitive to variations in versions of perl, XML::Parser and expat than to the OS, so this should cover some reasonable configurations.

The "recommended configuration" is perl 5.8.3+ (for good Unicode support), XML::Parser 2.31+ and expat 1.95.5+

See http://testers.cpan.org/search?request=dist&dist=XML-Twig for the CPAN testers reports on XML::Twig, which list all tested configurations.

An Atom feed of the CPAN Testers results is available at http://xmltwig.com/rss/twig_testers.rss

Finally:

When in doubt, upgrade expat, XML::Parser and Scalar::Util

Finally, for some optional features, XML::Twig depends on some additional modules. The complete list, which depends somewhat on the version of Perl that you are running, is given by running t/zz_dump_config.t

Simplifying XML processing

CLASSES

XML::Twig uses a very limited number of classes. The ones you are most likely to use are XML::Twig of course, which represents a complete XML document, including the document itself (the root of the document itself is root), its handlers, its input or output filters... The other main class is XML::Twig::Elt, which models an XML element. Element here has a very wide definition: it can be a regular element, or but also text, with an element tag of #PCDATA (or #CDATA), an entity (tag is #ENT), a Processing Instruction (#PI), a comment (#COMMENT).

Those are the 2 commonly used classes.

You might want to look the elt_class option if you want to subclass XML::Twig::Elt.

Attributes are just attached to their parent element, they are not objects per se. (Please use the provided methods att and set_att to access them, if you access them as a hash, then your code becomes implementaion dependent and might break in the future).

Other classes that are seldom used are XML::Twig::Entity_list and XML::Twig::Entity.

If you use XML::Twig::XPath instead of XML::Twig, elements are then created as XML::Twig::XPath::Elt

METHODS

XML::Twig

A twig is a subclass of XML::Parser, so all XML::Parser methods can be called on a twig object, including parse and parsefile. setHandlers on the other hand cannot be used, see BUGS