XML::Twig Examples: Extract skeleton from document
I thought the answer to an email I received yesterday could be interesting:
Subject: document skeleton
I would like to extract elements by name but maintain the
context. For example if I have the XML below:
<book>
<title>My Book</title>
<chapter><title>First Title</title><num>1</num><text>chapter 1 text</text></chapter>
<chapter><title>Second Title</title><num>2</num><text>chapter 2 text</text></chapter>
<index><title>Index</title><text>index text.</text></index>
</book>
and I would like to extract all "title" elements ending up with the
following XML:
<book>
<title>My Book</title>
<chapter><title>First Title</title></chapter>
<chapter><title>Second Title</title></chapter>
<index><title>Index</title></index>
</book>
Here is one way to do this with XML::Twig:
You need to take advantage of the fact that the inner elements are parsed, and processed by twig_handlers before outer ones. So as you parse the tree, every time you hit a title, you can mark it, and all its ancestors, as "to keep". Any element (you create a handler on all elements by using '_all_' as the condition) that is not marked can be safely discarded (if it included a title it would have been marked when the title was processed).
For added efficiency you can flush the tree anytime you hit an element marked as a keeper, or you can just wait until the end of the parsing, at which point your twig will only include the elements you want.
There are several ways to mark the elements as "to keep": you could store them in a list, actually in a hash as you will want direct access to them, or use "invisible attributes" as I did below: attributes (and elements) whose name starts with a # are not output by xml exporting methods (print, sprint, flush...), so you can test an attribute #keep on the element, but it will be ignored by flush.
1: #!/usr/bin/perl 2: 3: use strict; 4: use warnings; 5: 6: use XML::Twig; 7: 8: XML::Twig->new( twig_handlers => 9: { # mark titles and their ancestors as keepers 10: # '#att' attributes are not output by flush 11: title => sub { foreach my $keep ( $_, $_->ancestors) 12: { $keep->set_att( '#keep' => 1); } 13: }, 14: # called for all elements (including titles) 15: _all_ => sub { if( $_->att( '#keep')) { $_->flush; } 16: else { $_->delete; } 17: }, 18: }, 19: pretty_print => 'indented', 20: ) 21: ->parsefile( "doc.xml");
[XML::Twig Examples] [permalink]