<?xml version="1.0"?>
<!-- <!DOCTYPE tutorial SYSTEM "tutorial.dtd"> -->
<tutorial>

<h1>Processing XML efficiently with Perl and XML::Twig</h1>
<author>Michel Rodriguez &lt;mirod@xmltwig.com&gt;</author>
<date>2003-03-31</date>

<section>

<h2>Introduction</h2>

<p>XML::Twig is a Perl module used to process efficently XML documents
</p>

<p>Twig offers a tree-oriented interface to a document while still allowing the processing of documents of any size. I think the current buzzword for it would be something like "push-pull" processing ;--)</p>

<p>When I was younger I wanted to grow up and write a tool that would allow people to process text the way they wanted, offering tons of
feature, various ways to achieve the same result, not forcing them into any processing model but allowing them to use the one they felt the
most comfortable with. Eventually I grew up and I realized a guy named Larry Wall had already written a language named Perl... Darn! So
as I was quite involved in dealing with SGML, then XML documents, I decided to settle for the next best thing: writing a module that would
allow people to process XML the way they wanted, offering them tons of feature, various ways... you get the point.</p>

<p>So I wrote XML::Twig. XML::Twig gives you a tree interface to XML documents... if you want. It also lets you dump parts of the tree, set
callbacks during processing, both on tags and on subtrees, process only part of the tree... you name it. The only thing XML::Twig does not do is follow standards (except XML of course). Consider yourself warned!</p>

<p>This talk is aimed at programmers who want to process XML data with the
XML::Twig module.</p>

<p>It will go from the basic functionnalities of the module to its most
adanced use, offering numerous examples of code, from HTML conversion to
database integration.</p>

<p>XML::Twig is a Perl module offering a push-pull processing model of XML
data. In other words it lets you build a tree from an XML documents, while
letting you output the results of your processing as its built. But more on
that later...</p>

<p>This tutorial is available in XML (<a
href="yapc_xmltwig.xml">yapc_xmltwig.xml</a>), converted to html using the <a
href="talk2html">talk2html</a> script (which uses XML::Twig).</p>

<p>The latest version of the XML::Twig tutorial can be found on the
	<a href="http://xmltwig.com/xmltwig/">XML::Twig page</a></p>

<h4>Knowledge</h4>

<p>Prior knowledge of Perl, especially its object-oriented aspects and regular
expressions will probably help the reader. Familiarity with the DBI module
wouldn't hurt either, but the examples are simple and detailed enough to offer
a first introduction to data base processing using Perl.</p>

<p>Very little prior knowledge of XML is assumed, although a selection of
related links is offered and would be of interest to the complete beginner.</p>

<h3>Alternatives to XML::Twig</h3>

<p>Of course other ways of processing XML documents exist, both using Perl and
other languages, especially Java and Python.</p>

<p>You can find information on Perl modules on the <a href="http://perl-xml.sourceforge.net/">Perl-XML FAQ</a>, for a list of Python XML
resources see <a href="http://www.python.org/topics/xml/">Python and XML Processing</a> and for a list of Java XML resources see <a
href="http://java.sun.com/xml/">Java (TM) Technology and XML</a>.</p>
</section>

<section>

<h2>Introduction to XML</h2>

<h3>What is XML</h3>

<p>XML could be described as "HTML on steroids". Or conversely as "SGML on Prozac".</p>
<p>XML is a markup language, just like HTML, using the same basic syntax: 
pointy brackets, attributes... just slightly more dictatrial than HTML: tags 
MUST be closed, attributes MUST be enclosed in quotes, either single or double.
</p>

<p>In fact it is just a little more than comma separated files, apart from the fact that fields are somewhat documented (by the element name and by attributes)
and that they can be nested, thus defining a tree structure instead of a table.
</p>
<p>What XML brings is syntaxic coherence, allowing the same tools to be used to
process all XML files, and a host of associated standards to do formatting, 
transformation, linking...</p>

<p>XML complexity stems from 2 main facts:</p>
<ul><li>in order to "unleash the power of XML" you have to design the "right" 
XML for your system, through DTD's (and soon schemas),</li>
    <li>the associated standards, such as CSS, XSL, DOM, XSLT, XPath, XLink, 
        XInclude: you often need them to do anything useful with XML, but their 
        mere number is quite overwhelming.</li>
</ul>

<h3>XML example</h3>
<p>A simple example would be: <example desc="A simple XML document">simple_doc.xml</example>.</p>

<h3>Resources</h3>

<p>The best resource on XML, and SGML by the way, is certainly Robin Cover's <a href="http://www.oasis-open.org/cover/sgml-xml.html">SGML/XML Web Page</a>, which links to everything else anyway. <a
		href="http://xml.com/">XML.com</a> and <a
		href="http://xmlhack.com/">xmlhack</a> are 2 good sites respectively for detailed
articles on XML and for the latest news on the topic.</p> 

<h3>XML used in this tutorial</h3>

<p>Just a word on the XML I use in this tutorial.</p>

<p>XML is usually used for 2 purposes these days: either purely to store data,
to be exchanged between 2 pieces of software, or to store documents, possibly
including data, that are destined to be printed or displayed on the web.</p>

<h4>Data oriented XML</h4>

<p>Data-oriented XML should be tagged according to a DTD that represents
faithfully the data, we will see examples of that in the section about data
base integration.</p>

<h4>Document oriented XML</h4>

<p>For document-oriented XML, after using SGML then XML for nearly 8 years, in
all sorts of flavors and according to all sorts of DTD's I have become a firm
believer in what I'd call "HTML++". By this I mean that as much as possible of
the HTML DTD should be used for text. There is really no need to redefine
paragraphs, lists, code, headers etc... Structuring elements can be added, such
as sections, possibly typed ones, that's one +. Specific inline elements, for
domain relevant data, such as part numbers and prices in a catalog, standard
references in a standard, etc... constitue the second +. Links can either use
the familiar &lt;a> tag or use different tags, possibly typed.</p>

<p><a
		href="http://www.xmlnews.org/">XMLnews</a> is a good example of such a DTD.</p>

<p>Starting from the <a
href="http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">XHTML DTD</a>
and adding the extra elements is definitely the easiest way to create that kind
of DTD.</p>

<p>Although I did not use a DTD for this tutorial it would look like:</p>
<pre><![CDATA[  <!ELEMENT tutorial (h1, section+)>
  <!ELEMENT section (html_stuff)>
  html_stuff is just the usual html content, plus a couple of elements:
  <!-- a link to a resource, so they can be gathered -->
  <!ELEMENT resource EMPTY>
  <!ATTLIST resource refid REFID>
  <!-- a method from XML::Twig, so it can be linked to the doc -->
  <!ELEMENT method (#PCDATA)>
  <!ATTLIST method class #REQUIRED>
  <!-- a code example, contains the file name -->
  <!ELEMENT example (#PCDATA)>
  <!ATTLIST example desc #REQUIRED>
  ]]></pre>

</section>

<section>

<h2>Introduction to XML::Twig</h2>

<h3>XML::Parser</h3>

<p>XML::Parser, first developped by Larry Wall and now
supported by Clark Cooper, is the basis of most other XML modules. It includes
a non-validating parser, Expat, written by James Clark, who amongst other feats
also wrote the nsgmls parser for SGML.</p>

<p>XML::Parser allows calling software to set handlers on parsing events. Those
events include start tags (and XML::Parser gives the name of the tag and the
attributes), end tags, text, processing instructions etc... </p>

<h3>XML::Twig</h3>

<p>XML::Twig is a sub-class of XML::Parser that allows higher level processing
of XML.  XML::Twig offers a tree interface to a document, both once the
document has been completely parsed and during the parsing by allowing handlers
to be defined on elements. Additional methods help managing the resources
needed by XML::Twig.</p>

<p>A whole bunch of methods can be used on elements in the twig, to navigate it,
transform it, create new elements...</p> 

<h3>Why use XML::Twig</h3>
<p>XML:Twig is only one of the dozen or so Perl modules that process XML.
Other popular ones are XML::DOM, XML::Simple, XML::PYX, XML::Grove or just
plain vanilla XML::Parser.</p>

<p>So why would you use XML::Twig?</p>

<ul><li>you need to process huge documents efficiently,</li>
    <li>PYX is not quite powerful enough,</li>
    <li>the XML data is too complex for XML::Simple to handle,</li>
    <li>the processing is hard to write in XML::Parser,</li>
    <li>the document is too big to load conveniently in XML::DOM,</li>
    <li>XSLT is a pain to write.</li>
</ul>
<p>XML::Twig uses a tree-based processing model, you can control how much of
the tree you want to load at once in memory and it is very perlish, up to
TIMTOWTDI and DWIM.</p>


</section>

<section>

<h2>First Examples</h2>

<h3>Full-tree mode</h3>

<h4>Creating and navigating the twig</h4>

<p>Now let's see our first code example. The purpose of this one is to reorder
a list of elements on the value of an attribute.</p>

<p>The DTD is quite simple: <example desc="The stats DTD">stats.dtd</example></p>
<p>And the data is: </p>
<table bgcolor="#00FFFF" border="1" width="100%"><tr><td><pre><![CDATA[
<?xml version="1.0"?>
<!DOCTYPE stats SYSTEM "stats.dtd">
<stats><player><name>Houston, Allan</name><g>69</g><ppg>20.1</ppg><rpg>3.4</rpg><apg>2.8</apg><blk>14</blk></player>
<player><name>Sprewell, Latrell</name><g>69</g><ppg>19.2</ppg><rpg>4.5</rpg><apg>4.0</apg><blk>15</blk></player>
<player><name>Ewing, Patrick</name><g>49</g><ppg>14.6</ppg><rpg>10.0</rpg><apg>1.0</apg><blk>68</blk></player>
</stats>
]]></pre></td></tr></table>


<p>The complete <a href="nba.xml">xml data</a>.</p>

<p>The script is <example class="code" desc="Reordering an XML file">ex1_1.pl</example>.</p>

<p>Note how we get the root of the twig using the <method
class="twig">root</method> method,  then use the <method
class="elt">children</method> method to get the list of players.</p> <p>The
<method class="elt">first_child</method> method is used to navigate the twig,
it accepts an optionnal parameter which is the gi we are interested in, if the
parameter is ommited the first child, whatever it's gi, is returned. Other
navigation methods are <method class="elt">last_child</method>, <method
class="elt">prev_sibling</method>, <method class="elt">next_sibling</method> and
<method class="elt">parent</method>. They all return <tt>undef</tt> if no
element is found.</p>

<p>The <method class="elt">text</method> returns the... text of the element,
including all  elements included in it, without any tags. Other methods used to
retrieve the content of an element include <method class="elt">print</method>,
which prints the element content, from its start tag to its end tag, included,
and including the content (and tags) of all included elements, and <method
class="elt">sprint</method>, which returns the string that <tt>print</tt>
prints, and accepts an optionnal parameter which excludes the element tags when
true.</p>

<h4>Modifying the twig</h4>

<p>Another example, in which we will create new elements: our statistics
include the  total number of blocks for each player, but in order to find out
the best blocker in our selection we want the number of blocks per game, and we
want to store it in the document (conveniently the DTD allows for an optionnal
<tt>blg</tt> element).</p>

<p>Here is the <example class="code" desc="Creating a new element">ex1_2.pl</example>.</p>

<p>The <method class="elt">paste</method> method accepts 4 different position
arguments:</p>

<ul compact="compact">

  <li><tt>first_child</tt>: pastes the element as the first child of the third
argument</li>

   <li><tt>last_child</tt>: pastes the element as the last child of the third
argument</li>

    <li><tt>before</tt>: pastes the element before the third argument</li>

   <li><tt>after</tt>: pastes the element after the third argument</li>

 </ul>

<p>You can ommit <tt>first_child</tt> and just write <tt>$elt->paste(
$ref)</tt>. What you  can't do is paste an element that already belongs to a
document, that will cause a fatal error.</p>
<p>An important feature of the <tt>paste</tt> method is that it is called
on the element <b>being pasted</b>: <tt>$child->paste( $parent)</tt> and 
not the other way around.</p>

<p>Note that the output is now generated by the <method
class="twig">print</method> method, instead of regular print statements, and
that the extra line returns that we had inserted in the file have disapeared.
We will see a little later how to keep them around.</p>

<h3>Twig handlers</h3>

<p>Another way to accomplish the same task, a more
"twig-ish" way, would be to set a handler on the <tt>player</tt> element. A
handler is attached to an element name through the twig_handlers option when the
twig is created. The subroutine that will be called everytime an element with
that name has been completely parsed. It is then called with 2 parameters: the
twig itself and the element.</p>
<p>Note that the handler is called <b>as soon as the element is completely parsed</b>.
That means that the handler will be called when the end tag for that element is parsed.
A somewhat surprising consequence of that is that if you set twig handlers on nested 
elements, the handlers on the <b>inner</b> elements will be called <b>before</b> the
handlers on the <b>outer</b> elements.</p>

<p>Here is the <example class="code" desc="Using twig_handlers">ex1_3.pl</example>.</p>

<p>This is basically similar to the
previous example, except the interesting code is in the handler instead of
being in the loop. It gets more interesting in the next section though...</p>

<h3>The flush and purge methods</h3>

<h4>The flush method</h4>

<p>Now in the previous examples the whole document was being loaded, then
printed. This is not very memory efficient, especially as once a player has
been updated it is never used again.</p>

<p>Hence the use of the <method
class="twig">flush</method> method. The <method class="twig">flush</method>
method just dumps the twig that has been parsed so far. It takes care of
printing the proper closing tags when needed and deleting the printed elements,
thus allowing the memory to be reused for the rest of the processing. It does
not delete the parents of the current element (but might delete most of their
children), so they are still available when navigating the twig.</p>

<p>Here is the <example class="code" desc="Using the flush method">ex1_4.pl</example>.</p>

<p>Still very similar to the previous example, except that instead of printing
the whole twig at the end of the processing the calls to <method
class="twig">flush</method> at the end of <tt>player</tt> ensure that each
player element stays in memory for just as long as it is needed.</p>

<p><strong>Note</strong>: as of XML::Twig 3.23, there is no longer any need to 
call <tt>flush</tt> one last time after the document is completely parsed. If the
document was flushed, then it will be "auto-flushed" (to the same filehandle used
for the first flush) after the parse.</p>

<h4>The purge method</h4>

<p>The <tt>flush</tt> method is usefull if you want to output the
modified standard. But you might not always want that. Suppose you just want to
output the leader in a category:</p>

<p>Here is the <example class="code" desc="Using the
purge method">ex1_5.pl</example>.</p>

<p>Very simple, yet very memory
efficient. You still get the advantage of local tree-processing, having access
to the whole <tt>player</tt> sub-tree, while not having to pay the price of
loading the whole document in memory.</p>

<p>But wait! There's more...</p>

<h3>The twig_roots option</h3>

<p>Actually in the previous example we build the
complete twig for each player element, even though we are really only
interested in the name and one of the sub-elements. It's OK as the xml file we
are working on is not to big, but it can be a problem, both in terms of speed
and memory for bigger file. Hopefully XML::Twig offer a way to build the twig
only for those elements we are interested in.</p>

<p>The twig_roots option, set
when the twig is created, gives a list (well, actually a hash) of elements for
which the twig will be built. Other elements will be ignored. The result is a
twig that includes the root of the document (we need a root for the tree in any
case) and the twig_roots elements as children of that root. For each element in
the twig_roots list the whole sub-tree is built.</p>

<p>Here is the <example
desc="Using the twig_roots option">ex1_6.pl</example>.</p>

<p>The virtual twig
build (looking for the leader in ppg) is <tt>&lt;stats>&lt;name>Houston,
Allan&lt;/name>&lt;ppg>20.1&lt;/ppg>&lt;name>Sprewell,
Latrell&lt;/name>&lt;ppg>19.2&lt;/ppg>...&lt;/stats></tt>. The script doesn't
spend memory storing useless information on other stats, nor time building the
twig for those stats.</p>

<h3>The twig_print_outside_roots option</h3>

<p>Now suppose all we want to do is remove a statistical category from the
document. Ideally we would like to build as little of the twig as possible,
using the twig_roots option, but we also want want most of the document to be
output as-is. twig_print_outside_roots to the rescue! By setting that option when
we create the twig anything outside of the twig_roots elements will simply be
print.</p> <p>Here is the <example class="code" desc="Using the twig_print_outside_roots
option">ex1_7.pl</example>.</p>

<p>Note the use of the <method
class="elt">cut</method> method, which just removes the element from the
twig. It is also possible to use <method
class="elt">delete</method> instead of <tt>cut</tt>. The difference is that cut keeps the
element around (so it can be for example <method
class="elt">paste</method>d somewhere else), while <tt>delete</tt> destroys it (and
frees up the memory it used). </p>

<p>And of course, as There's More Than One Way To Do It, here is a
real short script that does the same thing, just in a more lazy way (and
actually a slightly faster but  more memory intensive one).</p>

<p>The <example
desc="A lazy example">ex1_8.pl</example>.</p>

<p>Figuring out how it works is
left as an exercise for the reader (hint: twig_print_outside_roots does just what
it's name suggests, no more).</p>

<h3>A simple HTML+ converter</h3>

<p>Now with
what we've learned so far we are just a couple of additional tricks away from
building a simple "HTML+" converter. The <tt>+</tt> here means that we can
include additional inline elements to an HTML document. Provided of course that
HTML document is a valid XML instance (and I admit this can be hard to
achieve).</p>

<p>So here is the <example class="code" desc="A simple HTML+
converter">xml2html1.pl</example>. It runs on the <a
href="html_plus.xml">html_plus.xml</a> file and includes itself in the
output:<a href="html_plus.html">html_plus.html</a></p>

<p>We use 3 new methods here:</p>

<ul>

<li><method class="elt">set_gi</method>, predictably sets the gi
(the name, gi means <em>generic identifier</em>, it comes from  sgml) of the
element, the <method class="elt">gi</method> method returns the gi of an
element</li>

<li><method class="elt">set_att</method> sets (and creates if it
does not exist already) an attribute to a value, the <method
class="elt">att</method> method retrieves the value of an attribute,</li>

<li><method class="elt">insert</method> creates an element which is inserted
within an other element, the new element is the only child of the initial
element and all children of the initial element become children of the new
element. The method returns the new element.</li>

</ul>

<p>Also note the neat trick (thanks to Clark Cooper for this one) that consist in setting the handler as a sub that just adds an extra parameter to the usual ones: <tt>sub {
make(@_, 'tt') }</tt>.</p>

<h3>Setting handlers for elements in context</h3>
<p>An additional option is to set handlers not for elements but for elements in a given context: instead of giving just the gi of the element you can use an XPath-like expression in the twig_handlers (as well as in the twig_roots) argument.</p>
<p>Valid path can be of the form <tt>/root/elt1/elt2</tt> for a complete path to the element, or <tt>elt1/elt2</tt> for a partial path.</p>
<p>Note that this path is given in the <b>original</b> document, not in the current twig.</p>
<p>So if we want to convert the simple document we saw in the XML examples we would write the conversion as in <example class="code" desc="Using path instead of gi's">ex1_9.pl</example>.</p>

<p>When we process the <tt>doc</tt> element the title has already
been processed, so we have to look for a <tt>h1</tt> child.</p>
<p>We also use two new methods here: <method class="elt">erase</method> removes
the element and pastes all of its children as children of the element parent.
The effect on the output is that the tag has been erased from the document. <method class="elt">set_text</method> sets the textual content of the element.</p>

</section>

<section>

<h2>Data base integration</h2>

<p>We now have all the tools we need to build documents that include data 
straight out of relationnal data bases. The only decision we have to make is how to design our documents, and our DTD's. Are we going to include entire tables or single values, and how.</p>

<p>Here are some simple examples of what can be done:</p>

<h3>Including a table</h3>

<p>For this first example we will include a whole table in the document.</p>
<p>The document we use is <example desc="An XML document including a relationnal table">books1.xml</example>, where the table is
generated by the <tt>&lt;rel_table query="SELECT code, name, price FROM books"/> </tt> tag.</p>
<p>The code in <example class="code" desc="Including a relationnal table in an XML document">ex2_1.pl</example> mixes DBI and XML::Twig to build the table.</p>
<p>This code can also be used to process slightly trickier queries, as in  <example desc="Another XML document including a relationnal table!">books2.xml</example>.</p>

<h3>Including values from a table</h3>

<p>Depending on how generic, and how convenient to write we want the queries to be, several options are possible. Here are a couple:</p>
<p>The first document is <example desc="A document including generic queries">books3.xml</example>, which includes very generic queries.</p>
<p>It can be processed using the <example class="code" desc="Including values from a relationnal table in an XML document">ex2_2.pl</example> script.</p>

<p>A shorter but less generic way would be a document like <example desc="A document including generic queries">books3.xml</example>.</p>
<p>It can be processed using the <example class="code" desc="Including values from a relationnal table in an XML document (alt)">ex2_3.pl</example> script.</p>


<h3>Dumping an XML table into a data base table</h3>
<p>We are now going to fill a relationnal table from an XML file, which could come from another, incompatible, data base for example.</p>
<p>The XML file looks like this: <example desc="An XML file linking players with their team">teams_extract.xml</example> (the whole file is in <a href="teams.xml">teams.xml</a>).</p>
<p>The script to load the table: <example class="code" desc="Dumping an XML table into a data base table">ex2_4.pl</example> is pretty simple, the only notable features being the fact that we prepare the SQL statement once and then bind parameters to it, and that the <method class="twig">purge</method> does not delete the parent element of a name.</p> 
</section>

<section>

<h2>Other features</h2>

<p>Now let see some other features of XML::Twig, beyond the basic examples.</p>

<h3>Using the finish and finish_print methods</h3>
<p>Sometimes all we need is to extract or update part of the document. In this case there is no reason to bother with building the twig for the rest of the
document. We just want to be done with it and exit or go through the rest of the document and just output it.That's what the <method class="twig">finish</method>
and <method class="twig">finish_print</method> methods provide.</p>
<p><method class="twig">finish</method> calls Expat finish method. It unsets all handlers (including internal ones that set context), but expat continues parsing to the end of the document or until it finds an error. It should finish up a lot faster than with the handlers set.</p>
<p><method class="twig">finish_print</method> stops the twig processing, flushes the twig and proceed to finish printing the document as fast as possible.</p>
<p>So here is <example class="code" desc="Using the finish method">ex3_1.pl</example>, which just displays a stat for a player then finishes parsing. Note that the document is still checked for well-formedness,
the script will exit with an error if the document is not well-formed XML.</p>
<p>Probably more interesting is <example class="code" desc="Using the finish_print method">ex3_2.pl</example> which updates the stats for a player.</p>
 
<h3>Using set_id and elt_id methods</h3>

<p>For some applications, especially when the whole document is loaded in memory, it can be very convenient to get direct access to elements through an
ID attribute. XML::Twig provides such a feature. By default if an element has
an attribute named <tt>id</tt> then a hash id => element is created. This
hash can be accessed through the <method class="elt">id</method>, <method class="elt">set_id</method> and <method class="elt">del_id</method> methods on
an element, and an element can be retrived from a twig using the <method class="twig">elt_id</method> method on the twig.</p>
<p>The name of the ID attribute can be changed when the twig is created by using the <method class="twig_new">Id</method> option.</p>
<p>The id attribute can still be accessed through the <method class="elt">att</method>, <method class="elt">set_att</method> and <method class="elt">del_att</method> methods on the element but in this case the id
hash will not be updated.</p>

<p><example class="code" desc="Using set_id">ex3_3.pl</example> is an example
of the <method class="elt">set_id</method> method.</p>

<p><example class="code" desc="using elt_id">ex3_4.pl</example> uses <method class="twig">elt_id</method> on the updated XML document to display the name
of a player with a given id. <tt>perl ex3_3.pl | perl ex3_4.pl 050</tt> will then display <b>player050: Stojakovic, Predrag</b>.</p>

<h3>Comparing the order of 2 elements</h3>
<p>XML::Twig also offers methods to compare the order of 2 elements in the document. <method class="elt">before</method> and <method class="elt">after</method> are based on the <method class="elt">cmp</method> method. An element is before
an other one if its opening tag is before the opening tag of the other element.
Otherwise it is after. The 2 elements are equal if they are... equal!</p>
<p><example class="code" desc="Using before and after">ex3_5.pl</example>
shows how to use those methods. You can run it on an ordered and "id'ed"
document this way: <tt>perl ex1_1.pl blk | perl ex3_3.pl | perl ex3_5.pl 001 015</tt>.</p>

<h3>The next_elt method</h3>
<p>Although the <tt>next_sibling</tt> and <tt>first_child</tt> methods are
often the most convenient way to navigate there are some cases where another 
method is easier to use: the <method class="elt">next_elt</method> method makes 
it easier to go through all the elements in a sub-tree.</p>
<p>The <tt>next_elt</tt> of an element is the first element opened after the
open tag of the element. This is either the first child of the element, or its
next sibling, or the next sibling of one of its ancestors. Note that as usual PCDATA is considered an element.</p>
<p>This method has 2 forms: </p>
<ul><li><tt>$elt->next_elt</tt> returns simply the next element,</li>
    <li><tt>$elt->next_elt( $subtree_root</tt> returns the next element, or
        <tt>undef</tt> if the next element would be outside of the 
            $subtree_root element.</li>
</ul>
<p><example class="code" desc="Using next_elt">ex3_6.pl</example> shows how
to use <tt>next_elt</tt> to list all the methods in the html_plus.xml document.
</p>

<h3>Pretty printing</h3>
<p>By popular demand I have included a number of pretty printing options,
both for documents and for data.</p>
<p>The usefull options to pretty print a document are:</p>
<ul><li><tt>none</tt>: the default, no \n is used,</li>
    <li><tt>nsgmls</tt>: nsgmls style, with \n added within tags,</li>
    <li><tt>nice</tt>: adds \n wherever possible (<b>NOT SAFE</b>),</li>
    <li><tt>indented</tt>: same as nice plus indents elements (<b>NOT 
        SAFE</b>).</li>
</ul>
<p>The NOT SAFE options can produce invalid XML (that would not conform 
to the original DTD) in some cases. I have included them anyway because
it rarely happens with simple DTDs and they look good!</p>
<p>The <example class="code" desc="Pretty printing a document">ex3_7.pl</example> example shows the pretty printer.</p>
<p>The output is <example desc="A pretty printed document">ex3_7.res</example>.</p> 
<p>To pretty print tables 2 options can be used (besides the faithful 
<tt>none</tt>):</p>
<ul><li><tt>record_c</tt>: compact, one record per line,</li>
    <li><tt>record</tt>: one field per line.</li>
    <li><tt>indented</tt>: same as nice plus indents elements (<b>NOT 
        SAFE</b>).</li>
</ul>
<p>The <example class="code" desc="Pretty printing a table">ex3_8.pl</example> example shows the pretty printer.</p>
<p>The output is <example desc="A pretty printed table">ex3_8.res</example>.</p> 

<p>These options can be set either when creating the twig, using the PrettyPrint
option, by using the PrettyPrint option in the print method (on a twig or on an 
element) or by using the <method class="elt">set_pretty_print</method>
method either on a twig or on an element. Note that the setting is actually 
global at the moment.</p>


</section>
<section>

<h2>Advanced features</h2>

<p>or "I hope you don't need those"</p>

<h3>Using StartTagHandlers</h3>
<p>Sometimes you might want to just change a tag name, or store some attributes,
BEFORE the whole tree for the element is built. This is often the case when you need to <method class="twig">flush</method> the twig while in the element. Then changing the element name for example will only change the end tag, as the start tag will have been output by the time you try to change it.</p>

<p>In that case you can use the StartTagHandlers option when you create the twig, which will call a handler when the start tag of the element is found. 
The arguments passed to the handler will be the twig and the element. The element will be empty at that point but the attributes will be there.</p>

<p><example class="code" desc="Using StartTagHandlers">ex4_0.pl</example> demonstrates the use of StartTagHandlers to change the tags in an XML document.
</p>

<p>The other new feature used in this script is the <tt>_all_</tt> keyword in
the twig_handlers option. This calls the handler (which in this case just flushes the twig) for every single element in the document. Another keyword, <tt>_default_</tt> calls a handler for each element that does not have a handler. <tt>_all_</tt> and <tt>_default_</tt> can be used both with StartTagHandlers and with the twig_handlers option.
</p> 

<h3>Purging part of the tree</h3>
<p>Sometimes, especially when converting an XML file to several HTML ones it
is convenient to purge the twig only up to the next-to-last sibling, not
up to the current one. Hence the <method class="twig">purge_up_to</method>
and <method class="twig">flush_up_to</method> methods.</p>
<p>Here is an example of how to use them to list the difference in a given
stat between 2 consecutive players. <example class="code" desc="Purging part of the tree">ex4_1.pl</example> can receive the output from ex1_1.pl.</p>

<h3>Fun with overloading</h3>
<p>I just thought I'd mention, because I think it's cool, that you can overload
the comparison operators to use the <method class="elt">cmp</method> method to
compare elements in a twig.</p>

<p>So just insert these lines in your script:</p>

<pre><![CDATA[package XML::Twig::Elt;

use overload  cmp  => \&cmp,
             'lt'  => \&lt,
             'le'  => \&le,
             'gt'  => \&gt,
             'ge'  => \&ge,
             '+='  => \&suffix,
             '-='  => \&prefix,
             '>>'  => \&suffix,
             '<<'  => \&prefix,
             fallback => 1,
;]]>
</pre>

<p>Then you will be able to write <code>if( $elt1 le $elt2) { print "$elt1 is
before $elt2\n"; }</code>. As an added bonus you get 2 new ways to prefix or
suffix an element: </p>

<pre>$elt += "suffix";
$elt -= "prefix";
$elt &lt;&lt; "prefix";
$elt >> "suffix";
</pre>

<p>This is just syntactic sugar, and IMHO pretty useless (hence it is not
included in the module), plus it slows the module down by a good 30%. It's cute
though, and if you don't care about speed and need to do a lot of comparisons
of elements it can be handy.</p>

</section>

<section>
<h2>Under the hood</h2>

<p>Now let's have a look under the hood at some of the things that go on in XML::Twig from a developer stand point.</p>

<h3>Speedup</h3>

<p>I think one of the most interesting feature of XML::Twig is the optimization step that takes place when the module is installed.</p>

<p>The module is written in pure OO style, whith accessors for every fields of objects, even inside the module. But as we all know method calls are expensive.
So an optimization pass replaces method calls by hash accesses if possible.</p>

<p>For example <tt>$elt->parent</tt> is replaced by<tt> $elt->{parent}</tt> and 
<tt>$elt->set_parent( $parent)</tt> is replaced by <tt>$elt->{parent}= $parent</tt>.</p>

<p>The <example desc="The speedup tool">speedup</example> is pretty simple,
just a bunch of substitutions, and certainly not foolproof (it would crash
miserably if I were to use brackets in the argument list of a method). It
works pretty well though, and if it fails then the non-regression tests will
catch the problem. It could be improved by using 5.6 new regexp to fix this.</p>

<p>The result is an improvement of about 30% of the speed of the module.</p>
<p>Speedup could also be used to... speedup a production script, with the caveat that as XML::Twig implementation changes it might be necessary to re-run the
tool with new versions of the module.</p>

<h3>Element names "compression"</h3>
<p>A minor optimization in XML::Twig is that element names, which are stored as hash <i>values</i> are replaced by an index in an array holding all names.</p>

<h3>Failed optimizations</h3>

<p>Not all attempts at optimizing XML::Twig succeded, so I think it might be useful for me to share at least my biggest failure in this area...</p>
<p>Twig elements are stored in hashes, one element per hash. In order to reduce the potential overhead of all too much memory being allocated for each one of them I tried to store elements in global arrays, each array storing one field
for all the elements: instead of the parent of an element being stored in 
$elt->{parent} it was stored in $parent[$elt], $elt being a blessed scalar.</p>
<p>It did not work.</p>
<p>The twig was just as big and slower to access than the original version.</p>
<p>Oh well... there goes 2 days of work...</p>


</section>

<section>

<h2>Reference</h2>

<p>The <a href="twig_dev.html">XML::Twig documentation</a>.</p>

</section>

<copyright><p>(c) 2000 Michel Rodriguez<br/> This tutorial is free
documentation. It can be redistributed and modified under the same terms as
perl itself</p></copyright>

</tutorial>
