Perl XML::LibXML $node->findnodes($xpath) finds nodes it shouldn't

RedGrittyBrick picture RedGrittyBrick · Aug 14, 2012 · Viewed 20.6k times · Source

Here's some code I am having problems with, I process some XML and in a method in an OO class I extract an element from each of several nodes that repeat in the document. There should only be one such element in the subtree for each node but my code gets all elements as if it is operating on the document as a whole.

Because I only expected to get oine element I only use the zeroth element of an array, this leads my function to output the wrong value (and its the same for all items in the document)

Here's some simplified code that illustrates the problem

$ cat t4.pl
#!/usr/bin/perl
use strict;
use warnings;
use XML::LibXML;

my $xml = <<EndXML;
<Envelope>
  <Body>
    <Reply>
      <List>
        <Item>
          <Id>8b9a</Id>
          <Message>
            <Response>
              <Identifier>55D</Identifier>
            </Response>
          </Message>
        </Item>
        <Item>
          <Id>5350</Id>
          <Message>
            <Response>
              <Identifier>56D</Identifier>
            </Response>
          </Message>
        </Item>
      </List>
    </Reply>
  </Body>
</Envelope>
EndXML

my $foo = Foo->new();

my $parser = XML::LibXML->new();
my $doc    = $parser->parse_string( $xml );
my @list   = $doc->getElementsByTagName( 'Item' );

for my $item ( @list ) {

    my $id = get( $item, 'Id' );
    my @messages = $item->getElementsByLocalName( 'Message' );

    for my $message ( @messages ) {

        my @children = $message->getChildNodes();

        for my $child ( @children ) {

            my $name = $child->nodeName;

            if ( $name eq 'Response' ) {
                print "child is a Response\n";
                $foo->do( $child, $id );
            }
            elsif ( $name eq 'text' ) {

                # ignore whitespace between elements
            }
            else {
                print "child name is '$name'\n";
            }
        }    # child
    }    # Message
}    # Item

# ..............................................

sub get {
    my ( $node, $name ) = @_;

    my $value   = "(Element $name not found)";
    my @targets = $node->getElementsByTagName( $name );

    if ( @targets ) {
        my $target = $targets[0];
        $value = $target->textContent;
    }

    return $value;
}

# ..............................................

package Foo;

sub new {
    my $self = {};
    bless $self;
    return $self;
}

sub do {
    my $self = shift;
    my ( $node, $id ) = @_;

    print '-' x 70, "\n", ' ' x 12, $node->toString( 1 ), "\n", '-' x 70, "\n";

    my @identifiers = $node->findnodes( '//Identifier' );
    print "do() found ", scalar @identifiers, " Identifiers\n";

    print "$id, ", $identifiers[0]->textContent, "\n\n";
}

Here's the output

$ perl t4.pl
child is a Response
----------------------------------------------------------------------
            <Response>
              <Identifier>55D</Identifier>
            </Response>
----------------------------------------------------------------------
do() found 2 Identifiers
8b9a, 55D

child is a Response
----------------------------------------------------------------------
            <Response>
              <Identifier>56D</Identifier>
            </Response>
----------------------------------------------------------------------
do() found 2 Identifiers
5350, 55D

I was expecting

do() found 1 Identifiers

I was expecting the last line to be

5350, 56D

I am using an old version of XML::LibXML due to platform issues.

Q: Does the problem exist in later versions or am I doing something wrong?

Answer

Borodin picture Borodin · Aug 14, 2012

From the documentation of XPath 1.0

//para selects all the para descendants of the document root

(emphasis my own). So your call

$node->findnodes( '//Identifier' )

is ignoring the context node $node and searching for all Identifier elements anywhere in the document

To get all Identifier descendants of the context node you must add a dot, like this

$node->findnodes('.//Identifier');

but since $node is always a Response element and Identifier is a direct child of Response you can just write

$node->findnodes('Identifier');



You seem to have got yourself a little tied up writing this. I know you have cut the code down as an example, but do you really need the separate package? Much can be done with judicious application of XPath.

The most obvious change is that you don't need to loop through all children - you can simply pick out the ones you're interested in.

This refactored code may be worth reading

use strict;
use warnings;

use XML::LibXML;

my $parser = XML::LibXML->new;
my $doc    = $parser->parse_fh(*DATA);

for my $item ( $doc->findnodes('//Item') ) {

    print "\n";

    my ($id) = $item->findvalue('Id');
    printf "Item Id: %s\n", $item->findvalue('Id');

    my @messages = $item->findnodes('Message');

    for my $message (@messages) {
        my ($response) = $message->findnodes('Response');
        printf "Response Identifier: %s\n", $response->findvalue('Identifier');
    }
}

__DATA__
<Envelope>
  <Body>
    <Reply>
      <List>
        <Item>
          <Id>8b9a</Id>
          <Message>
            <Response>
              <Identifier>55D</Identifier>
            </Response>
          </Message>
        </Item>
        <Item>
          <Id>5350</Id>
          <Message>
            <Response>
              <Identifier>56D</Identifier>
            </Response>
          </Message>
        </Item>
      </List>
    </Reply>
  </Body>
</Envelope>

output

Item Id: 8b9a
Response Identifier: 55D

Item Id: 5350
Response Identifier: 56D