Perl and Penn Treebank musings
Jul. 18th, 2003 04:46 pmHmm... I'm looking very closely at creating a CPAN module for dealing with the Penn Treebank markup format. (This is a way of marking up grammatical structure on sentences.)
( (INTJ (UH Hello) (-DFL- E_S) ))
( (S
(NP-SBJ (DT this) )
(VP (VBZ is)
(NP-PRD (NNS Lois) ))
(. .) (-DFL- E_S) ))
They're actually an elegant format, but now that Perl 5.8 ships with Text::Balanced, it's really a quite elegant snippet of code:
use Text::Balanced 'extract_bracketed'; # thanks Damian!
sub get_tags {
# pass it a complete constituent, it returns the tag plus a list
# of its subconstituents. If subconstituents themselves have
# structure, then they will be arrayrefs
local $_ = shift;
my ($tag, $children) = /^ \( ( [\S]* ) \s (.*\S) \s* \) $/sx;
my @children;
while ($children) {
my $child = extract_bracketed($children, '()');
if (defined $child) {
# child is itself a constituent
$child = [ get_tags($child) ];
}
else {
# this is a word; we're done
($child, $children) = ($children, '');
warn "trouble -- two tokens in preterminal" if @children;
}
push @children, $child;
}
return ($tag, @children);
}