tags:

views:

35

answers:

1

I have a collection of parse trees, and they are in this ascii representation where indentation determines the structure (and closing brackets are implicit). I need to convert them to s-expressions so that parentheses determine the structure. It's a little bit like python's significant whitespace vs. braces. The input format is a vertical representation of trees, like so:

STA:fcl
=S:np
==DN:pron-dem("tia" <*> <Dem> <Du> <dem> DET P NOM)     Tiaj
==H:n("akuzo" <act> <sd> P NOM) akuzoj
=fA:adv("certe")        certe
=P:v-fin("dauxri" <va+TEMP> <mv> FUT VFIN)      dauxros
.

Should become:

(STA:fcl (S:np (DN:pron-dem Tiaj) (H:n akuzoj)) (fA:adv certe) (P:v-fin dauxros) .)

I have code that almost does it, but not quite. There's always a missing paren somewhere; it's getting very frustrating. Should I use a proper parser, maybe a CFG? The current (messy) code is at http://github.com/andreasvc/eodop/blob/master/arbobanko.py

+1  A: 

Focusing only on the example you're giving in this Q, and the Q's title about converting vertical trees to S-expressions, something like...:

import re
import sys

samp='''S
=NP
==(DT +def) the
== (N +ani) man
=VP
==V walks'''.splitlines()

relinelev = re.compile(r'(=*)(.*)')
reclean = re.compile(r'\s*\((\S+)[^)]*\)')

def clean(line):
  return reclean.sub(r'\1', line)

def reparse(tree=samp):
  stack = [-1]
  for line in tree:
    equals, rest = relinelev.match(line).groups()
    linelev = len(equals)
    while linelev < stack[-1]:
      sys.stdout.softspace = False
      print ')',
      curlev = stack.pop()
    if linelev == stack[-1]:
      sys.stdout.softspace = False
      print ')',
    else:
      stack.append(linelev)
    print '(%s' % clean(rest),
  while stack[-1] >= 0:
    sys.stdout.softspace = False
    print ')',
    stack.pop()
  print

reparse()

seems to work, and outputs

(S (NP (DT the) (N man)) (VP (V walks)))

I realize you're trying to do much more "cleaning" than I'm doing here, but that can be concentrated in the clean function, leaving reparse to deal with the Q's title. If you don't want to print as you go, but rather return the result as a string, the changes are of course quite minor:

def reparse(tree=samp):
  stack = [-1]
  result = []
  for line in tree:
    equals, rest = relinelev.match(line).groups()
    linelev = len(equals)
    while linelev < stack[-1]:
      result[-1] += ')'
      curlev = stack.pop()
    if linelev == stack[-1]:
      result[-1] += ')'
    else:
      stack.append(linelev)
    result.append('(%s' % clean(rest))
  while stack[-1] >= 0:
    result[-1] += ')'
    stack.pop()
  return ' '.join(result)
Alex Martelli
Thanks a bunch! I didn't even know about splitlines(). Is it OK if I add your code to my git repository? I'll GPL it and give you credit, of course.Your code fails when the tree has trailing leaves at the root level, a problem I also encountered in one of the x incorrect iterations of my code... I'll update the Q to reflect this.
Andreas
With your current sample and a slightly different `clean` function (which removes the parenthesized part and following spaces) my code gives `(STA:fcl (S:np (DN:pron-dem Tiaj) (H:n akuzoj)) (fA:adv certe) (P:v-fin dauxros))` which seems to be exactly what you require so I don't see in what sense it "fails when the tree has trailing leaves at the root level", whatever it means. Please edit your Q again to show an example in which my code _fails_, otherwise how can I fix it? And sure, once the code is correct (or at least its failure cases well documented) you can use it.
Alex Martelli
If you look at my example carefully you'll see what I mean: there is a period at the last line, and it should become part of the tree just before the final closing paren. But now that I think of it, it seems weird because if it's a child of the root STA:fcl, then it should be indented with one equals sign. Perhaps I should just use an ad-hoc fix to add the periods after conversion.
Andreas
Yes, I definitely recommend an ad-hoc fix since as you say it doesn't follow the same rules as other nodes -- if there's always a period at the very end, then first remove it, run my code above, then change the last closed paren into `.)` -- not too hard.
Alex Martelli
Perhaps it is because periods are not considered part of the sentence, and that both the sentence and the period are children of an implicit root node. The periods are not always there, eg. in headlines they are not (because they are only sentence fragments). Either way, thanks for your help, I consider the problem solved now.
Andreas