views:

317

answers:

5

I have a large (~50Mb) file containing poorly formatted XML describing documents and properties between <item> </item> tags, and I want to extract the text from all English documents.

Python's standard XML parsing utilities (dom, sax, expat) choke on the bad formatting, and more forgiving libraries (sgmllib, BeautifulSoup) parse the entire file and take too long.

<item>
  <title>some title</title>
  <author>john doe</author>
  <lang>en</lang>
  <document> .... </document>
</item>

Does anyone know a way to extract text between <document> </document> only if the lang=en without parsing the entire document?

Additional information: Why it's "poorly formatted"

Some of the documents have an attribute <dc:link></dc:link> which causes problems with the parsers. Python's xml.minidom complains:

ExpatError: unbound prefix: line 13, column 0
A: 

You'll need some event oriented parser, like SAX, or in .NET, System.Xml.XmlReader;

Rubens Farias
A: 

Depending on how (and how badly) the document is 'broken' it might be possible to write a simple filter in perl/python that fixes it enough to pass XML well-formedness tests and make it into a DOM or XSLT.

Can you add some examples of what's wrong with the input?

Jim Garrison
Thanks for the reply! Yes, the error was in one of the attributes **ExpatError: unbound prefix: line 13, column 0**Line 13: <dc:link>...</dc:link>Apparently it's the `dc` namespace. I'll try to figure out how to bind this...
trope
A: 

I think that if you are ok with Java, then VTD-XML would work without any issues of those undefined prefixes...

vtd-xml-author
A: 

if you have gawk

gawk 'BEGIN{
 RS="</item>"
 startpat="<document>"
 endpat="</document>"
 lpat=length(startpat)
 epat=length(endpat)
}
/<lang>en<\/lang>/{
    match($0,"<document>")
    start=RSTART
    match($0,"</document>")
    end=RSTART
    print substr($0,start+lpat,end-(start+lpat)) 
}' file

output

$ more file
Junk
Junk
<item>
  <title>some title</title>
  <author>john doe</author>
  <lang>en</lang>
  <document> text
         i want blah ............  </document>
</item>
junk
junk
<item>
  <title>some title</title>
  <author>jane doe</author>
  <lang>ch</lang>
  <document> junk text
           ..       ............ </document>
</item>
junk
blahblah..
<item>
  <title>some title</title>
  <author>GI joe</author>
  <lang>en</lang>
  <document>  text i want ..... in one line  </document>
</item>
aksfh
aslkfj
dflkas

$ ./shell.sh
 text
         i want blah ............
  text i want ..... in one line
ghostdog74
Thank you very much - this is exactly what I was looking for. Something that doesn't care about the XML convention.
trope
A: 

Try to extract text with PilotEdit, http://www.pilotedit.com 1. Open the XML file

<batitemhere>
<item>
  <title>some title</title>
  <author>john doe</author>
  <lang>en</lang>
  <document>document 1</document>
</item>
<item>
  <title>some title</title>
  <author>john doe
  <lang>en</lang>
  <document> document content </document>
</item>
<item>
    <titlebad>some title</title>
  <author bad>john doe
  <title>some title</title>
  <author>john doe
  <lang>en</lang>
  <document> multiline 
  text here </document>
</item>
  1. Click the "Sort" button on the toolbar
  2. Select the radio button "Compare over a string defined by regular expression" and fill in the following regular expression and target string:

Regular Expression:<item>[]*<lang>en</lang>[]*<document>[]*</document>[]*</item>

Target String:%06

  1. Click button "Copy target string to clipboard", you will be able to extract the following text:

document 1

document content

multiline

text here

Dracoder