ansaurus

Question

Extract text from particular elements of a large poorly formatted XML file

Answer 1

A:

You'll need some event oriented parser, like SAX, or in .NET, System.Xml.XmlReader;

Rubens Farias 2009-11-10 20:17:31

Answer 2

A:

Depending on how (and how badly) the document is 'broken' it might be possible to write a simple filter in perl/python that fixes it enough to pass XML well-formedness tests and make it into a DOM or XSLT.

Can you add some examples of what's wrong with the input?

Jim Garrison 2009-11-10 20:18:11

Thanks for the reply! Yes, the error was in one of the attributes **ExpatError: unbound prefix: line 13, column 0**Line 13: <dc:link>...</dc:link>Apparently it's the `dc` namespace. I'll try to figure out how to bind this...

trope 2009-11-10 20:27:53

Answer 3

A:

I think that if you are ok with Java, then VTD-XML would work without any issues of those undefined prefixes...

vtd-xml-author 2009-11-11 01:00:01

Answer 4

A:

if you have gawk

gawk 'BEGIN{
 RS="</item>"
 startpat="<document>"
 endpat="</document>"
 lpat=length(startpat)
 epat=length(endpat)
}
/<lang>en<\/lang>/{
    match($0,"<document>")
    start=RSTART
    match($0,"</document>")
    end=RSTART
    print substr($0,start+lpat,end-(start+lpat)) 
}' file

output

$ more file
Junk
Junk
<item>
  <title>some title</title>
  <author>john doe</author>
  <lang>en</lang>
  <document> text
         i want blah ............  </document>
</item>
junk
junk
<item>
  <title>some title</title>
  <author>jane doe</author>
  <lang>ch</lang>
  <document> junk text
           ..       ............ </document>
</item>
junk
blahblah..
<item>
  <title>some title</title>
  <author>GI joe</author>
  <lang>en</lang>
  <document>  text i want ..... in one line  </document>
</item>
aksfh
aslkfj
dflkas

$ ./shell.sh
 text
         i want blah ............
  text i want ..... in one line

ghostdog74 2009-11-11 01:24:10

Thank you very much - this is exactly what I was looking for. Something that doesn't care about the XML convention.

trope 2009-11-11 02:28:29

Answer 5

A:

Try to extract text with PilotEdit, http://www.pilotedit.com 1. Open the XML file

<batitemhere>
<item>
  <title>some title</title>
  <author>john doe</author>
  <lang>en</lang>
  <document>document 1</document>
</item>
<item>
  <title>some title</title>
  <author>john doe
  <lang>en</lang>
  <document> document content </document>
</item>
<item>
    <titlebad>some title</title>
  <author bad>john doe
  <title>some title</title>
  <author>john doe
  <lang>en</lang>
  <document> multiline 
  text here </document>
</item>

Click the "Sort" button on the toolbar
Select the radio button "Compare over a string defined by regular expression" and fill in the following regular expression and target string:

Regular Expression:<item>[]*<lang>en</lang>[]*<document>[]*</document>[]*</item>

Target String:%06

Click button "Copy target string to clipboard", you will be able to extract the following text:

document 1

document content

multiline

text here

Dracoder 2009-11-11 12:44:15

ansaurus

tags:

views:

answers:

Extract text from particular elements of a large poorly formatted XML file

related questions