I have some ebooks in xml format. The books' pages are marked using processing instructions(e.g. <?pg 01?>
). I need to extract the content of the book in plain text, one page at a time and save each page as a text file. What's the best way of doing this?
views:
74answers:
5I would probably use castor to do this. It's a java tool that allows you to specify bindings to java objects, which you can then output as text to file
You need an ebook renderer for the format your books are in (and I highly doubt that it's XML if they use backslashes as processing instructions). Also, XPath works wonders if all you want to do is get the actual text, simply use //text() for all the text.
You could try converting it to YAML and editing it in a word processor--then a simple macro should fix it right up.
I just browsed for this XML to YAML conversion utility--it's small but I didn't test it or anything.
http://svn.pyyaml.org/pyyaml-legacy/trunk/experimental/XmlYaml/convertyaml_map.py
Use an XSL stylesheet with <xsl:output method="text"/>
.
You can even debug stylesheets in eclipse nowadays.
The easiest way, assuming you need to integrate this into a Java program (as the tag implies), is probably to use a SAX parser such as XMLReader provides. You write a ContentHandler callback for text and processing instructions.
When your p-i handler is called, you open a new output file.
When your text handler is called, you copy the character data to the currently open output file.
This tutorial has some helpful example code.
However if you don't need to integrate this into a Java program, I might use XSLT 2.0 (Saxon is free). XSLT 1.0 will not allow multiple output documents, but XSLT 2.0 will, and it will also make grouping by "milestone markup" (your "pg" processing instructions) easier. If you're interested in this approach, just ask... and give more info about the structure of the input document.
P.S. Even if you do need to integrate this into a Java program, you can call XSLT from Java - Saxon for example is written in Java. However I think if you're just processing PI's and text, it would be less effort to use a SAX parser.