I'm looking for a neat and efficient way to replace characters in XML document. There is a replacement table defined for almost 12.000 UTF-8 characters, most of them are to be replaced by single characters, but some must be replaced by two or even three characters (e.g. Greek theta should become TH). The documents can be bulky (100MB+). How to do it in Java? I came up with the idea of using XSLT, but I'm not too sure if this is the best option.
A:
Have a look at SAX which allows you to see each individual part of the XML document as they pass by. You can then take action on text nodes and do the manipulation you need.
The problem with XSLT is that most implementations need the whole input tree in memory, which is typically 10 times the size on disk. I only know of the commercial edition of Saxon XSLT transformer which can do streaming XSLT (but that would be perfect for your needs).
Thorbjørn Ravn Andersen
2010-05-19 13:14:00
+3
A:
String.replace(..) is very slow, based on my experience. I used to parse 100MB KML files using that API and the performance is just bad. Then, I pre-compiled the regular expression using Pattern.compile(..) and that worked whole lot faster.
limc
2010-05-19 13:17:42
Good point. Blinded by the fact that there's no need to treat it as XML, I entirely forgot to think about the best solution for the Java part of it.
Paul Butcher
2010-05-19 13:22:44
As I said the replacement is not straightforward 'foo' to 'bar'. There is big mapping table defined that contains 12.000 replacements. This is why I was considering loading the mapping to HashMap<Character,String> and then checking each character of textual content of XML tags against that map. What about that?
pregzt
2010-05-19 13:27:09
Yes, I think you need that mapping table whether it's in a map or if you are storing in a database and have Hibernate to handle that caching for you. Perhaps, you can have a precompiled regex that scans the entire XML to look for non-alpha-numeric characters first, then for each of these characters, check against the map to see if such character exists, if it does then create another precompiled regex to perform that particular character replacement. Maybe, this is not the best solution, but I'm just throwing out ideas here.
limc
2010-05-19 13:37:41
To add a little more to my strategy above, the whole point is not to loop through the entire 12K keys if it is not needed.
limc
2010-05-19 13:40:07
Thanks limc. This is what we've decided to do. We'd use regexp to check if there are any nonstandard characters for the textual values and then perform replacement using mapping (in memory).
pregzt
2010-05-19 13:42:33