ansaurus

Question

Searching for regex patterns on a 30GB xml dataset. Making use of 16gb of memory.

Answer 1

+2 A:

No Java experience, sorry, but maybe you should change the parser? SAX should work sequentially and there should be no need to buffer most of the file ...

gabr 2008-09-21 21:10:51

Answer 2

+1 A:

I don't really understand what you're trying to do with this huge amount of XML, but I get the impression that

using XML was wrong for the data stored
you are buffering way beyond what you should do (and you are giving up all advantages of SAX parsing by doing so)

Apart from that: XML is not ancient and in massive and active use. What do you think all those interactive web sites are using for their interactive elements?

Thorsten79 2008-09-21 21:13:56

By ancient I meant to say 'been around for a while' -- and was hoping to find a library that mines it faster.

Achille 2008-09-21 21:26:47

Answer 3

A:

I would suggest to first import your massive XML file into a native XML database (such as eXist if you are looking for open source stuff, never tested it myself), and then perform iterative paged queries to process your data small chunks at a time.

Joannes Vermorel 2008-09-21 21:16:38

Answer 4

+2 A:

SAX is, essentially, "event driven", so the only state you should be holding on to from element to element is state that relevant to that element, rather than the document as a whole. What other state are you maintaining, and why? As each "complete" node (or set of nodes) comes by, you should be discarding them.

Will Hartung 2008-09-21 21:17:29

I am discarding them, I am simply trying to parse the data faster.

Achille 2008-09-21 21:25:04

Answer 5

+1 A:

Are you being slowed down by multiple small commits to your db? Sounds like you would be writing to the db almost all the time from your program and making sure you don't commit too often could improve performance. Possibly also preparing your statements and other standard bulk processing tricks could help

Other than this early comment, we need more info - do you have a profiler handy that can scrape out what makes things run slowly

Oskar 2008-09-21 21:25:40

I temporarily removed the database operations and it still takes a long time. Any suggestion on a java profiler?

Achille 2008-09-21 21:38:22

sorry, not my world. Add that to your oginal post and enter a new question asking specifically for a java profiler (I'm sure the gurus are online and waiting) - good luck!

Oskar 2008-09-21 22:27:03

Answer 6

A:

You may want to try Stax instead of SAX, I hear it's better for that sort of thing (I haven't used it myself).

SCdF 2008-09-21 21:37:04

Answer 7

+2 A:

First, try to find out what's slowing you down.

How much faster is the parser when you parse from memory?
Does using a BufferedInputStream with a large size help?

Is it easy to split up the XML file? In general, shuffling through 30 GiB of any kind of data will take some time, since you have to load it from the hard drive first, so you are always limited by the speed of this. Can you distribute the load to several machines, maybe by using something like Hadoop?

Torsten Marek 2008-09-21 22:28:40

Increasing the BufferedInputStream size did indeed help, but I am limited to 2gb (the size comes off as a signed integer).

Achille 2008-09-22 02:00:54

Are you using a 32bit or 64bit system?

Torsten Marek 2008-09-22 20:38:24

Answer 8

+1 A:

You can use the Jibx library, and bind your XML "nodes" to objects that represent them. You can even overload an ArrayList, then when x number of objects are added, perform the regexes all at once (presumably using the method on your object that performs this logic) and then save them to the database, before allowing the "add" method to finish once again.

Jibx is hosted on SourceForge: Jibx

To elaborate: you can bind your XML as a "collection" of these specialized String holders. Because you define this as a collection, you must choose what collection type to use. You can then specify your own ArrayList implementation.

Override the add method as follows (forgot the return type, assumed void for example):

public void add(Object o) {
    super.add(o);
    if(size() > YOUR_DEFINED_THRESHOLD) {
        flushObjects();
    }
}

YOUR_DEFINED_THRESHOLD

is how many objects you want to store in the arraylist until it has to be flushed out to the database. flushObjects(); is simply the method that will perform this logic. The method will block the addition of objects from the XML file until this process is complete. However, this is ok, the overhead of the database will probably be much greater than file reading and parsing anyways.

MetroidFan2002 2008-09-22 00:17:38

Answer 9

+3 A:

Just to cover the bases, is Java able to use your 16GB? You (obviously) need to be on a 64-bit OS, and you need to run Java with -d64 -XMx10g (or however much memory you want to allocate to it).
It is highly unlikely memory is a limiting factor for what you're doing, so you really shouldn't see it fully utilized. You should be either IO or CPU bound. Most likely, it'll be IO. If it is, IO, make sure you're buffering your streams, and then you're pretty much done; the only thing you can do is buy a faster harddrive.
If you really are CPU-bound, it's possible that you're bottlenecking at regex rather than XML parsing.

See this (which references this)
If your bottleneck is at SAX, you can try other implementations. Off the top of my head, I can think of the following alternatives:
- StAX (there are multiple implementations; Woodstox is one of the fastest)
- Javolution
- Roll your own using JFlex
- Roll your own ad hoc, e.g. using regex
For the last two, the more constrained is your XML subset, the more efficient you can make it.
It's very hard to say, but as others mentioned, an XML-native database might be a good alternative for you. I have limited experience with those, but I know that at least Berkeley DB XML supports XPath-based indices.

ykaganovich 2008-09-22 04:06:00

Answer 10

A:

If the data in the XML is order independent, can you multi-thread the process to split the file up or run multiple processes starting in different locations in the file? If you're not I/O bound that should help speed it along.

18Rabbit 2008-09-22 15:56:18

ansaurus

tags:

views:

answers:

Searching for regex patterns on a 30GB xml dataset. Making use of 16gb of memory.

related questions