views:

235

answers:

8

I have to deal with a directory of about 2 million xml's to be processed.

I've already solved the processing distributing the work between machines and threads using queues and everything goes right.

But now the big problem is the bottleneck of reading the directory with the 2 million files in order to fill the queues incrementally.

I've tried using the File.listFiles() method , but it gives me a java out of memory: heap space Exception. Any ideas?

+1  A: 

At fist you could try to increase the memory of your JVM with passing -Xmx1024m e.g.

InsertNickHere
I have a feeling this won't fix the problem, and the JVM will just run out of memory *slightly* later.
Piskvor
@Piskvor If so, I guess there is no way to solve this issue. Whatever you use to parse the os file system will need a certain amound of bytes - with 2million files this can fastly become too much.
InsertNickHere
@InsertNickHere: you don't need to keep all your data in RAM at the same time.
Piskvor
+3  A: 

Why do you store 2 million files in the same directory anyway? I can imagine it slows down access terribly on the OS level already.

I would definitely want to have them divided into subdirectories (e.g. by date/time of creation) already before processing. But if it is not possible for some reason, could it be done during processing? E.g. move 1000 files queued for Process1 into Directory1, another 1000 files for Process2 into Directory2 etc. Then each process/thread sees only the (limited number of) files portioned for it.

Péter Török
Diving them its a problem in it's own. I'm thinking on that as well at OS bash functions.It is not possible to do it while processing because the exception comes when trying to list the directory programmatically.
Fgblanch
A: 

Please post the full stack trace of the OOM exception to identify where the bottleneck is, as well as a short, complete Java program showing the behaviour you see.

It is most likely because you collect all of the two million entries in memory, and they don't fit. Can you increase heap space?

Thorbjørn Ravn Andersen
+8  A: 
aioobe
Java 7 is not an option right now.Currently i'm trying the filter option. Thankfully the files have a hierarchy written in the filename. So this option could work.
Fgblanch
aioobe effectively it didn't work. I've found the filenames are "guessables" :) so i'll do it the other way around:Generate the filenames and then go to the folder and try to reach them. Thanks a lot for your help
Fgblanch
+1  A: 

Use File.list() instead of File.listFiles() - the String objects it returns consume less memory than the File objects, and (more importantly, depending on the location of the directory) they don't contain the full path name.

Then, construct File objects as needed when processing the result.

However, this will not work for arbitrarily large directories either. It's an overall better idea to organize your files in a hierarchy of directories so that no single directory has more than a few thousand entries.

Michael Borgwardt
A: 

If file names follow certain rules, you can use File.list(filter) instead of File.listFiles to get manageable portions of file listing.

atzz
A: 

Try this, it works to me, but I hadn't so many documents...

File dir = new File("directory");
String[] children = dir.list();
if (children == null) {
   //Either dir does not exist or is not a  directory
  System.out.print("Directory doesn't  exist\n");
}
else {
  for (int i=0; i<children.length; i++) {   
    // Get filename of file or directory   
    String filename = children[i];  
}
mujer esponja
+1  A: 

This is untested and an absolute hack, but you might want to try somthing like this anyway:

Process process = System.getRuntime().exec(new String[]{"ls", "/path"});
BufferedReader reader = new BufferedReader(new InputStreamReader(process.getInputStream()));
String line;
while (null != (line = reader.readLine()) {
}
Jörn Horstmann