views:

897

answers:

7

Hi,

I’m working with Lucene 2.4.0 and the JVM (JDK 1.6.0_07). I’m consistently receiving “OutOfMemoryError: Java heap space”, when trying to index large text files.

Example 1: Indexing a 5 MB text file runs out of memory with a 64 MB max. heap size. So I increased the max. heap size to 512 MB. This worked for the 5 MB text file, but Lucene still used 84 MB of heap space to do this. Why so much?

The class FreqProxTermsWriterPerField appears to be the biggest memory consumer by far according to JConsole and the TPTP Memory Profiling plugin for Eclipse Ganymede.

Example 2: Indexing a 62 MB text file runs out of memory with a 512 MB max. heap size. Increasing the max. heap size to 1024 MB works but Lucene uses 826 MB of heap space while performing this. Still seems like way too much memory is being used to do this. I’m sure larger files would cause the error as it seems correlative.

I’m on a Windows XP SP2 platform with 2 GB of RAM. So what is the best practice for indexing large files? Here is a code snippet that I’m using:

// Index the content of a text file. private Boolean saveTXTFile(File textFile, Document textDocument) throws MyException {

        try {             

              Boolean isFile = textFile.isFile();
              Boolean hasTextExtension = textFile.getName().endsWith(".txt");

              if (isFile && hasTextExtension) {

                    System.out.println("File " + textFile.getCanonicalPath() + " is being indexed");
                    Reader textFileReader = new FileReader(textFile);
                    if (textDocument == null)
                          textDocument = new Document();
                    textDocument.add(new Field("content", textFileReader));
                    indexWriter.addDocument(textDocument);   // BREAKS HERE!!!!
              }                    
        } catch (FileNotFoundException fnfe) {
              System.out.println(fnfe.getMessage());
              return false;
        } catch (CorruptIndexException cie) {
              throw new MyException("The index has become corrupt.");
        } catch (IOException ioe) {
              System.out.println(ioe.getMessage());
              return false;
        }                    
        return true;
  }

Thanks much,

Paul

+1  A: 

Profiling is the only way to determine, such large memory consumption.

Also, in your code,you are not closing the Filehandlers,Indexreaders,Inderwriters, perhaps the culprit for OOM,

Narayan
I'm using JConsole and TPTP profiling for Eclipse. When I try to index the 5 MB file with 64 MB maximum heap space. I run out of memory real quick.
Paul Murdoch
A: 

How do you initialize indexWriter? Are you using a RAMDirectory? If you are, try moving to an FSDirectory. The paper Scaling Lucene and Solr may also help.

Yuval F
Here is how I initialize the IndexWriter:indexWriter = new IndexWriter(indexDirectory, new StandardAnalyzer(),createFlag, MaxFieldLength.UNLIMITED); indexWriter.setMergeScheduler(new org.apache.lucene.index.SerialMergeScheduler()); indexWriter.setRAMBufferSizeMB(32);indexWriter.setMergeFactor(1000); indexWriter.setMaxFieldLength(Integer.MAX_VALUE);indexWriter.setUseCompoundFile(false);indexWriter.close();I'm not using RAMDirectory. Thanks.
Paul Murdoch
Sorry about the formatting. I have to learn how to repost instead of comment. I don't use RAMDirectory.
Paul Murdoch
A: 

You can set the IndexWriter to flush based on memory usage or # of documents - I would suggest setting it to flsuh based on memory and seeing if this fixes your issue. My guess is your entire index is living in memory because you never flush it to disk.

Gandalf
Here is how I initialize the IndexWriter:indexWriter = new IndexWriter(indexDirectory, new StandardAnalyzer(),createFlag, MaxFieldLength.UNLIMITED); indexWriter.setMergeScheduler(new org.apache.lucene.index.SerialMergeScheduler()); indexWriter.setRAMBufferSizeMB(32); // should flush right?indexWriter.setMergeFactor(1000); indexWriter.setMaxFieldLength(Integer.MAX_VALUE);indexWriter.setUseCompoundFile(false);indexWriter.close();
Paul Murdoch
Sorry about the formatting. I didn't know it was going to do that.The line:indexWriter.setRAMBufferSizeMB(32)... should flush when 32 MB of heap space is used right?
Paul Murdoch
+1  A: 

In response as a comment to Gandalf

I can see you are setting the setMergeFactor to 1000

the API says

setMergeFactor

public void setMergeFactor(int mergeFactor)

Determines how often segment indices are merged by addDocument(). With smaller values, less RAM is used while indexing, and searches on unoptimized indices are faster, but indexing speed is slower. With larger values, more RAM is used during indexing, and while searches on unoptimized indices are slower, indexing is faster. Thus larger values (> 10) are best for batch index creation, and smaller values (< 10) for indices that are interactively maintained.

This method is a convenience method, it uses the RAM as you increase the mergeFactor

What i would suggest is set it to something like 15 or so on.; (on a trial and error basis) complemented with setRAMBufferSizeMB, also call Commit(). then optimise() and then close() the indexwriter object.(probably make a JavaBean and put all these methods in one method) call this method when you are closing the index.

post with your result, feedback =]

Narayan
Thanks for the response. I will try some smaller values for mergeFactor. I've resorted to paging the file which is very slow and eventually reaches an OOM point. Of course if I increase the JVM max heap space the 5 MB file will index very quickly. However I will be indexing very large files (250 MB+) sometimes and I can't change the JVM heap size dynamically. After benchmarking it seems like any file size is 5% of the heap space needed to index it in one shot. Unfortunately, I don't have the RAM. So unless someone has an answer I'm going to try and make paging work and make it fast too.
Paul Murdoch
It doesnt matter,**setMergeFactor** needs to be set appropriately, we have successfully indexed the JDBC resultset to the tune > 10 GB, with dev machines having RAM eq 2Gigs.can you post the full code here?
Narayan
Naryan - see my reposted code. Thanks.
Paul Murdoch
A: 

I suppose the only way to repost is to answer your own question. This is very trimmed down version of my text indexer class, but has everything that is important for this question. I have a 5 MB file full of unique text terms. Still hitting OOM with 64MB heap space. I'm bench marking for files of 250MB or larger of text. So far I see no way to avoid the OOM for large files. Please help.

Thanks,

Paul

public class TextIndexer {

static IndexWriter indexWriter = null; 
static String indexDirectory = null;

public TextIndexer() {
 this.createIndex();  
 Document doc = new Document();
 File file = new File("C:5MBOfUniqueTerms.txt");
 this.saveTXTFile(file, doc);
}

// create the index
private void createIndex()
{   
 try {
  indexWriter = new IndexWriter(indexDirectory, new StandardAnalyzer(),
    true, MaxFieldLength.UNLIMITED);   
  indexWriter.setMergeScheduler(new org.apache.lucene.index.SerialMergeScheduler());
  indexWriter.setRAMBufferSizeMB(32);
  indexWriter.setMergeFactor(16); 
  indexWriter.setMaxMergeDocs(32);
  indexWriter.setMaxFieldLength(Integer.MAX_VALUE);
  indexWriter.setUseCompoundFile(false);   
 } catch (CorruptIndexException cie) {
  cie.printStackTrace(); 
 } catch (IOException ioe) {
  ioe.printStackTrace();
 }  
}

// index the content of a text file
private void saveTXTFile(File textFile, Document textDocument) {  

 try {   

  Boolean isFile = textFile.isFile();
  Boolean hasTextExtension = textFile.getName().endsWith(".txt");

  if (isFile && hasTextExtension) {

   System.out.println("File " + textFile.getCanonicalPath() + " is being indexed");
   Reader textFileReader = new FileReader(textFile);
   if (textDocument == null)
    textDocument = new Document();
   textDocument.add(new Field("content", textFileReader));
   indexWriter.addDocument(textDocument);
  }      
 } catch (FileNotFoundException fnfe) {
  fnfe.printStackTrace();
 } catch (CorruptIndexException cie) {
  cie.printStackTrace();
 } catch (IOException ioe) {
  ioe.printStackTrace();
 }      
}

}

Paul Murdoch
A: 

We experienced some similar "out of memory" problems earlier this year when building our search indexes for our maven repository search engine at jarvana.com. We were building the indexes on a 64 bit Windows Vista quad core machine but we were running 32 bit Java and 32 bit Eclipse. We had 1.5 GB of RAM allocated for the JVM. We used Lucene 2.3.2. The application indexes about 100GB of mostly compressed data and our indexes end up being about 20GB.

We tried a bunch of things, such as flushing the IndexWriter, explicitly calling the garbage collector via System.gc(), trying to dereference everything possible, etc. We used JConsole to monitor memory usage. Strangely, we would quite often still run into “OutOfMemoryError: Java heap space” errors when they should not have occurred, based on what we were seeing in JConsole. We tried switching to different versions of 32 bit Java, and this did not help.

We eventually switched to 64 bit Java and 64 bit Eclipse. When we did this, our heap memory crashes during indexing disappeared when running with 1.5GB allocated to the 64 bit JVM. In addition, switching to 64 bit Java let us allocate more memory to the JVM (we switched to 3GB), which sped up our indexing.

Not sure exactly what to suggest if you're on XP. For us, our OutOfMemoryError issues seemed to relate to something about Windows Vista 64 and 32 bit Java. Perhaps switching to running on a different machine (Linux, Mac, different Windows) might help. I don't know if our problems are gone for good, but they appear to be gone for now.

Deron Eriksson
A: 

For hibernate users (using mysql) and also using grails (via searchable plugin).

I kept getting OOM errors when indexing 3M rows and 5GB total of data.

These settings seem to have fixed the problem w/o requiring me to write any custom indexers.

here are some things to try:

Compass settings:

        'compass.engine.mergeFactor':'500',
        'compass.engine.maxBufferedDocs':'1000'

and for hibernate (not sure if it's necessary, but might be helping, esp w/ mysql which has jdbc result streaming disabled by default. [link text][1]

        hibernate.jdbc.batch_size = 50  
        hibernate.jdbc.fetch_size = 30
        hibernate.jdbc.use_scrollable_resultset=true

Also, it seems specially for mysql, had to add some url parameters to the jdbc connection string.

        url = "jdbc:mysql://127.0.0.1/mydb?defaultFetchSize=500&useCursorFetch=true"

(update: with the url parameters, memory doesn't go above 500MB)

In any case, now I'm able to build my lucene / comapss index with less than 2GB heap size. Previously I needed 8GB to avoid OOM. Hope this helps someone.

[1]: http://dev.mysql.com/doc/refman/5.0/en/connector-j-reference-implementation-notes.html mysql streaming jdbc resultset

It turns out, when setting useCursorFetch=true on mysql, the jvm doesn't use much memory, but mysql writes a temp file to manage the buffered response. For some reason on my machine, this file was growing to over 50GB and bringing my machine to a halt. Discovered that setting entitiesIndexer = new PaginationHibernateIndexEntitiesIndexer() instead of the default new ScrollableHibernateIndexEntitiesIndexer(); on the HibernateGpsDevice has the indexer break up the querries into small batches of fetchCount. Now I can index my data w/o using up too much memory on either the jvm or mysql side.