I need to index around 10GB of data. Each of my "documents" is pretty small, think basic info about a product, about 20 fields of data, most only a few words. Only 1 column is indexed, the rest are stored. I'm grabbing the data from text files, so that part is pretty fast.
Current indexing speed is only about 40mb per hour. I've heard other people say they have achieved 100x faster than this. For smaller files (around 20mb) the indexing goes quite fast (5 minutes). However, when I have it loop through all of my data files (about 50 files totalling 10gb), as time goes on the growth of the index seems to slow down a lot. Any ideas on how I can speed up the indexing, or what the optimal indexing speed is?
On a side note, I've noticed the API in the .Net port does not seem to contain all of the same methods as the original in Java...
Update--here are snippets of the indexing C# code: First I set thing up:
directory = FSDirectory.GetDirectory(@txtIndexFolder.Text, true);
iwriter = new IndexWriter(directory, analyzer, true);
iwriter.SetMaxFieldLength(25000);
iwriter.SetMergeFactor(1000);
iwriter.SetMaxBufferedDocs(Convert.ToInt16(txtBuffer.Text));
Then read from a tab-delim data file:
using (System.IO.TextReader tr = System.IO.File.OpenText(File))
{
string line;
while ((line = tr.ReadLine()) != null)
{
string[] items = line.Split('\t');
Then create the fields and add the document to the index:
fldName = new Field("Name", items[4], Field.Store.YES, Field.Index.NO);
doc.Add(fldName);
fldUPC = new Field("UPC", items[10], Field.Store.YES, Field.Index.NO);
doc.Add(fldUPC);
string Contents = items[4] + " " + items[5] + " " + items[9] + " " + items[10] + " " + items[11] + " " + items[23] + " " + items[24];
fldContents = new Field("Contents", Contents, Field.Store.NO, Field.Index.TOKENIZED);
doc.Add(fldContents);
...
iwriter.AddDocument(doc);
Once its completely done indexing:
iwriter.Optimize();
iwriter.Close();