views:

132

answers:

3

Afternoon chaps,

Trying to index a 1.7million row table with the Zend port of Lucene. On small tests of a few thousand rows its worked perfectly, but as soon as I try and up the rows to a few tens of thousands, it times out. Obviously, I could increase the time php allows the script to run, but seeing as 360 seconds gets me ~10,000 rows, I'd hate to think how many seconds it'd take to do 1.7million.

I've also tried making the script run a few thousand, refresh, and then run the next few thousand, but doing this clears the index each time.

Any ideas guys?

Thanks :)

A: 

Try speeding it up by selecting only the fields you require from that table.

If this is something to run as a cronjob, or a worker, then it must be running from the CLI and for that I don't see why changing the timeout would be a bad thing. You only have to build the index once. After that new records or updates to them are only small updates to your Lucene database.

Htbaa
Thanks for the quick reply Htbaa :)I've just got rid of 2 fields, so its now only 3 (id, company name and postcode). I'm just running some tests and I'll see how that improves things.Ramping up the run time and letting it take as long as it takes is an option, although its a pain for testing ;)I need to look into updating Lucene index's, as I was unaware that you actually could?Thanks again!
thebluefox
See http://framework.zend.com/manual/en/zend.search.lucene.index-creation.html for updating documents and the index.
Htbaa
A: 

Some info for you all - posting as an answer so I can use the code styles.

$sql = "SELECT id, company, psearch FROM businesses";
$result = $db->query($sql);     // Run SQL

$feeds = array();

$x = 0;
while ( $record = $result->fetch_assoc() ) {
    $feeds[$x]['id'] = $record['id'];
    $feeds[$x]['company'] = $record['company'];
    $feeds[$x]['psearch'] = $record['psearch'];
    $x++;   
}

//grab each feed

foreach($feeds as $feed) {  
  $doc = new Zend_Search_Lucene_Document();  

    $doc->addField(Zend_Search_Lucene_Field::UnIndexed('id',  
    $feed["id"]));  

  $doc->addField(Zend_Search_Lucene_Field::Text('company',  
    $feed["company"]));  

    $doc->addField(Zend_Search_Lucene_Field::Text('psearch',  
    $feed["psearch"]));  

    $doc->addField(Zend_Search_Lucene_Field::UnIndexed('link',  
    'http://www.google.com'));  


  //echo "Adding: ". $feed["company"] ."-".$feed['pcode']."\n";  

  $index->addDocument($doc);  
}  



$index->commit();

(I've used google.com as a temp link)

The server its running on is a local install of Ubuntu 8.10, 3Gb RAM and a Dual Pentium 3.2GHz chip.

thebluefox
Why are you double looping? Seems like you could get by with a single loop just fine.
gms8994
i'd like to share with you some performance tips : 1- you are getting into 2 loops , each of them has 1.7 million record .........2- Zend_Db_Table_Abstract::setDefaultMetadataCache($cache); which will notably speed up your querying
tawfekov
+2  A: 

I'm sorry to say it, because the developer of Zend_Search_Lucene is a friend and he has worked really hard it, but unfortunately it's not suitable to create indexes on data sets of any nontrivial size.

Use Apache Solr to create indexes. I have tested that Solr runs more than 300x faster than Zend for creating indexes.

You could use Zend_Search_Lucene to issue queries against the index you created with Apache Solr.

Of course you could also use the PHP PECL Solr extension, which I would recommend.

Bill Karwin
Bill, thanks for the heads up, I've moved over to using Solr now and have got 1.7million rows indexing in roughly 8 minutes; doesn't that make you feel warm inside! I have run into one problem though... http://stackoverflow.com/questions/2668279/using-solr-and-zends-lucene-port-together but it was definately the right move, thank you!
thebluefox