ansaurus

Question

Answer 1

A:

Try speeding it up by selecting only the fields you require from that table.

If this is something to run as a cronjob, or a worker, then it must be running from the CLI and for that I don't see why changing the timeout would be a bad thing. You only have to build the index once. After that new records or updates to them are only small updates to your Lucene database.

Htbaa 2010-04-14 14:55:42

Thanks for the quick reply Htbaa :)I've just got rid of 2 fields, so its now only 3 (id, company name and postcode). I'm just running some tests and I'll see how that improves things.Ramping up the run time and letting it take as long as it takes is an option, although its a pain for testing ;)I need to look into updating Lucene index's, as I was unaware that you actually could?Thanks again!

thebluefox 2010-04-14 15:05:41

See http://framework.zend.com/manual/en/zend.search.lucene.index-creation.html for updating documents and the index.

Htbaa 2010-04-14 19:38:10

Answer 2

A:

Some info for you all - posting as an answer so I can use the code styles.

$sql = "SELECT id, company, psearch FROM businesses";
$result = $db->query($sql);     // Run SQL

$feeds = array();

$x = 0;
while ( $record = $result->fetch_assoc() ) {
    $feeds[$x]['id'] = $record['id'];
    $feeds[$x]['company'] = $record['company'];
    $feeds[$x]['psearch'] = $record['psearch'];
    $x++;   
}

//grab each feed

foreach($feeds as $feed) {  
  $doc = new Zend_Search_Lucene_Document();  

    $doc->addField(Zend_Search_Lucene_Field::UnIndexed('id',  
    $feed["id"]));  

  $doc->addField(Zend_Search_Lucene_Field::Text('company',  
    $feed["company"]));  

    $doc->addField(Zend_Search_Lucene_Field::Text('psearch',  
    $feed["psearch"]));  

    $doc->addField(Zend_Search_Lucene_Field::UnIndexed('link',  
    'http://www.google.com'));  


  //echo "Adding: ". $feed["company"] ."-".$feed['pcode']."\n";  

  $index->addDocument($doc);  
}  



$index->commit();

(I've used google.com as a temp link)

The server its running on is a local install of Ubuntu 8.10, 3Gb RAM and a Dual Pentium 3.2GHz chip.

thebluefox 2010-04-15 10:12:25

Why are you double looping? Seems like you could get by with a single loop just fine.

gms8994 2010-04-16 12:42:31

i'd like to share with you some performance tips : 1- you are getting into 2 loops , each of them has 1.7 million record .........2- Zend_Db_Table_Abstract::setDefaultMetadataCache($cache); which will notably speed up your querying

tawfekov 2010-04-16 12:56:55

Answer 3

+2 A:

I'm sorry to say it, because the developer of Zend_Search_Lucene is a friend and he has worked really hard it, but unfortunately it's not suitable to create indexes on data sets of any nontrivial size.

Use Apache Solr to create indexes. I have tested that Solr runs more than 300x faster than Zend for creating indexes.

You could use Zend_Search_Lucene to issue queries against the index you created with Apache Solr.

Of course you could also use the PHP PECL Solr extension, which I would recommend.

Bill Karwin 2010-04-17 00:40:59

Bill, thanks for the heads up, I've moved over to using Solr now and have got 1.7million rows indexing in roughly 8 minutes; doesn't that make you feel warm inside! I have run into one problem though... http://stackoverflow.com/questions/2668279/using-solr-and-zends-lucene-port-together but it was definately the right move, thank you!

thebluefox 2010-04-20 14:20:06

ansaurus

tags:

views:

answers:

Indexing large DB's with Lucene/PHP

related questions