views:

355

answers:

2

So I've been working on a crawler script to index all the pages on the my site using Zend Lucene search. I've been able to get the script to work but for some reason will not find the other links on the pages. The problem seems to be when the script hits the find method:

$hits = $index->find('url:'.$targets[$i]);

When I execute the script there are no hits in the array so the crawler indexes only the starting URI. Any ideas on what I can try?

+2  A: 

There is a tool to view the lucene index, that will let you see what is being indexed. Luke should let you see what has been indexed and test some searches.

Are you sure that the url field is indexed when you are creating the index, it is possible you might just be storing the information rather than making is searchable:

addField(Zend_Search_Lucene_Field::UnIndexed

won't be found as it isn't indexed

Chris
A: 

If you have numbers in your index this will help.

To recognize numerics use: Zend_Search_Lucene_Analysis_Analyzer::setDefault(new Zend_Search_Lucene_Analysis_Analyzer_Common_TextNum_CaseInsensitive()); as default analyser. For more info refer http://framework.zend.com/manual/en/zend.search.lucene.extending.html