views:

1773

answers:

4

I am using Lucene in PHP (using the Zend Framework implementation). I am having a problem that I cannot search on a field which contains a number.

Here is the data in the index:

      ts      |    contents
--------------+-----------------
  1236917100  | dog cat gerbil
  1236630752  |  cow pig goat
  1235680249  | lion tiger bear
  nonnumeric  | bass goby trout

My problem: A query for "ts:1236630752" returns no hits. However, a query for "ts:nonnumeric" returns a hit.

I am storing "ts" as a keyword field, which according to documentation "is not tokenized, but is indexed and stored. Useful for non-text fields, e.g. date or url." I have tried treating it as a "text" field, but the behavior is the same except that a query for "ts:*" returns nothing when ts is text.

I'm using Zend Framework 1.7 (just downloaded the latest 3 days ago) and PHP 5.2.9. Here is my code:

<?php

//=========================================================
// Initializes Zend Framework (Zend_Loader).
//=========================================================
set_include_path(realpath('../library') . PATH_SEPARATOR . get_include_path());
require_once('Zend/Loader.php');
Zend_Loader::registerAutoload();

//=========================================================
// Delete existing index and create a new one
//=========================================================
define('SEARCH_INDEX', 'test_search_index');
if(file_exists(SEARCH_INDEX))
  foreach(scandir(SEARCH_INDEX) as $file)
    if(!is_dir($file))
      unlink(SEARCH_INDEX . "/$file");

$index = Zend_Search_Lucene::create(SEARCH_INDEX);

//=========================================================
// Create this data in index:
//         ts      |    contents
//   --------------+-----------------
//     1236917100  | dog cat gerbil
//     1236630752  |  cow pig goat
//     1235680249  | lion tiger bear
//     nonnumeric  | bass goby trout
//=========================================================

function add_to_index($index, $ts, $contents) {
  $doc = new Zend_Search_Lucene_Document();
  $doc->addField(Zend_Search_Lucene_Field::Keyword('ts', $ts));
  $doc->addField(Zend_Search_Lucene_Field::Text('contents', $contents));
  $index->addDocument($doc);
}

add_to_index($index, '1236917100', 'dog cat gerbil');
add_to_index($index, '1236630752', 'cow pig goat');
add_to_index($index, '1235680249', 'lion tiger bear');
add_to_index($index, 'nonnumeric', 'bass goby trout');

//=========================================================
// Run some test queries and output results
//=========================================================

echo '<html><body><pre>';

function run_query($index, $query) {
  echo "Running query:  $query\n";
  $hits = $index->find($query);
  echo 'Got ' . count($hits) . " hits.\n";
  foreach($hits as $hit)
    echo "  ts='$hit->ts', contents='$hit->contents'\n";
  echo "\n";
}

run_query($index, 'pig');           //1 hit
run_query($index, 'ts:1236630752'); //0 hits
run_query($index, '1236630752');    //0 hits
run_query($index, 'ts:pig');        //0 hits
run_query($index, 'contents:pig');  //1 hits
run_query($index, 'ts:[1236630700 TO 1236630800]'); //0 hits (range query)
run_query($index, 'ts:*');          //4 hits if ts is keyword, 1 hit otherwise
run_query($index, 'nonnumeric');    //1 hits
run_query($index, 'ts:nonnumeric'); //1 hits
run_query($index, 'trout');         //1 hits

Output

Running query:  pig
Got 1 hits.
  ts='1236630752', contents='cow pig goat'

Running query:  ts:1236630752
Got 0 hits.

Running query:  1236630752
Got 0 hits.

Running query:  ts:pig
Got 0 hits.

Running query:  contents:pig
Got 1 hits.
  ts='1236630752', contents='cow pig goat'

Running query:  ts:[1236630700 TO 1236630800]
Got 0 hits.

Running query:  ts:*
Got 4 hits.
  ts='1236917100', contents='dog cat gerbil'
  ts='1236630752', contents='cow pig goat'
  ts='1235680249', contents='lion tiger bear'
  ts='nonnumeric', contents='bass goby trout'

Running query:  nonnumeric
Got 1 hits.
  ts='nonnumeric', contents='bass goby trout'

Running query:  ts:nonnumeric
Got 1 hits.
  ts='nonnumeric', contents='bass goby trout'

Running query:  trout
Got 1 hits.
  ts='nonnumeric', contents='bass goby trout'
+2  A: 

I'm used to using Lucene under Java so I can't tell if your code is correct but it seems like the field is being tokanized in a manner that is stripping out anything exept [a-zA-Z].

It may help shed light on the situation to use an index explorer tool like http://www.getopt.org/luke/ to see exactly what is in the index.

Kris
+1  A: 

The find() method tokenizes the query, and with the default Analzer your numbers will be pretty much ignored. If you want to search for a number you have to construct the query or use an alternate analyzer that includes numeric values..

http://framework.zend.com/manual/en/zend.search.lucene.searching.html

It is important to note that the query parser uses the standard analyzer to tokenize separate parts of query string. Thus all transformations which are applied to indexed text are also applied to query strings.

The standard analyzer may transform the query string to lower case for case-insensitivity, remove stop-words, and stem among other transformations.

The API method doesn't transform or filter input terms in any way. It's therefore more suitable for computer generated or untokenized fields.

Zoredache
Note that newer versions of Zend Search Lucene include an alphanumeric analyzer; you just have to set it as default. Make sure to include this near the beginning of your indexing script as well as before you run $index->find(): Zend_Search_Lucene_Analysis_Analyzer::setDefault(new Zend_Search_Lucene_Analysis_Analyzer_Common_TextNum_CaseInsensitive());
Robert Elwell
A: 

Maybe you could try with Sphinx engine. I am running a site with more than 1 million regs , and sphinx its incredibly fast! The php api its also very easy to use.

http://www.sphinxsearch.com/powered.html

jipipayo
A: 

Hello there!

Very good response, it worked, but.. what if I would like something trickier ? Let's say that I would like both the analyzers combined.. so that it still tokenizes the strings by "breaking" them when it encounters numbers, but it should also index the numbers.

Would that be possible? Thanks!

L.E.: I'm pretty interested in this because I opened the two classes - Utf8 and Utf8Num - and I saw only ONE difference, the regular expressions rule:

  • In Utf8 it's '/[\p{L}]+/u'
  • In Utf8Num it's '/[\p{L}\p{N}]+/u'

Any suggestions ?

I did it! The new rule is

  • /[a-zA-Z]+|[0-9]+/ for Text
  • /[\p{L}]+|[\p{N}]+/u for UTF-8

Enjoy :) I know I will :D

If you didn't understood what it does exactly.. here's a explanation for the three filters for the example string encountered "ghi678jkl41mn-1000":

  • Utf8 simple (standard): keywords "ghi", "jkl", "mn"
  • Utf8Num (standard): keywords "ghi678jkl41mn", "1000" (wich includes numbers but it's NOT cool)
  • Utf8NumComplex (by applying my rule): keywords "ghi", "678", "jkl", "41", "mn", "1000"
tXK