views:

1181

answers:

1

How can I convert a Google search query to something I can feed PostgreSQL's to_tsquery() ?

If there's no existing library out there, how should I go about parsing a Google search query in a language like PHP?

For example, I'd like to take the following Google-ish search query:

("used cars" OR "new cars") -ford -mistubishi

And turn it into a to_tsquery()-friendly string:

('used cars' | 'new cars') & !ford & !mistubishi

I can fudge this with regexes, but that's the best I can do. Is there some robust lexical analysis method of going about this? I'd like to be able to support extended search operators too (like Google's site: and intitle:) that will apply to different database fields, and thus would need to be separated from the tsquery string.

UPDATE: I realize that with special operators this becomes a Google to SQL WHERE-clause conversion, rather than a Google to tsquery conversion. But the WHERE clause may contain one or more tsqueries.

For example, the Google-style query:

((color:blue OR "4x4") OR style:coupe) -color:red used

Should produce an SQL WHERE-clause like this:

WHERE to_tsvector(description) MATCH to_tsquery('used')
  AND color <> 'red'
  AND ( (color = 'blue' OR to_tsvector(description) MATCH to_tsquery('4x4') )
    OR style = 'coupe'
  );

I'm not sure if the above is possible with regex?

+1  A: 

Honest, I think regular expressions are the way to go with something like this. Just the same, this was a fun exercise. The code below is very prototypal - in fact, you'll see that I didn't even implement the lexer itself - I just faked the output. I'd like to continue it but I just don't have more spare time today.

Also, there definitely a lot more work to be done here in terms of supporting other types of search operators and the like.

Basically, the idea is that a certain type of query is lexed then parsed into a common format (in this case, a QueryExpression instance) which is then rendered back out as another type of query.

<?php

ini_set( "display_errors", "on" );
error_reporting( E_ALL );

interface ILexer
{
    public function execute( $str );
    public function getTokens();
}

interface IParser
{
    public function __construct( iLexer $lexer );
    public function parse( $input );
    public function addToken( $token );
}

class GoogleQueryLexer implements ILexer
{
    private $tokenStack = array();

    public function execute( $str )
    {
     $chars = str_split( $str );
     foreach ( $chars as $char )
     {
      // add to self::$tokenStack per your rules
     }

     //'("used cars" OR "new cars") -ford -mistubishi'
     $this->tokenStack = array(
       '('
      , 'used cars'
      , 'or new cars'
      , ')'
      , '-ford'
      , '-mitsubishi'
     );
    }

    public function getTokens()
    {
     return $this->tokenStack;
    }
}

class GoogleQueryParser implements IParser
{
    protected $lexer;

    public function __construct( iLexer $lexer )
    {
     $this->lexer = $lexer;
    }

    public function addToken( $token )
    {
     $this->tokenStack[] = $token;
    }

    public function parse( $input )
    {
     $this->lexer->execute( $input );
     $tokens = $this->lexer->getTokens();

     $expression = new QueryExpression();

     foreach ( $tokens as $token )
     {
      $expression = $this->processToken( $token, $expression );
     }

     return $expression;
    }

    protected function processToken( $token, QueryExpression $expression )
    {
     switch ( $token )
     {
      case '(':
       return $expression->initiateSubExpression();
       break;
      case ')':
       return $expression->getParentExpression();
       break;
      default:
       $modifier = $token[0];
       $phrase  = substr( $token, 1 );
       switch ( $modifier )
       {
        case '-':
         $expression->addExclusionPhrase( $phrase );
         break;
        case '+':
         $expression->addPhrase( $phrase );
         break;
        default:
         $operator = trim( substr( $token, 0, strpos( $token, ' ' ) ) );
         $phrase  = trim( substr( $token, strpos( $token, ' ' ) ) );
         switch ( strtolower( $operator ) )
         {
          case 'and':
           $expression->addAndPhrase( $phrase );
           break;
          case 'or':
           $expression->addOrPhrase( $phrase );
           break;
          default:
           $expression->addPhrase( $token );
         }
       }
     }
     return $expression;
    }
}

class QueryExpression
{
    protected $phrases = array();
    protected $subExpressions = array();
    protected $parent;

    public function __construct( $parent=null )
    {
     $this->parent = $parent;
    }

    public function initiateSubExpression()
    {
     $expression = new self( $this );
     $this->subExpressions[] = $expression;
     return $expression;
    }

    public function getPhrases()
    {
     return $this->phrases;
    }

    public function getSubExpressions()
    {
     return $this->subExpressions;
    }

    public function getParentExpression()
    {
     return $this->parent;
    }

    protected function addQueryPhrase( QueryPhrase $phrase )
    {
     $this->phrases[] = $phrase;
    }

    public function addPhrase( $input )
    {
     $this->addQueryPhrase( new QueryPhrase( $input ) );
    }

    public function addOrPhrase( $input )
    {
     $this->addQueryPhrase( new QueryPhrase( $input, QueryPhrase::MODE_OR ) );
    }

    public function addAndPhrase( $input )
    {
     $this->addQueryPhrase( new QueryPhrase( $input, QueryPhrase::MODE_AND ) );
    }

    public function addExclusionPhrase( $input )
    {
     $this->addQueryPhrase( new QueryPhrase( $input, QueryPhrase::MODE_EXCLUDE ) );
    }
}

class QueryPhrase
{
    const MODE_DEFAULT = 1;
    const MODE_OR = 2;
    const MODE_AND = 3;
    const MODE_EXCLUDE = 4;

    protected $phrase;
    protected $mode;

    public function __construct( $input, $mode=self::MODE_DEFAULT )
    {
     $this->phrase = $input;
     $this->mode = $mode;
    }

    public function getMode()
    {
     return $this->mode;
    }

    public function __toString()
    {
     return $this->phrase;
    }
}

class TsqueryBuilder
{
    protected $expression;
    protected $query;

    public function __construct( QueryExpression $expression )
    {
     $this->query = trim( $this->processExpression( $expression ), ' &|' );
    }

    public function getResult()
    {
     return $this->query;
    }

    protected function processExpression( QueryExpression $expression )
    {
     $query = '';
     $phrases = $expression->getPhrases();
     $subExpressions = $expression->getSubExpressions();

     foreach ( $phrases as $phrase )
     {
      $format = "'%s' ";
      switch ( $phrase->getMode() )
      {
       case QueryPhrase::MODE_AND :
        $format = "& '%s' ";
        break;
       case QueryPhrase::MODE_OR :
        $format = "| '%s' ";
        break;
       case QueryPhrase::MODE_EXCLUDE :
        $format = "& !'%s' ";
        break;
      }
      $query .= sprintf( $format, str_replace( "'", "\\'", $phrase ) );
     }

     foreach ( $subExpressions as $subExpression )
     {
      $query .= "& (" . $this->processExpression( $subExpression ) . ")";
     }
     return $query;
    }
}

$parser = new GoogleQueryParser( new GoogleQueryLexer() );

$queryBuilder = new TsqueryBuilder( $parser->parse( '("used cars" OR "new cars") -ford -mistubishi' ) );

echo $queryBuilder->getResult();
Peter Bailey
Wow that's a nice answer! :-) I'm interested to know what regex(s) you would use since you think regex is the way to do it? I have tried writing some regexs but I'm stuck on how to handle nested parentheses?
well, I don't think you have to since they wouldn't change between your desired input and output formats - you really just need to convert the operators. If you like, I'll post a reply with a regular expression-based solution as well.
Peter Bailey
I'd be interested to see the regex solution if it's no trouble to post. I realized though that once special operators are involved it's no longer a google-to-tsquery situation, but it's google-to-sql-where-clause situation. I'll edit the post to show an example.
One small bug in GoogleQueryLexer::execute, on the assignment of tokenStack: 'used cars' needs a space between the opening quote and the first letter, or else the resulting query just contains 'cars'. Is this the correct fix or is there a problem in GoogleQueryParser::processToken?
Ok, fixed the bug. It was in GoogleQueryParser::processToken(). I'll try to get around to a regex version tomorrow, but I have to be honest - your new example is _much_ more complicated.
Peter Bailey
Hmm. Should I maybe look at using flex/bison to make a PECL extension to handle this conversion? Or is that overkill?
To explain that last comment: I assumed complicated means the PHP implementation will be slow, but I guess that isn't necessarily the case.
One other trivial bug triggered by the second last line. The return value of $parser->parse() is passed to the constructor of TsqueryBuilder, but it doesn't return anything. I added a return statement to GoogleQueryParser::parse to fix. I could've used getExpression(), I'm not sure what's preferred.
Fixed again. Somehow the version I had working on my test server isn't what I ended up pasting here, sorry about that. Complicated can definitely mean slower, but we're already doing lexical parsing with a high level language - it will always be slow compared to languages better suited for that.
Peter Bailey