views:

182

answers:

2

I have search strings, similar to the one bellow:

energy food "olympics 2010" Terrorism OR "government" OR cups NOT transport

and I need to parse it with PHP5 to detect if the content belongs to any of the following clusters:

  • AllWords array
  • AnyWords array
  • NotWords array

These are the rules i have set:

  1. If it has OR before or after the word or quoted words if belongs to AnyWord.
  2. If it has a NOT before word or quoted words it belongs to NotWords
  3. If it has 0 or more more spaces before the word or quoted phrase it belongs to AllWords.

So the end result should be something similar to:

AllWords: (energy, food, "olympics 2010")
AnyWords: (terrorism, "government", cups)
NotWords: (Transport)

What would be a good way to do this?

+3  A: 

If you want to do this with Regex, be aware that your parsing will break on stupid user input (the user, not the input =) ).

I'd try the following Regexes.

NotWords:

(?<=NOT\s)\b((?!NOT|OR)\w+|"[^"]+")\b

AllWords:

(?<!OR\s)\b((?!NOT|OR)\w+|"[^"]+")\b(?!\s+OR)

AnyWords: Well.. the rest. =) They are not that easy to spot, since I do not know how to put "OR behind it or OR in front of it" into regex. Maybe you could join the results from the three regexes

(?<=OR\s)\b((?!NOT|OR)\w+|"[^"]+")\b(?!\s+OR)
(?<=OR\s)\b((?!NOT|OR)\w+|"[^"]+")\b(?=\s+OR)
(?<!OR\s)\b((?!NOT|OR)\w+|"[^"]+")\b(?=\s+OR)

Problems: These require exactly one space between modifier words and expressions. PHP only supports lookbehinds for fixes length expressions, so I see no way around that, sorry. You could just use \b(\w+|"[^"]+")\b to split the input, and parse the resulting array manually.

Jens
Hi Jens, \b(\w+|"[^"]+")\b to parse the input seems like a good solution because of the regex restriction, then I can use a for loop to look behind or after the array bucket to see if there is a NOT or OR and act accordingly.
Benjamin Ortuzar
+2  A: 

This is an excellent example of how an test-first driven approach can help you arrive at a solution. It might not be the very best one, but having tests written allow you to refactor with confidence and instantly see if you break any of the existing tests. Anyway, you could set up a few tests like:

public function setUp () {
  $this->searchParser = new App_Search_Parser();
}

public function testSingleWordParsesToAllWords () {
  $this->searchParser->parse('Transport');
  $this->assertEquals(
     $this->searchParser->getAllWords(), 
     array('Transport')
  );
  $this->assertEquals($this->searchParser->getNotWords(), array());
  $this->assertEquals($this->searchParser->getAnyWords());
}

public function testParseOfCombinedSearchString () {
   $query = 'energy food "olympics 2010" Terrorism ' . 
            'OR "government" OR cups NOT transport';
   $this->searchParser->parse($query);

  $this->assertEquals(
     $this->searchParser->getAllWords(), 
     array('energy', 'food', 'olympics 2010')
  );
  $this->assertEquals(
     $this->searchParser->getNotWords(), 
     array('Transport')
  );
  $this->assertEquals(
     $this->searchParser->getAnyWords(),
     array( 'terrorism', 'government', 'cups')
  );
}

Other good tests would include:

  • testParseTwoWords
  • testParseTwoWordsWithOr
  • testParseSimpleWithNot
  • testParseInvalid
    • Here you have to decide what invalid input looks like and how you interpret it, i.e:
    • 'NOT Transport': Search for anything that doesn't contain Transport or inform the user that he has to include at least one search term too?
    • 'OR energy': Is it ok to begin with a combinator?
    • 'food OR NOT energy': Does this mean "search for food or anything that doesn't contain energy", or does it mean "search for food and not energy", or doesn't it mean anything? (i.e. throw exception, return false or whatnot)
  • testParseEmpty

Then, write the tests one by one, and write a simple solution that passes the test. Then refactor and make it right, and run again to see that you still pass the test. Once a test passes and the code is refactored, then write the next test and repeat the procedure. Add more tests as you find special cases and refactor the code so that it passes all tests. If you break a test, back-up and re-write the code (not the test!) such that it passes.

As for how you can solve this problem, look into preg_match, strtok or rely simply loop through the string adding up tokens as you go.

PatrikAkerstrand