tags:

views:

112

answers:

3

Hi!
How can i count the number of words between two words?

   $txt = "tükörfúrógép banana orange lime, tükörfúrógép cherry árvíztűrő orange lyon
    cat lime mac tükörfúrógép cat orange lime cat árvíztűrő
    tükörfúrógép banana orange lime
    orange lime cat árvíztűrő";

The two words: 'árvíztűrő' and 'tükörfúrógép'
I need this return:
tükörfúrógép cherry árvíztűrő
tükörfúrógép cat orange lime cat árvíztűrő
tükörfúrógép banana orange lime orange lime cat árvíztűrő

Now i have this regular expression:

preg_match_all('@((tükörfúrógép(.*)?árvíztűrő)(árvíztűrő(.*)?tükörfúrógép))@sui',$txt,$m);
+7  A: 

I have several things to point out:

  1. You can't do it in one regex. Regex is forward-only, reversed match order requires a second regex.
  2. You use (.*)?, but you mean (.*?)
  3. To aquire correct matches, you must ensure that the left boundary of your expression cannot occur in the middle.
  4. You should denote word boundaries (\b) around your delimiter words to ensure whole-word matches. EDIT: While this is correct in theory, it does not work for Unicode input in PHP.
  5. You should switch the PHP locale to Hungarian (it is Hungarian, right?) before calling preg_match_all(), because the locale has an influence on what's considered a word boundary in PHP. EDIT: The meaning of \b does in fact not change with the selected locale.

That being said, regex #1 is:

(\btükörfúrógép\b)((?:(?!\1).)*?)\bárvíztűrő\b

and regex #2 is analoguous, just with reversed delimiter words.

Regex explanation:

(               # match group 1:
  \b            #   a word boundary
  tükörfúrógép  #   your first delimiter word
  \b            #   a word boundary
)               # end match group 1
(               # match group 2:
  (?:           #   non-capturing group:
    (?!         #     look-ahead:
      \1        #       must not be followed by delimiter word 1
    )           #     end look-ahead
    .           #     match any next char (includes \n with the "s" switch)
  )*?           #   end non-capturing group, repeat as often as necessary
)               # end match group 2 (this is the one you look for)
\b              # a word boundary
árvíztűrő       # your second delimiter word
\b              # a word boundary

UPDATE: With PHP's patheticpoor Unicode string support, you will be forced to use expressions like these as replacements for \b:

$before = '(?<=^|[^\p{L}])';
$after  = '(?=[^\p{L}]|$)';

This suggestion has been taken from another question.

Tomalak
This return empty array: Array ( [0] => Array ( ) [1] => Array ( ) [2] => Array ( ) )
turbod
PS: Well, to be completely honest - you *can* do it in one regex, by concatenating regex #1 and regex #2 like this `#1|#2`. It's up to you if you consider the resulting expression worthwhile. ;-)
Tomalak
@turbod: What does a simple `\árvíztűrő\b` give you?
Tomalak
I'm currently researching the way `\b` works with PHP PCRE and unicode strings. Looks like the locale does *not* have an influence, and an alternative must be used for "international" word boundaries. When I found something, I'll update my answer.
Tomalak
setLocale(LC_ALL, 'hu_HU.utf8');preg_match_all('@\bárvíztűrő\b@',$txt,$m);print_r($m);This return empty array.
turbod
@turbod: Yeah, as I said that's because `\b` does not change meaning based on the locale. Take out all `\b` and try again.
Tomalak
Thanks Tomalak! This expression is work! ((?<!\pL)tükörfúrógép(?!\pL))((?:(?!\1).)*?)(?<!\pL)árvíztűrő(?!\pL)|((?<!\pL)árvíztűrő(?!\pL))((?:(?!\1).)*?)(?<!\pL)tükörfúrógép(?!\pL)
turbod
@turbod: Your look-around for Unicode letters is *almost* correct - it does not account for start-of-string and end-of-string conditions. See my update.
Tomalak
+1  A: 

Instead of a huge, confusing regexp, why not write a few lines using various string functions?

Example:

$start = strpos($txt, 'árvíztűrő') + 9; // position of first char after 'árvíztűrő'
$end   = strpos($txt, 'tükörfúrógép', $start);
$inner = substr($txt, $start, $end - $start);
$words = preg_split("/[\s,]+/", $inner);
$num   = count($words);

Of course, this will eat up memory if you have some gigantic input string...

Kelsey Rider
Sorry, but this not work.
turbod
Ah - what did it do? Looking at it now, a possible problem that comes to mind is that your funny accented characters probably aren't in the ASCII set and so the length of 'árvíztűrő' may be more than 9...
Kelsey Rider
+3  A: 

To count words between two words you can easily use:

count(split(" ", "lime orange banana"));

And a function that returns an array with matches and counts will be:

function count_between_words($text, $first, $second, $case_sensitive = false)
{
    if(!preg_match_all('/('.$first.')((?:(?!\\1).)*?)'.$second.'/s' . ($case_sensitive ? "" : "i"), preg_replace("/\\s+/", " ", $text), $results, PREG_SET_ORDER))
        return array();

    $data = array();

    foreach($results as $result)
    {
        $result[2] = trim($result[2]);
        $data[] = array("match" => $result[0], "words" => $result[2], "count" => count(split(" ", $result[2])));
    }

    return $data;
}

$result = count_between_words($txt, "tükörfúrógép", "árvíztűrő");

echo "<pre>" . print_r($result, true) . "</pre>";

Result will be:

Array
(
    [0] => Array
    (
        [match] => tükörfúrógép cherry árvíztűrő
        [words] => cherry
        [count] => 1
    )

    [1] => Array
    (
        [match] => tükörfúrógép cat orange lime cat árvíztűrő
        [words] => cat orange lime cat
        [count] => 4
    )

    [2] => Array
    (
        [match] => tükörfúrógép banana orange lime orange lime cat árvíztűrő
        [words] => banana orange lime orange lime cat
        [count] => 6
    )
)
Wiliam
Thanks William! Is great!But what happens if you reverse the order of the parameters?For example: $result = count_between_words($txt, "árvíztűrő","tükörfúrógép");
turbod
Search the reverse is not a logic error, is a completely different search. Why? :o
Wiliam
+1 for providing a self-contained solution. The regex however needs some improvement because it makes assumptions that may or may not be true (namely: `\s*` and `[^,]+?`) and can produce false negatives because of this.
Tomalak
Reverse will return: " árvíztűrő orange lyon cat lime mac tükörfúrógép" (5) and "árvíztűrő tükörfúrógép" (0)
Wiliam
Tomalak, I used \s* to trim the result contained in ([^,]+?) but you are right, seeing the example he gave us and thinking in a normal human redacted post this will be ok, errors can be easily fixed. With [^,] is the same point, in human redacted texts coma separates orations and if you don't use it in this example will return a false positive. (Ah! Thanks for the point!)
Wiliam
I think that assuming that a comma is a significant delimiter in a complex, human-produced text is putting to much faith in the grammatical abilities of the average human. ;-) The question stated "between these two words", and as long as the definition is not more precise, I would refrain from making assumptions about the nature of the input. :-) *(PS: This site uses a Twitter style @-reply system. Unless you use it, your comment might go unnoticed by the one you are talking to.)*
Tomalak
@Tomalak, yes, I saw that after my last comment. In response of your comment, you are right again, I improved the function with your regex, I learned today what (?!) makes in regex :D
Wiliam
@turbod, why you want reverse it?
Wiliam
First day, already learned something. Good start. :-) *(PS, again: Check out http://meta.stackoverflow.com/questions/38600/ for a way to make comment replies easy.)*
Tomalak
@Wiliam I was just curious
turbod
@Tomalak: Ok, I installed the fast reply script, i need it hehe
Wiliam