ansaurus

Question

How can i count the number of words between two words?

Answer 1

+7 A:

I have several things to point out:

You can't do it in one regex. Regex is forward-only, reversed match order requires a second regex.
You use (.*)?, but you mean (.*?)
To aquire correct matches, you must ensure that the left boundary of your expression cannot occur in the middle.
~~You should denote word boundaries (\b) around your delimiter words to ensure whole-word matches.~~ EDIT: While this is correct in theory, it does not work for Unicode input in PHP.
~~You should switch the PHP locale to Hungarian (it is Hungarian, right?) before calling preg_match_all(), because the locale has an influence on what's considered a word boundary in PHP.~~ EDIT: The meaning of \b does in fact not change with the selected locale.

That being said, regex #1 is:

(\btükörfúrógép\b)((?:(?!\1).)*?)\bárvíztűrő\b

and regex #2 is analoguous, just with reversed delimiter words.

Regex explanation:

(               # match group 1:
  \b            #   a word boundary
  tükörfúrógép  #   your first delimiter word
  \b            #   a word boundary
)               # end match group 1
(               # match group 2:
  (?:           #   non-capturing group:
    (?!         #     look-ahead:
      \1        #       must not be followed by delimiter word 1
    )           #     end look-ahead
    .           #     match any next char (includes \n with the "s" switch)
  )*?           #   end non-capturing group, repeat as often as necessary
)               # end match group 2 (this is the one you look for)
\b              # a word boundary
árvíztűrő       # your second delimiter word
\b              # a word boundary

UPDATE: With PHP's ~~pathetic~~poor Unicode string support, you will be forced to use expressions like these as replacements for \b:

$before = '(?<=^|[^\p{L}])';
$after  = '(?=[^\p{L}]|$)';

This suggestion has been taken from another question.

Tomalak 2010-07-21 07:24:41

This return empty array: Array ( [0] => Array ( ) [1] => Array ( ) [2] => Array ( ) )

turbod 2010-07-21 07:39:17

PS: Well, to be completely honest - you *can* do it in one regex, by concatenating regex #1 and regex #2 like this `#1|#2`. It's up to you if you consider the resulting expression worthwhile. ;-)

Tomalak 2010-07-21 07:43:47

@turbod: What does a simple `\árvíztűrő\b` give you?

Tomalak 2010-07-21 07:45:34

I'm currently researching the way `\b` works with PHP PCRE and unicode strings. Looks like the locale does *not* have an influence, and an alternative must be used for "international" word boundaries. When I found something, I'll update my answer.

Tomalak 2010-07-21 07:53:44

setLocale(LC_ALL, 'hu_HU.utf8');preg_match_all('@\bárvíztűrő\b@',$txt,$m);print_r($m);This return empty array.

turbod 2010-07-21 07:54:04

@turbod: Yeah, as I said that's because `\b` does not change meaning based on the locale. Take out all `\b` and try again.

Tomalak 2010-07-21 08:00:12

Thanks Tomalak! This expression is work! ((?<!\pL)tükörfúrógép(?!\pL))((?:(?!\1).)*?)(?<!\pL)árvíztűrő(?!\pL)|((?<!\pL)árvíztűrő(?!\pL))((?:(?!\1).)*?)(?<!\pL)tükörfúrógép(?!\pL)

turbod 2010-07-21 08:02:52

@turbod: Your look-around for Unicode letters is *almost* correct - it does not account for start-of-string and end-of-string conditions. See my update.

Tomalak 2010-07-21 08:10:35

Answer 2

+1 A:

Instead of a huge, confusing regexp, why not write a few lines using various string functions?

Example:

$start = strpos($txt, 'árvíztűrő') + 9; // position of first char after 'árvíztűrő'
$end   = strpos($txt, 'tükörfúrógép', $start);
$inner = substr($txt, $start, $end - $start);
$words = preg_split("/[\s,]+/", $inner);
$num   = count($words);

Of course, this will eat up memory if you have some gigantic input string...

Kelsey Rider 2010-07-21 07:35:50

Sorry, but this not work.

turbod 2010-07-21 08:03:53

Ah - what did it do? Looking at it now, a possible problem that comes to mind is that your funny accented characters probably aren't in the ASCII set and so the length of 'árvíztűrő' may be more than 9...

Kelsey Rider 2010-07-21 10:11:12

Answer 3

+3 A:

To count words between two words you can easily use:

count(split(" ", "lime orange banana"));

And a function that returns an array with matches and counts will be:

function count_between_words($text, $first, $second, $case_sensitive = false)
{
    if(!preg_match_all('/('.$first.')((?:(?!\\1).)*?)'.$second.'/s' . ($case_sensitive ? "" : "i"), preg_replace("/\\s+/", " ", $text), $results, PREG_SET_ORDER))
        return array();

    $data = array();

    foreach($results as $result)
    {
        $result[2] = trim($result[2]);
        $data[] = array("match" => $result[0], "words" => $result[2], "count" => count(split(" ", $result[2])));
    }

    return $data;
}

$result = count_between_words($txt, "tükörfúrógép", "árvíztűrő");

echo "<pre>" . print_r($result, true) . "</pre>";

Result will be:

Array
(
    [0] => Array
    (
        [match] => tükörfúrógép cherry árvíztűrő
        [words] => cherry
        [count] => 1
    )

    [1] => Array
    (
        [match] => tükörfúrógép cat orange lime cat árvíztűrő
        [words] => cat orange lime cat
        [count] => 4
    )

    [2] => Array
    (
        [match] => tükörfúrógép banana orange lime orange lime cat árvíztűrő
        [words] => banana orange lime orange lime cat
        [count] => 6
    )
)

Wiliam 2010-07-21 08:00:37

Thanks William! Is great!But what happens if you reverse the order of the parameters?For example: $result = count_between_words($txt, "árvíztűrő","tükörfúrógép");

turbod 2010-07-21 08:09:26

Search the reverse is not a logic error, is a completely different search. Why? :o

Wiliam 2010-07-21 08:17:59

+1 for providing a self-contained solution. The regex however needs some improvement because it makes assumptions that may or may not be true (namely: `\s*` and `[^,]+?`) and can produce false negatives because of this.

Tomalak 2010-07-21 08:25:12

Reverse will return: " árvíztűrő orange lyon cat lime mac tükörfúrógép" (5) and "árvíztűrő tükörfúrógép" (0)

Wiliam 2010-07-21 08:25:30

Tomalak, I used \s* to trim the result contained in ([^,]+?) but you are right, seeing the example he gave us and thinking in a normal human redacted post this will be ok, errors can be easily fixed. With [^,] is the same point, in human redacted texts coma separates orations and if you don't use it in this example will return a false positive. (Ah! Thanks for the point!)

Wiliam 2010-07-21 08:29:42

I think that assuming that a comma is a significant delimiter in a complex, human-produced text is putting to much faith in the grammatical abilities of the average human. ;-) The question stated "between these two words", and as long as the definition is not more precise, I would refrain from making assumptions about the nature of the input. :-) *(PS: This site uses a Twitter style @-reply system. Unless you use it, your comment might go unnoticed by the one you are talking to.)*

Tomalak 2010-07-21 08:38:53

@Tomalak, yes, I saw that after my last comment. In response of your comment, you are right again, I improved the function with your regex, I learned today what (?!) makes in regex :D

Wiliam 2010-07-21 08:55:54

@turbod, why you want reverse it?

Wiliam 2010-07-21 08:57:01

First day, already learned something. Good start. :-) *(PS, again: Check out http://meta.stackoverflow.com/questions/38600/ for a way to make comment replies easy.)*

Tomalak 2010-07-21 08:57:05

@Wiliam I was just curious

turbod 2010-07-21 09:47:05

@Tomalak: Ok, I installed the fast reply script, i need it hehe

Wiliam 2010-08-20 10:49:20

ansaurus

tags:

views:

answers:

How can i count the number of words between two words?

related questions