ansaurus

Question

Answer 1

+2 A:

Tokenize - strtok.

<?php
$text = 'This is an example text, it contains commas and full stops. Exclamation marks, too! Question marks? All punctuation marks you know.';
$delim = ' \n\t,.!?:;';

$tok = strtok($text, $delim);

while ($tok !== false) {
    echo "Word=$tok<br />";
    $tok = strtok($delim);
}
?>

eed3si9n 2009-04-26 10:23:26

Thank you, I think this function does this well.

2009-04-26 10:33:42

This won't work if you get a : or ; or any other punctuation character you haven't accounted for.

marcog 2009-04-26 10:41:05

@marcog, I added : and ;. Doesn't {P} catch apostrophe and hyphen?

eed3si9n 2009-04-26 10:57:05

What about cases such quoting? My updated answer discriminates between these cases.

marcog 2009-04-26 11:23:42

Answer 2

A:

Do:

str_word_count($text, 1);

Or if you need unicode support:

function str_word_count_Helper($string, $format = 0, $search = null)
{
    $result = array();
    $matches = array();

    if (preg_match_all('~[\p{L}\p{Mn}\p{Pd}\'\x{2019}' . preg_quote($search, '~') . ']+~u', $string, $matches) > 0)
    {
     $result = $matches[0];
    }

    if ($format == 0)
    {
     return count($result);
    }

    return $result;
}

Alix Axel 2009-04-26 10:24:48

Thanks but this wouldn't work. "Fri3nd" wouldn't be extracted but it should.

2009-04-26 10:29:19

I don't understand why "Fri3nd" should be extracted. Removed from the array, broken down into "Fri3" and "nd" (or similar)? O.o

David Thomas 2009-04-26 11:07:34

If you want to consider numbers as words just do str_word_count_Helper($string, 1, '0123456789');

Alix Axel 2009-04-26 11:56:18

Answer 3

+4 A:

Use the class \p{P} which matches any unicode punctuation character, combined with the \s whitespace class.

$result = preg_split('/((^\p{P}+)|(\p{P}*\s+\p{P}*)|(\p{P}+$))/', $text, -1, PREG_SPLIT_NO_EMPTY);

This will split on a group of one or more whitespace characters, but also suck in any surrounding punctuation characters. It also matches punctuation characters at the beginning or end of the string. This discriminates cases such as "don't" and "he said 'ouch!'"

marcog 2009-04-26 10:24:50

+1, not sure, tho, how this will deal with äöüß. Does regex normally classify äöüß as word characters?

Peter Perháč 2009-04-26 10:28:24

Thank you. This would't probably work for English texts but I also want to extract German umlauts (ä, ö, ü), the "ß" and numbers in a string. The "\W" wouldn't extract "Fri3nd", would it?

2009-04-26 10:31:36

Seems it does not, but updated answer with something similar that works.

marcog 2009-04-26 10:34:40

Updated answer works with perl (which php regex are based on): $ echo "äöüß, test" | perl -e 'while (<>) { if (/([\p{P}\s]+)/) { print "$1\n"; } }',

marcog 2009-04-26 10:37:49

+1 You hit it with the Unicode character properties!

Gumbo 2009-04-26 10:40:18

Should one split don't into don and t?

eed3si9n 2009-04-26 10:59:41

Updated it to handle such a case :)

marcog 2009-04-26 11:20:57

Thanks, marcog, it works perfectly!But is it really better than my updated code above? Actually, what is the difference between our approaches? Is one faster than the other one?

2009-04-26 11:39:21

In your approach you're specifying the non-punctuation characters. You will be therefore be missing some cases, e.g. á. Why try manually specify them when the whole set of unicode punctuation characters has already been defined? And like eed3si9n pointed out with my original answer, yours will break up words such as don't.

marcog 2009-04-26 12:24:56

Convinced me! ;) Thanks!

2009-04-26 16:01:27

Answer 4

A:

you can also use PHP strtok() function to fetch string tokens from your large string. you can use it like this:

 $result = array();
 // your original string
 $text = 'This is an example text, it contains commas and full stops. Exclamation marks, too! Question marks? All punctuation marks you know.';
 // you pass strtok() your string, and a delimiter to specify how tokens are separated. words are seperated by a space.
 $word = strtok($text,' ');
 while ( $word !== false ) {
     $result[] = $word;
     $word = strtok(' ');
 }

see more on php documentation for strtok()

farzad 2009-04-26 10:29:45

Answer 5

+1 A:

I would first make the string to lower-case before splitting it up. That would make the i modifier and the array processing afterwards unnecessary. Additionally I would use the \W shorthand for non-word characters and add a + multiplier.

$text = 'This is an example text, it contains commas and full stops. Exclamation marks, too! Question marks? All punctuation marks you know.';
$result = preg_split('/\W+/', strtolower($text), -1, PREG_SPLIT_NO_EMPTY);

Edit Use the Unicode character properties instead of \W as marcog suggested. Something like [\p{P}\p{Z}] (punctuation and separator characters) would cover the characters more specific than \W.

Gumbo 2009-04-26 10:35:09

Thanks, the idea to perform strtolower() before is very good. I'll use this.

2009-04-26 10:40:05

ansaurus

tags:

views:

answers:

Split a text into single words

related questions