ansaurus

Question

Answer 1

+4 A:

$words = preg_split('#[\\s.]#', $string, -1, PREG_SPLIT_NO_EMPTY);

The \\s will match all white space characters (such as space, tab, new line, etc). The . will match, well a .... If you wanted to add more characters, just add them after the . (with the exceptions that a [, a ] and a # must be escaped with \\, and a - must be the last character in the list)...

It will return for your above sentence:

array(9) {
  [0]=>
  string(2) "My"
  [1]=>
  string(4) "name"
  [2]=>
  string(2) "is"
  [3]=>
  string(3) "Bob"
  [4]=>
  string(3) "I'm"
  [5]=>
  string(3) "104"
  [6]=>
  string(3) "yrs"
  [7]=>
  string(3) "old"
}

ircmaxell 2010-08-05 17:17:24

Heh, good to know about the PREG_SPLIT_NO_EMPTY - avoids needing the `+` I used in my answer.

Peter Boughton 2010-08-05 17:24:58

Need some help. Can you tell me how to add comma (`,`), `<word>` and `</word>` to the list of characters that would seperate words? so: `<word>company,firm,business</word>` would be seperated into 5 elements, and `<word>` and `</word>` would be included in them

Click Upvote 2010-08-05 19:09:45

Adding a comma is easy: `'#[\\s.,]#'`... As for `<word>` and `</word>`, http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html (In other words, NEVER parse XML/HTML with a regex)...

ircmaxell 2010-08-05 19:15:00

Answer 2

+2 A:

Two ways to do this, either inclusive or exclusive, by splitting on either of the following:

Use "word characters", plus common "connectors" (apostrophe,hyphen,etc), and negate the whole group:

[^\w'-]+

Or specify what you consider non-word characters (spaces, dots, colons, parens, etc):

[\s.;:()]+

(In both cases, the + avoids empty groups being created.)

Certain characters need to be escaped in character classes - for details see http://www.regular-expressions.info/charclass.html

Peter Boughton 2010-08-05 17:18:04

Answer 3

A:

Check out the word boundary anchor (\b or \w) to isolate individual words from whitespace and punctuation.

Paul Sasik 2010-08-05 17:18:23

This will fail on words such as `I'm`, which was explicitly stated as a requirement in the question.

Peter Boughton 2010-08-05 17:23:15

Answer 4

+8 A:

What about str_word_count()?:

For the purpose of this function, 'word' is defined as a locale dependent string containing alphabetic characters, which also may contain, but not start with "'" and "-" characters.

Example:

$str = "My name is Bob.I'm 104 yrs old."; 
print_r(str_word_count($str, 1, '0123456789'));

gives:

Array
(
    [0] => My
    [1] => name
    [2] => is
    [3] => Bob
    [4] => I'm
    [5] => 104
    [6] => yrs
    [7] => old
)

The third parameter takes a string which defines which additional characters should be considered as "word characters".

Felix Kling 2010-08-05 17:19:11

Yep, using built-in functions is always the preferred option, because they're much more likely to cater for potential edge cases.

Peter Boughton 2010-08-05 17:22:36

Very well done. + 1

Ryan Kinal 2010-08-05 17:22:40

Answer 5

A:

have a look at preg_split

$words = preg_split('/\W+/', $sentence); // split on non-word-characters

this will obviously split »I'm« into ›I‹ and ›m‹

knittl 2010-08-05 17:19:44

As you noted, this will fail on words such as `I'm`, which was explicitly stated as a requirement in the question.

Peter Boughton 2010-08-05 17:23:37

ansaurus

tags:

views:

answers:

Regexp for parsing words from sentence

related questions