tags:

views:

86

answers:

5

I need a regular expression to parse words from a sentence or a paragraph. Some separaters that should be used are: spaces, and dots. So in:

My name is Bob.I'm 104 yrs old.

Bob and I'm are seperated even though there isn't any space between them, but a dot.

Any other regular seperaters of words should also be included.

+4  A: 
$words = preg_split('#[\\s.]#', $string, -1, PREG_SPLIT_NO_EMPTY);

The \\s will match all white space characters (such as space, tab, new line, etc). The . will match, well a .... If you wanted to add more characters, just add them after the . (with the exceptions that a [, a ] and a # must be escaped with \\, and a - must be the last character in the list)...

It will return for your above sentence:

array(9) {
  [0]=>
  string(2) "My"
  [1]=>
  string(4) "name"
  [2]=>
  string(2) "is"
  [3]=>
  string(3) "Bob"
  [4]=>
  string(3) "I'm"
  [5]=>
  string(3) "104"
  [6]=>
  string(3) "yrs"
  [7]=>
  string(3) "old"
}
ircmaxell
Heh, good to know about the PREG_SPLIT_NO_EMPTY - avoids needing the `+` I used in my answer.
Peter Boughton
Need some help. Can you tell me how to add comma (`,`), `<word>` and `</word>` to the list of characters that would seperate words? so: `<word>company,firm,business</word>` would be seperated into 5 elements, and `<word>` and `</word>` would be included in them
Click Upvote
Adding a comma is easy: `'#[\\s.,]#'`... As for `<word>` and `</word>`, http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html (In other words, NEVER parse XML/HTML with a regex)...
ircmaxell
+2  A: 

Two ways to do this, either inclusive or exclusive, by splitting on either of the following:

Use "word characters", plus common "connectors" (apostrophe,hyphen,etc), and negate the whole group:

[^\w'-]+

Or specify what you consider non-word characters (spaces, dots, colons, parens, etc):

[\s.;:()]+

(In both cases, the + avoids empty groups being created.)

Certain characters need to be escaped in character classes - for details see http://www.regular-expressions.info/charclass.html

Peter Boughton
A: 

Check out the word boundary anchor (\b or \w) to isolate individual words from whitespace and punctuation.

Paul Sasik
This will fail on words such as `I'm`, which was explicitly stated as a requirement in the question.
Peter Boughton
+8  A: 

What about str_word_count()?:

For the purpose of this function, 'word' is defined as a locale dependent string containing alphabetic characters, which also may contain, but not start with "'" and "-" characters.

Example:

$str = "My name is Bob.I'm 104 yrs old."; 
print_r(str_word_count($str, 1, '0123456789'));

gives:

Array
(
    [0] => My
    [1] => name
    [2] => is
    [3] => Bob
    [4] => I'm
    [5] => 104
    [6] => yrs
    [7] => old
)

The third parameter takes a string which defines which additional characters should be considered as "word characters".

Felix Kling
Yep, using built-in functions is always the preferred option, because they're much more likely to cater for potential edge cases.
Peter Boughton
Very well done. + 1
Ryan Kinal
A: 

have a look at preg_split

$words = preg_split('/\W+/', $sentence); // split on non-word-characters

this will obviously split »I'm« into ›I‹ and ›m‹

knittl
As you noted, this will fail on words such as `I'm`, which was explicitly stated as a requirement in the question.
Peter Boughton