tags:

views:

640

answers:

4

Hi, I am trying to split the text into words:

$delimiterList = array(" ", ".", "-", ",", ";", "_", ":",
           "!", "?", "/", "(", ")", "[", "]", "{", "}", "<", ">", "\r", "\n",
           '"');
$words = mb_split($delimiterList, $string);

which works quite fine with strings but I am stuck in some cases where I have to do with numbers.

E.g. If I have the text "Look at this.My score is 3.14, and I am happy about it.". Now the array is

[0]=>Look,
[1]=>at,
[2]=>this,
[3]=>My,
[4]=>score,
[5]=>is,
[6]=>3,
[7]=>14,
[8]=>and, ....

Then also the 3.14 is divided in 3 and 14 which should not happen in my case. I mean point should divide two strings but not two numbers. It should be like:

[0]=>Look,
[1]=>at,
[2]=>this,
[3]=>My,
[4]=>score,
[5]=>is,
[6]=>3.14,
[7]=>and, ....

But I have no Idea how to avoid this cases!

Anybody any idea how to solve this problem?

Thanx, Granit

+4  A: 

Take a look at strtok. It lets you change the parsing tokens dynamically, so you can break the string apart manually in a while loop, pushing each split off word into an array.

Jeff Ober
Thank you Jeff. I almost got the solution thanks to you but I have a little problem. I have a delimiterlist and is t possible to know which delimiter was matched exactly.Because now I can check two sequent tokens if they are numbers and i can join them but I need to know what was in the middle of them.
Granit
+1 .. I hate to call strtok() the most reliable bet, but in his case, it applies.
Tim Post
Granit: not that I am aware of.
Jeff Ober
A: 

Use ". ", instead of ".", in $delimiterList.

powtac
You can not be sure on that. I should be able to proccess also this.is.a.text for example.
Granit
When do you have "this.is.a.text" and dont want to split it as mentioned in your question???
powtac
+5  A: 

Or use regex :)

<?php
$str = "Look at this.My score is 3.14, and I am happy about it.";

// alternative to handle Marko's example (updated)
// /([\s_;?!\/\(\)\[\]{}<>\r\n"]|\.$|(?<=\D)[:,.\-]|[:,.\-](?=\D))/

var_dump(preg_split('/([\s\-_,:;?!\/\(\)\[\]{}<>\r\n"]|(?<!\d)\.(?!\d))/',
                    $str, null, PREG_SPLIT_NO_EMPTY));

array(13) {
  [0]=>
  string(4) "Look"
  [1]=>
  string(2) "at"
  [2]=>
  string(4) "this"
  [3]=>
  string(2) "My"
  [4]=>
  string(5) "score"
  [5]=>
  string(2) "is"
  [6]=>
  string(4) "3.14"
  [7]=>
  string(3) "and"
  [8]=>
  string(1) "I"
  [9]=>
  string(2) "am"
  [10]=>
  string(5) "happy"
  [11]=>
  string(5) "about"
  [12]=>
  string(2) "it"
}
ptomli
what about 3,14 and 3/14?it splits.
Granit
The commented one works very good I mean /([\s_;?!\/\(\)\[\]{}<>\r\n"]|\.$|[:,.\-](?=\D)|[:,.\-](?=\D))/.Thank you very very very much ptomli!
Granit
I've actually messed up a bit there, I'll edit it quickly.
ptomli
+1  A: 

My first idea was preg_match_all('/\w+/', $string, $matches); but that gives a similar result to the one you've got. The problem is that the numbers separated by a dot is very ambiguous. It can mean both decimal point and end of sentence so we need a way to change the string in such a way to eliminate the double meaning.

For example in this sentence we have several parts that we'd like to keep as one word: "Look at this.My score is 3.14, and I am happy about it. It's not 334,3 and today's not 2009-12-12 11:12:13.".

We start by building a search->replace dictionary to encode the exceptions into something that's not going to get split:

$encode = array(
    '/(\d+?)\.(\d+?)/' => '\\1DOT\\2',
    '/(\d+?),(\d+?)/' => '\\1COMMA\\2',
    '/(\d+?)-(\d+?)-(\d+?) (\d+?):(\d+?):(\d+?)/' => '\\1DASH\\2DASH\\3SPACE\\4COLON\\5COLON\\6'
);

Next, we encode the exceptions:

foreach ($encode as $regex => $repl) {
    $string = preg_replace($regex, $repl, $string);
}

Split the string:

preg_match_all('/\w+/', $string, $matches);

And convert the encoded word back:

$decode = array(
    'search' =>  array('DOT', 'COMMA', 'DASH', 'SPACE', 'COLON'),
    'replace' => array('.',   ',',     '-',    ' ',     ':'    )
);
foreach ($matches as $k => $v) {
    $matches[$k] = str_replace($decode['search'], $decode['replace'], $v);
}

$matches now contains the original sentence split into words with the right exceptions.

You can make the regex used in exceptions as simple or as complex as you like, but some ambiguity is always going to get through, for example two sentances with the first one ending and the next one beginning with a number: Number of the counting shall be 3.3 only and nothing but the 3.5 is right out..

Marko