ansaurus

Question

Split text into words problem PHP, complicated problem

Answer 1

+4 A:

Take a look at strtok. It lets you change the parsing tokens dynamically, so you can break the string apart manually in a while loop, pushing each split off word into an array.

Jeff Ober 2009-10-21 13:01:37

Thank you Jeff. I almost got the solution thanks to you but I have a little problem. I have a delimiterlist and is t possible to know which delimiter was matched exactly.Because now I can check two sequent tokens if they are numbers and i can join them but I need to know what was in the middle of them.

Granit 2009-10-21 13:24:26

+1 .. I hate to call strtok() the most reliable bet, but in his case, it applies.

Tim Post 2009-10-21 13:33:03

Granit: not that I am aware of.

Jeff Ober 2009-10-21 13:40:55

Answer 2

A:

Use ". ", instead of ".", in $delimiterList.

powtac 2009-10-21 13:04:25

You can not be sure on that. I should be able to proccess also this.is.a.text for example.

Granit 2009-10-21 13:08:19

When do you have "this.is.a.text" and dont want to split it as mentioned in your question???

powtac 2009-10-21 15:19:44

Answer 3

+5 A:

Or use regex :)

<?php
$str = "Look at this.My score is 3.14, and I am happy about it.";

// alternative to handle Marko's example (updated)
// /([\s_;?!\/\(\)\[\]{}<>\r\n"]|\.$|(?<=\D)[:,.\-]|[:,.\-](?=\D))/

var_dump(preg_split('/([\s\-_,:;?!\/\(\)\[\]{}<>\r\n"]|(?<!\d)\.(?!\d))/',
                    $str, null, PREG_SPLIT_NO_EMPTY));

array(13) {
  [0]=>
  string(4) "Look"
  [1]=>
  string(2) "at"
  [2]=>
  string(4) "this"
  [3]=>
  string(2) "My"
  [4]=>
  string(5) "score"
  [5]=>
  string(2) "is"
  [6]=>
  string(4) "3.14"
  [7]=>
  string(3) "and"
  [8]=>
  string(1) "I"
  [9]=>
  string(2) "am"
  [10]=>
  string(5) "happy"
  [11]=>
  string(5) "about"
  [12]=>
  string(2) "it"
}

ptomli 2009-10-21 13:43:09

what about 3,14 and 3/14?it splits.

Granit 2009-10-21 14:08:32

The commented one works very good I mean /([\s_;?!\/\[\]{}<>\r\n"]|\.$|[:,.\-](?=\D)|[:,.\-](?=\D))/.Thank you very very very much ptomli!

Granit 2009-10-21 14:12:24

I've actually messed up a bit there, I'll edit it quickly.

ptomli 2009-10-21 14:33:47

Answer 4

+1 A:

My first idea was preg_match_all('/\w+/', $string, $matches); but that gives a similar result to the one you've got. The problem is that the numbers separated by a dot is very ambiguous. It can mean both decimal point and end of sentence so we need a way to change the string in such a way to eliminate the double meaning.

For example in this sentence we have several parts that we'd like to keep as one word: "Look at this.My score is 3.14, and I am happy about it. It's not 334,3 and today's not 2009-12-12 11:12:13.".

We start by building a search->replace dictionary to encode the exceptions into something that's not going to get split:

$encode = array(
    '/(\d+?)\.(\d+?)/' => '\\1DOT\\2',
    '/(\d+?),(\d+?)/' => '\\1COMMA\\2',
    '/(\d+?)-(\d+?)-(\d+?) (\d+?):(\d+?):(\d+?)/' => '\\1DASH\\2DASH\\3SPACE\\4COLON\\5COLON\\6'
);

Next, we encode the exceptions:

foreach ($encode as $regex => $repl) {
    $string = preg_replace($regex, $repl, $string);
}

Split the string:

preg_match_all('/\w+/', $string, $matches);

And convert the encoded word back:

$decode = array(
    'search' =>  array('DOT', 'COMMA', 'DASH', 'SPACE', 'COLON'),
    'replace' => array('.',   ',',     '-',    ' ',     ':'    )
);
foreach ($matches as $k => $v) {
    $matches[$k] = str_replace($decode['search'], $decode['replace'], $v);
}

$matches now contains the original sentence split into words with the right exceptions.

You can make the regex used in exceptions as simple or as complex as you like, but some ambiguity is always going to get through, for example two sentances with the first one ending and the next one beginning with a number: Number of the counting shall be 3.3 only and nothing but the 3.5 is right out..

Marko 2009-10-21 13:48:20

ansaurus

tags:

views:

answers:

Split text into words problem PHP, complicated problem

related questions