views:

354

answers:

7

If I have a description like:

"We prefer questions that can be answered, not just discussed. Provide details. Write clearly and simply."

and all I want is "We prefer questions that can be answered, not just discussed."

I figure I would search for a regular expression, like "[.!\?]", determine the strpos and then do a substr from the main string, but I imagine it's a common thing to do, so hoping someone has a snippet lying around.

Thanks!

+3  A: 
<?php
$text = "We prefer questions that can be answered, not just discussed. Provide details. Write clearly and simply.";
$array = explode('.',$text);
$text = $array[0];
?>
Jason
+1 to this response. It should be noted though that this will explode on all .'s (i.e. the period character). So if the sentence contains abbreviations such as 'i.e.' or 'e.g.' you will run into problems. Apart from that it's the easiest option.
mdec
However, not all sentences end with "."s. I need something that would deal with "!" and "?" as well I'm pretty sure, so it would have to use regexp I think.
FilmJ
You can further split elements of $array by '!', '?', etc.
Jason
But you can't dynamically select which to split by.
Ian Elliott
A: 
 reset(explode('.', $s, 2));
p00ya
+7  A: 

A slightly more costly expression, however will be more adaptable if you wish to select multiple types of punctuation as sentence terminators.

   $sentence = preg_replace('/([^?!.]*.).*/', '\\1', $string);

Find termination characters followed by a space

   $sentence = preg_replace('/(.*?[?!.](?=\s|$)).*/', '\\1', $string);
Ian Elliott
Thanks for this. I suppose I can accept the cost, as it will be cached.
FilmJ
Actually, just realized, this was missing one piece. Because it grabs everything up to the end, it drops the actual punctuation char. A "." at the end of the search expression within the parens seems to resolve.preg_replace('/([^?!.]*.).*/', '\\1', $str);
FilmJ
You must have grabbed the code before I modified :) If you look again that's what I posted.
Ian Elliott
yes, i saw that right after I posted my comment. Someone below makes the point that it should be period (or other sentence terminator) followed by at least one blank space (to allow for domain names for example). I took a stab but wasn't able to figure the right expression for that and adding "\s" didn't work.
FilmJ
This regex will fail if the string contains a real number such as 3.14, it will then snip it at the first decimal point.
dyve
Test string for previous comment:We prefer prices below US$ 7.50. Any higher, we won't buy.
dyve
That wasn't in the requirements given, but can be easily changed by checking for a whitespace character `\s`
Ian Elliott
FWIW, just adding \s didn't work for me(see above). Thanks guys, this is a helpful snippet.
FilmJ
Yeah, I realized afterwards that a simple `\s` wouldn't suffice, so I included an example using a positive lookahead to find whitespace.
Ian Elliott
Nice work Ian. Didn't see your improved regex so I provided an alternative below. Yours looks more elegant though. Kudos.
dyve
Okay, so not to beat a dead horse here, but I ended up trying to use this code recently on results returned from YouTube's API, and strangely when using Playlist Feeds, it did not work as expected. I then used dyve's solution, and it did.. Wonder if Unicode strings are a factor.
FilmJ
A: 

current(explode(".",$input));

Lasiaf
A: 

I'd probably use any of the multitudes of substring/string-split functions in PHP (some mentioned here already). But also look for ". " OR ".\n" (and possibly ".\n\r") instead of just ".". Just in case for whatever reason, the sentence contains a period that isn't followed by a space. I think it will harden the likelihood of you getting genuine results.

Example, searching for just "." on:

"I like stackoverflow.com."

Will get you:

"I like stackoverflow."

When really, I'm sure you'd prefer:

"I like stackoverflow.com."

And once you have that basic search, you'll probably come across one or two occasions where it may miss something. Tune as you run with it!

Omega
Most strings probably won't have newlines inside them.
Ian Elliott
I do think however that many strings (and some in my project) will have URLs... so it would be good to figure out the solution for that, though the answer accepted above is good for now.
FilmJ
A: 

My previous regex seemed to work in the tester but not in actual PHP. I have edited this answer to provide full, working PHP code, and an improved regex.

$string = 'A simple test!';
var_dump(get_first_sentence($string));

$string = 'A simple test without a character to end the sentence';
var_dump(get_first_sentence($string));

$string = '... But what about me?';
var_dump(get_first_sentence($string));

$string = 'We at StackOverflow.com prefer prices below US$ 7.50. Really, we do.';
var_dump(get_first_sentence($string));

$string = 'This will probably break after this pause .... or won\'t it?';
var_dump(get_first_sentence($string));

function get_first_sentence($string) {
    $array = preg_split('/(^.*\w+.*[\.\?!][\s])/', $string, -1, PREG_SPLIT_DELIM_CAPTURE);
    // You might want to count() but I chose not to, just add   
    return trim($array[0] . $array[1]);
}
dyve
This doesn't appear to work actually. Did you change it since you first posted?
FilmJ
Sorry, rewrote it and it is now working PHP code.
dyve
so this not only worked now, but in the end, it actually handled my real-world problem, whereas Ian's did not... (though at first it did). As I commented there above, perhaps this is due to the fact that the results are Unicode strings... not sure, but food for thought. Thanks for the function - I'll defin. use it again and again.
FilmJ
A: 

This is a genuinely hard problem. I recommend looking into an NLP package if you require robust results. A tokenizer can identify sentence ending characters (either "?", ".", ";" etc depending on your intended use), and you can split on that.

Kevin Peterson