tags:

views:

151

answers:

3

I'm trying to split a string on its punctuation, but the string may contain URLs (which conveniently has all the typical punctuation marks).

I have a basic working knowledge of RegEx, but not enough to help me out here. This is what I was using when I discovered the problem:

$text[$i] = preg_split('/[\.\?!\-]+/', $post->text);

(this also accounts for multiple consecutive punctuation characters - ellipses, !!!!, ????, ?!?, etc)

How would I split a string on the punctuation while maintaining the integrity of URLs? Thanks!

Edit:

My apologies...an example would be something along the lines of a tweet:

"Blah blah blah? A sentence. Here's a link: http://somelink.com?key=value ."

The results should look something like this:

[0] => "Blah blah blah?"
[1] => "A sentence."
[2] => "Here's a link: http://somelink.com?key=value ."
A: 

Is there a pattern that your non-URL punctuation marks follow? In most English sentences, many punctuation marks are followed (or sometimes preceeded) by a space character. I don't know what your source text is like but that MIGHT be a reliable way to do it, because the punctuation marks in a URL will NOT have space on either side - although they could END with a punctuation mark followed by a space - I guess it depends on the URLs you anticipate as well.

Another approace (if you don't mind doing this in stages) is to remove all of the URLs from the string and then do the rest of your processing on the result of this. That only works if you don't need the URLs. If you need to preserve the URLs, you can add placeholder strings on either side of the URL such as ">>>>http://placeholder.com&lt;&lt;&lt;&lt;" and then when you split on punctuation, be sure to exclude any punction that occurs between >>>> and <<<<. Afterwards, you would have to remove the >>>> and <<<<

FrustratedWithFormsDesigner
I thought about that - a period being followed by a space - but while it does happen the majority of the time, it's not sure-fire, so if there's a 100% accurate way I'd prefer that.Your second suggestion might be that very method; I was just hoping for a way that wouldn't incur extra overhead and could be done in one fell swoop. There has to be a way to find if a punctuation character is part of a contiguous URL?
Magsol
@Magsol - I agree that *just* the "period space" isn't enough, especially for folks that abbreviate "doctor" as "Dr." "This belongs to Dr. Smith and his son." should not be rendered, "This belongs to Dr." "Smith and his son."
warren
@warren: Another very good point. I suppose titles could be accounted for, but it wouldn't be pretty (or easy), and it also highlights other ambiguous uses of punctuation...such as ellipses within a single sentence. Oy.
Magsol
+1  A: 

What you're doing here isn't quite splitting on punctuation, because you're trying to keep the punctuation in one of the split items. You're also attempting to discard the whitespace afterwards, but don't seem to have covered that in your question.

I would tackle this in the following way: split your input string with a regular expression which matches punctuation or a URL, and keep the pieces, including the separators. Then iterate over the items, and for each separator decide whether it was punctuation, in which case you can strip trailing whitespace and move it to the end of the previous item, or a URL, in which case you just join it with the preceding and following items.

In PHP, you can keep the delimiters using something like this:

$text[$i] = preg_split('/([\.\?!\-]+|https?:\/\/\S+)/', $post->text, PREG_SPLIT_DELIM_CAPTURE);

where the PREG_SPLIT_DELIM_CAPTURE flag is explained in the documentation as:

If this flag is set, parenthesized expression in the delimiter pattern will be captured and returned as well.

Tim
I didn't mention the whitespace because it honestly doesn't matter; if it's easier to split on punctuation (and keep the delimeters, as you pointed out) and maintain the white space (it can go on either token, doesn't matter), that's fine; otherwise, the reverse is fine as well. Whichever is simpler.
Magsol
A: 

This regex produces the example you've given:

/(?<!http[^\s]{0,2048})[\.\?\!\-]+\B/

It looks for your punctuation set not preceded by a string starting with 'http' and ending with a whitespace character. The trailing \B prevents a hyphenated word from causing a split

but...

This input:

Blah blah blah? A sentence. Here's a link: http://somelink.com?key=value.blah blah blah...

won't split the value.blah into two... but I think URL matching regex would have the same problem as 'value.blah' could be part of a valid URL. I think your data, coming from twitter users, will be very inconsistent and therefore hard to clean up, even if you go for FrustratedWithFormsDes' second suggestion.

beggs
I agree there are still fringe cases that would be difficult to capture 100% of the time. But you raise a very valid point regarding punctuation just after the URL; that wasn't something I'd considered, nor am I sure how to deal with that.
Magsol