ansaurus

Question

Regex to split on punctuation excluding URLs

Answer 1

A:

Is there a pattern that your non-URL punctuation marks follow? In most English sentences, many punctuation marks are followed (or sometimes preceeded) by a space character. I don't know what your source text is like but that MIGHT be a reliable way to do it, because the punctuation marks in a URL will NOT have space on either side - although they could END with a punctuation mark followed by a space - I guess it depends on the URLs you anticipate as well.

Another approace (if you don't mind doing this in stages) is to remove all of the URLs from the string and then do the rest of your processing on the result of this. That only works if you don't need the URLs. If you need to preserve the URLs, you can add placeholder strings on either side of the URL such as ">>>>http://placeholder.com<<<<" and then when you split on punctuation, be sure to exclude any punction that occurs between >>>> and <<<<. Afterwards, you would have to remove the >>>> and <<<<

FrustratedWithFormsDesigner 2009-10-30 03:35:52

I thought about that - a period being followed by a space - but while it does happen the majority of the time, it's not sure-fire, so if there's a 100% accurate way I'd prefer that.Your second suggestion might be that very method; I was just hoping for a way that wouldn't incur extra overhead and could be done in one fell swoop. There has to be a way to find if a punctuation character is part of a contiguous URL?

Magsol 2009-10-30 04:15:11

@Magsol - I agree that *just* the "period space" isn't enough, especially for folks that abbreviate "doctor" as "Dr." "This belongs to Dr. Smith and his son." should not be rendered, "This belongs to Dr." "Smith and his son."

warren 2009-10-30 05:10:31

@warren: Another very good point. I suppose titles could be accounted for, but it wouldn't be pretty (or easy), and it also highlights other ambiguous uses of punctuation...such as ellipses within a single sentence. Oy.

Magsol 2009-10-30 16:51:17

Answer 2

+1 A:

What you're doing here isn't quite splitting on punctuation, because you're trying to keep the punctuation in one of the split items. You're also attempting to discard the whitespace afterwards, but don't seem to have covered that in your question.

I would tackle this in the following way: split your input string with a regular expression which matches punctuation or a URL, and keep the pieces, including the separators. Then iterate over the items, and for each separator decide whether it was punctuation, in which case you can strip trailing whitespace and move it to the end of the previous item, or a URL, in which case you just join it with the preceding and following items.

In PHP, you can keep the delimiters using something like this:

$text[$i] = preg_split('/([\.\?!\-]+|https?:\/\/\S+)/', $post->text, PREG_SPLIT_DELIM_CAPTURE);

where the PREG_SPLIT_DELIM_CAPTURE flag is explained in the documentation as:

If this flag is set, parenthesized expression in the delimiter pattern will be captured and returned as well.

Tim 2009-10-30 05:01:34

I didn't mention the whitespace because it honestly doesn't matter; if it's easier to split on punctuation (and keep the delimeters, as you pointed out) and maintain the white space (it can go on either token, doesn't matter), that's fine; otherwise, the reverse is fine as well. Whichever is simpler.

Magsol 2009-10-30 14:54:52

Answer 3

A:

This regex produces the example you've given:

/(?<!http[^\s]{0,2048})[\.\?\!\-]+\B/

It looks for your punctuation set not preceded by a string starting with 'http' and ending with a whitespace character. The trailing \B prevents a hyphenated word from causing a split

but...

This input:

Blah blah blah? A sentence. Here's a link: http://somelink.com?key=value.blah blah blah...

won't split the value.blah into two... but I think URL matching regex would have the same problem as 'value.blah' could be part of a valid URL. I think your data, coming from twitter users, will be very inconsistent and therefore hard to clean up, even if you go for FrustratedWithFormsDes' second suggestion.

beggs 2009-10-30 05:06:15

I agree there are still fringe cases that would be difficult to capture 100% of the time. But you raise a very valid point regarding punctuation just after the URL; that wasn't something I'd considered, nor am I sure how to deal with that.

Magsol 2009-10-30 14:57:09

ansaurus

tags:

views:

answers:

Regex to split on punctuation excluding URLs

related questions