ansaurus

Question

Need RegEx to return first paragraph or first n words

Answer 1

A:

Use a HTML parser to get the first paragraph, flattening its structure (i.e. remove decorating HTML tags inside the paragraph).
Search for the position of the nth whitespace character.
Take the substring from 0 to that position.

edit: I removed the regex proposal for step 2 and 3, since it was wrong (thanks to the commenter). Also, the HTML structure needs to be flattened.

Svante 2009-05-07 12:42:25

Inside a character class, \b matches a backspace character. Also, the problem definition seems to have been changed since you posted this; \w and \W aren't going to cut it.

Alan Moore 2009-05-07 15:14:43

Answer 2

+4 A:

OK, complete re-edit to acknowledge the new "spec" :)

I'm pretty sure you can't do that with one regex. The best tool definitely is an HTML parser. The closest I can get with regexes is a two-step approach.

First, isolate each paragraph's contents with:

<p>(.*?)</p>

You need to set RegexOptions.Singleline if paragraphs can span multiple lines.

Then, in a next step, iterate over your matches and apply the following regex once on each match's Group[1].Value:

((?:(\S+\s+){1,6})\w+)

That will match the first seven items separated by spaces/tabs/newlines, ignoring any trailing punctuation or non-word characters.

BUT it will treat a tag separated by spaces as one of those items, i. e. in

One, two three <br\> four five six seven

it will only match up until six. I guess that regex-wise, there's no way around that.

Tim Pietzcker 2009-05-07 12:47:10

This is perfect - Cheers!I know that there will never be nested p tags, so RegEx is a good fit.

Milky Joe 2009-05-07 13:01:00

Thanks for your efforts - I really appreciate it (and thanks for pointing out the oversight with my original "spec")

Milky Joe 2009-05-08 10:53:00

ansaurus

tags:

views:

answers:

Need RegEx to return first paragraph or first n words

related questions