tags:

views:

774

answers:

2

I'm looking for a RegEx to return either the first [n] words in a paragraph or, if the paragraph contains less than [n] words, the complete paragraph is returned.

For example, assuming I need, at most, the first 7 words:

<p>one two <tag>three</tag> four five, six seven eight nine ten.</p><p>ignore</p>

I'd get:

one two <tag>three</tag> four five, six seven

And the same RegEx on a paragraph containing less than the requested number of words:

<p>one two <tag>three</tag> four five.</p><p>ignore</p>

Would simply return:

one two <tag>three</tag> four five.

My attempt at the problem resulted in the following RegEx:

^(?:\<p.*?\>)((?:\w+\b.*?){1,7}).*(?:\</p\>)

However, this returns just the first word - "one". It doesn't work. I think the .*? (after the \w+\b) is causing problems.

Where am I going wrong? Can anyone present a RegEx that will work?

FYI, I'm using .Net 3.5's RegEX engine (via C#)

Many thanks

A: 
  1. Use a HTML parser to get the first paragraph, flattening its structure (i.e. remove decorating HTML tags inside the paragraph).
  2. Search for the position of the nth whitespace character.
  3. Take the substring from 0 to that position.

edit: I removed the regex proposal for step 2 and 3, since it was wrong (thanks to the commenter). Also, the HTML structure needs to be flattened.

Svante
Inside a character class, \b matches a backspace character. Also, the problem definition seems to have been changed since you posted this; \w and \W aren't going to cut it.
Alan Moore
+4  A: 

OK, complete re-edit to acknowledge the new "spec" :)

I'm pretty sure you can't do that with one regex. The best tool definitely is an HTML parser. The closest I can get with regexes is a two-step approach.

First, isolate each paragraph's contents with:

<p>(.*?)</p>

You need to set RegexOptions.Singleline if paragraphs can span multiple lines.

Then, in a next step, iterate over your matches and apply the following regex once on each match's Group[1].Value:

((?:(\S+\s+){1,6})\w+)

That will match the first seven items separated by spaces/tabs/newlines, ignoring any trailing punctuation or non-word characters.

BUT it will treat a tag separated by spaces as one of those items, i. e. in

One, two three <br\> four five six seven

it will only match up until six. I guess that regex-wise, there's no way around that.

Tim Pietzcker
This is perfect - Cheers!I know that there will never be nested p tags, so RegEx is a good fit.
Milky Joe
Thanks for your efforts - I really appreciate it (and thanks for pointing out the oversight with my original "spec")
Milky Joe