views:

113

answers:

3

I have following string:

<div> text0 </div> prefix <div> text1 <strong>text2</strong> text3 </div> text4

and want to know wether it contains text3 inside divs that go after prefix:

prefix<div>...text3...</div>

but I don't know how ta make regex for that, since I can't use [^<]+ because div's can contain strong tag inside.

Please help

EDIT:

  1. Div tags after prefix are guaranted to be not nested
  2. Language is C#
  3. Text4 is very long, so regex must not look after closing div

EDIT2: I don't want to use html parser, it can be easily (and MUCH faster) achieved with Regex. HTML there is simple: no attributes in tags; no nesting div's. And even some % of wrong answers are acceptable in my case.

A: 

For C# + HtmlAgilityPack you can do something like:

InputString = Regex.Replace(InputString,"^(?:[^<]+?|<[^>]*>)*?prefix","");

HtmlDocument doc = new HtmlDocument();

doc.LoadHtml(InputString);

HtmlNode node = doc.DocumentNode.SelectSingleNode("//div[contains('text3')]");

The prefix removal is still not a good way of dealing with it. Ideally you'd do something like using HtmlAgilityPack to find where prefix occurs in the DOM, translate that to provide the position in the string, then do a substring(pos,len) (or equivalent) to look at only the relevant text (you can also avoid looking at text4 using similar method).
I'm afraid I can't translate all that into code right now; hopefully someone else can help there.


(original answer, before extra details provided)
Here is a JavaScript + jQuery solution:

var InputString = '<div>text0 </div> prefix <div>text1 <strong>text2</strong> text3 </div> text4';

InputString = InputString.replace(/^.*?prefix/,'');

var MatchingDivs = jQuery('div:contains(text3)','<div>'+InputString+'</div>')

console.log(MatchingDivs.get());

This makes use of jQuery's ability to accept a context as second argument (though it appears this needs to be wrapped in div tags to actually work).

Peter Boughton
Splitting by `prefix` and then trying to parse one of the resulting substrings could also result in parse errors if prefix occurs within a tag. (I haven't used jQuery, though, so I don't know how it would behave in such a situation.)
David
Yeah, that bit is certainly not great, but my brain isn't awake enough to come up with a proper solution for it. :( I have improved it slightly by switching to a replace though.
Peter Boughton
+2  A: 

If you turn off the "greedy" option, you should be able to just use something like prefix<div>.*text3.*</div>. (If the <div> is allowed to have attributes, use prefix<div[^>]*>.*text3.*</div> instead.)

Numerous improvements could be made to this in order to take account of unusual spacing, >s within quotes, </div> within quotes, etc.

Patterns like prefix<div>...<div></div>text3</div> would be more difficult. You might have to capture all of the occurrences of the div tag so that you could count how many div tags were open at a given time.

EDIT: Oops, turning off the greedy option won't always give the right result, even in examples other than the one above. Probably better just to capture all occurrences of the div tag and go from there. As noted above by Peter, HTML is not a regular language and so you can't use regular expressions to do everything you might want with it.

David
A: 

this is my new regex:

prefix<div>([^<]*<(?!/div>))*[^<]*text3([^<]*<(?!/div>))*[^<]*</div>

seems to work ok.

Poma