ansaurus

Question

How do I find a HTML div contains specific text after a text prefix?

Answer 1

A:

For C# + HtmlAgilityPack you can do something like:

InputString = Regex.Replace(InputString,"^(?:[^<]+?|<[^>]*>)*?prefix","");

HtmlDocument doc = new HtmlDocument();

doc.LoadHtml(InputString);

HtmlNode node = doc.DocumentNode.SelectSingleNode("//div[contains('text3')]");

The prefix removal is still not a good way of dealing with it. Ideally you'd do something like using HtmlAgilityPack to find where prefix occurs in the DOM, translate that to provide the position in the string, then do a substring(pos,len) (or equivalent) to look at only the relevant text (you can also avoid looking at text4 using similar method).
I'm afraid I can't translate all that into code right now; hopefully someone else can help there.

(original answer, before extra details provided)
Here is a JavaScript + jQuery solution:

var InputString = '<div>text0 </div> prefix <div>text1 <strong>text2</strong> text3 </div> text4';

InputString = InputString.replace(/^.*?prefix/,'');

var MatchingDivs = jQuery('div:contains(text3)','<div>'+InputString+'</div>')

console.log(MatchingDivs.get());

This makes use of jQuery's ability to accept a context as second argument (though it appears this needs to be wrapped in div tags to actually work).

Peter Boughton 2010-08-06 11:51:38

Splitting by `prefix` and then trying to parse one of the resulting substrings could also result in parse errors if prefix occurs within a tag. (I haven't used jQuery, though, so I don't know how it would behave in such a situation.)

David 2010-08-06 12:04:30

Yeah, that bit is certainly not great, but my brain isn't awake enough to come up with a proper solution for it. :( I have improved it slightly by switching to a replace though.

Peter Boughton 2010-08-06 12:06:32

Answer 2

+2 A:

If you turn off the "greedy" option, you should be able to just use something like prefix<div>.*text3.*</div>. (If the <div> is allowed to have attributes, use prefix<div[^>]*>.*text3.*</div> instead.)

Numerous improvements could be made to this in order to take account of unusual spacing, >s within quotes, </div> within quotes, etc.

Patterns like prefix<div>...<div></div>text3</div> would be more difficult. You might have to capture all of the occurrences of the div tag so that you could count how many div tags were open at a given time.

EDIT: Oops, turning off the greedy option won't always give the right result, even in examples other than the one above. Probably better just to capture all occurrences of the div tag and go from there. As noted above by Peter, HTML is not a regular language and so you can't use regular expressions to do everything you might want with it.

David 2010-08-06 11:54:14

Answer 3

A:

this is my new regex:

prefix<div>([^<]*<(?!/div>))*[^<]*text3([^<]*<(?!/div>))*[^<]*</div>

seems to work ok.

Poma 2010-08-06 13:27:29

ansaurus

tags:

views:

answers:

How do I find a HTML div contains specific text after a text prefix?

related questions