views:

118

answers:

2

I have a regex tested in Expresso, works like a charm. But when I try to use it in javascript it gave an error. Firebug says:

invalid quantifier ?><div\b[^>]*>(?<DEPTH>)|<\/div>(?<-DEPTH>)|.?)*(?(DEPTH)(?!))<\/div>

the regex:

<div\b[^>]*>(?><div\b[^>]*>(?<DEPTH>)|</div>(?<-DEPTH>)|.?)*(?(DEPTH)(?!))</div>

The regex matches nested html-divs such as:

<div id="foo"><div>blubb</div><div foobar>blubb</div></div>

Is the javascript regex only a subset?

edit: I have to strip the div's, including the text between them, away.

<div id="foo"><div>blubb</div><div foobar>blubb</div></div>some
non html...

only the "some non html..." should stay. So I think I can't use any htmlparser?

+4  A: 

Is the javascript regex only a subset?

No, they are different - there are a variety of Regular Expression engines out there, and they each have different features/quirks.

C# is has more features than JavaScript, but JS's one is not derived from C# so it isn't a subset.

Here's a couple of pages that document the differences:

And that whole website (regular-expressions.info) is well worth browsing to learn more about regex.


The regex matches nested html-divs

It probably doesn't, not in all cases.

And certainly it wont be possible for a single JS regex, since it doesn't support that depth stuff, amongst other things.

You're using the wrong tool for this job - parsing HTML should be done with a proper HTML parser/selector, then analysing the DOM to find the nested divs.

Anything that implements Sizzle should do (i.e jQuery, Dojo Toolkit, and others).

For example, something like jQuery('div:has(div)') or dojo.query('div:has(div)') or similar, should find nested divs (i.e. select all divs which have a div nested inside them), and will correctly cope with assorted quirks which can be complex if not impossible with a single regex.


edit: I have to strip the div's, including the text between them, away.
<div id="foo"><div>blubb</div><div foobar>blubb</div></div>some non html...
only the "some non html..." should stay. So I think I can't use any htmlparser?

No - that is even more reason to use a HTML parser, and not attempt a messy regex hack.

jQuery('#foo div').remove()

That will remove all child DIVs, and leave the HTML text node in place.

Depending on your precise requirements, the selector might need changing, but this is absolutely a task for a tool that is designed to understand HTML.

Peter Boughton
I have edited my question, added an example. I think because of that I can't use jQuery. It was my first approach to use it, but it wont work.
chriszero
chriszero, I've updated my answer - using jQuery (or equiv) is the right way to handle this. If you're having problems with that approach, I suggest a question specific to that, with details of what/why you're trying to achieve (i.e. provide concise context instead of just some specific details.)
Peter Boughton
+1  A: 

Of course, todays javascript won't support atomic group and recursive regex, but you could easily build a quick&dirty solution by piecewise recursive stripping of tags from html source. If other solutions are too complicated and the structure of the documents is predictable, you could do sth. like:

 function stripme(tag, code)
{
 var strp = code;
 var regexp = new RegExp('<'+tag+'[^>]*?>(.*)</'+tag+'>');  // <- involves backtracking 
 while( strp.match(regexp) )            // every level of nesting will lead to
    strp = strp.replace(regexp, '');    // another loop invocation with the captured
 return strp;                           // contents (.*) of the level in RegExp.$1
}                                       // (if needed) 

This will work with, for example:

 var html ='<div id="foo"><div>blubb</div><div foobar>blubb</div></div>some non html...';

when invoked with, eg.:

 window.onload = function() { var stripped=stripme('div', html); alert(stripped); }

BTW, if possible, always use a DOM parser or Javascript library as recommended by Peter Boughton

Regards

rbo

rubber boots
That's very very nice =) I've thought about something similar.
chriszero
Instead of "if the structure of the documents is predictable" that part should read "if the structure of the documents is *guaranteed* and there will be no non-removable text between removed tags" - i.e. if the input was `<div><div>stuff</div></div>don't remove<div>stuff</div>` the whole of that would be removed.
Peter Boughton
Passing in `b` will incorrectly remove `blockquote`, so you need a `\s` before the `[^>]*?` to prevent that - oh and you've got an unnecessary non-greedy quantifier - with `[^>]*?>` it'll always keep going until the next char is a `>` anyway, so can just use `[^>]*>` instead (with regex that supports atomic, there might be a speed improvement for non-matches with a possessive quantifier `[^>]*+>`, but if so it'd be minor)
Peter Boughton
One more thought - depending on the source, it might be worth making the regex case-insensitive; potentially there are div/DIV mismatches to deal with.
Peter Boughton
rubber boots