views:

52

answers:

3

Hi

I have problem need to solved with Regex

If i am using Firefox or IE8 javascript will generate this code which is what I really want.

<div style="visibility: hidden;" id="wizardId1">1001</div><div style="visibility: hidden;" id="wizardId2">1002</div>

However with IE7 it will generate diferently to be

<DIV id=wizardId1 style="VISIBILITY: hidden;">1001</DIV><DIV id=wizardId2 style="VISIBILITY: hidden;" >1002</DIV>

which is the id for the div is placed before style parameter.

In my java program, i have regex only to support the first one (Firefox & IE8) the regex is

<(?:DIV|div)\s+style=(?:["\'])*(?:[\w\d:; ]+)*(?:["\'])*\s+id=(?:["\'])*([\w\d]+)(?:["\'])*>([\w\d]+)</(?:DIV|div)>

Because the IE7 placed the id before the style, so i could get the result that i want like this

Result should be appeared

Match 1: <div style="visibility: hidden;" id="wizardId1">1001</div>
    Subgroups:
    1: wizardId1
    2: 1001
Match 2: <div style="visibility: hidden;" id="wizardId2">1002</div>
    Subgroups:
    1: wizardId2
    2: 1002

I tried to use this regex (to take out the style from regex) but the result only return the last id.

<(?:DIV|div).*\s+id=(?:["\'])*([\w\d]+)(?:["\'])*>([\w\d]+)</(?:DIV|div)>

result not wanted

Match 1: <div style="visibility: hidden;" id="wizardId1">1001</div><div style="visibility: hidden;" id="wizardId2">1002</div>
    Subgroups:
    1: wizardId2
    2: 1002

Question

How can i produce the same result using regex as the first result by not considering the ( style="visibility: hidden;" ) ? (Not to use .* and not to add extra group)

Thanks for helping me.

+1  A: 

Does this work for you?

<(?:DIV|div)(?:(?:\s+style=(?:["\'])*(?:[\w\d:; ]+)*(?:["\'])*)|(?:\s+id=(?:["\'])*([\w\d]+)(?:["\'])*))*>([\w\d]+)</(?:DIV|div)>
Thomas
Wah it is working now.... Thank you sooooooooo much.
Joe Ijam
+1  A: 

Previously, the .* was matching everything from the end of the first <div through and including the second <div.

You can try using a minimal match.

So

<(?:DIV|div).*?\s+id=(?:["\'])*([\w\d]+)(?:["\'])*[^>]*>([\w\d]+)</(?:DIV|div)>

instead of

<(?:DIV|div).*\s+id=(?:["\'])*([\w\d]+)(?:["\'])*[^>]*>([\w\d]+)</(?:DIV|div)>

Note the ? after the .* means match as few as possible.

I would recommend against trying to parse HTML with regexs though. Maybe you could try a SAX style parser like makeSaxParser in http://code.google.com/p/google-caja/source/browse/trunk/src/com/google/caja/plugin/html-sanitizer.js

Mike Samuel
This is exactly the perfect answer. Thank you so much bro.
Joe Ijam
+1  A: 

This works ok and is pretty general (I assumed you don't need check for the style attribute):

<div.+?id="([^"]+).+?>([^<]+)

Don't forget to turn case insesitivity on, in JavaScript it should look like:

/<div.+?id="([^"]+).+?>([^<]+)/i

hpuiu
Thanks... So simple.
Joe Ijam