views:

3224

answers:

7

I've been investigating this issue that only seems to get worse the more I dig deeper.

I started innocently enough trying to use this expression to split a string on HTML 'br' tags:

T = captions.innerHTML.split(/<br.*?>/g);

This works in every browser (FF, Safari, Chrome), except IE7 and IE8 with example input text like this:

is invariably subjective. <br /> 
The less frequently used warnings (Probably/Possibly) <br />

Please note that the example text contains a space before the '/', and precedes a new line.

Both of the following will match all HTML tags in every browser:

T = captions.innerHTML.split(/<.*?>/g);
T = captions.innerHTML.split(/<.+?>/g);

However, surprisingly (to me at least), this does not work in FF and Chrome:

T = captions.innerHTML.split(/<br.+?>/g);

Edit:

This (suggested several times in the responses below,) does not work on IE 7 or 8:

T = captions.innerHTML.split(/<br[^>]*>/g);

(It did work on Chrome and FF.)

My question is: does anyone know an expression that works in all current browsers to match the 'br' tags above (but not other HTML tags). And can anyone confirm that the last example above should be a valid match since two characters are present in the example text before the '>'.

PS - my doctype is HTML transitional.

Edit:

I think I have evidence this is specific to the string.split() behavior on IE, and not regex in general. You have to use split() to see this issue. I have also found a test matrix that shows a failure rate of about 30% for split() test cases when I ran it on IE. The same tests passed 100% on FF and Chrome:

http://stevenlevithan.com/demo/split.cfm

So far, I have still not found a solution for IE, and the library provided by the author of that test matrix did not fix this case.

+4  A: 

Try this one:

/<br[^>]*>/gi
Chad Birch
I'd advise /gi since you never know how someone will case their tags
Dr.Dredel
This works in Chrome and FF, and fails in IE. I'm giving +1 because it *should* work.
Walt Gordon Jones
Btw, as I now realize it does NOT fail when used exactly as you provided here. I omitted the 'i' flag because I was working with a known lower-case source. lesson learned: IE up-cases tags in innerHTML.
Walt Gordon Jones
A: 

Regexes are fundamentally bad at parsing HTML (see Can you provide some examples of why it is hard to parse XML and HTML with a regex? for why). What you need is an HTML parser. See Can you provide an example of parsing HTML with your favorite parser? for examples using a variety of parsers.

In particular you may be interested in the JavaScript+DOM answer.

Chas. Owens
Yep, I'm not intending to do a full HTML parser, and this is not a jQuery environment. Please note, there is not a problem with regex handling this, but a browser compat issue in IE 7 and 8. (Although the example that failed in FF also puzzles me.)
Walt Gordon Jones
"Regexes are fundamentally bad at parsing HTML" -- not if you know what the input is going to look like.
nickf
@Walt Gordon Jones It isn't a matter of what you intend to do or not, regexes can't handle HTML, it isn't what they are good at, at least take a look at doing it with a parser, you can always use the DOM.
Chas. Owens
@nickf And is the input going to stay the same? Using a parser saves you time in the long run as regexes are extremely fragile when parsing HTML (if they even work in the first place).
Chas. Owens
Guys, I totally agree. You are making an excellent point, but in this specific case, I just needed to create an array using 'br' tags as delimiters. I don't think there's a DOM method for that, is there?
Walt Gordon Jones
+1  A: 

Instead of

/<br.*?>/

you could try

/<br[^>]*>/

i.e. matching "<br", followed by any characters other than '>', followed by '>'.

hlovdal
Thanks, still fails in IE only.
Walt Gordon Jones
A: 

Well, unfortunately I don't have a wide variety of browsers at work (just IE - sigh) but right off the bat I can see a way to optimize your regex:

T = captions.innerHTML.split(/<br[^>]*?>/g);

The inline character class definition [^>] instructs the expression to match any character EXCEPT the greater-than sign. You may also want to make it case insensitive (pass gi at the end not just g).

Goyuix
In some regular expression engines, the *? operator indicates non-greedy matching, where /.*?>/ will match any character up to the *first* point where the following text matches. Without the ?, /.*>/ matches up to the *last* point where the following text matches.
Greg Hewgill
Yes, want the first match (obviously), but the [^>] looks like a clever way to force first match since that's only way to satisfy the condition. Regardless, even the variations that should be greedy do not match at all under IE.
Walt Gordon Jones
A: 

Tested in Firefox 3 & IE7:

/<br.*?>/gi

Try it yourself here: http://jsbin.com/ofoke

var input = "one <br/>\n" 
          + "two <br />\n" 
          + "three <br>\n" 
; 

alert(input.replace(/<br.*?>/gi, ''));
nickf
I believe I have determined the issue is specifically with String.split on IE. (Your example uses String replace.) Look at this test case matrix for split(): http://stevenlevithan.com/demo/split.cfm IE fails about 30% of the cases. FF and Chrome pass this matrix 100%.
Walt Gordon Jones
could you then try doing something like a replace using a regex, to replace <br> tags with "||BR||" and then use a normal non-regex to split it?input.replace(/<br.*?>/gi, '||BR||').split("||BR||");Does that work?
nickf
+7  A: 

The reason your code is not working is because IE parses the HTML and makes the tags uppercase when you read it through innerHTML. For example, if you have HTML like this:

<div id='box'>
Hello<br>
World
</div>

And then you use this Javascript (in IE):

alert(document.getElementById('box').innerHTML);

You will get an alert box with this:

Hello<BR>World

Notice the <BR> is now uppercase. To fix this, just add the i flag in addition to the g flag to make the regex be case-insensitive and it will work as you expect.

Paolo Bergantino
Yes, you are exactly right. A million thanks, and now I know something new about innerHTML on IE.
Walt Gordon Jones
A: 

<\s*br\s*\/?\s*>

matches

<br>, <br />, < br >,<br / >

I tested here in IE.6. If march is Ok, the js could certainly split it according to the regexp.

unigogo