ansaurus

Question

Javascript regex - why is it not working as expected on IE?

Answer 1

+3 A:

Seriously this is horrible. A solution based on getElementById / getElementsByTagName will be considerably more reliable and flexible.

As for the actual problem it's probably because javascript multiline regex support is not x-browser safe and IE in particular has problems. Removing the table declaration will probably force IE to internally format the remaining markup to a single line (=success) where adding it back in will make IE add carriage returns etc (=fail).

I know that you did say you know there are better ways, but you didn't explain why you'd persist with this. Relying on regex and further relying on IE plaintext interpretation of a DOM is going to get you into problems like this. Don't do it.

annakata 2009-05-21 08:42:18

Thanks! This is part of a system that needs to get various data from random pages. It cannot rely on page structure but on heuristics and rules that are part of the data and not the page structure. Is there a way to get the source code in javascript before IE changes it?

Nir 2009-05-21 08:47:14

No, none. JS is only aware of the interpreted DOM, it has no awareness that source even exists. Regardless of your stated requirements you will still be better off trying to traverse the DOM with the DOM methods, I guarantee it. You may wish to investigate the possibility of an xpath or css selector solution enabled by something like jQuery.

annakata 2009-05-21 08:50:50

I can't use jQuery or any external library. I guess what I should do is use DOM to get to html parts that are small enough for parsing individually with Regex. The question is how can I determine what can be parsed and what not ?

Nir 2009-05-21 08:57:16

Yes, that would be a *much* improved start. Hunt for the smallest possible granularity, using ID where you can to narrow the search. After that it's a matter of if/else and for loops essentially. Invest some time in looking at all the properties and methods on a DOM element. Something like nextSibling might be helpful for you.

annakata 2009-05-21 09:12:02

Nir has this got to be done on the client? Or can it be done server side (is it an intranet page for example)

Chris S 2009-05-21 09:13:51

It has to be done on the client side. cross browser

Nir 2009-05-22 16:08:59

Answer 2

A:

Hi,

Try to build your regexp with new RegExp("", "gim"). It's more portable.

ATorras 2009-05-21 08:48:44

Thanks for answering. didn't work.

Nir 2009-05-21 08:54:52

It's possible that your regexp doesn't match what you want to. More info: http://msdn.microsoft.com/en-us/library/9dthzd08(VS.85).aspx

ATorras 2009-05-21 11:22:01

Answer 3

A:

The ending td tags have a character that needs to be escaped: the / slash. I don't know if that is why IE7 is tripping. Safari is okay as tested.

You might want to consider adding an id to the table. Then just iterate through the childNodes of the table only. You would go through a whole lot less HTML on a bigger page and probably conserve memory, too.

Fran Corpier 2009-05-21 08:54:49

Being able to put an id on the table implies you have the power to do so. In which case you may as well go the hole hog and replace <font> with <span> and placing class="item" on the appropriate <td> then enumerate the table row/cells looking of item cells. There would be no need for regex if one has control over the markup.

AnthonyWJones 2009-05-21 09:24:12

I agree, and that's why I suggested iterating through childNodes, in the same sentence-- in other words, use the DOM, if possible. In the original post, Nir did not indicate whether there is the ability to change the page structure. But your regex answer was enlightening. I learned something new today; thanks.

Fran Corpier 2009-05-21 14:06:45

Answer 4

+2 A:

The problem is that JScripts multiline implementation is buggy. It doesn't allow the any char . to match a newline character.

Use this regex instead:-

 var _pattern = /trash[\s\S]*?<font[^>]*>([^<]*)<\/font>/gi;

This eliminates . altogether, note [\s\S] is equivalent but will match a new line.

The reason why removing table changes things is the IE's .innerHTML implementation doesn't rely on original markup received. Instead the markup is created dynamically by examining the DOM. When it sees a table element it places newlines in the output in different places to than when table is missing.

AnthonyWJones 2009-05-21 09:03:39

it Works now. Thanks a lot!

Nir 2009-05-21 09:14:26

IE 6 internally adds a <tbody> tag after <table> if it's not especified.

ATorras 2009-05-21 11:24:02

I have a couple of questions:1. why do you use [^>]* for any character and not [\s\S]* (after font)?2. isn't a lasy ? required after [^>]* and [^<]* ?

Nir 2009-05-21 12:19:40

@Nir: I prefer to avoid lazy *? operations since they can slow some expressions down significantly. Using *? will continue to check the rest of the pattern follow and only when the following patter fails does it roll back to where it was at when *? was encountered, moves on a character then does it all over again.

AnthonyWJones 2009-05-21 16:27:28

@Nir: Using [^>]*> pattern can be used because we know we've finished matching that part of the string when the first > character is found. This doesn't use a the lazy ? and hence is more efficient.

AnthonyWJones 2009-05-21 16:30:06

thank you very much

Nir 2009-05-22 16:07:10

ansaurus

tags:

views:

answers:

Javascript regex - why is it not working as expected on IE?

related questions