views:

273

answers:

4

After loosing much sleep I still cannot figure this out:

The code below (its a simplification from larger code that shows only the problem) Identifies Item1 and Item2 on FF but does not on IE7. I'm clueless.

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"&gt;
<html xmlns="http://www.w3.org/1999/xhtml"&gt;
<head>
</head>

<body>
<table><tr>
<td><img src=imgs/site/trash.jpg border=1></td><td><font style="">Item1</font></td>
<td><img src=imgs/site/trash.jpg border=1></td><td><font style="">Item2</font></td>
</tr></table>

<script type="text/javascript">
    var _pattern =/trash.*?<font.*?>(.*)<\/font>/gim;
    alert (_pattern);

    var thtml = document.documentElement.innerHTML;
    alert (thtml);
    while ( _match =_pattern.exec(thtml)){
     alert (_match[1]);

    }

</script>

</body>
</html>

Notes: 1. I know there are better ways to get Item1 and Item2. this example is for showing the Regex problem I'm facing in the simplest way. 2. When I remove the table and /table tags it works.

Thanks in advance

+3  A: 

Seriously this is horrible. A solution based on getElementById / getElementsByTagName will be considerably more reliable and flexible.

As for the actual problem it's probably because javascript multiline regex support is not x-browser safe and IE in particular has problems. Removing the table declaration will probably force IE to internally format the remaining markup to a single line (=success) where adding it back in will make IE add carriage returns etc (=fail).

I know that you did say you know there are better ways, but you didn't explain why you'd persist with this. Relying on regex and further relying on IE plaintext interpretation of a DOM is going to get you into problems like this. Don't do it.

annakata
Thanks! This is part of a system that needs to get various data from random pages. It cannot rely on page structure but on heuristics and rules that are part of the data and not the page structure. Is there a way to get the source code in javascript before IE changes it?
Nir
No, none. JS is only aware of the interpreted DOM, it has no awareness that source even exists. Regardless of your stated requirements you will still be better off trying to traverse the DOM with the DOM methods, I guarantee it. You may wish to investigate the possibility of an xpath or css selector solution enabled by something like jQuery.
annakata
I can't use jQuery or any external library. I guess what I should do is use DOM to get to html parts that are small enough for parsing individually with Regex. The question is how can I determine what can be parsed and what not ?
Nir
Yes, that would be a *much* improved start. Hunt for the smallest possible granularity, using ID where you can to narrow the search. After that it's a matter of if/else and for loops essentially. Invest some time in looking at all the properties and methods on a DOM element. Something like nextSibling might be helpful for you.
annakata
Nir has this got to be done on the client? Or can it be done server side (is it an intranet page for example)
Chris S
It has to be done on the client side. cross browser
Nir
A: 

Hi,

Try to build your regexp with new RegExp("", "gim"). It's more portable.

ATorras
Thanks for answering. didn't work.
Nir
It's possible that your regexp doesn't match what you want to. More info: http://msdn.microsoft.com/en-us/library/9dthzd08(VS.85).aspx
ATorras
A: 

The ending td tags have a character that needs to be escaped: the / slash. I don't know if that is why IE7 is tripping. Safari is okay as tested.

You might want to consider adding an id to the table. Then just iterate through the childNodes of the table only. You would go through a whole lot less HTML on a bigger page and probably conserve memory, too.

Fran Corpier
Being able to put an id on the table implies you have the power to do so. In which case you may as well go the hole hog and replace <font> with <span> and placing class="item" on the appropriate <td> then enumerate the table row/cells looking of item cells. There would be no need for regex if one has control over the markup.
AnthonyWJones
I agree, and that's why I suggested iterating through childNodes, in the same sentence-- in other words, use the DOM, if possible. In the original post, Nir did not indicate whether there is the ability to change the page structure. But your regex answer was enlightening. I learned something new today; thanks.
Fran Corpier
+2  A: 

The problem is that JScripts multiline implementation is buggy. It doesn't allow the any char . to match a newline character.

Use this regex instead:-

 var _pattern = /trash[\s\S]*?<font[^>]*>([^<]*)<\/font>/gi;

This eliminates . altogether, note [\s\S] is equivalent but will match a new line.

The reason why removing table changes things is the IE's .innerHTML implementation doesn't rely on original markup received. Instead the markup is created dynamically by examining the DOM. When it sees a table element it places newlines in the output in different places to than when table is missing.

AnthonyWJones
it Works now. Thanks a lot!
Nir
IE 6 internally adds a <tbody> tag after <table> if it's not especified.
ATorras
I have a couple of questions:1. why do you use [^>]* for any character and not [\s\S]* (after font)?2. isn't a lasy ? required after [^>]* and [^<]* ?
Nir
@Nir: I prefer to avoid lazy *? operations since they can slow some expressions down significantly. Using *? will continue to check the rest of the pattern follow and only when the following patter fails does it roll back to where it was at when *? was encountered, moves on a character then does it all over again.
AnthonyWJones
@Nir: Using [^>]*> pattern can be used because we know we've finished matching that part of the string when the first > character is found. This doesn't use a the lazy ? and hence is more efficient.
AnthonyWJones
thank you very much
Nir