views:

942

answers:

6

I'm trying to find all occurrences of items in HTML page that are in between <nobr> and </nobr> tags. EDIT:(nobr is an example. I need to find content between random strings, not always tags)

I tried this

var match = /<nobr>(.*?)<\/nobr>/img.exec(document.documentElement.innerHTML);
alert (match);

But it gives only one occurrence. + it appears twice, once with the <nobr></nobr> tags and once without them. I need only the version without the tags.

+5  A: 

use the DOM

var nobrs = document.getElementsByTagName("nobr")

and you can then loop through all nobrs and extract the innerHTML or apply any other action on them.

duckyflip
Thats a great solution but I need a general solution for any pattern in the HTML file, not just standard tags
Nir
getElementsByTagName() will work for any well formed xml in the document, not just valid xhtml tags.
Bell
Perhaps you should indicate this in your question.
annakata
+2  A: 

you need to do it in a loop

var match, re = /<nobr>(.*?)<\/nobr>/img;
while((match = re.exec(document.documentElement.innerHTML)) !== null){
   alert(match[1]);
}
Rafael
This "!== null" is not needed. The expression evaluates to false just fine without it.
J-P
+1  A: 

you can use

while (match = /<nobr>(.*?)<\/nobr>/img.exec("foo <nobr> hello </nobr> bar <nobr> world </nobr> foobar"))
    alert (match[1]);
動靜能量
Thanks! now i got it
Nir
Turns out there is a bug in IE when doing a while loop like this one. Rafaels way should work
Nir
+2  A: 

(Since I can't comment on Rafael's correct answer...)

exec is doing what it is supposed to do - finding the first match, returning the result in the match object, and setting you up for the next exec call. The match object contains (at index 0) the whole of the string matched by the whole of the regex. In subsequent slots are the bits of the string matched by the parenthesized subgroups. So match[1] contains the bit of the string matched by "(.*?)" in your example.

+1  A: 

If the strings you're using aren't xml elements, and you're sticking with regexes the return value you're getting can be explained by the bracketing. .exec returns the whole matching string followed by the contents of the bracketed expressions.

If your doc contains:

This is out.
Bzz. This is in. unBzz.

then

/Bzz.(.*?)unBzz./img.exec(document.documentElement.innerHTML)

Will give you 'Bzz. This is in. unBzz.' in element 0 of the returned array and 'This is in.' in element 1. Trying to display the whole array gives both as a comma separated list because that's what JavaScript does to try to display it.

So alert($match[1]); is what you're after.

Bell
+1  A: 

it takes to steps but you could do it like this

match = document.documentElement.innerHTML.match(/<nobr>(.*?)<\/nobr>/img)
alert(match)//includes '<nobr>'

match_length = match.length;
for (var i = 0; i < match_length; i++)
{
    var match2 = match[i].match(/<nobr>(.*?)<\/nobr>/im);//same regex without the g option
    alert(match2[1]);
}
hayato