views:

388

answers:

5

I'm trying to return the contents of any tags in a body of text. I'm currently using the following expression, but it only captures the contents of the first tag and ignores any others after that.

Here's a sample of the html:

 <script type="text/javascript">
  alert('1');
 </script>

 <div>Test</div>

 <script type="text/javascript">
  alert('2');
 </script>

My regex looks like this:

//scripttext contains the sample
re = /<script\b[^>]*>([\s\S]*?)<\/script>/gm;
var scripts  = re.exec(scripttext);

When I run this on IE6, it returns 2 matches. The first containing the full tag, the 2nd containing alert('1').

When I run it on http://www.pagecolumn.com/tool/regtest.htm it gives me 2 results, each containing the script tags only.

A: 

Try using the global flag:

document.body.innerHTML.match(/<script.*?>([\s\S]*?)<\/script>/gmi)

Edit: added multiple line and case insensitive flags (for obvious reasons).

Justin Johnson
or, if you are using a regex function, make sure it is configured to catch all matches. Some of them require multiple calls, or an extra parameter, or a difference function to be called.
TheJacobTaylor
@TheJacobTaylor The seems kind of vague. What regex function are your referring to other than `new RegExp`?
Justin Johnson
@Justin Johnson My comment was partially driven by questions above about what language the regex was in. Since I was not sure, and they were getting on result, I thought they might have been impacted by calling the wrong function. In PHP, for example, preg_match and preg_match_all will return the first or all matches.
TheJacobTaylor
Ah, very well. I assume JavaScript. I think it was tagged as such when I got to the question, not sure though.
Justin Johnson
What's the down vote for?
Justin Johnson
A: 

The first group contains the content of the tags.

Edit: Don't you have to surround the regex-satement with quotes? Like:

re = "/<script\b[^>]*>([\s\S]*?)<\/script>/gm";
Phoexo
No, you don't. In javascript, /.../ denotes a regular expression. You can build it as a string if you want, but then you have to be more explicit in its construction. E.g.: `/<script\b[^>]*>([\s\S]*?)<\/script>/g` is equivalent to `new RegExp("<script\b[^>]*>([\s\S]*?)<\/script>", "g")`
Justin Johnson
A: 

In .Net, there's a submatch method, in PHP, preg_match_all, which should solve you problem. In Javascript there isn't such a method. But you can made by yourself.

Test in http://www.pagecolumn.com/tool/regtest.htm

Select $1elements method will return what you want

unigg
A: 

The "problem" here is in how exec works. It matches only first occurrence, but stores current index (i.e. caret position) in lastIndex property of a regex. To get all matches simply apply regex to the string until it fails to match (this is a pretty common way to do it):

var scripttext = ' <script type="text/javascript">\nalert(\'1\');\n</script>\n\n<div>Test</div>\n\n<script type="text/javascript">\nalert(\'2\');\n</script>';

var re = /<script\b[^>]*>([\s\S]*?)<\/script>/gm;

var match;
while (match = re.exec(scripttext)) {
  // full match is in match[0], whereas captured groups are in ...[1], ...[2], etc.
  console.log(match[1]);
}
kangax
+1  A: 

Don't use regular expressions for parsing HTML. HTML is not a regular language. Use the power of the DOM. This is much easier, because it is the right tool.

var scripts = document.getElementsByTagName('script');
Svante