views:

310

answers:

4

I have the following within an XHTML document:

<script type="text/javascript" id="JSBALLOONS">
    function() {
        this.init = function() {
            this.wAPI = new widgetAPI('__BALLOONS__');
            this.getRssFeed();
        };
    }
</script>

I'm trying to select everything in between the two script tags. The id will always be JSBALLOONS if that helps. I know how to select that including the script tags, but I don't know how to select the contents excluding the script tags. The result of the regular expression should be:

    function() {
        this.init = function() {
            this.wAPI = new widgetAPI('__BALLOONS__');
            this.getRssFeed();
        };
    }
+8  A: 

(Updated post specifically for a Javascript solution.)

In Javascript, your code might look like this:

if (data.match(/<script[^>]+id="JSBALLOONS">([\S\s]*?)<\/script>/)) {
    inner_script = RegExp.$1;
}

That part between parentheses ([\S\s]*?) is saved by the regex engine and is accessible to you after a match is found. In Javascript, you can use RegExp.$1 to reference to the matched part inside the script tags. If you have more than one of such a group, surrounded by (), you can refer to them with RegExp.$2, and so on, up to RegExp.$9.

Javascript will not match newline characters by default, so that is why we have to use ([\S\s]*?) rather than (.*?), which may make more sense. Just to be complete, in other languages this is not necessary if you use the s modifier (/.../s).

(I have to add that regexes are typically very fragile when scraping content from HTML pages like this. You may be better off using the jQuery framework to extract the contents.)

molf
Hi, thanks. This is exactly what I have, but it includes the script tags. Can you explain what you mean by $1? I'm unfamiliar. Thanks!
slypete
@slypete, which language or tool are you using to execute the regex?
molf
@molf, I'm using javascript and jQuery.var javascript = this.data.match(/<script[^>]+id="JSBALLOONS">([\S\s]*?)<\/script>/ig);this.javascript = eval('(' + javascript + ')');
slypete
@slypete, updated with an example in Javascript. In Javascript, groups are saved in RegExp.$1, RegExp.$2, etc, up to RegExp.$9.
molf
Thanks, learned something new!
slypete
+2  A: 

What the gentleman means by $1 is "the value of the first capture group". When you enclose part of your regular expression in parentheses, it defines capture groups. You count them from the left to the right. Each opening parenthesis starts a new capture group. They can be nested.

(There are ways to define sub expressions without defining capture groups - I forget the syntax.)

In Perl, $1 is the magic variable holding the string matched by the first capture group, $2 is the string matched by the second, etc. Other languages may require you to call a method on the returned match object to get the Nth capture group.

But back to molf's solution. Suppose he said to use this pattern instead:

/<script[^>]+id="JSBALLOONS">(.*)<\/script>/

In this case, if you have more than one script element, this incorrect pattern will gobble them all up because it is greedy, a point worth explaining. This pattern will start with the first opening tag, match to its closing tag, keep going, and finally match the last . The magic in molf's solution is the question mark in (.*?) which makes it non-greedy. It will return the shortest string that matches the pattern, hence not gobble up extra script elements.

Paul Chernoch
Thank you, very helpful as well!
slypete
+2  A: 

Don't try to use regular expressions for non-regular languages. The right way is to use an XML parser, resp. the DOM:

document.getElementById("JSBALLOONS")

edit: Regarding your comment, I have no experience with JavaScript or jQuery, but after some searching, I think that something along these lines should work:

$.ajax({
  type: "GET",
  url: "test.xml",
  dataType: "xml",
  success: function(xml) {
    return $(xml).find("#JSBALLOONS").text();
  }
});

Can someone more qualified correct this?

Svante
This content is not on the DOM, so I'm afraid it won't work.
slypete
The document is remotely loaded into a string that I need to extract select things from. I'm aware regex is not the best solution. Please do let me know if you know of other working solutions. Thanks!
slypete
Again, it will not work. I've tried this. Please see my other more general question for the reason: http://stackoverflow.com/questions/1034881/what-is-the-best-practice-for-parsing-remote-content-with-jqueryHopefully someone will be able to come up with an answer for this question.
slypete
A: 

Let foo be the string containing the code. Then, you can strip the enclosing tags via

foo = foo.substring(foo.indexOf('>') + 1, foo.lastIndexOf('<'))
Christoph