views:

900

answers:

6

I'm trying to integrate analytics into my GWT application. To do this, I'm calling a service that returns a String of HTML that needs to be parsed and eval'ed.

I need a regex that looks for and grabs either 1) the body of the tag or 2) the contents of the "src" attribute. I want to eval both of these with JavaScript. I'm happy with assuming that if a "src" attribute exists, the body can be ignored.

Thanks,

Matt

A: 

To match the body of the tag, you can try something like

<script[^>]*?>(.*?)</script>

which you want to match case-insensitively. Works assuming there is no "" appearing in the actual script body and no ">" in the attributes for the tag. You can add whitespace globbers to the regexp to make it more robust. Note the use of .*? to make sure the scanning stops at the first closing tag.

To add the src attribute, you can try

<script[^>]*?(src="([^"]*)")?[^>]*?>(.*?)</script>

and use the second submatch to get 'src', and third to get the body. Again, you might want to add whitespace globbers.

But would be best off by running the thing through a proper HTML/XML/SGML parser, because regexps can blow up in special cases.

antti.huima
+1  A: 
Miky Dinescu
A: 

How about

<script>(.*)</script>|<script src="(.*)">.*</script>

to start with. You may need to customize it a bit to

  1. accept the src attribute with single quotes or without quotes.
  2. ignore white between the '<script' and '>'

You also must use the DOTALL mode to ensure the . captures newlines.

Akbar ibrahim
Your answer won't handle <script src="...." />
Eddie
Agreed. There are many cases it won't handle (like the type attribute of the script tag). I suggested this as a start to build from.
Akbar ibrahim
and it will match everything between first <script> and last </script> which wouldn't work nicely when there are multiple scripts on the page
Slartibartfast
+6  A: 

Must it be a regex? You can use the DOM to obtain such information, here is a trivial example of getting the contents of the BODY tag, you could apply it to whatever you like:

function test(){
 var body = document.getElementsByTagName("body")[0];
 alert(body.innerHTML);
}
David in Dakota
+1 Yes! Parsing non-regular strings with regular expressions is WRONG!
Welbog
+1. I love regex, but use the right tool for the job. regex is the wrong tool for this job.
Eddie
While I agree in principle, he's trying to do this via the GWT which uses java to create javascript.
Akrikos
A: 

This seems to do what you want:

    final String srcOne = "<html>\r\n<head>\r\n<script src=\"http://test.com/some.js\"/&gt;\r\n&lt;/head&gt;&lt;/html&gt;";
    final String srcTwo = "<html>\r\n<head>\r\n<script src=\"http://test.com/some.js\"&gt;&lt;/script&gt;\r\n&lt;/head&gt;&lt;/html&gt;";
    final String tag = "<html>\r\n<head>\r\n<script>\r\nfunction() {\r\n\talert('hi');\r\n}\r\n</script>\r\n</head></html>";
    final String tagAndSrc = "<html>\r\n<head>\r\n<script src=\"http://test.com/some.js\"&gt;\r\nfunction() {\r\n\talert('hi');\r\n}\r\n</script>\r\n</head></html>";
    final String[] tests = new String[] {srcOne, srcTwo, tag, tagAndSrc, srcOne + srcTwo, tag + srcOne + tagAndSrc};

    final String regex = "<script(?:[^>]*src=['\"]([^'\"]*)['\"][^>]*>|[^>]*>([^<]*)</script>)";
    final Pattern pattern = Pattern.compile(regex, Pattern.CASE_INSENSITIVE | Pattern.DOTALL);
    for (int testNumber = 0; testNumber < tests.length; ++testNumber) {
        final String test = tests[testNumber];
        final Matcher matcher = pattern.matcher(test);
        System.out.println("--------------------------------");
        System.out.println("TEST " + testNumber + ": " + test);
        while (matcher.find()) {
            System.out.println("GROUP 1: " + matcher.group(1));
            System.out.println("GROUP 2: " + matcher.group(2));
        }
        System.out.println("--------------------------------");
        System.out.println();
    }

That being said, you would probably be better off using something like Tag Soup if it is at all possible.

laz
I'm marking this as the correct answer since it does what I originally wanted. Also, laz provided me with the secondary answer (below) that I needed for the final solution.
Matt Raible
A: 

Thanks for all the great suggestions everyone. I quickly discovered it's not possible to use Java's Regex API in GWT and was able to do what I wanted with JSNI.

public static native String evalJS(Element e) /*-{
    var scripts = e.getElementsByTagName("script");

    for (i=0; i < scripts.length; i++) {
        // if src, eval it, otherwise eval the body
        if (scripts[i].hasAttribute("src")) {
            eval(scripts[i].getAttribute("src")); // silently fails here
        } else {
            eval(scripts[i].innerHTML); // this works
        }
    }
}-*/;

Unfortunately, I ran into additional issues as documented in the following thread:

http://groups.google.com/group/Google-Web-Toolkit/browse_thread/thread/ac2589369ddec8a3

Matt Raible
I guessing that the call to eval(scripts[i].getAttribute("src")) doesn't not load the URL that src="" points to. It is simply trying to execute the actual URL string as JavaScript. You need to figure out how to load the contents of that URL and eval it.
laz
Ugh, I = I'm and doesn't not = does not
laz
Thanks for the suggestion. It allowed me to solve my problem. Here's the solution I came up with:http://groups.google.com/group/Google-Web-Toolkit/msg/0d076f647a4472bc
Matt Raible