views:

681

answers:

5

I'm trying to match chunks of JS code and extract string literals that contain a given keyword using Java.

After trying to come up with my own regexp to do this, I ended up modifying this generalized string-literal matching regexp (Pattern.COMMENTS used when building the patterns in Java):

(["'])
(?:\\?+.)*?
\1

to the following

(["'])
(?:\\?+.)*?
keyword
(?:\\?+.)*?
\1

The test cases:

var v1 = "test";
var v2 = "testkeyword";
var v3 = "test"; var v4 = "testkeyword";

The regexp correctly doesn't match line 1 and correctly matches line 2.

However, in line 3, instead of just matching "testkeyword", it matches the chunk

"test"; var v4 = "testkeyword"

which is wrong - the regexp matched the first double quote and did not terminate at the second double quote, going all the way till the end of line.

Does anyone have any ideas on how to fix this?

PS: Please keep in mind that the Regexp has to correctly handle escaped single and double quote characters inside of string literals (which the generalized matcher already did).

+1  A: 

After much revision (see edit history, viewers at home :), I believe this is my final answer:

(?:
    "
    (?:\\?+"|[^"])*
    keyword
    (?:\\?+"|[^"])*
    "
|
    '
    (?:\\?+'|[^'])*
    keyword
    (?:\\?+'|[^'])*
    '
)
chaos
It does when I test it... test case 3 matches properly as 2 string literals. Here is where the original regex came from http://blog.stevenlevithan.com/archives/match-quoted-string
niktech
Oops, I see. Yeah, it's relying on the non-greedy behavior for that, which you can't use the same way because you're anchoring to a keyword. Editing...
chaos
That should work but the problem in my case would be false-positives. The probability of a string literal having the special keyword is about 1%. And I have to process a lot of files with hundreds of lines every time. If no one can come up with a way to pre-screen the literals for the special keyword before processing them, I'll go with your solution.
niktech
Maybe further developments will help...
chaos
It seems to match a lot more than needed. In my test case chunk above, it matched: "test";var v2 = "testkeyword";var v3 = "test"; var v4 = "testkeyword";
niktech
Tim's modification of your Regexp above works correctly.
niktech
Cool. Glad to have helped, anyhow.
chaos
+3  A: 

How about this modification:

(?:
    "
    (?:\\"|[^"\r\n])*
    keyword
    (?:\\"|[^"\r\n])*
    "
|
    '
    (?:\\'|[^'\r\n])*
    keyword
    (?:\\'|[^'\r\n])*
    '
)
Tim Pietzcker
Perfect! Works as needed!
niktech
A: 

You need to write two patterns for either single or double quoted strings, as there is no way to make the regex remember which opened the string. Then you can or them together with |.

gromgull
A: 

Consider using code from Rhino -- JS in Java -- to get the real String literals.

Or, if you want to use regex, consider one find for the whole literal, then a nested test if the literal contains 'keyword'.

I think Tim's construction works, but I wouldn't bet on it in all situations, and the regex would have to get insanely unwieldy if it had to deal with literals that don't want to be found (as if trying to sneak by your testing). For example:

    var v5 =  "test\x6b\u0065yword"

Separate from any solution, my secret weapon for interactively working out regexes is a tool I made called Regex Powertoy, which unlike many such utilities runs in any browser with Java applet support.

gojomo
The test case you mentioned does not apply to my situation. I'm guaranteed that 'keyword' will appear just like that, in ASCII. Doing two tests (first test for string literal, then test for presence of keyword) will produce a lot of false-positives in my case because probability of a literal having a keyword is about 1%.
niktech
A: 

A grammar to construct a string literal would look roughly like this:

string-literal ::= quote text quote

text ::= character text
       | character

character ::= non-quote
            | backslash quote

with non-quote, backslash, and quote being terminals.

A grammar is regular if it is context free (i.e. the left hand side of all rules is always a single non-terminal) and the right hand side of all rules is always either empty, a terminal, or a terminal followed by a non-terminal.

You may notice that the first rule given above has a terminal followed by a nonterminal followed by a terminal. This is thus not a regular grammar.

A regular expression is an expression that can parse regular languages (languages that can be constructed by a regular grammar). It is not possible to parse non-regular languages with regular expressions.

The difficulty you have in finding a suitable regular expression stems from the fact that a suitable regular expression doesn't exist. You will never arrive at code that is obviously correct, this way.

It is much easier to write a simple parser along the lines of above rules. Since the text contained by your string literals is regular, you can use a simple regular expression to look for your keyword---after you extracted that text from its surroundings.

Svante
Interesting observation. Do you happen to have a test case that will break Tim's solution? It seems to be holding up to all of my test cases.
niktech