ansaurus

Question

Regexp to match Javascript string literals with a specific keyword using Java

Answer 1

+1 A:

After much revision (see edit history, viewers at home :), I believe this is my final answer:

(?:
    "
    (?:\\?+"|[^"])*
    keyword
    (?:\\?+"|[^"])*
    "
|
    '
    (?:\\?+'|[^'])*
    keyword
    (?:\\?+'|[^'])*
    '
)

chaos 2009-07-10 11:35:28

It does when I test it... test case 3 matches properly as 2 string literals. Here is where the original regex came from http://blog.stevenlevithan.com/archives/match-quoted-string

niktech 2009-07-10 11:41:09

Oops, I see. Yeah, it's relying on the non-greedy behavior for that, which you can't use the same way because you're anchoring to a keyword. Editing...

chaos 2009-07-10 11:43:57

That should work but the problem in my case would be false-positives. The probability of a string literal having the special keyword is about 1%. And I have to process a lot of files with hundreds of lines every time. If no one can come up with a way to pre-screen the literals for the special keyword before processing them, I'll go with your solution.

niktech 2009-07-10 11:53:29

Maybe further developments will help...

chaos 2009-07-10 11:59:14

It seems to match a lot more than needed. In my test case chunk above, it matched: "test";var v2 = "testkeyword";var v3 = "test"; var v4 = "testkeyword";

niktech 2009-07-10 12:13:39

Tim's modification of your Regexp above works correctly.

niktech 2009-07-10 21:02:13

Cool. Glad to have helped, anyhow.

chaos 2009-07-10 21:24:56

Answer 2

+3 A:

How about this modification:

(?:
    "
    (?:\\"|[^"\r\n])*
    keyword
    (?:\\"|[^"\r\n])*
    "
|
    '
    (?:\\'|[^'\r\n])*
    keyword
    (?:\\'|[^'\r\n])*
    '
)

Tim Pietzcker 2009-07-10 11:54:16

Perfect! Works as needed!

niktech 2009-07-10 21:01:27

Answer 3

A:

You need to write two patterns for either single or double quoted strings, as there is no way to make the regex remember which opened the string. Then you can or them together with |.

gromgull 2009-07-10 12:06:41

Answer 4

A:

Consider using code from Rhino -- JS in Java -- to get the real String literals.

Or, if you want to use regex, consider one find for the whole literal, then a nested test if the literal contains 'keyword'.

I think Tim's construction works, but I wouldn't bet on it in all situations, and the regex would have to get insanely unwieldy if it had to deal with literals that don't want to be found (as if trying to sneak by your testing). For example:

    var v5 =  "test\x6b\u0065yword"

Separate from any solution, my secret weapon for interactively working out regexes is a tool I made called Regex Powertoy, which unlike many such utilities runs in any browser with Java applet support.

gojomo 2009-07-10 13:08:43

The test case you mentioned does not apply to my situation. I'm guaranteed that 'keyword' will appear just like that, in ASCII. Doing two tests (first test for string literal, then test for presence of keyword) will produce a lot of false-positives in my case because probability of a literal having a keyword is about 1%.

niktech 2009-07-10 21:08:23

Answer 5

A:

A grammar to construct a string literal would look roughly like this:

string-literal ::= quote text quote

text ::= character text
       | character

character ::= non-quote
            | backslash quote

with non-quote, backslash, and quote being terminals.

A grammar is regular if it is context free (i.e. the left hand side of all rules is always a single non-terminal) and the right hand side of all rules is always either empty, a terminal, or a terminal followed by a non-terminal.

You may notice that the first rule given above has a terminal followed by a nonterminal followed by a terminal. This is thus not a regular grammar.

A regular expression is an expression that can parse regular languages (languages that can be constructed by a regular grammar). It is not possible to parse non-regular languages with regular expressions.

The difficulty you have in finding a suitable regular expression stems from the fact that a suitable regular expression doesn't exist. You will never arrive at code that is obviously correct, this way.

It is much easier to write a simple parser along the lines of above rules. Since the text contained by your string literals is regular, you can use a simple regular expression to look for your keyword---after you extracted that text from its surroundings.

Svante 2009-07-10 18:33:22

Interesting observation. Do you happen to have a test case that will break Tim's solution? It seems to be holding up to all of my test cases.

niktech 2009-07-10 21:10:59

ansaurus

tags:

views:

answers:

Regexp to match Javascript string literals with a specific keyword using Java

related questions