views:

110

answers:

2

I am writing a simple parser for C. I was just running it with some other language files (for fun - to see the extent of C-likeness and laziness - don't wanna really write separate parsers for each language if I can avoid it).

However the parser seems to break down for JavaScript if the code being parsed contains regular expressions...

Case 1: For example, while parsing the JavaScript code snippet,

var phone="(304)434-5454"
phone=phone.replace(/[\(\)-]/g, "") 
//Returns "3044345454" (removes "(", ")", and "-")

The '(', '[' etc get matched as starters of new scopes, which may never be closed.

Case 2: And, for the Perl code snippet,

 # Replace backslashes with two forward slashes
 # Any character can be used to delimit the regex
 $FILE_PATH =~ s@\\@//@g; 

The // gets matched as a comment...

How can I detect a regular expression within the content text of a "C-like" program-file?

+1  A: 
Pointy
I'm not trying to parse the languages with just a regular expression.
sonofdelphi
So I guess I'll have to use separate regex syntax for each language.
sonofdelphi
Yes, generally every language has its own rules for tokens, though sometimes they're boring enough to share.
Pointy
I'll go one simpler: You can't parse Perl. Or, more precisely, you can't *statically* parse Perl. PPI comes pretty close, though. http://search.cpan.org/perldoc?PPI#Background
Michael Carman
@Michael CarmanHow about JavaScript? Can it parsed statically?
sonofdelphi
@sonofdelphi: As far as I know JavaScript can be statically parsed. The key problem with Perl is how function prototypes change the way things are parsed.
Michael Carman
+3  A: 

It is impossible.

Take this, for example:

m =~ s/a/b/g;

Could be both C or perl.

One minute's thinking reveals, that the number of perl style regular expressions that are also sntyctically valid C expressions is infinite.

Another example:

m+foo *bar[index]+i

The best you can get is some extreme vague guesswork. The difficulty stems from the fact that a regular expression is a sequence of characters that can be virtually everything.

You better clean up your error handling. A parser should not "break down" if some parenthesis are missing or superfluous ones are seen.

Ingo
sure. will do that. :)
sonofdelphi