tags:

views:

92

answers:

7

Problem that i face:

-I have an input string, a SQL statement that i need to parse

-extract the value that need to be insert base on the column name specify

-i can extract the value that is wrap in between 2 single quotes, but:

--?what about value that has no single quotes wrap at them? (like: integer or double)

--?what if the value inside already has single quotes? (like: 'James''s dictionary')

Below is the sample input string:

INSERT INTO LJS1_DX (base, doc, key1, key2, no, sq, eq, ln, en, date, line) 
VALUES ('GET','','#000210','','   0','   1','5',1,0,'20100706','Street''James''s dictionary')

The Java Code i have below match value between two single quotes only:

 Pattern p = Pattern.compile("'.*?'");
 columnValues = "'GET0','','#000210','','   0','   1','5',1,0,'20100706','Street''James''s dictionary'";
 Matcher m = p.matcher(columnValues); // get a matcher object
 StringBuffer output = new StringBuffer();
 while (m.find()) {
  logger.trace(m.group());
 }

Appreciate if anyone can provide any guideline or sample to this question.

Thank you!!

A: 

Regex are not really suitable for this. You will always find cases that fail

A csv parser such as opencsv is probably a better option

gnibbler
+1  A: 

I agree with gnibbler that this is a job for a csv parser.

A regex that works on your example would be

'(?:''|[^'])*'|[^',]+

which looks challenging to debug and maintain, doesn't it?

Explanation:

'            # First alternative: match an "opening" '
 (?:         # followed by either...
  ''         # two ' in a row (escaped ')
 |           # or...
  [^']       # any character that is not a '
 )*          # zero or more times,
'            # then match a "closing" '
|            # or (second alternative):
[^',\s]+     # match any run of characters except ', comma or whitespace

It also works if there is whitespace around the values/commas (and will leave that out of the match).

Tim Pietzcker
this works too! except after running for a while, i got Java StackOverFlow exception. i wonder whether this is a limitation in Java or not. still checking on the root cause.you're right, it does looks complex and i personally wish i dont need to maintain this part of code :(
Reusable
I don't see how the regex could be a reason for a stack overflow, since there is nothing in it that could cause catastrophic backtracking, even with malformed input. Perhaps the surrounding code (which looks OK in your sample, though) does something that's causing the exception. I don't know Java, so maybe some expert might have a better idea.
Tim Pietzcker
i am surprise as well on the error, the error trace are all inside the java's own api. this is the stacktrace:2010-07-30 17:22:29,458 TRACE [main] (SQLAction.java:178) - Extracted Value:: ' 5'Exception in thread "main" java.lang.StackOverflowError at java.util.regex.Pattern$Slice.match(Pattern.java:3472) at java.util.regex.Pattern$Branch.match(Pattern.java:4114) at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168)... at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168) at java.util.regex.Pattern$Loop.match(Pattern.java:4295)
Reusable
I don't really know how to read this error trace, but the extracted value `' 5'` is definitely not the problem. Could you post (perhaps as addendum to your question) the entire code surrounding the line that triggers the exception?
Tim Pietzcker
A: 

In general, when you need to parse complex langauges, regexps are not the best tool - there's too much context to make sense of. So, if reading XML use an XML parser, if reading C code, use a C language parser and if reading SQL ...

There's a Java SQL parser here, I would use somethink like this.

For other languages it may be best to use a "YACC"-like parser. For example JACK

djna
I agree with you. i would do the same, (using Zql) if i need to handle all select, insert, update, delete and create.Now, i am just limited to insert (dont need to care about table) , column and column's data.
Reusable
from bitter experience I'd recommend taking the extra time to parse the thing properly, even though you won't use all the bits you parse. This approach deals with all the corner cases of badly formed SQL much more cleanly.
djna
i have tried using zQL, wow... it is really a good tool and saves alot of hassle! i would really consider using it next time if something similar but require to cater more. thanks for the idea on this.
Reusable
A: 

instead you can get all values using subString after Values keyword. Same way we can get names also. then you will have two comma-separated string which can be converted to array and you will have a arrays for names and values. you can then check which param has which value .

hope this helps.

Paarth
this is true. However, there could be "comma" inside those varchar columns. what do you think?
Reusable
but for varchar, the value will be in ' (Single Quotes) right? so yeah.. its a matter of preference. You can check which ever seem to be easy and fast..
Paarth
A: 
p = Pattern.compile("('?)(.*?)\\1(?:,|$)"); 
m = p.matcher(columnValues); // get a matcher object 
 while (m.find()) { 
  System.out.println(m.group(2)); 
 }
  1. ('?) - Group 1 - Optional quote
  2. (.*?) - Group 2 - Chars within quotes
  3. \1 - To match first quote captured in group 1
  4. (?:,|$) - To match comma or end of string - (?: - to ignore capturing of groups)
Marimuthu Madasamy
It works! any idea how should i improve the regex given to support blank space in between comma and single quote?eg: ' 1' , '5' ,1
Reusable
doesn't match correctly `'James''s dictionary'`
M42
for this, String columnValues = "'', '1', ,2,'James''s dictionary'" it returns an empty string, '1'(a space and then a 1 wrapped in quotes), a space, 2, James''s dictionary. What else do you want? (One thing I noted:at last it also matches the end of the string with a blank string and returns the blank string as a match(which we need to avoid)).
Marimuthu Madasamy
yah, i notice it match a blank string at the end. though i am not sure what is the reason. Following also return an additional match, which shouldn't be the case:VALUES ('GET','','#000210','',' 0',' 1','5',1,0,'20100706','James''s Dog'', ''is hiding under the car''')
Reusable
@Marimuthu: You're right, i tested with wrong string. Sorry.
M42
A: 

Regular expressions are not easy to use with this (but everything is possible).

I would suggest parsing it yourself, or use a library to do the parsing. By writing the parser yourself you are certain that it works exactly as you need it to.

Thorbjørn Ravn Andersen
No, everything is not possible. You can't use regular espressions outside their domain, which is regular languages. The OP's problem has context, so no RE can solve it, by definition. As you and others have suggested, he must use a parser.
EJP
It is _possible_ to use with _this_, but it is not _easy_. Regular expressions cannot handle everything as they are not Turing complete. That, however, is most likely not a suitable answer for the original poster.
Thorbjørn Ravn Andersen
A: 

I think Tim had the right idea; it just needs to be implemented more efficiently. Here's a much more efficient version:

'[^']*+(?:''[^']*+)*+'|[^',\s]++

It uses Friedl's "unrolled loop" technique to avoid excessive reliance on alternations that match one or two characters at a time (I think that's what did you in, Tim), plus possessive quantifiers throughout.

Alan Moore
This works too. Probably more efficient?
Reusable
According to RegexBuddy, @Tim's regex takes 80 steps to match `'Street''James''s dictionary'`, while mine takes 13.
Alan Moore