views:

79

answers:

3

hi, I am writing (hand written) recursive descent parser for SQL select statement in c++, i need to know whether the parse tree created by me is correct or not. I want to check but i didn't get a good sources for sql parse trees. My way of approach is - writing a function for each production and in that function the result is adding to the root tree. Can any one help me? Thanks in advance.

+1  A: 

I don't know how you'll go about verifying your code is correct, but if you're concerned about your understanding of the SQL grammar, then here is a website that lists BNF grammars for various dialects of SQL. You ought to be able construct your parser in terms of these rules.

Oli Charlesworth
yes.. i am using that grammar..
jony
@jony: In that case, what exactly is your question? How to determine that your implementation is bug-free? That's tough to do, which is why most people just use Yacc or Bison, rather than rolling their own.
Oli Charlesworth
@ Oli Charlesworth : i don't have option buddy.. i need to do it in that way, and is my question confusing?? ok.. my question is same the question that what you asked :- How to determine the implementation is bug-free? and if possible is there any good source for parse tree examples of sql so that i can check my parse trees with them.
jony
@jony: Verification is very difficult (no matter how many test cases you provide, there could always be another that breaks it!). But a quick Google search (http://www.google.co.uk/search?q=SQL+%22syntax+tree%22) finds some software that claims to parse SQL in various ways; I don't know if any of those will of any help to you.
Oli Charlesworth
A: 

This parser, based on pyparsing, might be helpful as a second SELECT parsing resource (although it is in Python, not C++, sorry).

Paul McGuire
+1  A: 

My company builds a lot of parsers, and have your same problem. We recently finished a SQL 2011 parser based on the draft standard.

Pretty much you decide if the parse tree is right by hand-inspecting it for many source code cases. This presumes that you can print the parse tree in a form that you can easily inspect; this is easily accomplished by a recursive tree walk of the parse tree. [You have to already believe that your abstract syntax tree nodes correctly model what you intend to capture!]. You choose the cases carefully to exercise different parts of the grammar (think "unit tests for grammars"). For a langauge as rich as SQL, this is a big job.

You also need to validate that the parser works in general, and you do that by feeding a lot of real code for the particular dialect of SQL you are handling. I typically try to find 100K-1M SLOC, and if the parser can't eat all of that, I have still have work left to do. Once you get to that level, you sort of consider that your parser is OK and treat further errors as "maintenance issues".

While the following may not help you directly, it might hint at a direction in which you could head. I use a somewhat different approach, based on having extremely strong parsing machinery available. Our tool, the DMS Software Reengineering Toolkit, given a grammar, will produce ASTs automatically, and has built-in facilities to print such parse trees (in one form as XML). The AST has sufficient information to regenerate ("prettyprint") the source text, and DMS has a built-in prettyprinter. So after hand inspecting a variety of cases, what I do is to take a large body of code, and for each file, parse it (getting no parse errors by virtue of the work done above), prettyprint the source, and reparse the source (expecting to get no errors). This is strong hint that we haven't lost anything in the round trip.

We have a new tool available, the Smart Differencer that compares the text of two programs to see if they are "the same" ignoring language layout rules. It works in essence by parsing two files and comparting their parse trees, ignoring the formatting (line/column/escapes/radix/comments/whitespace). What we are starting to do is to parse the source code, prettyprint it, and the smart-diffing the prettytprinted result against the original file. SmartDiff should say "no AST differences". This is a much stronger hint that we haven't lost anything. You can do pretty the much the same if you are willing to compare your before-and-after printed parse trees.

Ira Baxter