views:

272

answers:

3

I'm attempting to write an application to extract properties and code from proprietary IDE design files. The file format looks something like this:

HEADING
{
  SUBHEADING1
  {
    PropName1 = PropVal1;
    PropName2 = PropVal2;
  }

  SUBHEADING2
  {
    { 1 ; PropVal1 ; PropValue2 }
    { 2 ; PropVal1 ; PropValue2 ; OnEvent1=BEGIN
                                             MESSAGE('Hello, World!');
                                             { block comments are between braces }
                                             //inline comments are after double-slashes
                                           END; 
    PropVal3 }
    { 1 ; PropVal1 ; PropVal2; PropVal3 }
  }
}

What I am trying to do is extract the contents under the subheading blocks. In the case of SUBHEADING2, I would also separate each token as delimited by the semicolons. I had reasonably good success with just counting the brackets and keeping track of what subheading I'm currently under. The main issue I encountered involves dealing with the code comments.

This language happens to use {} for block comments, which interferes with the brackets in the file format. To make it even more interesting, it also needs to take into account double-slash inline comments and ignore everything up to the end of the line.

What is the best approach to tackling this? I looked at some of the compiler libraries discussed in another article (ANTLR, Doxygen, etc.) but they seem like overkill for solving this specific parsing issue.

+1  A: 

You should be able to put something together in a few hours, using regular expressions in combination with some code that uses the results.

Something like this should work: - Initialize the process by loading the file into a string.

  • Pull each top-level block from the string, using regex tags to separately identify the block keyword and contents.
  • If a block is found,
    • Make a decision based on the keyword
    • Pass the content to this process recursively.

Following this, you would process HEADING, then the first SUBHEADING, then the second SUBHEADING, then each sub-block. For the sub-block containing the block comment, you would presumably know based on the block's lack of a keyword that any sub-block is a comment, so there is no need to process the sub-blocks.

John Fisher
Thanks for the advice. As a result I've taken the initiative to learn more about regular expressions.
polara
+3  A: 

I'd suggest writing a tokenizer and parser; this will give you more flexibility. The tokenizer basically does a simple text-wise breakdown of the sourcecode and puts it into more usable data structure; the parser figures out what to do with it, often leveraging recursion.

Terms to google: tokenizer, parser, compiler design, grammars

Math expression evaluator: http://www.codeproject.com/KB/vb/math_expression_evaluator.aspx (you might be able to take an example like this and hack it apart into what you want)

More info about parsing: http://www.codeproject.com/KB/recipes/TinyPG.aspx

You won't have to go nearly as far as those articles go, but, you're going to want to study a bit on this one first.

FastAl
A: 

No matter which solution you will choose, I'm pretty sure the best way is to have 2 parsers/tokenizers. One for the main file structure with {} as grouping characters, and one for the code blocks.

devio