views:

240

answers:

4

It's been a few years since I've had to parse any files which were harder than CSV or XML so I am out of practice. I've been given the task of parsing a file format called NeXus in a Delphi application.

The problem is I just don't know where to start, do I use a tokenizer, regex, etc? Maybe even a tutorial might be what I need at this point.

+7  A: 

Have a look at GOLD Parser. It's a meta-parsing system that allows you to define a formal grammar for a language/file format. It creates a parsing rules file which you feed into a tokenizer, together with your input file, and it creates a syntax tree in memory.

There's a Delphi implementation of the tokenizer available on the website. It makes parsing a lot easier since the lexing and tokenizing is already taken care of for you, and all you have to worry about is defining the tokens in a formal grammar and then interpreting them once they've been parsed.

Mason Wheeler
I'm checking out GOLD Parser now.
Daisetsu
Just an FYI this only supports BNF not EBNF which makes it somewhat painful to work with.
Daisetsu
Yeah, I agree, EBNF support would make it simpler. :(
Mason Wheeler
+1, this seems to be in active development (and that says allot!). The only version I used was version 1 (or 2??) and didn't like it much, I preferred my hand-written parsers; But I absolutely need to give this an other try.
Cosmin Prund
+2  A: 

In addition to Mason's very nice answer. There is a great little class in Delphi that is often underappreciated, and one that you can learn a really nice technique from and thats the PageProducer class.

Have a look at the way that it parses HTML and surfaces events on things like finding tags, attributes etc. I'm not saying use the PageProducer (because you won't be able to for Nexus) but its a very simple, elegant and powerful technique.

Tim Jarvis
+3  A: 

Check this out, it's commercial, but it looks like a fun toy:

http://dpg.zenithlab.com/

But, actually: For nexus you do not need a complicated parser.

A bit of position checking code, and some string-splitting and parenthesis counting, and you've got it written.

I would parse it using a simple token-at-a-time parser like this:

  1. load file into a TStringList.
  2. for each line, grab one token at a time, to determine line type.
    have an enumerated type for this line type.
  3. first valid non-blank line should be deteted to be a valid #nexus tag.
  4. next the header area (skipped mostly it looks like)
  5. begin is the first and keyword on the line.
  6. following lines inside the begin block appear to be almost like a DOS command and its command line parameters and are separated by spaces, and end with semicolons. pretty much like pascal, but parenthesis.

For the above I would code for myself a little set of helpers, and eventually one of the things I might need to write is a little token splitting function like this:

function GetToken( var inputString:String;outputToken:String; const Separators:TStrings;Keywords:TStrings;ParenFlag:Boolean):Boolean;

GetToken would return true when it was able to find and return a token string from inputString, it would skip any leading whitespace, and terminate when it finds a separator. Separators are items like space or comma.
ParenFlag:True would mean that the next token I get should be an entire parenthesized list of items. Once I get the whole parenthesized list (((a,b),(c,d),(e,f))) then I would call another function that would unpack the content of that list into some data structure for the lists/arrays.

I do not recommend the big parser engine, and the BNF grammar thing will help you write the code if you write a BNF grammar first before you write the parser. But there's nothing so brutal here that you can't parse it.

Are you going to be expected to do queries/transforms on this? Do you think you need to convert it into json or xml in order to work further with it?

Warren P
Fine explanation, would stay away from loading with tstringlist though. It unnecessarily puts a limit on filesize (+/- half of your memory available to your app)
Marco van de Voort
DPG seems to be an old project, last release was in 2002. That wouldn't necessarily be a problem, but in the particular case of Delphi, I would not invest money (much less time) in a technology that's not proven to work with Delphi Unicode.
Cosmin Prund
If it came with source I could port it forward in a few hours. But yes, the fact that he only lists Delphi 7 and back from that as compatible tells you, it's dead. Wish he'd open source it and I'll fix it up for him.
Warren P
+2  A: 

Haven't found a test format yet a state machine won't parse. Add in recursion to run down nests in trees. They are an easily written relatively quick parsing engine that can be built for virtually any patterned text file. Often easier than using a scripting language to boot. I have custom ones written for HTML, XML, HL7 and a variety of medical EDI formats.

Cameron
This is a good way to do it actually. You could have a parsing state object which contains a state machine function and state machine variables, and methods like "Rewind", and "NextSymbol:String", and Eof:Boolean.
Warren P