tags:

views:

122

answers:

5
  1. Should I be using RegularExpressions to do this?
  2. Possible to structure the results as queryable, IEnumerable, etc.

I have a file, I cannot change how it is generated. I wish to create a parser class to extract all the data. Ideally, I would like to then use this class to open the file and have it return a queryable array type structure I can use.

The data is structured like this:

["Table"] = {
    ["Text"] = { 
        ["Number"] = { 
            "Item", --[1]
            "Item", --[2]
            "Item", --[3]
        },
    --repeat--
Note that the actual file has line brakes, tab, etc. (\n\t\t) 
As you will see the patters I use take this into account 
to get different levels.

I have a regular expression that was written for vb6 for this very file but, 1 of the 7 patterns does not work:

@"^\t\[""([\s\S]*?)""] = {([\s\S]*?)^\t},$

This is supposed to group the top most level ["Table"] into their own match. but it returns 0 and it is slow. If I take the $ sign out it just returns all sub nodes too. This is the only thing stopping me from using Regular Expressions to do this.

Another option is just to parse line by line I guess. I am sure I can figure this out given time but I'd like to hear other opinions before I go one way or the other.

Any thoughts?

+1  A: 

Go with your gut. Regular Expressions are the correct way to handle this. If you could post up a sample, i can help you write a RegEx to match whatever you want :-)

One way to easily test your regular expressions quickly is to go to http://rubular.com/

It shows you the matches against your sample on the fly..allowing you to fine-tune your expression quickly.

Caladain
Rubular is a nifty site...thanks for pointing it out.
JasCav
+3  A: 

I would stay away from Regular Expressions, if you want to do any real-world parsing on such a file you will quickly run into massive undebuggable issues with Regex, for example dealing with the right nestedness (assuming your file can have multiple levels of nesting) and correctness will cause you so much headache. There are many patterns that can cause any regex processor to almost look like an infinite loop and never end (or at least not in any reasonable time), and really writing such a simple parsers should be quick and lead to better debugging, performance, maintainability, etc.

CarlosAg
+1 - a PARSER may be more correct. Get one that allows you to put up "proper grammar" which is a lot better for complex syntax.
TomTom
A: 

I'm guessing that your structure is Lua related. At least by the looks of it that should be readable by Lua any day. If I'm right you might want to check out luainterface

Also there's some other questions here with example code: Parse a Lua Datastructure , Read nested Lua table

Don
It is Lua. I ended up customizing this http://youpvp.com/blog/post/LuaParse-C-parser-for-World-of-Warcraft-saved-variable-files.aspx
Dan
Nice find, haven't seen that one before :)
Don
A: 

Do not use Regex - get a proper parser where you can put in a syntax file. This allows a lot more complex parsing easily, compared to REGEX.

TomTom
A: 

Question #1 practically answers itself. In fact, this is a textbook example of the top two reasons why regexes should be avoided in many cases.

  • You inherited a regex that worked, but now it needs to be tweaked and nobody in your shop has the necessary expertise.

  • The data has a recursive or hierarchical structure, something regexes are particularly ill-suited for.

Your regex gets around the recursion problem by cheating; it uses the length of each line's leading whitespace to infer which delimiter goes with which. You could do it properly using .NET's recursive matching feature, but it would be very, very ugly. So let's see what we can do with what you've got.

@"^\t\[""([\s\S]*?)""] = {([\s\S]*?)^\t},$"

You performance problem is almost certainly due to that second [\s\S]*?--which, by the way, should be .*? with Singleline mode set; only JavaScript requires that [\s\S] hack. But whichever way you write it, you're asking it to do too much work. Here's how I would do it:

@"^\t\[""([^""]*)""\] *= *{(?>.*\n)*?\t}," // Multiline ON, Singleline OFF

Where you were matching one character at a time with [\s\S]*?, I match a full line at a time with (?>.*\n)*?. Reluctant quantifiers are very handy, but you can get into just as much trouble with them as you can with the greedy ones if you overwork them.

I still use the ^ anchor at the beginning, but I don't have to use anchors anywhere else because I'm matching all of the newlines explicitly. And, although I used \n in this example for clarity, I usually use (?:\r\n|[\r\n]) to match any of the three most common line separators: \r\n (Windows), \r (older Macs) and \n (Unix/Linux/OSX).

Alan Moore