ansaurus

Question

Parsing an unknown data structure in python

Answer 1

+1 A:

That depends on how the data is structured, and what kind of changes you need to do.

One option might be to parse that into a Python data structure, it seems similar, except that you don't have quotes around the strings. That makes complex manipulation easy.

On the other hand, if all you need to do is make changes that modify some entries to other entries, you can do it with search and replace.

So you need to understand the issue better before you can know what the best way is.

Lennart Regebro 2009-06-26 19:33:11

I want to delete an 'Entry' each time I get something like "Data1:0" or "Data:[]". And also sort the entries in each group based on other conditions of the data members. I'll look into parsing as a Python data structure though, thanks!

Lin 2009-06-26 19:44:43

Yeah, if you want to sort them, parsing into a Python structure definitely sounds like the best.

Lennart Regebro 2009-06-26 19:59:10

Answer 2

+1 A:

This is a pretty similar problem to XML processing, and there's a lot of Python code to do that. So if you could somehow convert the file to XML, you could just run it through a parser from the standard library. An XML version of your example would be something like this:

<group id="Group1">  
    <entry id="Entry1">
        <title id="Title1"><data id="Data1">Member1</data> <data id="Data2">Member2</data></title>
        <title id="Title2"><data id="Data3">Member3</data> <data id="Data4">Member4</data></title>
    </entry>  
    <entry id="Entry2">  
        ...
    </entry>
</group>

Of course, converting to XML probably isn't the most straightforward thing to do. But your job is pretty similar to what's already been done with the XML parsers, you just have a different syntax to deal with. So you could take a look at some XML parsing code and write a little Python parser for your data file based on that. (Depending on how the XML parser is implemented, you might even be able to copy the code, just change a few regular expressions, and run it for your file)

David Zaslavsky 2009-06-26 19:40:14

Answer 3

+3 A:

The data structure basically seems to be a dict where they keys are strings and the value is either a string or another dict of the same type, so I'd recommend maybe pulling it into that sort of python structure,

eg:

{'group1': {'Entry2': {}, 'Entry1': {'Title1':{'Data4': 'Member4',
'Data1': 'Member1','Data3': 'Member3', 'Data2': 'Member2'}, 
'Title2': {}}}

At the top level of the file you would create a blank dict, and then for each line you read, you use the identifier as a key, and then when you see a { you create the value for that key as a dict. When you see Key:Value, then instead of creating that key as a dict, you just insert the value normally. When you see a } you have to 'go back up' to the previous dict you were working on and go back to filling that in.

I'd think this whole parser to put the file into a python structure like this could be done in one fairly short recursive function that just called itself to fill in each sub-dict when it saw a { and then returned to its caller upon seeing }

bdk 2009-06-26 19:56:16

Answer 4

+2 A:

If you have the grammar for the structure of your data file, or you can create it yourself, you could use a parser generator for Python, like YAPPS: link text.

Marco Mustapic 2009-06-26 19:58:29

Answer 5

+1 A:

I have something similar but written in java. It parses a file with the same basic structure with a little different syntax (no '{' and '}' only indentation like in python). It is a very simple script language.

Basically it works like this: It uses a stack to keep track of the inner most block of instructions (or in your case data) and appends every new instruction to the block on the top. If it parses an instruction which expects a new block it is pushed to the stack. If a block ends it pops one element from the stack.

I do not want to post the entire source because it is big and it is available on google code (lizzard-entertainment, revision 405). There is a few things you need to know.

Instruction is an abstract class and it has a block_expected method to indicate wether the concrete instruction needs a block (like loops, etc) In your case this is unnecessary you only need to check for '{'.
Block extends Instruction. It contains a list of instructions and has an add method to add more.
indent_level return how many spaces are preceding the instruction text. This is also unneccessary with '{}' singns.

placeholder

BufferedReader input = null;
try {
    input = new BufferedReader(new FileReader(inputFileName));
    // Stack of instruction blocks
    Stack<Block> stack = new Stack<Block>();
    // Push the root block
    stack.push(this.topLevelBlock);
    String line = null;
    Instruction prev = new Noop();
    while ((line = input.readLine()) != null) {
        // Difference between the indentation of the previous and this line
        // You do not need this you will be using {} to specify block boundaries
        int level = indent_level(line) - stack.size();
        // Parse the line (returns an instruction object)
        Instruction inst = Instruction.parse(line.trim().split(" +"));
        // If the previous instruction expects a block (for example repeat)
        if (prev.block_expected()) {
            if (level != 1) {
                // TODO handle error
                continue;
            }
            // Push the previous instruction and add the current instruction
            stack.push((Block)(prev));
            stack.peek().add(inst);
        } else {
            if (level > 0) {
                // TODO handle error
                continue;
            } else if (level < 0) {
                // Pop the stack at the end of blocks
                for (int i = 0; i < -level; ++i)
                    stack.pop();
            }
            stack.peek().add(inst);
        }
        prev = inst;
    }
} finally {
    if (input != null)
        input.close();
}

stribika 2009-06-26 21:56:48

Indent your code (there's a button on the toolbar) for it to be formatted properly.

Kiv 2009-06-26 22:10:57

Sorry, it's not working for me. At least the final part is visible now.

stribika 2009-06-26 22:23:01

You have a list before, which has it's own indent. Either type something on a unindented row in between, or indent the code even more.

MizardX 2009-06-26 22:40:33

Thank you it works :)

stribika 2009-06-26 22:48:15

Answer 6

+2 A:

Here is a grammar.

dict_content : NAME ':' NAME [ ',' dict_content ]?
             | NAME '{' [ dict_content ]? '}' [ dict_content ]?
             | NAME '[' [ list_content ]? ']' [ dict_content ]?
             ;

list_content : NAME [ ',' list_content ]?
             | '{' [ dict_content ]? '}' [ ',' list_content ]?
             | '[' [ list_content ]? ']' [ ',' list_content ]?
             ;

Top level is dict_content.

I'm a little unsure about the comma after dicts and lists embedded in a list, as you didn't provide any example of that.

MizardX 2009-06-26 22:59:43

ansaurus

tags:

views:

answers:

Parsing an unknown data structure in python

related questions