tags:

views:

725

answers:

7

I'm trying to extract specific hard coded variables from C source code. My remaining problem is that I'd like to parse array initialisation, for example:

#define SOMEVAR { {T_X, {1, 2}}, {T_Y, {3, 4}} }

It's enough to parse this example into "{T_X, {1, 2}}" and "{T_Y, {3, 4}}", since it's then possible to recurse to get the full structure. However, it needs to be sufficiently general so as to be able to parse any user defined types.

Even better would be a list of regular expressions that can be used to extra values from general C code constructs like #define, enums and global variables.

The C code is provided to me, so I have no control over it. I'd rather not write a function that parses it a character at a time. However, it'd be OK to have a sequence of regular expressions.

This is not a problem of getting files into MATLAB or basic regular expressions. I'm after a specific regular expression that preserves groupings by brackets.

EDIT: Looks like regular expressions don't do recursion or arbitrarily deep matches. According to here and here.

+1  A: 

Have you looked at the following site which provides extensive tutorials and examples on regular expressions :-

http://www.regular-expressions.info/

cyberbobcat
A: 

Maybe vim's syntax file would help in this matter. I'm not sure whether it has those elements you seek (I don't do C), but it's got a whole lot of elements, so it's definitely a starting point. Download vim (www.vim.org), and in vim/syntax/c.vim look around a little.

ldigas
A: 

I don't think regexps will work on arbitrary C code. Clang allows you to build a syntax tree from C code and use it programatically.

That could be readily used for globals, but #defines are handled by the preprocessor so I'm not sure how they would work.

cristi:tmp diciu$ cat test.c
#define t 1
int m=5;


int fun(char * y)
{
    float g;

    return t;
}

int main()
{
    int g=7;
    return t;
}


cristi:tmp diciu$ ~/Downloads/checker-137/clang -ast-dump test.c
(CompoundStmt 0xc01ec0 <test.c:6:1, line:10:1>
  (DeclStmt 0xc01e70 <line:7:2>
    0xc01e30 "float g"
  (ReturnStmt 0xc01eb0 <line:9:2, line:1:11>
        (IntegerLiteral 0xc01e90 <col:11> 'int' 1)))
(CompoundStmt 0xc020a0 <test.c:13:1, line:16:1>
  (DeclStmt 0xc02060 <line:14:2>
    0xc02010 "int g =
      (IntegerLiteral 0xc02040 <col:8> 'int' 7)"
  (ReturnStmt 0xc01b50 <line:15:2, line:1:11>
    (IntegerLiteral 0xc02080 <col:11> 'int' 1)))
typedef char *__builtin_va_list;
Read top-level variable decl: 'm'

int fun(char *y)


int main()
diciu
No external tools, sorry. But I still don't see how that helps me.
Nzbuu
A: 

I assume you have access to the C code in question. If so, then define two macros:

#define BEGIN_MATLAB_DATA
#define END_MATLAB_DATA

Wrap all the data you want to extract between these macros. When the C code is compiled, they expand to nothing, so they won't harm there.

Now you can use a very simple regexp to get the data.

Aaron Digulla
A: 

EDIT: Now that the question has been updated, it appears that my previous answer missed the point. I don't know if you've already searched the other regular-expression-related questions on Stack Overflow. On the chance that you haven't, I came across two that may help give you guidance for your problem (which appears to be a problem, at least partially, of trying to match and keep track of opening and closing curly braces): this one and this one. Good luck!

gnovice
It's easy enough to write an expression that matches a specific case, but I'm after something general that preserves groupings while separating the list. Thanks anyway.
Nzbuu
Ah, I understand better now from your new edit of the question. The problem appears quite a bit more difficult than the example you gave. Unfortunately, no immediate solution springs to mind.
gnovice
A: 

This regular expression:

(\{\s*[A-Za-z_]+)\s*,\s*\{\s*\d+\s*,\s*\d+\s*\}\s*\}

seems reasonable, but I don't know if it's enough for you. It's littered with \s* to allow arbitrary whitespace between tokens, from C's point of view that's allowable. It will match stuff that looks more or less just your examples; some kind of identifier followed by exactly two digit strings.

unwind
Do you mean this? \{\s*\w+\s*,\s*\{\s*\d+\s*,\s*\d+\s*\}\s*\}. That only matches this specific example. I'm looking for something more general.
Nzbuu
+1  A: 

The formal language that defines brace matching is not a regular language. Therefore, you cannot use a regular expression to solve your problem.

The problem is that you need some way to count the number of opening braces you have already encountered. Some regular expression engines support extended features, such as peeking, which could be used to solve your problem, but these can be tough to deal with. You might be better off writing a simple parser for this task.

Tim