views:

332

answers:

6

Hi,

I am trying to extract the namespaces defined in C++ files.
Basically, if my C++ file contains:

namespace n1 {
  ...
  namespace n2 { ... } // end namespace n2 
  ...
  namespace n3 { ...} //end namespace n3 
  ...
} //end namespace n1

I want to be able to retrieve: n1, n1::n2, n1::n3.

Does someone have any suggestion of how I could do that using python-regex?

Thanks.

+6  A: 

Searching for the namespace names is pretty easy with a regular expression. However, to determine the nesting level you will have to keep track of the curly bracket nesting level in the source file. This is a parsing problem, one that cannot be solved (sanely) with regular expressions. Also, you may have to deal with any C preprocessor directives in the file which can definitely affect parsing.

C++ is a notoriously tricky language to parse completely, but you may be able to get by with a tokeniser and a curly bracket counter.

Greg Hewgill
That confirms what I was worried about. I believe I will follow your suggestion: using regex to extract namespace names and another way to handle the curly bracket level. At this stage I will ignore the C preprocessor directives. Thanks for the answer.
Note that you can run your source through cpp (the preprocessor) and deal with just C++ code as the compiler will see it, but that means you will also see namespace declarations #included through header files (eg. std).
Greg Hewgill
+1  A: 

You cannot completely ignore preprocessor directives, as they may introduce additional namespaces. I have seen a lot of code like:

#define __NAMESPACE_SYSTEM__ namespace system

__NAMESPACE_SYSTEM__ {
   // actual code here...
}

Yet, I don't see any reason for using such directives, other than defeating regular expression parsing strategy...

SadSido
+1  A: 

Most of the time when someone asks how to do something with regex, they're doing something very wrong. I don't think this case is different.

If you want to parse c++, you need to use a c++ parser. There are many things that can be done that will defeat a regex but still be valid c++.

Dustin
+1  A: 

You could write a basic lexer for it. It's not that hard.

Geo
+2  A: 

The need is simple enough that you may not need a complex parser. You need to:

  • extract the namespace names
  • count the open/close braces to keep track of where your namespace is defined.

This simple approach works if the other conditions are met:

  • you don't get spurious namespace like strings inside comments or inside strings
  • you don't get unmatched open/closeing braces inside comments or strings

I don't think this is too much asking from your source.

Bluebird75
A: 

That is what I did earlier today:

  • Extract the comment out of the C++ files
  • Use regex to extract the namespace definition
  • Use a simple string search to get the open & close braces positions

The various sanity check added show that I am successfully processing 99.925% of my files (5 failures ouf of 6678 files). The issues are due to mismatches in numbers of { and } cause by few '{' or '}' in strings, and unclean usage of the preprocessor instruction.

However, I am only dealing with header files, and I own the code. That limits the number of scenari that could cause some issues and I can manually modify the ones I don't cover.

Of course I know there are plenty of cases where it would fail but it is probably enough for what I want to achieve.

Thanks for your answers.