tags:

views:

223

answers:

3

What regex will find the triple quote comments (possibly multi-line) in a Python source code?

+4  A: 
re.findall('(?:\n[\t ]*)\"{3}(.*?)\"{3}', s, re.M | re.S)

captures only text within triple quotes that are at the begging of a line and could be preceded by spaces, tabs or nothing, as python docstrings should be.

SilentGhost
what about single quotes?
Triptych
and what about this: `a = '""" not a real triple quote """'`
Triptych
why is it not a real triple quote? is there something lost in formatting?
SilentGhost
I suppose quite similar regex could be used to get single quotes as well (it's quite easy to extend given example), I just see no point in stuffing single regex to the point of unintelligibility.
SilentGhost
because it's inside a simple quote... so it's part of a string literal.
fortran
Also: `"""foo\"""bar"""`.
bobince
is that a raw string, bobince?
SilentGhost
of course it is. I just typed it into python prompt
nosklo
He didn't say docstrings, either.
Glenn Maynard
@Glenn: he didn't. did you downvote bobince's answer too?
SilentGhost
+9  A: 

Python is not a regular language and cannot reliably be parsed using regex.

If you want a proper Python parser, look at the ast module. You may be looking for get_docstring.

bobince
+1: Question has no valid solution using regexes, only half-working hacks.
nosklo
I believe regular expressions are powerful enough to do this right. But constructing proper regexp for such task is hard, so using built-in python parser is more reliable solution.
Denis Otkidach
Do you have a link for that? 'Cannot be reliably parsed using regex'. Which languages can?
kaizer.se
Barely-readable summary of theory: http://en.wikipedia.org/wiki/Regular_language. Most programming languages aren't, but then modern regex has extensions that take it well beyond traditional regular language matching. Python's syntax, however, is still too complex to be amenable to regex.
bobince
Also see http://stackoverflow.com/questions/612654/is-regex-in-modern-programming-languages-really-context-sensitive-grammar
bobince
A: 

I've found this one from Tim Peters (I think) :

pat = """
    qqq
    [^\\q]*
    (
    (   \\\\[\000-\377]
        |   q
        (   \\\\[\000-\377]
        |   [^\\q]
        |   q
        (   \\\\[\000-\377]
            |   [^\\q]
        )
        )
    )
    [^\\q]*
    )*
    qqq
"""  
pat = ''.join(pat.split(), '')  
tripleQuotePat = pat.replace("q", "'") + "|" + pat.replace('q', '"')  

But, as stated by bobince, regex alone doesn't seem to be the right tool for parsing Python code.
So I went with tokenize from the standard library.

dugres
And finally, I use the lexer from **pygments** ( http://pygments.org/ )
dugres