What regex will find the triple quote comments (possibly multi-line) in a Python source code?
+4
A:
re.findall('(?:\n[\t ]*)\"{3}(.*?)\"{3}', s, re.M | re.S)
captures only text within triple quotes that are at the begging of a line and could be preceded by spaces, tabs or nothing, as python docstrings should be.
SilentGhost
2009-09-24 14:29:57
what about single quotes?
Triptych
2009-09-24 14:42:13
and what about this: `a = '""" not a real triple quote """'`
Triptych
2009-09-24 14:45:04
why is it not a real triple quote? is there something lost in formatting?
SilentGhost
2009-09-24 14:46:14
I suppose quite similar regex could be used to get single quotes as well (it's quite easy to extend given example), I just see no point in stuffing single regex to the point of unintelligibility.
SilentGhost
2009-09-24 14:48:49
because it's inside a simple quote... so it's part of a string literal.
fortran
2009-09-24 14:53:09
Also: `"""foo\"""bar"""`.
bobince
2009-09-24 15:19:29
is that a raw string, bobince?
SilentGhost
2009-09-24 15:21:55
of course it is. I just typed it into python prompt
nosklo
2009-09-24 16:29:27
He didn't say docstrings, either.
Glenn Maynard
2009-09-24 20:25:13
@Glenn: he didn't. did you downvote bobince's answer too?
SilentGhost
2009-09-24 20:34:07
+9
A:
Python is not a regular language and cannot reliably be parsed using regex.
If you want a proper Python parser, look at the ast module. You may be looking for get_docstring
.
bobince
2009-09-24 15:20:41
+1: Question has no valid solution using regexes, only half-working hacks.
nosklo
2009-09-24 16:30:52
I believe regular expressions are powerful enough to do this right. But constructing proper regexp for such task is hard, so using built-in python parser is more reliable solution.
Denis Otkidach
2009-09-25 08:16:06
Do you have a link for that? 'Cannot be reliably parsed using regex'. Which languages can?
kaizer.se
2009-09-25 09:15:37
Barely-readable summary of theory: http://en.wikipedia.org/wiki/Regular_language. Most programming languages aren't, but then modern regex has extensions that take it well beyond traditional regular language matching. Python's syntax, however, is still too complex to be amenable to regex.
bobince
2009-09-25 11:15:44
Also see http://stackoverflow.com/questions/612654/is-regex-in-modern-programming-languages-really-context-sensitive-grammar
bobince
2009-09-25 11:18:09
A:
I've found this one from Tim Peters (I think) :
pat = """
qqq
[^\\q]*
(
( \\\\[\000-\377]
| q
( \\\\[\000-\377]
| [^\\q]
| q
( \\\\[\000-\377]
| [^\\q]
)
)
)
[^\\q]*
)*
qqq
"""
pat = ''.join(pat.split(), '')
tripleQuotePat = pat.replace("q", "'") + "|" + pat.replace('q', '"')
But, as stated by bobince, regex alone doesn't seem to be the right tool for parsing Python code.
So I went with tokenize from the standard library.
dugres
2009-09-29 09:36:13