Is there a simple regular expression to match all unicode quotes? Or does one have to hand-code it like this:
quotes = ur"[\"'\u2018\u2019\u201c\u201d]"
Thank you for reading.
Brian
Is there a simple regular expression to match all unicode quotes? Or does one have to hand-code it like this:
quotes = ur"[\"'\u2018\u2019\u201c\u201d]"
Thank you for reading.
Brian
Python doesn't support Unicode properties, therefore you can't use the Pi
and Pf
properties, so I guess your solution is as good as it gets.
You might also want to consider the "false quotation marks" that are sadly being used - the acute and grave accent (´
and `
): \u0060
and \u00B4
.
Then there are guillemets (« » ‹ ›
), do you want those, too? Use \u00BB\u203A\u00AB\u2039
for those.
Also, your command has a little bug: you're adding the backslash to the quotes
string (because you're using a raw string). Use a triple-quoted string instead.
>>> quotes = ur"[\"'\u2018\u2019\u201c\u201d\u0060\u00b4]"
>>> "\\" in quotes
True
>>> quotes
u'[\\"\'\u2018\u2019\u201c\u201d`\xb4]'
>>> quotes = ur"""["'\u2018\u2019\u201c\u201d\u0060\u00b4]"""
>>> "\\" in quotes
False
>>> quotes
u'["\'\u2018\u2019\u201c\u201d`\xb4]'
Quotation marks will often have the Unicode category Pi
(punctuation, initial quote) or Pf
(Punctuation, final quote). You'll have to handle the "neutral" quotation marks '
and "
manually.