




This should be easy and this regex works fine to search for words beginning with specific characters, but I can't get it to match hashes and question marks.

This works and matches words beginning a:

r = re.compile(r"\b([a])(\w+)\b")

But these don't match: Tried:

r = re.compile(r"\b([#?])(\w+)\b")
r = re.compile(r"\b([\#\?])(\w+)\b")
r = re.compile( r"([#\?][\w]+)?")

even tried just matching hashes

r = re.compile( r"([#][\w]+)?"
r = re.compile( r"([/#][\w]+)?"

text = "this is one #tag and this is ?another tag"
items = r.findall(text)

expecting to get:

[('#', 'tag'), ('?', 'another')]
+2  A: 

\b matches the empty space between a \w and \W (or between a \W and \w) but there is no \b before a # or ?.

In other words: remove the first word boundary.


r = re.compile(r"\b([#?])(\w+)\b")


r = re.compile(r"([#?])(\w+)\b")
Bart Kiers
Bart K. That works - and also found I had additional bug in my code that added further to my confusion! Many thanks.
No problem phoebebright.
Bart Kiers
The RE you gave will match `something like#this`.
Yes, that is correct. If those strings can occur, the OP can ask about it. Note that I can list a few more corner cases: what about strings that have a hash in it followed by `\w`'s, like this: `"this #is a ?string"`. It all depends on what is going to be parsed.
Bart Kiers
@Bart: `this #is a ?string` should get `is` and `string`, what's your point?
No I meant something different. My point is that we don't know exactly what the OP is trying to do based on the single example s/he posted. What if s/he's parsing something that is source code (or looks like such). Looking for a hash if that hash is inside a string literal or a comment will cause both our suggestions to fail. And what about when punctuation marks come into play like `this is one,#tag and`? Then your suggestion fails. In other words: it is for the OP to comment on my (or your) suggestion if something does not work properly. S/He knows the exact requirements after all.
Bart Kiers
@Bart: Notice that there is a leading `\b` in the question, so that's probably what the OP meant. I just pointed this out in case the OP didn't realize, shouldn't I?
Ah yes, I see your point.
Bart Kiers
+1  A: 

The first \b won't match before # or ?, use (?:^|\s) instead.

Also, the \b at the end is unnecessary, because \w+ is a greedy match.

r = re.compile(r"(?:^|\s)([#?])(\w+)")

text = "#head this is one #tag and this is ?another tag, but not this?one"
print r.findall(text)
# Output: [('#', 'head'), ('#', 'tag'), ('?', 'another')]
like learning to touch type - makeing the effort to become regex expert would be well worth the effort in the long run! Thanks for contributing.

you are using Python, regex is the last thing to come to mind

>>> text = "this is one #tag and this is ?another tag"
>>> for word in text.split():
...   if word.startswith("#") or word.startswith("?"):
...     print word