tags:

views:

316

answers:

3

This should be easy and this regex works fine to search for words beginning with specific characters, but I can't get it to match hashes and question marks.

This works and matches words beginning a:

r = re.compile(r"\b([a])(\w+)\b")

But these don't match: Tried:

r = re.compile(r"\b([#?])(\w+)\b")
r = re.compile(r"\b([\#\?])(\w+)\b")
r = re.compile( r"([#\?][\w]+)?")

even tried just matching hashes

r = re.compile( r"([#][\w]+)?"
r = re.compile( r"([/#][\w]+)?"

text = "this is one #tag and this is ?another tag"
items = r.findall(text)

expecting to get:

[('#', 'tag'), ('?', 'another')]
+2  A: 

\b matches the empty space between a \w and \W (or between a \W and \w) but there is no \b before a # or ?.

In other words: remove the first word boundary.

Not:

r = re.compile(r"\b([#?])(\w+)\b")

but

r = re.compile(r"([#?])(\w+)\b")
Bart Kiers
Bart K. That works - and also found I had additional bug in my code that added further to my confusion! Many thanks.
No problem phoebebright.
Bart Kiers
The RE you gave will match `something like#this`.
Iamamac
Yes, that is correct. If those strings can occur, the OP can ask about it. Note that I can list a few more corner cases: what about strings that have a hash in it followed by `\w`'s, like this: `"this #is a ?string"`. It all depends on what is going to be parsed.
Bart Kiers
@Bart: `this #is a ?string` should get `is` and `string`, what's your point?
Iamamac
No I meant something different. My point is that we don't know exactly what the OP is trying to do based on the single example s/he posted. What if s/he's parsing something that is source code (or looks like such). Looking for a hash if that hash is inside a string literal or a comment will cause both our suggestions to fail. And what about when punctuation marks come into play like `this is one,#tag and`? Then your suggestion fails. In other words: it is for the OP to comment on my (or your) suggestion if something does not work properly. S/He knows the exact requirements after all.
Bart Kiers
@Bart: Notice that there is a leading `\b` in the question, so that's probably what the OP meant. I just pointed this out in case the OP didn't realize, shouldn't I?
Iamamac
Ah yes, I see your point.
Bart Kiers
+1  A: 

The first \b won't match before # or ?, use (?:^|\s) instead.

Also, the \b at the end is unnecessary, because \w+ is a greedy match.

r = re.compile(r"(?:^|\s)([#?])(\w+)")

text = "#head this is one #tag and this is ?another tag, but not this?one"
print r.findall(text)
# Output: [('#', 'head'), ('#', 'tag'), ('?', 'another')]
Iamamac
like learning to touch type - makeing the effort to become regex expert would be well worth the effort in the long run! Thanks for contributing.
A: 

you are using Python, regex is the last thing to come to mind

>>> text = "this is one #tag and this is ?another tag"
>>> for word in text.split():
...   if word.startswith("#") or word.startswith("?"):
...     print word
...
#tag
?another
ghostdog74