tags:

views:

165

answers:

4

I am having a heck of a time taking the information in a tweet including hashtags, and pulling each hashtag into an array using Python. I am embarrassed to even put what I have been trying thus far.

For example, "I love #stackoverflow because #people are very #helpful!"

This should pull the 3 hashtags into an array.

A: 
hashtags = [word for word in tweet.split() if word[0] == "#"]
smehmood
You mean `==`, not `=`. (Also, `word.startswith("#")` is preferred to `word[0] == "#"`.)
Mike Graham
+7  A: 

A simple regex should do the job:

>>> import re
>>> s = "I love #stackoverflow because #people are very #helpful!"
>>> re.findall(r"#(\w+)", s)
['stackoverflow', 'people', 'helpful']
AndiDog
Thank you sooooo much!
Scott
Elegant, simple, complete and well-formulated.
Paul Lammertsma
+4  A: 
>>> s="I love #stackoverflow because #people are very #helpful!"
>>> [i  for i in s.split() if i.startswith("#") ]
['#stackoverflow', '#people', '#helpful!']
ghostdog74
+1  A: 

AndiDogs answer will screw up with links and other stuff, you may want to filter them out first. After that use this code:

UTF_CHARS = ur'a-z0-9_\u00c0-\u00d6\u00d8-\u00f6\u00f8-\u00ff'
TAG_EXP = ur'(^|[^0-9A-Z&/]+)(#|\uff03)([0-9A-Z_]*[A-Z_]+[%s]*)' % UTF_CHARS
TAG_REGEX = re.compile(TAG_EXP, re.UNICODE | re.IGNORECASE)

It may seem overkill but this has been converted from here http://github.com/mzsanford/twitter-text-java. It will handle like 99% of all hashtags in the same way that twitter handles them.

For more converted twitter regex check out this: http://github.com/BonsaiDen/Atarashii/blob/master/atarashii/usr/share/pyshared/atarashii/formatter.py

EDIT:
Check out: http://github.com/BonsaiDen/AtarashiiFormat

Ivo Wetzel