views:

314

answers:

4

Hi, I am trying to parse Twitter RSS feeds and put the information in a sqlite database, using Python. Here's an example:

u'MiamiPete: today\'s "Last Call" is now up http://bit.ly/MGDzu #stocks #stockmarket #finance #money'

What I want to do is create one column for the main content ("Miami Pete....now up)", one column for the URL ("http://bit.ly/MGDzu), and four separate columns for the hashtags (stocks, stockmarket, finance, money). I've been playing around with how to do this.

Any advice would be greatly appreciated!

Thanks,

Greg

P.S. Some code I've been playing around with is below--you can see I tried initially creating a variable called "tiny_url" and splitting it, which it does seem to do, but this feeble attempt is not anywhere close to solving the problem noted above. :)

def store_feed_items(id, items):
    """ Takes a feed_id and a list of items and stored them in the DB """
    for entry in items:
        c.execute('SELECT entry_id from RSSEntries WHERE url=?', (entry.link,))
        tinyurl = entry.summary    ### I added this in
        print tinyurl.split('http') ### I added this in 
        if len(c.fetchall()) == 0:
            c.execute('INSERT INTO RSSEntries (id, url, title, content, tinyurl, date, tiny) VALUES (?,?,?,?,?,?,?)', (id, entry.link, entry.title, entry.summary, tinyurl, strftime("%Y-%m-%d %H:%M:%S",entry.updated_parsed), tiny ))
A: 

Twitter has an api that may be easier for you to use here, http://apiwiki.twitter.com/Twitter-API-Documentation.

You can get the results as JSON or XML and use one of the many Python libraries to parse the results.

Or if you must your the RSS there are Python feed parsers like, http://www.feedparser.org/.

Jason Christa
I appreciate the response, Jason. I am a newbie at APIs as well, so I'm hoping to learn more about them.
Gregory Saxton
+3  A: 

It seems like your data-driven design is rather flawed. Unless all your entries have a text part, an url and up to 4 tags, it's not going to work.

You also need to separate saving to db from parsing. Parsing could be easily done with a regexep (or even string methods):

>>> s = your_string
>>> s.split()
['MiamiPete:', "today's", '"Last', 'Call"', 'is', 'now', 'up', 'http://bit.ly/MGDzu', '#stocks', '#stockmarket', '#finance', '#money']
>>> url = [i for i in s.split() if i.startswith('http://')]
>>> url
['http://bit.ly/MGDzu']
>>> tags = [i for i in s.split() if i.startswith('#')]
>>> tags
['#stocks', '#stockmarket', '#finance', '#money']
>>> ' '.join(i for i in s.split() if i not in url+tags)
'MiamiPete: today\'s "Last Call" is now up'

Single-table db design would probably have to go, though.

SilentGhost
Thanks, SilentGhost! I really appreciate the detailed response. As you can tell, I'm just starting to learn python. I have a couple of follow-up questions, if you don't mind (see additional post below).
Gregory Saxton
OK, I've read up on the 1st and 2nd normal forms--great stuff! I think the key lesson for this particular project is that I would want a table with columns for "entry_id" (which is the same rss entry_id as in my core data table) and a second column for "hashtags." This way, there would be multiple rows for each entry_id, if there is more than one hashtag. Does that seem right?
Gregory Saxton
that sounds about right, don't forget about your url, they're not different from hashtags, so require their own table.
SilentGhost
Regarding url, is that true even if there is at most 1 url in each rss entry?
Gregory Saxton
well, if you can rely on such assumption and are ready to deal with consequences, sure, empty string or any other default values will do.
SilentGhost
OK, point taken. So, having read the material on normalization and deciding to have separate tables for urls and tags, I'm stuck on the next stage of implementation. Specifically, do you have any suggestions on what the code would look like for putting all of the tags in a separate table? I can see that they are all in the "tags" variable you created, but don't have any sample code on the next stage. If you have any links or other material, I'd greatly appreciate it.
Gregory Saxton
hm, it seems like code your posted already have all you need. You might want to have a closer look at http://docs.python.org/library/sqlite3.html but you have working example already. Just a piece of advice, when you ask a question, try to isolate the problem and be more specific.
SilentGhost
+1  A: 

Also, you can parse your strings using regexps:

>>> s = (u'MiamiPete: today\'s "Last Call" is now up http://bit.ly/MGDzu '
         '#stocks #stockmarket #finance #money')
>>> re.match(r'(.*) (http://[^ ]+)', s).groups()
(u'MiamiPete: today\'s "Last Call" is now up', u'http://bit.ly/MGDzu')
>>> re.findall(r'(#\w+)', s)
[u'#stocks', u'#stockmarket', u'#finance', u'#money']
J.F. Sebastian
Thanks! I have never used regular expressions but they certainly look promising. Now might be a good time. :)
Gregory Saxton
A: 

I would highly recommend using the Twitter API. There are actually two APIs, one for the main twitter server and one for the search server. They are used for different things.

You can find sample code, pytwitter on svn. Add simplejson and you can be doing very powerful things in a matter of minutes.

Good luck

Trik
Thanks, Trik. I'll check it out!
Gregory Saxton