views:

540

answers:

1

I am learning Python, and as a first project am taking Twitter RSS feeds, parsing the data, and inserting the data into a sqlite database. I have been able to successfully parse each feed entry into a content variable (e.g., "You should buy low..."), a url variable (e.g., [u'http://bit.ly/HbFwL']), and a hashtag list (e.g., #stocks', u'#stockmarket', u'#finance', u'#money', u'#mkt']). I have also been successful at inserting these three pieces of information into three separate columns in a sqlite "RSSEntries" table, where each row is a different rss entry/tweet.

However, I want to set up a database where there is a many-to-many relation between the individual rss feed entries (i.e., individual tweets) and the hashtags that are associated with each entry. So, I set up the following tables using sqlalchemy (the first table just includes the Twitterers' rss feed urls that I want to download and parse):

RSSFeeds = schema.Table('feeds', metadata,
    schema.Column('id', types.Integer, 
        schema.Sequence('feeds_seq_id', optional=True), primary_key=True),
    schema.Column('url', types.VARCHAR(1000), default=u''),
)

RSSEntries = schema.Table('entries', metadata,
    schema.Column('id', types.Integer, 
        schema.Sequence('entries_seq_id', optional=True), primary_key=True),
    schema.Column('feed_id', types.Integer, schema.ForeignKey('feeds.id')),
    schema.Column('short_url', types.VARCHAR(1000), default=u''),
    schema.Column('content', types.Text(), nullable=False),
    schema.Column('hashtags', types.Unicode(255)),
)

tag_table = schema.Table('tag', metadata,
    schema.Column('id', types.Integer,
       schema.Sequence('tag_seq_id', optional=True), primary_key=True),
    schema.Column('tagname', types.Unicode(20), nullable=False, unique=True)
)

entrytag_table = schema.Table('entrytag', metadata,
    schema.Column('id', types.Integer,
        schema.Sequence('entrytag_seq_id', optional=True), primary_key=True),
    schema.Column('entryid', types.Integer, schema.ForeignKey('entries.id')),
    schema.Column('tagid', types.Integer, schema.ForeignKey('tag.id')),
)

So far, I've been able to successfully enter just the three main pieces of information into the RSSEntries table using the following code (abbreviated where...)

engine = create_engine('sqlite:///test.sqlite', echo=True)
conn = engine.connect()
.........
conn.execute('INSERT INTO entries (feed_id, short_url, content, hashtags) VALUES 
    (?,?,?,?)', (id, tinyurl, content, hashtags))

Now, here's the huge question. How do I insert the data into the feedtag and tagname tables? This is a real sticking point for me, since to start the hasthag variable is currently a list, and each feed entry could contain anywhere between 0 and, say, 6 hashtags. I know how to insert the whole list into a single column but not how to insert just the elements of the list into separate columns (or, in this example, rows). A bigger sticking point is the general question of how to insert the individual hashtags into the tagname table when a tagname could be used in numerous different feed entries, and then how to have the "associations" appear properly in the feedtag table.

In brief, I know exactly how each of the tables should look when they're all done, but I have no idea how to write the code to get the data into the tagname and feedtag tables. The whole "many-to-many" set-up is new to me.

I could really use your help on this. Thanks in advance for any suggestions.

-Greg

P.S. - Edit - Thanks to Ants Aasma's excellent suggestions, I've been able to almost get the whole thing to work. Specifically, the 1st and 2nd suggested blocks of code now work fine, but I'm having a problem implementing the 3rd block of code. I am getting the following error:

Traceback (most recent call last):
  File "RSS_sqlalchemy.py", line 242, in <module>
    store_feed_items(id, entries)
  File "RSS_sqlalchemy.py", line 196, in store_feed_items
    [{'feedid': entry_id, 'tagid': tag_ids[tag]} for tag in hashtags2])
NameError: global name 'entry_id' is not defined

Then, because I couldn't tell where Ants Aasma got the "entry_id" part from, I tried replacing it with "entries.id", thinking this might insert the "id" from the "entries" table. However, in that case I get this error:

Traceback (most recent call last):
  File "RSS_sqlalchemy.py", line 242, in <module>
    store_feed_items(id, entries)
  File "RSS_sqlalchemy.py", line 196, in store_feed_items
    [{'feedid': entries.id, 'tagid': tag_ids[tag]} for tag in hashtags2])
AttributeError: 'list' object has no attribute 'id'

I'm not quite sure where the problem is, and I don't really understand where the "entry_id" part fits in, so I've pasted in below all of my relevant "insertion" code. Can somebody help me see what's wrong? Note that I also just noticed that I was incorrectly calling my last table "feedtag_table" instead of "entrytag_table" This didn't match with my initially stated goal of relating individual feed entries to hashtags, rather than feeds to hashtags. I've since corrected the code above.

feeds = conn.execute('SELECT id, url FROM feeds').fetchall()

def store_feed_items(id, items):
    """ Takes a feed_id and a list of items and stored them in the DB """
    for entry in items:
        conn.execute('SELECT id from entries WHERE short_url=?', (entry.link,))
        s = unicode(entry.summary) 
        test = s.split()
        tinyurl2 = [i for i in test if i.startswith('http://')]
        hashtags2 = [i for i in s.split() if i.startswith('#')]
        content2 = ' '.join(i for i in s.split() if i not in tinyurl2+hashtags2)
        content = unicode(content2)
        tinyurl = unicode(tinyurl2)
        hashtags = unicode (hashtags2)
        date = strftime("%Y-%m-%d %H:%M:%S",entry.updated_parsed)

        conn.execute(RSSEntries.insert(), {'feed_id': id, 'short_url': tinyurl,
            'content': content, 'hashtags': hashtags, 'date': date})    

        tags = tag_table
        tag_id_query = select([tags.c.tagname, tags.c.id], tags.c.tagname.in_(hashtags))
        tag_ids = dict(conn.execute(tag_id_query).fetchall())
        for tag in hashtags:
            if tag not in tag_ids:
                result = conn.execute(tags.insert(), {'tagname': tag})
                tag_ids[tag] = result.last_inserted_ids()[0]

        conn.execute(entrytag_table.insert(),
            [{'feedid': id, 'tagid': tag_ids[tag]} for tag in hashtags2])
+1  A: 

First, you should use the SQLAlchemy SQL builder for the inserts to give SQLAlcehemy more insight into what you're doing.

 result = conn.execute(RSSEntries.insert(), {'feed_id': id, 'short_url': tinyurl,
        'content': content, 'hashtags': hashtags, 'date': date})
 entry_id = result.last_insert_ids()[0]

To insert tag associations to your schema you need to fist look up your tag identifiers and create any that don't exist:

tags = tag_table
tag_id_query = select([tags.c.tagname, tags.c.id], tags.c.tagname.in_(hashtags))
tag_ids = dict(conn.execute(tag_id_query).fetchall())
for tag in hashtags:
    if tag not in tag_ids:
        result = conn.execute(tags.insert(), {'tagname': tag})
        tag_ids[tag] = result.last_inserted_ids()[0]

Then just insert the associated id's into the feedtag_table. You can use the executemany support by passing a list of dicts to the execute method.

conn.execute(feedtag_table.insert(),
    [{'feedid': entry_id, 'tagid': tag_ids[tag]} for tag in hashtags])
Ants Aasma
Thanks--that's great! I have successfully added in the first bit, but am having a "global name 'select' is not defined" error when I add in the second bit. I am guessing it's due to where I'm putting in the code. I'll try to edit my question above to show you what I've done so far so maybe you can spot my mistake?
Gregory Saxton
OK, I figured out the problem with the second block of code--I hadn't imported "select" from sqlalchemy! Now, that works great. I am just working on tweaking the third block of code--something is working quite right.
Gregory Saxton
Re: last comment. I meant to say something is NOT working quite right. :) I understand the 'tagid' part but not where the 'entry_id' part comes from? I was hoping you could explain that? I'll paste in the relevant "inserting" section of my code above.
Gregory Saxton
entry_id would be the if of the entry just inserted. You can get it via the last_inserted_ids() method of the result of executing the insert. I updated the answer to show this.
Ants Aasma
Thank you--now everything works! I appreciate you taking the time with this. It has really helped. I do have one last question, if you don't mind. On the entrytag table, if an entry does not have a tag associated with it, a "NULL" row gets inserted (i.e., with entry_id NULL and tagid NULL). It's not a big problem, but I'm guessing there's a way to stop these from getting inserted?
Gregory Saxton
guard it with a check: if hashtags2: conn.execute(..)
Ants Aasma
Thanks once again. Your suggestions have been tremendously helpful. Is there a way for me to vote on your answer or give any other feedback on the helpfulness of your answer?
Gregory Saxton
@Gregory: Welcome to SO. You should 'accept' this answer as the most helpful.
Adam Bernier
Thanks, Adam. I've accepted it.
Gregory Saxton