I am learning Python, and as a first project am taking Twitter RSS feeds, parsing the data, and inserting the data into a sqlite database. I have been able to successfully parse each feed entry into a content variable (e.g., "You should buy low..."), a url variable (e.g., [u'http://bit.ly/HbFwL']), and a hashtag list (e.g., #stocks', u'#stockmarket', u'#finance', u'#money', u'#mkt']). I have also been successful at inserting these three pieces of information into three separate columns in a sqlite "RSSEntries" table, where each row is a different rss entry/tweet.
However, I want to set up a database where there is a many-to-many relation between the individual rss feed entries (i.e., individual tweets) and the hashtags that are associated with each entry. So, I set up the following tables using sqlalchemy (the first table just includes the Twitterers' rss feed urls that I want to download and parse):
RSSFeeds = schema.Table('feeds', metadata,
schema.Column('id', types.Integer,
schema.Sequence('feeds_seq_id', optional=True), primary_key=True),
schema.Column('url', types.VARCHAR(1000), default=u''),
)
RSSEntries = schema.Table('entries', metadata,
schema.Column('id', types.Integer,
schema.Sequence('entries_seq_id', optional=True), primary_key=True),
schema.Column('feed_id', types.Integer, schema.ForeignKey('feeds.id')),
schema.Column('short_url', types.VARCHAR(1000), default=u''),
schema.Column('content', types.Text(), nullable=False),
schema.Column('hashtags', types.Unicode(255)),
)
tag_table = schema.Table('tag', metadata,
schema.Column('id', types.Integer,
schema.Sequence('tag_seq_id', optional=True), primary_key=True),
schema.Column('tagname', types.Unicode(20), nullable=False, unique=True)
)
entrytag_table = schema.Table('entrytag', metadata,
schema.Column('id', types.Integer,
schema.Sequence('entrytag_seq_id', optional=True), primary_key=True),
schema.Column('entryid', types.Integer, schema.ForeignKey('entries.id')),
schema.Column('tagid', types.Integer, schema.ForeignKey('tag.id')),
)
So far, I've been able to successfully enter just the three main pieces of information into the RSSEntries table using the following code (abbreviated where...)
engine = create_engine('sqlite:///test.sqlite', echo=True)
conn = engine.connect()
.........
conn.execute('INSERT INTO entries (feed_id, short_url, content, hashtags) VALUES
(?,?,?,?)', (id, tinyurl, content, hashtags))
Now, here's the huge question. How do I insert the data into the feedtag and tagname tables? This is a real sticking point for me, since to start the hasthag variable is currently a list, and each feed entry could contain anywhere between 0 and, say, 6 hashtags. I know how to insert the whole list into a single column but not how to insert just the elements of the list into separate columns (or, in this example, rows). A bigger sticking point is the general question of how to insert the individual hashtags into the tagname table when a tagname could be used in numerous different feed entries, and then how to have the "associations" appear properly in the feedtag table.
In brief, I know exactly how each of the tables should look when they're all done, but I have no idea how to write the code to get the data into the tagname and feedtag tables. The whole "many-to-many" set-up is new to me.
I could really use your help on this. Thanks in advance for any suggestions.
-Greg
P.S. - Edit - Thanks to Ants Aasma's excellent suggestions, I've been able to almost get the whole thing to work. Specifically, the 1st and 2nd suggested blocks of code now work fine, but I'm having a problem implementing the 3rd block of code. I am getting the following error:
Traceback (most recent call last):
File "RSS_sqlalchemy.py", line 242, in <module>
store_feed_items(id, entries)
File "RSS_sqlalchemy.py", line 196, in store_feed_items
[{'feedid': entry_id, 'tagid': tag_ids[tag]} for tag in hashtags2])
NameError: global name 'entry_id' is not defined
Then, because I couldn't tell where Ants Aasma got the "entry_id" part from, I tried replacing it with "entries.id", thinking this might insert the "id" from the "entries" table. However, in that case I get this error:
Traceback (most recent call last):
File "RSS_sqlalchemy.py", line 242, in <module>
store_feed_items(id, entries)
File "RSS_sqlalchemy.py", line 196, in store_feed_items
[{'feedid': entries.id, 'tagid': tag_ids[tag]} for tag in hashtags2])
AttributeError: 'list' object has no attribute 'id'
I'm not quite sure where the problem is, and I don't really understand where the "entry_id" part fits in, so I've pasted in below all of my relevant "insertion" code. Can somebody help me see what's wrong? Note that I also just noticed that I was incorrectly calling my last table "feedtag_table" instead of "entrytag_table" This didn't match with my initially stated goal of relating individual feed entries to hashtags, rather than feeds to hashtags. I've since corrected the code above.
feeds = conn.execute('SELECT id, url FROM feeds').fetchall()
def store_feed_items(id, items):
""" Takes a feed_id and a list of items and stored them in the DB """
for entry in items:
conn.execute('SELECT id from entries WHERE short_url=?', (entry.link,))
s = unicode(entry.summary)
test = s.split()
tinyurl2 = [i for i in test if i.startswith('http://')]
hashtags2 = [i for i in s.split() if i.startswith('#')]
content2 = ' '.join(i for i in s.split() if i not in tinyurl2+hashtags2)
content = unicode(content2)
tinyurl = unicode(tinyurl2)
hashtags = unicode (hashtags2)
date = strftime("%Y-%m-%d %H:%M:%S",entry.updated_parsed)
conn.execute(RSSEntries.insert(), {'feed_id': id, 'short_url': tinyurl,
'content': content, 'hashtags': hashtags, 'date': date})
tags = tag_table
tag_id_query = select([tags.c.tagname, tags.c.id], tags.c.tagname.in_(hashtags))
tag_ids = dict(conn.execute(tag_id_query).fetchall())
for tag in hashtags:
if tag not in tag_ids:
result = conn.execute(tags.insert(), {'tagname': tag})
tag_ids[tag] = result.last_inserted_ids()[0]
conn.execute(entrytag_table.insert(),
[{'feedid': id, 'tagid': tag_ids[tag]} for tag in hashtags2])