views:

148

answers:

2

I have a data set of books and authors, with a many-to-many relationship.

There are about 10^6 books and 10^5 authors, with an average of 10 authors per book.

I need to perform a series of operations on the data set, such as counting the number of books by each author or deleting all books by a certain author from the set.

What would be a good data structure that will allow fast handling?

I'm hoping for some ready made module that can provide methods along the lines of:

obj.books.add(book1)

# linking
obj.books[n].author = author1
obj.authors[m].author = book1

# deleting
obj.remove(author1) # should automatically remove all links to the books by author1, but not the linked books

I should clarify that I prefer not to use a database for this, but to do it all in memory.

Thanks

+14  A: 

sqlite3 (or any other good relational DB, but sqlite comes with Python and is handier for such a reasonably small set of data) seems the right approach for your task. If you'd rather not learn SQL, SQLAlchemy is a popular "wrapper" over relational DBs, so to speak, that allows you to deal with them at any of several different abstraction levels of your choice.

And "doing it all in memory" is no problem at all (it's silly, mind you, since you'll needlessly pay the overhead of reading in all the data from somewhere more persistent on each and every run of your program, while keeping the DB on a disk file would save you that overhead -- but, that's a different issue;-). Just open your sqlite database as ':memory:' and there you are -- a fresh, new relational DB living entirely in memory (for the duration of your process only), no disk involved in the procedure at all. So, why not?-)

Personally, I'd use SQL directly for this task -- it gives me excellent control of exactly what's going on, and easily lets me add or remove indices to tweak performance, etc. You'd use three tables: a Books table (primary key ID, other fields such as Title &c), an Authors table (primary key ID, other fields such as Name &c), and a "many-to-many relationship table", say BookAuthors, with just two fields, BookID and AuthorID, and one record per author-book connection.

The two fields of the BookAuthors table are what's known as "foreign keys", referring respectively to the ID fields of Books and Authors, and you can define them with an ON DELETE CASCADE so that records referring to a book or author that gets deleted are automatically dropped in turn -- an example of the high semantic level at which even "bare" SQL lets you work, which no other existing data structure can come close to matching.

Alex Martelli
I believe sqlite even has an option to create the database in memory.
Omnifarious
In addition, to use memory as per the comments on the OP: "You can also supply the special name `:memory:` to create a database in RAM."
Matt Joiner
Moreover, sqlite can be used in memory only - see http://www.sqlite.org/inmemorydb.html
Brendan
+2  A: 

I'm hoping for some ready made module that can provide methods along the lines of:

Since that actually works, what more do you need?

You have a Book and an Author class definition. You also have a Book-Author association for the relationships. The methods required to manage add/change/delete are only a few lines of code.

Create big old dictionaries of Authors, Books and Author-Book association objects.

Use shelve to store it all.

Done.

S.Lott