views:

19

answers:

1

I am given task to write a script (or better yet, a daemon), that has to do several things:

  1. Crawl most recent data from several input xml feeds. There are, like, 15-20 feeds for the time being, but I believe number might go up to 50 in future. Feed size varies between 500 KB and 5 MB (it most likely won't go over 10 MB). Since feeds are not in a standardized format, there has to be a feed parser for each feed from given source, so that data is unified into single, common format.
  2. Store data into database, in such a way that every single unit of data that is extracted from feeds is still available.
  3. Since data changes over time (say, information is updated at least once per hour), it is necessary to keep archive of changed data.

One other thing that has proven to be difficult to manage in this case (I already hacked together some solution) is that during step 2 database begins to slow down to a crawl, because of volume of SQL queries that insert data in several tables, which affects rest of the system that relies on database (it's a dedicated server with several sites hosted). And I couldn't even get to step 3...

Any hints on how should I approach this problem? Caveats to pay attention to? Anything that would help me in solving this problem is more than welcome.

Thanks!

A: 

Some of my ideas:

  1. You can devise a clever way to use database transactions if your database supports transactions. I've only experimented with database transactions but they say it can increase insert speeds up to 40% (mysql.com) and it doesn't lock tables.

  2. You can append data to a temp file, even in a sql friendly format and load data into your database at once. Using LOAD DATA INFILE is usually 20 times faster (mysql), I've used to quickly insert over a 1 million entries and it was pretty quick.

  3. Setup some kind of queing system.

  4. Put a sleep or wait on each query (in python, time.sleep(1) will make the process wait 1 second)

I'm not exactly sure what db you're using but here are some pointers in optimizing inserts:
http://dev.mysql.com/doc/refman/5.0/en/insert-speed.html

Louis
I'm not following you, how could transactions help me? They still utilize database time, so to say, and lock tables, no?
mr.b
@mr.b I've only experimented with database transactions but they say it can increase insert speeds up to 40% (mysql.com) and it doesn't lock tables. If you are using Microsoft SQL, or something else, I would do a search for "[insert db name] optimize inserts"
Louis
I'm using mysql (see tags)
mr.b
Just a heads up, using transactions solved my problem almost completely, and while doing some benchmarks, I have discovered that (no surprise here) LOAD DATA INFILE is (depending on INSERTs involved) up to 50% faster than executing inserts wrapped in a transaction. As it appears, LOAD DATA INFILE already uses trancations internally, but it can't hurt to wrap them into transactions, too.
mr.b