I am given task to write a script (or better yet, a daemon), that has to do several things:
- Crawl most recent data from several input xml feeds. There are, like, 15-20 feeds for the time being, but I believe number might go up to 50 in future. Feed size varies between 500 KB and 5 MB (it most likely won't go over 10 MB). Since feeds are not in a standardized format, there has to be a feed parser for each feed from given source, so that data is unified into single, common format.
- Store data into database, in such a way that every single unit of data that is extracted from feeds is still available.
- Since data changes over time (say, information is updated at least once per hour), it is necessary to keep archive of changed data.
One other thing that has proven to be difficult to manage in this case (I already hacked together some solution) is that during step 2 database begins to slow down to a crawl, because of volume of SQL queries that insert data in several tables, which affects rest of the system that relies on database (it's a dedicated server with several sites hosted). And I couldn't even get to step 3...
Any hints on how should I approach this problem? Caveats to pay attention to? Anything that would help me in solving this problem is more than welcome.
Thanks!