views:

85

answers:

3

I'm a Perl programmer with some nice scripts that go fetch HTTP pages (from a text file-list of URLs) with cURL and save them to a folder.

However, the number of pages to get is in the tens of millions. Sometimes the script fails on number 170,000 and I have to start the script again manually. It automatically reads the URL and sees if there is a page downloaded and skips. But, with a few hundred thousand, it still takes a few hours to skip back up to where it left off. Obviously, this is not going to pan out in the end.

I've been told that instead of saving to a text file, which is hard to search and modify, I need to use a database. I don't know much about databases, just messed around with MySQL on a school server a year ago. I just need the ability to add millions of rows and a few static columns, search/modify one quickly, and do this all locally on a lan (or a single computer if that's difficult). And of course, I need to access this database using perl.

Where should I start? What do I need to download to get a server started on Windows? Which Perl modules should I use? (I'm using an ActiveState distro)

+1  A: 

Look into DBI. If you do not like SQL in your programs, try SQL::Abstract.

Alan Haggai Alavi
+5  A: 

There's many sorts of databases, but if you've already decided for an SQL database and are trying to make the setup process easy, you might want to have a look at SQLite and the DBI/DBD::SQLite modules, which allow you to use that from perl.

rafl
Yeah, I read over the SQLite site, that sounds like what I need. If you don't mind, could you give me some simple steps to set up an environment on a Windows box? Or at least some up-to-date docs on how to do it. I downloaded the sqlite_amalgamation-3_7_2.zip, but now I'm lost...
Sho Minamimoto
Install Strawberry Perl and tell its CPAN client to install the two modules I mentioned.
rafl
Oh, didn't realize everything was handled by those modules. Thanks, I'll try it out.
Sho Minamimoto
DBD::SQLite is included into both Strawberry and ActivePerl. Strawberry is better, of course.
Alexandr Ciornii
+4  A: 

Since you only need to search on one column, you may wish to consider a key/value store database like the Berkeley DB by using either BerkeleyDB or DB_File.

Generally, you can think of these key/value databases as being Perl hashes that operate from a disk rather than memory. Exact key look ups are very fast. Everything else requires scanning the whole dataset.

daotoad
Turns out that I need to store more than two pieces of data for each row, so I don't think this will work out as well as SQLite, but I will bookmark for later, thanks!
Sho Minamimoto
You know what, I might have to use something like this just to find if I already have a unique URL in my DB, would you suggest using the Berkeley Hash or BTree to store these values to find faster?
Sho Minamimoto
Hash is good for quick unique lookups and additions. BTree keeps everything sorted. For unique lookups use a hash. To handle multiple data columns you can use Storable or another serializer to flatten the data so that it in the data column. MLDBM is one tool to help with this. You could also use DBD::DBM to get an DBI SQL interface to your data.
daotoad