views:

175

answers:

3

I'm trying to tackle a problem by using a SQLite database and Perl modules. In the end, there will be tens of millions of entries I need to log. The only unique identifier for each item is a text string for the URL. I'm thinking of doing this in two ways:

Way #1: Have a good table, bad table, unsorted table. (I need to check the html and decide whether I want it.) Say we have 1 billion pages total, 333 million URLs in each table. I have a new URL to add, and I need to check and see if it's in any of the tables, and add it to the Unsorted if it is unique. Also, I would be moving a lot of rows around with this option.

Way #2: I have 2 tables, Master and Good. Master has all 1 billion page URLs, and Good has the 333 million that I want. New URL, need to do the same thing, except this time I am only querying one table, and I would never delete a row from Master, only add the data to Good.

So basically, I need to know the best setup to quickly query a huge SQLite database to see if a text string of ~20 characters is unique, then add if it isn't.

Edit: I'm now trying to get Berkeley DB to work using the Perl module, but no dice. Here's what I have:

use BerkeleyDB;

$dbFolder = 'C:\somedirectory';
my $env = BerkeleyDB::Env->new ( -Home => $dbFolder );

my $db  = BerkeleyDB::Hash->new (
-Filename => "fred.db", 
-Env => $env );
my $status = $db->db_put("apple", "red");

And when I run this, I get the following:

Can't call method "db_put" on an undefined value at C:\Directory\perlfile.pl line 42, <STDIN> line 1.
+2  A: 

I don't know if this is optimal, but you could set up your SQLite DB such that the "good" table has a unique constraint on the URL column. You probably don't have enough RAM to do the comparisons in Perl (naive solution would be to create a hash where the URLs are the keys, but if you have a billion pages you'll need an awful lot of memory).

When it comes time to do an insert, the database will enforce the uniqueness and throw some kind of error when it tries to insert a duplicated URL. You can catch this error and ignore it, as long as DBI returns different error values for different error messages.

CanSpice
+4  A: 

I'd be inclined to use a hash instead of SQLite to do what you want to do. A hash is optimized to test for existence without the need to keep the values in any sorted order and with no need to keep a redundant copy of the datum in an index. The hash algorithm applied to the datum yields the location where it would be stored, if it did exist; you can seek to that location and see if it's there. I don't think you'd need to keep the hash table in RAM.

Here's how you might take a hybrid hash/SQLite approach.

Create a SQLite table

STORE
id INTEGER PRIMARY KEY
BUCKET (integer, indexed) 
URL (text, not indexed)
status 

You could have three of these tables, STORE1, STORE2, and STORE3 if you want to keep them separate by status.

Let's assume that there will be 250,000,001 distinct buckets in each store. (You can experiment with this number; make it a prime number).

Find a hashing algorithm that takes two inputs, the URL string and 250,000,0001 and returns a number between 1 and 250,000,001.

When you get a URL, feed it to the hashing algorithm and it will tell you which BUCKET to look in:

Select * from STORE where BUCKET = {the value returned by your hash function}.

Your index on the BUCKET field will quickly return the rows, and you can examine the URLs. If the current URL is not one of them, add it:

INSERT STORE(BUCKET, URL) VALUES( {your hash return value}, theURL). 

SQLite will be indexing integer values, which I think will be more efficient than indexing the URL. And the URL will be stored only once.

Tim
I wanted to elaborate on this answer but not enough space, so I'll add a second answer that is a continuation.
Tim
I think there's no real limit to answer length. It's certainly got to be in a couple KBs...
MPelletier
@Tim, I've merged your continuation answer with this one. Feel free to delete the other answer.
cjm
Err, indexing integer values may be more efficient, but only because indexing by a string presumably does the equivalent of your bucket calculation, only more efficiently. This is a general observation and may not apply to SQLite, but I bet it does.
ysth
Elaborate please, why SQLite's sort of URL for insertion into btree is more efficient than hashing algorithm that calculates a bucket? Secondly, your criticism doesn't address the duplication of the URL value which can be avoided by the approach I suggested. I stand by my original recommendation: hash better than btree because hash won't degrade with insertion -- no btree to keep balanced. This hybrid approach won't be as efficient as a pure hash approach. I just suggested it as possibly better than indexing the URL column, assuming the OP had to use a relational database like SQLite.
Tim
On the first point, I'm not being argumentative -- I'm asking why the calculation of the URL's hash-bucket would be less efficient than the binary algorithm to decide where to place the URL in a btree when there are a billion URLs (or 333 million of them if the set is broken into separate tables). My assumption would be that at some point the hash-calc would surpass the binary calc in efficiency because hash-calc wouldn't involve disk reads.
Tim
I don't *know* that it is more efficient, but am guessing that you are trading one btree for the indexed url for additional work on the client side, plus two btrees (one for the primary key and one for the bucket index) in sqlite and don't see how that could be an improvement. It seems like you are expecting sqlite to do something significantly different with integer indexes? I am not so assuming.
ysth
Re avoiding duplication of the URL value, I'm not sure what you mean. I'm assuming the url would be the primary key, so there's clearly no duplication.
ysth
@ysth: With SQLite everything is "client-side", but I take your point -- client I assume URLs would require more RAM. I'd expect integers to take 8 bytes and strings of up to 255 characters to significantly more. But your point that I'd need two btrees is well taken. Still, I'd prefer a real hash solution to a simulated one or a btree.
Tim
I meant client of the sql library. The question says these are ~20 characters (!). Somebody is scraping a url-shortener, maybe?
ysth
Well thanks to you guys for figuring this out, maybe some other soul later will be forced to use SQLite to solve this, but I found Berkeley DB to be more well suited because I'm going to be making a lot of queries, and I can match exact keys.
Sho Minamimoto
+2  A: 

If $db is undefined, opening the database is failing, and you should inspect $! and $BerkeleyDB::Error to see why.

Have you created the database already? If not, you need -Flags => DB_CREATE.

Working example:

use strict;
use warnings;
use BerkeleyDB;

my $dbFolder = '/home/ysth/bdbtmp/';

my $db  = BerkeleyDB::Hash->new (
    -Filename => "$dbFolder/fred.db", 
    -Flags => DB_CREATE,
) or die "couldn't create: $!, $BerkeleyDB::Error.\n";

my $status = $db->db_put("apple", "red");

I couldn't get BerkeleyDB::Env to do anything useful, though; whatever I tried, the constructor returned undef.

ysth
I added DB_CREATE and checked the $! after creating the environment, but all it says is "No such file or directory." Would you mind just dropping me a working sample for me to dissect? All I need to do is start a hash on disk, add items to it, and check for existing items.
Sho Minamimoto
@Sho Minamimoto: added an example
ysth
Got it working. I think the problem was that I put "fred.db" instead of a full path, but in the docs it says the DB should be made wherever the Env is anyway. Oh well. Thanks for the help!
Sho Minamimoto
One last thing, with millions of entries, how would I go about sorting by values? Like if I had some key URLs with the value of '0' and some with the value of '1', how would I only get URLs with the '0' value? Would this be fast?
Sho Minamimoto
@Sho Minamimoto: no, having to go through the bad ones would definitely slow down just getting the good ones. If you need that to be as fast as possible, you're better off with good, bad, and unsorted files (or maybe good, all, and unsorted? not sure how you are wanting to use this.)
ysth
Well I've got a list of 10 million of URLs, and code to parse it, but what happens when the code breaks at any point and time? I can't be positive that all pages are at the same state at any given time (some might be unsorted, some good, some bad, and there will always be new URLs). And I don't want to start from #1 again. I think 3 Berkeley DBs will still keep me at O(c) because of 3 queries to 3 Hashes per URL.
Sho Minamimoto