views:

59

answers:

1

Hi all:

Disclaimer: first time I've used DBI.

I have a MySQL table with a lot of indexed fields (f1, f2, f3, etc) that are used to generate WHERE clauses by long-running processes that iterate over chunks of the database performing various cleaning and testing operations.

The current version of this code works something like this:

sub get_list_of_ids() {
    my ($value1, $value2, $value3...) = @_;

    my $stmt = 'SELECT * FROM files WHERE 1';
    my @args;

    if (defined($value1)) {
        $stmt .= ' AND f1 = ?';
        push(@args, $value1);
    }
    # Repeat for all the different fields and values

    my $select_sth = $dbh->prepare($stmt) or die $dbh->errstr;
    $select_sth->execute(@args) or die $select_sth->errstr;

    my @result;
    while (my $array = $select_sth->fetch) {
        push(@result, $$array[0]);
    }
    return \@result;
}

sub function_A() {
    my ($value1, $value2, $value3...) = @_;

    my $id_aref = get_list_of_ids($value1, $value2, $value3...);
    foreach my $id (@$id_aref) {
        # Do something with $id
        # And something else with $id
    }
}

sub function_B() {
    my ($value1, $value2, $value3...) = @_;

    my $id_aref = get_list_of_ids($value1, $value2, $value3...);
    foreach my $id (@$id_aref) {
        # Do something different with $id
        # Maybe even delete the row
    }
}

Anyway, I'm about to dump an awful lot more rows in the database, and am well aware that the code above wont scale up. I can think of several ways to fix it based on other languages. What is the best way to handle it in Perl?

Key points to note are that the logic in get_list_of_ids() is too long to replicate in each function; and that the operations on the selected rows are very varied.

Thanks in advance.

+5  A: 

I presume by "scale up" you mean in maintenance terms rather than performance.

The key change to your code is to pass in your arguments as column/value pairs rather than a list of values with an assumed set of columns. This will allow your code to handle any new columns you might add.

DBI->selectcol_arrayref is both convenient and a bit faster, being written in C.

If you turn on RaiseError in your connect call, DBI will throw an exception on errors rather than having to write or die ... all the time. You should do that.

Finally, since we're writing SQL from possibly untrusted user input, I've taken care to escape the column name.

The rest is explained in this Etherpad, you can watch your code be transformed step by step.

sub get_ids {
    my %search = @_;

    my $sql = 'SELECT id FROM files';

    if( keys %search ) {
        $sql .= " WHERE ";
        $sql .= join " AND ", map { "$_ = ?" }
                              map { $dbh->quote_identifier($_) }
                              keys %search;
    }

    return $dbh->selectcol_arrayref($sql, undef, values %search);
}

my $ids = get_ids( foo => 42, bar => 23 );

If you expect get_ids to return a huge list, too much to keep in memory, then instead of pulling out the whole array and storing it in memory you can return the statement handle and iterate with that.

sub get_ids {
    my %search = @_;

    my $sql = 'SELECT id FROM files';

    if( keys %search ) {
        $sql .= " WHERE ";
        $sql .= join " AND ", map { "$_ = ?" }
                              map { $dbh->quote_identifier($_) }
                              keys %search;
    }

    my $sth = $dbh->prepare($sql);
    $sth->execute(values %search);
    return $sth;
}

my $sth = get_ids( foo => 42, bar => 23 );
while( my $id = $sth->fetch ) {
    ...
}

You can combine both approaches by returning a list of IDs in array context, or a statement handle in scalar.

sub get_ids {
    my %search = @_;

    my $sql = 'SELECT id FROM files';

    if( keys %search ) {
        $sql .= " WHERE ";
        $sql .= join " AND ", map { "$_ = ?" }
                              map { $dbh->quote_identifier($_) }
                              keys %search;
    }

    # Convenient for small lists.
    if( wantarray ) {
        my $ids = $dbh->selectcol_arrayref($sql, undef, values %search);
        return @$ids;
    }
    # Efficient for large ones.
    else {
        my $sth = $dbh->prepare($sql);
        $sth->execute(values %search);
        return $sth;
    }
}

my $sth = get_ids( foo => 42, bar => 23 );
while( my $id = $sth->fetch ) {
    ...
}

my @ids = get_ids( baz => 99 );

Eventually you will want to stop hand coding SQL and use an Object Relation Mapper (ORM) such as DBIx::Class. One of the major advantages of an ORM is it is very flexible and can do the above for you. DBIx::Class can return a simple list of results, or very powerful iterator. The iterator is lazy, it will not perform the query until you start fetching rows, allowing you to change the query as needed without having to complicate your fetch routine.

my $ids = get_ids( foo => 23, bar => 42 );
$ids->rows(20)->all;  # equivalent to adding LIMIT 20
Schwern
Where do the placeholder ?'s get mapped to their values in the search hash? I see where the keys are used in the map pipeline but not the values. To use a value you would need $search{$_}, right?
Paul
Must have lost it in the refactoring, I've fixed it. The values are passed into `selectcol_arrayref()`. Normally you can't trust the order of `keys` and `values`, but they're guaranteed to return in the same order for the same unmodified hash.
Schwern
By "scale up" I mean that there will be tens of millions of rows, so I expect that storing all the identifiers as scalars in an array will require more RAM than I have available. However, it is possible that `selectcol_arrayref` will simplify the get_ids() logic enough that I can just reuse it in each of the functions where I need it. I'll give that a try. I'll check out DBIx::Class too, thanks.
Anon Guy
@AnonGuy I've expanded it to address saving memory if you expect queries to be returning millions of rows.
Schwern
Awesome, thanks.
Anon Guy