tags:

views:

295

answers:

4

My web application parses data from an uploaded file and inserts it into a database table. Due to the nature of the input data (bank transaction data), duplicate data can exist from one upload to another. At the moment I'm using hideously inefficient code to check for the existence of duplicates by loading all rows within the date range from the DB into memory, and iterating over them and comparing each with the uploaded file data.

Needless to say, this can become very slow as the data set size increases.

So, I'm looking to replace this with a SQL query (against a MySQL database) which checks for the existence of duplicate data, e.g.

SELECT count(*) FROM transactions WHERE desc = ? AND dated_on = ? AND amount = ?

This works fine, but my real-world case is a little bit more complicated. The description of a transaction in the input data can sometimes contain erroneous punctuation (e.g. "BANK 12323 DESCRIPTION" can often be represented as "BANK.12323.DESCRIPTION") so our existing (in memory) matching logic performs a little cleaning on this description before we do a comparison.

Whilst this works in memory, my question is can this cleaning be done in a SQL statement so I can move this matching logic to the database, something like:

SELECT count(*) FROM transactions WHERE CLEAN_ME(desc) = ? AND dated_on = ? AND amount = ?

Where CLEAN_ME is a proc which strips the field of the erroneous data.

Obviously the cleanest (no pun intended!) solution would be to store the already cleaned data in the database (either in the same column, or in a separate column), but before I resort to that I thought I'd try and find out whether there's a cleverer way around this.

Thanks a lot

A: 

The cleanest way is indeed to make sure only correct data is in the database.

In this example the "BANK.12323.DESCRIPTION" would be returned by:

SELECT count(*) FROM transactions
WHERE desc LIKE 'BANK%12323%DESCRIPTION' AND dated_on = ? AND amount = ?

But this might impose performance issues when you have a lot of data in the table.

tehvan
I think it should be rather 'BANK[. ]12323[. ]DESCRIPTION' to avoid false positives as much as possible.
Tomalak
+1  A: 

The easiest way to do that is to add a unique index on the appropriate columns and to use ON DUPLICATE KEY UPDATE. I would further recommend transforming the file into a csv and loading it into a temporary table to get the most out of mysql's builtin functions, which are surely faster than anything that you could write yourself - if you consider that you would have to pull the data into your own application, while mysql does everything in place.

soulmerge
+1  A: 

can this cleaning be done in a SQL statement

Yes, you can write a stored procedure to do it in the database layer:

mysql> CREATE FUNCTION clean_me (s VARCHAR(255))
    -> RETURNS VARCHAR(255) DETERMINISTIC
    -> RETURN REPLACE(s, '.', ' ');

mysql> SELECT clean_me('BANK.12323.DESCRIPTION');

BANK 12323 DESCRIPTION

This will perform very poorly across a large table though.

Obviously the cleanest (no pun intended!) solution would be to store the already cleaned data in the database (either in the same column, or in a separate column), but before I resort to that I thought I'd try and find out whether there's a cleverer way around this.

No, as far as databases are concerned the cleanest way is always the cleverest way (as long as performance isn't awful).

Do that, and add indexes to the columns you're doing bulk compares on, to improve performance. If it's actually intrinsic to the type of data that desc/dated-on/amount are always unique, then express that in the schema by making it a UNIQUE index constraint.

bobince
A: 

Another way that you could do it is as follows:

  • Clean the description before inserting.

  • Create a primary key for the table that is a combination of the columns that uniquely identifier the entry. Sounds like that might be cleaned description, date and amount.

  • Use the either the 'replace' or 'on duplicate key' syntax, which ever is more appropriate. 'replace' actually replaces the existing row in the db with the updated one when an existing unique key confict occurs, e.g:

    REPLACE INTO transactions (desc, dated_on, amount) values (?,?,?)

    'on duplicate key' allows you to specify which columns to update on a duplicate key error:

    INSERT INTO transaction (desc, dated_on, amount) values (?,?,?) ON DUPLICATE KEY SET amount = amount

By using the multi-column primary key, you will gain a lot of performance since primary key lookups are usually quite fast.

If you prefer to keep your existing primary key, you could also create a unique unix on those three columns.

Whichever way you choose, I would recommend cleaning the description before going into the db, even if you also store the original description and just use the cleaned one for indexing.

jonstjohn