My web application parses data from an uploaded file and inserts it into a database table. Due to the nature of the input data (bank transaction data), duplicate data can exist from one upload to another. At the moment I'm using hideously inefficient code to check for the existence of duplicates by loading all rows within the date range from the DB into memory, and iterating over them and comparing each with the uploaded file data.
Needless to say, this can become very slow as the data set size increases.
So, I'm looking to replace this with a SQL query (against a MySQL database) which checks for the existence of duplicate data, e.g.
SELECT count(*) FROM transactions WHERE desc = ? AND dated_on = ? AND amount = ?
This works fine, but my real-world case is a little bit more complicated. The description of a transaction in the input data can sometimes contain erroneous punctuation (e.g. "BANK 12323 DESCRIPTION" can often be represented as "BANK.12323.DESCRIPTION") so our existing (in memory) matching logic performs a little cleaning on this description before we do a comparison.
Whilst this works in memory, my question is can this cleaning be done in a SQL statement so I can move this matching logic to the database, something like:
SELECT count(*) FROM transactions WHERE CLEAN_ME(desc) = ? AND dated_on = ? AND amount = ?
Where CLEAN_ME is a proc which strips the field of the erroneous data.
Obviously the cleanest (no pun intended!) solution would be to store the already cleaned data in the database (either in the same column, or in a separate column), but before I resort to that I thought I'd try and find out whether there's a cleverer way around this.
Thanks a lot