ansaurus

Question

MySQL SELECT statement using Regex to recognise existing data

Answer 1

A:

The cleanest way is indeed to make sure only correct data is in the database.

In this example the "BANK.12323.DESCRIPTION" would be returned by:

SELECT count(*) FROM transactions
WHERE desc LIKE 'BANK%12323%DESCRIPTION' AND dated_on = ? AND amount = ?

But this might impose performance issues when you have a lot of data in the table.

tehvan 2009-02-24 13:40:16

I think it should be rather 'BANK[. ]12323[. ]DESCRIPTION' to avoid false positives as much as possible.

Tomalak 2009-02-24 13:42:24

Answer 2

+1 A:

The easiest way to do that is to add a unique index on the appropriate columns and to use ON DUPLICATE KEY UPDATE. I would further recommend transforming the file into a csv and loading it into a temporary table to get the most out of mysql's builtin functions, which are surely faster than anything that you could write yourself - if you consider that you would have to pull the data into your own application, while mysql does everything in place.

soulmerge 2009-02-24 13:41:21

Answer 3

+1 A:

can this cleaning be done in a SQL statement

Yes, you can write a stored procedure to do it in the database layer:

mysql> CREATE FUNCTION clean_me (s VARCHAR(255))
    -> RETURNS VARCHAR(255) DETERMINISTIC
    -> RETURN REPLACE(s, '.', ' ');

mysql> SELECT clean_me('BANK.12323.DESCRIPTION');

BANK 12323 DESCRIPTION

This will perform very poorly across a large table though.

Obviously the cleanest (no pun intended!) solution would be to store the already cleaned data in the database (either in the same column, or in a separate column), but before I resort to that I thought I'd try and find out whether there's a cleverer way around this.

No, as far as databases are concerned the cleanest way is always the cleverest way (as long as performance isn't awful).

Do that, and add indexes to the columns you're doing bulk compares on, to improve performance. If it's actually intrinsic to the type of data that desc/dated-on/amount are always unique, then express that in the schema by making it a UNIQUE index constraint.

bobince 2009-02-24 13:45:52

Answer 4

A:

Another way that you could do it is as follows:

Clean the description before inserting.
Create a primary key for the table that is a combination of the columns that uniquely identifier the entry. Sounds like that might be cleaned description, date and amount.
Use the either the 'replace' or 'on duplicate key' syntax, which ever is more appropriate. 'replace' actually replaces the existing row in the db with the updated one when an existing unique key confict occurs, e.g:

REPLACE INTO transactions (desc, dated_on, amount) values (?,?,?)

'on duplicate key' allows you to specify which columns to update on a duplicate key error:

INSERT INTO transaction (desc, dated_on, amount) values (?,?,?) ON DUPLICATE KEY SET amount = amount

By using the multi-column primary key, you will gain a lot of performance since primary key lookups are usually quite fast.

If you prefer to keep your existing primary key, you could also create a unique unix on those three columns.

Whichever way you choose, I would recommend cleaning the description before going into the db, even if you also store the original description and just use the cleaned one for indexing.

jonstjohn 2009-02-24 14:43:51

ansaurus

tags:

views:

answers:

MySQL SELECT statement using Regex to recognise existing data

related questions