ansaurus

Question

MySQL, select records with at least X characters matching

Answer 1

A:

One obvious option is to use REGEX. I have no prior experience in this but this might be of help to you: http://dev.mysql.com/doc/refman/5.1/en/regexp.html

You'll need to find a suitable expression to match what you need.

o.k.w 2009-07-19 14:26:14

It's not very obvious.. We're talking about running a random regex and match the result against all other records in the table. I can't see any SQL involving regex for that.

PatrikAkerstrand 2009-07-19 16:30:14

As stated above, this is only a small part of the solution. I do not know the characters to look for. With 5 characters, this gives 2^5 regexp queries I should execute if I did it randomly. Unfortunately this is not suited for this problem.

Digits 2009-07-19 17:48:50

@Machine I guess I didn't understand the question fully until re-reading it. Agree, my 'obvious' option wasn't applicable after all. I really do not think this can be achieved by using only sql queries but I would certainly wish to be proven wrong.

o.k.w 2009-07-20 01:48:41

Answer 2

A:

duplicate answer

2009-07-19 21:50:33

Answer 3

+3 A:

Well, this is kind of ugly, but it does work fine. It's generic SQL and will work in any environment. Simply generate a number of selects of a substring that is greater than the maximum length of the field that you're reading. Change the number 50 in the function to a number that exceeds your fieldlength. It may return a realllly long query, but like I said, it'll work fine. Here is an example in Python:

import sqlite3

c = sqlite3.connect('test.db')

c.execute('create table myTable (id integer, content varchar[50])')
for id, content in ((1,'apple'),(2,'pineapple'),(3,'application'),(4,'nation')):
    c.execute('insert into myTable values (?,?)', [id,content])

c.commit();

def GenerateSQL(substrSize):
    subqueries = ["select substr(content,%i,%i) AS substr, count(*) AS myCount from myTable where length(substr(content,%i,%i))=%i group by substr(content,%i,%i) " % (i,substrSize,i,substrSize,substrSize,i,substrSize)  for i in range(50)]
    sql = 'select substr FROM \n\t(' + '\n\tunion all '.join(subqueries) + ') \nGROUP BY substr HAVING sum(myCount) > 1'
    return sql

print GenerateSQL(3)

print c.execute(GenerateSQL(3)).fetchall()

The query generated looks like:

select substr FROM 
    (select substr(content,0,3) AS substr, count(*) AS myCount from myTable where length(substr(content,0,3))=3 group by substr(content,0,3) 
    union all select substr(content,1,3) AS substr, count(*) AS myCount from myTable where length(substr(content,1,3))=3 group by substr(content,1,3) 
    union all select substr(content,2,3) AS substr, count(*) AS myCount from myTable where length(substr(content,2,3))=3 group by substr(content,2,3) 
    union all select substr(content,3,3) AS substr, count(*) AS myCount from myTable where length(substr(content,3,3))=3 group by substr(content,3,3) 
    union all select substr(content,4,3) AS substr, count(*) AS myCount from myTable where length(substr(content,4,3))=3 group by substr(content,4,3) 
    ... ) 
GROUP BY substr HAVING sum(myCount) > 1

And the results it produces are:

[(u'app',), (u'ati',), (u'ion',), (u'nat',), (u'pin',), (u'ple',), (u'ppl',), (u'tio',)]

Greg 2009-07-20 04:32:19

I'll try this and let you know if my server explodes ;) Thx

Digits 2009-07-20 10:28:47

Answer 4

+1 A:

I'm sorry as I haven't been playing with php for a while & I don't have a proper test environment for it, but I quickly devised a way of doing this in c# 3.5

pseudocode: build a table with strings of the specified length & a count of occurences next to it. Select where count > 1:

    static void Main(string[] args)
    {

        string[] data = { "apple", "pinapple", "application", "nation" };
        string[] result = my_func(3,data);

        foreach (string str in result)
        {
            Console.WriteLine(str);
        }
        Console.ReadKey();
    }

    private static string[] my_func(int l, string[] data)
    {
        Dictionary<string,int> dict = new Dictionary<string,int>();
        foreach (string str in data)
        {
            for (int i = 0; i < str.Length - l + 1; i++)
            {
                string part = str.Substring(i, l);
                if (dict.ContainsKey(part))
                {
                    dict[part]++;
                }else {
                    dict.Add(part,1);
                }
            }
        }
        var result = from k in dict.Keys
                where dict[k] > 1
                orderby dict[k] descending
                select k;

        return result.ToArray<string>();
    }

Vincent 2009-07-20 07:54:18

This looks interesting. I'm just a little worried about performance, since all "quick calls" you make to your dictionary, which in my case will be sql queries. And caching a table with 10k records might not be a good idea either, but I'll check it out!

Digits 2009-07-20 10:27:33

You're right, this code should run on the server, then it seems it needs to be written in SQL, but then you would need to iterate in SQL which is practically impossible. I actually like Greg's answer except that the SQL query generated seems insane and it depends on the fieldlength.

Vincent 2009-07-20 11:13:48

ansaurus

tags:

views:

answers:

MySQL, select records with at least X characters matching

related questions