views:

572

answers:

4

Hello,

I am trying to accomplish the following. Let's say we have a table that contains these fields (ID, content)

1 | apple

2 | pineapple

3 | application

4 | nation

now, I am looking for a function that will tell me all possible common matches. For example, if the argument is "3", the function will return all possible strings from 3 characters that appear in more then one record.

In this case, I get "app","ppl","ple","ati","tio","ion"

If the argument is "4", i get: "appl","pple","atio","tion"

If the arugment is "5", i get: "apple","ation"

If the argument is "6", nohting is returned.

Untill now, I did not find a function that accomplishes this.

Thx!

Some extra information: I am using this in a PHP script with a MySQL database. I really just want to give the amount of characters as an argument and of course the table to search in.

A: 

One obvious option is to use REGEX. I have no prior experience in this but this might be of help to you: http://dev.mysql.com/doc/refman/5.1/en/regexp.html

You'll need to find a suitable expression to match what you need.

o.k.w
It's not very obvious.. We're talking about running a random regex and match the result against all other records in the table. I can't see any SQL involving regex for that.
PatrikAkerstrand
As stated above, this is only a small part of the solution. I do not know the characters to look for. With 5 characters, this gives 2^5 regexp queries I should execute if I did it randomly. Unfortunately this is not suited for this problem.
Digits
@Machine I guess I didn't understand the question fully until re-reading it. Agree, my 'obvious' option wasn't applicable after all. I really do not think this can be achieved by using only sql queries but I would certainly wish to be proven wrong.
o.k.w
A: 

duplicate answer

+3  A: 

Well, this is kind of ugly, but it does work fine. It's generic SQL and will work in any environment. Simply generate a number of selects of a substring that is greater than the maximum length of the field that you're reading. Change the number 50 in the function to a number that exceeds your fieldlength. It may return a realllly long query, but like I said, it'll work fine. Here is an example in Python:

import sqlite3

c = sqlite3.connect('test.db')

c.execute('create table myTable (id integer, content varchar[50])')
for id, content in ((1,'apple'),(2,'pineapple'),(3,'application'),(4,'nation')):
    c.execute('insert into myTable values (?,?)', [id,content])

c.commit();

def GenerateSQL(substrSize):
    subqueries = ["select substr(content,%i,%i) AS substr, count(*) AS myCount from myTable where length(substr(content,%i,%i))=%i group by substr(content,%i,%i) " % (i,substrSize,i,substrSize,substrSize,i,substrSize)  for i in range(50)]
    sql = 'select substr FROM \n\t(' + '\n\tunion all '.join(subqueries) + ') \nGROUP BY substr HAVING sum(myCount) > 1'
    return sql

print GenerateSQL(3)

print c.execute(GenerateSQL(3)).fetchall()

The query generated looks like:

select substr FROM 
    (select substr(content,0,3) AS substr, count(*) AS myCount from myTable where length(substr(content,0,3))=3 group by substr(content,0,3) 
    union all select substr(content,1,3) AS substr, count(*) AS myCount from myTable where length(substr(content,1,3))=3 group by substr(content,1,3) 
    union all select substr(content,2,3) AS substr, count(*) AS myCount from myTable where length(substr(content,2,3))=3 group by substr(content,2,3) 
    union all select substr(content,3,3) AS substr, count(*) AS myCount from myTable where length(substr(content,3,3))=3 group by substr(content,3,3) 
    union all select substr(content,4,3) AS substr, count(*) AS myCount from myTable where length(substr(content,4,3))=3 group by substr(content,4,3) 
    ... ) 
GROUP BY substr HAVING sum(myCount) > 1

And the results it produces are:

[(u'app',), (u'ati',), (u'ion',), (u'nat',), (u'pin',), (u'ple',), (u'ppl',), (u'tio',)]
Greg
I'll try this and let you know if my server explodes ;) Thx
Digits
+1  A: 

I'm sorry as I haven't been playing with php for a while & I don't have a proper test environment for it, but I quickly devised a way of doing this in c# 3.5

pseudocode: build a table with strings of the specified length & a count of occurences next to it. Select where count > 1:

    static void Main(string[] args)
    {

        string[] data = { "apple", "pinapple", "application", "nation" };
        string[] result = my_func(3,data);

        foreach (string str in result)
        {
            Console.WriteLine(str);
        }
        Console.ReadKey();
    }

    private static string[] my_func(int l, string[] data)
    {
        Dictionary<string,int> dict = new Dictionary<string,int>();
        foreach (string str in data)
        {
            for (int i = 0; i < str.Length - l + 1; i++)
            {
                string part = str.Substring(i, l);
                if (dict.ContainsKey(part))
                {
                    dict[part]++;
                }else {
                    dict.Add(part,1);
                }
            }
        }
        var result = from k in dict.Keys
                where dict[k] > 1
                orderby dict[k] descending
                select k;

        return result.ToArray<string>();
    }
Vincent
This looks interesting. I'm just a little worried about performance, since all "quick calls" you make to your dictionary, which in my case will be sql queries. And caching a table with 10k records might not be a good idea either, but I'll check it out!
Digits
You're right, this code should run on the server, then it seems it needs to be written in SQL, but then you would need to iterate in SQL which is practically impossible. I actually like Greg's answer except that the SQL query generated seems insane and it depends on the fieldlength.
Vincent