ansaurus

Question

Finding similar number patterns in table

Answer 1

+3 A:

One way to do this is to calculate the Levenshtein distance between your search string and the about_member fields for each member. Here's an implementation of the function as a MySQL stored function.

With that you can do:

SELECT name, LEVENSHTEIN(about_member, '1-1-2-1-2') AS diff 
FROM members 
ORDER BY diff ASC

The % of similarity is related to diff; if diff=0 then it's 100%, if diff is the size of the string (minus the amount of dashes), it's 0%.

NullUserException 2010-08-08 22:30:13

Wow, it is very interesting. And what about performance? Is it slow query? Does it "eat" a lot of resources of hosting?

hey 2010-08-08 22:36:10

@hey Honestly, I have no clue. You'd have to try and find out.

NullUserException 2010-08-08 22:37:55

@hey - If you use the function result in an Order By or Where clause, it will force a table scan. On a large table, this would likely not perform well. If you tell us more about why you are using such a pattern, we might be able to help you find a set based solution that would perform better.

Thomas 2010-08-08 22:42:24

@hey: better take a closer look at how the Levenshtein distance works. I somehow doubt that this is what you had in mind - could be wrong though...

VolkerK 2010-08-08 22:43:18

Let's say members have questions and they can answer to question yes or no, if they answer yes, then it will put value 1, if no - value 2. I really don't know how to explain it better...

hey 2010-08-08 22:45:37

No, VolkerK. I really needed this. And this method is correct, but it is not the best thing to do I guess, if it will force a table scan as Thomas says.

hey 2010-08-08 22:46:49

Answer 2

+3 A:

The obvious solution is to look at the levenstein distance (there isn't an implementation built into mysql but there are other implementations accesible e.g. this one in pl/sql and some extensions), however as usual, the right way to solve the problem would be to have normalised the data properly in the first place.

symcbean 2010-08-08 22:33:00

Answer 3

+11 A:

convert your number sequences to bit masks and use BIT_COUNT(column ^ search) as similarity function, ranged from 0 (= 100% match, strings are equal) to [bit length] (=0%, strings are completely different). To convert this similarity function to the percent value use

100 * (bit_length - similarity) / bit_length

For example, "1-1-2-2-1" becomes "00110" (assuming you have only two states), 2-1-1-2-1 is "10010", bit_count(00110 ^ 10010) = 2, bit-length = 5, and 100 * (5 - 2) / 5 = 60%.

stereofrog 2010-08-08 22:59:39

Do not get it...

hey 2010-08-14 18:15:17

what exactly?...

stereofrog 2010-08-15 10:16:01

@stereofrog, could you please try to explain with queries and bounty is yours, please? Thank you.

hey 2010-08-26 09:31:23

I think what he is saying is this: `$sql = "SELECT BIT_COUNT(about_member ^ '{$search}') AS similarity FROM members";` Note that the data in about_member should be saved like 00101 instead of 1-1-2-1-2.

captaintokyo 2010-08-28 04:00:02

Answer 4

A:

I would go with the Levenshtein distance approach, you can use it within MySQL or PHP.

Alix Axel 2010-08-24 12:22:44

Answer 5

+1 A:

Having read the clarification comments on the original question, the Levenshtein distance is not the answer you are looking for.

You are not trying to compute the smallest number of edits to change one string into another.

You are trying to compare one set of numbers with another set of numbers. What you are looking for is the minimum (weighted) sum of the differences between the two sets of numbers.

Place each answer in a separate column (Ans1, Ans2, Ans3, Ans4, .... )

Assume you are searching for similarities to 1-2-1-2.

SELECT UserName, Abs( Ans1 - 1 ) + Abs( Ans2 - 2 ) + Abs( Ans3 - 1 ) + Abs( Ans4 - 2) as Difference ORDER BY Difference ASC

Will list users by similarity to answers 1-2-1-2, assuming all questions are weighted evenly.

If you want to make certain answers more important, just multiply each of the terms by a weighting factor.

If the questions will always be yes/no and the number of answers is small enough that all the answers can be fitted into a single integer and all answers are equally weighted, then you could encode all the answers in a single column and use BIT_COUNT as @stereofrog suggested. This would be a faster and more space-efficient implementation.

MZB 2010-08-25 23:24:14

Answer 6

A:

If you represent your answer patterns as bit sequences you can use the formula stereofrog gave (100 * (bit_length - similarity) / bit_length).

Following the mentioned example, when we convert "1"s to bit off and "2"s to bit on "1-1-2-2-1" becomes 6 (as base-10, 00110 in binary) and "2-1-1-2-1" becomes 18 (10010b) etc.

Also, I think you should store the answers' bits to the least significant bits, but it doesn't matter as long as you are consistent that the answers of different members align.

Here's a sample script to be run against MySQL.

DROP TABLE IF EXISTS `test`;

CREATE TABLE `members` (
    `id` VARCHAR(16) NOT NULL ,
    `about_member` INT NOT NULL
) ENGINE = InnoDB;

INSERT INTO `members`
    (`id`, `about_member`)
VALUES
    ('member_1', '6'),
    ('member_2', '18');

SELECT 100 * ( 5 - BIT_COUNT( about_member ^ (
    SELECT about_member
    FROM members
    WHERE id = 'member_1' ) ) ) / 5
FROM members;

The magical 5 in the script is the number of answers (bit_length in the formula above). You should change it according to your situation, regardless of how many bits there are in the actual data type used, as BIT_COUNT doesn't know how many bytes you are using.

BIT_COUNT returns the number of bits set and is explained in MySQL manual. ^ is the binary XOR operator in MySQL.

Here the comparison of member_1's answers is compared with everybody's, including their own - which results as 100% match, naturally.

Jawa 2010-08-27 09:27:05

How do you get `"1-1-2-2-1" becomes 6`?

hey 2010-08-27 11:18:53

And what `BIT_COUNT(N ^ N)` means?

hey 2010-08-27 11:24:09

As you replace "1"s with bit off (0) and "2"s with bit on (1), "1-1-2-2-1" -> 00110 (base-2) -> 6 (base-10).

Jawa 2010-08-27 11:35:38

Updated. For further clarification refer to http://en.wikipedia.org/wiki/Bitwise_operation#XOR

Jawa 2010-08-27 12:50:15

Answer 7

A:

If you don't have too many fields, you could create an index on the integer representation of about_member. Then you can find the 100% by an exact match on the about_member field, followed by the 80% matches by changing 1 bit, the 60% matches by changing 2 bits, and so on.

Joshua Martell 2010-08-28 03:47:37

Answer 8

+1 A:

I would go with the similar_text() PHP built-in. It seems to be exactly what you want:

$percent = 0;
similar_text($string1, $string2, $percent);

echo $percent;

It works as the question expects.

passcod 2010-08-28 13:11:49

Answer 9

+1 A:

stereofrog posted this idea originally; Jawa tried to improve it; here is my attempt.

^ is the XOR function. It compares 2 binary numbers bit-by-bit and returns 0 if both bits are the same, and 1 otherwise.

    0 1 0 0 0 1 0 1 0 1 1 1  (number 1)
 ^  0 1 1 1 0 1 0 1 1 0 1 1  (number 2)
 =  0 0 1 1 0 0 0 0 1 1 0 0  (result)

How this applies to your problem:

  // In binary...
  1111 ^ 0111 = 1000 // (1 bit out of 4 didn't match: 75% match)
  1111 ^ 0000 = 1111 // (4 bits out of 4 didn't match: 0% match)

  // The same examples, except now in decimal...
    15 ^    7 = 8  (1000 in binary) // (1 bit out of 4 didn't match: 75% match)
    15 ^    0 = 15 (1111 in binary) // (4 bits out of 4 didn't match: 0% match)

How we can count these bits in MySQL:

BIT_COUNT(b'0111') = 3 // Bit count of binary '0111'
BIT_COUNT(7) = 3       // Bit count of decimal 7 (= 0111 in binary)
BIT_COUNT(b'1111' ^ b'0111') = 1 // (1 bit out of 4 didn't match: 75% match)

So to get the similarity...

// First we focus on calculating mismatch.
(BIT_COUNT(b'1111' ^ b'0111') / YOUR_TOTAL_BITS) = 0.25 (25% mismatch)
(BIT_COUNT(b'1111' ^ b'1111') / YOUR_TOTAL_BITS) = 0 (0% mismatch; 100% match)

// Now, getting the proportion of matched bits is easy
1 - (BIT_COUNT(b'1111' ^ b'0111') / YOUR_TOTAL_BITS) = 0.75 (75% match)
1 - (BIT_COUNT(b'1111' ^ b'1111') / YOUR_TOTAL_BITS) = 1.00 (100% match)

If we could just make your about_member field store data as bits (and be represented by an integer), we could do all of this easily! Instead of 1-2-1-1-1, use 0-1-0-0-0, but without the dashes.

Here's how PHP can help us:

bindec('01000') == 8;
bindec('00001') == 1;
decbin(8) == '01000';
decbin(1) == '00001';

And finally, here's the implementation:

// Setting a member's about_member property...
$about_member = '01100101';
$about_member_int = bindec($about_member);
$query = "INSERT INTO members (name,about_member) VALUES ($name,$about_member_int)";

// Getting matches...
$total_bits = 8; // The maximum length the member_about field can be (8 in this example)
$my_member_about = '00101100';
$my_member_about_int = bindec($my_member_about_int);
$query = "
    SELECT 
        *,
        (1 - (BIT_COUNT(member_about ^ $my_member_about_int) / $total_bits)) match 
    FROM members
    ORDER BY match DESC
    LIMIT 10";

This last query will have selected the 10 members most similar to me!

Now, to recap, in layman's terms,

We use binary because it makes things easier; the binary number is like a long line of light switches. We want to save our "light switch configuration" as well as find members that have the most similar configurations.

The ^ operator, given 2 light switch configurations, does a comparison for us. The result is again a series of switches; a switch will be ON if the 2 original switches were in different positions, and OFF if they were in the same position.

BIT_COUNT tells us how many switches are ON--giving us a count of how many switches were different. YOUR_TOTAL_BITS is the total number of switches.

But binary numbers are still just numbers... and so a string of 1's and 0's really just represents a number like 133 or 94. But it's a lot harder to visualize our "light switch configuration" if we use decimal numbers. That's where PHP's decbin and bindec come in.

Learn more about the binary numeral system.

Hope this helps!

Tim 2010-08-29 08:09:47

ansaurus

tags:

views:

answers:

Finding similar number patterns in table

related questions