ansaurus

Question

Fuzzy grouping in Postgres

Answer 1

+1 A:

For any grouping you should have transitive equality, that is a ~= b, b ~= c => a ~= c.

Formulate it strictly using words and we'll try to formulate it using SQL.

For instance, which group should foo*bar go to?

Update:

This query replaces all non-alphanumerical characters with spaces and returns first title from each group:

SELECT  DISTINCT ON (REGEXP_REPLACE(UPPER(title), '[^[:alnum:]]', '', 'g')) title
FROM    (
        VALUES
        (1, '5. foo'),
        (2, '5.foo'),
        (3, '5. foo*'),
        (4, 'bar'),
        (5, 'bar*'),
        (6, 'baz'),
        (7, 'BAZ')
        ) rows (id, title)

Quassnoi 2009-10-30 17:30:32

To its own group, as its not sufficiently similar to the other items. That's why the question is about fuzzy grouping: it doesn't matter which of the variations the row ends up grouped with, it just matters that they are grouped at all.

Reinis I. 2009-10-30 17:34:08

`Reinis I.`: *sufficiently similar* is usually not transitive, this menas it's not groupable. If, say, `foo` is sufficiently similar to `for` and `for` is sufficiently similar to `bar`, but `foo` is not sufficiently similar to `bar`, then you cannot build any groups.

Quassnoi 2009-10-30 17:37:01

I'm not saying it can be done, I'm asking how to work around it.

Reinis I. 2009-10-30 17:55:02

Answer 2

+2 A:

At some time, you are going to have to define what makes a set of values belong together in a group. If that's too hard, maybe you should prohibit and inhibit the entry of fuzzy data, or if you must permit it, add a column that contains a sanitized version of the title for use by the grouping operations.

Jonathan Leffler 2009-10-30 17:32:38

ansaurus

tags:

views:

answers:

Fuzzy grouping in Postgres

related questions