



I have a table with contents that look similar to this:

id | title
1  | 5. foo
2  |
3  | 5. foo*
4  | bar
5  | bar*
6  | baz
6  | BAZ

…and so on. I would like to group by the titles and ignore the extra bits. I know Postgres can do this:

  SELECT regexp_replace(title, '[*.]+$', '') AS title
  FROM table
) AS a
GROUP BY title

However, that's quite simple and would get very unwieldy if I tried to anticipate all the possible variations. So, the question is, is there a more general way to do fuzzy grouping than using regexp? Is it even possible, at least without breaking one's back doing it?

Edit: To clarify, there is no preference for any of the variations, and this is what the table should look like after grouping:

5. foo

I.e., the variations would be items that are different just by a few characters or capitalization, and it doesn't matter which ones are left as long as they're grouped.

+1  A: 

For any grouping you should have transitive equality, that is a ~= b, b ~= c => a ~= c.

Formulate it strictly using words and we'll try to formulate it using SQL.

For instance, which group should foo*bar go to?


This query replaces all non-alphanumerical characters with spaces and returns first title from each group:

SELECT  DISTINCT ON (REGEXP_REPLACE(UPPER(title), '[^[:alnum:]]', '', 'g')) title
FROM    (
        (1, '5. foo'),
        (2, ''),
        (3, '5. foo*'),
        (4, 'bar'),
        (5, 'bar*'),
        (6, 'baz'),
        (7, 'BAZ')
        ) rows (id, title)
To its own group, as its not sufficiently similar to the other items. That's why the question is about fuzzy grouping: it doesn't matter which of the variations the row ends up grouped with, it just matters that they are grouped at all.
Reinis I.
`Reinis I.`: *sufficiently similar* is usually not transitive, this menas it's not groupable. If, say, `foo` is sufficiently similar to `for` and `for` is sufficiently similar to `bar`, but `foo` is not sufficiently similar to `bar`, then you cannot build any groups.
I'm not saying it can be done, I'm asking how to work around it.
Reinis I.
+2  A: 

At some time, you are going to have to define what makes a set of values belong together in a group. If that's too hard, maybe you should prohibit and inhibit the entry of fuzzy data, or if you must permit it, add a column that contains a sanitized version of the title for use by the grouping operations.

Jonathan Leffler