I have a table with contents that look similar to this:
id | title
------------
1 | 5. foo
2 | 5.foo
3 | 5. foo*
4 | bar
5 | bar*
6 | baz
6 | BAZ
…and so on. I would like to group by the titles and ignore the extra bits. I know Postgres can do this:
SELECT * FROM (
SELECT regexp_replace(title, '[*.]+$', '') AS title
FROM table
) AS a
GROUP BY title
However, that's quite simple and would get very unwieldy if I tried to anticipate all the possible variations. So, the question is, is there a more general way to do fuzzy grouping than using regexp? Is it even possible, at least without breaking one's back doing it?
Edit: To clarify, there is no preference for any of the variations, and this is what the table should look like after grouping:
title
------
5. foo
bar
baz
I.e., the variations would be items that are different just by a few characters or capitalization, and it doesn't matter which ones are left as long as they're grouped.