views:

120

answers:

1

I have a list of artists, albums and tracks that I want to sort using the first letter of their respective name. The issue arrives when I want to ignore "The ", "A ", "An " and other various non-alphanumeric characters (Talking to you "Weird Al" Yankovic and [dialog]). Django has a nice start '^(An?|The) +' but I want to ignore those and a few others of my choice.

I am doing this in Django, using a MySQL db with utf8_bin collation.

EDIT

Well my fault for not mentioning this but the database I am accessing is pretty much ready only. It's created and maintained by Amarok and I can't alter it without a whole mess of issues. That being said the artist table has The Chemical Brothers listed as The Chemical Brothers so I think I am stuck here. It probably will be slow but that's not so much of a concern for me as it's a personal project.

+4  A: 

What you are asking for probably isn't what you need. You probably don't want to sort by just the first letter. If the first letter is the same then you would normally also want to look at the second letter, etc. This will cause all songs by the same artist to be grouped together when you sort by artist.

Updated answer

You said you weren't allowed to change the database. Then you can use TRIM(LEADING ... FROM ...) to strip off the uninteresting words, but note that this will be slow as the query will not be able to use an index on the column.

SELECT *
FROM song
WHERE SUBSTRING(TRIM(LEADING 'The ' FROM TRIM(LEADING 'A ' FROM title)), 1, 1) = 'B'
ORDER BY TRIM(LEADING 'The ' FROM TRIM(LEADING 'A ' FROM title))

Result:

'The Bar'   -- "The" is ignored when sorting.
'Baz A'    

Test data:

CREATE TABLE song (title NVARCHAR(100) NOT NULL);
INSERT INTO song (title) VALUES
('The Bar'),
('Baz A'),
('Foo'),
('Qux'),
('A Quux');

Original Answer

Also note that if you ORDER BY a function of a column it will be really slow when you have a lots of records as the index on that column can't be used. Instead you should store another column where you remove all uninteresting words (the, an, etc..) and order by that column. You can either insert into that column from your application when you insert the row, or else use a trigger in the database.

Mark Byers
Woo-hoo, at least +1 my earlier comment.
Hamish Grubijan
@Hamish: I didn't see your comment, I'll +1 it. Yes an `artist_prefix` and `artist` column would work and avoid duplicating data in the database. Why didn't you submit it as an answer? Then I would have seen it and +1'ed it. I don't think StackOverflow notifies you if someone post a comment while you are writing an answer, but it does (sometimes) if someone posts an answer.
Mark Byers
My fault my fault, I did not mention and obviously I should of, this database is read-only and I cannot alter it. I have three tables, tracks, artists and albums. I just need "Weird Al" to be returned when I search W and The Chemical Brothers when I search C.
TheLizardKing
I like it but I was hoping to use this sort of filtering in the form of a regex. I would be able to use that both in sorting in MySQL and in Python. For that matter, most any language supporting regex.
TheLizardKing
@TheLizardKing: It can be in the sorting, or the filtering, or both. I'll update my answer to include an example of searching for 'C'.
Mark Byers
@TheLizardKing: And this technique will work in Python too: just use `lstrip` instead of `TRIM`.
Mark Byers
Haha, I usually don't modify a question to fit the answer but this is very useful to a lot of people I would imagine. I am still looking for my regex but I will look else where.
TheLizardKing
@TheLizardKing: If you *really* want a regex, try something like `^(?:The |A |An )?(.)`. This will work in Python, but probably will need some modification to work in MySQL. Python example: `re.match('(?:The |A |An )?(.)', 'The Bar').groups()`
Mark Byers
So I should really be figuring out how to get the regexp to work in MySQL. I'm getting a lot of `Got error 'repetition-operator operand invalid' from regexp` so I suppose this is going to be a new learning experience for me.
TheLizardKing
@TheLizardKing: I think you will soon discover that your previous comment `I was hoping to use this sort of filtering in the form of a regex. I would be able to use that both in sorting in MySQL and in Python.` is too optimistic. The regex engine in Python and MySQL are completely different beasts. You can't just take a regex from one and use it in the other. They have different syntax. See the manuals. http://docs.python.org/library/re.html vs http://dev.mysql.com/doc/refman/5.1/en/regexp.html#operator_regexp
Mark Byers
It's sad, I am so close `SELECT * FROM `artists` WHERE `name` REGEXP '^(The |An? )C';` Matches `The Chemical Brothers but not `Cat Empire`, SIGH. I need a way to negate the (The |An? )
TheLizardKing
@TheLizardKing: Did you try a `?` after the closing parenthesis, like in my Python example?
Mark Byers
I actually went with your first TRIM example. You were absolutely correct about me not seeing my real issue here. Pulling the rows was easy, ordering them is the challenge. I'm thinkin' instead of relying on Amarok to be my mp3 maintainer I should try to find/write a python|perl|php script that would use my db of choice (PostgreSQL) with my own schema with a proper artist_sort field. Thanks for all your help, I should of seen this as much more of a problem than I thought. If you happen to know of a script/daemon/application that does just as you recommend though please tell.
TheLizardKing
@TheLizardKing: You're welcome. I don't happen to know of any standard script etc. that does this, but should you find one you could post it here so others can learn about it. Note that if you choose to use PostgreSQL you can index on a function of your data: http://www.postgresql.org/docs/7.3/static/indexes-functional.html - You could use this to create a sorted index of your column without having two columns, and with no need for triggers.
Mark Byers