views:

68

answers:

3

Hi all,

I am doing a "ALPHABETICAL ORDER SEARCH" module for a project.

that is it will look like

A B C D E F . . . . . . . . . .. . . . . . . .. . . . Z

When i click on "A" the results should be sort by "A". Which is same for all the alphabets.

Now my prob is as follows:

  • For example there is a film named "The Mummy".

  • What i do is when click on the "ALPHABET T" this corresponding film will be sort.

  • But my client requirement is that "The Mummy" movie must sort when the user clicks on "M" and not "T"

  • Because "a, an, the" are "ARTICLES" and it does not have any meanings.

I hope now that everybody can understood what my problem is....

Any help will be appreciable and thankful.

thanks in advance

+2  A: 

Assuming that you do not wish to modify the content of the table (and hence get slightly less efficient queries), the following should do the trick.
(If you do have the leisure of modifying the table, see the suggestions at the end of this answer)

SELECT Title
FROM myTable
WHERE (Title LIKE 'x%' OR Title LIKE 'THE x%')
  -- AND Title NOT LIKE 'THE [^T]%'   
ORDER BY Title

Notes:
- x designate the desired letter (example: LIKE 'A%' etc.)
- The "AND TITLE NOT LIKE" extra condition is only needed when "X" is the letter "T" (it is otherwise functionally redundant, but doesn't change the result)
- I'm unsure of the support of the the [^xyz] (i.e. NOT characters x, y or z), so the [^T] could be replaced by its positive equivalent say [A-RS-Z0-9].

There are a few other stop words to consider ("A", "AN", "OF"...) but for book or film titles, it is a common practice to only consider "THE". If you must deal with the other articles, the logic can be extended as in:

SELECT Title
FROM myTable
WHERE (Title LIKE 'x%' 
    OR Title LIKE 'THE x%' 
    OR Title LIKE 'A x%' 
    OR Title LIKE 'AN x%') 
 -- the following is only needed when "x" is either the letter T or A.
 -- AND (Title NOT LIKE 'THE [^T]%' 
 --      AND Title NOT LIKE 'A [^A]%' 
 --      AND Title NOT LIKE 'AN [^A]%'
 --  )
ORDER BY Title



There are better solutions, if you can modify the table's contents. Some of these imply pre-computing one or several extra columns (and maintaining it/these, when new records are added etc.).

  • See for example Cletus' answer in this post for the "sort_column" approach, where the extra column contains the title stripped of any undesirable leading noise-word. In addition to its purpose as the filtering field in the initial letter search problem of the OP, this column can also be used to sort, in a more friendly/sensible way, lists of titles which were produced by a filter unrelated to the initial letter and/or the begining of the title (say a search by year).
  • A variation on the above is to only store the "effective" initial letter (the one past the undesired noise), making for a smaller column, but a less versatile one too.
  • The title column itself can be updated, storing a modified form of the title, whereby the extraneous leading noise-word(s) is moved to the end of the string, between parenthesis. This practice is quite common with bibliographic-type catalogs.
mjv
hi.. many thanks for your effort. And i got the result for "LIKE" AN didn't get the result for "NOT LIKE"... any changes to be done.. My query for "T" is Select * from films Where film_status='Active' AND (user_id='3' OR distributor='3') AND (film_name LIKE 'T%' OR film_name LIKE 'THE T%' OR film_name LIKE 'A T%' OR film_name LIKE 'AN T%' OR film_name LIKE 'The T%' OR film_name LIKE 'An T%') AND (film_name NOT LIKE 'THE [^T]%' AND film_name NOT LIKE 'The [^T]%' AND film_name NOT LIKE 'A [^T]%' AND film_name NOT LIKE 'AN [^T]%' ) ORDER BY film_name
Fero
is the above query right?
Fero
@Fero: yes, the query in your comment looks right. Note that there are ways to make mySQL handle text in a case-insensitive fashion. This should spare you all these extra condition with "The" and such.
mjv
thanks mjv.. got it
Fero
+2  A: 

What you really are asking here is how to remove "stop words" ("the" is just one example; you will want to remove "of", "a", etc). Trying to hard-code the set of stop words is a HUGE pain in the butt, and as your corpus changes, you'll have to change the code.

Instead, you should try to use an algorithm that will deduce what the stop words are based on your corpus. Algorithms to do this kind of a thing are well known and are employed by search engines. One that works very well is called TF/IDF

Alex
+1 for mentioning the phrase 'stop words', as that is a key to unlock this problem (and have fruitful SO/google fu)
instanceofTom
+1 This answers my question of "What about German/French/etc articles?"
Frank Shearar
+1  A: 

Basically how you do this is you have an extra column for sorting. If you have a movie table with a name column, add another column called sort_name. That should contain the movie title in lowercase with any words that you want to ignore from the front removed (eg "the", "a").

Don't try and do this dynamically.

When the field is updated you'll also have to update the sort_name column. You can rebuild it at any time and you will of course have to index it. Then just do:

SELECT *
FROM movies
WHERE sort_name LIKE 'a%'
cletus