views:

110

answers:

4

Hello, I am writing a program to sort Spanish words.The letters are almost the same as the English alphabet, only with a few exceptions.

a,b,c,ch,d,e,f,g,h,i,j,k,l,ll,m,n,ñ,o,p,q,r,rr,s,t,u,v,w,x,y,z

Further, for this problem, assume that any pair of characters which can represent a letter does; for example, the combination ch would always mean the letter ch, not the letter c followed by the letter h.

Now if there wasn't the extra letters, I would be able to sort this easy, but can you guys help me think of an algorithm to help me sort it?

+8  A: 

Typically, language-specific sorting should be done using a Collator for string comparison. For Spanish, you might use:

Collator collator = Collator.getInstance(new Locale("es", "ES"));

If this is homework, though, I imagine you'd need to come up with something yourself.

ColinD
Thanks for teaching me something.
San Jacinto
A: 

You're going to need to account for the specifics of the spellings in the language to decide whether or not "ll", for instance, is "l" "l" or indeed is "ll". This much is obvious to you.

My point in bringing it up is that this is the crux of the problem. You're going to need to pre-process your input so that these ambigous double-letters are encoded as one letter. With a store medium like plain old ASCII text files, this will be impossible.

Your other option is to rely on some statistical liklihood in determining whether a letter pairing is 1 spanish letter or two. By taking into context the contents of the entire word, you can use prior probabilities to determine this. A Bayesian tecnique may work well.

San Jacinto
It would be assumed that a combination of double letters such as "ll" "ch", or "rr", would always mean the Spanish letter of the alphabet, not two different letters. This is one of the requirements.
Zerobu
..then I'm not understanding the problem... can't you just look 1 letter ahead until the end of the word/punctuation to see if this is a letter pair or an ordinary letter?
San Jacinto
OH!! I get what you're saying. You don't know what to do with the fact that there are EXTRA letters because you only know how to do an ASCII comparison. Just add the strings to a hash table or other associative container like a simple array, and the values would be the place in the alphabet.
San Jacinto
The "rr" is a letter in Spanish?
belisarius
+1  A: 

I would simply map each letter (starting with the combinations) as a 2 digit number (Starting from 10).

a - 10 b - 11 c - 12 ch - 13 d - 14 etc

The trick is to search for the paired letters (ch, ll, rr) first before you search for the single letter ones.

So - taking a word such as llave the steps would be

23ave 2310ve 231035e 23103515

If you sort the numbers as Strings (so that 1111 comes before 90) then that should produce the correct order.

If you can do a 'sort on' then just pair the number with the original word. Use the number you created as your sort key.

If you can't do a 'sort on' then you'll need to break the number back to 2 digit codes and convert those back to letters after you've sorted.

Stray
how do you prove that the first step is 23ave and not 21lave?
San Jacinto
Because of the find/replace order.I could be wrong - my spanish is pretty rough and only extends to a basic use of the language - but I understand that the ll will always mean ʎ, and never just l+l. Similarly, rr is always the double rr, not just two r characters that happen to be together. l + l would be invalid, just as qq wouldn't be a valid combination in English. This may be an over simplification (it's certainly true for all the spanish words I know) but he also stated that the assumption was ok for his problem.
Stray
A: 

try first parsing each word into an array or list of letter groups, then sorting by comparing the parsed letter groups.

Jason S