views:

245

answers:

2

Hi,

The question is about Armenian. I'm using sql server 2005, collation SQL_Latin1_General_CP1_CI_AS, data mostly is in Armenian and we can't use unicode.

I tested on ms sql 2008 with a windows collation for armenian language ( Cyrillic_General_100_ ), I have found here, ( http://msdn.microsoft.com/en-us/library/ms188046.aspx ) but it didn't help.

I have a function, that orders hex values and a lower function, which takes each char in each string and converts it to it's lower form, but it's not acceptable solution, it works really slow, calling that functions on every column of a huge table.

Is there any solution for this issue not using unicode and not working with hex values manually?

UPDATE:

On the left side are mixed case words, sorted in the right order and with lower case representations on the right side. Hope this will help. Thank You. Words are written in unicode.

  1. ԱբԳդԵզ -> աբգդեզ

  2. ԱգԳսԴԼ -> ագգսդլ

  3. ԲաԴֆդԴ -> բադֆդդ

  4. ԳԳԼասա -> գգլասա

  5. ԴմմլօՏ -> դմմլօտ

  6. ԵլԲնՆն -> ելբննն

  7. ԶՎլուտ -> զվլուտ

  8. էԹփձջՐ -> էթփձջր

  9. ԸխԾդսՂ -> ըխծդսղ

  10. ԹԶէըԿր -> թզէըկր

+2  A: 

One solution would be to create a computed column for each text column which converts the value into Armenian collation and sets it to lower case like so:

Alter Table TableName
    Add TextValueArmenian As ( LOWER(TextColumn COLLATE Latin1_General_CI_AS) ) PERSISTED

Once you do this, you can put indexes on these columns and query for them.

If that isn't your flavor of tea, then another solution would be an indexed view where you create a view with SCHEMABINDING that casts each of the various columns to lower case and to the right collation and then put indexes on that view.

EDIT I notice in your examples, that your are using a Case-insensitive, Accent-sensitive. Perhaps the simple solution to your ordering issues would be to use Latin1_General_CS_AS or Cyrillic_General_100_CS_AS if available.

EDIT

Whew. After quite a bit of research, I think I have an answer which unfortunately may not be you will want. First, yes I can copy the text you provided into code or something like Notepad++ because StackOverflow is encoded using UTF-8 and Armenian will fit into UTF-8. Second, this hints at what you are trying to achieve: storing UTF-8 in SQL Server. Unfortunately, SQL Server 2008 (or any prior version) does not natively support UTF-8. In order to store data in UTF-8, you have a handful of choices:

  1. Store it in binary and convert it to UTF-8 on the client (which pretty much eliminates any sorting done on the server)
  2. Store it in Unicode and convert it to UTF-8 on the client. It should be noted that the SQL Server driver will already convert most strings to Unicode and your example does work fine in Unicode. The obvious downside is that it eats up twice the space.
  3. Create a CLR user defined type in SQL Server to store UTF-8 values. Microsoft provides a sample that comes with SQL Server to do just this. You can download the samples from CodePlex from here. You can also find more information on the sample in this article in the Books Online. The downside is that you have to have the CLR enabled in SQL Server and I'm not sure how well it will perform.

Now, that said, I was able to get you sample working with no problem using Unicode in SQL Server.

If object_id('tempdb..#Test') Is Not Null
    Drop Table #Test
GO
Create Table #Test
(
    EntrySort int identity(1,1) not null
    , ProperSort int 
    , MixedCase nvarchar(50)
    , Lowercase nvarchar(50)
)
GO
Insert #Test(ProperSort, MixedCase, Lowercase)
Select 1, N'ԱբԳդԵզ',N'աբգդեզ'
Union All Select 6, N'ԵլԲնՆն',N'ելբննն'
Union All Select 2, N'ԱգԳսԴԼ',N'ագգսդլ'
Union All Select 3, N'ԲաԴֆդԴ',N'բադֆդդ'
Union All Select 4, N'ԳԳԼասա',N'գգլասա'
Union All Select 5, N'ԴմմլօՏ',N'դմմլօտ'
Union All Select 9, N'ԸխԾդսՂ',N'ըխծդսղ'
Union All Select 7, N'ԶՎլուտ',N'զվլուտ'
Union All Select 10, N'ԹԶէըԿր',N'թզէըկր'
Union All Select 8,N'էԹփձջՐ',N'էթփձջր'

Select * From #Test Order by ProperSort
Select * From #Test Order by Lowercase
Select * From #Test Order by Lower(MixedCase)

All three of these queries return the same result.

Thomas
@Thomas, yes, it's also a solution, if we talk only about lower, but the first way isn't easy, there're too many tables with multiple columns with this kind of data, and even there're up to 10 db's, but writing a view is a better way. It doesn't change Tables structure. Two more things.1) I can't use sql lower, cause it doesn't work right, I wrote my own armLower. 2) What about order by? It doesn't work. Solution could be storing not the lower values, but columns' hex values and using my own function, that would convert that hex values, compare them..
hgulyan
@Thomas, but I need a way to use lower and order by with my data.
hgulyan
The collation should properly handle calling ORDER BY on the column. Remember that there are multiple variants of collations: Accent Sensitive, Accent Insensitive, Case-sensitive, Case-insensitive etc. The collation you mentioned is Case-insensitive and Accent sensitive. Perhaps switching to Cyrillic_General_100_CS_AS would solve the problem?
Thomas
If you cannot use SQL Server, what database are you using?
Thomas
I've tried CS_AS on SQL Server 2008, but it doesn't work. It can't figure out, that upper of a is A, b is B, it doesn't know that b goes after a (for Armenian alphabet). p.s. where did I write, that I can't use Sql Server? I'm using MS SQL 2005 and 2008.
hgulyan
No, nor the Cyrillic_General_100 neither Latin1_General don't work. Are you sure, that it should work?
hgulyan
RE: Use of SQL, I must have misread your first post. RE: doesn't b go after a in all alphabets? RE: sorting, if you add a column and manually enter lowercase words, does SQL sort that properly?
Thomas
One thing you could do that would help, can you post 10 mixed case words, their lowercase equivalent and a number the represents the order in which you expect to sort them?
Thomas
No, the sort isn't working no matter I enter words manually or not.But how would 10 words help? Can you test it?:)
hgulyan
Yeah. It would give me something to test against.
Thomas
But you can't just copy-paste the unicode text. Isn't that so?
hgulyan
Expanded my answer. I can copy-paste the text because the site is coded as UTF-8 and Armenian fits into that.
Thomas
Great job, Thomas, I didn't know about clr types. Thank you!I'll try suggested choices. p.s. storing data in utf-8 format won't help me, yeah?:)
hgulyan
@hgulyan - Will a CLR type be better than storing in binary? Absolutely. Will it be better than storing in Unicode? Hard to know. My bet would be that using the native Unicode type (nvarchar) would be faster even though it is using twice the space.
Thomas
Storing data in binary or unicode isn't accessible. I'll read more about CLR types, maybe it will be a good way.I've read that for using windows collation you need to change regional settings to Armenia on the server computer. Do you think it could solve the problem?
hgulyan
@hgulyan - I don't think that switching to a Windows collation will make any difference. I'm a bit surprised that switching Unicode is not an option. Remember that when you pass the data to SQL Server via whatever driver you are using, it is cast to Unicode. Even if you use a UTF-8 CLR type, it will go from your presentation layer, to Unicode to SQL Server where the CLR type will convert it from Unicode back to UTF-8. Either way, it sounds like UTF-8 should solve your problem.
Thomas
@Thomas, damn, I wrote a comment and it's just gone. I wrote, that I can't use unicode not because it's not a good solution, but that it requires huge changes in db structure and in all programs, besides that some programs send some data to another server, which doesn't accept unicode, what means that we need to convert it back to utf-8, so I was thinking of an easier way, smth like using view or a script or smth other way that doesn't require changes in tables structure. I didn't know that sql server converts data to unicode. That's interesting. Thank you:)
hgulyan
A: 

Did you get an error like this?

Msg 448, Level 16, State 1, Line 1
Invalid collation 'Cyrillic_General_100_'.

Try:

ORDER BY Name COLLATE Cyrillic_General_100_CI_AS

Or pick one you prefer from this list:

select * from fn_helpcollations()
where name like 'Cyrillic_General_100_%'
Peter
@peter, I've tried it, but it's not working. LOWER and ORDER BY doesn't work right for armenian language.
hgulyan
I've tried to configure db collation, table collation, added collate in a query, as you need and on LOWER, but I get wrong results.Is it possible, that sql server doesn't have a solution for armenian?
hgulyan
If you post some Armenian (create table, insert values etc), perhaps someone could take a stab at it. Include the order the results should appear in and the lower forms.
Peter
Ok, but how to place it here? If I write with unicode, I can see letters, but if it's ascii, it views smth like this ²ñÙ»Ý Thank you for your suggestion :)
hgulyan
You would have to post it in Unicode; but then the conversion to ASCII in the script would probably be wrong. Good luck!
Peter
It would be wrong, indeed, and you need an armenian font to see the query results.
hgulyan