diacritics

Unicode character categories in Ruby

Is there anything in Ruby that will return me an array of characters belonging to a certain Unicode category? In particular, I'd like to have the Mn category so that I can follow the advice on this answer. ...

Accents on numbers in HTML (like a ^ over 1)

I'm trying to find the best way to put circumflex accents ( = &circ;) on top of numbers (a musical notation) without resorting to images. Certain letters have equivalent HTML entities: = &ecirc;, = &Ocirc;, etc., but numbers don't. Here is what I'm currently using on my website: <span style="position:relative;">1 <span style="p...

How to remove diacritics from text?

I am making a swedish website, and swedish letters are å, ä, and ö. I need to make a string entered by a user to become url-safe with PHP. Basically, need to convert all characters to underscore, all EXCEPT these: A-Z, a-z, 1-9 and all swedish should be converted like this: 'å' to 'a' and 'ä' to 'a' and 'ö' to 'o' (just remove the...

iTextSharp diacritics

Hi, I am trying to typeset a pdf with iTextSharp library, but I cannot find anywhere how to handle diacritics. Since I found tables of contents of two books about iTextSharp where diacritics has a section, I suppose it is doable. So the question is How to typeset "ř" ? In addition, is there some guide or link about this problem? Than...

ToAscii/ToUnicode in a keyboard hook destroys dead keys.

It seems that if you call ToAscii() or ToUnicode() while in a global WH_KEYBOARD_LL hook, and a dead-key is pressed, it will be 'destroyed'. For example, say you've configured your input language in Windows as Spanish, and you want to type an accented letter á in a program. Normally, you'd press the single-quote key (the dead key), then...

Test if string contains only letters (a-z + é ü ö ê å ø etc..)

Hey, I want to match a string to make sure it contains only letters. I've got this and it works just fine: var onlyLetters = /^[a-zA-Z]$/.test(myString); BUT Since I speak another language too, I need to allow all letters, not just A-Z. Also for eg é ü ö ê å ø does anyone know if there is a global 'alpha' term that includes all ...

How to perform a case and diacritic insensitive filter of an array of NSDictionaries (not NSString)?

I have an array of dictionaries. I would like to filter that array by seeing if the @"name" field of each dictionary contains a given string. The catch is that I would like to make my filtering insensitive to case and diacritics. If the array contained only strings I could easily use an NSPredicate. However, it doesn't, and I don't s...

How to write a php search script in which words with diacritics match search terms without diacritics, and the results are underlined?

Hi all! I've got this site where there are lots of texts with diacritics in them (ancillary glyphs added to letters, according to wikipedia) and most people search these texts using words without the glyphs. Now it shouldn't be challenging to do this by having a copy of the texts without diacritics. However, I want to highlight the matc...

Why doesn't Đ get flatten to D when Removing Accents/Diacritics

I'm using this method to remove accents from my strings: static string RemoveAccents(string input) { string normalized = input.Normalize(NormalizationForm.FormKD); StringBuilder builder = new StringBuilder(); foreach (char c in normalized) { if (char.GetUnicodeCategory(c) != UnicodeCategory.NonSpacingMark...

iphone's nsxmlparser parsing RSS causes encoding problems

Hi, Im working on simle RSS reader. This reader loads data from internet via this code: NSXMLParser *rss = [[NSXMLParser alloc] initWithURL:[NSURL URLWithString:@"http://twitter.com/statuses/user_timeline/50405236.rss"]]; My problem is with encoding. RSS 2.0 file is supposed to be UTF8 encoded according to encoding attribute in XML fi...

Replace diacritic characters with "equivalent" ASCII in PHP?

Related questions: http://stackoverflow.com/questions/2653739/how-to-replace-characters-in-a-java-string http://stackoverflow.com/questions/2393887/how-to-replace-special-characters-with-their-equivalent-such-as-a-for-a As in the questions above, I'm looking for a reliable, robust way to reduce any unicode character to near-equivalen...

Code to strip diacritical marks using ICU

Can somebody please provide some sample code to strip diacritical marks (i.e., replace characters having accents, umlauts, etc., with their unaccented, unumlauted, etc., character equivalents, e.g., every accented é would become a plain ASCII e) from a UnicodeString using the ICU library in C++? E.g.: UnicodeString strip_diacritics( Un...

Indexing and searching French text with diacritics in Lucene

I am using Lucene Search. I have uploaded french file with following content. french.txt multimédia francophone pour l'enseignement du français langue étrangère If I search for francophone then it shows file in search result. Now when I search for multimédia or français or étrangère word it does not show any result. I have tried to ...

Python regex \w doesn't match combining diacritics?

I have a UTF8 string with combining diacritics. I want to match it with the \w regex sequence. It matches characters that have accents, but not if there is a latin character with combining diacritics. >>> re.match("a\w\w\wz", u"aoooz", re.UNICODE) <_sre.SRE_Match object at 0xb7788f38> >>> print u"ao\u00F3oz" aoóoz >>> re.match("a\w\w\wz...

How to normalize CodePage to Unicode Form C when diacritic preceds and accent not combining form

I would like to be able to say "Normalize this string by forcing diacritic accents into their combining form". Details: My code is being developed in C# but I don't believe the issue to be language specific. There are two problems with my data (1) the diacritic is preceding the base character in this data (it needs to follow the base ...

Foreign characters lose their diacritics

I'm trying to internationalize the questions in our survey-tool, but when I insert some translated strings, SQL-server seems to strip of some, but not all, diacritics... Example: (Lithuanian) Ar jūsų darbas reikalauja, kad jūs įgytumėte naujų žinių ir įgūdžių? Becomes Ar jusu darbas reikalauja, kad jus igytumete nauju žiniu ir igudž...

Removing diacritics in Polish

Hi. I'm trying to remove diacritic characters from a pangram in Polish. I'm using code from Michael Kaplan's blog http://blogs.msdn.com/b/michkap/archive/2007/05/14/2629747.aspx, however, with no success. Consider following pangram: "Pchnąć w tę łódź jeża lub ośm skrzyń fig.". Everything works fine but for letter "ł", I still get "ł". ...

Ignoring diacritics when ordering alphabetically

Hello. I'm making a Java app that receives some names from SQLite and puts them on a listbox. The thing is I want it to be accurately ordered in an ascending alphabetical way (In Portuguese to be specific). These entries, for example: Beta Árida Ana Should be ordered as: Ana Árida Beta But since it orders in some ASCII order, the "a...

Diacritics alphabetical ordering in C#

Hello! I want to know how do you perform a reliable alphabetical ordering (for a listbox) of people's full names with the diacritics of the language in C sharp? Thanks in advance. Q: So you just want to treat diacritics as the "original" letter? (eg: João is the same as Joao)? – NullUserException A: I want to treat them as they should...

Unicode characters show differently in different browsers

So... I'm still in unicode hell... New problem... On my computer, everything shows perfectly. In all browsers. On a co-workers computer, same story. Everything is good. Even in elinks and w3m on one of my Linux VPS'es all the exotic diacritics of Lithuanian and Latvian, and nordic letters, shows perfectly. However, I have had a few ca...