views:

97

answers:

5

Hi.

I'm trying to remove diacritic characters from a pangram in Polish. I'm using code from Michael Kaplan's blog http://blogs.msdn.com/b/michkap/archive/2007/05/14/2629747.aspx, however, with no success.

Consider following pangram: "Pchnąć w tę łódź jeża lub ośm skrzyń fig.". Everything works fine but for letter "ł", I still get "ł". I guess the problem is that "ł" is represented as single unicode character and there is no following NonSpacingMark.

Do you have any idea how I can fix it (without relying on custom mapping in some dictionary - I'm looking for some kind of unicode conversion)?

+2  A: 

The approach taken in the article is to remove Mark, Nonspacing characters. Since as you correctly point out "ł" is not composed of two characters (one of which is Mark, Nonspacing) the behavior you see is expected.

I don't think that the structure of Unicode allows you to accomplish a fully automated remapping (the author of the article you reference reaches the same conclusion).

If you're just interested in Polish characters, at least the mapping is small and well-defined (see e.g. the bottom of http://www.biega.com/special-char.html). For the general case, I do no think an automated solution exists for characters that are not composed of a standard character plus a Mark, Nonspacing character.

Eric J.
+2  A: 

It is in the Unicode chart, codepoint \u0142. Scroll down to the description, "Latin small letter with stroke", it has no decomposition listed. Don't know anything about Polish, but it is common for a letter to have a distinguishing mark that makes it its own letter instead of a base one with a diacritic.

Hans Passant
+1  A: 

There are quite a few precomposed characters that have no meaningful decompositions.

(There are also a handful that could have reasonable decompositions that are prohibitted from such decomposition in most normalisation forms, as it would lead to differences between version, which would make them not really normalisation any more).

ł is one of these. IIRC it's also not possible to give a culture-neutral transcription to alphabets that don't use ł. I think Germans tend to transcribe it to w rather than l (or maybe it's someone else who does), which makes sense (it's not quite right sound either, but it's closer than l).

Jon Hanna
+1  A: 

You'll have to replace these manually (just like with ÆÐØÞßæðøþ in Latin-1).

Other people have had the same problem, so the Unicode Common Locale Data Repository has "Agreed to add a transliterator that does accent removal, even for overlaid accents." (Ticket #2884)

dan04
A: 

Here is my quick implementation of Polish stoplist with normalization of Polish diacritics.

    class StopList
{
    private HashSet<String> set = new HashSet<String>();

    public void add(String word)
    {
        word = word.trim().toLowerCase();
        word = normalize(word);
        set.add(word);

    }

    public boolean contains(final String string)
    {
        return set.contains(string) || set.contains(normalize(string));
    }

    private char normalizeChar(final char c)
    {
        switch ( c)
        {
            case 'ą':
                return 'a';
            case 'ć':
                return 'c';
            case 'ę':
                return 'e';
            case 'ł':
                return 'l';
            case 'ń':
                return 'n';
            case 'ó':
                return 'o';
            case 'ś':
                return 's';
            case 'ż':
            case 'ź':
                return 'z';
        }
        return c;
    }

    private String normalize(final String word)
    {
        if (word == null || "".equals(word))
        {
            return word;
        }
        char[] charArray = word.toCharArray();
        char[] normalizedArray = new char[charArray.length];
        for (int i = 0; i < normalizedArray.length; i++)
        {
            normalizedArray[i] = normalizeChar(charArray[i]);
        }
        return new String(normalizedArray);
    }
}

I couldnt find any other solution in the Net. So maybe it will be helpful for someone (?)

Michal_R
Expect of the `ł` all of those characters have just [diacritics](http://en.wikipedia.org/wiki/Diacritic) (I see at least ogonek, acute and dot) and could be easily normalized using `Normalize`. I'd suggest to combine the two methods.
BalusC
Normalize is .NET library ? Sorry ... that's snippet from my Java code :) And writing "Net" I was thinking about InterNet, not ".NET".
Michal_R