tags:

views:

519

answers:

1

I have a UTF-8 encoding string I am getting from reading a PDF, and I am trying to strip out some characters that represent spaces but are not encoded as the standard 0x20 space. My problem is that the characters are represented by 3-bytes of UTF-8 and I can't figure out how to get that into a string or character so I can do a replace. The two UTF-8 characters I am trying to replace are 0xE28087 and 0xE28088.

I have tried Chr and ChrW which only take integer parameters up to 65,000 (presumably items that can be represented in a single byte in UTF-8)

I also tried using System.Text.Encoding.UTF8.GetChars() with the byte representation of my characters, but the result seems to be 4 chars instead of just one - IE it is interpreting my 3 byte character as separate one-byte characters

    Dim ResultChars() As Char
    Dim bytes() As Byte
    Dim SpaceChar As Int32

    SpaceChar = Integer.Parse("E28087", Globalization.NumberStyles.HexNumber)
    bytes = BitConverter.GetBytes(SpaceChar)
    ResultChars = System.Text.Encoding.UTF8.GetChars(bytes)
    For Each ResultChar In ResultChars
        Debug.WriteLine(ResultChar)
    Next

What I am trying to do in pseudocode is simply: ConvertedText = ConvertedText.Replace(StringOrCharofThisUnicodeCharacter("0xE28087"), " ")

+2  A: 

You're mixing code points with UTF-8 encoding. Internally, all .NET strings use UTF-16 so you just need to specify the Unicode code point, not UTF-8 byte data:

Const FigureSpaceChar As Char = ChrW(&H2007)

Codepoint from www.fileformats.info.

Konrad Rudolph
.NET uses UTF-16, not UTF-32. (Each char is a UTF-16 code point.)
Jon Skeet
Jon: of course. Typo. Thanks for spotting it.
Konrad Rudolph
TJ