views:

85

answers:

2

Hello world !

I am developing, in VB.Net, an application that reads from text files using a FileStream Object. I do not use a StreamReader, since the buffering it does makes it impossible to use Seek.

Those text files form a database, with both index and data files. In index files, all fields are fixed-length, which is not the case in data files.

I've recently run into a problem. Since some of my files contain accents, the corresponding characters take more that 1 Byte. Therefore, when I seek in the index file, and offset appears the rest of my index file is not read in the right way.

I'm searching for an encoding that allows to use accents, special characters and so on, where every character is stored using the same number of bytes. This way,; I could still seek in my files. Does this exist ?

Thank you,

CFP.

A: 

I believe UTF-16 will have all the accents and each character is the same number of bytes.

If you know this is a specific language, you may be able to use the encoding specific that language.

Oded
Minor problem: characters above code range 010000 will be 4 bytes long, and not 2.Major problem: decomposed form will take 2 combining characters (for instance (U+00E9, LATIN SMALL LETTER E WITH ACUTE) can be replaced by (U+0065, LATIN SMALL LETTER E and U+0301, COMBINING ACUTE ACCENT)).
Jerome
+1  A: 

UTF-32 is the only (non-lossy) encoding that is garanteed to be fixed length. This causes a lot of overhead though.

What I don't understand is that you state that your index file contains fixed length fields. This means that you shouldn't have a problem. You can seek in the index file using these specific fixed lengths. And then seek in the data file using the given address in the index file. You will always end up at the start of text. What am I missing?

Tomas
In fact, I meant fixed-length in terms of number of characters. And I fear that on mobile devices, utf32 is not supported.Unicode works great, though.
CFP