tags:

views:

188

answers:

2

Ok, so i'm basically trying to load the contents of a .txt file that contains 1 word per line into a dictionary.

I had no problems doing so when the words in that file were in english, but changing the file to a language with accents, i started having problems.

Had to change the encoding while creating the stream reader, also the culture in the ToLower method while adding the word to the dictionary.

Basically i now have something similar to this:

if (!dict.ContainsKey(word.ToLower(culture)))
    dict.Add(word.ToLower(culture), true);

The problem is that words like "esta" and "está" are being considered the same. So, is there any way to set the ContainsKey method to a specific language or do we need to implement something in the lines of a comparable? Either way i'm kinda new to c# so i would apreciate an example please.

Another issue submerge with the new file... after like a hundred words it stops adding the rest of the file, leaving a word incomplete... but i cant see any special chars in that word to end the execution of the method, any ideas about this problem?

Many thanks.

EDIT: 1st Problem solved using Jon Skeet sugestion.

In regards of the 2nd problem: Ok, changed the file format to UTF8 and removed the encoding in the stream reader since it now recognizes the accents just right. Testing some stuff regarding the 2nd issue now.

2nd problem also solved, it was a bug on my part... the shame...

Thnks for the quick answers everyone, and especially Jon Skeet.

+1  A: 

The problem is with the enconding you are using when opening the file to read. Looks like you may be using ASCIIEncoding.

.NET handles strings internally as UTF-8, so this kind of issue would not happen internally.

Oded
I wonder if encoding comes into it at all until you try to serialize/deserialize string/char data. How .net handles strings internally should be free of such encoding quandries and of no concern to the developer.
spender
@spender: Reading a text file *is* deserializing character data. The encoding used for this has to be right, or the data will be corrupt.
Jon Skeet
@Jon: I didn't make it clear that it's the second para of this answer I was commenting on.
spender
@Oded: I'm pretty sure that `string` uses UTF-16 encoding internally.
LukeH
+6  A: 

I assume you're trying to get case insensitivity for the dictionary. Instead of calling ToLower, use the constructor of Dictionary which takes an equality comparer - and use StringComparer.Create(culture, true) to construct a suitable comparer.

I don't know what your second problem is about - we'd need more detail to diagnose it, including the code you're using, ideally.

EDIT: UTF-7 is almost certainly not the correct encoding. Don't just guess at the encoding; find out what it's really meant to be. Where did this text file come from? What can you open it successfully in?

I suspect that at least some of your problems are due to using UTF-7.

Jon Skeet
Many thanks, adding the StringComparer.Create(culture, true) solved my first problem.Second one still remains, im using UTF-7 since neither UTF-8 or ASCII encodings recognized the accents.
brokencoding