tags:

views:

48

answers:

1

I'm trying to read a UTF8 encoded file (.torrent). In the file there is a 'pieces' section. Directly following that is the length of the text that contains a sequence of SHA1 hashes. The file reports a length (say 130100) to read, but when reading I end up going passed EOF.

I'm not sure why this is happening. The files are good (I've tested them with existing torrent clients and I've tried a number of them with consistent results) and I'm reading them with this:

string contents = string.Empty;
using (FileStream fs = new FileStream(path, FileMode.Open, FileAccess.Read)
{
    using (StreamReader reader = new StreamReader(fs, Encoding.UTF8))
    {
        contents = reader.ReadToEnd();
    }
}

parse(contents);

However, this obviously isn't working. Am I reading the file wrong, or am I storing it in a string incorrectly before trying to parse it? It seems to only fault when it reads characters outside of the normal range of readable strings.

+3  A: 

BitTorrent files aren't UTF-8-encoded. Some or all of the filenames in the files->path/name property may be UTF-8 encoded strings, but the file as a whole is purely binary, and the contents of the pieces property is a binary string containing the hashes. It makes no sense to try to read a .torrent with a TextReader.

The format under which BitTorrent files are stored is a simple structured-value serialisation known as bencode. You will want to use a proper bencode parser to extract information from a .torrent file. It's not difficult to write one (after all, you only get four datatypes), or see theory's libraries list for a couple of existing .NET libraries.

bobince
I've written an implementation that can encode and decode strings to/from bencoding. This is what I'm testing. I can successfully decode an entire .torrent file, with this 1 exception. I'm not looking to use an existing library (and the ones linked are either n/a for .NET or require me to dl/install git to get). Thank you for your answer, but it doesn't help me.
SnOrfus
Additionally, maybe you can clarify: Given that the rest of the file is not a problem, only this one part of it. The spec defines "pieces maps to a string whose length is a multiple of 20" and "All strings in a .torrent file that contains text must be UTF-8 encoded" why would using a StreamReader be ill advised?
SnOrfus
But not all strings do ‘contain text’. `pieces`, in particular, won't. This is a byte string representing binary hashes, not text characters, and will almost never form valid UTF-8 sequences. You'll have to parse it into a `byte[]` structure and not a `String`. Unfortunately the bencode format does not tell you which strings are binary and which ‘contain text’ (mainly because the format was originally devised with no conception of Unicode), which leaves you in the position of having to return `byte[]` for everything.
bobince
(Not a bad idea anyway given how often the ostensibly-text string fields in .torrent files in the wild actually still contain characters in encodings other than UTF-8.)
bobince
+1 bobince. Thank you kindly. You've been of great help.
SnOrfus