views:

247

answers:

2

I'm playing with bencoding and i would like to keep bencoded strings as java strings but they contain binary data blindly converting them to string will corrupt the data. What i am trying to accomplish is have a conversion function that will keep the ascii bytes as ascii end encode non ascii chars in a reversible way.

I have found some examples of what i am trying to accomplish in python but i don't know enough ptyhon to dig through them. this decoder does exactly what i would like to do ascii parts of the torrent stay as ascii but sha1 hashes are printed as "\xd8r\xe7". With my very limited python knowledge he doesn't seem to be doing anything special to the string is this handled by the python interpreter? Can i accomplish the same in Java?

I have played with some encodings such as Base64 or using Integer.toHexString but i get non readable ascii strings in the end?

I have also found a scheme example that prints everything but the sha1 hashes.

A: 

Bencoded strings are byte strings. You can attempt to decode a byte string to unicode codepoints in Java with String(byte[] bytes, Charset charset). Decoding with certain encodings such as ISO-8859-1 will always succeed, since any byte maps directly to a codepoint. With many of these encodings (including ISO-8859-1) the process is also reversible.

Tuure Laurinolli
Yes that's what i'm doing right now but bencoded strings contain binary data not just text, at least in torrents. Building a regular string will corrupt sha1's.
Hamza Yerlikaya
Er, it shouldn't... as long as the codepoints cover the entire 0-255 byte range, nothing should change in the process.
Amber
It's a common mistake to believe that ISO-8859-1 does a 1:1-mapping for bytes in the range 0-255. ISO-8859-1 is undefined in the range 128-159, so trying to convert a byte in that range to a character will result in '?' as a best-fit representation of an unknown character.
jarnbjo
@jarnbjo, ISO 8859-1 is the encoding that doesn't define some code points, ISO-8859-1 does. The wikipedia article has more details.
Tuure Laurinolli
@Hamza Yerlikaya, no it won't. If you encode the String again with ISO-8859-1, the resulting bytes are the the same. Or in code: Arrays.equals(bytes, new String(bytes, Charset.forName("ISO-8859-1")).getBytes("ISO-8859-1")) == true for any byte[] bytes.
Tuure Laurinolli
A: 

If Wikipedia is accurate on Bencode, the format seems straightforward enough. Parse the byte data directly:

while (true) {
  in.mark(1);
  int n = in.read();
  if (n < 0) {
    // end of input
    break;
  }
  in.reset();
  // take advantage of some UTF-16 values == ASCII values
  if (n == 'd') {
    // parse dictionary
  } else if (n == 'i') {
    // parse int
  } else if (n >= '0' && n <= '9') {
    // parse binary string
  } else if (n == 'l') {
    // parse list
  } else {
    throw new IOException("Invalid input");
  }

Store the binary strings in a type that only converts them to ASCII when you do it explicitly, as in this toString call:

public class ByteString {
  private final byte[] data;

  public ByteString(byte[] data) { this.data = data.clone(); }
  public byte[] getData() { return data.clone(); }

  @Override public String toString() {
    return new String(data, Charset.forName("US-ASCII"));
  }
}
McDowell