ansaurus

Question

encoding/decoding string and a special character to byte array

Answer 1

A:

You would find this a lot easier if you just used the java.nio.charset.CharsetEncoder class to convert your characters to bytes. It would even work for characters other than ASCII. Even String.getBytes would be a lot less code to the same basic effect.

bmargulies 2009-12-09 13:44:47

I'm doubtful about this. CharsetEncoder is generally used to pack a single character into one or more bytes, hence the "averageBytesPerChar" field. Can it really be subverted to pack one-and-a-half characters into a single byte?

Carl Smotricz 2009-12-09 13:51:18

You have to implement your own encoding class, but, yes, you can.

bmargulies 2009-12-09 15:21:54

Answer 2

A:

If the "special char" is fixed and you're always aware that a 4 character String begins with this special char, then the char itself provides no useful information.

If the String is 3 characters in length, then do what you did before; if it's 4 characters, run the old algorithm on the String's substring starting with the 2nd character.

Am I thinking too simply or are you thinking too hard?

Carl Smotricz 2009-12-09 13:54:01

Thanks Carl. The special character is used to specify different processing path for the decoded string. if(decodedString.startsWith('@')) { // do this } else { // do that }Hence the additional character.

Tequila Guy 2009-12-09 14:16:57

I think I've given you the solution. There's no need to decode your special character.

Carl Smotricz 2009-12-09 14:22:10

but, he may need to keep the second character, in which case, use one of your leftover bits in your encoded bytes to represent the existence of the extra char.

james 2009-12-09 14:46:24

Ah, i see Aaron Digulla already mentioned this solution below...

james 2009-12-09 14:47:53

OK, I finally got the point and provided a coded solution (in my other answer). Consider this answer "for documentary purposes only" :)

Carl Smotricz 2009-12-09 16:50:34

Answer 3

+1 A:

In your alphabet, you use only 15 of the 16 available bits of the output. So you could just set the MSB (most significant bit) if the string is of length 4 since the special char is fixed.

The other option is to use a translation table. Just create a String with all valid characters:

String valid = "@ABCDEFGHIJKLMNOPQRSTUVWXYZ";

The index of a character in this string is the encoding in the output. Now create two arrays:

byte encode[] = new byte[256];
char decode[] = new char[valid.length ()];
for (int i=0; i<valid.length(); i++) {
    char c = valid.charAt(i);
    encode[c] = i;
    decode[i] = c;
}

Now you can lookup the values for each direction in the arrays and add any character you like in any order.

Aaron Digulla 2009-12-09 14:03:00

You didn't understand my approach: After finding the encoded value, you must compress the bits as before. My approach just allows you to squeeze more different characters into the output.

Aaron Digulla 2009-12-09 15:54:19

oh ok :) Thanks. Actually that compression of bits is the part I am struggling with. Am I doing it correctly? are there better ways of doing it.Thanks for the simpler approach.

Tequila Guy 2009-12-09 16:08:23

Aaron Digulla 2009-12-10 08:13:24

How do I set the free bit to signify presence/absence of special character in the code I put in the question? I am not able to figure which bit is free :(

Tequila Guy 2009-12-22 13:27:36

Aaron Digulla 2009-12-22 16:22:34

@Aaron Digulla: Alas, what you say is untrue. He's encoding base 26 numbers in base 32, and this is leaving gaps every time there's a carry. There's a contiguous batch of available values just below FFFF, but it's only about 11,000 values and thus not enough. So using the sign bit doesn't work.

Carl Smotricz 2009-12-23 12:19:00

@Carl: 32 values = 5 bits * 3 = 15 bits out of 16. That leaves one bit, right?

Aaron Digulla 2009-12-23 15:15:19

Answer 4

+2 A:

~~Not directly an answer, but~~ here's how I would do the encoding:

   public static byte[] encode(String s) {
      int code = s.charAt(0) - 'A' + (32 * (s.charAt(1) - 'A' + 32 * (s.charAt(2) - 'A')));
      byte[] encoded = { (byte) ((code >>> 8) & 255), (byte) (code & 255) };
      return encoded;
   }

The first line uses Horner's Schema to arithmetically assemble 5 bits of each character into an integer. It will fail horribly if any of your input chars fall outside the range [A-`].

The second line assembles a 2 byte array from the leading and trailing byte of the integer.

Decoding could be done in a similar manner, with the steps reversed.

UPDATE with the code (putting my foot where my mouth is, or something like that):

public class TequilaGuy {

   public static final char SPECIAL_CHAR = '@';

   public static byte[] encode(String s) {
      int special = (s.length() == 4) ? 1 : 0;
      int code = s.charAt(2 + special) - 'A' + (32 * (s.charAt(1 + special) - 'A' + 32 * (s.charAt(0 + special) - 'A' + 32 * special)));
      byte[] encoded = { (byte) ((code >>> 8) & 255), (byte) (code & 255) };
      return encoded;
   }

   public static String decode(byte[] b) {
      int code = 256 * ((b[0] < 0) ? (b[0] + 256) : b[0]) + ((b[1] < 0) ? (b[1] + 256) : b[1]);
      int special = (code >= 0x8000) ? 1 : 0;
      char[] chrs = { SPECIAL_CHAR, '\0', '\0', '\0' };
      for (int ptr=3; ptr>0; ptr--) {
         chrs[ptr] = (char) ('A' + (code & 31));
         code >>>= 5;
      }
      return (special == 1) ? String.valueOf(chrs) : String.valueOf(chrs, 1, 3);
   }

   public static void testEncode() {
      for (int spcl=0; spcl<2; spcl++) {
         for (char c1='A'; c1<='Z'; c1++) {
            for (char c2='A'; c2<='Z'; c2++) {
               for (char c3='A'; c3<='Z'; c3++) {
                  String s = ((spcl == 0) ? "" : String.valueOf(SPECIAL_CHAR)) + c1 + c2 + c3;
                  byte[] cod = encode(s);
                  String dec = decode(cod);
                  System.out.format("%4s : %02X%02X : %s\n", s, cod[0], cod[1], dec);
               }
            }
         }
      }
   }

   public static void main(String[] args) {
      testEncode();
   }

}

Carl Smotricz 2009-12-09 14:27:31

Thanks. Can this be extended to 4 characters too? Maybe I am not understanding your point about not requiring the special character to be encoded. I need it because when I receive an encoded 2 byte[] array, and decode it, my processing logic will change if the decoded string has the special character.

Tequila Guy 2009-12-09 16:04:00

Sure, can do. I didn't realize that the information needed to be transported around, but it's easy as that whole 4th character can be represented by a single bit and you have a 1.9 bits to spare :) Answer coming up...

Carl Smotricz 2009-12-09 16:10:22

Thanks for your help Carl :)

Tequila Guy 2009-12-10 09:14:17

just changed one thing here.instead of using 32* while encoding i am doing << 5 as its much faster.Thanks once again :)

Tequila Guy 2009-12-10 13:27:12

Color me surprised. How much faster? And how can you tell?

Carl Smotricz 2009-12-10 13:34:54

You are right! I thought it would. There is hardly a 2-3 nano second improvement with << 5 but I cant be sure of that too. I thought shifting should be quicker than multiplying (it is for me!) Am trying to optimize it (removing extra subtractions, making charAt() - 'A' etc etc) and will let you know in case I do manage to improve it.

Tequila Guy 2009-12-11 10:23:16

Thing is, the compiler is able to recognize the equivalency of `* 32` and `<< 5` and it knows better than us which is faster. This may even vary from computer to computer. So it's highly likely to be changing one into the other behind your back. However, there IS still some potential for optimization. I can look again and give you some hints. HOWEVER, if you're hard up for speed your best move might be to create a pair of hashmaps for all values, which is less than 60000. Then you can convert simply by looking up.

Carl Smotricz 2009-12-11 10:56:19

This is more than just a hunch: Both functions create a new object (String/byte[]) *every time*, and object creation is time-expensive. Hash lookup, on the other hand, is well optimized. Try that!

Carl Smotricz 2009-12-11 10:59:16

You learn something new everyday :) Was doing the 2 hashmap thing to check the performance. Will let you know the results. I'll need a wrapper on byte[] for that and would still end up creating that object for each decode but encoding should be faster with hash lookup.

Tequila Guy 2009-12-11 13:32:15

Why the wrapper? Can't you just consider the returned byte arrays as immutable? Or if you're gonna wrap, why not put the wrapped (and hopefully immutable) byte arrays into the hashmap? Immutable is good. Immutable is fast. Immutable gives you strength to crush your enemies [foaming at the mouth, just a little]

Carl Smotricz 2009-12-11 13:38:17

:) if I put byte[] byte1 = {1,2} as key in a HashMap, and try looking up using byte[] byte2 = {1,2}, it will return null as byte1 !=byte2. So I'll need a wrapper on byte[] (ByteArrayWrapper or something) with overridden hashCode and equals for hashmap to work properly.Regarding immutability, if I implement the wrapper to be immutable, it would have an extra overhead of copying byte array while constructing the object and returning the array.

Tequila Guy 2009-12-11 14:00:33

What you say is true, alas, and annoying. But a real superhero never gives up! For decoding, toss out the hashmap and replace it with an array of arrays (i.e. a 2 dimensional array) of Strings. Your byte array input is just 2 bytes, so you can use those as "row" and "column" to directly index to the corresponding String. Since bytes are signed and values from 0x80 up are negative, you may need to do some minor bit twiddling to turn them into positive subscripts. Ternary exps are one solution, ORing with an integer might be another.

Carl Smotricz 2009-12-11 14:10:21

As it turns out, the new encoding (i.e. hash lookup for a String to get a byte[]) is slower than the original encode method. only marginally though. :) crappy 13 nano seconds lol.

Tequila Guy 2009-12-11 14:37:11

Bummer. Well, you're well on your way.

Carl Smotricz 2009-12-11 15:15:43

I had coded everything and now looks like that I have to use the original logic for encoding/decoding, (the one I posted in the question) due to backward compatibility issues. Is there a way of squeezing in a bit for presence/absence of '-' in the original code so that it gives exactly same results as before when '-' is not present? Am not able to see a free bit :(

Tequila Guy 2009-12-22 13:01:08

When I said you had 1.9 bits to spare I wasn't kidding - there was an exact calculation behind that statement. We used up one bit for your '@', and 0.9 bits won't hold any information. So: If '-' takes the place of '@' then your original calc can be adapted. If it's extra information, then no.

Carl Smotricz 2009-12-22 13:41:25

'-' or '@' or whatever special char we use is same. I mean its upto us to decide if we want to use '-' or '@'. So to answer your question, '-' does takes place of '@'

Tequila Guy 2009-12-22 13:45:30

OK, so the objective is to create an identical mapping to your original function with added capability for including the '-' (in so far unmapped combinations) and optimizing for speed as well as possible?

Carl Smotricz 2009-12-22 13:48:54

yes. And one of the many learnings I had from your original code was that you used string.CharAt() instead of creating an array. That is better performance wise.

Tequila Guy 2009-12-22 14:06:26

If it seems like I'm asking a lot of questions to put off getting to work on the core problem, that's exactly what I'm doing. Will the conversions in either directions be happening unexpectedly in/from various threads, or can I set up thread-unsafe static data structures to simplify and speed things?

Carl Smotricz 2009-12-22 14:10:16

You have been really helpful and its me who has been asking lots of questions! :) This is currently exposed as a static method in a utility class. Many threads can call encode and decode methods when they need to.(unexpectedly)

Tequila Guy 2009-12-22 14:23:29

OK, so I'm not allowed to cheat on thread safety. I'm being called away for about an hour, then I will set to work.

Carl Smotricz 2009-12-22 14:35:09

No Problem at all :) I have 2 more days to test all functionality and write Unit Tests

Tequila Guy 2009-12-22 15:27:50

Aaargh! Your original function hides those extra 1.9 bits in the cracks (= small gaps) between the encodings for (e.g.) AAZ = 019h = 25 and ABA = 020h = 32. There are of course bigger (though rarer) gaps between e.g. AZZ = 639h = 1593 and BAA = 800h = 2048 and a single big one between ZZZ = CE39h = 52793 and FFFFh = 65535, but the '-' values can't be represented in a neat 1:1 way. I may need to go for a lookup table based solution. I'll be back to let you know.

Carl Smotricz 2009-12-22 21:19:19

Does the free bit always has a constant position? i.e. in the second byte (3rd bit?):strEncoded[1] = (byte) ((((charsAsNumbers[1] << 5)) + ((charsAsNumbers[2]))));would x = x+x|24 help?

Tequila Guy 2009-12-23 01:06:17

No, it doesn't, unfortunately; and that's why you were unable to find a solution. Please look at my latest, meanwhile 3rd answer to this question, though.

Carl Smotricz 2009-12-23 02:35:36

Just in case I'm confusing you: My new answer is a separate answer, with 0 points so far; it includes source code for what you want to do. And it even seems to work :) It's around here somewhere ;)

Carl Smotricz 2009-12-23 12:27:26

:) yeah! its working :) i created some junits to test it by comparing it with old encoding. Also did some performance tests and its a bit faster than one you gave me earlier :) Its a LOT faster than the code i posted in the question. Just integrating and making appropriate changes else where.. Lot of things to learn from ur code and of course u!Thanks a TON for your help!

Tequila Guy 2009-12-23 14:27:04

Answer 5

+1 A:

Yes, it is possible to encode an extra bit of information while maintaining the previous encoding for 3 character values. But since your original encoding doesn't leave nice clean swaths of free numbers in the output set, mapping of the additional set of Strings introduced by adding that extra character cannot help but be a little discontinuous.

Accordingly, I think it would be hard to come up with mapping functions that handle these discontinuities without being both awkward and slow. I conclude that a table-based mapping is the only sane solution.

I was too lazy to re-engineer your mapping code, so I incorporated it into the table initialization code of mine; this also eliminates many opportunities for translation errors :) Your encode() method is what I call OldEncoder.encode().

I've run a small test program to verify that NewEncoder.encode() comes up with the same values as OldEncoder.encode(), and is in addition able to encode Strings with a leading 4th character. NewEncoder.encode() doesn't care what the character is, it goes by String length; for decode(), the character used can be defined using PREFIX_CHAR . I've also eyeball checked that the byte array values for prefixed Strings don't duplicate any of those for non-prefixed Strings; and finally, that encoded prefixed Strings can indeed be converted back to the same prefixed Strings.

package tequilaguy;


public class NewConverter {

   private static final String[] b2s = new String[0x10000];
   private static final int[] s2b = new int[0x10000];
   static { 
      createb2s();
      creates2b();
   }

   /**
    * Create the "byte to string" conversion table.
    */
   private static void createb2s() {
      // Fill 17576 elements of the array with b -> s equivalents.
      // index is the combined byte value of the old encode fn; 
      // value is the String (3 chars). 
      for (char a='A'; a<='Z'; a++) {
         for (char b='A'; b<='Z'; b++) {
            for (char c='A'; c<='Z'; c++) {
               String str = new String(new char[] { a, b, c});
               byte[] enc = OldConverter.encode(str);
               int index = ((enc[0] & 0xFF) << 8) | (enc[1] & 0xFF);
               b2s[index] = str;
               // int value = 676 * a + 26 * b + c - ((676 + 26 + 1) * 'A'); // 45695;
               // System.out.format("%s : %02X%02X = %04x / %04x %n", str, enc[0], enc[1], index, value);
            }
         }
      }
      // Fill 17576 elements of the array with b -> @s equivalents.
      // index is the next free (= not null) array index;
      // value = the String (@ + 3 chars)
      int freep = 0;
      for (char a='A'; a<='Z'; a++) {
         for (char b='A'; b<='Z'; b++) {
            for (char c='A'; c<='Z'; c++) {
               String str = "@" + new String(new char[] { a, b, c});
               while (b2s[freep] != null) freep++;
               b2s[freep] = str;
               // int value = 676 * a + 26 * b + c - ((676 + 26 + 1) * 'A') + (26 * 26 * 26);
               // System.out.format("%s : %02X%02X = %04x / %04x %n", str, 0, 0, freep, value);
            }
         }
      }
   }

   /**
    * Create the "string to byte" conversion table.
    * Done by inverting the "byte to string" table.
    */
   private static void creates2b() {
      for (int b=0; b<0x10000; b++) {
         String s = b2s[b];
         if (s != null) {
            int sval;
            if (s.length() == 3) {
               sval = 676 * s.charAt(0) + 26 * s.charAt(1) + s.charAt(2) - ((676 + 26 + 1) * 'A');
            } else {
               sval = 676 * s.charAt(1) + 26 * s.charAt(2) + s.charAt(3) - ((676 + 26 + 1) * 'A') + (26 * 26 * 26);
            }
            s2b[sval] = b;
         }
      }
   }

   public static byte[] encode(String str) {
      int sval;
      if (str.length() == 3) {
         sval = 676 * str.charAt(0) + 26 * str.charAt(1) + str.charAt(2) - ((676 + 26 + 1) * 'A');
      } else {
         sval = 676 * str.charAt(1) + 26 * str.charAt(2) + str.charAt(3) - ((676 + 26 + 1) * 'A') + (26 * 26 * 26);
      }
      int bval = s2b[sval];
      return new byte[] { (byte) (bval >> 8), (byte) (bval & 0xFF) };
   }

   public static String decode(byte[] b) {
      int bval = ((b[0] & 0xFF) << 8) | (b[1] & 0xFF);
      return b2s[bval];
   }

}

I've left a few intricate constant expressions in the code, especially the powers-of-26 stuff. The code looks horribly mysterious otherwise. You can leave those as they are without losing performance, as the compiler folds them up like Kleenexes.

Update:

As the horror of X-mas approaches, I'll be on the road for a while. I hope you'll find this answer and code in time to make good use of it. In support of which effort I'll throw in my little test program. It doesn't directly check stuff but prints out the results of conversions in all significant ways and allows you to check them by eye and hand. I fiddled with my code (small tweaks once I got the basic idea down) until everything looked OK there. You may want to test more mechanically and exhaustively.

package tequilaguy;

public class ConverterHarness {

//   private static void runOldEncoder() {
//      for (char a='A'; a<='Z'; a++) {
//         for (char b='A'; b<='Z'; b++) {
//            for (char c='A'; c<='Z'; c++) {
//               String str = new String(new char[] { a, b, c});
//               byte[] enc = OldConverter.encode(str);
//               System.out.format("%s : %02X%02X%n", str, enc[0], enc[1]);
//            }
//         }
//      }
//   }

   private static void testNewConverter() {
      for (char a='A'; a<='Z'; a++) {
         for (char b='A'; b<='Z'; b++) {
            for (char c='A'; c<='Z'; c++) {
               String str = new String(new char[] { a, b, c});
               byte[] oldEnc = OldConverter.encode(str);
               byte[] newEnc = NewConverter.encode(str);
               byte[] newEnc2 = NewConverter.encode("@" + str);
               System.out.format("%s : %02X%02X %02X%02X %02X%02X %s %s %n", 
                     str, oldEnc[0], oldEnc[1], newEnc[0], newEnc[1], newEnc2[0], newEnc2[1],
                     NewConverter.decode(newEnc), NewConverter.decode(newEnc2));
            }
         }
      }
   }
   public static void main(String[] args) {
      testNewConverter();
   }

}

Carl Smotricz 2009-12-22 23:25:58

Thanks :) I created a similar method in the morning and was printing results when Arrays.equals(oldEnc, newEnc) returned false. The new encoding works perfectly :) ....Actions Items for me: 1) Learn bit manipulations and encoding algorithms (did a bit in college but not after that)2) Spend more time on Stackoverflow!

Tequila Guy 2009-12-23 14:35:22

Yer welcome, bro. Glad I was able to help after all. If you have more interesting puzzles like this, figure out my email address from my profile and give me a shout.

Carl Smotricz 2009-12-23 16:42:26

ansaurus

tags:

views:

answers:

encoding/decoding string and a special character to byte array

related questions