views:

435

answers:

4

The Challenge

  • Encode ANY series of integers, floats, strings, nulls, booleans and arrays/objects of them, into a binary format and decode it again.
  • The term object refers to a HashMap/Dictionary
  • At the top level there's always a surrounding array or object, depending on the input

The Test Input

[-1, 0.1, 0, 'Foo', {'b': 8388608.25, 'e': -0.99, 'c': 14}, [1, 256, false, 32768, 0.00, 128, -127, -65535, 'Use jQuery for this one!'], 127, null, 2147483648, -32769.4, ['Hello World', -2147483647, [-65535.22, 42]], 2147483648.12, -19, {'l': -2147483647.99, 'o': {'p': [-128, 'Blub']}}, [['Bar'],[false, ['Test', {}, [null],[[1, 2], 1337], true], 4]], 'The cake is a lie...', -8388607.12]

Expected Output

  • The same as above(except for the usual floating point errors)

The Rules

  • Your floats need at least 2 decimal places
  • It's OK to encode floats as integers when possible like with 2.0(becoming 2) and 0.0(becoming 0)
  • Expect integers to be in the range from -2147483647 to 2147483648
  • Expect floats to be in the range from -2147483647.99 to 2147483648.99
  • There's no NaN or Infinite nor are there any other special number constants
  • Expect keys and strings to not contain any 0x00 bytes

Not allowed

  • Your self invented one character language that solves all the problems in the universe(including this golf)

The Winner

  • The smallest (EncodedOutputSize of the Test Input above in bytes * 100) + (Code size in characters) This way you cannot just dump everything with chr().

EDIT
Simplified the rules, now just the smallest output + code size wins.

+1  A: 

Python, penalty score=24179

Here is a baseline: zlib compress the human-readable output (guess that's what you meant with "chr()"):

#!/usr/bin/env python
input = [-1, 0.1, 0, 'Foo', 8388608.25, -0.99, 14, [1, 256, False,
    32768, 0.00, 128, -127, -65535, 'Use jQuery for this one!'],
    127, None, 2147483648, -32769.4, ['Hello World', -2147483647,
    [-65535.22, 42]], 2147483648.12, -19, -2147483647.99, -128,
    [['Bar'],[False, ['Test', [None],[[1, 2], 1337], True], 4]],
    'The cake is a lie...', -8388607.12
]

import zlib
e = zlib.compress(repr(input))
print len(e)
print eval(zlib.decompress(e))

This prints 241 on my machine, i.e. the length score is 241 * 100 + 480 = 24580, or just 24179 if we do not count the input and use the short-hand backquote "i" backquote instead of repr(input).

Using Python's marshal module instead of repr+eval is not better: 262 bytes. With the pickle module this increases to 279 bytes.

Note that using shorter decimal representations (getting rid of the long runs of 9s and 0s) would reduce the zlib output to 227 bytes. Of course, this would require some coding, for example, a post-processor could edit all fractions. Care would have to be taken to not edit decimal fractions appearing inside strings.

Jojo
Sorry should have excluded the use of zlib and such stuff I'm talking about actually encoding the numbers and all the stuff to bytes themselfs, sorry.
Ivo Wetzel
So where do you draw the border? What codes may I use to encode an integer? How much may the code be optimised to the numbers appearing in the test data (numbers close to powers of powers of two)? May I Huffman code individual strings appearing in the input? If yes, why then not do the same with sequences of integers? Delta coding maybe allowed? Do you mean with "bytes" that every array element must be byte-aligned in the binary output, or may I use bits, for example 3 bits instead of a full byte for the data type? Anyway, this code was just provided for a comparison with a pragmatic solution.
Jojo
Well I'm talking about self written encoding here, do whatever you want but library calls like zlib somewhat defeat the purpose of this code golf. For example the thing I've linked in the comments of the question already beats zlib with 179 bytes.
Ivo Wetzel
OK, I see your implementation. This can only be improved by tailoring it even more to the test input, for example, numbers between 43 and 126 do not occur so the respective one byte codes could be used to shorten other things: .99 occurs twice, many integers result in one or more zero bytes (16 possible patterns), strings use a limited character set etc. In the end, we get an encoder that can encode anything roughly as well as your implementation but has an extreme peak of coding-efficiency for the test input. Lesson learned: test input has to be large and less artificial.
Jojo
No it can still be improved, using up 1 byte for every value just as the information is a big waste. I have a version lying around which only has 16 types of "values" therefore it only needs half of the type indications bytes, it still handles any possible values. Using a bitstream one could compress it even further I guess, but that would be slow in JavaScript since every number is a float.
Ivo Wetzel
A: 

Before I implement the following as a transducer, I'd like to know whether the solution would be accepted: 1. take the human-readable, programming-language specific representation, e.g. repr() in Python 2. remove unneeded precision (decimal digits) 3. replace any pair of decimal digits appearing outside strings with a single byte chosen from a set not colliding with the characters that can appear outside strings (much like BCD but single digits stay intact) 4. replace comma followed by space (", ") outside strings by a special character.

This would reduce the test string to 236 bytes which is better than the zlib solution and the transducer should take less than 1 KiB, i.e. the length score would be around 24600, very close to the zlib solution's score. Merging the solutions, one would get down to approx. 210 characters (score around 22020).

Jojo
OK, I see the rule update "no eval and its friends". This adds another KiB to iterate the list, check for types and rebuild it. More importantly, it's more time-consuming to implement. I'm out.
Jojo
delete this "answer" please? any notes, use comment feature.
Nas Banov
+1  A: 

c99, penalty score=29587 (277*100+1887)

1887 character with the addition of code to reduce the representation of small integers, making the compressed form 277 1785 characters in this version. Still many to lose. 1822 characters in the version (not yet very golfed, there are many that can be removed, but I want to preserve the ungolfed source in a readable form for editing...golfing comes later), and it actually makes the test input 3 characters longer in the coded form.

Moderately golfed version

#include <stdio.h>
#include <ctype.h>
#include <string.h>
#define E case
#define G default:
#define H switch
#define I stdin
#define J break;
#define P printf
#define R return
#define W while
#define F(a) a=getchar()
#define U(a) ungetc(a,I)
#define Q(a) putc(a,stdout)
#define L(a) do{F(a);}W(isspace(a)&&(a!=EOF))
long long int N,M;
int A,B,C,D,S[99],s;
int eat(int a){
W(isalpha(F(B)));
H(B){
E EOF: a+=4;
E ':': a+=2;
E ',':
J
G
U(B);
a+=4;
}
R a;
}
int str(int a){
W((F(B))!=a && B<0x80 && B>0)Q(B);
R B;
}
int sep(){
L(B);
H (B){
E ',':
R 0;
E ':':
R 2;
E EOF:
R 6;
G
U(B);
R 4;
}
}
int num(){
double e=1.0, d;
A = (C-'-')?1:-1;
N=isdigit(C)?(C-'0'):0;
W((F(B))!=EOF){
if (isdigit(B)){
N=N*10+(B-'0');
continue;
}
if (B=='.')goto Z;
J
}
U(B);
M=127LL;
B = 1;
W ( (N>M) ) {
for (C=B;C>0;--C) {
M<<=8;
M+=0xFFLL;
};
B = B*2;
}
N*=A;
R 0x80 + (((B-1)&7)<<3);
Z:
d=N;
W((F(B))!=EOF){
if (!isdigit(B)) J
e/=10;
d+=e*(B-'0');
}
d*=1.0*A;
U(B);
bcopy(&d,&N,8);
R 0x78;
}
int e(){
L(C);
W (C != EOF){
H (C){
E '[':
E '{':
A=( C & 0x70 ) | 0x1 ;
J
E ']':
E '}':
A=0xC0 + sep();
J
E 'f':
A=eat(0x00);
J
E 't':
A=eat(0x08);
J
E 'n':
A=eat(0x10);
J
E '\'':
Q(0x20);
str(C);
A=0x80 + sep();
J
G
A=num()+sep();
Q(A);
A=((A&0x38)>>3)+1;
for(B=A-1;B>0;B--){
C = (N>>(B*8))&0xFFLL;
Q(C);
}
A=(N & 0xFFLL);
J
}
Q(A);
L(C);
}
}
int f(){
F(C);
O:
B=(C&0x70);
Q((S[s++]=B)+0xB);
T:
F(C);
if (C&1)goto O;
B=C&0xC0;
if (B==0xC0)goto C;
H(C&0xF0){
E 0x00:
P(C&8?"true":"false");
J
E 0x10:
P("null");
J
E 0x20:
Q('\'');
str(0x80);
Q('\'');
C=B;
J
G
N=0LL;
B=((C&0x38)>>3)+1;
for(A=7;A>=(8-B);--A){
F(D);
N|=((D&0xFFLL)<<(A*8));
}
N>>=((8-B)*8);
if (C&0x80) {
P("%lld",N);
} else {
double d;
bcopy(&N,&d,8);
P("%.2f",d);
}
J
}
Z:
H(C&6){
E 0:
Q(',');
Q(' ');
J
E 2:
Q(':');
Q(' ');
J
E 6:
R EOF;
G
L(C);
goto C;
}
goto T;
C:
B=S[--s];
Q(B+0xD);
goto Z;
}
int main(int c, char** v){
R (*v[1]-'i')?f():e();
}

Validation:

$ gcc bincode.define.c
$ ./a.out i <input.test>output.test 
$ ls -l output.test 
-rw-r--r-- 1 dmckee staff 277 Aug 31 23:20 output.test
$ ./a.out d <output.test 
[-1, 0.10, 0, 'Foo', {'b': 8388608.25, 'e': -0.99, 'c': 14}, [1, 256, false, 32768, 0.00, 128, -127, -65535, 'Use jQuery for this one!'], 127, null, 2147483648, -32769.40, ['Hello World', -2147483647, [-65535.22, 42]], 2147483648.12, -19, {'l': -2147483647.99, 'o': {'p': [-128, 'Blub']}}, [['Bar'], [false, ['Test', {}, [null], [[1, 2], 1337], true], 4]], 'The cake is a lie...', -8388607.12]

and regression test

$ ./a.out i < input.test | ./a.out d > recon.test 
$ ./a.out i < recon.test | ./a.out d > recon2.test 
$ cmp recon.test recon2.test

Rather more readable version

/* bincode.c                                                                 */
/*                                                                           */
/* A minimal binary en-/de-coder.                                            */
/*                                                                           */
/* 29 Aug. 2010 dmckee                                                       */
/*                                                                           */
/* Defects and Assumptions                                                   */
/*                                                                           */
/* ** Strings are assumed to consist of characters 0-127 only.               */
/* ** The char type is assumed to be signed                                  */
/* ** ':' is treated like any other separator, so the code does not consider */
/*    "[1:2:true:false]" to be a problem and does not insist that {} type    */
/*    entities contain only X:Y entries.                                     */
/* ** long long int is assumed to be 8 bytes as is double                    */
/* ** Assumes that encode and decode will be done on machines with the same  */
/*    endianess                                                              */
/* ** Currently all number are coded in 8 bytes                              */
/* ** No compression on strings                                              */
/* ** Absolutely no input validation                                         */
/*                                                                           */
/*                                                                           */
/* Standard:                                                                 */
/*                                                                           */
/* Each data object is coded with one or more bytes.                         */
/*                                                                           */
/* A compound object [] or {} is indicated with the LSB (bit 1) set, and has */
/* the high nibble of the character ('[' or '{' that indicates the start of  */
/* the object. Bit three is ignores (set to zero)                            */
/*                                                                           */
/* Each object has a separator field which indicated what separator          */
/* (',', ':', a compound object closes after, EOF) follows. Separators are   */
/* coded into bits 2 and 3 as (00:',' 01:':' 10:close 11:EOF).               */ 
/*                                                                           */
/* The beginning of a string is indicated with a byte 0x20 (the separator    */
/* is coded in the close byte), followed by the literal byte of the string,  */
/* which must be 7-bit ASCII, and a close byte in the form b10000ss0, where  */
/* the s's are the separator coding.                                         */
/*                                                                           */
/* Booleans are coded b0000vss0 where v indicates the value.                 */
/*                                                                           */
/* Nulls are coded b00010ss0.                                                 */
/*                                                                           */
/* Integers are coded b10lllss0 and floats are b01lllss0. In both cases the  */
/* l-bits are reserved for indicating the length, a feature currently used   */
/* only for integers. For the numeric types the next L+1 bytes hold the in   */
/* memory representation of the value in MSB first order.                    */
/*                                                                           */
/* A closing bracket for a compound entity is coded b11000ss0. The type of   */
/* bracket is not indicated and is handled by stacking the open brackets.    */
/*                                                                           */

#include <stdio.h>
#include <ctype.h>
#include <string.h>
/* debugging only libraries */
/* #include <stdlib.h> */
#define fetch(a) a=getchar()
#define unfetch(a) ungetc(a,stdin)
#define say(a) putc(a,stdout)
#define lex(a) do{fetch(a);}while(isspace(a)&&(a!=EOF))
long long int N,M; /* buffer for int and float coding */
int A,B,C,D,S[99],s;/* object symbol stack and index*/

/* Consume non-spearating bytes, leaving the separator on the input, */
/* and returning EOF in the event that the end of nput is reached */
int eat(int a){
/*     fprintf(stderr,"eat:In: %02x (%c)\n",a,a); */
  while(isalpha(fetch(B)))/*{ fprintf(stderr,"eat: %02x (%c)\n",B,B); }*/;
/*   fprintf(stderr,"eat: %02x (%c)\n",B,B); */
  switch(B){
  case EOF: a+=4;
  case ':': a+=2;
  case ',':
    break;
  default:
    unfetch(B);
    a+=4;
  }
/*   fprintf(stderr,"eat:Out: %02x (%c)\n",a,a); */
  return a;
}

/* copy input bytes to the output until the argument or EOF is */
/* reached. Return EOF if the end of the input occurs, otherwise stick */
/* the argument back on the input */
int str(int a){
/*   fprintf(stderr,"str:In: %02x (%c)\n",a,a); */
  while((fetch(B))!=a && B<0x80 && B>0)say(B);
/*   fprintf(stderr,"str:Out: %02x (%c)\n",B,B); */
  return B;
}
/* Fetch one input byte and code the separator type from it. If it is */
/* non-separator, non-EOF, (i.e. a close bracket) put it back on the  */
/* input */
int sep(){
  lex(B);
  switch (B){
  case ',':
    return 0;
  case ':':
    return 2;
  case EOF:
    return 6;
  default:
    unfetch(B);
    return 4;
  }
}

/* attempt to parse numeric input */
int num(){
  double e=1.0, d;
  /* C holds the first charater */
  A = (C-'-')?1:-1; /* pick off a sign if needed */
  N=isdigit(C)?(C-'0'):0; /* C might be '-' or '+' */
  while((fetch(B))!=EOF){
    if (isdigit(B)){
      N=N*10+(B-'0');
      continue;
    }
    if (B=='.')goto Z;
    break;
  }
  /* push back the first non digit */
  unfetch(B);
  /* determine how many bytes are need to represent N */
  M=127LL;
  B = 1;
  while ( (N>M) ) {
    for (C=B;C>0;--C) {
      M<<=8;
      M+=0xFFLL;
    };
    B = B*2;
  }
  /* apply the sign */
  N*=A;
  return 0x80 + (((B-1)&7)<<3); 
 Z:
  d=N;
  /* going to leave floats at 8 bytes for now */
  while((fetch(B))!=EOF){
    if (!isdigit(B)) break;
    e/=10;
    d+=e*(B-'0');
  }
  d*=1.0*A;
  unfetch(B);
  bcopy(&d,&N,8);
  return 0x78; 
}
int encode(){
/*   fprintf(stderr,"Encoding...\n"); */
  lex(C);
  while (C != EOF){
    switch (C){
    case '[':
    case '{':
      /* emit and object byte coding for this bracket */
      A=( C & 0x70 ) | 0x1 ;
      break;
    case ']':
    case '}':
      /* emit a close byte; no matching enforced! */
      A=0xC0 + sep();
      break;
    case 'f': /* false boolean */
      A=eat(0x00);
      break;
    case 't': /* true boolean */
      A=eat(0x08);
      break;
    case 'n': /* null */
      A=eat(0x10);
      break;
    case '\'': /* string case */
      say(0x20);
      str(C);
      A=0x80 + sep(); /* string close is signaled by a negative value */
      break;
    default:
      /* a numeric value */
      A=num()+sep();
      say(A);
      A=((A&0x38)>>3)+1; /* decode the length to use */
      for(B=A-1;B>0;B--){
    C = (N>>(B*8))&0xFFLL;
    say(C);
      }
      A=(N & 0xFFLL);
      break;
    }
    say(A);
    lex(C);
  }
}
int decode(){
/*   fprintf(stderr,"Decoding...\n"); */
  fetch(C);
 O:
  B=(C&0x70);
  say((S[s++]=B)/* Push to the symbol stack */+0xB);
 T:   
  fetch(C);
  if (C&1)goto O;
  B=C&0xC0;
  if (B==0xC0)goto C;
  switch(C&0xF0){
  case 0x00:
    printf(C&8?"true":"false");
    break;
  case 0x10:
    printf("null");
    break;
  case 0x20: /* string case */
    say('\'');
    str(0x80);
    say('\'');
    C=B;
    break;
  default: /* int or float case */
    /* C holds the number indicator with int/float and sep info */
    N=0LL;
    B=((C&0x38)>>3)+1; /* decode the length to use */
    /* we need to force sign extension on small values, */
    /* so we put the bytes in high  */
    for(A=7;A>=(8-B);--A){      
      fetch(D);
      N|=((D&0xFFLL)<<(A*8));
    }
    /* then shift down wholesale */
    N>>=((8-B)*8);
    if (C&0x80) {
      printf("%lld",N);
    } else {
      double d;
      bcopy(&N,&d,8);
      printf("%.2f",d);
    }
    break;
  }
 Z:
  switch(C&6){
  case 0:
    say(',');
    say(' ');
    break;
  case 2:
    say(':');
    say(' ');
    break;
  case 6:
    return EOF;
  default:
    lex(C);
    goto C;
  }
  goto T;
  /* Close an object bracket */
 C:
  /* C currently holds the close record, with the necessary separator, */
  /* so don't overwrite it */
  B=S[--s];/* Pop from symbol stack */
  say(B+0xD);
  goto Z;
}
int main(int c, char** v){
  return (*v[1]-'i')?decode():encode();
}

The difficulty with the coding arises because this versions codes are floating point values into 8 bytes and does not compress strings at all.

Philosophical discussion: In detail c is ill suited to the problem, because lacking reflection makes general serialization...uhm...tricky. This code actually woks on the test input as a string. In that sense a plain string compression is almost certainly better. But I've gone ahead ad treated this as a structured parsing problem, and the resulting compresssed format has substrings that are fully valid inputs to the decoder.

dmckee
so umm, wouldn't it be easier and shorter NOT to do any encoding, if said encoding increases the size of the test input? me, i'd expect the test input to be decreased by half...
Nas Banov
@Nas: Of course it would be easier, but this method compresses some strings efficiently, and will do better if I get around to dealing with the numeric types properly. Being smart about the strings is harder still. ::shrug:: The problem is not natural in c, but a thrown gauntlet must be picked up.
dmckee
well consider that by the proposed rating score, 1 byte less in encoding equals 100 bytes of code, i.e. you are better off generating small encoding than writing short program
Nas Banov
dmckee
@dmckee: no, that's not the idea... i would like to compress the data structure for real at least half its length and thus get down to 17000 "points" or under. PS. and i did - see my post
Nas Banov
+1  A: 

Python, penalty points=16568 (best solution so far)

Code is lightly optimized for golfing, main focus was on producing smallest output:

inp = [-1, 0.1, 0, 'Foo', 8388608.25, -0.99, 14, [1, 256, False,
    32768, 0.00, 128, -127, -65535, 'Use jQuery for this one!'],
    127, None, 2147483648, -32769.4, ['Hello World', -2147483647,
    [-65535.22, 42]], 2147483648.12, -19, -2147483647.99, -128,
    [['Bar'],[False, ['Test', [None],[[1, 2], 1337], True], 4]],
    'The cake is a lie...', -8388607.12
]
#################################################################
num={'0':'00','1':'0110','2':'0111','3':'1000','4':'1001','5':'1010','6':'1011','7':'1100','8':'1101','9':'1110','-':'11110','.':'11111'}
bin=lambda i,n:''.join(str(i>>n+~y&1)for y in range(n))
enc=lambda o:list==type(o)and'01'+''.join(enc(i)for i in o)+'10'or type(o)in(int,long,float)and'00'+''.join(num[i]for i in str(o))+'010'or str==type(o)and'110'+''.join(c==' 'and'0111'or'`'<c<'{'and bin(ord(c)-97,6)or'1'+bin(ord(c),7)for c in o)+'01101'or o==True and'11110'or o==False and'11101'or o==None and'11100'
def dec(i):
    if'0'==i():
        if'0'==i():
            r=''
            while 1:
                j=i()+i()
                if'00'==j:r+='0'
                else:
                    j+=i()
                    if'010'==j:break
                    j+=i()
                    r+=j<'1111'and`int(j,2)-5`or'-.'[int(i())]
            r=eval(r)
        else:
            r=[]
            while()not in r:r.append(dec(i))
            r=r[:-1]
    else:
        if'0'==i():r=()
        elif'0'==i():
            r=''
            while 1:
                j=i()+i()+i()+i()
                if'0111'==j:r+=' '
                else:
                    j+=i()
                    if'01101'==j:break
                    j+=i()
                    r+=chr(j<'100000'and 97+int(j,2)or int(j[1:]+i()+i(),2))
        else:r=(None,False,True)[int(i()+i(),2)]
    return r
e=hex(int(enc(inp)[::-1],2))[2:-1].decode('hex')
d=dec(list(bin(int(e.encode('hex'),16),8*len(e))).pop)
######################################################

Length (proper code, measured between ####### lines) 1168 chars. Results:

>>> e
"+\x9b\xe6m\xd8\xdb\xcb:\xba\xba\x90&\x9c\x07$'\x10\xa0\x08q\x1c\x15\xb5R/Q\x88\xb0\xae\x13\x14\x9e\xb6I\x10V\xeb\xcd\xa2\x01\r\xd2\xbeg\x89\xdf\xe7:7'-\xcf\x13\xb3\xc5\xcd\xfb\x9d\x1b\x93\x96\xe1WH\xbb\xbe\xa2\xab\xaf$s\xa3rr\xdc\xf2\xd8i\x13\x9d|\xe6\x9a\x10&\xe5?}>\x17\x8a\xe7F\xe4\xe5\xb8:>a\xb4)\x169\xc9\t\xc6]\x13\x94p\xd1\x10T\\\x9cBUm(\xaa\xeb\xc4|\xcf\x15\xf3\x08\xf8+\xd3\xe1.\xb5xLJXN\xff\x1e%\xef\xd9\xb7ce\x9cq\x8d\xa0M\xf0L\xf2"
>>> len(e)
154
>>> e.encode('hex')
'2b9be66dd8dbcb3ababa90269c07242710a008711c15b5522f5188b0ae13149eb6491056ebcda2010dd2be6789dfe73a37272dcf13b3c5cdfb9d1b9396e15748bbbea2abaf2473a37272dcf2d869139d7ce69a1026e53f7d3e178ae746e4e5b83a3e61b4291639c909c65d139470d110545c9c42556d28aaebc47ccf15f308f82bd3e12eb5784c4a584eff1e25efd9b763659c718da04df04cf2'
>>> d
[-1, 0.10000000000000001, 0, 'Foo', 8388608.25, -0.98999999999999999, 14, [1, 256, False, 32768, 0.0, 128, -127, -65535, 'Use jQuery for this one!'], 127, None, 2147483648L, -32769.400000000001, ['Hello World', -2147483647, [-65535.220000000001, 42]], 2147483648.1199999, -19, -2147483647.99, -128, [['Bar'], [False, ['Test', [None], [[1, 2], 1337], True], 4]], 'The cake is a lie...', -8388607.1200000001]
>>> d==inp
True

The test input encodes and decodes correctly. The encoded length is 154 chars. Thus the score is only 154*100+1168=16568 !

Nas Banov