views:

231

answers:

2

Is there a way in constant working space to do arbitrary size and arbitrary base conversions. That is, to convert a sequence of n numbers in the range [1,m] to a sequence of ceiling(n*log(m)/log(p)) numbers in the range [1,p] using a 1-to-1 mapping that (preferably but not necessarily) preservers lexigraphical order?

I'm particularly interested in solutions that are viable as a pipe function, e.i. are able to handle larger dataset than can be stored in RAM.

I have found a number of solutions that require "working space" proportional to the size of the input but none yet that can get away with constant "working space".


some thoughts:

might this work?

streamBasen -> convert(n, lcm(n,p)) -> convert(lcm(n,p), p) -> streamBasep

(where lcm is least common multiple)

+3  A: 

I don't think it's possible in the general case. If m is a power of p (or vice-versa), or if they're both powers of a common base, you can do it, since each group of logm(p) is then independent. However, in the general case, suppose you're converting the number a1 a2 a3 ... an. The equivalent number in base p is

sum(ai * mi-1 for i in 1..n)

If we've processed the first i digits, then we have the ith partial sum. To compute the i+1'th partial sum, we need to add ai+1 * mi. In the general case, this number is going have non-zero digits in most places, so we'll need to modify all of the digits we've processed so far. In other words, we'll have to process all of the input digits before we'll know what the final output digits will be.

In the special case where m are both powers of a common base, or equivalently if logm(p) is a rational number, then mi will only have a few non-zero digits in base p near the front, so we can safely output most of the digits we've computed so far.

Adam Rosenfield
Thanks .
BCS
+2  A: 

I think there is a way of doing radix conversion in a stream-oriented fashion in lexicographic order. However, what I've come up with isn't sufficient for actually doing it, and it has a couple of assumptions:

  1. The length of the positional numbers are already known.
  2. The numbers described are integers. I've not considered what happens with the maths and -ive indices.

We have a sequence of values a of length p, where each value is in the range [0,m-1]. We want a sequence of values b of length q in the range [0,n-1]. We can work out the kth digit of our output sequence b from a as follows:

bk = floor[ sum(ai * mi for i in 0 to p-1) / nk ] mod n

Lets rearrange that sum into two parts, splitting it at an arbitrary point z

bk = floor[ ( sum(ai * mi for i in z to p-1) + sum(ai * mi for i in 0 to z-1) ) / nk ] mod n

Suppose that we don't yet know the values of a between [0,z-1] and can't compute the second sum term. We're left with having to deal with ranges. But that still gives us information about bk.

The minimum value bk can be is:

bk >= floor[ sum(ai * mi for i in z to p-1) / nk ] mod n

and the maximum value bk can be is:

bk <= floor[ ( sum(ai * mi for i in z to p-1) + mz - 1 ) / nk ] mod n

We should be able to do a process like this:

  1. Initialise z to be p. We will count down from p as we receive each character of a.
  2. Initialise k to the index of the most significant value in b. If my brain is still working, ceil[ logn(mp) ].
  3. Read a value of a. Decrement z.
  4. Compute the min and max value for bk.
  5. If the min and max are the same, output bk, and decrement k. Goto 4. (It may be possible that we already have enough values for several consecutive values of bk)
  6. If z!=0 then we expect more values of a. Goto 3.
  7. Hopefully, at this point we're done.

I've not considered how to efficiently compute the range values as yet, but I'm reasonably confident that computing the sum from the incoming characters of a can be done much more reasonably than storing all of a. Without doing the maths though, I won't make any hard claims about it though!

Wuggy
I don't think you can throw away the second term in the sum, because floor(a+b) ≠ floor(a) + floor(b) in general. If the first term is just a tiny fraction less than an integer, the value of the second term is very significant.
Adam Rosenfield
I think I've (erroneously) stated the conditions for selecting z. Conceptually what I'm trying to do is to use a range for the incoming number 'a'. As we get each digit, the range that 'a' can be narrows. Once it is narrow enough, we can know with certainty (in any base we like) what the initial sequence is (i.e. if we're converting 0x4b?? to decimal, even missing the last 2 digits we know that the first decimal digits are 19, since the number can only be between 19200 and 19455. I'll re-examine what I've done and see if I can express it correctly in an edit.
Wuggy
This process reminds me of arithmetic decompress (range decoding?) - which might be enough to show that any stream can be split into digits uniquely.I think there is a pathalogical case where you can't output the next digit until you've seen all the incoming stream.
caffiend
@caffiend: Intuition tells me that the pathological cases occur when converting to a base n which is the same magnitude as all of *a*. I'm basing this on looking at radixes which require exact numbers of digits to convert between (i.e. base 2 to base 2<sup>n</sup>). These seem to be flag-posts for a simple function indicating the number of values required. As the to-radix gets proportionally bigger, more and more values have an effect on each output character, until you get to the pathological case when all of them are needed. I don't think that occurs until the to-radix is very very large!!
Wuggy
@wuggy: you'll always get the pathological case of 99999999... where a single carry in the least significant digit can ripple all the way to the top
Martin DeMello
@Martin DeMello: Ugh. Of course there is. Sorry - I've been a bit dim recently. Maybe I need more caffeine or something. :(
Wuggy