ansaurus

Question

Sorting numbers from 1 to 999,999,999 in words as strings

Answer 1

+9 A:

Those strings are going to have lots and lots of common prefixes - perfect use case for a trie, which would drastically reduce memory usage and probably also running time.

Michael Borgwardt 2009-09-29 21:52:21

So that'd work - at the cost of greater time-complexity you could spin through the trie after each insertion to discard any elements after the 51 billionth character too, so 51 billion characters + a bit would be the upper bound on the data needed to be stored in memory.

Dominic Rodger 2009-09-29 22:02:32

@Dominic: And how much time would generating all 999,999,999 strings and inserting them into a trie take?

MAK 2009-09-30 14:29:49

@MAK - unless I've misunderstood tries, I believe it'd be O(nlogn). As to absolute time, I've no idea. An awfully long time I'd expect.

Dominic Rodger 2009-10-01 11:55:11

Answer 2

A:

Do you need to save the entire string in memory?

If not, just save how many characters you've appended so far. For each iteration, you check the length the next number's textual representation. If it exceeds the nth letter you are looking for, the letter must be in that string, so extract it by it's index, print it, and stop execution. Otherwise, add the string length to the character count and move to the next number.

Phil M 2009-09-29 21:53:32

How would that work? You'd have to generate all the strings before you sort them, right? So you'd still have significantly greater than 51 billion characters in memory, which is an awful lot of data.

Dominic Rodger 2009-09-29 21:55:39

The problem is coming up with the sorted list in the first place...

DJ 2009-09-29 21:57:50

This is what I was thinking, until I saw the "sorted alphabetically" part

Neil N 2009-09-29 21:59:39

You don't need to sort the list. I think the key to this is thinking how you would programmatically work out just the first or last string. Then how to work out the second, etc etc, and keep going until you get to 51 billion or whatever.

DisgruntledGoat 2009-09-29 23:14:45

I totally missed the sorting part of the question! :(

Phil M 2009-09-30 00:56:57

No, any scheme that involves saving the whole string *anywhere* (including tries) is not the right approach.

RBarryYoung 2009-09-30 17:03:24

Answer 3

A:

All the strings are going to start with either one, ten, two, twenty, three, thirty, four, etc so I'd start with figuring out how many are in each of the buckets. Then you should at least know which bucket you need to look closer at.

Then I'd look at subdividing the buckets further based on the possible prefixes. For example, within ninehundred, you are going to have all the same buckets that you had to start off with, just for numbers starting with 900.

Alan Jackson 2009-09-29 22:03:15

Answer 4

+14 A:

I'd sort the names of the first 20 integers and the names of the tens, hundreds and thousands, work out how many numbers start with each of those, and go from there.

For example, the first few are [ eight, eighteen, eighthundred, eightmillion, eightthousand, eighty, eleven, ....

The numbers starting with "eight" are 8. With "eighthundred", 800-899, 800,000-899,999, 800,000,000-899,999,999. And so on.

The number of letters in the concatenation of words for 0 ( represented by the empty string ) to 99 can be found and totalled; this can be multiplied with "thousand"=8 or "million"=7 added for higher ranges. The value for 800-899 will be 100 times the length of "eighthundred" plus the length of 0-99. And so on.

Pete Kirkham 2009-09-29 22:03:34

+1. This is the only sort of approach that's going to work as the numbers get large; everyone's idea of actually storing a billion strings is crazy!

ShreevatsaR 2009-09-29 23:16:51

Agreed, this is one of only two correct approaches that also have a chance of ever finishing (the other is elaboration streaming, which is both very difficult to get right and much slower). Still, the answer given is very abstract and the real devil is in the details.

RBarryYoung 2009-09-30 16:59:38

Given the question appears to be still live for selecting candidates for interview, I didn't think it sporting to post a fully worked out answer.

Pete Kirkham 2009-09-30 17:16:10

This is part of a job application?!? Heh, I guess the test is to see who's smart enough to realize that it's actually harder to solve it with a program than it is with Excel of a paper and pencil. fair enough...(+1)

RBarryYoung 2009-10-01 04:11:49

@RBarryYoung - if you've got enough time to solve it with a pen and paper, you either (a) have far too much time, (b) can write so quick that you are probably not human.

Dominic Rodger 2009-10-01 07:58:47

Dominic: nope, this really is easier for a person to solve than most folks here realize. What Pete describes above is a 5000-foot description of the best and fastest way to solve. Using a program for this is harder because there is a significant amount of one-time special casing involved. By the time you figure all of that out, code it up, test it, debug it, and then finally run it, then yes you could have done it by hand.

RBarryYoung 2009-10-01 14:35:24

Answer 5

A:

The question is about efficient data storage not string manipulation. Create an enum to represent the words. the words should appear in sorted order so that when it comes time to sort it is a simplish compare. Now generate the list and sort. use the fact that you know how long each word is in conjunction with the enum to add up to the character you need.

stonemetal 2009-09-29 22:06:05

Answer 6

A:

Honestly I would let an RDBMS like SQL Server or Oracle do the work for me.

Insert the billion strings into an indexed table.
Compute a string length column.
Start pulling off the top X records at a time with a SUM, until I get to 51 billion.

Might beat up the server for a while as it would need to do a lot of Disk IO, but overall I think I could find an answer faster than someone who would write a program to do it.

Sometimes just getting it done is what the client really wants, and could care less what fancy design pattern or data structure you used.

Neil N 2009-09-29 22:10:17

Where can I sign up to work for the client that asks for problems like this to be solved?

Fragsworth 2009-09-29 22:38:57

Well, it's better than the orginal approach, but that's still an awful lot of diskspace, even by today's standards and it'll take forever to run.

RBarryYoung 2009-09-30 16:56:38

Answer 7

A:

You have one billion numbers and 51 billion characters - there's a good chance that this is a trick question, as there are an average of 51 characters per number. Sum up the conversions of all the numbers and see if it adds up to 51 billion.

Edit: It adds up to 70,305,000,000 characters, so this is the wrong answer.

Mark Ransom 2009-09-29 22:10:50

the problem doesnt state that you have 51 billion characters, it just asks what the 51 billionth character is.

aepurniet 2009-09-30 03:08:54

And I'm stating that there's a chance the 51 billionth character is also the last character. Testing that possibility is a simpler problem than an all-out solution.

Mark Ransom 2009-09-30 06:45:37

Huh - that was a closer call than I thought it might be - good effort! Out of interest, how did you compute the character counts? Using some kind of common-subsequence counting?

Dominic Rodger 2009-09-30 13:37:55

I created the names with the code I posted to another answer. I tried two different computation methods to make sure they matched, one with common subsequences and one with brute force.

Mark Ransom 2009-09-30 16:06:13

Answer 8

A:

Code wins...

#!/bin/bash
n=0
while [ $n -lt 1000000000 ]; do
   number -l $n | sed -e 's/[^a-z]//g'
   let n=n+1
done | sort > /tmp/bignumbers
awk '
BEGIN {
    total = 0;
}
{
    l = length($0); 
    offset = total + l - 51000000000;
    print total " " offset
    if (offset >= 0) {
        c = substr($0, l - offset, 1);
        print c;
        exit;
    }
    total = total + l;
}' /tmp/bignumbers

Tested for a much smaller range ;-). Requires a LOT of diskspace, a compressed filesystem would be, umm, valuable, but not so much memory.

Sort has options to compress work files as well, and you could toss in gzip to directly compress data.

Not the zippiest solution.

But it does work.

Will Hartung 2009-09-29 22:24:54

It would be interesting to see if someone could come up with a better algorithm while waiting for this to finish.

Mark Ransom 2009-09-29 22:30:18

Mine is still running...sort is 25MB heap so far, and has two, 16ishMB work files, but it will destroy my drive if it actually ever finishes, I can't hold the sorted file on this drive....I have no hope that it will get that for however. :-)

Will Hartung 2009-09-29 23:04:15

Mine is finished. How about yours?

Mark Ransom 2009-10-01 14:44:22

Answer 9

A:

weird but fun idea.

build a sparse list of the lengths of the number from 0 to 9, then 10-90 by tens, then 100, 1000, etc etc, to billion, indexes are the value of the integer part who's lenght is stored.

write a function to calculate the number as a string length using the table. (breaking the number into it's parts, and looking up the length of the aprts, never actally creating a string.)

then you're only doing math as you traverse the numbers, calculating the length from the table afterward summing for your sum.

with the sum, and the value of the final integer, figure out the integer that's being spelled, and volia, you're done.

Eric Brown - Cal 2009-09-29 22:25:42

Answer 10

+8 A:

This guy has a solution to the puzzle written in Haskell. Apparently Michael Borgwardt was right about using a Trie for finding the solution.

Anax 2009-09-29 22:28:02

Yes, it "uses" a trie, but that solution does not actually store all billion strings, or anywhere close to that. Instead, it generates only the relevant parts of the trie after the question is asked, cleverly skipping over *most* of the billion numbers using a bit of **automatic differentiation** and other ideas. Your link is the fourth (and presumably final) part in the series; look at Part 2 for the crucial ideas: http://conway.rutgers.edu/~ccshan/wiki/blog/posts/WordNumbers2/ As such, despite sharing a keyword ("trie"), it is closer to Pete Kirkham's answer than Michael Borgwardt's. :-)

ShreevatsaR 2009-09-30 13:58:49

Thanks for taking the trouble and sharing this information.

Anax 2009-09-30 14:19:18

Answer 11

+1 A:

Yes, me again, but a completely different approach.

Simply, rather than storing the "onethousandeleventyseven" words, you write the sort to use that when comparing.

Crude java POC:

public class BillionsBillions implements Comparator {
    public int compare(Object a, Object b) {
        String s1 = (String)a; // "1234";
        String s2 = (String)b; // "1235";

        int n1 = Integer.valueOf(s1);
        int n2 = Integer.valueOf(s2);

        String w1 = numberToWords(n1);
        String w2 = numberToWords(n2);

        return w1.compare(w2);
    }

    public static void main(String args[]) {
        long numbers[] = new long[1000000000]; // Bring your 64 bit JVMs

        for(int i = 0; i < 1000000000; i++) {
            numbers[i] = i;
        }

        Arrays.sort(numbers, 0, numbers.length, new BillionsBillions());

        long total = 0;

        for(int i : numbers) {
            String l = numberToWords(i);
            long offset = total + l - 51000000000;

            if (offset >= 0) {
                String c = l.substring(l - offset, l - offset + 1);
                System.out.println(c);
                break;
            }
         }
    }
}

"numberToWords" is left as an exercise for the reader.

Will Hartung 2009-09-29 22:39:58

Answer 12

A:

figure out lengths for 1-999 and include length for 0 as 0.

so now you have an array for 0-999 namely uint32 sizes999[1000]; (not going to get into the details of generating this) also need an array of thousand last letters last_letters[1000] (again not going to get into the details of generating this as it is even easier even hundreds d even tens y except 10 which is n others cycle though last of on e through nin e zero is irrelavant)

uint32 sizes999[1000];

uint64 totallen = 0;
strlen_million = strlen("million");
strlen_thousand = strlen("thousand");
for (uint32 i = 0; i<1000;++i){
    for (uint32 j = 0; j<1000;++j){
        for (uint32 j = 0; j<1000;++j){ 
           total_len += sizes999[i]+strlen_million + 
                        sizes999[j]+strlen_thousand +
                        sizes999[k];
           if totallen == 51000000000 goto done;
           ASSERT(totallen <51000000000);//he claimed  51000000000 was not intermediate
        }
    }
}
done:

//now use i j k to get last letter by using last_letters999

//think of i,j,k as digits base 1000

//if k = 0 & j ==0 then the letter is n millio**n**

//if only k = 0 then the letter is d thousan**d**

//other wise use the array of last_letters since

//the units digit base 1000, that is k, is not zero

//for the sum of the numbers i,j,k are the digits of the number base 1000 so

n = i*1000000 + j*1000 + k;

//represent the number and use

sum = n*(n+1)/2;

if you need to do it for number other than 51000000000 then also calculate sums_sizes999 and use that in the natural way.

total memory: 0(1000);

total time: 0(n) where n is the number

pgast 2009-09-29 22:57:40

This doesn't sort the number names in alphabetical order.

Will Hartung 2009-09-29 23:04:59

your are absolutely correct

pgast 2009-09-30 03:12:16

Answer 13

A:

This is what I'd do:

Create an array of 2,997 strings: "one" through "ninehundredninetynine", "onethousand" through "ninehundredninetyninethousand", and "onemillion" through "ninehundredninetyninemillion".
Store the following about each string: length (this can be calculated of course), the integer value represented by the string, and some enum to signify whether it's "ones", "thousands", or "millions".
Sort the 2,997 strings alphabetically.

With this array created, it's straightforward to find all 999,999,999 strings in order alphabetically based on the following observations:

Nothing can follow a "ones" string
Either nothing, or a "ones" string, can follow a "thousands" string
Either nothing, a "ones" string, a "thousands" string, or a "thousands" string then a "ones" string, can follow a "millions" string.

Constructing the words basically involves creating one- to three-letter "words" based on these 2,997 tokens, making sure that the order of the tokens makes a valid number according to the rules above. Given a particular "word", the next "word" is found like this:

Lengthen the "word" by adding the token first alphabetically, if possible.
If this can't be done, advance the rightmost token to the next one alphabetically, if possible.
If this too is not possible, then remove the rightmost token, and advance the second-rightmost token to the next one alphabetically, if possible.
If this too is not possible, you're done.

At each step you can calculate the total length of the string and the sum of the numbers by just keeping two running totals.

John at CashCommons 2009-09-29 23:06:28

Answer 14

A:

It's important to note that there is a lot of overlapping and double counting if you iterate over all 100 billion possible numbers. It's important to realize that the number of strings that start with "eight" is the same number of numbers that start with "nin" or "seven" or "six" etc...

To me, this begs for a dynamic programming solution where the number of strings for tens, hundreds, thousands, etc are calculated and stored in some type of look up table. Ofcourse, there will be special cases for one vs eleven, two vs twelve, etc

I'll update this if I can get a quick running solution.

llamaoo7 2009-09-29 23:24:08

Answer 15

A:

WRONG!!!!!!!!! I READ THE PROBLEM WRONG. I thought it meant "what's the last letter of the alphabetically last number"

what's wrong with:

public class Nums {

// if overflows happen, switch to an BigDecimal or something
// with arbitrary precision
public static void main(String[] args) {
    System.out.println("last letter: " + lastLetter(1L, 51000000L);
    System.out.println("sum: " + sum(1L, 51000000L);
}    

static char lastLetter(long start, long end) {
   String last = toWord(start);
   for(long i = start; i < end; i++)
      String current = toWord(i);
      if(current.compareTo(last) > 1)
         last = current;

   return last.charAt(last.length()-1);
}

static String toWord(long num) {
   // should be relatively easy, but for now ...
   return "one";
}

static long sum(long first, long n) {
   return (n * first + n*n) / 2;
}
}

haven't actually tried this :/ LOL

LES2 2009-09-29 23:50:41

Answer 16

+2 A:

(The first attempt at this is wrong, but I will leave it up since it's more useful to see mistakes on the way to solving something rather than just the final answer.)

I would first generate the strings from 0 to 999 and store them into an array called thousandsStrings. The 0 element is "", and "" represents a blank in the lists below.

The thousandsString setup uses the following:

Units: "" one two three ... nine

Teens: ten eleven twelve ... nineteen

Tens: "" "" twenty thirty forty ... ninety

The thousandsString setup is something like this:

thousandsString[0] = ""

for (i in 1..10)
   thousandsString[i] = Units[i]
end

for (i in 10..19)
   thousandsString[i] = Teens[i]
end

for (i in 20..99)
   thousandsString[i] = Tens[i/10] + Units[i%10]
end

for (i in 100..999)
   thousandsString[i] = Units[i/100] + "hundred" + thousandsString[i%100]
end

Then, I would sort that array alphabetically.

Then, assuming t1 t2 t3 are strings taken from thousandsString, all of the strings have the form

t1

OR

t1 + million + t2 + thousand + t3

OR

t1 + thousand + t2

To output them in the proper order, I would process the individual strings, followed by the millions strings followed by the string + thousands strings.

foreach (t1 in thousandsStrings)

   if (t1 == "")
     continue;

   process(t1)

   foreach (t2 in thousandsStrings)
      foreach (t3 in thousandsStrings)
         process (t1 + "million" + t2 + "thousand" + t3)
      end
   end

   foreach (t2 in thousandsStrings)
       process (t1 + "thousand" + t2)
   end
end

where process means store the previous sum length and then add the new string length to the sum and if the new sum is >= your target sum, you spit out the results, and maybe return or break out of the loops, whatever makes you happy.

=====================================================================

Second attempt, the other answers were right that you need to use 3k strings instead of 1k strings as a base.

Start with the thousandsString from above, but drop the blank "" for zero. That leaves 999 elements and call this uStr (units string).

Create two more sets:

tStr = the set of all uStr + "thousand"

mStr = the set of all uStr + "million"

Now create two more set unions:

mtuStr = mStr union tStr union uStr

tuStr = tStr union uStr

Order uStr, tuStr, mtuStr

Now the looping and logic here are a bit different than before.

foreach (s1 in mtuStr)
   process(s1)

   // If this is a millions or thousands string, add the extra strings that can
   // go after the millions or thousands parts.

   if (s1.contains("million"))
      foreach (s2 in tuStr)
         process (s1+s2)          

         if (s2.contains("thousand"))
            foreach (s3 in uStr)
               process (s1+s2+s3)
            end
         end
      end
   end

   if (s1.contains("thousand"))
      foreach (s2 in uStr)
         process (s1+s2)
      end
   end
end

John 2009-09-30 01:36:28

Almost. First problem, 8000000 becomes "eightmillionthousand", but this is easily fixed. Second problem, you produce an incorrect order: "eight", "eightmillion", "eightthousand", "eighthundred".

Mark Ransom 2009-09-30 05:57:20

Ah, you're right about needing to ignore the blanks, and I see you're right about the eighthundred vs eightmillion ordering. So it looks like the answer is the one given by someone else using 3k strings instead of 1k strings so you use the answers given by Pete Kirkham or John W

John 2009-09-30 12:50:05

+1: not 1005, but still far better than the two Trie answers that everyone is upvoting for some reason (tries are NOT the way to solve this problem).

RBarryYoung 2009-09-30 17:00:45

*sigh* "Not 100%", stupid fingers...

RBarryYoung 2009-09-30 17:01:21

Answer 17

+14 A:

Edit: Solved!

You can create a generator that outputs the numbers in sorted order. There are a few rules for comparing concatenated strings that I think most of us know implicitly:

a < a+b, where b is non-null.
a+b < a+c, where b < c.
a+b < c+d, where a < c, and a is not a subset of c.

If you start with a sorted list of the first 1000 numbers, you can easily generate the rest by appending "thousand" or "million" and concatenating another group of 1000.

Here's the full code, in Python:

import heapq

first_thousand=[('', 0), ('one', 1), ('two', 2), ('three', 3), ('four', 4),
                ('five', 5), ('six', 6), ('seven', 7), ('eight', 8),
                ('nine', 9), ('ten', 10), ('eleven', 11), ('twelve', 12),
                ('thirteen', 13), ('fourteen', 14), ('fifteen', 15),
                ('sixteen', 16), ('seventeen', 17), ('eighteen', 18),
                ('nineteen', 19)]
tens_name = (None, 'ten', 'twenty', 'thirty', 'forty', 'fifty', 'sixty',
             'seventy','eighty','ninety')
for number in range(20, 100):
    name = tens_name[number/10] + first_thousand[number%10][0]
    first_thousand.append((name, number))
for number in range(100, 1000):
    name = first_thousand[number/100][0] + 'hundred' + first_thousand[number%100][0]
    first_thousand.append((name, number))

first_thousand.sort()

def make_sequence(base_generator, suffix, multiplier):
    prefix_list = [(name+suffix, number*multiplier)
                   for name, number in first_thousand[1:]]
    prefix_list.sort()
    for prefix_name, base_number in prefix_list:
        for name, number in base_generator():
            yield prefix_name + name, base_number + number
    return

def thousand_sequence():
    for name, number in first_thousand:
        yield name, number
    return

def million_sequence():
    return heapq.merge(first_thousand,
                       make_sequence(thousand_sequence, 'thousand', 1000))

def billion_sequence():
    return heapq.merge(million_sequence(),
                       make_sequence(million_sequence, 'million', 1000000))

def solve(stopping_size = 51000000000):
    total_chars = 0
    total_sum = 0
    for name, number in billion_sequence():
        total_chars += len(name)
        total_sum += number
        if total_chars >= stopping_size:
            break
    return total_chars, total_sum, name, number

It took a while to run, about an hour. The 51 billionth character is the last character of sixhundredseventysixmillionsevenhundredfortysixthousandfivehundredseventyfive, and the sum of the integers to that point is 413,540,008,163,475,743.

Mark Ransom 2009-09-30 13:13:53

ansaurus

tags:

views:

answers:

Sorting numbers from 1 to 999,999,999 in words as strings

related questions