views:

108

answers:

4

Hi,

I am having a very large string, and when I read it in Java, I am getting out of memory error. Actually, I need to read all this string into memory and then split into individual strings and sort them based on value. What is the best way do this?

Thanks

+2  A: 

If you are limited by memory then you could try applying merge sort else increase the heap size using virtual machine parameters -Xmx and -Xms

keshav.veerapaneni
+4  A: 

Where does your large String come from? As you say you read it, I assume it comes from a file. Do you have to know the whole String to know where to split it? If not, you could just read the file char by char until you hit a split marker, put all the chars read so far in a String and begin reading the next String. Would you roughly know where to sort a single String you just read? If so, you could write the partial Strings to separate files (e.g. all Strings starting with A go to A.tmp when you sort your Strings alphabetically) in the first run. After that, you can sort the (hopefully now small enough to fit in your memory) created files' contents and finally append the contents to a new outputfile.

hd42
+1  A: 

If you want Hadoop to process a 100 GiB apache logfile "line by line" you are essentially doing the same as what you want: A large body of text split into pieces.

The normal way for doing that in Hadoop (as you tagged the question with this) is using the TextInputFormat which uses LineRecordReader which uses LineReader to split the Text file on the "end-of-line" separator. What you want is essentially the same with one difference: split on something different.

Sorting the resulting values (in Hadoop) is essentially done by employing what is called "Secondary Sort" (See the Hadoop example and the explanation in Tom's book).

So what I would recommend doing is

  1. Make your own variation on TextInputFormat/LineRecordReader/LineReader that reads and extracts the individual parts of your String based on you separator.
  2. Create a map that rewrites the information to do Secondary Sort.
  3. Create the correct partition, group and key comparator classes/methods to do the sorting.
  4. Create a reduce where you receive the sorted information which you can the process further.

HTH

Niels Basjes
A: 

You can look at External sorting algoritmhs

Qubeuc
Yes, but going into this kind of algorithms at that detail level is something I would rather leave to the implementors of frameworks like Hadoop.
Niels Basjes