tags:

views:

572

answers:

10

I need to include about 1 MByte of data in a Java application, for very fast and easy access in the rest of the source code. My main background is not Java, so my initial idea was to convert the data directly to Java source code, defining 1MByte of constant arrays, classes (instead of C++ struct) etc., something like this:

public final/immutable/const MyClass MyList[] = { 
  { 23012, 22, "Hamburger"} , 
  { 28375, 123, "Kieler"}
};

However, it seems that Java does not support such constructs. Is this correct? If yes, what is the best solution to this problem?

NOTE: The data consists of 2 tables with each about 50000 records of data, which is to be searched in various ways. This may require some indexes later, with significant more records, maybe 1 million records, saved this way. I expect the application to start up very fast, without iterating through these records.

+3  A: 

An idea is that you use enumerators, but I'm not sure if this suits to your implementation, and it also depends on how you are planning to use the data.

public enum Stuff {

 HAMBURGER (23012, 22),
 KIELER    (28375, 123);

 private int a;
 private int b;

 //private instantiation, does not need to be called explicitly.
 private Stuff(int a, int b) {
    this.a = a;
    this.b = b;
  }

 public int getAvalue() {
   return this.a;
 }

 public int getBvalue() {
   return this.b;
 }

}

These can then be accessed like:

Stuff someThing = Stuff.HAMBURGER;
int hamburgerA = Stuff.HAMBURGER.getA() // = 23012

Another idea is using a static initializer to set private fields of a class.

Lars Andren
@Lars, remove the " in the arguments of the enum-constructor call.
aioobe
@aioobe, thanks a lot! Unless you want to store those numbers as Strings (then the private constructor needs to change accordingly), no " are needed.
Lars Andren
@Lars D (other Lars), like this you never need to explicitly call the constructor, you only need to define each element like in the example above .
Lars Andren
A: 

You could also declare a static class (or a set of static classes) exposing the desidered values as methods. After all, you want your code to be able to find the value for a given name, and don't want the value to change.

So: location=MyLibOfConstants.returnHamburgerLocation().zipcode

And you can store this stuff in a hashtable with lazyinitialization, if you thing that calculating it on the fly would be a waste of time.

p.marino
+21  A: 

I personally wouldn't put it in source form.

Instead, include the data in some appropriate raw format in your jar file (I'm assuming you'll be packaging the application or library up) and use Class.getResourceAsStream or ClassLoader.getResourceAsStream to load it.

You may very well want a class to encapsulate loading, caching and providing this data - but I don't see much benefit from converting it into source code.

Jon Skeet
What is the standard data format in Java for this?
Lars D
@Lars: For K/V pairs, *key=value* in `.properties` file (check javadoc for `Properties` class), for lists only the sky is the limit although I'd suggest you use something simple that fits your needs. Maybe XML if you want to, but usually that's not really necessary.
Esko
@Lars D: It really depends on what your data needs are. Properties files are fine for key/value pairs, but may not be terribly efficient if you have a lot of numeric data. (The same is true for XML.) There are a number of serialization libraries which may make your life easier, or just use your own custom data format.
Jon Skeet
@Lars D - you could use a `DataInputStream` to read a raw data format or save the data using Java serialization if you wanted to avoid parsing XML or JSON structured data.
McDowell
I definitely want to avoid parsing the data at application start... so XML and JSON are definitely not useful.
Lars D
There is a LOT of numeric data, so key/value pairs sound like a no-go.Java serialization sounds like something that needs to be parsed... not good. Isn't there some kind of standard format for storing data in a stream, that can be read/looked up using a standard API without parsing it?
Lars D
@Lars D Java serialisation is the Java standard. Whatever you do in Java needs to be parsed - whether it is byte code which is read from a file, parsed, verified then (probably) interpreted as a one-off initialisation method, or data which is read from a file, parsed, and run in a loop which may well get compiled for a million entries. For a simple format, you could also use DataIn/OutputStream, which might save one reflective call in the serialisation.
Pete Kirkham
@Pete: I don't think I'd particularly recommend the default Java serialization - it's quite brittle. Other alternatives like Protocol Buffers or Thrift have a better versioning story IMO.
Jon Skeet
I suppose that much data is generated anyway, so I don't see any problem if that's generated directly to java code. It might make the build simpler actually as you don't have to handle external files. Of course, if the data is not generated, the story is different.
fish
@Jon He asked for a standard, not a recommendation! I've tended to use ASN.1 for versioned data, but people want free stuff instead.
Pete Kirkham
@Pete: While I take your point about Lars asking for the "standard" I think it's only reasonable to give caveats if that "standard" has significant shortcomings.
Jon Skeet
If the data is as simple as the example indicates, why not just use a CSV?
ILMTitan
@ILMTitan: CSV will still be storing numbers as text, which is inefficient in terms of time to parse *and* space. I'm under the impression that the OP wants it to be reasonably efficient.
Jon Skeet
I chose to write an app that converts the data to a binary format which is then added as a resource to the Java app.
Lars D
Note to anyone reading this: raw files, which are attached to resources, are compressed and can only be read sequentially, there is no random access.
Lars D
+7  A: 

Due to limitations of the java bytecode files, class-files can not be larger than 64k iirc. (They are simply not intended for this type of data.)

I would load the data upon starting the program, using something like the following lines of code:

import java.io.*;
import java.util.*;

public class Test {
    public static void main(String... args) throws IOException {
        List<DataRecord> records = new ArrayList<DataRecord>();
        BufferedReader br = new BufferedReader(new FileReader("data.txt"));
        String s;
        while ((s = br.readLine()) != null) {
            String[] arr = s.split(" ");
            int i = Integer.parseInt(arr[0]);
            int j = Integer.parseInt(arr[1]);
            records.add(new DataRecord(i, j, arr[0]));
        }
    }
}


class DataRecord {
    public final int i, j;
    public final String s;
    public DataRecord(int i, int j, String s) {
        this.i = i;
        this.j = j;
        this.s = s;
    }
}

(NB: The Scanner is quite slow, so don't be tempted to use it just because it has a simple interface. Stick with some form of BufferedReader and split, or StringTokenizer.)

Efficiency can of course be improved if you transform the data into a binary format. In that case, you can make use of the DataInputStream (but don't forget to go through some BufferedInputStream or BufferedReader)

Depending on how you wish to access the data, you might be better off storing the records in a hash-map (HashMap<Integer, DataRecord>) (having i or j as the key).

If you wish to load the data at the same time as the JVM loads the class file itself (roughly!) you could do the read / initialization, not within a method, but ecapsulated in static { ... }.


For a memory-mapped approach, have a look at the java.nio.channels-package in java. Especially the method

public abstract MappedByteBuffer map(FileChannel.MapMode mode, long position,long size) throws IOException

Complete code examples can be found here.


Dan Bornstein (the lead developer of DalvikVM) explains a solution to your problem in this talk (Look around 0:30:00). However I doubt the solution applies to as much data as a megabyte.

aioobe
Loading data is not an option, that would be too slow. Memory-mapping would make more sense, but I don't think that memory-mapping is available on my target platform (Android)
Lars D
Have a look at http://www.developer.com/java/other/article.php/1548681/Introduction-to-Memory-Mapped-IO-in-Java.htm
aioobe
Class files *can* be larger than 64k - it's individual methods (and initalization blocks) that can't be.
Michael Borgwardt
Oh thanks, that's nice to know! :)
aioobe
+1  A: 

convert the data directly to Java source code, defining 1MByte of constant arrays, classes

Be aware that there are strict constraints on the size of classes and their structures [ref JVM Spec.

McDowell
A: 

Isn't a cache what you need? As classes it is loaded in the memory, not really limited to a defined size, should be as fast as using constants... Actually it can even search data with some kind of indexes (exemple with the object hashcode...) You can for exemple create all your data arrays (ex { 23012, 22, "Hamburger"}) and then create 3 hashmap: map1.put(23012,hamburgerItem); map2.put(22,hamburgerItem); map3.put("Hamburger",hamburgerItem); This way you can search very fast in one of the map according to the parameter you have... (but this works only if your keys are unique in the map... this is just an exemple that could inspire you)

At work we have a very big webapp (80 weblogic instances) and it's almost what we do: caching everywhere. From a countrylist in database, create a cache...

There are many different kind of caches, you should check the link and choose what you need... http://en.wikipedia.org/wiki/Cache_algorithms

Sebastien Lorber
The main criteria is, that the data is present at program startup, so that I don't need to iterate through it or parse it. I'm not sure how a cache can help doing that?
Lars D
@Lars: what you want makes no sense. Loading a Java class involves iterating through the byte code and parsing it as well. It's flat-out impossible to load any kind of data without iterating through it and parsing it in some form. It's just a question of how expensive those steps are.
Michael Borgwardt
@Michael: Good point.
Lars D
+1  A: 

This is how you define it in Java, if I understood what you are after:

public final Object[][] myList = { 
          { 23012, 22, "Hamburger"} , 
          { 28375, 123, "Kieler"}
        };
fish
A Columbus egg... I wonder how this looks in bytecode.
Lars D
aioobe
The assignment is executed as part of the constructor, which has a size limit of 64 kbytes, which is violated. Other methods are also limited to 64kbytes, so it would require at least 16 methods to implement this.
Lars D
In that case I guess you must store the data to a file... unless you want to create those 16 methods :) Of course, if you generate the class, it doesn't really matter that much how it is formatted.
fish
+3  A: 

Putting the data into source could would actually not be the fastest solution, not by a long shot. Loading a Java class is quite complex and slow (at least on a platform that does bytecode verification, not sure about Android).

The fastest possible way to do this would be to define your own binary index format. You could then read that as a byte[] (possibly using memory mapping) or even a RandomAccessFile without interpreting it in any way until you start accessing it. The cost of this would be the complexity of the code that accesses it. With fixed-size records, a sorted list of records that's accessed via binary search would still be pretty simple, but anything else is going to get ugly.

Though before doing that, are you sure this isn't premature optimization? The easiest (and probably still quite fast) solution would be to jsut serialize a Map, List or array - have you tried this and determined that it is, in fact, too slow?

Michael Borgwardt
A: 

Java serialization sounds like something that needs to be parsed... not good. Isn't there some kind of standard format for storing data in a stream, that can be read/looked up using a standard API without parsing it?

If you were to create the data in code, then it would all be loaded on first use. This is unlikely to be much more efficient than loading from a separate file - as well as parsing the data in the class file, the JVM has to verify and compile the bytecodes to create each object a million times, rather than just the once if you load it from a loop.

If you want random access and can't use a memory mapped file, then there is a RandomAccessFile which might work. You need either to load a index on start, or you need to make the entries a fixed length.

You might want to check whether the HDF5 libraries run on your platform; it may be overkill for such a simple and small dataset though.

Pete Kirkham
+1  A: 

It looks like you plan to write your own lightweight database.
If you can limit the length of the String to a realistic max size the following might work:

  • write each entry into a binary file, the entries have the same size, so you waste some bytes with each entry(int a, int b,int stringsize, string, padding)
  • To read an entry open the file as a random access file, multiply the index with the length of an entry to get the offset and seek the position.
  • Put the bytes into a bytebuffer and read the values, the String has to be converted with the String(byte[] ,int start, int length,Charset) ctor.

If you can't limit the length of a block dump the strings in an additional file and only store the offsets in your table. This requires an additional file access and makes modifiying the data hard.
Some informationa about random file-access in java can be found here http://java.sun.com/docs/books/tutorial/essential/io/rafs.html.

For faster access you can cache some of your read entries in a Hashmap and always remove the oldest from the map when reading a new one.
Pseudo code (wont compile):

class MyDataStore
{
   FileChannel fc = null;
   Map<Integer,Entry> mychace = new HashMap<Integer, Entry>();
   int chaceSize = 50000;
   ArrayList<Integer> queue = new ArrayList();
   static final int entryLength = 100;//byte
   void open(File f)throws Exception{fc = f.newByteChannel()}
   void close()throws Exception{fc.close();fc = null;}
   Entry getEntryAt(int index)
   {
       if(mychace.contains(index))return mychace.get(index);

       long pos = index * entryLength; fc.seek(pos);ByteBuffer 
       b = new ByteBuffer(100);
       fc.read(b);
       Entry a = new Entry(b);
       queue.add(index);
       mychace.put(index,a);
       if(queue.size()>chacesize)mychace.remove(queue.remove(0));
       return a;
   }

}
class Entry{
   int a; int b; String s;
   public Entry(Bytebuffer bb)
   {
     a = bb.getInt(); 
     b = bb.getInt(); 
     int size = bb.getInt();
     byte[] bin = new byte[size];
     bb.get(bin);
     s = new String(bin);
   }
}

Missing from the pseudocode:

  • writing, since you need it for constant data
  • total number of entries/sizeof file, only needs an additional integer at the beginning of the file and an additional 4 byte offset for each access operation.
josefx