ansaurus

Question

The best way to store and access 120,000 words in java

Answer 1

A:

Use a Hashtable? This will give you your best lookup speed.

Kevin Babcock 2009-02-06 03:05:44

If the words only need to be accessed by index, an array will give the best lookup speed.

benjismith 2009-02-06 03:08:40

That is true. But if they need to be access in non-sequential order based on an arbitrary set of keys, a Hashtable (or even better, a HashMap) would be more efficient. I guess the answer depends on which of those is the case for his app.

Kevin Babcock 2009-02-06 03:12:00

Hashes will only be useful if the OP intends to lookup via any key BUT the index. OP suggests that index is all that's needed

basszero 2009-02-06 03:48:49

Answer 2

+14 A:

I would store them in one large ArrayList and worry about (possibly unnecessary) optimisations later on.

Being inherently lazy, I don't think it's a good idea to optimise unless there's a demonstrated need. Otherwise, you're just wasting effort that could be better spent elsewhere.

In fact, if you can set an upper bound to your word count and you don't need any of the fancy List operations, I'd opt for a normal (native) array of string objects with an integer holding the actual number. This is likely to be faster than a class-based approach.

This gives you the greatest speed in accessing the individual elements whilst still retaining the ability to do all that wonderful string manipulation.

Note I haven't benchmarked native arrays against ArrayLists. They may be just as fast as native arrays, so you should check this yourself if you have less blind faith in my abilities than I do :-).

If they do turn out to be just as fast (or even close), the added benefits (expandability, for one) may be enough to justify their use.

paxdiablo 2009-02-06 03:07:00

I wouldn't be so sure plain vanilla arrays are, across the board, better than ArrayLists. Object creation in Java is ridiculously cheap, and you get a very nice set of abstraction above the list to help you out. Configured correctly, I see no reason to use a plain array. Only reason you should is if profiling shows with absolute accuracy that you get much better performance. And I doubt you will get these results from profiling.

Yuval A 2009-04-18 15:30:35

Yuval, I agree in general with all your points but it appears to me in this case that the ArrayList is being used to simply store a fixed number of String objects, so none of the fancy *ArrayList* functionality would be of use. It's the String functions that are important. And I'm serious about testing it, my favorite mantra is "measure, don't guess".

paxdiablo 2009-04-18 15:47:34

I think your points on premature optimization are spot on and I second those comments.

digitaljoel 2009-04-18 15:58:21

Answer 3

+2 A:

If you will access these Strings sequentially, the LinkedList would be the best choice.

For random access, ArrayLists have a nice memory usage/access speed tradeof.

fsanches 2009-02-06 03:15:45

*Accessing* sequentially is still faster with arrays (and possibly ArrayLists). What linked lists give you is fast insertion and deletion (which may or may not be needed here).

paxdiablo 2009-02-06 03:26:10

@Pax: sometimes it isn't possible to allocate 120.000 Strings on a contiguous block of memory (although I don't know if the JVM will handle this, I'm no Java expert). In that sense, linked lists are better - the memory doesn't need to be contiguous, so it'll be easier to allocate.

fsanches 2009-02-06 04:27:27

@fs, an array of 200 30-character string doesn't require 6000-ish contiguous characters, just 200 contiguous pointers. The strings are new'ed elsewhere.

paxdiablo 2009-02-06 04:38:20

@fsanches: Completely wrong. The JVM does not use malloc, and it doesn't keep free-lists. The JVM always allocates from the top of the heap. 120,000 allocations all in a row (with no intervening allocations) will *always* be adjacent in heap memory. But, more importantly, why does that matter?

benjismith 2009-02-06 05:36:05

@benjismith: I'm talking about allocating an simple array of 120000 elements. If the OS can't allocate this space contiguously, the allocation will fail. Linked lists won't suffer from this problem.

fsanches 2009-02-06 06:50:40

The java heap is *always* contiguous. Please stop giving people advice about Java programming. You may be a really well-informed developer on other platforms, but your Java misinformation is somewhat alarming.

benjismith 2009-02-06 16:06:25

I'm not arguing if the heap is contiguous or not. I'm arguing if the contiguous space *will be available* at runtime. We're talking about a 120k positions array here.

fsanches 2009-02-06 20:01:46

Off the top of my head I don't know what Java does it if tries to increase its heap size and can't get all it asks for in one contiguous block. Do you know that it fails, or are you just guessing? But more important, the question is irrelevant in this context. If he's building an array of 120,000 words, then he needs 120,000 * the pointer size, which I think is 8, or about 1 MB. I wouldn't think that getting 1 MB of memory on a modern computer is an extravagant request. The strings themselves are all separate allocations, each relatively small.

Jay 2009-08-11 17:19:23

Wait, I think an Object ref is only 4 bytes, which would bean 120,000 of them would be 500kB. A pittance.

Jay 2009-08-11 17:25:56

Answer 4

+3 A:

Just confirming pax assumptions, with a very naive benchmark

public static void main(String[] args)
{
 int size = 120000;
 String[] arr = new String[size];
 ArrayList al = new ArrayList(size);
 for (int i = 0; i < size; i++)
 {
  String put = Integer.toHexString(i).toString();
  // System.out.print(put + " ");
  al.add(put);
  arr[i] = put;
 }

 Random rand = new Random();
 Date start = new Date();
 for (int i = 0; i < 10000000; i++)
 {
  int get = rand.nextInt(size);
  String fetch = arr[get];

 }
 Date end = new Date();
 long diff = end.getTime() - start.getTime();
 System.out.println("array access took " + diff + " ms");

 start = new Date();
 for (int i = 0; i < 10000000; i++)
 {
  int get = rand.nextInt(size);
  String fetch = (String) al.get(get);

 }
 end = new Date();
 diff = end.getTime() - start.getTime();
 System.out.println("array list access took " + diff + " ms");
}

and the output:
array access took 578 ms
array list access took 907 ms

running it a few times the actual times seem to vary some, but generally array access is between 200 and 400 ms faster, over 10,000,000 iterations.

shsteimer 2009-02-06 04:14:59

Actually, that's less of a difference than I thought (twice as long but only 1/2 a second hit for 10m iterations). It's probably worth getting the benefits of ArrayList for that minimal cost.

paxdiablo 2009-02-06 04:41:15

milliseconds or microseconds? if those are milli that is an eternity!

windfinder 2009-02-06 06:02:10

@wf, it's ms (micro would be written 200us for those Unicode-deficient soles). And a fifth of a second difference isn't too bad for 10 million iterations

paxdiablo 2009-02-06 06:21:49

Interesting benchmark. Any reason you used Date objects rather than System.nanoTime()?

benjismith 2009-02-06 16:08:32

and now do the benchmark using a Object[] and a cast inside the loop like done with the ArrayList!

Carlos Heuberger 2009-04-18 20:30:14

Answer 5

A:

ArrayList/Vector if order matters (it appears to, since you are calling the words "word_xxx"), or HashTable/HashMap if it doesn't.

I'll leave the exercise of figuring out why you would want to use an ArrayList vs. a Vector or a HashTable vs. a HashMap up to you since I have a sneaking suspicion this is your homework. Check the Javadocs.

You're not going to get any methods that help you as you've asked for in the examples above from your Collections Framework class, since none of them do String comparison operations. Unless you just want to order them alphabetically or something, in which case you'd use one of the Tree implementations in the Collections framework.

bpapa 2009-02-06 04:39:01

Answer 6

A:

I don't understand why so many people are suggesting Arraylist, or the like, since you don't mention ever having to iterate over the whole list. Further, it seems you want to access them as key/value pairs ("word_348"="pedantic").

For the fastest access, I would use a TreeMap, which will do binary searches to find your keys. Its only downside is that it's unsynchronized, but that's not a problem for your application.

http://java.sun.com/javase/6/docs/api/java/util/TreeMap.html

dj_segfault 2009-02-06 05:05:37

You'd want to use an ArrayList or Array to take advantage of random access. If you're iterating, it might make more sense to use a LinkedList.

bpapa 2009-02-06 05:07:29

TreeMap is going to be considerably slower than an array or an ArrayList. Remember, TreeMap provides O(log n) access time, while array and ArrayList provide O(1) access time.

benjismith 2009-02-06 05:42:17

Answer 7

A:

Depends on what the problem is - speed or memory.

If it's memory, the minimum solution is to write a function getWord(n) which scans the whole file each time it runs, and extracts word n.

Now - that's not a very good solution. A better solution is to decide how much memory you want to use: lets say 1000 items. Scan the file for words once when the app starts, and store a series of bookmarks containing the word number and the position in the file where it is located - do this in such a way that the bookmarks are more-or-less evenly spaced through the file.

Then, open the file for random access. The function getWord(n) now looks at the bookmarks to find the biggest word # <= n (please use a binary search), does a seek to get to the indicated location, and scans the file, counting the words, to find the requested word.

An even quicker solution, using rather more memnory, is to build some sort of cache for the blocks - on the basis that getWord() requests usually come through in clusters. You can rig things up so that if someone asks for word # X, and its not in the bookmarks, then you seek for it and put it in the bookmarks, saving memory by consolidating whichever bookmark was least recently used.

And so on. It depends, really, on what the problem is - on what kind of patterns of retreival are likely.

paulmurray 2009-02-06 05:13:14

Answer 8

A:

How about a radix tree or Patricia trie?

http://en.wikipedia.org/wiki/Radix_tree

Chris KL 2009-02-06 06:40:55

Answer 9

+1 A:

My take:

For a non-threaded program, an Arraylist is always fastest and simplest.

For a threaded program, a java.util.concurrent.ConcurrentHashMap<Integer,String> or java.util.concurrent.ConcurrentSkipListMap<Integer,String> is awesome. Perhaps you would later like to allow threads so as to make multiple queries against this huge thing simultaneously.

Overflown 2009-02-06 06:59:38

As long as you initialize the list at startup, finalize it and dont change it afterward, an ArrayList is fine for multiple threads.

Guillaume 2009-02-06 08:06:46

Answer 10

A:

The only advantage of a linked list over an array or array list would be if there are insertions and deletions at arbitrary places. I don't think this is the case here: You read in the document and build the list in order.

I THINK that when the original poster talked about finding "word_2200", he meant simply the 2200th word in the document, and not that there are arbitrary labels associated with each word. If so, then all he needs is indexed access to all the words. Hence, an array or array list. If there really is something more complex, if one word might be labeled "word_2200" and the next word is labeled "foobar_42" or some such, then yes, he'd need a more complex structure.

Hey, do you want to give us a clue WHY you want to do any of this? I'm hard pressed to remember the last time I said to myself, "Hey, I wonder if the 1,237th word in this document I'm reading is longer or shorter than the 842nd word?"

Jay 2009-08-11 17:31:33

ansaurus

tags:

views:

answers:

The best way to store and access 120,000 words in java

related questions