views:

6517

answers:

18

I have to keep thousands of strings in memory to be accessed serially in Java. Should I store them in an array or should I use some kind of List ?

Since arrays keep all the data in a contiguous chunk of memory (unlike Lists), would the use of an array to store thousands of strings cause problems ?

Answer: The common consensus is that the performance difference is minor. List interface provides more flexibility.

+1  A: 

If you know in advance how large the data is then an array will be faster.

A List is more flexible. You can use an ArrayList which is backed by an array.

TofuBeer
The ArrayList has an ensureCapacity() method which preallocates the backing array to the specified size.
JesperE
Or you can specify the size at construction time. Also "faster" here means "a few microseconds to allocate two memory areas instead of one"
Aaron Digulla
+3  A: 

No, because technically, the array only stores the reference to the strings. The strings themselves are allocated in a different location. For a thousand items, I would say a list would be better, it is slower, but it offers more flexibility and it's easier to use, especially if you are going to resize them.

CookieOfFortune
List also stores only reference to strings.
Peter Štibraný
+1  A: 

list is slower than arrays.If you need efficiency use arrays.If you need flexibility use list.

Warrior
+26  A: 

I suggest that you use a profiler to test which is faster.

My personal opinion is that you should use Lists.

I work on a large codebase and a previous group of developers used arrays everywhere. It made the code very inflexible. After changing large chunks of it to Lists we noticed no difference in speed.

Fortyrunner
+1 for maintainability!
matt b
@Fortyrunner - From your experience, are there any such choices in Java between abstraction and raw data forms that do make a significant difference in performance ?
euphoria83
One of the issues with performance measurement is that you constantly have to retest against new versions of Java. I am working on a problem at the moment where someone used an int throughout for a key in a map (to save space/time). We now need to change all lines to a new object - its painful.
Fortyrunner
So.. I now try and stay away from raw data. It rarely makes a noticeable difference. Hotspot is an amazing piece of technology and you should never try and second guess. Just try to write simple, maintainable code and Hotspot will do the rest.
Fortyrunner
A: 

A List is more flexible.... so better to List than array

RV
+20  A: 

The Java way is that you should consider what data abstraction most suits your needs. Remember that in Java a List is an abstract, not a concrete data type. You should declare the strings as a List, and then initialize it using the ArrayList implementation.

List<String> strings = new ArrayList<String>();

This separation of Abstract Data Type and specific implementation is one the key aspects of object oriented programming.

An ArrayList implements the List Abstract Data Type using an array as its underlying implementation. Access speed is virtually identical to an array, with the additional advantages of being able to add and subtract elements to a List (although this is an O(n) operation with an ArrayList) and that if you decide to change the underlying implementation later on you can. For example, if you realize you need synchronized access, you can change the implementation to a Vector without rewriting all your code.

In fact, the ArrayList was specifically designed to replace the low-level array construct in most contexts. If Java was being designed today, it's entirely possible that arrays would have been left out altogether in favor of the ArrayList construct.

Since arrays keep all the data in a contiguous chunk of memory (unlike Lists), would the use of an array to store thousands of strings cause problems ?

In Java, all collections store only references to objects, not the objects themselves. Both arrays and ArrayList will store a few thousand references in a contiguous array, so they are essentially identical. You can consider that a contiguous block of a few thousand 32-bit references will always be readily available on modern hardware. This does not guarantee that you will not run out of memory altogether, of course, just that the contiguous block of memory requirement is not difficult to fufil.

Adding may of course involve reallocating the backing array, so if performance is important and the size of the array is known in advance, one should consider using ArrayList#ensureCapacity.
JesperE
Don't you pay the cost of dynamic binding here?
Uri
@Uri : I want to know that too. Good question
euphoria83
+7  A: 

You should prefer generic types over arrays. As mentioned by others, arrays are inflexible and do not have the expressive power of generic types. (They do however support runtime typechecking, but that mixes badly with generic types.)

But, as always, when optimizing you should always follow these steps:

  • Don't optimize until you have a nice, clean, and working version of your code. Changing to generic types could very well be motivated at this step already.
  • When you have a version which is nice and clean, decide if it is fast enough.
  • If it isn't fast enough, measure its performance. This step is important for two reasons. If you don't measure you won't (1) know the impact of any optimizations you make and (2) know where to optimize.
  • Optimize the hottest part of your code.
  • Measure again. This is just as important as measuring before. If the optimization didn't not improve things, revert it. Remember, the code without the optimization was clean, nice, and working.
JesperE
+1  A: 

Don't get into the trap of optimizing without proper benchmarking. As others have suggested use a profiler before making any assumption.

The different data structures that you have enumerated have different purposes. A list is very efficient at inserting elements in the beginning and at the end but suffers a lot when accessing random elements. An array has fixed storage but provides fast random access. Finally an ArrayList improves the interface to an array by allowing it to grow. Normally the data structure to be used should be dictated by how the data stored will be access or added.

About memory consumption. You seem to be mixing some things. An array will only give you a continuous chunk of memory for the type of data that you have. Don't forget that java has a fixed data types: boolean, char, int, long, float and Object (this include all objects, even an array is an Object). It means that if you declare an array of String strings [1000] or MyObject myObjects [1000] you only get a 1000 memory boxes big enough to store the location (references or pointers) of the objects. You don't get a 1000 memory boxes big enough to fit the size of the objects. Don't forget that your objects are first created with "new". This is when the memory allocation is done and later a reference (their memory address) is stored in the array. The object doesn't get copied into the array only it's reference.

potyl
A: 

"Thousands" is not a large number. If all you want to do is access these serially, use an in-memory, immutable, singly-linked List.

Apocalisp
8 bytes on most 64-bit implementations.
Tom Hawtin - tackline
+1  A: 

I don't think it makes a real difference for Strings. What is contiguous in an array of strings is the references to the strings, the strings themselves are stored at random places in memory.

Arrays vs. Lists can make a difference for primitive types, not for objects. IF you know in advance the number of elements, and don't need flexibility, an array of millions of integers or doubles will be more efficient in memory and marginally in speed than a list, because indeed they will be stored contiguously and accessed instantly. That's why Java still uses arrays of chars for strings, arrays of ints for image data, etc.

PhiLho
+1  A: 

Remember that an ArrayList encapsulates an array, so there is little difference compared to using a primitive array (except for the fact that a List is much easier to work with in java).

The pretty much the only time it makes sense to prefer an array to an ArrayList is when you are storing primitives, i.e. byte, int, etc and you need the particular space-efficiency you get by using primitive arrays.

Nuoji
+3  A: 

I'm guessing the original poster is coming from a C++/STL background which is causing some confusion. In C++ std::list is a doubly linked list.

In Java [java.util.]List is an implementation-free interface (pure abstract class in C++ terms). List can be a doubly linked list - java.util.LinkedList is provided. However, 99 times out of 100 when you want a new List, you want to use java.util.ArrayList instead, which is the rough equivalent of C++ std::vector. There are other standard implementations, such as those returns by java.util.Collections.emptyList and java.util.Arrays.asList.

From a performance stand point there is a very small hit from having to go through an interface and an extra object, however runtime inlining means this rarely has an significance. Also remember that String are (typically) an object plus array. So for each entry, you probably have two other objects. In C++ std::vector<std::string>, although copying by value without a pointer as such, the character arrays will form an object for string (and these will not usually be shared).

If this particular code is really performance sensitive, you could create a single char[] array (or even byte[]) for all the characters of all the strings, and then an array of offsets. IIRC, this is how javac is implemented.

Tom Hawtin - tackline
Thanx for the answer. But no, I am not confusing the C++ list with Java's interface List. I asked the question in such a way because I wanted to compare the performance of List implementations like ArrayList and Vector with raw arrays.
euphoria83
Both ArrayList and Vector "keep all the data in a contiguous chunk of memory".
Tom Hawtin - tackline
+1  A: 

Well firstly it's worth clarifying do you mean "list" in the classical comp sci data structures sense (ie a linked list) or do you mean java.util.List? If you mean a java.util.List, it's an interface. If you want to use an array just use the ArrayList implementation and you'll get array-like behaviour and semantics. Problem solved.

If you mean an array vs a linked list, it's a slightly different argument for which we go back to Big O (here is a plain English explanation if this is an unfamiliar term.

Array;

  • Random Access: O(1);
  • Insert: O(n);
  • Delete: O(n).

Linked List:

  • Random Access: O(n);
  • Insert: O(1);
  • Delete: O(1).

So you choose whichever one best suits how you resize your array. If you resize, insert and delete a lot then maybe a linked list is a better choice. Same goes for if random access is rare. You mention serial access. If you're mainly doing serial access with very little modification then it probably doesn't matter which you choose.

Linked lists have a slightly higher overhead since, like you say, you're dealing with potentially non-contiguous blocks of memory and (effectively) pointers to the next element. That's probably not an important factor unless you're dealing with millions of entries however.

cletus
i mean java.util.List interface
euphoria83
+1  A: 

Array vs. List choice is not so important (considering performance) in the case of storing string objects. Because both array and list will store string object references, not the actual objects.

  1. If the number of strings is almost constant then use an array (or ArrayList). But if the number varies too much then you'd better use LinkedList.
  2. If there is (or will be) a need for adding or deleting elements in the middle, then you certainly have to use LinkedList.
Emre
+1  A: 

Array is faster - all memory is pre-allocated in advance.

Yakov Fain
+1  A: 

I wrote a little benchmark to compare ArrayLists with Arrays. On my old-ish laptop, the time to traverse through a 5000-element arraylist, 1000 times, was about 10 milliseconds slower than the equivalent array code.

So, if you're doing nothing but iterating the list, and you're doing it a lot, then maybe it's worth the optimisation. Otherwise I'd use the List, because it'll make it easier when you do need to optimise the code.

n.b. I did notice that using for String s: stringsList was about 50% slower than using an old-style for-loop to access the list. Go figure... Here's the two functions I timed; the array and list were filled with 5000 random (different) strings.

private static void readArray(String[] strings) {
 long totalchars = 0;
 for (int j = 0; j < ITERATIONS; j++) {
  totalchars = 0;
  for (int i = 0; i < strings.length; i++) {
   totalchars += strings[i].length();

  }
 }
}

private static void readArrayList(List<String> stringsList) {
 long totalchars = 0;
 for (int j = 0; j < ITERATIONS; j++) {
  totalchars = 0;
  for (int i = 0; i < stringsList.size(); i++) {
   totalchars += stringsList.get(i).length();
  }
 }
}
Chris May
@ Chris May : Great work ! What are the actual running times for both ? Can you tell me the size of the strings you were using ? Also, as the use of 'String s : stringsList' made it take longer, this is my primary fear in using the higher abstractions in Java in general.
euphoria83
It doesn't really matter how long the strings are for this mcirobenchmark. There is no gc, and the `char[]` is not touched (this is not C).
Tom Hawtin - tackline
Typical times for me were ~25ms for the array version, ~35ms for the ArrayList version. The strings were 15-20 chars long. As Tom says, the string size doesn't make much of a difference, with a ~100-char string the timings were about same.
Chris May
A: 

If you have thousands, consider using a trie. A trie is a tree-like structure that merges the common prefixes of the stored string.

For example, if the strings were

intern
international
internationalize
internet
internets

The trie would store:

intern
 -> \0
 international
 -> \0
 -> ize\0
 net
 ->\0
 ->s\0

The strings requires 57 characters (including the null terminator, '\0') for storage, plus whatever the size of the String object that holds them. (In truth, we should probably round all sizes up to multiples of 16, but...) Call it 57 + 5 = 62 bytes, roughly.

The trie requires 29 (including the null terminator, '\0') for storage, plus sizeof the trie nodes, which are a ref to an array and a list of child trie nodes.

For this example, that probably comes out about the same; for thousands, it probably comes out less as long as you do have common prefixes.

Now, when using the trie in other code, you'll have to convert to String, probably using a StringBuffer as an intermediary. If many of the strings are in use at once as Strings, outside the trie, it's a loss.

But if you're only using a few at the time -- say, to look up things in a dictionary -- the trie can save you a lot of space. Definitely less space than storing them in a HashSet.

You say you're accessing them "serially" -- if that means sequentially an alphabetically, the trie also obviously gives you alphabetical order for free, if you iterate it depth-first.

tpdi
is trie like a library or how do i create it ?
euphoria83
A: 

Depending on the implementation. it's possible that an array of primitive types will be smaller and more efficient than ArrayList. This is because the array will store the values directly in a contiguous block of memory, while the simplest ArrayList implementation will store pointers to each value. On a 64-bit platform especially, this can make a huge difference.

Of course, it's possible for the jvm implementation to have a special case for this situation, in which case the performance will be the same.