Hello,
I am currently having heavy performance issues with an application I'm developping in natural language processing. Basically, given texts, it gathers various data and does a bit of number crunching.
And for every sentence, it does EXACTLY the same. The algorithms applied to gather the statistics do not evolve with previously read data and therefore stay the same.
The issue is that the processing time does not evolve linearly at all: 1 min for 10k sentences, 1 hour for 100k and days for 1M...
I tried everything I could, from re-implementing basic data structures to object pooling to recycles instances. The behavior doesn't change. I get non-linear increase in time that seem impossible to justify by a little more hashmap collisions, nor by IO waiting, nor by anything! Java starts to be sluggish when data increases and I feel totally helpless.
If you want an example, just try the following: count the number of occurences of each word in a big file. Some code is shown below. By doing this, it takes me 3 seconds over 100k sentences and 326 seconds over 1.6M ...so a multiplicator of 110 times instead of 16 times. As data grows more, it just get worse...
Here is a code sample: Note that I compare strings by reference (for efficiency reasons), this can be done thanks to the 'String.intern()' method which returns a unique reference per string. And the map is never re-hashed during the whole process for the numbers given above.
public class DataGathering
{
SimpleRefCounter<String> counts = new SimpleRefCounter<String>(1000000);
private void makeCounts(String path) throws IOException
{
BufferedReader file_src = new BufferedReader(new FileReader(path));
String line_src;
int n = 0;
while (file_src.ready())
{
n++;
if (n % 10000 == 0)
System.out.print(".");
if (n % 100000 == 0)
System.out.println("");
line_src = file_src.readLine();
String[] src_tokens = line_src.split("[ ,.;:?!'\"]");
for (int i = 0; i < src_tokens.length; i++)
{
String src = src_tokens[i].intern();
counts.bump(src);
}
}
file_src.close();
}
public static void main(String[] args) throws IOException
{
String path = "some_big_file.txt";
long timestamp = System.currentTimeMillis();
DataGathering dg = new DataGathering();
dg.makeCounts(path);
long time = (System.currentTimeMillis() - timestamp) / 1000;
System.out.println("\nElapsed time: " + time + "s.");
}
}
public class SimpleRefCounter<K>
{
static final double GROW_FACTOR = 2;
static final double LOAD_FACTOR = 0.5;
private int capacity;
private Object[] keys;
private int[] counts;
public SimpleRefCounter()
{
this(1000);
}
public SimpleRefCounter(int capacity)
{
this.capacity = capacity;
keys = new Object[capacity];
counts = new int[capacity];
}
public synchronized int increase(K key, int n)
{
int id = System.identityHashCode(key) % capacity;
while (keys[id] != null && keys[id] != key) // if it's occupied, let's move to the next one!
id = (id + 1) % capacity;
if (keys[id] == null)
{
key_count++;
keys[id] = key;
if (key_count > LOAD_FACTOR * capacity)
{
resize((int) (GROW_FACTOR * capacity));
}
}
counts[id] += n;
total += n;
return counts[id];
}
public synchronized void resize(int capacity)
{
System.out.println("Resizing counters: " + this);
this.capacity = capacity;
Object[] new_keys = new Object[capacity];
int[] new_counts = new int[capacity];
for (int i = 0; i < keys.length; i++)
{
Object key = keys[i];
int count = counts[i];
int id = System.identityHashCode(key) % capacity;
while (new_keys[id] != null && new_keys[id] != key) // if it's occupied, let's move to the next one!
id = (id + 1) % capacity;
new_keys[id] = key;
new_counts[id] = count;
}
this.keys = new_keys;
this.counts = new_counts;
}
public int bump(K key)
{
return increase(key, 1);
}
public int get(K key)
{
int id = System.identityHashCode(key) % capacity;
while (keys[id] != null && keys[id] != key) // if it's occupied, let's move to the next one!
id = (id + 1) % capacity;
if (keys[id] == null)
return 0;
else
return counts[id];
}
}
Any explanations? Ideas? Suggestions?
...and, as said in the beginning, it is not for this toy example in particular but for the more general case. This same exploding behavior occurs for no reason in the more complex and larger program.