ansaurus

Question

Optimizing a lot of Scanner.findWithinHorizon(pattern, 0) calls

Answer 1

+2 A:

Something to start with: Every single time you run id.next().matches(tokens.get(i)) the following code is executed:

Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(input);
return m.matches();

Compiling a regular expression is non-trivial and you should consider compiling the patterns once and for all in your program:

pattern[i] = Pattern.compile(tokens.get(i));

And then simply invoke something like

pattern[i].matcher(str).matches()

aioobe 2010-06-04 07:44:09

Good idea. I suppose it would help for the "any amount of whitespace" regex, but since my process is largely about extracting data from these files, every call to lookup() has 2-3 tokens, all of which are generally completely different (people's full names, one name per token). I will pre-compile any whitespace regex and re-run the process to check rough performance increase. Thanks :)

darvids0n 2010-06-05 10:43:22

That's done it I would think. Running that now with a precompiled "\r\n" pattern gives me a processing speed of about 20 records a second as opposed to 1 record every second or two. Still looking for other potential optimisations, but that was a huge help, thank you!

darvids0n 2010-06-05 13:41:00

That's sped up one of my use cases from taking 5 hours to about 15 minutes, but the other one (mentioned above) doesn't seem to show one bit of change. The possible reason is me splitting up the regex for the name if it's not found in file B, and all those being compiled. See the question for more info.

darvids0n 2010-06-05 14:20:56

ansaurus

tags:

views:

answers:

Optimizing a lot of Scanner.findWithinHorizon(pattern, 0) calls

related questions