I'm building a process which extracts data from 6 csv-style files and two poorly laid out .txt reports and builds output CSVs, and I'm fully aware that there's going to be some overhead searching through all that whitespace thousands of times, but I never anticipated converting about 50,000 records would take 12 hours.
Excerpt of my manual matching code (I know it's horrible that I use lists of tokens like that, but it was the best thing I could think of):
public static String lookup(Pattern tokenBefore,
List<String> tokensAfter)
{
String result = null;
while(_match(tokenBefore)) { // block until all input is read
if(id.hasNext())
{
result = id.next(); // capture the next token that matches
if(_matchImmediate(tokensAfter)) // try to match tokensAfter to this result
return result;
} else
return null; // end of file; no match
}
return null; // no matches
}
private static boolean _match(List<String> tokens)
{
return _match(tokens, true);
}
private static boolean _match(Pattern token)
{
if(token != null)
{
return (id.findWithinHorizon(token, 0) != null);
} else {
return false;
}
}
private static boolean _match(List<String> tokens, boolean block)
{
if(tokens != null && !tokens.isEmpty()) {
if(id.findWithinHorizon(tokens.get(0), 0) == null)
return false;
for(int i = 1; i <= tokens.size(); i++)
{
if (i == tokens.size()) { // matches all tokens
return true;
} else if(id.hasNext() && !id.next().matches(tokens.get(i))) {
break; // break to blocking behaviour
}
}
} else {
return true; // empty list always matches
}
if(block)
return _match(tokens); // loop until we find something or nothing
else
return false; // return after just one attempted match
}
private static boolean _matchImmediate(List<String> tokens)
{
if(tokens != null) {
for(int i = 0; i <= tokens.size(); i++)
{
if (i == tokens.size()) { // matches all tokens
return true;
} else if(!id.hasNext() || !id.next().matches(tokens.get(i))) {
return false; // doesn't match, or end of file
}
}
return false; // we have some serious problems if this ever gets called
} else {
return true; // empty list always matches
}
}
Basically wondering how I would work in an efficient string search (Boyer-Moore or similar). My Scanner id
is scanning a java.util.String
, figured buffering it to memory would reduce I/O since the search here is being performed thousands of times on a relatively small file. The performance increase compared to scanning a BufferedReader(FileReader(File)) was probably less than 1%, the process still looks to be taking a LONG time.
I've also traced execution and the slowness of my overall conversion process is definitely between the first and last like of the lookup method. In fact, so much so that I ran a shortcut process to count the number of occurrences of various identifiers in the .csv-style files (I use 2 lookup methods, this is just one of them) and the process completed indexing approx 4 different identifiers for 50,000 records in less than a minute. Compared to 12 hours, that's instant.
Some notes (updated 6/6/2010):
- I still need the pattern-matching behaviour for tokensBefore.
- All ID numbers I need don't necessarily start at a fixed position in a line, but it's guaranteed that after the ID token is the name of the corresponding object.
- I would ideally want to return a String, not the start position of the result as an int or something.
Anything to help me out, even if it saves 1ms per search, will help, so all input is appreciated. Thankyou!
Usage scenario 1: I have a list of objects in file A, who in the old-style system have an id number which is not in file A. It is, however, POSSIBLY in another csv-style file (file B) or possibly still in a .txt report (file C) which each also contain a bunch of other information which is not useful here, and so file B needs to be searched through for the object's full name (1 token since it would reside within the second column of any given line), and then the first column should be the ID number. If that doesn't work, we then have to split the search token by whitespace into separate tokens before doing a search of file C for those tokens as well.
Generalised code:
String field;
for (/* each record in file A */)
{
/* construct the rest of this object from file A info */
// now to find the ID, if we can
List<String> objectName = new ArrayList<String>(1);
objectName.add(Pattern.quote(thisObject.fullName));
field = lookup(objectSearchToken, objectName); // search file B
if(field == null) // not found in file B
{
lookupReset(false); // initialise scanner to check file C
objectName.clear(); // not using the full name
String[] tokens = thisObject.fullName.split(id.delimiter().pattern());
for(String s : tokens)
objectName.add(Pattern.quote(s));
field = lookup(objectSearchToken, objectName); // search file C
lookupReset(true); // back to file B
} else {
/* found it, file B specific processing here */
}
if(field != null) // found it in B or C
thisObject.ID = field;
}
The objectName tokens are all uppercase words with possible hyphens or apostrophes in them, separated by spaces (a person's name).
As per aioobe's answer, I have pre-compiled the regex for my constant search tokens, which in this case is just \r\n
. The speedup noticed was about 20x in another one of the processes, where I compiled [0-9]{1,3}\\.[0-9]%|\r\n|0|[A-Z'-]+
, although it was not noticed in the above code with \r\n
. Working along these lines, it has me wondering:
*Would it be better for me to match \r\n[^ ]
if the only usable matches will be on lines beginning with a non-space character anyway? It may reduce the number of _match executions.*
Another possible optimisation is this: concatenate all tokensAfter, and put a (.*)
beforehand. It would reduce the number of regexes (all of which are literal anyway) that would be compiled by about 2/3, and also hopefully allow me to pull out the text from that grouping instead of keeping a "potential token" from every line with an ID on it. Is that also worth doing?
The above situation could be resolved if I could get java.util.Scanner to return the token previous to the current one after a call to findWithinHorizon.