views:

102

answers:

1

I'm building a process which extracts data from 6 csv-style files and two poorly laid out .txt reports and builds output CSVs, and I'm fully aware that there's going to be some overhead searching through all that whitespace thousands of times, but I never anticipated converting about 50,000 records would take 12 hours.

Excerpt of my manual matching code (I know it's horrible that I use lists of tokens like that, but it was the best thing I could think of):

public static String lookup(Pattern tokenBefore,
                             List<String> tokensAfter)
{
    String result = null;

    while(_match(tokenBefore)) { // block until all input is read
        if(id.hasNext())
        {
            result = id.next(); // capture the  next token that matches

            if(_matchImmediate(tokensAfter)) // try to match tokensAfter to this result
                return result;
        } else
            return null; // end of file; no match
    }

    return null; // no matches
}

private static boolean _match(List<String> tokens)
{
    return _match(tokens, true);
}

private static boolean _match(Pattern token)
{
    if(token != null)
    {
        return (id.findWithinHorizon(token, 0) != null);
    } else {
        return false;
    }
}

private static boolean _match(List<String> tokens, boolean block)
{
    if(tokens != null && !tokens.isEmpty()) {
        if(id.findWithinHorizon(tokens.get(0), 0) == null)
            return false;

        for(int i = 1; i <= tokens.size(); i++)
        {
            if (i == tokens.size()) { // matches all tokens
                return true;
            } else if(id.hasNext() && !id.next().matches(tokens.get(i))) {
                break; // break to blocking behaviour
            }
        }
    } else {
        return true; // empty list always matches
    }

    if(block)
        return _match(tokens); // loop until we find something or nothing
    else
        return false; // return after just one attempted match
}

private static boolean _matchImmediate(List<String> tokens)
{
    if(tokens != null) {

        for(int i = 0; i <= tokens.size(); i++)
        {
            if (i == tokens.size()) { // matches all tokens
                return true;
            } else if(!id.hasNext() || !id.next().matches(tokens.get(i))) {
                return false; // doesn't match, or end of file
            }
        }

        return false; // we have some serious problems if this ever gets called
    } else {
        return true; // empty list always matches
    }
}

Basically wondering how I would work in an efficient string search (Boyer-Moore or similar). My Scanner id is scanning a java.util.String, figured buffering it to memory would reduce I/O since the search here is being performed thousands of times on a relatively small file. The performance increase compared to scanning a BufferedReader(FileReader(File)) was probably less than 1%, the process still looks to be taking a LONG time.

I've also traced execution and the slowness of my overall conversion process is definitely between the first and last like of the lookup method. In fact, so much so that I ran a shortcut process to count the number of occurrences of various identifiers in the .csv-style files (I use 2 lookup methods, this is just one of them) and the process completed indexing approx 4 different identifiers for 50,000 records in less than a minute. Compared to 12 hours, that's instant.

Some notes (updated 6/6/2010):

  1. I still need the pattern-matching behaviour for tokensBefore.
  2. All ID numbers I need don't necessarily start at a fixed position in a line, but it's guaranteed that after the ID token is the name of the corresponding object.
  3. I would ideally want to return a String, not the start position of the result as an int or something.

Anything to help me out, even if it saves 1ms per search, will help, so all input is appreciated. Thankyou!


Usage scenario 1: I have a list of objects in file A, who in the old-style system have an id number which is not in file A. It is, however, POSSIBLY in another csv-style file (file B) or possibly still in a .txt report (file C) which each also contain a bunch of other information which is not useful here, and so file B needs to be searched through for the object's full name (1 token since it would reside within the second column of any given line), and then the first column should be the ID number. If that doesn't work, we then have to split the search token by whitespace into separate tokens before doing a search of file C for those tokens as well.

Generalised code:

String field;
for (/* each record in file A */)
{
    /* construct the rest of this object from file A info */
    // now to find the ID, if we can
    List<String> objectName = new ArrayList<String>(1);
    objectName.add(Pattern.quote(thisObject.fullName));
    field = lookup(objectSearchToken, objectName); // search file B
    if(field == null) // not found in file B
    {
        lookupReset(false); // initialise scanner to check file C
        objectName.clear(); // not using the full name

        String[] tokens = thisObject.fullName.split(id.delimiter().pattern());
        for(String s : tokens)
            objectName.add(Pattern.quote(s));

        field = lookup(objectSearchToken, objectName); // search file C
        lookupReset(true); // back to file B
    } else {
        /* found it, file B specific processing here */
    }

    if(field != null) // found it in B or C
        thisObject.ID = field;
}

The objectName tokens are all uppercase words with possible hyphens or apostrophes in them, separated by spaces (a person's name).

As per aioobe's answer, I have pre-compiled the regex for my constant search tokens, which in this case is just \r\n. The speedup noticed was about 20x in another one of the processes, where I compiled [0-9]{1,3}\\.[0-9]%|\r\n|0|[A-Z'-]+, although it was not noticed in the above code with \r\n. Working along these lines, it has me wondering:

*Would it be better for me to match \r\n[^ ] if the only usable matches will be on lines beginning with a non-space character anyway? It may reduce the number of _match executions.*

Another possible optimisation is this: concatenate all tokensAfter, and put a (.*) beforehand. It would reduce the number of regexes (all of which are literal anyway) that would be compiled by about 2/3, and also hopefully allow me to pull out the text from that grouping instead of keeping a "potential token" from every line with an ID on it. Is that also worth doing?

The above situation could be resolved if I could get java.util.Scanner to return the token previous to the current one after a call to findWithinHorizon.

+2  A: 

Something to start with: Every single time you run id.next().matches(tokens.get(i)) the following code is executed:

Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(input);
return m.matches();

Compiling a regular expression is non-trivial and you should consider compiling the patterns once and for all in your program:

pattern[i] = Pattern.compile(tokens.get(i));

And then simply invoke something like

pattern[i].matcher(str).matches()
aioobe
Good idea. I suppose it would help for the "any amount of whitespace" regex, but since my process is largely about extracting data from these files, every call to lookup() has 2-3 tokens, all of which are generally completely different (people's full names, one name per token). I will pre-compile any whitespace regex and re-run the process to check rough performance increase. Thanks :)
darvids0n
That's done it I would think. Running that now with a precompiled "\r\n" pattern gives me a processing speed of about 20 records a second as opposed to 1 record every second or two. Still looking for other potential optimisations, but that was a huge help, thank you!
darvids0n
That's sped up one of my use cases from taking 5 hours to about 15 minutes, but the other one (mentioned above) doesn't seem to show one bit of change. The possible reason is me splitting up the regex for the name if it's not found in file B, and all those being compiled. See the question for more info.
darvids0n