views:

92

answers:

3

Note: This is a follow up to this question.

I have a "legacy" program which does hundreds of string matches against big chunks of HTML. For example if the HTML matches 1 of 20+ strings, do something. If it matches 1 of 4 other strings, do something else. There are 50-100 groups of these strings to match against these chunks of HTML (usually whole pages).

I'm taking a whack at refactoring this mess of code and trying to come up with a good approach to do all these matches.

The performance requirements of this code are rather strict. It needs to not wait on I/O when doing these matches so they need to be in memory. Also there can be 100+ copies of this process running at the same time so large I/O on startup could cause slow I/O for other copies.

With these requirements in mind it would be most efficient if only one copy of these strings are stored in RAM (see my previous question linked above).

This program currently runs on Windows with Microsoft compiler but I'd like to keep the solution as cross-platform as possible so I don't think I want to use PE resource files or something.

Mmapping an external file might work but then I have the issue of keeping program version and data version in sync, one does not normally change without the other. Also this requires some file "format" which adds a layer of complexity I'd rather not have.

So after all of this pre-amble it seems like the best solution is to have a bunch arrays of strings which I can then iterate over. This seems kind of messy as I'm mixing code and data heavily, but with the above requirements is there any better way to handle this sort of situation?

A: 

Are these literal strings stored in a file? If so, as you suggested, your best option might be to use memory mapped files to share copies of the file across the hundreds of instances of the program. Also, you may want to try and adjust the working set size to try and see if you can reduce the number of page faults, but given that you have so many instances, it might prove to be counterproductive (and besides your program needs to have quota privileges to adjust the working set size).

There are other tricks you can try to optimize IO performance like allocating large pages, but it depends on your file size and the privileges granted to your program.

The bottomline is that you need to experiment to see what works best and remember to measure after each change :)...

Atul
Currently they are all inline in the code like: `if (htmlPage.find("string to match") != htmlPage.npos)....` which is terrible. I'd rather not deal with some external file format so thats why I'm thinking arrays of strings would be a good first step to cleaning this up.
thelsdj
@thelsdj: That code looks like C++ rather than C
nategoose
@nategoose You're right, we use C++ compiler and std::string for a lot of stuff so it's disingenuous of me to specify C so I changed it. This is a bad/good example of an ugly "legacy" app. I think it was originally written as C++ to take advantage of C++ libraries and very basic STL usage (string) but the organization of the app is just lots of functions thrown together, no class design at all.
thelsdj
@nategoose But I should point out that the question and the answer I'm looking for uses nothing that is specific to C++ and should apply to C as well.
thelsdj
+1  A: 

If the strings that need to be matched can be locked down at compile time you should consider using a tokenizer generator like lex to scan your input for matches. If you aren't familiar with it lex takes a source file which has some regular expressions (including the simplest regular expressions -- string literals) and C action code to be executed when a match is found. It is used often in building compilers and similar programs, and there are several other similar programs that you could also use (flex and antlr come to mind). lex builds state machine tables and then generates efficient C code for matching input against the regular expressions those state tables represent (input is standard input by default, but you can change this). Using this method would probably not result in the duplication of strings (or other data) in memory among the different instances of your program that you fear. You could probably easily generate the regular expressions from the string literals in your existing code, but it may take a good bit of work to rework your program to use the code that lex generated.

If the strings you have to match change over time there are some regular expressions libraries that can compile regular expressions at run time, but these do use lots of RAM and depending on your program's architecture these might be duplicated across different instances of the program.

The great thing about using a regular expression approach rather than lots of strcmp calls is that if you had the patterns:

"string1"
"string2"
"string3"

and the input:

"string2"

The partial match for "string" would be done just once for a DFA (Deterministic Finite-state Automaton) regular expression system (like lex) which would probably speed up your system. Building these things does require a lot of work on lex 's behalf, but all of the hard work is done up front.

nategoose
lex is an interesting idea, I hadn't thought of that yet. Though I don't think I would benefit from the DFA regex system much as the strings generally aren't similar enough to share part of the match. (Most probably are 90% unique, beginning included)). This would probably be a second step once I've refactored and organized the matching a bit better to the point where `do` part of `if (match) do` are not all unique code paths that require special handling.
thelsdj
Even demultiplexing on the first character can save you a great deal of time. But some versions of _lex_ may still try to add new table states for each byte in every substring even if there is no possibility that that character could result in but two (or a few) next states. That could mean very big tables.
nategoose
+2  A: 

I'm not sure just how slow the current implementation is. So it's hard to recommend optimizations without knowing what level of optimization is needed.

Given that, however, I might suggest a two-stage approach. Take your string list and compile it into a radix tree, and then save this tree to some custom format (XML might be good enough for your purposes).

Then your process startup should consist of reading in the radix tree, and matching. If you want/need to optimize the memory storage of the tree, that can be done as a separate project, but it sounds to me like improving the matching algorithm would be a more efficient use of time. In some ways this is a 'roll your own regex system' idea. Rather similar to the suggestion to use a parser generator.

Edit: I've used something similar to this where, as a precompile step, a custom script generates a somewhat optimized structure and saves it to a large char* array. (obviously it can't be too big, but it's another option)

The idea is to keep the list there (making maintenance reasonably easy), but having the pre-compilation step speed up the access during runtime.

jkerian
Basically the problem I'm trying to solve is that the current code is full of: `if (html.contains("some string")).... if (html.contains("some other string"))` with a lot of logic wrapped around all these calls, they aren't all sequential or anything simple like that. I want to improve the maintainability of the code but not lose the performance that the naive approach already provides (due to no loading and likely shared storage of all these literals).
thelsdj
If you're not having issues with performance, honestly I'm not sure I'd worry about it. In terms of maintainability, it seems like the close location of the "some other string" with the code that will be executed if it's found is quite valuable. The only real exception I can see to this is if the entire html document will be searched for a few hundred strings (ie, none of the html.contains() calls are inside conditional blocks), in which case you'd essentially optimize for the searches, and fill some sort of "page status" variable that indicates which ones you've found, and loop through them.
jkerian
My first goal in refactoring is to do away with `if (x.contains(y)) do q; if (x.contains(z)) do q;` where q is identical in both cases and there may be 10 of these checks. Seems like the cleanest way is to store y,z,... in an array and loop that way I end up with one array of strings for each `do` they are attached to. Second, some of these groups of `do`s are identical except for what they set a variable to which should also be refactored into a parameter somewhere. Generally it is the first match that matters, though there are some nested matches so can't be simplified as much as you suggest
thelsdj
Edit your answer again so I can vote it up :)
thelsdj