ansaurus

Question

Efficient memory storage and retrieval of categorized string literals in C++

Answer 1

A:

Are these literal strings stored in a file? If so, as you suggested, your best option might be to use memory mapped files to share copies of the file across the hundreds of instances of the program. Also, you may want to try and adjust the working set size to try and see if you can reduce the number of page faults, but given that you have so many instances, it might prove to be counterproductive (and besides your program needs to have quota privileges to adjust the working set size).

There are other tricks you can try to optimize IO performance like allocating large pages, but it depends on your file size and the privileges granted to your program.

The bottomline is that you need to experiment to see what works best and remember to measure after each change :)...

Atul 2010-09-17 21:27:27

Currently they are all inline in the code like: `if (htmlPage.find("string to match") != htmlPage.npos)....` which is terrible. I'd rather not deal with some external file format so thats why I'm thinking arrays of strings would be a good first step to cleaning this up.

thelsdj 2010-09-17 21:29:55

@thelsdj: That code looks like C++ rather than C

nategoose 2010-09-17 21:40:10

@nategoose You're right, we use C++ compiler and std::string for a lot of stuff so it's disingenuous of me to specify C so I changed it. This is a bad/good example of an ugly "legacy" app. I think it was originally written as C++ to take advantage of C++ libraries and very basic STL usage (string) but the organization of the app is just lots of functions thrown together, no class design at all.

thelsdj 2010-09-17 21:50:03

@nategoose But I should point out that the question and the answer I'm looking for uses nothing that is specific to C++ and should apply to C as well.

thelsdj 2010-09-17 21:55:46

Answer 2

+1 A:

If the strings that need to be matched can be locked down at compile time you should consider using a tokenizer generator like lex to scan your input for matches. If you aren't familiar with it lex takes a source file which has some regular expressions (including the simplest regular expressions -- string literals) and C action code to be executed when a match is found. It is used often in building compilers and similar programs, and there are several other similar programs that you could also use (flex and antlr come to mind). lex builds state machine tables and then generates efficient C code for matching input against the regular expressions those state tables represent (input is standard input by default, but you can change this). Using this method would probably not result in the duplication of strings (or other data) in memory among the different instances of your program that you fear. You could probably easily generate the regular expressions from the string literals in your existing code, but it may take a good bit of work to rework your program to use the code that lex generated.

If the strings you have to match change over time there are some regular expressions libraries that can compile regular expressions at run time, but these do use lots of RAM and depending on your program's architecture these might be duplicated across different instances of the program.

The great thing about using a regular expression approach rather than lots of strcmp calls is that if you had the patterns:

"string1"
"string2"
"string3"

and the input:

"string2"

The partial match for "string" would be done just once for a DFA (Deterministic Finite-state Automaton) regular expression system (like lex) which would probably speed up your system. Building these things does require a lot of work on lex 's behalf, but all of the hard work is done up front.

nategoose 2010-09-17 21:30:29

lex is an interesting idea, I hadn't thought of that yet. Though I don't think I would benefit from the DFA regex system much as the strings generally aren't similar enough to share part of the match. (Most probably are 90% unique, beginning included)). This would probably be a second step once I've refactored and organized the matching a bit better to the point where `do` part of `if (match) do` are not all unique code paths that require special handling.

thelsdj 2010-09-17 21:44:56

Even demultiplexing on the first character can save you a great deal of time. But some versions of _lex_ may still try to add new table states for each byte in every substring even if there is no possibility that that character could result in but two (or a few) next states. That could mean very big tables.

nategoose 2010-09-17 21:53:58

Answer 3

+2 A:

I'm not sure just how slow the current implementation is. So it's hard to recommend optimizations without knowing what level of optimization is needed.

Given that, however, I might suggest a two-stage approach. Take your string list and compile it into a radix tree, and then save this tree to some custom format (XML might be good enough for your purposes).

Then your process startup should consist of reading in the radix tree, and matching. If you want/need to optimize the memory storage of the tree, that can be done as a separate project, but it sounds to me like improving the matching algorithm would be a more efficient use of time. In some ways this is a 'roll your own regex system' idea. Rather similar to the suggestion to use a parser generator.

Edit: I've used something similar to this where, as a precompile step, a custom script generates a somewhat optimized structure and saves it to a large char* array. (obviously it can't be too big, but it's another option)

The idea is to keep the list there (making maintenance reasonably easy), but having the pre-compilation step speed up the access during runtime.

jkerian 2010-09-17 21:31:23

Basically the problem I'm trying to solve is that the current code is full of: `if (html.contains("some string")).... if (html.contains("some other string"))` with a lot of logic wrapped around all these calls, they aren't all sequential or anything simple like that. I want to improve the maintainability of the code but not lose the performance that the naive approach already provides (due to no loading and likely shared storage of all these literals).

thelsdj 2010-09-17 21:38:17

If you're not having issues with performance, honestly I'm not sure I'd worry about it. In terms of maintainability, it seems like the close location of the "some other string" with the code that will be executed if it's found is quite valuable. The only real exception I can see to this is if the entire html document will be searched for a few hundred strings (ie, none of the html.contains() calls are inside conditional blocks), in which case you'd essentially optimize for the searches, and fill some sort of "page status" variable that indicates which ones you've found, and loop through them.

jkerian 2010-09-17 21:57:07

My first goal in refactoring is to do away with `if (x.contains(y)) do q; if (x.contains(z)) do q;` where q is identical in both cases and there may be 10 of these checks. Seems like the cleanest way is to store y,z,... in an array and loop that way I end up with one array of strings for each `do` they are attached to. Second, some of these groups of `do`s are identical except for what they set a variable to which should also be refactored into a parameter somewhere. Generally it is the first match that matters, though there are some nested matches so can't be simplified as much as you suggest

thelsdj 2010-09-17 22:11:14

Edit your answer again so I can vote it up :)

thelsdj 2010-09-17 22:12:41

ansaurus

tags:

views:

answers:

Efficient memory storage and retrieval of categorized string literals in C++

related questions