I'm looking for a way to search through terabytes of data for patterns matching regexes. The implementation does need to support a lot of the finer capabilities of regexes, such as beginning and end of line data, full TR1 support (preferably with POSIX and/or PCRE support), and the like. We're effectively using this application to test policy regarding storage of potentially sensitive information.
I've looked into indexing solutions, but the majority of the commercial suites don't seem to have the finer regex capabilites we'd like (to date, they've all utterly failed at parsing the complex regexes we're using).
This is a complicated problem because of the sheer mass of the amount of data we have, and the amount of system resources we have to dedicate to the task of scanning (not much, its just checks on policy compliance, so there isn't much of a budget there for hardware).
I looked into Lucene but I'm a little hesitant about using index systems that aren't fully capable of dealing with our regex battery, and while searching through the entire dataset would remedy this problem, we'd have to let the servers chug along at performing these actions for a couple weeks at least.
Any suggestions?