views:

83

answers:

5

I'm looking for a way to search through terabytes of data for patterns matching regexes. The implementation does need to support a lot of the finer capabilities of regexes, such as beginning and end of line data, full TR1 support (preferably with POSIX and/or PCRE support), and the like. We're effectively using this application to test policy regarding storage of potentially sensitive information.

I've looked into indexing solutions, but the majority of the commercial suites don't seem to have the finer regex capabilites we'd like (to date, they've all utterly failed at parsing the complex regexes we're using).

This is a complicated problem because of the sheer mass of the amount of data we have, and the amount of system resources we have to dedicate to the task of scanning (not much, its just checks on policy compliance, so there isn't much of a budget there for hardware).

I looked into Lucene but I'm a little hesitant about using index systems that aren't fully capable of dealing with our regex battery, and while searching through the entire dataset would remedy this problem, we'd have to let the servers chug along at performing these actions for a couple weeks at least.

Any suggestions?

A: 

The grep program is highly optimized for regex searching in files, to the point where I would say you could not beat it with any general-purpose regex library. Even that would be impractically slow for searching terabytes, so I think you're out of luck on doing full regex searches.

One option might be to use an indexer as a first-pass to find likely matches, then extract some bytes on either side of each match and run a full regex match on it.

Tim Sylvester
Trying to sort for likely matches may prove to be a challenge given the set of data we're looking for, but the idea is solid for sure.
tearman
+1  A: 

You might want to take a look at Apache Hadoop. Enormous sites like Yahoo and Facebook use Hadoop for a variety of things, one of them being processing multi-TB of text logs.

In the Hadoop documentation there is an example of a distributed Grep that could be scaled to handle any concievable data set size.

There is also a SequenceFileInputFilter.RegexFilter in the Hadoop API if you wanted to roll your own solution.

shadit
This in an interesting option, though our system is (unfortunately) Win32 and all rolled into a comparatively small cluster for the size of the dataset, so I'm wondering how well Hadoop would handle the comparatively small resource set for the large data set on a development platform.
tearman
+1  A: 

PowerGREP can handle any regular expression and has been designed for exactly this purpose. I've found it to be extremely fast searching through large amounts of data, but I haven't tried it on the order of terabytes yet. But since there's a 30 day trial, it's worth a shot, I'd say.

It's especially powerful when it comes to searching specific parts of files. You can section the file according your own criteria, and then apply another search only on those sections. Plus, it has got very good reporting capabilities.

Tim Pietzcker
This is probably the most practical solution within our given parameters, and while I'm not too thrilled about running with so much user intervention required, the application does seem to fit our needs.
tearman
Maybe if you contact the company that makes PowerGREP you can get them to license you a component that you can call from your application. Normally small to medium sized software companies are receptive to outside the box opportunities like this.
shadit
+1  A: 

I can only offer a high-level answer. Building on Tim's and shadit's answers, use a two-pass approach implemented as a MapReduce algorithm on EC2 or Azure Compute. In each pass the Map could take a chunk of data with an identifier and return to Reduce the identifier if a match is found, else a null value. Scale it as wide as you need to shrink the processing time.

gWiz
Great option, just seems like overkill for what we're attempting to accomplish in the end and budget constraints may dictate against this. Like the idea though.
tearman
Yeah this definitely wouldn't be a trivial undertaking. It really only makes sense if your business can benefit from having a solution in place (for future clients/needs).
gWiz
Its an in-house only solution for this one problem unfortunately, I've already tried pitching some fancy stuff and it got shot down. Could be a great idea for a commercial app though.
tearman
A: 

disclaimer: i am not a search expert.

if you really need all the generality of regexps then there's going to be nothing better than trawling through all the data (but see comments below on speeding that up).

however, i would guess that is not really the case. so the first thing to do is see if you can use an index to identify possible documents. if, for example, you know that you all your matches will include a word (any word) then you can index the words, use that to find the (hopefully small) set of documents that include that word, and then use grep or equivalent only on those files.

so, for example, maybe you need to find documents that have "FoObAr" at the start of the line. you would start with a caseless index to identify files that have "foobar" anywhere, and then grep (only) those for "^FoObAr".

next, how to grep as quickly as possible. you're likely going to be limited by io speed. so look at using several disks (there may be no need to use raid - you could just have one thread per disk). also, consider compression. you don't need random access to these files, and if they are text (i assume they are if you are grepping them) then they will compress nicely. that will reduce the amount of data you need to read (and store).

finally, note that if your index doesn't work for ALL queries, then it's probably not worth using. you can "grep" for all expressions in a single pass, and the expensive process is reading the data, not the details of the grep, so even if there is "just one" query that cannot be indexed, and you therefore need to scan everything, then building and using an index is probably not a good use of your time.

andrew cooke