tags:

views:

275

answers:

5

I was wondering how I could quickly search a data string of up to 1 billion bytes of data. The data is all numeric. Currently, we have the data split into 250k files and the searches using strpos (fastest built-in function) on each file until it finds something. Is there a way I can index to make it go faster? Any suggestions?

Eventually I would like to find multiple occurrences, which, as of now, would be done with the offset parameter on strpos.

Any help would surely lead to recognition where needed.

Thanks! - James Hartig

+1  A: 

Well, your tags indicate what you should do (the tag I am referring to is "indexing").

Basically, you should have separate files which would have the indexes for the data. It would have the data strings that you are looking for, as well as the file and byte positions that it is in.

You would then access the index, look up your value and then find the location(s) in the original file(s) for the data string, and process from there.

casperOne
However, I don't know how exactly I should index? The searching could be anywhere from 000 to 999(100 digits), I wouldn't know how to index. Anything could be searched.
James Hartig
Well, is there any structure to this file? Or are you searching for arbitrary data strings?
casperOne
Files are straight text files with 100% numeric data. As stated, each file contained 250k of data.First file (1.txt) contains 1-250k, next (2.txt) 250,001-500k, etc.
James Hartig
and yes arbitrary data strings are searched for and cached.
James Hartig
A: 

A good answer may require that you get a little more specific.

  1. How long is the search query? 1 digit? 10 digits? Arbitrary length?

  2. How "fast" is fast enough? 1 second? 10 seconds? 1 minute?

  3. How many total queries per second / minute / hour do you expect?

  4. How frequently does the data change? Every day? Hour? Continuously?

  5. When you say "multiple occurrences" it sounds like you mean overlapping matches.

  6. What is the "value" of the answer and to how many people?

A billion ain't what it used to be so you could just index the crap out of the whole thing and have an index that is 10 or even 100 times the original data. But if the data is changing by the minute, that would mean your were burning more cycles to create the index than to search it.

The amount of time and money you put into a solution is a function of the value of that solution.

Peter Rowell
The data will only change once a month.1) 3 - 100 Digits allowed2) less than a second preferred3) 100 Queries a minute (results are cached)4) monthly5) yes (12341234 if I searched "123" i want two locations, 0 and 4)6) Unknown at the moment.
James Hartig
As far as 10 to 100 times the original data. I would prefer to stay under 50x the original, however I can go as high as 125x.
James Hartig
A: 

You should definitely get a girlfriend. Besides helping you spend your time better it can grow fat without bursting. Oh, and the same goes for databases.

Cristian Libardo
Already have one.
James Hartig
Database or girlfriend? ;)
Cristian Libardo
As far as databases, I doubt that is the best option? Would something like suffix indexing? and both :P
James Hartig
I might have misinterpreted your question. Do you have a long sequence of random-looking data? To apply some any form of indexing there must be some underlying order to help you know where to start looking. A telephone catalogue uses the letters of the name.
Cristian Libardo
If this isn't possible should consider a large in-memory cache.
Cristian Libardo
A: 

All of Peter Rowell's questions pertain. If you absolutely must have an out-of-the box answer then try grep. You can even exec it from PHP if you like. It is orders of magnitude faster than strpos. We've actually used it quite well as a solution for something that couldn't deal with indexing.

But again, Peter's questions still all apply. I'd answer them before diving into a solution.

conceptDawg
I already answered in the comments? But thanks, I'll look into grep.
James Hartig
A: 

Would a hash function/table work? Or a Suffix Array/Tree?

James Hartig