views:

260

answers:

4

I have a file (fasta file to be specific) that I would like to index, so that I can quickly locate any substring within the file and then find the location within the original fasta file.

This would be easy to do in many cases, using a Trie or substring array, unfortunately the strings I need to index are 800+ MBs which means that doing them in memory in unacceptable, so I'm looking for a reasonable way to create this index on disk, with minimal memory usage.

(edit for clarification)

I am only interested in the headers of proteins, so for the largest database I'm interested in, this is about 800 MBs of text.

I would like to be able to find an exact substring within O(N) time based on the input string. This must be useable on 32 bit machines as it will be shipped to random people, who are not expected to have 64 bit machines.

I want to be able to index against any word break within a line, to the end of the line (though lines can be several MBs long).

Hopefully this clarifies what is needed and why the current solutions given are not illuminating.

I should also add that this needs to be done from within java, and must be done on client computers on various operating systems, so I can't use any OS Specific solution, and it must be a programatic solution.

A: 

I talked to a few co-workers and they just use VIM/Grep to search when they need to. Most of the time I wouldn't expect someone to search for a substring like this though.

But I don't see why MS Desktop search or spotlight or google's equivalent can't help you here.

My recommendation is splitting the file up --by gene or species, hopefully the input sequences aren't interleaved.

nlucaroni
+1  A: 

In some languages programmers have access to "direct byte arrays" or "memory maps", which are provided by the OS. In java we have java.nio.MappedByteBuffer. This allows one to work with the data as if it were a byte array in memory, when in fact it is on the disk. The size of the file one can work with is only limited by the OS's virtual memory capabilities, and is typically ~<4GB for 32-bit computers. 64-bit? In theory 16 exabytes (17.2 billion GBs), but I think modern CPUs are limited to a 40-bit (1TB) or 48-bit (128TB) address space.

This would let you easily work with the one big file.

Stu Thompson
So, the issue with this idea is that with a 7 MB header file, the substring Trie is about 600 MBs.
emeryc
The point to my post is that, when working with direct byte buffers, one can literally forget the difference between what is on disc and what is in memory and just concentrate on the algorithm.
Stu Thompson
except that you can't when your dealing with more then 4 gigs of data, which is the case.
emeryc
Your OP says 800MB, not 4GB. :S Is moving up to a 64-bit OS an option?
Stu Thompson
i've updated the answer with modern 64-bit general purpose CPUs addressable memory info.
Stu Thompson
+1  A: 

The FASTA file format is very sparse. The first thing I would do is generate a compact binary format, and index that - it should be maybe 20-30% the size of your current file, and the process for coding/decoding the data should be fast enough (even with 4GB) that it won't be an issue.

At that point, your file should fit within memory, even on a 32 bit machine. Let the OS page it, or make a ramdisk if you want to be certain it's all in memory.

Keep in mind that memory is only around $30 a GB (and getting cheaper) so if you have a 64 bit OS then you can even deal with the complete file in memory without encoding it into a more compact format.

Good luck!

Adam Davis
$30 a GB but alas, only 4 slots on my motherboard...
gbjbaanb
A: 

I don't imagine that the original poster still has this problem, but anyone needing FASTA file indexing and subsequence extraction should check out fastahack: http://github.com/ekg/fastahack

It uses an index file to count newlines and sequence start offsets. Once the index is generated you can rapidly extract subsequences; the extraction is driven by fseek64.

It will work very, very well in the case that your sequences are as long as the poster's. However, if you have many thousands or millions of sequences in your FASTA file (as is the case with the outputs from short-read sequencing or some de novo assemblies) you will want to use another solution, such as a disk-backed key-value store.

Erik Garrison