Fast algorithms for finding unique sets in two very long sequences of text

views:

139

answers:

+14 Q:

Fast algorithms for finding unique sets in two very long sequences of text

I need to compare the DNA sequences of X and Y chromosomes, and find patterns (composed of around 50-75 base pairs) that are unique to the Y chromosome. Note that these sequence parts can repeat in the chromosome. This needs to be done quickly (BLAST takes 47 days, need a few hours or less). Are there any algorithms or programs in particular suited to this kind of comparison? Again, speed is the key here.

One of the reasons I put this on SO was to get perspective from people outside the specific application domain, who can put forth algorithms they use in string comparison in their daily use, that may apply to our use. So don't be shy!

+1 A:

This paper might have some alternatives for adapting BLAST to improve its performance (by sub-dividing the problem space AFAIKS).

Andrew Matthews 2010-08-27 02:53:53

+2 A:

Build a suffix tree S on sequence X.
For each starting position i in sequence Y, look for the string Y[i..i+75] in S. If no match can be found starting at position i (i.e. if lookup fails after j < 75 nucleotides matched) then you have found a length-j string unique to Y.
The smallest such string over all starting positions i is the shortest unique string (or just halt after you find any such string if you aren't interested in minimising the length).

Total time: O(|X| + m|Y|) where m is the maximum string length (e.g. m = 75).

There are probably even more efficient algorithms based on generalised suffix trees.

j_random_hacker 2010-08-27 03:51:21

there probably needs to be a minumum length string, as length 1 strings will invalidate (make not unique) every starting position in Y. this will give X=ACGT and Y=TGCA to be both not unique since for every length 1 string in Y there exists the equivalent string in X.

aepurniet 2010-08-27 16:30:08

Not sure what you mean -- yes, there must exist a minimum length string (or strings) that exist in X but not Y. If that minimum length is > m (say 75) then the above algorithm won't find it -- is that what you mean?

j_random_hacker 2010-08-28 01:12:29

+2 A:

I am assuming that you have a single X and a single Y to compare. Concatenate them, separated by a marker character that does not appear in either, to form e.g. XoY. Now form the http://en.wikipedia.org/wiki/Suffix_array in linear time.

What you get is an array of pointers to positions in the concatenated string, where the pointers are arranged so that the substrings they point to appear in alphabetical order in the array. You also get an LCP array, giving the length of the longest common prefix shared between a suffix and the suffix directly before it in the array, which is the suffix that sorts just less than it. This is in fact the longest common prefix shared between that position and ANY substring less than it, because anything with a longer common prefix and less than string[i] would sort between it and the current string[i - 1].

You can tell which original string a pointer points into from its position, because X comes before Y. You can cut the array up into alternating sub-sections of X and Y pointers. The length of the common prefix shared between pos[i] and pos[i - 1] is lcp[i]. The length of the prefix shared between pos[i] and pos[i-2] is min(lcp[i], lcp[i-1]). So if you start at the Y value just before a range of Xs you can work out the number of characters of prefix between that Y and all of the Xs in turn by stepping down the section, doing a min operation at each step. Similarly you can work out the number of characters of prefix shared between all of those Xs and the Y that appears next in the suffix array at the cost of one min per X. Ditto, of course for the Y ranges. Now do a max per entry to work out the longest prefix shared between each position in X (or Y) and any position in Y (or X).

I think you want the substrings within either X or Y which have small prefixes shared between it and any other substring of the other sex, because the strings one character longer than this starting from the same position do not appear in the other sex at all. I think once you have done the min() calculations above you can extract the N smallest prefix substrings using a heap to keep track of the N smallest entries. I think everything here takes time linear in |X| + |Y| (unless N is comparable to |X| or |Y|).

mcdowella 2010-08-27 05:19:25

+1 for the general idea. But I would do it slightly differently: make 2 passes (1 forward, 1 backward) through the LCP array, each of which stores the maximum match length in X for each Y offset in one lexicographical direction. The forward pass compares the last X in a block of Xs with each Y in the immediately following block of Ys; the reverse pass compares the first X in a block of Xs with each Y in the immediately preceding block of Ys. Then, for each Y offset take the maximum of these 2 match lengths -- that's the best match length for that Y position to any X position.

j_random_hacker 2010-08-27 08:02:41

Finally, take the minimum over all Y positions of that maximum, and add 1 to get the minimum unique length. Definitely linear time -- we don't need to worry about any X substrings except those at the beginning or end of a block of X substrings.

j_random_hacker 2010-08-27 08:05:09

Yes - my idea is basically longest common substring with the serial numbers filed off. Your improvements fit it a lot better to what the OP was actually asking for.

mcdowella 2010-08-27 19:32:01

I have an interesting answer, it will be technological one. Main idea is that comparisons of sub-sequences should be done on GPU, because GPU of modern video cards is highly parallel processing environment (like small supercomputer). So we can encode base pair as one pixel, given that X chromosome is 154 million pairs- we get an image for X chromosome that consists of 154 million pixels - this image size will be about 500 MB. For Y chromosome we get an image of 160 MB in size. So these two (500+160) MB images could be handled by descent video card very effectively. (Just get a video card with >= 1 GB of video ram).

Next step is to write GPU program, maybe with the help of Pixel Shader or Cuda or OpenCL

GPU program would be simple - basically it will compare 50-75 neighboring pixels in Y chromosome image to all pixels of X chromosome image. So this GPU program should have maximum of 75*154 million operations, which will be computed on modern GPU in an HOUR or so. Because all sub-sequences of Y will be tested in parallel.

hope that helps

0x69 2010-08-27 09:40:49

(s)hes asking for what you called the 'simple' part. also its 75*154M operations for each data point (pixel) in Y.

aepurniet 2010-08-27 16:37:03

@aepurnietEach pixel would be processed in parallel by GPU, so total amount of operations DOESN'T sum here. That is why such comparison would last on GPU in about an hour (ok, to be very safe we could say about several hours).

0x69 2010-08-28 14:01:33

ansaurus

tags:

views:

answers:

Fast algorithms for finding unique sets in two very long sequences of text

related questions