views:

32

answers:

5

Sorry about the bad heading, but the question was not easy to compress into one sentence...

I have two lists of contigs (list1 and list2). They contain mostly unique contigs, but with some overlap. I want to compare list1 and list2 and then create a list3 that contains all contigs in list1 minus those also present in list2. Is this possible with a simple cat/paste/grep/sort/uniq kind of batch command?

Thanks!

A: 

Take a look at the Iesi.Collections library , please also refer to the article at Codeproject http://www.codeproject.com/KB/recipes/sets.aspx#xx703510xx

Bahadir Cambel
+1  A: 

you can do it with sort and uniq :

sort list1 list2 list2 | uniq -u 

any lines in list2 will appear at least twice in the sorted output and so will be filtered by the uniq filter

Alon
Thanks!The others would probably work also, but this did the trick for my dataset. :)
Martin Malmstrøm
A: 

try comm -23

example (first list: numbers 1-10, second list contains numbers 5-8)

comm -23 <(seq 1 10) <(seq 5 8)

the assumption is that your list1 and list2 are sorted

catwalk
A: 

you did not show any sample data about your lists, so i made it up. assume

$ cat file1
11
12
5
13
7
14
15

$ cat file2
6
7
8
5
4
1

$ awk 'FNR==NR { a[$0]; next } (! ($0 in a) ) ' file2 file1
11
12
13
14
15

if its not what you want, describe more clearly with examples of your lists and your desired output

ghostdog74
A: 

I also work in Bioinformatics and Genomics.

You'd better use python or perl to make this problem with a overlap/mismacth threadhold if you really want to get the uniq contig sequence!

BY GentleYang from BGI Shenzhen in China :)

GentleYang