views:

245

answers:

2

Hello, this seems to be an easy task really, but being completely new to the world of programming, I have problems with the following task: I have a huge file which has the following format:

track type= wiggle name09
variableStep chrom=chr1
34 5 
36 7 
54 8 
variableStep chrom=chr2 
33 4 
35 2 
78 7 
this is text with the word random in it# this we need to remove
82 4 
88 6 
variableStep chrom=chr3 
78 5 
89 4 
56 7

now what I would like as an out put is just

one file called 1 and containing only

34 5
36 7
54 8

a second file called 2

33 4
35 2
78 7
82 4 
88 6

a third file

78 5
89 4
56 7

It would be great to get some help on this... If any knows how to do it in R... that would be even better

+5  A: 

Does the following help?

#!/usr/bin/env perl

use strict;
use warnings;

my $filename = 1;
my $flag;
my $fh;

while (<>) {
    if (/^\d+\s+\d+\s*$/) {
        if ( $flag == 1 ) {
            $flag = 0;
            open $fh, '>', $filename;
            $filename++;
        }
        print $fh $_;
    }
    elsif (/random/) {
        next;
    }
    else {
        $flag = 1;
    }
}

Usage:

Save the above as extract (or any other name, if that matters).

Assuming that the file with data is named file.

perl extract /path/to/file
Alan Haggai Alavi
A programming newbie is probably going to need some instructions on how to use that Perl script...
las3rjock
maybe... I don't understand the regular expressions... but let me try... thanks
Tamir
@las3rjock: Updated the answer.
Alan Haggai Alavi
@Tamir: The regular expression checks if the line starts with digits, then has some spaces, and digits again to the end of the line. If so, we write the line to the file.
Alan Haggai Alavi
Cool!!! it works! Some minor things... it gets the naming wrong... basically it calls file 1 file no 2 and so on. Secondly, within the data blocks there are single rows of text that i need to get rid off before I can use the script above... they are all different, but they share that the word "random" occurs within each of them
Tamir
@Alan I see I think I can make sense of the regular expressions now
Tamir
How should the naming be? Sorry, I did not understand your second point.
Alan Haggai Alavi
I don't see why, but it produced an empty 1 file and then went on and put the first chunk of data into 2. There are rows which do not contain data but text. Usually they are ok, but the code above starts a new file whenever he comes across them. All of them can be identified since they have the word "random" within their text. How could we get rid of them before running the data extraction.?
Tamir
Can you please provide some parts of the original data file so that I can understand the format better?
Alan Haggai Alavi
track type= wiggle name09variableStep chrom=chr134 536 754 8variableStep chrom=chr233 435 278 7this is text with the word random in it82 488 6variableStep chrom=chr378 589 456 7
Tamir
track type= wiggle name09 \n variableStep chrom=chr1 \n 34 5 36 7 54 8 variableStep chrom=chr2 33 4 35 2 78 7 this is text with the word random in it 82 4 88 6 variableStep chrom=chr3 78 5 89 4 56 7
Tamir
sorry I don't know how to make it look nicer... it does not show line breaks
Tamir
Can you please update your question with the above data? You can `code` it there.
Alan Haggai Alavi
Thank you so much Alan, the script works perfectly!
Tamir
You are welcome. :-)
Alan Haggai Alavi
+2  A: 

Here's a solution in R.

Load your data:

a <- readLines(textConnection("track type= wiggle name09
variableStep chrom=chr1
34 5 
36 7 
54 8 
variableStep chrom=chr2 
33 4 
35 2 
78 7 
this is text with the word random in it# this we need to remove
82 4 
88 6 
variableStep chrom=chr3 
78 5 
89 4 
56 7"))

Process it by finding the break points and only keeping rows with number space number format:

idx <- grep("=", a)
idx <- idx[c(which((idx[-1]-idx[-length(idx)])>1),length(idx))]
idx <- cbind(idx+1,c(idx[-1]-1,length(a)))
sapply(1:nrow(idx), function(i) {
 x <- a[idx[i,1]:idx[i,2]]
 write.table(x[grep("^\\d+\\s+\\d+\\s*", x, perl=TRUE)], file=as.character(i), row.names=FALSE, col.names=FALSE, quote=FALSE)
})
Shane
thanks shane... trying it right now
Tamir
Hi Shane, where would it write it to? I usually specify a directory when I use write.table
Tamir
found it... working directory...
Tamir
Except for the first chunk of data (1), the code gives funny results, mixing data and leaving the text inside
Tamir
I updated my answer.
Shane
Hi Shane, it is not quite working, now it returns only the first lines without the data? It does something though... Thank you for taking the time
Tamir
Well, it works with your sample data. Not sure what to tell you without more information.
Shane
Updated it again using the same regex as Alan's solution.
Shane
Hi Shane... thank you! It nearly works, however it mixes the data a little bit. for example: Reads after the random line are not found in the file they should be but mixed with the next one... I don't know how to show a better snippet of the data... The real data contains between 10 and 80 million lines. However Alis code has helped me a lot. As well as yours, since I had never used readlines before...
Tamir
Try changing the part where it say "=" to "variableStep chrom=chr".
Shane