ansaurus

Question

Answer 1

+5 A:

Does the following help?

#!/usr/bin/env perl

use strict;
use warnings;

my $filename = 1;
my $flag;
my $fh;

while (<>) {
    if (/^\d+\s+\d+\s*$/) {
        if ( $flag == 1 ) {
            $flag = 0;
            open $fh, '>', $filename;
            $filename++;
        }
        print $fh $_;
    }
    elsif (/random/) {
        next;
    }
    else {
        $flag = 1;
    }
}

Usage:

Save the above as extract (or any other name, if that matters).

Assuming that the file with data is named file.

perl extract /path/to/file

Alan Haggai Alavi 2009-12-19 02:41:13

A programming newbie is probably going to need some instructions on how to use that Perl script...

las3rjock 2009-12-19 02:47:54

maybe... I don't understand the regular expressions... but let me try... thanks

Tamir 2009-12-19 02:52:12

@las3rjock: Updated the answer.

Alan Haggai Alavi 2009-12-19 02:56:41

@Tamir: The regular expression checks if the line starts with digits, then has some spaces, and digits again to the end of the line. If so, we write the line to the file.

Alan Haggai Alavi 2009-12-19 02:58:17

Cool!!! it works! Some minor things... it gets the naming wrong... basically it calls file 1 file no 2 and so on. Secondly, within the data blocks there are single rows of text that i need to get rid off before I can use the script above... they are all different, but they share that the word "random" occurs within each of them

Tamir 2009-12-19 03:05:02

@Alan I see I think I can make sense of the regular expressions now

Tamir 2009-12-19 03:07:04

How should the naming be? Sorry, I did not understand your second point.

Alan Haggai Alavi 2009-12-19 03:07:21

I don't see why, but it produced an empty 1 file and then went on and put the first chunk of data into 2. There are rows which do not contain data but text. Usually they are ok, but the code above starts a new file whenever he comes across them. All of them can be identified since they have the word "random" within their text. How could we get rid of them before running the data extraction.?

Tamir 2009-12-19 03:16:29

Can you please provide some parts of the original data file so that I can understand the format better?

Alan Haggai Alavi 2009-12-19 03:38:49

track type= wiggle name09variableStep chrom=chr134 536 754 8variableStep chrom=chr233 435 278 7this is text with the word random in it82 488 6variableStep chrom=chr378 589 456 7

Tamir 2009-12-19 03:44:55

track type= wiggle name09 \n variableStep chrom=chr1 \n 34 5 36 7 54 8 variableStep chrom=chr2 33 4 35 2 78 7 this is text with the word random in it 82 4 88 6 variableStep chrom=chr3 78 5 89 4 56 7

Tamir 2009-12-19 03:46:10

sorry I don't know how to make it look nicer... it does not show line breaks

Tamir 2009-12-19 03:46:48

Can you please update your question with the above data? You can `code` it there.

Alan Haggai Alavi 2009-12-19 03:54:07

Thank you so much Alan, the script works perfectly!

Tamir 2009-12-19 13:56:19

You are welcome. :-)

Alan Haggai Alavi 2009-12-19 14:34:22

Answer 2

+2 A:

Here's a solution in R.

Load your data:

a <- readLines(textConnection("track type= wiggle name09
variableStep chrom=chr1
34 5 
36 7 
54 8 
variableStep chrom=chr2 
33 4 
35 2 
78 7 
this is text with the word random in it# this we need to remove
82 4 
88 6 
variableStep chrom=chr3 
78 5 
89 4 
56 7"))

Process it by finding the break points and only keeping rows with number space number format:

idx <- grep("=", a)
idx <- idx[c(which((idx[-1]-idx[-length(idx)])>1),length(idx))]
idx <- cbind(idx+1,c(idx[-1]-1,length(a)))
sapply(1:nrow(idx), function(i) {
 x <- a[idx[i,1]:idx[i,2]]
 write.table(x[grep("^\\d+\\s+\\d+\\s*", x, perl=TRUE)], file=as.character(i), row.names=FALSE, col.names=FALSE, quote=FALSE)
})

Shane 2009-12-19 03:23:00

thanks shane... trying it right now

Tamir 2009-12-19 03:28:10

Hi Shane, where would it write it to? I usually specify a directory when I use write.table

Tamir 2009-12-19 03:49:42

found it... working directory...

Tamir 2009-12-19 03:50:22

Except for the first chunk of data (1), the code gives funny results, mixing data and leaving the text inside

Tamir 2009-12-19 03:54:57

I updated my answer.

Shane 2009-12-19 04:36:01

Hi Shane, it is not quite working, now it returns only the first lines without the data? It does something though... Thank you for taking the time

Tamir 2009-12-19 13:58:31

Well, it works with your sample data. Not sure what to tell you without more information.

Shane 2009-12-19 16:16:37

Updated it again using the same regex as Alan's solution.

Shane 2009-12-19 17:05:38

Hi Shane... thank you! It nearly works, however it mixes the data a little bit. for example: Reads after the random line are not found in the file they should be but mixed with the next one... I don't know how to show a better snippet of the data... The real data contains between 10 and 80 million lines. However Alis code has helped me a lot. As well as yours, since I had never used readlines before...

Tamir 2009-12-19 23:38:04

Try changing the part where it say "=" to "variableStep chrom=chr".

Shane 2009-12-20 15:04:34

ansaurus

tags:

views:

answers:

BEGINNER: extracting subsets of data

related questions