tags:

views:

157

answers:

5

I am using perl to search for a specific strings in a file with different sequences listed under different headings. I am able to write script when there is one sequence present i.e one heading but am not able to extrapolate it. suppose I am reqd to search for some string "FSFSD" in a given file then eg: can't search if file has following content :

Polons
CACAGTGCTACGATCGATCGATDDASD
HCAYCHAYCHAYCAYCSDHADASDSADASD
Seliems
FJDSKLFJSLKFJKASFJLAKJDSADAK
DASDNJASDKJASDJDSDJHAJDASDASDASDSAD
Teerag
DFAKJASKDJASKDJADJLLKJ
SADSKADJALKDJSKJDLJKLK

Can search when file has one heading i.e:

Terrans
FDKFJSKFJKSAFJALKFJLLJ
DKDJKASJDKSADJALKJLJKL
DJKSAFDHAKJFHAFHFJHAJJ

I need to output the result as "String xyz found under Heading abc"

The code I am using is:

print "Input the file name \n";
$protein= <STDIN>;
chomp $protein;
unless (open (protein, $protein))
{
print "cant open file \n\n";
exit;
}
@prot= <protein>;
close protein;
$newprotein=join("",@prot);
$protein=~s/\s//g;
do{
print "enter the motif to be searched \n";
$motif= <STDIN>;
chomp $motif;
if ($protein =~ /motif/)
{
print "found motif \n\n";
}
else{
print "not found \n\n";
}
}
until ($motif=~/^\s*$/);
exit;
+1  A: 

So you are saying you are able to read one line and achieve this task. But when you have more than one line in the file you are not able to do the same thing?

Just have a loop and read the file line by line.

$data_file="yourfilename.txt";
open(DAT, '<', $data_file) || die("Could not open file!");
while( my $line = <DAT>)
{
 //same command that you do for one 'heading' will go here. $line represents one heading
}
Omnipresent
A: 

EDIT: You're posted example has no clear delimiter, you need to find a clear division between your headings and your sequences. You could use multiple linebreaks or a non-alphanumeric character such as ','. Whatever you choose, let WHITESPACE in the following code be equal to your chosen delimiter. If you are stuck with the format you have, you will have to change the following grammar to disregard whitespace and delimit through capitalization (makes it slightly more complex).

Simple way ( O(n^2)? ) is to split the file using a whitespace delimiter, giving you an array of headings and sequences( heading[i] = split_array[i*2], sequence[i] = split_array[i*2+1]). For each sequence perform your regex.

Slightly more difficult way ( O(n) ), given a BNF grammar such as:

file: block
    | file block
    ;

block: heading sequence

heading: [A-Z][a-z]

sequence: [A-Z][a-z]

Try recursive decent parsing (pseudo-code, I don't know perl):

GLOBAL sequenceHeading, sequenceCount
GLOBAL substringLength = 5
GLOBAL substring = "FSFSD"

FUNC file ()
    WHILE nextChar() != EOF
        block()
        printf ( "%d substrings in %s", sequenceCount, sequenceHeading )
    END WHILE
END FUNC

FUNC block ()
    heading()
    sequence()
END FUNC

FUNC heading ()
    in = popChar()
    IF in == WHITESPACE
        sequenceHeading = tempHeading
        tempHeading = ""
        RETURN
    END IF
    tempHeading &= in
END FUNC

FUNC sequence ()
    in = popChar()
    IF in == WHITESPACE
        sequenceCount = count
        count = 0
        i = 0
    END IF
    IF in == substring[i]
        i++
        IF i > substringLength
            count++
        END IF
    ELSE
        i = 0
    END IF
END FUNC

For detailed information on recursive decent parsing, check out Let's Build a Compiler or Wikipedia.

Kelden Cowan
+4  A: 

Seeing your code, I want to make a few suggestions without answering your question:

  1. Always, always, always use strict;. For the love of whatever higher power you may (or may not) believe in, use strict;.
  2. Every time you use strict;, you should use warnings; along with it.
  3. Also, seriously consider using some indentation.
  4. Also, consider using obviously different names for different variables.
  5. Lastly, your style is really inconsistent. Is this all your code or did you patch it together? Not trying to insult you or anything, but I recommend against copying code you don't understand - at least try before you just copy it.

Now, a much more readable version of your code, including a few fixes and a few guesses at what you may have meant to do, follows:

use strict;
use warnings;

print "Input the file name:\n";
my $filename = <STDIN>;
chomp $filename;
open FILE, "<", $filename or die "Can't open file\n\n";
my $newprotein = join "", <FILE>;
close FILE;
$newprotein =~ s/\s//g;
while(1) {
  print "enter the motif to be searched:\n";
  my $motif = <STDIN>;
  last if $motif =~ /^\s*$/;
  chomp $motif;
  # here I might even use the ternary ?: operator, but whatever
  if ($newprotein =~ /$motif/) {
    print "found motif\n\n";
  }
  else {
    print "not found\n\n";
  }
}
Chris Lutz
You tell him to always use strict; use warnings; - good. But then you don't use them in your improved version? :-)
asjo
@asjo - Damn, that's what I get. Fixed, thank you.
Chris Lutz
:) I was going to comment pwnt on that
Omnipresent
+3  A: 

The main issue is how do you distinguish between a header and the data, from your examples I assume that a line is a header iff it contains a lower case letter.

use strict;
use warnings;
print "Enter the motif to be searched \n";
my $motif = <STDIN>;
chomp($motif);
my $header;
while (<>) {
    if(/[a-z]/) {
        $header = $_;
        next;
    }
    if (/$motif/o) {
        print "Found $motif under header $header\n";
        exit;
    }
}
print "$motif not found\n";
Motti
that is spot on!
Omnipresent
A: 

use strict; use warnings; use autodie qw'open';

my($filename,$motif) = @ARGV;

if( @ARGV < 1 ){
  print "Please enter file name:\n";
  $filename = <STDIN>;
  chomp $filename;
}

if( @ARGV < 2 ){
  print "Please enter motif:\n";
  $motif = <STDIN>;
  chomp $motif;
}

my %data;

# fill in %data;
{
  open my $file, '<', $filename;

  my $heading;
  while( my $line = <$file> ){
    chomp $line;
    if( $line ne uc $line ){
      $heading = $line;
      next;
    }
    if( $data{$heading} ){
      $data{$heading} .= $line;
    } else {
      $data{$heading}  = $line;
    }
  }
}

{
  # protect against malicious users
  my $motif_cmp = quotemeta $motif;

  for my $heading ( keys %data ){
    my $data = $data{$heading};

    if( $data =~ /$motif_cmp/ ){
      print "String $motif found under Heading $heading\n";
      exit 0;
    }
  }

  die "String $motif not found anywhere in file $filename\n";
}
Brad Gilbert