tags:

views:

82

answers:

3

sample data:

DNA : 
This is a string

BaseQuality :
4 4 4 4 4 4 6 7 7 7 

Metadata : 
Is_read

DNA : 
yet another string

BaseQuality : 
4 4 4 4 7 7 4 8 4 4 4 4 4

Metadata :
Is_read
SCF_File 
.
.
.

I have a method that is using a case statement as follows to separate parts of a longer text file into records using the delimeter "\n\n". And a class that models a data object

def parse_file(myfile)
    $/ = "\n\n"
    records = []
    File.open(myfile) do |f|
      f.each_line do |line|
        read = Read.new     
         case line
          when /^DNA/
            read.dna_data = line.strip
          when /^BaseQuality/
            read.quality_data =line.strip
          when /^Metadata/
            read.metadata =line.strip
          else
            puts "Unrecognized line: #{line}"
        end
        records.push read
      end
    end
    records
  end

class Read
attr_accessor :dna_data,:quality_data,:metadata
end

records.each do |r|
 puts r.dna_data
end

dna data contains the 'rightful' string part as well as two nil 'objects'/ irritating nils!

"This is a string"
nil
nil

My problems are the nil strings shown above which are assigned to dna_data when using read.dna_data = line.

Please how do you get rid of them? How do you avoid them in the first instance. What am i missing? Is my approach 'smelly'? Thank you

A: 

First off, I would avoid using Ruby for bioinformatics, it's not fast enough for certain set of problems. Sooner or later, you will hit issues and your program will crwal to a stop.

From what I gathered, you are trying to remove nils from an array. Here's two ways of doing so:

  1. use the compact method.

    [nil, nil, 'asdfa'].compact # >> ['asdfa']

  2. don't add nil when you are adding elements.

    records.push read unless read.nil?

    records.push read if read # nil gets evaluated to false.

Pran
eastafri
Your interpretation of the problem could be wrong. the read object cannot be nil in this case. The problem is at the point when line is getting assigned to the appropriate property. The problem is removing nil from that string. syntax ideas?
eastafri
+2  A: 

The problem is that the code creates a new instance of Read for each line. Instead, it should create an instance for each section. It appears that a section starts with the DNA header, so:

def parse_file(myfile)
  $/ = "\n\n"
  records = []
  File.open(myfile) do |f|
    read = nil                              # <- NEW
    f.each_line do |line|
      #read = Read.new                      # <- DELETED
      case line
      when /^DNA/
        read = Read.new                     # <- NEW
        read.dna_data = line.strip
      when /^BaseQuality/
        read.quality_data = line.strip
      when /^Metadata/
        read.metadata = line.strip
        records.push read                   # <= ADDED
      else
        puts "Unrecognized line: #{line}"
      end
      #records.push read                    # <= DELETED
    end
  end
  records
end

Having the parsed record pushed onto the records array after reading metadata works, but only if each record always contains metadata and the metadata is always last. We can make the program more forgiving of changes in the data layout by pushing the read onto records when it is first created:

def parse_file(myfile)
  $/ = "\n\n"
  records = []
  File.open(myfile) do |f|
    f.each_line do |line|
      read = Read.new
      case line
      when /^DNA/
        records << Read.new
        records.last.dna_data = line.strip
      when /^BaseQuality/
        records.last.quality_data = line.strip
      when /^Metadata/
        records.last.metadata = line.strip
      else
        puts "Unrecognized line: #{line}"
      end
    end
  end
  records
end
Wayne Conrad
I think there's a bug in this code: It's pushing partial records onto the records array. It should only push a record when starting a new section, or when the file is completely read. If you can trust that metadata is always present and always comes last, do the push after read.metadata =...
Wayne Conrad
Yes there is a bug. Metadata is always present and comes last. So it makes sense to do the push after read.metadata = line.strip ...please let me know if you get a workaround that is more generic. Thank you very much!
eastafri
@eastafri, bug fixed, and new version introduced which is less finicky about the layout of the data. I apologize for the poor quality in my initial answer.
Wayne Conrad
thanks @Wayne. I am really grateful for your time and assistance!
eastafri
A: 

You may wish to see if BioRuby is appropriate to your needs. I use it to handle quality sequences as well as nucleotide sequences.

Andrew Grimm
Thanks i do use it too. Using it to post process the records... :)
eastafri