tags:

views:

96

answers:

4

There are two large text files (Millions of lines) that my program uses. These files are parsed and loaded into hashes so that the data can be accessed quickly. The problem I face is that, currently, the parsing and loading is the slowest part of the program. Below is the code where this is done.

database = extractDatabase(@type).chomp("fasta") + "yml"
revDatabase = extractDatabase(@type + "-r").chomp("fasta.reverse") + "yml"
@proteins = Hash.new
@decoyProteins = Hash.new

File.open(database, "r").each_line do |line|
  parts = line.split(": ")
  @proteins[parts[0]] = parts[1]
end

File.open(revDatabase, "r").each_line do |line|
  parts = line.split(": ")
  @decoyProteins[parts[0]] = parts[1]
end

And the files look like the example below. It started off as a YAML file, but the format was modified to increase parsing speed.

MTMDK: P31946   Q14624  Q14624-2    B5BU24  B7ZKJ8  B7Z545  Q4VY19  B2RMS9  B7Z544  Q4VY20
MTMDKSELVQK: P31946 B5BU24  Q4VY19  Q4VY20
....

I've messed around with different ways of setting up the file and parsing them, and so far this is the fastest way, but it's still awfully slow.

Is there a way to improve the speed of this, or is there a whole other approach I can take?

List of things that don't work:

  • YAML.
  • Standard Ruby threads.
  • Forking off processes and then retrieving the hash through a pipe.
A: 

I don't know too much about Ruby but I have had to deal with the problem before. I found the best way was to split the file up into chunks or separate files then spawn threads to read each chunk in at a single time. Once the partitioned files are in memory combining the results should be fast. Here is some information on Threads in Ruby:

http://rubylearning.com/satishtalim/ruby_threads.html

Hope that helps.

Michael Bazos
Would splitting it up really help? Because, as I kind of mentioned, when I used a thread for each file it only went slower.
Jesse J
+2  A: 

Why not use the solution devised through decades of experience: a database, say SQLlite3?

Marc-André Lafortune
+1, Although this might not fair better after the "once-loaded" phase for simple key/values. Another option is a BDB-style (Berkley-DB) style back-end, if it's just a simple key/value store and not needing additional SQL-relationships and joins.
pst
+1  A: 

(To be different, although I'd first recommend looking at (Ruby) BDB and other "NoSQL" backend-engines, if they fit your need.)

If fixed-sized records with a deterministic index are used then you can perform a lazy-load of each item through a proxy object. This would be a suitable candidate for a mmap. However, this will not speed up the total access time, but will merely amortize the loading throughout the life-cycle of the program (at least until first use and if some data is never used then you get the benefit of never loading it). Without fixed-sized records or deterministic index values this problem is more complex and starts to look more like a traditional "index" store (eg. a B-tree in an SQL back-end or whatever BDB uses :-).

The general problems with threading here are:

  1. The IO will likely be your bottleneck around Ruby "green" threads
  2. You still need all the data before use

You may be interested in the Widefinder Project, just in general "trying to get faster IO processing".

pst
The time it took to create a database was unbearable.
Jesse J
+1  A: 

In my usage, reading all or part the file into memory before parsing usually goes faster. If the database sizes are small enough this could be as simple as

buffer = File.readlines(database)
buffer.each do |line|
    ...
end

If they're too big to fit into memory, it gets more complicated, you have to setup block reads of data followed by parse, or threaded with separate read and parse threads.

Digikata
This did shave roughly 30 seconds off, but it still takes over 2 minutes.
Jesse J
I found out that after doing this, making other alterations to the method decreased the time. With this and other improvements, it's now down to an acceptable time.
Jesse J