ansaurus

Question

"average length of the sequences in a fasta file": Can you improve this Erlang code ?

Answer 1

+3 A:

I too am learning Erlang, thanks for the fun question.

I understand working with Erlang strings as lists of characters can be very slow; if you can work with binaries instead you should see some performance gains. I don't know how you would use arbitrary-length strings with binaries, but if you can sort it out, it should help.

Also, if you don't mind working with a file directly rather than standard_io, perhaps you could speed things along by using file:open(..., [raw, read_ahead]). raw means the file must be on the local node's filesystem, and read_ahead specifies that Erlang should perform file IO with a buffer. (Think of using C's stdio facilities with and without buffering.)

I'd expect the read_ahead to make the most difference, but everything with Erlang includes the phrase "benchmark before guessing".

EDIT

Using file:open("uniprot_sprot.fasta", [read, read_ahead]) gets 1m31s on the full uniprot_sprot.fasta dataset. (Average 359.04679841439776.)

Using file:open(.., [read, read_ahead]) and file:read_line(S), I get 0m34s.

Using file:open(.., [read, read_ahead, raw]) and file:read_line(S), I get 0m9s. Yes, nine seconds.

Here's where I stand now; if you can figure out how to use binaries instead of lists, it might see still more improvement:

-module(golf).
-export([test/0]).

line([],{Sequences,Total}) ->  {Sequences,Total};
line(">" ++ Rest,{Sequences,Total}) -> {Sequences+1,Total};
line(L,{Sequences,Total}) -> {Sequences,Total+string:len(string:strip(L))}.

scanLines(S,Sequences,Total)->
        case file:read_line(S) of
            eof -> {Sequences,Total};
            {error,_} ->{Sequences,Total};
            {ok, Line} -> {S2,T2}=line(Line,{Sequences,Total}), scanLines(S,S2,T2)
        end  .

test()->
    F = file:open("/home/sarnold/tmp/uniprot_sprot.fasta", [read, read_ahead, raw]),
    case F of
    { ok, File } -> 
        {Sequences,Total}=scanLines(File,0,0),
        io:format("~p\n",[Total/(1.0*Sequences)]);
    { error, Reason } ->
            io:format("~s", Reason)
    end,
    halt().

sarnold 2010-07-21 07:04:48

Thanks Arnold, I'm currently testing you solution. erl (version= R13B01) raised an error :" {"init terminating in do_boot",{undef,[{file,read_line,[{file_descriptor,prim_file,{#Port<0.286>,7}}]},{golf,scanLines,3},{golf,test,0},{init,start_it,1},{init,start_em,1}]}}" . Any idea ?

Pierre 2010-07-21 17:38:26

Ok, I've told that my version doesn't support R13B01, I will test this tomorrow on another computer.

Pierre 2010-07-21 17:58:27

Answer 2

+2 A:

It looks like your big performance problems have been solved by opening the file in raw mode, but here's some more thoughts if you need to optimise that code further.

Learn and use fprof.

You're using string:strip/1 primarily to remove the trailing newline. As erlang values are immutable you have to make a complete copy of the list (with all the associated memory allocation) just to remove the last character. If you know the file is well formed, just subtract one from your count, otherwise I'd try writing a length function the counts the number of relevant characters and ignores irrelevant ones.

I'm wary of advice that says binaries are better than lists, but given how little processing you it's probably the case here. The first steps are to open the file in binary mode and using erlang:size/1 to find the length.

It won't affect performance (significantly), but the multiplication by 1.0 in Total/(1.0*Sequences) is only necessary in languages with broken division. Erlang division works correctly.

cthulahoops 2010-07-21 10:47:24

Answer 3

+1 A:

The call string:len(string:strip(L)) traverses the list at least twice (I'm unaware of the string:strip implementation). Instead you could write a simple function to count the line length w/0 the spaces:

stripped_len(L) ->
  stripped_len(L, 0).

stripped_len([$ |L], Len) ->
  stripped_len(L, Len);

stripped_len([_C|L], Len) ->
  stripped_len(L, Len + 1);

stripped_len([], Len) ->
  Len.

The same method can be applied to binaries as well.

Zed 2010-07-21 13:58:47

Answer 4

+5 A:

If you need really fast IO then you have to do little bit more trickery than usual.

-module(g).
-export([s/0]).
s()->
  P = open_port({fd, 0, 1}, [in, binary, {line, 256}]),
  r(P, 0, 0),
  halt().
r(P, C, L) ->
  receive
    {P, {data, {eol, <<$>:8, _/binary>>}}} ->
      r(P, C+1, L);
    {P, {data, {eol, Line}}} ->
      r(P, C, L + size(Line));
    {'EXIT', P, normal} ->
      io:format("~p~n",[L/C])
  end.

It is fastest IO as I know but note -noshell -noinput. Compile just like erlc +native +"{hipe, [o3]}" g.erl but with -smp disable

erl -smp disable -noinput -mode minimal -boot start_clean -s erl_compile compile_cmdline @cwd /home/hynek/Download @option native @option '{hipe, [o3]}' @files g.erl

and run:

time erl -smp disable -noshell -mode minimal -boot start_clean -noinput -s g s < uniprot_sprot.fasta
352.6697028442464

real    0m3.241s
user    0m3.060s
sys     0m0.124s

With -smp enable but native it takes:

$ erlc +native +"{hipe, [o3]}" g.erl
$ time erl -noshell -mode minimal -boot start_clean -noinput -s g s<uniprot_sprot.fasta
352.6697028442464

real    0m5.103s
user    0m4.944s
sys     0m0.112s

Byte code but with -smp disable (almost in par with native because most of work is done in port!):

$ erlc g.erl
$ time erl -smp disable -noshell -mode minimal -boot start_clean -noinput -s g s<uniprot_sprot.fasta
352.6697028442464

real    0m3.565s
user    0m3.436s
sys     0m0.104s

Just for completeness byte code with smp:

$ time erl -noshell -mode minimal -boot start_clean -noinput -s g s<uniprot_sprot.fasta 
352.6697028442464

real    0m5.433s
user    0m5.236s
sys     0m0.128s

For comparison sarnold version gives me wrong answer and takes more on same HW:

$ erl -smp disable -noinput -mode minimal -boot start_clean -s erl_compile compile_cmdline @cwd /home/hynek/Download @option native @option '{hipe, [o3]}' @files golf.erl
./golf.erl:5: Warning: variable 'Rest' is unused
$ time erl -smp disable -noshell -mode minimal -s golf test
359.04679841439776

real    0m17.569s
user    0m16.749s
sys     0m0.664s

EDIT: I have looked at characteristics of uniprot_sprot.fasta and I'm little bit surprised. It is 3824397 rows and 232MB. It means that -smp disabled version can handle 1.18 million text lines per second (71MB/s in line oriented IO).

Hynek -Pichi- Vychodil 2010-07-21 22:08:07

very interesting, thanks.

Pierre 2010-07-22 00:48:21

Excellent! Thanks for the example; can you point out why our versions are getting different answers? Thanks!

sarnold 2010-07-22 09:48:46

@sarnold: I have not have enough time to look to your version where is problem. I guess it is trailing "\n" which may be is not stripped out by `string:strip/1` but I'm not sure. I checked by this code `perl -nle'/^>/?$c++:($b+=length((/(\S*)/)[0]))}{print$b/$c'` to be sure that mine version has not same bug as all others on http://biostar.stackexchange.com/questions/1759 but all seems ok and 352.6697028442464 should be right answer.

Hynek -Pichi- Vychodil 2010-07-22 11:37:48

@sarnold: Very simple check tells me that trailing "\n" is it. When you subtract one from each line length you get right answer so it means `file:read_line/1` returns line with new line char and `string:strip/1` doesn't remove it.

Hynek -Pichi- Vychodil 2010-07-22 11:44:22

@hynek, thanks! I couldn't figure out where our versions were making different calculations...

sarnold 2010-07-22 23:57:44

ansaurus

tags:

views:

answers:

"average length of the sequences in a fasta file": Can you improve this Erlang code ?

related questions