tags:

views:

347

answers:

4

So following on from this question:

http://stackoverflow.com/questions/1459152/erlang-listsindexof-function

I have the following code which works just fine:

-module(test_index_of).
-compile(export_all).

index_of(Q)->
    N=length(Q),
    Qs=lists:zip(lists:sort(Q), lists:seq(1, N)),
    IndexFn=fun(X)->
             {_, {_, I}}=lists:keysearch(X, 1, Qs),
         I
        end,     
    [IndexFn(X) || X <- Q].

test()->
    Q=[random:uniform() || _X <- lists:seq(1, 20)],
    {T1, _}=timer:tc(test_index_of, index_of, [Q]),
    io:format("~p~n", [T1]).

Problem is, I need to run the index_of function a very large number of times [10,000] on lists of length 20-30 characters; the index_of function is the performance bottleneck in my code. So although it looks to be implemented reasonably efficiently to me, I'm not convinced it's the fastest solution.

Can anyone out there improve [performance-wise] on the current implementation of index_of ? [Zed mentioned gb_trees]

Thanks!

+3  A: 

You are optimizing an operation on the wrong data type.

If you are going to make 10 000 lookups on the same list of 20-30 items, then it really pays off to do pre-computation to speed up those lookups. For example, lets make a tuple sorted on the key in a tuples of {key, index}.

1> Ls = [x,y,z,f,o,o].
[x,y,z,f,o,o]
2> Ls2 = lists:zip(Ls, lists:seq(1, length(Ls))).
[{x,1},{y,2},{z,3},{f,4},{o,5},{o,6}]
3> Ts = list_to_tuple(lists:keysort(1, Ls2)).         
{{f,4},{o,5},{o,6},{x,1},{y,2},{z,3}}

A recursive binary search for a key on this tuple will very quickly home in on the right index.

  • Use proplists:normalize to remove duplicates, that is, if it is wrong to return 6 when looking up 'o' instead of 5. Or use folding and sets to implement your own filter that removes duplicates.
  • Try building a dict with dict:from_list/1 and make lookups on that dict instead.

But this still begs the question: Why do you want the index into a list of something? Lookups with lists:nth/2 has O(n) complexity.

Christian
For example, if you want to get the first occurrence of an event generated by a random variable, and do this for multiple occurrences, this seems a reasonable approach.
Zed
>> If you are going to make 10 000 lookups on the same list of 20-30 items I should probably have been clearer; I want to generate 10000 random lists of 20 items, then perform index_of on each of the 10000 lists >> Why do you want the index into a list of something? Lookups with lists:nth/2 has O(n) complexity.Yep, I understand it's inefficient. It's part of the problem domain, a monte carlo simulation. I suspect Erlang isn't the best solution for this type of problem but I want to push it as far as I can
Justin
+1  A: 

Not sure if I understand this completely, but if the above is your actual usecase, then...

First of all, you could generate Q as the following, and you already save the zipping part.

Q=[{N,random:uniform()} || N <- lists:seq(1, 20)]

Taking this further on, you could generate a tree indexed by the values from the beginning:

Tree = lists:foldl(
              fun(T, N) -> gb_trees:enter(uniform:random(), N, T) end,
              gb_trees:empty(),
              lists:seq(1, 20)
       ).

Then looking up your index becomes:

index_of(Item, Tree) ->
  case gb_trees:lookup(Item, Tree) of
    {value, Index} -> Index;
    _ -> not_found
  end.
Zed
Thanks, the gb_tree example is very useful
Justin
A: 

Just one question: WTF are you trying do?

I just can't found what is practical purpose of this function. I think you do something odd. It seems that you just improved from O(N*M^2) to O(N*M*logM) but it is still very bad.

EDIT:

When I synthesize what is goal, It seems that you are trying use Monte Carlo method to determine probabilities of team's 'finishing positions' in English Premiere League. But I'm still not sure. You can determine most probable position [1,1,2] -> 1 or as fractional number as some sort of average 1.33 - for example this last one can be achieve with less effort than others.

In functional programing languages data structures are more important that in procedural or OO ones. They are more about work-flow. You will do this and than this and than ... In functional language as Erlang you should think in manner, I have this input and I want that output. Required output I can determine from this and this and so. There may be not necessary have list of things as you used to be in procedural approaches.

In procedural approaches you are used to use arrays for storage with constant random access. List is not that such thing. There are not arrays in Erlang where you can write (even array module which is balanced tree in reality). You can use tuple or binary for read only array but no one read write. I can write a lot about that there doesn't exist data structure with constant access time at all (from RAM, through arrays in procedural languages to HASH maps) but there is not enough space to explain it in detail here (from RAM technology, through L{1,2,3} CPU caches to necessity increase HASH length when number of keys increase and key HASH computation dependency of key length).

List is data structure which have O(N) random access time. It is best structure for store data which you want take one by one in same order as stored in list. For small N it can be capable structure for random access for small N when corresponding constant is small. For example when N is number of teams (20) in your problem it can be faster than O(logN) access to some sort of tree. But you must take care how big your constant is.

One of common component of algorithms are Key-Value lookups. There can be used arrays as supporting data structure in procedural world in some circumstances. Key must be integer number, space of possible key must not be to sparse and so. List doesn't serve as its substitution well for this purpose except for very small N here. I learn that best way how write functional code is avoid Key-Value lookups where is unnecessary. It often needs rearrange work-flow or refactoring data structures and so. Sometimes it looks like flip over problem solution like glove.

If I ignore that your probability model is wrong. From information you provide it seems that in your model team's season points are independent random events which is not true of course. There is impossible that all teams have some high amount of point, 82 for example just because there is some limit of points taken by all teams in one season. So forgot for this for now. Then I will simulate one 'path' - season and take result in form [{78,'Liverpool'}, {81,'Man Utd'}, ...], then I can sort it using lists:sort without loosing information which team is where. Results I would collect using iteration by path. For each path I would iterate over sorted simulation result and collect it in dictionary where team is key (constant and very cheap hash computation from atom and constant storage because key set is fixed, there is possibility to use tuples/records but seems like premature optimization). Value can be tuple of size 20 and position in tuple is final position and value is count of it.

Something like:

% Process simulation results.
% Results = [[{Points, Team}]]
process(Results) ->
  lists:foldl(fun process_path/2,
    dict:from_list([{Team, {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}} ||
       Team <- ['Liverpool', 'Man Utd', ...]]),
    Results).

% process simulation path result
process_path(R, D) ->
  process_path(lists:reverse(lists:sort(R)), D, 1).

process_path([], _, D) -> D;
process_path([{_, Team}|R], D, Pos) ->
  process_path(R, update_team(Team, Pos, D), Pos + 1).

% update team position count in dictionary
update_team(Team, Pos, D) ->
  dict:update(Team, fun(T) -> add_count(T, Pos) end, D).

% Add final position Pos to tuple T of counts
add_count(T, P) -> setelement(P, T, element(P, T) + 1).

Notice that there is nothing like lists:index_of or lists:nth function. Resulting complexity will look like O(N*M) or O(N*M*logM) for small number M of Teams, but real complexity is O(N*M^2) for O(M) setelement/3 in add_count/2. For bigger M you should change add_count/2 to some more reasonable.

Hynek -Pichi- Vychodil
If you can't say anything constructive, don't say anything at all
Justin
Justin, I have read this question and the previous one. Hynek has asked you what you need this function for, and you haven't told him. The idiom you are expressing (the location of an object in a list has 'meaning') is very unnatural for functional programming - it gives off super-bad code smells. Neither Hynek (nor myself) can saying anything other than 'your code smells' until you say why you need this function. Asking 'what for' is a constructive statement - and if you answer it I would be happy to contribute. Until you do, I can only sit on my hands.
Gordon Guthrie
I'm trying to simulate the expected finishing positions of each team in the [football] English Premiership, based on distributions for each team's season points. So assuming 5 simulation paths my points distribution for Liverpool might be [78, 80, 81, 76, 82] and Man Utd might be [81, 78, 82, 82, 76]. I need to convert these points distributions to 'finishing positions'; so Liverpool would be [2, 1, 2, 2, 1] and Man Utd [1, 2, 1, 1, 2]. in practice I have 20 teams and 1000 simulation paths. The index\_of function is used to rank the simulated points and calculate the finishing position.
Justin
@Hynek I appreciate you taking the time to consider the problem, and the thoughtful response. You are correct about this being a Monte Carlo problem; you're also correct when you say the probability model is wrong; it's an (acceptable) hack. I think however that your solution is calculating the expected position for each team, ie a single number. I need to calculate the probability of each team finishing in each position, ie a vector; so Liverpool might be [{1, 0.1}, {2, 0.4}, {3, 0.3}, {4, 0.1} ... {20, 0.0}]
Justin
You can obtain exactly this result using: [{Team, lists:zip(lists:seq(1,20), [X/N||X<-tuple_to_list(Prob)])}||{Team, Prob}<-dict:to_list(process([simulation()||X<-lists:seq(1,N)]))] where N is number of simulations and simulation is function which returns [{Team, Points}]
Hynek -Pichi- Vychodil
A: 

I think you need custom sort function which record permutations it makes to input list. For example you can use lists:sort source. This should give you O(N*log N) performance.

Alexey Timanovsky