views:

291

answers:

3

Hello all

The following procedure (explanation follows) works fine for really small lists, but when the list contains a larger number of items (1/2 million) the application enters "not responding" state,and it takes about 2.5 minutes to finish (very bad time). I might add the application needs to process lists of 100 million items at least (eventually).

here is the code for the problematic procedure:

    public void removeItems(List<long> L, SortedList<long, List<long>> _subLists)
    {
        foreach (KeyValuePair<long, List<long>> kvp in _subLists)
        {
            foreach (long duplicate in kvp.Value)
            {
                int j = L.IndexOf(duplicate);
                L.RemoveRange(j,(int)kvp.Key); 

            }
        }
    }

L is a list of long values. _subLists is a sorted list where each value is a list of values from L,starting an arithmetic progression series of some difference (not relevant). the key associated with that value is the length of the series the values contain.

Example:

L = {1,2,3,5,6,7,18,20,21} _subLists = {2,<20>} {3,<1,5>}

The procedure simply removes the arithmetic progression series from L.

+10  A: 

The run time of this procedure in big O notation would be n^2, which is fairly slow and you can expect a slow run time if one of the lists has 100 million entries. There is no stack overflow problem here, it's simply slow to iterate through this much data. I don't really see a question here, are you looking to make this faster? If so, the nested for loop is definitely the problem.

AlbertoPL
yes, the objective is to make it faster, much faster.when i say a list will contain millions of entries' i am of course refering to L,_subLists is just a list of (surprise) sub lists of L.how can i iterate thru all the items in a sublists value without the inner loop? the way i see it, its a must, but that's why i came here... any suggestions?
The only way I can see to do this would be not to have a list as a value in your KeyValuePair. Do you have any means of putting the sublists in the main list so you're only ever iterating over one set of data?
AlbertoPL
"in the main List" you mean instead of having a sorted list, having just a list?if so, its a problem, because as i explained, the value list holds index values of starting arithmetic progression series in L.i dont realy see a simpler way of doing it...
+7  A: 

Your problem is that you are removing a lot of items from L which is a very costly operation. Every time an item is removed, memory is copied to move all the items above the deleted items down. The more items that are removed and the more items to shuffle down, the longer it takes. Memory is a bottleneck to performance, RAM runs slower than the CPU, and if you're paging to disk than it's really slow.

How can you improve this.

The easiest option is to use a container for L that has better performance when removing items - a LinkedList for example. LinkedLists do not need to move items around in memory when elements are removed but they do require more memory to store the data (two pointers per value). If this is too much overhead, then perhaps a LinkedList <List <long>> instead where each List holds a maximum number of values.

Alternatively, change the removal algorithm so that you iterate over the list L and create a new list containing the values not found in the _subLists. You can change the way _subLists stores data to make finding items in ranges quicker.

Skizz

Skizz
The alternatively part is very interesting and certainlly sounds like its worth a try.about the linked list part. i failed to mention that i'm using c#, and was under the impression that a List<T> container IS a linked list,no?
System.Collections.Generic.LinkedList<> is a linked lis. don't know List<> implemention off the top of my head but its probablly an array buffered with extra space.
Zack
@ndgani: No. List<T> is an array, more like std::vector<T> in C++. LinkedList<T> is a linked list.
Reed Copsey
No, a List<> is a dynamic array (resizes as necessary). It has O(1) access. C# provides a LinkedList<> type which has O(n) access time but better insertion and removal.
Skizz
i'm currently trying the linked list approach.its going to take a while (other procedures use List<>) but i'll let u know what's the outcome.
hi skizz, and sorry for the very long pause.i've performed the conversion of the List<> to a LinkedList and changed the process a bit.now i dont have a seperate "removeItems" method, instead, i have this:IEnumerable<long> enumer = node.indicesList.Except<long>(tempL.AsEnumerable<long>());node.indicesList = new LinkedList<long>(enumer);tempL is the list of all the items to be removed from L for a certain d(current mininum arithmetic progression difference in main List).this saves me a pass over the main List,can now work on input about 10 times larger (good,not good enough).
A: 

If possible:

A) Convert L into a sorted linked list. O: n * log(n)

B) Convert sublists into a sorted list pairs where the first item is the # in the sequence in L (duplicate in the posted code snippet) and the second item is the length of the sequence. O: n * log (n)

C) Perform a single pass through L using sublists to determine how many elements to remove at a given spot in L. Take advantage of the fact that both lists are sorted to not backtrack in either list. O: n

Should be able to get O: n * log(n) complexity out of this if it is possible to use. Of course, I'm not 100% sure about all of the details of the problem. For instance - can L have duplicates? If so, does the order of the sublists matter? You may be forced to ditch or modify such an algorithm depending on the answers to those ?s. Also, this will obviously use more memory.

Krazzy
thank u for your reply.the lists i am using are not declared as sorted ones, but due to my specific problem's conditions, by the time we reach the method we are discussing, they are unique and sorted, and still i get lousy performance.