views:

809

answers:

6

I've got two collections, each containing about 40,000 items.

The elements in list 2 are linked to the elements of list one via a foreign key.

For each element of list one, I want to find the corresponding element in list two.

Something like this:

foreach(var item in list1)
{
  var match = list2.Where(child => child.ID == item.ChildID).FirstOrDefault();
  item.Child = match;
}

This works but it's slow as hell.

Now, both list1 and list 2 are sorted by these keys from the database. So list1 is ordered by ChildID and list2 is ordered by ID (same value).

I think a Binary search would dramatically speed this up, but I read somewhere that Linq would choose the most appropriate strategy for the list in the Where clause. Maybe I need to explicitly cast to a sorted list? Or maybe I need to implement a custom Binary Search algorithm w/ a comparer?

Any insights are appreciated.

Thanks.

+6  A: 

Why not use a join?

var query = 
   from a in list1
   join b in list2 on a.ChildID equals b.ID
   select new {Item1 = a, Item2 = b};

foreach(var item in query)
{
   item.Item1.Child = item.Item2;
}
ph0enix
awesome...elegant and fast.
Scott
ph0enix, would you care to explain why this is faster. Out of curiosity. TIA.
Serge - appTranslator
Well, I don't consider myself an expert on the topic, but it's better to let SQL do what it's meant to do (selects, joins, filters, etc). Since IEnumerable<>.Where is linear, the original description is an n^2 algorithm, whereas the proposed solution lets SQL do all the work.
ph0enix
A: 

Not sure if this would actually speed it up any, but you can move the predicate into the FirstOrDefault() clause and get rid of where entirely.

item.Child = list2.FirstOrDefault(child => child.ID == item.ChildID)

It might not actually help since this might implicitly just call Where().

Here's a question though, is the method actually slow or is it slow only the first time it is run after a compilation? Check out the discussion on this post.

George Mauer
+1  A: 

I had this problem before, LINQ-based searching is extremely slow compared to DB-based because it's not utilizing any index.

Have you considered using a Dictionary instead of List?

You can implement a Dictionary and then instead of using Where, you can use ContainsKey and if it does exist, get the value using index accessor.

Sample code:

Dictionary<int, Child> list2 = ...;

...

foreach(var item in list1)
{
  if (list2.ContainsKey(item.ChildID))
    item.Child = list2[item.ChildID];
}

Access using index would be significantly faster than searching a list, on the cost of extra memory required for the index.

Adrian Godong
Thanks Adrian, Dictionary solves the performance problem but I prefer the elegance of the join above...either way though, it's a huge improvement over what I had.
Scott
+1  A: 

What about this:

        var joined = list1.Join(list2, x => x.ChildID, x => x.ID, (x, y) => new { x, y });

        foreach (var j in joined)
        {
            j.x.Child = j.y;
        }

?

Jay Bazuzi
+1  A: 

As both lists are sorted on the same value, you can just loop through them in parallel:

int index1 = 0, index2 = 0;
while (index1 < list1.Count && index2 < list2.Count) {
   while (index1 < list1.Count && list1[index1].ChildId < list2[index2].Id) index1++;
   if (index1 < list1.Count) {
      while (index2 < list2.Count && list2[index2].Id < list1[index1].ChildId) index2++;
      if (index2 < list2.Count && list1[index1].ChildId == list2[index2].Id) {
         list1[index].Child = list2[index2];
         index1++;
         index2++;
      }
   }
}

or:

int index1 = 0, index2 = 0;
while (index1 < list1.Count && index2 < list2.Count) {
   if (list1[index1].ChildId == list2[index2].Id) {
      list1[index].Child = list2[index2];
      index1++;
      index2++;
   } else {
      if (list1[index1].ChildId < list2[index2].Id) {
         index1++;
      } else {
         index2++;
      }
   }
}

Another efficient alternative, but which doesn't take advantage of the order of the lists, is to create an index by putting one of the lists in a dictionary:

Dictionary<int, TypeOfChild> index = new Dictionary<int, TypeOfChild>();
foreach (TypeOfChild child in list2) {
   index.Add(child.Id, child);
}
foreach (TypeOfParent parent in list1) {
   TypeOfChild child;
   if (index.TryGetValue(parent.ChildId, out child) {
      parent.Child = child;
   }
}
Guffa
A: 

Hi..

I couldn't resist to answer this :-)

The main reason your code gets slow, is that your items will be readed many many times. The art of speed is: Read memory only if you need to and if you need to read it, read it as less as possible.

Here is an example:

code

public class Item
{    
   private int _id;
   private List<ItemDetails> _detailItems = new List<ItemDetails>();

   public Item(int id)<br>
   {
       _id = id;
   }

   public void AddItemDetail(ItemDetails itemDetail)
   {
       _detailItems.Add(itemDetail);
   }

   public int Id
   {
       get { return _id; }
   }
   public ReadOnlyCollection<ItemDetails> DetailItems
   {
       get { return _detailItems.AsReadOnly(); }
   }
}


public class ItemDetails
{
    private int _parentId;

    public ItemDetails(int parentId)
    {
        _parentId = parentId;
    }

    public int ParentId
    {
        get { return _parentId; }
    }
}

example code:

The main goal is to scan the lists and compare the item and itemdetail on the current indexes. When the parentid is equal to it's parents id. add it to the list and go on to the next detail. If it is different go on to the next parent.

// for performance tests..
DateTime startDateTime;

// create 2 lists  (master/child list)
List<Item> itemList = new List<Item>();
List<ItemDetails> itemDetailList = new List<ItemDetails>();

Debug.WriteLine("# Adding items");
startDateTime = DateTime.Now;

// add items (sorted)
for (int i = 0; i < 400000; i++)
    itemList.Add(new Item(i));

// show how long it took
Debug.WriteLine("Total milliseconds: " + (DateTime.Now - startDateTime).TotalMilliseconds.ToString("0") + "ms" );

// adding some random details (also sorted)
Debug.WriteLine("# Adding itemdetails");
Random rnd = new Random(DateTime.Now.Millisecond);

startDateTime = DateTime.Now;

int index = 0;
for (int i = 0; i < 800000; i++)
{
    // when the random number is bigger than 2, index will be increased by 1
    index += rnd.Next(5) > 2 ? 1 : 0;
    itemDetailList.Add(new ItemDetails(index));
}
Debug.WriteLine("Total milliseconds: " + (DateTime.Now - startDateTime).TotalMilliseconds.ToString("0") + "ms");

// show how many items the lists contains
Debug.WriteLine("ItemList Count: " + itemList.Count());
Debug.WriteLine("ItemDetailList Count: " + itemDetailList.Count());

// matching items
Debug.WriteLine("# Matching items");
startDateTime = DateTime.Now;

int itemIndex = 0;
int itemDetailIndex = 0;

int itemMaxIndex = itemList.Count;
int itemDetailMaxIndex = itemDetailList.Count;

// while we didn't reach any end of the lists, continue...
while ((itemIndex < itemMaxIndex) && (itemDetailIndex < itemDetailMaxIndex))
{
    // if the detail.parentid matches the item.id. add it to the list.
    if (itemList[itemIndex].Id == itemDetailList[itemDetailIndex].ParentId)
    {
        itemList[itemIndex].AddItemDetail(itemDetailList[itemDetailIndex]);
        // increase the detail index.
        itemDetailIndex++;
    }
    else
        // the detail.parentid didn't matches the item.id so check the next 1
        itemIndex++;
}

Debug.WriteLine("Total milliseconds: " + (DateTime.Now - startDateTime).TotalMilliseconds.ToString("0") + "ms");

results

I took 10 times more items to see a better result:

Adding items:
Total milliseconds: 140ms
Adding itemdetails:
Total milliseconds: 203ms
ItemList Count: 400000
ItemDetailList Count: 800000
Matching items:
Total milliseconds: 265ms

This was typed fast and could be cleaner. So i hope you can read it. Play with it.

Greetz, Jeroen.

Jeroen