views:

473

answers:

3

I want to extract large amounts of data from Excel, manipulate it and put it back. I have found the best way to do this is to extract the data from an Excel Range in to a large array, change the contents on the array and write it back to the Excel Range.

I am now rewriting the application using VB.NET 2008/2010 and wish to take advantage of any new features.

Currently I have to loop through the contents of the array to find elements with certain values; also sorting large arrays is cumbersome. I am looking to use the new features, including LINQ to manipulate the data in my array.

Does anybody have any advice on the easiest ways to filter / query, sort etc. data in a large array. Also what are the reasonable limits to the size of the array?

~Many Thanks

A: 

I checked this out of curiosity, and I don't think LINQ is the way to go. I assume that you are reading named ranges into "rectangular" arrays of objects. If that's the case, the bad news is, a rectangular array is non-enumerable and non-queryable, which means you can't execute a LINQ query against it. See the following discussion for details:
querying 2-dimension array with LINQ
I stand corrected. Thx to jdk and AMissico for taking the time to explain!

Mathias
The info you're referencing might be old or wrong because it's possible to use LINQ on rectangular arrays in VB.NET. See my answer.
John K
IEnumerable<string> query = arrValues2D.Cast<string>().Where<string>(v => int.Parse(v) > 50
AMissico
The issue is jdk has Option Infer On and Option Strict Off which allows his syntax by VB.NET compiler. If you turn infer off and strict on, his syntax would suffer from the same compiler errors.
AMissico
+3  A: 

Although you have a multi-pronged question, I've chosen to answer the part about manipulating arrays using LINQ...

Edit - LINQ against 2D array, VB.NET

System.Array (from where all arrays are derived) supports an extension method for .AsQueryable() which means you can process a 2D array with LINQ directly like so:

Option Infer On
Option Strict Off

Public Class DataProcessor

    'something like this came from Excel, maybe a Range
    Dim arrValues2D(,) As String = { _
        {"1", "2", "3"}, _
        {"100", "99", "98"}, _
        {"2", "3", "4"} _
    }

    Private Sub FindLargeNumbers()

        'LINQ against 2D array elements
        Dim query = From v In arrValues2D Where (Int32.Parse(v) > 50 And Int32.Parse(v) < 101) Select CStr(v)

        For Each thing As String In query
            'do stuff
            Console.Write(thing + ",")
        Next

    End Sub
End Class

Original thoughts

(pretty much deprecated since the above solution)

You could Implement IEnumerable(Of T) and IEnumerator(Of T) in Visual Basic to "flatten" your 2D array (effectively return its values one at a time as if it were a 1D array) and perform operations with the values using LINQ.

Depending on the data type of array elements, in order to write back to them, you might need to wrap each value in something so you can reference the original element instead of a copy of its value.

This is just a general plan to allow you to use LINQ and "new features" against a large efficient 2D array that you already have from Excel.

John K
jdk, I am puzzled, because this code, the C# equivalent of yours, doesn't even compile on my machine.int[,] array = new int[,]{{1,1},{2,3}};var query = from c in array where c==1 select c;Specifically, the where statement fails (and if I remove it, the select fails). So either VB behaves differently, or I am missing something! I tried it with the array of strings converted to int to, same problem.
Mathias
John K
I think I've heard VB.NET and C# vary independently (two different teams) although they try to port significant features to both.
John K
The issue is jdk has Option Infer On and Option Strict Off which allows his syntax by VB.NET compiler. If you turn infer off and strict on, his syntax would suffer from the same compiler errors.
AMissico
+2  A: 

If you want any performance, you have to stick with extracting the data into arrays, manipulating the arrays, then writing back the data into the range. The performance benefit is significant. I have tested and it is hundreds of times faster.

I do all my array manipulation with custom functions. I do not believe LINQ can compare to my custom functions in speed. What LINQ does bring to the table is flexibility and readabilty and maintainability. If you design custom array manipulation funtions with LINQ principles in mind, then you should be able to take advantage of LINQ without giving up performance. For instance, I had a Median array function. I converted it to a Median extension function that accepted a sequence of doubles then used LINQ over the array to call the Median function.

<Extension()> Public Function Median(ByVal source As IEnumerable(Of Double)) As Double

As in;

Dim m = Aggregate v In arrValues2D Into Median(CDbl(v))
AMissico
thanks for the reply, have you had issues manipulating large arrays, e.g 100,000 rows X 200 columns
Tim
No issues at all. It worked brilliantly and as advertised. Think your data times five with two pivot tables to do "complex" array manipulations. I would programmatically manipulate a pivot table then extract arrays for the next step in the calculation. If you can do it in a pivot table, do it. It is faster than arrays.
AMissico
Oh, I did break large arrays into chunks so I could maintain a progress form with a Cancel button. I also, split the calculation into two parts; a parameter phase (three minutes) and the calculation phase (five minutes). This allowed the user to review preliminary data with their specified parameters before display a report, instead of forcing them to wait.
AMissico