views:

317

answers:

6

Hi, I want to know what the fastest way is of reading and writing data to and from an open Excel workbook to c# objects. The background is that I want to develop a c# application that is used from Excel and uses data held in excel.

The business logic will reside in the c# application but the data will reside in an Excel workbook. The user will be using Excel and will click a button (or do something similar) on the excel workbook to initiate the c# application. The c# application will then read data off the excel workbook, process the data, and then write data back to the excel workbook.
There may be numerous blocks of data that are required to be read off and written back to the excel workbook but they will normally be of a relatively small size, say 10 rows and 20 columns. Occasionally a large list of data may need to be processed, of the order of 50,000 rows and 40 columns.

I know that this is relatively easy to do say using VSTO but I want to know what the fastest (but still robust and elegant) solution is and get an idea of the speed. I don't mind if the solution recommends using third party products or uses C++.

The obvious solution is using VSTO or interop but I don't know what the performance is like versus VBA which I'm currently using to read in the data, or if there are any other solutions.

This was posted on experts exchange saying that VSTO was dramatically slower than VBA but that was a couple of years ago and I don't know if the performance has improved.

http://www.experts-exchange.com/Microsoft/Development/VSTO/Q_23635459.html

Thanks.

A: 

I've used VBA code (macro) to gather & compact the data, and get this data in one call to C#, and vice versa. This will probably be the most performant approach.

Using C#, you will always need to use some marshalling. Using VSTO or COM Interop, the underlaying communication layer (marshalling overhead) is the same.

In VBA (Visual Basic For Application) you work directly on the objects in Excel. So the access to this data will always be faster.

But.... Once you have the data in C#, the manipulation of this data can be a lot faster.

If you are using VB6 or C++, you also go through a COM interface, and you will also be facing cross process marshalling.

So you are looking for a method to minimize cross process calls and marshalling.

GvS
Thanks, that makes sense. The problem is that the c# application will be required to load live data while it is executing from the open Excel worksheet and I don't know what data it will need at the time the process is initiated. Therefore VSTO or COM seem the only options, but from your post it looks like execution will be a lot slower than VBA.
jw_pr
The communication will be slower, but once in C# handling the data will probably be a lot faster.
GvS
+1  A: 

The fastest interface to Excel data is the C API. There are a number of products out there that link .NET to Excel using this interface.

2 products I like that do this are Excel DNA (which is free and open source) and Addin Express (which is a commercial product and has both the C API and COM interface available).

Charles Williams
jw_pr
Addin Express can definitely read or write to an unsaved Excel workbook at will, without any question. The bigger issue is how fast it executes, but I *believe* that it is both a shimmed solution and runs shockingly fast considering that it's shimmed. I could be wrong about this, so you need to do your own research and time testing. See my full answer on this, but I think the top contenders for you would either be ExcelDna or Addin Express.
Mike Rosenblum
Wow, thanks for the comprehensive answer Mike! I really appreciate it. I will investigate both contenders as well as look into the other information you gave me.
jw_pr
Ok, sounds good, post back when you're done... I'd be curious what you learn. I bet others would be too.
Mike Rosenblum
+1  A: 

If the C# application is a stand-alone application, then you will always have cross-process marshaling involved that will overwhelm any optimizations you can do by switching languages from, say, C# to C++. Stick to your most preferred language in this situation, which sounds like is C#.

If you are willing to make an add-in that runs within Excel, however, then your operations will avoid cross-process calls and run about 50x faster.

If you run within Excel as an add-in, then VBA is among the fastest options, but it does still involve COM and so C++ calls using an XLL add-in would be fastest. But VBA is still quite fast in terms of calls to the Excel object model. As for actual calculation speed, however, VBA runs as pcode, not as fully compiled code, and so executes about 2-3x slower than native code. This sounds very bad, but it isn't because the vast majority of the execution time taken with a typical Excel add-in or application involves calls to the Excel object model, so VBA vs. a fully compiled COM add-in, say using natively compiled VB 6.0, would only be about 5-15% slower, which is not noticeable.

VB 6.0 is a compiled COM approach, and runs 2-3x faster than VBA for non-Excel related calls, but VB 6.0 is about 12 years old at this point and won't run in 64 bit mode, say if installing Office 2010, which can be installed to run 32 bit or 64 bit. Usage of 64 bit Excel is tiny at the moment, but will grow in usage, and so I would avoid VB 6.0 for this reason.

C#, if running in-process as an Excel add-in would execute calls to the Excel object model as fast as VBA, and execute non-Excel calls 2-3x faster than VBA -- if running unshimmed. The approach recommended by Microsoft, however, is to run fully shimmed, for example, by making use of the COM Shim Wizard. By being shimmed, Excel is protected from your code (if it's faulty) and your code is fully protected from other 3rd party add-ins that could otherwise potentially cause problems. The down-side to this, however, is that a shimmed solution runs within a separate AppDomain, which requires cross-AppDomain marshaling that incurrs an execution speed penalty of about 40x -- which is very noticeable in many contexts.

Add-ins using Visual Studio Tools for Office (VSTO) are automatically loaded within a shim and executes within a separate AppDomain. There is no avoiding this if using VSTO. Therefore, calls to the Excel object model would also incur an approximately 40x execution speed degradation. VSTO is a gorgeous system for making very rich Excel add-ins, but execution speed is its weakness for applications such as yours.

ExcelDna is a free, open source project that allows you to use C# code, which is then converted for you to an XLL add-in that uses C++ code. That is, ExcelDna parses your C# code and creates the required C++ code for you. I've not used it myself, but I am familiar with the process and it's very impressive. ExcelDna gets very good reviews from those that use it. [Edit: Note the following correction as per Govert's comments below: "Hi Mike - I want add a small correction to clarify the Excel-Dna implementation: all the managed-to-Excel glue works at runtime from your managed assembly using reflection - there is no extra pre-compilation step or C++ code generation. Also, even though Excel-Dna uses .NET, there need not be any COM interop involved when talking to Excel - as an .xll the native interface can be used directly from .NET (though you can also use COM if you want). This makes high-performance UDFs and macros possible." – Govert]

You also might want to look at Add-in Express. It's not free, but it would allow you to code in C# and although it shims your solution into a separate AppDomain, I believe that it's execution speed is outstanding. If I am understanding its execution speed correctly, then I'm not sure how Add-in Express doing this, but it might be taking advantage of something called FastPath AppDomain marshaling. Don't quote me on any of this, however, as I'm not very familiar with Add-in Express. You should check it out though and do your own research. [Edit: Reading Charles Williams' answer, it looks like Add-in Express enables both COM and C API access. And Govert states that Excel DNA also enables both COM and the fastrer C API access. So you'd probably want to check out both and compare them to ExcelDna.]

My advice would be to research Add-in Express and ExcelDna. Both approaches would allow you to code using C#, which you seem most familiar with.

The other main issue is how you make your calls. For example, Excel is very fast when handling an entire range of data passed back-and-forth as an array. This is vastly more efficient than looping through the cells individually. For example, the following code makes use of the Excel.Range.set_Value accessor method to assign a 10 x 10 array of values to a 10 x 10 range of cells in one shot:

void AssignArrayToRange()
{
    // Create the array.
    object[,] myArray = new object[10, 10];

    // Initialize the array.
    for (int i = 0; i < myArray.GetLength(0); i++)
    {
        for (int j = 0; j < myArray.GetLength(1); j++)
        {
            myArray[i, j] = i + j;
        }
    }

    // Create a Range of the correct size:
    int rows = myArray.GetLength(0);
    int columns = myArray.GetLength(1);
    Excel.Range range = myWorksheet.get_Range("A1", Type.Missing);
    range = range.get_Resize(rows, columns);

    // Assign the Array to the Range in one shot:
    range.set_Value(Type.Missing, myArray);
}

One can similarly make use of the Excel.Range.get_Value accessor method to read an array of values from a range in one step. Doing this and then looping through the values within the array is vastly faster than looping trough the values within the cells of the range individually.

Mike Rosenblum
Regarding speed: this is also my experience. VSTO can be very fast /if/ you use it the right way, i.e. writing to entire ranges at once.
Francesco De Vittori
Hi Francesco, yes, VSTO operates from within a separate AppDomain from Excel for Excel's protection and its own. But cross-AppDomain calls are about 40x slower than a standard call, which is very noticeable. For example, a routine that would previously take 0.1 seconds would now take 4.0 full seconds to execute. So I totally agree, in this scenario one must be even more careful than ever to move all your data in one shot, to minimize the calls to the Excel object model.
Mike Rosenblum
Hi Mike - I want add a small correction to clarify the Excel-Dna implementation: all the managed-to-Excel glue works at runtime from your managed assembly using reflection - there is no extra pre-compilation step or C++ code generation. Also, even though Excel-Dna uses .NET, there need not be any COM interop involved when talking to Excel - as an .xll the native interface can be used directly from .NET (though you can also use COM if you want). This makes high-performance UDFs and macros possible.
Govert
Hi, thanks Govert for that excellent explanation -- I assumed the implementation and got it wrong. Thanks for clearing that up. I was aware of the COM vs. C API capabilities, but neglected that in my writeup above, which I'll amend now. Thanks Govert!
Mike Rosenblum
+1  A: 

Further to Mike Rosenblum's comments on the use of arrays, I'd like to add that I've been using the very approach (VSTO + arrays) and when I measured it, the actual read speed itself was within milliseconds. Just remember to disable event handling and screen updating prior to the read/write, and remember to re-enable after the operation is complete.

Using C#, you can create 1-based arrays exactly the same as Excel VBA itself does. This is pretty useful, especially because even in VSTO, when you extract the array from an Excel.Range object, the array is 1-based, so keeping the Excel-oriented arrays 1-based helps you avoid needing to always check for whether the array is one-based or zero-based. (If the column position in the array has significance to you, having to deal with 0-based and 1-based arrays can be a real pain).

Generally reading the Excel.Range into an array would look something like this:

var myArray = (object[,])range.Value2;


My variation of Mike Rosenblum's array-write uses a 1-based array like this:

int[] lowerBounds = new int[]{ 1, 1 };
int[] lengths = new int[] { rowCount, columnCount };  
var myArray = 
    (object[,])Array.CreateInstance(typeof(object), lengths, lowerBounds);

var dataRange = GetRangeFromMySources();

// this example is a bit too atomic; you probably want to disable 
// screen updates and events a bit higher up in the call stack...
dataRange.Application.ScreenUpdating = false;
dataRange.Application.EnableEvents = false;

dataRange = dataRange.get_Resize(rowCount, columnCount);
dataRange.set_Value(Excel.XlRangeValueDataType.xlRangeValueDefault, myArray);

dataRange.Application.ScreenUpdating = true;
dataRange.Application.EnableEvents = true;
code4life
Thanks very much code4life, that is very useful! When you said the difference in read speed was minimal, within milliseconds, do you mean compared to VBA (because Mike notes that calls across app domains are around 40X slower so it sounded like VSTO would be much slower than VBA when interacting with excel)? By the way I'm still trying to get a response from Add-In Express, will let you all know when I get a reply. Thanks.
jw_pr
@jw_pr, I think Mike is quite correct in what he says. I think you need to reduce the VSTO calls to the bare minimum. That being said, the actual array-content manipulation is much faster on the VSTO side, I think (but that's anecdotal).
code4life
I will guess that it's in the range of tens of miliseconds when crossing the AppDomain using VSTO. But if you have a 10x10 grid of cells, 10 ms to process each cell comes to approximately 1 full second for the entire grid of 100 cells. This is way too slow, and you can watch this progress by eye. You should test this on a variety of calls as well: setting values, changing formatting, etc.
Mike Rosenblum
@Mike, I agree, the more calls to Excel the slower it gets. I try to keep my (object[,])Range.Value2 and Range.set_Value() calls to the minimum possible. Also iterating range contents without using the array is brutal.
code4life
+1  A: 

Hello Justin,

First off, your solution cannot be an Excel UDF (user-defined function). In our manuals, we give the following definition: "Excel UDFs are used to build custom functions in Excel for the end user to use them in formulas." I wouldn't mind if you suggest a better definition :)

That definition shows that a UDF cannot add a button to the UI (I know that XLLs can modify the CommandBar UI) or intercept keyboard shortcuts as well as Excel events.

That is, ExcelDNA is out of scope because it is purposed for developing XLL add-ins. The same applies to Excel-targeted functionality of Add-in Express since it allows developing XLL add-ins and Excel Automation add-ins.

Because you need to handle Excel events, your solution can be a standalone application but there are obvious limitations of such approach. The only real way is to create a COM add-in; it allows handling Excel events and adding custom things to the Excel UI. You have three possibilities:

  • VSTO
  • Add-in Express (COM add-in functionality)
  • Shared Add-in (see the corresponding item in the New Project dialog in VS)

If talking about developing an Excel COM add-in, the 3 tools above provide different features: visual designers, shimming, etc. But I don't think they differ in the speed of accessing the Excel Object Model. Say, I don't know (and cannot imagine) why getting a COM object from the Default AppDomain should differ from getting the same COM object from another AppDomain. BTW, you can check if shimming influences the speed of operation by creating a shared add-in and then using the COM Shim Wizard to shim it.

Speed II. As I wrote to you yesterday: "The best way to speed up reading and writing to a range of cells is to create a variable of the Excel.Range type referring to that range and then read/write an array from/to the Value property of the variable." But contrary to what Francesco says, I don't attribute this to VSTO; this is a feature of the Excel object model.

Speed III. The fastest Excel UDFs are written in native C++, not in any .NET language. I haven't compared the speed of an XLL add-in produced by ExcelDNA and Add-in Express; I don't think you'll find any substantial difference here.

To sum up. I am convinced you are on a wrong way: COM add-ins based on Add-in Express, VSTO or Shared Add-in should read and write Excel cells at the same speed. I will be glad (sincerely) if someone disproves this statement.

Now on your other questions. VSTO doesn't allow developing a COM add-in supporting Office 2000-2010. It requires three different codebases and at least two versions of Visual Studio to completely support Office 2003-2010; you need to have strong nerves and a portion of good luck to deploy a VSTO-based add-in for Excel 2003. With Add-in Express, you create a COM add-in for all Office versions with a single codebase; Add-in Express provides you with a setup project, which is ready to install your add-in in Excel 2000-2010 (32-bit and 64-bit); ClickOnce deployment is on board too.

VSTO beats Add-in Express in one area: it allows creating so-called document-level add-ins. Imagine a workbook or template with some .NET code behind it; I wouldn't be surprised, however, if deployment of such things is a nightmare.

On Excel events. All Excel events are listed in MSDN, for instance, see Excel 2007 events

Regards from Belarus (GMT+2),

Andrei Smolin Add-in Express Team Leader

Andrei Smolin - Add-in Express
Thanks very much Andrei for posting your answer here. This clarifies the issue for me. You are right that I don't want to use a UDF and now I understand the differences between VSTO and Add-In Express. I also understand that the choice is not going to be determined by speed. It is clear that the solution will need to read ranges instead of cells and the algorithm will need to optimize (minimize) interaction with Excel. Thanks again to you and everyone who has contributed to this answer! Justin
jw_pr
Hi Andrei, excellent post, but a few clarifications to follow...
Mike Rosenblum
"That is, ExcelDNA is out of scope because it is purposed for developing XLL add-ins." <-- As Govert shows in his post above, Excel-Dna most definitely can handle menu commands, not just UDFs.
Mike Rosenblum
"Say, I don't know (and cannot imagine) why getting a COM object from the Default AppDomain should differ from getting the same COM object from another AppDomain." <-- Cross AppDomain marshalling is about 80% as expensive as a cross-process calls. I've tested in the past and timed Excel object model calls at about 40x slower going cross-AppDomain and about 50x slower going cross process.
Mike Rosenblum
Note that Excel object model calls are realtively rich in functionality and so are realtively expensive. This means that the cost of cross-AppDomain marshaling is relatively *low* when compared to the full cost of the call -- so even at a 40x reduction in speed, this is not that bad relative to making lightweight calls across the AppDomain. Witness this article that quotes a 1000x slowdown: Fast Marshaling for Cross-AppDomain Proxies (http://www.codeproject.com/KB/dotnet/FastAppDomainMarshal.aspx?msg=3025184).
Mike Rosenblum
"COM add-ins based on Add-in Express, VSTO or Shared Add-in should read and write Excel cells at the same speed. I will be glad (sincerely) if someone disproves this statement." <-- VSTO is shimmed into a separate AppDomain and there is no way around this. A managed COM add-in does not have to be shimmed. Since you are Team Leader for Add-in Express, maybe you can explain how Add-in Express works? I guess it runs within the default AppDomain when running with the C API, and is shimmed into a separate AppDomain when running with the COM API?
Mike Rosenblum
Mike, sorry, I've just found your questions. Add-in Express loader handles IDTExtensibility2; it also creates a proxy class to pass OnConnection, OnDisconnection etc calls to the add-in module. Yes, due to shimming Add-in Express add-ins process those events (there are 5 of them) slower. Nevertheless when the loader passes the control to the add-in module and the module handles all events of the application object model, no cross-domain marshalling occurs. That's because you connect to those events in the AppDomain of the add-in itself, not in the DefaultAppdomain.
Andrei Smolin - Add-in Express
+3  A: 

I'll take this as a challenge, and will bet the fastest way to shuffle your data between Excel and C# is to use Excel-Dna - http://exceldna.codeplex.com. (Disclaimer: I develop Excel-Dna. But it's still true...)

Because it uses the native .xll interface it skips all the COM integration overhead that you'd have with VSTO or another COM-based add-in approach. With Excel-Dna you could make a macro that is hooked up to a menu or ribbon button which reads a range, processes it, and writes it back to a range in Excel. All using the native Excel interface from C# - not a COM object in sight.

I've made a small test function that takes the current selection into an array, squares every number in the array, and writes the result into Sheet 2 starting from cell A1. You just need to add the (free) Excel-Dna runtime which you can download from http://exceldna.codeplex.com.

I read into C#, process and write back to Excel a million-cell range in under a second. Is this fast enough for you?

My function looks like this:

using ExcelDna.Integration;
public static class RangeTools {

[ExcelCommand(MenuName="Range Tools", MenuText="Square Selection")]
public static void SquareRange()
{
    object[,] result;

    // Get a reference to the current selection
    ExcelReference selection = (ExcelReference)XlCall.Excel(XlCall.xlfSelection);
    // Get the value of the selection
    object selectionContent = selection.GetValue();
    if (selectionContent is object[,])
    {
        object[,] values = (object[,])selectionContent;
        int rows = values.GetLength(0);
        int cols = values.GetLength(1);
        result = new object[rows,cols];

        // Process the values
        for (int i = 0; i < rows; i++)
        {
            for (int j = 0; j < cols; j++)
            {
                if (values[i,j] is double)
                {
                    double val = (double)values[i,j];
                    result[i,j] = val * val;
                }
                else
                {
                    result[i,j] = values[i,j];
                }
            }
        }
    }
    else if (selectionContent is double)
    {
        double value = (double)selectionContent;
        result = new object[,] {{value * value}}; 
    }
    else
    {
        result = new object[,] {{"Selection was not a range or a number, but " + selectionContent.ToString()}};
    }

    // Now create the target reference that will refer to Sheet 2, getting a reference that contains the SheetId first
    ExcelReference sheet2 = (ExcelReference)XlCall.Excel(XlCall.xlSheetId, "Sheet2"); // Throws exception if no Sheet2 exists
    // ... then creating the reference with the right size as new ExcelReference(RowFirst, RowLast, ColFirst, ColLast, SheetId)
    int resultRows = result.GetLength(0);
    int resultCols = result.GetLength(1);
    ExcelReference target = new ExcelReference(0, resultRows-1, 0, resultCols-1, sheet2.SheetId);
    // Finally setting the result into the target range.
    target.SetValue(result);
}
}
Govert
Thanks very much Govert! That is very interesting and must be the fastest approach! Out of interest, are there any limitations to this approach versus the other approaches that we have been discussing that you can think of? Thanks again for your very enlightening answer.
jw_pr
It's brilliant.
code4life
I guess using the native API takes some getting used to if you only have VBA or VSTO experience, although Excel-Dna makes it much easier than it would be in C/C++. As with any .NET-based plan, you need to consider obfuscation if you want to secure your source. Deployment is a pleasure though - you can pack everything (including your other C# libraries) into a single .xll file for users to open.
Govert
Thanks very much.
jw_pr