views:

425

answers:

8

I'm tasked with writing a solution for fixing a poorly performing legacy excel file generator.

The files I need to generate could get very large. Maybe up to a million rows with 40-50 columns. I guess I will stream directly to the user if possible, but I might just have to save the file to disk first and then create a link for the user.

I aim to make a performance test, testing if I can generate an xslx file with 1.500.000 rows and 50 columns, each cell containing a random 10 letter string... Will excel even handle files this big?

Note: In reality most files generated will never be larger than 300.000 rows and the absolute maximum is about 950.000 rows, but I like to play it safe when stress testing thus the 1.5M rows.

Do you have any suggestions as to how I should go about solving this task. Are there any components I should be aware of? Limitations in excel?

PS: I would appreciate if I didn't have to install Excel on the server.

+2  A: 

While I can't answer the maximum amount of data that Excel can handle, if you use the new .xlsx format, you are using MS's OpenXML format. The .xlsx file is actually a zip compressed file with all of the document data stored internally. The XML can be written just as any other XML, but you'll have to review the standards. There are a few commercial component implementations out there for doing this. You don't need Excel to write the format.

Here are some helpful links:

  • Office OpenXML - Wikipedia
  • Office Open XML C# Library - This looks like an open source library for reading/writing OpenXML
  • Read and write Open XML files - CodeProject - Another implementation of a R/W library
  • GemBox.Spreadsheet - A commercial .NET component for reading/writing office spreadsheets. Has a free version with limitations on number of rows you can read and write if you want to try it out.
  • NPOI Library - an implementation of Java's POI library for reading and writing office docs.
  • Simple OOXML - "A set of helper classes to make the creation of Open Office XML documents easier. Uses the Open Office SDK v 2.0. Modify or create any .docx or .xlsx document without Microsoft Word or Microsoft Excel."
snicker
+4  A: 

There is a limit to the amount of rows the you can have in a spreadsheet (1M for Office 2007). I would generate a CSV file instead, which is really just a formatted text file that can be opened in Excel.

jrummell
I did some preliminary tests using the libraries suggested in the other answers but performed really-really badly. I think I will follow your suggestion and go with a CSV solution do you have any recommendations on a CSV library? I know CSV is pretty simple but why do the work if somebody already did it, right?
JohannesH
I haven't used a library for it. I've just iterated over a DataSet/DataTable/Collection and put a comma between each column and a newline between each row.
jrummell
Just make sure that you escape/replace any commas in your field values. That tripped me up for a few minutes until I realized what was happening.
jrummell
What about linebreaks and (") chars?
JohannesH
Line breaks within a field will also need to be escaped/replaced. Quotes should be OK.
jrummell
+1  A: 

You might want to check out the NPOI library for reading and writing excel files at http://npoi.codeplex.com/. As for storing on the server that is an option but remember you will have to clean up the files after they have been download.

Michael Ciba
+1  A: 

Take a look at the Simple OOXML project on Codeplex.

It might be what you are looking for.

PS. Excel is mainly spreadsheet software, no a database substitute. Are you sure you want to dump a million rows to the end user?

Magnus Johansson
No I'm positive I don't think this is a good idea. ;) Unfortunately, legacy systems and the workflows around those systems are extremely hard to change. This is already how it works the problem I'm solving is timeouts and/or out-of-memory exceptions on the server with the current solution.
JohannesH
Maybe you can look into storing this data in a SQL Server Express/Compact on the server instead, expose a data connection/Web Service and let the users connect from within Excel through data connection instead, and filter the data retrieval based on the users input. That would surely be more efficient.
Magnus Johansson
+1  A: 

Excel can not handle millions of rows, try creating a CSV output file in stead, this can be read by Excel.

And it is not recommended to add a huge amount of data in an excel on user request He will need to wait to long before downloading the file.

Kennethvr
Excel 2007 removes the row limitations of previous versions, and CSV has the same limitations in Excel 97-2003 as the XLS format when being read by Excel. The OP obviously has a very special use case that would make the wait worthwhile for the user.
richardtallent
+1  A: 

Excel 2007 supports a maximum worksheet size of 1,048,576 rows by 16,384 columns so your test with 1.5 million rows may not be feasible. Source

Edit: Excel 2003 supports even fewer rows: 65,536 rows by 256 columns. Source

If you are able to require that your users have the ability to open documents in the Excel 2007 (xlsx) format then that may be your best bet as it is just a XML document and can be generated without any requirement for Excel on the server.

If you need to support "all" version of excel/other office suite programs you should probably use CSV or another character delimited format.

The Open Document Format may also be of interest but excel users would need the ODF add-in to use the documents.

Edit 2: If you are looking at using CSV you may want to look at the FileHelpers library.

Venr
+1  A: 

Assuming you can avoid exceeding the new row limitations in Excel 2007 (by splitting to other worksheets or files), Excel 2007's xlsx format should work just fine.

Since XLSX is a zip format, rather than creating in memory or writing to a disk, you should consider writing directly to an in-memory zip stream. The compression will keep memory usage low, and not writing to the file system will help with performance.

Another potential solution, depending on your circumstances: create a blank Access template, copy and write to it, and send it instead of an Excel file. Of course, this would be a shift for your application, but Access won't have the same row limit.

richardtallent
I thought about switching to access as well but the client won't accept formats which don't open directly in excel. For this reason as well as backwards compatibility with existing users I've chosen to go with a CSV solution. This also have the benefit of being able to stream the document directly to the user instead of saving to mem or disk.
JohannesH
+1  A: 

Make sure that your tests are representative of the actual data. Excel handles simple numeric cells much more efficiently than simple text cells - especially when all of the text cells are unique. So, if your data is really going to be made up of 10 character unique strings, by all means use that as your test case. If it's actually going to be mostly numbers, make sure your tests reflect that fact.

For example. I built a simple test using SpreadsheetGear for .NET to generate a 300,000 row by 50 column Open XML (.xlsx) workbook. To create and save to disk with unique numbers took 13.62 seconds on my almost two year old overclocked QX6850 CPU, while creating and saving a 300,000 row by 50 column .xlsx workbook with 10 character unique strings took 78 seconds - 6 times longer for text than for numbers. I will paste the code below and you can run it with the free SpreadsheetGear trial which you can download here.

It is important to note that Open XML (.xlsx) is compressed, so if your data has much redundancy, you are likely to get smaller files with .xlsx than with .csv. This could have a big impact on performance if you are generating workbooks on a web server for consumption over a network.

SpreadsheetGear with the IWorkbook.SaveToStream method, and most other 3rd party Excel compatible libraries will let you save directly to the response stream in an ASP.NET application, so you will be able to avoid saving to disk on the server.

Disclaimer: I own SpreadsheetGear LLC

Here is the test code:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using SpreadsheetGear;

namespace ConsoleApplication11
{
    class Program
    {
        static void Main(string[] args)
        {
            var timer = System.Diagnostics.Stopwatch.StartNew();
            int rows = 300000;
            int sheets = 1;
            var workbook = Factory.GetWorkbook();
            var sb = new System.Text.StringBuilder();
            int counter = 0;
            bool numeric = true;
            for (int sheet = 0; sheet < sheets; sheet++)
            {
                // Use the SpreadsheetGear Advanced API which is faster than the IRange API.
                var worksheet = (sheet == 0) ? workbook.Worksheets[0] : workbook.Worksheets.Add();
                var values = (SpreadsheetGear.Advanced.Cells.IValues)worksheet;
                for (int row = 0; row < rows; row++)
                {
                    for (int col = 0; col < 50; col++)
                    {
                        if (numeric)
                            values.SetNumber(row, col, ++counter);
                        else
                        {
                            sb.Length = 0;
                            // Make a 10 character unique string.
                            sb.Append(++counter);
                            System.Diagnostics.Debug.Assert(sb.Length <= 10);
                            // Make it 10 characters long.
                            while (sb.Length < 10)
                                sb.Append((char)('A' + (char)sb.Length));
                            values.SetText(row, col, sb);
                        }
                    }
                }
            }
            Console.WriteLine("Created {0} cells in {1} seconds.", counter, timer.Elapsed.TotalSeconds);
            workbook.SaveAs(@"C:\tmp\BigWorkbook.xlsx", FileFormat.OpenXMLWorkbook);
            Console.WriteLine("Created and saved {0} cells in {1} seconds.", counter, timer.Elapsed.TotalSeconds);
        }
    }
}
Joe Erickson
That's very helpful. Thanks! :)
JohannesH