ansaurus

Question

The right data structure to use for an Excel clone

Answer 1

A:

Given that the data is 2-dimensional, I would have a 2D array to hold it in.

ck 2009-03-17 10:59:48

No doubt that's the fastest, just a little expensive in terms of storage. Still, most spreadsheet data tend to be localized to, say, A1:F2 so that may be the best option.

paxdiablo 2009-03-17 11:23:28

Exactly. You can have 10000 rows of 1000 columns for just 40MB + actual data if you use a reference type. Hardly a lot :)

Jon Skeet 2009-03-17 11:32:02

Answer 2

A:

Well, you could store them in three Dictionaries: two Dictionary<int,CellValue> objects for rows and columns, and one Dictionary<string,CellValue> for text. You'd have to keep all three carefully in sync though.

I'm not sure that I wouldn't just go with a big two-dimensional array though...

Jon Skeet 2009-03-17 11:00:53

Answer 3

+1 A:

I think you should use one of the indexed collections to make it work reasonably fast, the perfect one is the KeyedCollection

You need to create your own collection by extending this class. This way your object will still contain row and column (so you will not loose anything), but you will be able to search for them. Probably you will have to create a class encapsulating (row, column) and make it the key (so make it immutable and override equals and get hash code)

Grzenio 2009-03-17 11:01:06

Answer 4

A:

If it's an exact clone, then an array-backed list of CellValue[256] arrays. Excel has 256 columns, but a growable number of rows.

Pete Kirkham 2009-03-17 11:03:38

"Excel has 256 columns" - 16384 in Excel 2007

Joe 2009-06-15 20:29:14

hmmmm, upgradeshttp://blogs.msdn.com/excel/archive/2005/09/26/474258.aspx

Pete Kirkham 2009-06-18 09:34:59

Answer 5

+1 A:

I'd create

 Collection<Collection<CellValue>> rowCellValues = new Collection<Collection<CellValue>>();

and

Collection<Collection<CellValue>> columnCellValues = new Collection<Collection<CellValue>>();

The outer collection has one entry for each row or column, indexed by the row or column number, the inner collection has all the cells in that row or column. These collections should be populated as part of the process that creates new CellValue objects.

rowCellValues[newCellValue.Row].Add(newCellValue);
columnCellValues[newCellValue.Column].Add(newCellValue);

MrTelly 2009-03-17 11:06:27

Answer 6

+4 A:

I would opt for a sparse array (a linked list of linked lists) to give maximum flexibility with minimum storage.

In this example, you have a linked list of rows with each element pointing to a linked list of cells in that row (you could reverse the cells and rows depending on your needs).

 |
 V
+-+    +---+             +---+
|1| -> |1.1| ----------> |1.3| -:
+-+    +---+             +---+
 |
 V
+-+             +---+
|7| ----------> |7.2| -:
+-+             +---+
 |
 =

Each row element has the row number in it and each cell element has a pointer to its row element, so that getting the row number from a cell is O(1).

Similarly, each cell element has its column number, making that O(1) as well.

There's no easy way to get O(1) for finding immediately the cell at a given row/column but a sparse array is as fast as it's going to get unless you pre-allocate information for every possible cell so that you can do index lookups on an array - this would be very wasteful in terms of storage.

One thing you could do is make one dimension non-sparse, such as making the columns the primary array (rather than linked list) and limiting them to 1,000 - this would make the column lookup indexed (fast), then a search on the sparse rows.

I don't think you can ever get O(1) for a text lookup simply because text can be duplicated in multiple cells (unlike row/column). I still believe the sparse array will be the fastest way to search for text, unless you maintain a sorted index of all text values in another array (again, that can make it faster but at the expense of copious amounts of memory).

paxdiablo 2009-03-17 11:22:05

+1, would be nice to make the LL a SkipList too.

sixlettervariables 2009-06-15 20:42:08

+1, that seems to be the most reasonable way to do it, and using a skip list is a good idea too

David Johnstone 2009-06-16 00:25:41

Answer 7

A:

If rows and columns can be added "dynamically", then you shouldn't store the row/column as an numeric attribute of the cell, but rather as a reference to a row or column object.

Example:

private struct CellValue
{
  private List<CellValue> _column;
  private List<CellValue> _row;
  private string text;

  public List<CellValue> column {
     get { return _column; }
     set {
         if(_column!=null) { _column.Remove(this); }
         _column = value;
         _column.Add(this);
        }
     }

  public List<CellValue> row {
     get { return _row; }
     set {
         if(_row!=null) { _row.Remove(this); }
         _row = value;
         _row.Add(this);
        }
     }
}

private List<List<CellValue>> MyRows    = new List<List<CellValue>>;
private List<List<CellValue>> MyColumns = new List<List<CellValue>>;

Each Row and Column object is implemented as a List of the CellValue objects. These are unordered--the order of the cells in a particular Row does not correspond to the Column index, and vice-versa.

Each sheet has a List of Rows and a list of Columns, in order of the sheet (shown above as MyRows and MyColumns).

This will allow you to rearrange and insert new rows and columns without looping through and updating any cells.

Deleting a row should loop through the cells on the row and delete them from their respective columns before deleting the row itself. And vice-versa for columns.

To find a particular Row and Column, find the appropriate Row and Column objects, then find the CellValue that they contain in common.

Example:

public CellValue GetCell(int rowIndex, int colIndex) {
  List<CellValue> row = MyRows[rowIndex];
  List<CellValue> col = MyColumns[colIndex];
  return row.Intersect(col)[0];
  }

(I'm a little fuzzy on these Extension methods in .NET 3.5, but this should be in the ballpark.)

richardtallent 2009-06-16 00:32:58

Answer 8

A:

If I recall correctly, there was an article about how Visicalc did it, maybe in Byte Magazine in the early 80s. I believe it was a sparse array of some sort. But I think there were links both up-and-down and left-and-right, so that any given cell had a pointer to the cell above it (however many cells away that may be), below it, to the left of it, and to the right of it.

Nosredna 2009-06-16 01:03:54

Answer 9

A:

This smells of premature optimization.

That said, there's a few features of excel that are important in choosing a good structure.

First is that excel uses the cells in a moderately non-linear fashion. The process of resolving formulas involves traversing the spreadsheets in effectively random order. The structure will need a mechanism of easily looking up values of random keys cheaply, marking them dirty, resolved, or unresolvable due to circular reference. It will also need some way to know when there are no more unresolved cells left, so that it can stop working. Any solution that involves a linked list is probably sub-optimal for this, since they would require a linear scan to get those cells.

Another issue is that excel displays a range of cells at one time. This may seem trivial, and to a large extent it is, but It will certainly be ideal if the app can pull all of the data needed to draw a range of cells in one shot. part of this may be keeping track of the display height and width of the rows and columns, so that the display system can iterate over the range until the desired width and height of cells has been collected. The need to iterate in this manner may preclude the use of a hashing strategy for sparse storage of cells.

On top of that, there are some weaknesses of the representational model of spreadsheets that could be addressed much more effectively by taking a slightly different approach.

For example, column aggregates are sort of clunky. A column total is easy enough to implement in excel, but it has a sort of magic behavior that works most of the time but not all of the time. For instance, if you add a row into the aggregated area, further calculations on that aggregate may continue to work, or not, depending on how you added it. If you copy and insert a row (and replace the values) everything works fine, but if you cut and paste the cells one row down, things don't work out so well.

TokenMacGuy 2009-06-16 01:11:28

ansaurus

tags:

views:

answers:

The right data structure to use for an Excel clone

related questions