views:

342

answers:

9

Hello,

I have requirements from our client where we basically have to 'parse' PDF files from various different sources.

The solution we have come with, as 'phase 1' (as we have short time to market and will save them a huge amount of time) is

1) manually use Able2Extract application to pull out the columns you need from the PDF file, and spit out an Excel file. This excel file is still very 'dirty' as it contains tons of header information, extra fields that we don't need, etc..

2) run our application, feeding it this excel file, which will do the remaining of the cleanup. It takes this 'dirty' Excel file and then gives them a very clean excel file which just has the 3 or 4 columns they need all lines up very neatly.

The first solution we are exploring is using VBA/Excel for step 2). They take their dirty output, paste it in Excel, then run our cleanup macro. Excel is great for this sort of stuff - shifting around and scrubbing data that's already in an Excel spreadsheet. We did a proof of concept with one specific 'source' file, and it came out great. Tooks around half a day to develop this one 'scrubbing script'...

Simple enough huh? Not really. This script only works for one specific file type from one specific source. We will have 10 different sources each with possible 3-10 different file types. That means in the end, we may wind up with a huge Excel macro that has 120 of these very specific 'scrubbing scripts'. So my worry is about long term maintainability here. We might also bump into files that we had never seen before that might 'break' our scrubbing script and have to do a quick re-deply / change to a scrubbing script... I've never used Visual Studio Tools for Office and have minimal experience with VBA Excel Macros - but it seems like this might be a good case here.

Any words of wisdom from someone who might have done something similar to this before? Are huge VBA Macros like what could result here nightmares to maintain? Is VSTFO a good alternative that will give me that 'easy to shift/scrub data' functionality, but with scalability and robustness? To be honest; my first instinct was a pure .NET solution with dynamically compiled scripts pulled from the database, using our Syncfusion Excel API to do the cleaning/scrubbing... but perhaps this is overkill..

Thanks for any advice...

+3  A: 

VBA is much much easier to deal with than VSTO. OK, VBA may not be such a nice language to work with, but at least it gives on-the-metal access to the Excel object model. And a solution based on VBA is likely to be much more stable than one built on VSTO.

I'd say go with VBA, and if you're concerned about the maintainability, think about storing the "scrubbing scripts" in separate files. You can either

(a) have one Excel file per scrubbing script, each with a single macro with the same name; your add-in can load (and execute code in) the appropriate Excel file for any given input file

(b) have one text file per scrubbing script, each with the text of the same macro as above; your add-in can create import this as a new module at run-time - either into itself or into a temporary workbook. This is less efficient, but plays better with version control systems, since you can diff between versions of text files but it's not so easy to diff the modules in two Excel workbooks.

In both of these cases, you can store the scrubbing scripts in a shared folder so that you have centralized update if you need to modify a script.

Gary McGill
+2  A: 

I have written many VBA functions within Excel, some of which have become very large and complex. I do not think maintaining them is that much harder then handling any other large project, except in the cases where people do not understand VBA very well. VBA gives you many ways to do things, most of which are not optimal. For example, if you are not very careful you will have a lot of code that looks like

Workbook("name").Sheets("name).Range(Cells(1,1),Cells(3,1)).Value = "Test"
Workbook("name").Sheets("name).Range(Cells(1,1),Cells(3,1)).Font.Bold = True
Workbook("name").Sheets("name).Range(Cells(1,1),Cells(3,1)).Font.Italics = True

Where it should be something like

With Workbook("name").Sheets("name).Range(Cells(1,1),Cells(3,1))
  .Value = "Test"
   With .Font
      .Bold = True
      .italics = True
   End With
End With

Both will do the same thing, however the second one should be a slightly better performer (there are probably better examples of this) and is atleast in my opinion easier to maintain.

With that said, if you and your team have the experience to write good VBA code, then I think that it is the way to go here. Otherwise, for long term maintainability, I would consider a solution that you have more experience with.

Irwin M. Fletcher
Thanks lansinwd, Gary, and Joel. As I said - I have very minimal, if any VBA and VSTFO experience, however the rest of the team does have VBA experience. It seems you guys cleared up my worries of long term maintainability... But what about when we get to phase 2 of this project and want to automate it fully end-to-end? Can I have a Windows Service take that dirty excel output, feed it to my macro, and get the 'clean' file? Easily? and is it scalable for a nenterprise application, or is it a hack? E.G.- will I be able to re-use these 'scrubbing scripts' easily, when the app is fully automated?
dferraro
.. and users are no longer even using Excel and instead are pushing a button on an ASP.NET web app, which is talking to lets say a Windows Service which does the work + more logic?
dferraro
Can the pdf product pull this data into something like a CSV file, instead of an Excel file. If so, I would consider a scripting language (I am a Perl fan but any would do) as this is the work that they are really useful for. It would be easy to fully automate this type of solution and would be about the same in terms of maintainability as using VBA.
Irwin M. Fletcher
What benefit though is it go move to a scripting language?Everyone on the team is .NET developers with minimal exposure to other languages. If we wanted to go the 'scripting' route, couldn't we just use C# or VB.NET as our scripting language and do dynamic compilation? This was what my first thoughts were...
dferraro
I agree, I was just throwing that out since you had noted that you thought that was overkill. Of all the solutions, I believe your best long term strategy would be the .Net route, especially given your teams skill set.
Irwin M. Fletcher
+3  A: 

I love programming in C# but I hate VSTO.

The two main problems I have:.

  • you've got no live access to the code anymore, it's all compile to a DLL which is attached to the workbook, with no debugging-on-the-go (which can be very useful for little RAD pieces). Debugging through Visual Studio is not an alternative to the ability to debug anywhere when using Excel VBA.

  • you're using the Excel VBA interface wrapped up for .NET usage rather than something that feels native. You have horrible function calls like sheet.get_Range("A1:B1", System.Type.Missing); with Missing being in the place of optional parameters.

There are a lot of people who use VSTO but having spent many years on the Excel VBA platform, I found few reasons to migrate at this point. But consider if you need to do some pretty cool stuff in C#/.NET that you can't pull off in VBA (e.g. reflection).

You can write very good code in VBA; it gets a lot of bad press as it's an environment that doesn't penalize you for writing bad code and absolutely anyone can dabble with VBA.

These may be just gripes of a grumpy developer who is experienced with VBA and not VSTO. So having said all that - if you are unfamiliar with VBA, you might just be better off going straight to VSTO. I'm not sure what Microsoft intends to do with VBA down the line; VSTO is supposed to be the future.

Joel Goodwin
I don't think I get your point about debugging; the quality of the debugging tools in VS vs office is just so much better, I find it very painful to work with the VBA editor at that point. The point about syntax is valid, but there are some ways around it, like using VB.NET, or the VSTO power tools.
Mathias
It's not about the quality per se, it's the fact I can immediately debug a problem as it happens on a user's desktop. Giving that up and telling users I can no longer fix things immediately is quite some sell.
Joel Goodwin
+2  A: 

I wouldn't be writing anything that needs long term maintainability in VBA, but if its short term VBA would be fine.

In terms of performance VBA is slightly faster than .NET, but you lose so many nice features, and with the new versions of VSTO gripes such as with debugging and full OM access are things of the past.

If all of the code is purely for Excel OM manipulation I would still consider VBA as it will be slightly faster and no clear advantage to using .NET (other than the mixture of familiarity within the team that was mentioned above).

If you are using other libraries then use .NET - the main reason is that you get rid of 1/2 dozen library dependancies that you'll need to add in VBA such as FSO, ADO, CDO, etc.

Another common complaint you hear is that you have to use get accessor from C# and that you have to use Type.Missing alot.

With the newer version of .NET the type.missing is a thing of the past. The get accessor problem was only with the early version of the interop library, and I think a common misunderstanding of the usage of the range object and the range property in C#.

I've never had to use accessor methods at all, and once you write some wrapper methods for common Excel OM methods, you won't have to write missing parameters at all either. Apparently .NET 4.0 has an even better way to solve this problem.

Anonymous Type
AT, how recent are these changes? I'm using VS2005 against Excel 2003 here. What level of upgrade would be required to see these problems disappear?
Joel Goodwin
right, i looked again and it seems you need VS 2008 .net 3.5however the good news is you can use VSTO 3.0 with Office 2003, so you just need to upgrade Visual Studio.The other piece you need is the VSTO power tools 1.0 - install the Office interop extensions library PT_ExtLibs and you get extension methods to fix alot of the above.
Anonymous Type
+1  A: 

If step 2 eventually needs to be a service, and you're willing to invest more time upfront (depends on your deliverable schedule) and you're dealing with excel in Open XML (although possible with the older binary formats as well) - check out the Open XML SDK and have a look at Microsoft's recommended server side automation of Office documents.

If you need to deliver it fast, VBA will get you there. If you want something that is easy to package and distribute, VSTO will get you there with a little more effort. If you need a service, go for something else entirely.

Mike Regan
+2  A: 

First, you are going to need 'n' scrubbing programs no matter what. The fact is, that Excel/VBA is not much worse for maintaining this functionality than many other platforms.

You could add an interface using Userform, or play the auto-detect game, spitting out any 'new' file formats that it doesn't understand. There are several robust error handling schemes available as well, so there's no need to worry that things will get broken.

One Oil company paid me to write an Excel application using 4 Userforms and over 5000 lines of VBA as a tool to assist its accountants in doing monthly joint venture reporting. The application was used for 4 years past its End-of-life because the interface was so familiar and easy to use.

...sorry for rambling on about this, but there is a tendency to 'look down' on VBA because so few 'real programmers' use it...

caving
Thanks for the reply. What realistic limitations am I looking at by going the VBA route, long term? I may need in the future to fully streamline this app, which would have been I imagined an ASP.NET site that talked to a Windows Service for the proccessing logic, likely using WCF. I don't see why we couldn't do a Desktop app for this though, if it made sense.. in this case we could possibly stick with VBA. Throughout my career I have been for the most part the bounty-hunter of XL Macros.
dferraro
So many of my apps automated existing business proccess that were all being done manually in Excel... so perhaps my hesitence in using them stems from that =)
dferraro
I have built 10K lines in one VBA app, but I can't speak to larger than that. I have found using services tends to be a bit slow, due to interprocess communication, but for smaller data sets this will not be a problem. Good luck on your project!
caving
Sorry for the delay, I have been away. I have never had any serious limitations, providing each module is less than 65K characters long. If a module gets beyond 16K characters, you may need to further modularize it anyway... I have experienced serious bottlenecks using WCF and using interprocess communication. If you need the Excel format, the best way is to use Excel itself to write the file using VBA to process it.
caving
dont use WCF directly from VBA, write a clientside .net or java component to take care of serialisation and async callbacks.
Anonymous Type
+2  A: 

I reckon you should go with your first instict.

Although pulling dynamically compiled scripts from the DB sure does sound like overkill to me. I probably don't fully understand your problem because I'm not sure what issue pulling dynamically compiled scripts from a DB solves.

You've got Syncfusion Excel API, for step #2 why not just write a pure .net application using Syncfusion to load and manipulate the excel files and re-save them. When you encounter new file types to support you update the application and re-distribute it.

This solution might possibly take a little longer to develop but:

  1. Will entirely be in .NET (I loathe VBA).
  2. Will not be using Excel as a server app (which another poster already pointed out is not something that Excel was built to do and MS advise against it for the reasons the other poster mentioned).
  3. Will (based on my experience) perform an order of magnitude faster than VSTO (interop) and probably VBA too.
Ben Robbins
A: 

Maybe Excel Services for Microsoft Office SharePoint Server 2007/2010 could be something? It seems that Excel Services can't be used without SharePoint though [look here].

Excel Services 2007 - Overview

Excel Services 2007 - Architecture

Excel Services 2010 - Overview

Excel Services 2010 - Architecture

What Is Excel Services 2007?

Octadrone
A: 

Referring to the wider question, the things to take into account:

  1. VBA IDE ships with Excel. Not so easy with VSTO if you want a wider group to edit code.
  2. More people know how to write VBA than VSTO at this stage.
  3. More online support for VBA at this stage.
  4. VBA is not designed to be anything more than automation langauge for Office products. It is perfectly adequate for that and won't be going away anytime soon. MS realise that it is one of the things that Office has over OpenOffice - Ken from Accounts is not about to sit down with Eclipse and start typing Public Static Void Main
  5. There are considerable limitations to VBA once you want to start using it like app code. Just including class libraries is a pain. If this will be widely distributed, I would go with VSTO.

As a poster above stated: 5,000 lines of code is 5,000 lines of code, give or take.

I am not a big fan of VSTO. VBA works for what it is intended to do. No need to rewrite it. If you need to get hard code, use C#.

Paddy