Hello,
I have requirements from our client where we basically have to 'parse' PDF files from various different sources.
The solution we have come with, as 'phase 1' (as we have short time to market and will save them a huge amount of time) is
1) manually use Able2Extract application to pull out the columns you need from the PDF file, and spit out an Excel file. This excel file is still very 'dirty' as it contains tons of header information, extra fields that we don't need, etc..
2) run our application, feeding it this excel file, which will do the remaining of the cleanup. It takes this 'dirty' Excel file and then gives them a very clean excel file which just has the 3 or 4 columns they need all lines up very neatly.
The first solution we are exploring is using VBA/Excel for step 2). They take their dirty output, paste it in Excel, then run our cleanup macro. Excel is great for this sort of stuff - shifting around and scrubbing data that's already in an Excel spreadsheet. We did a proof of concept with one specific 'source' file, and it came out great. Tooks around half a day to develop this one 'scrubbing script'...
Simple enough huh? Not really. This script only works for one specific file type from one specific source. We will have 10 different sources each with possible 3-10 different file types. That means in the end, we may wind up with a huge Excel macro that has 120 of these very specific 'scrubbing scripts'. So my worry is about long term maintainability here. We might also bump into files that we had never seen before that might 'break' our scrubbing script and have to do a quick re-deply / change to a scrubbing script... I've never used Visual Studio Tools for Office and have minimal experience with VBA Excel Macros - but it seems like this might be a good case here.
Any words of wisdom from someone who might have done something similar to this before? Are huge VBA Macros like what could result here nightmares to maintain? Is VSTFO a good alternative that will give me that 'easy to shift/scrub data' functionality, but with scalability and robustness? To be honest; my first instinct was a pure .NET solution with dynamically compiled scripts pulled from the database, using our Syncfusion Excel API to do the cleaning/scrubbing... but perhaps this is overkill..
Thanks for any advice...