views:

313

answers:

8

Our customers use 500+ applications and we would like to integrate these applications with our. What is the best way to do that? These applications are time registration applications and common for most of them is that they can export to csv or similar, some of them are actually home-brewed excel sheets where time is registered.

The best idea so far is to create our own excel sheet, which can be used to integrate with all these applications. The integrations could be in the form of cells containing something like ='[c:\export.csv]rawdata'!$A$3 Where export.csv is the csv file exported from the time registration applications. Can you see a better way to integrate against all these applications? It should be mentioned that almost all our customers have Microsoft Office.


Edit: Answers to the excellent questions from Pontus Gagge:

How similar are the data in the different applications? I assume that since they time registration applications, they will have some similarities, but I assume that some will register the how long time one has worked in total for a whole month, while others will spesify for each day. If Excel is chosen, I believe that many of the differences could be ironed out using basic formulas.

What quality is the data? The quality of the data can vary so basic validation must be undertaken, a good way is also to make it transparent for the customers, how our application understands their input, so they are responsible.

How large amounts of data are you talking about? There will be information about the time worked for up to 50 employees.

Is the integration one-way only? Yes

With what frequency should information be transferred? Once per month (when they need to pay salaries).

How often do the applications themselves change, and how often does your product change? If their application is a home-brewed Excel sheet, then I assume it will change once a year (due for example a mistake someone). If it is a standard proper time registration application, then I do not believe they are updated more often than every fifth year or so, as it is a very stabile concept.

Should the integration be fully automatic or can your end users trigger a data transfer? They can surely trigger data transfer. The users are often dedicated to the process so they can be trained at doing it, which means that they could make up to, say 30, mouse clicks in order to integrate each month.

Will the customers have somebody to monitor the integrations? As we have many customers, many of them should be able to undertake the integration themselves. We will though be able to assist them over the telephone. We cannot, though undertake the integration ourselves because we would then be responsible for any errors due to user mistakes, etc.

Does the phrase 'integration spaghetti' mean anything to you...? I am looking for ideas from the best chefs to cook a nice large portion of that.

+2  A: 

I would also look at CSV and then use an OLEDB connection against the CSV file for importing.

Johann Strydom
It is a good idea. I will consider it.
David
Please remember to accept my answer if you feel it helped.
Johann Strydom
I do have a habit of doing that. Thanks.
David
+5  A: 

You need to come up with a common data format, and a way to translate the individual data formats to the common format. There's really no way around this - any solution you come up with will have to do this in one way or the other. It's the essential complexity of what you're doing.

The bigger issue is actually variances within the source data, in terms of how things like dates are stored, missing columns, etc. Doing a generic conversion for CSV to move columns around is comparatively easy.

kyoryu
That is a good point. I am looking for the easiest way to solve these issues. Do you believe in a using spreadsheets also for parsing.
David
There is no real easy way to solve these issues. The essential complexity of translating 500 different data formats into a single common one is incredible. The best solution is to do some standardization of the applications used, if possible, rather than collating data from 500 different formats.
kyoryu
+1  A: 

With a multitude of data sources mapping each one correctly to an intermediate format is not trivial. Regular expressions are good with a finite set of known data formats. Multipass can help when data is ambiguous without context (month,day fields and have several days of data), and also help defeat data entry errors. But it seems as this data is connected to salaries there needs a good reliable transfer.

An import configuring trick

Get the customer to make a set of training data in the application. It should have a "predefined unique date" and each subsequent data field have a number corresponding to the target data field in your application. On importing your application needs to recognise the predefined date, determine the unique translation required and effect the displaying/saving of this "mapping key", and stop the import. eg If you expect "Duration hours" in field two then get the user to enter 2 in the relevant field which might be "Attendance hours".

On subsequent runs, and with the mapping definition key, import becomes a fairly easy process of translation.

Note on terms

  • "predefined date" - must be historical, say founding date of your company?, might need to be in PC clock settable range.
  • "mapping key" - could be string of hex digits and nybble based so tractable to workout The entered code can be extended to signify required conversions ie customer's application has durations in days and your application expects it in hours.

Interfacing with windows programs (in order if increasing fragility)

  • Ye Olde saving as CSV file
  • Print to operating system printer that is setup as a text file/pdf, then scavenge the data out of that
  • Extract data via the application interface control, typically ActiveX for several windows programs ie like Matlab's Spreadsheet Link
  • Read native file format xls format ie like Matlab's xlsread
  • Add an additional intermediate spreadsheet sheet that has extended cell references ie ='[filename]rawdata'!$A$3
Roaker
+1  A: 

If you try to make something that can interface to any data structure in the universe (and 500 is plenty close enough), it is guaranteed to be a maintenance nightmare. Instead I would approach this from multiple angles:

  1. Devise an interface into which a human can enter this data already in the proper format. With 500+ clients, I'd make this a small, raw but functional browser based site that users can use to enter this information manally. This is the fall-back. At the end of the day, a human can re-key the information into the site and solve the import issue. Ideally, everyone would use this instead of their own format. Data entry people are cheap.

  2. Similar to above, but expanded, I would develop a standard application or standardize on an off-the-shelf application that can be used to replace their existing format. This might take more time than #1. The goal would be to only do one-time imports of these varying data schemas into the application and be done with them for good.

  3. The nice thing about spreadsheets is that you can do anything anywhere. The bad thing about spreadsheets is that you can do anything anywhere. With CSV or a spreadsheet there is simply no way to enforce data integrity and thus consistency (which is the primary goal) on the data. If the source data is already in a database, then that is obviously simpler.

I would be inclined to use database format into which each of these files need to be converted rather than a spreadsheet (e.g. use something like Jet (MDB)). If you have non-Windows users then that will make it harder and you might have to use a spreadsheet. The problem is that it is too easy for the user to change their source structure, break their upload and come crying to you. If a given end user has a resident expert, they can find a way of importing the data into that database format . If you are that expert, then I would on a case-by-case basis, write something that would import into that database format. XML would be the other choice, but that will likely take more coding than an import/export into a database format.

Standardization of the apps (even having all the sources in a database format instead of a spreadsheet would help) and control over the data schema is the ultimate goal rather than permitting a gazillion formats. There really is no nice answer other than standardization. Otherwise, you are having to write a converter for every Tom-Dick-and-Harry format and again when someone changes the source format.

Thomas
A: 

Use a simple XML format. A non-technical person can easily understand a simple XML format (and could even identify basic problems with XML documents that are not well-formed).

Maybe use a DTD (or even better an XML schema) to do very basic validation, and then supplement this with an XSL stylesheet to do more validation with better error reporting. (An XSL stylesheet simply converts from XML to something else and so can be generate readable error messages.)

The advantage of this approach is that web browsers such as Internet Explorer can apply the XSL stylesheets. A customer need only spend at most a day enhancing their applications or writing excel macros to generate the XML data in the format that you specify.

Recent versions of Excel have support for converting spreadsheet data to XML, and can even validate against schemas.

Once the data passes the XSL validation checks, you have validated XML data.

Nelson
A: 

If you have heaps of data and heaps of money, you could look at existing data management and cleansing tools:

http://www-01.ibm.com/software/data/infosphere/datastage

http://www-01.ibm.com/software/data/infosphere/qualitystage

But even then, you'll likely need to follow kyoryu's suggestion assuming you have 500+ data formats. The problem isn't your side. You need them to standardize their output formats if you have no control over their apps. CSV is likely the easiest. You could even send them a excel template to help them along.

Glenn
A: 

Have a look at Teiid by JBoss: http://jboss.org/teiid

Also consider using SOA - e.g., if you're on Java, try JBoss SOA platform: http://www.jboss.com/resources/soa/?intcmp=1004

Ondra Žižka
A: 

My company Temboo works specifically on these types of problems. Our SaaS solution give the user full control over the process by allowing them to edit and create their own data integrations via our visual designer. Check us out, we would be happy to discuss your problem and I am sure we could help come up with a successful solution.

Luke