views:

146

answers:

4

Basically the situation we have is we need to be able to take data from an arbitrary data source (database, flat file, it could be anything really), morph the data so it maps to the schema we have, and then send us the data so we can store it. The data will be coming from ~200 different physical locations around the country and we'd like it to run on a schedule (once a night or whatever). Also the people at the ~200 locations are not technical, and so we want to make it as easy and hassle-free as possible for them.

Nothing is implemented yet, it is still in the design stage. This is a preliminary design I came up with, and I just wanted SO's opinion on it as well as any problems they can foresee, or any suggestions to do it a better way.

What I came up with was to have a standard standalone application that will accept different plugins for reading data (from a database, flat file, etc). The plugin will read the data and give whatever's there to the main standard application, which will serialize it to XML and send it to us, either via a WSDL or some kind of REST api (haven't decided on this yet, SOAP seems like such a pain in the ass). The app can be scheduled via windows scheduler or run as a cron job, that part is easy enough to do.

This way the user only has to enter the location of the data and possibly how to get to it (username/password, host, etc, whatever configuration is needed). The catch is, we have no idea what their data is going to look like, since there is no standard and every location does it their own way. What I was thinking is, obviously if its a flat file there only way to do it would be to send the whole file. But if it's a database, then if some config file isn't present, it sends us all the metadata (tables, column names, etc), then we can build a config file to tell the application what data it needs to select, e-mail it to the user and tell them to put the config file somewhere.

I was thinking it would be easiest and probably best to do the actual morphing of the data on our side, so if it changes we don't have to send them anything, etc etc.

This will all be done in Java, so if there's already some obscure Apache project to do all this for me, please do let me know.

Also what other storage solutions do you foresee non-technical people using besides your standard SQL database or a flat-file?

Thanks in advance!

A: 

If possible, I think it would be best to try and get all the sites to agree to a limited number of formats and methods. Perhaps you could provide tools that would allow this to happen. If not, I think you're stuck with a maintenance nightmare -- people changing file names, formats, locations, perhaps on a daily basis depending on who happens to be responsible that day. You have my sympathies.

tvanfosson
+2  A: 

What you are essentially building here is a data warehouse with multiple incoming and outgoing data profiles.

If you study any EDI implementation/whitepapers it might help you with your overall approach/mindset.

That being said, I have had success doing this in doing the following.

I implemented it in an environment that used too many different data inputs (Lotus Notes, Pervasive SQL, Excel/CSV, Foxpro and a few others that make me cringe). I imported all the data into a standard MySQL DB that I could then do anything with.

While pushing for standardized formats, you will have to deal with reality. I would love to hear what others have done too.

From a high level, you will have a few types of data to manage.

1) Incoming/import data profiles -- these will be data mappings you take from one column format and move into to your standardized storage format. Ideally you can setup one way that you like it. Inevitably someone might do something different, or you might need to accept data from other vendors/partners in a different structure that you need to translate to yours. You could build an incoming data profile for each location.

2) Stored data mappings -- This is the table where all data is stored in a centralized format. All data is stored in this format.

3) Outgoing/Export Data profiles -- Everyone needs data in different structures and formats. This would probably be easiest done by storing a query in a database field that you run, and then an export format (xml, csv, excel) that you need to support. Ideally you want to have a "general" export. Inevitably you may need to limit or customize who sees what..

Functionality:

  • If you can build a web-based upload it would be ideal. That could then trigger the upload process. If you need to automate the update from the client site a windows scheduler that runs something and transmits it would probably be doable as well.

  • Multiple Datasets: A powerful side of this is keeping snapshots of multiple data sets. What was uploaded, etc, all comes back to be very handy when we need to see what was recieved, sent, processed, etc on a daily basis.

With these types of data you might want to store a few different datasets, so you might want to consider creating as much standardization that is possible. Without knowing more about how much flat files vs databases people are using, you might be able to throw a small applet or access database that can crunch excel files and upload. There's a lot of different ways of doing this that would keep it easy on the non-technical users. The holy grail of course - is being able to go in and get the data yourself, or pre-programming each location's system to send out their data in a format you want on a scheduled basis so they don't have to touch it.

The above might be simple -- but it might be what works best.

Jas Panesar
"With these types of data you might want to store a few different datasets, so you might want to consider creating"creating what? the sentence just ended there :(
Alex Beardsley
Sorry, finished the sentence! It sounds like a fun project, assuming you can move towards standardizing as much as possible that won't make it harder to use.
Jas Panesar
A: 

Using XML is relevant since you are dealing with interoperability with external organizations, but it is also the heaviest alternative. I suggest considering JSON for the transport, since it is also standardized and supported, but is much more lightweight.

As for the overall project, I must agree with the others that you deserve our sympathies. This would be a difficult project to coordinate even with technical personnel at each location. Without technical people for coordination, I anticipate failure, or at best a very expensive but minimally satisfactory success.

A crucial issue will be the quantity and complexity of the data at each location. It is difficult to make recommendations without knowing that, particularly whether this project crosses the fuzzy threshold between programming in-the-small or in-the-large.

However, what can definitely be said is that this project will be MUCH more difficult if you do not control both sides of the communication. I would pursue placing an application at each location that communicates with your central application at your location, so that you can control both sides of the communication.

Then I would work towards configuring and extending the applications at those remote locations from your location, dynamically. Ideally, design the remote applications to simply be slaves that respond to your master--they should run continuously and check in with the master for configuration, updates, and commands to send data. Once you have that deployed with minimal functionality, you can focus on connecting to the data sources and explore your options from there. Meanwhile, you will not have to bother the people at the remote location unless your slave "disappears".

When you control both sides, then you can choose whatever tools, technologies, and components seem to fit, and change your mind at will.

Rob Williams
A: 

Isn't this the problem that ETL tools like Informatica and Microsoft DTS were invented to solve? Or did I miss something important?

duffymo