views:

186

answers:

8

I have a large collection of data in an excel file (and csv files). The data needs to be placed into a database (mysql). However, before it goes into the database it needs to be processed..for example if columns 1 is less than column 3 add 4 to column 2. There are quite a few rules that must be followed before the information is persisted.

What would be a good design to follow to accomplish this task? (using java)

Additional notes

The process needs to be automated. In the sense that I don't have to manually go in and alter the data. We're talking about thousands of lines of data with 15 columns of information per line.

Currently, I have a sort of chain of responsibility design set up. One class(Java) for each rule. When one rule is done, it calls the following rule.

More Info

Typically there are about 5000 rows per data sheet. Speed isn't a huge concern because this large input doesn't happen often.

I've considered drools, however I wasn't sure the task was complicated enough for drols.

Example rules:

  1. All currency (data in specific columns) must not contain currency symbols.

  2. Category names must be uniform (e.g. book case = bookcase)

  3. Entry dates can not be future dates

  4. Text input can only contain [A-Z 0-9 \s]

etc..
Additionally if any column of information is invalid it needs to be reported when processing is complete (or maybe stop processing).

My current solution works. However I think there is room for improvement so I'm looking for ideals as to how it can be improved and or how other people have handled similar situations.

I've considered (very briefly) using drools but i wasn't sure the work was complicated enough to take advantage of drools.

A: 

A class for each rule? Really? Perhaps I'm not understanding the quantity or complexity of these rules, but I would (semi-pseudo-code):

public class ALine {
    private int col1;
    private int col2;
    private int coln;
    // ...

    public ALine(string line) {
         // read row into private variables
         // ...

         this.Process();
         this.Insert();
    }

    public void Process() {
         // do all your rules here working with the local variables
    }

    public void Insert() {
        // write to DB
    }
}

foreach line in csv
    new ALine(line);
Oli
A: 

Your methodology of using classes for each rule does sound a bit heavy weight but it has the advantage of being easy to modify and expand should new rules come along.

As for loading the data bulk loading is the way to go. I have read some informaiton which suggests it may be as much as 3 orders of magnitude faster than loading using insert statements. You can find some information on it here

stimms
A: 

Bulk load the data into a temp table, then use sql to apply your rules. use the temp table, as a basis for the insert into real table. drop the temp table.

EvilTeach
+1  A: 

I think your method is OK. Especially if you use the same interface on every processor.

You could also look to somethink called Drules, currently Jboss-rules. I used that some time ago for a rule-heavy part of my app and what I liked about it is that the business logic can be expressed in for instance a spreadsheet or DSL which then get's compiled to java (run-time and I think there's also a compile-time option). It makes rules a bit more succint and thus readable. It's also very easy to learn (2 days or so).

Here's a link to the opensource Jboss-rules. At jboss.com you can undoubtedly purchase an offically maintained version if that's more to your companies taste.

extraneon
A: 

hi oneBelizean,

you can see that all the different answers are coming from their own experience and perspective.

Since we don't know much about the complexity and number of rows in your system, we tend to give advice based on what we have done earlier.

If you want to narrow down to a 1/2 solutions for your implementation, try giving more details.

Good luck

anjanb
+1  A: 

If I didn't care to do this in 1 step (as Oli mentions), I'd probably use a pipe and filters design. Since your rules are relatively simple, I'd probably do a couple delegate based classes. For instance (C# code, but Java should be pretty similar...perhaps someone could translate?):

interface IFilter {
   public IEnumerable<string> Filter(IEnumerable<string> file) {
   }
}

class PredicateFilter : IFilter {
   public PredicateFilter(Predicate<string> predicate) { }

   public IEnumerable<string> Filter(IEnumerable<string> file) {
      foreach (string s in file) {
         if (this.Predicate(s)) {
            yield return s;
         }
      }
   }
}

class ActionFilter : IFilter {
  public ActionFilter(Action<string> action) { }

  public IEnumerable<string> Filter(IEnumerable<string> file) {
      foreach (string s in file) {
         this.Action(s);
         yield return s;
      }
  }
}

class ReplaceFilter : IFilter {
  public ReplaceFilter(Func<string, string> replace) { }

  public IEnumerable<string> Filter(IEnumerable<string> file) {
     foreach (string s in file) {
        yield return this.Replace(s);
     }
  }
}

From there, you could either use the delegate filters directly, or subclass them for the specifics. Then, register them with a Pipeline that will pass them through each filter.

Mark Brackett
+1  A: 

Just create a function to enforce each rule, and call every applicable function for each value. I don't see how this requires any exotic architecture.

Seun Osewa
A: 

It may not be what you want to hear, it isn't the "fun way" by any means, but there is a much easier way to do this.

So long as your data is evaluated line by line... you can setup another worksheet in your excel file and use spreadsheet style functions to do the necessary transforms, referencing the data from the raw data sheet. For more complex functions you can use the vba embedded in excel to write out custom operations.

I've used this approach many times and it works really well; its just not very sexy.