tags:

views:

443

answers:

8

I'm currently in charge of a process that seems to be very intimate with the database. My program/script/framework's goal is to make uniformity out of disparate data sources. Using a form of dependency injection, my process at a very high level works fine. The implementation of each data source type is hidden from the highest level business abstraction of what's going on. Great. My questions are two.

1) I have a long paragraph (and it's the length that's bothering me) that assembles an SQL statement in Perl-space of how to translate these different data sources into one, homogeneous end format. So the SQL string always depends on the type of data I'm working with. The WHERE clause depends, the FROM clause depends, the INSERT clause depends, it all depends. It's the high level of depending-ness that's confusing me. How do I model this process in an object-oriented way? MagicObject->buildSQL? That's essentially what I have now, but it feels like all of the parts of the code know too much, hence it's length.

2) If I have a function that does something (builds SQL?), do I pass in the business objects whole and then stringify them at the last minute? Or do I stringify them early and only let my function handle what it needs, as opposed to rendering the objects itself?

Edit: While I don't doubt the importance of ORMs, I do not believe we are yet in the ORM space. Imagine baseball data for the American, National, and Fictional leagues were all stored in wildly different formats with varying levels of normalization. It is the job of my process to read these data sources and put them in one unified, normalized pool. I feel the ORM space of acting on these objects happens after my process. I'm a sort of data janitor, if you will. There are essentially no business objects yet to act on because of the lack of a unified pool, which I create.

Edit^2: It's been brought to my attention that maybe I haven't described the problem space in enough detail. Here's an example.

Imagine you had to make a master database of all the criminals in the United States. Your company's service is selling a product which sits atop and provides access to this data in a clean, unified format.

This data is provided publicly by the 50 states, but in wildly different formats. Some are one file of data, not normalized. Other are normalized tables in CSV format. Some are Excel documents. Some are TSVs. Some records are even provided that are not complete without manual intervention (other, manually created data sources).

The purpose of my project is to make a "driver" for each of the 50 states and make sure the end product of the process is a master database of criminals in a perfect, relation model. Everything keyed correctly, the schema in perfect shape, etc.

A: 

I think you're describing dynamic SQL- building the request programmatically at runtime. This is a common feature of Object Relational Mappers such as LINQ to SQL and LLBLGenPro, to name a few. Building one is no small task.

Generally, ORMs objectify the SQL language. You write a sort of "SQL Document Object Model (DOM)" that allows you to build SQL queries programmatically by representing them (for example) as a "Request" object. You then set properties on the Request object such as a Column collection, Table collection, and Join collection (these are just examples of one approach.) The result would be a SQL request string, exposed as a property of the Request object.

You must also make it possible for the Request object to read the schema definition of your data sources. You mention that your WHERE clause is type-dependent. Your SQL assembler must therefore be able to read the schema and build the clause appropriately.

This may be overkill for your case. I think the fundamental question is, do you absolutely require dynamic SQL queries, or is there a less complex option that will satisfy your requirements?

Dave Swersky
+8  A: 

Please do not write your own ORM. Use something like DBIx::Class.

All of these problems that you mention have been solved, and the implementation tested in thousands of other applications. Stick to writing your app, not reimplementing libraries. You might not actually use DBIC in your app, but you should look at its implementation approach; especially how it incrementally builds ResultSets (which aren't sets of results, but are rather deferred queries).

jrockway
I updated the OP given your response, as I feel this arena isn't appropriate for ORM. That or I haven't seen this type of functionality expressed in ORMs.
Mark Canlas
this IS what ORMs do.
Javier
People like reinventing ORMs because it's more fun than actual work, and "learning is hard". Not my problem, though... I let mst write my ORM :)
jrockway
+1  A: 

From purely coding point of view - you have a long and complex piece of code on your hands. You don't like it. Why? I can only assume that there is some code duplication in there. Otherwise, what's not to like? So, refactor it to eliminate duplication... I know it sounds trite, but since you don't post the code, it's hard to be more specific. May be have an object that has methods for from, where and insert clauses, so that the SQL's infrastructure is not duplicated? I just don't know what to do, exactly, but eliminating the duplication is key.

Arkadiy
+6  A: 

If you don't want an ORM, but you want to assemble SQL from bits without direct string manipulation/concatenation, take a look at Fey, which may do what you want.

Update: Aristotle Pagaltzis's answer is much better. He actually gave examples of what Fey looks like and how it can help.

Dave Rolsky
hey buddy, you're famous =). currently looking at fey
Mark Canlas
+8  A: 

You want to look at Fey. I started using it a few months ago on the job, and while the implementation still has rough corners due to young age, the idea behind it is solid. F.ex., take a query lightly adapted from the manual:

my $user = $schema->table( 'user' );
my $q = Fey::SQL
    ->new_select
    ->select( $user->columns( 'user_id', 'username' ) )
    ->from( $user );

Now you could write a function like this:

sub restrict_with_group {
    my ( $q, $table, @group_id ) = @_;
    my $group = $schema->table( 'group' )->alias;
    $q
     ->from( $table, $group )
     ->where( $group->column( 'group_id' ), 'IN', @group_id );
}

This will add an inner join from user to group as well as a WHERE condition. And voila, you can write the following in the main program:

restrict_with_group( $q, $user, qw( 1 2 3 ) );

But this restrict_with_group function will work for any query that that has a foreign key to the group table! To use it, you pass the query you want to restrict and the table to which you want to apply the restriction, as well as the group IDs to which you want to restrict it.

In the end you say $q->sql( $dbh ) and you get back a string of SQL representing the query that you have built up in the $q object.

So basically Fey gives you the abstractive capabilities that native SQL is missing. You can extract reusable aspects from your queries and package them as separate functions.

Aristotle Pagaltzis
I have to chew on it for a while longer to really set in. And, I don't think I'll be using Fey per se, but the approach certainly is pretty clear. Thank you for the great example!
Mark Canlas
I ended up going with a Fey-esque solution. It was based on Perl objects that represent queries, whose parts could be freely modified as it was passed around.
Mark Canlas
A: 

If you don't want to deal with an ORM, I often have code like this:

my (@columns,@tables,@wheres,@order_bys,@values);

... # Add value to those variables as needed, using push.
... # use ? for variables to be quoted

# Build SQL statement
my $sql = "select ".join(",",@columns).
    " from ".join(",",@tables).
    " where ".join(" and ",@wheres).
    " order by ".join(",",@order_bys);

my $sth = $dbh->prepare($sql);
$sth->execute(@values);

Simple, no need for an ORM, very customisable. Plus, I always find ORM too heavy for the volume of data I'm dealing with, but that's another subject.

Mathieu Longtin
Down that path lies madness. I did that for a long time. It works as long as your queries need not be flexible in structure, but crumbles rapidly if you try to customise them on dynamically.
Aristotle Pagaltzis
You need to look at Fey on CPAN ;)
Dave Rolsky
Fey looks good. It just wasn't around when I started using the above, five years ago :)
Mathieu Longtin
+1  A: 

Unless I'm misunderstanding, this seems like an ETL (Extract/Transform/Load) application that hasn't figured out to keep the three stages separate.

If the output model is only a table or two then you're probably just as well off using SQL. Otherwise, and especially if there are relationships between the tables you're inserting to, a decent ORM should simplify things.

Taking the 50-state idea, you can't really get away from having 50 "extract" processes, hopefully with a library of shared routines. I'd attack the problem one input source at a time, refactoring as I added new ones but being careful to encapsulate the variable parts so that I know exactly where changes will need to be made when a supplier changes their format.

The "transform" part shouldn't be too onerous: just take what you got and prepare it for output.

Mike Woodhouse
A: 

It seems to me like the approach you are taking to solve this problem may need to be looked at. You currently have multiple data sources which you need to treat as if it were a single data source. So why keep them as separate data sources?

Depending on how frequently the data is updated (or looking at performance, how often it is accessed) you could possibly do the combination of the data into a temporary data source such as SQLite. If your data from each state has a translator that will take it from format A to your common format in an SQLite table, then you can use your choice of methods to access it.

This method also allows for flexibility as your data access needs may change. For example, if you are asked the question, "How many blond drivers had speeding tickets in each state?". The SQLite database can do this with a single command, while other solutions would likely require returning a set of data which then needs to be parsed, grouped, and set for output.

Jack M.