views:

202

answers:

1

I'm building an expression tree dependency analyzer for a cross data source IQueryProvider.

That is, I have an IQueryable with some elements that can be executed locally in memory against some arbitrary provider (say Entity Framework). Some other elements in the IQueryable go against an entity that I need to make a remote WCF call. The WCF operation takes a serialized expression tree, will deserialize it, execute the LINQ query against its own local data store (lets also say Entity Framework), then send me back the results (though this mechanism could just as easily be a WCF Data Services DataServiceQuery...but I'm not using it because it's level of functional support is limited...at best). Once I get the results back from the WCF service, I will perform the result of the LINQ query in memory against the locally executed LINQ query.

So, what's so hard about that? Well, I need to determine the dependencies of the expression tree, so that my local underlying Query Provider won't explode trying to execute my LINQ query which has components that can only be executed on the remote WCF service...and vice versa.

Let's take the simple scenario:

  var result = 
   (from entityX in new Query<MyEntityX>()
   from entityY in new Query<MyEntityY>()
   where entityX.SomeProperty == "Hello" &&
   entityY.SomeOtherProperty == "Hello 2" && entityX.Id == entityY.XId).ToList();

Query<T> is a simple queryable wrapper with my own provider which has the chance to parse the tree an figure out what to do before swapping out roots with a different query provider. So, in the above case I need to:

  1. Execute the query against MyEntityA using a local object context and apply only the myEntityX.SomeProperty == "Hello" criteria. That is, run the following locally:

    // assume the functionality for replacing new Query<MyEntityA> with new
    // ObjectContext<MyEntityA>() is already there...
    var resultX = (from entityX in new Query<MyEntityX>()
    where entityX.SomeProperty == "Hello").ToList().AsQueryable();

  2. Send over the following serialized and have it execute on my remote WCF service, then get the results back.

    // Send the preceeding expression over the over the wire
    // and get the results back (just take my word this already works)
    var resultY = (from entityY in new Query<MyEntityY>()
    where entityY.SomeOtherProperty == "Hello 2").ToList().AsQueryable();

  3. Execute the following in memory:

    var finalResult = (from entityX in resultX
    from entityY in resultY
    where entityX.SomeProperty == "Hello" &&
    entityY.SomeOtherProperty == "Hello 2" &&
    entityX.Id == entityY.XId).ToList();

Note that the solution must incorporate a way of accumulating criteria that is specified off of projections too...like

var result = 
(from i in  
  (from entityX in new Query<MyEntityX>()  
   from entityY in new Query<MyEntityY>()  
   select new { PropX = entityX, PropY = entityY })  
where  
   i.PropX.SomeProperty == "Hello" && i.PropY.SomeOtherProperty == "Hello 2"  
   && i.PropX.Id == i.PropY.XId  
select i)  
.ToList();

This should result in the same two individual queries above being actually issued before the rest is evaluated in memory. On an unrelated note, I think I will probably used PLINQ and or DRYAD for running the in memory operations with improved performance.

So, I have some ideas (like doing some passes over the tree with a visitor and accumulating candidates for a given entity type), but I'm looking for some other peoples' suggestions about how to accumulate the parts my expression tree that can be executed against a given context...that is, knowing that one where criteria applies to one underlying new Query<T> and another criteria applies to a different one...so that I can figure out what I can do against data store 1, what I can do against data store 2 and what I need to do in memory, and execute the different parts of the tree accordingly. It's sort of like a Funcletizer, but a bit more complex...

Thanks for any assistance.

+2  A: 

This is an interesting problem. I would consider an approach consisting of several phases:

  1. expression analysis (probably bottom-up) of the expression tree and tagging of nodes as “remote”, “local”, and “neutral”
  2. top-down identification of “remote” subexpressions
  3. remote query execution (subexpression elimination)
  4. local query execution

The following gives more details for each phase. The Remarks section at the end of my answer provides some important notes to consider.

Disclaimer: My answer is rather schematic and I'm sure it doesn't cover a lot of aspects and cases that may happen in respect to the semantics of individual operations allowed in an expression tree. I think certain compromises will have to be made to make the implementation reasonably simple.

Phase 1: Expression analysis and tagging

Each node in the expression tree can be considered to fall within the following categories:

  • “remote” nodes correspond to operations that must be executed remotely;
  • “local” nodes correspond to operations that must be executed locally;
  • “neutral” nodes correspond to operations that can be executed by any query processor.

A bottom-up approach for traversing and processing the expression tree seems as appropriate for this case. The reason is when processing a given node X, having subnodes Y_1 to Y_n the category of the node X heavily depends on the categories of its subnodes Y_1 to T_n.

Let's rewrite the sample you provided:

entityX.SomeProperty == "Hello" &&
entityY.SomeOtherProperty == "Hello 2" && 
entityX.Id == entityY.Id

into an outline of the corresponding expression tree:

&&(&&(==(Member(SomeProperty, Var(entityX)), "Hello"), 
      ==(Member(SomeOtherProperty, Var(entityY)), "Hello 2")),
   ==(Member(Id, Var(entityX)), Member(Id, Var(entityY)))

This expression tree will then be tagged bottom-up. R for “remote”, L for “local”, N for “neutral”. Providing entityX is remote and entityY is local the result will look like this:

L:&&(L:&&(R:==(R:Member(SomeProperty, R:Var(entityX)), N:"Hello"), 
          L:==(L:Member(SomeOtherProperty, L:Var(entityY)), N:"Hello 2")),
     L:==(R:Member(Id, R:Var(entityX)), L:Member(Id, L:Var(entityY)))

As you can see, for each operator your analyzer will have to decide the category based on the categories of its subnodes. In the example above:

  • doing a property access on an object will yield the same category as the object has;
  • a string literal will be neutral;
  • an equality comparison of a local and a remote subexpression will have the local category;
  • the and operator will again favor local over remote.

However, you might consider combining the bottom-up approach with a top-down optimization pass to get better results. Consider this (symbolic): R == R + L. How do you want to execute the equality comparison? With a pure bottom-up approach you'd execute it locally. However, in some situations it might be better to precalculate L locally, replace the subexpression with an actual value (i.e. neutral) and execute the equality comparison remotely. In other words, you can end-up implementing a query plan optimizer.

Phase 2: Identification of remote subexpressions

In the next phase, the tagged expression tree will be processed top-down and each subexpression marked as remote taken out of the tree, and enlisted among the set of expressions evaluated remotely for each item in the remote data set.

From the above it's clear that certain remote subexpressions will encapsulate local subexpression. And, consequently, local subexpressions may contain remote subexpressions. Only neutral nodes shall represent subexpressions that are homogeneous it terms of category.

Hence it may be necessary to execute a given query with several round-trips to the remote query processor. An alternate approach would be to allow bi-directional communications between the query processors, so that the “remote” processor can identify a “local” (actually “remote” in from its point of view) subexpression and call back the “local” processor to execute it.

Phase 3: Remote query execution

In the third phase the list of remote subexpressions will be sent to the “remote” query processor for execution. (See also discussion in the previous phase.)

The question also is, how to identify subexpressions that can be used to effectively limit the resulting data set returned by the remote query processor. To do this, the semantics of top-level operators in the expression tree (usually && and ||) have to be taken into account. Short-circuit evaluation of && and || complicates the things a bit because the query preprocessor may not reorder operands.

Phase 4: Local query execution

When all remote subexpression are executed, their occurrences in the original expression tree are replaced with gathered results.

Remarks

  • You may end up with the necessity to limit only certain operations to be allowed in “remote” subtrees to reduce processing complexity — it will be a trade-off between capabilities and time spent on implementing the query pre-processor.

  • To handle data aliasing (like in the PropX = entityX … i.PropX.SomeProperty == "Hello" example you provided) you will have to perform data flow analysis. Here you will most likely end-up with a set of cases that will be to complicated to be worth handling.

Ondrej Tucny
Thank you! This is exactly what I was asking for
JeffN825
Great to hear that! Are you planning to open source the solution? I'd be happy to follow your progress and probably contribute.
Ondrej Tucny
Thanks again, more than anything else, I just wanted to hear how someone else would approach the problem so as to see if it was radically different than my own ideas. I hadn't considered open sourcing it, but given the complexity and potential benefit of having people smarter than I am contribute to it, I will put some serious thought into it.
JeffN825
Alright, I'll be happy to keep discussing about it via email (tucny(at)boldbrick(dot)com). However, enjoy coding it, it'll be fun for sure.
Ondrej Tucny