ansaurus

Question

Answer 1

+2 A:

This is an interesting problem. I would consider an approach consisting of several phases:

expression analysis (probably bottom-up) of the expression tree and tagging of nodes as “remote”, “local”, and “neutral”
top-down identification of “remote” subexpressions
remote query execution (subexpression elimination)
local query execution

The following gives more details for each phase. The Remarks section at the end of my answer provides some important notes to consider.

Disclaimer: My answer is rather schematic and I'm sure it doesn't cover a lot of aspects and cases that may happen in respect to the semantics of individual operations allowed in an expression tree. I think certain compromises will have to be made to make the implementation reasonably simple.

Phase 1: Expression analysis and tagging

Each node in the expression tree can be considered to fall within the following categories:

“remote” nodes correspond to operations that must be executed remotely;
“local” nodes correspond to operations that must be executed locally;
“neutral” nodes correspond to operations that can be executed by any query processor.

A bottom-up approach for traversing and processing the expression tree seems as appropriate for this case. The reason is when processing a given node X, having subnodes Y_1 to Y_n the category of the node X heavily depends on the categories of its subnodes Y_1 to T_n.

Let's rewrite the sample you provided:

entityX.SomeProperty == "Hello" &&
entityY.SomeOtherProperty == "Hello 2" && 
entityX.Id == entityY.Id

into an outline of the corresponding expression tree:

&&(&&(==(Member(SomeProperty, Var(entityX)), "Hello"), 
      ==(Member(SomeOtherProperty, Var(entityY)), "Hello 2")),
   ==(Member(Id, Var(entityX)), Member(Id, Var(entityY)))

This expression tree will then be tagged bottom-up. R for “remote”, L for “local”, N for “neutral”. Providing entityX is remote and entityY is local the result will look like this:

L:&&(L:&&(R:==(R:Member(SomeProperty, R:Var(entityX)), N:"Hello"), 
          L:==(L:Member(SomeOtherProperty, L:Var(entityY)), N:"Hello 2")),
     L:==(R:Member(Id, R:Var(entityX)), L:Member(Id, L:Var(entityY)))

As you can see, for each operator your analyzer will have to decide the category based on the categories of its subnodes. In the example above:

doing a property access on an object will yield the same category as the object has;
a string literal will be neutral;
an equality comparison of a local and a remote subexpression will have the local category;
the and operator will again favor local over remote.

However, you might consider combining the bottom-up approach with a top-down optimization pass to get better results. Consider this (symbolic): R == R + L. How do you want to execute the equality comparison? With a pure bottom-up approach you'd execute it locally. However, in some situations it might be better to precalculate L locally, replace the subexpression with an actual value (i.e. neutral) and execute the equality comparison remotely. In other words, you can end-up implementing a query plan optimizer.

Phase 2: Identification of remote subexpressions

In the next phase, the tagged expression tree will be processed top-down and each subexpression marked as remote taken out of the tree, and enlisted among the set of expressions evaluated remotely for each item in the remote data set.

From the above it's clear that certain remote subexpressions will encapsulate local subexpression. And, consequently, local subexpressions may contain remote subexpressions. Only neutral nodes shall represent subexpressions that are homogeneous it terms of category.

Hence it may be necessary to execute a given query with several round-trips to the remote query processor. An alternate approach would be to allow bi-directional communications between the query processors, so that the “remote” processor can identify a “local” (actually “remote” in from its point of view) subexpression and call back the “local” processor to execute it.

Phase 3: Remote query execution

In the third phase the list of remote subexpressions will be sent to the “remote” query processor for execution. (See also discussion in the previous phase.)

The question also is, how to identify subexpressions that can be used to effectively limit the resulting data set returned by the remote query processor. To do this, the semantics of top-level operators in the expression tree (usually && and ||) have to be taken into account. Short-circuit evaluation of && and || complicates the things a bit because the query preprocessor may not reorder operands.

Phase 4: Local query execution

When all remote subexpression are executed, their occurrences in the original expression tree are replaced with gathered results.

Remarks

You may end up with the necessity to limit only certain operations to be allowed in “remote” subtrees to reduce processing complexity — it will be a trade-off between capabilities and time spent on implementing the query pre-processor.
To handle data aliasing (like in the PropX = entityX … i.PropX.SomeProperty == "Hello" example you provided) you will have to perform data flow analysis. Here you will most likely end-up with a set of cases that will be to complicated to be worth handling.

Ondrej Tucny 2010-10-11 22:07:03

Thank you! This is exactly what I was asking for

JeffN825 2010-10-12 14:59:40

Great to hear that! Are you planning to open source the solution? I'd be happy to follow your progress and probably contribute.

Ondrej Tucny 2010-10-12 15:01:33

Thanks again, more than anything else, I just wanted to hear how someone else would approach the problem so as to see if it was radically different than my own ideas. I hadn't considered open sourcing it, but given the complexity and potential benefit of having people smarter than I am contribute to it, I will put some serious thought into it.

JeffN825 2010-10-13 17:14:19

Alright, I'll be happy to keep discussing about it via email (tucny(at)boldbrick(dot)com). However, enjoy coding it, it'll be fun for sure.

Ondrej Tucny 2010-10-13 17:18:32

ansaurus

tags:

views:

answers:

Expression Tree Dependency Analyzer

Phase 1: Expression analysis and tagging

Phase 2: Identification of remote subexpressions

Phase 3: Remote query execution

Phase 4: Local query execution

Remarks

related questions