views:

166

answers:

7

What is the simplest way to identify if a given method is reading or writing a member variable or property? I am writing a tool to assist in an RPC system, in which access to remote objects is expensive. Being able to detect if a given object is not used in a method could allow us to avoid serializing its state. Doing it on source code is perfectly reasonable (but being able to do it on compiled code would be amazing)

I think I can either write my own simple parser, I can try to use one of the existing C# parsers and work with the AST. I am not sure if it is possible to do this with Assemblies using Reflection. Are there any other ways? What would be the simplest?

EDIT: Thanks for all the quick replies. Let me give some more information to make the question clearer. I definitely prefer correct, but it definitely shouldn't be extremely complex. What I mean is that we can't go too far checking for extremes or impossibles (as the passed-in delegates that were mentioned, which is a great point). It would be enough to detect those cases and assume everything could be used and not optimize there. I would assume that those cases would be relatively uncommon. The idea is for this tool to be handed to developers outside of our team, that should not be concerned about this optimization. The tool takes their code and generates proxies for our own RPC protocol. (we are using protobuf-net for serialization only, but no wcf nor .net remoting). For this reason, anything we use has to be free or we wouldn't be able to deploy the tool for licensing issues.

+1  A: 

My intuition is that detecting which member variables will be accessed is the wrong approach. My first guess at a way to do this would be to just request serialized objects on an as-needed basis (preferably at the beginning of whatever function needs them, not piecemeal). Note that TCP/IP (i.e. Nagle's algorithm) should stuff these requests together if they are made in rapid succession and are small

Brian
That makes a lot of sense. However, I think it comes down to the same problem. I would still need to somehow know at the beginning of the function what is needed. If I were to request entities as soon as they are needed (likely some time after the beginning of the function) I would expect this to add a lot of latency when fetching objects. Maybe setting a threshold on the size of an object to qualify as as-needed or always send would be useful. Thanks for pointing this out.
cloudraven
@cloudraven: After the first call to the function, this question will have already been answered. So maybe the solution is to make calls as needed but define "needed" as needing it right now or having needed it the last time you called the function.
Brian
A: 

I think the best you can do is explicitly maintain a dirty flag.

Steven Sudit
Nobody's offered anything better than a dirty flag, so this downvote (like so many of them) makes no sense.
Steven Sudit
I didn't downvote, but I guess the problem is that he needs to know what members are accessed by the method *before* its invoked. So he can serialize the data for a remote call. That can't be done with a dirty flag.
nikie
@User143605: If so, I didn't find that in the question. That requirement is not going to be satisfied in a way that'll make anyone satisfied.
Steven Sudit
method A() { if (rare) read(X); } .... so he runs an application for a long time and it never reads X. Does that mean it won't read X the next time it runs? A flag can't be clairvoyant.
Ira Baxter
A: 

By RPC do you mean .NET Remoting? Or DCOM? Or WCF?

All of these offer the opportunity to monitor cross process communication and serialization via sinks and other constructs, but they are all platform specific, so you'll need to specify the platform...

JeffN825
A: 

You could listen for the event that a property is being read/written to with an interface similar to INotifyPropertyChanged (although you obviously won't know which method effected the read/write.)

Pierreten
+6  A: 

You can have simple or you can have correct - which do you prefer?

The simplest way would be to parse the class and the method body. Then identify the set of tokens which are properties and field names of the class. The subset of those tokens which appears in the method body are the properties and field names you care about.

This trivial analysis of course is not correct. If you had

class C
{
    int Length;
    void M() { int x = "".Length; }
}

Then you would incorrectly conclude that M references C.Length. That's a false positive.

The correct way to do it is to write a full C# compiler, and use the output of its semantic analyzer to answer your question. That's how the IDE implements features like "go to definition".

Eric Lippert
For cloudraven's question, it is also acceptable to interpret the already compiled IL. That would be easier than writing a C# compiler? (I don't have first hand experience so I would not know for sure :) ).
jdv
@jdv: That is probably easier, but that doesn't eliminate the simple vs correct tradeoff. For example, suppose you have void M() { var count = (new[] { this }).Select(x=>x.Foo).Count(); } Does M access Foo? No. The code generator spits a *helper method* that accesses Foo, and then M passes a delegate to that helper method to Select. Does this method show up as one that accesses Foo or not? If it doesn't, then that seems incorrect. If it does, then in order to determine that you have to do flow analysis across bodies, which seems not "simple".
Eric Lippert
I'm pretty sure that "correct" is impossible -- it's at least as hard as solving the halting problem. A method could access a field from within another method selected at runtime via a passed-in delegate.
Gabe
Thanks! I thought it was going to be something like that. I wonder if one of the existent C# parsers like NRefactory (used in the SharpDevelop IDE) or CSParser would work for this or if I would be better off doing it on my own. About correctness, as I edited in the question. I just need to cover the common cases, and do no optimization in the ambiguous ones. Anything with late-binding involved would have to be left out.
cloudraven
To do this as well as practical, you'd need some pretty heavy interprocedural data flow analysis. Regardless of how it is done, there are Turing tarpit troubles, which are usually resolved by having the analyzer make conservative assumptions, e.g., if it can't find proof that some fact is true, it has to assume to be safe (can't find evidence that A doesn't read X? then assume it does). This allows a tool to get pretty good answers most of the time. Where it is too conservative, you can have users annotate the code with thier assertions ("I swear method A doesn't read X").
Ira Baxter
@Gabe: No, it's pretty clear that correct is possible, just very difficult and expensive.
Steven Sudit
A: 

Before attempting to write this kind of logic yourself, I would check to see if you can leverage NDepend to meet your needs.

NDepend is a code dependency analysis tool ... and much more. It implements a sophisticated analyzer for examining relationships between code constructs and should be able to answer that question. It also operates on both source and IL, if I'm not mistaken.

NDepend exposes CQL - Code Query Language - which allows you to write SQL-like queries against the relationships between structures in your code. NDepend has some support for scripting and is capable of being integrated with your build process.

LBushkin
That tool is impressive! I guess if we could bundle it with our tool, it would definitely do the trick. The problem is that it is meant to be used by developers outside the team, or even the company.
cloudraven
A: 

Eric has it right: to do this well, you need what amounts to a compiler front end. What he didn't emphasize enough is the need for strong flow analysis capabilities (or a willingness to accept very conservative answers possibly alleviated by user annotations). Maybe he meant that in the phrase "semantic analysis" although his example of "goto definition" just needs a symbol table, not flow analysis.

A plain C# parser could only be used to get very conservative answers (e.g., if method A in class C contains identifier X, assume it reads class member X; if A contains no calls then you know it can't read member X).

The first step beyond this is having a compiler's symbol table and type information (if method A refers to class member X directly, then assume it reads member X; if A contains *no calls and mentions identifier X only in the context of accesses to objects which are not of this class type then you know it can't read member X). You have to worry about qualified references, too; Q.X may read member X if Q is compatible with C.

The sticky point are calls, which can hide arbitrary actions. An analysis based on just parsing and symbol tables could determine that if there are calls, the arguments refer only to constants or to objects which are not of the class which A might represent (possibly inherited).

If you find an argument that has an C-compatible class type, now you have to determine whether that argument can be bound to this, requiring control and data flow analysis:

   method A( ) {  Object q=this;
                     ...
                     ...q=that;...
                     ...
                     foo(q);
               }

foo might hide an access to X. So you need two things: flow analysis to determine whether the initial assignment to q can reach the call foo (it might not; q=that may dominate all calls to foo), and call graph analysis to determine what methods foo might actually invoke, so that you can analyze those for accesses to member X.

You can decide how far you want to go with this simply making the conservative assumption "A reads X" anytime you don't have enough information to prove otherwise. This will you give you a "safe" answer (if not "correct" or what I'd prefer to call "precise").

Of frameworks that might be helpful, you might consider Mono, which surely parses and builds symbol tables. I don't know what support it provides for flow analysis or call graph extraction; I would not expect the Mono-to-IL front-end compiler to do a lot of that, as people usually hide that machinery in the JIT part of JIT-based systems. A downside is that Mono may be behind the "modern C#" curve; last time I heard, it handled only C# 2.0 but my information may be stale.

An alternative is our DMS Software Reengineering Toolkit and its C# Front End. (Not an open source product).

DMS provides general source code parsing, tree building/inspection/analysis, general symbol table support and built-in machinery for implementing control-flow analysis, data flow analysis, points-to analysis (needed for "What does object O point to?"), and call graph construction. This machinery has all been tested by fire with DMS's Java and C front ends, and the symbol table support has been used to implement full C++ name and type resolution, so its pretty effective. (You don't want to underestimate the work it takes to build all that machinery; we've been working on DMS since 1995).

The C# Front End provides for full C# 4.0 parsing and full tree building. It presently does not build symbol tables for C# (we're working on this) and that's a shortcoming compared to Mono. With such a symbol table, however, you would have access to all that flow analysis machinery (which has been tested with DMS's Java and C front ends) and that might be a big step up from Mono if it doesn't provide that.

If you want to do this well, you have a considerable amount of work in front of you. If you want to stick with "simple", you'll have to do with just parsing the tree and being OK with being very conservative.

You didn't say much about knowing if a method wrote to a member. If you are going to minimize traffic the way you describe, you want to distinguish "read", "write" and "update" cases and optimize messages in both directions. The analysis is obviously pretty similar for the various cases.

Finally, you might consider processing MSIL directly to get the information you need; you'll still have the flow analysis and conservative analysis issues. You might find the following technical paper interesting; it describes a fully-distributed Java object system that has to do the same basic analysis you want to do, and does so, IIRC, by analyzing class files and doing massive byte code rewriting. Java Orchestra System

Ira Baxter
Thanks! Too bad I can't mark both answers right.
cloudraven
Both the paper and the extra examples helped me get more of an idea of what the problematic cases will be. I will pursue the compiler path then. Thanks
cloudraven