views:

148

answers:

5

I'm reading data from a custom data format that conceptually stores data in a table. Each column can have a distinct type. The types are specific to the file format and map to C# types.

I have a Column type that encapsulates the idea of a column, with generic parameter T indicating the C# type that is in the column. The Column.FormatType indicates the type in terms of the format types. So to read a value for a column, I have a simple method:

protected T readColumnValue<T>(Column<T> column)
{
  switch (column.FormatType)
  {
    case FormatType.Int:
      return (T)readInt();
  }
}

How simple and elegant! Now all I have to do is:

Column<int> column=new Column<int>(...)
int value=readColumnValue(column);

The above cast to type T would work in Java (albeit with a warning), and because of erasure the cast would not be evaluated until the value was actually used by the caller---at which point a ClassCastException would be thrown if the cast wasn't correct.

This doesn't work in C#. However, because C# doesn't throw away the generic types it should be possible to make it even better! I appears that I can ask for the type of T at runtime:

Type valueType=typeof(T);

Great---so I have the type of value that I'll be returning. What can I do with it? If this were Java, because there exists a Class.Cast method which performs a runtime cast, I would be home free! (Because each Java Class class has a generic type parameter indicating of the class is for it would also provide compile-time type safety.) The following is from my dream-world where C# Type class works like the Java Class class:

protected T readColumnValue<T>(Column<T> column)
{
  Type<T> valueType=typeof(T);
  switch (column.FormatType)
  {
    case FormatType.Int:
      return valueType.Cast(readInt());
  }
}

Obviously there is no Type.Cast()---so what do I do?

(Yes, I know there is a Convert.ChangeType() method, but that seems to perform conversions, not make a simple cast.)

Update: So it's seeming like this is simply not possible without boxing/unboxing using (T)(object)readInt(). But this is not acceptable. These files are really big---80MB, for example. Let's say I want to read an entire column of values. I'd have an elegant little method that uses generics and calls the method above like this:

public T[] readColumn<T>(Column<T> column, int rowStart, int rowEnd, T[] values)
{
  ...  //seek to column start
  for (int row = rowStart; row < rowEnd; ++row)
  {
    values[row - rowStart] = readColumnValue(column);
    ... //seek to next row

Boxing/unboxing for millions of values? That doesn't sound good. I find it absurd that I'm going to have to throw away generics and resort to readColumnInt(), readColumnFloat(), etc. and reproduce all this code just to prevent boxing/unboxing!

public int[] readColumnInt(Column<int> column, int rowStart, int rowEnd, int[] values)
{
  ...  //seek to column start
  for (int row = rowStart; row < rowEnd; ++row)
  {
    values[row - rowStart] = readInt();
    ... //seek to next row

public float[] readColumnFloat(Column<float> column, int rowStart, int rowEnd, float[] values)
{
  ...  //seek to column start
  for (int row = rowStart; row < rowEnd; ++row)
  {
    values[row - rowStart] = readFloat();
    ... //seek to next row

This is pitiful. :(

+2  A: 
return (T)(object)readInt();
VirtualBlackFox
Wouldn't this produce unnecessary boxing/unboxing which wouldn't occur with a normal cast?
Garret Wilson
Yes it will box then unbox the int. In theory the implementation could optimize this (one generic implementation is emited by the JIT per value type and the boxing could be removed when possible) but i don't think that current implementations does it. If you want speed the only way seem to generate code.
VirtualBlackFox
VirtualBlackFox
I want the cast to work no differently than (int)readInt(), which I understand is how Java's type.cast(readInt()) works (while providing compile-time type safety). I see no point in introducing unnecessary boxing/unboxing when all I'm trying to do is get around the compiler restriction and I know 100% what the type would be (which is what a cast is for).
Garret Wilson
C# generics can't do this check at compile time, they are true, runtime-existing generics, not compiler syntaxic sugar like in Java or C++. It have some upsides but this is one of the downsides (there are others, especially regarding the ? generic parameter that java allow)
VirtualBlackFox
Btw it seem to me like the cast method in java take an Object parameter and return an instance of the class, so Integer in this case... the net result is that you have one boxing and one unboxing exactly as in C# if you use this method. The cleanest and best performing solution are the C++ templates in this specific case as they can be specialized by parameter type.
VirtualBlackFox
But I'm not asking for the check at compile time---I'm asking for the check at runtime. If I say (MyType)createSomething(), C# will say, "OK, I can't check this at compile time so I'll let it go until runtime and complain about it then." I just want something analogous to the (MyType) cast using (T). Java's erasure makes this *less* possible, not more possible. That's why in Java a cast to (T) produces a warning, when in C# it could actually be checked if they would have added this feature. But so be it---I guess I'll have to accept that it's not possible.
Garret Wilson
@VirtualBlackFox: regarding your new comment about Java integers, yes, in Java boxing would occur whatever the case for primitives. I guess my hopes were doubly dashed because it seems like in C# boxing would not have to occur if the language would have allowed generic type casting, and combined with no erasure C# could have made this twice as good as Java generics. :(
Garret Wilson
The problem about "let it go until runtime" is that the IL instruction for a cast is OpCodes.Castclass and expect a reference to a class instance so it's what is emited so some boxing is required (Same IL for all T types). Now as previously said it is theoretically possible for the VM to detect the Int,Box,Cast Int,Unbox and remove this as if i remember well the Microsoft VM already generate x86 code for each value type parameter. I have not checked but i guess that this optimization isn't in the current vm
VirtualBlackFox
Another case that could be optimized in the same way is the case where T is int and you do (T)(object)long_value as it could in theory call OpCodes.conv instead of going trough box,castclass,unbox but for this one i also have some doubts that it is implemented.
VirtualBlackFox
Thanks for the discussion! It's been informative.
Garret Wilson
It _cannot_ just do `conv` because that would have different semantics. E.g. you can `conv` from long to int, but you'd get a cast exception if you box a long and then try to unbox it to int. It could still optimize that of course, it'd just have to insert the checks to exactly match the behavior of boxing/unboxing wrt type compatibility. And yes - last I checked JIT output for this kind of thing, it doesn't actually try to optimize it, so actual boxing happens there in practice.
Pavel Minaev
Exact, i wrote to fast without thinking if it worked `conv` would be inconsistent with what happens with user defined implicit casts in this case. Sad that this case isn't optimized but with manual code generation, Reflection.Emit, Expression<T> and the DLR it's not like there aren't any solution to optimize manually if needed.
VirtualBlackFox
+1  A: 

I think the closest way to make this work is to overload readColumnInfo and not make it generic like so:

    protected Int32 readColumnValue(Column<Int32> column) {
        return readInt();
    }
    protected Int64 readColumnValue(Column<Int64> column) {
        return readLong();
    }
    protected String readColumnValue(Column<String> column){
        return String.Empty;
    }
Steve Ellinger
struct is C# way of declaring a value type. The .Net library defines a ValueType type, C# will not allow you to use it directly however, you must use the reserved word struct to create the value type. Boxing / Unboxing occurs when you cast a value type to Object (which is a reference type) and back again, since we are telling it to cast to ValueType (not a reference type) no boxing / unboxing will occur. The where T : struct tells the compiler that the type must be a value type, therefore it allows the first cast (ValueType) on the return
Steve Ellinger
Upon further research it would appear that I am wrong, in looking at the generated msil generated by the compiler the code is boxing and unboxing the silly thing. The only way I see it working without the silly boxing / unboxing is to overload readColumnValue like I did with the String types for all the value types needed. And yes, I have verified by looking in reflector that no boxing / unboxing takes place in that scenario. I'm kind of mad at myself right now.
Steve Ellinger
Steve, thanks for investigating this, but I wish you wouldn't have completely edited your answer by replacing your first answer, because now my first comment above doesn't seem to make sense. :( Just for the record, Steve's first proposal was that restricting the method to struct types ("where T : struct") would prevent boxing by casting to (T)(ValueType)readInt(). I'll also reiterate that be replacing all the generic methods with individual type methods, as Steve indicates in his new answer, will also require that all the generic methods that call these methods be replicated, and so on. :(
Garret Wilson
Garret, I wasn't entirely sure how to handle the situation as I could edit this entry or add a new entry / vote to delete this entry and add a new entry. I'm not sure the new approach helps you because I am not sure of the context of the call to the readColumnValue method. It is clear that you have an instance of Column<>, the compiler knows the type and can call the proper overload thus eliminating the switch that would have to exist in the generic version of readColumnValue, this seems to me to be a good thing
Steve Ellinger
A: 

The short answer to all of this (see the question details) is that C# does not allow explicit casting to generic type T even if you know the type of T and you know the value that you have is T---unless you want to live with boxing/unboxing:

return (T)(object)myvalue;

This personally seems like a major deficiency in the language---there is nothing about the situation that says that boxing/unboxing would need to occur.

There is, however, a workaround, if you know ahead of time all the different types of T that are possible. Continuing the example in the question, we have a Column of generic type T representing tabular data in a file, and a parser that reads values from a column based upon the type of the column. I wanted the following in the parser:

protected T readColumnValue<T>(Column<T> column)
{
  switch (column.FormatType)
  {
    case FormatType.Int:
      return (T)readInt();
  }
}

As discussed, that doesn't work. But (assuming for this example that the parser is of type MyParser) you can actually create a different Column subclass for each T, like this:

public abstract class Column<T>
{
  public abstract T readValue(MyParser myParser);
}

public class IntColumn : Column<int>
{
  public override int readValue(MyParser myParser)
  {
    return myParser.readInt();
  }
}

Now I can update my parsing method to delegate to the column:

protected T readColumnValue<T>(Column<T> column)
{
  return column.readValue(this);
}

Note that the same program logic is occurring---it's just that by subclassing the generic column type, we've allowed specialization of a method to do the casting to T for us. In other words, we still have (T)readInt(), it's just that the (T) cast is happening, not within a single line, but in the override of the method that changes from:

  public abstract T readValue(MyParser myParser);

to

  public override int readValue(MyParser myParser)

So if the compiler can figure out how to cast to T in a method specialization, it should be able to figure it out on a single line cast. Put another way, nothing prevents C# from having a typeof(T).cast() method that would do exactly the same thing being done in method specialization above.

(What's even more frustrating about this whole exercise is that this solution has forced me to mix parsing code into the data object model, after trying so hard to keep it separate.)

Now, if somebody compiles this, looks at the generated CIL, and finds out that .NET is boxing/unboxing the return value just so that the specialized readValue() method can satisfy the generic return type T, I will cry.

Garret Wilson
Two important observations here. First of all, the obvious reason why you can't just write `(T)x` there is because there is no guarantee that the cast is even applicable for all possible `T` (note: I'm not talking about failure to downcast here, but about cases which would be compile-time errors in absence of generics, such as `(long)"foo"`). When you're upcasting to `(object)` first, it's a "workaround" on the first glance, but in practice it means something different semantically - just as `(long)(object)123` is very different from `(long)123`.
Pavel Minaev
As for performance, there's actually no requirement for a CLR implementation to physically box or unbox anything for IL generated from `(T)(object)x`. Clearly, according to the "as if" rule, it could just do the conversion directly (though it has to be careful to capture the semantics of boxing/unboxing when it comes to type compatibility, as opposed to direct conversions!). So this is strictly a quality-of-implementation issue, not a language design issue. Unfortunately the existing JIT in .NET does not optimize this case so physical boxing/unboxing happens.
Pavel Minaev
Thanks for your general comments, Pavel---they echo what others have noted above relating to compiler optimization. But more directly to the solution outlined in this answer, I'm curious to know if you think A) it addresses the problem, B) whether it would introduce boxing, and C) whether .NET would require (or forbid) boxing in this case. P.S. The JIT isn't relevant here, as boxing is part of the CIL---I'm sure you simply meant to say "the compiler".
Garret Wilson
Of course the JIT is relevant. CIL never gets executed, it gets transformed into machine code by the JIT, and that only happens once per type (your concerns about the number of loop iterations do not apply to optimizations made during JIT compilation).
Ben Voigt
Ben, are you saying (and I'm no expert yet in this area) that if the CIL contains a box instruction, that the JIT compiler could choose to ignore it if it thought boxing wasn't needed? And I guess the other larger issue that I was trying to get at (as others here have mentioned) is that if the C# compiler were to optimize away the boxing, we wouldn't even have to worry about the JIT compiler. Thanks.
Garret Wilson
+1  A: 

Why don't you implement your own casting operator from Column<T> to T?

public class Column<T>
{
    public static explicit operator T(Column<T> value)
    {
        return value;
    }

    private T value;
}

Then you can easily convert whenever you need to:

Column<int> column = new Column<int>(...)
int value = (int)column;
Bevan
If there's already a `private T value`, why in the world wouldn't you just make it a public property of the `Column<T>` type? Isn't `int value = column.Value` even *more* straightforward than `int value = (int)column;`?
Dan Tao
If there *is* already a `private T value`, then yes you could do that. We don't know the implementation of `Column<T>` (the OP didn't show us), so we don't know if it is there or not. I included it in my example purely to make it easier to read.
Bevan
The Column<T> doesn't contain the value---it merely encapsulates the concept of a column and its type. If the column contained the value, I wouldn't need an operator---I would just use T getValue(). Rather, as shown in the example, I am simply giving the Column<T> to the parser to tell it what type to parse and return to me.
Garret Wilson
Ahh. Perhaps you could turn the problem around (or inside out). `Column<T>` has the execution context where `T` is bound to a type, your parser does not. The code you have inside the parser isn't generic, and is forced to deal with type variations by using reflection and switch statements. Try moving the key parts inside `Columns<T>`, where T is just another type. Note also that the .NET Jitter is smart enough to avoid redundant boxing/unboxing when working in a generic method.
Bevan
A: 

Is the data stored in row-major or column-major order? If it's in row-major order, then having to scan the entire data set (millions of values you said) multiple times to pick out each column will dwarf the cost of boxing.

I really would suggest doing everything in one pass through the data, probably by building a vector of Action<string> (or Predicate<string> to report errors) delegates that process a single cell each into a List<T> associated with the column. Closed delegates could help a whole lot. Something like:

public class TableParser
{
    private static bool Store(List<string> lst, string cell) { lst.Append(cell); return true; }
    private static bool Store(List<int> lst, string cell) { int val; if (!int.TryParse(cell, out val)) return false; lst.Append(val); return true; }
    private static bool Store(List<double> lst, string cell) { double val; if (!double.TryParse(cell, out val)) return false; lst.Append(val); return true; }
    private static readonly Dictionary<Type, System.Reflection.MethodInfo> storeMap = new Dictionary<Type, System.Reflection.MethodInfo>();

    static TableParser()
    {
        System.Reflection.MethodInfo[] storeMethods = typeof(TableParser).GetMethods("Store", BindingFlags.Private | BindingFlags.Static);
        foreach (System.Reflection.MethodInfo mi in storeMethods)
            storeMap[mi.GetParameters()[0].GetGenericParameters()[0]] = mi;
    }

    private readonly List< Predicate<string> > columnHandlers = new List< Predicate<string> >;

    public bool TryBindColumn<T>(List<T> lst)
    {
        System.Reflection.MethodInfo storeImpl;
        if (!storeMap.TryGetValue(typeof(T), out storeImpl)) return false;
        columnHandlers.Add(Delegate.Create(typeof(Predicate<string>), storeImpl, lst));
        return true;
    }

    // adapt your existing logic to grab a row, pull it apart with string.Split or whatever, and walk through columnHandlers passing in each of the pieces
}

Of course you could separate the element parsing logic from the dataset walking logic, by choosing between alternate storeMap dictionaries for each format. And if you don't store things as strings, you could just as well use Predicate<byte[]> or similar.

Ben Voigt