views:

261

answers:

3

OLE Variants, as used by older versions of Visual Basic and pervasively in COM Automation, can store lots of different types: basic types like integers and floats, more complicated types like strings and arrays, and all the way up to IDispatch implementations and pointers in the form of ByRef variants.

Variants are also weakly typed: they convert the value to another type without warning depending on which operator you apply and what the current types are of the values passed to the operator. For example, comparing two variants, one containing the integer 1 and another containing the string "1", for equality will return True.

So assuming that I'm working with variants at the underlying data level (e.g. VARIANT in C++ or TVarData in Delphi - i.e. the big union of different possible values), how should I hash variants consistently so that they obey the right rules?

Rules:

  • Variants that hash unequally should compare as unequal, both in sorting and direct equality
  • Variants that compare as equal for both sorting and direct equality should hash as equal

It's OK if I have to use different sorting and direct comparison rules in order to make the hashing fit.

The way I'm currently working is I'm normalizing the variants to strings (if they fit), and treating them as strings, otherwise I'm working with the variant data as if it was an opaque blob, and hashing and comparing its raw bytes. That has some limitations, of course: numbers 1..10 sort as [1, 10, 2, ... 9] etc. This is mildly annoying, but it is consistent and it is very little work. However, I do wonder if there is an accepted practice for this problem.

A: 

So in summary, to make stuff comparable you first stream to a common format, string or blob.

How do you handle e.g. localisation, e.g. formating of reals? A real compared to a string containing the same real created in another locale will fail. Or a real written to string with a different precision setting.

It sounds to me the definition of equal() is the problem, not the hashing. If "equal" values can be serialized to string (or blob) differently, hashing will fail.

Marco van de Voort
That's a good point. There are two potential answers: (a) use invariant settings so that hash codes will be reliable across multiple instances and locales etc., or (b) don't care so long as the results are consistent within any given run (though settings may change and break things in edge cases).Given all that's been said in the comments to my question - I wish more of those comments were actual answers - I may review my approach and handle types individually, and not try to preserve Delphi-like equality semantics when considering comparers etc. needed for algorithms.
Barry Kelly
A: 

Hash codes of VARIANTS that are equal should be equal.

Without knowing the equality and coercion rules that are used for testing equality, it is hard to come up with a proper implementation.

andras
I'm very familiar with how hash codes work in .NET and Java (I've written compilers in both targeting both CLR and JVM), but the problem is that Variants as used in VB and Delphi are not type-safe in the same way as polymorphic objects stored in a location of type Object in .NET or Java, or the way values are type-safe in Ruby, Python or Javascript. That is, `1 == "1"`, or `1.Equals("1") == true`, for Variant values of `1` and `"1"`. I guess the answer to my question is "it depends" - depending on the language semantics.
Barry Kelly
I'm marking this the answer as it's quite true, in order to write the hash function that is guaranteed to match the equality function, the equality function must be known and well defined.
Barry Kelly
+2  A: 

There's a built in tension in your question between the use of a hash function and the stated requirements, which are to validated against the input of the hash. I'd suggest we keep in mind a few properties of hashes in general: information is lost during the hashing process, and hash collisions are to be expected. It is possible to construct a perfect hash without collisions, but it would be problematic (or impossible?) to construct a perfect hash function if the domain of the function is any possible OLE Variant. On the other hand, if we're not talking about a perfect hash, then your first rule is violated.

I don't know the larger context of what you're trying to accomplish, but I must push back on one of your assumptions: is a hash function really what you want? Your requirements could be met in a fairly straightforward way if you develop a system that encodes, not hashes, all of the possible OLE Variant attributes so that they can be recalled later and compared against other Variant images.

Your baseline implementation of converting the Variant to a string representation is moving in this direction. As you are no doubt aware, a Variant can contain pointers, double pointers, and arrays, so you'll have to develop consistent string representation of these data types. I question whether this approach could really be classified as a hash. Aren't you just persisting data attributes?

Paul Keister
I'm writing a generic collection class for a runtime library. The generic parameter might be a variant. Perfect hashing isn't relevant. (Actually, perfect hashing can be counterproductive in small hash tables by increasing the cost of a failed hash lookup.)
Barry Kelly