views:

344

answers:

1

Short Version

I'm currently looking into an issue with MySQL collations and how they affect a Set of values (which is mapped using Hibernate, but that shouldn't matter for now). I want to have a Set of Strings using the same collation as MySQL uses. E.g. I want "foobar" and "fööbar" considered equal but "foo bar" and "foobar" considered different. Using default Collator.getInstance() (with Collator.PRIMARY strength) doesn't work reliably, as there are still differences (most notably whitespaces). So how to get a Collator that behaves equally as MySQL for each and every possible String?

Long Version

I want to have a unique index on the table where I store the Set's values and make sure the Set only holds values that are allowed in the DB and vice-versa.

Table looks like this:

CREATE TABLE `MY_SET` (
  `entity_id` int  NOT NULL,
  `value` varchar(255)  NOT NULL,
  UNIQUE `entity-value`(`entity_id`, `value`)
) ENGINE = InnoDB DEFAULT CHARSET=latin1 DEFAULT COLLATION=;

Now, if I use plain Strings and a HashSet to hold my values, e.g. as in

public class MyValues {
  private MyEntity _myEntity;
  private final HashSet<String> _values = new HashSet<String>();
}

It would be possible to add both, "foobar" and "fööbar" to the set of values. Now if Hibernate flushes the Set to the DB, MySQL will complain about "foobar" and "fööbar" beeing duplicates for the defined 'entity-value' key. Therefore, I thought I wrap the Strings and use a Collator to check strings for equality:

public class MyValues {
  private MyEntity _entity;
  private final HashSet<CollatedString> _values = new HashSet<CollatedString>();
}

public static class CollatedString {
  private String _string;
  private CollationKey _key;

  public String getString() {
   return _string;
  }

  public void setString(final String string) {
   _string = string;
   _key = getCollator().getCollationKey(_string);
  }

  @Override
  public int hashCode() {
   return _key.hashCode();
  }

  @Override
  public boolean equals(final Object obj) {
   if (!(obj instanceof CollatedString)) {
    return false;
   }
   return _key.equals(((CollatedString) obj)._key);
  }

}

This works well for "foobar" and "fööbar":

final MyEntity e = new MyEntity();
final MyValues v = new MyValues();
v.setEntity(e);
v.getValues().add(new CollatedString("foobar"));
v.getValues().add(new CollatedString("fööbar"));
System.out.println("1 == " + v.getValues().size()); // prints 1 == 1

But doesn't work for "foo bar" and "foobar" which MySQL considers different:

v.getValues().add(new CollatedString("foobar"));
v.getValues().add(new CollatedString("foo bar"));
System.out.println("2 == " + v.getValues().size()); // prints 2 == 1 (which is wrong)

Do basically what's left to do is implement the getCollator() method:

public static final Collator getCollator() {
  // FIXME please help!
}

Full code for the sample is available: Download

A: 

I've managed to get something working myself. As I wasn't able to get builtin Collators to do what I want, I decided to implement something myself. I've posted my findings on my blog. The basic idea was to read MySQL's characterset/collation definition files (/usr/share/mysql/charsets on my Ubuntu systems). An alternative approach would have been to build rules for a RuleBasedCollator from those definitions, but I decided to build my own collator, which has less features (MySQL collations can't be configured to be case-sensitive or not - they either are or are not) and should perform quite well.

sfussenegger