views:

516

answers:

4

Hello everyone,

I am using VSTS 2008 + C# + .Net 3.0. I have two input strings, I think they are different. But the following C# code thinks they are the same, and throws System.Data.ConstraintException, says Column Name is contrained to be unique, but value already exists. Any ideas what is wrong?

Here is my code and my input strings,

Hex view of my input strings,

http://i30.tinypic.com/2anx2b.jpg

Notepad view of my input strings,

http://i30.tinypic.com/2q03hn4.jpg

My code,

    static void Main(string[] args)
    {
        string[] buf = new string[] { "2ch", "2ch" };

        DataTable bulkInserTable = new DataTable("BulkTable");
        DataColumn column = null;
        DataRow row = null;

        column = new DataColumn();
        column.DataType = System.Type.GetType("System.String");
        column.ColumnName = "Name";
        column.ReadOnly = true;
        column.Unique = true;
        bulkInserTable.Columns.Add(column);

        foreach (string item in buf)
        {
            row = bulkInserTable.NewRow();
            row["Name"] = item;
            bulkInserTable.Rows.Add(row);
        }
    }

EDIT 1:

My confusion is, why C# Dictionary thinks they are different, but DataSet thinks they are of the same. Any solution to make the behavior consistent? Here is my code to prove C# Dictionary thinks they are different, the return buf array is of two elements.

            Dictionary<string, bool> dic = new Dictionary<string, bool>();
            foreach (string s in buf)
            {
                dic[s] = true;
            }
            buf = new List<string>(dic.Keys).ToArray(); // we got two strings here, other than one, which proves Dictionary thinks the two strings are different.

thanks in advance, George

+5  A: 

Where are you putting the string into the row? It looks to me like you are creating blank rows and inserting 2 of them?

Something like this?

        foreach (string item in buf)
        {
            row = bulkInserTable.NewRow();
            row["Name"] = item;//Set the data<------------
            bulkInserTable.Rows.Add(row);
        }
Russell Troywest
I am creating an in-memory table and then doing bulk insert into backend database. I have posted my further confusion about inconsistent behavior between C# Dictionary and C# DataSet to check about string uniqueness, any solutions to make the behavior consistent?
George2
Good catch, I have corrected my code.
George2
+3  A: 

well for a start you need you sample code to be:

foreach (string item in buf)
{
    row = bulkInserTable.NewRow();
    row["Name"] = item;
    bulkInserTable.Rows.Add(row);
}

Though that still exhibits the issue at least it's for the real reason

The reason for this is that, when creating a data table the default compare options in effect are:

this._compareFlags = CompareOptions.IgnoreWidth 
                     CompareOptions.IgnoreKanaType | 
                     CompareOptions.IgnoreCase;

From the docs Ignore Width:

Indicates that the string comparison must ignore the character width. For example, Japanese katakana characters can be written as full-width or half-width. If this value is selected, the katakana characters written as full-width are considered equal to the same characters written as half-width.

System.Globalization.CultureInfo.CurrentCulture.CompareInfo.Compare(
    "2ch", "2ch", System.Globalization.CompareOptions.IgnoreWidth);

returns 0, i.e. identical

I strongly suggest you do consider such values identical or cause further confusion down the line however if you really want to change it:

//CaseSensitive property uses this under the hood
internal bool SetCaseSensitiveValue(
    bool isCaseSensitive, bool userSet, bool resetIndexes)
{
    if (!userSet && (
        this._caseSensitiveUserSet || (this._caseSensitive == isCaseSensitive)))
    {
        return false;
    }
    this._caseSensitive = isCaseSensitive;
    if (isCaseSensitive)
    {
        this._compareFlags = CompareOptions.None;
    }
    else
    {
        this._compareFlags = CompareOptions.IgnoreWidth | 
                             CompareOptions.IgnoreKanaType | 
                             CompareOptions.IgnoreCase;
    }
    if (resetIndexes)
    {
        this.ResetIndexes();
        foreach (Constraint constraint in this.Constraints)
        {
            constraint.CheckConstraint();
        }
    }
    return true;
}

Thus you can ignore case and totally disable the complex comparison options.

If you want to make a Dictionary with the same behaviour use the following comparer:

public class DataTableIgnoreCaseComparer : IEqualityComparer<string>
{
    private readonly System.Globalization.CompareInfo ci =
        System.Globalization.CultureInfo.CurrentCulture.CompareInfo; 
    private const System.Globalization.CompareOptions options = 
        CompareOptions.IgnoreCase | 
        CompareOptions.IgnoreKanaType | 
        CompareOptions.IgnoreWidth;

    public DataTableIgnoreCaseComparer() {}

    public bool Equals(string a, string b)
    {
        return ci.Compare(a, b, options) == 0;
    }

    public int GetHashCode(string s)
    {
        return ci.GetSortKey(s, options).GetHashCode();
    }
}
ShuggyCoUk
Good catch, I have corrected my code.
George2
I have posted my further confusion about inconsistent behavior between C# Dictionary and C# DataSet to check about string uniqueness, any solutions to make the behavior consistent?
George2
Thanks ShuggyCoUk. I think treat them the same making more senses. How to use your above System.Globalization.CultureInfo.CurrentCulture.CompareInfo.Compare function to check against the uniqueness of strings, i.e. for an array of input strings, I want to output uniqueness strings ignore of width.
George2
If you want a dictionary which has the same 'rules' then supplying an EqualityComparer<string> whose equals method uses the CompareInfo.Compare() == 0 and which also creates hashcodes by converting all characters to uppercase and short width first...
ShuggyCoUk
Thanks ShuggyCoUk, your reply is really great, for your idea -- "then supplying an EqualityComparer<string> whose equals method uses the CompareInfo.Compare() == 0", could you show me some code sample please? I never did this before. :-(
George2
added the example
ShuggyCoUk
Thanks ShuggyCoUk, but for the DataTableIgnoreCaseComparer comparer, how to apply to an instance of Dictionary?
George2
var dic = new Dictionary<string, object>(new DataTableIgnoreCaseComparer()); I suggest taking a look at the dictionary docs on msdn
ShuggyCoUk
Another confusion is about the theory of full-width and half-width unicode character, does it mean all unicode characters including ASCII has two different unicode values (one of full-width value and the other half-width value)?
George2
Jon answered that. You don't have to care since ci.GetSortKey deals with all that for you (at some cost, you may want to cache them in some way if you find this is *actually* a problem) but then you simply don't have to care. For informative reasons if you're interested do read the link Jon provided.
ShuggyCoUk
Your solution is really cool, thanks!
George2
+1  A: 

It looks like the encoding is different on the second string. When debugging, the second string comes back as garbage. If I delete the second string and enter "2 c h" in Visual Studio, it works correctly.

Jonathan S.
Thanks Jonathan, but I cannot delete the 2nd string to make a solution. I need to deal with both forms. I have posted my further confusion about inconsistent behavior between C# Dictionary and C# DataSet to check about string uniqueness, any solutions to make the behavior consistent?
George2
+6  A: 

It depends on what you mean by "the same".

The two strings have different Unicode values, but I suspect under some normalization rules they would be the same. Just so that others can reproduce it easily without cut and paste issues, the second string is:

"\uff12\uff43\uff48"

These are the "full width" versions of "2ch".

EDIT: To respond to your edit, clearly the DataSet uses a different idea of equality, whereas unless you provide anything specific, Dictionary will use ordinal comparisons (as provided by string itself).

EDIT: I'm pretty sure the problem is that the DataTable is using CompareOptions.IgnoreWidth:

using System;
using System.Data;
using System.Globalization;

class Test
{
    static void Main()
    { 
        string a = "2ch";
        string b = "\uff12\uff43\uff48";

        DataTable table = new DataTable();            
        CompareInfo ci = table.Locale.CompareInfo;

        // Prints 0, i.e. equal
        Console.WriteLine(ci.Compare(a, b, CompareOptions.IgnoreWidth));
    }
}

EDIT: If you set the DataTable's CaseSensitive property to true, I suspect it will behave the same as Dictionary.

Jon Skeet
Thanks Jon, I have posted my further confusion about inconsistent behavior between C# Dictionary and C# DataSet to check about string uniqueness, any solutions to make the behavior consistent?
George2
How to make the behavior consistent? I am using both Dictionary and DataSet in my program and they are producing different results... Very weird. No matter they are the same or different, I want to have the same behavior. Any solutions?
George2
Thanks Jon, set the DataTable's CaseSensitive property to true solve my issue. A further question, I think the two strings are full width or half width versions of string, they are not of different case sensitive strings -- upper case and lower case, why CaseSensitive property which controls upper and lower case matters?
George2
I think IgnoreWidth makes more sense to the root cause of my issue. But how to use this constant in my code? In my code, I did not find a place to set this property. :-(
George2
found it - the DataTable uses the CompareInfo with a default of IgnoreWidth (see example in my answer)
ShuggyCoUk
CaseSensitivity appears to control more than just case sensitivity, in practice - it controls whether width is ignored, *and* whether the "Kana" type is ignored. Not ideal, I realise... I don't know of a way of creating an `IEqualityComparer<string>` which ignores width and kana type, unfortunately.
Jon Skeet
One option would be to create a `Dictionary<SortKey, bool>` instead of a `Dictionary<string, bool>` and create a sort key for each item with the appropriate CompareOptions (e.g. IgnoreCase|IgnoreWidth|IgnoreKanaType.) It's pretty ugly though...
Jon Skeet
I think treat them the same making more senses. How to use ignore width to check against the uniqueness of strings, i.e. for an array of input strings, I want to output uniqueness strings ignore of width?
George2
"create a sort key for each item with the appropriate CompareOptions" -- could you show me some more code please? I am confused what do you mean "create a sort key"?
George2
Use CompareInfo.GetSortKey and give it the same CompareOptions as your DataTable will use. Unfortunately you'll have to basically guess at that, I think - but it'll be CompareOptions.None for CaseSensitive=true, or CompareOptions.IgnoreCase|CompareOptions.IgnoreKanaType|CompareOptions.IgnoreWidth if CaseSensitive is set to false.
Jon Skeet
Another confusion is about the theory of full-width and half-width unicode character, does it mean all unicode characters including ASCII has two different unicode values (one of full-width value and the other half-width value)?
George2
Not everything has a fullwidth/halfwidth form, as far as I'm aware. Click on the link in my answer for the relevant code chart, and you might also want to read http://unicode.org/reports/tr11/
Jon Skeet