tags:

views:

828

answers:

3

A recent project called for importing data into an Oracle database. The program that will do this is a C# .Net 3.5 app and I'm using the Oracle.DataAccess connection library to handle the actual inserting.

I ran into a problem where I'd receive this error message when inserting a particular field:

ORA-12899 Value too large for column X

I used Field.Substring(0, MaxLength); but still got the error (though not for every record).

Finally I saw what should have been obvious, my string was in ANSI and the field was UTF8. Its length is defined in bytes, not characters.

This gets me to my question. What is the best way to trim my string to fix the MaxLength?

My substring code works by character length. Is there simple C# function that can trim a UT8 string intelligently by byte length (ie not hack off half a character) ?

+2  A: 

If a UTF-8 byte has a zero-valued high order bit, it's the beginning of a character. If its high order bit is 1, it's in the 'middle' of a character. The ability to detect the beginning of a character was an explicit design goal of UTF-8.

Check out the Description section of the wikipedia article for more detail.

Oren Trutner
Thanks for the tip. Can you show me a C# example? Does this mean that there isn't any built in functionality to handle this need? It seems like a common problem.
Michael La Voie
If you have a C# string, you could use Encoding.UTF8.GetByteCount(string) to get the precise byte count. You can trim characters from the end of the string, if you need to, until the byte count fits the limit.
Oren Trutner
+4  A: 

Here are two possible solution - a LINQ one-liner processing the input left to right and a traditional for-loop processing the input from right to left. Which processing direction is faster depends on the string length, the allowed byte length, and the number and distribution of multibyte characters and is hard to give a general suggestion. The decision between LINQ and traditional code I probably a matter of taste (or maybe speed).

If speed matters, one could think about just accumulating the byte length of each character until reaching the maximum length instead of calculating the byte length of the whole string in each iteration. But I am not sure if this will work because I don't know UTF-8 encoding well enough. I could theoreticaly imagine that the byte length of a string does not equal the sum of the byte lengths of all characters.

public static String LimitByteLength(String input, Int32 maxLength)
{
    return new String(input
        .TakeWhile((c, i) =>
            Encoding.UTF8.GetByteCount(input.Substring(0, i + 1)) <= maxLength)
        .ToArray());
}

public static String LimitByteLength2(String input, Int32 maxLength)
{
    for (Int32 i = input.Length - 1; i >= 0; i--)
    {
        if (Encoding.UTF8.GetByteCount(input.Substring(0, i + 1)) <= maxLength)
        {
            return input.Substring(0, i + 1);
        }
    }

    return String.Empty;
}
Daniel Brückner
I love the LINQ example. That is an elegant solution!
Michael La Voie
+1 like both solution
Feryt
+1  A: 

Is there a reason that you need the database column to be declared in terms of bytes? That's the default, but it's not a particularly useful default if the database character set is variable width. I'd strongly prefer declaring the column in terms of characters.

CREATE TABLE length_example (
  col1 VARCHAR2( 10 BYTE ),
  col2 VARCHAR2( 10 CHAR )
);

This will create a table where COL1 will store 10 bytes of data and col2 will store 10 characters worth of data. Character length semantics make far more sense in a UTF8 database.

Assuming you want all the tables you create to use character length semantics by default, you can set the initialization parameter NLS_LENGTH_SEMANTICS to CHAR. At that point, any tables you create will default to using character length semantics rather than byte length semantics if you don't specify CHAR or BYTE in the field length.

Justin Cave
Great idea. I don't control the Oracle DB, but I will suggest that. Maybe it was just an oversight. This would be the best solution to my problem if they're willing to make the change.
Michael La Voie