views:

728

answers:

3

Hi

I am inserting jpeg images into my UTF-8 encoded Postgres database into bytea column/s. I'm using a prepared sql statement to insert the images. In the statement I create a file object, within Java, of the jpeg image and then pass it as a FileInputStream into the setBinaryStream method. However every now and again my Java app will throw an exception once the statement is executed, stating that:

" ERROR: invalid byte sequence for encoding "UTF8": 0x84 "

This happens for a select few images which is odd. These images are extracted from a previous set of images, all the previous images insert fine only a few extracted images seem to cause the error. So how do I solve such a problem? Can encode the byte steam to UTF-8 somehow? Or is it a problem with the database?

Btw, if I replace the extracted images with new ones and save them as jpegs the same error occurs. Thanks for your help!

The code is below as requested...

There is some code missing otherwise this would be very long but, basically I do a few checks on the path and directory names to make sure they adhere to the file system rules. This is in a loop which goes through all the sub-directories and adds all the jpeg files in those sub-directories. I then go to the next directory with sub-directories with images till there are none there. I have yet to add try-catches and logging sections yet.

String imgStr = image.toString();
int age = getAgeFromDir(imgStr);
String gender = getSexFromDir(imgStr);
String table = "";
switch(validIdx){
    case 0: table = "carpals";
        break;
    case 1: table = "d_phalanges";
        break;
    case 2: table = "p_phalanges";
        break;
    case 3: table = "i_phalanges";
        break;
    case 4: table = "epiphyses";
        break;
    case 5: table = "sesamoids";
        break;
    case 6: table = "metacarpals ";
        break;
}

    PreparedStatement ps = con.prepareCall("INSERT INTO " + table +
            " VALUES( (SELECT hands.hand_id FROM hands WHERE hands.age = " + age + " AND hands.gender = '" + gender + "' AND hands.location = '" + path + directory + imageNames[i] + "' )," +
            " (SELECT COUNT(" + table + ".location) FROM " + table + " ), " +
            " ?, ? )"   );

        //go through each sub-directory which contains jpeg images and add them to
        //the database
        File sublist = new File(image + "\\" + subdir[j]);
        String[] files = sublist.list();
        String[] pics = sublist.list(new JpegFilter());

        if(files.length > pics.length){
            //WRITE TO LOG
            //WARNING UNEXPECTED FILES OR DIRECTORIES FOUND IN....
        }

            for(int r = 0; r < pics.length; r++ ){

                    String location = image + "\\" + subdir[j] + "\\" + pics[r];
                    System.out.println(i + "\t" + r + " location : " + location);

                    File f = new File(location);
                    FileInputStream pic = new FileInputStream(f);


                    if(f.isFile()){
                    ps.setString(2, location);
                    ps.setBinaryStream(1, pic, (int)f.length());
                    ps.execute();
                    pic.close();
                    }
            }
    ps.close();

}

The SQLException thrown is below, it is thrown at ps.execute():

Exception in thread "main" org.postgresql.util.PSQLException: ERROR: invalid byte sequence for encoding "UTF8": 0x84 at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:1608) at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1343) at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:194) at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:451) at org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:350) at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:343) at nuffielddb.HandDB.addExtractedImages(HandDB.java:406) at nuffielddb.Main.main(Main.java:37) Java Result: 1

+1  A: 

Well, 0x84 is not a valid utf8 character:

=> perl -e 'print "\x84"' | iconv -f utf8 -t utf8
iconv: illegal input sequence at position 0

Generally - bytea will work with any bytes, but INSERT statement is a text string, and as such has to conform to encoding of client!

Simple way to insert the data:

  1. Encode the data in your application to be in Base64 format (there are also other options, but this one is easiest for me to show)
  2. insert it with: INSERT INTO q (x) VALUES (decode(?, 'base64'))

Example in Perl (sorry, I don't write Java):

#!/usr/bin/perl
use MIME::Base64;
use DBI;

my $dbh = DBI->connect( "dbi:Pg:dbname=depesz;port=5840", "depesz" );
my $blob = "\x84";
my $encoded = encode_base64( $blob );
$dbh->do("INSERT INTO q (x) VALUES (decode(?, 'base64'))", undef, $encoded );

q table is:

      Table "public.q"
 Column | Type  | Modifiers
--------+-------+-----------
 x      | bytea |

Data (after insert) looks like this:

# select x, octet_length(x) from q;
  x   | octet_length
------+--------------
 \x84 |            1
(1 row)
depesz
Ah ok, I don't know Perl at all but I think I get the general idea. I am right in assuming I need to undertake the following steps:1. Get the file and create a byte array2. Store the files contains as bytes within the byte array3. Encode the byte array to 'base64'4. Then try executing the SQL statementQuestion, what does 'undef , $encoded' do within your code?And does anyone know how to turn a File object into a byte array containing the corresponding byte's of the File object respectively?
Graham
undef is not important - it's a DBI (database interface in Perl) thing.$encoded is variable containing encoded (to base64) content of $blob.and the ->do( sql, undef, $some_variable ), things just runs sql with substituting $some_variable in place of ? in sql.
depesz
+1  A: 

Well that's that idea out the window (relating to my comment of the original question) - there's obviously some encoding taking place and certain images contain invalid byte sequences and hence can't be encoded, but my reasoning for this was going to be that a clob was being used (must learn to read the question more carefully).

I'd be tempted to BASE64 encode the stream if possible.

A quick Google turned up this - http://commons.apache.org/codec/api-release/org/apache/commons/codec/binary/Base64InputStream.html - that I suspect might be of use (even if it's just for inspiration).

Nick Holt
Out of interest, what is causing this invalid byte encoding. I have one big image, cut and crop a small section of it and then I add both to the database. The larger image inserts fine how come the cropped one. Am I being stupid in assuming that the smaller image does not contain subset byte representation of the larger image?
Graham
@Graham: I'm not that familiar with image formats but I wouldn't assume a cropped section of a large image would translate to the same sequence of bytes - I think I'm right in saying that most formats employ some sort of compression that will be dependent on the frequency of the each pixel value. This could occur for example if the image as a whole is mostly light colors and you crop a darker area.
Nick Holt
Oh, so the problem could be due to the compression used in the jpeg format creates a byte which is 0x84. So perhaps changing how much it is compressed may solve this problem as well or am I barking up the wrong tree?
Graham
I'm guessing it may fix a particular case but the underlying problem would remain that there's no guaranteeing that a particular image won't have the invalid byte sequence.
Nick Holt
I was wondering if you found a class that takes a byte array and encodes it, returning a new byte array in base64? I can only find methods or classes which allow me to encode the byte array into a String which is base64. Or can I encode the file into a String/s which is base64 and then when I retrieve the bytea column covert them back into a byte array with the original values? Will this keep the underlying image data intact?
Graham
When I've had to perform Base64 encoding in the past I've used the Commons API (link included in my answer, after editing). You simply need to wrap you java.io.FileInputStream in a org.apache.commons.codec.binary.Base64InputStream and do the reverse with an org.apache.commons.codec.binary.Base64OutputStream to load the image back from the database.
Nick Holt
A: 

Problem solved :-) After encoding and decoding the different files I found that the same SQL error occurred. I believe the problem was occurred due to one of the FK attributes storing a blank value within the Postgres database after inserting some values with the Java application I had created. When referencing value hand_id within the sub-query (below):

(SELECT hands.hand_id FROM hands WHERE hands.age = " + age + " AND hands.gender = '" + gender + "' AND hands.location = '" + path + directory + imageNames[i] + "' )

The result returned in postgres when the variables were replaced in Java was some sort of empty non-returnable character, I believe like an escape character or carriage-return in Java (e.g. "\" for backslash). After looking up the character within the UTF-8 table of values and character representations the table shows a blank space.

On looking up the value within the on the net I found http://www.utf8-chartable.de/unicode-utf8-table.pl?utf8=0xthis, it says the character is :

Unicode value, Character, UTF-Hex, Name

----------------------------------------------

U+0084,          ,0xc2 0x84,<control>

Notice that the character is column in the table is empty.

The problem was caused by the sub-query not including the necessary escape characters. To fix the problem the necessary escape characters were added to the SQL sub-query. In my code, it meant the following changes occurred regarding the 'hands.location' section in the final SQL statement being sent of:

BEFORE

... hands.location = 'C:\directory\anotherdir\picture.jpg'

AFTER

... hands.location = E'C:\\directory\\anotherdir\\picture.jpg'

SO, what were the lessons I learn?

  1. ALWAYS check your SQL statement, even if you believe it to be correct

  2. When inserting Strings into a VARCHAR column remember the different escape characters needed and to place a E before the first open quotation mark (as so, E'). Remember that a backslash requires two backslashes to be added (as so, E'\')

  3. IF you do have a problem with the encoding of a database you can always try and redefine if the databases encoding within the configuration OR convert and encode your data to the required encoding for it to be acceptable for the database.

  4. APACHE COMMONS as a useful base64 encoding codec for Java. Very useful and a must to remember for a later date.

  5. ERRORS can really be deceiving at the best of times. If you get this error you mite want to check all the things I did first.

Btw, thank you for all those who posted answers. I am always amazed by people generosity to give up their time to help others! It truly was useful and proves why StackOverflow works so well! :-)

Graham