views:

1616

answers:

5

I receive some xml-files with embedded base64-encoded images, that I need to decode and save as files.

An unmodified (other than zipped) example of such a file can be downloaded below:

20091123-125320.zip (60KB)

However, I get errors like "Invalid length for a Base-64 char array" and "Invalid character in a Base-64 string". I marked the line in the code where I get the error in the code.

A file could look like this:

<?xml version="1.0" encoding="windows-1252"?>
<mediafiles>
    <media media-type="image">
      <media-reference mime-type="image/jpeg"/>
      <media-object encoding="base64"><![CDATA[/9j/4AAQ[...snip...]P4Vm9zOR//Z=]]></media-object>
      <media.caption>What up</media.caption>
    </media>
</mediafiles>

And the code to process like this:

var xd = new XmlDocument();
xd.Load(filename);
var nodes = xd.GetElementsByTagName("media");

foreach (XmlNode node in nodes)
        {
            var mediaObjectNode = node.SelectSingleNode("media-object");
            //The line below is where the errors occur
            byte[] imageBytes = Convert.FromBase64String(mediaObjectNode.InnerText);
            //Do stuff with the bytearray to save the image
        }

The xml-data is from an enterprise newspaper system, so I am pretty sure the files are ok - and there must be something in the way I process them, that is just wrong. Maybe a problem with the encoding?

I have tried writing out the contents of mediaObjectNode.InnerText, and it is the base64 encoded data - so the navigating the xml-doc is not the issue.

I have been googling, binging, stackoverflowing and crying - and found no solution... Help!

Edit:

Added an actual example file (and a bounty). PLease note the downloadable file is in a bit different schema, since I simplified it in the above example, removing irrelevant stuff...

A: 

Is the character encoding correct? The error sounds like there's a problem that causes invalid characters to appear in the array. Try copying out the text and decoding manually to see if the data is indeed valid.

(For the record, windows-1252 is not exactly the same as iso-8859-1, so that may be the cause of a problem, barring other sources of corruption.)

futureelite7
Well - maybe there is an error there, but this is how I get the file (with this encoding).How can I check if it is the correct one?
Kjensen
A: 

Well, it's all very simple. CDATA is a node itself, so mediaObjectNode.InnerText actually produces <![CDATA[/9j/4AAQ[...snip...]P4Vm9zOR//Z=]]>, which is obviously not valid Base64-encoded data.

To make things work, use mediaObjectNode.ChildNodes[0].Value and pass that value to Convert.FromBase64String'.

Anton Gogolev
I tried saving the contents of mediaObjectNode.InnerText to a text.file (after outputting it to a console), and no cdata-stuff is included. I tried your suggestion anyway, but it makes no difference.
Kjensen
A: 

Try using Linq to XML:

using System.Xml.XPath;

class Program
{
    static void Main(string[] args)
    {
        var elements = XElement
            .Load("test.xml")
            .XPathSelectElements("//media/media-object[@encoding='base64']");
        foreach (var element in elements)
        {
            byte[] image = Convert.FromBase64String(element.Value);
        }
    }
}


UPDATE:

After downloading the XML file and analyzing the value of the media-object node it is clear that it is not a valid base64 string:

string value = "PUT HERE THE BASE64 STRING FROM THE XML WITHOUT THE NEW LINES";
byte[] image = Convert.FromBase64String(value);

throws a System.FormatException saying that the length is not a valid base 64 string. Event when I remove the \n from the string it doesn't work:

var elements = XElement
    .Load("20091123-125320.xml")
    .XPathSelectElements("//media/media-object[@encoding='base64']");
foreach (var element in elements)
{
    string value = element.Value.Replace("\n", "");
    byte[] image = Convert.FromBase64String(value);
}

also throws System.FormatException.

Darin Dimitrov
I get the same error, no luck.
Kjensen
+4  A: 

For a first shot i didn't use any programming language, just Notepad++

I open the xml file within and copy and pasted the raw base64 content into a new file (without square brackets).

Afterwards i selected everything (Strg-A) and used the option Extensions - Mime Tools - Base64 decode. This throwed an error about the wrong text length (must be mod 4). So i just added two equal signs ('=') as placeholder at the end to get the correct length.

Another retry and it decoded successfully into 'something'. Just save the file as .jpg and it opens like a charm in any picture viewer.

So i would say, there IS something wrong with the data you'll get. They just don't have the right numbers of equal signs at the end to fill up to a number of signs which can be break into packets of 4.

The 'easy' way would be to add the equal sign till the decoding doesn't throw an error. The better way would be to count the number of characters (minus CR/LFs!) and add the needed ones in one step.

Further investigations

After some coding and reading of the convert function, the problem is a wrong attaching of a equal sign from the producer. Notepad++ has no problem with tons of equal signs, but the Convert function from MS only works with zero, one or two signs. So if you fill up the already existing one with additional equal signs you get an error too! To get this damn thing to work, you have to cut off all existing signs, calculate how much are needed and add them again.

Just for the bounty, here is my code (not absolute perfect, but enough for a good starting point): ;-)

    static void Main(string[] args)
    {
        var elements = XElement
            .Load("test.xml")
            .XPathSelectElements("//media/media-object[@encoding='base64']");
        foreach (XElement element in elements)
        {
            var image = AnotherDecode64(element.Value);
        }
    }

    static byte[] AnotherDecode64(string base64Decoded)
    {
        string temp = base64Decoded.TrimEnd('=');
        int asciiChars = temp.Length - temp.Count(c => Char.IsWhiteSpace(c));
        switch (asciiChars % 4)
        {
            case 1:
                //This would always produce an exception!!
                //Regardless what (or what not) you attach to your string!
                //Better would be some kind of throw new Exception()
                return new byte[0];
            case 0:
                asciiChars = 0;
                break;
            case 2:
                asciiChars = 2;
                break;
            case 3:
                asciiChars = 1;
                break;
        }
        temp += new String('=', asciiChars);

        return Convert.FromBase64String(temp);
    }
Oliver
Brilliant! Thanks! :)
Kjensen
+1  A: 

The base64 string is not valid as Oliver has already said, the string length must be multiples of 4 after removing white space characters. If you look at then end of the base64 string (see below) you will see the line is shorter than the rest.

RRRRRRRRRRRRRRRRRRRRRRRRRRRRX//Z=

If you remove this line, your program will work, but the resulting image will have a missing section in the bottom right hand corner. You need to pad this line so the overall string length is corect. From my calculations if you had 3 characters it should work.

RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRX//Z=
Andrew