views:

124

answers:

3

I'm trying to figure out a way to parse out a base64 string from with a larger string.

I have the string "Hello <base64 content> World" and I want to be able to parse out the base64 content and convert it back to a string. "Hello Awesome World"

Answers in C# preferred.

Edit: Updated with a more real example.

--abcdef
\n
Content-Type: Text/Plain;
Content-Transfer-Encoding: base64
\n
<base64 content>
\n
--abcdef--

This is taken from 1 sample. The problem is that the Content.... vary quite a bit from one record to the next.

+2  A: 

In short form you could:

  • split the string on any chars that are not valid base64 data or padding
  • try to convert each token
  • if the conversion succeeds, call replace on the original string to switch the token with the converted value

In code:

var delimiters = new char[] { /* non-base64 ASCII chars */ };
var possibles = value.Split(delimiters, StringSplitOptions.RemoveEmptyEntries);
//need to tweak to include padding chars in matches, but still split on padding?
//maybe better off creating a regex to match base64 + padding
//and using Regex.Split?

foreach(var match in possibles)
{
    try
    {
        var converted = Convert.FromBase64String(match);
        var text = System.Text.Encoding.UTF8.GetString(converted);
        if(!string.IsNullOrEmpty(text))
        {
            value = value.Replace(match, text);
        }
    } 
    catch (System.ArgumentNullException) 
    {
        //handle it
    }
    catch (System.FormatException) 
    {
        //handle it
    }
}

Without a delimiter though, you can end up converting non-base64 text that happens to be also be valid as base64 encoded text.

Looking at your example of trying to convert "Hello QXdlc29tZQ== World" to "Hello Awesome World" the above algorithm could easily generate something like "ée¡Ý•Í½µ”¢¹]" by trying to convert the whole string from base64 since there is no delimiter between plain and encoded text.

Update (based on comments):

If there are no '\n's in the base64 content and it is always preceded by "Content-Transfer-Encoding: base64\n", then there is a way:

  • split the string on '\n'
  • iterate over all the tokens until a token ends in "Content-Transfer-Encoding: base64"
  • the next token (if there are any) should be decoded (if possible) and then the replacement should be made in the original string
  • return to iterating until out of tokens

In code:

private string ConvertMixedUpTextAndBase64(string value)
{
    var delimiters = new char[] { '\n' };
    var possibles = value.Split(delimiters, 
                                StringSplitOptions.RemoveEmptyEntries);

    for (int i = 0; i < possibles.Length - 1; i++)
    {
        if (possibles[i].EndsWith("Content-Transfer-Encoding: base64"))
        {
            var nextTokenPlain = DecodeBase64(possibles[i + 1]);
            if (!string.IsNullOrEmpty(nextTokenPlain))
            {
                value = value.Replace(possibles[i + 1], nextTokenPlain);
                i++;
            }
        }                
    }
    return value;
}

private string DecodeBase64(string text)
{
    string result = null;
    try
    {
        var converted = Convert.FromBase64String(text);
        result = System.Text.Encoding.UTF8.GetString(converted);
    }
    catch (System.ArgumentNullException)
    {
        //handle it
    }
    catch (System.FormatException)
    {
        //handle it
    }
    return result;
}
jball
The last part is the tricky part. For instance, if you split and obtain "aaBG" as your string, what do you do? This is the base64 representation of "i F". You'd need some heuristic to decide which is the one you actually want.
Yuliy
+1  A: 

There is no reliable way to do it. How would you know that, for instance, "Hello" is not a base64 string ? OK, it's a bad example because base64 is supposed to be padded so that the length is a multiple of 4, but what about "overflow" ? It's 8-character long, it is a valid base64 string (it would decode to "¢÷«~Z0"), even though it's obviously a normal word to a human reader. There's just no way you can tell for sure whether a word is a normal word or base64 encoded text.

The fact that you have base64 encoded text embedded in normal text is clearly a design mistake, I suggest you do something about it rather that trying to do something impossible...

Thomas Levesque