ansaurus

Question

Answer 1

+5 A:

Base encoding of data is used in many situations to store or transfer data in environments that, perhaps for legacy reasons, are restricted to US-ASCII data.

So it depends on the purpose of usage of the encoded data if the data should be considered as dangerous.

But if you’re just looking for a regular expression to match Base64 encoded words, you can use the following:

^(?:[A-Za-z0-9+/]{4})*(?:[A-Za-z0-9+/]{2}==|[A-Za-z0-9+/]{3}=)?$

Gumbo 2009-01-24 00:40:53

That doesn't deal with the 'white space' at line ends. I'm not sure whether program always put the newline at a 4-byte boundary - it would be reasonable, but then following the standard would be reasonable too, but people don't do it.

Jonathan Leffler 2009-01-24 01:09:12

The simplest solution would be to strip out all whitespace (which is ignored as per the RFC) before validation.

Ben Blank 2009-01-24 01:35:20

Answer 2

+1 A:

Neither a ":" nor a "." will show up in valid Base64, so I think you can unambiguously throw away the http://www.stackoverflow.com line. In Perl, say, something like

my $sanitized_str = join q{}, grep {!/[^A-Za-z0-9+\/=]/} split /\n/, $str;

say decode_base64($sanitized_str);

might be what you want. It produces

This is simple ASCII Base64 for StackOverflow exmaple.

oylenshpeegul 2009-01-24 01:01:49

I can agree there, but all the OTHER letters in the URL do happen to be valid base64... So, where do you draw the line? Just at line breaks? (I have seen ones where there is just a couple random chars in the middle of the line. Can't toss the rest of the line just because of that, IMHO)...

LarryF 2009-01-24 02:05:47

@LarryF: unless there's integrity checking on the base-64 encoded data, you can't tell what to do with any base-64 block of data containing incorrect characters. Which is the best heuristic: ignore the incorrect characters (allowing any and all correct ones) or reject the lines, or reject the lot?

Jonathan Leffler 2009-01-24 04:08:49

(continued): the short answer is "it depends" - on where the data comes from and the sorts of mess you find in it.

Jonathan Leffler 2009-01-24 04:09:36

(resumed): I see from comments to the question that you want to accept anything that might be base-64. So simply map each and every character that's not in your base-64 alphabet (note that there are URL-safe and other such variant encodings) including the newlines and colons, and take what's left.

Jonathan Leffler 2009-01-24 04:11:57

ansaurus

tags:

views:

answers:

RegEx to parse or validate Base64 data

related questions