tags:

views:

1494

answers:

4

Hello, I am currently looking to detect whether an URL is encoded or not. Here are some specific examples:

  1. http://www.linxology.com/browse.php?u=Oi8vZXNwbnN0YXIuY29tL21lZGlhLXBsYXllci8%3D&b=13
  2. http://www.linxology.com/browse.php?u=Oi8vZXNwbnN0YXIuY29tL290aGVyX2ZpbGVzL2VzcG5zdGFyL25hdl9iZy1vZmYucG5n&b=13

Can you please give me a Regular Expression for this? Is there a self learning regular expression generator out there which can filter a perfect Regex as the number of inputs are increased?

A: 

Well, depending on what is in that encoded text, you might not even need a regular expression. If there are multiple querystring parameters in that one "u" key, perhaps you could just check the length of the text on each querystring value, and if it is over (say) 50, you can assume it's probably encoded. I doubt any unencoded single parameters would be as long as these, since those would have to be string data, and therefore they would probably need to be encoded!

x4000
A: 

This question may be harder than you realize. For example:

I could say that if a query string includes a question mark character then what follows it is encoded.

Now, it may be simple encoding like "?year=2009" or complicated like in your examples.

Or

The site URLs could use URL rewriting (like this site does). Look at the URL of this question. The "615958" is encoded and... no question marks were used!

In fact, you could say that the entire URL is encoded!

Perhaps you need to better define what you mean by "encoded".

BoltBait
A: 

You can't reliably parse URL using regex. (Is this an SO mantra yet?)

Here are some specific examples:

It's not clear what ‘encoded’ means — can you give some counter-examples of URLs you consider “not encoded”?

Are you talking about the Base64 encoding in the ‘u’ parameter? Whilst it is possible to say whether a string is a valid Base64 string, it's not possible to detect Base64 and distinguish it from anything else; for example the word “sausages” also happens to be valid Base64 (it decodes to '\xb1\xab\xacj\x07\xac').

bobince
+2  A: 

If you are interested in the base64-encoded URLs, you can do it.

A little theory. If L, R are regular languages and T is a regular transducer, then LR (concatenation), L & R (intersection), L | R (union), TR(L) (image), TR^-1(L) (kernel) are all regular languages. Every regular language has a regular expression that generates it, and every regexp generates a regular language. URLs can be described by regular language (except if you need a subset of those that is not), almost every escaping scheme (and base64) is a regular transducer. Therefore, in theory, it's possible.

In practice, it gets rather messy.

A regex for valid base64 strings is ([A-Za-z0-9+/]{4})*(|[A-Za-z0-9+/]{2}(==|[A-Za-z0-9+/]=)

If it is embedded in a query parameter of an url, it will probably be urlencoded. Let's assume only the = will be urlencoded (because other characters can too, but don't need to).

This gets us to something like [?&][^?&#=;]+=([A-Za-z0-9+/]{4})*(|[A-Za-z0-9+/]{2}(%3D%3D|[A-Za-z0-9+/]%3D)

Another possibility is to consider only those base64 encoded URLs that have some property - in your case, thy all begin with "://", which is fortunate, because that translates exactly to 4 characters "Oi8v". Otherwise, it would be more complex.

This gets [?&][^?&#=;]+=Oi8v([A-Za-z0-9+/]{4})*(|[A-Za-z0-9+/]{2}(%3D%3D|[A-Za-z0-9+/]%3D)

As you can see, it gets messier and messier. Therefore, I'd recommend you rather to

  1. break the URL in its parts (eg. protocol, host, query string)
  2. get the parameters from the query string, and urldecode them
  3. try base64 decode on the values of the parameters
  4. apply your criterion for "good encoded URLs"
jpalecek
Thanks jpalecek, I will take your advice.
Pushkar