tags:

views:

251

answers:

5

What is the meaning of a ^ sign in a URL?

I needed to crawl some link data from a webpage and I was using a simple handwritten PHP crawler for it. The crawler usually works fine; then I came to a URL like this:

http://www.example.com/example.asp?x7=3^^^^^select%20col1,col2%20from%20table%20where%20recordid%3E=20^^^^^

This URL works fine when typed in a browser but my crawler is not able to retrieve this page. I am getting an "HTTP request failed error".

+6  A: 

Based on the context, I'd guess they're a homespun attempt to URL-encode quote-marks.

Bob Kaufman
A: 

The crawler may be using regular expressions to parse the URL and therefore is falling over because the caret (^) means beginning of line. I'm thinking these URLs are really bad practice since they are exposing the underlying database structure; whomever wrote this might want to consider serious refactoring!

HTH!

Gav
No, the crawler is 4 lines of code - no parsing nothing
Crimson
+2  A: 

Caret (^) is not a reserved character in URLs, so it should be acceptable to use as-is. However, if you re having problems, just replace it with its hex encoding %5E.

And yeah, putting raw SQL in the URL is like a big flashing neon sign reading "EXPLOIT ME PLEASE!". Whoever designed the site in question is brain-dead.

Tyler McHenry
It isn't reserved, but it also isn't "unreserved", meaning it "must be escaped" according to RFC2396, section 2.4.
Laurence Gonsalves
RFC 2396 is obsoleted by 3986, though the point about it not being "unreserved" still applies.
Anon.
@Anon true, but 3986 spreads the relevant information around so much that I couldn't find a good single place to cite. RFC 2396 is still mostly correct for all practical purposes.
Laurence Gonsalves
+5  A: 

^ characters should be encoded, see RFC 1738 Uniform Resource Locators (URL):

Other characters are unsafe because gateways and other transport agents are known to sometimes modify such characters. These characters are "{", "}", "|", "\", "^", "~", "[", "]", and "`".

All unsafe characters must always be encoded within a URL

You could try URL encoding the ^ character.

Brian R. Bondy
Don't ever make a recommendation that should not be carried out. The questioner may take you seriously. He should contact the site owners as to their stupidity, but not drop tables.
SamGoody
Jokes aside, +1 because the RFC really tells us to encode ours ^.
Bruno Brant
@samgoody: Took out the what was meant to be funny part of my answer.
Brian R. Bondy
+1  A: 

Caret is neither reserved nor "unreserved", which makes it an "unsafe character" in URLs. They should never appear in URLs unencoded. From RFC2396:

2.2. Reserved Characters

   Many URI include components consisting of or delimited by, certain
   special characters.  These characters are called "reserved", since
   their usage within the URI component is limited to their reserved
   purpose.  If the data for a URI component would conflict with the
   reserved purpose, then the conflicting data must be escaped before
   forming the URI.

      reserved    = ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" |
                    "$" | ","

   The "reserved" syntax class above refers to those characters that are
   allowed within a URI, but which may not be allowed within a
   particular component of the generic URI syntax; they are used as
   delimiters of the components described in Section 3.

   Characters in the "reserved" set are not reserved in all contexts.
   The set of characters actually reserved within any given URI
   component is defined by that component. In general, a character is
   reserved if the semantics of the URI changes if the character is
   replaced with its escaped US-ASCII encoding.

2.3. Unreserved Characters

   Data characters that are allowed in a URI but do not have a reserved
   purpose are called unreserved.  These include upper and lower case
   letters, decimal digits, and a limited set of punctuation marks and
   symbols.

      unreserved  = alphanum | mark

      mark        = "-" | "_" | "." | "!" | "~" | "*" | "'" | "(" | ")"

   Unreserved characters can be escaped without changing the semantics
   of the URI, but this should not be done unless the URI is being used
   in a context that does not allow the unescaped character to appear.

2.4. Escape Sequences

   Data must be escaped if it does not have a representation using an
   unreserved character; this includes data that does not correspond to
   a printable character of the US-ASCII coded character set, or that
   corresponds to any US-ASCII character that is disallowed, as
   explained below.
Laurence Gonsalves
I should add that RFC 3986 supersedes 2396, but 3986 spreads the relevant information around so much that I couldn't find a good single place to cite. RFC 2396 is still mostly correct for all practical purposes, and is generally easier to understand.
Laurence Gonsalves