I am writting bbcode for my own forum (based on php); how to find out if it is an invalid url provided in the the [url] tag? Which characters make a url invalid?
Not really an answer to your question but validating url's is really a serious p.i.t.a You're probably just better off validating the domainname and leave query part of the url be. That is my experience. You could also resort to pinging the url and seeing if it results in a valid response but that might be too much for such a simple task.
Regular expressions to detect url's are abundant, google it :)
It is not a matter of which just a matter of which characters. Different characters are legal at different points. For example, according to RFC 2396, an unescaped '?' is legal in the fragment part but not the path part.
You need to read RFC 2396 to understand the details ... or ask a more specific question. Or if you really mean URI rather than URL the RFC 3986 is what you should be reading.
All valid characters that can be used in a URI (a URL is a type of URI) are defined in RFC 3986.
All other characters can be used in a URL provided that they are "URL Encoded" first. This involves changing the invalid character for specific "codes" (usually in the form of the percent symbol (%) followed by a hexadecimal number).
This link, HTML URL Encoding Reference, contains a list of the encodings for invalid characters.
In general URIs as defined by RFC 3986 may contain any of the following characters: A
-Z
, a
-z
, 0
-9
, -
, .
, _
, ~
, :
, /
, ?
, #
, [
, ]
, @
, !
, $
, &
, '
, (
, )
, *
, +
, ,
, ;
and =
. Any other character needs to be encoded with the percent-encoding (%
hh
). Each part of the URI has further restrictions about what characters need to be represented by an percent-encoded word.
See 'regex for url validation' in a previous stackoverflow question.
In your supplementary question you asked if www.example.com/file[/].html
is a valid URL.
That URL isn't valid because a URL is a type of URI and a valid URI must have a scheme like http:
(see RFC 3986).
If you meant to ask if http://www.example.com/file[/].html
is a valid URL then the answer is still no because the square bracket characters aren't valid there.
The square bracket characters are reserved for URLs in this format: http://[2001:db8:85a3::8a2e:370:7334]/foo/bar
(i.e. an IPv6 literal instead of a host name)
It's worth reading RFC 3896 carefully if you want to understand the issue fully.