I'm doing a sitemap producer in Object Pascal and need a good function or lib to emulate the parse_url function on PHP.
Does anyone know of any good ones?
I'm doing a sitemap producer in Object Pascal and need a good function or lib to emulate the parse_url function on PHP.
Does anyone know of any good ones?
The URI RFC lists this regular expression for URI parsing:
^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
12 3 4 5 6 7 8 9
Where the numbers are these groups:
$1 = http:
$2 = http
$3 = //www.ics.uci.edu
$4 = www.ics.uci.edu
$5 = /pub/ietf/uri/
$6 = <undefined>
$7 = <undefined>
$8 = #Related
$9 = Related
For this URI:
http://www.ics.uci.edu/pub/ietf/uri/#Related
The regular expression is pretty simple and uses no special features the regular expression lib has to provide, so grab one that is compatible with your pascal implementation and there you go.
I am not familiar with the parse_url function on PHP, but you might try the TIdURI class that is included with Indy (which in turn is included with most recent Delphi releases). I think they ported it to FreePascal as well.
TIdURI is a TObject descendant that encapsulates a Universal Resource Identifier, as described in the Internet Standards document:
TIdURI provides methods and properties for assembly and disassembly of URIs using the component parts that make up the URI, including: Protocol, Host, Port, Path, Document, and Bookmark.
If that does not work, please give a specific example of what you are trying to accomplish - what are you trying to parse out of a URL.
If you're using wininet.dll you can also use their InternetCrackUrl API.
Freepascal has the unit URIParser with the ParseURI function. An example how to use it can be found in one of the example in Freepascal's source. Or an old example which is somewhat easier to understand.
Be careful with Indy's TIdURI class. It was supposed to be a general-purpose parser, but it has a few bugs and design flaws in it that prevent it from being a fully compliant parser. I'm currently in the process of writing a new class from scratch for Indy 11 to replace TIdURI. It will be a fully compliant URI parser, and it will also suppor IRI (RFC 3987) parsing as well.