I need to parse a url to get the protocol host path and query in an application I am writing in c++. The application is intended to be cross platform. Im surprised I cant find anything that does this in boost or poco libraries. Is it somewhere obvious Im not looking? Any suggestions on appropriate open source libs? Or is this something I just have to do my self? Its not super complicated but it seems such a common task I am surprised there isnt a common solution.
+2
A:
While I don't think there's going to be an explicit library for this, the URL syntax is pretty easy to parse.
typedef std::string::const_iterator iterator_t;
std::string someQuery(
"http://SomeHost.Somewhere.com/Path/To/File.php?query=hello");
//The first ? is the start of the query
iterator_t queryStart = std::find(someQuery.begin(), someQuery.end(), '?');
iterator_t pathEnd = queryStart; //We'll need this later
//Skip past the ? itself (if there is one)
if (queryStart != someQuery.end()) queryStart++;
std::string queryString(queryStart, someQuery.end()); //Done.
iterator_t pathStart = someQuery.begin();
//Skip past the two slashes in the protocol, to the slash just before the path
for(size_t idx = 0; pathStart != pathEnd && idx < 3; idx++) {
pathStart = std::find(pathStart, pathEnd, L'/');
if (pathStart != pathEnd) pathStart++;
}
std::string pathString(pathStart, pathEnd); //Done.
Billy ONeal
2010-04-11 04:13:12
Edited to fix a bug where sometimes `pathString` would be constructed with invalid iterators.
Billy ONeal
2010-04-11 04:20:01
+1 shorter than mine :p
wilhelmtell
2010-04-11 06:39:31
the fact that he had to edit to fix a bug is why you don't want to write this yourself.
Dustin Getz
2010-06-04 20:43:38
@Dustin true. in fact, he should consider stopping writing software. i too should stop with this futile habit of self-delusion.
wilhelmtell
2010-10-30 15:52:08
A:
QT has QUrl for this. GNOME has SoupURI in libsoup, which you'll probably find a little more light-weight.
Matthew Flaschen
2010-04-11 04:23:03
+4
A:
Terribly sorry, couldn't help it. :s
url.hh
#ifndef URL_HH_
#define URL_HH_
#include <string>
struct url {
url(const std::string& url_s); // omitted copy, ==, accessors, ...
private:
void parse(const std::string& url_s);
private:
std::string protocol_, host_, path_, query_;
};
#endif /* URL_HH_ */
url.cc
#include "url.hh"
#include <string>
#include <algorithm>
#include <cctype>
#include <functional>
using namespace std;
// ctors, copy, equality, ...
void url::parse(const string& url_s)
{
const string prot_end("://");
string::const_iterator prot_i = search(url_s.begin(), url_s.end(),
prot_end.begin(), prot_end.end());
protocol_.reserve(distance(url_s.begin(), prot_i));
transform(url_s.begin(), prot_i,
back_inserter(protocol_),
ptr_fun<int,int>(tolower)); // protocol is icase
if( prot_i == url_s.end() )
return;
advance(prot_i, prot_end.length());
string::const_iterator path_i = find(prot_i, url_s.end(), '/');
host_.reserve(distance(prot_i, path_i));
transform(prot_i, path_i,
back_inserter(host_),
ptr_fun<int,int>(tolower)); // host is icase
string::const_iterator query_i = find(path_i, url_s.end(), '?');
path_.assign(path_i, query_i);
if( query_i != url_s.end() )
++query_i;
query_.assign(query_i, url_s.end());
}
main.cc
// ...
url u("HTTP://stackoverflow.com/questions/2616011/parse-a.py?url=1");
cout << u.protocol() << '\t' << u.host() << ...
wilhelmtell
2010-04-11 06:17:28
Minor nitpick: You don't need to use ptr_fun here, and if you do, you need to `#include <functional>`. (you probably shouldn't `using namespace std` either but I'm assuming this isn't for production code)
Billy ONeal
2010-04-11 06:27:19
I omitted some trivial functionality, like the assignment operator, constructors, accessors and so on. The `url` class shouldn't have mutators. For the equality operator, you might add a hash member that you fill in while parsing the original string. Then, comparing two urls for equality should be very fast. It also means some extra complexity; it's your call.
wilhelmtell
2010-04-11 07:07:48
@Billy I always bring namespace `std` into my compilation units (not the headers!). I think it's perfectly fine, and I think that having `std::` all over the place poses more pollution and eye-fatigue than bringing in the namespace.
wilhelmtell
2010-04-11 07:12:48
Funny how things are, on the very contrary I agree with Billy ONeal and remove all `using namespace` I came accross. If you really repeat a symbol, you can always have `using std::string;` but I prefer to have namespace qualification, makes it easier for poor old me to understand where that symbol came from.
Matthieu M.
2010-04-11 11:45:43
There are a lot of URI/URL forms not supported besides example.com:port/pathname. For instance http:/pathname and more importantly http://username:[email protected]/pathname#section - all the combinations are listed in http://www.ietf.org/rfc/rfc2396.txt - they show the following regex: ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
jdkoftinoff
2010-10-18 02:52:11
+3
A:
There is a library that's proposed for Boost inclusion and allows you to parse HTTP URI's easily. It uses Boost.Spirit and is also released under the Boost Software License. The library is cpp-netlib which you can find the documentation for at http://cpp-netlib.github.com/ -- you can download the latest release from http://github.com/cpp-netlib/cpp-netlib/downloads .
The relevant type you'll want to use is boost::network::http::uri
and is documented here.
Dean Michael
2010-04-11 09:56:10