ansaurus

Question

What is the regular expression to get a token of a URL?

Answer 1

A:

First, use an HTML parser and get a DOM. Then get the anchor elements and loop over them looking for the hrefs. Don't try to grab the token straight out of a string.

Then:

The glib answer would be:

/(The_Token_I_Want.zip)/

You might want to be a little more precise then a single example.

I'm guessing you are actually looking for:

/([^/]+)$/

David Dorward 2010-08-15 20:33:58

Answer 2

A:

m/The_Token_I_Want/

You'll have to be more specific about what kind of token it is. A number? A string? Does it repeat? Does it have a form or pattern to it?

Shaggy Frog 2010-08-15 20:34:03

Answer 3

A:

It's probably best to use something smarter than a RegEx. For example, if you're using C# you could use the System.Uri class to parse it for you.

Jesse Collins 2010-08-15 20:36:08

Answer 4

+2 A:

Appendix B of RFC 2396 gives a doozy of a regular expression for splitting a URI into its components, and we can adapt it for your case

^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*/([^.]+)[^?#]*)(\?([^#]*))?(#(.*))?
                                     #######

This leaves The_Token_I_Want in $6, which is the “hashderlined” subexpression above. (Note that the hashes are not part of the pattern.) See it live:

#! /usr/bin/perl

$_ = "http://domain.com/133742/The_Token_I_Want.zip";    
if (m!^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*/([^.]+)[^?#]*)(\?([^#]*))?(#(.*))?!) {
  print "$6\n";
}
else {
  print "no match\n";
}

Output:

$ ./prog.pl
The_Token_I_Want

UPDATE: I see in a comment that you're using boost::regex, so remember to escape the backslash in your C++ program.

#include <boost/foreach.hpp>
#include <boost/regex.hpp>
#include <iostream>
#include <string>

int main()
{
  boost::regex token("^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*"
                     "/([^.]+)"
                   //  ####### I CAN HAZ HASHDERLINE PLZ
                     "[^?#]*)(\\?([^#]*))?(#(.*))?");

  const char * const urls[] = {
    "http://domain.com/133742/The_Token_I_Want.zip",
    "http://domain.com/12345/another_token.zip",
    "http://domain.com/0981723/YET_ANOTHER_TOKEN.zip",
  };

  BOOST_FOREACH(const char *url, urls) {
    std::cout << url << ":\n";

    std::string t;
    boost::cmatch m;
    if (boost::regex_match(url, m, token))
      t = m[6];
    else
      t = "<no match>";

    std::cout << "  - " << m[6] << '\n';
  }

  return 0;
}

Output:

http://domain.com/133742/The_Token_I_Want.zip:
  - The_Token_I_Want
http://domain.com/12345/another_token.zip:
  - another_token
http://domain.com/0981723/YET_ANOTHER_TOKEN.zip:
  - YET_ANOTHER_TOKEN

Greg Bacon 2010-08-15 20:41:13

wouldn't that be a little overkill to just get one component?

Thomas 2010-08-15 20:48:23

Overkill or not, I vote for "hashderlined" to be added to the dictionary.

Tim Stone 2010-08-15 20:50:58

Answer 5

+1 A:

Try this:

/(?:f|ht)tps?:\/{2}(?:www.)?domain[^\/]+.([^\/]+).([^\/]+)/i

or

/\w{3,5}:\/{2}(?:w{3}.)?domain[^\/]+.([^\/]+).([^\/]+)/i

Jet 2010-08-15 20:45:33

Answer 6

+1 A:

/a href="http://domain.com/[0-9]+/([a-zA-Z_]+).zip"/

Might want to add more characters to [a-zA-Z_]+

Thomas 2010-08-15 20:46:03

Answer 7

+1 A:

You can use:

(http|ftp)+://[[:alnum:]./_]+/([[:alnum:]._-]+).[[:alnum:]_-]+

([[:alnum:]._-]+) is a group for the matched pattern, and in your example its value will be The_Token_I_Want. to access this group, use \2 or $2, because (http|ftp) is the first group and ([[:alnum:]._-]+) is the second group of the matched pattern.

PC2st 2010-08-15 20:49:13

ansaurus

tags:

views:

answers:

What is the regular expression to get a token of a URL?

related questions