tags:

views:

100

answers:

7

Say I have strings like these:

bunch of other html<a href="http://domain.com/133742/The_Token_I_Want.zip" more html and stuff
bunch of other html<a href="http://domain.com/12345/another_token.zip" more html and stuff
bunch of other html<a href="http://domain.com/0981723/YET_ANOTHER_TOKEN.zip" more html and stuff

What is the regular expression to match The_Token_I_Want, another_token, YET_ANOTHER_TOKEN?

A: 

First, use an HTML parser and get a DOM. Then get the anchor elements and loop over them looking for the hrefs. Don't try to grab the token straight out of a string.

Then:

The glib answer would be:

/(The_Token_I_Want.zip)/

You might want to be a little more precise then a single example.

I'm guessing you are actually looking for:

/([^/]+)$/
David Dorward
A: 
m/The_Token_I_Want/

You'll have to be more specific about what kind of token it is. A number? A string? Does it repeat? Does it have a form or pattern to it?

Shaggy Frog
A: 

It's probably best to use something smarter than a RegEx. For example, if you're using C# you could use the System.Uri class to parse it for you.

Jesse Collins
+2  A: 

Appendix B of RFC 2396 gives a doozy of a regular expression for splitting a URI into its components, and we can adapt it for your case

^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*/([^.]+)[^?#]*)(\?([^#]*))?(#(.*))?
                                     #######

This leaves The_Token_I_Want in $6, which is the “hashderlined” subexpression above. (Note that the hashes are not part of the pattern.) See it live:

#! /usr/bin/perl

$_ = "http://domain.com/133742/The_Token_I_Want.zip";    
if (m!^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*/([^.]+)[^?#]*)(\?([^#]*))?(#(.*))?!) {
  print "$6\n";
}
else {
  print "no match\n";
}

Output:

$ ./prog.pl
The_Token_I_Want

UPDATE: I see in a comment that you're using boost::regex, so remember to escape the backslash in your C++ program.

#include <boost/foreach.hpp>
#include <boost/regex.hpp>
#include <iostream>
#include <string>

int main()
{
  boost::regex token("^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*"
                     "/([^.]+)"
                   //  ####### I CAN HAZ HASHDERLINE PLZ
                     "[^?#]*)(\\?([^#]*))?(#(.*))?");

  const char * const urls[] = {
    "http://domain.com/133742/The_Token_I_Want.zip",
    "http://domain.com/12345/another_token.zip",
    "http://domain.com/0981723/YET_ANOTHER_TOKEN.zip",
  };

  BOOST_FOREACH(const char *url, urls) {
    std::cout << url << ":\n";

    std::string t;
    boost::cmatch m;
    if (boost::regex_match(url, m, token))
      t = m[6];
    else
      t = "<no match>";

    std::cout << "  - " << m[6] << '\n';
  }

  return 0;
}

Output:

http://domain.com/133742/The_Token_I_Want.zip:
  - The_Token_I_Want
http://domain.com/12345/another_token.zip:
  - another_token
http://domain.com/0981723/YET_ANOTHER_TOKEN.zip:
  - YET_ANOTHER_TOKEN
Greg Bacon
wouldn't that be a little overkill to just get one component?
Thomas
Overkill or not, I vote for "hashderlined" to be added to the dictionary.
Tim Stone
+1  A: 

Try this:

/(?:f|ht)tps?:\/{2}(?:www.)?domain[^\/]+.([^\/]+).([^\/]+)/i

or

/\w{3,5}:\/{2}(?:w{3}.)?domain[^\/]+.([^\/]+).([^\/]+)/i

Jet
+1  A: 
/a href="http://domain.com/[0-9]+/([a-zA-Z_]+).zip"/

Might want to add more characters to [a-zA-Z_]+

Thomas
+1  A: 

You can use:

(http|ftp)+://[[:alnum:]./_]+/([[:alnum:]._-]+).[[:alnum:]_-]+

([[:alnum:]._-]+) is a group for the matched pattern, and in your example its value will be The_Token_I_Want. to access this group, use \2 or $2, because (http|ftp) is the first group and ([[:alnum:]._-]+) is the second group of the matched pattern.

PC2st