ansaurus

Question

Answer 1

+19 A:

from http://codesnippets.joyent.com/posts/show/523

A single regex to parse and breakup a full URL including query parameters and anchors e.g.

https://www.google.com/dir/1/2/search.html?arg=0-a&arg1=1-b&arg3-c#hash

^((http[s]?|ftp):\/)?\/?([^:\/\s]+)((\/\w+)*\/)([\w\-\.]+[^#?\s]+)(.*)?(#[\w\-]+)?$

RexEx positions:

url: RegExp['$&'],

protocol:RegExp.$2,

host:RegExp.$3,

path:RegExp.$4,

file:RegExp.$6,

query:RegExp.$7,

hash:RegExp.$8

you could then further parse the host ('.' delimited) quite easily.

hometoast 2008-08-26 11:06:09

The link http://codesnippets.joyent.com/posts/show/523 does not work as of Oct 20 '10

W3Max 2010-10-20 14:26:05

The problem is this part: `(.*)?` Since the Kleene star already accepts 0 or more, the `?` part (0 or 1) is confusing it. I fixed it by changing `(.*)?` to `(.+)?`. You could also just remove the `?`

Bryan Ross 2010-10-25 22:23:54

Good catch Bryan. I'm not going to edit the response, since I quoted it from the (now gone) link, but upvoted your comment so that the clarification is more visible.

hometoast 2010-10-28 11:49:46

Answer 2

+2 A:

This is not a direct answer but most web libraries have a function that accomplishes this task. The function is often called something similar to CrackUrl. If such a function exists, use it, it is almost guaranteed to be more reliable and more efficient than any hand-crafted code.

Konrad Rudolph 2008-08-26 11:06:43

Answer 3

+2 A:

Try the following:

^((ht|f)tp(s?)\:\/\/|~/|/)?([\w]+:\w+@)?([a-zA-Z]{1}([\w-]+.)+([\w]{2,5}))(:[\d]{1,5})?((/?\w+/)+|/?)(\w+.[\w]{3,4})?((\?\w+=\w+)?(&\w+=\w+)*)?

It supports HTTP / FTP, subdomains, folders, files etc.

I found it from a quick google search:

http://geekswithblogs.net/casualjim/archive/2005/12/01/61722.aspx

Mark Ingram 2008-08-26 11:10:16

Answer 4

+1 A:

subdomain and domain are difficult because the subdomain can have several parts, as can the top level domain, http://sub1.sub2.domain.co.uk/

 the path without the file : http://[^/]+/((?:[^/]+/)*(?:[^/]+$)?)  
 the file : http://[^/]+/(?:[^/]+/)*((?:[^/.]+\.)+[^/.]+)$  
 the path with the file : http://[^/]+/(.*)  
 the URL without the path : (http://[^/]+/)

(Markdown isn't very friendly to regexes)

tgmdbm 2008-08-26 11:17:28

Answer 5

A:

Using http://www.fileformat.info/tool/regex.htm hometoast's regex works great.

But here is the deal, I want to use different regex patterns in different situations in my program.

For example, I have this URL, and I have an enumeration that lists all supported URLs in my program. Each object in the enumeration has a method getRegexPattern that returns the regex pattern which will then be used to compare with a URL. If the particular regex pattern returns true, then I know that this URL is supported by my program. So, each enumeration has it's own regex depending on where it should look inside the URL.

Hometoast's suggestion is great, but in my case, I think it wouldn't help (unless I copy paste the same regex in all enumerations).

That is why I wanted the answer to give the regex for each situation separately. Although +1 for hometoast. ;)

pek 2008-08-26 11:23:45

Answer 6

A:

I know you're claiming language-agnostic on this, but can you tell us what you're using just so we know what regex capabilities you have?

If you have the capabilities for non-capturing matches, you can modify hometoast's expression so that subexpressions that you aren't interested in capturing are set up like this:

(?:SOMESTUFF)

You'd still have to copy and paste (and slightly modify) the Regex into multiple places, but this makes sense--you're not just checking to see if the subexpression exists, but rather if it exists as part of a URL. Using the non-capturing modifier for subexpressions can give you what you need and nothing more, which, if I'm reading you correctly, is what you want.

Just as a small, small note, hometoast's expression doesn't need to put brackets around the 's' for 'https', since he only has one character in there. Quantifiers quantify the one character (or character class or subexpression) directly preceding them. So:

https?

would match 'http' or 'https' just fine.

Brian Warshaw 2008-08-26 11:34:49

Answer 7

+2 A:

Java offers a URL class that will do this. Query URL Objects.

On a side note, PHP offers parse_url().

Chris Bartow 2008-08-26 11:55:04

It looks like this doesn't parse out the subdomain though?

DutrowLLC 2010-03-05 04:11:50

Answer 8

+6 A:

I found the highest voted answer (hometoast's answer) doesn't work perfectly for me. Two problems:

it can't handle port number
The hash part is broken

The following is a modified version:

^((http[s]?|ftp):\/)?\/?([^:\/\s]+)(:([^\/]*))?((\/\w+)*\/)([\w\-\.]+[^#?\s]+)(\?([^#]*))?(#(.*))?$

Position of parts are as follows:

int SCHEMA = 2, DOMAIN = 3, PORT = 5, PATH = 6, FILE = 8, QUERYSTRING = 9, HASH = 12

mingfai 2008-11-21 16:28:57

Answer 9

A:

/^((?P<scheme>https?|ftp):\/)?\/?((?P<username>.*?)(:(?P<password>.*?)|)@)?(?P<hostname>[^:\/\s]+)(?P<port>:([^\/]*))?(?P<path>(\/\w+)*\/)(?P<filename>[-\w.]+[^#?\s]*)?(?P<query>\?([^#]*))?(?P<fragment>#(.*))?$/

From my answer on a similar question. Works better than some of the others mentioned because they had some bugs (such as not supporting username/password, not supporting single-character filenames, fragment identifiers being broken).

strager 2009-01-14 04:13:34

Answer 10

A:

regexp to get the URL path without the file.

url = 'http://domain/dir1/dir2/somefile' url.scan(/^(http:\/\/[^\/]+)((?:\/[^\/]+)+(?=\/))?\/?(?:[^\/]+)?$/i).to_s

It can be useful for adding a relative path to this url.

2009-07-16 22:22:56

Answer 11

A:

Hi,

You can get all the http/https, host, port, path as well as query by using Uri object in .NET. just the difficult task is to break the host into sub domain, domain name and TLD.

There is no standard to do so and can't be simply use string parsing or RegEx to produce the correct result. At first, I am using RegEx function but not all URL can be parse the subdomain correctly. The practice way is to use a list of TLDs. After a TLD for a URL is defined the left part is domain and the remaining is sub domain.

However the list need to maintain it since new TLDs is possible. The current moment I know is publicsuffix.org maintain the latest list and you can use domainname-parser tools from google code to parse the public suffix list and get the sub domain, domain and TLD easily by using DomainName object: domainName.SubDomain, domainName.Domain and domainName.TLD.

This answers also helpfull: http://stackoverflow.com/questions/288810/get-the-subdomain-from-a-url

CaLLMeLaNN

CallMeLaNN 2009-10-09 04:39:51

Answer 12

A:

I would recommend not using regex. An API call like WinHttpCrackUrl() is less error prone.

http://msdn.microsoft.com/en-us/library/aa384092%28VS.85%29.aspx

Jason 2009-11-30 19:35:38

And also very platform specific.

Andir 2010-07-12 21:02:23

Answer 13

+1 A:

This improved version should work as reliably as a parser.

   // Applies to URI, not just URL or URN:
   //    http://en.wikipedia.org/wiki/Uniform_Resource_Identifier#Relationship_to_URL_and_URN
   //
   // http://labs.apache.org/webarch/uri/rfc/rfc3986.html#regexp
   //
   // (?:([^:/?#]+):)?(?://([^/?#]*))?([^?#]*)(?:\?([^#]*))?(?:#(.*))?
   //
   // http://en.wikipedia.org/wiki/URI_scheme#Generic_syntax
   //
   // $@ matches the entire uri
   // $1 matches scheme (ftp, http, mailto, mshelp, ymsgr, etc)
   // $2 matches authority (host, user:pwd@host, etc)
   // $3 matches path
   // $4 matches query (http GET REST api, etc)
   // $5 matches fragment (html anchor, etc)
   //
   // Match specific schemes, non-optional authority, disallow white-space so can delimit in text, and allow 'www.' w/o scheme
   // Note the schemes must match ^[^\s|:/?#]+(?:\|[^\s|:/?#]+)*$
   //
   // (?:()(www\.[^\s/?#]+\.[^\s/?#]+)|(schemes)://([^\s/?#]*))([^\s?#]*)(?:\?([^\s#]*))?(#(\S*))?
   //
   // Validate the authority with an orthogonal RegExp, so the RegExp above won’t fail to match any valid urls.
   function uriRegExp( flags, schemes/* = null*/, noSubMatches/* = false*/ )
   {
      if( !schemes )
         schemes = '[^\\s:\/?#]+'
      else if( !RegExp( /^[^\s|:\/?#]+(?:\|[^\s|:\/?#]+)*$/ ).test( schemes ) )
         throw TypeError( 'expected URI schemes' )
      return noSubMatches ? new RegExp( '(?:www\\.[^\\s/?#]+\\.[^\\s/?#]+|' + schemes + '://[^\\s/?#]*)[^\\s?#]*(?:\\?[^\\s#]*)?(?:#\\S*)?', flags ) :
         new RegExp( '(?:()(www\\.[^\\s/?#]+\\.[^\\s/?#]+)|(' + schemes + ')://([^\\s/?#]*))([^\\s?#]*)(?:\\?([^\\s#]*))?(?:#(\\S*))?', flags )
   }

   // http://en.wikipedia.org/wiki/URI_scheme#Official_IANA-registered_schemes
   function uriSchemesRegExp()
   {
      return 'about|callto|ftp|gtalk|http|https|irc|ircs|javascript|mailto|mshelp|sftp|ssh|steam|tel|view-source|ymsgr'
   }

Shelby Moore 2010-09-16 07:21:21

ansaurus

tags:

views:

answers:

Getting parts of a URL (Regex)

related questions