tags:

views:

125

answers:

7

I have a very simple regex question. Suppose I have 2 conditions:

  1. url =http://www.abc.com/cde/def
  2. url =https://www.abc.com/sadfl/dsaf

How can I extract the baseUrl using regex?

Sample output:

  1. http://www.abc.com
  2. https://www.abc.com
+1  A: 

/^(https?\:\/\/[^\/]+).*/$1/

This will capture ANYTHING that starts with http and $1 will contain everything from the beginning to the first / after the //

Matt S
Thanks for your quick responce
Sunil
+4  A: 

Like this:

String baseUrl;
Pattern p = Pattern.compile("^(([a-zA-Z]+://)?[a-zA-Z0-9.-]+\\.[a-zA-Z]+(:\d+)?/");
Matcher m = p.matcher(str); 
if (m.matches())
    baseUrl = m.group(1);

However, you should use the URI class instead, like this:

URI uri = new URI(str);
SLaks
Thanks for your quick responsebut it gives https://www.abc.comsadflcan it is possible it only gives first one https://www.abc.com
Sunil
Thank you very much sir This is working
Sunil
+1 for the URI class.
Aistina
This is not working with page source.suppose I am searching a string 'sun' on google it will give 20 link and I want to fetch all sites base url but this is not working on page source of html page.Any change so that It can work for page source. Thanks . waiting for response.
Sunil
Remove the `^`, which anchors the regex to the beginning of the string. You can then loop through each match.
SLaks
Thanks a lot sir
Sunil
+1  A: 

Except for write-and-throw-away scripts, you should not always refrain from parsing complex syntaxes (e-mail addresses, urls, html pages, etc etc) using regexes.

believe me, you will get bitten eventually.

Gyom
Thanks for comment
Sunil
A: 

I'm pretty sure that there is a Java class that will allow path manipulations, but if it has to be a regex,

https?://[^/]+

would work. (s? included to also handle https:)

Tim Pietzcker
Thanks for your response sir
Sunil
A: 

Looks like the simplest solution to your two specific examples would be the pattern:

[^/]_//[^/]+

i.e.: non-slash (0 or more times), two slashes, non-slash (0 or more times). You can be stricter than that if you wish, as the two existing answers are doing in different ways -- one will reject e.g. URLs starting with ftp:, the other will reject domains with underscores (but accept URLs without a leading protocol://, thereby being even broader than mine in that respect). This variety of answers (all correct wrt your scant specs;-) should suggest to you that your specs are too vague and should be tightened.

Alex Martelli
Thank you for your quick response sir
Sunil
A: 

Here's a regex that should satisfy the problem as given.

https?://[^/]*

I'm assuming you're asking this partly to gain more knowledge of regexes. If, however, you're trying to pull the host from a URL, it's arguably much more correct to use Java's more robust parsing methods:

String urlStr = "https://www.abc.com/stuff";
URL url = new URL(urlStr);
String host = url.getHost();
String protocol = url.getProtocol();
URL baseUrl = new URL (protocol, host);

This is better, as it should catch more cases if your input URL isn't as strict as described above.

Paul Brinkley
Thanks for your quick answer sir
Sunil
A: 

A one liner without regexp:

String baseUrl = url.substring(0, url.indexOf('/', url.indexOf("//")+2));
Andreas_D
:)Thanks for your answer But I want using regex
Sunil