HI
I want regex option that find website links like here :
www.yahoo.com
yahoo.com
http://www.yahoo.com
http://yahoo.com
yahoo.jp ( or any domain)
http://yahoo.fr
is there anyway to track them all with regex ?
HI
I want regex option that find website links like here :
www.yahoo.com
yahoo.com
http://www.yahoo.com
http://yahoo.com
yahoo.jp ( or any domain)
http://yahoo.fr
is there anyway to track them all with regex ?
This regex from daringfireball.net should be able to do most what you want. I'm unsure about domain.tld
since that is very ambiguous.
(?xi)
\b
( # Capture 1: entire matched URL
(?:
[a-z][\w-]+: # URL protocol and colon
(?:
/{1,3} # 1-3 slashes
| # or
[a-z0-9%] # Single letter or digit or '%'
# (Trying not to match e.g. "URI::Escape")
)
| # or
www\d{0,3}[.] # "www.", "www1.", "www2." … "www999."
| # or
[a-z0-9.\-]+[.][a-z]{2,4}/ # looks like domain name followed by a slash
)
(?: # One or more:
[^\s()<>]+ # Run of non-space, non-()<>
| # or
\(([^\s()<>]+|(\([^\s()<>]+\)))*\) # balanced parens, up to 2 levels
)+
(?: # End with:
\(([^\s()<>]+|(\([^\s()<>]+\)))*\) # balanced parens, up to 2 levels
| # or
[^\s`!()\[\]{};:'".,<>?«»“”‘’] # not a space or one of these punct chars
)
)
For more specifics about what it does check out http://daringfireball.net/2010/07/improved_regex_for_matching_urls
I'm going to throw out an alternative here, not RegEx at all. Take a look at the HTML Agility Pack, your case would look like this:
var doc = new HtmlDocument();
doc.Load("file.htm");
foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a[contains(@href, 'yahoo')]"])
{
var href = link["href"];
//href is a url that contains the word `yahoo`, do something with it
}
It's not really answering the question as you've written is, just something to keep your options open, as RegEx can have many other problems when applied against HTML.