views:

410

answers:

5
<div><a href="http://anirudhagupta.blogspot.com/"&gt;Anirudha Web blog</a></div>

What is the Regular Expression to get http://anirudhagupta.blogspot.com/ from

<div><a href="http://anirudhagupta.blogspot.com/"&gt;Anirudha Web blog</a></div>

If you suggest in c# that's good i also like jQuery for do this.

HTML code is also means for a page who have a html code
A: 

Quick and dirty:

href="(.*?)"

Ok, let's go with another regex for parsing URLs. This comes from RFC 2396 - URI Generic Syntax: Parsing a URI Reference with a Regular Expression

^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?

Of course, you can have relative URL address into your HTML code, you'll need to address them in another way; I can recommend you to use C# Uri Constructor (Uri, String).

Rubens Farias
no thanks but i say for get programmatically means to say regex to get url
@Gupta, I didn't understood your comment; both are regular expressions.
Rubens Farias
Nice try, but (a) *? for minimal munch is FWIS rare among the world's regex flavours (b) too many mistakes in your second regex to begin listing them
Stewart
@Stewart, how about this one?
Rubens Farias
@Rubens Farias don't worry i say that how can i do it by regex
A: 

The simplest way to do this is using the following regex.

/href="([^"]+)"/

This will get all characters from the first quote until it find a character that is a quote. This is in most languages the fastest way to get a quoted string, that can't itself contain quotes. Quotes should be encoded when used in attributes.

UPDATE: A complete Perl program for parsing urls would look like this:

use 5.010;

while (<>) {
    push @matches, m/href="([^"]+)"/gi;
    push @matches, m/href='([^']+)'/gi;
    push @matches, m/href=([^"'][^>\s]*)[>\s]+/gi;
    say for @matches;
}

It reads from stdin and prints all urls. It takes care of the three possible quotes. Use it with curl to find all the urls in a webpage:

curl url | perl urls.pl
Peter Stuifzand
In the wild, HTML can be a deadly thing. href=a.html is "valid," or at least should work just as well as href="a.html" and in most instances special characters that should be escaped aren't *cough* google *cough*
Gary Green
Correct, there are many pitfalls when using information from the web. On the other hand if I need to find the urls from one webpage on which I can see all possible problems (or find out by testing) I will use this regex (or variant) before using heavier tools. Still, this all depends on the situation and this looks like a Get it Done situation.
Peter Stuifzand
blah... this won't work at all. Attribute values can have ",' or none delimiters.
Hogan
+1  A: 

If you want to use jQuery you can do the following.

$('a').attr('href')
Peter Stuifzand
A: 

you don't need complicated regex or HTML parser. since you only want to extract links. Here's a generic way to do it.

data="""
<html>
abcd ef ....
blah blah <div><a href="http://anirudhagupta.blogspot.com/"&gt;Anirudha Web blog</a></div>
blah  ...
<div><a href="http://mike.blogspot.com/"&gt;Mike's Web blog
</a></div>
end...
</html>
"""    
for item in data.split("</a>"):
    if "<a href" in item:
        start_of_href = item.index("<a href") # get where <a href=" is
        print item[start_of_href+len('<a href="'):] # print substring from <a href onwards.

the above is Python code, but the idea behind you can adapt in your C# language. Split your html string using "</a>" as delimiter. Go thorough each splitted field, check for "href", then get the substr after "href". That will be your links.

This seems more complex than a regex!
Gary Green
complex because it has more words? Would you rather look at an essay written in english or one that is encoded with numbers, each number representing a letter? its the same analogy. what regex does behind is roughly the same as what i posted. string manipulations, except that its presented more clearly to the reader, and not having the reader guessing what your code means
take for example the regex posted by Ruben. seriously, if you can decipher what it means at first glance, i take my hat off you.
also here you get to make sure the code is optimized. Who knows what the regex will do... you know </a> is the best place to split the code -- going to be much better than the regex.
Hogan
see my comment below -- that is is the "non complex" way to do it.
Hogan
A: 

The right way to do this is to load the html into the C# xml parser and then use xpath to query the URLs. This way you don't have to worry about parsing at all.

Hogan
Can you show an example of how this works?
Peter Stuifzand
Why write when you can link:http://www.c-sharpcorner.com/UploadFile/shehperu/SimpleXMLParser11292005004801AM/SimpleXMLParser.aspx is a nice simple example. http://developer.yahoo.com/dotnet/howto-xml_cs.html is a more complex one.But as you can see... all you do is read it into the xml object and then query it with xpath. you will then have a list of href attributes. simple. done.
Hogan