ansaurus

Question

Answer 1

A:

Quick and dirty:

href="(.*?)"

Ok, let's go with another regex for parsing URLs. This comes from RFC 2396 - URI Generic Syntax: Parsing a URI Reference with a Regular Expression

^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?

Of course, you can have relative URL address into your HTML code, you'll need to address them in another way; I can recommend you to use C# Uri Constructor (Uri, String).

Rubens Farias 2009-11-30 12:34:25

no thanks but i say for get programmatically means to say regex to get url

2009-11-30 12:36:02

@Gupta, I didn't understood your comment; both are regular expressions.

Rubens Farias 2009-11-30 12:46:46

Nice try, but (a) *? for minimal munch is FWIS rare among the world's regex flavours (b) too many mistakes in your second regex to begin listing them

Stewart 2009-11-30 13:43:24

@Stewart, how about this one?

Rubens Farias 2009-11-30 13:50:30

@Rubens Farias don't worry i say that how can i do it by regex

2009-11-30 14:07:42

Answer 2

A:

The simplest way to do this is using the following regex.

/href="([^"]+)"/

This will get all characters from the first quote until it find a character that is a quote. This is in most languages the fastest way to get a quoted string, that can't itself contain quotes. Quotes should be encoded when used in attributes.

UPDATE: A complete Perl program for parsing urls would look like this:

use 5.010;

while (<>) {
    push @matches, m/href="([^"]+)"/gi;
    push @matches, m/href='([^']+)'/gi;
    push @matches, m/href=([^"'][^>\s]*)[>\s]+/gi;
    say for @matches;
}

It reads from stdin and prints all urls. It takes care of the three possible quotes. Use it with curl to find all the urls in a webpage:

curl url | perl urls.pl

Peter Stuifzand 2009-11-30 12:44:25

In the wild, HTML can be a deadly thing. href=a.html is "valid," or at least should work just as well as href="a.html" and in most instances special characters that should be escaped aren't *cough* google *cough*

Gary Green 2009-11-30 14:11:03

Correct, there are many pitfalls when using information from the web. On the other hand if I need to find the urls from one webpage on which I can see all possible problems (or find out by testing) I will use this regex (or variant) before using heavier tools. Still, this all depends on the situation and this looks like a Get it Done situation.

Peter Stuifzand 2009-11-30 14:59:47

blah... this won't work at all. Attribute values can have ",' or none delimiters.

Hogan 2009-11-30 17:49:30

Answer 3

+1 A:

If you want to use jQuery you can do the following.

$('a').attr('href')

Peter Stuifzand 2009-11-30 12:49:55

Answer 4

A:

you don't need complicated regex or HTML parser. since you only want to extract links. Here's a generic way to do it.

data="""
<html>
abcd ef ....
blah blah <div><a href="http://anirudhagupta.blogspot.com/"&gt;Anirudha Web blog</a></div>
blah  ...
<div><a href="http://mike.blogspot.com/"&gt;Mike's Web blog
</a></div>
end...
</html>
"""    
for item in data.split("</a>"):
    if "<a href" in item:
        start_of_href = item.index("<a href") # get where <a href=" is
        print item[start_of_href+len('<a href="'):] # print substring from <a href onwards.

the above is Python code, but the idea behind you can adapt in your C# language. Split your html string using "</a>" as delimiter. Go thorough each splitted field, check for "href", then get the substr after "href". That will be your links.

2009-11-30 13:33:31

This seems more complex than a regex!

Gary Green 2009-11-30 14:08:38

complex because it has more words? Would you rather look at an essay written in english or one that is encoded with numbers, each number representing a letter? its the same analogy. what regex does behind is roughly the same as what i posted. string manipulations, except that its presented more clearly to the reader, and not having the reader guessing what your code means

2009-11-30 15:00:07

take for example the regex posted by Ruben. seriously, if you can decipher what it means at first glance, i take my hat off you.

2009-11-30 15:01:30

also here you get to make sure the code is optimized. Who knows what the regex will do... you know </a> is the best place to split the code -- going to be much better than the regex.

Hogan 2009-11-30 17:48:16

see my comment below -- that is is the "non complex" way to do it.

Hogan 2009-11-30 17:56:59

Answer 5

A:

The right way to do this is to load the html into the C# xml parser and then use xpath to query the URLs. This way you don't have to worry about parsing at all.

Hogan 2009-11-30 17:56:16

Can you show an example of how this works?

Peter Stuifzand 2009-12-01 12:35:49

Why write when you can link:http://www.c-sharpcorner.com/UploadFile/shehperu/SimpleXMLParser11292005004801AM/SimpleXMLParser.aspx is a nice simple example. http://developer.yahoo.com/dotnet/howto-xml_cs.html is a more complex one.But as you can see... all you do is read it into the xml object and then query it with xpath. you will then have a list of href attributes. simple. done.

Hogan 2009-12-01 20:40:57

ansaurus

tags:

views:

answers:

Get url from html code using Regex

related questions