ansaurus

Question

Answer 1

A:

Use a DOM parser to extract all <a href> tags, and, if desired, additionally scan the source for http:// outside of those tags.

Borealid 2010-07-30 04:01:26

not all url are in tags, some are text and some are in links or other tags :S

Saikios 2010-07-30 04:03:59

@Saikios : That's what I said. The part about scanning for other links outside the tags. The second half of my single-sentence answer. Was it too long?

Borealid 2010-07-30 04:07:01

Hey, no it wasn't but if I do that what beneffit i'm going to have ... just a few href? :( my Idea was trying to do something like the str_replace using indexes or a reg_exp but i have it flying over my head

Saikios 2010-07-30 04:11:13

Take a look at [this](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454)

Borealid 2010-07-30 04:15:11

I can't parse html with regex but I can eliminate everything that isn't between . .edu . .com . .gov, etc but is a really complex regex, :S

Saikios 2010-07-30 04:19:59

Answer 2

+4 A:

Try using a HTML parsing library then search for <a> tags in the HTML document.

Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");
Elements links = doc.select("a[href]"); // a with href

not all url are in tags, some are text and some are in links or other tags

You shouldn't scan the HTML source to achieve this.

You will end up with link elements that are not necessarily in the 'text' of the page, i.e you could end up with 'links' of JS scripts in the page for example.

Best way is still that you use a tool made for the job.

You should grab HTML tags and cover the most likely ones to have 'links' inside them (say: <h1>, <p>, <div> etc) . HTML parsers provide regex-like functionalities to filter through the content of the tags, something similar to your logic of "starts with HTTP".

[attr^=value], [attr$=value], [attr*=value]: elements with attributes that start with, end with, or contain the value, e.g. select("[href*=/path/]")

See: jSoup.

Bakkal 2010-07-30 04:01:57

not all url are in tags, some are text and some are in links or other tags :Sby the way, I add it to my bookmarks the jSoup page for future reference, but I can't use it in this project :( I need all the urls

Saikios 2010-07-30 04:05:09

I will give it a try with this thanks @Bakkal

Saikios 2010-07-30 04:17:37

Answer 3

+1 A:

You may want to have a look at XPath or Regular Expressions.

Hamid Nazari 2010-07-30 04:02:30

Hi I'm using java, but anyway like I told to the other guys I'm trying to get all the urls on the string text, a, link, etc. Thanks to everybody that answer ;)

Saikios 2010-07-30 04:07:02

Answer 4

A:

The best way should be to google for regexes. One example is this one:

    /^(https?):\/\/((?:[a-z0-9.\-]|%[0-9A-F]{2}){3,})(?::(\d+))?((?:\/(?:[a-z0-9\-._~!$&'()+,;=:@]|%[0-9A-F]{2})))(?:\?((?:[a-z0-9\-._~!$&'()+,;=:\/?@]|%[0-9A-F]{2})))?(?:#((?:[a-z0-9\-._~!$&'()+,;=:\/?@]|%[0-9A-F]{2})*))?$/i

found in a hacker news article. As far as I can follow it, it looks good. But there is, as far as I know, no formal regex for this problem. So the best solution is to google for some and try which one matches most of what you want.

erikb 2010-07-30 04:31:35

coolregex but is for something else, this is for checking if the url is a correct url but not to obtain the url's from a big string the one that I was thinking is probably bigger :P

Saikios 2010-07-30 04:42:58

ansaurus

tags:

views:

answers:

How to find URLs in HTML using Java

related questions