ansaurus

Question

Answer 1

+2 A:

If I understand you correctly, you are trying to heuristically extract URLs from HTML files, from both attributes (e.g. "href") and text.

You want it to work with malformed HTML
You want it to work with malformed URLs; e.g. URLs containing spaces
You don't want it to make any mistakes; e.g. your example.

I put it to you that your requirements are impossible. For example, what should be extracted from the following text:

Go to the URL http://example.com/ this and that.  And if that doesn't work, 
I recommend that you go read the http specification.

Is "this and that" supposed to be part of the URL, or not? And how is your software supposed to figure this out? And what if the author of the document meant the opposite to what your heuristic says? And what about "http specification" ... which is clearly NOT a URL.

And here's another, slightly more subtle example:

First, go to the URL http://example.com/index.html.
Then click on the "login" link.

Should your software extract "http://example.com/index.html." or "http://example.com/index.html"? According to the URL specification, both are valid URLs. Your software would probably strip off the final "." because it is most likely to be punctuation, but it might be wrong.

My advice:

Don't think you can do a better job than an existing permissive HTML parser. Where you are coming from, the chances that you can are close to zero.
Don't think that your software won't make mistakes. 100% accuracy requires that your software can read the mind of the person who created the file. (And arguably, even that is not sufficient.)
Pay attention to the context in which URLs appear. You need to use different heuristics to extract URLs from HTML attributes and text.
Pay attention to exactly what is, and what is not a legal URL.
Fully read and understand all relevant parts of the HTML and URL/URI specifications. While it is kind of OK to make mistakes with malformed documents, it would be unforgivable to fail to extract well-formed URLs from attributes of well-formed HTML documents.

Stephen C 2010-08-01 02:03:24

lots of years developing software make me understand that nothing is impossible, the problem is that some things are mor complicated than others, but not impossibles ;)thanks for the advices, my final solution is probably going to be an intelligent agent, but for now I was trying to optimize this little code to gain time :D

Saikios 2010-08-01 02:19:06

seriously I appreciate the advices ;) you are the only one who answer =)

Saikios 2010-08-01 02:23:18

@Saikios - I've been developing software for 30+ years, and I can tell you that some problems are **provably** impossible. Extracting the **intended** meaning from an ambiguous text is one such problem ... unless your software can read minds.

Stephen C 2010-08-01 02:29:55

hahaha, Stephen, I'm younger than you, but I had been in a conference where Uma Ramamurthy share his work at Menphis U and if what she is doing is possible I believe mind-reading should be piece of cake for a simple ai soft

Saikios 2010-08-01 02:35:13

@Saikios - well if you believe is is possible ... good luck. But don't say you weren't warned.

Stephen C 2010-08-01 02:55:22

Thanks for everything =D, I will keep working on this, but I guess that the only solution would be with ai

Saikios 2010-08-01 11:41:55

Answer 2

A:

John Gruber has a great regular expression for finding URLs in plain text: see An Improved Liberal, Accurate Regex Pattern for Matching URLs

There will always be ambiguities, but John's regex does an excellent job of working in Real Life.

Jonathan Hedley 2010-08-04 10:28:23

reading :) thanks, long time not reading that blog :-D

Saikios 2010-08-05 02:09:44

ansaurus

tags:

views:

answers:

How to optimize this ugly code?

related questions