views:

215

answers:

2

I make the other day a question here, but finally I decided to do it myself for questions of time, now I have a little more time to fix it :D I liked jSoup, but I'm kind of from old school, and preffer doing it my self (thanks to @Bakkal anyway).

I manage to made this code, it works fine for now, but if a webpage is not well contructed it will break the code, for example if it doesn't start with http the code is not going to find it, and if the url is not ending with one of the things that I put in the while it's going to return a really ugly addres.. for example

http://www.google.com/ hey dude how are you? great, eating at jack's

My result would be:

http://www.google.com/ hey dude how are you? great, eating at jack

I'm open to suggestions, any of them, I will resume my questions and after that I will post the code

  1. The code breaks if I don't have the exact ending
  2. If I put the space " " as a break I'm going to loose all the pages that have space in the address
  3. I would like to capture all addresses not only the ones starting with http, for example www.google.com is a valida address and so is contacts.google.com

Thanks for everything :D

File txtUrlSpecialFile = new File("pepe.txt");
            FileWriter txtUrlSpecial;
            txtUrlSpecial = new FileWriter(txtUrlSpecialFile);
            txtUrlSpecial.write(profundidad-1);

            for(int j=0;j<bigString.length()-5;j++){ //I put 5 but I can put more
                if(bigString.substring(j, j+4).equals("http")){
                    while(bigString.charAt(j)!='"' && bigString.charAt(j)!='<'&& bigString.substring(j, j)!="'"){
                        txtUrlSpecial.write(bigString.charAt(j));
                        j++;
                    }
                    txtUrlSpecial.write(SingletonFunction.getNewLine());
                }
            }
            txtUrlSpecial.close();
+2  A: 

If I understand you correctly, you are trying to heuristically extract URLs from HTML files, from both attributes (e.g. "href") and text.

  • You want it to work with malformed HTML
  • You want it to work with malformed URLs; e.g. URLs containing spaces
  • You don't want it to make any mistakes; e.g. your example.

I put it to you that your requirements are impossible. For example, what should be extracted from the following text:

Go to the URL http://example.com/ this and that.  And if that doesn't work, 
I recommend that you go read the http specification.

Is "this and that" supposed to be part of the URL, or not? And how is your software supposed to figure this out? And what if the author of the document meant the opposite to what your heuristic says? And what about "http specification" ... which is clearly NOT a URL.

And here's another, slightly more subtle example:

First, go to the URL http://example.com/index.html.
Then click on the "login" link.

Should your software extract "http://example.com/index.html." or "http://example.com/index.html"? According to the URL specification, both are valid URLs. Your software would probably strip off the final "." because it is most likely to be punctuation, but it might be wrong.

My advice:

  1. Don't think you can do a better job than an existing permissive HTML parser. Where you are coming from, the chances that you can are close to zero.
  2. Don't think that your software won't make mistakes. 100% accuracy requires that your software can read the mind of the person who created the file. (And arguably, even that is not sufficient.)
  3. Pay attention to the context in which URLs appear. You need to use different heuristics to extract URLs from HTML attributes and text.
  4. Pay attention to exactly what is, and what is not a legal URL.
  5. Fully read and understand all relevant parts of the HTML and URL/URI specifications. While it is kind of OK to make mistakes with malformed documents, it would be unforgivable to fail to extract well-formed URLs from attributes of well-formed HTML documents.
Stephen C
lots of years developing software make me understand that nothing is impossible, the problem is that some things are mor complicated than others, but not impossibles ;)thanks for the advices, my final solution is probably going to be an intelligent agent, but for now I was trying to optimize this little code to gain time :D
Saikios
seriously I appreciate the advices ;) you are the only one who answer =)
Saikios
@Saikios - I've been developing software for 30+ years, and I can tell you that some problems are **provably** impossible. Extracting the **intended** meaning from an ambiguous text is one such problem ... unless your software can read minds.
Stephen C
hahaha, Stephen, I'm younger than you, but I had been in a conference where Uma Ramamurthy share his work at Menphis U and if what she is doing is possible I believe mind-reading should be piece of cake for a simple ai soft
Saikios
@Saikios - well if you believe is is possible ... good luck. But don't say you weren't warned.
Stephen C
Thanks for everything =D, I will keep working on this, but I guess that the only solution would be with ai
Saikios
A: 

John Gruber has a great regular expression for finding URLs in plain text: see An Improved Liberal, Accurate Regex Pattern for Matching URLs

There will always be ambiguities, but John's regex does an excellent job of working in Real Life.

Jonathan Hedley
reading :) thanks, long time not reading that blog :-D
Saikios