views:

114

answers:

4

I am writing a java program that connects to a website and it returns the HTML, for some reason I am having problems with it. Right now I am only able to access the website if I do

 //example     String host = "www.google.com"

but If I want to access a URL that is any more complicated then I get an UnknownHostException. At first I thought it might have something to do with it not recognizing certain characters in the URL but im not sure. For example, here is one of the URL's Im trying to access.

host ="http://www.cyberspacei.com/englishwiz/library/name/etymology_of_first_names.htm";
int port = 80;
Socket s = new Socket(host,port)

....etc

and It wont return anything but an UnknownHostException.

Somebody please help me!!!

+5  A: 

it is failing because you are being asked about a hostname, not an URL like the one you are entering, if you want the document in that URL, you need to use the URL class

URL url = new URL("http://www.thesite.com/thefile.html");
Object doc = url.getContent();

of course you need to replace that "Object doc" with a file that is prepared to cache that content.

ONi
+3  A: 

The "host" parameter for the Socket object specifies which machine to connect to on the network (internet). This is different from a URI used in a web browser which includes the protocol, server, and the directory structure of the file or object being requested.

Socket s = new Socket("www.cyberspacei.com", "80"); will open a new raw socket to the webserver running on that machine but it will then be up to you to negotiate the HTTP protocol over that socket and request "/englishwiz/library/name/etymology_of_first_names.htm"

You might save yourself some headaches by using a library such as HttpClient which takes alot of the leg work out of the http negotiation as long as you don't need raw access to the http stream.

http://hc.apache.org/httpclient-3.x/index.html

emills
+2  A: 

Hi there

I'm not an expert in the field of Java, but I know what went wrong.

Firstly the host variable should only contain host of the URL.

The host of the URL http://www.cyberspacei.com/englishwiz/library/name/etymology_of_first_names.htm is actually 'cyberspacei.com'

So you connect to the host, then send HTTP headers to request for the page you are looking for.

GET /englishwiz/library/name/etymology_of_first_names.htm HTTP 1.0
HOST cyberspacei.com
Accept: */*
Connection: Close

Some web pages may need User-Agent or Referer headers to work. so add the fields appropriately.

thephpdeveloper
Thanks, this answer was very helpful, I was able to fix my problem in a matter of minutes...I appreciate your help
CitadelCSAlum
no problem at all ^^
thephpdeveloper
+1  A: 

@ONi is right here. You're using the Socket() class, which means you're using raw sockets and you want to write your own HTTP/web server requests. You want something more like the URL class because that class 'understands' HTTP request and just gives you the content of a website.

It's like the difference between printing out & reading an email from your computer (URL class) vs. sticking the ethernet cord in your mouth and trying to decipher the signals with your tongue. The Socket() class is too low-level for what you're doing.

Rocketmonkeys