views:

191

answers:

2

I want to open a webpage (whose URL is given as the commandline argument) and then want to save the content of that webpage as a .txt file.

Remember, I need the .txt file and not the source of the webpage.

I tried my hand with selenium and it works fine. But now I want something that doesn't open the real browser as opening the browser and loading a page in it is a time consuming task.

I want to do it in java.

By content, I mean the text (without markups) which we get when we save a webpage in IE by going to "Save As" and then selecting ".txt" as the output format of the file.

+3  A: 

If I understand correctly your question, you want to render the page and copy the rendered text without using a navigator.

For this, you'll need a headless browser. HTMLUnit would be a good choice.

To get the text content, you could do it like this (not tested) :

WebClient c = new WebClient(BrowserVersion.INTERNET_EXPLORER_6);
TextPage tp = c.getPage("yoururl");
String content = tp.getContent();

(see Javadoc)

Valentin Rocher
Yes, you have understood my question correctly. I have opened the webpage in that headless browser provided by HTMLUnit.But now, I don't know how to save the HtmlPage as to output the desired file.
Yatendra Goel
i added some example code)
Valentin Rocher
Yes, I have seen it and trying it. It is throwing some exceptions and am trying to find the cause...Thanks for that.
Yatendra Goel
A: 

Hmm, I'd even code that from scratch, does not seem as a complex thing and might not be even worth adding a dependency on another library to your project:

  • Open a URLConnection to that URL
  • Get a stream from the connection, apply regex to strip out all the HTML to the data. If the page is not expected to be too large for you memory requirements :) read the page into a String then apply the regex. Alternatively, give a shoot to what's described here (I have no experience with the way described there though).
  • Save output to a txt.
david a.