views:

210

answers:

4

I am retrieving the HTML from the web. I get "java.lang.OutOfMemoryError: Java heap space (repl-1:3)"

;; fetch: URL -> String
;; fetch returns the string of the HTML url
(defn fetch [url]
   (with-open [stream (. url openStream)]
      (let [buffer (BufferedReader. (InputStreamReader. stream))]
        (apply str (line-seq buffer)))))

I think the problem is the "apply str" . Is there an easier way to

  • Convert the buffered reader to string?
  • or retrieve the web page?

Edit: I need to retrieve

http://fiji4.ccs.neu.edu/~zerg/lemurcgi/lemur.cgi?g=p&v=or&v=measures&v=being&v=taken&v=against,&v=corrupt&v=public&v=officials&v=of&v=any&v=governmental&v=jurisdiction&v=worldwide.

+1  A: 

What do you mean by it "being too slow"? I can't imagine the language would matter much since the bottleneck here would be the internet.

tomjen
I am sorry I get java.lang.OutOfMemoryError: Java heap space (repl-1:3)
kunjaan
@tomjen: niavely concatenating a list of N strings of average length M is going to copy O(N*N*M) bytes. By contrast, downloading will involve copying O(N*M) bytes. Now the constants of proportionality are important, but for large enough N the string concatenation WILL take longer than the downloading. This is an algorithm issue ... not a language issue.
Stephen C
PS, I can spell "naive" ... I just can't type it :-)
Stephen C
+1  A: 

What is the current size of the heap? You can use the JVM arguments to specify more heap space with -X arg.

See JVM Tuning for more information. If you have more time, try using a Java Profiler to see why you're application is running out of memory. Although, you can resize the heap space, it's a temporary solution.

Naqeeb
+6  A: 

Yikes. line-seq is going to create one String object per line, which you then eventually concatenate and discard, which is going to be slow and painful. Using apply like that is going to put all of those Strings into an enormous list and call str on that, which is also going to be painful.

Try this instead:

(use 'clojure.contrib.duck-streams)  ;SO's syntax highlighting sucks
(slurp* (reader url))

slurp* uses a StringBuilder which is a better way to build up a large string in Java.

Brian Carper
this code chokes up when the url has a lot of data.
kunjaan
+1  A: 

There are two possibilities:

  1. The size of the content that you are fetching is a significant proportion of the available heap space, and your algorithm requires 2 or 3 times the size in working storage during the reading / concatenation process. In this case, increasing the heap space is a reasonable workaround.

  2. The algorithm is actually using O(N^2) space to do the concatenation using apply. It is not inconceivable that the implementation of apply is recursive and that the clojure compiler / JIT compiler are producing recursive code with lots of references to intermediate strings. In this case, increasing the heap space is a poor workaround.

Either way, I'd start by replacing (apply str (line-seq buffer)) with a more efficient alternative (see @Brian's answer, and my comment on @tomjen's answer) ... and only worry about the heap usage if it is still an issue. (I suspect that it won't be.)

Stephen C