How to go about reading a web page lazily in Clojure

views:

157

answers:

+3 Q:

How to go about reading a web page lazily in Clojure

I and a friend recently implemented link grabbing in my Clojure IRC bot. When it sees a link, it slurp*s the page and grabs the title from the page. The problem is that it has to slurp* the ENTIRE page just to grab the link.

How does one go about reading a page lazily until the first </title>?

+5 A:

Use line-seq but don't forget to close the underlying stream when done.

cgrand 2010-04-13 12:26:12

I forgot about line-seq. Thanks. :)

Rayne 2010-04-13 12:49:17

+4 A:

I wouldn't count on the HTML necessarily being split into lines in a sensible way; without looking outside of our own backyard, e.g. Compojure (or Hiccup currently, I guess) doesn't bother inserting line breaks, I believe (update: just checked Hiccup -- no line breaks).

What I'd suggest instead is lazy XML parsing (with clojure.contrib.lazy-xml) on top of a java.io.BufferedInputStream.

Michał Marczyk 2010-04-13 14:40:46

BTW, I'm hardly an expert in this area, but I don't believe that you can request something less than the full source of the page over HTTP; that's not really a problem though, as you won't be getting the big stuff like images anyway. `slurp*` will get you a `java.io.BufferedReader` on the source and `c.c.lazy-xml` will do the one thing which might be important for performance, which is *not* to parse anything beyond the `<title>...</title>` bit that you're looking for. (Hm, should I include this in the answer proper...? Maybe I will next time, got to take a break now.)

Michał Marczyk 2010-04-13 14:52:12

There's the Range header, but not all servers support it.

Jouni K. Seppänen 2010-04-13 16:32:26

The way I'm doing it, it doesn't really matter. Line-seq does what I need.

Rayne 2010-04-13 20:06:45

@Jouni K. Seppänen: Ah, right. Thanks. @Rayne: Well, it's unlikely to break anything for you (except you might have to re-concatenate some initial lines, which seems like unnecessary effort, and it might be a bit fragile in the face of some particularly ugly html depending on how you go about finding the last interesting line), but I'm not sure what you gain by using it above what `c.c.lazy-xml` together with perhaps `c.c.io/reader` (in a `with-open`) give you. Anyway, happy hacking. :-)

Michał Marczyk 2010-04-14 06:05:52

Another thing you can't be sure of is receiving well-formed XML. I usually combat this problem by running my "sloppy HTML" input text through JTidy before trying to process it as XML. Of course, doing this kinda collides with the aim of not reading the whole page; so I mention this only for general interest.

Carl Smotricz 2010-04-23 18:23:32

ansaurus

tags:

views:

answers:

How to go about reading a web page lazily in Clojure

related questions