tags:

views:

259

answers:

3

I have a URL checker that I use in Perl. I was wondering how something like this would be done in Clojure. I have a file with thousands of URLs and I'd like the output file to contain the URL (minus http://, https://) and a simple :1 for valid and :0 for false. Ideally, I could check each site concurrently, considering that this is one of Clojure's strengths.

Input

http://www.google.com
http://www.cnn.com
http://www.msnbc.com
http://www.abadurlisnotgood.com

Output

www.google.com:1
www.cnn.com:1
www.msnbc.com:1
www.abadurlisnotgood.com:0

+3  A: 

Write a small function that appends a ":1" or ":0" to a url and then use pmap to apply it in parallel to all the urls.

(defn check-a-url [url] .... )
(pmap #(if (check-a-url %) (str url ":1") (str url ":0")))
Arthur Ulfeldt
+4  A: 

I assume by "valid URL" you mean HTTP response 200. This might work. It requires clojure-contrib. Change map to pmap to attempt to make it parallel, like Arthur Ulfeldt mentioned.

(use '(clojure.contrib duck-streams
                       java-utils
                       str-utils))

(import '(java.net URL
                   URLConnection
                   HttpURLConnection
                   UnknownHostException))

(defn check-url [url]
  (str (re-sub #"^(?i)http:/+" "" url)
       ":"
       (try
        (let [c (cast HttpURLConnection
                      (.openConnection (URL. url)))]
          (if (= 200 (.getResponseCode c))
            1
            0))
        (catch UnknownHostException _
          0))))

(defn check-urls-from-file [filename]
  (doseq [line (map check-url
                    (read-lines (as-file filename)))]
    (println line)))

Given your example as input:

user> (check-urls-from-file "urls.txt")
www.google.com:1
www.cnn.com:1
www.msnbc.com:1
www.abadurlisnotgood.com:0
Brian Carper
A: 

Instead of pmap, I used agents with send-off in conjunction with the above solution. I think this is better when there is blocking I/O. I believe pmap has limited concurrency too. Here's what I have so far. I wonder how this will scale with thousands of URLs.


(use '(clojure.contrib duck-streams
                       java-utils
                       str-utils))

(import '(java.net URL
                   URLConnection
                   HttpURLConnection
                   UnknownHostException))

(defn check-url [url]
  (str (re-sub #"^(?i)http:/+" "" url)
       ":"
       (try
        (let [c (cast HttpURLConnection
                      (.openConnection (URL. url)))]
          (if (= 200 (.getResponseCode c))
            1
            0))
        (catch UnknownHostException _
          0))))

(def urls (read-lines "urls.txt"))

(def agents (for [url urls] (agent url)))

(doseq [agent agents]
  (send-off agent check-url))

(apply await agents)

(def x '())
(doseq [url (filter deref agents)]
    (def x (cons @url x)))
(prn x)

(shutdown-agents)