tags:

views:

110

answers:

1

When crawling on webpages I need to be careful as to not make too many requests to the same domain, for example I want to put 1 s between requests. From what I understand it is the time between requests that is important. So to speed things up I want to use async workflows in F#, the idea being make your requests with 1 sec interval but avoid blocking things while waiting for request response.

let getHtmlPrimitiveAsyncTimer (uri : System.Uri) (timer:int) =
    async{

            let req =  (WebRequest.Create(uri)) :?> HttpWebRequest
            req.UserAgent<-"Mozilla"
            try 

                Thread.Sleep(timer)
                let! resp =    (req.AsyncGetResponse())
                Console.WriteLine(uri.AbsoluteUri+" got response")
                use stream = resp.GetResponseStream()
                use reader = new StreamReader(stream)
                let html = reader.ReadToEnd()
                return html
            with 
            | _ as ex -> return "Bad Link"
                 }

Then I do something like:

let uri1 = System.Uri "http://rue89.com"
let timer = 1000
let jobs = [|for i in 1..10 -> getHtmlPrimitiveAsyncTimer uri1 timer|]

jobs
|> Array.mapi(fun i job -> Console.WriteLine("Starting job "+string i)
                               Async.StartAsTask(job).Result)

Is this alright ? I am very unsure about 2 things: -Does the Thread.Sleep thing work for delaying the request ? -Is using StartTask a problem ?

I am a beginner (as you may have noticed) in F# (coding in general actually ), and everything envolving Threads scares me :)

Thanks !!

+3  A: 

I think what you want to do is - create 10 jobs, numbered 'n', each starting 'n' seconds from now - run those all in parallel

Approximately like

let makeAsync uri n = async {
    // create the request
    do! Async.Sleep(n * 1000)
    // AsyncGetResponse etc
    }

let a = [| for i in 1..10 -> makeAsync uri i |]
let results = a |> Async.Parallel |> Async.RunSynchronously

Note that of course they all won't start exactly now, if e.g. you have a 4-core machine, 4 will start running very soon, but then quickly execute up to the Async.Sleep, at which point the next 4 will run up until their sleeps, and so forth. And then in one second the first async wakes up and posts a request, and another second later the 2nd async wakes up, ... so this should work. The 1s is only approximate, since they're starting their timers each a very tiny bit staggered from one another... you may want to buffer it a little, e.g. 1100 ms or something if the cut-off you need is really exactly a second (network latencies and whatnot still leave a bit of this outside the possible control of your program probably).

Thread.Sleep is suboptimal, it will work ok for a small number of requests, but you're burning a thread, and threads are expensive and it won't scale to a large number.

You don't need StartAsTask unless you want to interoperate with .NET Tasks or later do a blocking rendezvous with the result via .Result. If you just want these to all run and then block to collect all the results in an array, Async.Parallel will do that fork-join parallelism for you just fine. If they're just going to print results, you can fire-and-forget via Async.Start which will drop the results on the floor.

(An alternative strategy is to use an agent as a throttle. Post all the http requests to a single agent, where the agent is logically single-threaded and sits in a loop, doing Async.Sleep for 1s, and then handling the next request. That's a nice way to make a general-purpose throttle... may be blog-worthy for me, come to think of it.)

Brian
Hehe! Very very nice, thank you. F# is truly amazing, i love the "Async.Parallel" will do that for you :) , for beginners like me it lets you worry about getting the code right.Thank you !
jlezard
Progress has been made: http://stackoverflow.com/questions/3023153/asynchronous-webcrawling-f-something-wrong
jlezard
Actually Brian, I am not sure this solution works for a large number of uri's, or does it ?
jlezard
See my response there; the problem there is different than the problem here, since here you know all the uris a priori and kick off everything at once, whereas there you discover new uris during the program run.
Brian
Yes thank you saw it, working on it :) . For this post,it is no problem to launch |>Asyn.Parallel|>Asyn.RunSynchronously on a very large seq of Async elements ?
jlezard
Right; I think this would be fine for tens of thousands of elements (though I haven't actually tried it); my hunch is the first limit would be the array of results (e.g. if you save the html-string of the pages in the results array, ensure you don't use up all the process memory storing all those big strings).
Brian