tags:

views:

132

answers:

4

I am getting a "DOS" instead of the html string ....

let getHtmlBasic (uri :System.Uri ) =
    use client = new WebClient()
    client.DownloadString( uri)


let uri = System.Uri( "http://www.b-a-r-f.com/" )
getHtmlBasic uri

This gives a string, "DOS"

Lol what the ?

All other websites seems to work ...

A: 

It's based on the user agent. If you use a browser user agent, the request will work. That said, it seems to be a pretty clear Keep Out message.

Matthew Flaschen
I'm sorry you told him that.
Developer Art
@Developer Art: Why are you sorry that he tried to help a developer?
Ian Boyd
@Ian, especially considering he just said jlezard is trying to spam us (which implies it's his site).
Matthew Flaschen
Because he told him a way to circumvent site protection.
Developer Art
@Matt i thought he was pointing out how silly eugeneK's assertion was: why would someone spam a four year old french web-site devoted to pets, on an english programming site, using the F# language.
Ian Boyd
I am not trying to spam anybody lol, this is not my website. I am just trying to build a crawler and came accross this site where things dont seem to work
jlezard
@Developer He's not allowed to write his own search engine? Writing a crawler is limited to special people?
Ian Boyd
Thanks for all the replies except Developer Art which seems bitter about life
jlezard
A: 

They probably detect automated crawling and send you that response.

Ian Boyd
-1 for copying my deleted answer precisely.
Developer Art
Well it was the correct answer; someone had to put it there. But now that others have given the same answer, i guess they can get credit, and not you - and not me.
Ian Boyd
+1  A: 

I've been testing myself (not in F but it doesn't really matter) and I can confirm that the site reads the User Agent string and depending on its value it either returns the site contents or the "DOS" text.

Curiously, they provide a feed service (both RSS and Atom) and they also filter out requests to it.

Although the User Agent information can be easily faked, my advice is that you try to get their permission to grab contents; at least from the XML feed!

Álvaro G. Vicario
I am intending to read a few thousand pages with my little "crawler", do you think I will encounter a lot of websites like this ?Thanks
jlezard
Make sure you know what you are doing before taking third-party sites down or exhausting their bandwidth. Writing a smart crawler is hard. Reading all the terms of use is impossible.
Álvaro G. Vicario
I ll put a little timer to not take down third party. It definetly is much harder than I thought to do a smart crawler. But its a lot of fun especially in F# with the asynchronous computations. Thanks for the warnings :)
jlezard
If there's a robots.txt file on the website, you need to respect their wishes. You might want to check for the presence of that file before you crawl the site.
Onorio Catenacci
+1  A: 
let req =  (WebRequest.Create(uri)) :?> HttpWebRequest
// 'use' is equivalent to ‘using’ in C# for an IDisposable
req.UserAgent<-"Mozilla"

Thanks to all !!!

jlezard