ansaurus

Question

Answer 1

A:

It's based on the user agent. If you use a browser user agent, the request will work. That said, it seems to be a pretty clear Keep Out message.

Matthew Flaschen 2010-06-10 11:00:30

I'm sorry you told him that.

Developer Art 2010-06-10 11:01:15

@Developer Art: Why are you sorry that he tried to help a developer?

Ian Boyd 2010-06-10 11:04:07

@Ian, especially considering he just said jlezard is trying to spam us (which implies it's his site).

Matthew Flaschen 2010-06-10 11:05:26

Because he told him a way to circumvent site protection.

Developer Art 2010-06-10 11:08:32

@Matt i thought he was pointing out how silly eugeneK's assertion was: why would someone spam a four year old french web-site devoted to pets, on an english programming site, using the F# language.

Ian Boyd 2010-06-10 11:16:53

I am not trying to spam anybody lol, this is not my website. I am just trying to build a crawler and came accross this site where things dont seem to work

jlezard 2010-06-10 11:16:59

@Developer He's not allowed to write his own search engine? Writing a crawler is limited to special people?

Ian Boyd 2010-06-10 11:18:34

Thanks for all the replies except Developer Art which seems bitter about life

jlezard 2010-06-10 11:23:01

Answer 2

A:

They probably detect automated crawling and send you that response.

Ian Boyd 2010-06-10 11:03:20

-1 for copying my deleted answer precisely.

Developer Art 2010-06-10 11:09:07

Well it was the correct answer; someone had to put it there. But now that others have given the same answer, i guess they can get credit, and not you - and not me.

Ian Boyd 2010-06-10 11:51:09

Answer 3

+1 A:

I've been testing myself (not in F but it doesn't really matter) and I can confirm that the site reads the User Agent string and depending on its value it either returns the site contents or the "DOS" text.

Curiously, they provide a feed service (both RSS and Atom) and they also filter out requests to it.

Although the User Agent information can be easily faked, my advice is that you try to get their permission to grab contents; at least from the XML feed!

Álvaro G. Vicario 2010-06-10 11:16:49

I am intending to read a few thousand pages with my little "crawler", do you think I will encounter a lot of websites like this ?Thanks

jlezard 2010-06-10 11:22:15

Make sure you know what you are doing before taking third-party sites down or exhausting their bandwidth. Writing a smart crawler is hard. Reading all the terms of use is impossible.

Álvaro G. Vicario 2010-06-10 12:25:52

I ll put a little timer to not take down third party. It definetly is much harder than I thought to do a smart crawler. But its a lot of fun especially in F# with the asynchronous computations. Thanks for the warnings :)

jlezard 2010-06-10 14:37:45

If there's a robots.txt file on the website, you need to respect their wishes. You might want to check for the presence of that file before you crawl the site.

Onorio Catenacci 2010-06-10 15:44:37

Answer 4

+1 A:

let req =  (WebRequest.Create(uri)) :?> HttpWebRequest
// 'use' is equivalent to ‘using’ in C# for an IDisposable
req.UserAgent<-"Mozilla"

Thanks to all !!!

jlezard 2010-06-10 11:25:29

ansaurus

tags:

views:

answers:

Web request returns "DOS"

related questions