views:

322

answers:

5

Hey,

I don't know much about SEO and how web spiders work, so forgive my ignorance here. I'm creating a site (using ASP.NET-MVC) which has areas that displays information retrieved from the database. The data is unique to the user, so there's no real server-side output caching going on. However, since the data can contain things the user may not wish to have displayed from search engine results, I'd like to prevent any spiders from accessing the search results page. Are there any special actions I should take to ensure that the search result directory isn't crawled? Also, would a spider even crawl a page that's dynamically generated and would any actions preventing certain directories being search mess up my search engine rankings?

edit: I should add, I'm reading up on robots.txt protocol, but it relies on co-operation from the web crawler. However, I'd also like to prevent any data-mining users who will ignore the robots.txt file.

I appreciate any help!

+1  A: 

Check out the Robots exclusion standard. It's a text file that you put on your site that tells a bot what it can and can't index. You will also want to address what happens if a bot doesn't honour the robots.txt file.

Matthew Lock
I literally just updated my post to include the robots.txt info at the same time as your post :)
Skoder
+2  A: 

You can prevent some malicious clients from hitting your server too heavily by implementing throttling on the server. "Sorry, your IP has made too many requests to this server in the past few minutes. Please try again later." In practice, though, assume that you can't stop a truly malicious user from bypassing any throttling mechanisms that you put in place.

Given that, here's the more important question:

Are you comfortable with the information that you're making available for all the world to see? Are your users comfortable with this?

If the answer to those questions is no, then you should be ensuring that only authorized users are able to see the sensitive information. If the information isn't particularly sensitive but you don't want clients crawling it, throttling is probably a good alternative. Is it even likely that you're going to be crawled anyway? If not, robots.txt should be just fine.

Levi
Thanks for the reply. It's up to the user what information they decide to reveal (knowing full well, it's public) and it's also up to the user to add a password if they want to keep it hidden. I would like the site to be crawled, in order to advertise the service. I just don't want the user data to be indexed.
Skoder
+2  A: 

It seems like you have 2 issues.

Firstly a concern about certain data appearing in search results. The second about malicious or unscrupulous user harvesting user related data.

The first issue will be covered by appropriate use of a robots.txt file as all the big search engines honour this.

The second issue seems more to do with data privacy. The first question which immediately springs to mind is: If there is user information which people may not want displayed, why are you making it available at all?
What is the privacy policy for such data?
Do users have the ability to control what information is made available?
If the information is potentially sensitive but important to the system could it be restricted so it is only available to logged in users?

Matt Lacey
The user can choose to display what information is shown, and they can password protect their results if they choose to do so (similar to Twitter). Even in protected areas, can bots not have a registered account, sign in, perform the search and cache the result? Naturally, no big-name search engine would, but maybe malicious ones? I'll protect that by ip scanning, but just curious if that's even possible.
Skoder
@Skoder If any logged in user could be a bot, you need to look at behaviour patterns and probably implement some form of CAPTCHA to prevent further "browsing" until you can confirm what they're doing. In reality if someone can sign up to the site and browse the data, there's no way to guarantee that it can't ever be saved/cached/etc.
Matt Lacey
@Matt: That's a fair point. The data isn't super-sensitive/non-public knowledge, but I still like to keep privacy to a high amount (if just for good practise). Thanks for the help.
Skoder
+1  A: 

robots.txt file as mentioned. If that is not enough then you can:

  • Block unknown useragents - hard to maintain, easy for a bot to forge a browser's (although most legitimate bots wont)
  • Block unknown IP addresses - not useful for a public site
  • Require logins
  • Throttle user connections - tricky to tune, you will still be disclosing information.

Perhaps by using a combination. Either way it is a trade off, if the public can browse to it, so can a bot. Be sure you don't block & alienate people in your attempts to block bots.

Mobs
Matt Lacey
which basically sums up all of my points...
Mobs
A: 

a few options:

  • force the user to login to view the content
  • add a CAPTCHA page before the content
  • embed content in Flash
  • load dynamically with JavaScript
Plumo