Web crawling and its limitations

tags:

web-crawler

views:

208

answers:

+1 Q:

Web crawling and its limitations

Let's say that we place a file on the web that is publicly assessable if you know the direct URL. There are no links pointing to the file and directory listings have been disabled on the server as well. So while it is publicly accessible, there is no way to reach the page except for typing in the exact URL to this file. What are the chances that a web crawler of any sort (nice or malicious) is able to locate this file by crawling and then indexing the file.

To me, even though it is publicly accessible, its going to require luck or specific knowledge of finding the file. Much like burying gold in by back yard and having someone find it without a map or knowing something is buried there.

I just can't see any other way it would be discovered, but that's why I'm asking the stackoverflow community.

Thanks.

+1 A:

Links can occur anywhere - someone could Twitter a link to it, or post it on Facebook, or in a comment on a blog. It only takes one.

If it's vital that it not show up anywhere, put it behind a password.

If it's not vital but you'd still prefer it not be easily accessible via search engine, use a robots.txt file to block well behaved crawlers.

ceejayoz 2009-05-25 17:51:48

Wouldn't a robots.txt indicate the URL to badly-behaved crawlers, who otherwise wouldn't ever have found it?

MarkJ 2009-05-25 19:29:09

Yes, which is why I said "if it's not vital". Bad crawlers aren't (usually) feeding public-facing search engines, so if search engine indexing is the main concern robots.txt is an acceptable approach.

ceejayoz 2009-05-26 02:42:30

+2 A:

In the past, such hidden locations have been allegedly "found" using the Google Toolbar (and probably other such browser plugins), used by the owner/uploader.

mjy 2009-05-25 17:52:09

Very interesting can you find a link to more information on this? It is not jumping out at my from a google search. +1

Copas 2009-05-25 18:06:47

http://blog.tmcnet.com/blog/robert-hashemian/google-toolbar-exposing-hidden-web-pages.html

mjy 2009-06-01 17:35:35

you can use google search api. for the webpage unlinked with any other webpage. we have not idea about that.

ariso 2009-05-25 17:52:47

Uh....... what?

ceejayoz 2009-05-26 02:44:03

Assuming this:

Directory listing: disabled. No one
knows the existence of the page.
Your file doesn't contain any links (your browser could then send the referer to the linked site)
You have set up the robots.txt properly
You trust all the people won't spread your link to anyone else.
You are lucky

Well, your page won't probably be found or discovered.

Conclusion ?

Use an .htaccess file to protect your data.

Boris Guéry 2009-05-25 17:53:09

Even if the users don't intend to, there's a very good chance they'll spread the link accidentally.

Matthew Flaschen 2009-05-25 17:54:12

Thank you, yes a good point about .htaccess. No one knows about the file except those who have admin rights to the server so it is privileged and confidential info as far as the address of the page.

2009-05-25 18:01:19

+1 A:

Security through obscurity never works. You say, you're not going to link to it, and I believe you. But nothing stops your user from linking to it, intentionally or uninteionally. As ceejayoz indicated, there are so many different places to post links now. And there are even "bookmark synchronizers" that people may think are private but are actually open to the world.

So use real authentication. If you don't you'll regret it later.

Matthew Flaschen 2009-05-25 17:53:38

Can't disagree with you here and no one except for those with admin rights to the servers know about the location of this file. Someone is just freaking out about the file being publicly accessible, and I understand that there is concern here, but the person is also being unreasonable and not very rational about the severity of this and the actual likelihood that someone will discover the file.

2009-05-25 18:05:17

the presence of google toobar and similar tools make it almost certain that someone will get notice of your 'obscure' url

Javier 2009-05-25 19:09:38

If only admins have access, can't you just put it on a localhost only HTTP virtual host and make them ssh in then use the local browser?

Matthew Flaschen 2009-05-25 19:17:50

You are correct. Web crawlers are, metaphorically, spiders - they need to have a way to traverse the web (hyperlinks) and arrive at your page.

To get your hypothetical page into a search engine's results, you must manually submit its URL to the search engine. There are multiple services for submitting your page to these search engines. See "submitting URLs to search engines"

Also, your page will only appear if the search engine determines that your page has enough metadata/karma within the search engine's proprietary ranking system. See "SEO" and "meta keywords".

Jeff Meatball Yang 2009-05-25 17:55:55

You don't have to manually submit the URL for it to show up in results. If you click a link on the page to another server that displays recent referrers, Google could pick that up. If a friend posts the link to Twitter, Google could pick that up.

ceejayoz 2009-05-26 02:43:42

yes u r right Web Crawler visits at URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit and called the crawl frontier but these hyperlinks and URLs have bad links. Once users click a bad link and land on the Malware site, they’re often promoted with a fake codec installation dialog. If that doesn’t get them, the site is still loaded will dozens of other tactics to infect their computer. From fake toolbars, scare ware, rogue software, and more, the sites have it all. One site that they came across even tried to install 25 different bits of Malware. Such sites are leaving people vulnerable to installations of spam bots, rootkits, password Steelers, and an assortment of Trojan horses, amongst other things.

2009-08-21 07:20:34

Purchased/sold clickstream data may result in otherwise unlinked content discovery: http://en.wikipedia.org/wiki/Clickstream

arachnode.net 2010-10-01 18:49:48

ansaurus

tags:

views:

answers:

Web crawling and its limitations

related questions