tags:

views:

208

answers:

8

Let's say that we place a file on the web that is publicly assessable if you know the direct URL. There are no links pointing to the file and directory listings have been disabled on the server as well. So while it is publicly accessible, there is no way to reach the page except for typing in the exact URL to this file. What are the chances that a web crawler of any sort (nice or malicious) is able to locate this file by crawling and then indexing the file.

To me, even though it is publicly accessible, its going to require luck or specific knowledge of finding the file. Much like burying gold in by back yard and having someone find it without a map or knowing something is buried there.

I just can't see any other way it would be discovered, but that's why I'm asking the stackoverflow community.

Thanks.

+1  A: 

Links can occur anywhere - someone could Twitter a link to it, or post it on Facebook, or in a comment on a blog. It only takes one.

If it's vital that it not show up anywhere, put it behind a password.

If it's not vital but you'd still prefer it not be easily accessible via search engine, use a robots.txt file to block well behaved crawlers.

ceejayoz
Wouldn't a robots.txt indicate the URL to badly-behaved crawlers, who otherwise wouldn't ever have found it?
MarkJ
Yes, which is why I said "if it's not vital". Bad crawlers aren't (usually) feeding public-facing search engines, so if search engine indexing is the main concern robots.txt is an acceptable approach.
ceejayoz
+2  A: 

In the past, such hidden locations have been allegedly "found" using the Google Toolbar (and probably other such browser plugins), used by the owner/uploader.

mjy
Very interesting can you find a link to more information on this? It is not jumping out at my from a google search. +1
Copas
http://blog.tmcnet.com/blog/robert-hashemian/google-toolbar-exposing-hidden-web-pages.html
mjy
A: 

you can use google search api. for the webpage unlinked with any other webpage. we have not idea about that.

ariso
Uh....... what?
ceejayoz
A: 

Assuming this:

  • Directory listing: disabled. No one
  • knows the existence of the page.
  • Your file doesn't contain any links (your browser could then send the referer to the linked site)
  • You have set up the robots.txt properly
  • You trust all the people won't spread your link to anyone else.
  • You are lucky

Well, your page won't probably be found or discovered.

Conclusion ?

Use an .htaccess file to protect your data.

Boris Guéry
Even if the users don't intend to, there's a very good chance they'll spread the link accidentally.
Matthew Flaschen
Thank you, yes a good point about .htaccess. No one knows about the file except those who have admin rights to the server so it is privileged and confidential info as far as the address of the page.
+1  A: 

Security through obscurity never works. You say, you're not going to link to it, and I believe you. But nothing stops your user from linking to it, intentionally or uninteionally. As ceejayoz indicated, there are so many different places to post links now. And there are even "bookmark synchronizers" that people may think are private but are actually open to the world.

So use real authentication. If you don't you'll regret it later.

Matthew Flaschen
Can't disagree with you here and no one except for those with admin rights to the servers know about the location of this file. Someone is just freaking out about the file being publicly accessible, and I understand that there is concern here, but the person is also being unreasonable and not very rational about the severity of this and the actual likelihood that someone will discover the file.
the presence of google toobar and similar tools make it almost certain that someone will get notice of your 'obscure' url
Javier
If only admins have access, can't you just put it on a localhost only HTTP virtual host and make them ssh in then use the local browser?
Matthew Flaschen
A: 

You are correct. Web crawlers are, metaphorically, spiders - they need to have a way to traverse the web (hyperlinks) and arrive at your page.

To get your hypothetical page into a search engine's results, you must manually submit its URL to the search engine. There are multiple services for submitting your page to these search engines. See "submitting URLs to search engines"

Also, your page will only appear if the search engine determines that your page has enough metadata/karma within the search engine's proprietary ranking system. See "SEO" and "meta keywords".

Jeff Meatball Yang
You don't have to manually submit the URL for it to show up in results. If you click a link on the page to another server that displays recent referrers, Google could pick that up. If a friend posts the link to Twitter, Google could pick that up.
ceejayoz
A: 

yes u r right Web Crawler visits at URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit and called the crawl frontier but these hyperlinks and URLs have bad links. Once users click a bad link and land on the Malware site, they’re often promoted with a fake codec installation dialog. If that doesn’t get them, the site is still loaded will dozens of other tactics to infect their computer. From fake toolbars, scare ware, rogue software, and more, the sites have it all. One site that they came across even tried to install 25 different bits of Malware. Such sites are leaving people vulnerable to installations of spam bots, rootkits, password Steelers, and an assortment of Trojan horses, amongst other things.

A: 

Purchased/sold clickstream data may result in otherwise unlinked content discovery: http://en.wikipedia.org/wiki/Clickstream

arachnode.net