views:

103

answers:

4

I'm trying to find the best method to gather URLs, I could create my own little crawler but it would take my servers decades to crawl all of the Internet and the bandwidth required would be huge. The other thought would be using Google's Search API or Yahoo's Search API, but that's not really a great solution as it requires a search to be performed before I get results.

Other thoughts include asking DNS servers and requesting a list of URLs but DNS servers can limit/throttle my requests or even ban me all together. My knowledge of asking DNS servers is quite limited at the moment, so I don't know if this is the best method or not.

I just want a massive list of URLs, but I want to build this list without running into brick walls in the future. Any thoughts?

I'm starting this project to learn Python but that really has nothing to do with the question.

+2  A: 
$ wget http://s3.amazonaws.com/alexa-static/top-1m.csv.zip
The MYYN
Whilst this is a substantial amount, it doesn't provide potentials to reach my end goal of 99% of Internet's URLs.But thanks a lot !
Dallas Clark
A: 

How big is massive? A good place to start is http://www.alexa.com/topsites. They offer a download of the top 1,000,000 sites (by their ranking mechanism). You could then expand this list by going to Google and scraping the results of the query link: url for each url in the list.

Dathan
I like your idea but this doesn't promise me every URL available.
Dallas Clark
Greedy, much? Nothing promises every URL available. I have private URL's on my personal website that aren't linked to by any site on the 'Net; how would you discover those URL's? This is just a good starting point - you'll need to employ some ingenuity and elbow grease to build your collection from there.
Dathan
@Dathan true but a lot of people have already done the work (like Google) so why re-invent the wheel? If there isn't an appropriate solution then I might have to crawl the Internet with my own bot.
Dallas Clark
I've had to create a bot to crawl the Internet and I'm also asking every DNS Server I find for a list of sites (if allowed). My URL collection is slowly building and I might have to create a cluster of database servers to get a good sample.
Dallas Clark
A: 

modern terms now are URI and URN, URL is the shrunk/outdated. i'd scan for sitemap files that contain many addresses in one file and study the classic text spiders, wanderes, brokers and bots and RFC 3305 (appendix b. p 50) defining URI regex

LarsOn
Check your definitions, I'm after web sites after all. ;)I've looked into crawling but I don't have the resources or bandwidth available to do a serious job.
Dallas Clark
define "definition"
LarsOn
From IBM---------Uniform Resource Identifier (URI)A unique address that is used to identify content on the Web, such as a page of text, a video or sound clip, a still or animated image, or a program. The most common form of URI is the Web page address, which is a particular form or subset of URI called a Uniform Resource Locator (URL). A URI typically describes how to access the resource, the computer that contains the resource, and the name of the resource (a file name) on the computer..... I don't want FTP, SMTP and so on ....
Dallas Clark
A: 

You can register to get access to the entire .com and .net zone files at Verisign

I haven't read the fine print for terms of use, nor do I know how much (if anything) it costs. However, that would give you a huge list of active domains to use as URLs.

Geoff Fritz
Awesome, this helps me for .com, .net, and .name. I will have to look into other countries.
Dallas Clark