views:

67

answers:

4

If I had to create a content inventory for a website that doesn't have a sitemap, and I do not have access to modify the website, but the site is very large. How can I build a sitemap out of that website without having to browse it entirely ? I tried with Visio's sitemap builder, but it fails great time. Let's say for example: I Want to create a sitemap of Stackoverflow. Do you guys know a software to build it ?

A: 

I would register all your pages in a Database, and then just output them all on a page (php - sql). Maybe even indexing software could help you! First of all, just make sure all your pages are linked up and submit it to google still!

Kyle
A: 

Just googled and found this one. http://www.xml-sitemaps.com/ Looks pretty interesting!

Sands
This is the defacto site that I use for indexing however it is limited to 500 pages within a site.You can get around this limit by indexing specific subdirectories and then manually combining them...
rtpHarry
+1  A: 

You would have to browse it entirely to search every page for unique links within the site and then put them in an index.

Also for each unique link you find within the site you then need to visit that page and search for more unique links.

You would use a tool such as HtmlAgilityPack to easily grab urls and extract links from them.

I have written an article which touches on the extracting links part of the problem:

rtpHarry
A: 

There is a pretty big collection of XML Sitemaps generators (assuming that's what you want to generate -- not a HTML sitemap page or something else?) at http://code.google.com/p/sitemap-generators/wiki/SitemapGenerators

In general, for any larger site, the best solution is really to grab the information directly from the source, for example from the database that powers the site. By doing that you can get the most accurate and up-to-date Sitemap file. If you have to crawl the site to get the URLs for a Sitemap file, it will take quite some time for a larger site and it will load the server during that time (it's like someone visiting all pages in your site). Crawling the site from time to time to determine if there are crawlability issues (such as endless calendars, content hidden through forms, etc) is a good idea, but if you can, it's generally better to get the URLs for the Sitemap file directly.

John Mueller
I'm not the web designer for this website. I do not have access to the website itself. So I can't try to generate an xml map. I'm just doing content inventory for a proposal on Information Architecture , which leads me to start with content inventory. If I do this manually I'll never finish. So I'd like to know if there's software available out there, to search an entire website and give me a report of all the links the software did follow (with some good sense of jeriarchy). Let's say I go to Microsoft.com and I want a sitemap,but in my computer, so I can see how many pages they have.
UXdesigner
There are some tools that generate graphic maps of websites based on crawling, but I do not have any experience with them so I can't recommend any. For a lower level view, you could use Xenu's Link Sleuth or even "wget" to crawl the whole website and then categorize the content based on the URL. These tools will however discover all content on the website (including images, scripts, etc), so you may want to try them out on a smaller website first to see if you can work with their output.
John Mueller