views:

314

answers:

5

I essentially want to spider my local site and create a list of all the titles and URLs as in:

http://localhost/mySite/Default.aspx      My Home Page
http://localhost/mySite/Preferences.aspx  My Preferences
http://localhost/mySite/Messages.aspx     Messages

I'm running Windows. I'm open to anything that works--a C# console app, PowerShell, some existing tool, etc. We can assume that the tag does exist in the document.

Note: I need to actually spider the files since the title may be set in code rather than markup.

A: 

Ok, I'm not familiar with Windows, but to get you in the right direction: use a XSLT transformation with

<xsl:value-of select="/head/title" /> in there to get the title back or if you can, use the XPath '/head/title' to get the title back.

Roalt
+3  A: 

A quick and dirty Cygwin Bash script which does the job:

#!/bin/bash
for file in $(find $WWWROOT -iname \*.aspx); do
  echo -en $file '\t'
  cat $file | tr '\n' ' ' | sed -i 's/.*<title>\([^<]*\)<\/title>.*/\1/'
done

Explanation: this finds every .aspx file under the root directory $WWWROOT, replaces all newlines with spaces so that there are no newlines between the <title> and </title>, and then grabs out the text between those tags.

Adam Rosenfield
This doesn't seem to quite work. What am I doing wrong?
Larsenal
+3  A: 

I think a script similar to what Adam Rosenfield suggested is what you want, but if you want the actual URLs, try using wget. With some appropriate options, it will print out a list of all the pages on your site (plus download them, which maybe you can suppress with --spider). The wget program is avaliable through the normal Cygwin installer.

rmeador
Yeah, that is what I was trying to get working to post here! Here's a snippet: site=mysite.com wget --recursive --accept \*.html http://$site ;for file in $( find $site -name *.html ); do // adam's for-body
Dustin
A: 

I would use wget as detailed above. Be sure you don't have any spider traps on your site.

Chris Nava
A: 

you should consider using scrapy shell

check out

http://doc.scrapy.org/intro/tutorial.html

in console put something like this :

hxs.x('/html/head/title/text()').extract()

if you want all titles, you should do a spider...it really easy.

Also consider to move to linux :P

llazzaro