tags:

views:

126

answers:

6

Hi,

I have a big list of websites and I need to know if they have areas that are password protected.

I am thinking about doing this: downloading all of them with httrack and then writing a script that looks for keywords like "Log In" and "401 Forbidden". But the problem is these websites are different/some static and some dynamic (html, cgi, php,java-applets...) and most of them won't use the same keywords...

Do you have any better ideas?

Thanks a lot!

+1  A: 

Look for forms with password fields.

You may need to scrape the site to find the login page. Look for links with phrases like "log in", "login", "sign in", "signin", or scrape the whole site (needless to say, be careful here).

Konrad Garus
Thanks fro ur answer Konrad. Looking for keywords is what I have been doing, but I wanted a better solution, such as the HTTP response mentioned by tux21b... but what do you mean with "scrape the whole website"?
Samantha
+1  A: 

I would use httrack with several limits and then search the downloaded files for password fields.

Typically, a login form could be found within two links of the home page. Almost all ecommerce sites, web apps, etc. have login forms that are accessed just by clicking on one link on the home page, but another layer or even two of depth would almost guarantee that you didn't miss any.

I would also limit the speed that httrack downloads, tell it not to download any non-HTML files, and prevent it from downloading external links. I'd also limit the number of simultaneous connections to the site to 2 or even 1. This should work for just about all of the sites you are looking at, and it should be keep you off the hosts.deny list.

827
+1  A: 

You could just use wget and do something like:

wget -A html,php,jsp,htm -S -r http://www.yoursite.com > output_yoursite.txt

This will cause wget to download the entire site recursively, but only download endings listed with the -A option, in this case try to avoid heavy files.

The header will be directed to file output_yoursite.txt which you then can parse for the header value 401, which means that the part of the site requires authentication, and parse the files accordingly to Konrad's recommendation also.

Anders
+1  A: 

Looking for 401 codes won't reliably catch them as sites might not produce links to anything you don't have privileges for. That is, until you are logged in, it won't show you anything you need to log in for. OTOH some sites (ones with all static content for example) manage to pop a login dialog box for some pages so looking for password input tags would also miss stuff.

My advice: find a spider program that you can get source for, add in whatever tests (plural) you plan on using and make it stop of the first positive result. Look for a spider that can be throttled way back, can ignore non HTML files (maybe by making HEAD requests and looking at the mime type) and can work with more than one site independently and simultaneously.

BCS
+1  A: 

You might try using cURL and just attempting to connect to each site in turn (possibly put them in a text file and read each line, try to connect, repeat).

You can set up one of the callbacks to check the HTTP response code and do whatever you need from there.

peachykeen
+1  A: 

Looking for password fields will get you so far, but won't help with sites that use HTTP authentication. Looking for 401s will help with HTTP authentication, but won't get you sites that don't use it, or ones that don't return 401. Looking for links like "log in" or "username" fields will get you some more.

I don't think that you'll be able to do this entirely automatically and be sure that you're actually detecting all the password-protected areas.

You'll probably want to take a library that is good at web automation, and write a little program yourself that reads the list of target sites from a file, checks each one, and writes to one file of "these are definitely passworded" and "these are not", and then you might want to go manually check the ones that are not, and make modifications to your program to accomodate. Using httrack is great for grabbing data, but it's not going to help with detection -- if you write your own "check for password protected area" program with a general purpose HLL, you can do more checks, and you can avoid generating more requests per site than would be necessary to determine that a password-protected area exists.

You may need to ignore robots.txt

I recommend using the python port of perls mechanize, or whatever nice web automation library your preferred language has. Almost all modern languages will have a nice library for opening and searching through web pages, and looking at HTTP headers.

If you are not capable of writing this yourself, you're going to have a rather difficult time using httrack or wget or similar and then searching through responses.

jeremiahd