views:

85

answers:

4

How would I start on a single web page, let's say at the root of DMOZ.org and index every single url attached to it. Then store those links inside a text file. I don't want the content, just the links themselves. An example would be awesome.

A: 

use urllib

import urllib
html_source = urllib.urlopen('my_url').read()

then search for patterns such as

href="http://anychars"

maybe with a regex you can do it for the full doc or line by line:

for html_line in urllib.urlopen('my_url'):
   find_patterns

As suggested you can use more appropriate libraries for searching your patterns using an HTML parser (pe BeautifulSoup).

joaquin
-1 for recommending a regex to parse html
Daenyth
@Daenyth: OP says “index every single url attached to it”, so it is fine to just look for valid URLs within the source.
poke
@poke: There are better tools to do this that require less work, are more robust, and simpler to understand
Daenyth
@Daenyth: I didn't say there were not better tools, of course there are, but using regex to simply look in the source code is easy enough to get out links.. and most probably much faster than parsing the page..
poke
@Daenyth, I find for most easy searches much more straighforward to use simple string functions or simple regexes. On the other hand in many cases you get shorter code and higher speed.
joaquin
@joaquin: If you look at the answer I linked to in my post, I think you'll find it hard to argue that a regex will be more readable.
Daenyth
A: 

If you insist on reinventing the wheel, use an html parser like BeautifulSoup to grab all the tags out. This answer to a similar question is relevant.

Daenyth
+2  A: 

This, for instance, would print out links on this very related (but poorly named) question:

import urllib2
from BeautifulSoup import BeautifulSoup

q = urllib2.urlopen('http://stackoverflow.com/questions/3884419/')
soup = BeautifulSoup(q.read())

for link in soup.findAll('a'):
    if link.has_key('href'):
        print str(link.string) + " -> " + link['href']
    elif link.has_key('id'):
        print "ID: " + link['id']
    else:
        print "???"

Output:

Stack Exchange -> http://stackexchange.com
log in -> /users/login?returnurl=%2fquestions%2f3884419%2f
careers -> http://careers.stackoverflow.com
meta -> http://meta.stackoverflow.com
...
ID: flag-post-3884419
None -> /posts/3884419/revisions
...
Nick T
You should use `if 'href' in link:` rather than `link.has_key`. `has_key` is deprecated and removed from python 3.
Daenyth
For me (Py 2.6.5, BS 3.0.8) `'href' in link` returns `False`, even though `link['href']` will give me a URL. I don't know that much about the workings of dictionaries though. `'href' in zip(*link.attrs)[0]` does seem to work, but is ugly.
Nick T
A: 

Scrapy is a Python framework for web crawling. Plenty of examples here: http://snippets.scrapy.org/popular/bookmarked/

ScraperWiki