How would I start on a single web page, let's say at the root of DMOZ.org and index every single url attached to it. Then store those links inside a text file. I don't want the content, just the links themselves. An example would be awesome.
A:
use urllib
import urllib
html_source = urllib.urlopen('my_url').read()
then search for patterns such as
href="http://anychars"
maybe with a regex you can do it for the full doc or line by line:
for html_line in urllib.urlopen('my_url'):
find_patterns
As suggested you can use more appropriate libraries for searching your patterns using an HTML parser (pe BeautifulSoup).
joaquin
2010-10-13 16:38:01
-1 for recommending a regex to parse html
Daenyth
2010-10-13 16:42:16
@Daenyth: OP says “index every single url attached to it”, so it is fine to just look for valid URLs within the source.
poke
2010-10-13 16:44:53
@poke: There are better tools to do this that require less work, are more robust, and simpler to understand
Daenyth
2010-10-13 16:45:53
@Daenyth: I didn't say there were not better tools, of course there are, but using regex to simply look in the source code is easy enough to get out links.. and most probably much faster than parsing the page..
poke
2010-10-13 16:47:39
@Daenyth, I find for most easy searches much more straighforward to use simple string functions or simple regexes. On the other hand in many cases you get shorter code and higher speed.
joaquin
2010-10-13 16:53:53
@joaquin: If you look at the answer I linked to in my post, I think you'll find it hard to argue that a regex will be more readable.
Daenyth
2010-10-13 16:55:25
A:
If you insist on reinventing the wheel, use an html parser like BeautifulSoup to grab all the tags out. This answer to a similar question is relevant.
Daenyth
2010-10-13 16:42:49
+2
A:
This, for instance, would print out links on this very related (but poorly named) question:
import urllib2
from BeautifulSoup import BeautifulSoup
q = urllib2.urlopen('http://stackoverflow.com/questions/3884419/')
soup = BeautifulSoup(q.read())
for link in soup.findAll('a'):
if link.has_key('href'):
print str(link.string) + " -> " + link['href']
elif link.has_key('id'):
print "ID: " + link['id']
else:
print "???"
Output:
Stack Exchange -> http://stackexchange.com
log in -> /users/login?returnurl=%2fquestions%2f3884419%2f
careers -> http://careers.stackoverflow.com
meta -> http://meta.stackoverflow.com
...
ID: flag-post-3884419
None -> /posts/3884419/revisions
...
Nick T
2010-10-13 17:06:46
You should use `if 'href' in link:` rather than `link.has_key`. `has_key` is deprecated and removed from python 3.
Daenyth
2010-10-13 17:41:53
For me (Py 2.6.5, BS 3.0.8) `'href' in link` returns `False`, even though `link['href']` will give me a URL. I don't know that much about the workings of dictionaries though. `'href' in zip(*link.attrs)[0]` does seem to work, but is ugly.
Nick T
2010-10-13 18:38:32
A:
Scrapy is a Python framework for web crawling. Plenty of examples here: http://snippets.scrapy.org/popular/bookmarked/
ScraperWiki
2010-10-14 08:47:47