ansaurus

Question

How would I make a simple URL extracter in Python?

Answer 1

A:

use urllib

import urllib
html_source = urllib.urlopen('my_url').read()

then search for patterns such as

href="http://anychars"

maybe with a regex you can do it for the full doc or line by line:

for html_line in urllib.urlopen('my_url'):
   find_patterns

As suggested you can use more appropriate libraries for searching your patterns using an HTML parser (pe BeautifulSoup).

joaquin 2010-10-13 16:38:01

-1 for recommending a regex to parse html

Daenyth 2010-10-13 16:42:16

@Daenyth: OP says “index every single url attached to it”, so it is fine to just look for valid URLs within the source.

poke 2010-10-13 16:44:53

@poke: There are better tools to do this that require less work, are more robust, and simpler to understand

Daenyth 2010-10-13 16:45:53

@Daenyth: I didn't say there were not better tools, of course there are, but using regex to simply look in the source code is easy enough to get out links.. and most probably much faster than parsing the page..

poke 2010-10-13 16:47:39

@Daenyth, I find for most easy searches much more straighforward to use simple string functions or simple regexes. On the other hand in many cases you get shorter code and higher speed.

joaquin 2010-10-13 16:53:53

@joaquin: If you look at the answer I linked to in my post, I think you'll find it hard to argue that a regex will be more readable.

Daenyth 2010-10-13 16:55:25

Answer 2

A:

If you insist on reinventing the wheel, use an html parser like BeautifulSoup to grab all the tags out. This answer to a similar question is relevant.

Daenyth 2010-10-13 16:42:49

Answer 3

+2 A:

This, for instance, would print out links on this very related (but poorly named) question:

import urllib2
from BeautifulSoup import BeautifulSoup

q = urllib2.urlopen('http://stackoverflow.com/questions/3884419/')
soup = BeautifulSoup(q.read())

for link in soup.findAll('a'):
    if link.has_key('href'):
        print str(link.string) + " -> " + link['href']
    elif link.has_key('id'):
        print "ID: " + link['id']
    else:
        print "???"

Output:

Stack Exchange -> http://stackexchange.com
log in -> /users/login?returnurl=%2fquestions%2f3884419%2f
careers -> http://careers.stackoverflow.com
meta -> http://meta.stackoverflow.com
...
ID: flag-post-3884419
None -> /posts/3884419/revisions
...

Nick T 2010-10-13 17:06:46

You should use `if 'href' in link:` rather than `link.has_key`. `has_key` is deprecated and removed from python 3.

Daenyth 2010-10-13 17:41:53

For me (Py 2.6.5, BS 3.0.8) `'href' in link` returns `False`, even though `link['href']` will give me a URL. I don't know that much about the workings of dictionaries though. `'href' in zip(*link.attrs)[0]` does seem to work, but is ugly.

Nick T 2010-10-13 18:38:32

Answer 4

A:

Scrapy is a Python framework for web crawling. Plenty of examples here: http://snippets.scrapy.org/popular/bookmarked/

ScraperWiki 2010-10-14 08:47:47

ansaurus

tags:

views:

answers:

How would I make a simple URL extracter in Python?

related questions