views:

339

answers:

2

Here is a related question but I could not figure out how to apply the answer to mechanize/urllib2: http://stackoverflow.com/questions/1540749/how-to-force-python-httplib-library-to-use-only-a-requests

Basically, given this simple code:

#!/usr/bin/python
import urllib2
print urllib2.urlopen('http://python.org/').read(100)

This results in wireshark saying the following:

  0.000000  10.102.0.79 -> 8.8.8.8      DNS Standard query A python.org
  0.000023  10.102.0.79 -> 8.8.8.8      DNS Standard query AAAA python.org
  0.005369      8.8.8.8 -> 10.102.0.79  DNS Standard query response A 82.94.164.162
  5.004494  10.102.0.79 -> 8.8.8.8      DNS Standard query A python.org
  5.010540      8.8.8.8 -> 10.102.0.79  DNS Standard query response A 82.94.164.162
  5.010599  10.102.0.79 -> 8.8.8.8      DNS Standard query AAAA python.org
  5.015832      8.8.8.8 -> 10.102.0.79  DNS Standard query response AAAA 2001:888:2000:d::a2

That's a 5 second delay!

I don't have IPv6 enabled anywhere in my system (gentoo compiled with USE=-ipv6) so I don't think that python has any reason to even try an IPv6 lookup.

The above referenced question suggested explicitly setting the socket type to AF_INET which sounds great. I have no idea how to force urllib or mechanize to use any sockets that I create though.

EDIT: I know that the AAAA queries are the issue because other apps had the delay as well and as soon as I recompiled with ipv6 disabled, the problem went away... except for in python which still performs the AAAA requests.

+1  A: 

The DNS server 8.8.8.8 (Google DNS) replies immediately when asked about the AAAA of python.org. Therefore, the fact we do not see this reply in the trace you post probably indicate that this packet did not come back (which happens with UDP). If this loss is random, it is normal. If it is systematic, it means there is a problem in your network setup, may be a broken firewall which prevents the first AAAA reply to come back.

The 5-second delay comes from your stub resolver. In that case, if it is random, it is probably bad luck, but not related to IPv6, the reply for the A record could have failed as well.

Disabling IPv6 seems a very strange move, only two years before the last IPv4 address is distributed!

% dig @8.8.8.8  AAAA python.org

; <<>> DiG 9.5.1-P3 <<>> @8.8.8.8 AAAA python.org
; (1 server found)
;; global options:  printcmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 50323
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
;; QUESTION SECTION:
;python.org.                    IN      AAAA

;; ANSWER SECTION:
python.org.             69917   IN      AAAA    2001:888:2000:d::a2

;; Query time: 36 msec
;; SERVER: 8.8.8.8#53(8.8.8.8)
;; WHEN: Sat Jan  9 21:51:14 2010
;; MSG SIZE  rcvd: 67
bortzmeyer
well, i'd be happy to use IPv6...once it stops adding a 5 second delay to my DNS queries :-P. And unfortunately, it isn't "bad luck" it is every single query.
Evan Teran
+1  A: 

No answer, but a few datapoints. The DNS resolution appears to be originating from httplib.py in HTTPConnection.connect() (line 670 on my python 2.5.4 stdlib)

The code flow is roughly:

for res in socket.getaddrinfo(self.host, self.port, 0, socket.SOCK_STREAM):
    af, socktype, proto, canonname, sa = res
    self.sock = socket.socket(af, socktype, proto)
    try:
        self.sock.connect(sa)
    except socket.error, msg: 
        continue
    break

A few comments on what's going on:

  • the third argument to socket.getaddrinfo() limits the socket families -- i.e., IPv4 vs. IPv6. Passing zero returns all families. Zero is hardcoded into the stdlib.

  • passing a hostname into getaddrinfo() will cause name resolution -- on my OS X box with IPv6 enabled, both A and AAAA records go out, both answers come right back and both are returned.

  • the rest of the connect loop tries each returned address until one succeeds

For example:

>>> socket.getaddrinfo("python.org", 80, 0, socket.SOCK_STREAM)
[
 (30, 1, 6, '', ('2001:888:2000:d::a2', 80, 0, 0)), 
 ( 2, 1, 6, '', ('82.94.164.162', 80))
]
>>> help(socket.getaddrinfo)
getaddrinfo(...)
    getaddrinfo(host, port [, family, socktype, proto, flags])
        -> list of (family, socktype, proto, canonname, sockaddr)

Some guesses:

  • Since the socket family in getaddrinfo() is hardcoded to zero, you won't be able to override the A vs. AAAA records through some supported API interface in urllib. Unless mechanize does their own name resolution for some other reason, mechanize can't either. From the construct of the connect loop, this is By Design.

  • python's socket module is a thin wrapper around the POSIX socket APIs; I expect they're resolving every family available & configured on the system. Double-check Gentoo's IPv6 configuration.

J.J.
seems to me that python shouldn't pass `0` to `socket.getaddrinfo` if it is built with no ipv6 support. perhaps this could be considered a minor bug in some ways.
Evan Teran