views:

47

answers:

2

I don't know if I'm doing something wrong, but I'm 100% sure it's the python script brings down my Internet connection.

I wrote a python script to scrape thousands of files header info, mainly for Content-Length to get the exact size of each file, using HEAD request.

Sample code:

class HeadRequest(urllib2.Request):
    def get_method(self):
        return "HEAD"

response = urllib2.urlopen(HeadRequest("http://www.google.com"))
print response.info()

The thing is after several hours running, the script starts to throw out urlopen error timed out, and my Internet connection is down from then on. And my Internet connection will always be back on immediately after I close that script. At the beginning I thought it might be the connection not stable, but after several times running, it turned out to be the scripts fault.

I don't know why, this should be considered as a bug, right? Or my ISP banned me for doing such things? (I already set the program to wait 10s each request)

BTW, I'm using VPN network, does it have something to do with this?

A: 

I'd guess that either your ISP or VPN provider is limiting you because of high-volume suspicious traffic, or your router or VPN tunnel is getting clogged up with half-open connections. Consumer internet is REALLY not intended for spider-type activities.

Paul McMillan
Oh, so I guess we are not allowed to do things google does... How come they can do this while we cannot, do you have to pay a lot to do such things?
Shane
So it's definitely not a bug, right?
Shane
What do you mean by half-open connections? I only use one thread to do such thing
Shane
@Shane - Depending on the NAT implementation it may keep a bunch of information about connections lying around even for a short time after they're closed. The fact the problem clears up so quickly strongly indicates this kind of problem to me rather than any sort of ISP throttling. If you simply pause a little between each connection it should work.
Omnifarious
@Omnifarious: Well, is there anything I can do to configure NAT settings to make it work?
Shane
That looks like pretty standard code. Half open connections would be caused if you had a shell script to run this code repeatedly, for example, or your network hardware was hanging onto the connections. If you really are only executing one such connection at a time, they're probably not a problem. If you have access to a different operating system, I'd try it there and see if your results changed.
Paul McMillan
I set the script to wait 10s each request, is it still too short?
Shane
@Paul McMilan: Yes I ran this script in python shell, so does that mean the script will run smoothly if I simply double click the python script to run?
Shane
If you have a router between your computer and your modem, try taking it out. Sometimes cheap routing hardware can get confused by a multitude of very short connections like this. Specifically consider the hardware on the other end of your VPN. If you told us more about the VPN, we might be more helpful. Some network hardware might keep the connections in memory for as much as several minutes. Does it work without the VPN?
Paul McMillan
@Shane double clicking vs. running in the python shell won't make any difference unless you're executing more than one instance at once.
Paul McMillan
@PaulMcMillan: I will try directly connecting to my modem, my router is indeed a cheap one, so it might be the one causing all the problem. My government won't even let me connect to those sites, thanks to GFW, so working without a VPN is not an option...
Shane
Gotcha. I'm not sure what you're using for the VPN, but it's likely that the problem is on the outlet end. Spidering is a pretty stressful activity for connections, which is why this sort of thing is usually run from dedicated servers.
Paul McMillan
@Paul McMillan: Thanks a lot, man!
Shane
A: 

"the script starts to throw out urlopen error timed out"

We can't even begin to guess.

You need to gather data on your computer and include that data in your question.

Get another computer. Run your script. Is the other computer's internet access blocked also? Or does it still work?

  • If both computers are blocked, it's not your software, it's your provider. Update Your Question with this information, and how you got it.

  • If only the computer running the script is stopped, it's not your provider, it's your OS resources being exhausted. This is harder to diagnose because it could be memory, sockets or file descriptors. Usually its sockets.

You need to find some ifconfig/ipconfig diagnostic software for your operating system. You need to update your question to state exactly what operating system you're using. You need to use this diagnostic software to see how many open sockets are cluttering up your system.

S.Lott