Each Answer to this Q is separated by one/two green lines.
Background: I am using
urllib.urlretrieve, as opposed to any other function in the
urllib* modules, because of the hook function support (see
reporthook below) .. which is used to display a textual progress bar. This is Python >=2.6.
>>> urllib.urlretrieve(url[, filename[, reporthook[, data]]])
urlretrieve is so dumb that it leaves no way to detect the status of the HTTP request (eg: was it 404 or 200?).
>>> fn, h = urllib.urlretrieve('http://google.com/foo/bar') >>> h.items() [('date', 'Thu, 20 Aug 2009 20:07:40 GMT'), ('expires', '-1'), ('content-type', 'text/html; charset=ISO-8859-1'), ('server', 'gws'), ('cache-control', 'private, max-age=0')] >>> h.status '' >>>
What is the best known way to download a remote HTTP file with hook-like support (to show progress bar) and a decent HTTP error handling?
urllib.urlretrieve‘s complete code:
def urlretrieve(url, filename=None, reporthook=None, data=None): global _urlopener if not _urlopener: _urlopener = FancyURLopener() return _urlopener.retrieve(url, filename, reporthook, data)
In other words, you can use urllib.FancyURLopener (it’s part of the public urllib API). You can override
http_error_default to detect 404s:
class MyURLopener(urllib.FancyURLopener): def http_error_default(self, url, fp, errcode, errmsg, headers): # handle errors the way you'd like to fn, h = MyURLopener().retrieve(url, reporthook=my_report_hook)
You should use:
import urllib2 try: resp = urllib2.urlopen("http://www.google.com/this-gives-a-404/") except urllib2.URLError, e: if not hasattr(e, "code"): raise resp = e print "Gave", resp.code, resp.msg print "=" * 80 print resp.read(80)
Edit: The rationale here is that unless you expect the exceptional state, it is an exception for it to happen, and you probably didn’t even think about it — so instead of letting your code continue to run while it was unsuccessful, the default behavior is–quite sensibly–to inhibit its execution.
The URL Opener object’s “retreive” method supports the reporthook and throws an exception on 404.