404 - actual url

Hi, really hoping someone can help me out.

 

Im writing some python that gets a webpages content and then uses xpath to navigate the contents using urllib2 and its simple enough. The issue I am having is the page I am fetching redirects to another page which sometimes throws a 404 and all that is wrong is a small part of the redirected url is incorrect.

 

What i'm trying to do (but failing) is attempt to load the page, if a 404 occurs on redirect,  get the url that caused the 404 as it will not be the one I called, do some url modification and then retry.

 

Can anyone help?

@yoadster,

If I am understanding you correctly, for info regarding urllib2 redirect handling you might want to take a look here.

Example 11.11 shows you how to handle the redirects with custom handlers. I would suggest parsing the location header from the 301/302 responses to get the redirect url(s). Once you reach the 404 then the last redirect url should be the one that sent you there.

Hope this helps!

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.