Scraping problems

Hi,

 

I'm revisiting a project I started about a year ago. I've got it "mostly" working. However, the design of the site that I'm scraping is giving me some problems.

 

First, you can see the project code here: https://github.com/lexy0/TVO.bundle

 

My code is supposed to go through each program page (http://tvo.org/programs-a-z/A for example) and grab the relevant info. It works fine for programs starting with A.

 

Unfortunately, the code will not work for any other letter.

 

If I use http://tvo.org/programs-a-z#/B the code only grabs program info for programs starting with A.

 

If I use http://tvo.org/programs-a-z/C the code returns the "channel is not responding" error.

 

If I paste either link into a browser, the correct data is displayed.

 

Does anyone have suggestions on how to get around this?

 

Thanks!

My original code used HTML.ElementFromURL() to scrape the program page(s). Once I switched to the following to capture the webpage data, the code started working a bit better:

pg_content = HTTP.Request(pass_url)
pg_page = HTML.ElementFromString(pg_content)

But now I have a new problem.

My code checks to see if a program is either a series or a single documentary. To accomplish this, I use the following code:

if nodeExists == 1:
            vidURL = item.xpath('.//span[@class="field-content ms-detail-links-video"]/a')[0].get('href')
            isSeries = vidURL.find('video-landing');

        if isSeries == 15:
            if not showTitle.startswith(' '):
                oc.add(DirectoryObject(key=Callback(ShowEpisodes, title=showTitle, pass_url=showURL, pass_thumb=showThumb), title=showTitle, summary=showSummary, thumb=showThumb))

        if isSeries == -1:
            if not showTitle.startswith(' '):

                if 'bcid' in vidURL:
                    oc.add(DirectoryObject(key=Callback(PlayEpisodes, title=showTitle, pass_url=vidURL), title=showTitle, summary=showSummary, thumb=showThumb))

I expect isSeries to equal 15 (series) or -1 (doc). However, when it equals -1, I get the "channel is not responding" error for any program page OTHER than http://tvo.org/programs-a-z/A (this page scrapes perfectly).

When I switch isSeries == -1: to isSeries == '-1':, I don't get the error msg, but I also don't still don't get the list of documentaries. The series list shows up just fine.

So I'm assuming that now my code can't handle negative numbers. Does anyone have suggestions on how to fix this? I have uploaded my update to Github. https://github.com/lexy0/TVO.bundle

Thanks!

In this line you're setting isSeries to be an int (but later you're checking it as a string).  https://github.com/lexy0/TVO.bundle/blob/master/Contents/Code/__init__.py#L68

If you're just using it the way it looks like you're using it easiest to make them both strings (put quotes around first one as well).

ALso just a comment, you can save yourself a bit of hoop jumping by using HTML.ElementFromURL(pass_url) instead of doing it manually with two steps (page_content= and pg_page=)   EDIT:  I didn't read your other post clearly enough.  It's very odd that one works and the other doesn't but if it works it works :)

If I might make a suggestion ... I took a quick look at the web page here (just for the A listing).  It looks like the vidURL you're pulling has one of two formats from a quick look.

Personally I would try and simplify things to something like this:

if "/video-landing/" in vidURL:
    # do whatever you need here
elif "/bcid/" in vidURL:
    # do whatever you need here
else:
    # catch-all if you need it (in case there's some weird third format you haven't found yet)

Gunk,

Thanks for your advice!

I was able to get the docs portion working using the if in method in the same format of my old code. Then I tried your simplified method and it returned the "channel not responding" error for all letters but the default of A. So I've stuck with what I have.

There isn't a weird 3rd format that I can tell other than some programs take users to separate sub-sites, but there is nothing that IDs those special programs from any other. There are only a handful of those oddities, so for now I'll leave it be.

Gunk, the channel is now pretty much working. When is safe to release a channel? Should I ask for further feedback on this forum or just post in the channels forum and wait for feedback there?

The Github project has been updated: https://github.com/lexy0/TVO.bundle

Thanks again!

I would go ahead and post it as a release in the channel forum, and just let people know that it's new and if there's problems to report them.

One question though, is this stuff geo-blocked to Canada only?  (I'm in Canada and it seems to work for me) ... but given that it's TV Ontario it might be restricted to Canada only for viewing videos.  I guess you will find out soon enough after releasing ;)

I'll post it shortly. Not sure if geo-blocking will be an issue. We use Unblock-us here and TVO works fine while the Food Network Canada channel keeps giving us errors, which I have assumed were geo-blocking related.

I guess we'll see!

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.