@parallelize and @task

How do I use them?

 

I have a site that I've grabbed all the URLs for pages that contain videos (but there's no meta data on the pages, hence not using a URL service). There's 18 in total (and there could be more or less).

 

At present, I'm doing the below, which basically does each channel in turn, getting the title, url for the page the channel video is on, then requests that URL and finds the HLS file. These three values are then stored in an array called CHANNEL_LIST.

 

This is really really slow, and I'd like to do some of the requests in parallel.

################################################################################
# Gets a list of channels to iterate over
################################################################################
def GetChannelList():
    # Check to see if CHANNEL_LIST is already populated, if yes return it, if
    # no construct it.
    if CHANNEL_LIST:
        return CHANNEL_LIST
    else:
        # Construct CHANNEL_LIST_URL and grab HTML
        CHANNEL_LIST_URL        = URL_BASE + URL_MEMBERS + URL_CHANNELMENU        
        CHANNEL_LIST_SOURCE     = HTML.ElementFromURL(CHANNEL_LIST_URL)
    # Find the channel links in the HTML source with xPath
    CHANNELS                = CHANNEL_LIST_SOURCE.xpath("//p/a")

    # Remove the last link from the CHANNELS list (the 'Return
    # to desktop version' links)
    CHANNELS.pop()

    # Add each channel to CHANNEL_LIST
    for CHANNEL in CHANNELS:
        # Grab the link text and convert from list to string
        # N.B. xpath ALWAYS returns a list
        CHANNEL_TITLE       = "".join(CHANNEL.xpath(".//text()"))
        CHANNEL_URL         = URL_BASE + URL_MEMBERS + "".join(CHANNEL.xpath(".//@href"))
        
        # Extracts the actual video URL for a channel. We do it inside 
        # this function so we can store it as part of CHANNEL_LIST and 
        # only do it once, not every time we hit the main menu
        CHANNEL_VIDEO       = GetChannelVideoStreamURL(CHANNEL_URL)

        # Gets the correct channel thumbnail
        CHANNEL_THUMB       = GetChannelThumb(CHANNEL_TITLE)

        # Appends the channel details to the CHANNEL_LIST
        CHANNEL_LIST.append([CHANNEL_TITLE,CHANNEL_VIDEO,CHANNEL_THUMB])
    
    CHANNEL_LIST.sort()
    
    return CHANNEL_LIST

################################################################################

Extracts the actual video URL for a channel

################################################################################
def GetChannelVideoStreamURL(URL):
# Grab the source from the Channel’s URL – done inside here so we
# only do it once, not every time we hit the main menu
CHANNEL_SOURCE = HTML.ElementFromURL(URL)

# Gets the relevant script that has the mediaplayer info in it, by using
# xPath to search for a script containing the string 'mediaplayer'
CHANNEL_SCRIPT          = CHANNEL_SOURCE.xpath("//script[contains(., 'mediaplayer')]//text()")[0]

# Grabs the video URL via regex
CHANNEL_VIDEO           = re.findall(r'(http:\/\/[\d].*)\'',CHANNEL_SCRIPT)[0]

return CHANNEL_VIDEO</pre>

However, I'm not really sure how to @parallellize/@task this. When i try, CHANNEL_LIST returns empty, as if it's returning before the parallel tasks have finished, or that I'm not getting the information back from @task.

 

Any help? I've looked at the devour plugin mentioned in this thread but I'm none the wiser.

Is there a reason that you need to store the actual HLS url? It would be much more efficient to just parse the page when a play request it made (whether you're using a URL Service or not). If the urls don't change, you could always include code to check if the stored value exists and if not, then parse the html and store the value for the selected channel.

That being said, I use parallelization in the UnsupportedAppstore. See here for an example.

Is there a reason that you need to store the actual HLS url? It would be much more efficient to just parse the page when a play request it made (whether you're using a URL Service or not). If the urls don't change, you could always include code to check if the stored value exists and if not, then parse the html and store the value for the selected channel.

That being said, I use parallelization in the UnsupportedAppstore. See here for an example.

I could do it that way though I don't really know how to – would I have to do a playvideo function it as an indirect?

 Have managed to do it via indirect, and according to the logs the URL is only requested when playback is initiated.

Thanks for pointing me in that direction.

Scratch the above.

The channel only works when the PMS and client are on the same machine now. Everytime I request the URL for the video file, it basically asks for login again (I seem to remember running in to this problem when I originally tried to use a URL service when I started).

Is there a way to get round that? I've tried sending the user agent header with the request for the page containing the video, but get the same result.

Scratch the above.

The channel only works when the PMS and client are on the same machine now. Everytime I request the URL for the video file, it basically asks for login again (I seem to remember running in to this problem when I originally tried to use a URL service when I started).

Is there a way to get round that? I've tried sending the user agent header with the request for the page containing the video, but get the same result.

Is there a cookie or session header which needs to be passed with the requests? That's usually the sort of thing that causes a recurring request for login. Do you have your code on github?

If I make the request for the pages containing the video streams immediately after logging in and getting the channel list page, it works fine (I.e. The original slower way).


If I delay and make the request when a user wishes to play the video that’s when it tries to login again.


Code is here:


https://github.com/papalozarou/SportseBooks.bundle/blob/master/Contents/Code/init.py

So it seems, the content server uses a time based session. I would expect the http traffic to include a session header of some type. If you retrieve that with the login, then you should be able to pass the header back with each subsequent request.

Alternatively, it’s not unreasonable to force the channel to execute the login logic following a play request before the retrieving the video URL.


Alternatively, you could proceed with your original plan to retrieve all the video URLs up front. You could do it in a background thread to avoid timing out.

Thanks for this – I went with the second of the three options you've outlined above and it seems to work now.

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.