How do I use them?
I have a site that I've grabbed all the URLs for pages that contain videos (but there's no meta data on the pages, hence not using a URL service). There's 18 in total (and there could be more or less).
At present, I'm doing the below, which basically does each channel in turn, getting the title, url for the page the channel video is on, then requests that URL and finds the HLS file. These three values are then stored in an array called CHANNEL_LIST.
This is really really slow, and I'd like to do some of the requests in parallel.
################################################################################
# Gets a list of channels to iterate over
################################################################################
def GetChannelList():
# Check to see if CHANNEL_LIST is already populated, if yes return it, if
# no construct it.
if CHANNEL_LIST:
return CHANNEL_LIST
else:
# Construct CHANNEL_LIST_URL and grab HTML
CHANNEL_LIST_URL = URL_BASE + URL_MEMBERS + URL_CHANNELMENU
CHANNEL_LIST_SOURCE = HTML.ElementFromURL(CHANNEL_LIST_URL)
# Find the channel links in the HTML source with xPath
CHANNELS = CHANNEL_LIST_SOURCE.xpath("//p/a")
# Remove the last link from the CHANNELS list (the 'Return
# to desktop version' links)
CHANNELS.pop()
# Add each channel to CHANNEL_LIST
for CHANNEL in CHANNELS:
# Grab the link text and convert from list to string
# N.B. xpath ALWAYS returns a list
CHANNEL_TITLE = "".join(CHANNEL.xpath(".//text()"))
CHANNEL_URL = URL_BASE + URL_MEMBERS + "".join(CHANNEL.xpath(".//@href"))
# Extracts the actual video URL for a channel. We do it inside
# this function so we can store it as part of CHANNEL_LIST and
# only do it once, not every time we hit the main menu
CHANNEL_VIDEO = GetChannelVideoStreamURL(CHANNEL_URL)
# Gets the correct channel thumbnail
CHANNEL_THUMB = GetChannelThumb(CHANNEL_TITLE)
# Appends the channel details to the CHANNEL_LIST
CHANNEL_LIST.append([CHANNEL_TITLE,CHANNEL_VIDEO,CHANNEL_THUMB])
CHANNEL_LIST.sort()
return CHANNEL_LIST
################################################################################
Extracts the actual video URL for a channel
################################################################################
def GetChannelVideoStreamURL(URL):
# Grab the source from the Channel’s URL – done inside here so we
# only do it once, not every time we hit the main menu
CHANNEL_SOURCE = HTML.ElementFromURL(URL)
# Gets the relevant script that has the mediaplayer info in it, by using
# xPath to search for a script containing the string 'mediaplayer'
CHANNEL_SCRIPT = CHANNEL_SOURCE.xpath("//script[contains(., 'mediaplayer')]//text()")[0]
# Grabs the video URL via regex
CHANNEL_VIDEO = re.findall(r'(http:\/\/[\d].*)\'',CHANNEL_SCRIPT)[0]
return CHANNEL_VIDEO</pre>
However, I'm not really sure how to @parallellize/@task this. When i try, CHANNEL_LIST returns empty, as if it's returning before the parallel tasks have finished, or that I'm not getting the information back from @task.
Any help? I've looked at the devour plugin mentioned in this thread but I'm none the wiser.