Not finding src attrib from embedded iframe

Hey all,

 

Attempting to create my first bundle for Naruto Shippuden over at NarutoGet.com; this is the first time I've tried my hand at programming anything web-related, and Im running into an issue with creating the VideoObject / VideoMenu. When parsing the site which has the episode embedded into it, my html request cant seem to find the src attribute that resides in the iframe:

< td id="embedcode" style="padding: 0px;">
<script> … </script>
<iframe width="670" scrolling="no" height="400" frameborder="0" src="http://online.narutoget.com/s/googplayer.php?skintype=nget&to=1002kEhprnVM&autostart=false&id=112965806382805543465/Nshop3ss41Hd#5953866251562821890">

        #document
   
   

I've tried a couple variants with xpath, but every combination of //td[@id="embedcode"] , //iframe[@width="670"] provides me with an empty list.

 

 

Here's the def:

@route('/video/naruto/videos')
def VideoMenu(videoURL, title):
	oc = ObjectContainer(title2 = title)
data = HTML.ElementFromURL(videoURL)

for video in data.xpath('//div[@id="side-a"]'):
	try:
		title = video.xpath('//td[@class="style100"]//text()')[1]
		Log(title)
		sources = video.xpath('//td[@id="embedcode"]//iframe')
		Log(sources)

	except:
		continue

	oc.add(CreateVideoClipObject(
		title = title,
		string = sources
		))
		
return oc

Any ideas?

Thanks!

It's very possible that those iframes don't get added or populated until after some javascript runs, and the framework doesn't run any javascript when it loads URLs (i.e. it doesn't behave like a browser).   In fact you may not even have a td with the id of embedcode either.  It depends on how the site runs.

To see exactly what the framework gets, try using the command line utility **curl** to load the page, i.e.:

curl http://www.somedomain.com/videopage/mypage.html

(or whatever URL you're trying to scrape)  ... What you get there is what the framework will get to work with.

Hope this helps.

Most likely, the iframe is loaded via javascript after the page loads and therefore is not included in the actual page source. Try requesting the HTML page using curl via the Terminal (for OSX) to view the same code that the plugin will be attempting to parse. Another option is to have the plugin log the page source for you:

source = HTTP.Request(page_url).content
Log(source)

Ha! Gerk beat me to the punch :slight_smile:

LOL yep, or log it like Mike says above ^^^  :D

Yeah, I checked the logs last night and that could make sense. The iframe in question was empty; it looked like the script field might have the necessary code in it with this nasty call (can inspect the site yourself, here, if it helps) :



That being said, how would I go about retrieving the src url that is formed after js makes the call? I assume the Plex framework can handle this in some way?

A second question, now that I've been looking into it, is that it seems that given only a URL I would probably need to create a Service to redirect the URL through; can this be done (given this usecase) by just creating the Media and VideoClip objects instead?

Thanks again.

EDIT:

After doing some manual scraping of the site, I found that the url:

video.ak.directvid.com/googDev.php?url=/112965806382805543465/Nshoput will be the associated link to the given episode number, s.t. = the actual episode number.

It's an ugly hack for the time being, but I've got the episodes streaming to all of my devices. Also noticed that PartObject will take the url for the key attribute, and any unique string will do for the rating_key (of which I used the episode Number); so no Service was required.

If anyone knows of a clean and Pythonic way to retrieve the URL from that nasty js script, please let me know :)

That's pretty much how you have to do things, anything that you can use to find the episode ID or whatever you need in terms of being able to play it is how it's done.  There's no way (or at least no way I know) to allow the framework to delve into the js built side of things, so the way you've done it is probably the "right" way.

And yes, with the url and rating_key you've done it correctly (for doing it without a server).  I typically would use the actual video URL as the rating_key, but as long as it's unique you should be fine.

That “nasty JS” has all the info you need. It’s just in percent-encoded format. Using regex you can grab everything contained within the

unescape("...")
then you can turn it into legible HTML by passing it through String.Unquote() - http://dev.plexapp.com/docs/api/utilkit.html?highlight=string.unquote#String.Unquote

Was worried that was going to be the answer; always stayed away from regex while I could… Time to learn some shorthand.


Thanks for all the help.

The plugin framework has some built-in regex handling so that you don't need to "import re". It's also recommended that you include your regex pattern at the top of your code and just reference it later using .search() or .match(), etc. Something like this:

RE_JS_MESS = Regex('unescape\(\"\(%.+)"\);')
...
'''plugin code'''
...
# grab the page source as a string
source = HTTP.Request(url).content
# parse it for the %-encoded JS/HTML
iframe_source = RE_JS_MESS.search(source).group(1)
# convert it to legible HTML
iframe_html = String.Unquote(iframe_source)
# parse it as html
iframe = HTML.ElementFromString(iframe_html)
# grab the url
video_url = iframe.get('src')

I would write the regex as:

RE_JS_MESS = Regex("innerHTML=unescape\('(.+?)'\)")

The site uses single quotes and by making it a little more specific you're sure you're grabbing the right 'unescape' part.

Nice; love examples. That did the trick.

With the functionality in place, I've gone back to adding more metadata, fighting with the Directory Objects that make up MainMenu(). Here it lists all of the episodes with a callback to the VideoMenu:

@handler("/video/naruto", "Naruto", allow_sync=True)
def MainMenu():

    oc = ObjectContainer(title1 = “Episodes”)
    
    #Scraping metadata from CrunchyRoll
    img = []
    metaData = HTML.ElementFromURL(‘http://www.crunchyroll.com/naruto-shippuden’)
    for entry in metaData.xpath(’//ul[@class=“portrait-grid cf”]//li//div[@class=“wrapper container-shadow hover-classes”]//a//img[contains(@src, “spire”)]//@src’):
        img.append(entry)
    desc = metaData.xpath(’//li[@class=“large-margin-bottom”]//p[@class=“description”]//span[@class=“more”]//text()’)[0]
    Log(desc)
    
    #Episode info
    data = HTML.ElementFromURL(‘http://www.narutoget.com/’)
    
    count=0
    for entry in data.xpath(’//table[@width=“400”]//table[@width=“340”]//td[@valign=“top”]’):
        try:
            title = entry.xpath(’./a[@class=“movie”]//text()’)[0]
            epNumber = title.split(" ")[-1]
            name = entry.xpath(’./text()’)[2]
            videoURL = entry.xpath(’./a[@class=“movie”]//@href’)[0]
            
        except:
            continue

        if count < (len(img)-1) :
            oc.add(DirectoryObject(
                key = Callback(VideoMenu, videoURL=videoURL, episode = epNumber, title=title, thumb = img[count]),
                title = title+" - “+name,
                summary = desc,
                thumb = Resource.ContentsOfURLWithFallback(img[count]),        
                ))
            count+=1
        else:
            oc.add(DirectoryObject(
                key = Callback(VideoMenu, videoURL=videoURL, episode = epNumber, title=title, thumb = img[len(img)-1]),
                title = title+” - "+name,
                summary = desc,
                thumb = Resource.ContentsOfURLWithFallback(img[len(img)-1]),
                ))
              
    return oc

I cant seem to get these Directory Objects to populate their respective DirectoryObject.summary or DirectoryObject.thumb properties. I noticed the episode images that I scrape from CrunchyRoll (dont mind the non-Pythonic count vars; they arent scraping every episode yet for some reason) percolate to the respective VideoClipObject, but that's it. I took a look at another plug-in that were able to do this (Escapist) and the format looks the same. One thing I noticed when inspecting the elements in Plex/Web was that each Directory Object in the Escapist plug-in has div classes for the respective objects (below) whereas mine only has a for the media title:

		<h3>Zero Punctuation</h3>

		
			<div class="summary">Zero Punctuation is The Escapist's groundbreaking video review series starring Ben "Yahtzee" Croshaw. Every Wednesday Zero Punctuation picks apart the games so you don't have to...</div>
		
	</div>

Anyone know what would cause this irregularity?

Was worried that was going to be the answer; always stayed away from regex while I could... Time to learn some shorthand.

Thanks for all the help.

I am very weary of Regex as well. But I find as I go along, I keep learning a new trick here and there that makes it a little easier each time. Sander just gave you the first and biggest trick in his example above. All you have to remember is that (.+?) represents the data you want to extract from a string. So the rest is just what comes before and after the data you want to extracts.

The second trick he gives you is the use of a forward slash. There are certain characters regex reserves for other uses, but if you put a forward slash in front of it, the regex will just treat it like a piece of data in your string.

The best way to figure it out is to use a regex tester (I use this one - http://www.pythonregex.com/). Then you can see exactly what will be returned by your regex pull.

I think I remember hearing something about the Web client having issue with summaries, but one of the other guys would need to chime in to tell you for sure if that could be an issue.

One thing I see with you xpath is you are using a lot of double backslashes and that may not be necessary, though I am not sure if they would cause any issues. It would depend on your code. When to use single or double backslashes was one thing about xpath that took me awhile to figure out. 

In xpath, if you use a double backslash between two tags, it means the tag that follows the double backslashes might be anywhere within the parent tag that appears right before that double backslash (meaning it could be a child of a child of a child...). While using a single backslash between two tags, means the tag that follows the backlash must be a direct child of the parent tag that appears right before the single backslash. 

Also, again I am not sure if it makes a difference, but when using Fallback on images, I have always seen it used with "url=" and then the value of the image like this:

Resource.ContentsOfURLWithFallback(url=value_of_image)

I would suggest, that, rather than inspecting the element source via Plex/Web, you check the XML generated by your plugin directly by querying PMS via curl or your web browser.

http://localhost:32400/video/naruto

Subsequently, you can walk the XML tree by replacing using the "key" attribute of the object you wish to test as the URL suffix.

As shopgirl mentions, Plex/Web has some interesting behaviours in regards to thumbs & summaries. In my experience, if the objects in a directory share a common thumb, Plex/Web collapses the list and does not show any thumbs or summaries for the items in the list.  I find that viewing the XML directly, as above, or testing with another client can be very helpful in determining if your code is returning expected results.

This looks to be the case; running the plex android app renders all the thumb images clearly, as does skifta. I’ll look into cURL and the other methods you mentioned.


Thanks again for all the help, and quick responses.

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.