xpath coding

I've run into similar situations before. Changing the xpath by adding an extra '/' often clears it up. Ie.

from

videoHi = page.xpath("//video/videoHi/text()")[0]

to

videoHi = page.xpath("//video/videoHi//text()")[0]

Essentially it changes the meaning of the expression from "the value of the first 'text' element which is a direct descendant of the first 'videoHi' element whose parent is 'video'", to "the value of the first 'text' element which is a descendant of the first 'videoHi' element whose parent is 'video'". By not enforcing the "direct descendant" the xpath is more flexible.  Flexibility can be helpful but it can also end up giving you unintended results. Experimentation is often required to fine tune your xpath expressions to make sure you're getting the results you want.

That works.  With the extra /, the channel folder shows up.  And it makes sense that you have to do something to specify that it is not just the child element of the element it follows with a name of "text".

This does bring up a question about the bigger picture of the way Plex Media Server reads and processes your channel code. The fact that the channel does not even show up in Plex with the incorrect coding means Plex is doing something with the channel code at start up versus waiting until you choose or click on the channel icon.

What exactly is Plex doing at startup? Is Plex already running the __inti__.py files and therefore those xpath commands even before you choose to open the channel? Or is Plex just running some type of code check of all of the channels to ensure proper syntax?  So for that particular channel, when it sees those lines of code, it expects something at the end of those lines telling it how to return that xpath data like a .text or .get(), and when it doesn't see that, it thinks those lines of code are incomplete and therefore thinks the __init__.py file is improperly coded.

The channel framework doesn't make any http requests or evaluate any of the xpath expressions until the channel is accessed by the user. It does however load the code. That step can be derailed by a syntax error in the code. Adding or subtracting a '/' in an xpath expression should not impact that but, having mismatched quotes or brackets certainly will.

Good to know mike.  Thanks again.  Actually, thanks to your help, now the xpath commands, even in html documents, is the one area of Plex channel coding that seems to be easiest for me to understand. So thanks again.

I seem to still be having trouble with xpath commands with cdata tags in them.  I am getting a "list index out of range" error.

To make sure, Plex ignores the CDATA as well as all of its surrounding tag code?

For example:

<![CDATA[

12.21.2011

Alejandro Ingelmo
Fashion designer Alejandro Ingelmo is challenged to recycle the Lexus CT Hybrid into one of his fabulous pieces of footwear.

]]>

Plex would ignore "" after the paragraph tag ends?

And so in the above code, I should be able to call the image info in this element with an xml document with the xpath ./description/p/a/img//@src.  If I remove the CDATA code completely as mentioned above, the XML utility web page shows that this xpath is returning  the correct data, but Plex is giving me an error. So I am a bit confused.

EDIT: I even just tried changing the code to ./description/p/a/img with .get(src) at the end, but got the same error

And one more thing.  What should your date format be?  I have a few times gotten errors that my date was not in the right format, but no where in the Framwork Documentation can I find what that "date format" should be.

The best practice for dates is to use the framework's function to parse the date and return a "properly formatted" date string.

For example:

date_variable = '''some string parsed from the HTML/XML/etc'''
originally_available_at = Datetime.ParseDate(date_variable).date()

If you browse through the available services in the Services.bundle, you will often see it condensed to one line where the "date_variable" isn't explicitly defined. The (usually xpath) argument is passed directly to the ParseDate function.

I seem to still be having trouble with xpath commands with cdata tags in them.  I am getting a "list index out of range" error.

To make sure, Plex ignores the CDATA as well as all of its surrounding tag code?

For example:

<![CDATA[

12.21.2011

Alejandro Ingelmo
Fashion designer Alejandro Ingelmo is challenged to recycle the Lexus CT Hybrid into one of his fabulous pieces of footwear.

]]>

Plex would ignore "" after the paragraph tag ends?

And so in the above code, I should be able to call the image info in this element with an xml document with the xpath ./description/p/a/img//@src.  If I remove the CDATA code completely as mentioned above, the XML utility web page shows that this xpath is returning  the correct data, but Plex is giving me an error. So I am a bit confused.

EDIT: I even just tried changing the code to ./description/p/a/img with .get(src) at the end, but got the same error

You have to do this in 2 steps:

  1. Get the value of the description element:
    xml = XML.ElementFromURL()
    

    Some more code here to loop over all the items

    Get the description string

    description = xml.xpath(‘//description’)



  2. Then treat the description string as a piece of HTML and use another xpath query to get the data you want:

    html = HTML.ElementFromString(description)
    image = html.xpath(‘//img/@src’)[0]

Thanks Mike I will use try the date parser.

Thanks Sander1, that helps alot.  I had seen pages where too much info was contained between tags like paragraph to separate it and I have also seen where people used the HTML.ElementFromString () to pull a particular string that they had extracted through xpath to manipulate that data further, but I never put the two concepts together.

Definitely something good to play with.

OK I finally got a channel to work. YEAH!!!

But I cannot get to into the description element to get the icon and description for each episode that you suggested in the post below:

You have to do this in 2 steps:

  1. Get the value of the description element:
    xml = XML.ElementFromURL()
    

    Some more code here to loop over all the items

    Get the description string

    description = xml.xpath(‘//description’)



  2. Then treat the description string as a piece of HTML and use another xpath query to get the data you want:

    html = HTML.ElementFromString(description)
    image = html.xpath(‘//img/@src’)[0]

I tried this coding you suggested above and I still get errors.  I even pulled all the data that is available in one of the description fields and created an html document with html and body tags around it and used xpath checker to see what the xpath commands would pull from the data in the description line.  First, the text is on nine different lines, so I am not sure how I can pull out the description.  Secondly and more importantly, to pull image address, the command you gave above showed as the the proper format in xpath checker, but when I try to put that line in my code, I get an error. "IndexError: list index out of range."

An example of one of the RSS feed pages I am trying to pull up is:

http://www.lstudio.com/rss/lstudiorss-3-minute-talk-show-with-barry-sobel.xml

The data in one of the description fields is :
 

<![CDATA[

04.13.2011

Barry Sobel, Fred Willard, Stephen Moyer
Featuring host Barry Sobel and co-host Fred Willard, actor Stephen Moyer of ''True Blood,'' and a musical performance by Geoffro.

]]>

So I included the code you showed above except since these are coming from a item parent in an xml document I used ./description instead of //description.  Here is my code:

   

    xml = XML.ElementFromURL(SHOW_RSS) 
for video in xml.xpath('//item'):

    epUrl = video.xpath(‘./link//text()’)[0]
    epTitle = video.xpath(‘./title//text()’)[0]
    epDate = video.xpath(‘./pubDate//text()’)[0]
    epDate = Datetime.ParseDate(epDate)
description = video.xpath(‘./description’)[0]
html = HTML.ElementFromString(description)
#epSummary = html.xpath(‘//text()’)[0]
epThumb = html.xpath(‘//img//@src’)[0]

I get the error on this last line, where I am trying to pull the thumb.  (I just stopped the summary line from running for now with the comment code, since it returns nine lines of data and I need line 8.)


 

And BTW, I have also tried leaving the "[0]" off the end of the description line.  The difference is that without the zero, you are pulling the paragraph element into the value of html vs. the element element with the zero on the end of the first line.  And since I am not exactly sure what the "element" element is and I do know what is contained within the

elements of the code, I changed that line to:

description = video.xpath('./description')

And I have tried every variation possible for pulling the image address through xpath including:

epThumb = html.xpath('//img//@src')[0]
epThumb = html.xpath('//img/@src')[0]
epThumb = html.xpath('//img')[0].get(src)
epThumb = html.xpath('//a/img//@src')[0]
epThumb = html.xpath('//a/img/@src')[0]
AND
epThumb = html.xpath('//a/img')[0].get(src)

But I still get the same error

I would start by Log()-ing what you get from your description to make sure you're getting what you want. i think that you likely need to add '.text' to you xpath statement to grab the contents of the element rather than just the xpath element. eg.

description = video.xpath('./description')[0].text
Log(description)

If you get your HTML showing up in the log, then your HTML.ElementFromString() should allow you to carry on in the right direction.

My example code is flawed, it's missing the /text() in the xpath. It should be:

description = video.xpath('./description/text()')[0]
html = HTML.ElementFromString(description)

I had a log statement in there and it was just returning an element and not the data in the element, so that is what it needed was the text() at the end to pull out the data of the string. I knew it didn't look right that it showed as an element and not data.  Just too new to know.  Thanks again guys.

url     = 'http://www.lstudio.com/rss/lstudiorss-web-therapy.xml'
channel = RSS.FeedFromURL(url)

for item in channel.entries:
    # regular RSS stuff
    title    = item.title
    category = item.category
    pubDate  = Datetime.ParseDate(item.date)
    link     = item.link

    # The description is HTML with a p containing:
    # - the pubDate
    # - the permalink, with thumbnail
    # - the actual description
    _desc     = item.description
    _html     = HTML.ElementFromString(_desc)
    _els      = list(_html)
    thumbnail = _html.cssselect('img')[0].get('src')
    description = []
    for el in _els:
        if el.tail: description.append(el.tail)

    description = '. '.join(description)

    video = VideoClipObject(...)

As Mike is correct to point out, the XML you're trying to parse is more accurately an RSS feed. There is specific code built in to the Channel Framework for parsing RSS feeds. It's based on Python's FeedParser which you can read more about here.

Thanks Mikewhy, that looks like good information to know.  I will have to look at it a little while to make sure I understand all that is going on here. I will throw in some debug codes and look at it a little closer.

I found the mention of the RSS feed parsing in the Plex Framework Documentation and I am taking for granted that the two broken links there are what Mikedm139 gave me an updated link to.

Looking at the code, most of it looks pretty straight forward, but there are a few I will need to look up and understand better like the commands with cssselect and tail.

cssselect is an alternative to xpath that I find cleaner and easier on the brain. plus, you can test them in your browser’s javascript console.

example:

html = ‘’’
...
...
...
'''

videos = html.cssselect(’#content .video-with-thumbnail’)

vs. html.xpath(’//div[@id=“content”]/div[@class=“video-with-thumbnail”]’)

the tail bit is strange, normally you would use el.text but ‘
’ tags mean you need el.tail. you could scrap the whole for el in els loop and simply use ‘description = _html.text_content()’, but it doesn’t look as nice.

Thanks Mikewhy, I used the code you suggested and it worked great.  I was able to figure out what the .tail was doing with some debug commands in my code.

I had found a little info on cssselect in some lxml documentation, but I will definitely play around with the cssselect some more now and get a better grasp on it based on your explanation above.

Thanks again.

I have come across an xpath questions:

Is there a way to add a variable to your xpath command? For example, I want to pull all occurrences of a list with a particular id related to a particular show. I do not know the id to hard code it in but will have a variable that holds its value.  How would I write, for example: //div/ul/li[@id"variable"], where variable is a variable in my python code?